CourseName : Bioinformatics Tools for Fisheries(FBT 505) (Complete e-course content ) Credit : 2(1+1)

Module 1

Basics of Bioinformatics

Introduction to Bioinformatics Resources and Tools

Bioinformatics is an interdisciplinary field that combines biology, computer science, and information technology to analyze and interpret biological data. This module introduces the fundamental resources, platforms, databases, and software tools used in bioinformatics, laying the foundation for more advanced computational analysis in life sciences.

1. Overview of Bioinformatics Resources

Bioinformatics resources are essential tools for managing the massive amount of data generated by genomic, transcriptomic, proteomic, and structural biology research. These resources include:

Databases (primary and secondary)
Software tools for sequence alignment, visualization, and molecular modeling
Web-based platforms for genome browsing, annotation, and protein structure prediction

Applications:

Gene and protein identification
Phylogenetic analysis
Molecular docking and drug discovery
Structural and functional annotation

2. Platforms for Bioinformatics Analysis

Bioinformatics tools can run on various operating systems. Understanding the environment is crucial for effective tool deployment and scripting.

a. Linux-Based Platforms:

Most bioinformatics tools are developed for Unix/Linux due to its open-source flexibility and scripting power.
Popular distributions: Ubuntu, CentOS, Debian
Common commands: grep, awk, sed, bash, python, perl

b. Windows-Based Platforms:

GUI-based tools (e.g., MEGA, Geneious)
Compatibility layer tools like WSL (Windows Subsystem for Linux) are increasingly used

3. Bioinformatics Software Tools

These are used for analyzing DNA, RNA, and protein data:

Task	Software/Tool
Sequence Alignment	BLAST, Clustal Omega
Phylogenetic Tree Construction	MEGA, PhyML
Protein Structure Prediction	SWISS-MODEL, Phyre2
Data Visualization	Jalview, UCSF Chimera

Most of these tools are available online or as downloadable packages for local installations.

4. Biological Databases

Databases are categorized into primary and secondary types:

a. Primary Databases

Nucleotide Sequence Databases:
- GenBank (NCBI): Comprehensive database of publicly available DNA sequences
- EMBL-EBI: European counterpart to GenBank
- DDBJ: DNA Data Bank of Japan
Protein Sequence Databases:
- UniProtKB: Central resource for protein sequence and annotation data
- PIR: Protein Information Resource

b. Secondary Databases

Secondary databases derive information by curating and analyzing primary data.

Examples:
- Pfam: Protein families and domains
- InterPro: Integrated resource of protein signatures

c. Structure Databases

These house 3D structures of biomolecules.

PDB (Protein Data Bank): Repository for 3D structural data of proteins and nucleic acids.
SCOP and CATH: Classification of protein structures

5. Analysis Packages

These combine multiple tools or scripts into workflows:

EMBOSS: Suite of command-line tools for sequence analysis
Bioconductor: R-based platform for genomics data
Galaxy: Web-based platform enabling users to run workflows without command-line knowledge
Geneious: Commercial package with integrated GUI for sequence editing, alignment, and annotation

CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)

Module 2

Sequence Alignment:

1. Introduction to Sequence Alignment

Sequence alignment is the process of arranging sequences to identify regions of similarity. These similarities may indicate functional, structural, or evolutionary relationships.

There are two main types of sequence alignment:

Pairwise Alignment: Comparison between two sequences.
Multiple Sequence Alignment (MSA): Simultaneous alignment of three or more sequences.

2. Dot Matrix Method

The dot matrix is a graphical method used for preliminary sequence comparison.

Key Features:

Plots one sequence on the X-axis and another on the Y-axis.
Dots are placed where residues (nucleotides/amino acids) match.
Diagonal lines indicate regions of high similarity or identity.

Uses:

Quick visualization of repetitive sequences.
Identification of insertions, deletions, or inversions.

Tools:Dotlet, EMBOSS dotmatcher

3. Scoring Matrices: PAM and BLOSUM

Scoring matrices are essential in aligning protein sequences to assess substitution likelihoods.

a. PAM (Point Accepted Mutation) Matrix

Based on evolutionary models of accepted mutations.
PAM1 represents 1% change; PAM250 for more divergent sequences.
Assumes a common ancestor and slow mutation rates.

b. BLOSUM (Blocks Substitution Matrix)

Derived from conserved sequence blocks (e.g., in protein domains).
BLOSUM62 is the most commonly used matrix.
Better suited for aligning sequences with varying levels of similarity.

Matrix	Suitable For
PAM250	Distantly related proteins
BLOSUM62	Moderately similar proteins

4. Sequence Retrieval from Online Databases

To perform alignments, sequences are often retrieved from public databases.

Steps:

Access NCBI (https://www.ncbi.nlm.nih.gov) or UniProt (https://www.uniprot.org).
Use accession numbers, gene/protein names, or organism filters.
Download sequences in FASTA format for alignment.

Databases:

GenBank: Nucleotide sequences
UniProtKB: Protein sequences and annotations
RefSeq: Curated reference sequences

5. Pairwise Sequence Alignment Using BLAST

BLAST (Basic Local Alignment Search Tool) is a widely used tool for comparing a query sequence with a database.

Types of BLAST:

blastn: Nucleotide vs. nucleotide
blastp: Protein vs. protein
blastx: Translated nucleotide vs. protein
tblastn: Protein vs. translated nucleotide

Figure 2:Types of BLAST

Steps to Run BLAST:

Paste query sequence or upload a file.
Select appropriate BLAST type.
Choose database (e.g., nr, Swiss-Prot).
Adjust parameters (e.g., matrix, gap penalties).
View alignment results with identity %, e-value, score.

Applications:

Gene annotation
Homology search
Ortholog/paralog identification

6. Multiple Sequence Alignment (MSA)

MSA is used to align three or more sequences simultaneously to identify conserved regions, motifs, or evolutionary patterns.

Common Tools:

Clustal Omega
MUSCLE
MAFFT
T-Coffee

Key Concepts:

Conserved regions may indicate functional or structural importance.
Gaps are introduced to maximize alignment score across all sequences.
Phylogenetic trees can be built from MSA results.

Applications:

Protein family studies
Primer design
Phylogenetics

CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)

Module 3

Sequence analysis:

Sequence analysis is a critical area in bioinformatics that enables researchers to interpret raw DNA and protein sequences to extract meaningful biological information. This module focuses on how to retrieve, evaluate, and analyze sequencing data, particularly Sanger sequencing reads, and how to identify important genetic elements such as open reading frames (ORFs), regulatory motifs, and sequence variations like SNPs and ESTs.

1. Retrieval of Sequences

Accurate sequence retrieval from public databases is the first step in any analysis pipeline.

Sources:

NCBI GenBank: Nucleotide sequences
ENA: European Nucleotide Archive
UniProt: Protein sequences and functional annotations

File Formats:

FASTA: For raw sequences
FASTQ: For sequences with quality scores
GenBank: Annotated format with feature tables

Tools: NCBI Entrez, EBI’s ENA browser, UniProt sequence fetcher

2. Sequence Quality Assessment

Before any downstream analysis, the quality of the sequence must be checked and filtered.

Parameters Checked:

Base quality scores (Q scores)
Ambiguous bases (N’s)
Read length and overall sequence coverage

Tools:

Phred: Quality scoring system for Sanger reads
FastQC: For next-generation sequencing data (can also assess Sanger files)
TrimGalore/Cutadapt: For trimming low-quality regions

Outcome: High-quality reads suitable for reliable assembly and annotation.

3. Assembly and Annotation of Sanger Sequencing Reads

Sanger reads are typically short and require assembly to form longer contiguous sequences (contigs).

a. Sequence Assembly

Types:
- De novo assembly: Without a reference
- Reference-guided assembly: Using known templates
Tools: CAP3, Phrap, Geneious, BioEdit

b. Sequence Annotation

Annotation involves identifying functional elements such as:

Genes and coding regions
Promoters and regulatory elements
Exons and introns

Tools: Artemis, NCBI ORF Finder, RAST server (for prokaryotes)

4. Identification of Cis-Acting Regulatory Elements

Cis-elements are short DNA sequences found near genes that regulate their expression.

Common Elements:

TATA box: Core promoter
CAAT box, GC box
Enhancer/silencer sequences

Databases and Tools:

PLACE and PlantCARE: Plant regulatory element databases
TRANSFAC and JASPAR: Transcription factor binding site prediction
MEME Suite: Motif discovery tool

Application: Identifying transcriptional control regions upstream of genes.

5. ORF (Open Reading Frame) Finding

ORFs are potential protein-coding regions within a DNA sequence.

Key Features:

Begin with a start codon (ATG)
End with a stop codon (TAA, TAG, TGA)
Located in correct reading frame without interruptions

Tools:

NCBI ORF Finder
GeneMark

ExPASy Translate Tool

Output: Predicted coding regions with possible protein translations.

6. Signal Sequences in DNA and Proteins

Signal sequences direct the cellular localization or secretion of proteins.

a. In DNA:

Promoter regions
Ribosome binding sites
Polyadenylation signals

b. In Proteins:

N-terminal signal peptides (for secretory pathways)
Transmembrane domains

Tools:

SignalP: Predicts signal peptides in protein sequences
TargetP: Predicts subcellular localization
TMHMM: Predicts transmembrane regions

7. Data Analysis Tools for SNP and ESTs

a. SNP (Single Nucleotide Polymorphism) Analysis

SNPs are single-base changes in DNA that can affect traits or disease susceptibility.

Databases and Tools:

dbSNP (NCBI): Public SNP database
SNPdat, GATK, VCFtools: For annotation and analysis

Applications:

Marker-assisted selection
Genotyping and population studies

b. EST (Expressed Sequence Tag) Analysis

ESTs are short sequences from transcribed mRNA, used to identify gene transcripts.

Uses:

Gene discovery and annotation
Identification of tissue-specific expression

Tools:

UniGene: Clusters ESTs into genes
ESTScan: Predicts coding regions in ESTs
BLASTx: Matches ESTs to protein sequences

CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)

Module 4

Phylogeny and evolution:

Understanding the evolutionary history of organisms and genes is a central task in bioinformatics. This module introduces the concepts and computational techniques used to infer evolutionary relationships, estimate divergence times, and construct phylogenetic trees. It also covers the principles of evolutionary theory, species classification, and statistical validation of evolutionary hypotheses.

1. Evolution of Genomes

Genome evolution is a dynamic process influenced by mutation, recombination, gene duplication, and horizontal gene transfer.

Key Concepts:

Gene gain and loss
Expansion of gene families
Mobile genetic elements
Genome rearrangements and size variation

Bioinformatics Relevance:
Comparative genomics and alignment of genome sequences reveal patterns of genome evolution and divergence.

2. Basic Forces of Evolution

Evolution is driven by several key forces:

Mutation: Source of genetic variation
Natural Selection: Favors advantageous traits
Genetic Drift: Random changes in allele frequencies, especially in small populations
Gene Flow: Exchange of genes between populations
Recombination: Shuffling of genes during reproduction

These forces shape the genetic structure and diversity of populations over time.

3. Variation and Divergence of Populations

Populations evolve and diverge due to geographic isolation, genetic drift, and environmental selection.

Important Measures:

Allele frequency changes
Genetic distance (e.g., Nei’s distance)
F_ST values (measure of population differentiation)

Applications: Used in population genetics, conservation biology, and epidemiological studies.

4. Estimation of Divergence Time

Estimating when two species or genes diverged helps in understanding evolutionary timelines.

Methods:

Molecular Clock Hypothesis: Assumes constant mutation rate over time
Synonymous vs. Non-synonymous substitutions
Calibration with fossil records or known divergence events

Tools: MEGA, BEAST, TimeTree

5. Phylogenetic Species Concept

Unlike traditional species definitions based on morphology, the phylogenetic species concept defines species as the smallest group sharing a common ancestor and diagnosable by unique traits.

Importance in Bioinformatics:
Helps define taxa based on genetic data and evolutionary lineage rather than phenotype alone.

6. Phylogenetic Trees and Cladistics

Phylogenetic Trees are graphical representations of evolutionary relationships.

Types:

Rooted trees: Show common ancestor and direction of evolution
Unrooted trees: Show relationships without time direction

Cladistics:

Method to classify organisms based on shared derived traits (synapomorphies)
Yields cladograms, a type of phylogenetic tree

7. Concepts of Monophyly, Paraphyly, and Polyphyly

Concept	Description	Example
Monophyly	A group containing an ancestor and all its descendants	Mammals
Paraphyly	A group with a common ancestor but not all descendants	Reptiles (excluding birds)
Polyphyly	Group with unrelated organisms from different ancestors	Marine mammals (e.g., whales + seals)

Correct interpretation ensures accurate evolutionary classification.

8. Phylogenetic Tree Reconstruction Methods

A. Distance-Based Methods

These use genetic distance matrices to build trees:

UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
- Assumes constant rate of evolution (molecular clock)
- Produces rooted trees
Neighbour-Joining (NJ)
- Does not assume a constant rate
- More flexible and widely used for large datasets

Tools: MEGA, Phylip, PAUP*

B. Character-Based Methods

These examine each position in sequence alignment individually.

Maximum Parsimony (MP)
- Chooses the tree requiring the fewest evolutionary changes
- Simpler, but sensitive to homoplasy (shared traits not from common ancestry)
Maximum Likelihood (ML)
- Evaluates likelihood of a tree given a specific model of sequence evolution
- More accurate but computationally intensive
Bayesian Inference
- Uses probability distributions to estimate tree credibility
- Employs Markov Chain Monte Carlo (MCMC) simulations

Software:PhyML (ML), MrBayes (Bayesian), RAxML (ML), TNT (MP)

9. Tree Comparison and Statistical Validation

Once trees are constructed, they must be evaluated for reliability and accuracy.

Methods and Tools:

Bootstrap Analysis: Resampling method to assess tree branch support (100–1000 replicates)
Jackknife: Similar to bootstrapping but excludes some data in each replicate
Likelihood Ratio Test (LRT): Compares likelihood of different tree models
Tree Comparisons: Tools like TreeCompare or Dendroscope compare topologies

10. Parametric Bootstrapping

A rigorous statistical test used to validate phylogenetic hypotheses by simulating datasets based on an estimated model.

Steps:

Estimate model and construct original tree
Simulate data under that model
Reconstruct trees from simulated data
Compare likelihood of real vs. simulated trees

Interpretation: If the real tree is significantly better, the hypothesis is supported.

11. Limitations in Phylogenetic Analysis

Model selection biases (wrong substitution model)
Long-branch attraction (MP particularly susceptible)
Incomplete lineage sorting or horizontal gene transfer
Computational complexity with large datasets
Unequal evolutionary rates may mislead tree topology

CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)

Module 5

Practical:

Hands-On Bioinformatics: Practical Applications and Analysis

This module emphasizes practical skills in bioinformatics by guiding students through essential computational exercises. Learners will explore how to retrieve biological sequences, analyze them using various tools, and interpret the results for applications in molecular biology, genetics, and evolutionary studies.

1. Sequence Retrieval from Databases

Students will learn to access and download nucleotide and protein sequences from globally recognized biological databases.

Steps:

Go to NCBI (https://www.ncbi.nlm.nih.gov/) or UniProt (https://www.uniprot.org/)
Use gene name, organism, accession number, or keywords in the search bar.
Filter results using advanced search options (e.g., organism, molecule type, sequence length).
Download in FASTA format for downstream analysis.

Other Resources:

BOLD (Barcode of Life Data Systems): For species identification and DNA barcoding
DDBJ and EMBL-EBI: Additional sequence repositories

2. Refining Search Criteria

Efficient data mining requires refining database queries to retrieve the most relevant sequences.

Search Modifiers:

Use Boolean operators (AND, OR, NOT)
Apply filters (e.g., “complete genome,” “mitochondrial DNA”)
Specify database (e.g., RefSeq, TSA, EST)
Limit results by taxonomy, molecule type, or publication date

Outcome: Accurate and narrowed-down sequence results

3. Sequence Submission to Databases (NCBI GenBank / BOLD)

Understanding how to submit original sequences to databases is essential for scientific transparency and data sharing.

A. GenBank Submission via NCBI:

Use BankIt or Sequin submission tools
Prepare the sequence in FASTA format
Annotate sequence with metadata (organism, gene name, product, source)
Upload using an NCBI account
Track submission status and receive accession number

B. BOLD Submission:

Targeted for DNA barcoding sequences (e.g., COI gene)
Requires taxonomic verification and geographic metadata
Upload through the BOLD Workbench

4. Pairwise Sequence Alignment (BLAST)

BLAST (Basic Local Alignment Search Tool) compares a query sequence to a database to find regions of local similarity.

Procedure:

Navigate to BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi)
Select appropriate program: blastn, blastp, blastx, etc.
Paste query or upload FASTA file
Choose database (e.g., nr, Swiss-Prot)
Adjust parameters (e.g., scoring matrix, word size, gap penalties)
Submit and analyze output

Interpreting Output:

Score and E-value: Indicate statistical significance
% Identity: Reflects sequence similarity
Alignment: Shows matched regions and mismatches

5. Multiple Sequence Alignment (ClustalW)

MSA helps identify conserved regions, motifs, and evolutionary relationships among sequences.

Steps:

Visit Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo/) or use ClustalW
Input three or more sequences in FASTA format
Run alignment with default or custom parameters
Download alignment results

Applications:

Identify functional domains
Guide primer design
Build phylogenetic trees

6. Identification of Open Reading Frames (ORFs)

ORFs are regions likely to encode proteins and are critical for gene annotation.

Tools:

NCBI ORF Finder (https://www.ncbi.nlm.nih.gov/orffinder/)
ExPASy Translate Tool
GeneMark (for gene prediction)

Steps:

Paste DNA sequence
Select genetic code and frame
View predicted ORFs and start/stop codons

7. Primer Designing

Designing primers is essential for PCR, sequencing, and gene cloning.

Tool: Primer3 (https://primer3.ut.ee/)

Input Requirements:

Target sequence (FASTA format)
Desired product size
Melting temperature (Tm) range
GC content range

Output:

Forward and reverse primer sequences
Tm, GC%, self-complementarity
Option to avoid secondary structures

8. Restriction Site Identification

Identifying restriction enzyme sites is crucial in recombinant DNA techniques.

Tools:

NEBcutter (https://nc3.neb.com/NEBcutter/)
SnapGene Viewer (desktop tool)
Biology Workbench online tools

Steps:

Paste DNA sequence
Select enzymes from list or database
View restriction map and cut positions

9. Plasmid Map Drawing

Plasmid maps are essential for illustrating recombinant DNA constructs.

Software Tools:

SnapGene (Viewer) – Free for viewing and basic editing
Geneious, ApE (A plasmid Editor) – Open-source alternatives

Features:

Annotate features (origin, antibiotic resistance, MCS)
Insert genes or tags
Display direction of transcription
Export maps as images or GenBank files

10. Protein Structure Prediction

Predicting 3D protein structures provides insight into function and interaction.

Tools:

SWISS-MODEL (https://swissmodel.expasy.org/)
Phyre2 (http://www.sbg.bio.ic.ac.uk/phyre2/)
AlphaFold Protein Structure Database

Steps:

Input protein sequence (FASTA)
Select template (automated or manual)
Generate 3D model
Download PDB files for visualization in PyMOL or Chimera

11. Phylogenetic Tree Construction

Phylogenetic trees represent evolutionary relationships among sequences.

Software Options:

MEGA (Molecular Evolutionary Genetics Analysis)
MrBayes (Bayesian inference)
Phylip and PAUP* (Parsimony and distance methods)

Procedure (Using MEGA):

Import aligned sequences (e.g., from ClustalW)
Choose method: Neighbor-Joining, UPGMA, Maximum Parsimony, or Maximum Likelihood
Customize options: bootstrap replicates, substitution model
Generate and visualize tree

Bayesian Workflow (MrBayes):

Convert alignment to NEXUS format
Set model parameters (e.g., GTR+G)
Run MCMC simulations
Analyze posterior probabilities

12. Phylogenetic Tree Interpretation and Statistical Support

Tree Elements:

Branches: Represent evolutionary lineages
Nodes: Common ancestors
Branch Lengths: Proportional to evolutionary distance
Bootstrap Values / Posterior Probabilities: Support for each branch

Interpretation Tips:

Look for clusters (clades) with high support values
Use outgroups to root trees
Evaluate evolutionary relationships based on topology and branch lengths
Name of course instructor : Dr.Mohd Ashraf Rather, Assistant Professor , Division of Fish Genetics and Biotechnology-SKUAST-Kashmir

Email : mashraf38@skuastkashmir.ac.in

Leave a Comment Cancel reply