WhatsApp Channel Join Now
Telegram Channel Join Now
YouTube Channel Join Now

CourseName : Bioinformatics Tools for Fisheries(FBT 505) (Complete  e-course content ) Credit : 2(1+1)

Module 1

Basics of Bioinformatics

Introduction to Bioinformatics Resources and Tools

Bioinformatics is an interdisciplinary field that combines biology, computer science, and information technology to analyze and interpret biological data. This module introduces the fundamental resources, platforms, databases, and software tools used in bioinformatics, laying the foundation for more advanced computational analysis in life sciences.


1. Overview of Bioinformatics Resources

Bioinformatics resources are essential tools for managing the massive amount of data generated by genomic, transcriptomic, proteomic, and structural biology research. These resources include:

  • Databases (primary and secondary)
  • Software tools for sequence alignment, visualization, and molecular modeling
  • Web-based platforms for genome browsing, annotation, and protein structure prediction



Applications:

  • Gene and protein identification
  • Phylogenetic analysis
  • Molecular docking and drug discovery
  • Structural and functional annotation

2. Platforms for Bioinformatics Analysis

Bioinformatics tools can run on various operating systems. Understanding the environment is crucial for effective tool deployment and scripting.

a. Linux-Based Platforms:

  • Most bioinformatics tools are developed for Unix/Linux due to its open-source flexibility and scripting power.
  • Popular distributions: Ubuntu, CentOS, Debian
  • Common commands: grep, awk, sed, bash, python, perl

b. Windows-Based Platforms:

  • GUI-based tools (e.g., MEGA, Geneious)
  • Compatibility layer tools like WSL (Windows Subsystem for Linux) are increasingly used

3. Bioinformatics Software Tools

These are used for analyzing DNA, RNA, and protein data:

TaskSoftware/Tool
Sequence AlignmentBLAST, Clustal Omega
Phylogenetic Tree ConstructionMEGA, PhyML
Protein Structure PredictionSWISS-MODEL, Phyre2
Data VisualizationJalview, UCSF Chimera

Most of these tools are available online or as downloadable packages for local installations.


4. Biological Databases

Databases are categorized into primary and secondary types:

a. Primary Databases

  • Nucleotide Sequence Databases:
    • GenBank (NCBI): Comprehensive database of publicly available DNA sequences
    • EMBL-EBI: European counterpart to GenBank
    • DDBJ: DNA Data Bank of Japan
  • Protein Sequence Databases:
    • UniProtKB: Central resource for protein sequence and annotation data
    • PIR: Protein Information Resource

b. Secondary Databases

Secondary databases derive information by curating and analyzing primary data.

  • Examples:
    • Pfam: Protein families and domains
    • InterPro: Integrated resource of protein signatures

c. Structure Databases

These house 3D structures of biomolecules.

  • PDB (Protein Data Bank): Repository for 3D structural data of proteins and nucleic acids.
  • SCOP and CATH: Classification of protein structures

5. Analysis Packages

These combine multiple tools or scripts into workflows:

  • EMBOSS: Suite of command-line tools for sequence analysis
  • Bioconductor: R-based platform for genomics data
  • Galaxy: Web-based platform enabling users to run workflows without command-line knowledge
  • Geneious: Commercial package with integrated GUI for sequence editing, alignment, and annotation

CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)

Module 2

Sequence Alignment:

1. Introduction to Sequence Alignment

Sequence alignment is the process of arranging sequences to identify regions of similarity. These similarities may indicate functional, structural, or evolutionary relationships.

There are two main types of sequence alignment:

  • Pairwise Alignment: Comparison between two sequences.
  • Multiple Sequence Alignment (MSA): Simultaneous alignment of three or more sequences.

2. Dot Matrix Method

The dot matrix is a graphical method used for preliminary sequence comparison.

Key Features:

  • Plots one sequence on the X-axis and another on the Y-axis.
  • Dots are placed where residues (nucleotides/amino acids) match.
  • Diagonal lines indicate regions of high similarity or identity.

Uses:

  • Quick visualization of repetitive sequences.
  • Identification of insertions, deletions, or inversions.

Tools:Dotlet, EMBOSS dotmatcher


3. Scoring Matrices: PAM and BLOSUM

Scoring matrices are essential in aligning protein sequences to assess substitution likelihoods.

a. PAM (Point Accepted Mutation) Matrix

  • Based on evolutionary models of accepted mutations.
  • PAM1 represents 1% change; PAM250 for more divergent sequences.
  • Assumes a common ancestor and slow mutation rates.

b. BLOSUM (Blocks Substitution Matrix)

  • Derived from conserved sequence blocks (e.g., in protein domains).
  • BLOSUM62 is the most commonly used matrix.
  • Better suited for aligning sequences with varying levels of similarity.
MatrixSuitable For
PAM250Distantly related proteins
BLOSUM62Moderately similar proteins

4. Sequence Retrieval from Online Databases

To perform alignments, sequences are often retrieved from public databases.

Steps:

  1. Access NCBI (https://www.ncbi.nlm.nih.gov) or UniProt (https://www.uniprot.org).
  2. Use accession numbers, gene/protein names, or organism filters.
  3. Download sequences in FASTA format for alignment.

Databases:

  • GenBank: Nucleotide sequences
  • UniProtKB: Protein sequences and annotations
  • RefSeq: Curated reference sequences

5. Pairwise Sequence Alignment Using BLAST

BLAST (Basic Local Alignment Search Tool) is a widely used tool for comparing a query sequence with a database.



Types of BLAST:

  • blastn: Nucleotide vs. nucleotide
  • blastp: Protein vs. protein
  • blastx: Translated nucleotide vs. protein
  • tblastn: Protein vs. translated nucleotide

    50d72553 c8e0 4a9b ad37 68f40e92dad5

Figure 2:Types of BLAST

Steps to Run BLAST:

  1. Paste query sequence or upload a file.
  2. Select appropriate BLAST type.
  3. Choose database (e.g., nr, Swiss-Prot).
  4. Adjust parameters (e.g., matrix, gap penalties).
  5. View alignment results with identity %, e-value, score.

Applications:

  • Gene annotation
  • Homology search
  • Ortholog/paralog identification

6. Multiple Sequence Alignment (MSA)

MSA is used to align three or more sequences simultaneously to identify conserved regions, motifs, or evolutionary patterns.

Common Tools:

  • Clustal Omega
  • MUSCLE
  • MAFFT
  • T-Coffee

Key Concepts:

  • Conserved regions may indicate functional or structural importance.
  • Gaps are introduced to maximize alignment score across all sequences.
  • Phylogenetic trees can be built from MSA results.

Applications:

  • Protein family studies
  • Primer design
  • Phylogenetics







CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)

Module 3

Sequence analysis:

Sequence analysis is a critical area in bioinformatics that enables researchers to interpret raw DNA and protein sequences to extract meaningful biological information. This module focuses on how to retrieve, evaluate, and analyze sequencing data, particularly Sanger sequencing reads, and how to identify important genetic elements such as open reading frames (ORFs), regulatory motifs, and sequence variations like SNPs and ESTs.

1. Retrieval of Sequences

Accurate sequence retrieval from public databases is the first step in any analysis pipeline.

Sources:

  • NCBI GenBank: Nucleotide sequences
  • ENA: European Nucleotide Archive
  • UniProt: Protein sequences and functional annotations

File Formats:

  • FASTA: For raw sequences
  • FASTQ: For sequences with quality scores
  • GenBank: Annotated format with feature tables

Tools: NCBI Entrez, EBI’s ENA browser, UniProt sequence fetcher

2. Sequence Quality Assessment

Before any downstream analysis, the quality of the sequence must be checked and filtered.

Parameters Checked:

  • Base quality scores (Q scores)
  • Ambiguous bases (N’s)
  • Read length and overall sequence coverage

Tools:

  • Phred: Quality scoring system for Sanger reads
  • FastQC: For next-generation sequencing data (can also assess Sanger files)
  • TrimGalore/Cutadapt: For trimming low-quality regions

Outcome: High-quality reads suitable for reliable assembly and annotation.

3. Assembly and Annotation of Sanger Sequencing Reads

Sanger reads are typically short and require assembly to form longer contiguous sequences (contigs).

a. Sequence Assembly

  • Types:
    • De novo assembly: Without a reference
    • Reference-guided assembly: Using known templates
  • Tools: CAP3, Phrap, Geneious, BioEdit

b. Sequence Annotation

Annotation involves identifying functional elements such as:

  • Genes and coding regions
  • Promoters and regulatory elements
  • Exons and introns

Tools: Artemis, NCBI ORF Finder, RAST server (for prokaryotes)

4. Identification of Cis-Acting Regulatory Elements

Cis-elements are short DNA sequences found near genes that regulate their expression.

Common Elements:

  • TATA box: Core promoter
  • CAAT box, GC box
  • Enhancer/silencer sequences

Databases and Tools:

  • PLACE and PlantCARE: Plant regulatory element databases
  • TRANSFAC and JASPAR: Transcription factor binding site prediction
  • MEME Suite: Motif discovery tool

Application: Identifying transcriptional control regions upstream of genes.

5. ORF (Open Reading Frame) Finding

ORFs are potential protein-coding regions within a DNA sequence.

Key Features:

  • Begin with a start codon (ATG)
  • End with a stop codon (TAA, TAG, TGA)
  • Located in correct reading frame without interruptions

Tools:

  • NCBI ORF Finder
  • Text Box: Figure 3: ORF Finding Tools
    GeneMark

ExPASy Translate Tool


Output: Predicted coding regions with possible protein translations.

6. Signal Sequences in DNA and Proteins

Signal sequences direct the cellular localization or secretion of proteins.

a. In DNA:

  • Promoter regions
  • Ribosome binding sites
  • Polyadenylation signals

b. In Proteins:

  • N-terminal signal peptides (for secretory pathways)
  • Transmembrane domains

Tools:

  • SignalP: Predicts signal peptides in protein sequences
  • TargetP: Predicts subcellular localization
  • TMHMM: Predicts transmembrane regions

7. Data Analysis Tools for SNP and ESTs

a. SNP (Single Nucleotide Polymorphism) Analysis

SNPs are single-base changes in DNA that can affect traits or disease susceptibility.

Databases and Tools:

  • dbSNP (NCBI): Public SNP database
  • SNPdat, GATK, VCFtools: For annotation and analysis

Applications:

  • Marker-assisted selection
  • Genotyping and population studies

b. EST (Expressed Sequence Tag) Analysis

ESTs are short sequences from transcribed mRNA, used to identify gene transcripts.

Uses:

  • Gene discovery and annotation
  • Identification of tissue-specific expression

Tools:

  • UniGene: Clusters ESTs into genes
  • ESTScan: Predicts coding regions in ESTs
  • BLASTx: Matches ESTs to protein sequences

CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)

Module 4

Phylogeny and evolution:

Understanding the evolutionary history of organisms and genes is a central task in bioinformatics. This module introduces the concepts and computational techniques used to infer evolutionary relationships, estimate divergence times, and construct phylogenetic trees. It also covers the principles of evolutionary theory, species classification, and statistical validation of evolutionary hypotheses.


1. Evolution of Genomes

Genome evolution is a dynamic process influenced by mutation, recombination, gene duplication, and horizontal gene transfer.

Key Concepts:

  • Gene gain and loss
  • Expansion of gene families
  • Mobile genetic elements
  • Genome rearrangements and size variation

Bioinformatics Relevance:
Comparative genomics and alignment of genome sequences reveal patterns of genome evolution and divergence.


2. Basic Forces of Evolution

Evolution is driven by several key forces:

  • Mutation: Source of genetic variation
  • Natural Selection: Favors advantageous traits
  • Genetic Drift: Random changes in allele frequencies, especially in small populations
  • Gene Flow: Exchange of genes between populations
  • Recombination: Shuffling of genes during reproduction


These forces shape the genetic structure and diversity of populations over time.

3. Variation and Divergence of Populations

Populations evolve and diverge due to geographic isolation, genetic drift, and environmental selection.

Important Measures:

  • Allele frequency changes
  • Genetic distance (e.g., Nei’s distance)
  • F_ST values (measure of population differentiation)

Applications: Used in population genetics, conservation biology, and epidemiological studies.

4. Estimation of Divergence Time

Estimating when two species or genes diverged helps in understanding evolutionary timelines.

Methods:

  • Molecular Clock Hypothesis: Assumes constant mutation rate over time
  • Synonymous vs. Non-synonymous substitutions
  • Calibration with fossil records or known divergence events

Tools: MEGA, BEAST, TimeTree


5. Phylogenetic Species Concept

Unlike traditional species definitions based on morphology, the phylogenetic species concept defines species as the smallest group sharing a common ancestor and diagnosable by unique traits.

Importance in Bioinformatics:
Helps define taxa based on genetic data and evolutionary lineage rather than phenotype alone.


6. Phylogenetic Trees and Cladistics

Phylogenetic Trees are graphical representations of evolutionary relationships.

Types:

  • Rooted trees: Show common ancestor and direction of evolution
  • Unrooted trees: Show relationships without time direction

Cladistics:

  • Method to classify organisms based on shared derived traits (synapomorphies)
  • Yields cladograms, a type of phylogenetic tree

7. Concepts of Monophyly, Paraphyly, and Polyphyly

ConceptDescriptionExample
MonophylyA group containing an ancestor and all its descendantsMammals
ParaphylyA group with a common ancestor but not all descendantsReptiles (excluding birds)
PolyphylyGroup with unrelated organisms from different ancestorsMarine mammals (e.g., whales + seals)

Correct interpretation ensures accurate evolutionary classification.


8. Phylogenetic Tree Reconstruction Methods

A. Distance-Based Methods

These use genetic distance matrices to build trees:

  1. UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
    • Assumes constant rate of evolution (molecular clock)
    • Produces rooted trees
  2. Neighbour-Joining (NJ)
    • Does not assume a constant rate
    • More flexible and widely used for large datasets

Tools: MEGA, Phylip, PAUP*


B. Character-Based Methods

These examine each position in sequence alignment individually.

  1. Maximum Parsimony (MP)
    • Chooses the tree requiring the fewest evolutionary changes
    • Simpler, but sensitive to homoplasy (shared traits not from common ancestry)
  2. Maximum Likelihood (ML)
    • Evaluates likelihood of a tree given a specific model of sequence evolution
    • More accurate but computationally intensive
  3. Bayesian Inference
    • Uses probability distributions to estimate tree credibility
    • Employs Markov Chain Monte Carlo (MCMC) simulations

Software:PhyML (ML), MrBayes (Bayesian), RAxML (ML), TNT (MP)


9. Tree Comparison and Statistical Validation

Once trees are constructed, they must be evaluated for reliability and accuracy.

Methods and Tools:

  • Bootstrap Analysis: Resampling method to assess tree branch support (100–1000 replicates)
  • Jackknife: Similar to bootstrapping but excludes some data in each replicate
  • Likelihood Ratio Test (LRT): Compares likelihood of different tree models
  • Tree Comparisons: Tools like TreeCompare or Dendroscope compare topologies

10. Parametric Bootstrapping

A rigorous statistical test used to validate phylogenetic hypotheses by simulating datasets based on an estimated model.

Steps:

  1. Estimate model and construct original tree
  2. Simulate data under that model
  3. Reconstruct trees from simulated data
  4. Compare likelihood of real vs. simulated trees

Interpretation: If the real tree is significantly better, the hypothesis is supported.


11. Limitations in Phylogenetic Analysis

  • Model selection biases (wrong substitution model)
  • Long-branch attraction (MP particularly susceptible)
  • Incomplete lineage sorting or horizontal gene transfer
  • Computational complexity with large datasets
  • Unequal evolutionary rates may mislead tree topology

CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)

Module 5

Practical:

Hands-On Bioinformatics: Practical Applications and Analysis

This module emphasizes practical skills in bioinformatics by guiding students through essential computational exercises. Learners will explore how to retrieve biological sequences, analyze them using various tools, and interpret the results for applications in molecular biology, genetics, and evolutionary studies.


1. Sequence Retrieval from Databases

Students will learn to access and download nucleotide and protein sequences from globally recognized biological databases.

Steps:

  • Go to NCBI (https://www.ncbi.nlm.nih.gov/) or UniProt (https://www.uniprot.org/)
  • Use gene name, organism, accession number, or keywords in the search bar.
  • Filter results using advanced search options (e.g., organism, molecule type, sequence length).
  • Download in FASTA format for downstream analysis.

Other Resources:

  • BOLD (Barcode of Life Data Systems): For species identification and DNA barcoding
  • DDBJ and EMBL-EBI: Additional sequence repositories

2. Refining Search Criteria

Efficient data mining requires refining database queries to retrieve the most relevant sequences.

Search Modifiers:

  • Use Boolean operators (AND, OR, NOT)
  • Apply filters (e.g., “complete genome,” “mitochondrial DNA”)
  • Specify database (e.g., RefSeq, TSA, EST)
  • Limit results by taxonomy, molecule type, or publication date

Outcome: Accurate and narrowed-down sequence results


3. Sequence Submission to Databases (NCBI GenBank / BOLD)

Understanding how to submit original sequences to databases is essential for scientific transparency and data sharing.

A. GenBank Submission via NCBI:

  • Use BankIt or Sequin submission tools
  • Prepare the sequence in FASTA format
  • Annotate sequence with metadata (organism, gene name, product, source)
  • Upload using an NCBI account
  • Track submission status and receive accession number

B. BOLD Submission:

  • Targeted for DNA barcoding sequences (e.g., COI gene)
  • Requires taxonomic verification and geographic metadata
  • Upload through the BOLD Workbench

4. Pairwise Sequence Alignment (BLAST)

BLAST (Basic Local Alignment Search Tool) compares a query sequence to a database to find regions of local similarity.

Procedure:

  • Navigate to BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi)
  • Select appropriate program: blastn, blastp, blastx, etc.
  • Paste query or upload FASTA file
  • Choose database (e.g., nr, Swiss-Prot)
  • Adjust parameters (e.g., scoring matrix, word size, gap penalties)
  • Submit and analyze output

Interpreting Output:

  • Score and E-value: Indicate statistical significance
  • % Identity: Reflects sequence similarity
  • Alignment: Shows matched regions and mismatches

5. Multiple Sequence Alignment (ClustalW)

MSA helps identify conserved regions, motifs, and evolutionary relationships among sequences.

Steps:

Applications:

  • Identify functional domains
  • Guide primer design
  • Build phylogenetic trees

6. Identification of Open Reading Frames (ORFs)

ORFs are regions likely to encode proteins and are critical for gene annotation.

Tools:

Steps:

  • Paste DNA sequence
  • Select genetic code and frame
  • View predicted ORFs and start/stop codons

7. Primer Designing

Designing primers is essential for PCR, sequencing, and gene cloning.

Tool: Primer3 (https://primer3.ut.ee/)

Input Requirements:

  • Target sequence (FASTA format)
  • Desired product size
  • Melting temperature (Tm) range
  • GC content range

Output:

  • Forward and reverse primer sequences
  • Tm, GC%, self-complementarity
  • Option to avoid secondary structures

8. Restriction Site Identification

Identifying restriction enzyme sites is crucial in recombinant DNA techniques.

Tools:

Steps:

  • Paste DNA sequence
  • Select enzymes from list or database
  • View restriction map and cut positions

9. Plasmid Map Drawing

Plasmid maps are essential for illustrating recombinant DNA constructs.

Software Tools:

  • SnapGene (Viewer) – Free for viewing and basic editing
  • Geneious, ApE (A plasmid Editor) – Open-source alternatives

Features:

  • Annotate features (origin, antibiotic resistance, MCS)
  • Insert genes or tags
  • Display direction of transcription
  • Export maps as images or GenBank files

10. Protein Structure Prediction

Predicting 3D protein structures provides insight into function and interaction.

Tools:

Steps:

  • Input protein sequence (FASTA)
  • Select template (automated or manual)
  • Generate 3D model
  • Download PDB files for visualization in PyMOL or Chimera

11. Phylogenetic Tree Construction

Phylogenetic trees represent evolutionary relationships among sequences.

Software Options:

  • MEGA (Molecular Evolutionary Genetics Analysis)
  • MrBayes (Bayesian inference)
  • Phylip and PAUP* (Parsimony and distance methods)

Procedure (Using MEGA):

  1. Import aligned sequences (e.g., from ClustalW)
  2. Choose method: Neighbor-Joining, UPGMA, Maximum Parsimony, or Maximum Likelihood
  3. Customize options: bootstrap replicates, substitution model
  4. Generate and visualize tree

Bayesian Workflow (MrBayes):

  1. Convert alignment to NEXUS format
  2. Set model parameters (e.g., GTR+G)
  3. Run MCMC simulations
  4. Analyze posterior probabilities

12. Phylogenetic Tree Interpretation and Statistical Support

Tree Elements:

  • Branches: Represent evolutionary lineages
  • Nodes: Common ancestors
  • Branch Lengths: Proportional to evolutionary distance
  • Bootstrap Values / Posterior Probabilities: Support for each branch

Interpretation Tips:

  • Look for clusters (clades) with high support values
  • Use outgroups to root trees
  • Evaluate evolutionary relationships based on topology and branch lengths
  • Name of course instructor : Dr.Mohd Ashraf Rather, Assistant Professor , Division of Fish Genetics and Biotechnology-SKUAST-Kashmir

                                         Email : mashraf38@skuastkashmir.ac.in 

Leave a Comment