Module 1
Basics of Bioinformatics
Introduction to Bioinformatics Resources and Tools
Bioinformatics is an interdisciplinary field that combines biology, computer science, and information technology to analyze and interpret biological data. This module introduces the fundamental resources, platforms, databases, and software tools used in bioinformatics, laying the foundation for more advanced computational analysis in life sciences.
1. Overview of Bioinformatics Resources
Bioinformatics resources are essential tools for managing the massive amount of data generated by genomic, transcriptomic, proteomic, and structural biology research. These resources include:
- Databases (primary and secondary)
- Software tools for sequence alignment, visualization, and molecular modeling
- Web-based platforms for genome browsing, annotation, and protein structure prediction
Applications:
- Gene and protein identification
- Phylogenetic analysis
- Molecular docking and drug discovery
- Structural and functional annotation
2. Platforms for Bioinformatics Analysis
Bioinformatics tools can run on various operating systems. Understanding the environment is crucial for effective tool deployment and scripting.
a. Linux-Based Platforms:
- Most bioinformatics tools are developed for Unix/Linux due to its open-source flexibility and scripting power.
- Popular distributions: Ubuntu, CentOS, Debian
- Common commands: grep, awk, sed, bash, python, perl
b. Windows-Based Platforms:
- GUI-based tools (e.g., MEGA, Geneious)
- Compatibility layer tools like WSL (Windows Subsystem for Linux) are increasingly used
3. Bioinformatics Software Tools
These are used for analyzing DNA, RNA, and protein data:
Task | Software/Tool |
Sequence Alignment | BLAST, Clustal Omega |
Phylogenetic Tree Construction | MEGA, PhyML |
Protein Structure Prediction | SWISS-MODEL, Phyre2 |
Data Visualization | Jalview, UCSF Chimera |
Most of these tools are available online or as downloadable packages for local installations.
4. Biological Databases
Databases are categorized into primary and secondary types:
a. Primary Databases
- Nucleotide Sequence Databases:
- GenBank (NCBI): Comprehensive database of publicly available DNA sequences
- EMBL-EBI: European counterpart to GenBank
- DDBJ: DNA Data Bank of Japan
- Protein Sequence Databases:
- UniProtKB: Central resource for protein sequence and annotation data
- PIR: Protein Information Resource
b. Secondary Databases
Secondary databases derive information by curating and analyzing primary data.
- Examples:
- Pfam: Protein families and domains
- InterPro: Integrated resource of protein signatures
c. Structure Databases
These house 3D structures of biomolecules.
- PDB (Protein Data Bank): Repository for 3D structural data of proteins and nucleic acids.
- SCOP and CATH: Classification of protein structures
5. Analysis Packages
These combine multiple tools or scripts into workflows:
- EMBOSS: Suite of command-line tools for sequence analysis
- Bioconductor: R-based platform for genomics data
- Galaxy: Web-based platform enabling users to run workflows without command-line knowledge
- Geneious: Commercial package with integrated GUI for sequence editing, alignment, and annotation
CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)
Module 2
Sequence Alignment:
1. Introduction to Sequence Alignment
Sequence alignment is the process of arranging sequences to identify regions of similarity. These similarities may indicate functional, structural, or evolutionary relationships.
There are two main types of sequence alignment:
- Pairwise Alignment: Comparison between two sequences.
- Multiple Sequence Alignment (MSA): Simultaneous alignment of three or more sequences.
2. Dot Matrix Method
The dot matrix is a graphical method used for preliminary sequence comparison.
Key Features:
- Plots one sequence on the X-axis and another on the Y-axis.
- Dots are placed where residues (nucleotides/amino acids) match.
- Diagonal lines indicate regions of high similarity or identity.
Uses:
- Quick visualization of repetitive sequences.
- Identification of insertions, deletions, or inversions.
Tools:Dotlet, EMBOSS dotmatcher
3. Scoring Matrices: PAM and BLOSUM
Scoring matrices are essential in aligning protein sequences to assess substitution likelihoods.
a. PAM (Point Accepted Mutation) Matrix
- Based on evolutionary models of accepted mutations.
- PAM1 represents 1% change; PAM250 for more divergent sequences.
- Assumes a common ancestor and slow mutation rates.
b. BLOSUM (Blocks Substitution Matrix)
- Derived from conserved sequence blocks (e.g., in protein domains).
- BLOSUM62 is the most commonly used matrix.
- Better suited for aligning sequences with varying levels of similarity.
Matrix | Suitable For |
PAM250 | Distantly related proteins |
BLOSUM62 | Moderately similar proteins |
4. Sequence Retrieval from Online Databases
To perform alignments, sequences are often retrieved from public databases.
Steps:
- Access NCBI (https://www.ncbi.nlm.nih.gov) or UniProt (https://www.uniprot.org).
- Use accession numbers, gene/protein names, or organism filters.
- Download sequences in FASTA format for alignment.
Databases:
- GenBank: Nucleotide sequences
- UniProtKB: Protein sequences and annotations
- RefSeq: Curated reference sequences
5. Pairwise Sequence Alignment Using BLAST
BLAST (Basic Local Alignment Search Tool) is a widely used tool for comparing a query sequence with a database.
Types of BLAST:
- blastn: Nucleotide vs. nucleotide
- blastp: Protein vs. protein
- blastx: Translated nucleotide vs. protein
- tblastn: Protein vs. translated nucleotide
Figure 2:Types of BLAST
Steps to Run BLAST:
- Paste query sequence or upload a file.
- Select appropriate BLAST type.
- Choose database (e.g., nr, Swiss-Prot).
- Adjust parameters (e.g., matrix, gap penalties).
- View alignment results with identity %, e-value, score.
Applications:
- Gene annotation
- Homology search
- Ortholog/paralog identification
6. Multiple Sequence Alignment (MSA)
MSA is used to align three or more sequences simultaneously to identify conserved regions, motifs, or evolutionary patterns.
Common Tools:
- Clustal Omega
- MUSCLE
- MAFFT
- T-Coffee
Key Concepts:
- Conserved regions may indicate functional or structural importance.
- Gaps are introduced to maximize alignment score across all sequences.
- Phylogenetic trees can be built from MSA results.
Applications:
- Protein family studies
- Primer design
- Phylogenetics
CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)
Module 3
Sequence analysis:
Sequence analysis is a critical area in bioinformatics that enables researchers to interpret raw DNA and protein sequences to extract meaningful biological information. This module focuses on how to retrieve, evaluate, and analyze sequencing data, particularly Sanger sequencing reads, and how to identify important genetic elements such as open reading frames (ORFs), regulatory motifs, and sequence variations like SNPs and ESTs.
1. Retrieval of Sequences
Accurate sequence retrieval from public databases is the first step in any analysis pipeline.
Sources:
- NCBI GenBank: Nucleotide sequences
- ENA: European Nucleotide Archive
- UniProt: Protein sequences and functional annotations
File Formats:
- FASTA: For raw sequences
- FASTQ: For sequences with quality scores
- GenBank: Annotated format with feature tables
Tools: NCBI Entrez, EBI’s ENA browser, UniProt sequence fetcher
2. Sequence Quality Assessment
Before any downstream analysis, the quality of the sequence must be checked and filtered.
Parameters Checked:
- Base quality scores (Q scores)
- Ambiguous bases (N’s)
- Read length and overall sequence coverage
Tools:
- Phred: Quality scoring system for Sanger reads
- FastQC: For next-generation sequencing data (can also assess Sanger files)
- TrimGalore/Cutadapt: For trimming low-quality regions
Outcome: High-quality reads suitable for reliable assembly and annotation.
3. Assembly and Annotation of Sanger Sequencing Reads
Sanger reads are typically short and require assembly to form longer contiguous sequences (contigs).
a. Sequence Assembly
- Types:
- De novo assembly: Without a reference
- Reference-guided assembly: Using known templates
- Tools: CAP3, Phrap, Geneious, BioEdit
b. Sequence Annotation
Annotation involves identifying functional elements such as:
- Genes and coding regions
- Promoters and regulatory elements
- Exons and introns
Tools: Artemis, NCBI ORF Finder, RAST server (for prokaryotes)
4. Identification of Cis-Acting Regulatory Elements
Cis-elements are short DNA sequences found near genes that regulate their expression.
Common Elements:
- TATA box: Core promoter
- CAAT box, GC box
- Enhancer/silencer sequences
Databases and Tools:
- PLACE and PlantCARE: Plant regulatory element databases
- TRANSFAC and JASPAR: Transcription factor binding site prediction
- MEME Suite: Motif discovery tool
Application: Identifying transcriptional control regions upstream of genes.
5. ORF (Open Reading Frame) Finding
ORFs are potential protein-coding regions within a DNA sequence.
Key Features:
- Begin with a start codon (ATG)
- End with a stop codon (TAA, TAG, TGA)
- Located in correct reading frame without interruptions
Tools:
- NCBI ORF Finder
GeneMark
ExPASy Translate Tool
Output: Predicted coding regions with possible protein translations.
6. Signal Sequences in DNA and Proteins
Signal sequences direct the cellular localization or secretion of proteins.
a. In DNA:
- Promoter regions
- Ribosome binding sites
- Polyadenylation signals
b. In Proteins:
- N-terminal signal peptides (for secretory pathways)
- Transmembrane domains
Tools:
- SignalP: Predicts signal peptides in protein sequences
- TargetP: Predicts subcellular localization
- TMHMM: Predicts transmembrane regions
7. Data Analysis Tools for SNP and ESTs
a. SNP (Single Nucleotide Polymorphism) Analysis
SNPs are single-base changes in DNA that can affect traits or disease susceptibility.
Databases and Tools:
- dbSNP (NCBI): Public SNP database
- SNPdat, GATK, VCFtools: For annotation and analysis
Applications:
- Marker-assisted selection
- Genotyping and population studies
b. EST (Expressed Sequence Tag) Analysis
ESTs are short sequences from transcribed mRNA, used to identify gene transcripts.
Uses:
- Gene discovery and annotation
- Identification of tissue-specific expression
Tools:
- UniGene: Clusters ESTs into genes
- ESTScan: Predicts coding regions in ESTs
- BLASTx: Matches ESTs to protein sequences
CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)
Module 4
Phylogeny and evolution:
Understanding the evolutionary history of organisms and genes is a central task in bioinformatics. This module introduces the concepts and computational techniques used to infer evolutionary relationships, estimate divergence times, and construct phylogenetic trees. It also covers the principles of evolutionary theory, species classification, and statistical validation of evolutionary hypotheses.
1. Evolution of Genomes
Genome evolution is a dynamic process influenced by mutation, recombination, gene duplication, and horizontal gene transfer.
Key Concepts:
- Gene gain and loss
- Expansion of gene families
- Mobile genetic elements
- Genome rearrangements and size variation
Bioinformatics Relevance:
Comparative genomics and alignment of genome sequences reveal patterns of genome evolution and divergence.
2. Basic Forces of Evolution
Evolution is driven by several key forces:
- Mutation: Source of genetic variation
- Natural Selection: Favors advantageous traits
- Genetic Drift: Random changes in allele frequencies, especially in small populations
- Gene Flow: Exchange of genes between populations
- Recombination: Shuffling of genes during reproduction
These forces shape the genetic structure and diversity of populations over time.
3. Variation and Divergence of Populations
Populations evolve and diverge due to geographic isolation, genetic drift, and environmental selection.
Important Measures:
- Allele frequency changes
- Genetic distance (e.g., Nei’s distance)
- F_ST values (measure of population differentiation)
Applications: Used in population genetics, conservation biology, and epidemiological studies.
4. Estimation of Divergence Time
Estimating when two species or genes diverged helps in understanding evolutionary timelines.
Methods:
- Molecular Clock Hypothesis: Assumes constant mutation rate over time
- Synonymous vs. Non-synonymous substitutions
- Calibration with fossil records or known divergence events
Tools: MEGA, BEAST, TimeTree
5. Phylogenetic Species Concept
Unlike traditional species definitions based on morphology, the phylogenetic species concept defines species as the smallest group sharing a common ancestor and diagnosable by unique traits.
Importance in Bioinformatics:
Helps define taxa based on genetic data and evolutionary lineage rather than phenotype alone.
6. Phylogenetic Trees and Cladistics
Phylogenetic Trees are graphical representations of evolutionary relationships.
Types:
- Rooted trees: Show common ancestor and direction of evolution
- Unrooted trees: Show relationships without time direction
Cladistics:
- Method to classify organisms based on shared derived traits (synapomorphies)
- Yields cladograms, a type of phylogenetic tree
7. Concepts of Monophyly, Paraphyly, and Polyphyly
Concept | Description | Example |
Monophyly | A group containing an ancestor and all its descendants | Mammals |
Paraphyly | A group with a common ancestor but not all descendants | Reptiles (excluding birds) |
Polyphyly | Group with unrelated organisms from different ancestors | Marine mammals (e.g., whales + seals) |
Correct interpretation ensures accurate evolutionary classification.
8. Phylogenetic Tree Reconstruction Methods
A. Distance-Based Methods
These use genetic distance matrices to build trees:
- UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
- Assumes constant rate of evolution (molecular clock)
- Produces rooted trees
- Neighbour-Joining (NJ)
- Does not assume a constant rate
- More flexible and widely used for large datasets
Tools: MEGA, Phylip, PAUP*
B. Character-Based Methods
These examine each position in sequence alignment individually.
- Maximum Parsimony (MP)
- Chooses the tree requiring the fewest evolutionary changes
- Simpler, but sensitive to homoplasy (shared traits not from common ancestry)
- Maximum Likelihood (ML)
- Evaluates likelihood of a tree given a specific model of sequence evolution
- More accurate but computationally intensive
- Bayesian Inference
- Uses probability distributions to estimate tree credibility
- Employs Markov Chain Monte Carlo (MCMC) simulations
Software:PhyML (ML), MrBayes (Bayesian), RAxML (ML), TNT (MP)
9. Tree Comparison and Statistical Validation
Once trees are constructed, they must be evaluated for reliability and accuracy.
Methods and Tools:
- Bootstrap Analysis: Resampling method to assess tree branch support (100–1000 replicates)
- Jackknife: Similar to bootstrapping but excludes some data in each replicate
- Likelihood Ratio Test (LRT): Compares likelihood of different tree models
- Tree Comparisons: Tools like TreeCompare or Dendroscope compare topologies
10. Parametric Bootstrapping
A rigorous statistical test used to validate phylogenetic hypotheses by simulating datasets based on an estimated model.
Steps:
- Estimate model and construct original tree
- Simulate data under that model
- Reconstruct trees from simulated data
- Compare likelihood of real vs. simulated trees
Interpretation: If the real tree is significantly better, the hypothesis is supported.
11. Limitations in Phylogenetic Analysis
- Model selection biases (wrong substitution model)
- Long-branch attraction (MP particularly susceptible)
- Incomplete lineage sorting or horizontal gene transfer
- Computational complexity with large datasets
- Unequal evolutionary rates may mislead tree topology
CourseName :Bioinformatics Tools for Fisheries(FBT 505) 2(1+1)
Module 5
Practical:
Hands-On Bioinformatics: Practical Applications and Analysis
This module emphasizes practical skills in bioinformatics by guiding students through essential computational exercises. Learners will explore how to retrieve biological sequences, analyze them using various tools, and interpret the results for applications in molecular biology, genetics, and evolutionary studies.
1. Sequence Retrieval from Databases
Students will learn to access and download nucleotide and protein sequences from globally recognized biological databases.
Steps:
- Go to NCBI (https://www.ncbi.nlm.nih.gov/) or UniProt (https://www.uniprot.org/)
- Use gene name, organism, accession number, or keywords in the search bar.
- Filter results using advanced search options (e.g., organism, molecule type, sequence length).
- Download in FASTA format for downstream analysis.
Other Resources:
- BOLD (Barcode of Life Data Systems): For species identification and DNA barcoding
- DDBJ and EMBL-EBI: Additional sequence repositories
2. Refining Search Criteria
Efficient data mining requires refining database queries to retrieve the most relevant sequences.
Search Modifiers:
- Use Boolean operators (AND, OR, NOT)
- Apply filters (e.g., “complete genome,” “mitochondrial DNA”)
- Specify database (e.g., RefSeq, TSA, EST)
- Limit results by taxonomy, molecule type, or publication date
Outcome: Accurate and narrowed-down sequence results
3. Sequence Submission to Databases (NCBI GenBank / BOLD)
Understanding how to submit original sequences to databases is essential for scientific transparency and data sharing.
A. GenBank Submission via NCBI:
- Use BankIt or Sequin submission tools
- Prepare the sequence in FASTA format
- Annotate sequence with metadata (organism, gene name, product, source)
- Upload using an NCBI account
- Track submission status and receive accession number
B. BOLD Submission:
- Targeted for DNA barcoding sequences (e.g., COI gene)
- Requires taxonomic verification and geographic metadata
- Upload through the BOLD Workbench
4. Pairwise Sequence Alignment (BLAST)
BLAST (Basic Local Alignment Search Tool) compares a query sequence to a database to find regions of local similarity.
Procedure:
- Navigate to BLAST (https://blast.ncbi.nlm.nih.gov/Blast.cgi)
- Select appropriate program: blastn, blastp, blastx, etc.
- Paste query or upload FASTA file
- Choose database (e.g., nr, Swiss-Prot)
- Adjust parameters (e.g., scoring matrix, word size, gap penalties)
- Submit and analyze output
Interpreting Output:
- Score and E-value: Indicate statistical significance
- % Identity: Reflects sequence similarity
- Alignment: Shows matched regions and mismatches
5. Multiple Sequence Alignment (ClustalW)
MSA helps identify conserved regions, motifs, and evolutionary relationships among sequences.
Steps:
- Visit Clustal Omega (https://www.ebi.ac.uk/Tools/msa/clustalo/) or use ClustalW
- Input three or more sequences in FASTA format
- Run alignment with default or custom parameters
- Download alignment results
Applications:
- Identify functional domains
- Guide primer design
- Build phylogenetic trees
6. Identification of Open Reading Frames (ORFs)
ORFs are regions likely to encode proteins and are critical for gene annotation.
Tools:
- NCBI ORF Finder (https://www.ncbi.nlm.nih.gov/orffinder/)
- ExPASy Translate Tool
- GeneMark (for gene prediction)
Steps:
- Paste DNA sequence
- Select genetic code and frame
- View predicted ORFs and start/stop codons
7. Primer Designing
Designing primers is essential for PCR, sequencing, and gene cloning.
Tool: Primer3 (https://primer3.ut.ee/)
Input Requirements:
- Target sequence (FASTA format)
- Desired product size
- Melting temperature (Tm) range
- GC content range
Output:
- Forward and reverse primer sequences
- Tm, GC%, self-complementarity
- Option to avoid secondary structures
8. Restriction Site Identification
Identifying restriction enzyme sites is crucial in recombinant DNA techniques.
Tools:
- NEBcutter (https://nc3.neb.com/NEBcutter/)
- SnapGene Viewer (desktop tool)
- Biology Workbench online tools
Steps:
- Paste DNA sequence
- Select enzymes from list or database
- View restriction map and cut positions
9. Plasmid Map Drawing
Plasmid maps are essential for illustrating recombinant DNA constructs.
Software Tools:
- SnapGene (Viewer) – Free for viewing and basic editing
- Geneious, ApE (A plasmid Editor) – Open-source alternatives
Features:
- Annotate features (origin, antibiotic resistance, MCS)
- Insert genes or tags
- Display direction of transcription
- Export maps as images or GenBank files
10. Protein Structure Prediction
Predicting 3D protein structures provides insight into function and interaction.
Tools:
- SWISS-MODEL (https://swissmodel.expasy.org/)
- Phyre2 (http://www.sbg.bio.ic.ac.uk/phyre2/)
- AlphaFold Protein Structure Database
Steps:
- Input protein sequence (FASTA)
- Select template (automated or manual)
- Generate 3D model
- Download PDB files for visualization in PyMOL or Chimera
11. Phylogenetic Tree Construction
Phylogenetic trees represent evolutionary relationships among sequences.
Software Options:
- MEGA (Molecular Evolutionary Genetics Analysis)
- MrBayes (Bayesian inference)
- Phylip and PAUP* (Parsimony and distance methods)
Procedure (Using MEGA):
- Import aligned sequences (e.g., from ClustalW)
- Choose method: Neighbor-Joining, UPGMA, Maximum Parsimony, or Maximum Likelihood
- Customize options: bootstrap replicates, substitution model
- Generate and visualize tree
Bayesian Workflow (MrBayes):
- Convert alignment to NEXUS format
- Set model parameters (e.g., GTR+G)
- Run MCMC simulations
- Analyze posterior probabilities
12. Phylogenetic Tree Interpretation and Statistical Support
Tree Elements:
- Branches: Represent evolutionary lineages
- Nodes: Common ancestors
- Branch Lengths: Proportional to evolutionary distance
- Bootstrap Values / Posterior Probabilities: Support for each branch
Interpretation Tips:
- Look for clusters (clades) with high support values
- Use outgroups to root trees
- Evaluate evolutionary relationships based on topology and branch lengths
- Name of course instructor : Dr.Mohd Ashraf Rather, Assistant Professor , Division of Fish Genetics and Biotechnology-SKUAST-Kashmir
Email : mashraf38@skuastkashmir.ac.in