Molecular Biology and Bioinformatics Principles
I. Molecular Biology Fundamentals
Genes and Genomes
Every cell contains a complete set of genetic instructions—the genome—encoded in DNA and organized into genes packaged on chromosomes. A gene is a specific DNA sequence that encodes a functional product (usually a protein or an RNA). Genetic variation (mutations) underlies phenotypic differences, while environmental factors also contribute to traits.
DNA vs. RNA
Feature | DNA (Deoxyribonucleic Acid) | RNA (Ribonucleic Acid) |
---|---|---|
Sugar | Deoxyribose (no 2′–OH group) | Ribose (has 2′–OH group) |
Nitrogenous Bases | A, T, C, G | A, U (Uracil) replaces T, C, G |
Structure | Usually double-stranded helix | Usually single-stranded (can form loops/folds) |
Stability | More stable (long-term storage) | Less stable (transient transcripts) |
Cellular Location | Nucleus (eukaryotes), nucleoid (prokaryotes) | Nucleus (as pre-mRNA) & cytoplasm (mRNA, tRNA) |
Primary Function | Genetic blueprint | Messenger (mRNA), transfer (tRNA), ribosomal (rRNA) roles. |
II. Central Dogma of Molecular Biology
Flow of Information:
Replication: DNA → DNA
Transcription: DNA → RNA
Translation: RNA → Protein
Exceptions (e.g., reverse transcription in retroviruses) exist, but generally proteins do not encode nucleic acids.
III. Transcription (DNA → RNA)
Enzyme: RNA polymerase reads the template strand (3′→5′) and synthesizes RNA (5′→3′).
Coding vs. Template Strands:
Coding strand has same sequence as RNA (T→U).
Template strand is complementary and used for synthesis.
Eukaryotic Processing: 5′ cap, 3′ poly-A tail, and splicing out introns yield mature mRNA.
Prokaryotes lack a nucleus—transcription and translation are coupled, and mRNAs are used directly without capping or splicing.
IV. Translation (RNA → Protein)
Ribosome ‘reads’ mRNA codons (three-base units) in the 5′→3′ direction.
Start codon: AUG (Methionine) establishes reading frame.
Stop codons: UAA, UAG, UGA terminate synthesis.
tRNAs with complementary anticodons bring specific amino acids to the ribosome’s A (aminoacyl), P (peptidyl), and E (exit) sites.
Open Reading Frame (ORF): Region from start to stop codon; six possible frames (3 per strand). Tools like NCBI ORFfinder identify ORFs in a sequence.
V. Genetic Mutations and Variations
Point Mutations (Substitutions):
Silent (Synonymous): Codon change but same amino acid (e.g., GAA→GAG both code Glu)—no protein change.
Missense (Nonsynonymous): Different amino acid (e.g., GAG (Glu)→GTG (Val))—effect depends on similarity.
Nonsense: Codon → stop codon (e.g., TGC→TGA)—truncated protein, often loss of function.
Insertions/Deletions (Indels):
Frameshift: Indel not a multiple of 3 shifts reading frame downstream, scrambles protein, often leading to an early stop—usually deleterious.
In-frame: Indel multiple of 3 adds/removes whole amino acids—impact varies by location and size.
Mutation Type | Description | Typical Consequence |
---|---|---|
Silent (Synonymous) | Codon changes but same amino acid | Usually no effect |
Missense (Nonsynonymous) | Codon changes to different amino acid | Benign to harmful depending on substitution |
Nonsense | Codon changes to stop | Premature termination; truncated protein |
Frameshift (Insertion/Deletion) | Indel not multiple of 3 | Alters downstream sequence; often nonfunctional protein |
In-frame Indel | Indel multiple of 3 | Adds/removes amino acids; varies in effect |
VI. Bioinformatics Databases & Tools
Primary vs. Derived Databases:
Primary (Archive): Raw submissions (GenBank, EMBL, DDBJ, SRA, PDB)—may be redundant or uncurated.
Derived (Curated): Processed, non-redundant (RefSeq, UniProt/Swiss-Prot, CDD)—standardized, well-annotated.
GenBank vs. RefSeq:
GenBank: Submission-based, multiple entries per gene, accession.version (e.g., U12345.1).
RefSeq: Curated single records per biomolecule with stable accessions (e.g., NM_, NP_, NC_).
RefSeq Accession Prefixes:
Prefix | Type | Example |
---|---|---|
NC_ | Chromosome/genome assembly | NC_000001 |
NG_ | Genomic region (gene) | NG_007073 |
NM_ | Curated mRNA (transcript) | NM_000518 |
NP_ | Curated protein | NP_000509 |
NR_ | Curated non-coding RNA | NR_003285 |
XM_/XP_/XR_ | Model (predicted) sequences | XM_017001338, XP_017001338 |
Key NCBI Tools:
Entrez: Integrated search across sequence, literature (PubMed), structures.
ORFfinder: Identifies ORFs in DNA sequences.
BLAST: Heuristic local alignment search (BLASTN, BLASTP, BLASTX, TBLASTN/TBLASTX) with E-values indicating significance.
UCSC Genome Browser: Visualize genomic context and annotations.
VII. Sequence Alignment
7.1 Pairwise Alignment
Dot Plot: Visual match matrix; diagonal lines reveal similarity; adjustable stringency.
Dynamic Programming:
Needleman–Wunsch: Global end-to-end alignment.
Smith–Waterman: Local high-scoring segment alignment.
Scoring Schemes:
Matches: Positive scores.
Mismatches: Negative scores.
Gaps: Opening and extension penalties (e.g., –2, –1).
Percent Identity: (Identical matches / aligned length) × 100%.
7.2 Substitution Matrices (Proteins)
PAM Matrices: Based on accepted mutations (PAM1 for ~1% divergence; PAM250 for distant comparisons).
BLOSUM Matrices: Empirical; e.g., BLOSUM62 for ~62% identity clustering. Choose matrix based on expected divergence.
7.3 BLAST
Method: Heuristic local alignment via “word” matches extended into High-Scoring Segment Pairs (HSPs).
E-value: Expected number of random matches; lower values indicate more significance.
Flavors:
BLASTN: Nucleotide vs. nucleotide.
BLASTP: Protein vs. protein.
BLASTX/TBLASTN/TBLASTX: Translated searches.
VIII. Multiple Sequence Alignment (MSA)
Definition & Purpose
Aligns three or more homologous sequences simultaneously so each column represents evolutionarily equivalent positions.
Key applications include identifying conserved motifs/domains, inferring functional sites, preparing input for phylogenetic analysis, and structure prediction.
Challenges vs. Pairwise Alignment
Exponential search space: aligning n sequences optimally is NP-hard.
Complexity increases steeply with sequence number and length.
1. Progressive Alignment
Workflow:
Compute all pairwise distances (e.g., percent identity or substitution-matrix scores).
Construct a guide tree (e.g., UPGMA or Neighbor-Joining) reflecting sequence relationships.
Align the most similar pair of sequences into a profile.
Iteratively align remaining sequences or profiles following the guide tree order.
Tools: ClustalW, Clustal Omega.
Advantages: Fast, scalable to large numbers of sequences.
Limitations: Errors in early pairwise alignments propagate (“once a gap, always a gap”).
2. Iterative Refinement & Consistency Methods
Iterative Refinement:
Repeatedly partition or realign subsets of sequences or the entire alignment to improve score (sum-of-pairs or likelihood).
Tools: MUSCLE, MAFFT, PRANK.
Consistency-Based Alignment:
Incorporate information from multiple pairwise alignments to enforce consistency across the MSA.
Tools: T-Coffee, ProbCons (uses hidden Markov models for posterior probabilities).
3. Template/Structure-Guided Alignment
Use known 3D structures to align sequences, preserving structural equivalences.
Tools: Expresso (3D-Coffee), PROMALS3D.
4. Scoring & Gap Penalties
Sum-of-Pairs Score: Sum of scores for all residue pairs in each column (using substitution matrices for proteins).
Profile-Profile Alignment: Aligns profiles (weighted residue frequencies) rather than raw sequences for later steps.
Gap Penalties:
Affine model: penalty = gap_open + (length × gap_extend).
Adjust based on expected indel frequencies and structural regions.
5. Output Formats & Visualization
Formats: CLUSTAL (.aln), FASTA (with gaps), Stockholm (.sto).
Viewers: Jalview, MSAViewer, UGENE.
6. Quality Assessment & Post-Processing
Metrics: Column Score (fraction of pairwise matches), GUIDANCE scores, TCS (Transitive Consistency Score).
Filtering: trimAl, Gblocks—remove poorly aligned or divergent regions before downstream analyses.
Practical Tips
Choose algorithms based on dataset size, sequence similarity, and computational resources.
Inspect alignments manually; realign problematic regions.
Experiment with different substitution matrices (e.g., BLOSUM62 vs. BLOSUM45) and gap penalties.
Always trim unreliable columns prior to phylogenetic inference or motif discovery.
IX. Phylogenetic Tree Reconstruction
Concepts & Terminology
Leaf (Tip): Observed sequences.
Internal Node: Hypothetical ancestor.
Branch Length: Represents evolutionary change or time.
Topology: Branching order (unrooted vs. rooted).
Clade (Monophyletic Group): Ancestor plus all its descendants.
1. Input: MSA of Homologous Sequences
Quality of MSA directly impacts tree accuracy. Trim ambiguous regions.
2. Models of Sequence Evolution
Nucleotide Models: Jukes-Cantor, Kimura 2-parameter, General Time Reversible (GTR).
Protein Models: JTT, WAG, LG—incorporate amino acid substitution frequencies.
3. Tree-Building Methods
A. Distance-Based Methods
UPGMA (Unweighted Pair Group Method with Arithmetic Mean):
Assumes a molecular clock (constant rate); yields an ultrametric (rooted) tree.
Simple but unreliable if rates vary.
Neighbor-Joining (NJ):
No clock assumption; fast O(n³); produces an unrooted tree.
Widely used for exploratory analyses.
B. Character-Based Methods
Maximum Parsimony (MP):
Seeks the tree minimizing total character changes.
No explicit evolutionary model; vulnerable to long-branch attraction.
Maximum Likelihood (ML):
Finds the tree maximizing likelihood given a model of substitution.
Computationally intensive; tools: PhyML, RAxML, IQ-TREE.
Bayesian Inference:
Estimates the posterior distribution of trees under a model via MCMC.
Outputs clade posterior probabilities and can incorporate relaxed molecular clocks (BEAST, MrBayes).
4. Branch Support & Validation
Bootstrap Analysis:
Resample alignment columns with replacement to create replicates.
Reconstruct trees; the percentage of replicates supporting each clade equals the bootstrap value.
Values >70% are considered strong support.
Posterior Probabilities: From Bayesian inference; values >0.95 indicate strong support.
5. Tree Formats & Visualization
Newick Format:
(A:0.1,(B:0.2,C:0.2):0.3);
Nexus Format: Includes metadata, alignment blocks, and tree blocks.
Visualization: FigTree, iTOL (Interactive Tree Of Life), Dendroscope, ETE Toolkit.
6. Rooting & Molecular Clocks
Rooting:
Outgroup Rooting: Include a known distant relative.
Midpoint Rooting: Place the root at the midpoint of the longest path.
Molecular Clock Models:
Strict Clock: Assumes a constant rate; calibrate with fossil or sampling dates.
Relaxed Clock: Allows rate variation among branches.
Tools: BEAST for dating analyses.
7. Interpretation & Best Practices
Topology vs. Branch Lengths: Topology shows relationships; lengths indicate the amount of change.
Monophyly, Paraphyly, Polyphyly: Understand clade definitions.
Pitfalls:
Poor alignment regions lead to misleading branches.
Model misspecification leads to incorrect likelihoods or posterior probabilities.
Long-branch attraction especially affects parsimony.
Recommendations:
Use multiple tree-building approaches to compare topologies.
Employ adequate substitution models and partitioning (e.g., codon positions).
Report support values; collapse poorly supported nodes.
X. Practical Applications and Hands-On Examples
X.1. Transcription & Translation Basics
MCQs Key Answers: 1) OH group; 2) b, d; 3) UAA stop codon.
Translation: Identify reading frame, transcribe DNA → mRNA (T → U), translate by AUG start.
Example:
GAGCCAUGCAUUAUCUAGAUAGUAGGCUCUGAGAAUUUAUCUC
→ Met-His-Tyr-Leu-Asp-Ser-Arg-Leu.
Promoter Location: Upstream of the transcription start site.
Mutations:
Upstream insertion (5′ UTR) → no protein change.
In-frame substitution (synonymous: AAU ⇄ AAC both Asn) → silent mutation.
Frameshift insertion in CDS → premature stop, truncated protein.
Stop-codon suppression by mutant tRNA → elongated protein.
X.2. Point Mutations & Splicing
Deletion Effects:
4-nt deletion → removes ≥1 aa + frameshift → likely truncated, nonfunctional.
3-nt deletion → removes 1 aa; possible amino acid substitution if crossing codons.
Insertion of 1 nt near start/end → frameshift → high probability of harm.
Substitution Types:
Silent (synonymous)
Conservative missense (similar property amino acid)
Non-conservative missense (different property amino acid)
Nonsense → early stop.
Splice-site mutation: Single intronic base change abolishes original acceptor (AG), uses downstream AG → skips exon → mRNA 173 nt shorter.
X.3. Beta-Globin Cluster & Allele Variation
Codon Translation:
atg gtg cac ctg act cct gag gag aag
→ MVHLTP E E K.Variant at position 7: E → V if codon GAG → GTG.
Cluster Gene Order (5′→3′): ε – γ-G – γ-A – δ – β.
Strand Orientation: Reverse-strand genes have complementary designations (A ↔ T).
Allele Observations: Non-synonymous mutation changes residue; lack of reported allele frequency signals potential data error.
X.4. Gene Structure & Thalassemia
Gene Anatomy:
5′ UTR – exon 1 – intron – exon 2 – intron – exon 3 – 3′ UTR – poly-A signal/tail.
Transcription start site at 5′ end; translation begins at the first AUG in the exon.
Beta-Thalassemia Mutations:
Nonsense (GAG → TAG in exon) → β⁰, no beta-globin.
Single-nt insertion → frameshift → β⁰.
4-nt deletion → frameshift → β⁰.
Splice-site (IVS I) G → A → aberrant splicing → β⁺, reduced beta-globin.
X.5. Synonymous Substitutions & Conservative Changes
Synonymous (Silent) Substitutions: Occur at the third codon position; protein sequence unchanged.
Transition vs. Transversion:
Transition: purine ↔ purine (A ↔ G), pyrimidine ↔ pyrimidine (C ↔ T).
Transversion: purine ↔ pyrimidine.
Observation: Transitions are more frequent than transversions.
Conservative Amino Acid Changes:
Basic: Arg ↔ Lys; nonpolar: Ile, Val, Leu, Met.
Phylogenetic Signal: Gaps indicate clade distinctions (Afrotheria vs. Eutheria vs. Marsupials).
X.6. HIV/SIV Phylogenetics
Data: 9,176 bp HIV-1 vs. SIV sequences; 9 protein-coding genes.
Tree: Strong bootstrap support shows HIV-1 clusters with SIVcpz, HIV-2 with non-chimpanzee SIVs.
Genome: ~9,176 bp, 9 coding genes. Strong bootstrap support at key nodes.
Relationships: HIV-1 clusters with SIVcpz; HIV-2 with non-chimpanzee SIVs.
Bootstrap Analysis: Percentage of replicates supporting each clade (≥70% indicates strong support).
X.7. Mitochondrial & MYH16 Pseudogene Analysis
Sequences: Human, Neanderthal, Denisovan, Pan, Gorilla, etc.
Newick Example:
((((Pan_paniscus:0.000001,Gorilla_gorilla:0.000001):0.06,(Homo_sapiens:0.000001,Pongo_pygmaeus:0.000001):0.12):0.12,Pan_troglodytes:0.000001);
Interpretation: Closest relatives are indicated by shortest branch lengths; divergence is indicated by branch length values.
Reference Accessions:
Human: NC_012920.1; Neanderthal: NC_011137.1; Denisovan: NC_013993.1.
Percent Identity: Human–Neanderthal ~98.72%; Human–Denisovan ~97.61%.
Residue Change: Isoleucine (I) → Valine (V) conservative mutation at specified codons.
Pseudogene 2-base Deletion: BK001410 MYH16 frameshift; truncated pseudogene.
Phylogenetic Tree (Newick): Indicates evolutionary distances; shortest branches represent closest relatives.
XI. Essential Formulas (with Descriptions)
Percent Identity
Percent Identity = (Number of identical aligned positions / Aligned length) × 100%Identical aligned positions: count of positions where residues match exactly.
Aligned length: total non-gap columns compared.
Percent Similarity
Percent Similarity = (Number of similar (conservative) residue pairs / Aligned length) × 100%Similar pairs: substitutions between biochemically similar residues (e.g., Ile ↔ Val).
Transition/Transversion Ratio (κ)
κ = #Transitions (A ↔ G, C ↔ T) / #Transversions (purine ↔ pyrimidine)Affine Gap Penalty
GapPenalty = GapOpen + (L – 1) × GapExtendGapOpen: penalty for introducing a new gap.
GapExtend: penalty for each additional position in the same gap.
L: length of the gap in residues.
Alignment Score (S)
S = Σₖ s(aₖ,bₖ) + Σ_g [GapOpen_g + (L_g – 1) × GapExtend_g]s(aₖ,bₖ): substitution score for aligned residues.
G: number of gaps.
L_g: length of the g-th gap.
Needleman–Wunsch Recurrence (Global Alignment)
F(i,j) = max{ F(i-1,j-1) + s(xᵢ,yⱼ), F(i-1,j) – d, F(i,j-1) – d }F(i,j): best score for prefixes x₁…xᵢ and y₁…yⱼ.
s(xᵢ,yⱼ): match/mismatch score.
d: gap penalty.
Smith–Waterman Recurrence (Local Alignment)
H(i,j) = max{ 0, H(i-1,j-1)+s(xᵢ,yⱼ), H(i-1,j) – d, H(i,j-1) – d }H(i,j): best local score ending at positions.
Zero bound ensures restarting at zero when negative.
p-distance
p = Number of differing sites / Total sites comparedJukes–Cantor Model Distance
d = -¾ × ln(1 – 4/3 × p)p: observed proportion of differences.
Kimura 2-Parameter Distance
d = -½ ln(1 – 2P – Q) – ¼ ln(1 – 2Q)
P: proportion of transitions.
Q: proportion of transversions.
Bootstrap Support (%)
%Bootstrap = (#Replicates supporting clade / Total replicates) × 100%BLAST E-value
E = K × m × n × e^(−λS)
m, n: query and database lengths.
S: raw score.
λ, K: statistical constants.
ORF Count
6 (three reading frames per strand × two strands)Total Codons
4³ = 64 possible codons