Molecular Biology and Bioinformatics Principles

I. Molecular Biology Fundamentals

Genes and Genomes

  • Every cell contains a complete set of genetic instructions—the genome—encoded in DNA and organized into genes packaged on chromosomes. A gene is a specific DNA sequence that encodes a functional product (usually a protein or an RNA). Genetic variation (mutations) underlies phenotypic differences, while environmental factors also contribute to traits.

DNA vs. RNA

FeatureDNA (Deoxyribonucleic Acid)RNA (Ribonucleic Acid)
SugarDeoxyribose (no 2′–OH group)Ribose (has 2′–OH group)
Nitrogenous BasesA, T, C, GA, U (Uracil) replaces T, C, G
StructureUsually double-stranded helixUsually single-stranded (can form loops/folds)
StabilityMore stable (long-term storage)Less stable (transient transcripts)
Cellular LocationNucleus (eukaryotes), nucleoid (prokaryotes)Nucleus (as pre-mRNA) & cytoplasm (mRNA, tRNA)
Primary FunctionGenetic blueprintMessenger (mRNA), transfer (tRNA), ribosomal (rRNA) roles.

II. Central Dogma of Molecular Biology

Flow of Information:

  1. Replication: DNA → DNA

  2. Transcription: DNA → RNA

  3. Translation: RNA → Protein

Exceptions (e.g., reverse transcription in retroviruses) exist, but generally proteins do not encode nucleic acids.


III. Transcription (DNA → RNA)

  • Enzyme: RNA polymerase reads the template strand (3′→5′) and synthesizes RNA (5′→3′).

  • Coding vs. Template Strands:

    • Coding strand has same sequence as RNA (T→U).

    • Template strand is complementary and used for synthesis.

  • Eukaryotic Processing: 5′ cap, 3′ poly-A tail, and splicing out introns yield mature mRNA.

  • Prokaryotes lack a nucleus—transcription and translation are coupled, and mRNAs are used directly without capping or splicing.


IV. Translation (RNA → Protein)

  • Ribosome ‘reads’ mRNA codons (three-base units) in the 5′→3′ direction.

  • Start codon: AUG (Methionine) establishes reading frame.

  • Stop codons: UAA, UAG, UGA terminate synthesis.

  • tRNAs with complementary anticodons bring specific amino acids to the ribosome’s A (aminoacyl), P (peptidyl), and E (exit) sites.

  • Open Reading Frame (ORF): Region from start to stop codon; six possible frames (3 per strand). Tools like NCBI ORFfinder identify ORFs in a sequence.


V. Genetic Mutations and Variations

Point Mutations (Substitutions):

  • Silent (Synonymous): Codon change but same amino acid (e.g., GAA→GAG both code Glu)—no protein change.

  • Missense (Nonsynonymous): Different amino acid (e.g., GAG (Glu)→GTG (Val))—effect depends on similarity.

  • Nonsense: Codon → stop codon (e.g., TGC→TGA)—truncated protein, often loss of function.

Insertions/Deletions (Indels):

  • Frameshift: Indel not a multiple of 3 shifts reading frame downstream, scrambles protein, often leading to an early stop—usually deleterious.

  • In-frame: Indel multiple of 3 adds/removes whole amino acids—impact varies by location and size.

Mutation TypeDescriptionTypical Consequence
Silent (Synonymous)Codon changes but same amino acidUsually no effect
Missense (Nonsynonymous)Codon changes to different amino acidBenign to harmful depending on substitution
NonsenseCodon changes to stopPremature termination; truncated protein
Frameshift (Insertion/Deletion)Indel not multiple of 3Alters downstream sequence; often nonfunctional protein
In-frame IndelIndel multiple of 3Adds/removes amino acids; varies in effect

VI. Bioinformatics Databases & Tools

Primary vs. Derived Databases:

  • Primary (Archive): Raw submissions (GenBank, EMBL, DDBJ, SRA, PDB)—may be redundant or uncurated.

  • Derived (Curated): Processed, non-redundant (RefSeq, UniProt/Swiss-Prot, CDD)—standardized, well-annotated.

GenBank vs. RefSeq:

  • GenBank: Submission-based, multiple entries per gene, accession.version (e.g., U12345.1).

  • RefSeq: Curated single records per biomolecule with stable accessions (e.g., NM_, NP_, NC_).

RefSeq Accession Prefixes:

PrefixTypeExample
NC_Chromosome/genome assemblyNC_000001
NG_Genomic region (gene)NG_007073
NM_Curated mRNA (transcript)NM_000518
NP_Curated proteinNP_000509
NR_Curated non-coding RNANR_003285
XM_/XP_/XR_Model (predicted) sequencesXM_017001338, XP_017001338

Key NCBI Tools:

  • Entrez: Integrated search across sequence, literature (PubMed), structures.

  • ORFfinder: Identifies ORFs in DNA sequences.

  • BLAST: Heuristic local alignment search (BLASTN, BLASTP, BLASTX, TBLASTN/TBLASTX) with E-values indicating significance.

  • UCSC Genome Browser: Visualize genomic context and annotations.


VII. Sequence Alignment

7.1 Pairwise Alignment

  • Dot Plot: Visual match matrix; diagonal lines reveal similarity; adjustable stringency.

  • Dynamic Programming:

    • Needleman–Wunsch: Global end-to-end alignment.

    • Smith–Waterman: Local high-scoring segment alignment.

  • Scoring Schemes:

    • Matches: Positive scores.

    • Mismatches: Negative scores.

    • Gaps: Opening and extension penalties (e.g., –2, –1).

  • Percent Identity: (Identical matches / aligned length) × 100%.

7.2 Substitution Matrices (Proteins)

  • PAM Matrices: Based on accepted mutations (PAM1 for ~1% divergence; PAM250 for distant comparisons).

  • BLOSUM Matrices: Empirical; e.g., BLOSUM62 for ~62% identity clustering. Choose matrix based on expected divergence.

7.3 BLAST

  • Method: Heuristic local alignment via “word” matches extended into High-Scoring Segment Pairs (HSPs).

  • E-value: Expected number of random matches; lower values indicate more significance.

  • Flavors:

    • BLASTN: Nucleotide vs. nucleotide.

    • BLASTP: Protein vs. protein.

    • BLASTX/TBLASTN/TBLASTX: Translated searches.


VIII. Multiple Sequence Alignment (MSA)

Definition & Purpose

  • Aligns three or more homologous sequences simultaneously so each column represents evolutionarily equivalent positions.

  • Key applications include identifying conserved motifs/domains, inferring functional sites, preparing input for phylogenetic analysis, and structure prediction.

Challenges vs. Pairwise Alignment

  • Exponential search space: aligning n sequences optimally is NP-hard.

  • Complexity increases steeply with sequence number and length.

1. Progressive Alignment

  • Workflow:

    1. Compute all pairwise distances (e.g., percent identity or substitution-matrix scores).

    2. Construct a guide tree (e.g., UPGMA or Neighbor-Joining) reflecting sequence relationships.

    3. Align the most similar pair of sequences into a profile.

    4. Iteratively align remaining sequences or profiles following the guide tree order.

  • Tools: ClustalW, Clustal Omega.

  • Advantages: Fast, scalable to large numbers of sequences.

  • Limitations: Errors in early pairwise alignments propagate (“once a gap, always a gap”).

2. Iterative Refinement & Consistency Methods

  • Iterative Refinement:

    • Repeatedly partition or realign subsets of sequences or the entire alignment to improve score (sum-of-pairs or likelihood).

    • Tools: MUSCLE, MAFFT, PRANK.

  • Consistency-Based Alignment:

    • Incorporate information from multiple pairwise alignments to enforce consistency across the MSA.

    • Tools: T-Coffee, ProbCons (uses hidden Markov models for posterior probabilities).

3. Template/Structure-Guided Alignment

  • Use known 3D structures to align sequences, preserving structural equivalences.

  • Tools: Expresso (3D-Coffee), PROMALS3D.

4. Scoring & Gap Penalties

  • Sum-of-Pairs Score: Sum of scores for all residue pairs in each column (using substitution matrices for proteins).

  • Profile-Profile Alignment: Aligns profiles (weighted residue frequencies) rather than raw sequences for later steps.

  • Gap Penalties:

    • Affine model: penalty = gap_open + (length × gap_extend).

    • Adjust based on expected indel frequencies and structural regions.

5. Output Formats & Visualization

  • Formats: CLUSTAL (.aln), FASTA (with gaps), Stockholm (.sto).

  • Viewers: Jalview, MSAViewer, UGENE.

6. Quality Assessment & Post-Processing

  • Metrics: Column Score (fraction of pairwise matches), GUIDANCE scores, TCS (Transitive Consistency Score).

  • Filtering: trimAl, Gblocks—remove poorly aligned or divergent regions before downstream analyses.

Practical Tips

  • Choose algorithms based on dataset size, sequence similarity, and computational resources.

  • Inspect alignments manually; realign problematic regions.

  • Experiment with different substitution matrices (e.g., BLOSUM62 vs. BLOSUM45) and gap penalties.

  • Always trim unreliable columns prior to phylogenetic inference or motif discovery.


IX. Phylogenetic Tree Reconstruction

Concepts & Terminology

  • Leaf (Tip): Observed sequences.

  • Internal Node: Hypothetical ancestor.

  • Branch Length: Represents evolutionary change or time.

  • Topology: Branching order (unrooted vs. rooted).

  • Clade (Monophyletic Group): Ancestor plus all its descendants.

1. Input: MSA of Homologous Sequences

  • Quality of MSA directly impacts tree accuracy. Trim ambiguous regions.

2. Models of Sequence Evolution

  • Nucleotide Models: Jukes-Cantor, Kimura 2-parameter, General Time Reversible (GTR).

  • Protein Models: JTT, WAG, LG—incorporate amino acid substitution frequencies.

3. Tree-Building Methods

A. Distance-Based Methods

  • UPGMA (Unweighted Pair Group Method with Arithmetic Mean):

    • Assumes a molecular clock (constant rate); yields an ultrametric (rooted) tree.

    • Simple but unreliable if rates vary.

  • Neighbor-Joining (NJ):

    • No clock assumption; fast O(n³); produces an unrooted tree.

    • Widely used for exploratory analyses.

B. Character-Based Methods

  • Maximum Parsimony (MP):

    • Seeks the tree minimizing total character changes.

    • No explicit evolutionary model; vulnerable to long-branch attraction.

  • Maximum Likelihood (ML):

    • Finds the tree maximizing likelihood given a model of substitution.

    • Computationally intensive; tools: PhyML, RAxML, IQ-TREE.

  • Bayesian Inference:

    • Estimates the posterior distribution of trees under a model via MCMC.

    • Outputs clade posterior probabilities and can incorporate relaxed molecular clocks (BEAST, MrBayes).

4. Branch Support & Validation

  • Bootstrap Analysis:

    • Resample alignment columns with replacement to create replicates.

    • Reconstruct trees; the percentage of replicates supporting each clade equals the bootstrap value.

    • Values >70% are considered strong support.

  • Posterior Probabilities: From Bayesian inference; values >0.95 indicate strong support.

5. Tree Formats & Visualization

  • Newick Format: (A:0.1,(B:0.2,C:0.2):0.3);

  • Nexus Format: Includes metadata, alignment blocks, and tree blocks.

  • Visualization: FigTree, iTOL (Interactive Tree Of Life), Dendroscope, ETE Toolkit.

6. Rooting & Molecular Clocks

  • Rooting:

    • Outgroup Rooting: Include a known distant relative.

    • Midpoint Rooting: Place the root at the midpoint of the longest path.

  • Molecular Clock Models:

    • Strict Clock: Assumes a constant rate; calibrate with fossil or sampling dates.

    • Relaxed Clock: Allows rate variation among branches.

    • Tools: BEAST for dating analyses.

7. Interpretation & Best Practices

  • Topology vs. Branch Lengths: Topology shows relationships; lengths indicate the amount of change.

  • Monophyly, Paraphyly, Polyphyly: Understand clade definitions.

  • Pitfalls:

    • Poor alignment regions lead to misleading branches.

    • Model misspecification leads to incorrect likelihoods or posterior probabilities.

    • Long-branch attraction especially affects parsimony.

  • Recommendations:

    • Use multiple tree-building approaches to compare topologies.

    • Employ adequate substitution models and partitioning (e.g., codon positions).

    • Report support values; collapse poorly supported nodes.


X. Practical Applications and Hands-On Examples

X.1. Transcription & Translation Basics

  • MCQs Key Answers: 1) OH group; 2) b, d; 3) UAA stop codon.

  • Translation: Identify reading frame, transcribe DNA → mRNA (T → U), translate by AUG start.

    • Example: GAGCCAUGCAUUAUCUAGAUAGUAGGCUCUGAGAAUUUAUCUC → Met-His-Tyr-Leu-Asp-Ser-Arg-Leu.

  • Promoter Location: Upstream of the transcription start site.

  • Mutations:

    • Upstream insertion (5′ UTR) → no protein change.

    • In-frame substitution (synonymous: AAU ⇄ AAC both Asn) → silent mutation.

    • Frameshift insertion in CDS → premature stop, truncated protein.

    • Stop-codon suppression by mutant tRNA → elongated protein.

X.2. Point Mutations & Splicing

  • Deletion Effects:

    • 4-nt deletion → removes ≥1 aa + frameshift → likely truncated, nonfunctional.

    • 3-nt deletion → removes 1 aa; possible amino acid substitution if crossing codons.

  • Insertion of 1 nt near start/end → frameshift → high probability of harm.

  • Substitution Types:

    1. Silent (synonymous)

    2. Conservative missense (similar property amino acid)

    3. Non-conservative missense (different property amino acid)

    4. Nonsense → early stop.

  • Splice-site mutation: Single intronic base change abolishes original acceptor (AG), uses downstream AG → skips exon → mRNA 173 nt shorter.

X.3. Beta-Globin Cluster & Allele Variation

  • Codon Translation:

    • atg gtg cac ctg act cct gag gag aag → MVHLTP E E K.

    • Variant at position 7: E → V if codon GAG → GTG.

  • Cluster Gene Order (5′→3′): ε – γ-G – γ-A – δ – β.

  • Strand Orientation: Reverse-strand genes have complementary designations (A ↔ T).

  • Allele Observations: Non-synonymous mutation changes residue; lack of reported allele frequency signals potential data error.

X.4. Gene Structure & Thalassemia

  • Gene Anatomy:

    • 5′ UTR – exon 1 – intron – exon 2 – intron – exon 3 – 3′ UTR – poly-A signal/tail.

    • Transcription start site at 5′ end; translation begins at the first AUG in the exon.

  • Beta-Thalassemia Mutations:

    1. Nonsense (GAG → TAG in exon) → β⁰, no beta-globin.

    2. Single-nt insertion → frameshift → β⁰.

    3. 4-nt deletion → frameshift → β⁰.

    4. Splice-site (IVS I) G → A → aberrant splicing → β⁺, reduced beta-globin.

X.5. Synonymous Substitutions & Conservative Changes

  • Synonymous (Silent) Substitutions: Occur at the third codon position; protein sequence unchanged.

  • Transition vs. Transversion:

    • Transition: purine ↔ purine (A ↔ G), pyrimidine ↔ pyrimidine (C ↔ T).

    • Transversion: purine ↔ pyrimidine.

    • Observation: Transitions are more frequent than transversions.

  • Conservative Amino Acid Changes:

    • Basic: Arg ↔ Lys; nonpolar: Ile, Val, Leu, Met.

  • Phylogenetic Signal: Gaps indicate clade distinctions (Afrotheria vs. Eutheria vs. Marsupials).

X.6. HIV/SIV Phylogenetics

  • Data: 9,176 bp HIV-1 vs. SIV sequences; 9 protein-coding genes.

  • Tree: Strong bootstrap support shows HIV-1 clusters with SIVcpz, HIV-2 with non-chimpanzee SIVs.

  • Genome: ~9,176 bp, 9 coding genes. Strong bootstrap support at key nodes.

  • Relationships: HIV-1 clusters with SIVcpz; HIV-2 with non-chimpanzee SIVs.

  • Bootstrap Analysis: Percentage of replicates supporting each clade (≥70% indicates strong support).

X.7. Mitochondrial & MYH16 Pseudogene Analysis

  • Sequences: Human, Neanderthal, Denisovan, Pan, Gorilla, etc.

  • Newick Example:

    ((((Pan_paniscus:0.000001,Gorilla_gorilla:0.000001):0.06,(Homo_sapiens:0.000001,Pongo_pygmaeus:0.000001):0.12):0.12,Pan_troglodytes:0.000001);
  • Interpretation: Closest relatives are indicated by shortest branch lengths; divergence is indicated by branch length values.

  • Reference Accessions:

    • Human: NC_012920.1; Neanderthal: NC_011137.1; Denisovan: NC_013993.1.

  • Percent Identity: Human–Neanderthal ~98.72%; Human–Denisovan ~97.61%.

  • Residue Change: Isoleucine (I) → Valine (V) conservative mutation at specified codons.

  • Pseudogene 2-base Deletion: BK001410 MYH16 frameshift; truncated pseudogene.

  • Phylogenetic Tree (Newick): Indicates evolutionary distances; shortest branches represent closest relatives.


XI. Essential Formulas (with Descriptions)

  1. Percent Identity

    Percent Identity = (Number of identical aligned positions / Aligned length) × 100%

    • Identical aligned positions: count of positions where residues match exactly.

    • Aligned length: total non-gap columns compared.

  2. Percent Similarity

    Percent Similarity = (Number of similar (conservative) residue pairs / Aligned length) × 100%

    • Similar pairs: substitutions between biochemically similar residues (e.g., Ile ↔ Val).

  3. Transition/Transversion Ratio (κ)

    κ = #Transitions (A ↔ G, C ↔ T) / #Transversions (purine ↔ pyrimidine)

  4. Affine Gap Penalty

    GapPenalty = GapOpen + (L – 1) × GapExtend

    • GapOpen: penalty for introducing a new gap.

    • GapExtend: penalty for each additional position in the same gap.

    • L: length of the gap in residues.

  5. Alignment Score (S)

    S = Σₖ s(aₖ,bₖ) + Σ_g [GapOpen_g + (L_g – 1) × GapExtend_g]

    • s(aₖ,bₖ): substitution score for aligned residues.

    • G: number of gaps.

    • L_g: length of the g-th gap.

  6. Needleman–Wunsch Recurrence (Global Alignment)

    F(i,j) = max{ F(i-1,j-1) + s(xᵢ,yⱼ), F(i-1,j) – d, F(i,j-1) – d }

    • F(i,j): best score for prefixes x₁…xᵢ and y₁…yⱼ.

    • s(xᵢ,yⱼ): match/mismatch score.

    • d: gap penalty.

  7. Smith–Waterman Recurrence (Local Alignment)

    H(i,j) = max{ 0, H(i-1,j-1)+s(xᵢ,yⱼ), H(i-1,j) – d, H(i,j-1) – d }

    • H(i,j): best local score ending at positions.

    • Zero bound ensures restarting at zero when negative.

  8. p-distance

    p = Number of differing sites / Total sites compared

  9. Jukes–Cantor Model Distance

    d = -¾ × ln(1 – 4/3 × p)

    • p: observed proportion of differences.

  10. Kimura 2-Parameter Distance

    d = -½ ln(1 – 2P – Q) – ¼ ln(1 – 2Q)

  • P: proportion of transitions.

  • Q: proportion of transversions.

  1. Bootstrap Support (%)

    %Bootstrap = (#Replicates supporting clade / Total replicates) × 100%

  2. BLAST E-value

    E = K × m × n × e^(−λS)

  • m, n: query and database lengths.

  • S: raw score.

  • λ, K: statistical constants.

  1. ORF Count
    6 (three reading frames per strand × two strands)

  2. Total Codons
    4³ = 64 possible codons