Molecular Biology and Bioinformatics Principles

Posted on Aug 16, 2025 in Computer Engineering

I. Molecular Biology Fundamentals

Genes and Genomes

Every cell contains a complete set of genetic instructions—the genome—encoded in DNA and organized into genes packaged on chromosomes. A gene is a specific DNA sequence that encodes a functional product (usually a protein or an RNA). Genetic variation (mutations) underlies phenotypic differences, while environmental factors also contribute to traits.

DNA vs. RNA

Feature	DNA (Deoxyribonucleic Acid)	RNA (Ribonucleic Acid)
Sugar	Deoxyribose (no 2′–OH group)	Ribose (has 2′–OH group)
Nitrogenous Bases	A, T, C, G	A, U (Uracil) replaces T, C, G
Structure	Usually double-stranded helix	Usually single-stranded (can form loops/folds)
Stability	More stable (long-term storage)	Less stable (transient transcripts)
Cellular Location	Nucleus (eukaryotes), nucleoid (prokaryotes)	Nucleus (as pre-mRNA) & cytoplasm (mRNA, tRNA)
Primary Function	Genetic blueprint	Messenger (mRNA), transfer (tRNA), ribosomal (rRNA) roles.

II. Central Dogma of Molecular Biology

Flow of Information:

Replication: DNA → DNA
Transcription: DNA → RNA
Translation: RNA → Protein

Exceptions (e.g., reverse transcription in retroviruses) exist, but generally proteins do not encode nucleic acids.

III. Transcription (DNA → RNA)

Enzyme: RNA polymerase reads the template strand (3′→5′) and synthesizes RNA (5′→3′).
Coding vs. Template Strands:
- Coding strand has same sequence as RNA (T→U).
- Template strand is complementary and used for synthesis.
Eukaryotic Processing: 5′ cap, 3′ poly-A tail, and splicing out introns yield mature mRNA.
Prokaryotes lack a nucleus—transcription and translation are coupled, and mRNAs are used directly without capping or splicing.

IV. Translation (RNA → Protein)

Ribosome ‘reads’ mRNA codons (three-base units) in the 5′→3′ direction.
Start codon: AUG (Methionine) establishes reading frame.
Stop codons: UAA, UAG, UGA terminate synthesis.
tRNAs with complementary anticodons bring specific amino acids to the ribosome’s A (aminoacyl), P (peptidyl), and E (exit) sites.
Open Reading Frame (ORF): Region from start to stop codon; six possible frames (3 per strand). Tools like NCBI ORFfinder identify ORFs in a sequence.

V. Genetic Mutations and Variations

Point Mutations (Substitutions):

Silent (Synonymous): Codon change but same amino acid (e.g., GAA→GAG both code Glu)—no protein change.
Missense (Nonsynonymous): Different amino acid (e.g., GAG (Glu)→GTG (Val))—effect depends on similarity.
Nonsense: Codon → stop codon (e.g., TGC→TGA)—truncated protein, often loss of function.

Insertions/Deletions (Indels):

Frameshift: Indel not a multiple of 3 shifts reading frame downstream, scrambles protein, often leading to an early stop—usually deleterious.
In-frame: Indel multiple of 3 adds/removes whole amino acids—impact varies by location and size.

Mutation Type	Description	Typical Consequence
Silent (Synonymous)	Codon changes but same amino acid	Usually no effect
Missense (Nonsynonymous)	Codon changes to different amino acid	Benign to harmful depending on substitution
Nonsense	Codon changes to stop	Premature termination; truncated protein
Frameshift (Insertion/Deletion)	Indel not multiple of 3	Alters downstream sequence; often nonfunctional protein
In-frame Indel	Indel multiple of 3	Adds/removes amino acids; varies in effect

VI. Bioinformatics Databases & Tools

Primary vs. Derived Databases:

Primary (Archive): Raw submissions (GenBank, EMBL, DDBJ, SRA, PDB)—may be redundant or uncurated.
Derived (Curated): Processed, non-redundant (RefSeq, UniProt/Swiss-Prot, CDD)—standardized, well-annotated.

GenBank vs. RefSeq:

GenBank: Submission-based, multiple entries per gene, accession.version (e.g., U12345.1).
RefSeq: Curated single records per biomolecule with stable accessions (e.g., NM_, NP_, NC_).

RefSeq Accession Prefixes:

Prefix	Type	Example
NC_	Chromosome/genome assembly	NC_000001
NG_	Genomic region (gene)	NG_007073
NM_	Curated mRNA (transcript)	NM_000518
NP_	Curated protein	NP_000509
NR_	Curated non-coding RNA	NR_003285
XM_/XP_/XR_	Model (predicted) sequences	XM_017001338, XP_017001338

Key NCBI Tools:

Entrez: Integrated search across sequence, literature (PubMed), structures.
ORFfinder: Identifies ORFs in DNA sequences.
BLAST: Heuristic local alignment search (BLASTN, BLASTP, BLASTX, TBLASTN/TBLASTX) with E-values indicating significance.
UCSC Genome Browser: Visualize genomic context and annotations.

VII. Sequence Alignment

7.1 Pairwise Alignment

Dot Plot: Visual match matrix; diagonal lines reveal similarity; adjustable stringency.
Dynamic Programming:
- Needleman–Wunsch: Global end-to-end alignment.
- Smith–Waterman: Local high-scoring segment alignment.
Scoring Schemes:
- Matches: Positive scores.
- Mismatches: Negative scores.
- Gaps: Opening and extension penalties (e.g., –2, –1).
Percent Identity: (Identical matches / aligned length) × 100%.

7.2 Substitution Matrices (Proteins)

PAM Matrices: Based on accepted mutations (PAM1 for ~1% divergence; PAM250 for distant comparisons).
BLOSUM Matrices: Empirical; e.g., BLOSUM62 for ~62% identity clustering. Choose matrix based on expected divergence.

7.3 BLAST

Method: Heuristic local alignment via “word” matches extended into High-Scoring Segment Pairs (HSPs).
E-value: Expected number of random matches; lower values indicate more significance.
Flavors:
- BLASTN: Nucleotide vs. nucleotide.
- BLASTP: Protein vs. protein.
- BLASTX/TBLASTN/TBLASTX: Translated searches.

VIII. Multiple Sequence Alignment (MSA)

Definition & Purpose

Aligns three or more homologous sequences simultaneously so each column represents evolutionarily equivalent positions.
Key applications include identifying conserved motifs/domains, inferring functional sites, preparing input for phylogenetic analysis, and structure prediction.

Challenges vs. Pairwise Alignment

Exponential search space: aligning n sequences optimally is NP-hard.
Complexity increases steeply with sequence number and length.

1. Progressive Alignment

Workflow:
1. Compute all pairwise distances (e.g., percent identity or substitution-matrix scores).
2. Construct a guide tree (e.g., UPGMA or Neighbor-Joining) reflecting sequence relationships.
3. Align the most similar pair of sequences into a profile.
4. Iteratively align remaining sequences or profiles following the guide tree order.
Tools: ClustalW, Clustal Omega.
Advantages: Fast, scalable to large numbers of sequences.
Limitations: Errors in early pairwise alignments propagate (“once a gap, always a gap”).

2. Iterative Refinement & Consistency Methods

Iterative Refinement:
- Repeatedly partition or realign subsets of sequences or the entire alignment to improve score (sum-of-pairs or likelihood).
- Tools: MUSCLE, MAFFT, PRANK.
Consistency-Based Alignment:
- Incorporate information from multiple pairwise alignments to enforce consistency across the MSA.
- Tools: T-Coffee, ProbCons (uses hidden Markov models for posterior probabilities).

3. Template/Structure-Guided Alignment

Use known 3D structures to align sequences, preserving structural equivalences.
Tools: Expresso (3D-Coffee), PROMALS3D.

4. Scoring & Gap Penalties

Sum-of-Pairs Score: Sum of scores for all residue pairs in each column (using substitution matrices for proteins).
Profile-Profile Alignment: Aligns profiles (weighted residue frequencies) rather than raw sequences for later steps.
Gap Penalties:
- Affine model: penalty = gap_open + (length × gap_extend).
- Adjust based on expected indel frequencies and structural regions.

5. Output Formats & Visualization

Formats: CLUSTAL (.aln), FASTA (with gaps), Stockholm (.sto).
Viewers: Jalview, MSAViewer, UGENE.

6. Quality Assessment & Post-Processing

Metrics: Column Score (fraction of pairwise matches), GUIDANCE scores, TCS (Transitive Consistency Score).
Filtering: trimAl, Gblocks—remove poorly aligned or divergent regions before downstream analyses.

Practical Tips

Choose algorithms based on dataset size, sequence similarity, and computational resources.
Inspect alignments manually; realign problematic regions.
Experiment with different substitution matrices (e.g., BLOSUM62 vs. BLOSUM45) and gap penalties.
Always trim unreliable columns prior to phylogenetic inference or motif discovery.

IX. Phylogenetic Tree Reconstruction

Concepts & Terminology

Leaf (Tip): Observed sequences.
Internal Node: Hypothetical ancestor.
Branch Length: Represents evolutionary change or time.
Topology: Branching order (unrooted vs. rooted).
Clade (Monophyletic Group): Ancestor plus all its descendants.

1. Input: MSA of Homologous Sequences

Quality of MSA directly impacts tree accuracy. Trim ambiguous regions.

2. Models of Sequence Evolution

Nucleotide Models: Jukes-Cantor, Kimura 2-parameter, General Time Reversible (GTR).
Protein Models: JTT, WAG, LG—incorporate amino acid substitution frequencies.

3. Tree-Building Methods

A. Distance-Based Methods

UPGMA (Unweighted Pair Group Method with Arithmetic Mean):
- Assumes a molecular clock (constant rate); yields an ultrametric (rooted) tree.
- Simple but unreliable if rates vary.
Neighbor-Joining (NJ):
- No clock assumption; fast O(n³); produces an unrooted tree.
- Widely used for exploratory analyses.

B. Character-Based Methods

Maximum Parsimony (MP):
- Seeks the tree minimizing total character changes.
- No explicit evolutionary model; vulnerable to long-branch attraction.
Maximum Likelihood (ML):
- Finds the tree maximizing likelihood given a model of substitution.
- Computationally intensive; tools: PhyML, RAxML, IQ-TREE.
Bayesian Inference:
- Estimates the posterior distribution of trees under a model via MCMC.
- Outputs clade posterior probabilities and can incorporate relaxed molecular clocks (BEAST, MrBayes).

4. Branch Support & Validation

Bootstrap Analysis:
- Resample alignment columns with replacement to create replicates.
- Reconstruct trees; the percentage of replicates supporting each clade equals the bootstrap value.
- Values >70% are considered strong support.
Posterior Probabilities: From Bayesian inference; values >0.95 indicate strong support.

5. Tree Formats & Visualization

Newick Format: (A:0.1,(B:0.2,C:0.2):0.3);
Nexus Format: Includes metadata, alignment blocks, and tree blocks.
Visualization: FigTree, iTOL (Interactive Tree Of Life), Dendroscope, ETE Toolkit.

6. Rooting & Molecular Clocks

Rooting:
- Outgroup Rooting: Include a known distant relative.
- Midpoint Rooting: Place the root at the midpoint of the longest path.
Molecular Clock Models:
- Strict Clock: Assumes a constant rate; calibrate with fossil or sampling dates.
- Relaxed Clock: Allows rate variation among branches.
- Tools: BEAST for dating analyses.

7. Interpretation & Best Practices

Topology vs. Branch Lengths: Topology shows relationships; lengths indicate the amount of change.
Monophyly, Paraphyly, Polyphyly: Understand clade definitions.
Pitfalls:
- Poor alignment regions lead to misleading branches.
- Model misspecification leads to incorrect likelihoods or posterior probabilities.
- Long-branch attraction especially affects parsimony.
Recommendations:
- Use multiple tree-building approaches to compare topologies.
- Employ adequate substitution models and partitioning (e.g., codon positions).
- Report support values; collapse poorly supported nodes.

X. Practical Applications and Hands-On Examples

X.1. Transcription & Translation Basics

MCQs Key Answers: 1) OH group; 2) b, d; 3) UAA stop codon.
Translation: Identify reading frame, transcribe DNA → mRNA (T → U), translate by AUG start.
- Example: GAGCCAUGCAUUAUCUAGAUAGUAGGCUCUGAGAAUUUAUCUC → Met-His-Tyr-Leu-Asp-Ser-Arg-Leu.
Promoter Location: Upstream of the transcription start site.
Mutations:
- Upstream insertion (5′ UTR) → no protein change.
- In-frame substitution (synonymous: AAU ⇄ AAC both Asn) → silent mutation.
- Frameshift insertion in CDS → premature stop, truncated protein.
- Stop-codon suppression by mutant tRNA → elongated protein.

X.2. Point Mutations & Splicing

Deletion Effects:
- 4-nt deletion → removes ≥1 aa + frameshift → likely truncated, nonfunctional.
- 3-nt deletion → removes 1 aa; possible amino acid substitution if crossing codons.
Insertion of 1 nt near start/end → frameshift → high probability of harm.
Substitution Types:
1. Silent (synonymous)
2. Conservative missense (similar property amino acid)
3. Non-conservative missense (different property amino acid)
4. Nonsense → early stop.
Splice-site mutation: Single intronic base change abolishes original acceptor (AG), uses downstream AG → skips exon → mRNA 173 nt shorter.

X.3. Beta-Globin Cluster & Allele Variation

Codon Translation:
- atg gtg cac ctg act cct gag gag aag → MVHLTP E E K.
- Variant at position 7: E → V if codon GAG → GTG.
Cluster Gene Order (5′→3′): ε – γ-G – γ-A – δ – β.
Strand Orientation: Reverse-strand genes have complementary designations (A ↔ T).
Allele Observations: Non-synonymous mutation changes residue; lack of reported allele frequency signals potential data error.

X.4. Gene Structure & Thalassemia

Gene Anatomy:
- 5′ UTR – exon 1 – intron – exon 2 – intron – exon 3 – 3′ UTR – poly-A signal/tail.
- Transcription start site at 5′ end; translation begins at the first AUG in the exon.
Beta-Thalassemia Mutations:
1. Nonsense (GAG → TAG in exon) → β⁰, no beta-globin.
2. Single-nt insertion → frameshift → β⁰.
3. 4-nt deletion → frameshift → β⁰.
4. Splice-site (IVS I) G → A → aberrant splicing → β⁺, reduced beta-globin.

X.5. Synonymous Substitutions & Conservative Changes

Synonymous (Silent) Substitutions: Occur at the third codon position; protein sequence unchanged.
Transition vs. Transversion:
- Transition: purine ↔ purine (A ↔ G), pyrimidine ↔ pyrimidine (C ↔ T).
- Transversion: purine ↔ pyrimidine.
- Observation: Transitions are more frequent than transversions.
Conservative Amino Acid Changes:
- Basic: Arg ↔ Lys; nonpolar: Ile, Val, Leu, Met.
Phylogenetic Signal: Gaps indicate clade distinctions (Afrotheria vs. Eutheria vs. Marsupials).

X.6. HIV/SIV Phylogenetics

Data: 9,176 bp HIV-1 vs. SIV sequences; 9 protein-coding genes.
Tree: Strong bootstrap support shows HIV-1 clusters with SIVcpz, HIV-2 with non-chimpanzee SIVs.
Genome: ~9,176 bp, 9 coding genes. Strong bootstrap support at key nodes.
Relationships: HIV-1 clusters with SIVcpz; HIV-2 with non-chimpanzee SIVs.
Bootstrap Analysis: Percentage of replicates supporting each clade (≥70% indicates strong support).

X.7. Mitochondrial & MYH16 Pseudogene Analysis

Sequences: Human, Neanderthal, Denisovan, Pan, Gorilla, etc.

Newick Example:

((((Pan_paniscus:0.000001,Gorilla_gorilla:0.000001):0.06,(Homo_sapiens:0.000001,Pongo_pygmaeus:0.000001):0.12):0.12,Pan_troglodytes:0.000001);

Interpretation: Closest relatives are indicated by shortest branch lengths; divergence is indicated by branch length values.
Reference Accessions:
- Human: NC_012920.1; Neanderthal: NC_011137.1; Denisovan: NC_013993.1.
Percent Identity: Human–Neanderthal ~98.72%; Human–Denisovan ~97.61%.
Residue Change: Isoleucine (I) → Valine (V) conservative mutation at specified codons.
Pseudogene 2-base Deletion: BK001410 MYH16 frameshift; truncated pseudogene.
Phylogenetic Tree (Newick): Indicates evolutionary distances; shortest branches represent closest relatives.

XI. Essential Formulas (with Descriptions)

Percent Identity

Percent Identity = (Number of identical aligned positions / Aligned length) × 100%
- Identical aligned positions: count of positions where residues match exactly.
- Aligned length: total non-gap columns compared.
Percent Similarity

Percent Similarity = (Number of similar (conservative) residue pairs / Aligned length) × 100%
- Similar pairs: substitutions between biochemically similar residues (e.g., Ile ↔ Val).
Transition/Transversion Ratio (κ)

κ = #Transitions (A ↔ G, C ↔ T) / #Transversions (purine ↔ pyrimidine)
Affine Gap Penalty

GapPenalty = GapOpen + (L – 1) × GapExtend
- GapOpen: penalty for introducing a new gap.
- GapExtend: penalty for each additional position in the same gap.
- L: length of the gap in residues.
Alignment Score (S)

S = Σₖ s(aₖ,bₖ) + Σ_g [GapOpen_g + (L_g – 1) × GapExtend_g]
- s(aₖ,bₖ): substitution score for aligned residues.
- G: number of gaps.
- L_g: length of the g-th gap.
Needleman–Wunsch Recurrence (Global Alignment)

F(i,j) = max{ F(i-1,j-1) + s(xᵢ,yⱼ), F(i-1,j) – d, F(i,j-1) – d }
- F(i,j): best score for prefixes x₁…xᵢ and y₁…yⱼ.
- s(xᵢ,yⱼ): match/mismatch score.
- d: gap penalty.
Smith–Waterman Recurrence (Local Alignment)

H(i,j) = max{ 0, H(i-1,j-1)+s(xᵢ,yⱼ), H(i-1,j) – d, H(i,j-1) – d }
- H(i,j): best local score ending at positions.
- Zero bound ensures restarting at zero when negative.
p-distance

p = Number of differing sites / Total sites compared
Jukes–Cantor Model Distance

d = -¾ × ln(1 – 4/3 × p)
- p: observed proportion of differences.
Kimura 2-Parameter Distance

d = -½ ln(1 – 2P – Q) – ¼ ln(1 – 2Q)

P: proportion of transitions.
Q: proportion of transversions.

Bootstrap Support (%)

%Bootstrap = (#Replicates supporting clade / Total replicates) × 100%
BLAST E-value

E = K × m × n × e^(−λS)

m, n: query and database lengths.
S: raw score.
λ, K: statistical constants.

ORF Count
6 (three reading frames per strand × two strands)
Total Codons
4³ = 64 possible codons