Compiler Design Fundamentals: Phases, Parsing, and Optimization Techniques

Phases of a Compiler Design Process

A compiler converts source code into machine code. The main phases are:

  • Lexical Analysis: Breaks code into tokens and removes spaces/comments.
  • Syntax Analysis: Checks the grammar structure of tokens.
  • Semantic Analysis: Checks meaning, types, and logical correctness.
  • Intermediate Code Generation: Creates machine-independent middle code.
  • Optimization: Improves code efficiency.
  • Code Generation: Converts intermediate code into machine code.
  • Symbol Table & Error Handling: Stores variable information and handles errors.

Compiler Bootstrapping Explained

Bootstrapping means creating a compiler using another compiler. The process involves three main steps:

  1. Write a simple version of the compiler in an existing language.
  2. Create an advanced compiler for the new language.
  3. Use the new compiler to compile itself (self-hosting).

SLR Parsing Method Steps

SLR (Simple LR) is a bottom-up parsing method. The steps involved are:

  1. Create augmented grammar.
  2. Create LR(0) items and states.
  3. Build ACTION and GOTO tables.
  4. Use FOLLOW sets for reduce entries.

FIRST and FOLLOW Sets in Parsing

These sets are fundamental tools used in parsing table construction:

  • FIRST(X): The set of terminals that can appear first from the non-terminal X.
  • FOLLOW(A): The set of terminals that can appear immediately after the non-terminal A in some sentential form.

Left Recursion Removal and Left Factoring

If a grammar production starts with its own name (e.g., $A \rightarrow A \alpha \mid \beta$), it exhibits left recursion. It must be converted into:

$A \rightarrow \beta A’$
$A’ \rightarrow \alpha A’ \mid \epsilon$

Left Factoring is used when two or more productions for the same non-terminal start with the same prefix.

Directed Acyclic Graph (DAG) Construction

DAG construction removes repeated sub-expressions. For example, if an expression like (a + b) appears twice, a single node is created and reused, improving efficiency.

Three Address Code (TAC)

TAC breaks complex expressions into simple steps, typically using temporary variables. Each instruction has at most three operands.

Nondeterministic vs. Deterministic Finite Automata (NFA/DFA)

NFA (Nondeterministic Finite Automaton)
Allows multiple possible transitions for a given input symbol and can include $\epsilon$-moves (transitions without consuming input).
DFA (Deterministic Finite Automaton)
Requires one definite transition per symbol and does not allow $\epsilon$-moves. A DFA is generated from an NFA using subset construction.

Symbol Table Implementation and Purpose

The Symbol Table stores crucial information about program elements, including variables, types, functions, scope, and memory location. It is usually implemented using hash tables for efficient lookup.

Peephole Optimization Techniques

Peephole optimization is a small-window optimization technique applied to the generated code. It performs tasks such as:

  • Removing unnecessary instructions.
  • Replacing slower code sequences with faster equivalents.
  • Folding constants (performing constant calculations at compile time).

Intermediate Code Generation (IR)

This phase generates an intermediate representation (IR), such as Three Address Code (TAC), that is independent of the target machine architecture.

Classification and Recovery of Compiler Errors

Compiler errors are typically classified into four main types:

  • Lexical Errors: Invalid characters or illegal tokens.
  • Syntax Errors: Wrong structural arrangement of tokens (grammar violations).
  • Semantic Errors: Type mismatch, undeclared variables, or logical inconsistencies.
  • Runtime Errors: Errors detected during program execution (e.g., division by zero, array bounds violation).

Error recovery techniques include panic mode, phrase level recovery, error productions, and global correction.