Compiler Design Phases: From Lexical Analysis to Code Generation

Lexical Analysis

The first phase of a compiler, lexical analysis or scanning, involves reading the source program’s character stream and grouping them into meaningful sequences called lexemes. For each lexeme, the lexical analyzer generates a token, which is then passed to the syntax analysis phase. The token consists of a token-name, an abstract symbol used during syntax analysis, and an attribute-value pointing to the token’s symbol table entry. The symbol table stores information like name and type, which are necessary for semantic analysis and code generation.

For instance, consider the assignment statement:

position = initial + rate * 60

The lexemes in this statement would be mapped into tokens as follows:

  1. position: Token <id, 1>, where ‘id’ represents identifier and ‘1’ points to the symbol table entry for ‘position’.
  2. =: Token <=, >
  3. initial: Token <id, 2>, where ‘2’ points to the symbol table entry for ‘initial’.
  4. +: Token <+, >
  5. rate: Token <id, 3>, where ‘3’ points to the symbol table entry for ‘rate’.
  6. *: Token <*, >
  7. 60: Token <60, >

The lexical analyzer discards blanks and passes these tokens to the next phase.

MWOk00AAAAASUVORK5CYII= wH2lLhz23NbbgAAAABJRU5ErkJggg==

Syntax Analysis

The second phase, syntax analysis or parsing, uses the token-names from the lexical analyzer to create a tree-like representation of the grammatical structure, often a syntax tree. In a syntax tree, interior nodes represent operations, and their children represent the arguments. The syntax tree for the example statement would look like this:

position = initial + rate * 60

The tree illustrates the order of operations, with multiplication taking precedence over addition due to operator precedence rules.

Semantic Analysis

The semantic analyzer checks the source program for semantic consistency using the syntax tree and symbol table information. It also collects type information for later use during intermediate code generation. Type checking ensures that operators have compatible operands. For example, if an array index requires an integer, using a floating-point number would result in an error. The language may allow certain type conversions, such as converting an integer to a floating-point number when used with a floating-point operator.

In our example, if ‘position’, ‘initial’, and ‘rate’ are declared as floating-point numbers and ’60’ is an integer, the type checker would identify the need to convert ’60’ to a floating-point number before the multiplication.

Intermediate Code Generation

Many compilers generate an intermediate representation after syntax and semantic analysis. This low-level representation acts as a program for an abstract machine and should be easy to produce and translate into the target machine code. The intermediate code generator might produce three-address code for our example:

  • t1 = inttofloat(60)
  • t2 = id3 * t1
  • t3 = id2 + t2
  • id1 = t3

Three-address instructions have at most one operator on the right side, ensuring a fixed order of operations. Temporary names hold intermediate values, and some instructions may have fewer than three operands.

Code Optimization

The code optimization phase aims to improve the intermediate code for better target code generation, often focusing on speed, code size, or power consumption. A simple optimization in our example would be to precompute the floating-point value of ’60’ at compile time, eliminating the ‘inttofloat’ operation. Additionally, since ‘t3’ is used only once, the code can be further simplified:

t1 = id3 * 60.0
id1 = id2 + t1

Code Generation

The code generator translates the optimized intermediate representation into the target language, such as machine code. It assigns registers or memory locations to variables and converts intermediate instructions into machine instructions. For our example, using registers R1 and R2, the code might look like this:

  • LDF R2, id3
  • MULF R2, R2, #60.0
  • LDF R1, id2
  • ADDF R1, R1, R2
  • STF id1, R1

This code loads values, performs calculations, and stores the final result, effectively implementing the original assignment statement.

Symbol-Table Management

The symbol table is a data structure that stores records for each variable name, including their attributes. It is designed for efficient lookup and retrieval of information during compilation.