Project built with Gini
Welcome to Comp, an educational compiler project written in Go. The goal of this project is to demystify how compilers work by building one from scratch, showing every phase of the translation pipeline from source code to executable machine code (via C/GCC).
All of this wouldn't be possible if it weren't for this guy, Tlaceby, and my college's old books and slides.
A compiler is a program that translates source code written in a high-level programming language into another language (often low-level machine code or intermediate representations). This process is divided into several well-defined phases:
graph TD
Src[Source Code] --> Lex[1. Lexer / Lexical Analyzer]
Lex --> |Tokens| Par[2. Parser / Syntactic Analyzer]
Par --> |AST / ATS| Sem[3. Semantic Analyzer]
Sem --> |Decorated AST & Symbol Table| Codegen[4. Interpreter / Code Generator]
Codegen --> |C Code / GCC| Bin[5. Target Executable / Output]
Below is a detailed breakdown of each step, including its meaning, what it does, code/conceptual examples, and suggested study topics.
The Lexer (short for Lexical Analyzer) is the first phase of a compiler. It reads the raw stream of characters representing the source code and groups them into meaningful sequences called Tokens while discarding trivia like whitespace, tabs, and comments.
- Scans the source file character by character.
- Recognizes patterns (using regular expressions or state machines) that form words of the language.
- Produces a stream of Tokens (e.g.,
IDENTIFIER,NUMBER,PLUS,ASSIGN,IF,EOF). - Tracks metadata like line and column numbers for error reporting.
Source Input:
let x = 5 + 10;Lexer Output (Token Stream):
[]Token{
{Type: TOKEN_LET, Literal: "let", Line: 1, Column: 1},
{Type: TOKEN_IDENT, Literal: "x", Line: 1, Column: 5},
{Type: TOKEN_ASSIGN, Literal: "=", Line: 1, Column: 7},
{Type: TOKEN_INT, Literal: "5", Line: 1, Column: 9},
{Type: TOKEN_PLUS, Literal: "+", Line: 1, Column: 11},
{Type: TOKEN_INT, Literal: "10", Line: 1, Column: 13},
{Type: TOKEN_SEMICOLON,Literal: ";", Line: 1, Column: 15},
{Type: TOKEN_EOF, Literal: "", Line: 1, Column: 16},
}- Regular Expressions (Regex): Used to define patterns for tokens (e.g., numeric literals, identifiers).
- Finite Automata: Finite State Machines (DFAs and NFAs) used to recognize lexical patterns efficiently.
- Lexer Generators: Tools like
lex,flex, or Go'sgolex. - Buffer Management: Handling large source files efficiently using sliding windows or double buffering.
The Parser takes the flat sequence of tokens generated by the Lexer and determines whether they form valid sentences according to the grammar rules of the programming language.
- Validates the syntax of the token stream.
- Implements the grammar rules (syntax rules) of the language.
- Organizes the tokens into a hierarchical tree structure, typically an Abstract Syntax Tree (AST).
- Reports syntax errors (e.g., mismatched parentheses, missing semicolons).
For the statement let x = 5 + 10;, the parser matches it against the rule:
VariableDeclaration -> "let" Identifier "=" Expression ";"
It parses the expression 5 + 10 recursively, recognizing the operator precedence (addition) and groups them as a binary expression node.
- Context-Free Grammars (CFGs): Formal notations like Backus-Naur Form (BNF) or Extended BNF (EBNF) used to define programming language syntax.
- Parsing Algorithms:
- Top-down Parsers: Recursive Descent Parsers, LL(k) parsers.
- Bottom-up Parsers: LR, LALR, SLR parsers (often used by parser generators).
- Pratt Parsing (Top-Down Operator Precedence): A clean and highly readable hand-written parsing technique ideal for parsing expressions and binary operators.
- Parser Generators: Tools like
yacc,bison, orgoyacc.
The AST (Abstract Syntax Tree, sometimes referred to as ATS / Abstract Syntax Structure) is a hierarchical, tree-like data structure that represents the logical syntactic structure of the source code. Unlike a Concrete Syntax Tree (Parse Tree), the AST omits syntactic sugar and punctuation details (like parentheses, commas, or semicolons) and focuses purely on the semantics of the structure.
- Acts as the central intermediate representation (IR) of the program structure.
- Represents expressions, statements, declarations, and control flow in a nested tree structure.
- Serves as the primary structure traversed by the Semantic Analyzer, Interpreter, and Code Generator.
In Go, an AST node representing the variable declaration let x = 5 + 10; might look like:
type LetStatement struct {
Token Token // The 'let' token
Name *Identifier // Node representing 'x'
Value Expression // Node representing '5 + 10'
}
type InfixExpression struct {
Token Token // The '+' token
Left Expression // Node representing '5'
Operator string // "+"
Right Expression // Node representing '10'
}Visual Tree Representation:
LetStatement (Name: x)
|
InfixExpression (+)
/ \
Integer(5) Integer(10)
- Tree Traversal Algorithms: Depth-First Search (DFS), Pre-order, Post-order, and In-order traversals of AST nodes.
- Design Patterns (Visitor Pattern): A design pattern commonly used in object-oriented languages to write clean, modular compiler passes (like type checking or codegen) without modifying the AST node classes/structs.
- AST Node Modeling in Go: Leveraging Go interfaces (
Node,Statement,Expression) to build a type-safe tree hierarchy.
The Semantic Analyzer inspects the AST to ensure the program makes sense and follows the semantic rules of the language that cannot be easily verified by the context-free grammar of the parser.
- Scope Resolution: Ensures variables, functions, and types are declared before use, and handles variable scoping rules.
- Type Checking: Verifies that operators are applied to compatible types (e.g., you cannot add a string to an integer, or assign a float to a boolean variable).
- Symbol Table Construction: Maintains a "Symbol Table" containing information about declarations, scopes, types, and variables.
- Control Flow Analysis: Ensures functions return values where required, and keywords like
breakorcontinueare only used inside loops.
Suppose we have the following code snippet:
let x = 5;
let y = x + "hello"; // Semantic Error!- Lexer successfully tokens it (all characters are valid).
- Parser successfully parses it (matches
VariableDeclarationsyntax). - Semantic Analyzer looks up the type of
x(integer) and the literal"hello"(string), checks the operator+rule, and throws a compile-time semantic error:Type Mismatch: cannot add int and string.
- Symbol Tables (Environment): Designing scoping mechanisms (lexical scope, dynamic scope, nested environments).
- Type Systems: Static vs dynamic typing, type inference (e.g., Hindley-Milner type inference), type coercion, and type safety.
- Attribute Grammars: Formalisms for assigning semantic values (attributes) to the nodes of the syntax tree.
The Interpreter or Code Generator (Builder) is the final backend phase.
- An Interpreter directly executes the AST node-by-node or compiles it to virtual machine bytecode to execute it on the fly.
- A Code Generator (Builder) translates the AST into a target language (like C, assembly, or LLVM IR). This output is then compiled into a final native executable using an external toolchain, such as GCC or Clang.
- Traversing the decorated AST.
- If building (compiling):
- Emits equivalent C code or Assembly instructions from each AST node.
- Writes the generated code to a file (e.g.,
output.c). - Invokes an external compiler (like
gcc) via system commands to build the executable binary.
- If interpreting:
- Evaluates expressions on the fly using runtime variables/environment.
- Produces immediate execution results.
For the AST of let x = 5 + 10;, the Code Generator translates it to:
#include <stdio.h>
int main() {
int x = 5 + 10;
return 0;
}Then, the compiler invokes GCC under the hood:
gcc output.c -o myprogram- Target Code Generation: Emitting machine code, intermediate C code, or Assembly (x86_64, ARM).
- Intermediate Representation (IR): 3-Address Code (3AC), Static Single Assignment (SSA) form, or LLVM IR.
- Compiler Optimization: Constant folding, loop unrolling, dead code elimination, register allocation.
- Runtime Systems & Memory Management: Garbage collection, stack frames, heap allocation.