This repository contains a Python implementation of a toy C compiler, that passes all of the tests for Chapter 1 of nlsandler / writing-a-c-compiler-tests. It uses the Visitor Pattern to traverse and process the Abstract Syntax Tree (AST) during the compilation process. The compiler is designed as a learning tool to understand the key stages of compilation, including lexical analysis, parsing, assembly ast generation, and emmitting assembly code.
The compiler is implemented in Python and follows a modular design, with each stage of the compilation process clearly separated. The Visitor Pattern is used extensively to traverse and manipulate the AST, making the code modular, extensible, and easy to understand.
- Lexical Analysis: Converts the source code into a stream of tokens.
- Parsing: Builds an Abstract Syntax Tree (AST) from the tokens.
- assembly ast generation: Generates an assembly AST.
- Emmiting code: Emits assmebly code and links it into an executable using GCC.
- Visitor Pattern: Used to traverse and process the AST during parsing and code generation.
The compiler follows these stages to transform a C source file into an executable:
-
Preprocessing:
- The source file is preprocessed using GCC to handle macros and includes.
- Output: A preprocessed
.i
file.
-
Lexical Analysis:
- The preprocessed file is tokenized into a list of tokens (e.g., keywords, identifiers, operators).
- Output: A list of
Token
objects.
-
Parsing:
- The tokens are parsed into an Abstract Syntax Tree (AST) using a recursive descent parser.
- Output: An AST representing the structure of the program.
-
Assembly AST Generation
- Generates an assembly AST.
-
Assembly Code:
- The AST is traversed using the Visitor Pattern to generate assembly code.
- Output: Assembly code in a
.s
file.
-
Assembly and Linking:
- The assembly code is converted into machine code and linked into an executable using GCC.
- Output: An executable file (
.exe
on Windows or no extension on Unix-like systems).
The Visitor Pattern is used to separate the logic for traversing the AST from the operations performed on it. This makes the code modular and allows for easy extension. For example:
- ParserPrinter: A visitor that prints the AST in a human-readable format.
- GeneratorPrinter: A visitor that generates assembly code from the AST.
- CodeEmitter: A visitor that emits machine code from the assembly AST.
Each visitor implements a visit
method for each node type in the AST, allowing for custom behavior at each step.
- Python 3.7 or higher
- GCC (for preprocessing and linking)
- Linux or Mac or Windows OS
-
Clone the repository:
git clone https://github.com/https://github.com/JamesSullivan/python-visitor-c-compiler.git cd python-visitor-c-compiler
-
Run the compiler:
python compiler.py <path-to-source-file.c> [options]
--lex
: Stop after lexical analysis and print the tokens.--parse
: Stop after parsing and printing the AST.--codegen
: Stop after generating and printing the assembly AST.-S
: Stop after generating the assembly file (.s
).
python compiler.py return_2.c --parse
-
Input File (
return_2.c
):int main(void) { return 2; }
-
Preprocessing:
- Generates
return_2.i
.
- Generates
-
Lexical Analysis:
- Outputs a list of tokens:
Token(type='INT', value='int', line=1, column=1) Token(type='ID', value='main', line=1, column=5) Token(type='LPAREN', value='(', line=1, column=9) Token(type='VOID', value='void', line=1, column=10) Token(type='RPAREN', value=')', line=1, column=14) Token(type='LCUB', value='{', line=1, column=16) Token(type='NEWLINE', value='\n', line=1, column=17) Token(type='RETURN', value='return', line=2, column=2) Token(type='CONSTANT', value='2', line=2, column=9) Token(type='END', value=';', line=2, column=10) Token(type='NEWLINE', value='\n', line=2, column=11) Token(type='RCUB', value='}', line=3, column=1) Token(type='NEWLINE', value='\n', line=3, column=2)
- Outputs a list of tokens:
-
Parsing:
- Outputs the AST:
Program Function: main Return: 2
- Outputs the AST:
-
Assembly AST Generation
- Outputs the Assembly AST:
Program: Function: main Mov: Source: Immediate(2) Destination: Register(eax) Ret
- Outputs the Assembly AST:
-
Code Generation:
- Outputs assembly code:
.global main main: movl $2, %eax ret .section .note.GNU-stack,"",@progbits
- Outputs assembly code:
-
Assembly and Linking:
- Generates an executable file (
return2
orreturn2.exe
).
- Generates an executable file (
python-visitor-c-compiler/
├── assembly/ # Assembly AST generation module
│ ├── ast.py
│ ├── generator.py
│ └── generatorprinter.py
├── parser/ # Parsing module to create AST
│ ├── ast.py
│ ├── parser.py
│ └── parserprinter.py
├── codeemitter.py # Assembly code emission
├── compiler.py # Main driver program
├── compiler.sh # Convenience for Linux users
├── lexer.py # Tokenizes source code
├── LICENSE
├── mistake_lex.c # Example C source with token error
├── mistake_parse.c # Example C source with syntax error
├── README.md
└── return_2.c # Example C valid source file
Contributions are welcome! If you find a bug or want to add a feature, please open an issue or submit a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.
- From the book "Writing a C Compiler"
- Also available at Nora Sandler's Blog
Happy compiling! 🚀