A PEG tokenizer/parser-generator.
This is the program that generates the tokenizer and parser for the Daisho programming language.
Given the specification of a grammar, pgen
generates a very fast
tokenizer and parser for that grammar.
// Keywords
CLASS: "class";
PLUS: "+";
// A state machine that tokenizes single line comments.
SLCOM: (2, 3) {
(0, '/') -> 1;
(1, '/') -> 2;
(2, [^\n]) -> 2;
(2, [\n]) -> 3;
};
// A state machine that tokenizes whitespace.
WS: 1 {
((0, 1), [ \n\r\t]) -> 1;
};
/* Single and multiline C comments are allowed in `.peg` files. */
/* pgen's syntax, written in itself. */
PERCENT: "%";
LESSTHAN "<";
GREATERTHAN: ">";
COMMA: "";
// And so on...
grammar <- (directive / definition)*
directive <- PERCENT LOWERIDENT (&(!eol) WS)* eol
definition <- LOWERIDENT variables? ARROW slashexpr
variables <- LESSTHAN variable (COMMA variable)* GREATERTHAN
variable <- (!(GREATERTHAN / COMMA) wildcard)*
slashexpr <- modexprlist (DIV modexprlist)*
modexprlist <- modexpr*
modexpr <- (LOWERIDENT COLON)? // Variable assignment
(AMPERSAND / EXCLAIMATION)* // Operators
baseexpr // The modified expression
(QUESTION / STAR / PLUS)* // More Operators
(PIPE (STRING / baseexpr))? // Error handlers
baseexpr <- UPPERIDENT // Token to match
/ LOWERIDENT !(LT / ARROW) // Rule to call
/ CODEEXPR // Code to execute
/ OPENPAREN slashexpr CLOSEPAREN
%node EOL
eol <- {
bool iseol = 0;
if (ctx->pos >= ctx->len) {
iseol = 1;
} else if (ctx->tokens[ctx->pos - 1].line < ctx->tokens[ctx->pos].line) {
iseol = 1;
}
ret = iseol ? leaf(EOL) : NULL;
}
wildcard <- {
rule = pgen_astnode_leaf(ctx->alloc, ctx->tokens[ctx->pos++].kind);
}
For more a more precise description of the grammar, see pgen_grammar.peg
. Or just find me and ask me.
The syntax of pgen is based on the paper "Parsing Expression Grammars: A Recognition-Based Syntactic Foundation"
by Bryan Ford. This specific parser is inspired by packcc
by Arihiro Yoshida. You may see many commonalities. The main difference is that while packcc
parses at the source level,
pgen
introduces a tokenizer and parses the token stream instead.
/
- Try to match the left side, then try to match the right side. Returns the first that matches. Otherwise fail.&
- Try to parse, perform the match, but rewind back to the starting position and return SUCC. Otherwise fail as usual.!
- Try to parse, return SUCC on no match and fail on match.?
- Optionally match, returning either the result, or SUCC if no match. Does not cause the rule to fail.*
- Match zero or more. Returns SUCC.+
- Match one or more. Returns SUCC, or fails if no matches.()
- Matches if all expressions inside match. Returns SUCC or the single match within if there's only one.{}
- Code to insert into the parser. Assign toret
for the return value of this expression, orrule
for the rule.:
- Capture the info from a match inside a variable in the current rule.|
- Register an error using the string or expression on the right, and exit all parsing.
%oom
- Define the action that should be taken when out of memory%node
- Define an ASTNode kind%preinclude
- Include a file before astnode, but after support libs%include
- Include a file after astnode, but before the parser%postinclude
- Include a file after the parser%predefine
- #define something before astnode, but after support libs%define
- #define something after astnode, but before the parser%postdefine
- #define something after the parser%precode
- Insert code before astnode, but after support libs%code
- Insert code after astnode, but before the parser%postcode
- Insert code after the parser%extra
- Add fields to the astnode%extrainit
- Add initialization to the astnode%errextra
- Add fields to errors%errextrainit
- Add initialization to errors
rec(label)
- Record the parser's state to a labelrew(label)
- Rewind the parser's state to a labelnode(kind, children...)
- Create an astnode with a kind name and fixed number of childrenkind(name)
- Get the enum value of an astnode kind namelist(kind)
- Create an astnode with a kind name and a dynamic number of childrenleaf(kind)
- Create an astnode with no childrenadd(list, node)
- Add an astnode as a child to an astnode created by list()has(node)
- 0 if the node is NULL or SUCC, 1 otherwise.repr(node, ofnode)
- Set the string representation of the current node to another node'ssrepr(node, string)
- Set the string representation of node to a cstringcprepr(node, cps, len)
- Set the string representation of node to a codepoint stringexpect(kind, cap) - Parses a token the same way
TOKEN` does. Returns the astnode if cap(tured).
INFO(msg)
- Log an error to ctx->errlist with the position and severity 0.WARNING(msg)
- Log an error to ctx->errlist with the position and severity 1.ERROR(msg)
- Log an error to ctx->errlist with the position and severity 2.FATAL(msg)
- Log an error to ctx->errlist with the position and severity 3, and sets ctx->exit = 1.INFO_F(msg, freefn)
- INFO(), but pgen_allocator_destroy(ctx->alloc) calls freefn(msg).WARNING_F(msg, freefn)
- WARNING(), but pgen_allocator_destroy(ctx->alloc) calls freefn(msg).ERROR_F(msg, freefn)
- ERROR(), but pgen_allocator_destroy(ctx->alloc) calls freefn(msg).FATAL_F(msg, freefn)
- FATAL(), but pgen_allocator_destroy(ctx->alloc) calls freefn(msg).
There's documentation, but realistically you're not going to figure everything out on your own. Talk to me, submit an issue, send me an email, or find me on Discord, and I can walk you through how to use it.
C code in Code expressions are parsed by matching left and right curly braces. Therefore, it could get confused if you write something like { ret = ...; ret->str = "}"; }
. Instead of using "{"
or "}"
, you can use the macros LBSTR
/RBSTR
.
See examples/pl0.c
for the full example put together.
char *input_str = NULL;
size_t input_len = 0;
readFile("pl0.pl0", &input_str, &input_len);
codepoint_t *cps = NULL;
size_t cpslen = 0;
if (!UTF8_decode(input_str, input_len, &cps, &cpslen))
fprintf(stderr, "Could not decode to UTF32.\n"), exit(1);
You will have to create some sort of list data structure to hold the tokens.
Here, we add a token to that list with add_tok
. You will have to roll your
own.
This is also the step where you can discard any tokens you don't want. This you can parse comments and whitespace as tokens, and then ignore them.
The .kind
member of your token struct will contain what kind of token it is,
as described by your .tok
file. When there are no more tokens left to parse,
.kind
of the returned token will be PL0_TOK_STREAMEND
. You can also create
and append your own LANG_TOK_STREAMBEGIN
token at the beginning, if you wish.
pl0_tokenizer tokenizer;
pl0_tokenizer_init(&tokenizer, cps, cpslen);
pl0_token tok;
do {
tok = pl0_nextToken(&tokenizer);
// Discard whitespace, comments, and end of stream,
// add other tokens to the list.
if (!(tok.kind == PL0_TOK_SLCOM | tok.kind == PL0_TOK_MLCOM |
tok.kind == PL0_TOK_WS | tok.kind == PL0_TOK_STREAMEND))
add_tok(tok);
} while (tok.kind != PL0_TOK_STREAMEND);
pgen_allocator allocator = pgen_allocator_new();
pl0_parser_ctx parser;
pl0_parser_ctx_init(&parser, &allocator, toklist.buf, toklist.size);
Any rule can be an entry point for your parser. The function generated for each rule has the signature:
lang_astnode_t *lang_parse_rulename(lang_parser_ctx* ctx);
For our, pl0
example, program
is the rule we want, and we call it to parse the abstract syntax tree like so:
pl0_astnode_t *ast = pl0_parse_program(&parser);
pgen_allocator_destroy(&allocator); // The whole AST is freed with the allocator
free(toklist.buf); // The list of tokens (provide your own)
free(cps); // The file as UTF32
free(input_str); // The file as UTF8
More comprehensive documentation on these things will come eventually.
- Design an algorithm for merging state machines
- Multiple
%node
declarations in one - State Machine automaton state reachability analysis
- Regex tokenizer rules.
- Add a flag to warn on token/astnode kinds not used in the parser
- Rethink Token/Node print functions
- PGEN_RUNTIME_INCLUDE scope guard
- Rewrite memory allocator with GC and option to leak.
- Compiler option to generate runner C file
%drop
tokens%main
and%input
, to complement daisho-explorer.
The license for pgen
is GPLv3. The license applies only to the files already in this repository. The code that you generate using pgen
belongs to you (or whoever has the copyright to the .peg
file it was generated from.
However, if you modify or distribute pgen
itself then that must still follow the rules of the GPL.
If these terms are not acceptable for you, please contact me with your use case.