README
Recursive descent
This is a recursive-descent parser building utility tool written in javascript and inspired by ANTLR. You input a grammar definition and a program conforming to that grammar, the generator will output a syntax tree and a few tools to manipulate and walk that tree.
Features
- LL(1) parser, with 1 token lookahead.
- Backtracking to parse ambiguous grammars.
- Strict LL(1) checking which finds and throws errors at left-recursion, common prefix, used but undefined rules, ambiguous token capturing caused by *?+ operators and rules that potentially producing no tokens.
- Warns about defined but unused (unreachable) rules.
- Rudimentary tree walker.
- Optional trimming of tree branches by removing nodes that have a single non-terminal.
- Optional removal of non-semantic tokens.
- Optional Subrule flattening.
- Lexer customization via callbacks.
Installation
The program can run on node.js. No special build for browsers exists, but it should run on browsers with the help of a module bundler.
Node: npm install recursive-descent
If you cloned the GIT repository, you will need to install npm dependencies.
Usage
Quick example
Note that String.raw
is used to avoid escaping backslashes in regexps. This simulates reading the grammar from a file.
const grammar = String.raw`
//this is a comment
//special ignore token to allow and ignore whitespace
ignore = ~[\s]+~ ;
number = ~[0-9]+(\.[0-9]+)?~ ;
//special program rule to mark the root node
program : expression ;
//< operator flattens subrules
expression : term <('+' term | '-' term)* ;
term : factor <('*' factor | '/' factor)* ;
factor : '-'? atom ;
atom
: number
| '(' expression ')'
;
`;
const recdec = require("recursive-descent");
const rules = recdec.bnfParse(grammar);
const rootNode = recdec.parse(rules, "3 + 4 * 5");
recdec.print(rootNode);
Produces:
program
├─┬ expression
│ ├─┬ term
│ │ └─┬ factor
│ │ └─┬ atom
│ │ └── number: 3
│ ├── string: +
│ └─┬ term
│ ├─┬ factor
│ │ └─┬ atom
│ │ └── number: 4
│ ├── string: *
│ └─┬ factor
│ └─┬ atom
│ └── number: 5
└── #eof:
Grammar definition
Grammar definition syntax is based on a variation of an EBNF (Extended Bachus Naur Form).
A grammar file is a list of rule definitions. Each rule must end with a semicolon. Whitespace is unimportant. There exists a single line comment, which starts with '//' and ends with a new line.
There are 2 types of rules: Lexer rules and parser rules. Lexer rules are used for matching tokens from an input program, rule name and rule body are separated with a '=' sign whereas parser rules use ':'. Lexer rules are not required and tokens can be marked by their literal expressions anywhere in the grammar code. Lexer rules, however, are convenient as it allows matching a token with a name.
All rules may have alternatives, which are separated by '|' sign. Expressions within an alternative can also be grouped with parentheses, which are converted to anonymous rules behind the scenes. Example:
variableDeclaration
: 'var' identifier
| 'let' identifier
;
This could also be written as
variableDeclaration : ('var' | 'let') identifier ;
Lexer rules can contain only one expression per alternative and this expression can represent a string, a regexp or a user defined function from which a token is produced.
String: A string expression is single quote delimited. Like 'if'.
if = 'if'
Regexp: A regexp expression is delimited by '~' symbol. Right after the closing ~, you can put an 'i' to match tokens case insensitively. Such as
~[a-z]+~i
Don't put ^, $ or any other anchors. ^ is automatically inserted at the beginning.number = ~[0-9]+(\.[0-9]+)?~ ;
identifier = ~[A-Za-z][A-Za-z0-9_]*~ ;
Function: Expression consists of a '