README
FLEX.JS
FLEX.JS - Fast lexer (tokenizer, scanner) for JavaScript inspired by FLEX lexer generator.
Description
This is a library for creating scanners: programs which recognized lexical patterns in text. It analyzes its input for occurrences of the regular expressions. Whenever it finds one, it executes the corresponding JavaScript code.
This lexer is inspired by well-known FLEX lexer generator for C. See more: http://westes.github.io/flex/manual/ and https://github.com/westes/flex
What is common between FLEX and FLEX.JS
- general idea, rules, logic
- the same default behavior where it is possible to implement
- the same examples and the same output
Differences between FLEX and FLEX.JS
- FLEX is lexer generator but FLEX.JS is configurable lexer class.
- FLEX.JS uses JavaScript regular expression, this fact effect on syntax and some limitations.
- FLEX could work with streams/buffers but FLEX.JS works with fixed-size strings.
- REJECT action is not a branch, code after REJECT will be executed, but action return value will be ignored.
- Trailing context besides primitive $ is not supported. Lookahead assertion could be used instead but lookahead value is not used to increase weight for expression. (TODO: should be fixable)
- support for buffer
- support buffer switch
- custom input handler
- yywrap()
- EOF rule handling is slightly different (TODO: fix?)
- custom output buffer
- Track line number (TODO: fix)
Simple example
A simple example of float value scanner:
var Lexer = require('flex-js');
var lexer = new Lexer();
// options
lexer.setIgnoreCase(true); // does not make sense for this scanner, just for reference
// definitions
lexer.addDefinition('DIGIT', /[0-9]/);
// rules
lexer.addRule(/{DIGIT}+\.{DIGIT}+/, function (lexer) {
console.log('Found float: ' + lexer.text);
});
lexer.addRule(/\s+/);
// code
lexer.setSource('1.2 3.4 5.6');
lexer.lex();
Format of configuration
The lexer configuration consists of three sections:
- options
- definitions
- rules
- user code
The options section contains configuration for lexer, like default case sensivity behavior.
The definitions section contains declarations of simple name definitions to simplify the scanner specification, and declarations of start conditions, which are explained in a later section. Name definitions have the form:
lexer.addDefinition(name, definition);
The "name" is a word beginning with a letter or an underscore ('_') followed by zero or more letters, digits, '_', or '-' (dash). The definition can subsequently be referred to using "{name}", which will expand to "(definition)". For example,
lexer.addDefinition('DIGIT', /[0-9]/);
lexer.addDefinition('ID', /[a-z][a-z0-9]*/);
defines "DIGIT" to be a regular expression which matches a single digit, and "ID" to be a regular expression which matches a letter followed by zero-or-more letters-or-digits. A subsequent reference to {DIGIT}+"."{DIGIT}*
is identical to ([0-9])+"."([0-9])*
and matches one-or-more digits followed by a '.' followed by zero-or-more digits.
There is no way to set case sensivity flag per definition, only per pattern or globally for whole lexer instance.
The rules section of the lexer configuration contains a series of rules of the form:
lexer.addRule(pattern, action);
lexer.addRules(rules);
lexer.addStateRule(state, pattern, action);
lexer.addStateRules(state, rules);
Lexer defaults and definitions should be added before adding new rules.
Options
- Ignore Case - case sensivity could be set via
setIgnoreCase(false)
orsetIgnoreCase(true)
. By defalt lexer is case sensitive. - Debug Mode - debug mode could be enabled with
setDebugEnabled(true)
. In debug mode lexer will output on console state, expression and matched value for each accepted value. - track line number (TODO)
- read from stdin or custom file handler without boilerplate (TODO)
- echo to stdout, stderr or custom file handler (TODO)
States
addStateRule('s', 'r', action);
- anr
, but only in start conditions
(see below for discussion of start conditions)addStateRule(['s1', 's2', 's3'], 'r', action);
- same, but in any of start conditionss1
,s2
, ors3
addStateRule('*', 'r', action);
oraddStateRule(Lexer.STATE_ANY, 'r', action);
- anr
in any start condition, even an exclusive one.addStateRule(['s1', 's2'], '<<EOF>>', action);
oraddStateRule(['s1', 's2'], Lexer.RULE_EOF, action);
- an end-of-file when in start conditions1
ors2
Patterns
Read more here: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions
The patterns are written using an standard syntex for JavaScript regular expressions + string values are also supported. These are:
"[xyz]\"foo"
- the literal string:[xyz]"foo
/x/
- match the characterx
/./
- any unicode character except newline/[xyz]/
- a "character class"; in this case, the pattern matches either anx
, ay
, or az
/[abj-oZ]/
- a "character class" with a range in it; matches ana
, ab
, any letter fromj
througho
, or aZ
/[^A-Z]/
- a "negated character class", i.e., any character but those in the class. In this case, any character EXCEPT an uppercase letter./[^A-Z\n]/
any character EXCEPT an uppercase letter or a newline/r*/
- zero or more r's, where r is any regular expression/r+/
- one or more r's/r?/
- zero or one r's (that is, "an optional r")/r{2,5}/
- anywhere from two to five r's/r{2,}/
- two or more r's/r{4}/
- exactly 4 r's/{name}/
- the expansion of the "name" definition (see above)/[\b]/
- a backspace (U+0008
), you need to use square brackets if you want to match a literal backspace character. (Not to be confused with\b
.)/\b/
- a word boundary. A word boundary matches the position where a word character is not followed or preceded by another word-character. Note that a matched word boundary is not included in the match. In other words, the length of a matched word boundary is zero. (Not to be confused with[\b]
.)/\B/
- a non-word boundary. This matches a position where the previous and next character are of the same type: Either both must be words, or both must be non-words. The beginning and end of a string are considered non-words./\cX/
- whereX
is a character ranging fromA
toZ
. Matches a control character in a string./\d/
- a digit character. Equivalent to[0-9]
./\D/
- a non-digit character. Equivalent to[^0-9]
./\f/
- a form feed (U+000C
)./\n/
- a line feed (U+000A
)./\r/
- a carriage return (U+000D
)./\s/
- a single white space character, including space, tab, form feed, line feed./\S/
- a single character other than white space./\t/
- a tab (U+0009
)./\v/
- a vertical tab (U+000B
)./\w/
- any alphanumeric character including the underscore. Equivalent to[A-Za-z0-9_]
./\W/
- any non-word character. Equivalent to[^A-Za-z0-9_]
./\n/
- wheren
is a positive integer, a back reference to the last substring matching then
parenthetical in the regular expression (counting left parentheses)./\0/
- a NULL (U+0000
) character./\xhh/
- the character with the codehh
(two hexadecimal digits)/\uhhhh/
- caracter with the codehhhh
(four hexadecimal digits)./\u{hhhh}/
- (only whenu
flag is set) the character with the Unicode valuehhhh
(hexadecimal digits)./(r)/
- match anr
; parentheses are used to override precedence (see below)/rs/
- the regular expressionr
followed by the regular expressions
; called "concatenation"/r|s/
- either anr
or ans
/^r/
- anr
, but only at the beginning of a line (i.e., which just starting to scan, or right after a newline has been scanned)./r$/
- anr
, but only at the end of a line (i.e., just before a newline)./x(?=y)/
- anx
only ifx
is followed byy
. This is called a lookahead./x(?!y)/
- anx
only ifx
is not followed byy
. This is called a negated lookahead./\x/
- a backslash that precedes a non-special character indicates that the next character is special and is not to be interpreted literally. A backslash that precedes a special character indicates that the next character is not special and should be interpreted literally."<<EOF>>"
orLexer.RULE_EOF
- an end-of-file.
Note that inside of a character class, all regular expression operators lose their special meaning except escape ('') and the character class operators, '-', ']', and, at the beginning of the class, '^'.
The regular expressions listed above are grouped according to precedence, from highest precedence at the top to lowest at the bottom. Those grouped together have equal precedence. For example, foo|bar*
is the same as (foo)|(ba(r*))
since the '*' operator has higher precedence than concatenation, and concatenation higher than alternation ('|'). This pattern therefore matches either the string "foo" or the string "ba" followed by zero-or-more r's. To match "foo" or zero-or-more "bar"'s, use: foo|(bar)*
and to match zero-or-more "foo"'s-or-"bar"'s: (foo|bar)*
Some notes on patterns:
A negated character class such as the example "[^A-Z]" above will match a newline unless "\n" (or an equivalent escape sequence) is one of the characters explicitly present in the negated character class (e.g., "[^A-Z\n]"). This is unlike how many other regular expression tools treat negated character classes, but unfortunately the inconsistency is historically entrenched. Matching newlines means that a pattern like [^"]* can match the entire input unless there's another quote in the input.
A rule can have at most one instance of trailing context (the '