gelex

Generic lexer, WIP.

Usage no npm install needed!

<script type="module">
  import gelex from 'https://cdn.skypack.dev/gelex';
</script>

README

Gelex

Generic lexer, WIP.

Install

Run

npm install gelex

Use

Get the library reference

const gelex = require('gelex')

Create a lexer definition

const def = gelex.definition()

Define a rule, with token name and expression:

def.define('integer', '[0123456789][0123456789]*');

The text between [ and ] describe optional characters.

The asterisc * indicates zero or more occurrences

Optional characters could be defined in ranges using -:

def.define('name', '[a-zA-Z_][a-zA-Z0-9_]*');

Define a delimited text (a string):

def.defineText('string', '"', '"');

The second argument is the starting text delimiter and the third argument is the ending text delimiter.

Escaped characters could be optionally defined:

def.defineText('string', '"', '"',
    {
        escape: '\\',
        escaped: { 'n': '\n', 'r': '\r', 't': '\t' }
    }
);

The escaped field is a map from mapped character and its final representation. A escaped character not included in this map is mapped to itself, ie: an escaped double quote is mapped to a double quote in the above definition.

Define many rules in one, using an array:

def.define('delimiter', [ '{', '}', ',', ';' ]);
def.define('operator', [ '+', '-', '*', '/', '==', '===', '**', '^', '!', '|', '||', '&', '&&' ]);

It is equivalent to define each rule:

def.define('delimiter', '{' );
def.define('delimiter', '}' );
...

Matching any character using [.]:

def.define('anychar', '[.]');

Define a rule with transform

def.define('symbol', '#[a-zA-Z_][a-zA-Z_]*', trfn);

where trfn is a function that receives the scanned text value and returns another value. Example:

def.define('symbol', '#[a-zA-Z_][a-zA-Z_]*', 
    function (value) {
        return value.substring(1); // removing initial #
    });

Define a custom rule

def.define('character', rule);

where rule is an object with two functions:

  • first(): returns the initial characters for the rule
  • match(scanner): returns null if no match, or the detected value as string if there is a match.

Example, a rule detecting $a as the character a (as in Smalltalk):

function CharacterRule(ch) {
    this.first = function () { return ch; };
    
    this.match = function (scanner) {
        if (scanner.peek() !== ch)
            return null;
        
        scanner.scan();
        
        return scanner.scan();
    };
}

Define a comment

def.defineComment('/*', '*/');

The first argument is the starting text delimiter. The second argument is the ending text delimiter. Current version does not support nested comments, yet.

A comment is processed like an space character.

Define a line comment, giving only one argument:

def.defineComment('//');

Create and use a lexer:

const lexer = def.lexer();

const token = lexer.next();

Each token is retrieved in order invoking lexer next function. It returns null when the tokens are exhausted.

Each token is an object with fields:

  • type: the token type name, defined using the define function; ie integer.
  • value: the string value of the token
  • begin: start position in input text
  • end: end position in input text

Example:

const gelex = require('../..');
const def = gelex.definition();

def.define('integer', '[0123456789][0123456789]*');
def.define('name', '[a-zA-Z_][a-zA-Z0-9_]*');
def.define('delimiter', [ '{', '}', ',', ';' ]);
def.define('operator', [ '+', '-', '*', '/', '==', '===', '**', '^', '!', '|', '||', '&', '&&' ]);
def.defineText('string', "'", "'");
def.defineText('string', '"', '"');

const lexer = def.lexer('1 2 42 foo bar + * {},===== "foo" "bar"');

let token;

while (token = lexer.next())
    console.dir(token);

Expected output:

Expected output:

{ type: 'unknown', value: '1', begin: 0, end: 0 }
{ type: 'unknown', value: '2', begin: 2, end: 2 }
{ type: 'integer', value: '42', begin: 4, end: 5 }
{ type: 'name', value: 'foo', begin: 7, end: 9 }
{ type: 'name', value: 'bar', begin: 11, end: 13 }
{ type: 'operator', value: '+', begin: 15, end: 15 }
{ type: 'operator', value: '*', begin: 17, end: 17 }
{ type: 'delimiter', value: '{', begin: 19, end: 19 }
{ type: 'delimiter', value: '}', begin: 20, end: 20 }
{ type: 'delimiter', value: ',', begin: 21, end: 21 }
{ type: 'operator', value: '===', begin: 22, end: 24 }
{ type: 'operator', value: '==', begin: 25, end: 26 }
{ type: 'string', value: 'foo', begin: 28, end: 32 }
{ type: 'string', value: 'bar', begin: 34, end: 38 }

Versions

  • Version 0.0.1, first version.
  • Version 0.0.2, fixing ManyRule.
  • Version 0.0.3, detecting unclosed strings, match any character rule.
  • Version 0.0.4, custom rule
  • Version 0.0.4, custom rule
  • Version 0.0.5, transform function in define
  • Version 0.0.6, char function in Lexer
  • Version 0.0.7, seek function in Lexer

Previous work

Samples

References

TBD

To Do

  • Support nested comments
  • Detect unclosed comments
  • Programming language sample

License

MIT

Contribution

Feel free to file issues and submit pull requests — contributions are welcome.

If you submit a pull request, please be sure to add or update corresponding test cases, and ensure that npm test continues to pass.