@candlelib/wind

Tokenizer

Usage no npm install needed!

<script type="module">
  import candlelibWind from 'https://cdn.skypack.dev/@candlelib/wind';
</script>

README

Lightweight Lexer and Tokenizer

v0.4.0

\ ˈwīnd \ - to raise to a high level [as of excitement or tension]

Install

NPM

npm install --save @candlelib/wind

Usage

note: This script uses ES2015 module syntax, and has the extension .mjs. To include this script in a project, you may need to use the node flag --experimental-modules; or, use a bundler that supports ES modules, such as rollup.

import wind from "@candlelib/wind"

const sample_string = "The 2345 a 0x3456 + 'a string'";

let lexer = wind(sample_string);

//Example

lexer.text 						  //=> "The"
lexer.n.tx  					//=> "2345"                           
lexer.n.text   					//=> "a"                      
lexer.assert("b")
lexer.text                 		//=> "0x3456"
lexer.ty == lexer.types.number  //=> true

Wind Lexer

import { Lexer } from "@candlelib/wind"

Constructor

new Lexer ( string [ , INCLUDE_WHITE_SPACE_TOKENS ] )

  • string - The input string to parse.
  • INCLUDE_WHITE_SPACE_TOKENS - Flag to include white space tokens such as TABS and NEW_LINE.

note: the default export wind has the same form as the Lexer constructor function and is called without the new keyword.

let lexer = wind ( string [ , INCLUDE_WHITE_SPACE_TOKENS )

Properties

  • char (Read-Only) - Number
      The char offset of the token relative to the line.

  • CHARACTERS_ONLY - Boolean
      If true the Lexer will only produce tokens that are one character in length;

  • END (Read-Only) - Boolean
      If true the Lexer has reached the end of the input string.s

  • IGNORE_WHITE_SPACE - Boolean
      If true white_space and new_line tokens will not be generated.

  • line (Read-Only) - Number
      The index of the current line the token is located at.

  • off - Number
      The absolute index position of the current token measured from the beginning of the input string.

  • p - Wind Lexer
      A pointer cache to a peeking Lexer.

  • PARSE_STRING - Boolean
      If set to true then string tokens will not be generated and instead the contents of string will be individually tokenized.

  • sl - Number
      The length of the input string. Changing sl will cause the Lexer to stop parsing once off+token_length >= sl.

  • str - String
      The string that is being tokenized.

  • string (Read-Only) - String
      Returns the result of slice()

  • string_length (Read-Only) - Number
      The length of the remaining string to be parsed. Same as lex.sl - lex.off.

  • text - String
      The string value for the current token.

  • tl - Number
      The size of the current token.

  • type - Number
      The current token type. See types.

  • types - Object
      Proxy to types object.

  • ch
      The first character of the current token.

Alias properties

  • n
      Property proxy for next();

  • string
      Returns the result of slice().

  • token
      Property proxy for copy()

  • tx
      Proxy for text.

  • ty
      Proxy for type.

  • pos
      Proxy for off.

  • pk
      Property proxy for peek().

Methods

  • Lexer - assert ( text )
      Compares the current token text value to the argument text. If the values are the same then the lexer advances to the next token. If they are not equal, an error message is thrown.

    • Returns Lexer to allow method chaining.
  • Lexer - assertCharacter ( char )
      Same as assert() except compares a single character only.

    • Returns Lexer to allow method chaining.
  • Lexer - comment ( [ ASSERT [ , marker ] ] )
      Skips to the end of the comment section if the current token is / and the peek token is / or *. If true is passed for the ASSERT argument then an error is thrown if the current token plus the peek token is not /* or //.

    • Returns Lexer to allow method chaining.
  • Lexer - copy ( [ destination ])
      Copies the value of the lexer to destination. destination defaults to a new Wind Lexer.

  • Lexer - fence ( [ marker ] ) - Reduces the input string's parse length by the value of marker.off. The value of the marker must be a Wind Lexer that has the same input string as the callee Wind Lexer.

    • Returns Lexer to allow method chaining.
  • Lexer - next ( [ marker ] )
      Advances the marker to the next token in its input string. Returns marker or null if the end of the input string has been reached. marker defaults to the calling Wind Lexer object, which means this will be returned if no value is passed as marker.

    • Returns Lexer to allow method chaining.
  • Lexer - peek ( [ marker [ , peek_marker ] ] )
      Returns another Wind Lexer that is advanced one token ahead of marker. marker defaults to this and peek_marker defaults to p. A new Wind Lexer is created if no value is passed as peek_marker and marker.p is null.

  • Lexer - reset ( )
      Resets lexer completely. After this is called, the lexer will need to be set with a new input string to allow it to begin parsing again.

    • Returns Lexer to allow method chaining.
  • Lexer - resetHead ( )
      Reset the lexer to the beginning of the string.

    • Returns Lexer to allow method chaining.
  • Lexer - setString ( string [ , RESET ] )
      Changes the input string to string. If the optional RESET argument is true then resetHead() is also called.

    • Returns Lexer to allow method chaining.
  • String - slice ( [ start ] )
      Returns a substring of the input string that starts at start and ends at the value of off. If start is undefined then the substring starts at off and ends at sl.

  • Lexer - sync ( [ marker ] )
      Copies the current values of the marker object to the Wind Lexer. marker defaults to the value of p.

    • Returns Lexer to allow method chaining.
  • throw ( message )
      Throws a new Error with a custom message and information to indicate where in the input string the current token is positioned.

  • String - toString ( )
      Returns the result of slice().

  • trim ( )
      Creates and returns new Lexer with leading and trailing whitespace and line terminator characters removed from the input string.

Alias Methods

  • a ( text )
      Proxy for assert(text).

  • aC ( char )
      Proxy for assertCharacter(character).

  • r ( )
      Proxy for reset().

  • s( [ start ] )
      Proxy for slice(start).

Types

There are 10 types of tokens that the Wind Lexer will create. Type identifiers can be accessed through wind.types, Lexer.types, and the types property in Lexer instances. Each type is identified with a power of 2 value to allow nested comparisons:

(lexer.type & (lexer.types.identifier | lexer.types.symbol)) ? true : false;  
  • types.identifier or types.id
      Any set of characters beginning with _|a-z|A-Z, and followed by 0-9|a-z|A-Z|-|_|#|$.

  • types.number or types.num
      Any set of characters beginning with 0-9|., and followed by 0-9|..

  • types.string or types.str
      A set of characters beginning with either ' or " and ending with a matching ' or ".

  • types.open_bracket or types.ob
      A single character from the set <|(|{|[.

  • types.close_bracket or types.cb
      A single character from the set >|)|}|].

  • types.operator or types.op
      A single character from the set *|+|<|=|>|\|&|%|!|||^|:.

  • types.new_line or types.nl
      A single newline (LF or NL) character. It may also be LFCR if the input string has Windows style new lines.

  • types.white_space or types.ws
      An uninterrupted set of tab or space characters.

  • types.symbol or types.sym
      All other characters not defined by the the above, with each symbol token being comprised of one character.

  • types.data_link or types.dl
      A data link ASCII character, followed by two more characters and another data link character.