README
Drew
Declarative Rewriting Expressions
Apply a query to a language cut up in chunks, or tokens, so you can find certain structures with ease and analyze or rewrite them.
The goal of this library is to make it easy to manipulate, investigate, and rewrite tokenized input. Drew allows you to work on tokens in a similar way as string.replace
works strings in JavaScript. Similar, but not the same.
You have input, pre-processed by a lexer of your choice, and a query and apply this query to the input. A callback is called whenever a match is found.
Language presets
The repo contains example scripts for two languages and one for plain text. They will all parse your input in the designated language and supply the required built-in macros and constants for IS_BLACK
and IS_NEWLINE
. You can override them, though, through the options
parameter, if you'd want to.
src/drew_js.js contains drewJs
which you can call as drewJs(jsCode, query, callback, options)
.
src/drew_css.js contains drewCss
which you can call as drewCss(cssCode, query, callback, options)
.
src/drew_txt.js contains drewTxt
which you can call as drewTxt(txtCode, query, callback, options)
.
NPM
There's a build on npm. You should be able to get it through npm install drew
. When you require('drew')
you should get access to the compiler, the runtime, and the logging tools. The language presets describe above are not on npm, you can get them from github.
There is an example on npm that will be put in the dist
dir as well.
Introduction
To put Drew to work you call the main exported function drew
. It looks like this:
drew(tokens, query, macros, constants, callback, options);
After calling this function Drew will apply the query to the tokens and call your callback whenever there is a match. The options can tell Drew what to do after a match was found and processed, like continue or stop.
Simple example (also found in src/example.js
):
var input = 'hello, world!';
var query = '^^[/[a-z]/]';
function callback(token) {
token.value = token.value.toUpperCase();
}
// always must define macro or constant for IS_BLACK and IS_NEWLINE
var textMacros = {
IS_BLACK: '!(` ` | `\t` | IS_NEWLINE)',
IS_NEWLINE: 'LF | CR',
LF: '`\\x0A`',
CR: '`\\x0D`',
};
var textConstants = {}; // none needed
// and to run:
var drew = require('./drew');
var splitter = require('../lib/splitter');
var tokens = splitter(input);
drew(tokens, query, textMacros, textConstants, callback);
console.log(tokens.map(function (t) { return t.value; }).join(''));
// -> 'Hello, world!'
Drew allows you to search through "tokenized" input. This means the input as a string is cut up in chunks called "tokens". What exactly constitutes a token really depends on the language. In natural language (but also in general) it would be words, spaces, and punctuations. Tokens can have a specific type, like "identifier", or "string". Drew only cares about a more global classification which assigns a label to a subset of all the tokens. These tokens are considered "black". All tokens, including the black tokens, are "white". The origin of this term comes from "whitespace" and when doing these kinds of searches you often don't care about the whitespace (or the comments). In that case you can search through just the black tokens and don't have to worry whether there are one or two spaces, some newlines, or a comment between two tokens.
The above example uses the built-in string "splitter" to cut the input up in a list of simple tokens:
[
{ value: 'h' },
{ value: 'e' },
{ value: 'l' },
{ value: 'l' },
{ value: 'o' },
{ value: ',' },
{ value: ' ' },
{ value: 'w' },
{ value: 'o' },
{ value: 'r' },
{ value: 'l' },
{ value: 'd' },
{ value: '!' },
]
For specific languages you'll need to use specific parsers. Drew comes with a parser for JavaScript (ES5) called ZeParser2 and a parser for CSS called (GssParser). They will deliver the tokens required for Drew to do its work.
You can define macros and constants and use them inside queries. Macros can be seen as recursive (sub) query definitions. On the other hand constants are symbols that execute actual code and whose output will be evaluated to a boolean. Both macros and constants are only used to match the value of a token. This means that you can do [SPACE_OR_TAB]
to mean [` ` | `\t`]
, but SPACE_THEN_TAB
to mean [` `][`\t`]
will NOT work. Constants must always be an "expression" as well.
Drew applies the search in a recursive descent fashion, with naive backtracking. It basically means your query is processed from left to right and the parser will move forward on the input from left to right while the current part of the query matches. When a partial match fails the parser will "backtrack" (move back) and try to match the next part of the query. It will do so over and over until there is no current part of the query that can match and bail in that case. The parser applies the query starting at each token until a complete match is found. Continuation after a match depends on options as Drew can either stop completely, continue after the match, or continue with the would-be-next token regardless of a match.
The callback can be called with simply the start of a match when the query matches in full. The query can also fully control how the beginning and/or end of partial matches are passed back to the callback, either as a single object or as individual parameters. I've named these "designators" and you can read more about them below.
Drew doesn't return anything itself, instead you should manipulate the tokens directly and reconstruct the transformed source code after Drew finishes to run;
tokens.map(function (t) { return t.value; }).join('')
Queries
Drew queries look a bit like regular expressions. But since the goal of Drew is to work on tokens, tokens are explicitly delimited by either []
for "white tokens" or {}
for "black tokens". Black tokens automatically skip tokens that do not match the macro IS_BLACK
, which you must define yourself. Conceptually this macro will want to skip whitespace, newlines, and comments. Drew doesn't really care about the actual value of the macro, though, so if you want to use it to skip all tokens with the word "sheep" you are free to do so.
A query consists of "outside matching conditions" and "inside matching conditions". Outside conditions include the token wrapper, groups, seeks, invert, and line starts or ends. These conditions are more meta and apply to the type or position of the token in the token stream. Inside conditions should mostly concern themselves with the token.value
contents, like matching the value directly or with an actual regular expression.
The atom of a query is an outer matching condition, optionally with an inner matching condition followed by an optional quantifier and optional designators (in that order). Whitespace consists of actual spaces, tabs, and newlines.
Additionally there are three types of comments that can occur anywhere where they don't change the meaning of a query (I would say "between tokens", but that's probably too confusing in this context). All comments start with a colon and while the ::
and :::
comment are equivalent to single and multi-line comments in JavaScript, the single colon comment is a simplified comment that is ended implicitly by the next part of the query or explicitly by a semi-colon.
Query language CFG
The "cfg language" used below is hopefully pretty self explanatory.
Whitespace
Whitespace, newlines, and comments can occur anywhere between other tokens in a query as long as they would not destroy other tokens. In other words, all whitespace tokens can be considered to be a single space, regardless of whether the actual representation of the token looks like a spaced break.
- whitespace:
' '
|'\t'
|newline
|comment
- comment:
comment-simple
|comment-line
|comment-multi
- comment-simple:
':'
simple-comment-chars
[';'
] - simple-comment-chars:
simple-comment-chars
|simple-comment-char
- simple-comment-char:
/[a-zA-Z0-9\s_]+/
- comment-line:
'::'
anything-except-newline
newline
- comment-multi:
':::'
anything-except-triple-colons
':::'
- newline:
'\n'
// (and maybe \r \rn)
Atoms
The atom matches a token, its quantifier, and its designators. An atom is a single token, a group of tokens, some form of seek, or conditional line start/end boundaries.
Atom core
- atoms:
atom-complete
|atom-complete
atoms
- atom-complete:
atom
[quantifier
] [designation
] - atom:
white-token
|black-token
|atom-group
|line-boundary
|seek
- white-token:
'['
conditions
']'
- black-token:
'{'
conditions
'}'
- atom-group:
'('
atoms
')'
- line-boundary:
'^^'
|'^'
|'