README
TagSoup
TagSoup is the fastest pure JS SAX/DOM XML/HTML parser.
- It is the fastest;
- Tiny and tree-shakable, just 6.5 kB gzipped;
- Streaming support with SAX and DOM parsers for XML and HTML;
- Extremely low memory consumption;
- Forgives malformed tag nesting and missing end tags;
- Parses HTML attributes in the same way your browser does, see tests for more details;
- Recognizes CDATA, processing instructions, and DOCTYPE;
npm install --save-prod tag-soup
Usage
⚠️ API documentation is available here.
SAX
import {createSaxParser} from 'tag-soup';
// Or use
// import {createXmlSaxParser, createHtmlSaxParser} from 'tag-soup';
const saxParser = createSaxParser({
startTag(token) {
console.log(token); // → {tokenType: 1, name: 'foo', …}
},
endTag(token) {
console.log(token); // → {tokenType: 101, data: 'okay', …}
},
});
saxParser.parse('<foo>okay');
SAX parser invokes callbacks during parsing.
Callbacks receive tokens which represent structures read from the input. Tokens are pooled objects so when handler callback finishes they are returned to the pool and reused. Object pooling drastically reduces memory consumption and allows passing a lot of data to the callback.
If you need to retain token after callback finishes use
token.clone()
which returns the deep copy of
the token.
startTag
and endTag
callbacks are always invoked in the correct order even if tags in the input were incorrectly
nested or missed.
For self-closing tags only
startTag
callback in invoked.
Defaults
All SAX parser factories accept two arguments
the handler with callbacks and
options. The most generic parser factory
createSaxParser
doesn't have any defaults.
For createXmlSaxParser
defaults are
xmlParserOptions
:
- CDATA sections, processing instructions and self-closing tags are recognized;
- XML entities are decoded in text and attribute values;
- Tag and attribute names are preserved as is;
For createHtmlSaxParser
defaults are
htmlParserOptions
:
- CDATA sections and processing instructions are treated as comments;
- Self-closing tags are treated as a start tags;
- Tags like
p
,li
,td
and others follow implicit end rules, so<p>foo<p>bar
is parsed as<p>foo</p><p>bar</p>
; - Tag and attribute names are converted to lower case;
- Legacy HTML entities are decoded in text and attribute values.
You can alter how the parser works through options which give you fine-grained control over parsing dialect.
By default, TagSoup uses speedy-entites
to decode XML and HTML
entities. Parser created by createHtmlSaxParser
decodes only legacy HTML entities. This is done to reduce the bundle
size.
To decode all HTML entities use this snippet below. It would add 10 kB gzipped to the bundle size.
import {decodeHtml} from 'speedy-entities/lib/full';
const htmlParser = createHtmlSaxParser({
decodeText: decodeHtml,
decodeAttribute: decodeHtml,
});
With speedy-entites
you can create a custom decoder
that would recognize custom entities.
The list of legacy HTML entities
aacute
,Aacute
,acirc
,Acirc
,acute
,aelig
,AElig
,agrave
,Agrave
,amp
,AMP
,aring
,Aring
,atilde
,Atilde
,auml
,Auml
,brvbar
,ccedil
,Ccedil
,cedil
,cent
,copy
,COPY
,curren
,deg
,divide
,eacute
,Eacute
,ecirc
,Ecirc
,egrave
,Egrave
,eth
,ETH
,euml
,Euml
,frac12
,frac14
,frac34
,gt
,GT
,iacute
,Iacute
,icirc
,Icirc
,iexcl
,igrave
,Igrave
,iquest
,iuml
,Iuml
,laquo
,lt
,LT
,macr
,micro
,middot
,nbsp
,not
,ntilde
,Ntilde
,oacute
,Oacute
,ocirc
,Ocirc
,ograve
,Ograve
,ordf
,ordm
,oslash
,Oslash
,otilde
,Otilde
,ouml
,Ouml
,para
,plusmn
,pound
,quot
,QUOT
,raquo
,reg
,REG
,sect
,shy
,sup1
,sup2
,sup3
,szlig
,thorn
,THORN
,times
,uacute
,Uacute
,ucirc
,Ucirc
,ugrave
,Ugrave
,uml
,uuml
,Uuml
,yacute
,Yacute
,yen
andyuml
Streaming
SAX parsers support streaming. You can use
saxParser.write(chunk)
to parse input data
chunk by chunk.
const saxParser = createSaxParser({/*callbacks*/});
saxParser.write('<foo>ok');
// Triggers startTag callabck for "foo" tag.
saxParser.write('ay');
// Doesn't trigger any callbacks.
saxParser.write('</foo>');
// Triggers text callback for "okay" and endTag callback for "foo" tag.
DOM
import {createDomParser} from 'tag-soup';
// Or use
// import {createXmlDomParser, createHtmlDomParser} from 'tag-soup';
// Minimal DOM handler example
const domParser = createDomParser<any>({
element(token) {
return {tagName: token.name, children: []};
},
appendChild(parentNode, node) {
parentNode.children.push(node);
},
});
const domNode = domParser.parse('<foo>okay');
console.log(domNode[0].children[0].data); // → 'okay'
DOM parser assembles a node three using a handler that describes how nodes are created and appended.
The generic parser factory createDomParser
requires a handler to be provided.
Both createXmlDomParser
and
createHtmlDomParser
use
domHandler
if no other handler was provided and use
default options (xmlParserOptions
and htmlParserOptions
respectively) which
can be overridden.
Streaming
DOM parsers support streaming. You can use
domParser.write(chunk)
to parse input data
chunk by chunk.
const domParser = createXmlDomParser();
domParser.write('<foo>ok');
// → [{nodeType: 1, tagName: 'foo', children: [], …}]
domParser.write('ay');
// → [{nodeType: 1, tagName: 'foo', children: [], …}]
domParser.write('</foo>');
// → [{nodeType: 1, tagName: 'foo', children: [{nodeType: 3, data: 'okay', …}], …}]
Performance
To run a performance test use npm ci && npm run build && npm run perf
.
Large input
Performance was measured when parsing the 3.81 MB HTML file.
Results are in operations per second. The higher number is better.
SAX benchmark
Ops/sec | |
---|---|
createSaxParser ¹ |
36.3 ± 0.8% |
createXmlSaxParser ¹ |
30.7 ± 0.5% |
createHtmlSaxParser ¹ |
23.7 ± 0.5% |
createSaxParser |
29.2 ± 0.5% |
createXmlSaxParser |
26.1 ± 0.5% |
createHtmlSaxParser |
19.9 ± 0.5% |
@fb55/htmlparser2 |
14.3 ± 0.5% |
@isaacs/sax-js |
1.7 ± 4.6% |
¹ Parsers were provided a handler with a single
text
callback. This configuration can be
useful if you want to strip tags from the input.
DOM benchmark
Ops/sec | |
---|---|
createDomParser |
13.7 ± 0.5% |
createXmlDomParser |
12.6 ± 0.5% |
createHtmlDomParser |
10.6 ± 0.5% |
@fb55/htmlparser2 |
8.4 ± 0.5% |
@inikulin/parse5 |
2.8 ± 0.7% |
Small input
The performance was measured when parsing
258 files with 95 kB in size on average from
htmlparser-benchmark
.
Results are in operations per second. The higher number is better.
SAX benchmark
Ops/sec | |
---|---|
createSaxParser |
1 998.0 ± 0.1% |
createXmlSaxParser |
1 734.1 ± 0.1% |
createHtmlSaxParser |
1 285.4 ± 0.1% |
@fb55/htmlparser2 |
717.5 ± 0.2% |
DOM benchmark
Ops/sec | |
---|---|
createDomParser |
1 087.1 ± 0.2% |
createXmlDomParser |
853.5 ± 0.2% |
createHtmlDomParser |
668.0 ± 0.2% |
@fb55/htmlparser2 |
457.7 ± 0.2% |
@inikulin/parse5 |
50.8 ± 0.4% |
Limitations
TagSoup doesn't resolve some weird element structures that malformed HTML may cause.
For example, assume the following markup:
<p><strong>okay
<p>nope
With DOMParser
this markup would be transformed to:
<p><strong>okay</strong></p>
<p><strong>nope</strong></p>
TagSoup doesn't insert the second strong
tag:
<p><strong>okay</strong></p>
<p>nope</p> <!-- Note the absent "strong" tag -->