README
vgno-article-parser
Parses the article markup (HTML) provided by DrPublish/DrLib and translates it into a JSON-serializable structure.
Installing
npm install --save vgno-article-parser
Usage
Parse a response from DrLib:
var parseArticle = require('vgno-article-parser');
var request = require('request');
request({
url: 'http://drlib.url.no/articles/10131048.json',
json: true
}, function(err, res, body) {
if (err) {
throw err;
}
var parsed = parseArticle(body);
// Result is an object keyed by the same keys as within `contents.web` of the drlib response:
// motto, title, leadAsset, preamble, story etc.
console.log(parsed.story);
});
Parse a specific chunk of HTML:
var articleParser = require('vgno-article-parser');
var htmlString = '<div>some html string</div>';
var parsed = articleParser.parseHtml(htmlString);
// Result is a tree of nodes. Root is an array, each node (can) have a `children` property
console.log(parsed);
Testing / developing
git clone git@github.schibsted.io:vg/vgno-article-parser.git && cd vgno-article-parser
npm install
npm test
Adding entities
Adding new entities is fairly simple:
- Add a file to
src/entities
which parses the node into a serializable format- Ensure that
this.type
is set, and that it is a unique, descriptive value node
(first argument) is a plain object that cheerio returns when parsing an HTML node- At this state, the children are unparsed (this allows you to skip parsing unused nodes)
- To traverse the children in a jQuery-like way, simple call
cheerio(node)
and use it's API - Set properties on itself (
this.someAttribute = parsedThing
)
- Ensure that
- Add a reference to the entity in
src/entities/index.js
- If the new entity is an overlooked HTML-tag, add it to
src/entity-factory.js
undertags
, otherwise you will have to provide a sniffer (see below). - Write one or more test to ensure that things are working as expected and won't have any regressions over time.
Sniffers
Sniffers
are simple functions that detect if a given node should be treated as a specific entity. If the node does not match your entity, return false
. Otherwise, return the entity type that you want to assign it. This allows a single sniffer to instantiate different entity types based on the node attributes.
Creating a sniffer is simple:
- Add a file to
src/sniffers
which exposes a single function. It takes a single argument (node
) and should return as stated above. - Add a reference to the sniffer in
src/sniffers/index.js
. Note that the order of the sniffers matter here. Think of it like a switch statement. Returning an entity in one of the sniffers at the top of the list will prevent the other sniffers from taking a look and possibly finding a better match.
Example
Given the following HTML:
<div>
<h2>Chapter one: The fury of the seas</h2>
<p>It was a cold day, according to <a href="http://espen.codes/">The Hooverdam</a>. Then again, he always complained about being cold. Make no mistake, however; the sea was angry that day.</p>
<p>
After only a few hours, the hull was <em>riddled</em> with <strong>holes</strong>.<br />
GoodFire didn't mind, of course. Being a crab, he was used to the sea.
<div id="dp-article-image229" class="dp-plugin-element dp-article-image dp-plugin-src-images dp-float-none ">
<div class="dp-article-image-container">
<div>
<img id="dp-image2135218-22993469" src="http://some.url/image.jpg" width="988" height="621" alt="" />
<div class="dp-article-image-title">Such title</div>
<div class="dp-article-image-description">Some description</div>
<div class="dp-article-image-byline">Foto: Whatever</div>
</div>
</div>
</div>
“I bet we'll hit the rocks before nightfall”, shouted <abbr title="Espen Volden">The Riddler</abbr>. He turned around just in time to see the monumental arms of the Kraken tear the battered ship in two.
</p>
<h3>A new beginning</h3>
<p>The Hooverdam, dazed and confused, found himself throwing up water on a beach...</p>
</div>
This is the excerpts of a fantastic, unwritten, imaginary book:
"The Adventures of Crabman, The Riddler and The Hooverdam"
When ran through the parser and JSON-encoded, looks like the following:
[{
"type": "block",
"attributes": {},
"children": [{
"type": "heading",
"level": 2,
"attributes": {},
"children": [{
"type": "text",
"content": "Chapter one: The fury of the seas"
}]
}, {
"type": "paragraph",
"attributes": {},
"children": [{
"type": "text",
"content": "It was a cold day, according to"
}, {
"type": "link",
"attributes": {},
"to": "http://espen.codes/",
"children": [{
"type": "text",
"content": "The Hooverdam"
}]
}, {
"type": "text",
"content": ". Then again, he always complained about being cold. Make no mistake, however; the sea was angry that day."
}]
}, {
"type": "paragraph",
"attributes": {},
"children": [{
"type": "text",
"content": "After only a few hours, the hull was"
}, {
"type": "emphasis",
"attributes": {},
"children": [{
"type": "text",
"content": "riddled"
}]
}, {
"type": "text",
"content": "with"
}, {
"type": "strong",
"attributes": {},
"children": [{
"type": "text",
"content": "holes"
}]
}, {
"type": "text",
"content": "."
}, {
"type": "linebreak"
}, {
"type": "text",
"content": "GoodFire didn't mind, of course. Being a crab, he was used to the sea."
}, {
"type": "article-image",
"url": "http://some.url/image.jpg",
"title": "Such title",
"description": "Some description",
"byline": "Foto: Whatever"
}, {
"type": "text",
"content": "“I bet we'll hit the rocks before nightfall”, shouted"
}, {
"type": "abbreviation",
"attributes": {},
"title": "Espen Volden"
}, {
"type": "text",
"content": ". He turned around just in time to see the monumental arms of the Kraken tear the battered ship in two."
}]
}, {
"type": "heading",
"level": 3,
"attributes": {},
"children": [{
"type": "text",
"content": "A new beginning"
}]
}, {
"type": "paragraph",
"attributes": {},
"children": [{
"type": "text",
"content": "The Hooverdam, dazed and confused, found himself throwing up water on a beach..."
}]
}]
}, {
"type": "text",
"content": "This is the excerpts of a fantastic, unwritten, imaginary book:\n\"The Adventures of Crabman, The Riddler and The Hooverdam\""
}]
Credits
Created by Espen Hovlandsdal on my spare time. Be gentle and respectful when leaving feedback/issues, please ;-)