vgno-article-parser

Parses article HTML from DrPublish/DrLib into JSON-representable entities

Usage no npm install needed!

<script type="module">
  import vgnoArticleParser from 'https://cdn.skypack.dev/vgno-article-parser';
</script>

README

vgno-article-parser

Build Status

Parses the article markup (HTML) provided by DrPublish/DrLib and translates it into a JSON-serializable structure.

Installing

npm install --save vgno-article-parser

Usage

Parse a response from DrLib:

var parseArticle = require('vgno-article-parser');
var request = require('request');

request({
    url: 'http://drlib.url.no/articles/10131048.json',
    json: true
}, function(err, res, body) {
    if (err) {
        throw err;
    }

    var parsed = parseArticle(body);

    // Result is an object keyed by the same keys as within `contents.web` of the drlib response:
    // motto, title, leadAsset, preamble, story etc.
    console.log(parsed.story);
});

Parse a specific chunk of HTML:

var articleParser = require('vgno-article-parser');
var htmlString = '<div>some html string</div>';

var parsed = articleParser.parseHtml(htmlString);

// Result is a tree of nodes. Root is an array, each node (can) have a `children` property
console.log(parsed);

Testing / developing

git clone git@github.schibsted.io:vg/vgno-article-parser.git && cd vgno-article-parser
npm install
npm test

Adding entities

Adding new entities is fairly simple:

  1. Add a file to src/entities which parses the node into a serializable format
    • Ensure that this.type is set, and that it is a unique, descriptive value
    • node (first argument) is a plain object that cheerio returns when parsing an HTML node
    • At this state, the children are unparsed (this allows you to skip parsing unused nodes)
    • To traverse the children in a jQuery-like way, simple call cheerio(node) and use it's API
    • Set properties on itself (this.someAttribute = parsedThing)
  2. Add a reference to the entity in src/entities/index.js
  3. If the new entity is an overlooked HTML-tag, add it to src/entity-factory.js under tags, otherwise you will have to provide a sniffer (see below).
  4. Write one or more test to ensure that things are working as expected and won't have any regressions over time.

Sniffers

Sniffers are simple functions that detect if a given node should be treated as a specific entity. If the node does not match your entity, return false. Otherwise, return the entity type that you want to assign it. This allows a single sniffer to instantiate different entity types based on the node attributes.

Creating a sniffer is simple:

  1. Add a file to src/sniffers which exposes a single function. It takes a single argument (node) and should return as stated above.
  2. Add a reference to the sniffer in src/sniffers/index.js. Note that the order of the sniffers matter here. Think of it like a switch statement. Returning an entity in one of the sniffers at the top of the list will prevent the other sniffers from taking a look and possibly finding a better match.

Example

Given the following HTML:

<div>
    <h2>Chapter one: The fury of the seas</h2>
    <p>It was a cold day, according to <a href="http://espen.codes/">The Hooverdam</a>. Then again, he always complained about being cold. Make no mistake, however; the sea was angry that day.</p>
    <p>
        After only a few hours, the hull was <em>riddled</em> with <strong>holes</strong>.<br />
        GoodFire didn't mind, of course. Being a crab, he was used to the sea.

        <div id="dp-article-image229" class="dp-plugin-element dp-article-image dp-plugin-src-images dp-float-none ">
            <div class="dp-article-image-container">
                <div>
                    <img id="dp-image2135218-22993469" src="http://some.url/image.jpg" width="988" height="621" alt="" />
                    <div class="dp-article-image-title">Such title</div>
                    <div class="dp-article-image-description">Some description</div>
                    <div class="dp-article-image-byline">Foto: Whatever</div>
                </div>
            </div>
        </div>

        “I bet we'll hit the rocks before nightfall”, shouted <abbr title="Espen Volden">The Riddler</abbr>. He turned around just in time to see the monumental arms of the Kraken tear the battered ship in two.
    </p>

    <h3>A new beginning</h3>
    <p>The Hooverdam, dazed and confused, found himself throwing up water on a beach...</p>
</div>

This is the excerpts of a fantastic, unwritten, imaginary book:
"The Adventures of Crabman, The Riddler and The Hooverdam"

When ran through the parser and JSON-encoded, looks like the following:

[{
    "type": "block",
    "attributes": {},
    "children": [{
        "type": "heading",
        "level": 2,
        "attributes": {},
        "children": [{
            "type": "text",
            "content": "Chapter one: The fury of the seas"
        }]
    }, {
        "type": "paragraph",
        "attributes": {},
        "children": [{
            "type": "text",
            "content": "It was a cold day, according to"
        }, {
            "type": "link",
            "attributes": {},
            "to": "http://espen.codes/",
            "children": [{
                "type": "text",
                "content": "The Hooverdam"
            }]
        }, {
            "type": "text",
            "content": ". Then again, he always complained about being cold. Make no mistake, however; the sea was angry that day."
        }]
    }, {
        "type": "paragraph",
        "attributes": {},
        "children": [{
            "type": "text",
            "content": "After only a few hours, the hull was"
        }, {
            "type": "emphasis",
            "attributes": {},
            "children": [{
                "type": "text",
                "content": "riddled"
            }]
        }, {
            "type": "text",
            "content": "with"
        }, {
            "type": "strong",
            "attributes": {},
            "children": [{
                "type": "text",
                "content": "holes"
            }]
        }, {
            "type": "text",
            "content": "."
        }, {
            "type": "linebreak"
        }, {
            "type": "text",
            "content": "GoodFire didn't mind, of course. Being a crab, he was used to the sea."
        }, {
            "type": "article-image",
            "url": "http://some.url/image.jpg",
            "title": "Such title",
            "description": "Some description",
            "byline": "Foto: Whatever"
        }, {
            "type": "text",
            "content": "“I bet we'll hit the rocks before nightfall”, shouted"
        }, {
            "type": "abbreviation",
            "attributes": {},
            "title": "Espen Volden"
        }, {
            "type": "text",
            "content": ". He turned around just in time to see the monumental arms of the Kraken tear the battered ship in two."
        }]
    }, {
        "type": "heading",
        "level": 3,
        "attributes": {},
        "children": [{
            "type": "text",
            "content": "A new beginning"
        }]
    }, {
        "type": "paragraph",
        "attributes": {},
        "children": [{
            "type": "text",
            "content": "The Hooverdam, dazed and confused, found himself throwing up water on a beach..."
        }]
    }]
}, {
    "type": "text",
    "content": "This is the excerpts of a fantastic, unwritten, imaginary book:\n\"The Adventures of Crabman, The Riddler and The Hooverdam\""
}]

Credits

Created by Espen Hovlandsdal on my spare time. Be gentle and respectful when leaving feedback/issues, please ;-)