README
EbriScrap
EbriScrap is a tool that parse a HTML string, and return a JS object with all information that you need.
Installing
With Yarn
yarn add ebri-scrap
With NPM
npm install ebri-scrap
Examples
Go to Examples to see EbriScrap in action.
Documentation
Configuration
Root EbriScrap configuration can be a string
, an object
, or an array
.
Configuration Item
Field
In order to extract a single value, you have to use a field configuartion item. You have to specify one selector, up to one extractor and as many formators as you want !
Here is an example: "selector | extract:extractor1 | format:formator1 | format:formator2"
.
Selector
It should be a valid Cheerio / CSS selector. Example: h1
.
Extractor
text
(default): it calls.innerText
on the HTML element matched by the selector.Example:
const html = '<div>Hello world</div>'; const config = 'div'; // or const config = 'div | extract:text'; parse(html, config); // Output: "Hello world"
html
: it returns the raw HTML of the HTML element matched by the selector.Example:
const html = '<div>Hello world</div>'; const config = 'div | extract:html'; parse(html, config); // Output: "<div>Hello world</div>"
prop
: it returns a property of the HTML element matched by the selector.Example:
const html = '<a href="/unicorn-world">Hello world</div>'; const config = 'a | extract:prop:href'; parse(html, config); // Output: "/unicorn-world"
css
: it returns a css of the HTML element matched by the selector. Warning: it only works with style property on the element !Example:
const html = '<div style="font-size: 42px">Hello world</div>'; const config = 'a | extract:css:font-size'; parse(html, config); // Output: "42px"
Formattors
number
: strip all no-digit characters and parse as floatExample:
const html = '<div>42</div>'; const config = 'div | format:number'; parse(html, config); // Output: 42
regex
: find and replace with a text, using a regular expression. This formator needs two parameters:format:<THE_REGEX>:<REPLACEMENT_STRING>
Example:
const html = '<div>42</div>'; const config = 'div | format:regex:4(.*):$1'; parse(html, config); // Output: 2
url
: add a base url if the path is relative This formator needs one parameter:format:<BASE_URL>
Example:
const html = '<a href="/unicorn-world">Hello world</div>'; /* WARNING: as https://one-fake-domain.com contains colons, quotes (single or double) are mandatory ! */ const config = "a | extract:prop:href | format:url:'https://one-fake-domain.com'"; parse(html, config); // Output: "https://one-fake-domain.com/unicorn-world"
html-to-text
: replace<br>
,<p>
,<div>
with new lines, and then, returns text.one-line-string
: replace all new lines (\n
), tabs (\t
) and multi spaces with a single spacetrim
: remove leading and ending spaces
Group
A group configuration is a dictionary in which keys are the keys of the output object, and values are a piece of EbriScrapConfiguration (another group configuration, a field or an array configuration).
Example:
const html = `
<section>
<h1>What a wonderful world</h1>
<p>Lorem Ipsum...</p>
</section>`;
const config = {
title: 'h1',
content: 'p',
};
parse(html, config); // Output: { title: 'What a wonderful world': content: 'Lorem Ipsum...' }
Array
Array configuration are a bit more complicated. It is an array, with a single item with additional information:
containerSelector
: the selector of the container (It should be a valid Cheerio / CSS selector.)itemSelector
: the selector on which you want to iterate (It should be a valid Cheerio / CSS selector.)data
: a Field/Group/Array configurationincludeSiblings
: optional include siblings of selected item (see example below)
Example:
const html = `
<ul>
<li>
<p>Content 1</p>
</li>
<li>
<p>Content 2</p>
</li>
<li>
<p>Content 3</p>
</li>
</ul>`;
const config = [
{
containerSelector: 'ul',
itemSelector: 'li',
data: 'p',
},
];
parse(html, config); // Output: ['Content 1', 'Content 2', 'Content 3']
const html = `<body>
<h1>Title 1</h1>
<p>Text 1.1</p>
<p>Text 1.2</p>
<p>Text 1.3</p>
<h1>Title 2</h1>
<p>Text 2.1</p>
<p>Text 2.2</p>
<p>Text 2.3</p>
</body>`;
const config = [
{
containerSelector: 'section',
itemSelector: 'h1',
includeSiblings: true,
data: {
title: 'h1',
text: 'p'
},
},
];
parse(html, config);
/* Output: [
{ title: 'Title 1', text: 'Text 1.1Text 1.2Text 1.3' },
{ title: 'Title 2', text: 'Text 2.1Text 2.2Text 2.3' },
] */
Walkthrough example
Say we have the following HTML:
<html>
<head>
<title>EbriScrap</title>
</head>
<body>
<table>
<thead>
<tr>
<th>JS Library</th>
<th>Link</th>
</tr>
</thead>
<tbody>
<tr>
<td>Lodash</td>
<td><a href="https://github.com/lodash/lodash">Github</a></td>
</tr>
<tr>
<td>Cheerio</td>
<td><a href="https://github.com/cheeriojs/cheerio">Github</a></td>
</tr>
<tr>
<td>jQuery</td>
<td><a href="https://github.com/jquery/jquery">Github</a></td>
</tr>
</tbody>
</table>
</body>
</html>
And we want to get the following object:
{
name: 'EbriScrap',
librairies: [
{
name: 'Lodash',
link: 'https://github.com/lodash/lodash'
},
{
name: 'Cheerio',
link: 'https://github.com/cheeriojs/cheerio'
},
{
name: 'jQuery',
link: 'https://github.com/jquery/jquery'
}
]
}
Here is what you have to do:
Name
It is a regular text field. We want to extract the text in <title>
:
const nameConfig = 'title';
Library name
One again, it is a regular text field. We want to extract the text in the first <td>
:
const libNameConfig = 'td:first-of-type';
Library link
This time, we want to get the link in the href
property of the <a>
in the second <td>
:
const libLinkConfig = 'td:nth-of-type(2) a | extract:prop:href';
Library
Now that we have name and url, we want to create an object with two keys (name
and url
):
{
name: /* Library name configuration item */,
link: /* Library link configuration item */
}
// so we have:
{
name: 'td:first-of-type',
link: 'td:nth-of-type(2) a | extract:prop:href'
}
Libraries
We want to apply librairy configuration on every <tr>
in <tbody>
:
[
{
containerSelector: 'tbody',
itemSelector: 'tr',
data: /* Library configuration item */
}
]
Resulting configuration
Here is the full configuration:
{
name: 'title',
librairies: [
{
containerSelector: 'tbody',
itemSelector: 'tr',
data: {
name: 'td:first-of-type',
link: 'td:nth-of-type(2) a | extract:prop:href'
},
}
]
}