parsz

Parsing language and engine for the web

Usage no npm install needed!

<script type="module">
  import parsz from 'https://cdn.skypack.dev/parsz';
</script>

README

pársz

- A tool for parsing the web

Usage

Install globally from npm/yarn

$ npm install -g parsz

View options from help menu

$ parsz --help

Use a "parselet" as a recipe/filter to parse a website.

The structure of the parselet is JSON.

Here is an example of a parselet for grabbing business data from a Yelp page:

{
  "name": "h1|trim",
  "phone": ".biz-phone|trim",
  "address": "address|trim",
  "reviews(.review)": [{
    "date": "meta[itemprop=datePublished] @content",
    "name": ".user-name a",
    "comment": ".review-content p"
  }]
}

As a module

You can also use parsz as a module:

import parsz from 'parsz';

parsz([Parselet JSON], [URL]).then(data => {
  // Do something with the data
});

Tips

This is a very general purpose and flexible tool. But here are some tips for getting started.

Grabbing a list of data

Use a reference selector in the key and an Array as the value.

{
  "users(.user)": [{
    "name": ".name",
    "age": ".age",
  }]
}

Use transformation functions on data

Add a pipe (|) and the transformation name after the data selector.

{
  "user": {
    "name": ".name|trim",
    "age": ".age|parseInt",
    "worth": ".age|parseFloat",
    "someNumber": ".age|floor",
  }
}

If anyone would like to see a certain, helpful transformation function added, please just open a issue

Grabbing an attribute

Use a (@) symbol to reference an attribute.

{
  "user": {
    "name": ".name",
    "nickname": ".name@data-nickname",
  }
}

Grabbing remote data

Use a (~) and a link selector to reference external content. The mapping (value) will be relative to that new external scope.

{
  "user": {
    "name": ".name",
    "company~(a.company)": {
      "name": ".company-name",
      "address": ".company-address",
    },
  }
}

Have fun!