great-reaper

Scrap and collect data from urls, html, json and stuff..

Usage no npm install needed!

<script type="module">
  import greatReaper from 'https://cdn.skypack.dev/great-reaper';
</script>

README

Great reaper

great-reaper is targeted to scrap collections of data from web pages with usage of friendly jquery-like (css) selectors for describing scrap strategy.

Installation

npm install great-reaper

Examples

Get top 3 hacker news:

reap('https://news.ycombinator.com/')
    .group('table tr:nth-child(3) table tr')
    .map({
        title: '.title a',
        url: '.title a@href'
    })
    .limit(3)
    .then(console.log);

results

[ { title: 'Engineer Anti-Patterns',
    url: 'http://dtrace.org/blogs/eschrock/2012/08/14/engineer-anti-patterns/' },
  { title: 'Hotel Wi-Fi blocking: Marriott is bad, and should feel bad',
    url: 'http://www.economist.com/blogs/gulliver/2015/01/hotel-wi-fi-blocking' },
  { title: 'Can\'t you just turn up the volume?',
    url: 'https://medium.com/@Amp/cant-you-just-turn-up-the-volume-4ecb7fc422a' } ]

Use transforms

Get hot questions from stackoverflow with urls.

Initially question links are relative so we should make them absolute to get correct urls.

reap('http://stackoverflow.com/?tab=hot')
    .group('.question-summary')
    .map({
        question: '.question-hyperlink',
        url: '.question-hyperlink@href',
        views: '.views .mini-counts'
    })
    .transform({
        question: reap.t().lowercase(),
        url: reap.t.().prefix('http://stackoverflow.com'),
        views: reap.t.().int()
    })
    .then(console.log);

results

[ { question: 'program breaks from switch java',
    url: 'http://stackoverflow.com/questions/27840619/program-breaks-from-switch-java',
    views: 49 },
  { question: 'what is the z at the end of date',
    url: 'http://stackoverflow.com/questions/27840670/what-is-the-z-at-the-end-of-date',
    views: 28 },
  { question: 'convert array of objects into object',
    url: 'http://stackoverflow.com/questions/27840109/convert-array-of-objects-into-object',
    views: 18 }, .... ]

Also you can chain transforms

    ...
    .transform({
        summary: reap.t().lowercase().trim()
    })
    ...

Transforms

reap.transforms contains basic transforms functions

reap.t().tream()

Tream field value

reap.t().prefix(string)

Prepend string to field value

reap.t().postfix(string)

Append string to field value

reap.t().lowercase()

Lowercase field value

reap.t().slice(from, to)

Slices field value same as string.slice

reap.t().split(separator)

Split string using given separator and returns array

reap.t().join(glue)

Joins array using given glue and returns string

reap.t().int()

Typecase field value to int

reap.t().float()

Typecase field value to float

Custom transforms

You can use custom transform function:

    ...
    .transform(function (item) {
        if (item.type === 'good') {
            item.status = 'good item';
        }

        return item;
    })
    ...

Or apply transform for specific field

    ...
    .transform({
        status: function (val) {
            return 'status: ' + val.toLowerCase();
        }
    })
    ...

Filter results

Filters allows you to filter out redundant items from collection

    ...
    .filter(function (item) {
        return item.type === 'good';
    })
    ...

property specific filters:

    ...
    .filter({
        type: function (type) {
            return type === 'good';
        }
    })
    ...

LICENSE

MIT