hapi-goldwasher

A plugin for Hapi.js to run goldwasher as a scraping API on the web.

Usage no npm install needed!

<script type="module">
  import hapiGoldwasher from 'https://cdn.skypack.dev/hapi-goldwasher';
</script>

README

hapi-goldwasher

npm version Build Status Coverage Status Code Climate

Dependency Status devDependency Status

A plugin for hapi to run goldwasher as a scraping API on the web. Basically a scraper proxy that will return information in the selected format, defaulting to JSON.

Installation

npm install hapi-goldwasher

If you aren't already running a hapi server, you need to install this too, to run the example:

npm install hapi

Options

When registering the plugin with hapi, you have several options, non of them required:

  • path - the endpoint you mount the plugin on. Defaults to /goldwasher.
  • maxRedirects - the maximum number of redirects the scraper will accept before giving up. Defaults to 5.
  • cors - a CORS object. Defaults to false. See hapi docs for more information.
  • raw - enable raw output mode. This will enable output=raw that will return the raw, scraped result, usually HTML.

Parameters

  • url - url to scrape. Required.
  • selector - cheerio (jQuery) selector, a selection of target tags. Defaults to the default of goldwasher, usually 'h1, h2, h3, h4, h5, h6, p'.
  • search - only pick results containing these terms. Not case or special character sensitive.
  • limit - limit number of results.
  • output - output format (json, xml, atom, rss or - if enabled - raw).
  • filterTexts - stop texts that should be excluded.
  • filterKeywords - stop words that should be excluded as keywords.
  • filterLocale - stop words from external JSON file (see documentation on goldwasher)).

Example

var Hapi = require('hapi');
var HapiGoldwasher = require('./index');

var server = new Hapi.Server();
server.connection({ port: 7979 });

server.register({
  register: HapiGoldwasher,
  options: {
    path: '/goldwasher',
    cors: {
      origin: ['*']
    }
  }
}, function(err) {
  if (err) {
    throw err;
  }

  server.start(function() {
    console.log('Server running at: ' + server.info.uri);
  });
});

Go to the server uri and you will be presented with a JSON response containing documentation. I recommend using something like the Chrome JSON Formatter for readability.