discovery-web-crawler

Crawls a website and populates a Watson Discovery Collection.

Usage no npm install needed!

<script type="module">
  import discoveryWebCrawler from 'https://cdn.skypack.dev/discovery-web-crawler';
</script>

README

discovery-web-crawler

Version License: ISC Coverage Status Node.js CI

Crawls a website and populates a Watson Discovery Collection.

Install

npm install discovery-web-crawler

Usage

The following snippet will gather Watson stories from the IBM website and index them in Watson Discovery.

const DiscoveryWebCrawler = require('discovery-web-crawler')

let crawler = new DiscoveryWebCrawler({
    serviceUrl: 'YOUR_SERVICE_URL',
    apikey: 'YOUR_APIKEY',
    environmentId: 'YOUR_ENVIRONMENT_ID',
    collectionId: 'YOUR_COLLECTION_ID',

    url: 'https://www.ibm.com/watson/stories/',                                 // Starting point URL
    maxDepth: 3,                                                                // Max crawler depth
    fetchCondition: queueItem => queueItem.path.startsWith('/watson/'),         // Condition to crawl this URL
    urlCondition: url => !url.match('/list'),                                   // Condition to index this URL
    parse: async $ => ({ text: $('main').text().replace(/\s+/g, ' ').trim() }), // Cheerio API to extract JSON from HTML content
})
crawler.start()


Run tests

npm run test

Author

👤 Marco Cardoso

Show your support

Give a ⭐️ if this project helped you!