norch-crawlers

A NodeJS crawler library to quick and easy build versatile crawlers. Just to make working with request and cheerio a little easier and to not have to write all the standard stuff over and over again.

Usage no npm install needed!

<script type="module">
  import norchCrawlers from 'https://cdn.skypack.dev/norch-crawlers';
</script>

README

A NodeJS crawler library to quick and easy build versatile crawlers. Just to make working with request and cheerio a little easier and to not have to write all the standard stuff over and over again.

Functions

  • Play nice with servers: Wait between each request.
  • Get ´next´ and ´last´ URL for pagination scenario.
  • Write list syncronusly to file at the end
  • Serving header info

Examples

  • List crawling: Crawl paginated lists for URLs

Functionality to be

  • Item crawling
  • Pagination iteration, second version
  • Define which domain(s) to crawl
  • Site-crawl - Add found URLs to crawl queue
  • Write content asyncronusly (add to file) throughout crawling.
  • Follow robots.txt
  • Check if new content
  • Check if updated content
  • Overwrite crawler header and set ´from´-field.
  • Crawl with headless browser.