express-xml-scrapper

Specify the sitemap index and download the xml information locally or on a node.js server.

Usage no npm install needed!

<script type="module">
  import expressXmlScrapper from 'https://cdn.skypack.dev/express-xml-scrapper';
</script>

README

Express Node XML Scraper

A XML Scraper with a front-end facing Express server that scrapes for remote XML files while at the same time offering the saved sitemaps on the Node Express server.

Set-up

  • Run npm install to install all the dependancies including the simple-xml-scrapper app that will crawl the site.
node server.js --sitemap_index_url=http://www.nytimes-se.com/nytse/sitemap.xml

Express Server

  • After everything is successfully created, a local directory /rendered_sitemaps will be created.
  • Visit http://localhost:3001/rendered_sitemaps for additional information on how generated sitemap directories can be accessed.
  • Sitemaps will be have a directory created per domain. http://www.nytimes-se.com/nytse/sitemap.xml will create a nytimes-se.com/ directory and contain all related sitemaps from within that domain.

Cron

  • This node application features a running internal cron job that can be set using a regular cron expression and using the cron_schedule= flag within the CLI command for this job. The following command will scrap on the 30 minute marker the specified sitemap.
node server.js --sitemap_index_url=http://www.nytimes-se.com/nytse/sitemap.xml --cron_schedule="30 * * * *"

Saving

  • Files by default are saved into an XML format. Plans for CSV formatting will be made available.