stream-sitemap-parser

Receive any type of sitemap stream and parse it. Stream back list of URLs or errors found

Usage no npm install needed!

<script type="module">
  import streamSitemapParser from 'https://cdn.skypack.dev/stream-sitemap-parser';
</script>

README

sitemap-parser

Stream a sitemap file and get back a stream of URLs or any error found while parsing the file.

Usage

const { fetch, verify, getRules } = require('stream-sitemap-parser');

fs.createReadStream(file)
  .pipe(fetch())
  .on('data', function (url) {
    // each chunk now contains an url and all its given atributes
    {
      loc: 'www.google.com',
      lastmod: '2017-01-01T00:00:00.000Z',
      changefreq: 'monthly',
      priority: '0.8',
      alternate: [
        {
          href: 'https://www.google.com/es/',
          hreflang: 'es'
        }
      ]
    }
  })

verify(fs.createReadStream(file))
  .then(result => {
    // result will be an object containing information about any warning or error found while parsing the sitemap
    {
      messages: [
        {
          type: 'tooManyTags',
          details: {
            parent: 'url',
            tag: 'loc'
          }
        }
      ],
      alternates: [
        {
          loc: 'https://www.google.com',
          alternate: [
            {
              href: 'https://www.google.com/es/',
              hreflang: 'es'
            }
          ]
      ]
    }
  })

getRules();
// returns an object of all loaded rules of the parser

fetch and verify can take several options.

fetch ( { contentType, domain, maxSize, maxUrls } )

verify (sitemapStream, { contentType, domain, maxSize, maxUrls } )

contentType will be by default xml. Set it to txt when streaming that data type.

domain will be by default null. Set it to a given domain to make sure that the URLs parsed will have the same domain.

maxSize will be by default 50MB. Set it to any given size to make sure that the stream can't have a larger size than this.

maxUrls will be by default 50000. Set it to any given value to make sure that no more URLs will be parsed.