web-tree-crawler

A web crawler that builds a tree of URLs

Usage no npm install needed!

<script type="module">
  import webTreeCrawler from 'https://cdn.skypack.dev/web-tree-crawler';
</script>

README

web-tree-crawler

A naive web crawler that builds a tree of URLs under a domain using web-tree.

Note: This software is intended for personal learning and testing purposes.

How it works

You pass web-tree-crawler a URL and it tries to discover/visit as many URLs under that domain name as it can within a time limit. When time's up or it's run out of URLs, web-tree-crawler spits out a tree of URLs it visited. There are several configuration options - see the usage sections below.

Install

npm i web-tree-crawler

CLI

Usage

Usage: [option=] web-tree-crawler <url>

Options:
  format     , f  The output format of the tree (default="string")
  headers    , h  File containing headers to send with each request
  numRequests, n  The number of requests to send at a time (default=200)
  outFile    , o  Write the tree to file instead of stdout
  pathList   , p  File containing paths to initially crawl
  timeLimit  , t  The max number of seconds to run (default=120)
  verbose    , v  Log info and progress to stdout

Examples

Crawl and print tree to stdout

$ h=/path/to/file web-tree-crawler <url>

.com
  .domain
    .subdomain1
      /foo
        /bar
      .subdomain-of-subdomain1
        /baz
          ?q=1
    .subdomain2
...

And to print an HTML tree...

$ f=html web-tree-crawler <url>

...

Crawl and write tree to file

$ o=/path/to/file web-tree-crawler <url>

Wrote tree to file!

Crawl with verbose logging

$ v=true web-tree-crawler <url>

Visited "<url>"
Visited "<another-url>"
...

JS

Usage

/**
 * This is the main exported function that crawls and resolves the URL tree.
 *
 * @param  {String}   url
 * @param  {Object}   [opts = {}]
 * @param  {Object}   [opts.headers]           - headers to send with each request
 * @param  {Number}   [opts.numRequests = 200] - the number of requests to send at a time
 * @param  {String[]} [opts.startPaths]        - paths to initially crawl
 * @param  {Number}   [opts.timeLimit = 120]   - the max number of seconds to run for
 * @param  {Boolean}  [opts.verbose]           - if true, logs info and progress to stdout
 * @param  {}         [opts....]               - additional options for #lib.request()
 *
 * @return {Promise}
 */

Example

'use strict'

const crawl = require('web-tree-crawler')

crawl(url, opts)
  .then(tree => { ... })
  .catch(err => { ... })

Test

npm test

Lint

npm run lint

Documentation

npm run doc

Generate the docs and open in browser.

Contributing

Please do!

If you find a bug, want a feature added, or just have a question, feel free to open an issue. In addition, you're welcome to create a pull request addressing an issue. You should push your changes to a feature branch and request merge to develop.

Make sure linting and tests pass and coverage is 💯 before creating a pull request!