epic-link-crawler

A simple in depth links crawler. You can easily collect all the links available on a website.

Usage no npm install needed!

<script type="module">
  import epicLinkCrawler from 'https://cdn.skypack.dev/epic-link-crawler';
</script>

README

Epic Link Crawler

A simple in depth links crawler. You can easily collect all the links available on a website.

Installation

$ npm i epic-link-crawler --save

Usage

//Crawl all the links from google.com homepage.

const crawler = new epicLinkCrawler;

crawler.init("https://google.com", {
    depth: 5,
    strict: true,
    cache: true,
}).then(() => {
    crawler.crawl().then(data => {
        console.log(data);
    });
}).catch(error => {
    console.log(error);
});

/**
 * Expected Resuts In Depth 1
 *
 * [
    'https://play.google.com/?hl=en&tab=w8',
    'https://mail.google.com/mail/?tab=wm',
    'https://drive.google.com/?tab=wo',
    'https://www.google.com/calendar?tab=wc',
    'https://photos.google.com/?tab=wq&pageId=none',
    'https://docs.google.com/document/?usp=docs_alc',
    'https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=https://www.google.com/',
    'https://www.google.com/setprefs?sig=0_QLWlMq1910erDBng9UqCXn8pCmQ%3D&hl=ur&source=homepage&sa=X&ved=0ahUKEwiqg6bSwrnpAhUHrxoKHV8bCgQQ2ZgBCAU',
    'https://www.google.com/setprefs?sig=0_QLWlMq1910erDBng9UqCXn8pCmQ%3D&hl=ps&source=homepage&sa=X&ved=0ahUKEwiqg6bSwrnpAhUHrxoKHV8bCgQQ2ZgBCAY',
    'https://www.google.com/setprefs?sig=0_QLWlMq1910erDBng9UqCXn8pCmQ%3D&hl=sd&source=homepage&sa=X&ved=0ahUKEwiqg6bSwrnpAhUHrxoKHV8bCgQQ2ZgBCAc',
    'https://www.google.com/setprefdomain?prefdom=PK&prev=https://www.google.com.pk/&sig=K_BE-rlArupsHUl4I9PADVcxBLCNg%3D',
    'https://google.com/preferences?hl=en',
    'https://google.com/advanced_search?hl=en-PK&authuser=0',
    'https://google.com/intl/en/ads',
    'https://google.com/intl/en/about.html',
    'https://google.com/intl/en/policies/privacy',
    'https://google.com/intl/en/policies/terms'
    ]

 */

Options

Just three options are supported for now.

  • depth - 1 to 5 (Default 1) | Crawling Depth.
  • strict - boolean (Default True) | Set to False if you also want to collect links related to other websites.
  • cache - boolean (Default True) | Speeds up the crawl by saving data in the cache.

Methods

  • init: (url: string, { depth, strict, cache }?: options) => Promise - Initialize crawler.
  • blackList: (fingerPrintList: (string | RegExp)[]) => this - Black List Links.
  • validUrl: (url: string) => Promise - Validate url.
  • config: ({ depth, strict, cache }?: options) => this - Update Configuration.
  • getContent: (url: string) => Promise - Get content from url.
  • clearCache: () => this - Clear previous crawled cache.
  • collectLinks: (content: any) => string[] - Collect all links from html content.
  • crawl: (url?: string) => Promise - Start Crawling.