microfrontier

A web crawler frontier implementation in TypeScript backed by Redis. MicroFrontier is a scalable and distributed frontier implemented through Redis Queues.

Usage no npm install needed!

<script type="module">
  import microfrontier from 'https://cdn.skypack.dev/microfrontier';
</script>

README

MicroFrontier · npm npm version Docker Pulls Docker Image Size (tag)

A web crawler frontier implementation in TypeScript backed by Redis. MicroFrontier is a scalable and distributed frontier implemented through Redis Queues.

  • Fast Ingestion & High throughput
  • Multiple priority queues
  • Custom priority strategy
  • Per-Hostname crawl rate limit or default delay fallback
  • Easy to use HTTP Microservice
  • Multi-processing support

Example of Mercator Frontier1

Queue

Usage

MicroFrontier can be used both as a Javascript library SDK, from the command line or with a Docker deploy.

Command Line

Install microfrontier with:

npm i -g microfrontier

Run microfrontier

microfrontier --port 3035 --redis:host localhost #see configuration for other parameters

As a package

Npm:

npm i microfrontier

Yarn:

yarn add microfrontier

Docker

docker pull adileo/microfrontier

Configuration

ENV VAR CLI PARAMS Description
host --host Host name to start the microservice http server.
Default value: 127.0.0.1
port --port Port to start the microservice http server.
Default value: 8090
redis_host --redis:host Redis server host.
Default value: 127.0.0.1
redis_port --redis:port Redis server port.
Default value: 6379
redis_* --redis:* Parameters are interpreted by nconf and passed to ioredis as the client config.
config_frontierName --config:frontierName Prefix used for Redis keys.
config_* --config:* Parameters are interpreted by nconf, default value below.
{
    frontierName: 'frontier',
    priorities: {
        'high':     {probability: 0.6},
        'normal':   {probability: 0.3},
        'low':      {probability: 0.1},
    },
    defaultCrawlDelay: 1000
}

How to

Adding an URL to the frontier

Via HTTP

curl --location --request POST 'http://127.0.0.1:8090/frontier' \
--header 'Content-Type: application/json' \
--data-raw '{
    "url": "http://www.example.com",
    "priority": "normal",
    "meta": {
        "foo": "bar"
    }
}'

Via SDK

import { URLFrontier } from "microfrontier"

const frontier = new URLFrontier(config)

frontier.add("http://www.example.com", "normal", {"foo": "bar"}).then(() => {
    console.log('URL added')
})

Getting an URL from the frontier

curl --location --request GET 'http://127.0.0.1:8090/frontier'
import { URLFrontier } from "microfrontier"

const frontier = new URLFrontier(config)

frontier.get().then((item) => {
    // {url: "http://www.example.com", meta: {"foo":"bar"}}
})

Citations

[1]: High-Performance Web Crawling - Marc Najork, Allan Heydon