README

snoospider

A Node.js spider for scraping reddit.

Features

(See documentation for comprehensive features and examples.)

snoospider grants scraping of submissions, comments, and replies as delineated by subreddit and a specified Unix time frame in seconds, without being bogged down by the learning curve of reddit's API, the snoowrap wrapper for reddit's API, and Bluebird promises used by the wrapper.

If the directory option is supplied to an instance of snoospider, the spider will output JSON files with all fields and metadata to that relative directory, most importantly the body field of comments. This makes it easy to analyze the files with another tool, e.g., R and its RJSONIO package. A callback option can also be supplied to pass scraped JSON arrays into a function such as console.log(...) or any other for direct processing in JavaScript without the need for file I/O.

NOTE: If your use case falls outside of snoospider's scope, then you should move on to snoowrap—it is much more powerful than snoospider, but its learning curve is far greater for complex tasks.

Installation

First, to install snoospider as a dependency for your project, run:

npm install snoospider --save

Second, set up OAuth by running reddit-oauth-helper and following the directions:

npm install -g reddit-oauth-helper
reddit-oauth-helper

Select permanent token duration.
Select read and mysubreddits scope. Through reddit, you must subscribe to the subreddits you want to scrape with the account you provide to reddit-oauth-helper.

Third, you should have retrieved some JSON output. You will take some of its contents and place them in another file.

Create a file called credentials.json.
In credentials.json, fill in your information from reddit and reddit-outh-helper:

{
  "client_id": "",
  "client_secret": "",
  "refresh_token": "",
  "author": "/u/YourRedditUsername"
}

Usage

You may create a JavaScript file like this:

'use strict';

let currentCrawlTime = Date.UTC(2016, 1, 1, 21) / 1000;

const SNOOSPIDER = require('snoospider'),
      CREDENTIALS = require('path/to/credentials.json'),
      OPTIONS = {
        subreddit: 'funny',
        startUnixTime: currentCrawlTime,
        endUnixTime: currentCrawlTime + 60 * 60,
        numSubmissions: 3,
        directory: './',
        callback: console.log,
        sort: 'top',
        comments: {
          depth: 1,
          limit: 1
        }
      };

let spider = new SNOOSPIDER(CREDENTIALS, OPTIONS);

spider.crawl();

This file, let's say it's called test.js, can be run with the following command:

node --harmony test.js

Based on the provided options, spider.crawl() will output one file and all results to the console.

A few notes on the example file:

The --harmony flag must be used for ES6 syntax, which snoospider uses.
If options.comments is not specified, only submissions are crawled instead.
directory, callback, or both must be specified.
callback is simply a function that executes after the spider is done crawling. You can choose whether it actually processes data from the spider by providing a parameter for it to do so.

Advanced Usage

The following code outputs files of submissions and corresponding comments for each day of February 2016, from 3pm to 4pm, PST.

'use strict';

let currentCrawlTime = Date.UTC(2016, 1, 1, 21) / 1000;

const DAY_IN_SECONDS = 24 * 60 * 60,
      HOUR_IN_SECONDS = 60 * 60,
      END_FEB = Date.UTC(2016, 1, 29, 23, 59, 59) / 1000,
      SNOOSPIDER = require('path/to/snoospider/src/snoospider.js'),
      CREDENTIALS = require('path/to/credentials.json'),
      OPTIONS = {
        subreddit: 'sports',
        startUnixTime: currentCrawlTime,
        endUnixTime: currentCrawlTime + HOUR_IN_SECONDS,
        numSubmissions: 8,
        directory: './output/',
        callback: step,
        sort: 'comments',
        comments: {
            depth: 1,
            limit: 2
        }
      };

let spider = new SNOOSPIDER(CREDENTIALS, OPTIONS);

function step() {
    currentCrawlTime += DAY_IN_SECONDS;

    spider.setStartUnixTime(currentCrawlTime);
    spider.setEndUnixTime(currentCrawlTime + HOUR_IN_SECONDS);

    if (currentCrawlTime < END_FEB) spider.crawl();
}

spider.crawl();

Note how step is passed as the callback to the spider instance to allow for synchronous iteration.

File Output

Outputted files should look something like this, their filenames being {subreddit}-{Unix time in milliseconds}:

[
  {
    "program": "snoospider",
    "version": "0.13.0",
    "blame": "/u/YourRedditUsername",
    "parameters": {
      "subreddit": "funny",
      "startUnixTime": 1436369760,
      "endUnixTime": 1439048160,
      "numSubmissions": 3,
      "directory": './',
      "comments": {
        "depth": 1,
        "limit": 1
      }
    }
  },
  {
    "...SUBMISSION METADATA...": {
      "comments": [
        {
          "...": "...",
          "replies": [
            {
              "...": "..."
            }
          ]
        }
      ]
    }
  },
  {
    "...AND SO ON": "AND SO FORTH..."
  }
]

Note that the first object in the JSON array is metadata associated with snoospider.

Usage no npm install needed!