README
snoospider
A Node.js spider for scraping reddit.
Features
(See documentation for comprehensive features and examples.)
snoospider grants scraping of submissions, comments, and replies as delineated by subreddit and a specified Unix time frame in seconds, without being bogged down by the learning curve of reddit's API, the snoowrap wrapper for reddit's API, and Bluebird promises used by the wrapper.
If the directory
option is supplied to an instance of snoospider, the spider will output JSON files with all fields and metadata to that relative directory, most importantly the body
field of comments. This makes it easy to analyze the files with another tool, e.g., R and its RJSONIO package. A callback
option can also be supplied to pass scraped JSON arrays into a function such as console.log(...)
or any other for direct processing in JavaScript without the need for file I/O.
NOTE: If your use case falls outside of snoospider's scope, then you should move on to snoowrap—it is much more powerful than snoospider, but its learning curve is far greater for complex tasks.
Installation
First, to install snoospider as a dependency for your project, run:
npm install snoospider --save
Second, set up OAuth by running reddit-oauth-helper and following the directions:
npm install -g reddit-oauth-helper
reddit-oauth-helper
Select permanent token duration.
Select read and mysubreddits scope. Through reddit, you must subscribe to the subreddits you want to scrape with the account you provide to reddit-oauth-helper.
Third, you should have retrieved some JSON output. You will take some of its contents and place them in another file.
Create a file called
credentials.json
.In
credentials.json
, fill in your information from reddit and reddit-outh-helper:
{
"client_id": "",
"client_secret": "",
"refresh_token": "",
"author": "/u/YourRedditUsername"
}
Usage
You may create a JavaScript file like this:
'use strict';
let currentCrawlTime = Date.UTC(2016, 1, 1, 21) / 1000;
const SNOOSPIDER = require('snoospider'),
CREDENTIALS = require('path/to/credentials.json'),
OPTIONS = {
subreddit: 'funny',
startUnixTime: currentCrawlTime,
endUnixTime: currentCrawlTime + 60 * 60,
numSubmissions: 3,
directory: './',
callback: console.log,
sort: 'top',
comments: {
depth: 1,
limit: 1
}
};
let spider = new SNOOSPIDER(CREDENTIALS, OPTIONS);
spider.crawl();
This file, let's say it's called test.js
, can be run with the following command:
node --harmony test.js
Based on the provided options, spider.crawl()
will output one file and all results to the console.
A few notes on the example file:
- The
--harmony
flag must be used for ES6 syntax, which snoospider uses. - If
options.comments
is not specified, only submissions are crawled instead. directory
,callback
, or both must be specified.callback
is simply a function that executes after the spider is done crawling. You can choose whether it actually processes data from the spider by providing a parameter for it to do so.
Advanced Usage
The following code outputs files of submissions and corresponding comments for each day of February 2016, from 3pm to 4pm, PST.
'use strict';
let currentCrawlTime = Date.UTC(2016, 1, 1, 21) / 1000;
const DAY_IN_SECONDS = 24 * 60 * 60,
HOUR_IN_SECONDS = 60 * 60,
END_FEB = Date.UTC(2016, 1, 29, 23, 59, 59) / 1000,
SNOOSPIDER = require('path/to/snoospider/src/snoospider.js'),
CREDENTIALS = require('path/to/credentials.json'),
OPTIONS = {
subreddit: 'sports',
startUnixTime: currentCrawlTime,
endUnixTime: currentCrawlTime + HOUR_IN_SECONDS,
numSubmissions: 8,
directory: './output/',
callback: step,
sort: 'comments',
comments: {
depth: 1,
limit: 2
}
};
let spider = new SNOOSPIDER(CREDENTIALS, OPTIONS);
function step() {
currentCrawlTime += DAY_IN_SECONDS;
spider.setStartUnixTime(currentCrawlTime);
spider.setEndUnixTime(currentCrawlTime + HOUR_IN_SECONDS);
if (currentCrawlTime < END_FEB) spider.crawl();
}
spider.crawl();
Note how step
is passed as the callback to the spider
instance to allow for synchronous iteration.
File Output
Outputted files should look something like this, their filenames being {subreddit}-{Unix time in milliseconds}
:
[
{
"program": "snoospider",
"version": "0.13.0",
"blame": "/u/YourRedditUsername",
"parameters": {
"subreddit": "funny",
"startUnixTime": 1436369760,
"endUnixTime": 1439048160,
"numSubmissions": 3,
"directory": './',
"comments": {
"depth": 1,
"limit": 1
}
}
},
{
"...SUBMISSION METADATA...": {
"comments": [
{
"...": "...",
"replies": [
{
"...": "..."
}
]
}
]
}
},
{
"...AND SO ON": "AND SO FORTH..."
}
]
Note that the first object in the JSON array is metadata associated with snoospider.