#omega-red
Aggressive (rate limit disobeying) scraper for reddit.
##Quick Start
Add in the subreddits you want to scrape into the config.json
file.
{
"law": {
"law": 200,
"legaled": 200,
"cyberlaws": 200,
"isthislegal": 200,
"legalnews": 200,
"lsat": 200
},
"technology": {
"technology": 200,
"android": 200,
"bitcoin": 200,
"programming": 200,
"apple": 200
}
}
Each object in the config.json
is a "metareddit", a tag for a group of subreddits. Each of the keys of the object is a subreddit, with its value the number of threads to scrape.
In this case, I'm scraping two "meta"s, each having their own subreddits.
The scraper will try its best to approximate the number of threads specified subject to errors and length of the subreddit.
To start the scraper, run
$ node omega-red.js
The scraper will then run.
##Nitty Gritty
###Buffering The scraper uses a buffered write stream, meaning that writes to the disk will happen in bursts. Hence, observing file size to see progress will be inaccurate since quite a large amount of data can be cached.
###Rate Limits
Reddit will rate limit excessive queries. Hence, (increasingly towards the end of scraping) omega-red will produce error messages such as Rate limited. Waiting 531 ms
. It uses an exponential backoff for waiting time (with maximum of 30,000ms). This is aimed at reducing the likelihood of multiple scrapers jamming the query.
###Outputs
Outputs two .csv
s: threads.csv
and comments.csv
. Both CSVs have no headers. The meaning of the columns are:
####Threads
['text', 'title', 'url', 'id', 'subreddit', 'meta', 'time', 'author', 'ups', 'downs', 'authorlinkkarma', 'authorcommentkarma', 'authorisgold']
text
: text of the threadtitle
: title of the threadurl
: url of the threadid
: unique ID of the threadsubreddit
: subreddit that the thread belongs tometa
: meta tag assigned to the subreddit of the thread inconfig.json
time
: timestamp of the threadauthor
: username of the author of the threadups
: number of ups the thread has receiveddowns
: number of downs the thread has receivedauthorlinkkarma
: the author's link karmaauthorcommentkarma
: the author's comment karmaauthorisgold
:1
if the author has gold status,0
otherwise
####Comments
['text', 'id', 'subreddit', 'meta', 'time', 'author', 'ups', 'downs', 'authorlinkkarma', 'authorcommentkarma', 'authorisgold']
text
: text of the commentid
: unique ID of the commentsubreddit
: subreddit that the thread belongs tometa
: meta tag assigned to the subreddit of the thread inconfig.json
time
: timestamp of the threadauthor
: username of the author of the threadups
: number of ups the thread has receiveddowns
: number of downs the thread has receivedauthorlinkkarma
: the author's link karmaauthorcommentkarma
: the author's comment karmaauthorisgold
:1
if the author has gold status,0
otherwise
All text is normalized to lower case, tokenized using a TreebankTokenizer from natural, then joined with spaces. This results in punctuation being separated from words, a desired effect.