I am working on a little side project that involves mining Reddit data. It fetches a listing of all posts on different subreddits and copies the obtained data to a Google spreadsheet for further analysis (more on the project later).
Reddit, unlike most websites, allows web scraping as long as the crawler scripts make no more than one request every two seconds to the Reddit servers (see rules). You don’t even need a developer account or an API key to perform scraping on Reddit.
There are popular tools like wget, Site Sucker (Mac) or HTTrack Website Copier (Windows) that can download entire websites for offline use but they are mostly useless for scraping Reddit data since the site doesn’t use page numbers and content of pages is constantly changing. A post maybe listed on the first page of a subreddit but it could find itself on the third page the next second as other posts are voted to the top.
While there exists PHP and Python libraries for scraping Reddit, they are too complicated for the non-techies. Fortunately, there’s always Google Apps Script to the rescue. Here’s what you can do to pull data from any Subreddit on Reddit automatically.
- Open the Google Sheet and choose File – Make a copy to copy this sheet in your Google Drive.
- Go to Tools -> Script editor and copy-paste the Reddit Scraper Script. You can change “LifeProTips” to any other subreddit name.
- While in the script editor, choose Run -> Run and authorize the script.
That’s it. The script will run in the background automatically pulling content from Reddit into the Google spreadsheet. And it stops automatically once all the posts* of that Reddit have been fetched.
[*] All Subreddits on Reddit display a maximum of 1000 posts – you can’t go beyond that number even while manually browsing a subreddit.