Reddit Scraping

Reddit Scraping

For this exercise, we’re going to practice some of the processes that you’ll need to start making use of open source tools. Now that we’ve started getting comfortable with the command line, we’re ready to begin using command line tools, which allow for much more robust data scraping than our Google Sheets examples.

We’re going to install a Reddit Scraper that runs on Python, the language that powers most data tools. You’ll need the burner Reddit account we discussed in the pre-institute materials. We’ll also be installing Python tools that you’ll be using again when we discuss advanced Twitter data scraping later this week. If you are on a Mac, you’ll need to go through some extra steps to get configured: see the Mac supplement in this module.

You’ll also need Python on your system first. Start by installing Anaconda from the individual installers – https://www.anaconda.com/products/individual – just select the Python 3.7 option that fits your operating system.

Don’t change any of the options, just accept the defaults. Be patient – this might take a while.

The Stage One video walks through these steps:

We’re going to start by cloning a repository using Git Bash, the same tool we used previously to share our research question. This time, you’re going to work with someone else’s code, a Universal Reddit Scraper: https://github.com/JosephLai241/Universal-Reddit-Scraper – notice how the ReadMe at the bottom provides documentation of how to use it.
Start by opening Git Bash, navigating to a place you want to work using the directory commands from last time, and typing git clone https://github.com/JosephLai241/Universal-Reddit-Scraper.git
Open Visual Studio, and open the Universal-Reddit-Scraper folder you created. If this is your first time opening python, you’ll be prompted to install an extension: this will color-code the file to make it easier to read. You’ll need to replace the placeholders in scraper.py lines 27-31: for this, you need a Reddit app.
Log into the account you created previously, go to https://old.reddit.com/prefs/apps and make an app called “Reddit Scraper”. Select the option “script” as this will be for personal use. It doesn’t need an about url, but the redirect uri should be http://localhost:8080
Select create app. Copy the 14 character code under personal use script into the spot on line 27, and the secret key into line 28 between the quotation marks. Set the app name to Reddit Scraper, and add your Reddit username and password on lines 30 + 31. Make sure to resave the scraper.py file.

The Stage Two video walks through the rest of the process:

Now we need to open the “Anaconda Powershell” from our earlier install. Once in, use the commands from last time to navigate to the “Universal-Reddit-Scraper” folder you created. We’ll need to install the requirements – type:

pip install -r requirements.txt

Once in the folder, type python ./scraper.py -h

“python” tells your system to use python to run the file, “./scraper.py” tells it which file, and “-h” indicated we want to see the help commands.

This provides a breakdown of the types of commands. Here’s an example to try:

python ./scraper.py -r kotakuinaction2 T 50 –json

Note the syntax: “python ./scraper.py” is the same as before. “-r” indicates we are using the tools to scrape reddits. “kotakuinaction2” is the name of the subreddit we want. “T” indicates we want top posts. “50” indicates the number, and “-json” defines the output.

Go back to Visual Studio – if your command succeeded, you’ll now see the .json file with the results. Now try reading the documentation on GitHub and writing your own command to get data you can use.

PRO TIP: Want to run another query that’s a variation on the one you just ran without typing the whole thing in again? Use the up arrow to get back your previous command. You can go multiple steps back this way as well. Then edit it to what you’d like it to be now.

Take a screenshot of your final .json file and submit for credit.

download pdf

assignment

Author

Institute Faculty