Advanced Data Scraping

Advanced Data Scraping

Now it’s time to see the work you’ve done setting up your computer for Python pay off! There are lots of open source scrapers available of varying complexity, and once you have your system configured, it’s a lot easier to pick different tools to try. In this exercise, we’re going to put that in action: here are a few pre-selected scrapers with simple configurations that will give you new options for collecting data. For this exercise, we are deliberately not providing a walk through video: instead, there’s a few tips to get you started with each. Try to use our tips and the documentation from the repositories to work your way through the challenges as they arise – and of course, we’re always around for questions on #help-tech.

This process of picking up new command line tools will get easier with practice, but the most important skill you can learn is problem solving as you go. Here’s a few tips to help you with general problems that might arise:

– Read your error messages! Whenever you input a command, you’ll get a response from the system. If that response includes words like error or failure, you can copy and google the message to figure out what might be causing it. A few examples of common problems with these types of tools include authentication errors (usually caused by having the wrong credentials stored in the script or tool); permissions errors (which involve writing to a directory that you don’t have permission for, and often can be solved by either moving to a user folder, such as your documents folder, or using the “Run as Administrator” option covered in the last video); or requirements errors, which come when you try to run a script that you haven’t installed the libraries for.

– Recheck the documentation. The Github repository for a project might link to tutorials or video guides that can help get you started for larger projects. Make sure to read everything provided when you’re trying to solve a problem – often, it’ll be caused by missing an important step in set-up or configuration. Don’t try to dive in to technical tutorials out of order, as that can cause a lot of frustration.

– Google and StackOverflow are your friends. The more specific your question, the more likely you are to find someone with a similar problem getting help on StackOverflow, a community-driven site for code support. This can be especially helpful when trying to resolve general configuration problems that might come from accidental installations or different operating system requirements.

Choose one of the following tools, give it a try, and report back to help-tech with any tricks or tips you’d offer others trying the same tool. Think about how you would explain its use to others – that’s a great way to know if you’ve figured it out yourself:

Option One: Instagram Scraper

Go to: https://github.com/arc298/instagram-scraper and use the ReadMe for installation instructions. You’ll need to have followed yesterday’s tutorials so that “pip” commands are working from your system, and we recommend using the Anaconda Powershell prompt. You’ll need an instagram account username and password to run the script. Before you try a test scrape, make a folder in a place you don’t mind pulling a lot of data, and move to that folder in the command like using the “cd” commands we’ve worked with so far. Be patient – grabbing photos is data intensive.

Option Two: Archive of Our Own Scraper

Go to: https://github.com/radiolarian/AO3Scraper and read the ReadMe. You’ll need to clone the project like in yesterday’s example by moving to a new folder and using “git clone https://github.com/radiolarian/AO3Scraper.git“. Make sure to install all three dependencies using the pip install commands in the instructions before trying to run a search query. Depending on the number of fics you request this can also take a while, be patient and note that it is a two step process to grab text – the first command identifies what you want, and the second pulls them.

Option Three: Facebook Scraper

Go to: https://github.com/kevinzg/facebook-scraper and read the ReadMe: this one is a simple, single install command. This is not designed to use credentials or to scrape private groups, and instead focuses on public facebook pages. To run it, use the command facebook-scraper followed by the name of the page you want to scrape (whatever appears in “name” – facebook/name). As with the other tools, the output will be saved in whatever directory you run the command from.

Submit two (2) screenshots of your data collection process for credit.

download pdf

assignment

Author

Institute Faculty