Automated web data gathering¶

Intro and Objectives¶

We’ll do a number of things including learning about various approaches to getting data from websites including automated downloading, web scraping of HTML, and using web APIs.

Readings¶

Scraping for Craft Beers - A Dataset Creation Tutorial - BeautifulSoup and Beer. This blog post has some very important information that you should read before getting started on web scraping.

Downloads and other resources¶

downloads_webdata.zip

Activities¶

In the Downloads file you’ll find a Jupyter notebook entitled index.ipynb. This notebook has descriptions and links to six notebooks, each of which demos some aspect of web scraping.

Just go through the ones covering some topic that you want to learn about. For some of them you may need to pip install a library.

Here’s what the index notebook contains:

Introduction and preview¶

Jupyter notebook: getting_data_web.ipynb

overview of webscraping methods

strategies

be nice

preview of the web scraping tasks we’ll do

SCREENCAST: Overview of web scraping (6:01)

Automated downloading of data files¶

Jupyter notebook: downloading_files.ipynb

csv and json data

data from analytics.usa.gov

SCREENCAST: Downloading csv and json files (4:05)

Intro to web scraping with Beautiful Soup¶

Jupyter notebook: web_scraping_beautifulsoup.ipynb

BeautifulSoup, a very widely used web scraping package

scraping links out of one of my course websites

SCREENCAST: Intro to Beautiful Soup (18:12)

Web scrape html table of basketball stats into pandas DataFrame¶

Jupyter notebook: scrape_nba_playerstats.ipynb

data used by fantasy basketball leagues

get data out of html table and into pandas

some basic data wrangling in pandas to deal with player trades

SCREENCAST: Scraping NBA data from html table (8:49)

Scrape financial data from SEC XBRL pages¶

Jupyter notebook: pcda_xbrl_soup.ipynb

tons of financial data available via Edgar

XBRL provides scructure to financial statements

use BeautifulSoup to find financial elements of interest from 10K filings

Using Scrapy and XPath for web scraping¶

Jupyter notebook: etsy_xpath_scrapy.ipynb

XPath selectors provide a powerful way to navigate DOM of web pages

we’ll scrape product info from Etsy

SCREENCAST: Using XPath to scrape Etsy data (11:24)

Using web APIs¶

Jupyter notebook: web_api.ipynb

web APIs provide a powerful, and usually easier, way to get data from websites

many sites provide APIs for developers, data journalists, people like us

many APIs require an API key to use

websites change

examples are from eBird and the NY Times

Web APIs + data wrangling with pandas¶

Jupyter notebook: web_api_usgs.ipynb

even with web APIs, data wrangling often needed to get data in shape for analysis

we’ll get stream flow data for the Paint Creek from the USGS

two different approaches for dealing with a tricky header section and getting into a pandas DataFrame

time series plot of discharge rate using Seaborn

SCREENCAST: Using web APIs (24:17)

Using CSS selectors to explore government data metadata¶

Jupyter notebook: web_gov_datametadata_css.ipynb

CSS selectors provide yet another way of finding content of interest on a web page

both BeautifulSoup and lxml support CSS selectors

Explore (OPTIONAL)¶

Chapters 11-13 of Data Wrangling with Python (Kazil and Jarmul) does a really nice job of introducing the various ways of getting data from the web using Python.
Katharine Jarmul’s and Jackie Kazil’s GitHub pages - Lots of good web scraping related repos.