Patrik Wikström & Brenda Moon
- Introduction
- Step 1. Extract album title from a single item on a single page
- Step 2. Extract all album titles on a single page
- Step 3. Extract all review data from a single page
- Step 4. Structure the data extraction as a function
- Step 5. Store the data in a dataframe and save to disk
- Step 6. Restructure the code for clarity
- Step 7. Plotting, tiny stat analysis and improved I/O
- Final. Support for multiple pages
This workshop uses Python 3 and the python modules listed below.
One of the easiest ways to install Python and packages that are relevant for web scraping is to use Anaconda developed by Continuum Analytics. Do note however, that Python is available in two versions, "2.7" and "3.4". We will be using Python 3.4, so make sure you download the correct one.
Anaconda can be downloaded from https://www.continuum.io/downloads
We use three Python modules in this workshop. The full documentation for each is available on their websites:
- Requests - get webpages from urls
- BeautifulSoup - select text out of webpages
- Pandas - python data analysis library
These pages are in Jupyter Notebook