Introduction to Web Scraping using Python

QUT Digital Media Research Centre

Patrik Wikström & Brenda Moon

Notebooks

Introduction
Step 1. Extract album title from a single item on a single page
Step 2. Extract all album titles on a single page
Step 3. Extract all review data from a single page
Step 4. Structure the data extraction as a function
Step 5. Store the data in a dataframe and save to disk
Step 6. Restructure the code for clarity
Step 7. Plotting, tiny stat analysis and improved I/O
Final. Support for multiple pages

Installation

This workshop uses Python 3 and the python modules listed below.

One of the easiest ways to install Python and packages that are relevant for web scraping is to use Anaconda developed by Continuum Analytics. Do note however, that Python is available in two versions, "2.7" and "3.4". We will be using Python 3.4, so make sure you download the correct one.

Anaconda can be downloaded from https://www.continuum.io/downloads

Python modules used in this workshop

We use three Python modules in this workshop. The full documentation for each is available on their websites:

Requests - get webpages from urls
BeautifulSoup - select text out of webpages
Pandas - python data analysis library

These pages are in Jupyter Notebook

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.gitignore		.gitignore
index.ipynb		index.ipynb
readme.md		readme.md
requirements.txt		requirements.txt
web-scraping-intro-final.ipynb		web-scraping-intro-final.ipynb
web-scraping-intro-step1.ipynb		web-scraping-intro-step1.ipynb
web-scraping-intro-step2.ipynb		web-scraping-intro-step2.ipynb
web-scraping-intro-step3.ipynb		web-scraping-intro-step3.ipynb
web-scraping-intro-step4.ipynb		web-scraping-intro-step4.ipynb
web-scraping-intro-step5.ipynb		web-scraping-intro-step5.ipynb
web-scraping-intro-step6.ipynb		web-scraping-intro-step6.ipynb
web-scraping-intro-step7.ipynb		web-scraping-intro-step7.ipynb
web-scraping-intro-toc.ipynb		web-scraping-intro-toc.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction to Web Scraping using Python

QUT Digital Media Research Centre

Notebooks

Installation

Python modules used in this workshop

About

Releases

Packages

Contributors 2

Languages

qut-dmrc/web-scraping-intro-workshop

Folders and files

Latest commit

History

Repository files navigation

Introduction to Web Scraping using Python

QUT Digital Media Research Centre

Notebooks

Installation

Python modules used in this workshop

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages