This tutorial aims to give a short an practical intro into Web scraping. The tutorial is not meant to be an extensive or by any mean complete documente for Web scraping. If you are interested in a more detailed tutorial I invite you to take a look at other excellent tutorials available online.
Web scraping is the process of extracting data from websites or other online sources and copying the data into an structured form (e.g., a database) enabling further retrieval and analysis.
For this particular tutorial, we are going to extract demografic information (e.g., country, state and population) of Colombian towns from Wikipedia.
The tutorial is written in Python and will use two different methods, of the many available, for pulling the data, Beautiful Soup and Pandas.
The tutorial is divided into the following 4 sections:
- Section 1: Method Beautiful Soup
- Section 2: Method Pandas
- Section 3: Structuring and cleaning the data
- Section 4: Data saving
To run the tutorial you can download the notebook or the plain python tutorial (.py) in your local machine and run it locally. Alternatively you can run the tutorial's notebook online via Binder or Google Colab by directly clicking in the badges at the top of this page.
If you have any question or comments don't hesitate to post them in the issues.