Skip to content

Latest commit

 

History

History
121 lines (86 loc) · 5.38 KB

2022-10-26-HackyHour-4.md

File metadata and controls

121 lines (86 loc) · 5.38 KB

Topic Webscraping

What is web scraping

In the internet data is read by an browser, where the browser renders the data via a specification. Therefore a website must be machine-readable. HTML specifies content and layout of webpages and is rendered by the browser into a reader ready document.

|---------|                    linked /-----------\
|         | reads /-----------\  in   | Resourcen |
| Browser | <---- | HTML Page | <---- | json, jpg |
|         |       \-----------/       | csv ...   |
|---------|                           \-----------/
  • Web scraping means downloading content of an URL like <uni-giessen.de> in regular intervals not manually but with the help of a computer program.
  • Most of the time a scraper will save the content of the URL it just read into file or database.
  • It will save the raw contents of the website or post-process it. See the example below. When scraping a car park to determine its occupancy and plot it eventually over time, we don't need to save the whole web-page. So we post process it and just save the data relevant to determine the occupancy. In this case only the lines starting with <li>Parkhaus.

Example html content of a german website for a car park:

<html>
  <head>
  </head>
  <body>
    <h1>Parkhäuser in Gießen</h1>
    <ul>
      <li>Parkhaus A - 30/50 belegt</li>
      <li>Parkhaus B - 35/60 belegt</li>
    </ul>
    <a href="data.csv">Data</a>
    <img src="graph.jpg" />
  </body>
</html>

How to build a web scraper

  • To download in regular intervals you need a computer which is turned on 24/7, either you use:
    • Cloud-Computing power e.g. Github Actions which is for free
    • your own server with a cronjob installment When you are not that tech-savy you can try to save your webpage several times a day and try to analyze this, to grasp the gist of it.

If you just save the web page, no special programm is needed. If you post-process your webpage first, you need to write a bit of code to extract the relevant data. But most of the time you need to process the html at least at the anaylsis phase)

Exploration of HTML contents of a web page is best done with the dev tools in your browser or with Jupyter Notebooks (see the previous notes).

Caveats

You should think about getting notified if something goes wrong. If you don't, it can cause loss of data, various failures can happen.

What are possible failures:

  • The target webiste content changes structurally the content moves to another URL structure
  • the remote server, hosting the webiste, bans your ip
  • also possible, but often not probable, the url changes

If track failures of your automatic scraping, you can react to this kind of failures easily.

If you don't want to be dependent on structurally contents of the web resource you are scraping, you can save the whole content instead of extract specific XML-paths. Depdending of the size of the scraped contents, this naturally increases time for post-processing in analysis phase.

Note: HTML docuemnts can be transformed to pure text format via pandoc (see the notes on the last hacky-hour).

Examples

  • German Railway-News scraper: This saves the contents of the rail news from the homepage of german railways. The scraping method is based upon the git-history pattern of Simon Willison (and forked from his ca-fires repo), where he uses Github-Actions as free cloud computing resource to assemble time series data from git commits. It has about 4.7 k commits at the time of writing.

  • Traffic news from the public availbe website of Hesse police. The dataset ends on on 16th of May, when the webserver didn't allow the Github IP to access the website any longer.

The homepage of Simon Willison has also many more great examples.

Apart from that I like the CCC talks (in german) "BahnMining" and "SpiegelMining" from David Kriesel, where he scrapes data from the website of the german railway and from one of the biggest german news outlets to visualize and conclude about the publicly available data.

Use cases

  • Documenting change of a static or dynamic web resource
  • Time series analysis of static web resource over time
  • Enables users to see changes over time
  • Data analysis

Tasks you can solve with web scraping are

  • Transformation and cleaning of data
  • Data visualisation
  • Statistical modelling

What is data analysis

Speaker: Martin