Skip to content

Latest commit

 

History

History
24 lines (14 loc) · 1.42 KB

README.md

File metadata and controls

24 lines (14 loc) · 1.42 KB

real-time-page-tracker

This repo contains the code used to run this service: https://covid-data.wmflabs.org/

The main components are:

  • PageCrawler.py: Giving a set of Wikidata Item as seeds - where len(set(seeds) >0- discover related wikidata items and using the sitelinks returns a list of pages related with the seed(s). The output is written and sqlite db, in two tables: ** pagesPerProjectTable: Information about the Wikipedia articles, such as: project,page,url,wikilink,wikidataItem ** itemsInfoTable: Information about the Wikidata Items and their relation with the seeds.

  • getEdits.py: A rudimentary crawler to count the number of edits in each of the pages in pagesPerProjectTable. Save all edits with user,timestamp,page_title,project in 'revisions' table in sqlite.

  • app.py: It is Flask server to run https://covid-data.wmflabs.org/. It offers some endpoinds to download the data in JSON, and also some visualizations and statistics about the data.

To understand the methodology behind the Crawler please follow this notebook.

TODOs:

  • Replace/improve geEdits.py with direct connection to the Wiki Replicas.
  • Add more interactive visualizations,
  • Move the seeds list to a configuration file.
  • Clean the code ;)