A collection of scrapers to obtain documents from german parliaments and related public institutions
To run a full scrape with the current legislative term:
memorious run <scraper_name>
By default, documents (usually pdfs) and their metadata (json files) are stored
in ./data/<scraper_name>
All scrapers can have these options (unless otherwise mentioned in the detailed descriptions for each scraper) via env vars to filter the scraping.
DOCUMENT_TYPES
-major_interpellation
orminor_interpellation
(Große Anfrage / Kleine Anfrage)LEGISLATIVE_TERMS
- an integer, refer to the detailed scraper description for possible valuesSTART_DATE
- a date (isoformat) to scrape documents only published since this dateEND_DATE
- a date (isoformat) to scrape documents only published until this date
For example, to scrape all minor interpellations for Bayern from the last (not current) legislative term but only since 2018:
DOCUMENT_TYPES=minor_interpellation LEGISLATIVE_TERMS=17 START_DATE=2018-01-01 memorious run by
By default, scrapers are only executing requests and downloading documents they
have not seen before. To disable this behaviour, set
MEMORIOUS_INCREMENTAL=false
German state parliaments:
- bb - Landtag Brandenburg
- be - Abgeordnetenhaus Berlin
- bw - Landtag von Baden-Württemberg
- by - Bayerischer Landtag
- hh - Hamburgische Bürgerschaft
- he - Hessischer Landtag
- mv - Landtag Mecklenburg-Vorpommern
- ni - Landtag Niedersachsen
- rp - Landtag Rheinland-Pfalz
- st - Landtag von Sachsen-Anhalt
- th - Thüringer Landtag
Other scrapers:
- dip - Dokumentations- und Informationssystem für Parlamentsmaterialien - API
- parlamentsspiegel - Parlamentsspiegel (gemeinsames Informationssystem der Landesparlamente)
- sehrgutachten - Gutachten der Wissenschaftlichen Dienste
- vsberichte - Verfassungsschutzberichte des Bundes und der Länder
Landtag Brandenburg
memorious run bb
The scraper uses the starweb implementation using this form: https://www.parlamentsdokumentation.brandenburg.de/starweb/LBB/ELVIS/servlet.starweb?path=LBB/ELVIS/LISSH.web&AdvancedSearch=yes
current: 7
earliest: 1
Unfortunately, Brandenburg gives no results with answers for types "Kleine Anfrage" or "Große Anfrage", so the type option is unusable.
Abgeordnetenhaus Berlin
memorious run be
The backend used to be starweb, but recently (2021-06-24) changed to something
completly new, which could still be something starweb related, but according to
urls it is called "portala". The scraper still requires some refining to work
properly with date / document_type options, for now a START_DATE
is always
required to run.
The scraper sends some json that looks like an Elasticsearch query via post to this endpoint: https://pardok.parlament-berlin.de/portala/browse.tt.html from a query template
Although the new frontend looks fancy, that doesn't mean the service is performant. With too large queries (a long date range above a few months) it will shut down and return a 502 Error.
current: 18
earliest: 11
written_interpellation
(Both "Große" and "Kleine" anfragen)
Landtag von Baden-Württemberg
memorious run bw
For convenience, the scraper directly the xhr request result from this base site: https://www.landtag-bw.de/home/dokumente/drucksachen.html
Example:
There is no explicit option for LEGISLATIVE_TERMS
, but to filter for the
actual terms of BW, you can use START_DATE
and END_DATE
ranges that match
the terms.
minor_interpellation
major_interpellation
Bayerischer Landtag
memorious run by
The scraper uses this result page: https://www.bayern.landtag.de/parlament/dokumente/drucksachen/?dokumentenart=Drucksache&anzahl_treffer=10
current: 18
earliest: 1 (but useful metadata starts at 5 [1962-66])
minor_interpellation
major_interpellation
Hessischer Landtag
memorious run he
The scraper uses the starweb implementation using this form: http://starweb.hessen.de/starweb/LIS/servlet.starweb?path=LIS/PdPi.web
current: 20
earliest: 14 (or: 8?) // TODO
minor_interpellation
major_interpellation
Hamburgische Bürgerschaft
memorious run hh
The scraper uses the parldok [5.4.1] implementation using this form: https://www.buergerschaft-hh.de/parldok/formalkriterien
minor_interpellation
major_interpellation
current: 22
earliest: 16
Landtag Mecklenburg-Vorpommern
memorious run mv
The scraper uses the parldok [5.6.0] implementation using this form: https://www.dokumentation.landtag-mv.de/parldok/formalkriterien/
minor_interpellation
major_interpellation
current: 7
earliest: 1
Landtag Niedersachsen
memorious run ni
The scraper uses the starweb implementation using this form: https://www.nilas.niedersachsen.de/starweb/NILAS/servlet.starweb?path=NILAS/lissh.web
current: 18
earliest: 10
Landtag Rheinland-Pfalz
memorious run rp
The scraper uses the starweb implementation using this form: https://opal.rlp.de/starweb/OPAL_extern/servlet.starweb?path=OPAL_extern/PDOKU.web
current: 18
earliest: 11
minor_interpellation
major_interpellation
Landtag von Sachsen-Anhalt
memorious run st
The scraper uses the starweb implementation using this form: https://padoka.landtag.sachsen-anhalt.de/starweb/PADOKA/servlet.starweb?path=PADOKA/LISSH.web&AdvancedSuche
current: 7
earliest: 1
minor_interpellation
major_interpellation
Thüringer Landtag
memorious run th
The scraper uses the parldok [5.6.5] implementation using this form: http://parldok.thueringen.de/ParlDok/formalkriterien/
minor_interpellation
major_interpellation
current: 7
earliest: 1
Dokumentations- und Informationssystem für Parlamentsmaterialien - API
memorious run dip
There is a really nice api. The scraper uses this base url (with the public api key): https://search.dip.bundestag.de/api/v1/drucksache?apikey=N64VhW8.yChkBUIJeosGojQ7CSR2xwLf3Qy7Apw464&f.zuordnung=BT
minor_interpellation
major_interpellation
Parlamentsspiegel (gemeinsames Informationssystem der Landesparlamente)
memorious run parlamentsspiegel
The "Parlamentsspiegel" is an official aggregator page for the document systems of the german state parliaments.
The scraper uses this index page with configurable get parameters: https://www.parlamentsspiegel.de/home/suchergebnisseparlamentsspiegel.html?view=kurz&sortierung=dat_desc&vorgangstyp=ANFRAGE&datumVon=15.05.2021
The "Parlamentsspiegel" doesn't distinguish between minor and major
interpellations for the requests, so the DOCUMENT_TYPES
option is not
available.
Ausarbeitungen der Wissenschaftlichen Dienste des Deutschen Bundestages
memorious run sehrgutachten
Other than the name suggests, it's not technical based on https://sehrgutachten.de but scrapes the website of the bundestag directly.
This scraper scrapes documents from the Wissenschaftliche Dienste directly using and parsing this ajax call: https://www.bundestag.de/ajax/filterlist/de/dokumente/ausarbeitungen/474644-474644/?limit=10
There is no option DOCUMENT_TYPES
and LEGISLATIVE_TERMS
but START_DATE
and END_DATE
are available.
Verfassungsschutzberichte des Bundes und der Länder
memorious run vsberichte
Scraped from the api from https://vsberichte.de
This scraper doesn't need to run frequently as there is a new report once in a year.
There are no filter options available.
The scrapers are based upon memorious
Therefore, for each scraper there is a yaml file in ./dokukratie/
that
defines how the scraper should run.
Some scrapers work with just a yaml definition, like Bayern: ./dokukratie/by.yml
Some others have their own custom python implementation, like Baden-Württemberg: ./dokukratie/scrapers/bw.py
Some others share the same software for their document database backend/frontend, mainly starweb or parldok
Used by:
Code: ./dokukratie/scrapers/starweb.py
Used by:
Code: ./dokukratie/scrapers/parldok.py
The scrapers generate a metadata database for mmmeta to consume.
This is useful for client applications to track state of files without downloading the actual files, e.g. to know which files are already consumed and to only download newer ones, etc...
How to use mmmeta
for dokukratie:
Current used version: 0.4.0
pip install mmmeta
aws s3 sync s3://<bucket_name>/<scraper_name>/_mmmeta ./data/<scraper_name>
This will download the necessary metadata csv files (./db/
) and config.yml
Either use env var MMMETA=./data/<scraper_name>
or jump into the base
directory ./data/<scraper_name>
where the subdirectory _mmmeta
exists.
mmmeta update
or, within python applications:
from mmmeta import mmmeta
# init:
m = mmmeta() # env var MMMETA
# OR
m = mmmeta("./data/<scraper_name>")
# update (or generate) local state
m.update()
If this runs into sqlalchemy migration problems, there is an attempt to fix it
(perhaps make a backup of the local state.db
before):
mmmeta update --cleanup
or, within python applications:
m.update(cleanup=True)
This will cleanup data in the state.db
according to config.yml
but will
leave columns starting with an underscore untouched.
soft delete files (not existing in the s3 bucket for some reason...) are
marked with __deleted=1
and have a __deleted_reason
property.
for file in m.files:
# `file` has metadata as dictionary keys, e.g.:
publisher = file["publisher"]
# ...
# s3 location:
file.remote.uri
# alter state data, e.g.:
# as a convention, local state data should start with _
# to not confuse it with the remote metadata
file["_foo"] = bar
file["_downloaded"] = True
file.save()
make install
additional dependencies for local development:
make install.dev
additional dependencies for production deployment (i.e. psycopg2
):
make install.prod
Install test utils:
make install.test
Then,
make test
This will run through all the scrapers (see details in
./tests/test_scrapers.py
) with different combinations of input parameters and
stop after the first document downloaded.
Or, to test only a specific scraper:
make test.<scraper_name>
Test all scrapers with the starweb implementation:
make test.starweb
Test all scrapers with the parldok implementation:
make test.parldok