ETL pipelines for the RKI Metadata Exchange.
The Metadata Exchange (MEx) project is committed to improve the retrieval of RKI research data and projects. How? By focusing on metadata: instead of providing the actual research data directly, the MEx metadata catalog captures descriptive information about research data and activities. On this basis, we want to make the data FAIR1 so that it can be shared with others.
Via MEx, metadata will be made findable, accessible and shareable, as well as available for further research. The goal is to get an overview of what research data is available, understand its context, and know what needs to be considered for subsequent use.
RKI cooperated with D4L data4life gGmbH for a pilot phase where the vision of a FAIR metadata catalog was explored and concepts and prototypes were developed. The partnership has ended with the successful conclusion of the pilot phase.
After an internal launch, the metadata will also be made publicly available and thus be available to external researchers as well as the interested (professional) public to find research data from the RKI.
For further details, please consult our project page.
Contact
For more information, please feel free to email us at [email protected].
Robert Koch-Institut
Nordufer 20
13353 Berlin
Germany
The mex-extractors
package implements a variety of ETL pipelines to extract
metadata from primary data sources using a range of different technologies and
protocols. Then, we transform the metadata into a standardized format using models
provided by mex-common
. The last step in this process is to load the harmonized
metadata into a sink (file output, API upload, etc).
This package is licensed under the MIT license. All other software components of the MEx project are open-sourced under the same license as well.
- on unix, consider using pyenv https://github.com/pyenv/pyenv
- get pyenv
curl https://pyenv.run | bash
- install 3.11
pyenv install 3.11
- switch version
pyenv global 3.11
- run
make install
- get pyenv
- on windows, consider using pyenv-win https://pyenv-win.github.io/pyenv-win/
- follow https://pyenv-win.github.io/pyenv-win/#quick-start
- install 3.11
pyenv install 3.11
- switch version
pyenv global 3.11
- run
.\mex.bat install
- run all linters with
pdm lint
- run only unit tests with
pdm unit
- run unit and integration tests with
pdm test
- update boilerplate files with
cruft update
- update global requirements in
requirements.txt
manually - update git hooks with
pre-commit autoupdate
- update package dependencies using
pdm update-all
- update github actions in
.github/workflows/*.yml
manually
- run
pdm release RULE
to release a new version where RULE determines which part of the version to update and is one ofmajor
,minor
,patch
.
- build image with
make image
- run directly using docker
make run
- start with docker compose
make start
- run
pdm run {command} --help
to print instructions - run
pdm run {command} --debug
for interactive debugging
pdm run dagster dev
to launch a local dagster UI
pdm run all-extractors
executes all extractors- execute only in local or dev environment
pdm run artificial
creates deterministic artificial sample data- execute only in local or dev environment
pdm run biospecimen
extracts sources from the Biospecimen excel files
pdm run blueant
extracts sources from the Blue Ant project management software
pdm run confluence-vvt
extracts sources from the VVT confluence page
pdm run datscha-web
extracts sources from the datscha web app
pdm run ff-projects
extracts sources from the FF Projects excel file
pdm run ifsg
extracts sources from the ifsg data base
pdm run international-projects
extracts sources from the international projects excel
pdm run grippeweb
extracts grippeweb metadata from grippeweb database
pdm run odk
extracts ODK survey data from excel files
pdm run open-data
extracts Open Data sources from the Zenodo API
pdm run organigram
extracts organizational units from JSON file
pdm run rdmo
extracts sources from RDMO using its REST API
pdm run seq-repo
extracts sources from seq-repo JSON file
pdm run sumo
extract sumo data from xlsx files
pdm run synopse
extracts synopse data from report-server exports
pdm run voxco
extracts voxco data from voxco JSON files
Footnotes
-
FAIR is referencing the so-called FAIR data principles – guidelines to make data Findable, Accessible, Interoperable and Reusable. ↩