VFB Pipeline 2 comprises five servers/services and six data pipelines:
- Pipeline 2 servers:
- VFB knowledge base (
vfb-kb
) - VFB triple store (
vfb-triplestore
) - SOLr + preconfigured VFB SOLr core (
vfb-solr
) - owlery (
vfb-owlery
) - VFB Neo4J production instance (
vfb-prod
)
- VFB knowledge base (
- Pipeline 2 data pipelines:
- Transform KB1 to KB2 (
vfb-kb2kb
) [to be obsoleted] - Validate KB (
vfb-validate
) - Data collection (
vfb-collect-data
) - Triple store ingestion (
vfb-updatetriplestore
) - Data transformation and dumps for production instances (
vfb-dumps
) - VFB production instance ingestion (
vfb-update-prod
)
- Transform KB1 to KB2 (
Server and data pipelines are combined into 6 general sub-pipelines which are configured as Jenkins jobs (currently located here). This documentation describes all 6 sub-pipelines in detail, including which role the individual servers and data pipelines play. All high-level documentation including images can be found on the vfb-pipeline-config repo. Note: There was once a pipeline server named vfb-integration-api
which has since been discarded in favour of vfb-dumps
.
- Summary: This pipeline loads the current KB from backup, applies a series of transformation steps and validates the resulting version of the KB for VFB Schema compliance. The finalised KB is backed up, and spun up, again from backup to clear caches. Components:
vfb-kb
(deployment of the VFB knowledge base)vfb-kb2kb
(provisional data pipeline managing the migration from KB1 to KB2)vfb-validate
(validation pipeline to check if KB2 is in the correct basic shape for neo4j2owl)
- Jenkins job
- Dependents: pip-triplestore
- Image: virtualflybrain/docker-neo4j-knowledgebase:neo2owl (dockerhub)
- Git: https://github.com/VirtualFlyBrain/docker-neo4j-knowledgebase
- Dockerfile
- Jenkins job
- Summary: The VFB KB instance loads the VFB KB Archive and deploys it as a Neo4J instance that includes the neo2owl plugin. This plugin allows loading OWL ontologies into Neo4J according to a specific schema, as well as serialising the (valid) Neo4J graph into OWL.
- Access: http://kbl.p2.virtualflybrain.org/browser/ (post pipeline), http://kb.p2.virtualflybrain.org/browser/ (spin off from backup)
- There is nothing specifically important about
vfb-kb
, other than that it comes in two flavours. The pipeline spins of a KB instance which gets a tiny bit of pre-processing invfb-collectdata
(basically setting the labels correctly). This instance is spun up from backup, only for the pipeline, run, and thrown away after the pipeline is finished. It is important to note that this system means thatvfb-kb
pipeline edition is not necessarily the exact same asvfb-kb
curation edition -vfb-kb
pipeline edition corresponds tovfb-kb
curation edition at the time of the last backup. So, if you want them to correspond exactly, you need to make sure the backup step is run right before vfb-kb is spun up. - Currently the Dockerfile with the neo4j2owl plugin (and APOC!) is on a branch! So be careful when you merge this in!
- Image: virtualflybrain/vfb-pipeline-kb2kb:latest (dockerhub)
- Git: https://github.com/VirtualFlyBrain/vfb-pipeline-kb2kb
- Dockerfile
- Jenkins job
- Summary: The image encapsulates a (python/cypher-based) pipeline to transform the original version of the KB into a schema-compliant version.
- Currently, in order to perform the KB2KB migration, a table is required to decide what type an entity is. This is currently on a branch in the pipeline repo, which needs to be taken into account when the pipeline is merged all in. This can probably be merged in, but sould be done with the usual care (pull request, review that nothing important has changed accidentally - this branch has been created years ago).
- The script that performs the change is on an unmerged branch in the VirtualFlyBrain/VFB_neo4j repo.
- The script should be obsoleted once the migration to KB2 is completed
- Image: virtualflybrain/vfb-pipeline-validatekb:latest (dockerhub)
- Dockerfile
- Git: https://github.com/VirtualFlyBrain/vfb-pipeline-validatekb
- Jenkins job
- Summary: The image encapsulates a (python/cypher-based) pipeline to check whether the current state of the KB is schema-compliant.
- Results of the validation can be read on the latest console output of the Jenkins
pip_vfb-kb
pipeline.
- The actual validation process is implemented as a python script. It is not complete in terms of validation, but it is doing a few things, like checking that every node has at least one owl base type, and iri and so on. Each test is a single function in the script, so it should be fairly easy to read them over.
- The tests are implemented as a report that is printed as part of the Jenkins job - currently the pipeline does not break if there is a validation error!
- Summary: This pipeline deploys an empty triplestore, collects all VFB relevant data (including KB and ontologies), and pre-processes and loads the collected data into the triplestore. Components:
vfb-triplestore
(deploying triplestore)vfb-collect-data
(data collection and preprocessing pipeline for all VFB data)vfb-update-triplestore
(loading collected data into the triplestore)
- Jenkins job
- Depends on:
pip-kb
- Dependents:
pip-dumps
- Image: yyz1989/rdf4j:latest (dockerhub)
- Git: We do not maintain this, see ticket
- Summary: The triplestore is currently an unspectacular default implementation of rdf4j-server. We make use of a simple in-memory store that is configured here. The container is maintained elsewhere (see docker-hub pages of image for details).
- Triplestore access:
- Example SPARQL query agains UI
- Repo summary
- We should probably migrate away from this particular image of rdf4j towards our own VFB one, because there is a danger that this container gets removed/updated causing problems for us (though not likely);
- Image: virtualflybrain/vfb-pipeline-collectdata:latest (dockerhub)
- Git: https://github.com/VirtualFlyBrain/vfb-pipeline-collectdata
- Dockerfile
- Summary: This container encapsulates a process that downloads a number of source ontologies, obtains the OWL version of the VFB KB, and applies a number of ROBOT-based pre-processing steps, in particular: extracting modules/slices of external ontologies, running consistency checks and serialising as ttl for quicker ingest into triplestore. It also contains the data embargo pipeline and has some provisions for shacl validation.
- Exporting the KB into OWL2 is managed through a custom procedure (
exportOWL()
) implemented in the neo4j2owl plugin. - The plugin is documented in detail in the repos readme
- The process is encoded here. It performs the following steps:
- Exporting KB to OWL using the above
neo4j2owl:exportOWL()
procedure. - Removing embargoed data. The technique applied here is based on using ROBOT query and encoding the embargo logic as SPARQL queries (combined with
ROBOT remove
). - Downloading external ontologies.
- Ontologies in vfb_fullontologies.txt are imported in their entirety.
- Ontologies in vfb_slices.txt are sliced. The slice corresponds to a BOTTOM module that has the combined signature of all ontologies in the fullontologies section with the signature of the KB.
- Note: there is an annoying hack in there that should be fixed, simply by removing the if/else in this code block. First though, we need to understand why this process is so slow (ROBOT memory?).
- All ontologies are converted to turtle.
- The KB is checked using a SHACL validation engine.
- All ontologies ready to be imported into the triplestore are gzipped.
- Exporting KB to OWL using the above
- Image: virtualflybrain/vfb-pipeline-updatetriplestore:latest (dockerhub)
- Dockerfile
- Git: https://github.com/VirtualFlyBrain/vfb-pipeline-updatetriplestore
- Summary: This container encapsulates a process that (1) sets up the triplestores vfb database and (2) loads all of the ttl files generated by vfb-collect-data into the vfb-triplestore. The image contains the configuration details of triplestore, like choice of triplestore engine.
- The process really does nothing more other than loading the ontologies and data collected in the previous step into the triple store.
- Summary: This pipeline transforms the knowledge graph in the triplestore into various custom dumps used by downstream services such as the VFB Neo4J production instance, owlery and solr.
- Jenkins pipeline
- Depends on: pip-triplestore
- Dependents: pip-owlery, pip-prod
- Image: virtualflybrain/vfb-pipeline-dumps:latest (dockerhub)
- Git: https://github.com/VirtualFlyBrain/vfb-pipeline-dumps
- Summary: The VFB dumps pipeline access the triple store to obtain data dumps that in mungs, transforms and enriches for various downstream purposes such as vfb-prod ingestion, owlery ingestion and vfb-solr ingestion.
- Dockerfile
- Example access: http://virtualflybrain.org/data/VFB/OWL contains all the data that is generated by this pipeline. This generated is loaded into the various downstream tools.
- The process performs the following steps (all encoded in the Makefile):
- Build dump for
vfb-owlery
(all logical axioms in triplestore) - Build dump for
vfb-prod
(VFB production instance) - Build dump for
vfb-solr
(special json file, created using python)
- Build dump for
- There is a new section in the config file called filters which should be pretty self explanatory. The main thing to know is that the ['iri_prefix'] filter actually checks whether the listed string is contained somewhere in the IRI - so in our case, VFBc_ would have worked as well. The ['neo4j_node_label'] simply filters out every entity that also has a particular node label associated with it.
- The vfb-solr pipeline is a bit more involved and also relies on the general pipeline config file.
- There is a new section in the dumps.Makefile (around line 66) that allows adding arbitrary sparql construct queries to the produced dumps. This can be useful, for example, to materialise ad hoc neo labels. Add a new dump:
- pick name, add to the correct DUMPS variable (DUMPS_SOLR, DUMPS_PDB, DUMPS_OWLERY)
- create new sparql query in sparql/, naming it 'construct_name.sparql', e.g. sparql/construct_image_names.sparql Note that non-sparql goals, like 'inferred_annotation', need to be added separately.
- Summary: This pipeline deploys the Owlery webservice which is used by VFB to answer ontology queries (no special config).
- Depends on:
vfb-dumps
- Dependents: None (gepetto)
- Image: virtualflybrain/owlery-vfb:latest (dockerhub)
- Git: https://github.com/VirtualFlyBrain/owlery-vfb
- Dockerfile
- Summary: Deployment of Owlery, a web-service for accessing basic reasoning methods of an ontology.
- Example access: Get subclasses of a term
- Summary: This pipeline deploys the production instance of the VFB neo4j database and loads all the relevant data.
- Depends on: pip-integratio
- Jenkins pipeline
- Dependents: None (gepetto)
- Image: virtualflybrain/vfb-prod:latest (dockerhub)
- Git: https://github.com/VirtualFlyBrain/vfb-prod
- Dockerfile
- Summary: Deploys an empty, configured instance of a Neo4J database with the neo2owl plugin, and APOC tools.
- Access: http://pdb.p2.virtualflybrain.org/browser/
- Note that this image is used for all runtime deployments of neo4j databases across the VFB ecosystem (check branches of the vfb-prod container)
- Image: virtualflybrain/vfb-pipeline-update-prod:latest (dockerhub)
- Git: https://github.com/VirtualFlyBrain/vfb-pipeline-update-prod
- Dockerfile
- Summary: The update-prod container currently takes an ontology (from the integration layer) and loads it into the the Neo4J instance (vfb-prod) using the neo2owl plugin. Process"
- Loading the ontology using the
neo4j2owl:owl2Import()
procedure - Setting a number of indices (see detailed notes below).
- Loading the ontology using the
- You can set additional Pipeline post-processing steps like indices by editing this file. Note that this file can be used to set arbitrary post-processing cypher queries, not just indices (contrary to the file name). Essentially, all list cypher queries are executed in order right after PDB import is completed.
- The possible configuration settings for the
neo4j2owl:owl2Import()
procedure are described here. The configuration is stored here.
- Image: virtualflybrain/vfb-solr (dockerhub)
- Git: https://github.com/VirtualFlyBrain/vfb-solr
- Dockerfile
- Summary: an essentially unchanged solr 8 image instance
- Jenkins pipeline
- The pipeline is currently deployed as a series of connected Jenkis jobs.
- Every sub-pipeline has a Jenkins job that can be restarted manually. Every sub-pipeline will trigger all of its dependents. So if the
pip_vfb-dumps
pipeline is started, it will automatically trigger thepip_vfb-prod
andpip_vfb-owlery
pipelines to redeploy as well. - The whole pipeline can be restarted by simply triggering the
pip_vfb_kb
pipeline to be re-run. This will trigger all downstream sub-pipelines. - The whole pipeline is re-run every night at 4am.