Pipeline 2 Documentation

VFB Pipeline 2 comprises five servers/services and six data pipelines:

Pipeline 2 servers:
- VFB knowledge base (vfb-kb)
- VFB triple store (vfb-triplestore)
- SOLr + preconfigured VFB SOLr core (vfb-solr)
- owlery (vfb-owlery)
- VFB Neo4J production instance (vfb-prod)
Pipeline 2 data pipelines:
- Transform KB1 to KB2 (vfb-kb2kb) [to be obsoleted]
- Validate KB (vfb-validate)
- Data collection (vfb-collect-data)
- Triple store ingestion (vfb-updatetriplestore)
- Data transformation and dumps for production instances (vfb-dumps)
- VFB production instance ingestion (vfb-update-prod)

Server and data pipelines are combined into 6 general sub-pipelines which are configured as Jenkins jobs (currently located here). This documentation describes all 6 sub-pipelines in detail, including which role the individual servers and data pipelines play. All high-level documentation including images can be found on the vfb-pipeline-config repo. Note: There was once a pipeline server named vfb-integration-api which has since been discarded in favour of vfb-dumps.

Sub-pipeline: Deploy KB (pip_vfb-kb)

Summary: This pipeline loads the current KB from backup, applies a series of transformation steps and validates the resulting version of the KB for VFB Schema compliance. The finalised KB is backed up, and spun up, again from backup to clear caches. Components:
- vfb-kb (deployment of the VFB knowledge base)
- vfb-kb2kb (provisional data pipeline managing the migration from KB1 to KB2)
- vfb-validate (validation pipeline to check if KB2 is in the correct basic shape for neo4j2owl)
Jenkins job
Dependents: pip-triplestore

Service: vfb-kb

Image: virtualflybrain/docker-neo4j-knowledgebase:neo2owl (dockerhub)
Git: https://github.com/VirtualFlyBrain/docker-neo4j-knowledgebase
Dockerfile
Jenkins job
Summary: The VFB KB instance loads the VFB KB Archive and deploys it as a Neo4J instance that includes the neo2owl plugin. This plugin allows loading OWL ontologies into Neo4J according to a specific schema, as well as serialising the (valid) Neo4J graph into OWL.
Access: http://kbl.p2.virtualflybrain.org/browser/ (post pipeline), http://kb.p2.virtualflybrain.org/browser/ (spin off from backup)

Detailed notes on vfb-kb

There is nothing specifically important about vfb-kb, other than that it comes in two flavours. The pipeline spins of a KB instance which gets a tiny bit of pre-processing in vfb-collectdata (basically setting the labels correctly). This instance is spun up from backup, only for the pipeline, run, and thrown away after the pipeline is finished. It is important to note that this system means that vfb-kb pipeline edition is not necessarily the exact same as vfb-kb curation edition - vfb-kb pipeline edition corresponds to vfb-kb curation edition at the time of the last backup. So, if you want them to correspond exactly, you need to make sure the backup step is run right before vfb-kb is spun up.
Currently the Dockerfile with the neo4j2owl plugin (and APOC!) is on a branch! So be careful when you merge this in!

Data pipeline: vfb-kb2kb [provisional]

Image: virtualflybrain/vfb-pipeline-kb2kb:latest (dockerhub)
Git: https://github.com/VirtualFlyBrain/vfb-pipeline-kb2kb
Dockerfile
Jenkins job
Summary: The image encapsulates a (python/cypher-based) pipeline to transform the original version of the KB into a schema-compliant version.

Detailed notes on vfb-kb2kb

Currently, in order to perform the KB2KB migration, a table is required to decide what type an entity is. This is currently on a branch in the pipeline repo, which needs to be taken into account when the pipeline is merged all in. This can probably be merged in, but sould be done with the usual care (pull request, review that nothing important has changed accidentally - this branch has been created years ago).
The script that performs the change is on an unmerged branch in the VirtualFlyBrain/VFB_neo4j repo.
The script should be obsoleted once the migration to KB2 is completed

Data pipeline: vfb-validate

Image: virtualflybrain/vfb-pipeline-validatekb:latest (dockerhub)
Dockerfile
Git: https://github.com/VirtualFlyBrain/vfb-pipeline-validatekb
Jenkins job
Summary: The image encapsulates a (python/cypher-based) pipeline to check whether the current state of the KB is schema-compliant.
Results of the validation can be read on the latest console output of the Jenkins pip_vfb-kb pipeline.

Detailed notes on vfb-validate

The actual validation process is implemented as a python script. It is not complete in terms of validation, but it is doing a few things, like checking that every node has at least one owl base type, and iri and so on. Each test is a single function in the script, so it should be fairly easy to read them over.
The tests are implemented as a report that is printed as part of the Jenkins job - currently the pipeline does not break if there is a validation error!

Sub-pipeline: Deploy triplestore (pip_vfb-triplestore)

Summary: This pipeline deploys an empty triplestore, collects all VFB relevant data (including KB and ontologies), and pre-processes and loads the collected data into the triplestore. Components:
- vfb-triplestore (deploying triplestore)
- vfb-collect-data (data collection and preprocessing pipeline for all VFB data)
- vfb-update-triplestore (loading collected data into the triplestore)
Jenkins job
Depends on: pip-kb
Dependents: pip-dumps

Service: vfb-triplestore

Image: yyz1989/rdf4j:latest (dockerhub)
Git: We do not maintain this, see ticket
Summary: The triplestore is currently an unspectacular default implementation of rdf4j-server. We make use of a simple in-memory store that is configured here. The container is maintained elsewhere (see docker-hub pages of image for details).

Detailed notes on vfb-triplestore

Triplestore access:
- Example SPARQL query agains UI
- Repo summary
We should probably migrate away from this particular image of rdf4j towards our own VFB one, because there is a danger that this container gets removed/updated causing problems for us (though not likely);

Data pipeline: vfb-collect-data

Image: virtualflybrain/vfb-pipeline-collectdata:latest (dockerhub)
Git: https://github.com/VirtualFlyBrain/vfb-pipeline-collectdata
Dockerfile
Summary: This container encapsulates a process that downloads a number of source ontologies, obtains the OWL version of the VFB KB, and applies a number of ROBOT-based pre-processing steps, in particular: extracting modules/slices of external ontologies, running consistency checks and serialising as ttl for quicker ingest into triplestore. It also contains the data embargo pipeline and has some provisions for shacl validation.

neo4j2owl:exportOWL()

Exporting the KB into OWL2 is managed through a custom procedure (exportOWL()) implemented in the neo4j2owl plugin.
The plugin is documented in detail in the repos readme

Detailed notes on vfb-collect-data

The process is encoded here. It performs the following steps:
1. Exporting KB to OWL using the above neo4j2owl:exportOWL() procedure.
2. Removing embargoed data. The technique applied here is based on using ROBOT query and encoding the embargo logic as SPARQL queries (combined with ROBOT remove).
3. Downloading external ontologies.
4. Ontologies in vfb_fullontologies.txt are imported in their entirety.
5. Ontologies in vfb_slices.txt are sliced. The slice corresponds to a BOTTOM module that has the combined signature of all ontologies in the fullontologies section with the signature of the KB.
  - Note: there is an annoying hack in there that should be fixed, simply by removing the if/else in this code block. First though, we need to understand why this process is so slow (ROBOT memory?).
6. All ontologies are converted to turtle.
7. The KB is checked using a SHACL validation engine.
8. All ontologies ready to be imported into the triplestore are gzipped.

Data pipeline: vfb-update-triplestore

Image: virtualflybrain/vfb-pipeline-updatetriplestore:latest (dockerhub)
Dockerfile
Git: https://github.com/VirtualFlyBrain/vfb-pipeline-updatetriplestore
Summary: This container encapsulates a process that (1) sets up the triplestores vfb database and (2) loads all of the ttl files generated by vfb-collect-data into the vfb-triplestore. The image contains the configuration details of triplestore, like choice of triplestore engine.

Detailed notes on vfb-update-triplestore:

The process really does nothing more other than loading the ontologies and data collected in the previous step into the triple store.

Sub-pipeline: Data transformation and dumps for production instances (pip_vfb-pipeline-dumps)

Summary: This pipeline transforms the knowledge graph in the triplestore into various custom dumps used by downstream services such as the VFB Neo4J production instance, owlery and solr.
Jenkins pipeline
Depends on: pip-triplestore
Dependents: pip-owlery, pip-prod

Data pipeline: vfb-dumps

Image: virtualflybrain/vfb-pipeline-dumps:latest (dockerhub)
Git: https://github.com/VirtualFlyBrain/vfb-pipeline-dumps
Summary: The VFB dumps pipeline access the triple store to obtain data dumps that in mungs, transforms and enriches for various downstream purposes such as vfb-prod ingestion, owlery ingestion and vfb-solr ingestion.
Dockerfile
Example access: http://virtualflybrain.org/data/VFB/OWL contains all the data that is generated by this pipeline. This generated is loaded into the various downstream tools.

Detailed notes on vfb-dumps

The process performs the following steps (all encoded in the Makefile):
1. Build dump for vfb-owlery (all logical axioms in triplestore)
2. Build dump for vfb-prod (VFB production instance)
3. Build dump for vfb-solr (special json file, created using python)
There is a new section in the config file called filters which should be pretty self explanatory. The main thing to know is that the ['iri_prefix'] filter actually checks whether the listed string is contained somewhere in the IRI - so in our case, VFBc_ would have worked as well. The ['neo4j_node_label'] simply filters out every entity that also has a particular node label associated with it.
The vfb-solr pipeline is a bit more involved and also relies on the general pipeline config file.
There is a new section in the dumps.Makefile (around line 66) that allows adding arbitrary sparql construct queries to the produced dumps. This can be useful, for example, to materialise ad hoc neo labels. Add a new dump:
1. pick name, add to the correct DUMPS variable (DUMPS_SOLR, DUMPS_PDB, DUMPS_OWLERY)
2. create new sparql query in sparql/, naming it 'construct_name.sparql', e.g. sparql/construct_image_names.sparql Note that non-sparql goals, like 'inferred_annotation', need to be added separately.

Sub-pipeline: Deploy Owlery (pip_vfb-owlery, Service)

Summary: This pipeline deploys the Owlery webservice which is used by VFB to answer ontology queries (no special config).
Depends on: vfb-dumps
Dependents: None (gepetto)

Service: vfb-owlery

Image: virtualflybrain/owlery-vfb:latest (dockerhub)
Git: https://github.com/VirtualFlyBrain/owlery-vfb
Dockerfile
Summary: Deployment of Owlery, a web-service for accessing basic reasoning methods of an ontology.
Example access: Get subclasses of a term

Sub-pipeline: VFB prod (pip_vfb-prod)

Summary: This pipeline deploys the production instance of the VFB neo4j database and loads all the relevant data.
Depends on: pip-integratio
Jenkins pipeline
Dependents: None (gepetto)

Service: vfb-prod

Image: virtualflybrain/vfb-prod:latest (dockerhub)
Git: https://github.com/VirtualFlyBrain/vfb-prod
Dockerfile
Summary: Deploys an empty, configured instance of a Neo4J database with the neo2owl plugin, and APOC tools.
Access: http://pdb.p2.virtualflybrain.org/browser/
Note that this image is used for all runtime deployments of neo4j databases across the VFB ecosystem (check branches of the vfb-prod container)

Data pipeline: vfb-update-prod

Image: virtualflybrain/vfb-pipeline-update-prod:latest (dockerhub)
Git: https://github.com/VirtualFlyBrain/vfb-pipeline-update-prod
Dockerfile
Summary: The update-prod container currently takes an ontology (from the integration layer) and loads it into the the Neo4J instance (vfb-prod) using the neo2owl plugin. Process"
1. Loading the ontology using the neo4j2owl:owl2Import() procedure
2. Setting a number of indices (see detailed notes below).

Detailed notes about vfb-update-prod

You can set additional Pipeline post-processing steps like indices by editing this file. Note that this file can be used to set arbitrary post-processing cypher queries, not just indices (contrary to the file name). Essentially, all list cypher queries are executed in order right after PDB import is completed.
The possible configuration settings for the neo4j2owl:owl2Import() procedure are described here. The configuration is stored here.

Sub-pipeline VFB SOLr (pip_vfb-solr, Service)

Image: virtualflybrain/vfb-solr (dockerhub)
Git: https://github.com/VirtualFlyBrain/vfb-solr
Dockerfile
Summary: an essentially unchanged solr 8 image instance
Jenkins pipeline

Deployment during development phase:

The pipeline is currently deployed as a series of connected Jenkis jobs.
Every sub-pipeline has a Jenkins job that can be restarted manually. Every sub-pipeline will trigger all of its dependents. So if the pip_vfb-dumps pipeline is started, it will automatically trigger the pip_vfb-prod and pip_vfb-owlery pipelines to redeploy as well.
The whole pipeline can be restarted by simply triggering the pip_vfb_kb pipeline to be re-run. This will trigger all downstream sub-pipelines.
The whole pipeline is re-run every night at 4am.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
docs		docs
rancher		rancher
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build_all.sh		build_all.sh
docker-compose.yml		docker-compose.yml
pipeline-overview.png		pipeline-overview.png
run-int.sh		run-int.sh
run-owlery.sh		run-owlery.sh
run-prod.sh		run-prod.sh
run-src.sh		run-src.sh
run-ts.sh		run-ts.sh
stop-src.sh		stop-src.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pipeline 2 Documentation

Sub-pipeline: Deploy KB (pip_vfb-kb)

Service: vfb-kb

Detailed notes on vfb-kb

Data pipeline: vfb-kb2kb [provisional]

Detailed notes on vfb-kb2kb

Data pipeline: vfb-validate

Detailed notes on vfb-validate

Sub-pipeline: Deploy triplestore (pip_vfb-triplestore)

Service: vfb-triplestore

Detailed notes on vfb-triplestore

Data pipeline: vfb-collect-data

neo4j2owl:exportOWL()

Detailed notes on vfb-collect-data

Data pipeline: vfb-update-triplestore

Detailed notes on vfb-update-triplestore:

Sub-pipeline: Data transformation and dumps for production instances (pip_vfb-pipeline-dumps)

Data pipeline: vfb-dumps

Detailed notes on vfb-dumps

Sub-pipeline: Deploy Owlery (pip_vfb-owlery, Service)

Service: vfb-owlery

Sub-pipeline: VFB prod (pip_vfb-prod)

Service: vfb-prod

Data pipeline: vfb-update-prod

Detailed notes about vfb-update-prod

Sub-pipeline VFB SOLr (pip_vfb-solr, Service)

Deployment during development phase:

About

Releases 1

Packages

Contributors 3

Languages

License

VirtualFlyBrain/vfb-pipeline-config

Folders and files

Latest commit

History

Repository files navigation

Pipeline 2 Documentation

Sub-pipeline: Deploy KB (pip_vfb-kb)

Service: vfb-kb

Detailed notes on vfb-kb

Data pipeline: vfb-kb2kb [provisional]

Detailed notes on vfb-kb2kb

Data pipeline: vfb-validate

Detailed notes on vfb-validate

Sub-pipeline: Deploy triplestore (pip_vfb-triplestore)

Service: vfb-triplestore

Detailed notes on vfb-triplestore

Data pipeline: vfb-collect-data

neo4j2owl:exportOWL()

Detailed notes on vfb-collect-data

Data pipeline: vfb-update-triplestore

Detailed notes on vfb-update-triplestore:

Sub-pipeline: Data transformation and dumps for production instances (pip_vfb-pipeline-dumps)

Data pipeline: vfb-dumps

Detailed notes on vfb-dumps

Sub-pipeline: Deploy Owlery (pip_vfb-owlery, Service)

Service: vfb-owlery

Sub-pipeline: VFB prod (pip_vfb-prod)

Service: vfb-prod

Data pipeline: vfb-update-prod

Detailed notes about vfb-update-prod

Sub-pipeline VFB SOLr (pip_vfb-solr, Service)

Deployment during development phase:

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages