Welcome to the main repository of the RDF-Connect Orchestrator. This project attempts to facilitate RDF-Connect pipelines between any combination of programming languages and runtimes. Extensibility is at the core of this project, allowing you to bring RDF-Connect to new languages using existing or new methods of interprocess communication.
RDF-Connect introduces a fair amount of new concepts, each with their own terms and names. Below is a brief overview of the most important things to know before heading in.
- Orchestrator: the main control pane of an RDF-Connect pipeline, responsible for starting runners and processors as well as message brokering between different pipeline stages.
- Runner: a process which interprets the received control commands from the orchestrator and executes those to load processors, instantiate them into pipeline stages, and handles incoming and outgoing messages. Typically, a runner specializes in a specific programming language or runtime. For example, our standard Python runner can import and instantiate any kind of Python processor as long as it conforms to the expected abstract class.
- Processor: an implementation of some logic. These can be arbitrarily simple or complex, ranging from reading a file from disk to inference of a machine learning model. Processors can accept type-safe arguments, including destination-agnostic readers and writers.
- Readers and Writers: simple wrapper objects (or similar) which provide simple APIs to interact with the messaging system, such as asynchronous
send
andreceive
methods. - Stage: a processor which has been instantiated with concrete arguments and reader/writer targets.
Some lingo specific to the packaging and distributing is the following.
- Package: a collection of runners and processors, published locally or online.
- Dependency: a package required to load and execute a pipeline.
- Target: the runner which should execute a given processor.
- Prepare statements: commands which must be executed to build the runners and processors in a package and create the runtime environment if required. For example, executing a compilation step using the build system.
We provide a convenient and lean Docker image to get started quickly. Kotlin processors are supported out-of-the-box, but you may need to extend this image to host certain targets, such as node
and npm
for the TypeScript runner.
docker pull ghcr.io/rdf-connect/orchestrator:latest
This repository contains a homebrew formula.
brew install rdf-connect/orchestrator https://github.com/rdf-connect/orchestrator
We provide a number of runners and processors to fast-track the development of pipelines, as well as to provide a reference on how to implement your own. All of these are exported by our index.ttl
file, and can be loaded using a single dependency. Learn more here.
<MyPipeline> rdfc:dependency <https://github.com/rdf-connect/orchestrator.git> .
The Protobuf declarations are available in a separate repository. Make sure to initialize its respective submodule, since otherwise your Gradle build will fail.
git submodule update --init --recursive
You can of course also use the --recurse-submodules
switch with git clone
.
Some of the pipelines found in this repository depend on repositories hosted on GitHub, requiring an authentication token. These can be configured using the environment variables listed in .env.example
. It suffices to copy this file to .env
and fill in the variables.
This project automatically publishes Maven packages and a Docker image whenever the projectVersion
field in gradle.properties
is updated.
This repository uses pre-commit
for automatic linting and formatting. To get started, run the following command.
pre-commit install
Conventional commits are enforced, but require an additional command to register the hook.
pre-commit install --hook-type commit-msg
Due to the high amount of languages in this repository, we provide a simple shell script which formats most (if not all) files in this project. Please make sure your contributions conform to these formatters at all times.
We use Detekt for static analysis. The configuration file can be found at detekt.yml
, but note that we build upon the base config and therefore require the --build-upon-default-config
switch along with --config detekt.yml
. For example:
detekt --config detekt.yml --build-upon-default-config --input rdfc-core/src/main
The following section aims to give you an initial understanding of the project structure as well as provide motivation for certain design decisions. Note however that this project makes use of KDoc, so API and implementation details are available separately. This document only covers the project conceptually.
Command line interface - rdfc-cli
This is the only module which contains an executable, and it's scope is extremely limited. All different execution modes (such as install
, validate
, exec
) are mapped to a respective function, which in turn calls into the other rdfc
libraries. If at one point a function in this package provides a large amount of functionality, moving it into a different module should be considered.
The rdfc-cli
module can also be seen as a facade, since it wraps many aspects of the orchestrator into convenient and simple function calls.
Utility code - rdfc-core
This module is a collection of wrapper classes, extensions, and utility code.
Intermediate Representation - rdfc-intermediate
A collection of data classes which together model an RDF-Connect configuration.
The classes in this module are prefixed with IR
, which stands for intermediate representation. We're taking some liberties with the definition here, but essentially we refer to the fact that (aside from parsing) we never execute queries against the RDF model itself. Rather, we extract the values as soon as possible into these data classes to achieve better separation of concern.
Orchestrator - rdfc-orchestrator
rdfc-orchestrator
is the heart of the project. It takes responsibility of the following tasks.
- Accept tbe intermediate representation of a pipeline as parameter.
- Instantiate the runners listed in that pipeline.
- Forward pipeline stage declarations to the respective runners.
- Facilitate message brokering during the execution of the pipeline, including control messages such as
Close Channel
.
Communication typically passes four distinct components. A message is a tuple of raw bytes, and it's target URI.
- Processor #1: a processor can use a writer to submit a message to the system. It will forward the message to the runner.
- Runner #1: receives the message from a processor and forwards it to the orchestrator stub. Any protocol can be used here, but at the time of writing the project provides support for gRPC only.
- Orchestrator Stub #1: receives the message from the runner and forwards it to the central broker.
- Central Broker: receives the message , attempts to match the destination URI to a stub, and forwards it.
- Orchestrator Stub #2: receives the message from the broker and forwards it to it's corresponding runner.
- Runner #2: receives the message from the stub and attempts to match the URI against a specific processor to forward to.
- Processor #2: receives the message from the runner and buffers it in the corresponding reader.
Note that in the Kotlin runner, the stub and runner are implemented side-by-side in a single class.
Parser - rdfc-parser
Responsible for parsing a configuration to intermediate representation. The interface does not specify what type of configuration language that must be used, and instead only exposes methods which take in a file path as parameter. By default, only RDF in the Turtle syntax is supported.
Processor - rdfc-processor
This module exposes an abstract class which Kotlin-based processors must extend for the default runner implementation.