/zɛtəl/
xETL is a versatile orchestration library for sequencing the execution of programs. While it can be used to build ETL (Extract, Transform, Load) pipelines, its simplicity and flexibility make it suitable for a wide range of tasks.
Its design is inspired by the following set of principles:
- Minimize complexity by embracing the Unix Philosophy.
- Maximize ease-of-use by reusing concepts from the POSIX standards as much as possible.
The result is a simple yet powerful library that is easy to learn.
It is also unopiniated. The library itself is written in Python, but a job can be composed of tasks written in virtually any language.
There are only three main concepts to learn in order to build a xETL job.
The Job
is the highest level of abstraction in xETL. It outlines a sequence of Command
s, their execution order, dependencies, and inputs.
It defines a sequence of tasks in the form of a Directed Acyclic Graph (DAG).
The Command
is a node in the Job
's DAG. The term is taken from the Command Pattern as it contains the necessary parameters for executing a single Task
at a given time.
The Task
is a minimal, reusable and composable unit of execution. While Job
s and Command
s are purely metadata, a task will actually execute a program.
The Task
describes how to execute a program as well as its environment variables. It can run most types of executables, such as a bash script, python script, binary application, shell utility, etc. It can even run another nested xETL job which could be helpful to break down more complex jobs.
Let's builds a simple job to do two things:
- download an image from a web server
- convert that image to grayscale
We'll start by defining a task for each of these activities.
tasks/download/manifest.yml
name: download
env:
IMAGE_URL: URL to the image to download
OUTPUT: File path to save the file
run:
interpreter: /bin/bash -c
script: |
mkdir -p "$(dirname "$OUTPUT")"
curl -o "$OUTPUT" "$IMAGE_URL"
tasks/grayscale/manifest.yml
name: grayscale
env:
INPUT: File path to input image
OUTPUT: File path to outptu image
run:
interpreter: /bin/bash -c
script: |
mkdir -p "$(dirname "$OUTPUT")"
convert "$INPUT" -colorspace Gray "$OUTPUT"
We can now write a job that will make use of these tasks:
job.yml
name: fetch-grayscale
description: Download an image and convert it to grayscale
data: ./data
tasks: ./tasks
commands:
- task: download
env:
IMAGE_URL: https://www.python.org/static/img/[email protected]
OUTPUT: ${job.data}/source/download.png
- task: grayscale
env:
INPUT: ${previous.env.OUTPUT}
OUTPUT: ${job.data}/final/grayscale.png
That's it! This job can now be executed with:
$ python -m xetl example/job.yml
Loading job manifest at: /Users/user/src/xETL/example/job.yml
╭──╴Executing job: fetch-grayscale ╶╴╴╶ ╶
│ Parsed manifest for job: fetch-grayscale
│ Discovering tasks at paths: ['/Users/user/src/xETL/example/tasks']
│ Loading task at: /Users/user/src/xETL/example/tasks/download/manifest.yml
│ Loading task at: /Users/user/src/xETL/example/tasks/grayscale/manifest.yml
│ Available tasks detected:
│ - download
│ - grayscale
┏━━╸Executing command 1 of 2 ━╴╴╶ ╶
┃ name: null
┃ description: null
┃ task: download
┃ env:
┃ IMAGE_URL: https://www.python.org/static/img/[email protected]
┃ OUTPUT: /Users/user/src/xETL/example/data/download_source.png
┃ skip: false
┃╭──╴Executing task: download ─╴╴╶ ╶
┃│2024-02-05 22:21:48.633┊ % Total % Received % Xferd Average Speed Time Time Time Current
┃│2024-02-05 22:21:48.633┊ Dload Upload Total Spent Left Speed
┃│2024-02-05 22:21:48.633┊
┃│2024-02-05 22:21:48.743┊ 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
┃│2024-02-05 22:21:48.744┊ 100 15770 100 15770 0 0 139k 0 --:--:-- --:--:-- --:--:-- 140k
┃╰──╴Return code: 0 ─╴╴╶ ╶
┃
┏━━╸Executing command 2 of 2 ━╴╴╶ ╶
┃ name: null
┃ description: null
┃ task: grayscale
┃ env:
┃ INPUT: /Users/user/src/xETL/example/data/download_source.png
┃ OUTPUT: /Users/user/src/xETL/example/data/grayscale.png
┃ skip: false
┃╭──╴Executing task: grayscale ─╴╴╶ ╶
┃╰──╴Return code: 0 ─╴╴╶ ╶
│ Done! \o/