Skip to content

Commit

Permalink
Zyp: A compact transformation engine
Browse files Browse the repository at this point in the history
A data model and implementation for a compact transformation engine
written in Python.

- Based on JSON Pointer (RFC 6901), JMESPath, and transon
- Implemented using `attrs` and `cattrs`
- Includes built-in transformation functions `to_datetime` and
  `to_unixtime`
- Ability to marshall and unmarshall its representation to/from JSON and
  YAML
  • Loading branch information
amotl committed Aug 14, 2024
1 parent 998ce02 commit 46f895b
Show file tree
Hide file tree
Showing 38 changed files with 1,727 additions and 15 deletions.
64 changes: 63 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ jobs:
pip install "setuptools>=64" --upgrade
# Install package in editable mode.
pip install --use-pep517 --prefer-binary --editable=.[develop,test,mongodb]
pip install --use-pep517 --prefer-binary --editable=.[mongodb,develop,test]
- name: Run linters and software tests
run: poe check
Expand All @@ -120,3 +120,65 @@ jobs:
env_vars: OS,PYTHON
name: codecov-umbrella
fail_ci_if_error: true


test-zyp:
name: "
Zyp: Python ${{ matrix.python-version }}
"
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: ['ubuntu-latest']
python-version: ['3.8', '3.9', '3.12']

env:
OS: ${{ matrix.os }}
PYTHON: ${{ matrix.python-version }}

steps:

- name: Acquire sources
uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
architecture: x64
cache: 'pip'
cache-dependency-path:
pyproject.toml

- name: Set up project
run: |
# `setuptools 0.64.0` adds support for editable install hooks (PEP 660).
# https://github.com/pypa/setuptools/blob/main/CHANGES.rst#v6400
pip install "setuptools>=64" --upgrade
# Install package in editable mode.
pip install --use-pep517 --prefer-binary --editable=.[zyp,develop,test]
- name: Set timezone
uses: szenius/[email protected]
with:
timezoneLinux: "Europe/Berlin"
timezoneMacos: "Europe/Berlin"
timezoneWindows: "European Standard Time"

- name: Run linters and software tests
run: poe check

# https://github.com/codecov/codecov-action
- name: Upload coverage results to Codecov
uses: codecov/codecov-action@v4
env:
CODECOV_TOKEN: ${{ secrets.CODECOV_TOKEN }}
with:
files: ./coverage.xml
flags: zyp
env_vars: OS,PYTHON
name: codecov-umbrella
fail_ci_if_error: true
2 changes: 2 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# Changelog

## Unreleased
- Added `BucketTransformation`, a minimal transformation engine
based on JSON Pointer (RFC 6901).
- Added documentation using Sphinx and Read the Docs

## 2024/08/05 v0.0.3
Expand Down
20 changes: 13 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@
[![License](https://img.shields.io/pypi/l/commons-codec.svg)](https://pypi.org/project/commons-codec/)

## About
Data decoding, encoding, conversion, and translation utilities.

> A codec is a device or computer program that encodes or decodes a data stream or signal.
> Codec is a portmanteau of coder/decoder.
Expand All @@ -21,18 +20,23 @@ Data decoding, encoding, conversion, and translation utilities.
> -- https://en.wikipedia.org/wiki/Codec
## What's Inside
- **Decoders:** A collection of reusable utilities with minimal dependencies for
transcoding purposes, mostly collected from other projects like
- [Change Data Capture (CDC)]: **Transformer components** for converging CDC event messages to
SQL statements.

- A collection of reusable utilities with minimal dependencies for
**decoding and transcoding** purposes, mostly collected from other projects like
[Kotori](https://kotori.readthedocs.io/) and [LorryStream](https://lorrystream.readthedocs.io/),
in order to provide them per standalone package for broader use cases.

- Transformers for [Change Data Capture (CDC)] messages to SQL statements.
- [Zyp], a generic and compact **transformation engine** written in Python, for data
decoding, encoding, conversion, translation, transformation, and cleansing purposes,
to be used as a pipeline element for data pre- and/or post-processing.

## Installation
The package is available from [PyPI] at [commons-codec].
To install the most recent version, run:
To install the most recent version, including support for MongoDB, and Zyp, run:
```shell
pip install --upgrade commons-codec
pip install --upgrade 'commons-codec[mongodb,zyp]'
```

## Usage
Expand All @@ -47,7 +51,7 @@ Kudos to the authors of all the many software components this library is
vendoring and building upon.

### Similar Projects
See [prior art].
See [prior art] and [Zyp research].

### Contributing
The `commons-codec` package is an open source project, and is
Expand All @@ -69,8 +73,10 @@ within the header sections of relevant files.
[Apache Commons Codec]: https://commons.apache.org/proper/commons-codec/
[Change Data Capture (CDC)]: https://en.wikipedia.org/wiki/Change_data_capture
[commons-codec]: https://pypi.org/project/commons-codec/
[Zyp research]: https://commons-codec.readthedocs.io/zyp/research.html
[documentation]: https://commons-codec.readthedocs.io/
[examples]: https://github.com/daq-tools/commons-codec/tree/main/examples
[managed on GitHub]: https://github.com/daq-tools/commons-codec
[prior art]: https://commons-codec.readthedocs.io/prior-art.html
[PyPI]: https://pypi.org/
[Zyp]: https://commons-codec.readthedocs.io/zyp/
3 changes: 3 additions & 0 deletions doc/backlog.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@
- [ ] MongoDB: Implement stream resumption using `start_after`
- [ ] Feature: Filter by events, e.g. Ignore "delete" events?
- [ ] Integration Testing the "example" programs?
- [ ] Improve capabilities of DMS translator
https://github.com/daq-tools/commons-codec/issues/11
- https://github.com/supabase/pg_replicate
2 changes: 1 addition & 1 deletion doc/decode.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Various Decoders
# Decoder Collection

`commons-codec` includes telemetry data decoders for individual popular sensor
appliances.
Expand Down
3 changes: 2 additions & 1 deletion doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
| [LorryStream]

```{include} readme.md
:start-line: 12
:start-line: 11
```


Expand All @@ -34,6 +34,7 @@
cdc/index
decode
zyp/index
```

```{toctree}
Expand Down
48 changes: 48 additions & 0 deletions doc/zyp/backlog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# Zyp Backlog

## Iteration +1
- Refactor module namespace to `zyp`
- Documentation
- CLI interface
- Apply to MongoDB Table Loader in CrateDB Toolkit

## Iteration +2
Demonstrate!
- math expressions
- omit key (recursively)
- combine keys
- filter on keys and/or values
- Pathological cases like "Not defined" in typed fields like `TIMESTAMP`
- Use simpleeval, like Meltano, and provide the same built-in functions
- https://sdk.meltano.com/en/v0.39.1/stream_maps.html#other-built-in-functions-and-names
- https://github.com/MeltanoLabs/meltano-map-transform/pull/255
- https://github.com/MeltanoLabs/meltano-map-transform/issues/252
- Use JSONPath, see https://sdk.meltano.com/en/v0.39.1/code_samples.html#use-a-jsonpath-expression-to-extract-the-next-page-url-from-a-hateoas-response

## Iteration +3
- Moksha transformations on Buckets
- Investigate using JSON Schema
- Fluent API interface
- https://github.com/Halvani/alphabetic
- Mappers do not support external API lookups.
To add external API lookups, you can either (a) land all your data and
then joins using a transformation tool like dbt, or (b) create a custom
mapper plugin with inline lookup logic.
=> Example from Luftdatenpumpe, using a reverse geocoder
- [ ] Define schema
https://sdk.meltano.com/en/latest/typing.html
- https://docs.meltano.com/guide/v2-migration/#migrate-to-an-adapter-specific-dbt-transformer
- https://github.com/meltano/sdk/blob/v0.39.1/singer_sdk/mapper.py

## Fluent API Interface

```python

from zyp.model.fluent import FluentTransformation

transformation = FluentTransformation()
.jmes("records[?starts_with(location, 'B')]")
.rename_fields({"_id": "id"})
.convert_values({"/id": "int", "/value": "float"}, type="pointer-python")
.jq(".[] |= (.value /= 100)")
```
Loading

0 comments on commit 46f895b

Please sign in to comment.