Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Singer/Meltano: Add examples {singerfile,github}-to-cratedb #190

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions .github/workflows/test-singer-meltano.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
name: Python SQLAlchemy

on:
pull_request:
branches: ~
paths:
- '.github/workflows/test-singer-meltano.yml'
- 'framework/singer-meltano/**'
- 'requirements.txt'
push:
branches: [ main ]
paths:
- '.github/workflows/test-singer-meltano.yml'
- 'framework/singer-meltano/**'
- 'requirements.txt'

# Allow job to be triggered manually.
workflow_dispatch:

# Run job each night after CrateDB nightly has been published.
schedule:
- cron: '0 3 * * *'

# Cancel in-progress jobs when pushing to the same branch.
concurrency:
cancel-in-progress: true
group: ${{ github.workflow }}-${{ github.ref }}

jobs:
test:
name: "
Python: ${{ matrix.python-version }}
CrateDB: ${{ matrix.cratedb-version }}
on ${{ matrix.os }}"
runs-on: ${{ matrix.os }}
strategy:
fail-fast: false
matrix:
os: [ 'ubuntu-latest' ]
python-version: [ '3.10', '3.11' ]
cratedb-version: [ 'nightly' ]

services:
cratedb:
image: crate/crate:nightly
ports:
- 4200:4200
- 5432:5432

steps:

- name: Acquire sources
uses: actions/checkout@v4

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
architecture: x64
cache: 'pip'
cache-dependency-path: |
requirements.txt
framework/singer-meltano/requirements.txt
framework/singer-meltano/requirements-dev.txt

- name: Install utilities
run: |
pip install -r requirements.txt

- name: Validate framework/singer-meltano
run: |
ngr test --accept-no-venv framework/singer-meltano
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
.DS_Store
.idea
.env
.venv*
__pycache__
.coverage
coverage.xml
mlruns/
archive/
logs.log
2 changes: 2 additions & 0 deletions framework/singer-meltano/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
.meltano
output
45 changes: 45 additions & 0 deletions framework/singer-meltano/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
# Meltano Examples

Concise examples about working with [CrateDB] and [Meltano], for conceiving and
running flexible ELT tasks. All the recipes are using [meltano-target-cratedb]
for reading and writing data from/to CrateDB.

## What's inside

- `singerfile-to-cratedb`: Acquire data from Singer File, and load it into
CrateDB database table.

- `github-to-cratedb`: Acquire repository metadata from GitHub API, and load
it separated per entity into 32 CrateDB database tables.

## Prerequisites

Before running an examples within the subdirectories, make sure to install
Meltano and its dependencies.

```shell
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
```

## Usage

Then, explore the individual Meltano projects, either invoke them from within
their directories, or by using the `--cwd` option from the root folder.

```shell
meltano --cwd github-to-cratedb install
meltano --cwd github-to-cratedb run tap-github target-cratedb
```

## Software Tests
```shell
pip install -r requirements-dev.txt
poe check
```


[CrateDB]: https://cratedb.com/product
[Meltano]: https://meltano.com/
[meltano-target-cratedb]: https://github.com/crate-workbench/meltano-target-cratedb
82 changes: 82 additions & 0 deletions framework/singer-meltano/github-to-cratedb/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
# Meltano GitHub -> CrateDB example

## About

Acquire repository metadata from GitHub API, and insert into CrateDB database
tables, using [meltano-target-cratedb].

It follows the canonical example demonstrated at the [Meltano Getting Started Tutorial].

## Configuration

### tap-github

For accessing the GitHub API, you will need an authentication token. It
can be acquired at [GitHub Developer Settings » Tokens].

To configure the recipe, please store it into the `TAP_GITHUB_AUTH_TOKEN`
environment variable, either interactively, or by creating a dotenv
configuration file `.env`.

```shell
TAP_GITHUB_AUTH_TOKEN='ghp_hmQR3XTFWkfIcuyjRTBuVrRt6mnL1j2mMPT8'
```

Then, in `meltano.yml`, identify the `tap-github` section in `plugins.extractors`,
and adjust the value of `config.repositories` to correspond to the repository
you intend to scrape.

### target-cratedb

Within `loaders` section `target-cratedb`, adjust `config.sqlalchemy_url` to
match your database connectivity settings.


## Usage

Install dependencies.
```shell
meltano install
```

Invoke data transfer to JSONL files.
```shell
meltano run tap-github target-jsonl
cat github-to-cratedb/output/commits.jsonl
```

Invoke data transfer to CrateDB database.
```shell
meltano run tap-github target-cratedb
```

## Screenshot

Enjoy the release notes.
```sql
SELECT repo, tag_name, body FROM melty.releases ORDER BY tag_name DESC;
```

![image](https://github.com/crate-workbench/cratedb-toolkit/assets/453543/ac37c9cc-8e42-4c7c-84aa-64498bf48f4d)

## Troubleshooting

If you see such errors on stdout, please verify your GitHub authentication
token stored within the `TAP_GITHUB_AUTH_TOKEN` environment variable.
```python
singer_sdk.exceptions.RetriableAPIError: 401 Client Error: b'{"message":"This endpoint requires you to be authenticated.","documentation_url":"https://docs.github.com/graphql/guides/forming-calls-with-graphql#authenticating-with-graphql"}' (Reason: Unauthorized) for path: /graphql cmd_type=elb consumer=False name=tap-github producer=True stdio=stderr string_id=tap-github
```

## Development
In order to link the sandbox to a development installation of [meltano-target-cratedb],
configure the `pip_url` of the component like this:
```yaml
pip_url: --editable=/path/to/sources/meltano-target-cratedb
```


[GitHub Developer Settings » Tokens]: https://github.com/settings/tokens
[Meltano Getting Started Tutorial]: https://docs.meltano.com/getting-started/part1
[meltano-target-cratedb]: https://github.com/crate-workbench/meltano-target-cratedb
[tap-github]: https://hub.meltano.com/extractors/tap-github/
[target-jsonl]: https://hub.meltano.com/loaders/target-jsonl/
51 changes: 51 additions & 0 deletions framework/singer-meltano/github-to-cratedb/meltano.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# A Meltano project is just a directory on your filesystem containing text-based files.
# At a minimum, a Meltano project must contain a project file named `meltano.yml`,
# which contains your project configuration, and tells Meltano that a particular
# directory is a Meltano project.
---
version: 1
default_environment: dev
send_anonymous_usage_stats: false
project_id: f14797b9-9d1c-414c-851c-c91e08ddbc2e

environments:
- name: dev
- name: staging
- name: prod

plugins:

# Configure data source.
# In Singer jargon, it is an "extractor", wrapped into a "tap".
extractors:

- name: tap-github
variant: cratedb
namespace: cratedb
pip_url: git+https://github.com/crate-workbench/tap-github.git@cratedb
# Note: Configure your GitHub repository here.
config:
start_date: '2023-12-01'
repositories:
- crate-workbench/cratedb-toolkit

# Configure data sinks.
# In Singer jargon, it is a "loader", wrapped into a "target".
loaders:

- name: target-jsonl
variant: andyh1203
pip_url: target-jsonl

- name: target-cratedb
namespace: cratedb
variant: cratedb
# Acquire from PyPI.
pip_url: meltano-target-cratedb
# Acquire from GitHub.
# pip_url: git+https://github.com/crate-workbench/meltano-target-cratedb.git

# Note: Configure your database server and credentials here.
config:
sqlalchemy_url: crate://crate@localhost/
add_record_metadata: true
Loading