Resources to enable SQL queries over Snowplow bad rows
A "bad row" is any payload that hits a Snowplow pipeline, but which cannot be successfully parsed or processed. Bad rows are preserved by writing them to cloud storage (e.g. S3 or GCS) as newline-delimited JSON objects.
Athena on AWS and BigQuery on GCP are tools that let you query your bad rows, using the cloud storage files as a back-end data source.
SELECT data.failure.messages FROM adatper_failures
WHERE data.failure.timestamp > timestamp '2020-04-01'
This is great for debugging your pipeline without the need to load your bad rows into a separate database.
To create tables in Athena or BigQuery, you need to provide a table definitions corresponding to the JSON schema of your bad row files. Each different bad row type (e.g. schema violations, adapter failures) has its own JSON schema, and therefore requires its own separate table.
This repo is split into:
- Resources and instructions for Athena if your bad rows are in AWS S3
- Resources and instructions for BigQuery if your bad rows are in Google Cloud Storage