snowplow-badrows-tables

Resources to enable SQL queries over Snowplow bad rows

Overview

A "bad row" is any payload that hits a Snowplow pipeline, but which cannot be successfully parsed or processed. Bad rows are preserved by writing them to cloud storage (e.g. S3 or GCS) as newline-delimited JSON objects.

Athena on AWS and BigQuery on GCP are tools that let you query your bad rows, using the cloud storage files as a back-end data source.

SELECT data.failure.messages FROM adatper_failures
WHERE data.failure.timestamp > timestamp '2020-04-01'

This is great for debugging your pipeline without the need to load your bad rows into a separate database.

Setup

To create tables in Athena or BigQuery, you need to provide a table definitions corresponding to the JSON schema of your bad row files. Each different bad row type (e.g. schema violations, adapter failures) has its own JSON schema, and therefore requires its own separate table.

This repo is split into:

Resources and instructions for Athena if your bad rows are in AWS S3
Resources and instructions for BigQuery if your bad rows are in Google Cloud Storage

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
athena		athena
bigquery		bigquery
CHANGELOG		CHANGELOG
LICENSE-2.0.txt		LICENSE-2.0.txt
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

snowplow-badrows-tables

Overview

Setup

About

Releases 1

Packages

Contributors 3

snowplow-incubator/snowplow-badrows-tables

Folders and files

Latest commit

History

Repository files navigation

snowplow-badrows-tables

Overview

Setup

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Packages