book-sales-athena

This sample project explores the possibilities of merging two semi-structured json datasets using AWS Athena and deliver the result on a set of static GitLab pages.

Please note: This GitHub project provides an overview over the resources needed to build the solution. It isn't in itself runnable and requires an AWS account as well as a GitLab instance. If you'd like to build this yourself though, this repo should provide a good starting point.

Sample use case

A made-up company, AmazingBooks, Inc., merges with a competitor, MediocreBooks, Inc. For the launch of their new, merged online book store, the most popular books of both companies will be part of a special launch offer. The marketing folks now need to merge the book sales data of both companies. The problem is: We can't access any databases directly and all the IT staff of both companies can provide us with is a bunch of (incompatible) JSON exports with book sales data.

In order for the marketing team to work on campaigns for the most popular books, they need to have a full view on the data:

What are the most popular books overall?

To answer the question, we need to merge the data from both the AmazingBooks and the MediocreBooks systems. Athena to the rescue!

Technologies used

GitLab CI/CD
GitLab Pages with MkDocs
AWS Athena
AWS Lambda
AWS S3
docker and a number of shell tools such as jq and the AWS CLI

How it works

The data pipeline is held together by a GitLab CI/CD pipeline (see gitlab-ci.yaml) and looks as follows:

Data is delivered by the source systems (subsequently referred to as system a and system b) in json format and stored to s3://your-bucket-name-here/data-delivery/.
The data is sanatized (in this case, brought into json lines format) and stored in separate s3 directories (s3://your-bucket-name-here/system-a/; s3://your-bucket-name-here/system-b/) for Athena to pick up
A lambda function is invoked to trigger an Athena query (this assumes that the required Athena database, tables, and views are present to generate the report)
Upon triggering, the result file is retrieved from the results directory (s3://your-bucket-name-here/query-results/lambda-triggered/)
MkDocs generation is triggered and the result file is copied over to the static website directory once the build is finished

Athena Resources

Please find DDL statements for the required Athena resources in the aws-recources/athena subdirectory.

Lambda resources

Please find the code for the Lambda function in the aws-recources/lambda subdirectory.

Sample data

Please find the JSON sample data in the testdata subdirectory.

IAM Resources

The following IAM Resources are required for things to work:

IAM Exec Role for the lambda function
- Policy for Athena Execution
- Policy for s3 access
IAM User for GitLab with a set of AWS access credentials for GitLab to use
- Policy for S3 access
- Policy for Lambda invocation

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
aws-resources		aws-resources
docs		docs
testdata		testdata
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

book-sales-athena

Sample use case

Technologies used

How it works

Athena Resources

Lambda resources

Sample data

IAM Resources

About

Releases

Packages

Languages

License

codecentric/athena-gitlab-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

book-sales-athena

Sample use case

Technologies used

How it works

Athena Resources

Lambda resources

Sample data

IAM Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages