Skip to content

Data Warehousing Made Easy with Google BigQuery and Apache Airflow

Notifications You must be signed in to change notification settings

iamalonso/google-airflow-premier-league

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

78 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Google-BigQuery-Airflow

This project aims to show how fast and easy data management via Airflow and Google Cloud Platform works. This sample integrates the English Premier League data into BigQuery.

We load the daily data in Google Cloud Storage. The ETL job then starts automatically and imports the data into BigQuery to analyze the data.

Requirements

  • Google Cloud SDK
    • before start update components
      gcloud components update
    • Accessing a Cloud Composer environment requires the kubernetes commandline client [kubectl].
      gcloud components install kubectl
  • jq - a lightweight terminal application for json parsing and manipulation

Just run the ./scripts/google_init.sh script which sets up a new GCP project, Cloud Composer, BigQuery and deploys this workflow.

In the configuration of the Environment you get some information, including the location of Bucket and the Airflow Web UI link.

When the environment is in place, kick off the DAGs via the Airflow Web UI or run:

gcloud composer environments run ${GC_PROJECT_ID} \
	 --location ${LOCATION} trigger_dag -- matchweek_data_to_gc && \

gcloud composer environments run ${GC_PROJECT_ID} \
	 --location ${LOCATION} list_dag_runs -- scorer_data_to_gc

Upload Premier League Data

Upload the data to Storage with:

./scripts/google_upload_data.sh

Cleaning up

gcloud projects delete ${GC_PROJECT_ID}

Local Deployment

Set the project ID as an environment variable.

export GC_PROJECT_ID=[YOU_GC_PROJECT_ID]

Create a key for the Service Account and store to airflow/data/keyfile.json.

Start the Airflow Webserver: docker-compose up, then execute the ./scripts/local/init.sh script to create variables and connections.

Airflow will be available via http://localhost:8080.

Development

To set up a local development environment install pipenv: pip install pipenv.

Then install run SLUGIFY_USES_TEXT_UNIDECODE=yes pipenv install.

Open the folder with PyCharm and mark both dags/ and plugins/ as source folders.

About

Data Warehousing Made Easy with Google BigQuery and Apache Airflow

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 50.5%
  • Python 17.4%
  • TSQL 16.7%
  • Dockerfile 15.4%