This project aims to show how fast and easy data management via Airflow and Google Cloud Platform works. This sample integrates the English Premier League data into BigQuery.
We load the daily data in Google Cloud Storage. The ETL job then starts automatically and imports the data into BigQuery to analyze the data.
- Google Cloud SDK
- before start update components
gcloud components update
- Accessing a Cloud Composer environment requires the kubernetes commandline client [kubectl].
gcloud components install kubectl
- before start update components
- jq - a lightweight terminal application for json parsing and manipulation
Just run the ./scripts/google_init.sh
script which sets up a new GCP project,
Cloud Composer, BigQuery and deploys this workflow.
In the configuration of the Environment you get some information, including the location of Bucket and the Airflow Web UI link.
When the environment is in place, kick off the DAGs via the Airflow Web UI or run:
gcloud composer environments run ${GC_PROJECT_ID} \
--location ${LOCATION} trigger_dag -- matchweek_data_to_gc && \
gcloud composer environments run ${GC_PROJECT_ID} \
--location ${LOCATION} list_dag_runs -- scorer_data_to_gc
Upload the data to Storage with:
./scripts/google_upload_data.sh
gcloud projects delete ${GC_PROJECT_ID}
Set the project ID as an environment variable.
export GC_PROJECT_ID=[YOU_GC_PROJECT_ID]
Create a key for the Service Account
and store to airflow/data/keyfile.json
.
Start the Airflow Webserver: docker-compose up
, then execute the ./scripts/local/init.sh
script to create variables and connections.
Airflow will be available via http://localhost:8080.
To set up a local development environment install pipenv: pip install pipenv
.
Then install run SLUGIFY_USES_TEXT_UNIDECODE=yes pipenv install
.
Open the folder with PyCharm and mark both dags/
and plugins/
as source folders.