This project streams tweets from the Twitter API into Google BigQuery or Google Cloud Storage (GCS) for dead-letter handling. The pipeline processes real-time data using Apache Beam and enables transformation and analysis in BigQuery. Provisioning of GCP resources is automated using Terraform.
- Real-Time Streaming: Streams tweets into Google Pub/Sub for ingestion.
- Dual Pipelines:
- ETL Pipeline: Aggregates tweet counts per minute and writes to BigQuery.
- ELT Pipeline: Streams raw tweets into BigQuery for further transformation.
- Dead-Letter Handling: Stores unprocessable tweets in Google Cloud Storage.
- Scalability: Uses Google Dataflow for distributed data processing.
- Automated GCP Provisioning: Deploys GCP resources like Pub/Sub, BigQuery, and GCS using Terraform.
- Customizable: Easily adaptable to different Twitter stream rules and GCP setups.
- Programming Language: Python
- Frameworks/Libraries:
- Apache Beam: Data processing
- Tweepy: Twitter API integration
- Google Cloud Pub/Sub: Message ingestion
- BigQuery Schema Generator: Dynamic schema generation
- Infrastructure as Code: Terraform
- Cloud Platform: Google Cloud Platform (GCP)
- Sign up for a Twitter Developer account.
- Create an app and generate a Bearer Token.
- Create a GCP account and a project.
- Ensure billing is enabled for the project.
- Install the following tools:
- Terraform: Provision GCP resources.
- Python 3.8+: Run the pipeline.
- Google Cloud SDK: Authenticate and manage GCP resources.
-
Navigate to the
terraform
directory and apply the Terraform scripts:cd terraform terraform init terraform apply
-
Set up a Python virtual environment and install required libraries:
make setup make install
- Open two terminal windows:
-
Terminal 1: Stream tweets into Pub/Sub:
make stream_tweets
-
Terminal 2: Process tweets into BigQuery:
make process_tweets
-
- Logs are stored in tweet.log for debugging.
- Data is written to BigQuery tables or GCS (dead-letter).
-
Destroy GCP resources to save costs:
terraform destroy
-
Input: Streams tweets from the Twitter API into Google Pub/Sub.
-
**Processing:
- ETL Pipeline: Aggregates tweets per minute.
- ELT Pipeline: Writes raw tweets to BigQuery.
-
Output:
- Processed data is written to BigQuery.
- Failed records are stored in GCS for inspection.
Ensure the following environment variables are set in the Makefile
:
TWITTER_BEARER_TOKEN
: Your Twitter API Bearer Token.GCP_PROJECT_ID
: GCP Project ID.PUBSUB_TOPIC
: Pub/Sub topic for streaming tweets.GCP_REGION
: GCP region.GCP_STAGING_BUCKET
: Staging bucket for Apache Beam.GCP_TEMP_BUCKET
: Temporary bucket for Apache Beam.BIGQUERY_AGGREGATE_TABLE
: BigQuery table for aggregated data.BIGQUERY_RAW_TABLE
: BigQuery table for raw data.DEAD_LETTER_BUCKET
: GCS bucket for dead-letter data.
/main.py
: Streams tweets from Twitter API to Pub/Sub./process_tweets.py
: Processes tweets into BigQuery./terraform/
: Terraform scripts for provisioning GCP resources./Makefile
: Commands for pipeline automation./requirements.txt
: Python dependencies.
- Add support for more complex transformations in Apache Beam.
- Implement additional data quality checks and error handling.
- Enable multi-region deployment for improved reliability.
- Error: Insufficient Permissions:
- Ensure the GCP Service Account has the required roles (e.g.,
roles/pubsub.admin
,roles/bigquery.admin
).
- Ensure the GCP Service Account has the required roles (e.g.,
- Terraform State Issues:
- Delete the local
terraform/state
directory and reinitialize Terraform.
- Delete the local
- Pipeline Errors:
- Check logs in
tweet.log
and debug issues in data formatting or schema mismatches.
- Check logs in
This project is open-source and licensed under the MIT License.