Twitter Streaming Pipeline with GCP

This project streams tweets from the Twitter API into Google BigQuery or Google Cloud Storage (GCS) for dead-letter handling. The pipeline processes real-time data using Apache Beam and enables transformation and analysis in BigQuery. Provisioning of GCP resources is automated using Terraform.

Features

Real-Time Streaming: Streams tweets into Google Pub/Sub for ingestion.
Dual Pipelines:
- ETL Pipeline: Aggregates tweet counts per minute and writes to BigQuery.
- ELT Pipeline: Streams raw tweets into BigQuery for further transformation.
Dead-Letter Handling: Stores unprocessable tweets in Google Cloud Storage.
Scalability: Uses Google Dataflow for distributed data processing.
Automated GCP Provisioning: Deploys GCP resources like Pub/Sub, BigQuery, and GCS using Terraform.
Customizable: Easily adaptable to different Twitter stream rules and GCP setups.

Tech Stack

Programming Language: Python
Frameworks/Libraries:
- Apache Beam: Data processing
- Tweepy: Twitter API integration
- Google Cloud Pub/Sub: Message ingestion
- BigQuery Schema Generator: Dynamic schema generation
Infrastructure as Code: Terraform
Cloud Platform: Google Cloud Platform (GCP)

Setup and Prerequisites

1. Twitter Developer Setup

Sign up for a Twitter Developer account.
Create an app and generate a Bearer Token.

2. Google Cloud Setup

Create a GCP account and a project.
Ensure billing is enabled for the project.

3. Dependencies

Install the following tools:
- Terraform: Provision GCP resources.
- Python 3.8+: Run the pipeline.
- Google Cloud SDK: Authenticate and manage GCP resources.

Steps to Run the Pipeline

1. Provision GCP Resources

Navigate to the terraform directory and apply the Terraform scripts:
```
cd terraform
terraform init
terraform apply
```

2. Install Dependencies

Set up a Python virtual environment and install required libraries:
```
make setup
make install
```

3. Start Streaming Tweets

Open two terminal windows:
- Terminal 1: Stream tweets into Pub/Sub:
```
make stream_tweets
```
- Terminal 2: Process tweets into BigQuery:
```
make process_tweets
```

4. Monitor Logs and Data

Logs are stored in tweet.log for debugging.
Data is written to BigQuery tables or GCS (dead-letter).

5. Clean Up Resources

Destroy GCP resources to save costs:
```
terraform destroy
```

Pipeline Architecture

Input: Streams tweets from the Twitter API into Google Pub/Sub.
**Processing:
- ETL Pipeline: Aggregates tweets per minute.
- ELT Pipeline: Writes raw tweets to BigQuery.
Output:
- Processed data is written to BigQuery.
- Failed records are stored in GCS for inspection.

Environment Variables

Ensure the following environment variables are set in the Makefile:

TWITTER_BEARER_TOKEN: Your Twitter API Bearer Token.
GCP_PROJECT_ID: GCP Project ID.
PUBSUB_TOPIC: Pub/Sub topic for streaming tweets.
GCP_REGION: GCP region.
GCP_STAGING_BUCKET: Staging bucket for Apache Beam.
GCP_TEMP_BUCKET: Temporary bucket for Apache Beam.
BIGQUERY_AGGREGATE_TABLE: BigQuery table for aggregated data.
BIGQUERY_RAW_TABLE: BigQuery table for raw data.
DEAD_LETTER_BUCKET: GCS bucket for dead-letter data.

Files and Directories

/main.py: Streams tweets from Twitter API to Pub/Sub.
/process_tweets.py: Processes tweets into BigQuery.
/terraform/: Terraform scripts for provisioning GCP resources.
/Makefile: Commands for pipeline automation.
/requirements.txt: Python dependencies.

Future Enhancements

Add support for more complex transformations in Apache Beam.
Implement additional data quality checks and error handling.
Enable multi-region deployment for improved reliability.

Troubleshooting

Error: Insufficient Permissions:
- Ensure the GCP Service Account has the required roles (e.g., roles/pubsub.admin, roles/bigquery.admin).
Terraform State Issues:
- Delete the local terraform/state directory and reinitialize Terraform.
Pipeline Errors:
- Check logs in tweet.log and debug issues in data formatting or schema mismatches.

License

This project is open-source and licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Twitter Streaming Pipeline with GCP

Features

Tech Stack

Setup and Prerequisites

1. Twitter Developer Setup

2. Google Cloud Setup

3. Dependencies

Steps to Run the Pipeline

1. Provision GCP Resources

2. Install Dependencies

3. Start Streaming Tweets

4. Monitor Logs and Data

5. Clean Up Resources

Pipeline Architecture

Environment Variables

Files and Directories

Future Enhancements

Troubleshooting

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
terraform		terraform
Makefile		Makefile
README.md		README.md
main.py		main.py
process_tweets.py		process_tweets.py
requirements.txt		requirements.txt
terraform_setup.sh		terraform_setup.sh

uche-madu/twitter-pipeline

Folders and files

Latest commit

History

Repository files navigation

Twitter Streaming Pipeline with GCP

Features

Tech Stack

Setup and Prerequisites

1. Twitter Developer Setup

2. Google Cloud Setup

3. Dependencies

Steps to Run the Pipeline

1. Provision GCP Resources

2. Install Dependencies

3. Start Streaming Tweets

4. Monitor Logs and Data

5. Clean Up Resources

Pipeline Architecture

Environment Variables

Files and Directories

Future Enhancements

Troubleshooting

License

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages