This project demonstrates an event-driven architecture for web scraping and processing tasks using AWS services. The architecture includes a scraper job running on AWS Batch, which collects data from web pages and stores it in AWS S3. The processing job, triggered by AWS EventBridge, processes the scraped data and updates a Google Sheet with a updated price of a particular product.
- Scalable: The architecture can scale to handle large volumes of web scraping and processing tasks.
- Automated: The processing job is triggered automatically after the scraping job completes, reducing manual intervention.
- Parallel Scraping: The scraper job can scrape multiple web pages simultaneously, improving efficiency.
- Flexible: The architecture can be adapted to different scraping and processing requirements.
- AWS Batch: For running the scraper job at scale.
- AWS EventBridge: For triggering the processing job after the scraping job completes.
- AWS S3: For storing scraped data and logs.
- AWS ECS: For running Docker containers.
- AWS ECR: For storing Docker images.
- Google Sheets API: For updating a Google Sheet with processed data.
- Configure Google Sheets API:
- Enable the Google Sheets API in the Google Cloud Console.
- Create credentials (OAuth 2.0 client ID) for your project.
- Download the credentials JSON file and save it as
credentials.json
in the project directory. - Share the Google Sheet with the email address specified in the credentials JSON file.
- Set Up AWS Services: Set up AWS Batch, AWS EventBridge, AWS S3 and AWS ECR.
- Deploy Docker Images: Deploy the Docker images for the scraper job and processing job to AWS ECR.
- Configure EventBridge: Set up an EventBridge rule to trigger the processing job after the scraping job completes.
- Run the Scraping Job: Run the scraper job on AWS Batch to start scraping web pages.
- Monitor and Troubleshoot: Monitor the progress of the scraping and processing jobs and troubleshoot any issues.
-
Clone the repository:
git clone https://github.com/aj225patel/parallel-web-scraper-on-cloud.git
-
Set up AWS services following the instructions in the
README.md
file. -
Configure the Google Sheets API following the instructions in the
README.md
file. -
Deploy the Docker images for the scraper job and processing job to AWS ECR.
-
Set up an EventBridge rule to trigger the processing job after the scraping job completes.
-
Run the scraping job on AWS Batch.
-
Monitor the progress of the scraping and processing jobs and troubleshoot any issues.
- This project was inspired by the need for a scalable and automated web scraping solution.
- Special thanks to the AWS documentation and community for valuable insights and resources.