This is a small side project that crawls the Fragment Telegram platform to extract data about phone numbers, and provides a RESTful API, WebSocket API, and visualization of the data through a chart.
The goal of this project is to extract data and basic insights about Telegram numbers auction, also learn more about the Play framework, Scala, Terraform and AWS.
- Scrapy framework and Python language for Crawler part
- Play framework and Scala language for API server
- Plotly for data visualization
- MongoDB as data persistence
- Terraform infrastructure automation for provisioning
- AWS service cloud infrastructure
- MongoDB Atlas cloud database service
This project uses Amazon Web Services (AWS) for infrastructure provisioning. The infrastructure is organized into different components, with each component residing in its own directory under the fragmenty-infra
directory.
- Elastic Container Service (ECS) - Deploy and manage the containerized applications
- MongoDB Atlas - Host the MongoDB instance for data persistence
The ECS infrastructure is set up using Terraform and includes the following resources:
- Elastic Container Registry (ECR) for storing container images
- ECS Cluster, ECS Service, and ECS Task Definition for running the containerized applications
- AWS Lambda for running the Scrapy crawler periodically
- Application Load Balancer (ALB) for distributing traffic to the ECS tasks
- Route 53 for managing DNS records
- AWS Certificate Manager (ACM) for SSL certificate provisioning
The MongoDB Atlas infrastructure is also set up using Terraform and consists of the following resources:
- MongoDB Atlas Cluster
- MongoDB Atlas Database Users
Visualized Terraform graph
The deployment process is automated using Terraform. The external.tf
file is used to extract the latest Git commit SHA for the spider
and api
modules. These SHAs are used as container image tags. Terraform uses container_build_push.tf
to build and push the container images to the ECR. The ecs.tf
file contains the resources required to run the containerized applications on ECS.
The Lambda function, defined in lambda.tf
, is responsible for running the Scrapy crawler periodically. The function is triggered by a CloudWatch Event Rule that specifies the desired frequency.
The loadbalancer.tf
file defines an Application Load Balancer (ALB) that routes traffic to the ECS tasks. Route 53 is used to create a custom domain name and an SSL certificate, as specified in the route53.tf
file.
This project consists of two Git submodules:
-
fragmenty-api - This submodule contains the source code for the API server, which is built using the Play framework and Scala. The
fragmenty-api
directory contains a Dockerfile for building the container image, configuration files, and the application's source code. -
fragmenty-spider - This submodule contains the source code for the Scrapy crawler that extracts data from Telegram's Fragment platform. The
fragmenty-spider
directory contains a Dockerfile for building the container image, a build script, a sample environment file, and the Scrapy spider's source code.
These submodules are automatically checked out when the main repository is cloned with the --recurse-submodules
option:
git clone --recurse-submodules https://github.com/Maders/fragmenty.git
To initialize the working directory, run the following command in the respective directories:
terraform init
To apply the infrastructure changes, run the following command in the respective directories:
terraform apply
To destroy the infrastructure resources, run the following command in the respective directories:
terraform destroy