This repository holds all the code for the first stage of our Brand Sentiment Tracking system: The AWS Batch Article Extractor. When a job is submitted on AWS, this application will load all the WARC files published in CC-NEWS on the previous day (unless specified using START_DATE
and END_DATE
), extract all valid articles from them and push the results to an S3 Bucket (specified by S3_BUCKET_NAME
).
This application has been prepared to be run on Docker, which can be built using:
docker build -t article-extractor:latest .
To run on AWS, make sure the image is built for amd64
:
docker buildx build --platform=linux/amd64 -t article-extractor:latest .
To run this package, you must have Python 3.8, Java 8 and Apache Spark 1.3.1 installed on your computer. Follow the links to install them.
Install the package requirements using pip:
pip install -r requirements.txt
ExtractionJob
is the main class that carries out the downloading, extraction and saving of articles stored in WARC files. It is designed to be fully independent of the other classes so that it can run concurrently without any shared memory.
The main method of this class is extract_warc()
, which given the job's initialisation will complete the whole process of downloading, extracting and saving the results from a single WARC file.
This snippet of code will download and save the first 50 articles from a WARC file published today. Note, there is no check to ensure the WARC is published on the day passed.
from datetime import datetime
from extractor import ExtractionJob
job = ExtractionJob("full-url-to-warc", ["*"],
datetime.now(),
"./path/to/parquet/files")
job.extract_warc(limit=50)
The ArticleExtractor
class finds all the relevant WARC files to process and creates multiple ExtractionJob
using a process pool to speed up extraction. The main method for this class is download_articles()
, which will extract all the articles that match a list of URL patterns (based off the glob
library) and the start and end dates passed to the function.
The ArticleToParquetS3
class handles uploading all the parquet files saved in a local directory to S3, using Apache Spark. The main method of this class is run()
, which combines all the classes above, as well as a final uploading step to S3.
Configuration for the job in main.py
is passed as environment variables:
S3_BUCKET_NAME
: The name of the bucket to push the parquet files to in S3.URL_PATTERNS
: A list of glob patterns to match the article's URL against when filtering.START_DATE
: The start date in ISO format to filter the WARC files by.END_DATE
: The end date (not inclusive) in ISO format to filter the WARC files by.MAX_RECORDS
: The maximum number of articles to save to each parquet file.
If start date or end date is not specified, the extraction will run for the previous day. If no URL patterns are passed, all URLs will be accepted.
To run unit tests locally, use:
pytest extractor
To run in Docker, use:
docker run --entrypoint=./test.sh article-extractor:latest
Make sure you pass the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables to test uploading and downloading to AWS.