Mapping the commons towards an open ledger and cc search.
This repository contains the methods used to identify over 1.4 billion Creative Commons licensed works. The challenge is that these works are dispersed throughout the web and identifying them requires a combination of techniques. Two approaches are currently explored:
- Web crawl data
- Application Programming Interfaces (API Data)
The Common Crawl Foundation provides an open repository of petabyte-scale web crawl data. A new dataset is published at the end of each month comprising over 200 TiB of uncompressed data.
The data is available in three file formats:
- WARC (Web ARChive): the entire raw data, including HTTP response metadata, WARC metadata, etc.
- WET: extracted plaintext from each webpage.
- WAT: extracted html metadata, e.g. HTTP headers and hyperlinks, etc.
For more information about these formats, please see the Common Crawl documentation.
CC Catalog uses AWS Data Pipeline service to automatically create an Amazon EMR cluster of 100 c4.8xlarge instances that will parse the WAT archives to identify all domains that link to creativecommons.org. Due to the volume of data, Apache Spark is used to streamline the processing. The output of this methodology is a series of parquet files that contain:
- the domains and its respective content path and query string (i.e. the exact webpage that links to creativecommons.org)
- the CC referenced hyperlink (which may indicate a license),
- HTML meta data in JSON format which indicates the number of images on each webpage and other domains that they reference,
- the location of the webpage in the WARC file so that the page contents can be found.
The steps above are performed in ExtractCCLinks.py
.
Apache Airflow is used to manage the workflow for various API ETL jobs which pull and process data from a number of open APIs on the internet.
The Airflow DAGs defined in common_api_workflows.py
manage daily
ETL jobs for the following platforms, by running the linked scripts:
Airflow DAGs, defined in their own files, also run the following scripts daily:
In the future, we'll migrate to the latter style of Airflow DAGs and accompanying Provider API Scripts.
The Airflow DAG defined in monthlyWorkflow.py
handles the monthly
jobs that are scheduled to run on the 15th day of each month at 16:00 UTC. This
workflow is reserved for long-running jobs or APIs that do not have date
filtering capabilities so the data is reprocessed monthly to keep the catalog
updated. The following tasks are performed:
The Airflow DAG defined in loader_workflow.py
runs every minute,
and loads the oldest file which has not been modified in the last 15 minutes
into the upstream database. It includes some data preprocessing steps.
- Brooklyn Museum
- NYPL
- Cleveland Public Library
There are a number of scripts in the directory
src/cc_catalog_airflow/dags/provider_api_scripts
eventually
loaded into a database to be indexed for searching on CC Search. These run in a
different environment than the PySpark portion of the project, and so have their
own dependency requirements.
You'll need docker
and docker-compose
installed on your machine, with
versions new enough to use version 3
of Docker Compose .yml
files.
To set up environment variables, navigate to the
src/cc_catalog_airflow
directory, and run
cp env.template .env
If needed, fill in API keys or other secrets and variables in .env
. This is
not needed if you only want to run the tests. There is a
docker-compose.yml
provided in the
src/cc_catalog_airflow
directory, so from that directory, run
docker-compose up -d
This results, among other things, in the following running containers:
cc_catalog_airflow_webserver_1
cc_catalog_airflow_postgres_1
and some networking setup so that they can communicate. Note:
cc_catalog_airflow_webserver_1
is running the Apache Airflow daemon, and also has a few development tools (e.g.,pytest
) installed.cc_catalog_airflow_postgres_1
is running PostgreSQL, and is setup with some databases and tables to emulate the production environment. It also provides a database for Airflow to store its running state.- The directory containing the DAG files, as well as dependencies will be
mounted to the
usr/local/airflow/dags
directory in the containercc_catalog_airflow_webserver_1
.
At this stage, you can run the tests via:
docker exec cc_catalog_airflow_webserver_1 /usr/local/airflow/.local/bin/pytest
Edits to the source files or tests can be made on your local machine, then tests can be run in the container via the above command to see the effects.
If you'd like, it's possible to login to the webserver container via
docker exec -it cc_catalog_airflow_webserver_1 /bin/bash
It's also possible to attach to the running command process of the webserver container via
docker attach --sig-proxy=false cc_catalog_airflow_webserver_1
Attaching in this manner lets you see the output from both the Airflow webserver
and scheduler, which can be useful for debugging purposes. To leave the
container, (but keep it running), press Ctrl-C
on *nix platforms
To see the Airflow web UI, point your browser to localhost:9090
.
If you'd like to bring down the containers, run
docker-compose down
from the src/cc_catalog_airflow
directory.
To reset the test DB (wiping out all databases, schemata, and tables), run
docker-compose down
rm -r /tmp/docker_postgres_data/
JDK 9.0.1
Python 3.6
Pytest 4.3.1
Spark 2.2.1
Airflow 1.10.4
pip install -r requirements.txt
python -m pytest tests/test_ExtractCCLinks.py
See the list of contributors who participated in this project.