Set of scripts to deduplicate and annotate monolingual text corpora. Originally started as a monotextor-like pipeline under Slurm HPCs.
This pipeline needs as input, directories structured as $COLLECTION_NAME/$BATCH/{metadata,text,lang}.zst
from warc2text-runner.
The first step will merge for each batch, the three JSONL line-aligned files into a single JSONL, where each document is a JSON object containing the text and all the metadata (metadata.zst
and lang.zst
).
Then, for each collection, all the batches will be read sequentially and documents will be placed into separated folders for each language detected and, for each language folder, divide into batches if needed.
On this step, documents with that have prediction probability (prob
) for the first lang
field less than 0.5 are discarded.
After the data has been prepared, the actual processing takes place, with near-deduplication and annotation.
Near-deduplication is performed at document level and across all copllections. For each language, a MinHash LSH index is built using a modified version of gaoya library to be able to work with larger scale data. After an index containing the hashes of all the documents is built, the connected components are computed with Union-Find algorithm. Then all the unique documents and one document per cluster are kept. The input of this step are JSONL files and the output is the same format with near-duplicates removed.
The process is divided into two HyperQueue jobs.
The first one (./10.index
) reads all the documents, indexes them, builds the connected components and stores in disk the Union-Find vector.
Then, the second one (./10.dedup
) reads the Union-Find vector and the documents, discarding near-duplicates according to what the vector indicates.
For very large languages, a distributed approach has been implemented in the gaoya fork. This distributed technique runs multiple indexing jobs, where each job stores only one of the MinHash bands. Thus, all the documents can be indexed in memory. Otherwise tens of terabytes will be needed if a single process indexes all the bands.
With this approach, each job is computing its own Union-Find vector and storing it in disk. The dedup step is performed the same way, but instead all the vectors are read and merged at the beginning.
To comply with robots.txt
directives by each web domain, this pipeline includes optional annotation of documents that are not allowed to be crawled.
To do this, the WARCs containing the robots.txt
files for each collection, have to be provided in the same input directory structure described in Merge-batching step.
Afterwards, the robots.txt
processing can be executed in parallel to the main pipeline and will generate a list of disallowed urls for each collection, that will be used by the annotation step.
The procedure does the following, in one job per collection:
- Extract all the urls for that collection and create a compressed index with FST.
- Exctract all the documents in JSONL format containing
robots.txt
files from the WARCs. - Parse the
robots.txt
files and produce a list allowed and disallowed url patterns for our relevant user agents (*
,ia-archiver
,CCBot
). - Use the patterns to query the urls index and generate a list of disallowed patterns.
The annotation step consists of adding multiple metadata fields to each document (using annotate.py):
id
: unique id for the document, derived from the WARC file, url and timestamp (f
,u
,ts
fields).seg-langs
: segment level language identification. An array of size equal to the number of segments in the document (each segment being delimited by a\n
). The language identifiaction tool for this step was heliport, a fast port of HeLI-OTS trained with the same data as the language identifier for documents.robots
: robots.txt compliance (if the document has been disallowed for crawling.- monofixer to fix encoding issues and remove html entities. This step does not add any metadata field, it just fixes the document text.
pii
: look for PII information with multilingual-pii-tool. In case it any match is found, the field specifies the unicode character offsets for every match.filter
: if document matches any of the filtering criteria.doc_scores
: document quality scores with web-docs-scorer. An array where the first position is the overall quality score and the rest are the sub-scores used to determine the overall score. All of the scores ranging values from 0 to 10.
The output of this step will produce the same documents as input with the added metadata information.
The process of annotation adds a new metadata field (filter
) to each document that indicates if the document should be kept or not, and when not, indicate the discarding reason.
Possible values are:
keep
: the document does not match any of the filtering criteria.adult_ut1
: the url of the document matches one of the domains in UT1 adult list. To perform matches, full domains are searched in the list. If they don't not match, a second iteration tries search by removing the subdomains.length_XX
: the text of the document has less than XX characters. Default: 500.word_avg_X
: the average number of words per segment is less than X. Default: 5.cha\_avg_X
: the average number of characters per segment is less than X. This is used for Chinese, Japanese and Korean. Default: 10.
The previous step added lots of metadata for cleaning purposes, but no documents were removed.
To do this, the 30.clean.sh
step needs to be run.
This step will create a new version of the corpus, removing all the documents that do not meet all of these criteria:
- The
filter
field value iskeep
. - The
robots
field value isallowed
. - The overall doccument score (first value of the array:
doc_scores[0]
) is equal or higher than 5.
To avoid conflicts with the cluster installed software or available modules and be more cluster filesystem friendly, deacreasing dramatically the amount of files needed for the software installation, a Singularity container needs to be built. The build procedure can be performed in a local machine with these simple steps:
# docker build -t monotextor:latest .
# singularity build -F monotextor.sif monotextor.def
# rsync monotextor.sif user@your-cluster:/path/to/monotextor-slurm
This requires Docker and Singularity to be installed on the local machine and the cluster has to support Singularity containers execution.
For the second and third steps, the pipeline uses HyperQueue to schedule jobs.
To set it up, follow the HQ installation instructions, placing its binary in any directory of $PATH
.
Then, it is recommended to start a terminal multiplexer like screen
or tmux
and run hq server start
in a separated window.
After that, the pipeline can start running. HQ is only needed from 10.dedup.sh
and onwards because the merge-batching submits jobs directly to SLURM, so HQ server does not need to be up during that process.
Copy the .env.example
to .env
and edit the variables accordingly.
Needed variables are:
SBATCH_ACCOUNT Project account number.
SLURM_LOGS_DIR Directory where all the job logs will be stored.
WORKSPACE Directory where all the processing output will be stored.
FLASH_TMP Temporary directory for merge-batch parallel step (recommended flash storage partition).
COLLECTIONS Associative array with collection names and paths.
An example of the collections array looks like this:
INPUT_DIR=/scratch/project_XXXXXX/w2t-runner-output
declare -A COLLECTIONS=(
["cc13"]=$INPUT_DIR/CC-MAIN-2013
["cc14"]=$INPUT_DIR/CC-MAIN-2014
...
["cc22"]=$INPUT_DIR/CC-MAIN-2022
["cc23"]=$INPUT_DIR/CC-MAIN-2023
["wide5"]="$INPUT_DIR/wide00005"
["wide6"]="$INPUT_DIR/wide00006"
["wide10"]="$INPUT_DIR/wide00010"
...
["wide15"]="$INPUT_DIR/wide00015"
["wide16"]="$INPUT_DIR/wide00016"
["wide17"]="$INPUT_DIR/wide00017"
)
The pipeline will join multiple collections/crawls if specified as a pattern in the same array entry.
In the latest HPLT there were more than one Common Crawl collections for the same year, so, for example CC-MAIN-2014-28
and CC-MAIN-2014-48
are combined as cc14
.
This allows a more balanced processing during deduplication.
Running the pipeline is pretty simple. First, for each collection we need to run the merge-batching with
./00.merge-batching.sh <collection_name>
For example:
./00.merge-batching.sh wide16
where collection name is one of the keys in the $COLLECTIONS
array.
After batching is finished for all the collections, run the deduplication with
./10.dedup.sh
The submission script will create the list of tasks, where each pair of collection-language is a task, then ask for confirmation.
After confirmation, the script will block, showing the progress for all the tasks.
Be aware that this process may take hours or days, so it is recommended to run it in a terminal multiplexer like tmux
or screen
, to be able to detach and close the ssh connection to the cluster without killing the process.
Optional: while doling the merge batching or dedup steps, robotstxt processing can run in parallel. This step, however does not use HQ, therefore it has to be submitted manually:
source .env # load env to have the collections array variable available
for i in ${!COLLECTIONS[@]}; do
sbatch -J $i-robotstxt --mem-per-cpu 7500 09.robotstxt $i
done
It is recommended, as shown in the example above, to use large memory nodes (1TB in this case for the LUMI cluster).
When all the deduplication tasks have finished, the annotation can be eexecuted. For the annotation step, the same logic is applied, just run
./20.processing.sh
The output will be available at $WORKSPACE/annotated
.
Optional: merge each language into a single directory:
source .env
srun -A $SBATCH_ACCOUNT --pty -p small --ntasks 1 --cpus-per-task 128 --mem-per-cpu 1750 -t 24:00:00 bash
module load parallel
parallel -j64 bash 21.merge-collections ::: `cat langs`
That will be available at $WORKSPACE/collections-merged
.
Optional: run the cleaning from the merged collections:
./30.clean.sh
The output will be available at $WORKSPACE/cleaned
.
The output format is JSONL, where each line is a valid JSON value and a full document with all its metadata and text content. For example, the resulting JSON will be something like:
{"f":"./path/to/80716-00467.warc.gz","o":578687,"s":9202,"rs":102649,
"u":"https://www.example.com/some_text","c":"text/html","ts":"2021-05-09T10:26:25Z",
"collection":"wide17",
"lang":["eng_Latn","fra_Latn","deu_Latn"],"prob":[0.7479,0.076,0.0492],
"text":"this is paragraph1\nthis is paragraph2\nthis is paragraph3",
"seg_langs": ["eng_Latn","eng_Latn","eng_Latn"],
"id":"b8ff3519ba78334d6f63ed20239c42ce",
"filter":"word_avg_5",
"pii":[[23,34],[41,45]],
"doc_scores":[7.7,9.7,10.0,9.8,10.0,9.8,10.0,3.0,0.0],
"robots":"allowed"}
{"f":"./path/to/80716-00468.warc.gz","o":579437,"s":1100,"rs":44535,
...
"text":"another paragraph\n...",
...
In each document text
field, each paragraph is concatenated using new-line separators.
The first 7 fields are inherited from warc2text
HTML extraction from the WARCs, explained here, with the exception of p
field which is replaced by text
and l
replaced by lang
and prob
, which describe the three top identified languages for the document and their prediction probabilities.