Releases: hplt-project/monotextor-slurm
Releases · hplt-project/monotextor-slurm
v2.0
v1.0
Initial release of monotexting pipeline on LUMI.
- Extraction of text from warc2text directories and divide input into batches.
- Processing and adding Monocleaner metadata to the documents (fluency score and language identification for each segment)
- Conversion to JSONL format.
- Near-deduplication with MinHash 98c1717
- Cleaning filters https://github.com/hplt-project/monotextor-slurm/tree/v1.0#cleaning.