This is an experimental implementation of density sampling for performing semantic deduplication of large corpora. The main idea is to remove semi-duplicates from large dataset to allow for faster and more accurate training.
The script tries to implement a very efficiant way of calculating this based on the ideas in the following papers:
1. Coleman, Benjamin, and Anshumali Shrivastava. "Sub-linear race sketches for approximate kernel density estimation on streaming data." Proceedings of The Web Conference 2020. 2020.
2. Coleman, Benjamin, Richard Baraniuk, and Anshumali Shrivastava. "Sub-linear memory sketches for near neighbor search on streaming data." International Conference on Machine Learning. PMLR, 2020.
3. Coleman, Benjamin, and Anshumali Shrivastava. "A one-pass distributed and private sketch for kernel sums with applications to machine learning at scale." Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security. 2021.
4. Coleman, Benjamin, et al. "One-pass diversified sampling with application to terabyte-scale genomic sequence streams." International Conference on Machine Learning. PMLR, 2022.
5. Liu, Zichang, et al. "One-Pass Distribution Sketch for Measuring Data Heterogeneity in Federated Learning." Advances in Neural Information Processing Systems 36 (2024).
The script assumes the following directory structure:
main/
|-- original_corpus/
|-- paths/
|-- embeddings/
|-- normalised_embeddings/
|-- scratch/
|-- density_scores/
|-- final/
The first part creates embeddings. Currently it uses sentence-transformers/all-MiniLM-L6-v2
that creates 384-dimentional multilingual embeddings. This can be replaced with any other encoder-model from HuggingFace. The default model is using L2 normalising on the embeddings already, so we can save these directly to normalised_embeddings/
. The script reads the text-field of the jsonlines-file. If your corpus is in parque, please use the utils/convert_parquet_to_jsonlines.py
first.
Not that this script takes quite a long time to run even on fast computers. It works on single files, so that it can be easily paralellised.
python create_embeddings.py --input_file myfile.jsonl --paths_dir paths --embeddings_dir embeddings --emb_size 384
Note that for the default model the embeddings are already normalised. If you need to use another model that does not normalise the output, please use the script create_normalised_embeddings.py
.
The next step is to create the density scores. This script should take roughly an hour per GB of data.
python create_density_scores.py --embedding_input_folder embeddings --json_output_folder nonormalised_density_scores --nonormalise
TODO:
- Does not work
- Batching is weird
- Not merged with jsonlines file
python create_density_scores.py --input_folder normalised_embeddings --output_folder density_scores --kernel_bandwidth 0.035 --sketch_reps 1000 --sketch_range 20000