-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
regarding to deduplication #79
Comments
+1 - given the fuzzy deduplication hashes, is there a simple/suggested way to cluster and sample them ? Thanks for thegreat work ! |
Hi @kimcando and @ManuelFay and thanks for your questions!
Yes, we ran the entire dataset through a Bloomfilter for exact deduplication and published the duplicate ids as separate files (mirroring the dataset structure). Important to note is that the duplicates were deliberately kept in the dataset so that everyone can experiment with and study duplication in the training data.
This is correct, we compute the minhash signatures in the same pass as the other quality signals. Note that this is just the signatures; to do fuzzy deduplication, you need to run LSH on these (see below on how to run this).
the dataset we provide comes with the minhash signatures, but not with the deduplication clusters. These need to be computed using the script in Here is a minimal example you can run from the root of 1) Download listingsDATA_ROOT="${HOME}/path/to/data" # make sure this is an absolute path
mkdir -p "${DATA_ROOT}/listings"
listings_file="listings/en-2023-06-head_middle.txt"
wget "https://data.together.xyz/redpajama-data-v2/v1.0.0/${listings_file}" -O "${DATA_ROOT}/${listings_file}" 2) Download MinHash signatures# read the first 5 lines here to run the example
head -n5 "${DATA_ROOT}/${listings_file}" | while read line;
do
url="https://data.together.xyz/redpajama-data-v2/v1.0.0/minhash/${line}.minhash.parquet"
dest="${DATA_ROOT}/minhash/${line}.minhash.parquet"
mkdir -p $(dirname $dest)
wget "$url" -O "$dest"
echo "minhash/${line}.minhash.parquet" >> "${DATA_ROOT}/minhash_listings.txt"
done 3) Run LSH at similarity level 0.7cd app/
python3 src/run_lsh.py --input_base_uri "file://${DATA_ROOT}/" --output_dir "${DATA_ROOT}/minhash_clusters/" --similarity 0.7 --num_perm 128 --listings "${DATA_ROOT}/minhash_listings.txt" This will result in one parquet file for each input file, containing the MinHash cluster id for every (fuzzy duplicate) document in the corresponding |
Thanks for replying. Therfore, my question is when Redpajama-V2 is used for training models, then the considerable amount of datasets must be deduplicated, and at that situation(e.g, to handle 20 trillion tokens) could you give me some hints how many cores have you used? |
absolutely, the current LSH implementation does not scale to the entire dataset. I think to do full fuzzy deduplication, you will need to use multiple nodes (the implementation of MinHashLSH provided by BigCode here is probably a good starting point). With that said, a way forward with the single node LSH implementation in To run LSH on 200M documents with used a machine with 500GB RAM and 64 cores, and it took ~40 minutes. The exact (.wet document hash based) dedupe with the bloomfilter ran on the same machine in ~3.5 days for the 25B english documents. |
Is it possible to replace minhash with simhash? IIRC, dedup on exact match of simhash signatures is sufficient to remove near-duplicate documents. |
Hi @edwardzjl , you can use simhash for near deduplication but you need to explicitly compute new hashes for that. |
Hey,
thank you in advance for your great work and sharing the data :)
I read README and huggingface details and was unclear whether fuzzy deduplication is actually done on this dataset.
I understand that
Therefore, my question is the provided dataset is the one that fuzzy deduplication is also applied?
If so, could you please share the info that how many cores(if under distributed environments, how many and which type instances) you use? and how long does it take?
Cheeeers!!
The text was updated successfully, but these errors were encountered: