Fast near-duplicate matching is a method for quickly finding near-duplicate spans in a document by utilizing the Rabin-Karp algorithm. This algorithm can be used to count the near-duplicates of a query in a pre-training corpus, facilitating the analysis of memorization in Large Language Models (LLMs).
This repository contains the implementation of the fast near-duplicate matching algorithm and the benchmark for the algorithm in Rust. The core functionalities is provided as a crate neardup
.
- Input: Suffix
$s$ , document$d$ , and$n$ of$n$ -gram - Output: Whether
$d$ has a span near-duplicate to$s$
def fast_near_duplicate_matching(s: list[int], d: list[int], n: int, threshold: float) -> bool:
l_s = len(s)
l_d = len(d)
H = set(ngram(s, n))
for i in range(max(l_d - l_s, 0)):
if d[i:i+n] in H:
for j in range(max(i - l_s + n, 0), i):
t = d[j:j+l_s]
if Jaccard_W(s, t) >= threshold:
return True
return False
You can use fast hash functions like fxhash or rolling hash.
When the size of
You can count the near-duplicates of queries in the document (sample queries (2 queries) and documents (100 documents) are in the sample_data
folder) by running the following command:
$ cargo run --release -- --search-dir ./sample_data --query-path ./sample_data/query.jsonl --threshold 0.6 --n 10
[2024-08-07T10:59:40Z INFO neardup] query_list_all: 2
[2024-08-07T10:59:40Z INFO neardup] search_path_list len: 1
[2024-08-07T10:59:40Z INFO neardup] path: "./sample_data/pythia-00000-00999.jsonl.gz" start loading token_ids_list
[2024-08-07T10:59:40Z INFO neardup] loaded token_ids_list
[2024-08-07T10:59:40Z INFO neardup] query: 0 count: 1
[2024-08-07T10:59:40Z INFO neardup] query: 1 count: 1
[2024-08-07T10:59:40Z INFO neardup] path idx: 0 finished
[2024-08-07T10:59:40Z INFO neardup] count: [1, 1]
You can download the Pythia dataset from here After downloading the dataset, you can convert the dataset to the format that this program can read by running the following command:
$ python scripts/prepare_pythia.py --output_dir path/to/output/folder --pythia_data_path path/to/merged/folder/document
Then, you can count the near-duplicates of queries in the Pythia dataset by running the following command:
$ cargo run --release -- --search-dir path/to/output/folder --query-path path/to/output/folder/query.jsonl --threshold 0.6 --n 10
You can run the benchmark for the three methods:
- Fast near-duplicate matching with fxhash
- Fast near-duplicate matching with rolling hash
- Naive near-duplicate matching
$ cargo bench
@inproceedings{kiyomaru-etal-2024-comprehensive-analysis,
title = "A Comprehensive Analysis of Memorization in Large Language Models",
author = "Kiyomaru, Hirokazu and
Sugiura, Issa and
Kawahara, Daisuke and
Kurohashi, Sadao",
booktitle = "Proceedings of the 17th International Natural Language Generation Conference",
month = sep,
year = "2024",
address = "Tokyo, Japan",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.inlg-main.45",
pages = "584--596",
}