Fast Near-duplicate Matching

Fast near-duplicate matching is a method for quickly finding near-duplicate spans in a document by utilizing the Rabin-Karp algorithm. This algorithm can be used to count the near-duplicates of a query in a pre-training corpus, facilitating the analysis of memorization in Large Language Models (LLMs).

This repository contains the implementation of the fast near-duplicate matching algorithm and the benchmark for the algorithm in Rust. The core functionalities is provided as a crate neardup.

Method

Fast Near-duplicate Matching

Input: Suffix $s$, document $d$, and $n$ of $n$-gram
Output: Whether $d$ has a span near-duplicate to $s$

Pseudo-code in Python

def fast_near_duplicate_matching(s: list[int], d: list[int], n: int, threshold: float) -> bool:
    l_s = len(s)
    l_d = len(d)
    H = set(ngram(s, n))
    for i in range(max(l_d - l_s, 0)):
        if d[i:i+n] in H:
            for j in range(max(i - l_s + n, 0), i):
                t = d[j:j+l_s]
                if Jaccard_W(s, t) >= threshold:
                    return True
    return False

You can use fast hash functions like fxhash or rolling hash.

When the size of $n$ of $n$-gram is small, fxhash is faster than rolling hash. However, when the size of $n$ is large, rolling hash is faster than fxhash because the rolling hash can calculate the hash value of the next $n$-gram in $O(1)$ time.

How to run

You can count the near-duplicates of queries in the document (sample queries (2 queries) and documents (100 documents) are in the sample_data folder) by running the following command:

$ cargo run --release -- --search-dir ./sample_data --query-path ./sample_data/query.jsonl --threshold 0.6 --n 10
[2024-08-07T10:59:40Z INFO  neardup] query_list_all: 2
[2024-08-07T10:59:40Z INFO  neardup] search_path_list len: 1
[2024-08-07T10:59:40Z INFO  neardup] path: "./sample_data/pythia-00000-00999.jsonl.gz" start loading token_ids_list
[2024-08-07T10:59:40Z INFO  neardup] loaded token_ids_list
[2024-08-07T10:59:40Z INFO  neardup] query: 0 count: 1
[2024-08-07T10:59:40Z INFO  neardup] query: 1 count: 1
[2024-08-07T10:59:40Z INFO  neardup] path idx: 0 finished
[2024-08-07T10:59:40Z INFO  neardup] count: [1, 1]

Count near-duplicates in the Pythia dataset

You can download the Pythia dataset from here After downloading the dataset, you can convert the dataset to the format that this program can read by running the following command:

$ python scripts/prepare_pythia.py --output_dir path/to/output/folder --pythia_data_path path/to/merged/folder/document

Then, you can count the near-duplicates of queries in the Pythia dataset by running the following command:

$ cargo run --release -- --search-dir path/to/output/folder --query-path path/to/output/folder/query.jsonl --threshold 0.6 --n 10

Benchmark

You can run the benchmark for the three methods:

Fast near-duplicate matching with fxhash
Fast near-duplicate matching with rolling hash
Naive near-duplicate matching

$ cargo bench

Citation

@inproceedings{kiyomaru-etal-2024-comprehensive-analysis,
    title = "A Comprehensive Analysis of Memorization in Large Language Models",
    author = "Kiyomaru, Hirokazu  and
      Sugiura, Issa  and
      Kawahara, Daisuke  and
      Kurohashi, Sadao",
    booktitle = "Proceedings of the 17th International Natural Language Generation Conference",
    month = sep,
    year = "2024",
    address = "Tokyo, Japan",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.inlg-main.45",
    pages = "584--596",
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
benches		benches
sample_data		sample_data
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fast Near-duplicate Matching

Method

Fast Near-duplicate Matching

Pseudo-code in Python

How to run

Count near-duplicates in the Pythia dataset

Benchmark

Citation

References

About

Releases

Packages

Languages

License

speed1313/fast-near-duplicate-matching

Folders and files

Latest commit

History

Repository files navigation

Fast Near-duplicate Matching

Method

Fast Near-duplicate Matching

Pseudo-code in Python

How to run

Count near-duplicates in the Pythia dataset

Benchmark

Citation

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages