Skip to content

Fast near-duplicate matching is a method for quickly finding near-duplicate spans in a document by utilizing the Rabin-Karp algorithm.

License

Notifications You must be signed in to change notification settings

speed1313/fast-near-duplicate-matching

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fast Near-duplicate Matching   Latest Version

Fast near-duplicate matching is a method for quickly finding near-duplicate spans in a document by utilizing the Rabin-Karp algorithm. This algorithm can be used to count the near-duplicates of a query in a pre-training corpus, facilitating the analysis of memorization in Large Language Models (LLMs).

This repository contains the implementation of the fast near-duplicate matching algorithm and the benchmark for the algorithm in Rust. The core functionalities is provided as a crate neardup.

Method

Fast Near-duplicate Matching

  • Input: Suffix $s$, document $d$, and $n$ of $n$-gram
  • Output: Whether $d$ has a span near-duplicate to $s$

Pseudo-code in Python

def fast_near_duplicate_matching(s: list[int], d: list[int], n: int, threshold: float) -> bool:
    l_s = len(s)
    l_d = len(d)
    H = set(ngram(s, n))
    for i in range(max(l_d - l_s, 0)):
        if d[i:i+n] in H:
            for j in range(max(i - l_s + n, 0), i):
                t = d[j:j+l_s]
                if Jaccard_W(s, t) >= threshold:
                    return True
    return False

You can use fast hash functions like fxhash or rolling hash.

When the size of $n$ of $n$-gram is small, fxhash is faster than rolling hash. However, when the size of $n$ is large, rolling hash is faster than fxhash because the rolling hash can calculate the hash value of the next $n$-gram in $O(1)$ time.

How to run

You can count the near-duplicates of queries in the document (sample queries (2 queries) and documents (100 documents) are in the sample_data folder) by running the following command:

$ cargo run --release -- --search-dir ./sample_data --query-path ./sample_data/query.jsonl --threshold 0.6 --n 10
[2024-08-07T10:59:40Z INFO  neardup] query_list_all: 2
[2024-08-07T10:59:40Z INFO  neardup] search_path_list len: 1
[2024-08-07T10:59:40Z INFO  neardup] path: "./sample_data/pythia-00000-00999.jsonl.gz" start loading token_ids_list
[2024-08-07T10:59:40Z INFO  neardup] loaded token_ids_list
[2024-08-07T10:59:40Z INFO  neardup] query: 0 count: 1
[2024-08-07T10:59:40Z INFO  neardup] query: 1 count: 1
[2024-08-07T10:59:40Z INFO  neardup] path idx: 0 finished
[2024-08-07T10:59:40Z INFO  neardup] count: [1, 1]

Count near-duplicates in the Pythia dataset

You can download the Pythia dataset from here After downloading the dataset, you can convert the dataset to the format that this program can read by running the following command:

$ python scripts/prepare_pythia.py --output_dir path/to/output/folder --pythia_data_path path/to/merged/folder/document

Then, you can count the near-duplicates of queries in the Pythia dataset by running the following command:

$ cargo run --release -- --search-dir path/to/output/folder --query-path path/to/output/folder/query.jsonl --threshold 0.6 --n 10

Benchmark

You can run the benchmark for the three methods:

  • Fast near-duplicate matching with fxhash
  • Fast near-duplicate matching with rolling hash
  • Naive near-duplicate matching
$ cargo bench

Citation

@inproceedings{kiyomaru-etal-2024-comprehensive-analysis,
    title = "A Comprehensive Analysis of Memorization in Large Language Models",
    author = "Kiyomaru, Hirokazu  and
      Sugiura, Issa  and
      Kawahara, Daisuke  and
      Kurohashi, Sadao",
    booktitle = "Proceedings of the 17th International Natural Language Generation Conference",
    month = sep,
    year = "2024",
    address = "Tokyo, Japan",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.inlg-main.45",
    pages = "584--596",
}

References

About

Fast near-duplicate matching is a method for quickly finding near-duplicate spans in a document by utilizing the Rabin-Karp algorithm.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published