On indexing long documents #1541

superhans · 2023-06-02T20:56:40Z

superhans
Jun 2, 2023

Do you have a guide, or recommend best practices for indexing long documents and searching within individual documents ?

So, let us say, I have a collection of long-documents of size 10000+ tokens each, and I want to do dense-retrieval on these.

Now, one way would be to chunk each long-document into 512 (or whatever) sized tokens and index each of these chunks. This is identical to the DPR case.

But doing it this way, at search time, I'm searching within all 512 sized-chunks across the entire corpus. What I would like to do, is, at search time, search only within a particular long-document (so in other words, filter by long-document first and then do the search).

superhans · 2023-06-02T22:51:41Z

superhans
Jun 2, 2023
Author

Took a look at discussion here : #1372 and also relevant sections from your publication (https://cs.uwaterloo.ca/~jimmylin/publications/Ma_etal_SIGIR2022.pdf) which had the segmented ms_marco_v2 corpus.

I guess a rephrase of my original question is : If I were to ask a question about a particular long document, how do I ensure that all the other long-documents in my index are eliminated from the search ?

0 replies

lintool · 2023-06-02T23:05:10Z

lintool
Jun 2, 2023
Maintainer

Look here: https://castorini.github.io/pyserini/2cr/msmarco-v1-doc.html

Consider the BM25 doc segmented condition:

python -m pyserini.search.lucene \
  --threads 16 --batch-size 128 \
  --index msmarco-v1-doc-segmented-slim \
  --topics dl19-doc \
  --output run.msmarco-v1-doc.bm25-doc-segmented-default.dl19.txt \
  --bm25 --k1 0.9 --b 0.4 --hits 10000 --max-passage-hits 1000 --max-passage

The key is --hits 10000 --max-passage-hits 1000 --max-passage:

--max-passage is known is the MaxP technique: https://arxiv.org/pdf/1905.09217.pdf
--hits 10000 start with top 10k passages
after applying MaxP, retain only top 1k

0 replies

superhans · 2023-06-02T23:30:12Z

superhans
Jun 2, 2023
Author

I will try this out, but, just to ensure that I am communicating my original question clearly :

If I have k long-documents, d_1…d_k, and, each is segmented into passage-level chunks, and want to restrict my query to only search d_i, does your suggestion absolutely 100% ensure that I only retrieve passages from document d_i (and no other docs)? Or do I have to index each of my long-documents as separate indices to ensure this ?

3 replies

lintool Jun 3, 2023
Maintainer

you'll have to index each of your passages as doc1#1, doc1#2, ... (you'll have to double check the code, but IIRC the delimiter is #).

superhans Jun 6, 2023
Author

Aah. But does the --max-passage trick with dense retrievers as well ?

lintool Jun 6, 2023
Maintainer

Should work: https://github.com/castorini/pyserini/blob/master/pyserini/search/faiss/__main__.py#L168

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On indexing long documents #1541

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

On indexing long documents #1541

superhans Jun 2, 2023

Replies: 3 comments · 3 replies

superhans Jun 2, 2023 Author

lintool Jun 2, 2023 Maintainer

superhans Jun 2, 2023 Author

lintool Jun 3, 2023 Maintainer

superhans Jun 6, 2023 Author

lintool Jun 6, 2023 Maintainer

superhans
Jun 2, 2023

Replies: 3 comments 3 replies

superhans
Jun 2, 2023
Author

lintool
Jun 2, 2023
Maintainer

superhans
Jun 2, 2023
Author

lintool Jun 3, 2023
Maintainer

superhans Jun 6, 2023
Author

lintool Jun 6, 2023
Maintainer