Replies: 3 comments 3 replies
-
Took a look at discussion here : #1372 and also relevant sections from your publication (https://cs.uwaterloo.ca/~jimmylin/publications/Ma_etal_SIGIR2022.pdf) which had the segmented ms_marco_v2 corpus. I guess a rephrase of my original question is : If I were to ask a question about a particular long document, how do I ensure that all the other long-documents in my index are eliminated from the search ? |
Beta Was this translation helpful? Give feedback.
-
Look here: https://castorini.github.io/pyserini/2cr/msmarco-v1-doc.html Consider the BM25 doc segmented condition:
The key is
|
Beta Was this translation helpful? Give feedback.
-
I will try this out, but, just to ensure that I am communicating my original question clearly : If I have k long-documents, d_1…d_k, and, each is segmented into passage-level chunks, and want to restrict my query to only search d_i, does your suggestion absolutely 100% ensure that I only retrieve passages from document d_i (and no other docs)? Or do I have to index each of my long-documents as separate indices to ensure this ? |
Beta Was this translation helpful? Give feedback.
-
Do you have a guide, or recommend best practices for indexing long documents and searching within individual documents ?
So, let us say, I have a collection of long-documents of size 10000+ tokens each, and I want to do dense-retrieval on these.
Now, one way would be to chunk each long-document into 512 (or whatever) sized tokens and index each of these chunks. This is identical to the DPR case.
But doing it this way, at search time, I'm searching within all 512 sized-chunks across the entire corpus. What I would like to do, is, at search time, search only within a particular long-document (so in other words, filter by long-document first and then do the search).
Beta Was this translation helpful? Give feedback.
All reactions