Can's find terms in BM25 index when retrieve the document frequency for terms #1742
-
My goal is to get the document frequency for all the unique terms in the documents efficiently. To achieve this goal, I apply pyserini to build the BM25 index and then call certain function to return the document frequency. First, I have a document corpus and I process each document by:
to get documents for building BM25 index Sencond, I call:
To build the BM25 and index. Finally, I iterate the vocabs and call function get_df to return the document frequency:
However, I find that some words in the
utilize a different analyzer? Thanks. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
Issues might be related to stemming. Check this out: https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md |
Beta Was this translation helpful? Give feedback.
Issues might be related to stemming. Check this out: https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md