Can's find terms in BM25 index when retrieve the document frequency for terms #1742

zhiyuanpeng · 2023-12-09T00:56:21Z

zhiyuanpeng
Dec 9, 2023

My goal is to get the document frequency for all the unique terms in the documents efficiently. To achieve this goal, I apply pyserini to build the BM25 index and then call certain function to return the document frequency.

First, I have a document corpus and I process each document by:

    from pyserini.analysis import Analyzer, get_lucene_analyzer
    analyzer = get_lucene_analyzer()
    default_en_analyzer = Analyzer(analyzer)
    items_data = LmdbDataset(input_db_path)
    docs, vocabs, lmdb_docs, diff_ids_num = [], set(), [], 0
    for key in tqdm(keys):
        item = items_data[key]
        # Process the string and split into tokens
        tokens = default_en_analyzer.analyze(item)
        text = " ".join(tokens)
        if not text or text.isspace():
            continue
        # write to json 
        for t in tokens:
            vocabs.add(t)
        title = ""
        data = {"id": key, "title": title, "contents": text}
        docs.append(data)

to get documents for building BM25 index docs and the vocabulary vocabs. I utilize DefaultEnglishAnalyze to remove the stop workds, porter and tokenize the document.

Sencond, I call:

python -m pyserini.index -collection JsonCollection \
        -generator DefaultLuceneDocumentGenerator -threads {threads} \
        -input {self.doc_dir} -index {self.index_dir} -storeRaw \
        -storePositions -storeDocvectors"

To build the BM25 and index.

Finally, I iterate the vocabs and call function get_df to return the document frequency:

from pyserini.pyclass import autoclass
data_dir = join(cwd, "data")
Directory = autoclass('org.apache.lucene.store.FSDirectory')
Path = autoclass('java.nio.file.Paths')
StandardAnalyzer = autoclass('org.apache.lucene.analysis.standard.StandardAnalyzer')
IndexReader = autoclass('org.apache.lucene.index.DirectoryReader')
MultiFields = autoclass('org.apache.lucene.index.MultiFields')
index_dir = join(cwd, "baselines", "BM25", "index")
indexPath = Path.get(index_dir)
dir = Directory.open(indexPath)
reader = IndexReader.open(dir)

Terms = autoclass('org.apache.lucene.index.Terms')
Term = autoclass('org.apache.lucene.index.Term')

def get_df(term_str):
    term = Term("contents", term_str)  # Replace "fieldName" with the actual field name
    df = reader.docFreq(term)
    return df

However, I find that some words in the vocabs has document frequency 0 which is wrong as these words are collected from documents using defaultEnglishAnalyze. Does this command:

python -m pyserini.index -collection JsonCollection \
        -generator DefaultLuceneDocumentGenerator -threads {threads} \
        -input {self.doc_dir} -index {self.index_dir} -storeRaw \
        -storePositions -storeDocvectors"

utilize a different analyzer? Thanks.

Answered by lintool

Dec 9, 2023

Issues might be related to stemming. Check this out: https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md

View full answer

lintool · 2023-12-09T00:59:12Z

lintool
Dec 9, 2023
Maintainer

Issues might be related to stemming. Check this out: https://github.com/castorini/pyserini/blob/master/docs/usage-indexreader.md

5 replies

zhiyuanpeng Dec 9, 2023
Author

Thanks, that file is very helpful. I have applied pyserini's :

from pyserini.analysis import Analyzer, get_lucene_analyzer
analyzer = get_lucene_analyzer()
default_en_analyzer = Analyzer(analyzer)

to tokenize, stem, remove stop words. So the words in vocabs are already stemmed. One reason may be that the command I used to build index utilizes a different analyzer:

python -m pyserini.index -collection JsonCollection \
        -generator DefaultLuceneDocumentGenerator -threads {threads} \
        -input {self.doc_dir} -index {self.index_dir} -storeRaw \
        -storePositions -storeDocvectors"

to process my documents. What's the analyzer utilized in the above command? I need the same analyzer to process my documents. Thanks.

lintool Dec 9, 2023
Maintainer

Look here: https://github.com/castorini/anserini/blob/master/src/main/java/io/anserini/index/IndexCollection.java#L407

zhiyuanpeng Dec 9, 2023
Author

Thanks, I have run this command:

python -m pyserini.index -collection JsonCollection \
        -generator DefaultLuceneDocumentGenerator -threads {threads} \
        -input {self.doc_dir} -index {self.index_dir} -storeRaw \
        -storePositions -storeDocvectors"

steps by step, and find that defaultenglishanalyzer is utilized. Before I build the index, I use defaultenglishanalyzer to process the documents and the call the command to build the index, which means that the documents are processed by defaultenglishanalyzert twice. That may be the reason why I can't find some words in the build index. I will modify to code and have a try again.

lintool Dec 9, 2023
Maintainer

If you've already tokenized, the you should use -pretokenized option during indexing, like:
https://github.com/castorini/anserini/blob/master/docs/regressions/regressions-msmarco-passage-splade-pp-ed.md

zhiyuanpeng Dec 9, 2023
Author

@lintool Hi prof.Lin, thanks for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can's find terms in BM25 index when retrieve the document frequency for terms #1742

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Can's find terms in BM25 index when retrieve the document frequency for terms #1742

zhiyuanpeng Dec 9, 2023

Replies: 1 comment · 5 replies

lintool Dec 9, 2023 Maintainer

zhiyuanpeng Dec 9, 2023 Author

lintool Dec 9, 2023 Maintainer

zhiyuanpeng Dec 9, 2023 Author

lintool Dec 9, 2023 Maintainer

zhiyuanpeng Dec 9, 2023 Author

zhiyuanpeng
Dec 9, 2023

Replies: 1 comment 5 replies

lintool
Dec 9, 2023
Maintainer

zhiyuanpeng Dec 9, 2023
Author

lintool Dec 9, 2023
Maintainer

zhiyuanpeng Dec 9, 2023
Author

lintool Dec 9, 2023
Maintainer

zhiyuanpeng Dec 9, 2023
Author