Evaluation using Beir #3378

ugm2 · 2022-10-13T11:44:27Z

ugm2
Oct 13, 2022

Hi Community! 😊

I was currently trying to use the eval_beir() functionality to start comparing different pipelines and found that the documents from the chosen datasets are stored as files and then passed to the indexing_pipeline.run():

haystack/haystack/pipelines/base.py

Lines 2252 to 2262 in 60f678e

    
           file_paths = [] 
        
           metas = [] 
        
           for id, doc in corpus.items(): 
        
               file_path = f"{temp_dir}/{id}" 
        
               with open(file_path, "w") as f: 
        
                   f.write(doc["text"]) 
        
               file_paths.append(file_path) 
        
               metas.append({"id": id, "name": doc.get("title", None)}) 
        
           logger.info("indexing %s documents...", len(corpus)) 
        
           self.index_pipeline.run(file_paths=file_paths, meta=metas, params=self.index_params)

This is a problem, I think, because normally (at least in my case) I don't use files as input but Haystack.Document type.

It's a little bit of an overhead having to manually attach a TextConverter node every time you want to evaluate a pipeline, IMO

julian-risch · 2022-10-14T09:22:43Z

julian-risch
Oct 14, 2022
Maintainer

Hi @ugm2 thank you for sharing this insight. Yes, for the integration of external datasets from BEIR, we use the file format. The alternative of storing a preprocessed version of these files as Haystack Documents online isn't really feasible and also it would make the process to add more datasets to BEIR more complex. However, I understand that unfortunately this creates overhead in your use case.
As a workaround I was wondering whether you could maybe define two very similar indexing pipelines. One pipeline would have the TextConverter as the its first node and they other pipeline would not have it. As alternative you could maybe also have an indexing pipeline with a Classification node in the beginning, which could pass the input on to the TextConverter only if needed. What does your current indexing pipeline look like? You don't really have one because you start directly with Haystack Documents?

If you are interested, two pointers I could give are one test case with a classification node in the indexing pipeline:

haystack/test/pipelines/test_standard_pipelines.py

Line 507 in 797c20c

def test_indexing_pipeline_with_classifier(document_store):

and an exemplary YAML file containing multiple different indexing pipelines: https://github.com/deepset-ai/haystack/blob/797c20c966fe46308f646e02d662ca87155a9d4a/test/samples/pipeline/test.haystack-pipeline.yml

1 reply

ugm2 Nov 8, 2022
Author

Hi @julian-risch ! Sorry to answer this late. Thanks for the workarounds! I still don't understand why it's not feasible to use Haystack Documents, the TextConverter node actually creates Haystack Documents:

haystack/haystack/nodes/file_converter/txt.py

Line 83 in 43b24fd

document = Document(content=text, meta=meta, id_hash_keys=id_hash_keys)

It seems we are doing extra steps:

Data is in a dictionary (corpus)
We store it in temporary files
TextConverter transforms those files into Haystack Documents
We index

When it could be:

Data is in a dictionary (corpus)
We create Haystack Documents
We index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation using Beir #3378

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Evaluation using Beir #3378

ugm2 Oct 13, 2022

Replies: 1 comment · 1 reply

julian-risch Oct 14, 2022 Maintainer

ugm2 Nov 8, 2022 Author

ugm2
Oct 13, 2022

Replies: 1 comment 1 reply

julian-risch
Oct 14, 2022
Maintainer

ugm2 Nov 8, 2022
Author