Difficulty Reproducing BM25 Results #58

VecherVhatuX · 2024-03-20T15:06:45Z

I've encountered issues while trying to reproduce the BM25 results mentioned in the documentation. I've faced the challenges:

How does the script handle files with more context than the tokenizer can support? Is there a filtering mechanism in place to manage such instances?
Could you provide more details on how the parameter k is utilized in the script and its impact on the results?

I would appreciate any guidance or suggestions on how to address these issues to achieve the expected BM25 results.

Moreover, the tokenizer is being created for each instance, rather than being kept in memory. This seems to be inefficient and could potentially affect performance.
Also, the tokenization process does not appear to be parallelized. As a result, processing is slow, and when running the test dataset overnight, the scores achieved are lower than expected.

The text was updated successfully, but these errors were encountered:

Adefful · 2024-03-20T15:14:40Z

Issue: Discrepancy in Retrieval Results and Metrics

Description

We're encountering inconsistencies in the retrieval results that do not match the stated metrics, specifically when evaluating on a dataset with 27,000 contexts at k=50, and across different values of k for a dataset with 13,000 contexts. The discrepancies are evident in the average, all, and any recall metrics, deviating from the expected BM25 recall values.

Detailed Observations

For 27,000 Contexts at k=50

The observed metrics are as follows:

Metric	Observed Value
Avg Recall	36.54
All Recall	32.26
Any Recall	42.95

Compared to the expected BM25 recall metrics:

Context Size	Avg Recall	All Recall	Any Recall
13k	29.58	26.09	34.77
27k	44.41	39.83	51.27
50k	51.06	45.90	58.38

For 13,000 Contexts with Different k Values

An increase in metric values is observed with a decrease in k, which is counterintuitive, as the number of files that can be accommodated decreases.

At k=10:

Metric	Value
Avg Recall	24.21
All Recall	21.17
Any Recall	29.11

At k=50:

Metric	Value
Avg Recall	22.71
All Recall	19.78
Any Recall	27.50

At k=3 (Most files are not considered as they have no retrieved files):

Metric	Value
Avg Recall	29.53
All Recall	25.92
Any Recall	35.09

Missing Gold Files for Specific Commits

Additionally, there are instances with missing gold files, indicated by warnings during the retrieval process. Examples include django__django-15272 and sympy__sympy-18667.

Test Dataset Used

The test sample for this evaluation is derived from the provided test dataset.

Request

We kindly ask for an investigation into these discrepancies, particularly focusing on:

The apparent inconsistency in retrieval metrics versus the expected BM25 recall values.
The impact of varying k values on the metric results, especially in cases where the number of retrieved files is limited or excessive, theoretically not affecting the metric if all relevant files are retrieved.
The issue concerning instances with missing gold files.

We believe addressing these points will greatly enhance the accuracy and reliability of the retrieval process, aligning it more closely with the expected outcomes. Thank you for your attention to these matters.

john-b-yang · 2024-06-17T17:08:26Z

Tagging @carlosejimenez to address this.

dayuyang1999 · 2024-10-25T05:57:12Z

A relevant question, why recall is a number larger than 1? For example, what does 29.58 mean for 13K, Avg, BM25 Recall.

john-b-yang · 2024-10-25T18:28:06Z

@dayuyang1999 Oh I think those are just percentages (29.58%, not an absolute value). We should've put the percentage signs there.

VecherVhatuX mentioned this issue Apr 9, 2024

Running create_text_dataset.py gets Killed and takes too long #88

Open

john-b-yang added evaluation This issue is related to running evaluation inference This issue is related to running inference and removed evaluation This issue is related to running evaluation labels Apr 15, 2024

john-b-yang assigned carlosejimenez Jun 17, 2024

john-b-yang added the in progress We are actively working on this issue. label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty Reproducing BM25 Results #58

Difficulty Reproducing BM25 Results #58

VecherVhatuX commented Mar 20, 2024

Adefful commented Mar 20, 2024

john-b-yang commented Jun 17, 2024

dayuyang1999 commented Oct 25, 2024 •

edited

Loading

john-b-yang commented Oct 25, 2024

Difficulty Reproducing BM25 Results #58

Difficulty Reproducing BM25 Results #58

Comments

VecherVhatuX commented Mar 20, 2024

Adefful commented Mar 20, 2024

Issue: Discrepancy in Retrieval Results and Metrics

Description

Detailed Observations

For 27,000 Contexts at k=50

For 13,000 Contexts with Different k Values

Missing Gold Files for Specific Commits

Test Dataset Used

Request

john-b-yang commented Jun 17, 2024

dayuyang1999 commented Oct 25, 2024 • edited Loading

john-b-yang commented Oct 25, 2024

dayuyang1999 commented Oct 25, 2024 •

edited

Loading