[FT] Evaluation using a multi-document RAG based on statistical tools and LLM as judge #379

louisbrulenaudet · 2024-10-30T10:47:19Z

Issue encountered

It would be good to have a system for evaluating both the relevance of the RAG and its use by the LLM in producing the response. My first intuition would be a multi-stage system with, on the one hand, an evaluation in the manner of Sentence Transformers' information-retrieval evaluator (@tomaarsen), and on the other hand, an evaluation with certain metrics of the model's ability to interpret the documents from the context provided by the RAG system (mutli-documents) in order to test :

Answer Relevancy
Faithfulness
Contextual Precision
Contextual Recall
Contextual Relevancy
Tool Correctness
Hallucination

Solution/Feature

For the moment, it seems to me that the simplest way to implement this system is to combine two tools, sentence-transformers and DeepEval, and then aggregate the results in the form of a synthetic indicator, but this method requires a multitude of libraries and a connection between the database necessary for the sentence-transformers and DeepEval evaluators.

tomaarsen · 2024-10-30T10:51:37Z

On the information retrieval (IR) side of things, we have a solid and extensive leaderboard called BEIR (Benchmark for Information Retrieval), which is included in MTEB (Massive Text Embedding Benchmark) under the Retrieval tab: https://huggingface.co/spaces/mteb/leaderboard. This will soon be extended to MMTEB with more languages, etc.

However, as you're noticing, this is only one part of the puzzle for RAG applications. I'm not very familiar with the various RAG evaluation tools out there, so I can't provide much feedback on that part I'm afraid.

Tom Aarsen

NathanHB · 2024-11-04T14:00:22Z

Hi ! Thanks for the issue. I'm not very familiar either with RAG evaluation and adding it to lighteval would reqquire some effort. If you are familiar with it, we would love it if you could open a PR with a Proof of Concept !

To adding RAG would require adding a RAG evaluation function in the base_model.py model, as well as providing the documents in the requests. Do you have a particular benchmark to add in mind ?

louisbrulenaudet added the feature request New feature/request label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FT] Evaluation using a multi-document RAG based on statistical tools and LLM as judge #379

[FT] Evaluation using a multi-document RAG based on statistical tools and LLM as judge #379

louisbrulenaudet commented Oct 30, 2024

tomaarsen commented Oct 30, 2024

NathanHB commented Nov 4, 2024

[FT] Evaluation using a multi-document RAG based on statistical tools and LLM as judge #379

[FT] Evaluation using a multi-document RAG based on statistical tools and LLM as judge #379

Comments

louisbrulenaudet commented Oct 30, 2024

Issue encountered

Solution/Feature

tomaarsen commented Oct 30, 2024

NathanHB commented Nov 4, 2024