Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FT] Evaluation using a multi-document RAG based on statistical tools and LLM as judge #379

Open
louisbrulenaudet opened this issue Oct 30, 2024 · 2 comments
Labels
feature request New feature/request

Comments

@louisbrulenaudet
Copy link

Issue encountered

It would be good to have a system for evaluating both the relevance of the RAG and its use by the LLM in producing the response. My first intuition would be a multi-stage system with, on the one hand, an evaluation in the manner of Sentence Transformers' information-retrieval evaluator (@tomaarsen), and on the other hand, an evaluation with certain metrics of the model's ability to interpret the documents from the context provided by the RAG system (mutli-documents) in order to test :

  • Answer Relevancy
  • Faithfulness
  • Contextual Precision
  • Contextual Recall
  • Contextual Relevancy
  • Tool Correctness
  • Hallucination

Solution/Feature

For the moment, it seems to me that the simplest way to implement this system is to combine two tools, sentence-transformers and DeepEval, and then aggregate the results in the form of a synthetic indicator, but this method requires a multitude of libraries and a connection between the database necessary for the sentence-transformers and DeepEval evaluators.

@louisbrulenaudet louisbrulenaudet added the feature request New feature/request label Oct 30, 2024
@tomaarsen
Copy link
Member

On the information retrieval (IR) side of things, we have a solid and extensive leaderboard called BEIR (Benchmark for Information Retrieval), which is included in MTEB (Massive Text Embedding Benchmark) under the Retrieval tab: https://huggingface.co/spaces/mteb/leaderboard. This will soon be extended to MMTEB with more languages, etc.

However, as you're noticing, this is only one part of the puzzle for RAG applications. I'm not very familiar with the various RAG evaluation tools out there, so I can't provide much feedback on that part I'm afraid.

  • Tom Aarsen

@NathanHB
Copy link
Member

NathanHB commented Nov 4, 2024

Hi ! Thanks for the issue. I'm not very familiar either with RAG evaluation and adding it to lighteval would reqquire some effort. If you are familiar with it, we would love it if you could open a PR with a Proof of Concept !

To adding RAG would require adding a RAG evaluation function in the base_model.py model, as well as providing the documents in the requests. Do you have a particular benchmark to add in mind ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature/request
Projects
None yet
Development

No branches or pull requests

3 participants