feat: Automatic evaluation of RAG pipeline #17

KevinJBoyer · 2024-04-18T18:50:17Z

Ticket

As part of https://navalabs.atlassian.net/browse/DST-180, we want to first provide a way to automatically evaluate the impact of various changes to the RAG pipeline on accuracy.

Add eval.py, which uses the Phoenix evals prompt to compare human ground truth with AI-generated answers and logs the results to a .csv file
Update ingest.py to accept chunk_size and overlap_size as parameters, and provide a silent parameter so they don't print to console

This is a manual version to get a feel for what it's like to implement this directly -- I plan to try out an existing framework (like Phoenix) next
Preliminary results are available at https://docs.google.com/spreadsheets/d/1ZAZGZsqMGg2dX5qdy0j_n0xEAigWvffMPqvzvhD1DPk/edit#gid=687224843

python eval.py. If you use one of the OpenAI LLMs (either for generating answers or evaluating them), set OPENAI_API_KEY in your environment
You can modify the parameters that are tested by updating parameters -- but note that it can quickly get expensive. Trying 18 total combinations and using GPT-4 Turbo to generate half of the answers + evaluate all of the answers cost $3.18.

yoomlam

Nice, clean code!

KevinJBoyer added 2 commits April 18, 2024 14:40

Add manual evaluation script

484fe18

Have default permutation size of 1

cd3020e

KevinJBoyer requested review from yoomlam and ccheng26 April 18, 2024 18:50

KevinJBoyer added 2 commits April 18, 2024 14:50

Appease ruff :)

cfc4afe

Update .gitignore

5fecd88

yoomlam approved these changes Apr 19, 2024

View reviewed changes

yoomlam mentioned this pull request Apr 19, 2024

feat: Add HyDE and reranking to evaluation #18

Merged

KevinJBoyer and others added 2 commits April 19, 2024 09:45

feat: Add HyDE and reranking to evaluation (#18)

1badf12

Compile to requirements.txt

aa3057e

KevinJBoyer merged commit 9c614ca into main Apr 19, 2024
2 checks passed

KevinJBoyer deleted the kb/evals branch April 19, 2024 13:48