Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Automatic evaluation of RAG pipeline #17

Merged
merged 6 commits into from
Apr 19, 2024
Merged

feat: Automatic evaluation of RAG pipeline #17

merged 6 commits into from
Apr 19, 2024

Conversation

KevinJBoyer
Copy link
Collaborator

Ticket

As part of https://navalabs.atlassian.net/browse/DST-180, we want to first provide a way to automatically evaluate the impact of various changes to the RAG pipeline on accuracy.

Changes

  • Add eval.py, which uses the Phoenix evals prompt to compare human ground truth with AI-generated answers and logs the results to a .csv file
  • Update ingest.py to accept chunk_size and overlap_size as parameters, and provide a silent parameter so they don't print to console

Context for reviewers

Testing

  • python eval.py. If you use one of the OpenAI LLMs (either for generating answers or evaluating them), set OPENAI_API_KEY in your environment
  • You can modify the parameters that are tested by updating parameters -- but note that it can quickly get expensive. Trying 18 total combinations and using GPT-4 Turbo to generate half of the answers + evaluate all of the answers cost $3.18.

Copy link
Collaborator

@yoomlam yoomlam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, clean code!

@KevinJBoyer KevinJBoyer merged commit 9c614ca into main Apr 19, 2024
2 checks passed
@KevinJBoyer KevinJBoyer deleted the kb/evals branch April 19, 2024 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants