What is the best way to evaluate code generation? #1457
GauravRanganath
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Working on creating an eval for the following dataset for my own research purposes: https://huggingface.co/datasets/mbpp. This dataset is a series of mostly basic python problems.
I've converted the dataset to work with OpenAI evals, however, I was unsure what the best way to evaluate code generation would be. I think that a model graded evaluation makes sense, I was just surprised that there wasn't an existing yaml for code generation. The closest I could find was the SQL one.
Is the best approach just taking the SQL model graded eval method and modifying it to work better for evaluating python code?
Beta Was this translation helpful? Give feedback.
All reactions