What is the best way to evaluate code generation? #1457

GauravRanganath · 2024-01-12T18:44:16Z

GauravRanganath
Jan 12, 2024

Working on creating an eval for the following dataset for my own research purposes: https://huggingface.co/datasets/mbpp. This dataset is a series of mostly basic python problems.

I've converted the dataset to work with OpenAI evals, however, I was unsure what the best way to evaluate code generation would be. I think that a model graded evaluation makes sense, I was just surprised that there wasn't an existing yaml for code generation. The closest I could find was the SQL one.

Is the best approach just taking the SQL model graded eval method and modifying it to work better for evaluating python code?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What is the best way to evaluate code generation? #1457

{{title}}

Replies: 0 comments

Select a reply

What is the best way to evaluate code generation? #1457

GauravRanganath Jan 12, 2024

Replies: 0 comments

GauravRanganath
Jan 12, 2024