LLM As Judge should describe programs as expected and actual #287

jlewi · 2024-10-08T23:54:31Z

Right now in its explanations, the LLM as judge is referring to program1 and program2.

This is likely because our prompt is using "" and ""

foyle/app/pkg/eval/judge_prompt.tmpl

Line 14 in a52dcde

We should change that to "" and "" to try to get the judge to talk in terms of expected and actual.

Here's an example explanation issued by the judge

The two programs perform distinct operations.

Program 1:

Uses jq to extract the .requestHtml field from a JSON file and stores the result in another file.
Then, it displays the content of the newly created file using cat.
Program 2:

Utilizes curl to make an HTTP POST request to fetch logs data and store the response in a JSON file.
Checks for successful execution and displays an error message if the curl command fails.
The two programs have no overlapping functionality and perform entirely different tasks. Therefore, they are not equivalent.

jlewi added the good first issue Good for newcomers label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM As Judge should describe programs as expected and actual #287

LLM As Judge should describe programs as expected and actual #287

jlewi commented Oct 8, 2024

LLM As Judge should describe programs as expected and actual #287

LLM As Judge should describe programs as expected and actual #287

Comments

jlewi commented Oct 8, 2024

Here's an example explanation issued by the judge