Skip to content

Commit

Permalink
Reproduce llm-as-judge with fixed prompt
Browse files Browse the repository at this point in the history
  • Loading branch information
binkjakub committed Aug 29, 2024
1 parent 5f79cce commit e27516c
Show file tree
Hide file tree
Showing 6 changed files with 575 additions and 76 deletions.
1 change: 1 addition & 0 deletions configs/llm_judge.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ defaults:

answers_file: ???
out_metric_file: ???
prompt: ???
38 changes: 19 additions & 19 deletions data/experiments/predict/en-court-instruct/metrics_judge_summary.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
| llm | assessment | citation | date | judges |
|:-------------------------------------------------|:----------------|:----------------|:----------------|:----------------|
| Unsloth-Llama-3-8B-Instruct | (Correct) | 0.017 (± 0.001) | 0.051 (± 0.000) | 0.038 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct | (Disagreement) | 0.034 (± 0.001) | 0.000 (± 0.000) | 0.003 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct | (Correct) | 0.017 (± 0.001) | 0.051 (± 0.000) | 0.038 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct | (Disagreement) | 0.034 (± 0.001) | 0.000 (± 0.000) | 0.004 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.010 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.009 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct | (empty-answer) | 0.949 (± 0.000) | 0.949 (± 0.000) | 0.949 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Correct) | 0.853 (± 0.003) | 0.844 (± 0.002) | 0.449 (± 0.003) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Disagreement) | 0.005 (± 0.000) | 0.013 (± 0.000) | 0.136 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.037 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.236 (± 0.002) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Correct) | 0.853 (± 0.003) | 0.844 (± 0.002) | 0.446 (± 0.002) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Disagreement) | 0.005 (± 0.000) | 0.013 (± 0.000) | 0.188 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.032 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.192 (± 0.002) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (empty-answer) | 0.142 (± 0.003) | 0.142 (± 0.003) | 0.142 (± 0.003) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (Correct) | 0.001 (± 0.000) | 0.001 (± 0.000) | 0.001 (± 0.000) |
Expand All @@ -18,21 +18,21 @@
| Unsloth-Mistral-Nemo-Instruct-2407 | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (empty-answer) | 0.999 (± 0.000) | 0.999 (± 0.000) | 0.999 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Correct) | 0.889 (± 0.001) | 0.882 (± 0.001) | 0.487 (± 0.002) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Disagreement) | 0.006 (± 0.000) | 0.013 (± 0.000) | 0.111 (± 0.002) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.031 (± 0.001) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.265 (± 0.003) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Correct) | 0.889 (± 0.001) | 0.882 (± 0.001) | 0.486 (± 0.001) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Disagreement) | 0.006 (± 0.000) | 0.013 (± 0.000) | 0.163 (± 0.002) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.026 (± 0.001) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.220 (± 0.002) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (empty-answer) | 0.105 (± 0.001) | 0.105 (± 0.001) | 0.105 (± 0.001) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| open_ai_gpt-4o | (Correct) | 0.927 (± nan) | 0.920 (± nan) | 0.470 (± nan) |
| open_ai_gpt-4o | (Disagreement) | 0.007 (± nan) | 0.015 (± nan) | 0.132 (± nan) |
| open_ai_gpt-4o | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.013 (± nan) |
| open_ai_gpt-4o | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.320 (± nan) |
| open_ai_gpt-4o | (Correct) | 0.927 (± nan) | 0.920 (± nan) | 0.468 (± nan) |
| open_ai_gpt-4o | (Disagreement) | 0.007 (± nan) | 0.015 (± nan) | 0.200 (± nan) |
| open_ai_gpt-4o | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.011 (± nan) |
| open_ai_gpt-4o | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.256 (± nan) |
| open_ai_gpt-4o | (empty-answer) | 0.066 (± nan) | 0.065 (± nan) | 0.065 (± nan) |
| open_ai_gpt-4o | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| open_ai_gpt-4o-mini | (Correct) | 0.962 (± nan) | 0.956 (± nan) | 0.496 (± nan) |
| open_ai_gpt-4o-mini | (Disagreement) | 0.007 (± nan) | 0.013 (± nan) | 0.122 (± nan) |
| open_ai_gpt-4o-mini | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.016 (± nan) |
| open_ai_gpt-4o-mini | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.335 (± nan) |
| open_ai_gpt-4o-mini | (Correct) | 0.962 (± nan) | 0.956 (± nan) | 0.495 (± nan) |
| open_ai_gpt-4o-mini | (Disagreement) | 0.007 (± nan) | 0.013 (± nan) | 0.182 (± nan) |
| open_ai_gpt-4o-mini | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.015 (± nan) |
| open_ai_gpt-4o-mini | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.277 (± nan) |
| open_ai_gpt-4o-mini | (empty-answer) | 0.031 (± nan) | 0.031 (± nan) | 0.031 (± nan) |
| open_ai_gpt-4o-mini | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
Loading

0 comments on commit e27516c

Please sign in to comment.