Skip to content

Commit

Permalink
Reproduce evaluation on fixed parsing
Browse files Browse the repository at this point in the history
  • Loading branch information
binkjakub committed Aug 29, 2024
1 parent 264e623 commit 941b92f
Show file tree
Hide file tree
Showing 15 changed files with 548 additions and 147 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@
/metrics_7312.json
/metrics_42.json
/judge_metrics_997.json
/judge_metrics_42.json
/judge_metrics_7312.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@
/metrics_7312.json
/metrics_997.json
/judge_metrics_997.json
/judge_metrics_42.json
/judge_metrics_7312.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@
/metrics_7312.json
/metrics_997.json
/judge_metrics_997.json
/judge_metrics_42.json
/judge_metrics_7312.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,5 @@
/metrics_7312.json
/metrics_997.json
/judge_metrics_997.json
/judge_metrics_42.json
/judge_metrics_7312.json
64 changes: 38 additions & 26 deletions data/experiments/predict/en-court-instruct/metrics_judge_summary.md
Original file line number Diff line number Diff line change
@@ -1,26 +1,38 @@
| llm | assessment | citation | date | judges |
|:-------------------------------------------------|:----------------|:--------------|:--------------|:--------------|
| Unsloth-Llama-3-8B-Instruct | (Correct) | 0.016 (± nan) | 0.051 (± nan) | 0.038 (± nan) |
| Unsloth-Llama-3-8B-Instruct | (Disagreement) | 0.035 (± nan) | 0.000 (± nan) | 0.004 (± nan) |
| Unsloth-Llama-3-8B-Instruct | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| Unsloth-Llama-3-8B-Instruct | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.009 (± nan) |
| Unsloth-Llama-3-8B-Instruct | (empty-answer) | 0.949 (± nan) | 0.949 (± nan) | 0.949 (± nan) |
| Unsloth-Llama-3-8B-Instruct | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Correct) | 0.856 (± nan) | 0.847 (± nan) | 0.453 (± nan) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Disagreement) | 0.005 (± nan) | 0.013 (± nan) | 0.137 (± nan) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.036 (± nan) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.234 (± nan) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (empty-answer) | 0.140 (± nan) | 0.140 (± nan) | 0.140 (± nan) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (Correct) | 0.001 (± nan) | 0.001 (± nan) | 0.001 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (Disagreement) | 0.001 (± nan) | 0.000 (± nan) | 0.001 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (empty-answer) | 0.999 (± nan) | 0.999 (± nan) | 0.999 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Correct) | 0.889 (± nan) | 0.882 (± nan) | 0.487 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Disagreement) | 0.005 (± nan) | 0.013 (± nan) | 0.109 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.033 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.266 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (empty-answer) | 0.105 (± nan) | 0.105 (± nan) | 0.105 (± nan) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| llm | assessment | citation | date | judges |
|:-------------------------------------------------|:----------------|:----------------|:----------------|:----------------|
| Unsloth-Llama-3-8B-Instruct | (Correct) | 0.017 (± 0.001) | 0.051 (± 0.000) | 0.038 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct | (Disagreement) | 0.034 (± 0.001) | 0.000 (± 0.000) | 0.003 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.010 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct | (empty-answer) | 0.949 (± 0.000) | 0.949 (± 0.000) | 0.949 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Correct) | 0.853 (± 0.003) | 0.844 (± 0.002) | 0.449 (± 0.003) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Disagreement) | 0.005 (± 0.000) | 0.013 (± 0.000) | 0.136 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.037 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.236 (± 0.002) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (empty-answer) | 0.142 (± 0.003) | 0.142 (± 0.003) | 0.142 (± 0.003) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned-en | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (Correct) | 0.001 (± 0.000) | 0.001 (± 0.000) | 0.001 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (Disagreement) | 0.001 (± 0.000) | 0.000 (± 0.000) | 0.001 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (empty-answer) | 0.999 (± 0.000) | 0.999 (± 0.000) | 0.999 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407 | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Correct) | 0.889 (± 0.001) | 0.882 (± 0.001) | 0.487 (± 0.002) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Disagreement) | 0.006 (± 0.000) | 0.013 (± 0.000) | 0.111 (± 0.002) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.031 (± 0.001) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.265 (± 0.003) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (empty-answer) | 0.105 (± 0.001) | 0.105 (± 0.001) | 0.105 (± 0.001) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned-en | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| open_ai_gpt-4o | (Correct) | 0.927 (± nan) | 0.920 (± nan) | 0.470 (± nan) |
| open_ai_gpt-4o | (Disagreement) | 0.007 (± nan) | 0.015 (± nan) | 0.132 (± nan) |
| open_ai_gpt-4o | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.013 (± nan) |
| open_ai_gpt-4o | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.320 (± nan) |
| open_ai_gpt-4o | (empty-answer) | 0.066 (± nan) | 0.065 (± nan) | 0.065 (± nan) |
| open_ai_gpt-4o | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| open_ai_gpt-4o-mini | (Correct) | 0.962 (± nan) | 0.956 (± nan) | 0.496 (± nan) |
| open_ai_gpt-4o-mini | (Disagreement) | 0.007 (± nan) | 0.013 (± nan) | 0.122 (± nan) |
| open_ai_gpt-4o-mini | (Subset) | 0.000 (± nan) | 0.000 (± nan) | 0.016 (± nan) |
| open_ai_gpt-4o-mini | (Superset) | 0.000 (± nan) | 0.000 (± nan) | 0.335 (± nan) |
| open_ai_gpt-4o-mini | (empty-answer) | 0.031 (± nan) | 0.031 (± nan) | 0.031 (± nan) |
| open_ai_gpt-4o-mini | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
/outputs_997.json
/metrics_997.json
/judge_metrics_997.json
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
/outputs_997.json
/metrics_997.json
/judge_metrics_997.json
Original file line number Diff line number Diff line change
Expand Up @@ -47,4 +47,16 @@
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned | (Subset) | 0.002 (± 0.000) | 0.000 (± 0.000) | 0.012 (± 0.000) | 0.009 (± 0.001) | 0.021 (± 0.002) | 0.016 (± 0.003) | 0.014 (± 0.005) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned | (Superset) | 0.002 (± 0.000) | 0.000 (± 0.000) | 0.013 (± 0.000) | 0.020 (± 0.002) | 0.101 (± 0.003) | 0.001 (± 0.001) | 0.000 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned | (empty-answer) | 0.088 (± 0.001) | 0.068 (± 0.002) | 0.161 (± 0.001) | 0.144 (± 0.002) | 0.252 (± 0.000) | 0.273 (± 0.007) | 0.332 (± 0.004) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned | (non-evaluable) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| open_ai_gpt-4o | (Correct) | 0.939 (± nan) | 0.978 (± nan) | 0.915 (± nan) | 0.662 (± nan) | 0.019 (± nan) | 0.684 (± nan) | 0.972 (± nan) |
| open_ai_gpt-4o | (Disagreement) | 0.032 (± nan) | 0.022 (± nan) | 0.049 (± nan) | 0.135 (± nan) | 0.394 (± nan) | 0.001 (± nan) | 0.017 (± nan) |
| open_ai_gpt-4o | (Subset) | 0.012 (± nan) | 0.000 (± nan) | 0.011 (± nan) | 0.040 (± nan) | 0.015 (± nan) | 0.311 (± nan) | 0.001 (± nan) |
| open_ai_gpt-4o | (Superset) | 0.017 (± nan) | 0.000 (± nan) | 0.015 (± nan) | 0.163 (± nan) | 0.354 (± nan) | 0.002 (± nan) | 0.011 (± nan) |
| open_ai_gpt-4o | (empty-answer) | 0.001 (± nan) | 0.001 (± nan) | 0.011 (± nan) | 0.001 (± nan) | 0.218 (± nan) | 0.003 (± nan) | 0.001 (± nan) |
| open_ai_gpt-4o | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
| open_ai_gpt-4o-mini | (Correct) | 0.928 (± nan) | 0.978 (± nan) | 0.909 (± nan) | 0.555 (± nan) | 0.029 (± nan) | 0.943 (± nan) | 0.975 (± nan) |
| open_ai_gpt-4o-mini | (Disagreement) | 0.034 (± nan) | 0.022 (± nan) | 0.059 (± nan) | 0.175 (± nan) | 0.672 (± nan) | 0.001 (± nan) | 0.015 (± nan) |
| open_ai_gpt-4o-mini | (Subset) | 0.007 (± nan) | 0.000 (± nan) | 0.015 (± nan) | 0.017 (± nan) | 0.039 (± nan) | 0.053 (± nan) | 0.000 (± nan) |
| open_ai_gpt-4o-mini | (Superset) | 0.030 (± nan) | 0.000 (± nan) | 0.015 (± nan) | 0.252 (± nan) | 0.181 (± nan) | 0.002 (± nan) | 0.009 (± nan) |
| open_ai_gpt-4o-mini | (empty-answer) | 0.001 (± nan) | 0.001 (± nan) | 0.003 (± nan) | 0.001 (± nan) | 0.080 (± nan) | 0.002 (± nan) | 0.001 (± nan) |
| open_ai_gpt-4o-mini | (non-evaluable) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) | 0.000 (± nan) |
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@
|:----------------------------------------------|:-----------------|:----------------|:----------------|:------------------|:----------------|:----------------|:----------------|:----------------|
| Bielik-7B-Instruct-v0.1 | 0.354 (± 0.001) | 0.000 (± 0.000) | 0.001 (± 0.000) | 0.001 (± 0.000) | 0.001 (± 0.000) | 0.001 (± 0.000) | 0.000 (± 0.000) | 0.000 (± 0.000) |
| Bielik-7B-Instruct-v0.1-fine-tuned | 0.717 (± 0.000) | 0.890 (± 0.007) | 0.863 (± 0.007) | 0.886 (± 0.007) | 0.879 (± 0.007) | 0.465 (± 0.004) | 0.639 (± 0.001) | 0.459 (± 0.002) |
| Unsloth-Llama-3-8B-Instruct | 0.579 (± 0.001) | 0.865 (± 0.000) | 0.948 (± 0.001) | 0.882 (± 0.026) | 0.902 (± 0.011) | 0.312 (± 0.042) | 0.741 (± 0.002) | 0.665 (± 0.022) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned | 0.747 (± 0.000) | 0.916 (± 0.001) | 0.920 (± 0.002) | 0.902 (± 0.000) | 0.906 (± 0.001) | 0.442 (± 0.001) | 0.812 (± 0.003) | 0.805 (± 0.004) |
| Unsloth-Mistral-7B-Instruct-v0.3 | 0.574 (± 0.001) | 0.397 (± 0.005) | 0.470 (± 0.004) | 0.404 (± 0.005) | 0.424 (± 0.003) | 0.159 (± 0.002) | 0.436 (± 0.003) | 0.159 (± 0.001) |
| Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned | 0.634 (± 0.001) | 0.547 (± 0.003) | 0.549 (± 0.003) | 0.543 (± 0.003) | 0.544 (± 0.003) | 0.366 (± 0.002) | 0.534 (± 0.002) | 0.533 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct | 0.579 (± 0.001) | 0.863 (± 0.002) | 0.946 (± 0.002) | 0.909 (± 0.002) | 0.912 (± 0.003) | 0.362 (± 0.002) | 0.735 (± 0.004) | 0.686 (± 0.004) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned | 0.747 (± 0.000) | 0.913 (± 0.001) | 0.917 (± 0.002) | 0.906 (± 0.000) | 0.908 (± 0.001) | 0.528 (± 0.000) | 0.815 (± 0.003) | 0.810 (± 0.004) |
| Unsloth-Mistral-7B-Instruct-v0.3 | 0.574 (± 0.001) | 0.402 (± 0.004) | 0.463 (± 0.004) | 0.427 (± 0.004) | 0.434 (± 0.004) | 0.198 (± 0.002) | 0.438 (± 0.003) | 0.174 (± 0.002) |
| Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned | 0.634 (± 0.001) | 0.549 (± 0.003) | 0.547 (± 0.004) | 0.545 (± 0.003) | 0.545 (± 0.003) | 0.385 (± 0.002) | 0.534 (± 0.002) | 0.534 (± 0.002) |
| Unsloth-Mistral-Nemo-Instruct-2407 | 0.520 (± 0.001) | 0.732 (± 0.006) | 0.759 (± 0.005) | 0.687 (± 0.006) | 0.619 (± 0.006) | 0.267 (± 0.002) | 0.690 (± 0.008) | 0.600 (± 0.004) |
| Unsloth-Mistral-Nemo-Instruct-2407-fine-tuned | 0.708 (± 0.001) | 0.900 (± 0.001) | 0.843 (± 0.000) | 0.818 (± 0.001) | 0.826 (± 0.001) | 0.503 (± 0.002) | 0.693 (± 0.007) | 0.642 (± 0.007) |
| open_ai_gpt-4o | 0.651 (± nan) | 0.955 (± nan) | 0.986 (± nan) | 0.971 (± nan) | 0.917 (± nan) | 0.502 (± nan) | 0.834 (± nan) | 0.990 (± nan) |
Expand Down
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
/outputs_997.json
/metrics_997.json
/judge_metrics_997.json
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
/outputs_997.json
/metrics_997.json
/judge_metrics_997.json
Loading

0 comments on commit 941b92f

Please sign in to comment.