Skip to content

Commit

Permalink
Add llm-as-judge preliminary results (too much non-evaluable)
Browse files Browse the repository at this point in the history
  • Loading branch information
binkjakub committed Aug 11, 2024
1 parent fadb582 commit 1d8df86
Show file tree
Hide file tree
Showing 15 changed files with 693 additions and 130 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ celerybeat-schedule
*.sage.py

# dotenv
.env
*.env

# virtualenv
.venv
Expand Down
3 changes: 3 additions & 0 deletions configs/api_model/llama_3.1_8b_instruct.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
name: llama3.1 # == meta-llama/Meta-Llama-3.1-70B-Instruct
endpoint: https://services.clarin-pl.eu/api/v1/oapi
use_langsmith: false
46 changes: 0 additions & 46 deletions configs/api_model/llama_3_8b_instruct_q8_0.yaml

This file was deleted.

Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
/outputs_42.json
/outputs_7312.json
/outputs_997.json
/metrics_42.json
/metrics_7312.json
/metrics_997.json
/judge_metrics_7312.json
/judge_metrics_42.json
/judge_metrics_997.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
/outputs_42.json
/outputs_7312.json
/outputs_997.json
/metrics_42.json
/metrics_7312.json
/metrics_997.json
/judge_metrics_42.json
/judge_metrics_997.json
/judge_metrics_7312.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,6 @@
/metrics_997.json
/metrics_7312.json
/metrics_42.json
/judge_metrics_997.json
/judge_metrics_42.json
/judge_metrics_7312.json
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,6 @@
/metrics_997.json
/metrics_42.json
/metrics_7312.json
/judge_metrics_42.json
/judge_metrics_997.json
/judge_metrics_7312.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
| llm | assessment | court_name | date | department_name | judges | legal_bases | recorder | signature |
|:--------------------------------------------|:----------------|:----------------|:----------------|:------------------|:----------------|:----------------|:----------------|:----------------|
| Unsloth-Llama-3-8B-Instruct | (Correct) | 0.076 (± 0.001) | 0.000 (± 0.000) | 0.119 (± 0.000) | 0.129 (± 0.002) | 0.009 (± 0.001) | 0.013 (± 0.002) | 0.743 (± 0.004) |
| Unsloth-Llama-3-8B-Instruct | (Disagreement) | 0.066 (± 0.001) | 0.031 (± 0.001) | 0.097 (± 0.001) | 0.037 (± 0.001) | 0.428 (± 0.009) | 0.015 (± 0.001) | 0.111 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct | (Subset) | 0.136 (± 0.003) | 0.000 (± 0.000) | 0.168 (± 0.003) | 0.070 (± 0.001) | 0.033 (± 0.001) | 0.447 (± 0.003) | 0.011 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct | (Superset) | 0.061 (± 0.001) | 0.000 (± 0.000) | 0.022 (± 0.001) | 0.178 (± 0.005) | 0.174 (± 0.007) | 0.001 (± 0.000) | 0.007 (± 0.001) |
| Unsloth-Llama-3-8B-Instruct | (non-evaluable) | 0.661 (± 0.002) | 0.969 (± 0.001) | 0.595 (± 0.001) | 0.586 (± 0.005) | 0.356 (± 0.006) | 0.524 (± 0.003) | 0.129 (± 0.004) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned | (Correct) | 0.028 (± 0.000) | 0.000 (± 0.000) | 0.003 (± 0.000) | 0.024 (± 0.001) | 0.015 (± 0.001) | 0.002 (± 0.000) | 0.002 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned | (Disagreement) | 0.035 (± 0.001) | 0.022 (± 0.000) | 0.065 (± 0.001) | 0.021 (± 0.001) | 0.619 (± 0.004) | 0.025 (± 0.002) | 0.015 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned | (Subset) | 0.004 (± 0.001) | 0.000 (± 0.000) | 0.021 (± 0.001) | 0.027 (± 0.001) | 0.090 (± 0.003) | 0.036 (± 0.002) | 0.020 (± 0.002) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned | (Superset) | 0.005 (± 0.000) | 0.000 (± 0.000) | 0.006 (± 0.000) | 0.047 (± 0.000) | 0.064 (± 0.001) | 0.002 (± 0.001) | 0.002 (± 0.000) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned | (non-evaluable) | 0.928 (± 0.001) | 0.978 (± 0.000) | 0.905 (± 0.001) | 0.881 (± 0.001) | 0.212 (± 0.004) | 0.936 (± 0.004) | 0.961 (± 0.002) |
| Unsloth-Mistral-7B-Instruct-v0.3 | (Correct) | 0.006 (± 0.005) | 0.000 (± 0.000) | 0.040 (± 0.034) | 0.024 (± 0.021) | 0.004 (± 0.003) | 0.001 (± 0.001) | 0.165 (± 0.143) |
| Unsloth-Mistral-7B-Instruct-v0.3 | (Disagreement) | 0.038 (± 0.027) | 0.029 (± 0.015) | 0.064 (± 0.043) | 0.060 (± 0.047) | 0.206 (± 0.088) | 0.017 (± 0.008) | 0.146 (± 0.076) |
| Unsloth-Mistral-7B-Instruct-v0.3 | (Subset) | 0.002 (± 0.002) | 0.001 (± 0.002) | 0.046 (± 0.040) | 0.017 (± 0.015) | 0.054 (± 0.040) | 0.029 (± 0.025) | 0.010 (± 0.004) |
| Unsloth-Mistral-7B-Instruct-v0.3 | (Superset) | 0.353 (± 0.038) | 0.010 (± 0.016) | 0.133 (± 0.124) | 0.312 (± 0.086) | 0.130 (± 0.133) | 0.037 (± 0.033) | 0.133 (± 0.221) |
| Unsloth-Mistral-7B-Instruct-v0.3 | (non-evaluable) | 0.601 (± 0.005) | 0.959 (± 0.003) | 0.716 (± 0.006) | 0.586 (± 0.006) | 0.605 (± 0.005) | 0.916 (± 0.001) | 0.547 (± 0.002) |
| Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned | (Correct) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.002 (± 0.000) | 0.017 (± 0.001) | 0.025 (± 0.001) | 0.008 (± 0.001) | 0.000 (± 0.000) |
| Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned | (Disagreement) | 0.028 (± 0.002) | 0.015 (± 0.000) | 0.035 (± 0.001) | 0.024 (± 0.001) | 0.202 (± 0.002) | 0.017 (± 0.001) | 0.004 (± 0.001) |
| Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned | (Subset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.001 (± 0.000) | 0.008 (± 0.001) | 0.042 (± 0.001) | 0.004 (± 0.000) | 0.004 (± 0.001) |
| Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned | (Superset) | 0.000 (± 0.000) | 0.000 (± 0.000) | 0.004 (± 0.000) | 0.009 (± 0.000) | 0.022 (± 0.001) | 0.001 (± 0.000) | 0.000 (± 0.000) |
| Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned | (non-evaluable) | 0.971 (± 0.002) | 0.985 (± 0.000) | 0.958 (± 0.001) | 0.943 (± 0.002) | 0.709 (± 0.002) | 0.970 (± 0.001) | 0.992 (± 0.001) |
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
| llm | full_text_chrf | court_name | date | department_name | judges | legal_bases | recorder | signature |
|:--------------------------------------------|:-----------------|:----------------|:----------------|:------------------|:----------------|:----------------|:----------------|:----------------|
| Unsloth-Llama-3-8B-Instruct | 0.579 (± 0.001) | 0.882 (± 0.000) | 0.983 (± 0.001) | 0.905 (± 0.001) | 0.919 (± 0.000) | 0.400 (± 0.004) | 0.739 (± 0.003) | 0.735 (± 0.002) |
| Unsloth-Llama-3-8B-Instruct-fine-tuned | 0.747 (± 0.000) | 0.983 (± 0.001) | 0.989 (± 0.000) | 0.962 (± 0.001) | 0.964 (± 0.000) | 0.509 (± 0.000) | 0.957 (± 0.004) | 0.981 (± 0.002) |
| Unsloth-Mistral-7B-Instruct-v0.3 | 0.574 (± 0.001) | 0.800 (± 0.003) | 0.966 (± 0.002) | 0.830 (± 0.006) | 0.845 (± 0.004) | 0.360 (± 0.005) | 0.925 (± 0.003) | 0.451 (± 0.003) |
| Unsloth-Mistral-7B-Instruct-v0.3-fine-tuned | 0.634 (± 0.001) | 0.983 (± 0.001) | 0.988 (± 0.000) | 0.980 (± 0.001) | 0.974 (± 0.001) | 0.748 (± 0.004) | 0.986 (± 0.001) | 0.993 (± 0.001) |
Loading

0 comments on commit 1d8df86

Please sign in to comment.