Skip to content

Commit

Permalink
Merge branch 'align_score_eval_docs' into develop
Browse files Browse the repository at this point in the history
  • Loading branch information
drazvan committed Nov 27, 2023
2 parents f88acb1 + 9ed7b19 commit 8e2088c
Show file tree
Hide file tree
Showing 3 changed files with 46 additions and 22 deletions.
34 changes: 24 additions & 10 deletions nemoguardrails/eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,9 +118,20 @@ Results on _banking_ dataset, metric used is accuracy.
## Input and Output Rails

### Fact-checking Rails
In the Guardrails library, we provide two approaches out of the box for the fact-checking rail, these are colloquially referred to as AskLLM and AlignScore in the rest of this documentation. For more details read the [library guide](./../../docs/user_guides/guardrails-library.md).

The default fact checking rail is implemented as an entailment prediction problem. Given an evidence and the predicted answer, we make an LLM call to predict whether the answer is grounded in the evidence or not.
#### AskLLM
In this approach, the fact-checking rail is implemented as an entailment prediction problem. Given an evidence passage and the predicted answer, we prompt an LLM to predict yes/no whether the answer is grounded in the evidence or not. This is the default approach.

#### AlignScore
This approach is based on the AlignScore model [Zha et al. 2023](https://aclanthology.org/2023.acl-long.634.pdf). Given an evidence passage and the predicted answer, the model is finetuned to predict that they are aligned when:
1. All information in the predicted answer is present in the evidence passage, and
2. None of the information in the predicted answer contradicts the evidence passage.
The response is a value between 0.0 and 1.0. In our testing, the best average accuracies were observed with a threshold of 0.7.

Please see the [user guide documentation](./../../docs/user_guides/guardrails-library.md#alignscore) for detailed steps on how to configure your deployment to use AlignScore.

#### Evaluation
To run the fact checking rail, you can use the following CLI command:

```nemoguardrails evaluate fact-checking```
Expand Down Expand Up @@ -150,19 +161,22 @@ More details on how to set up the data in the right format and run the evaluatio

#### Evaluation Results

We evaluate the performance of the fact checking rail on the [MSMARCO](https://huggingface.co/datasets/ms_marco) dataset. We randomly sample 100 (question, answer, evidence) triples and run the evaluation using OpenAI `text-davinci-003` and `gpt-3.5-turbo` models.
Evaluation Date - Nov 23, 2023.

Evaluation Date - June 02, 2023.
We evaluate the performance of the fact checking rail on the [MSMARCO](https://huggingface.co/datasets/ms_marco) dataset using the Ask LLM and the AlignScore approaches. To build the dataset, we randomly sample 100 (question, correct answer, evidence) triples, and then, for each triple, build a non-factual or incorrect answer to yield 100 (question, incorrect answer, evidence) triples.

We breakdown the performance into positive entailment accuracy and negative entailment accuracy. Positive entailment accuracy is the accuracy of the model in correctly identifying answers that are grounded in the evidence passage. Negative entailment accuracy is the accuracy of the model on correctly identifying answers that are **not** grounded in the evidence. Details on how to create synthetic negative examples can be found [here](./data/factchecking/README.md)
We breakdown the performance into positive entailment accuracy and negative entailment accuracy. Positive entailment accuracy is the accuracy of the model in correctly identifying answers that are grounded in the evidence passage. Negative entailment accuracy is the accuracy of the model on correctly identifying answers that are **not** supported in the evidence. Details on how to create synthetic negative examples can be found [here](./data/factchecking/README.md)

| Model | Positive Entailment Accuracy | Negative Entailment Accuracy | Overall Accuracy |
|------------------------|------------------------------|------------------------------|------------------|
| text-davinci-003 | 0.83 | 0.87 | 0.85 |
| gpt-3.5-turbo | 0.87 | 0.80 | 0.83 |
| gpt-3.5-turbo-instruct | 0.94 | 0.70 | 0.82 |
| nemollm-43b | 0.80 | 0.83 | 0.81 |
| Model | Positive Entailment Accuracy | Negative Entailment Accuracy | Overall Accuracy | Average Time Per Checked Fact (ms) |
|------------------------|------------------------------|------------------------------|------------------|------------------------------------|
| text-davinci-003 | 70.0% | **93.0%** | 81.5% | 272.2ms |
| gpt-3.5-turbo | 76.0% | 89.0% | 82.5% | 435.1ms |
| gpt-3.5-turbo-instruct | **92.0%** | 69.0% | 80.5% | 188.8ms |
| align_score-base* | 81.0% | 88.0% | 84.5% | **23.0ms** ^ |
| align_score-large* | 87.0% | 90.0% | **88.5%** | 46.0ms ^ |

*The threshold used for align_score is 0.7, i.e. an align_score >= 0.7 is considered a factual statement, and an align_score < 0.7 signifies an incorrect statement.
^When the AlignScore model is loaded in-memory and inference is carried out without network overheads, i.e., not as a RESTful service.

### Moderation Rails

Expand Down
4 changes: 2 additions & 2 deletions nemoguardrails/eval/cli/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,10 +183,10 @@ def fact_checking(
),
llm: str = typer.Option("openai", help="LLM provider to be used for fact checking"),
model_name: str = typer.Option(
"text-davinci-003", help="Model name ex. text-davinci-003"
"gpt-3.5-turbo-instruct", help="Model name ex. gpt-3.5-turbo-instruct"
),
num_samples: int = typer.Option(50, help="Number of samples to be evaluated"),
create_negatives: bool = typer.Argument(
create_negatives: bool = typer.Option(
True, help="create synthetic negative samples"
),
output_dir: str = typer.Option(
Expand Down
30 changes: 20 additions & 10 deletions nemoguardrails/eval/evaluate_factcheck.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@

import json
import os
import time

import tqdm
import typer
Expand Down Expand Up @@ -107,6 +108,7 @@ def check_facts(self, split="positive"):

fact_check_predictions = []
num_correct = 0
total_time = 0

for sample in tqdm.tqdm(self.dataset):
assert (
Expand All @@ -122,10 +124,13 @@ def check_facts(self, split="positive"):
answer = sample["incorrect_answer"]
label = "no"

start_time = time.time()
fact_check_prompt = self.llm_task_manager.render_task_prompt(
Task.FACT_CHECKING, {"evidence": evidence, "response": answer}
)
fact_check = self.llm(fact_check_prompt)
end_time = time.time()
time.sleep(0.5) # avoid rate-limits
fact_check = fact_check.lower().strip()

if label in fact_check:
Expand All @@ -139,23 +144,23 @@ def check_facts(self, split="positive"):
"label": label,
}
fact_check_predictions.append(prediction)
total_time += end_time - start_time

return fact_check_predictions, num_correct
return fact_check_predictions, num_correct, total_time

def run(self):
"""
Run the fact checking evaluation and print the results.
"""

if self.create_negatives:
self.dataset = self.create_negative_samples(self.dataset)

print("Checking facts - positive entailment")
positive_fact_check_predictions, pos_num_correct = self.check_facts(
positive_fact_check_predictions, pos_num_correct, pos_time = self.check_facts(
split="positive"
)
print("Checking facts - negative entailment")
negative_fact_check_predictions, neg_num_correct = self.check_facts(
negative_fact_check_predictions, neg_num_correct, neg_time = self.check_facts(
split="negative"
)

Expand All @@ -165,6 +170,11 @@ def run(self):
f"Overall Accuracy: {(pos_num_correct + neg_num_correct)/(2*len(self.dataset))* 100}"
)

print("---Time taken per sample:---")
print(
f"Ask LLM ({self.model_config.model}):\t{(pos_time+neg_time)*1000/(2*len(self.dataset)):.1f}ms"
)

if self.write_outputs:
dataset_name = os.path.basename(self.dataset_path).split(".")[0]
with open(
Expand All @@ -181,28 +191,28 @@ def run(self):


def main(
data_path: str = typer.Option(
"data/factchecking/sample.json",
dataset_path: str = typer.Option(
"./data/factchecking/sample.json",
help="Path to the folder containing the dataset",
),
llm: str = typer.Option("openai", help="LLM provider to be used for fact checking"),
model_name: str = typer.Option(
"text-davinci-003", help="Model name ex. text-davinci-003"
"gpt-3.5-turbo-instruct", help="Model name ex. gpt-3.5-turbo-instruct"
),
num_samples: int = typer.Option(50, help="Number of samples to be evaluated"),
create_negatives: bool = typer.Argument(
create_negatives: bool = typer.Option(
True, help="create synthetic negative samples"
),
output_dir: str = typer.Option(
"outputs/factchecking",
"eval_outputs/factchecking",
help="Path to the folder where the outputs will be written",
),
write_outputs: bool = typer.Option(
True, help="Write outputs to the output directory"
),
):
fact_check = FactCheckEvaluation(
data_path,
dataset_path,
llm,
model_name,
num_samples,
Expand Down

0 comments on commit 8e2088c

Please sign in to comment.