Merge branch 'align_score_eval_docs' into develop

NVIDIA · Nov 27, 2023 · 8e2088c · 8e2088c
2 parents f88acb1 + 9ed7b19
commit 8e2088c
Show file tree

Hide file tree

Showing 3 changed files with 46 additions and 22 deletions.
diff --git a/nemoguardrails/eval/README.md b/nemoguardrails/eval/README.md
@@ -118,9 +118,20 @@ Results on _banking_ dataset, metric used is accuracy.
 ## Input and Output Rails
 
 ### Fact-checking Rails
+In the Guardrails library, we provide two approaches out of the box for the fact-checking rail, these are colloquially referred to as AskLLM and AlignScore in the rest of this documentation. For more details read the [library guide](./../../docs/user_guides/guardrails-library.md).
 
-The default fact checking rail is implemented as an entailment prediction problem. Given an evidence and the predicted answer, we make an LLM call to predict whether the answer is grounded in the evidence or not.
+#### AskLLM
+In this approach, the fact-checking rail is implemented as an entailment prediction problem. Given an evidence passage and the predicted answer, we prompt an LLM to predict yes/no whether the answer is grounded in the evidence or not. This is the default approach.
 
+#### AlignScore
+This approach is based on the AlignScore model [Zha et al. 2023](https://aclanthology.org/2023.acl-long.634.pdf). Given an evidence passage and the predicted answer, the model is finetuned to predict that they are aligned when:
+1. All information in the predicted answer is present in the evidence passage, and
+2. None of the information in the predicted answer contradicts the evidence passage.
+The response is a value between 0.0 and 1.0. In our testing, the best average accuracies were observed with a threshold of 0.7.
+
+Please see the [user guide documentation](./../../docs/user_guides/guardrails-library.md#alignscore) for detailed steps on how to configure your deployment to use AlignScore.
+
+#### Evaluation
 To run the fact checking rail, you can use the following CLI command:
 
 ```nemoguardrails evaluate fact-checking```
@@ -150,19 +161,22 @@ More details on how to set up the data in the right format and run the evaluatio
 
 #### Evaluation Results
 
-We evaluate the performance of the fact checking rail on the [MSMARCO](https://huggingface.co/datasets/ms_marco) dataset. We randomly sample 100 (question, answer, evidence) triples and run the evaluation using OpenAI `text-davinci-003` and `gpt-3.5-turbo` models.
+Evaluation Date - Nov 23, 2023.
 
-Evaluation Date - June 02, 2023.
+We evaluate the performance of the fact checking rail on the [MSMARCO](https://huggingface.co/datasets/ms_marco) dataset using the Ask LLM and the AlignScore approaches.  To build the dataset, we randomly sample 100 (question, correct answer, evidence) triples, and then, for each triple, build a non-factual or incorrect answer to yield 100 (question, incorrect answer, evidence) triples.
 
-We breakdown the performance into positive entailment accuracy and negative entailment accuracy. Positive entailment accuracy is the accuracy of the model in correctly identifying answers that are grounded in the evidence passage. Negative entailment accuracy is the accuracy of the model on correctly identifying answers that are **not** grounded in the evidence. Details on how to create synthetic negative examples can be found [here](./data/factchecking/README.md)
+We breakdown the performance into positive entailment accuracy and negative entailment accuracy. Positive entailment accuracy is the accuracy of the model in correctly identifying answers that are grounded in the evidence passage. Negative entailment accuracy is the accuracy of the model on correctly identifying answers that are **not** supported in the evidence. Details on how to create synthetic negative examples can be found [here](./data/factchecking/README.md)
 
-| Model                  | Positive Entailment Accuracy | Negative Entailment Accuracy | Overall Accuracy |
-|------------------------|------------------------------|------------------------------|------------------|
-| text-davinci-003       | 0.83                         | 0.87                         | 0.85             |
-| gpt-3.5-turbo          | 0.87                         | 0.80                         | 0.83             |
-| gpt-3.5-turbo-instruct | 0.94                         | 0.70                         | 0.82             |
-| nemollm-43b            | 0.80                         | 0.83                         | 0.81             |
+| Model                  | Positive Entailment Accuracy | Negative Entailment Accuracy | Overall Accuracy | Average Time Per Checked Fact (ms) |
+|------------------------|------------------------------|------------------------------|------------------|------------------------------------|
+| text-davinci-003       | 70.0%                        | **93.0%**                    | 81.5%            | 272.2ms                            |
+| gpt-3.5-turbo          | 76.0%                        | 89.0%                        | 82.5%            | 435.1ms                            |
+| gpt-3.5-turbo-instruct | **92.0%**                    | 69.0%                        | 80.5%            | 188.8ms                            |
+| align_score-base*      | 81.0%                        | 88.0%                        | 84.5%            | **23.0ms** ^                       |
+| align_score-large*     | 87.0%                        | 90.0%                        | **88.5%**        | 46.0ms ^                           |
 
+*The threshold used for align_score is 0.7, i.e. an align_score >= 0.7 is considered a factual statement, and an align_score < 0.7 signifies an incorrect statement.
+^When the AlignScore model is loaded in-memory and inference is carried out without network overheads, i.e., not as a RESTful service.
 
 ### Moderation Rails
 

diff --git a/nemoguardrails/eval/cli/evaluate.py b/nemoguardrails/eval/cli/evaluate.py
@@ -183,10 +183,10 @@ def fact_checking(
     ),
     llm: str = typer.Option("openai", help="LLM provider to be used for fact checking"),
     model_name: str = typer.Option(
-        "text-davinci-003", help="Model name ex. text-davinci-003"
+        "gpt-3.5-turbo-instruct", help="Model name ex. gpt-3.5-turbo-instruct"
     ),
     num_samples: int = typer.Option(50, help="Number of samples to be evaluated"),
-    create_negatives: bool = typer.Argument(
+    create_negatives: bool = typer.Option(
         True, help="create synthetic negative samples"
     ),
     output_dir: str = typer.Option(

diff --git a/nemoguardrails/eval/evaluate_factcheck.py b/nemoguardrails/eval/evaluate_factcheck.py
@@ -15,6 +15,7 @@
 
 import json
 import os
+import time
 
 import tqdm
 import typer
@@ -107,6 +108,7 @@ def check_facts(self, split="positive"):
 
         fact_check_predictions = []
         num_correct = 0
+        total_time = 0
 
         for sample in tqdm.tqdm(self.dataset):
             assert (
@@ -122,10 +124,13 @@ def check_facts(self, split="positive"):
                 answer = sample["incorrect_answer"]
                 label = "no"
 
+            start_time = time.time()
             fact_check_prompt = self.llm_task_manager.render_task_prompt(
                 Task.FACT_CHECKING, {"evidence": evidence, "response": answer}
             )
             fact_check = self.llm(fact_check_prompt)
+            end_time = time.time()
+            time.sleep(0.5)  # avoid rate-limits
             fact_check = fact_check.lower().strip()
 
             if label in fact_check:
@@ -139,23 +144,23 @@ def check_facts(self, split="positive"):
                 "label": label,
             }
             fact_check_predictions.append(prediction)
+            total_time += end_time - start_time
 
-        return fact_check_predictions, num_correct
+        return fact_check_predictions, num_correct, total_time
 
     def run(self):
         """
         Run the fact checking evaluation and print the results.
         """
-
         if self.create_negatives:
             self.dataset = self.create_negative_samples(self.dataset)
 
         print("Checking facts - positive entailment")
-        positive_fact_check_predictions, pos_num_correct = self.check_facts(
+        positive_fact_check_predictions, pos_num_correct, pos_time = self.check_facts(
             split="positive"
         )
         print("Checking facts - negative entailment")
-        negative_fact_check_predictions, neg_num_correct = self.check_facts(
+        negative_fact_check_predictions, neg_num_correct, neg_time = self.check_facts(
             split="negative"
         )
 
@@ -165,6 +170,11 @@ def run(self):
             f"Overall Accuracy: {(pos_num_correct + neg_num_correct)/(2*len(self.dataset))* 100}"
         )
 
+        print("---Time taken per sample:---")
+        print(
+            f"Ask LLM ({self.model_config.model}):\t{(pos_time+neg_time)*1000/(2*len(self.dataset)):.1f}ms"
+        )
+
         if self.write_outputs:
             dataset_name = os.path.basename(self.dataset_path).split(".")[0]
             with open(
@@ -181,28 +191,28 @@ def run(self):
 
 
 def main(
-    data_path: str = typer.Option(
-        "data/factchecking/sample.json",
+    dataset_path: str = typer.Option(
+        "./data/factchecking/sample.json",
         help="Path to the folder containing the dataset",
     ),
     llm: str = typer.Option("openai", help="LLM provider to be used for fact checking"),
     model_name: str = typer.Option(
-        "text-davinci-003", help="Model name ex. text-davinci-003"
+        "gpt-3.5-turbo-instruct", help="Model name ex. gpt-3.5-turbo-instruct"
     ),
     num_samples: int = typer.Option(50, help="Number of samples to be evaluated"),
-    create_negatives: bool = typer.Argument(
+    create_negatives: bool = typer.Option(
         True, help="create synthetic negative samples"
     ),
     output_dir: str = typer.Option(
-        "outputs/factchecking",
+        "eval_outputs/factchecking",
         help="Path to the folder where the outputs will be written",
     ),
     write_outputs: bool = typer.Option(
         True, help="Write outputs to the output directory"
     ),
 ):
     fact_check = FactCheckEvaluation(
-        data_path,
+        dataset_path,
         llm,
         model_name,
         num_samples,