From 25bd733b716b82967d1e758c4f0a9f54dbbaa26b Mon Sep 17 00:00:00 2001
From: Scott Condron <Scott@wandb.com>
Date: Wed, 31 Jan 2024 14:31:36 +0000
Subject: [PATCH] Prompts eval (#504)

* move custom columns, remove wandbaddons

* add prompts_evaluation

* clean metadata

* undo deleting custom column tut

* clean up

---------

Co-authored-by: Thomas Capelle <tcapelle@pm.me>
---
 colabs/prompts/prompts_evaluation.ipynb | 494 ++++++++++++++++++++++++
 1 file changed, 494 insertions(+)
 create mode 100644 colabs/prompts/prompts_evaluation.ipynb
diff --git a/colabs/prompts/prompts_evaluation.ipynb b/colabs/prompts/prompts_evaluation.ipynb
new file mode 100644
index 00000000..b1ac7fc5
--- /dev/null
+++ b/colabs/prompts/prompts_evaluation.ipynb
@@ -0,0 +1,494 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "280912a4-7121-4185-addf-e5dc315f9b03",
+   "metadata": {},
+   "source": [
+    "<a href=\"https://colab.research.google.com/github/wandb/examples/blob/master/colabs/prompts/WandB_Prompts_Quickstart.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>\n",
+    "<!--- @wandbcode{prompts-eval} -->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "bafb5303-fa1d-4abc-89f3-de10e8d282c8",
+   "metadata": {},
+   "source": [
+    "<img src=\"http://wandb.me/logo-im-png\" width=\"400\" alt=\"Weights & Biases\" />\n",
+    "<!--- @wandbcode{prompts-eval} -->"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b76f1564",
+   "metadata": {},
+   "source": [
+    "# Iterate and Evaluate LLM applications"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a3e61dd9",
+   "metadata": {},
+   "source": [
+    "AI application building is an experimental process where you likely don't know how a given system will perform on your task. To iterate on an application, we need a way to evaluate if it's improving. To do so, a common practice is to test it against the same dataset when there is a change.\n",
+    "\n",
+    "This tutorial will show you how to:\n",
+    "- track input prompts and pipeline settings with `wandb.config`\n",
+    "- track final evaluation metrics e.g. F1 score or scores from LLM judges, with `wandb.log`\n",
+    "- track individual model predictions and metadata in `W&B Tables`"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "c9720603",
+   "metadata": {},
+   "source": [
+    "We'll track F1 score on extracting named entities from an example news headlines dataset from `explosion/prodigy-recipes` from the https://prodi.gy/ team."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "52c11c15",
+   "metadata": {},
+   "source": [
+    "# Setup\n",
+    "## Download Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b983c7b8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!curl -O https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/annotated_news_headlines-ORG-PERSON-LOCATION-ner.jsonl"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "28bba1c5",
+   "metadata": {},
+   "source": [
+    "## Installation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "4b99b883",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install wandb openai"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "58518d19",
+   "metadata": {},
+   "source": [
+    "## Create a W&B account and log in"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3a935c78",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import wandb\n",
+    "wandb.login()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "3834be0a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import json\n",
+    "from functools import partial\n",
+    "import timeit\n",
+    "import openai\n",
+    "from concurrent.futures import ThreadPoolExecutor\n",
+    "data = []\n",
+    "with open('annotated_news_headlines-ORG-PERSON-LOCATION-ner.jsonl') as f:\n",
+    "    for line in f:\n",
+    "        data.append(json.loads(line))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ba4ad8a1",
+   "metadata": {},
+   "source": [
+    "# Format data"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2fd386fd",
+   "metadata": {},
+   "source": [
+    "Here we just remove data we're not using and format the examples for our task."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "34004089",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def clean_examples():\n",
+    "    labelled_examples = []\n",
+    "    for example in data:\n",
+    "        entities = []\n",
+    "        if 'spans' in example:\n",
+    "            for span in example['spans']:\n",
+    "                start = span['start']\n",
+    "                end = span['end']\n",
+    "                label = span['label']\n",
+    "                # Extract the corresponding text from tokens\n",
+    "                text = ''\n",
+    "                for token in example['tokens']:\n",
+    "                    if token['start'] >= start and token['end'] <= end:\n",
+    "                        text += token['text'] + ' '\n",
+    "                entities.append(text.rstrip())\n",
+    "        labelled_examples.append({'text': example['text'], 'entities': entities})\n",
+    "    return labelled_examples\n",
+    "\n",
+    "labelled_examples = clean_examples()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3e898d60",
+   "metadata": {},
+   "source": [
+    "# Set up LLM boilerplate"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3ada324f",
+   "metadata": {},
+   "source": [
+    "We'll call `openai` (you'll need to add an OpenAI API key) with a given prompt to extract the entities and replace `<text>` with our input. We'll also grab useful metadata from the openai response for logging."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c630e402",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def extract_entities_with_template(text, template_prompt, system_prompt, model, temperature):\n",
+    "    start_time = timeit.default_timer()\n",
+    "    prompt=template_prompt.replace('<text>', text)\n",
+    "    from openai import OpenAI\n",
+    "    client = OpenAI()\n",
+    "    response = client.chat.completions.create(\n",
+    "        model=model,\n",
+    "        messages=[\n",
+    "            {\n",
+    "                \"role\": \"system\",\n",
+    "                \"content\": system_prompt\n",
+    "            },\n",
+    "            {\n",
+    "                \"role\": \"user\",\n",
+    "                \"content\": prompt\n",
+    "            }\n",
+    "        ],\n",
+    "        temperature=temperature,\n",
+    "    )\n",
+    "    text = response.choices[0].message.content\n",
+    "    entities = list(filter(None, text.split('\\n')))\n",
+    "    usage = response.usage\n",
+    "    prompt_tokens = usage.prompt_tokens\n",
+    "    completion_tokens = usage.completion_tokens\n",
+    "    total_tokens = usage.total_tokens\n",
+    "    end_time = timeit.default_timer()\n",
+    "    elapsed = end_time - start_time\n",
+    "    return {\n",
+    "        'entities': entities,\n",
+    "        'model': model,\n",
+    "        'prompt': prompt,\n",
+    "        'elapsed': elapsed,\n",
+    "        'prompt_tokens': prompt_tokens,\n",
+    "        'completion_tokens': completion_tokens,\n",
+    "        'total_tokens': total_tokens\n",
+    "    }"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b9ee3ea0",
+   "metadata": {},
+   "source": [
+    "# Calculate Metric"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2a2ebb75",
+   "metadata": {},
+   "source": [
+    "Here, we make an evaluation metric for our task. \n",
+    "Note: It's not shown here, but you could also use an LLM to evaluate your task if it's not as straight forward to evaluate as this task."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0ad120db",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def calculate_f1(extracted_entities, ground_truth_entities):\n",
+    "    extracted_set = set(map(str.lower, extracted_entities))\n",
+    "    ground_truth_set = set(map(str.lower, ground_truth_entities))\n",
+    "    tp_examples = extracted_set & ground_truth_set\n",
+    "    tp = len(tp_examples)\n",
+    "    fp_examples = extracted_set - ground_truth_set\n",
+    "    fp = len(fp_examples)\n",
+    "    fn_examples = ground_truth_set - extracted_set\n",
+    "    fn = len(fn_examples)\n",
+    "    precision = tp / (tp + fp) if (tp + fp) else 0\n",
+    "    recall = tp / (tp + fn) if (tp + fn) else 0\n",
+    "    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0\n",
+    "    return f1, tp, fp, fn"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "43a50281",
+   "metadata": {},
+   "source": [
+    "# Perform inference in parallel"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6cb082bd",
+   "metadata": {},
+   "source": [
+    "Running evaluations can be a bit slow. To speed it up, here is a bit of useful code to gather your examples in parallel. None of this is specific to W&B, but it's useful to have nonetheless."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "7256eba2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def inference(examples, system_prompt, template_prompt, model, temperature):\n",
+    "    extracted = []\n",
+    "    # making a new function to openai which has the template\n",
+    "    # this is needed because exectutor.map wants a func with one arg\n",
+    "    openai_func = partial(extract_entities_with_template, model=model, \n",
+    "                          system_prompt=system_prompt, template_prompt=template_prompt, \n",
+    "                          temperature=temperature)\n",
+    "    # Run the model to extract the entities\n",
+    "    start_time = timeit.default_timer()\n",
+    "    with ThreadPoolExecutor(max_workers=8) as executor:\n",
+    "        for i in executor.map(openai_func, [t['text'] for t in examples]):\n",
+    "            extracted.append(i)\n",
+    "    end_time = timeit.default_timer()\n",
+    "    elapsed = end_time - start_time\n",
+    "    return extracted, elapsed\n",
+    "\n",
+    "model = 'gpt-3.5-turbo'\n",
+    "temperature = 0.7\n",
+    "template = '''\n",
+    "text: <text>\n",
+    "Return the entities as a list with a new line between each entity.\n",
+    "'''\n",
+    "system_prompt = 'You are an excellent entity extractor reading newspapers and extracting orgs, people and locations. Extract the entities from the follow sentence.'\n",
+    "extracted, elapsed = inference(labelled_examples[:1], system_prompt, template, model, temperature)\n",
+    "print(extracted[0]) \n",
+    "print(labelled_examples[0]['text'])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "7a615f80",
+   "metadata": {},
+   "source": [
+    "# Evaluate extracted entities, save in W&B Table for inspection later"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f7622cd",
+   "metadata": {},
+   "source": [
+    "Here, we calcualte our metric across all of our predictions and log them to a `wandb.Table` for later inspection."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "fb2da2ed",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def evaluate(extracted, labelled_examples):\n",
+    "    total_tp, total_fp, total_fn = 0,0,0\n",
+    "    eval_table = wandb.Table(columns=['pred', 'truth', 'f1', 'tp', 'fp', 'fn', \n",
+    "                                      'prompt_tokens', 'completion_tokens', 'total_tokens'])\n",
+    "    for pred, gt in zip(extracted, labelled_examples):\n",
+    "        f1, tp, fp, fn = calculate_f1(pred['entities'], gt['entities'])\n",
+    "        total_tp += tp\n",
+    "        total_fp += fp\n",
+    "        total_fn += f1\n",
+    "        eval_table.add_data(\n",
+    "            pred['entities'], gt['entities'],  f1, tp, fp, fn, \n",
+    "            pred['prompt_tokens'], pred['completion_tokens'], pred['total_tokens']\n",
+    "        )\n",
+    "    wandb.log({'eval_table': eval_table})\n",
+    "    overall_precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) else 0\n",
+    "    overall_recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) else 0\n",
+    "    overall_f1 = 2 * overall_precision * overall_recall / (overall_precision + overall_recall) if (overall_precision + overall_recall) else 0\n",
+    "    return overall_precision, overall_recall, overall_f1"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "763055df",
+   "metadata": {},
+   "source": [
+    "# Run our pipeline:"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6bc8d556",
+   "metadata": {},
+   "source": [
+    "To start logging to W&B, you can call `wandb.init` and pass in the config to track the configurations you're experimenting with currently.\n",
+    "\n",
+    "As you experiment, you can call `wandb.log` to track your work. This will log the metrics to W&B. Finally, we'll call `wandb.finish` to stop tracking. This will be tracked as one \"Run\" in W&B. \n",
+    "\n",
+    "You'll be given a link to W&B to see all of your logs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "530c96fb",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "NUM_EXAMPLES = 50\n",
+    "wandb.init(project='prompts_eval', config={\n",
+    "                'system_prompt': system_prompt,\n",
+    "                'template': template,\n",
+    "                'model': model,\n",
+    "                'temperature': temperature\n",
+    "            })\n",
+    "extracted, elapsed = inference(labelled_examples[:NUM_EXAMPLES],\n",
+    "                                system_prompt, template, model, temperature)\n",
+    "overall_precision, overall_recall, overall_f1 = evaluate(extracted, \n",
+    "                                                         labelled_examples[:NUM_EXAMPLES])\n",
+    "total_tokens_sum = sum([pred['total_tokens'] for pred in extracted])\n",
+    "completion_tokens_sum = sum([pred['completion_tokens'] for pred in extracted])\n",
+    "prompt_tokens_sum = sum([pred['prompt_tokens'] for pred in extracted])\n",
+    "wandb.log({'precision': overall_precision,\n",
+    "            'recall': overall_recall,\n",
+    "            'f1': overall_f1,\n",
+    "            'time_elapsed_total': elapsed,\n",
+    "            'prompt_tokens': prompt_tokens_sum,\n",
+    "            'completion_tokens': completion_tokens_sum,\n",
+    "            'total_tokens': total_tokens_sum\n",
+    "            })\n",
+    "wandb.finish()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ad4af593",
+   "metadata": {},
+   "source": [
+    "# Set up experiments\n",
+    "\n",
+    "Start a W&B run per experiment with `wandb.init`, store experiment details in `config` arg. Log results with `wandb.log`. Call `wandb.finish` to end experiment. Loop over all options in grid search to find best configuration."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "149b0700",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "system_prompts = ['Extract the entities from the follow sentence.', \n",
+    "                  'You are an excellent entity extractor reading newspapers and extracting orgs, people and locations. Extract the entities from the follow sentence.']\n",
+    "for system_prompt in system_prompts:\n",
+    "    for temperature in [0.2, 0.6, 0.9]:\n",
+    "        for model in ['gpt-3.5-turbo', 'gpt-3.5-turbo-1106']:\n",
+    "            wandb.init(project='prompts_eval', config={\n",
+    "                'system_prompt':system_prompt,\n",
+    "                'template': template,\n",
+    "                'model': model,\n",
+    "                'temperature': temperature\n",
+    "            })\n",
+    "            extracted, elapsed = inference(labelled_examples[:NUM_EXAMPLES],\n",
+    "                                system_prompt, template, model, temperature)\n",
+    "            overall_precision, overall_recall, overall_f1 = evaluate(extracted, \n",
+    "                                                                     labelled_examples[:NUM_EXAMPLES])\n",
+    "            total_tokens_sum = sum([pred['total_tokens'] for pred in extracted])\n",
+    "            completion_tokens_sum = sum([pred['completion_tokens'] for pred in extracted])\n",
+    "            prompt_tokens_sum = sum([pred['prompt_tokens'] for pred in extracted])\n",
+    "            wandb.log({'precision': overall_precision,\n",
+    "                        'recall': overall_recall,\n",
+    "                        'f1': overall_f1,\n",
+    "                        'time_elapsed_total': elapsed,\n",
+    "                        'prompt_tokens': prompt_tokens_sum,\n",
+    "                        'completion_tokens': completion_tokens_sum,\n",
+    "                        'total_tokens': total_tokens_sum\n",
+    "                        })\n",
+    "            wandb.finish()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5ba13a8c",
+   "metadata": {},
+   "source": [
+    "# Conclusion\n",
+    "\n",
+    "You've learned how to use W&B to track evaluations of your LLM applications. \n",
+    "You've used `wandb.init` to start tracking, `wandb.log` to log summary evaluation metrics and `wandb.Table` to track individual predictions & scores. \n",
+    "We've also shared some best practices to format your code to make it easier to run evaluations in parallel and track every iteration."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f5cceefe",
+   "metadata": {},
+   "source": [
+    "# Trace your LLM application\n",
+    "\n",
+    "If you want to learn more and you're using complex pipelines of LLM calls, you can leverage W&B Prompts to view traces of your application and see inputs & ouputs of each LLM or function call. \n",
+    "\n",
+    "Learn more about W&B Prompts in the documentation here: [https://docs.wandb.ai/guides/prompts](https://docs.wandb.ai/guides/prompts)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}