diff --git a/colabs/prompts/prompts_evaluation.ipynb b/colabs/prompts/prompts_evaluation.ipynb
new file mode 100644
index 00000000..981fdb30
--- /dev/null
+++ b/colabs/prompts/prompts_evaluation.ipynb
@@ -0,0 +1,506 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "280912a4-7121-4185-addf-e5dc315f9b03",
+ "metadata": {},
+ "source": [
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "bafb5303-fa1d-4abc-89f3-de10e8d282c8",
+ "metadata": {},
+ "source": [
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b76f1564",
+ "metadata": {},
+ "source": [
+ "# Iterate and Evaluate LLM applications"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a3e61dd9",
+ "metadata": {},
+ "source": [
+ "AI application building is an experimental process where you likely don't know how a given system will perform on your task. To iterate on an application, we need a way to evaluate if it's improving. To do so, a common practice is to test it against the same dataset when there is a change.\n",
+ "\n",
+ "This tutorial will show you how to:\n",
+ "- track input prompts and pipeline settings with `wandb.config`\n",
+ "- track final evaluation metrics e.g. F1 score or scores from LLM judges, with `wandb.log`\n",
+ "- track individual model predictions and metadata in `W&B Tables`"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c9720603",
+ "metadata": {},
+ "source": [
+ "We'll track F1 score on extracting named entities from an example news headlines dataset from `explosion/prodigy-recipes` from the https://prodi.gy/ team."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "52c11c15",
+ "metadata": {},
+ "source": [
+ "# Setup\n",
+ "## Download Data"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "b983c7b8",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!curl -O https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/annotated_news_headlines-ORG-PERSON-LOCATION-ner.jsonl"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "28bba1c5",
+ "metadata": {},
+ "source": [
+ "## Installation"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4b99b883",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!pip install wandb openai"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "58518d19",
+ "metadata": {},
+ "source": [
+ "## Create a W&B account and log in"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3a935c78",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import wandb\n",
+ "wandb.login()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "3834be0a",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import json\n",
+ "from functools import partial\n",
+ "import timeit\n",
+ "import openai\n",
+ "from concurrent.futures import ThreadPoolExecutor\n",
+ "data = []\n",
+ "with open('annotated_news_headlines-ORG-PERSON-LOCATION-ner.jsonl') as f:\n",
+ " for line in f:\n",
+ " data.append(json.loads(line))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ba4ad8a1",
+ "metadata": {},
+ "source": [
+ "# Format data"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2fd386fd",
+ "metadata": {},
+ "source": [
+ "Here we just remove data we're not using and format the examples for our task."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "34004089",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def clean_examples():\n",
+ " labelled_examples = []\n",
+ " for example in data:\n",
+ " entities = []\n",
+ " if 'spans' in example:\n",
+ " for span in example['spans']:\n",
+ " start = span['start']\n",
+ " end = span['end']\n",
+ " label = span['label']\n",
+ " # Extract the corresponding text from tokens\n",
+ " text = ''\n",
+ " for token in example['tokens']:\n",
+ " if token['start'] >= start and token['end'] <= end:\n",
+ " text += token['text'] + ' '\n",
+ " entities.append(text.rstrip())\n",
+ " labelled_examples.append({'text': example['text'], 'entities': entities})\n",
+ " return labelled_examples\n",
+ "\n",
+ "labelled_examples = clean_examples()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3e898d60",
+ "metadata": {},
+ "source": [
+ "# Set up LLM boilerplate"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3ada324f",
+ "metadata": {},
+ "source": [
+ "We'll call `openai` (you'll need to add an OpenAI API key) with a given prompt to extract the entities and replace `` with our input. We'll also grab useful metadata from the openai response for logging."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "c630e402",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def extract_entities_with_template(text, template_prompt, system_prompt, model, temperature):\n",
+ " start_time = timeit.default_timer()\n",
+ " prompt=template_prompt.replace('', text)\n",
+ " from openai import OpenAI\n",
+ " client = OpenAI()\n",
+ " response = client.chat.completions.create(\n",
+ " model=model,\n",
+ " messages=[\n",
+ " {\n",
+ " \"role\": \"system\",\n",
+ " \"content\": system_prompt\n",
+ " },\n",
+ " {\n",
+ " \"role\": \"user\",\n",
+ " \"content\": prompt\n",
+ " }\n",
+ " ],\n",
+ " temperature=temperature,\n",
+ " )\n",
+ " text = response.choices[0].message.content\n",
+ " entities = list(filter(None, text.split('\\n')))\n",
+ " usage = response.usage\n",
+ " prompt_tokens = usage.prompt_tokens\n",
+ " completion_tokens = usage.completion_tokens\n",
+ " total_tokens = usage.total_tokens\n",
+ " end_time = timeit.default_timer()\n",
+ " elapsed = end_time - start_time\n",
+ " return {\n",
+ " 'entities': entities,\n",
+ " 'model': model,\n",
+ " 'prompt': prompt,\n",
+ " 'elapsed': elapsed,\n",
+ " 'prompt_tokens': prompt_tokens,\n",
+ " 'completion_tokens': completion_tokens,\n",
+ " 'total_tokens': total_tokens\n",
+ " }"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b9ee3ea0",
+ "metadata": {},
+ "source": [
+ "# Calculate Metric"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2a2ebb75",
+ "metadata": {},
+ "source": [
+ "Here, we make an evaluation metric for our task. \n",
+ "Note: It's not shown here, but you could also use an LLM to evaluate your task if it's not as straight forward to evaluate as this task."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0ad120db",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def calculate_f1(extracted_entities, ground_truth_entities):\n",
+ " extracted_set = set(map(str.lower, extracted_entities))\n",
+ " ground_truth_set = set(map(str.lower, ground_truth_entities))\n",
+ " tp_examples = extracted_set & ground_truth_set\n",
+ " tp = len(tp_examples)\n",
+ " fp_examples = extracted_set - ground_truth_set\n",
+ " fp = len(fp_examples)\n",
+ " fn_examples = ground_truth_set - extracted_set\n",
+ " fn = len(fn_examples)\n",
+ " precision = tp / (tp + fp) if (tp + fp) else 0\n",
+ " recall = tp / (tp + fn) if (tp + fn) else 0\n",
+ " f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0\n",
+ " return f1, tp, fp, fn"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "43a50281",
+ "metadata": {},
+ "source": [
+ "# Perform inference in parallel"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6cb082bd",
+ "metadata": {},
+ "source": [
+ "Running evaluations can be a bit slow. To speed it up, here is a bit of useful code to gather your examples in parallel. None of this is specific to W&B, but it's useful to have nonetheless."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7256eba2",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def inference(examples, system_prompt, template_prompt, model, temperature):\n",
+ " extracted = []\n",
+ " # making a new function to openai which has the template\n",
+ " # this is needed because exectutor.map wants a func with one arg\n",
+ " openai_func = partial(extract_entities_with_template, model=model, \n",
+ " system_prompt=system_prompt, template_prompt=template_prompt, \n",
+ " temperature=temperature)\n",
+ " # Run the model to extract the entities\n",
+ " start_time = timeit.default_timer()\n",
+ " with ThreadPoolExecutor(max_workers=8) as executor:\n",
+ " for i in executor.map(openai_func, [t['text'] for t in examples]):\n",
+ " extracted.append(i)\n",
+ " end_time = timeit.default_timer()\n",
+ " elapsed = end_time - start_time\n",
+ " return extracted, elapsed\n",
+ "\n",
+ "model = 'gpt-3.5-turbo'\n",
+ "temperature = 0.7\n",
+ "template = '''\n",
+ "text: \n",
+ "Return the entities as a list with a new line between each entity.\n",
+ "'''\n",
+ "system_prompt = 'You are an excellent entity extractor reading newspapers and extracting orgs, people and locations. Extract the entities from the follow sentence.'\n",
+ "extracted, elapsed = inference(labelled_examples[:1], system_prompt, template, model, temperature)\n",
+ "print(extracted[0]) \n",
+ "print(labelled_examples[0]['text'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "7a615f80",
+ "metadata": {},
+ "source": [
+ "# Evaluate extracted entities, save in W&B Table for inspection later"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f7622cd",
+ "metadata": {},
+ "source": [
+ "Here, we calcualte our metric across all of our predictions and log them to a `wandb.Table` for later inspection."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "fb2da2ed",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def evaluate(extracted, labelled_examples):\n",
+ " total_tp, total_fp, total_fn = 0,0,0\n",
+ " eval_table = wandb.Table(columns=['pred', 'truth', 'f1', 'tp', 'fp', 'fn', \n",
+ " 'prompt_tokens', 'completion_tokens', 'total_tokens'])\n",
+ " for pred, gt in zip(extracted, labelled_examples):\n",
+ " f1, tp, fp, fn = calculate_f1(pred['entities'], gt['entities'])\n",
+ " total_tp += tp\n",
+ " total_fp += fp\n",
+ " total_fn += f1\n",
+ " eval_table.add_data(\n",
+ " pred['entities'], gt['entities'], f1, tp, fp, fn, \n",
+ " pred['prompt_tokens'], pred['completion_tokens'], pred['total_tokens']\n",
+ " )\n",
+ " wandb.log({'eval_table': eval_table})\n",
+ " overall_precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) else 0\n",
+ " overall_recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) else 0\n",
+ " overall_f1 = 2 * overall_precision * overall_recall / (overall_precision + overall_recall) if (overall_precision + overall_recall) else 0\n",
+ " return overall_precision, overall_recall, overall_f1"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "763055df",
+ "metadata": {},
+ "source": [
+ "# Run our pipeline:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6bc8d556",
+ "metadata": {},
+ "source": [
+ "To start logging to W&B, you can call `wandb.init` and pass in the config to track the configurations you're experimenting with currently.\n",
+ "\n",
+ "As you experiment, you can call `wandb.log` to track your work. This will log the metrics to W&B. Finally, we'll call `wandb.finish` to stop tracking. This will be tracked as one \"Run\" in W&B. \n",
+ "\n",
+ "You'll be given a link to W&B to see all of your logs."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "530c96fb",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "NUM_EXAMPLES = 50\n",
+ "wandb.init(project='prompts_eval', config={\n",
+ " 'system_prompt': system_prompt,\n",
+ " 'template': template,\n",
+ " 'model': model,\n",
+ " 'temperature': temperature\n",
+ " })\n",
+ "extracted, elapsed = inference(labelled_examples[:NUM_EXAMPLES],\n",
+ " system_prompt, template, model, temperature)\n",
+ "overall_precision, overall_recall, overall_f1 = evaluate(extracted, \n",
+ " labelled_examples[:NUM_EXAMPLES])\n",
+ "total_tokens_sum = sum([pred['total_tokens'] for pred in extracted])\n",
+ "completion_tokens_sum = sum([pred['completion_tokens'] for pred in extracted])\n",
+ "prompt_tokens_sum = sum([pred['prompt_tokens'] for pred in extracted])\n",
+ "wandb.log({'precision': overall_precision,\n",
+ " 'recall': overall_recall,\n",
+ " 'f1': overall_f1,\n",
+ " 'time_elapsed_total': elapsed,\n",
+ " 'prompt_tokens': prompt_tokens_sum,\n",
+ " 'completion_tokens': completion_tokens_sum,\n",
+ " 'total_tokens': total_tokens_sum\n",
+ " })\n",
+ "wandb.finish()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ad4af593",
+ "metadata": {},
+ "source": [
+ "# Set up experiments\n",
+ "\n",
+ "Start a W&B run per experiment with `wandb.init`, store experiment details in `config` arg. Log results with `wandb.log`. Call `wandb.finish` to end experiment. Loop over all options in grid search to find best configuration."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "149b0700",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "system_prompts = ['Extract the entities from the follow sentence.', \n",
+ " 'You are an excellent entity extractor reading newspapers and extracting orgs, people and locations. Extract the entities from the follow sentence.']\n",
+ "for system_prompt in system_prompts:\n",
+ " for temperature in [0.2, 0.6, 0.9]:\n",
+ " for model in ['gpt-3.5-turbo', 'gpt-3.5-turbo-1106']:\n",
+ " wandb.init(project='prompts_eval', config={\n",
+ " 'system_prompt':system_prompt,\n",
+ " 'template': template,\n",
+ " 'model': model,\n",
+ " 'temperature': temperature\n",
+ " })\n",
+ " extracted, elapsed = inference(labelled_examples[:NUM_EXAMPLES],\n",
+ " system_prompt, template, model, temperature)\n",
+ " overall_precision, overall_recall, overall_f1 = evaluate(extracted, \n",
+ " labelled_examples[:NUM_EXAMPLES])\n",
+ " total_tokens_sum = sum([pred['total_tokens'] for pred in extracted])\n",
+ " completion_tokens_sum = sum([pred['completion_tokens'] for pred in extracted])\n",
+ " prompt_tokens_sum = sum([pred['prompt_tokens'] for pred in extracted])\n",
+ " wandb.log({'precision': overall_precision,\n",
+ " 'recall': overall_recall,\n",
+ " 'f1': overall_f1,\n",
+ " 'time_elapsed_total': elapsed,\n",
+ " 'prompt_tokens': prompt_tokens_sum,\n",
+ " 'completion_tokens': completion_tokens_sum,\n",
+ " 'total_tokens': total_tokens_sum\n",
+ " })\n",
+ " wandb.finish()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5ba13a8c",
+ "metadata": {},
+ "source": [
+ "# Conclusion\n",
+ "\n",
+ "You've learned how to use W&B to track evaluations of your LLM applications. \n",
+ "You've used `wandb.init` to start tracking, `wandb.log` to log summary evaluation metrics and `wandb.Table` to track individual predictions & scores. \n",
+ "We've also shared some best practices to format your code to make it easier to run evaluations in parallel and track every iteration."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f5cceefe",
+ "metadata": {},
+ "source": [
+ "# Trace your LLM application\n",
+ "\n",
+ "If you want to learn more and you're using complex pipelines of LLM calls, you can leverage W&B Prompts to view traces of your application and see inputs & ouputs of each LLM or function call. \n",
+ "\n",
+ "Learn more about W&B Prompts in the documentation here: [https://docs.wandb.ai/guides/prompts](https://docs.wandb.ai/guides/prompts)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.12"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}