From 25bd733b716b82967d1e758c4f0a9f54dbbaa26b Mon Sep 17 00:00:00 2001 From: Scott Condron Date: Wed, 31 Jan 2024 14:31:36 +0000 Subject: [PATCH] Prompts eval (#504) * move custom columns, remove wandbaddons * add prompts_evaluation * clean metadata * undo deleting custom column tut * clean up --------- Co-authored-by: Thomas Capelle --- colabs/prompts/prompts_evaluation.ipynb | 494 ++++++++++++++++++++++++ 1 file changed, 494 insertions(+) create mode 100644 colabs/prompts/prompts_evaluation.ipynb diff --git a/colabs/prompts/prompts_evaluation.ipynb b/colabs/prompts/prompts_evaluation.ipynb new file mode 100644 index 00000000..b1ac7fc5 --- /dev/null +++ b/colabs/prompts/prompts_evaluation.ipynb @@ -0,0 +1,494 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "280912a4-7121-4185-addf-e5dc315f9b03", + "metadata": {}, + "source": [ + "\"Open\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "bafb5303-fa1d-4abc-89f3-de10e8d282c8", + "metadata": {}, + "source": [ + "\"Weights\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "b76f1564", + "metadata": {}, + "source": [ + "# Iterate and Evaluate LLM applications" + ] + }, + { + "cell_type": "markdown", + "id": "a3e61dd9", + "metadata": {}, + "source": [ + "AI application building is an experimental process where you likely don't know how a given system will perform on your task. To iterate on an application, we need a way to evaluate if it's improving. To do so, a common practice is to test it against the same dataset when there is a change.\n", + "\n", + "This tutorial will show you how to:\n", + "- track input prompts and pipeline settings with `wandb.config`\n", + "- track final evaluation metrics e.g. F1 score or scores from LLM judges, with `wandb.log`\n", + "- track individual model predictions and metadata in `W&B Tables`" + ] + }, + { + "cell_type": "markdown", + "id": "c9720603", + "metadata": {}, + "source": [ + "We'll track F1 score on extracting named entities from an example news headlines dataset from `explosion/prodigy-recipes` from the https://prodi.gy/ team." + ] + }, + { + "cell_type": "markdown", + "id": "52c11c15", + "metadata": {}, + "source": [ + "# Setup\n", + "## Download Data" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b983c7b8", + "metadata": {}, + "outputs": [], + "source": [ + "!curl -O https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/annotated_news_headlines-ORG-PERSON-LOCATION-ner.jsonl" + ] + }, + { + "cell_type": "markdown", + "id": "28bba1c5", + "metadata": {}, + "source": [ + "## Installation" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4b99b883", + "metadata": {}, + "outputs": [], + "source": [ + "!pip install wandb openai" + ] + }, + { + "cell_type": "markdown", + "id": "58518d19", + "metadata": {}, + "source": [ + "## Create a W&B account and log in" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3a935c78", + "metadata": {}, + "outputs": [], + "source": [ + "import wandb\n", + "wandb.login()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "3834be0a", + "metadata": {}, + "outputs": [], + "source": [ + "import json\n", + "from functools import partial\n", + "import timeit\n", + "import openai\n", + "from concurrent.futures import ThreadPoolExecutor\n", + "data = []\n", + "with open('annotated_news_headlines-ORG-PERSON-LOCATION-ner.jsonl') as f:\n", + " for line in f:\n", + " data.append(json.loads(line))" + ] + }, + { + "cell_type": "markdown", + "id": "ba4ad8a1", + "metadata": {}, + "source": [ + "# Format data" + ] + }, + { + "cell_type": "markdown", + "id": "2fd386fd", + "metadata": {}, + "source": [ + "Here we just remove data we're not using and format the examples for our task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "34004089", + "metadata": {}, + "outputs": [], + "source": [ + "def clean_examples():\n", + " labelled_examples = []\n", + " for example in data:\n", + " entities = []\n", + " if 'spans' in example:\n", + " for span in example['spans']:\n", + " start = span['start']\n", + " end = span['end']\n", + " label = span['label']\n", + " # Extract the corresponding text from tokens\n", + " text = ''\n", + " for token in example['tokens']:\n", + " if token['start'] >= start and token['end'] <= end:\n", + " text += token['text'] + ' '\n", + " entities.append(text.rstrip())\n", + " labelled_examples.append({'text': example['text'], 'entities': entities})\n", + " return labelled_examples\n", + "\n", + "labelled_examples = clean_examples()" + ] + }, + { + "cell_type": "markdown", + "id": "3e898d60", + "metadata": {}, + "source": [ + "# Set up LLM boilerplate" + ] + }, + { + "cell_type": "markdown", + "id": "3ada324f", + "metadata": {}, + "source": [ + "We'll call `openai` (you'll need to add an OpenAI API key) with a given prompt to extract the entities and replace `` with our input. We'll also grab useful metadata from the openai response for logging." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c630e402", + "metadata": {}, + "outputs": [], + "source": [ + "def extract_entities_with_template(text, template_prompt, system_prompt, model, temperature):\n", + " start_time = timeit.default_timer()\n", + " prompt=template_prompt.replace('', text)\n", + " from openai import OpenAI\n", + " client = OpenAI()\n", + " response = client.chat.completions.create(\n", + " model=model,\n", + " messages=[\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": system_prompt\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": prompt\n", + " }\n", + " ],\n", + " temperature=temperature,\n", + " )\n", + " text = response.choices[0].message.content\n", + " entities = list(filter(None, text.split('\\n')))\n", + " usage = response.usage\n", + " prompt_tokens = usage.prompt_tokens\n", + " completion_tokens = usage.completion_tokens\n", + " total_tokens = usage.total_tokens\n", + " end_time = timeit.default_timer()\n", + " elapsed = end_time - start_time\n", + " return {\n", + " 'entities': entities,\n", + " 'model': model,\n", + " 'prompt': prompt,\n", + " 'elapsed': elapsed,\n", + " 'prompt_tokens': prompt_tokens,\n", + " 'completion_tokens': completion_tokens,\n", + " 'total_tokens': total_tokens\n", + " }" + ] + }, + { + "cell_type": "markdown", + "id": "b9ee3ea0", + "metadata": {}, + "source": [ + "# Calculate Metric" + ] + }, + { + "cell_type": "markdown", + "id": "2a2ebb75", + "metadata": {}, + "source": [ + "Here, we make an evaluation metric for our task. \n", + "Note: It's not shown here, but you could also use an LLM to evaluate your task if it's not as straight forward to evaluate as this task." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0ad120db", + "metadata": {}, + "outputs": [], + "source": [ + "def calculate_f1(extracted_entities, ground_truth_entities):\n", + " extracted_set = set(map(str.lower, extracted_entities))\n", + " ground_truth_set = set(map(str.lower, ground_truth_entities))\n", + " tp_examples = extracted_set & ground_truth_set\n", + " tp = len(tp_examples)\n", + " fp_examples = extracted_set - ground_truth_set\n", + " fp = len(fp_examples)\n", + " fn_examples = ground_truth_set - extracted_set\n", + " fn = len(fn_examples)\n", + " precision = tp / (tp + fp) if (tp + fp) else 0\n", + " recall = tp / (tp + fn) if (tp + fn) else 0\n", + " f1 = 2 * precision * recall / (precision + recall) if (precision + recall) else 0\n", + " return f1, tp, fp, fn" + ] + }, + { + "cell_type": "markdown", + "id": "43a50281", + "metadata": {}, + "source": [ + "# Perform inference in parallel" + ] + }, + { + "cell_type": "markdown", + "id": "6cb082bd", + "metadata": {}, + "source": [ + "Running evaluations can be a bit slow. To speed it up, here is a bit of useful code to gather your examples in parallel. None of this is specific to W&B, but it's useful to have nonetheless." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7256eba2", + "metadata": {}, + "outputs": [], + "source": [ + "def inference(examples, system_prompt, template_prompt, model, temperature):\n", + " extracted = []\n", + " # making a new function to openai which has the template\n", + " # this is needed because exectutor.map wants a func with one arg\n", + " openai_func = partial(extract_entities_with_template, model=model, \n", + " system_prompt=system_prompt, template_prompt=template_prompt, \n", + " temperature=temperature)\n", + " # Run the model to extract the entities\n", + " start_time = timeit.default_timer()\n", + " with ThreadPoolExecutor(max_workers=8) as executor:\n", + " for i in executor.map(openai_func, [t['text'] for t in examples]):\n", + " extracted.append(i)\n", + " end_time = timeit.default_timer()\n", + " elapsed = end_time - start_time\n", + " return extracted, elapsed\n", + "\n", + "model = 'gpt-3.5-turbo'\n", + "temperature = 0.7\n", + "template = '''\n", + "text: \n", + "Return the entities as a list with a new line between each entity.\n", + "'''\n", + "system_prompt = 'You are an excellent entity extractor reading newspapers and extracting orgs, people and locations. Extract the entities from the follow sentence.'\n", + "extracted, elapsed = inference(labelled_examples[:1], system_prompt, template, model, temperature)\n", + "print(extracted[0]) \n", + "print(labelled_examples[0]['text'])" + ] + }, + { + "cell_type": "markdown", + "id": "7a615f80", + "metadata": {}, + "source": [ + "# Evaluate extracted entities, save in W&B Table for inspection later" + ] + }, + { + "cell_type": "markdown", + "id": "5f7622cd", + "metadata": {}, + "source": [ + "Here, we calcualte our metric across all of our predictions and log them to a `wandb.Table` for later inspection." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "fb2da2ed", + "metadata": {}, + "outputs": [], + "source": [ + "def evaluate(extracted, labelled_examples):\n", + " total_tp, total_fp, total_fn = 0,0,0\n", + " eval_table = wandb.Table(columns=['pred', 'truth', 'f1', 'tp', 'fp', 'fn', \n", + " 'prompt_tokens', 'completion_tokens', 'total_tokens'])\n", + " for pred, gt in zip(extracted, labelled_examples):\n", + " f1, tp, fp, fn = calculate_f1(pred['entities'], gt['entities'])\n", + " total_tp += tp\n", + " total_fp += fp\n", + " total_fn += f1\n", + " eval_table.add_data(\n", + " pred['entities'], gt['entities'], f1, tp, fp, fn, \n", + " pred['prompt_tokens'], pred['completion_tokens'], pred['total_tokens']\n", + " )\n", + " wandb.log({'eval_table': eval_table})\n", + " overall_precision = total_tp / (total_tp + total_fp) if (total_tp + total_fp) else 0\n", + " overall_recall = total_tp / (total_tp + total_fn) if (total_tp + total_fn) else 0\n", + " overall_f1 = 2 * overall_precision * overall_recall / (overall_precision + overall_recall) if (overall_precision + overall_recall) else 0\n", + " return overall_precision, overall_recall, overall_f1" + ] + }, + { + "cell_type": "markdown", + "id": "763055df", + "metadata": {}, + "source": [ + "# Run our pipeline:" + ] + }, + { + "cell_type": "markdown", + "id": "6bc8d556", + "metadata": {}, + "source": [ + "To start logging to W&B, you can call `wandb.init` and pass in the config to track the configurations you're experimenting with currently.\n", + "\n", + "As you experiment, you can call `wandb.log` to track your work. This will log the metrics to W&B. Finally, we'll call `wandb.finish` to stop tracking. This will be tracked as one \"Run\" in W&B. \n", + "\n", + "You'll be given a link to W&B to see all of your logs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "530c96fb", + "metadata": {}, + "outputs": [], + "source": [ + "NUM_EXAMPLES = 50\n", + "wandb.init(project='prompts_eval', config={\n", + " 'system_prompt': system_prompt,\n", + " 'template': template,\n", + " 'model': model,\n", + " 'temperature': temperature\n", + " })\n", + "extracted, elapsed = inference(labelled_examples[:NUM_EXAMPLES],\n", + " system_prompt, template, model, temperature)\n", + "overall_precision, overall_recall, overall_f1 = evaluate(extracted, \n", + " labelled_examples[:NUM_EXAMPLES])\n", + "total_tokens_sum = sum([pred['total_tokens'] for pred in extracted])\n", + "completion_tokens_sum = sum([pred['completion_tokens'] for pred in extracted])\n", + "prompt_tokens_sum = sum([pred['prompt_tokens'] for pred in extracted])\n", + "wandb.log({'precision': overall_precision,\n", + " 'recall': overall_recall,\n", + " 'f1': overall_f1,\n", + " 'time_elapsed_total': elapsed,\n", + " 'prompt_tokens': prompt_tokens_sum,\n", + " 'completion_tokens': completion_tokens_sum,\n", + " 'total_tokens': total_tokens_sum\n", + " })\n", + "wandb.finish()" + ] + }, + { + "cell_type": "markdown", + "id": "ad4af593", + "metadata": {}, + "source": [ + "# Set up experiments\n", + "\n", + "Start a W&B run per experiment with `wandb.init`, store experiment details in `config` arg. Log results with `wandb.log`. Call `wandb.finish` to end experiment. Loop over all options in grid search to find best configuration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "149b0700", + "metadata": {}, + "outputs": [], + "source": [ + "system_prompts = ['Extract the entities from the follow sentence.', \n", + " 'You are an excellent entity extractor reading newspapers and extracting orgs, people and locations. Extract the entities from the follow sentence.']\n", + "for system_prompt in system_prompts:\n", + " for temperature in [0.2, 0.6, 0.9]:\n", + " for model in ['gpt-3.5-turbo', 'gpt-3.5-turbo-1106']:\n", + " wandb.init(project='prompts_eval', config={\n", + " 'system_prompt':system_prompt,\n", + " 'template': template,\n", + " 'model': model,\n", + " 'temperature': temperature\n", + " })\n", + " extracted, elapsed = inference(labelled_examples[:NUM_EXAMPLES],\n", + " system_prompt, template, model, temperature)\n", + " overall_precision, overall_recall, overall_f1 = evaluate(extracted, \n", + " labelled_examples[:NUM_EXAMPLES])\n", + " total_tokens_sum = sum([pred['total_tokens'] for pred in extracted])\n", + " completion_tokens_sum = sum([pred['completion_tokens'] for pred in extracted])\n", + " prompt_tokens_sum = sum([pred['prompt_tokens'] for pred in extracted])\n", + " wandb.log({'precision': overall_precision,\n", + " 'recall': overall_recall,\n", + " 'f1': overall_f1,\n", + " 'time_elapsed_total': elapsed,\n", + " 'prompt_tokens': prompt_tokens_sum,\n", + " 'completion_tokens': completion_tokens_sum,\n", + " 'total_tokens': total_tokens_sum\n", + " })\n", + " wandb.finish()" + ] + }, + { + "cell_type": "markdown", + "id": "5ba13a8c", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "You've learned how to use W&B to track evaluations of your LLM applications. \n", + "You've used `wandb.init` to start tracking, `wandb.log` to log summary evaluation metrics and `wandb.Table` to track individual predictions & scores. \n", + "We've also shared some best practices to format your code to make it easier to run evaluations in parallel and track every iteration." + ] + }, + { + "cell_type": "markdown", + "id": "f5cceefe", + "metadata": {}, + "source": [ + "# Trace your LLM application\n", + "\n", + "If you want to learn more and you're using complex pipelines of LLM calls, you can leverage W&B Prompts to view traces of your application and see inputs & ouputs of each LLM or function call. \n", + "\n", + "Learn more about W&B Prompts in the documentation here: [https://docs.wandb.ai/guides/prompts](https://docs.wandb.ai/guides/prompts)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}