From 83624c2c23544aec1d37c03038cb98c6ea560741 Mon Sep 17 00:00:00 2001 From: niklas Date: Mon, 30 Oct 2023 12:45:34 +0100 Subject: [PATCH] Classification & Evaluation notebooks --- README.md | 16 +- src/examples/classification.ipynb | 454 ++++++++++++++++++++ src/examples/embedding_based_classify.ipynb | 120 ------ src/examples/evaluation.ipynb | 194 +++++++++ src/examples/single_label_classify.ipynb | 378 ---------------- 5 files changed, 656 insertions(+), 506 deletions(-) create mode 100644 src/examples/classification.ipynb delete mode 100644 src/examples/embedding_based_classify.ipynb create mode 100644 src/examples/evaluation.ipynb delete mode 100644 src/examples/single_label_classify.ipynb diff --git a/README.md b/README.md index 8a77950df..9b0e81d6c 100644 --- a/README.md +++ b/README.md @@ -18,14 +18,14 @@ The key features of the Intelligence Layer are: Not sure where to start? Familiarize yourself with the Intelligence Layer using the below notebooks. -| Order | Task | Description | Notebook 📓 | -| ----- | ------------------------------ | --------------------------------------- | ------------------------------------------------------------------------------- | -| 1 | Summarization | Summarize a document | [summarize.ipynb](./src/examples/summarize.ipynb) | -| 2 | Question Answering | Various approaches for QA | [qa.ipynb](./src/examples/qa.ipynb) | -| 3 | Quickstart task | Build a custom task for your use case | [quickstart_task.ipynb](./src/examples/quickstart_task.ipynb) | -| 4 | Single label Classification | Conduct zero-shot text classification | [single_label_classify.ipynb](./src/examples/single_label_classify.ipynb) | -| 5 | Embedding based Classification | Classify texts on the basis of examples | [embedding_based_classify.ipynb](./src/examples/embedding_based_classify.ipynb) | -| 6 | Document Index | Connect your proprietary knowledge base | [document_index.ipynb](./src/examples/document_index.ipynb) | +| Order | Task | Description | Notebook 📓 | +| ----- | ------------------ | ----------------------------------------- | ------------------------------------------------------------- | +| 1 | Summarization | Summarize a document | [summarize.ipynb](./src/examples/summarize.ipynb) | +| 2 | Question Answering | Various approaches for QA | [qa.ipynb](./src/examples/qa.ipynb) | +| 3 | Quickstart task | Build a custom task for your use case | [quickstart_task.ipynb](./src/examples/quickstart_task.ipynb) | +| 4 | Classification | Learn about two methods of classification | [classification.ipynb](./src/examples/classification.ipynb) | +| 5 | Evaluation | Evaluate LLM-based methodologies | [evaluation.ipynb](./src/examples/evaluation.ipynb) | +| 6 | Document Index | Connect your proprietary knowledge base | [document_index.ipynb](./src/examples/document_index.ipynb) | ## Getting started with the Jupyter Notebooks diff --git a/src/examples/classification.ipynb b/src/examples/classification.ipynb new file mode 100644 index 000000000..91b016597 --- /dev/null +++ b/src/examples/classification.ipynb @@ -0,0 +1,454 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Classification\n", + "\n", + "Language models offer unprecedented capabilities in understanding and generating human-like text.\n", + "One of the pressing issues in their application is the classification of vast amounts of data.\n", + "Traditional methods often require manual labeling and can be time-consuming and prone to errors.\n", + "LLMs, on the other hand, can swiftly process and categorize enormous datasets with minimal human intervention.\n", + "By leveraging LLMs for classification tasks, organizations can unlock insights from their data more efficiently, streamline their workflows, and harness the full potential of their information assets.\n", + "\n", + "In this notebook, we present to alternative ways for classifying text using Aleph Alpha's Luminous models.\n", + "First, let's have a look at single-label classification using prompting.\n", + "\n", + "### Prompt-based single-label classification\n", + "\n", + "Single-label classification refers to the task of categorizing data points into one of n distinct categories or classes.\n", + "In this type of classification, each input is assigned to only one class, ensuring that no overlap exists between categories.\n", + "Common applications of single-label classification include email spam detection, where emails are classified as either \"spam\" or \"not spam\", or sentiment classification, where a text can be \"positive\", \"negative\" or \"neutral\".\n", + "When trying to solve this issue in a prompt-based manner, our primary goal is to construct a prompt that instructs the model to accurately predict the correct class for any given input.\n", + "\n", + "### When should you use prompt-based classification?\n", + "\n", + "We recommend using this type of classification when...\n", + "- ...the labels are easily understood (they don't require explanation or examples).\n", + "- ...the labels cannot be recognized purely by their semantic meaning.\n", + "- ...many examples for each label aren't readily available.\n", + "\n", + "### Example snippet\n", + "\n", + "Running the following code will instantiate a `SingleLabelClassify` that leverages a prompt for classification.\n", + "We can now enter any `ClassifyInput` so that the task returns each label along with its probability.\n", + "In addition, note the `debug_log`, which will give a comprehensive overview of the result.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from os import getenv\n", + "\n", + "from aleph_alpha_client import Client\n", + "\n", + "from intelligence_layer.use_cases.classify.single_label_classify import ClassifyInput, SingleLabelClassify\n", + "from intelligence_layer.core.task import Chunk\n", + "from intelligence_layer.core.logger import InMemoryDebugLogger\n", + "\n", + "text_to_classify = Chunk(\"In the distant future, a space exploration party embarked on a thrilling journey to the uncharted regions of the galaxy. \\n\\\n", + "With excitement in their hearts and the cosmos as their canvas, they ventured into the unknown, discovering breathtaking celestial wonders. \\n\\\n", + "As they gazed upon distant stars and nebulas, they forged unforgettable memories that would forever bind them as pioneers of the cosmos.\")\n", + "labels = [\"happy\", \"angry\", \"sad\"]\n", + "client = Client(getenv(\"AA_TOKEN\"))\n", + "task = SingleLabelClassify(client)\n", + "input = ClassifyInput(\n", + " chunk=text_to_classify,\n", + " labels=labels\n", + ")\n", + "\n", + "debug_log = InMemoryDebugLogger(name=\"classify\")\n", + "output = task.run(input, debug_log)\n", + "for label, score in output.scores.items():\n", + " print(f\"{label}: {round(score, 4)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### How does this implementation work?\n", + "\n", + "We prompt the model multiple times, each time supplying the text, or chunk, and one label at a time.\n", + "Note that we also supply each label, rather than letting the model generate it.\n", + "\n", + "To further explain this, let's start with a more familiar case.\n", + "Intuitively, one would probably prompt a model like so:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from aleph_alpha_client import PromptTemplate\n", + "\n", + "prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)\n", + "print(prompt_template.to_prompt(text=text_to_classify, label=\"\").items[0].text)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The model would then complete our instruction, thus generating a matching label.\n", + "\n", + "In the case of single-label classification, however, we already know all possible classes beforehand.\n", + "Because of this, all we are interested in is the probability that the model would have generated our specific classes.\n", + "To get this probability, we can prompt the model with each of our classes and ask it to return the \"logprobs\" for the text.\n", + "\n", + "In the case of prompt-based classification, the base prompt looks something like this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)\n", + "print(prompt_template.to_prompt(text=text_to_classify, label=\" \" +labels[0]).items[0].text)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As you can see, we have the same prompt, but with a potential label candidate already filled in.\n", + "Now, we will ask the model to evaluate the likelihood of this label, i.e. completion.\n", + "\n", + "Our request will not generate any tokens, but instead return the log probability of this completion given the previous tokens.\n", + "This is called an `EchoTask`.\n", + "Let's have a look at just one of these tasks triggered by our classification run.\n", + "\n", + "In particular, note the `expected_completion` in the `Input` and the `prob` for the token \" angry\" in the `Output`.\n", + "Feel free to ignore the big `Complete` task dump in the middle." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "debug_log.logs[-1].logs[0].logs[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have the logprobs, we just need to do some calculations to turn them into a final score.\n", + "\n", + "To turn the logprobs into our end scores, we first normalize our probabilities.\n", + "For this, we utilize a probability tree." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from intelligence_layer.use_cases.classify.single_label_classify import TreeNode\n", + "from intelligence_layer.core.logger import LogEntry\n", + "\n", + "task_log = debug_log.logs[-1]\n", + "normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == \"Normalized Probs\"]\n", + "log = normalized_probs_logs[-1]\n", + "\n", + "root = TreeNode()\n", + "for probs in log.values():\n", + " root.insert_without_calculation(probs)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, we take the product of all the paths to get the following results:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for label, score in output.scores.items():\n", + " print(f\"{label}: {round(score, 5)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The example mentioned before is rather straightforward, but there are some situations when it isn't as obvious as a single token.\n", + "\n", + "What if we take some labels that have overlapping tokens?\n", + "This makes the calculation a bit more complicated:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from intelligence_layer.use_cases.classify.single_label_classify import SingleLabelClassify, ClassifyInput\n", + "from intelligence_layer.core.logger import LogEntry\n", + "\n", + "\n", + "labels = [\"Space party\", \"Space exploration\", \"Space exploration party\"]\n", + "task = SingleLabelClassify(client)\n", + "input = ClassifyInput(\n", + " chunk=text_to_classify,\n", + " labels=labels\n", + ")\n", + "logger = InMemoryDebugLogger(name=\"classify\")\n", + "output = task.run(input, logger)\n", + "task_log = logger.logs[-1]\n", + "normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == \"Normalized Probs\"]\n", + "log = normalized_probs_logs.pop()\n", + "\n", + "root = TreeNode()\n", + "for probs in log.values():\n", + " root.insert_without_calculation(probs)\n", + "\n", + "print(\"End scores:\")\n", + "for label, score in output.scores.items():\n", + " print(f\"{label}: {round(score, 4)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, the three classes have some overlapping tokens, namely \"Space\", and \"exploration\".\n", + "\"party\" is not overlapping, because it occurs in two different places (after \"Space\" and after \"exploration\").\n", + "\n", + "Cool, so we now figured out how to do prompt-based classification.\n", + "Let's have a look at another classification use-case!" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Embedding-based multi-label classification\n", + "\n", + "Large language model embeddings offer a powerful approach to text classification.\n", + "In particular, such embeddings can be seen as a numerical representation of the meaning of a text.\n", + "Utilizing this, we can provide textual examples for each label and embed them to create a representations for each label in vector space.\n", + "\n", + "**Or, in more detail**:\n", + "In this method, each example from various classes is transformed into a vector representation using the embeddings from the language model.\n", + "These embedded vectors capture the semantic essence of the text.\n", + "Once this is done, clusters of embeddings are formed for each class, representing the centroid or the average meaning of the examples within that class.\n", + "When a new piece of text needs to be classified, it is first embedded using the same language model.\n", + "This new embedded vector is then compared to the pre-defined clusters for each class using a cosine similarity.\n", + "The class whose cluster is closest to the new text's embedding is then assigned to the text, thereby achieving classification.\n", + "This method leverages the deep semantic understanding of large language models to classify texts with high accuracy and nuance.\n", + "\n", + "### When should you use embedding-based classification?\n", + "\n", + "We recommend using this type of classification when...\n", + "- ...proper classification requires fine-grained control over the classes' definitions.\n", + "- ...the labels can be defined mostly or purely by the semantic meaning of the examples.\n", + "- ...examples for each label are readily available.\n", + "\n", + "### Example snippet\n", + "\n", + "Let's start by instantiating a classifier for sentiment classification." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from intelligence_layer.use_cases.classify.embedding_based_classify import EmbeddingBasedClassify, LabelWithExamples\n", + "\n", + "\n", + "labels_with_examples = [\n", + " LabelWithExamples(\n", + " name=\"positive\",\n", + " examples=[\n", + " \"I really like this.\",\n", + " \"Wow, your hair looks great!\",\n", + " \"We're so in love.\",\n", + " \"That truly was the best day of my life!\",\n", + " \"What a great movie.\"\n", + " ],\n", + " ),\n", + " LabelWithExamples(\n", + " name=\"negative\",\n", + " examples=[\n", + " \"I really dislike this.\",\n", + " \"Ugh, Your hair looks horrible!\",\n", + " \"We're not in love anymore.\",\n", + " \"My day was very bad, I did not have a good time.\",\n", + " \"They make terrible food.\"\n", + " ],\n", + " ),\n", + "]\n", + "classify = EmbeddingBasedClassify(labels_with_examples, client)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "There are several things to note here, in particular:\n", + "- This time, we instantiated our classification task with a number of `LabelWithExamples`.\n", + "- The examples provided should reflect the spectrum of texts expected in the intended usage domain of this classifier.\n", + "- This cell took some time to run.\n", + "This is because we instantiate a retriever in the background, which also requires us to embed the provided examples.\n", + "\n", + "With that being said, let's run an unknown example!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "classify_input = ClassifyInput(\n", + " chunk=\"It was very awkward with him, I did not enjoy it.\",\n", + " labels=frozenset(l.name for l in labels_with_examples)\n", + ")\n", + "logger = InMemoryDebugLogger(name=\"Classify\")\n", + "result = classify.run(classify_input, logger)\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nice, we correctly identified the new example.\n", + "\n", + "Again, let's appreciate the difference of this result compared to `SingleLabelClassify`'s result.\n", + "- The probabilities do not add up to 1.\n", + "In fact, we have no way of predicting what the sum of all scores will be.\n", + "In some cases, individual scores may even be negative.\n", + "All we know is that the highest score is likely to correspond to the best fitting label, provided we delivered good examples.\n", + "- We were much quicker to obtain a result.\n", + "\n", + "Because all examples are pre-embedded, this classifier is much cheaper to operate as it only requires a single embedding-task to be sent to the Aleph Alpha API.\n", + "\n", + "Let's try another example. This time, we expect the outcome to be positive.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "classify_input = ClassifyInput(\n", + " chunk=\"We used to be not like each other, but this changed a lot.\",\n", + " labels=frozenset(l.name for l in labels_with_examples)\n", + ")\n", + "logger = InMemoryDebugLogger(name=\"Classify\")\n", + "result = classify.run(classify_input, logger)\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Unfortunately, we wrongly classify this text as negative.\n", + "To be fair, it is a difficult example.\n", + "But no worries, let's simply include this failing example in our list of label examples and try again!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from os import getenv\n", + "\n", + "from aleph_alpha_client import Client\n", + "\n", + "from intelligence_layer.use_cases.classify.embedding_based_classify import EmbeddingBasedClassify, LabelWithExamples\n", + "\n", + "\n", + "client = Client(getenv(\"AA_TOKEN\"))\n", + "labels_with_examples = [\n", + " LabelWithExamples(\n", + " name=\"positive\",\n", + " examples=[\n", + " \"I really like this.\",\n", + " \"Wow, your hair looks great!\",\n", + " \"We're so in love.\",\n", + " \"That truly was the best day of my life!\",\n", + " \"What a great movie.\",\n", + " \"We used to be not like each other, but this changed a lot.\" # failing example\n", + " ],\n", + " ),\n", + " LabelWithExamples(\n", + " name=\"negative\",\n", + " examples=[\n", + " \"I really dislike this.\",\n", + " \"Ugh, Your hair looks horrible!\",\n", + " \"We're not in love anymore.\",\n", + " \"My day was very bad, I did not have a good time.\",\n", + " \"They make terrible food.\"\n", + " ],\n", + " ),\n", + "]\n", + "classify = EmbeddingBasedClassify(labels_with_examples, client)\n", + "\n", + "logger = InMemoryDebugLogger(name=\"Classify\")\n", + "result = classify.run(classify_input, logger)\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Nice, we now correctly classify this example!\n", + "\n", + "One advantage of using the `EmbeddingBasedClassify`-approach is that we can easily tweak our labels by adding new examples.\n", + "In essence, this guarantees that we never make the same mistake twice.\n", + "As we increase the number of examples, this makes the method evermore precise.\n", + "\n", + "You now have an overview of these two main methods of classification!\n", + "Feel free to tweak these method and play around with their parameters to finetune them to our specific use-case." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "3.10-intelligence", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/src/examples/embedding_based_classify.ipynb b/src/examples/embedding_based_classify.ipynb deleted file mode 100644 index 14e5b1121..000000000 --- a/src/examples/embedding_based_classify.ipynb +++ /dev/null @@ -1,120 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Embedding-Based Classification\n", - "\n", - "Large language model embeddings offer a powerful approach to text classification.\n", - "In this method, each example from various classes is transformed into a vector representation using the embeddings from the language model.\n", - "These embedded vectors capture the semantic essence of the text.\n", - "Once this is done, clusters of embeddings are formed for each class, representing the centroid or the average meaning of the examples within that class.\n", - "When a new piece of text needs to be classified, it is first embedded using the same language model.\n", - "This new embedded vector is then compared to the pre-defined clusters for each class using a cosine similarity.\n", - "The class whose cluster is closest to the new text's embedding is then assigned to the text, thereby achieving classification.\n", - "This method leverages the deep semantic understanding of large language models to classify texts with high accuracy and nuance.\n", - "\n", - "### When should you use embedding-based classification?\n", - "\n", - "We recommend using this type of classification when...\n", - "- ...proper classification requires fine-grained control over the classes' definitions.\n", - "- ...the labels can be defined mostly or purely by the semantic meaning of the examples.\n", - "- ...examples for each label are readily available.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's start by instantiating a classifier for sentiment classification." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from os import getenv\n", - "\n", - "from aleph_alpha_client import Client\n", - "\n", - "from intelligence_layer.use_cases.classify.embedding_based_classify import EmbeddingBasedClassify, LabelWithExamples\n", - "\n", - "\n", - "client = Client(getenv(\"AA_TOKEN\"))\n", - "labels_with_examples = [\n", - " LabelWithExamples(\n", - " name=\"positive\",\n", - " examples=[\n", - " \"I really like this.\",\n", - " \"Wow, your hair looks great!\",\n", - " \"We're so in love.\",\n", - " \"That truly was the best day of my life!\",\n", - " \"What a great movie.\"\n", - " ],\n", - " ),\n", - " LabelWithExamples(\n", - " name=\"negative\",\n", - " examples=[\n", - " \"I really dislike this.\",\n", - " \"Ugh, Your hair looks horrible!\",\n", - " \"We're not in love anymore.\",\n", - " \"My day was very bad, I did not have a good time.\",\n", - " \"They make terrible food.\"\n", - " ],\n", - " ),\n", - "]\n", - "classify = EmbeddingBasedClassify(labels_with_examples, client)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Alright, let's classify a new example!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from intelligence_layer.core.logger import InMemoryDebugLogger\n", - "from intelligence_layer.use_cases.classify.classify import ClassifyInput\n", - "\n", - "\n", - "classify_input = ClassifyInput(\n", - " chunk=\"It was very awkward with him, I did not enjoy it.\",\n", - " labels=frozenset(l.name for l in labels_with_examples)\n", - ")\n", - "logger = InMemoryDebugLogger(name=\"Classify\")\n", - "result = classify.run(classify_input, logger)\n", - "result" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "3.10-intelligence", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -} diff --git a/src/examples/evaluation.ipynb b/src/examples/evaluation.ipynb new file mode 100644 index 000000000..2aa1818fe --- /dev/null +++ b/src/examples/evaluation.ipynb @@ -0,0 +1,194 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evaluating LLM-based tasks\n", + "\n", + "Evaluating LLM-based use cases is pivotal for several reasons.\n", + "First, with the myriad of methods available, comparability becomes essential.\n", + "By systematically evaluating different approaches, we can discern which techniques are more effective or suited for specific tasks, fostering a deeper understanding of their strengths and weaknesses.\n", + "Secondly, optimization plays a significant role. Without proper evaluation metrics and rigorous testing, it becomes challenging to fine-tune methods and/or models to achieve their maximum potential.\n", + "Moreover, drawing comparisons with state-of-the-art (SOTA) and open-source methods is crucial.\n", + "Such comparisons not only provide benchmarks but also enable users to determine the value-added by proprietary or newer models over freely available counterparts.\n", + "\n", + "However, evaluating LLMs, especially in the domain of text generation, presents unique challenges.\n", + "Text generation is inherently subjective, and what one evaluator deems coherent and relevant, another might find disjointed or off-topic. This subjectivity complicates the establishment of universal evaluation standards, making it imperative to approach LLM evaluation with a multifaceted and comprehensive strategy.\n", + "\n", + "### Evaluating classification use-cases\n", + "\n", + "To (at least for now) evade the elusive issue described in the last paragraph, let's have a look at an easier to evaluate methodology: classification.\n", + "Make sure that you have familiarized yourself with the `SingleLabelClassify` and `EmbeddingBasedClassify` prior to starting this notebook.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we need to instantiate an evaluator that takes our classify methodology (`task`) and some datapoints and returns some evaluation metrics.\n", + "\n", + "First, let's evaluate a single example and see what happens." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "from intelligence_layer.use_cases.classify.classify import ClassifyEvaluator\n", + "\n", + "evaluator = ClassifyEvaluator(task)\n", + "classify_input = ClassifyInput(\n", + " chunk=Chunk(\"This is good\"),\n", + " labels=frozenset({\"positive\", \"negative\"}),\n", + " )\n", + "evaluation_logger = InMemoryDebugLogger(name=\"evaluation logger\")\n", + "expected_output = \"positive\"\n", + "evaluation = evaluator.evaluate(\n", + " input=classify_input, logger=evaluation_logger, expected_output=[expected_output]\n", + ")\n", + "\n", + "print(\"The task result:\", evaluation.output.scores)\n", + "print(\"The expected output:\", expected_output)\n", + "print(\"The eval result:\", evaluation.correct)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Cool!\n", + "Let's now try to find a dataset to use.\n", + "We found this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) on huggingface, let's see if we can get an evaluation going!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(f\"cardiffnlp/tweet_topic_multi\")\n", + "test_set_name = \"validation_random\"\n", + "data = list(dataset[test_set_name])[:10] # this has 573 datapoints, let's take a look at 20 for now\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We need to transform our dataset into the required format. \n", + "Therefore, let's check out what it looks like." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "data[1]\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Accordingly, this must be translated into the interface of our `Evaluator`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "from intelligence_layer.core.evaluator import Example, Dataset\n", + "\n", + "\n", + "all_labels = list(set(c for d in data for c in d[\"label_name\"]))\n", + "dataset = Dataset(\n", + " name=\"tweet topics\",\n", + " examples=[\n", + " Example(\n", + " input=ClassifyInput(\n", + " chunk=d[Chunk(\"text\")],\n", + " labels=all_labels\n", + " ),\n", + " expected_output=d[\"label_name\"]\n", + " ) for d in data\n", + " ]\n", + ")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Ok, let's run this!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "evaluation_logger = InMemoryDebugLogger(name=\"evaluation logger\")\n", + "result = evaluator.evaluate_dataset(dataset=dataset, logger=evaluation_logger)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Checking out the results..." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "print(\"Percentage correct:\", result.percentage_correct)\n", + "print(\"First example:\", result.evaluations[0])\n" + ] + } + ], + "metadata": { + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/src/examples/single_label_classify.ipynb b/src/examples/single_label_classify.ipynb deleted file mode 100644 index 4d49d67e5..000000000 --- a/src/examples/single_label_classify.ipynb +++ /dev/null @@ -1,378 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Single Label Classification\n", - "\n", - "Single-label classification, also known as single-class or binary classification, refers to the task of categorizing data points into one of n distinct categories or classes.\n", - "In this type of classification, each input is assigned to only one class, ensuring that no overlap exists between categories.\n", - "Common applications of single-label classification include email spam detection, where emails are classified as either \"spam\" or \"not spam\", or sentiment classification, where a text can be \"positive\", \"negative\" or \"neutral\".\n", - "The primary goal is to train a model that can accurately predict the correct class for any given input based on its features.\n", - "\n", - "### Prompt-based classification\n", - "\n", - "Here, we'll use a purely prompt-based approach for classification.\n", - "\n", - "### When should you use prompt-based classification?\n", - "\n", - "We recommend using this type of classification when...\n", - "- ...the labels are easily understood (they don't require explanation or examples).\n", - "- ...the labels cannot be recognized purely by their semantic meaning.\n", - "- ...many examples for each label aren't readily available.\n", - "\n", - "### Example snippet\n", - "\n", - "Running the following code will instantiate a prompt-based classifier with a debug level for the log.\n", - "Then it will classify the text given in `ClassifyInput`.\n", - "The contents of the `debug_log` will be shown below.\n", - "It gives an overview of the steps taken to get the result.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from os import getenv\n", - "\n", - "from aleph_alpha_client import Client\n", - "\n", - "from intelligence_layer.use_cases.classify.single_label_classify import ClassifyInput, SingleLabelClassify\n", - "from intelligence_layer.core.task import Chunk\n", - "from intelligence_layer.core.logger import InMemoryDebugLogger\n", - "\n", - "text_to_classify = Chunk(\"In the distant future, a space exploration party embarked on a thrilling journey to the uncharted regions of the galaxy. \\n\\\n", - "With excitement in their hearts and the cosmos as their canvas, they ventured into the unknown, discovering breathtaking celestial wonders. \\n\\\n", - "As they gazed upon distant stars and nebulas, they forged unforgettable memories that would forever bind them as pioneers of the cosmos.\")\n", - "labels = [\"happy\", \"angry\", \"sad\"]\n", - "client = Client(getenv(\"AA_TOKEN\"))\n", - "task = SingleLabelClassify(client)\n", - "input = ClassifyInput(\n", - " chunk=text_to_classify,\n", - " labels=labels\n", - ")\n", - "\n", - "debug_log = InMemoryDebugLogger(name=\"classify\")\n", - "output = task.run(input, debug_log)\n", - "for label, score in output.scores.items():\n", - " print(f\"{label}: {round(score, 4)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### How does this implementation work?\n", - "\n", - "For prompt-based classification, we prompt the model multiple times with the text we want to classify and each of our classes.\n", - "Instead of letting the model generate the class it thinks fits the text best, we ask it for the probability for each class.\n", - "\n", - "To further explain this, let's start with a more familiar case.\n", - "Intuitively, one would probably prompt a model like so:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from aleph_alpha_client import PromptTemplate\n", - "\n", - "prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)\n", - "print(prompt_template.to_prompt(text=text_to_classify, label=\"\").items[0].text)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The model would then answer our question and generate a class or label that it thinks fits the text best.\n", - "\n", - "In the case of classification, however, we already know all possible classes beforehand.\n", - "Because of this, all we are interested in is the probability that the model would have generated our specific classes.\n", - "To get this probability, we can prompt the model with each of our classes and ask it to return the \"logprobs\" for the text.\n", - "\n", - "In the case of prompt-based classification, the base prompt looks something like this:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "prompt_template = PromptTemplate(SingleLabelClassify.PROMPT_TEMPLATE)\n", - "print(prompt_template.to_prompt(text=text_to_classify, label=\" \" +labels[0]).items[0].text)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As you can see, we have the same prompt, but with a potential label candidate already filled in.\n", - "\n", - "Now, we will ask the model to evaluate the likelihood of this completion.\n", - "\n", - "Our request will now not generate any tokens, but instead return the log probability of this completion given the previous tokens." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now that we have the logprobs, we just need to do some calculations to turn them into a final score.\n", - "\n", - "To turn the logprobs into our end scores, we first normalize our probabilities.\n", - "For this, we utilize a probability tree." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from intelligence_layer.use_cases.classify.single_label_classify import TreeNode\n", - "from intelligence_layer.core.logger import LogEntry\n", - "\n", - "task_log = debug_log.logs[-1]\n", - "normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == \"Normalized Probs\"]\n", - "log = normalized_probs_logs[-1]\n", - "\n", - "root = TreeNode()\n", - "for probs in log.values():\n", - " root.insert_without_calculation(probs)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally, we take the product of all the paths to get the following results:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "for label, score in output.scores.items():\n", - " print(f\"{label}: {round(score, 5)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The example mentioned before is rather straightforward, but there are some situations when it isn't as obvious as a single token.\n", - "\n", - "What if we take some classes that have some overlap?\n", - "In the following example, some of the classes overlap in the tokens they have.\n", - "This makes the calculation a bit more complicated:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from intelligence_layer.use_cases.classify.single_label_classify import SingleLabelClassify, ClassifyInput\n", - "from intelligence_layer.core.logger import LogEntry\n", - "\n", - "\n", - "labels = [\"Space party\", \"Space exploration\", \"Space exploration party\"]\n", - "task = SingleLabelClassify(client)\n", - "input = ClassifyInput(\n", - " chunk=text_to_classify,\n", - " labels=labels\n", - ")\n", - "logger = InMemoryDebugLogger(name=\"classify\")\n", - "output = task.run(input, logger)\n", - "task_log = logger.logs[-1]\n", - "normalized_probs_logs = [log_entry.value for log_entry in task_log.logs if isinstance(log_entry, LogEntry) and log_entry.message == \"Normalized Probs\"]\n", - "log = normalized_probs_logs.pop()\n", - "\n", - "root = TreeNode()\n", - "for probs in log.values():\n", - " root.insert_without_calculation(probs)\n", - "\n", - "print(\"End scores:\")\n", - "for label, score in output.scores.items():\n", - " print(f\"{label}: {round(score, 4)}\")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here, the three classes have some overlapping tokens, namely \"Space\", and \"exploration\".\n", - "\"party\" is not overlapping, because it occurs in two different places (after \"Space\" and after \"exploration\")." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Cool!\n", - "Now, let's evaluate how well our new methodology is working.\n", - "For this, we will first look for classification datasets to use.\n", - "We found this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) on huggingface, let's see if we can get an evaluation going!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from datasets import load_dataset\n", - "\n", - "dataset = load_dataset(f\"cardiffnlp/tweet_topic_multi\")\n", - "test_set_name = \"validation_random\"\n", - "data = list(dataset[test_set_name])[:10] # this has 573 datapoints, let's take a look at 20 for now\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next, we need to instantiate an evaluator that takes our classify methodology (`task`) and some datapoints and returns some evaluation metrics.\n", - "\n", - "First, let's evaluate a single example and see what happens." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from intelligence_layer.use_cases.classify.classify import ClassifyEvaluator\n", - "\n", - "evaluator = ClassifyEvaluator(task)\n", - "classify_input = ClassifyInput(\n", - " chunk=Chunk(\"This is good\"),\n", - " labels=frozenset({\"positive\", \"negative\"}),\n", - " )\n", - "evaluation_logger = InMemoryDebugLogger(name=\"evaluation logger\")\n", - "expected_output = \"positive\"\n", - "evaluation = evaluator.evaluate(\n", - " input=classify_input, logger=evaluation_logger, expected_output=[expected_output]\n", - ")\n", - "\n", - "print(\"The task result:\", evaluation.output.scores)\n", - "print(\"The expected output:\", expected_output)\n", - "print(\"The eval result:\", evaluation.correct)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We need to transform our dataset into the required format. \n", - "Therefore, let's check out what it looks like." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data[1]\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Accordingly, this must be translated into the interface of our `Evaluator`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "from intelligence_layer.core.evaluator import Example, Dataset\n", - "\n", - "\n", - "all_labels = list(set(c for d in data for c in d[\"label_name\"]))\n", - "dataset = Dataset(\n", - " name=\"tweet topics\",\n", - " examples=[\n", - " Example(\n", - " input=ClassifyInput(\n", - " chunk=d[Chunk(\"text\")],\n", - " labels=all_labels\n", - " ),\n", - " expected_output=d[\"label_name\"]\n", - " ) for d in data\n", - " ]\n", - ")\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Ok, let's run this!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "evaluation_logger = InMemoryDebugLogger(name=\"evaluation logger\")\n", - "result = evaluator.evaluate_dataset(dataset=dataset, logger=evaluation_logger)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Checking out the results..." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "print(\"Percentage correct:\", result.percentage_correct)\n", - "print(\"First example:\", result.evaluations[0])\n" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "3.10-intelligence", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.4" - } - }, - "nbformat": 4, - "nbformat_minor": 2 -}