From 1eb16e6c05d6061f6e90f94427cd5b2fd5b9ad91 Mon Sep 17 00:00:00 2001
From: Florian Schepers <florian.schepers@ext.aleph-alpha.com>
Date: Thu, 21 Mar 2024 17:06:13 +0100
Subject: [PATCH] doc: Revise and update README.md, Concepts.md and tutorial
 jupyter notebooks TASK: IL-305

---
 Concepts.md                         |  69 +++++----
 README.md                           |  18 +--
 src/examples/classification.ipynb   |  72 +++++-----
 src/examples/document_index.ipynb   |  52 +++----
 src/examples/evaluation.ipynb       | 132 +++++++++--------
 src/examples/fastapi_tutorial.md    |  51 ++++---
 src/examples/human_evaluation.ipynb | 113 ++++++++-------
 src/examples/performance_tips.ipynb | 216 ++++++++++++++++++++++++----
 src/examples/qa.ipynb               | 121 ++++++++++------
 src/examples/quickstart_task.ipynb  | 120 +++++++++-------
 src/examples/summarization.ipynb    |  58 ++++----
 11 files changed, 608 insertions(+), 414 deletions(-)

diff --git a/Concepts.md b/Concepts.md
index 9464d05c5..2e072d908 100644
--- a/Concepts.md
+++ b/Concepts.md
@@ -2,12 +2,12 @@
 
 The main focus of the Intelligence Layer is to enable developers to
 
-- implement their LLM use cases by building upon existing and composing existing functionality and providing insights into
-  the runtime behavior of these
+- implement their LLM use cases by building upon and composing existing functionalities
+- obtain insights into the runtime behavior of their implementations
 - iteratively improve their implementations or compare them to existing implementations by evaluating them against
-  a given set of example
+  a given set of examples
 
-Both focus points are described in more detail in the following sections.
+How these focus points are realized in the Intelligence Layer are described in more detail in the following sections.
 
 ## Task
 
@@ -18,8 +18,8 @@ transforms an input-parameter to an output like a function in mathematics.
 Task: Input -> Output
 ```
 
-In Python this is expressed through an abstract class with type-parameters and the abstract method `do_run`
-where the actual transformation is implemented:
+In Python this is realized by an abstract class with type-parameters and the abstract method `do_run`
+in which the actual transformation is implemented:
 
 ```Python
 class Task(ABC, Generic[Input, Output]):
@@ -30,13 +30,13 @@ class Task(ABC, Generic[Input, Output]):
 ```
 
 `Input` and `Output` are normal Python datatypes that can be serialized from and to JSON. For this the Intelligence
-Layer relies on [Pydantic](https://docs.pydantic.dev/). The types that can actually be used are defined in form
-of the type-alias [`PydanticSerializable`](src/intelligence_layer/core/tracer/tracer.py#L44).
+Layer relies on [Pydantic](https://docs.pydantic.dev/). The used types are defined in form
+of type-aliases PydanticSerializable.
 
 The second parameter `task_span` is used for [tracing](#Trace) which is described below.
 
-`do_run` is the method that needs to be implemented for a concrete task. The external interface of a
-task is its `run` method:
+`do_run` is the method that implements a concrete task and has to be provided by the user. It will be executed by the external interface method `run` of a
+task:
 
 ```Python
 class Task(ABC, Generic[Input, Output]):
@@ -45,7 +45,7 @@ class Task(ABC, Generic[Input, Output]):
       ...
 ```
 
-Its signature differs only in the parameters regarding [tracing](#Trace).
+The signatures of the `do_run` and `run` methods differ only in the [tracing](#Trace) parameters.
 
 ### Levels of abstraction
 
@@ -56,17 +56,17 @@ with an LLM on a very generic or even technical level.
 
 Examples for higher level tasks (Use Cases) are:
 
-- Answering a question based on a gievn document: `QA: (Document, Question) -> Answer`
+- Answering a question based on a given document: `QA: (Document, Question) -> Answer`
 - Generate a summary of a given document: `Summary: Document -> Summary`
 
 Examples for lower level tasks are:
 
-- Let the model generate text based on an instruacton and some context: `Instruct: (Context, Instruction) -> Completion`
+- Let the model generate text based on an instruction and some context: `Instruct: (Context, Instruction) -> Completion`
 - Chunk a text in smaller pieces at optimized boundaries (typically to make it fit into an LLM's context-size): `Chunk: Text -> [Chunk]`
 
 ### Composability
 
-Tasks compose. Typically you would build higher level tasks from lower level tasks. Given a task you can draw a dependency graph
+Typically you would build higher level tasks from lower level tasks. Given a task you can draw a dependency graph
 that illustrates which sub-tasks it is using and in turn which sub-tasks they are using. This graph typically forms a hierarchy or
 more general a directed acyclic graph. The following drawing shows this graph for the Intelligence Layer's `RecursiveSummarize`
 task:
@@ -76,8 +76,8 @@ task:
 
 ### Trace
 
-A task implements a workflow. It processes its input, passes it on to sub-tasks, processes the outputs of sub-tasks
-to build its own output. This workflow can be represented in a trace. For this a task's `run` method takes a `Tracer`
+A task implements a workflow. It processes its input, passes it on to sub-tasks, processes the outputs of the sub-tasks
+and builds its own output. This workflow can be represented in a trace. For this a task's `run` method takes a `Tracer`
 that takes care of storing details on the steps of this workflow like the tasks that have been invoked along with their
 input and output and timing information. The following illustration shows the trace of an MultiChunkQa-task:
 
@@ -86,9 +86,9 @@ input and output and timing information. The following illustration shows the tr
 To represent this tracing defines the following concepts:
 
 - A `Tracer` is passed to a task's `run` method and provides methods for opening `Span`s or `TaskSpan`s.
-- A `Span` is a `Tracer` and allows for grouping multiple logs and duration together as a single, logical step in the
+- A `Span` is a `Tracer` and allows to group multiple logs and runtime durations together as a single, logical step in the
   workflow.
-- A `TaskSpan` is a `Span` and allows for grouping multiple logs together, as well as the task's specific input, output.
+- A `TaskSpan` is a `Span` that allows to group multiple logs together with the task's specific input and output.
   An opened `TaskSpan` is passed to `Task.do_run`. Since a `TaskSpan` is a `Tracer` a `do_run` implementation can pass
   this instance on to `run` methods of sub-tasks.
 
@@ -127,8 +127,8 @@ The evaluation process helps to:
 
 ### Dataset
 
-The basis of an evaluation is a set of examples for the specific task-type to be evaluated. A single example
-consists out of :
+The basis of an evaluation is a set of examples for the specific task-type to be evaluated. A single `Example`
+consists of:
 
 - an instance of the `Input` for the specific task and
 - optionally an _expected output_ that can be anything that makes sense in context of the specific evaluation (e.g.
@@ -139,6 +139,7 @@ consists out of :
 To enable reproducibility of evaluations datasets are immutable. A single dataset can be used to evaluate all
 tasks of the same type, i.e. with the same `Input` and `Output` types.
 
+
 ### Evaluation Process
 
 The Intelligence Layer supports different kinds of evaluation techniques. Most important are:
@@ -153,13 +154,12 @@ The Intelligence Layer supports different kinds of evaluation techniques. Most i
   a single output, but it is easier to compare two different outputs and decide which one is better. An example
   use case could be summarization.
 
-To support these techniques the Intelligence Layer differantiates between 3 consecutive steps:
+To support these techniques the Intelligence Layer differentiates between 3 consecutive steps:
 
 1. Run a task by feeding it all inputs of a dataset and collecting all outputs
-2. Evaluate the outputs of one or several
-  runs and produce an evaluation result for each example. Typically a single run is evaluated if absolute
+2. Evaluate the outputs of one or several runs and produce an evaluation result for each example. Typically a single run is evaluated if absolute
   metrics can be computed and several runs are evaluated when the outputs of runs shall be compared.
-3. Aggregate the evaluation results of one or several evaluation runs into a single object containing the aggregated
+1. Aggregate the evaluation results of one or several evaluation runs into a single object containing the aggregated
   metrics. Aggregating over several evaluation runs supports amending a previous comparison result with
   comparisons of new runs without the need to re-execute the previous comparisons again.
 
@@ -171,7 +171,7 @@ The following table shows how these three steps are represented in code:
 | 2. Evaluate | `Evaluator` | `EvaluationLogic` | `EvaluationRepository` |
 | 3. Aggregate | `Aggregator` | `AggregationLogic` | `AggregationRepository` |
 
-The column
+Columns explained
 - Executor lists concrete implementations provided by the Intelligence Layer.
 - Custom Logic lists abstract classes that need to be implemented with the custom logic.
 - Repository lists abstract classes for storing intermediate results. The Intelligence Layer provides
@@ -180,24 +180,23 @@ The column
 ### Data Storage
 
 During an evaluation process a lot of intermediate data is created before the final aggregated result can be produced.
-To avoid that expensive computations have to be repeated if new results should be produced based on previous ones
+To avoid that expensive computations have to be repeated if new results are to be produced based on previous ones
 all intermediate results are persisted. For this the different executor-classes make use of repositories.
 
 There are the following Repositories:
 
-- The `DatasetRepository` offers methods to manage datasets. The `Runner` uses it to read all examples of a dataset to feed
-  then to the `Task`.
-- The `RunRepository` is responsible for storing a task's output (in form of a `ExampleOutput`) for each example of a dataset
+- The `DatasetRepository` offers methods to manage datasets. The `Runner` uses it to read all examples of a dataset and feed
+  them to the `Task`.
+- The `RunRepository` is responsible for storing a task's output (in form of an `ExampleOutput`) for each example of a dataset
   which are created when a `Runner`
   runs a task using this dataset. At the end of a run a `RunOverview` is stored containing some metadata concerning the run.
   The `Evaluator` reads these outputs given a list of runs it should evaluate to create an evaluation
   result for each example of the dataset.
 - The `EvaluationRepository` enables the `Evaluator` to store the individual evaluation result (in form of an `ExampleEvaluation`)
-  for each example and an `EvaluationOverview`
-  and makes them available to the `Aggregator`.
+  for each example and an `EvaluationOverview` and makes them available to the `Aggregator`. <!--- TODO: double 'and' but no good idea how to get rid of it -->
 - The `AggregationRepository` stores the `AggregationOverview` containing the aggregated metrics on request of the `Aggregator`.
 
-The following diagramms illustrate how the different concepts play together in case of the different types of evaluations.
+The following diagrams illustrate how the different concepts play together in case of different evaluation types.
 
 <figure>
 <img src="./assets/AbsoluteEvaluation.drawio.svg">
@@ -219,13 +218,13 @@ The next diagram illustrates the more complex case of a relative evaluation.
 <figcaption>Process of a relative Evaluation</figcaption>
 </figure>
 
-1. Multiple `Runner`s read the same dataset and produce for different `Task`s corresponding `Output`s.
+1. Multiple `Runner`s read the same dataset and produce the corresponding `Output`s for different `Task`s.
 2. For each run all `Output`s are stored in the `RunRepository`.
-3. The `Evaluator` gets as input previous evaluations (that were produced on basis of the same dataset, but different `Task`s) and the new runs of the previous step.
+3. The `Evaluator` gets as input previous evaluations (that were produced on basis of the same dataset, but by different `Task`s) and the new runs of the current task.
 4. Given the previous evaluations and the new runs the `Evaluator` can read the `ExampleOutput`s of both the new runs
    and the runs associated to previous evaluations, collect all that belong to a single `Example` and pass them
    along with the `Example` to the `EvaluationLogic` to compute an `Evaluation`.
 5. Each `Evaluation` gets wrapped in an `ExampleEvaluation` and is stored in the `EvaluationRepository`.
 6. The `Aggregator` reads all `ExampleEvaluation` from all involved evaluations
    and feeds them to the `AggregationLogic` to produce a `AggregatedEvaluation`.
-7. The `AggregatedEvalution` is wrapped in an `AggregationOverview` and stoed in the `AggregationRepository`.
+7. The `AggregatedEvalution` is wrapped in an `AggregationOverview` and stored in the `AggregationRepository`.
diff --git a/README.md b/README.md
index 757f2bab2..0c166b4db 100644
--- a/README.md
+++ b/README.md
@@ -46,12 +46,12 @@ The environment can be activated via `poetry shell`. See the official poetry doc
 
 ### Getting started with the Jupyter Notebooks
 
-After running the local installation steps, there are two environment variables that have to be set before you can start running the examples.
+After running the local installation steps, you can select if you want to use the Aleph-Alpha API or an on-prem setup by setting the environment variables accordingly.
 
 ---
 **Using the Aleph-Alpha API** \
   \
-You will need an [Aleph Alpha access token](https://docs.aleph-alpha.com/docs/account/#create-a-new-token) to run the examples.
+In the Intelligence Layer the Aleph-Alpha API (`https://api.aleph-alpha.com`) is set as default host URL. However, you will need an [Aleph Alpha access token](https://docs.aleph-alpha.com/docs/account/#create-a-new-token) to run the examples.
 Set your access token with
 
 ```bash
@@ -62,13 +62,15 @@ export AA_TOKEN=<YOUR TOKEN HERE>
 
 **Using an on-prem setup** \
   \
-The default host url in the project is set to `https://api.aleph-alpha.com`. This can be changed by setting the `CLIENT_URL` environment variable:
+In case you want to use an on-prem endpoint you will have to change the host URL by setting the `CLIENT_URL` environment variable:
 
 ```bash
 export CLIENT_URL=<YOUR_ENDPOINT_URL_HERE>
 ```
 
-The program will warn you if no `CLIENT_URL` is explicitly set.
+The program will warn you in case no `CLIENT_URL` is set explicitly set.
+
+Note, that you will still have to provide an Aleph Alpha access token as described above.
 
 ---
 After correctly setting up the environment variables you can run the jupyter notebooks.
@@ -188,10 +190,10 @@ Not sure where to start? Familiarize yourself with the Intelligence Layer using
 If you prefer you can also read about the [concepts](Concepts.md) first.
 
 ## Tutorials
-The tutorials aim to guide you through implementing several common use-cases with the Intelligence Layer. They introduce you to key concepts and enable you to create your own use-cases.
+The tutorials aim to guide you through implementing several common use-cases with the Intelligence Layer. They introduce you to key concepts and enable you to create your own use-cases. In general the tutorials are build in a way that you can simply hop into the topic you are most interested in. However, for starters we recommend to read through the `Summarization` tutorial first. It explains the core concepts of the intelligence layer in more depth while for the other tutorials we assume that these concepts are known.
 
-| Order | Topic              | Description                                          | Notebook 📓                                                      |
-| ----- | ------------------ | ---------------------------------------------------- | --------------------------------------------------------------- |
+| Order | Topic              | Description                                          | Notebook 📓                                                     |
+| ----- | ------------------ |------------------------------------------------------|-----------------------------------------------------------------|
 | 1     | Summarization      | Summarize a document                                 | [summarization.ipynb](./src/examples/summarization.ipynb)       |
 | 2     | Question Answering | Various approaches for QA                            | [qa.ipynb](./src/examples/qa.ipynb)                             |
 | 3     | Classification     | Learn about two methods of classification            | [classification.ipynb](./src/examples/classification.ipynb)     |
@@ -200,7 +202,7 @@ The tutorials aim to guide you through implementing several common use-cases wit
 | 6     | Document Index     | Connect your proprietary knowledge base              | [document_index.ipynb](./src/examples/document_index.ipynb)     |
 | 7     | Human Evaluation   | Connect to Argilla for manual evaluation             | [human_evaluation.ipynb](./src/examples/human_evaluation.ipynb) |
 | 8     | Performance tips   | Contains some small tips for performance             | [performance_tips.ipynb](./src/examples/performance_tips.ipynb) |
-| 9     | Deployment         | Shows how to deploy a Task in a minimal FastAPI app. | [fastapi_example.py](./src/examples/fastapi_example.py)         |
+| 9     | Deployment         | Shows how to deploy a Task in a minimal FastAPI app. | [fastapi_tutorial.md](./src/examples/fastapi_tutorial.md)       |
 
 ## How-Tos
 The how-tos are quick lookups about how to do things. Compared to the tutorials, they are shorter and do not explain the concepts they are using in-depth.
diff --git a/src/examples/classification.ipynb b/src/examples/classification.ipynb
index 71f1dfec6..b9bf2e0f3 100644
--- a/src/examples/classification.ipynb
+++ b/src/examples/classification.ipynb
@@ -1,5 +1,24 @@
 {
  "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dotenv import load_dotenv\n",
+    "from intelligence_layer.core import InMemoryTracer, LogEntry, TextChunk\n",
+    "from intelligence_layer.use_cases import (\n",
+    "    ClassifyInput,\n",
+    "    EmbeddingBasedClassify,\n",
+    "    LabelWithExamples,\n",
+    "    PromptBasedClassify,\n",
+    "    TreeNode,\n",
+    ")\n",
+    "\n",
+    "load_dotenv()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -31,8 +50,8 @@
     "\n",
     "### Example snippet\n",
     "\n",
-    "Running the following code will instantiate a `PromptBasedClassify` that leverages a prompt for classification.\n",
-    "We can now enter any `ClassifyInput` so that the task returns each label along with its probability.\n",
+    "Running the following code will instantiate a `PromptBasedClassify`-task that leverages a prompt for classification.\n",
+    "We can pass any `ClassifyInput` to the task and it returns each label along with its probability.\n",
     "In addition, note the `tracer`, which will give a comprehensive overview of the result."
    ]
   },
@@ -42,23 +61,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from dotenv import load_dotenv\n",
-    "\n",
-    "\n",
-    "load_dotenv()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from intelligence_layer.connectors import LimitedConcurrencyClient\n",
-    "from intelligence_layer.core import TextChunk, InMemoryTracer\n",
-    "from intelligence_layer.use_cases import ClassifyInput, PromptBasedClassify\n",
-    "\n",
-    "\n",
     "text_to_classify = TextChunk(\n",
     "    \"In the distant future, a space exploration party embarked on a thrilling journey to the uncharted regions of the galaxy. \\n\\\n",
     "With excitement in their hearts and the cosmos as their canvas, they ventured into the unknown, discovering breathtaking celestial wonders. \\n\\\n",
@@ -71,7 +73,6 @@
     "tracer = InMemoryTracer()\n",
     "output = task.run(input, tracer)\n",
     "\n",
-    "# Let's see the results:\n",
     "for label, score in output.scores.items():\n",
     "    print(f\"{label}: {round(score, 4)}\")"
    ]
@@ -105,19 +106,20 @@
    "source": [
     "The model would then complete our instruction, thus generating a matching label.\n",
     "\n",
-    "In the case of single-label classification, however, we already know all possible classes beforehand.\n",
-    "Because of this, all we are interested in is the probability that the model would have generated our specific classes.\n",
-    "To get this probability, we can prompt the model with each of our classes and ask it to return the \"logprobs\" for the text.\n"
+    "In case of single-label classification, however, we already know all possible classes beforehand.\n",
+    "Thus, all we are interested in is the probability that the model would have generated our specific class for the given input.\n",
+    "To get this probability, we modify the model such that it does not generate any token but returns the logarithmic probabilities (logprops) of the completion instead. From this we then extract the probability with which our class would have been selected. This process is called an `EchoTask`.\n",
+    "\n",
+    "Let's have a look at just one of these tasks triggered by our classification run.\n",
+    "\n",
+    "Feel free to ignore the big `Complete` task dump in the middle.\n",
+    "Instead, focus on the `expected_completion` in the `Input` and the `prob` for the token \" angry\" in the `Output`."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Our request will not generate any tokens, but instead return the log probability of this completion given the previous tokens.\n",
-    "This is called an `EchoTask`.\n",
-    "Let's have a look at just one of these tasks triggered by our classification run.\n",
-    "\n",
     "In particular, note the `expected_completion` in the `Input` and the `prob` for the token \" angry\" in the `Output`.\n",
     "Feel free to ignore the big `Complete` task dump in the middle."
    ]
@@ -147,9 +149,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.use_cases import TreeNode\n",
-    "from intelligence_layer.core import LogEntry\n",
-    "\n",
     "task_log = tracer.entries[-1]\n",
     "normalized_probs_logs = [\n",
     "    log_entry.value\n",
@@ -217,9 +216,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.use_cases import EmbeddingBasedClassify, LabelWithExamples\n",
-    "\n",
-    "\n",
     "labels_with_examples = [\n",
     "    LabelWithExamples(\n",
     "        name=\"positive\",\n",
@@ -243,8 +239,7 @@
     "    ),\n",
     "]\n",
     "\n",
-    "client = LimitedConcurrencyClient.from_env()\n",
-    "classify = EmbeddingBasedClassify(labels_with_examples, client=client)"
+    "classify = EmbeddingBasedClassify(labels_with_examples)"
    ]
   },
   {
@@ -323,9 +318,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.use_cases import EmbeddingBasedClassify, LabelWithExamples\n",
-    "\n",
-    "\n",
     "labels_with_examples = [\n",
     "    LabelWithExamples(\n",
     "        name=\"positive\",\n",
@@ -349,7 +341,7 @@
     "        ],\n",
     "    ),\n",
     "]\n",
-    "classify = EmbeddingBasedClassify(labels_with_examples, client=client)\n",
+    "classify = EmbeddingBasedClassify(labels_with_examples)\n",
     "\n",
     "tracer = InMemoryTracer()\n",
     "result = classify.run(classify_input, tracer)\n",
@@ -387,7 +379,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.0"
+   "version": "3.11.7"
   }
  },
  "nbformat": 4,
diff --git a/src/examples/document_index.ipynb b/src/examples/document_index.ipynb
index 090dcffda..dd3c89bbd 100644
--- a/src/examples/document_index.ipynb
+++ b/src/examples/document_index.ipynb
@@ -7,7 +7,25 @@
    "outputs": [],
    "source": [
     "%load_ext autoreload\n",
-    "%autoreload 2"
+    "%autoreload 2\n",
+    "\n",
+    "from dotenv import load_dotenv\n",
+    "from os import getenv\n",
+    "from intelligence_layer.connectors import (\n",
+    "    CollectionPath,\n",
+    "    DocumentContents,\n",
+    "    DocumentIndexClient,\n",
+    "    DocumentIndexRetriever,\n",
+    "    DocumentPath,\n",
+    "    LimitedConcurrencyClient,\n",
+    ")\n",
+    "from intelligence_layer.core import InMemoryTracer\n",
+    "from intelligence_layer.use_cases import (\n",
+    "    RetrieverBasedQa,\n",
+    "    RetrieverBasedQaInput,\n",
+    ")\n",
+    "\n",
+    "load_dotenv()"
    ]
   },
   {
@@ -55,20 +73,10 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from os import getenv\n",
-    "\n",
-    "from dotenv import load_dotenv\n",
-    "\n",
-    "from intelligence_layer.connectors import DocumentIndexClient\n",
-    "\n",
-    "load_dotenv()\n",
-    "\n",
-    "\n",
     "document_index = DocumentIndexClient(\n",
     "    token=getenv(\"AA_TOKEN\"),\n",
     "    base_document_index_url=\"https://document-index.aleph-alpha.com\",\n",
-    ")\n",
-    "?document_index"
+    ")"
    ]
   },
   {
@@ -90,9 +98,6 @@
    "outputs": [],
    "source": [
     "# change this value if you want to use a collection of a different name\n",
-    "from intelligence_layer.connectors import CollectionPath\n",
-    "\n",
-    "\n",
     "COLLECTION = \"demo\"\n",
     "\n",
     "collection_path = CollectionPath(namespace=NAMESPACE, collection=COLLECTION)\n",
@@ -192,9 +197,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.connectors import DocumentContents, DocumentPath\n",
-    "\n",
-    "\n",
     "for doc in documents:\n",
     "    document_path = DocumentPath(\n",
     "        collection_path=collection_path, document_name=doc[\"name\"]\n",
@@ -237,8 +239,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.connectors import DocumentIndexRetriever\n",
-    "\n",
     "document_index_retriever = DocumentIndexRetriever(\n",
     "    document_index=document_index,\n",
     "    namespace=NAMESPACE,\n",
@@ -269,16 +269,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.connectors import LimitedConcurrencyClient\n",
-    "\n",
-    "from intelligence_layer.use_cases import (\n",
-    "    RetrieverBasedQa,\n",
-    "    RetrieverBasedQaInput,\n",
-    "    SingleChunkQa,\n",
-    ")\n",
-    "from intelligence_layer.core import InMemoryTracer\n",
-    "\n",
-    "\n",
     "client = LimitedConcurrencyClient.from_env()\n",
     "retriever_qa = RetrieverBasedQa(document_index_retriever)\n",
     "\n",
@@ -327,7 +317,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.0"
+   "version": "3.11.7"
   }
  },
  "nbformat": 4,
diff --git a/src/examples/evaluation.ipynb b/src/examples/evaluation.ipynb
index 8f0d9a266..a6e22d2df 100644
--- a/src/examples/evaluation.ipynb
+++ b/src/examples/evaluation.ipynb
@@ -1,5 +1,42 @@
 {
  "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from collections import defaultdict\n",
+    "from datasets import load_dataset\n",
+    "from dotenv import load_dotenv\n",
+    "from intelligence_layer.connectors import LimitedConcurrencyClient\n",
+    "from intelligence_layer.core import NoOpTracer, TextChunk\n",
+    "from intelligence_layer.evaluation import (\n",
+    "    Aggregator,\n",
+    "    Evaluator,\n",
+    "    Example,\n",
+    "    InMemoryAggregationRepository,\n",
+    "    InMemoryDatasetRepository,\n",
+    "    InMemoryEvaluationRepository,\n",
+    "    InMemoryRunRepository,\n",
+    "    Runner,\n",
+    ")\n",
+    "from intelligence_layer.use_cases import (\n",
+    "    ClassifyInput,\n",
+    "    EmbeddingBasedClassify,\n",
+    "    LabelWithExamples,\n",
+    "    MultiLabelClassifyAggregationLogic,\n",
+    "    MultiLabelClassifyEvaluationLogic,\n",
+    "    PromptBasedClassify,\n",
+    "    SingleLabelClassifyAggregationLogic,\n",
+    "    SingleLabelClassifyEvaluation,\n",
+    "    SingleLabelClassifyEvaluationLogic,\n",
+    ")\n",
+    "from typing import Any, Mapping, Sequence\n",
+    "\n",
+    "load_dotenv()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -23,14 +60,14 @@
     "Well, unlike other tasks such as QA, the result of a classification task is more or less binary (true/false).\n",
     "There are very few grey areas, as it is unlikely that a classification result is somewhat or \"half\" correct.\n",
     "\n",
-    "Make sure that you have familiarized yourself with the `PromptBasedClassify` prior to starting this notebook.\n"
+    "Make sure that you have familiarized yourself with the [PromptBasedClassify](classification.ipynb#prompt-based-single-label-classification) prior to starting this notebook.\n"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "First, we need to instantiate our task, an evaluator for it and a repository that stores the evaluation results along with tracing information for the evaluated examples.\n"
+    "First, we need to instantiate our task, as well as, a runner, an evaluator and an aggregator for it. Furthermore, we need the corresponding repositories that store the results of each step along with tracing information.\n"
    ]
   },
   {
@@ -39,35 +76,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from dotenv import load_dotenv\n",
-    "\n",
-    "from intelligence_layer.connectors import LimitedConcurrencyClient\n",
-    "from intelligence_layer.evaluation import (\n",
-    "    Evaluator,\n",
-    "    InMemoryEvaluationRepository,\n",
-    "    InMemoryRunRepository,\n",
-    "    InMemoryDatasetRepository,\n",
-    "    InMemoryAggregationRepository,\n",
-    "    Runner,\n",
-    "    Aggregator,\n",
-    ")\n",
-    "from intelligence_layer.use_cases import (\n",
-    "    PromptBasedClassify,\n",
-    "    SingleLabelClassifyEvaluationLogic,\n",
-    "    SingleLabelClassifyAggregationLogic,\n",
-    ")\n",
-    "\n",
-    "load_dotenv()\n",
-    "\n",
     "task = PromptBasedClassify()\n",
     "dataset_repository = InMemoryDatasetRepository()\n",
     "run_repository = InMemoryRunRepository()\n",
     "evaluation_repository = InMemoryEvaluationRepository()\n",
+    "evaluation_logic = SingleLabelClassifyEvaluationLogic()\n",
     "aggregation_repository = InMemoryAggregationRepository()\n",
     "aggregation_logic = SingleLabelClassifyAggregationLogic()\n",
-    "evaluation_logic = SingleLabelClassifyEvaluationLogic()\n",
     "\n",
     "\n",
+    "runner = Runner(task, dataset_repository, run_repository, \"prompt-based-classify\")\n",
     "evaluator = Evaluator(\n",
     "    dataset_repository,\n",
     "    run_repository,\n",
@@ -80,8 +98,7 @@
     "    aggregation_repository,\n",
     "    \"single-label-classify\",\n",
     "    aggregation_logic,\n",
-    ")\n",
-    "runner = Runner(task, dataset_repository, run_repository, \"prompt-based-classify\")"
+    ")"
    ]
   },
   {
@@ -97,11 +114,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.core import TextChunk, NoOpTracer\n",
-    "from intelligence_layer.use_cases import ClassifyInput\n",
-    "from intelligence_layer.evaluation import Example\n",
-    "\n",
-    "\n",
     "classify_input = ClassifyInput(\n",
     "    chunk=TextChunk(\"This is good\"),\n",
     "    labels=frozenset({\"positive\", \"negative\"}),\n",
@@ -123,9 +135,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Cool!\n",
+    "Perfect! The example was classified correctly.\n",
     "\n",
-    "Let's have a look at this [dataset](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) for more elaborate evaluation."
+    "Next, we will have a look at this pre-defined [dataset of tweets](https://huggingface.co/cardiffnlp/tweet-topic-21-multi) for more elaborate evaluation."
    ]
   },
   {
@@ -134,8 +146,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from datasets import load_dataset\n",
-    "\n",
     "dataset = load_dataset(\"cardiffnlp/tweet_topic_multi\")\n",
     "test_set_name = \"validation_random\"\n",
     "all_data = list(dataset[test_set_name])\n",
@@ -175,9 +185,9 @@
     "\n",
     "```\n",
     "\n",
-    "We want the `input` in each `Example` to mimic the input of an actual task, therefore we must every time include the text (chunk) and all possible labels.\n",
+    "We want the `input` in each `Example` to mimic the input of an actual task. Therefore, we have to always include the text (chunk) and all possible labels.\n",
     "The `expected_output` shall correspond to anything we wish to compare our generated output to.\n",
-    "In this case, that means the correct class(es)."
+    "In this case, that means the correct class(es), i.e., the label name(s)."
    ]
   },
   {
@@ -186,14 +196,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "all_labels = list(set(c for d in data for c in d[\"label_name\"]))\n",
+    "all_labels = list(set(label_name for item in data for label_name in item[\"label_name\"]))\n",
     "dataset = dataset_repository.create_dataset(\n",
     "    examples=[\n",
     "        Example(\n",
-    "            input=ClassifyInput(chunk=TextChunk(d[\"text\"]), labels=all_labels),\n",
-    "            expected_output=d[\"label_name\"],\n",
+    "            input=ClassifyInput(chunk=TextChunk(item[\"text\"]), labels=all_labels),\n",
+    "            expected_output=item[\"label_name\"],\n",
     "        )\n",
-    "        for d in data\n",
+    "        for item in data\n",
     "    ],\n",
     "    dataset_name=\"tweet_topic_multi\",\n",
     ")"
@@ -242,8 +252,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.use_cases import SingleLabelClassifyEvaluation\n",
-    "\n",
     "print(\"Percentage correct:\", aggregation_overview.statistics.percentage_correct)\n",
     "print(\n",
     "    \"First example:\",\n",
@@ -258,10 +266,10 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For the sake of comparison, let's see if we can achieve a better result with our EmbeddingBasedClassifier.\n",
-    "Here, we have to provide some example for each class.\n",
+    "As an alternative to the `PromptBasedClassify` we now gonne use the `EmbeddingBasedClassify` for multi label classifications.\n",
+    "In this case, we have to provide some example for each class.\n",
     "\n",
-    "We can even reuse our data repositories"
+    "We can even reuse our data repositories:"
    ]
   },
   {
@@ -270,23 +278,13 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from collections import defaultdict\n",
-    "from typing import Any, Mapping, Sequence\n",
-    "from intelligence_layer.use_cases import (\n",
-    "    MultiLabelClassifyEvaluationLogic,\n",
-    "    MultiLabelClassifyAggregationLogic,\n",
-    "    EmbeddingBasedClassify,\n",
-    "    LabelWithExamples,\n",
-    ")\n",
-    "\n",
-    "\n",
     "def build_labels_and_examples(hf_data: Any) -> Mapping[str, Sequence[str]]:\n",
     "    examples = defaultdict(list)\n",
-    "    for d in hf_data:\n",
-    "        labels = d[\"label_name\"]\n",
+    "    for item in hf_data:\n",
+    "        labels = item[\"label_name\"]\n",
     "        for label in labels:\n",
     "            if len(examples[label]) < 20:\n",
-    "                examples[label].append(d[\"text\"])\n",
+    "                examples[label].append(item[\"text\"])\n",
     "    return examples\n",
     "\n",
     "\n",
@@ -298,9 +296,15 @@
     "        for name, examples in build_labels_and_examples(all_data[25:]).items()\n",
     "    ],\n",
     ")\n",
-    "eval_logic = MultiLabelClassifyEvaluationLogic(threshold=0.6)\n",
+    "eval_logic = MultiLabelClassifyEvaluationLogic(threshold=0.60)\n",
     "aggregation_logic = MultiLabelClassifyAggregationLogic()\n",
     "\n",
+    "embedding_based_classify_runner = Runner(\n",
+    "    embedding_based_classify,\n",
+    "    dataset_repository,\n",
+    "    run_repository,\n",
+    "    \"embedding-based-classify\",\n",
+    ")\n",
     "embedding_based_classify_evaluator = Evaluator(\n",
     "    dataset_repository,\n",
     "    run_repository,\n",
@@ -313,12 +317,6 @@
     "    aggregation_repository,\n",
     "    \"multi-label-classify\",\n",
     "    aggregation_logic,\n",
-    ")\n",
-    "embedding_based_classify_runner = Runner(\n",
-    "    embedding_based_classify,\n",
-    "    dataset_repository,\n",
-    "    run_repository,\n",
-    "    \"embedding-based-classify\",\n",
     ")"
    ]
   },
@@ -357,7 +355,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Apparently, our method has a great recall value, but we tend to falsely predict labels at times.\n",
+    "Apparently, our method has a great recall value, i.e. all relevant labels are retrieved. However, the low precision value indicates that we tend to falsely predict labels at times.\n",
     "\n",
     "Note, that the evaluation criteria for the multiple label approach are a lot harsher; we evaluate whether we correctly predict all labels & not just one of the correct ones!"
    ]
@@ -391,7 +389,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.4"
+   "version": "3.11.7"
   }
  },
  "nbformat": 4,
diff --git a/src/examples/fastapi_tutorial.md b/src/examples/fastapi_tutorial.md
index a3dbbd4ed..cae5ea25f 100644
--- a/src/examples/fastapi_tutorial.md
+++ b/src/examples/fastapi_tutorial.md
@@ -1,8 +1,8 @@
 # Tutorial: Extending a FastAPI App with the Aleph-Alpha Intelligence Layer
 
-In this tutorial, a basic [FastAPI](https://fastapi.tiangolo.com) app is extended with new route at which a summary for a given text can be retrieved, using the _Aleph-Alpha Intelligence Layer_, and it's _Luminous_ control models.
+In this tutorial, a basic [FastAPI](https://fastapi.tiangolo.com) app is extended with a new route at which a summary for a given text can be retrieved, using the _Aleph-Alpha Intelligence Layer_, and it's _Luminous_ control models.
 
-The full source code for this tutorial app can be found at the end and in [src/examples/fastapi_example.py](./fastapi_example.py).
+The full source code for this example app can be found at the end of this tutorial and in [src/examples/fastapi_example.py](./fastapi_example.py).
 
 ## Basic FastAPI App
 
@@ -26,12 +26,12 @@ This application can be started from the command line with the [Hypercorn](https
 hypercorn fastapi_example:app --bind localhost:8000
 ```
 
-In a successful run, you should see a message similar to
+If the start-up was successful, you should see a message similar to
 ```cmd
 [2024-03-07 14:00:55 +0100] [6468] [INFO] Running on http://<your ip>:8000 (CTRL + C to quit)
 ```
 
-Now that the server is running, we can make a `GET` request via `cURL`:
+Now that the server is running, we can perform a `GET` request via `cURL`:
 ```bash
 curl -X GET http://localhost:8000
 ```
@@ -40,21 +40,26 @@ You should get
 Hello World
 ```
 
-After successfully starting the basic FastAPI app, the next step is to add a route to make use of the Intelligence Layer.
+After successfully starting the basic FastAPI app, the next step is to add a route that makes use of the Intelligence Layer.
 
 ## Adding the Intelligence Layer to the application
 
-The building blocks of the Intelligence Layer for applications are `Tasks`. In general, a task implements the `Task` interface and defines an `Input` and an `Output`. Multiple tasks can be chained to create more complex applications.
-Here, we will make use of the pre-built task `SteerableSingleChunkSummarize` of the Intelligence Layer. This task defines as it's input the `SingleChunkSummarizeInput` class, and as it's output the `SummarizeOutput` class.
-As many other tasks, the `SteerableSingleChunkSummarize` task makes use of a `ControlModel`, and in turn, the `ControlModel` needs access to the Aleph-Alpha backend via a `AlephAlphaClientProtocol` client.
+The building blocks of the Intelligence Layer for applications are `Tasks`. In general, a task implements the `Task` 
+interface and defines an `Input` and an `Output`. Multiple tasks can be chained to create more complex applications.
+Here, we will make use of the pre-built task `SteerableSingleChunkSummarize`. This task defines `SingleChunkSummarizeInput` 
+as it's input, and `SummarizeOutput` as it's output.
+Like many other tasks, the `SteerableSingleChunkSummarize` task makes use of a `ControlModel`. The 
+`ControlModel` itself needs access to the Aleph-Alpha backend via a `AlephAlphaClientProtocol` client.
 In short, the hierarchy is as follows:
 
 ![task_dependencies.drawio.svg](task_dependencies.drawio.svg)
 
 We make use of the built-in [Dependency Injection](https://fastapi.tiangolo.com/reference/dependencies/) of FastAPI to
-resolve this hierarchy automatically. In this framework, the defaults for parameters are dynamically created with the `Depends(func)` annotation, where `func` is a function that returns the default value.
+resolve this hierarchy automatically. In this framework, the defaults for the parameters are dynamically created with 
+the `Depends(func)` annotation, where `func` is a function that returns the default value.
 
-So, first, we define our client-generating function. For that, we provide the host URL and a valid Aleph-Alpha token, which are stored in an `.env`-file.
+So, first, we define our client-generating function. For that, we provide the host URL and a valid Aleph-Alpha token, 
+which are stored in an `.env`-file.
 
 ```python
 import os
@@ -71,7 +76,7 @@ def client() -> Client:
 ```
 
 Next, we create a `ControlModel`. In this case, we make use of the `LuminousControlModel`, which takes
-an `AlephAlphaClientProtocol` that we default to the previously defined `client`.
+an `AlephAlphaClientProtocol` that we let default to the previously defined `client`.
 
 ```python
 from typing import Annotated
@@ -84,9 +89,11 @@ def default_model(app_client: Annotated[AlephAlphaClientProtocol, Depends(client
 ```
 
 
-Finally, we create the actual `Task`. For our example, we choose the `SteerableSingleChunkSummarize` of the Intelligence Layer.
-The `Input` of this task is a `SingleChunkSummarizeInput`, which consists of the text to summarize as the field `chunk`, and the desired `Language` as the field `language`.
-The `Output` of this task is a `SummarizeOutput` and contains the `summary` as text, and number of generated tokens for the `summary` as the field `generated_tokens`.
+Finally, we create the actual `Task`. For our example, we choose the `SteerableSingleChunkSummarize`.
+The `Input` of this task is a `SingleChunkSummarizeInput`, consisting of the text to summarize as the `chunk` field, 
+and the desired `Language` as the `language` field.
+The `Output` of this task is a `SummarizeOutput` and contains the `summary` as text, 
+and number of generated tokens for the `summary` as the `generated_tokens` field.
 
 ```python
 from intelligence_layer.use_cases import SteerableSingleChunkSummarize
@@ -118,10 +125,11 @@ def summary_task_route(
     return task.run(input, NoOpTracer())
 ```
 
-This concludes the refactoring to add an Intelligence-Layer task to the FastAPI app. After restarting the server, we can call our endpoint via a command such as the following (`<your text here>` with the text you want to summarize):
+This concludes the addition of an Intelligence-Layer task to the FastAPI app. After restarting the server, we can call 
+our newly created  endpoint via a command such as the following:
 ```bash
 
-curl -X POST http://localhost:8000/summary -H "Content-Type: application/json" -d '{"chunk": "<your text here>", "language": {"iso_639_1": "en"}}'
+curl -X POST http://localhost:8000/summary -H "Content-Type: application/json" -d '{"chunk": "<your text to summarize here>", "language": {"iso_639_1": "en"}}'
 ```
 
 ## Add Authorization to the Routes
@@ -129,9 +137,11 @@ curl -X POST http://localhost:8000/summary -H "Content-Type: application/json" -
 Typically, authorization is needed to control access to endpoints.
 Here, we will give a minimal example of how a per-route authorization system could be implemented in the minimal example app.
 
-The authorization system makes use of two parts: An `AuthService` that checks whether the user is allowed to access a given site, and a `PermissionsChecker` that is called on each route access and in turn calls the `AuthService`.
+The authorization system makes use of two parts: An `AuthService` that checks whether the user is allowed to access a 
+given site, and a `PermissionsChecker` that is called on each route access and in turn calls the `AuthService`.
 
-For this minimal example, the `AuthService` is simply a stub. You will want to implement a concrete authorization service depending on your needs.
+For this minimal example, the `AuthService` is simply a stub. You will want to implement a concrete authorization service
+tailored to your needs.
 
 ```python
 from typing import Sequence
@@ -149,7 +159,10 @@ class AuthService:
         return True
 ```
 
-When the `PermissionsChecker` is created, `permissions` can be passed in to define which roles, e.g. "user" or "admin", are allowed to access which website. The `PermissionsChecker` implements the `__call__` function, so that it can be used as a function in the `dependencies` argument of each route via `Depends`, see extended definition of the `summary_task_route` further below.
+With this `PermissionsChecker`, `permissions` can be passed in to define which roles, e.g. "user" or "admin", 
+are allowed to access which endpoints. The `PermissionsChecker` implements the `__call__` function, so that it can be 
+used as a function in the `dependencies` argument of each route via `Depends`. For more details see the extended 
+definition of the `summary_task_route` further below.
 
 ```python
 from fastapi import HTTPException, Request
diff --git a/src/examples/human_evaluation.ipynb b/src/examples/human_evaluation.ipynb
index d88e22496..709def263 100644
--- a/src/examples/human_evaluation.ipynb
+++ b/src/examples/human_evaluation.ipynb
@@ -1,5 +1,50 @@
 {
  "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from datasets import load_dataset\n",
+    "from dotenv import load_dotenv\n",
+    "from intelligence_layer.evaluation import (\n",
+    "    ArgillaAggregator,\n",
+    "    AggregationLogic,\n",
+    "    ArgillaEvaluator,\n",
+    "    ArgillaEvaluationLogic,\n",
+    "    ArgillaEvaluationRepository,\n",
+    "    Example,\n",
+    "    InMemoryAggregationRepository,\n",
+    "    InMemoryDatasetRepository,\n",
+    "    InMemoryEvaluationRepository,\n",
+    "    InMemoryRunRepository,\n",
+    "    RecordDataSequence,\n",
+    "    Runner,\n",
+    "    SuccessfulExampleOutput,\n",
+    ")\n",
+    "from intelligence_layer.connectors import (\n",
+    "    ArgillaEvaluation,\n",
+    "    DefaultArgillaClient,\n",
+    "    Field,\n",
+    "    LimitedConcurrencyClient,\n",
+    "    Question,\n",
+    "    RecordData,\n",
+    ")\n",
+    "from intelligence_layer.core import (\n",
+    "    CompleteOutput,\n",
+    "    Instruct,\n",
+    "    InstructInput,\n",
+    "    LuminousControlModel,\n",
+    ")\n",
+    "from pydantic import BaseModel\n",
+    "from typing import Iterable, cast\n",
+    "\n",
+    "load_dotenv()\n",
+    "\n",
+    "client = LimitedConcurrencyClient.from_env()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -36,53 +81,6 @@
     "- password: `1234`"
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from typing import Iterable, cast\n",
-    "\n",
-    "from datasets import load_dataset\n",
-    "from dotenv import load_dotenv\n",
-    "from pydantic import BaseModel\n",
-    "\n",
-    "from intelligence_layer.connectors import (\n",
-    "    LimitedConcurrencyClient,\n",
-    "    Question,\n",
-    "    ArgillaEvaluation,\n",
-    "    DefaultArgillaClient,\n",
-    "    Field,\n",
-    "    RecordData,\n",
-    ")\n",
-    "from intelligence_layer.core import (\n",
-    "    InstructInput,\n",
-    "    Instruct,\n",
-    "    CompleteOutput,\n",
-    "    LuminousControlModel,\n",
-    ")\n",
-    "from intelligence_layer.evaluation import (\n",
-    "    ArgillaEvaluator,\n",
-    "    AggregationLogic,\n",
-    "    RecordDataSequence,\n",
-    "    ArgillaEvaluationLogic,\n",
-    "    ArgillaEvaluationRepository,\n",
-    "    Example,\n",
-    "    InMemoryDatasetRepository,\n",
-    "    InMemoryEvaluationRepository,\n",
-    "    InMemoryRunRepository,\n",
-    "    InMemoryAggregationRepository,\n",
-    "    Runner,\n",
-    "    SuccessfulExampleOutput,\n",
-    "    ArgillaAggregator,\n",
-    ")\n",
-    "\n",
-    "load_dotenv()\n",
-    "\n",
-    "client = LimitedConcurrencyClient.from_env()"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -141,7 +139,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For our dataset repository, we could use a FileDatasetRepository or an InMemoryDatasetRepository."
+    "For our dataset repository, we can either use a FileDatasetRepository or an InMemoryDatasetRepository."
    ]
   },
   {
@@ -173,7 +171,8 @@
     "## Task Setup\n",
     "\n",
     "We use an Instruction task to run the examples in the Instruct dataset.\n",
-    "In addition, we define an `EvaluationRepository` to save the results and a `Runner` to generate the completions from the model for our dataset."
+    "In addition, we define a `Runner` to generate the completions from the model for our dataset\n",
+    "and a `RunRepository` to save the results."
    ]
   },
   {
@@ -222,14 +221,14 @@
    "source": [
     "![Argilla Interface](../../assets/argilla_interface.png)\n",
     "\n",
-    "In the Argilla UI, we see our model input (Instruction) and output (Model Completion) on the left side.\n",
-    "These are defined using the `fields` list.\n",
-    "The field names have to match the content keys from the `RecordData` that we will define in our `InstructArgillaEvaluationLogic`.\n",
+    "In the Argilla UI, our model input (Instruction) and output (Model Completion) will be shown on the left hand side.\n",
+    "They are defined below using the `fields` list.\n",
+    "Note that the field names have to match the content keys from the `RecordData` which we will define later in our `InstructArgillaEvaluationLogic`.\n",
     "\n",
-    "On the right side of the UI, we see our rating interface.\n",
-    "This can serve a number of Questions to be rated.\n",
+    "On the right side of the UI, the rating interface will be shown.\n",
+    "It is used to serve a number of questions that can be rated by the user.\n",
     "Currently, only integer scales are accepted.\n",
-    "The `name` property is used to access the human ratings in the aggregation step"
+    "The `name` property will later be used to access the human ratings in the aggregation step"
    ]
   },
   {
@@ -387,7 +386,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "We can access once we have evaluated some examples "
+    "Once we have evaluated the examples in the Argilla UI we can retrive the evaluation results via the `ArgillaAggregator`."
    ]
   },
   {
@@ -425,7 +424,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.11.7"
   }
  },
  "nbformat": 4,
diff --git a/src/examples/performance_tips.ipynb b/src/examples/performance_tips.ipynb
index aac04fed4..b4b25a0b6 100644
--- a/src/examples/performance_tips.ipynb
+++ b/src/examples/performance_tips.ipynb
@@ -2,14 +2,40 @@
  "cells": [
   {
    "cell_type": "markdown",
-   "id": "d8767b2a",
-   "metadata": {},
+   "id": "8cd7dfb528a28b66",
+   "metadata": {
+    "collapsed": false
+   },
    "source": [
     "# How to get more done in less time\n",
     "The following notebook contains tips for the following problems:\n",
-    " - A single task that takes very long to complete\n",
-    " - Running one task multiple times\n",
-    " - Running several different tasks at the same time"
+    "    - A single task that takes very long to complete\n",
+    "    - Running one task multiple times\n",
+    "    - Running several different tasks at the same time\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "941eeae4325336d6",
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T10:51:28.180489190Z",
+     "start_time": "2024-03-28T10:51:28.087623138Z"
+    },
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "## Imports\n",
+    "\n",
+    "import time\n",
+    "from typing import Any\n",
+    "\n",
+    "from intelligence_layer.core import Task, TaskSpan, NoOpTracer\n",
+    "from itertools import repeat\n",
+    "from concurrent.futures import ThreadPoolExecutor"
    ]
   },
   {
@@ -21,9 +47,9 @@
     "With a single long running task, consider the following:\n",
     " - If there are other calculations to do, consider using `ThreadPool.submit`, together with `result`\n",
     "   - See [here](#submit_example) for an example\n",
-    " - If this is not the case consider to:\n",
-    "   - Choose a faster model. The `base` model is faster than `extended`, `extended` is faster than `supreme`\n",
-    "   - Choose tasks that perform fewer LLM operations. E.g.: `MultiChunkQa` usually takes longer than `SingleChunkQa`"
+    " - If this is not the case consider:\n",
+    "   - Choosing a faster model. The `base` model is faster than `extended`, `extended` is faster than `supreme`\n",
+    "   - Choosing tasks that perform fewer LLM operations. E.g.: `MultiChunkQa` usually takes longer than `SingleChunkQa`"
    ]
   },
   {
@@ -39,16 +65,37 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 12,
    "id": "04dac517",
-   "metadata": {},
-   "outputs": [],
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T10:51:30.280192577Z",
+     "start_time": "2024-03-28T10:51:28.204771132Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n"
+     ]
+    },
+    {
+     "data": {
+      "text/plain": [
+       "['A', 'B', 'C', 'D']"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
    "source": [
-    "import time\n",
-    "from typing import Any\n",
-    "\n",
-    "from intelligence_layer.core import Task, TaskSpan, NoOpTracer\n",
-    "\n",
     "\n",
     "class DummyTask(Task):\n",
     "    def do_run(self, input: Any, task_span: TaskSpan) -> Any:\n",
@@ -81,9 +128,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 13,
    "id": "8959fcec-dc54-4137-9cb8-3a9c70d6a3d0",
-   "metadata": {},
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T10:51:30.281331699Z",
+     "start_time": "2024-03-28T10:51:30.253672599Z"
+    }
+   },
    "outputs": [],
    "source": [
     "# Second long running task\n",
@@ -108,18 +160,52 @@
    "metadata": {},
    "source": [
     "<a id='submit_example'></a>\n",
-    "The following shows how single tasks can be submitted to a ThreadPool.  \n",
+    "The individual tasks can then be submitted to a ThreadPool.  \n",
     "This is especially useful when there are other things to do while running tasks."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 14,
    "id": "6c88c3a2",
-   "metadata": {},
-   "outputs": [],
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T10:51:32.252184650Z",
+     "start_time": "2024-03-28T10:51:30.254258611Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task2 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task2 complete\n",
+      "Task1 complete\n",
+      "Task2 complete\n",
+      "Task1 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task1 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task1 complete\n",
+      "Task 1 result: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n",
+      "Task 2 result: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]\n"
+     ]
+    }
+   ],
    "source": [
-    "from concurrent.futures import ThreadPoolExecutor\n",
+    "\n",
     "\n",
     "with ThreadPoolExecutor(max_workers=2) as executor:\n",
     "    task_1_result = executor.submit(task_1.run_concurrently, task_input_1, tracer)\n",
@@ -141,12 +227,46 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 15,
    "id": "6b71469e",
-   "metadata": {},
-   "outputs": [],
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T10:51:34.271759294Z",
+     "start_time": "2024-03-28T10:51:32.247545036Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task 1 result: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n",
+      "Task 2 result: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]\n"
+     ]
+    }
+   ],
    "source": [
-    "from itertools import repeat\n",
+    "\n",
     "\n",
     "jobs = list(zip(repeat(task_1), task_input_1)) + list(zip(repeat(task_2), task_input_2))\n",
     "\n",
@@ -166,10 +286,44 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 16,
    "id": "de3fe114",
-   "metadata": {},
-   "outputs": [],
+   "metadata": {
+    "ExecuteTime": {
+     "end_time": "2024-03-28T10:51:36.286713303Z",
+     "start_time": "2024-03-28T10:51:34.263930848Z"
+    }
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Task1 completeTask2 complete\n",
+      "Task2 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task1 complete\n",
+      "Task2 complete\n",
+      "Task1 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task2 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task1 complete\n",
+      "Task 1 result: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]\n",
+      "Task 2 result: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]\n"
+     ]
+    }
+   ],
    "source": [
     "with ThreadPoolExecutor(max_workers=2) as executor:\n",
     "    results = list(\n",
@@ -211,7 +365,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.10.12"
+   "version": "3.10.13"
   }
  },
  "nbformat": 4,
diff --git a/src/examples/qa.ipynb b/src/examples/qa.ipynb
index eecf3d6aa..0205deeac 100644
--- a/src/examples/qa.ipynb
+++ b/src/examples/qa.ipynb
@@ -1,37 +1,48 @@
 {
  "cells": [
   {
-   "cell_type": "markdown",
+   "cell_type": "code",
+   "execution_count": null,
    "metadata": {},
+   "outputs": [],
    "source": [
-    "# Question and Answer\n",
-    "\n",
-    "A common use case for using large language models is to generate answers to questions based on a given piece of text.\n",
-    "We will be focusing on the open-book Q&A use case, where we provide the model with a piece of text we think is relevant to the question and ask it to answer based on it.\n",
+    "from dotenv import load_dotenv\n",
+    "from intelligence_layer.core import (\n",
+    "    DetectLanguage,\n",
+    "    DetectLanguageInput,\n",
+    "    InMemoryTracer,\n",
+    "    Language,\n",
+    "    LuminousControlModel,\n",
+    "    NoOpTracer,\n",
+    ")\n",
+    "from intelligence_layer.use_cases import (\n",
+    "    LongContextQa,\n",
+    "    LongContextQaInput,\n",
+    "    MultipleChunkQa,\n",
+    "    MultipleChunkQaInput,\n",
+    "    SingleChunkQa,\n",
+    "    SingleChunkQaInput,\n",
+    ")\n",
+    "from IPython.display import Pretty\n",
     "\n",
-    "To start, we first need to instantiate an Aleph Alpha `Client` so that we can make calls to the Aleph Alpha API, which we use to get access to the Aleph Alpha family of models."
+    "load_dotenv()"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "from intelligence_layer.connectors import LimitedConcurrencyClient\n",
-    "\n",
-    "from dotenv import load_dotenv\n",
-    "\n",
-    "load_dotenv()\n",
+    "# Question and Answer\n",
     "\n",
-    "client = LimitedConcurrencyClient.from_env()"
+    "A common use case for using large language models is to generate answers to questions based on a given piece of text.\n",
+    "We will be focusing on the open-book Q&A use case, where we provide the model with a piece of text we think is relevant to the question and ask the model to answer the question based on the given text."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, let's grab a piece of text we want to ask a question about. We can start with a random Wikipedia article about [\"Surface micromachining\"](https://en.wikipedia.org/wiki/Surface_micromachining)"
+    "Let's grab a piece of text we want to ask a question about. We can start with a random Wikipedia article about [\"Surface micromachining\"](https://en.wikipedia.org/wiki/Surface_micromachining)"
    ]
   },
   {
@@ -59,12 +70,11 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's see if we can use the `SingleChunkQa`-task to answer questions about this text!\n",
+    "We can use the `SingleChunkQa`-task to answer questions about this text!\n",
     "This particular task is optimized for shorter texts that fit into the model's context window.\n",
     "The main things of interest are that you can provide a `QaInput`, which consists of a `question` you want to ask and a `text` to base that answer upon.\n",
-    "We also have to provide a tracer, but disregard its output for now.\n",
     "\n",
-    "The output will be a `QaOutput`, which will include an `answer` (if it can find one in the text) and `highlights`, which point to the source text's section where the results are from.\n"
+    "The output will be a `QaOutput`, which will include an `answer` (if it can find one in the text) and `highlights` which mark the most relevant sections of the input text for the generated answer.\n"
    ]
   },
   {
@@ -73,16 +83,16 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.core import LuminousControlModel\n",
-    "from intelligence_layer.use_cases import SingleChunkQa, SingleChunkQaInput\n",
-    "from intelligence_layer.core import NoOpTracer\n",
-    "from IPython.display import Pretty\n",
-    "\n",
+    "# Define some question you want to ask about the input text\n",
     "question = \"What are some benefits of surface micro-machining?\"\n",
+    "\n",
+    "# Pass both the input text and the question to the SingleChunkQaInput-task\n",
     "input = SingleChunkQaInput(chunk=text, question=question)\n",
     "\n",
-    "model = LuminousControlModel(name=\"luminous-supreme-control\", client=client)\n",
+    "# Define a LuminousControlModel and instantiate a SingleChunkQa task\n",
+    "model = LuminousControlModel(name=\"luminous-supreme-control\")\n",
     "single_chunk_qa = SingleChunkQa(model=model)\n",
+    "\n",
     "output = single_chunk_qa.run(input, NoOpTracer())\n",
     "\n",
     "Pretty(output.answer)"
@@ -94,8 +104,8 @@
    "source": [
     "Nice, we extracted some advantages!\n",
     "\n",
-    "If you want to investigate based on which part of the input text was used to produce the answer, you can use the `highlights`.\n",
-    "Under the hood, it uses the explainability feature of the Aleph Alpha inference stack."
+    "If you want to investigate based on which part of the input text the answer was produced, you can use the `highlights` property of the `SingleChunkQaOutput`.\n",
+    "Under the hood, is uses the explainability feature of the Aleph Alpha inference stack. Each highlight in the `highlights` list contains the start and end courser position of the relevant text section and a score indicating its degree of relevance."
    ]
   },
   {
@@ -107,6 +117,20 @@
     "output.highlights"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "print(\n",
+    "    f\"Highlight 1 (Score {round(output.highlights[0].score,2)}): {text[output.highlights[0].start:output.highlights[0].end]}\"\n",
+    ")\n",
+    "print(\n",
+    "    f\"Highlight 2 (Score {round(output.highlights[1].score,2)}): {text[output.highlights[1].start:output.highlights[1].end]}\"\n",
+    ")"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -143,7 +167,13 @@
     "## Language detection\n",
     "\n",
     "You can also ask questions about documents in languages other than English.\n",
-    "Our models support 5 European languages 🇬🇧 🇩🇪 🇪🇸 🇫🇷 🇮🇹.\n",
+    "Our models support 5 European languages:\n",
+    "- English - 'en'\n",
+    "- German - 'de'\n",
+    "- Spanish - 'es'\n",
+    "- French - 'fr'\n",
+    "- Italian - 'it'\n",
+    "  \n",
     "We provide you with some tools making it easier to detect the language in the document."
    ]
   },
@@ -153,16 +183,14 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.core import DetectLanguage, DetectLanguageInput, Language\n",
-    "\n",
-    "my_i_have_no_idea_what_language_document = \"\"\"Rom begann ab dem 5. Jahrhundert v. Chr. mit einer immer rascheren Expansion in Mittelitalien (Eroberung von Veji 396 v. Chr.), musste dabei aber auch schwere Rückschläge verkraften. Der „Galliersturm“ unter Brennus hinterließ psychologisch tiefe Spuren, wobei die Schlacht an der Allia am 18. Juli (wahrscheinlich) 387 v. Chr. als „dies ater“ („schwarzer Tag“) in die Geschichte Roms einging. Es folgten die Samnitenkriege (343–341 v. Chr.; 326–304 v. Chr.; 298–290 v. Chr.) und der Latinerkrieg (um 340–338 v. Chr.). Rom schuf schließlich ein weitverzweigtes Bündnisgeflecht. So wurden an strategisch wichtigen Orten Kolonien angelegt und Bündnisse mit mehreren italischen Stämmen geschlossen, die jedoch nicht das römische Bürgerrecht erhielten.\n",
+    "document_with_unknown_language = \"\"\"Rom begann ab dem 5. Jahrhundert v. Chr. mit einer immer rascheren Expansion in Mittelitalien (Eroberung von Veji 396 v. Chr.), musste dabei aber auch schwere Rückschläge verkraften. Der „Galliersturm“ unter Brennus hinterließ psychologisch tiefe Spuren, wobei die Schlacht an der Allia am 18. Juli (wahrscheinlich) 387 v. Chr. als „dies ater“ („schwarzer Tag“) in die Geschichte Roms einging. Es folgten die Samnitenkriege (343–341 v. Chr.; 326–304 v. Chr.; 298–290 v. Chr.) und der Latinerkrieg (um 340–338 v. Chr.). Rom schuf schließlich ein weitverzweigtes Bündnisgeflecht. So wurden an strategisch wichtigen Orten Kolonien angelegt und Bündnisse mit mehreren italischen Stämmen geschlossen, die jedoch nicht das römische Bürgerrecht erhielten.\n",
     "\n",
     "Aus dieser Zeit seiner Geschichte ging Rom als straffes Staatswesen mit schlagkräftiger Armee und starkem Drang zur Ausdehnung hervor. Damit waren die Grundlagen für seinen weiteren Aufstieg geschaffen. Konkurrierende Mächte stellten auf der Italischen Halbinsel die Stadtstaaten der Etrusker nördlich von Rom, die Kelten in der Po-Ebene und die griechischen Kolonien in Süditalien dar.\n",
     "\n",
     "Im 3. Jahrhundert v. Chr. setzte sich Rom gegen die Samniten und andere italische Stämme durch. Nach und nach fiel die gesamte Halbinsel an Rom (außer Oberitalien, welches erst später annektiert wurde). Im Süden verleibte sich die Republik um 275 v. Chr. die dortigen griechischen Stadtstaaten ein, nachdem es während des Pyrrhischen Krieges gelungen war, den hellenistischen Hegemon Pyrrhos I. von Epiros abzuwehren. Mit dieser Expansion kam Rom allerdings in Konflikt mit der bisher Rom freundlich gesinnten Handelsrepublik Karthago (im heutigen Tunesien), was zu den Punischen Kriegen führte.\"\"\"\n",
     "\n",
     "lang_detection_input = DetectLanguageInput(\n",
-    "    text=my_i_have_no_idea_what_language_document,\n",
+    "    text=document_with_unknown_language,\n",
     "    possible_languages=[Language(l) for l in [\"en\", \"de\", \"es\", \"fr\", \"it\"]],\n",
     ")\n",
     "language = DetectLanguage().run(lang_detection_input, NoOpTracer())\n",
@@ -186,7 +214,7 @@
     "question = \"Wie viele Samnitenkriege gab es & wann fanden sie statt?\"\n",
     "\n",
     "input = SingleChunkQaInput(\n",
-    "    chunk=my_i_have_no_idea_what_language_document,\n",
+    "    chunk=document_with_unknown_language,\n",
     "    question=question,\n",
     "    language=language.best_fit,\n",
     ")\n",
@@ -201,14 +229,14 @@
    "source": [
     "## Multi-chunk QA\n",
     "\n",
-    "One of the problems with the approach above is that your input (text and question) have to fit into the context size of the LLM, which depends on the model being used.\n",
-    "If your input exceeds this limit, you may want to use a `MultipleChunkQa`.\n",
+    "Some times you might have multiple texts you want to provide as context for your question. In this case the `MultipleChunkQa`-task might be the better option. The workflow of this task consists of the following steps:\n",
+    "1. The tasks takes multiple text chunks and a question as input.\n",
+    "2. It runs the model for each chunk generating an individual answer per chunk.\n",
+    "3. It generates a final answer based on the combination of the intermediate answers.\n",
     "\n",
-    "In this case, out QA-workflow becomes slightly more complicated and consists of the following steps:\n",
-    "First, the model generates an answer based on each chunk and the given question.\n",
-    "Then all the intermediate answers are combined into a single answer.\n",
+    "Note, that for the  `MultipleChunkQa` the combined length of all input chunks is **not** limited by the context window of the model. Thus, `MultipleChunkQa` provides one option to deal with long input texts by splitting them into multiple chunks. However, below in the section [Long context QA](#long-context-qa) we will present a more sophisticated approache on how to handle QA-tasks for long input texts. \n",
     "\n",
-    "Let's have a look at an example where two chunks lead to different parts of the final answer.\n",
+    "Now let's have a look at an example where two chunks lead to different parts of the final answer.\n",
     "\n",
     "This time, let's also use a proper debug log, so that we can see what happens under the hood!"
    ]
@@ -219,9 +247,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.core import InMemoryTracer\n",
-    "from intelligence_layer.use_cases import MultipleChunkQa, MultipleChunkQaInput\n",
-    "\n",
     "chunks = [\n",
     "    'Around 1997, Goldenvoice was struggling to book concerts against larger companies, and they were unable to offer guarantees as high as their competitors, such as SFX Entertainment. Tollett said, \"We were getting our ass kicked financially. We were losing a lot of bands. And we couldn\\'t compete with the money.\" As a result, the idea of a music festival was conceived, and Tollett began to brainstorm ideas for one with multiple venues. His intent was to book trendy artists who were not necessarily chart successes: \"Maybe if you put a bunch of them together, that might be a magnet for a lot of people.\" While attending the 1997 Glastonbury Festival, Tollett handed out pamphlets to artists and talent managers that featured pictures of the Empire Polo Club and pitched a possible festival there. In contrast to the frequently muddy conditions at Glastonbury caused by rain, he recalled, \"We had this pamphlet... showing sunny Coachella. Everyone was laughing.\"',\n",
     "    \"Rock am Ring wurde erstmals 1985 veranstaltet und war ursprünglich als ein einmaliges Ereignis geplant. Aufgrund des großen Erfolges mit 75.000 Zuschauern entschloss man sich jedoch, diese Veranstaltung jedes Jahr stattfinden zu lassen. Der Einbruch der Zuschauerzahlen 1988 hatte eine zweijährige Pause zur Folge. 1991 startete das größte deutsche Rockfestival mit einem überarbeiteten Konzept erneut. Ein neues Hauptaugenmerk wurde darauf gelegt, dem Publikum mehr Newcomer vorzustellen. So traten unter anderem die zu diesem Zeitpunkt eher unbekannten INXS oder Alanis Morissette bei Rock am Ring vor großem Publikum auf.\",\n",
@@ -262,6 +287,13 @@
     "tracer"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "From the trace we can see that the `MultipleChunkQa`-task runs the the `SingleChunkQa` twice, once for each chunk and then combines both answeres in a final `Complete`. "
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -325,9 +357,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.use_cases import LongContextQa, LongContextQaInput\n",
-    "from intelligence_layer.core import InMemoryTracer\n",
-    "\n",
     "question = \"What is the name of the book about Robert Moses?\"\n",
     "input = LongContextQaInput(text=long_text, question=question)\n",
     "\n",
@@ -382,7 +411,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.0"
+   "version": "3.11.7"
   }
  },
  "nbformat": 4,
diff --git a/src/examples/quickstart_task.ipynb b/src/examples/quickstart_task.ipynb
index 39aae45d4..3d739e638 100644
--- a/src/examples/quickstart_task.ipynb
+++ b/src/examples/quickstart_task.ipynb
@@ -1,5 +1,43 @@
 {
  "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from aleph_alpha_client import Prompt\n",
+    "from dotenv import load_dotenv\n",
+    "from intelligence_layer.core import (\n",
+    "    AlephAlphaModel,\n",
+    "    CompleteInput,\n",
+    "    InMemoryTracer,\n",
+    "    NoOpTracer,\n",
+    "    Task,\n",
+    "    TaskSpan,\n",
+    ")\n",
+    "from intelligence_layer.evaluation import (\n",
+    "    Example,\n",
+    "    InMemoryDatasetRepository,\n",
+    "    InMemoryRunRepository,\n",
+    "    Runner,\n",
+    ")\n",
+    "from intelligence_layer.evaluation import (\n",
+    "    Aggregator,\n",
+    "    AggregationLogic,\n",
+    "    Evaluator,\n",
+    "    Example,\n",
+    "    InMemoryAggregationRepository,\n",
+    "    InMemoryEvaluationRepository,\n",
+    "    SingleOutputEvaluationLogic,\n",
+    ")\n",
+    "from pydantic import BaseModel\n",
+    "from statistics import mean\n",
+    "from typing import Iterable\n",
+    "\n",
+    "load_dotenv()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -13,6 +51,8 @@
     "To do so, we will leverage `luminous-base` and a few-shot prompt to generate matching keywords for variable input texts.\n",
     "Next, we will build an evaluator to check how well our extractor performs.\n",
     "\n",
+    "## Initial task setup\n",
+    "\n",
     "Let's start with the interface of any generic task. The full `Task` interface can be found here: [../intelligence_layer/task.py](../intelligence_layer/task.py).\n",
     "However, to initially set up a `Task`, there are only a few parts relevant to us. For now, we shall only care about the following part of the interface:\n",
     "\n",
@@ -37,12 +77,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from dotenv import load_dotenv\n",
-    "from pydantic import BaseModel\n",
-    "\n",
-    "load_dotenv()\n",
-    "\n",
-    "\n",
     "class KeywordExtractionInput(BaseModel):\n",
     "    \"\"\"This is the text we will extract keywords from\"\"\"\n",
     "\n",
@@ -80,11 +114,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from aleph_alpha_client import Prompt\n",
-    "from intelligence_layer.core import Task, TaskSpan, CompleteInput\n",
-    "from intelligence_layer.core import AlephAlphaModel\n",
-    "\n",
-    "\n",
     "class KeywordExtractionTask(Task[KeywordExtractionInput, KeywordExtractionOutput]):\n",
     "    PROMPT_TEMPLATE: str = \"\"\"Identify matching keywords for each text.\n",
     "###\n",
@@ -109,7 +138,7 @@
     "\n",
     "    def _create_complete_input(self, text: str) -> Prompt:\n",
     "        prompt = Prompt.from_text(self.PROMPT_TEMPLATE.format(text=text))\n",
-    "        # Explain stop sequences here.\n",
+    "        # Explain stop sequences here. !!! Is this a todo? !!!\n",
     "        model_input = CompleteInput(\n",
     "            prompt=prompt,\n",
     "            stop_sequences=[\"\\n\", \"###\"],\n",
@@ -143,9 +172,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.core import InMemoryTracer\n",
-    "\n",
-    "\n",
     "task = KeywordExtractionTask()\n",
     "text = \"Computer vision describes the processing of an image by a machine using external devices (e.g., a scanner) into a digital description of that image for further processing. An example of this is optical character recognition (OCR), the recognition and processing of images containing text. Further processing and final classification of the image is often done using artificial intelligence methods. The goal of this field is to enable computers to process visual tasks that were previously reserved for humans.\"\n",
     "\n",
@@ -161,6 +187,8 @@
    "source": [
     "Looks great!\n",
     "\n",
+    "## Evaluation\n",
+    "\n",
     "Now that our task is set up, we can start evaluating its performance.\n",
     "\n",
     "For this, we will have to set up an evaluator. The evaluator requires an `EvaluationLogic` and an `AggregationLogic` object.  \n",
@@ -281,16 +309,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.core import NoOpTracer\n",
-    "from intelligence_layer.evaluation import (\n",
-    "    InMemoryDatasetRepository,\n",
-    "    InMemoryRunRepository,\n",
-    "    Runner,\n",
-    "    Example,\n",
-    ")\n",
-    "from statistics import mean\n",
-    "from typing import Iterable\n",
-    "\n",
     "dataset_repository = InMemoryDatasetRepository()\n",
     "run_repository = InMemoryRunRepository()\n",
     "\n",
@@ -320,14 +338,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.evaluation import (\n",
-    "    Evaluator,\n",
-    "    InMemoryEvaluationRepository,\n",
-    "    Example,\n",
-    "    SingleOutputEvaluationLogic,\n",
-    ")\n",
-    "\n",
-    "\n",
     "class KeywordExtractionEvaluationLogic(\n",
     "    SingleOutputEvaluationLogic[\n",
     "        KeywordExtractionInput,\n",
@@ -341,9 +351,9 @@
     "        example: Example[KeywordExtractionInput, KeywordExtractionOutput],\n",
     "        output: KeywordExtractionExpectedOutput,\n",
     "    ) -> KeywordExtractionEvaluation:\n",
-    "        true_positives = output.keywords & output.keywords\n",
-    "        false_positives = output.keywords - output.keywords\n",
-    "        false_negatives = output.keywords - output.keywords\n",
+    "        true_positives = output.keywords & example.expected_output.keywords\n",
+    "        false_positives = output.keywords - example.expected_output.keywords\n",
+    "        false_negatives = example.expected_output.keywords - output.keywords\n",
     "        return KeywordExtractionEvaluation(\n",
     "            true_positive_rate=len(true_positives) / len(output.keywords),\n",
     "            true_positives=true_positives,\n",
@@ -382,7 +392,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "To aggregate the evaluation results, we have to implement a method doing this in a `AggregationLogic` class."
+    "To aggregate the evaluation results, we have to implement a method doing this in an `AggregationLogic` class."
    ]
   },
   {
@@ -391,14 +401,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.evaluation import (\n",
-    "    InMemoryAggregationRepository,\n",
-    "    Example,\n",
-    "    Aggregator,\n",
-    "    AggregationLogic,\n",
-    ")\n",
-    "\n",
-    "\n",
     "class KeywordExtractionAggregationLogic(\n",
     "    AggregationLogic[\n",
     "        KeywordExtractionEvaluation,\n",
@@ -408,9 +410,11 @@
     "    def aggregate(\n",
     "        self, evaluations: Iterable[KeywordExtractionEvaluation]\n",
     "    ) -> KeywordExtractionAggregatedEvaluation:\n",
-    "        eval_list = list(evaluations)\n",
+    "        evaluation_list = list(evaluations)\n",
     "        true_positive_rate = (\n",
-    "            mean(e.true_positive_rate for e in eval_list) if eval_list else 0\n",
+    "            mean(evaluation.true_positive_rate for evaluation in evaluation_list)\n",
+    "            if evaluation_list\n",
+    "            else 0\n",
     "        )\n",
     "        return KeywordExtractionAggregatedEvaluation(\n",
     "            average_true_positive_rate=true_positive_rate\n",
@@ -421,7 +425,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Let's create now an aggregator and generate evaluation statistics from the previously generated evaluation results."
+    "Let's now create an aggregator and generate evaluation statistics from the previously generated evaluation results."
    ]
   },
   {
@@ -457,8 +461,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from pprint import pprint\n",
-    "\n",
     "dataset_id = dataset_repository.create_dataset(\n",
     "    examples=[\n",
     "        Example(input=model_input, expected_output=expected_output),\n",
@@ -494,12 +496,19 @@
     "print(aggregation_overview)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "We have now run our first evaluation on this tiny dataset.\n",
-    "Let's have a more detailed look at the debug log of one example run."
+    "Let's take a more detailed look at the debug log of one of the example runs."
    ]
   },
   {
@@ -513,8 +522,10 @@
     "        dataset_id, evaluator.input_type(), evaluator.expected_output_type()\n",
     "    )\n",
     ")\n",
+    "print(examples[1].input.text)\n",
+    "examples.sort(key=lambda x: x.input.text)\n",
     "last_example_result = run_repository.example_trace(\n",
-    "    next(iter(aggregation_overview.run_overviews())).id, examples[-1].id\n",
+    "    next(iter(aggregation_overview.run_overviews())).id, examples[1].id\n",
     ")\n",
     "last_example_result.trace"
    ]
@@ -536,6 +547,7 @@
     "5. **Metrics**: Several metrics generated by our `KeywordExtractionTaskEvaluationLogic`.\n",
     "\n",
     "Let's have a look at the evaluation results.\n",
+    "<!-- The text below is no longer correct, we do get no fals positives-->\n",
     "Here, we can see that the model returned \"behavi*o*ral economics\" as a keyword.\n",
     "However, in the `false_negatives`, we can see that we did indeed expect this phrase, but with a different spelling: \"behavi*ou*ral economics\".\n",
     "Thus, the debug log helped us easily identify this misalignment between our dataset and the model's generation."
@@ -549,7 +561,7 @@
    "source": [
     "last_example_result = evaluation_repository.example_evaluation(\n",
     "    next(iter(aggregation_overview.evaluation_overviews)).id,\n",
-    "    examples[-1].id,\n",
+    "    examples[1].id,\n",
     "    KeywordExtractionEvaluation,\n",
     ")\n",
     "print(last_example_result.result)"
@@ -585,7 +597,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.4"
+   "version": "3.11.7"
   }
  },
  "nbformat": 4,
diff --git a/src/examples/summarization.ipynb b/src/examples/summarization.ipynb
index 23426256b..a8bf4da0d 100644
--- a/src/examples/summarization.ipynb
+++ b/src/examples/summarization.ipynb
@@ -1,5 +1,31 @@
 {
  "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from dotenv import load_dotenv\n",
+    "from intelligence_layer.core import (\n",
+    "    CompleteInput,\n",
+    "    InMemoryTracer,\n",
+    "    Language,\n",
+    "    LuminousControlModel,\n",
+    "    NoOpTracer,\n",
+    "    TextChunk,\n",
+    ")\n",
+    "from intelligence_layer.use_cases import (\n",
+    "    RecursiveSummarize,\n",
+    "    RecursiveSummarizeInput,\n",
+    "    SingleChunkSummarizeInput,\n",
+    "    SteerableLongContextSummarize,\n",
+    "    SteerableSingleChunkSummarize,\n",
+    ")\n",
+    "\n",
+    "load_dotenv()"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -18,12 +44,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from dotenv import load_dotenv\n",
-    "\n",
-    "from intelligence_layer.core import CompleteInput, LuminousControlModel, NoOpTracer\n",
-    "\n",
-    "load_dotenv()\n",
-    "\n",
     "# first, we define a `LuminousControlModel`; this model holds much more information than a simple string...\n",
     "# it knows how to call the API, how to complete, how to tokenize and more...\n",
     "model = LuminousControlModel(\"luminous-base-control\")\n",
@@ -104,8 +124,8 @@
    "source": [
     "Cool, our summary now uses information from the document!\n",
     "\n",
-    "At the core of the Intelligence Layer is the concept of a `Task`; a task could be any process involving an LLM, for example our summary use-case here.\n",
-    "By using a summarization-task, instead of the above method, we can isolate the responsibility for summarizing in said task.\n",
+    "At the core of the Intelligence Layer is the concept of a `Task`; a task could be any process involving an LLM, for example our summarization use-case here. <!--- Does a Task have to involve an LLM? -->\n",
+    "By using a summarization-task we can isolate the responsibility for summarizing in said task.\n",
     "We can then simply export this task, evaluate it or deploy it into productive settings.\n",
     "\n",
     "The IL has pre-buiilt `Task`s for summarizing texts. Let's try this out.\n"
@@ -117,13 +137,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.core import TextChunk, Language\n",
-    "from intelligence_layer.use_cases import (\n",
-    "    SteerableSingleChunkSummarize,\n",
-    "    SingleChunkSummarizeInput,\n",
-    ")\n",
-    "\n",
-    "\n",
     "# instantiating a `SteerableSingleChunkSummarize` with our model from before\n",
     "single_chunk_summarize = SteerableSingleChunkSummarize(model)\n",
     "\n",
@@ -144,7 +157,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Cool!\n",
+    "Awesome!\n",
     "\n",
     "Notice how the output is quite similar but we did not have to provide the instruction. It is in fact embedded in our task.\n",
     "\n",
@@ -187,8 +200,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.use_cases import SteerableLongContextSummarize\n",
-    "\n",
     "# again, we can simply insert our model from before\n",
     "long_context_summarize = SteerableLongContextSummarize(model=model)"
    ]
@@ -266,8 +277,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.use_cases import RecursiveSummarize, RecursiveSummarizeInput\n",
-    "\n",
     "# notice how tasks are composable\n",
     "# we instantiate our recursive summarization strategy with our previous long context strategy\n",
     "recursive_summarize = RecursiveSummarize(long_context_summarize)\n",
@@ -298,9 +307,6 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from intelligence_layer.core import InMemoryTracer\n",
-    "from IPython.display import Pretty\n",
-    "\n",
     "# for our experiment, we use an `InMemoryTracer`\n",
     "# you may also use other tracers, for example the `FileTracer` if you want to persist the traces\n",
     "tracer = InMemoryTracer()\n",
@@ -319,9 +325,9 @@
     "Don't be afraid, I promise this trace makes sense!\n",
     "\n",
     "Notice how each block is labeled with the task that was run as well as the respective inputs and outputs.\n",
-    "We can now can a better insight into which tasks are doing what. If we find a mistake, we can double down to figure out what went wrong.\n",
+    "We can now obtain a better insight in which task is doing what. If we find a mistake, we can double down to figure out what went wrong.\n",
     "\n",
-    "Cool! You are now familiar with the basics concepts of a `Model`, a `Task` and summarization using the Intelligence Layer.\n"
+    "Great! You are now familiar with the basics concepts of a `Model`, a `Task` and summarization using the Intelligence Layer.\n"
    ]
   }
  ],
@@ -341,7 +347,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.4"
+   "version": "3.11.7"
   }
  },
  "nbformat": 4,