feat: Rename how-to implement incremental evaluation and make it more…

… concise (#864) TASK: IL-313
Aleph-Alpha · May 23, 2024 · 86c5e2e · 86c5e2e
1 parent 48b09b0
commit 86c5e2e
Show file tree

Hide file tree

Showing 4 changed files with 155 additions and 207 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,7 +6,7 @@
 ...
 
 ### New Features
- - Add `how_to_implement_complete_incremental_evaluation_flow`
+ - Add `how_to_implement_incremental_evaluation`.
 
 ### Fixes
 - The document index client now correctly URL-encodes document names in its queries.

diff --git a/README.md b/README.md
@@ -180,7 +180,7 @@ The how-tos are quick lookups about how to do things. Compared to the tutorials,
 | [...retrieve data for analysis](./src/documentation/how_tos/how_to_retrieve_data_for_analysis.ipynb)                                                   | Retrieve experiment data in multiple different ways                        |
 | [...implement a custom human evaluation](./src/documentation/how_tos/how_to_human_evaluation_via_argilla.ipynb)                                        | Necessary steps to create an evaluation with humans as a judge via Argilla |
 | [...implement elo evaluations](./src/documentation/how_tos/how_to_implement_elo_evaluations.ipynb)                                                     | Evaluate runs and create ELO ranking for them                               |
-| [...implement complete incremental evaluation flow](./src/documentation/how_tos/how_to_implement_complete_incremental_evaluation_flow.ipynb)           | Run complete incremental evaluation flow from runner to aggretation
+| [...implement incremental evaluation](./src/documentation/how_tos/how_to_implement_incremental_evaluation.ipynb)           | Implement and run an incremental evaluation
 # Models
 
 Currently, we support a bunch of models accessible via the Aleph Alpha API. Depending on your local setup, you may even have additional models available.

diff --git a/src/documentation/how_tos/how_to_implement_complete_incremental_evaluation_flow.ipynb b/src/documentation/how_tos/how_to_implement_complete_incremental_evaluation_flow.ipynb
diff --git a/src/documentation/how_tos/how_to_implement_incremental_evaluation.ipynb b/src/documentation/how_tos/how_to_implement_incremental_evaluation.ipynb
@@ -0,0 +1,153 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from documentation.how_tos.example_data import (\n",
+    "    DummyAggregationLogic,\n",
+    "    DummyEvaluation,\n",
+    "    DummyExample,\n",
+    "    example_data,\n",
+    ")\n",
+    "from intelligence_layer.evaluation import (\n",
+    "    Aggregator,\n",
+    "    Example,\n",
+    "    IncrementalEvaluationLogic,\n",
+    "    IncrementalEvaluator,\n",
+    "    InMemoryAggregationRepository,\n",
+    "    InMemoryEvaluationRepository,\n",
+    "    SuccessfulExampleOutput,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# How to implement incremental evaluation\n",
+    "This notebook outlines how to perform evaluations in an incremental fashion, i.e., adding additional runs to your existing evaluations without the need for recalculation.\n",
+    "    \n",
+    "## Step-by-Step Guide\n",
+    "0. Run your tasks on the datasets on which you want to evaluate them (see [here](./how_to_run_a_task_on_a_dataset.ipynb))\n",
+    "   - When evaluating multiple runs, all of them need the same data types \n",
+    "1. Initialize all necessary repositories and define your `IncrementalEvaluationLogic`; It is similar to a normal `EvaluationLogic` (see [here](./how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb)) but you additionally have to implement your own `do_incremental_evaluate` method\n",
+    "2. Initialize an `IncrementalEvaluator` with the repositories and your custom `IncrementalEvaluationLogic`\n",
+    "3. Call the `evaluate_runs` method of the `IncrementalEvaluator`\n",
+    "4. Aggregate your evaluations using the [standard aggregation](./how_to_aggregate_evaluations.ipynb) or using a [custom aggregation logic](./how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb)\n",
+    "\n",
+    "#### Steps for addition of new runs \n",
+    "5. Call the `evaluate_additional_runs` method of the `IncrementalEvaluator`:\n",
+    "   - `run_ids`: Runs to be included in the evaluation results, including those that have been evaluated before\n",
+    "   - `previous_evaluation_ids`: Runs **not** to be re-evaluated, depending on the specific implementation of the `do_incremental_evaluate` method\n",
+    "6. Aggregate all your `EvaluationOverview`s"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Step 0\n",
+    "examples = [\n",
+    "    DummyExample(input=\"input1\", expected_output=\"expected_output1\", data=\"data1\")\n",
+    "]\n",
+    "my_example_data = example_data()\n",
+    "\n",
+    "dataset_repository = my_example_data.dataset_repository\n",
+    "run_repository = my_example_data.run_repository\n",
+    "\n",
+    "# Step 1\n",
+    "evaluation_repository = InMemoryEvaluationRepository()\n",
+    "aggregation_repository = InMemoryAggregationRepository()\n",
+    "\n",
+    "\n",
+    "class DummyIncrementalEvaluationLogic(\n",
+    "    IncrementalEvaluationLogic[str, str, str, DummyEvaluation]\n",
+    "):\n",
+    "    def do_incremental_evaluate(\n",
+    "        self,\n",
+    "        example: Example[str, str],\n",
+    "        outputs: list[SuccessfulExampleOutput[str]],\n",
+    "        already_evaluated_outputs: list[list[SuccessfulExampleOutput[str]]],\n",
+    "    ) -> DummyEvaluation:\n",
+    "        return DummyEvaluation(eval=\"DummyEvalResult\")\n",
+    "\n",
+    "\n",
+    "# Step 2\n",
+    "incremental_evaluator = IncrementalEvaluator(\n",
+    "    dataset_repository,\n",
+    "    run_repository,\n",
+    "    evaluation_repository,\n",
+    "    \"My incremental evaluation\",\n",
+    "    DummyIncrementalEvaluationLogic(),\n",
+    ")\n",
+    "\n",
+    "# Step 3\n",
+    "incremental_evaluator.evaluate_runs(my_example_data.run_overview_1.id)\n",
+    "\n",
+    "# Step 4\n",
+    "aggregation_logic = DummyAggregationLogic()\n",
+    "aggregator = Aggregator(\n",
+    "    evaluation_repository, aggregation_repository, \"MyAggregator\", aggregation_logic\n",
+    ")\n",
+    "aggregation_overview = aggregator.aggregate_evaluation(\n",
+    "    *evaluation_repository.evaluation_overview_ids()\n",
+    ")\n",
+    "print(aggregation_overview)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "## Addition of new task/run\n",
+    "# Step 5\n",
+    "run_ids = [my_example_data.run_overview_1.id, my_example_data.run_overview_1.id]\n",
+    "incremental_evaluator.evaluate_additional_runs(\n",
+    "    *run_ids,\n",
+    "    previous_evaluation_ids=evaluation_repository.evaluation_overview_ids(),\n",
+    ")\n",
+    "\n",
+    "# Step 6\n",
+    "second_aggregation_overview = aggregator.aggregate_evaluation(\n",
+    "    *evaluation_repository.evaluation_overview_ids()\n",
+    ")\n",
+    "print(second_aggregation_overview)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.8"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}