-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: Add How-to describing complete incremental evaluation workflow
TASK: IL-313
- Loading branch information
1 parent
20db23b
commit c0370ba
Showing
5 changed files
with
225 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
205 changes: 205 additions & 0 deletions
205
src/documentation/how_tos/how_to_implement_complete_incremental_evaluation_flow.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,205 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from documentation.how_tos.example_data import (\n", | ||
" DummyAggregationLogic,\n", | ||
" DummyEvaluation,\n", | ||
" DummyExample,\n", | ||
" DummyTask,\n", | ||
")\n", | ||
"from intelligence_layer.evaluation import (\n", | ||
" Aggregator,\n", | ||
" IncrementalEvaluator,\n", | ||
" InMemoryAggregationRepository,\n", | ||
" InMemoryEvaluationRepository,\n", | ||
" InMemoryRunRepository,\n", | ||
" Runner,\n", | ||
")\n", | ||
"from intelligence_layer.evaluation.dataset.domain import Example\n", | ||
"from intelligence_layer.evaluation.dataset.in_memory_dataset_repository import (\n", | ||
" InMemoryDatasetRepository,\n", | ||
")\n", | ||
"from intelligence_layer.evaluation.evaluation.evaluator.incremental_evaluator import (\n", | ||
" IncrementalEvaluationLogic,\n", | ||
")\n", | ||
"from intelligence_layer.evaluation.run.domain import SuccessfulExampleOutput" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# How to implement complete incremental evaluation workflows from running (multiple) tasks to aggregation\n", | ||
"This notebook outlines how to:\n", | ||
" - run multiple tasks and configurations on the same dataset\n", | ||
" - perform evaluations in an incremental fashion, i.e., adding additional runs to your existing evaluations without the need for recalculation\n", | ||
" - run aggregation on these evaluations\n", | ||
" \n", | ||
"## Step-by-Step Guide\n", | ||
"1. Setup:\n", | ||
"- Initialize all necessary repositories: \n", | ||
" - dataset\n", | ||
" - run\n", | ||
" - evaluation\n", | ||
" - aggregation\n", | ||
"- Create dataset from example(s)\n", | ||
"- Initialized task(s)\n", | ||
"- Initialize `Runner` for each task \n", | ||
"2. Run task(s) for the dataset (see [here](./how_to_run_a_task_on_a_dataset.ipynb))\n", | ||
"3. Compose a list of IDs of runs you want to evaluate.\n", | ||
"4. Define and initialize an `IncrementalEvaluationLogic`; This is similar to a normal `EvaluationLogic` (see [here](./how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb)) but you also have to implement your own `do_incremental_evaluate` method\n", | ||
"5. Initialize an `IncrementalEvaluator` with the repositories and your custom `IncrementalEvaluationLogic`\n", | ||
"6. Call the `evaluate_runs` method of the `IncrementalEvaluator` to evaluate the run(s) and create a single `EvaluationOverview`\n", | ||
"7. Aggregate your evaluation of the run(s) using the [standard aggregation](./how_to_aggregate_evaluations.ipynb) or using a [custom aggregation logic](./how_to_implement_a_simple_evaluation_and_aggregation_logic.ipynb)\n", | ||
"\n", | ||
"#### Steps for addition of new runs \n", | ||
"8. Define and run some new task(s)\n", | ||
"9. Define a list for runs that should not be re-evaluated\n", | ||
"10. Call the `evaluate_additional_runs` method of the `IncrementalEvaluator`:\n", | ||
" - `run_ids`: Runs to be included in the evaluation results, including those that have been evaluated before\n", | ||
" - `previous_evaluation_ids`: Runs **not** to be re-evaluated, depending on the specific implementation of the `do_incremental_evaluate` method\n", | ||
"11. Aggregate all your `EvaluationOverview`s in your `EvaluationRepository`" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Preparation\n", | ||
"examples = [\n", | ||
" DummyExample(input=\"input1\", expected_output=\"expected_output1\", data=\"data1\")\n", | ||
"]\n", | ||
"\n", | ||
"# Step 1\n", | ||
"dataset_repository = InMemoryDatasetRepository()\n", | ||
"run_repository = InMemoryRunRepository()\n", | ||
"evaluation_repository = InMemoryEvaluationRepository()\n", | ||
"aggregation_repository = InMemoryAggregationRepository()\n", | ||
"\n", | ||
"my_dataset = dataset_repository.create_dataset(examples, \"MyDataset\")\n", | ||
"\n", | ||
"first_task = DummyTask()\n", | ||
"first_runner = Runner(first_task, dataset_repository, run_repository, \"MyFirstRun\")\n", | ||
"\n", | ||
"# Step 2\n", | ||
"first_run_overview = first_runner.run_dataset(my_dataset.id)\n", | ||
"print(f\"ID of first run: {first_run_overview.id}\")\n", | ||
"\n", | ||
"# Step 3\n", | ||
"run_overview_ids_for_first_evaluation = []\n", | ||
"for run_overview in run_repository.run_overviews():\n", | ||
" if (\n", | ||
" run_overview.description == \"MyFirstRun\"\n", | ||
" ): ## This is filter for all the runs you want to include\n", | ||
" run_overview_ids_for_first_evaluation.append(run_overview.id)\n", | ||
"\n", | ||
"\n", | ||
"# Step 4\n", | ||
"class DummyIncrementalEvaluationLogic(\n", | ||
" IncrementalEvaluationLogic[str, str, str, DummyEvaluation]\n", | ||
"):\n", | ||
" def do_incremental_evaluate(\n", | ||
" self,\n", | ||
" example: Example[str, str],\n", | ||
" outputs: list[SuccessfulExampleOutput[str]],\n", | ||
" already_evaluated_outputs: list[list[SuccessfulExampleOutput[str]]],\n", | ||
" ) -> DummyEvaluation:\n", | ||
" output_str = \"(\" + (\", \".join(o.output for o in outputs)) + \")\"\n", | ||
" return DummyEvaluation(\n", | ||
" eval=f\"{example.input}, {example.expected_output}, {output_str}, {already_evaluated_outputs} -> evaluation\"\n", | ||
" )\n", | ||
"\n", | ||
"\n", | ||
"incremental_evaluation_logic = DummyIncrementalEvaluationLogic()\n", | ||
"\n", | ||
"# Step 5\n", | ||
"incremental_evaluator = IncrementalEvaluator(\n", | ||
" dataset_repository,\n", | ||
" run_repository,\n", | ||
" evaluation_repository,\n", | ||
" \"My incremental evaluation\",\n", | ||
" incremental_evaluation_logic,\n", | ||
")\n", | ||
"\n", | ||
"# Step 6\n", | ||
"evaluation_overview_first_task = incremental_evaluator.evaluate_runs(\n", | ||
" *run_overview_ids_for_first_evaluation\n", | ||
")\n", | ||
"\n", | ||
"# Step 7\n", | ||
"aggregation_logic = DummyAggregationLogic()\n", | ||
"aggregator = Aggregator(\n", | ||
" evaluation_repository, aggregation_repository, \"MyAggregator\", aggregation_logic\n", | ||
")\n", | ||
"first_aggregation_overview = aggregator.aggregate_evaluation(\n", | ||
" *evaluation_repository.evaluation_overview_ids()\n", | ||
")\n", | ||
"print(f\"First aggregation: {first_aggregation_overview}\")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"## Addition of new task/run\n", | ||
"# Step 8\n", | ||
"second_task = DummyTask()\n", | ||
"second_runner = Runner(second_task, dataset_repository, run_repository, \"MySecondRun\")\n", | ||
"second_run_overview = second_runner.run_dataset(my_dataset.id)\n", | ||
"print(f\"ID of second run: {second_run_overview.id}\")\n", | ||
"\n", | ||
"# Step 9\n", | ||
"already_evaluated_run_ids = evaluation_repository.evaluation_overview_ids()\n", | ||
"\n", | ||
"# Step 10\n", | ||
"incremental_evaluator.evaluate_additional_runs(\n", | ||
" *run_repository.run_overview_ids(),\n", | ||
" previous_evaluation_ids=already_evaluated_run_ids,\n", | ||
")\n", | ||
"\n", | ||
"# Step 11\n", | ||
"second_aggregation_overview = aggregator.aggregate_evaluation(\n", | ||
" *evaluation_repository.evaluation_overview_ids()\n", | ||
")\n", | ||
"print(second_aggregation_overview)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters