From be6c0cca584301a6c0d4316b0f0b9e19f0be2aea Mon Sep 17 00:00:00 2001
From: Nicholas Junge <n.junge@appliedai.de>
Date: Tue, 26 Mar 2024 15:26:46 +0100
Subject: [PATCH] Change guide to HuggingFace model benchmarking

Now advertises memos as the canonical way of using caching+lazy loading
in combination for memory efficiency.

Changes the example structure to three standalone Python files instead
of a project. This is because we disentangled the files by removing the
argument parser, and trivialized the runner.py file.

Changes the guide name to mention HuggingFace as the tech of interest in
this example.
---
 docs/tutorials/artifact_benchmarking.md       | 109 ------------------
 docs/tutorials/huggingface.md                 | 107 +++++++++++++++++
 docs/tutorials/index.md                       |   2 +-
 examples/artifact_benchmarking/pyproject.toml |  19 ---
 .../artifact_benchmarking/requirements.txt    |  47 --------
 .../artifact_benchmarking/src/__init__.py     |   0
 .../src/training/__init__.py                  |   0
 .../src => huggingface}/benchmark.py          |   0
 .../src => huggingface}/runner.py             |   0
 .../src/training => huggingface}/training.py  |   0
 mkdocs.yml                                    |   2 +-
 11 files changed, 109 insertions(+), 177 deletions(-)
 delete mode 100644 docs/tutorials/artifact_benchmarking.md
 create mode 100644 docs/tutorials/huggingface.md
 delete mode 100644 examples/artifact_benchmarking/pyproject.toml
 delete mode 100644 examples/artifact_benchmarking/requirements.txt
 delete mode 100644 examples/artifact_benchmarking/src/__init__.py
 delete mode 100644 examples/artifact_benchmarking/src/training/__init__.py
 rename examples/{artifact_benchmarking/src => huggingface}/benchmark.py (100%)
 rename examples/{artifact_benchmarking/src => huggingface}/runner.py (100%)
 rename examples/{artifact_benchmarking/src/training => huggingface}/training.py (100%)

diff --git a/docs/tutorials/artifact_benchmarking.md b/docs/tutorials/artifact_benchmarking.md
deleted file mode 100644
index a1b375a..0000000
--- a/docs/tutorials/artifact_benchmarking.md
+++ /dev/null
@@ -1,109 +0,0 @@
-# Benchmark on saved models
-There is a high likelihood that you, at some point, find yourself wanting to benchmark models that were trained previously.
-In this guide we will walk through how we can accomplish this with nnbench.
-
-## Example: Named Entity Recognition
-We will start with an aside that talks through the setup of the example we will use in this guide.
-If you are only interested in the application of nnbench, you can skip this section.
-
-There are lots of reasons why you could want to retrieve saved models for benchmarking. 
-Among them these are reviewing the work of colleagues, comparing experimental performances to an existing benchmark, or dealing with models that require significant compute such that in-place retraining is impractical.
-For this example, we deal with a named entity recognition (NER) model that is based on the pre-trained encoder-decoder transformer [BERT](https://arxiv.org/abs/1810.04805).
-The model is trained on the [CoNLLpp dataset](https://huggingface.co/datasets/conllpp) which consists of sentences from news stories where words were tagged with Person, Organization, Location, or Miscellaneous if they referred to entities. 
-Words are assigned an out-of-entity label if they do not represent an entity.
-
-### Model Training
-You find the code to train the model in the nnbench [repository](https://github.com/aai-institute/nnbench) in the directory `examples/artifact_benchmarking/src/training/training.py`.
-If you want to skip running the training script but still want to reproduce this example, you can take any BERT model fine tuned for NER with the CoNLL dataset family.
-You find many on the Huggingface model hub, for example [this one](https://huggingface.co/dslim/bert-base-NER). You need to download the `model.safetensors`, `config.json`, `tokenizer_config.json`, and `tokenizer.json` files.
-If you want to train your own model, continue below. 
-
-There is some necessary preprocessing and data wrangling to train the model. 
-We will not go into the details here. But if you are interested in a more thorough walkthrough, look into this [resource](https://huggingface.co/learn/nlp-course/chapter7/2?fw=pt) by Huggingface which served as the basis for this example. 
-It is not feasible to train the model on a CPU. If you do not have access to a GPU you can use free GPU instances on [Google Colab](https://colab.research.google.com/).
-When you open a new notebook there make sure to select a GPU instance in the upper right corner.
-The you can upload the `training.py`.
-You can ignore An eventual warning that the data is not persisted.
-Next, install the necessary dependencies: `!pip install datasets transformers[torch]`.
-Google Colab comes with some dependencies already installed in the environment.
-Hence, if you are working with a different GPU instance, make sure to install everything from the `pyproject.toml` in the `examples/artifact_benchmarking` folder. 
-Next you can execute the `training.py` with `!python training.py`.
-This will train two BERT models ("bert-base-uncased" and "distilbert-base-uncased") which we can compare using nnbench. 
-If you want, you can adapt the training script to train other models by editing the tuples in the `tokenizers_and_models` list at the bottom of the training script. 
-The training of the models takes around 10 minutes.
-Once it is done, download the respective files and save them to your disk.
-They should be the same mentioned above. 
-We will need the path to the files for benchmarking later.
-
-### The Benchmarks
-The benchmarking code is found in the `examples/artifact_benchmarking/benchmark.py`.
-We calculate precision, recall, accuracy, and f1 scores for the whole test set and specific labels.
-Additionally, we obtain information about the model such as its memory footprint and inference time.
-We are not walking through the whole file but instead point out certain design choices as an inspiration to you. 
-If you are interested in a more detailed walkthrough on how to set up benchmarks, you can find it [here](../guides/benchmarks.md).
-Notable design choices in this benchmark are that we factored out the evaluation loop as it is necessary for all evaluation metrics. We cache it using the `functools.lru_cache` decorator so the evaluation loop runs only once per benchmark run instead of once per metric which greatly reduces runtime.
-We use `nnbench.parametrize` to get the per-class metrics. 
-As the parametrization method needs the same arguments for each benchmark, we use Python's builtin `functools.partial` to fill the arguments.
-One noteworthy subtlety here is that we need to call the partial immediately so it returns the pre-filled `nnbench.parametrize` decorators.
-If we don't do that, the `runner.collect` does not find the respective benchmarks. 
-
-## Running Benchmarks with saved Artifacts
-Now that we have explained the example, let's jump into the benchmarking code.
-You find it in the nnbench repository in `examples/artifact_benchmarking/src/runner.py`.
-
-The benchmarking code is written to be executed as a script that consumes any number of file paths to models to benchmark as arguments, `python runner.py /path/to/model1 /path/to/model2`.
-The parsing of the arguments is handled by the last lines in the script which then calls the `main()` function:
-
-```python
---8<-- "examples/artifact_benchmarking/src/runner.py:91:95"
-```
-
-### Artifact Classes
-The `main()` function first sets up the benchmark reporter and the runner. 
-Then we create a list of models. These models are instances of a `TokenClassificationModel`, a custom class we implemented which inherits from the `Artifact` base class.
-
-```python
---8<-- "examples/artifact_benchmarking/src/runner.py:30:37"
-```
-
-The `Artifact` class is a typesafe wrapper around serialized data of any kind.
-It allows for lazy deserialization of artifacts from a  path attribute.
-This attribute is set by the `ArtifactLoader` (which we will cover in a moment) that supplied a path to the local artifact disk storage. 
-In our derived class, we have to override the `deserialize()` method to properly load the artifact value into memory.
-
-```python
---8<-- "examples/artifact_benchmarking/src/runner.py:56:57"
-```
-
-The `deserialize()` method has to set the `self._value` attribute to the value we want to access later.
-In this case, we assign it a tuple containing the Huggingface Transformers model and tokenizer.
-
-We do similar with the CoNLLpp dataset.
-
-```python
---8<-- "examples/artifact_benchmarking/src/runner.py:30:37"
-```
-
-In this case, we store the `datasets.Dataset` object as well ass a dictionary which maps the label id to a semantic string in the `_value` attribute. 
-The value that we store in the `_value` of an artifact can be of any kind and that we use tuples in both instances here is a circumstance.
-
-### Artifact Loaders
-Upon instantiation of an `Artifact` or derived classes we need to supply an `ArtifactLoader` (or a class derived from it). `ArtifactLoader`s are classes that implement a `load()` method that resolves to a path-like object or string which points to the local storage location of the artifact. This method is used by the Artifact class. 
-
-For our models, we use the provided `LocalArtifactLoader` which consumes a path and passes it on later.
-
-```python
---8<-- "examples/artifact_benchmarking/src/runner.py:52:54"
-```
-
-We have a little more logic with respect to the dataset as we handle the train test split as well.
-
-```python
---8<-- "examples/artifact_benchmarking/src/runner.py:19:27"
-```
-
-Now we execute the benchmark in the loop over the different models.
-
-```python
---8<-- "examples/artifact_benchmarking/src/runner.py:59:88"
-```
diff --git a/docs/tutorials/huggingface.md b/docs/tutorials/huggingface.md
new file mode 100644
index 0000000..40f36d3
--- /dev/null
+++ b/docs/tutorials/huggingface.md
@@ -0,0 +1,107 @@
+# Benchmarking HuggingFace models on a dataset
+There is a high likelihood that you, at some point, find yourself wanting to benchmark previously trained models.
+This guide shows you how to do it for a HuggingFace model with nnbench.
+
+## Example: Named Entity Recognition
+We start with a small tangent about the example setup that we will use in this guide.
+If you are only interested in the application of nnbench, you can skip this section.
+
+There are lots of reasons why you could want to retrieve saved models for benchmarking. 
+Among them these are reviewing the work of colleagues, comparing model performance to an existing benchmark, or dealing with models that require significant compute such that in-place retraining is impractical.
+
+For this example, we look at a named entity recognition (NER) model that is based on the pre-trained encoder-decoder transformer [BERT](https://arxiv.org/abs/1810.04805) from HuggingFace.
+The model is trained on the [CoNLLpp dataset](https://huggingface.co/datasets/conllpp) which consists of sentences from news stories where words were tagged with Person, Organization, Location, or Miscellaneous if they referred to entities. 
+Words are assigned an out-of-entity label if they do not represent an entity.
+
+## Model Training
+You find the code to train the model in the nnbench [repository](https://github.com/aai-institute/nnbench/tree/main/examples/huggingface).
+If you want to skip running the training script but still want to reproduce this example, you can take any BERT model fine tuned for NER with the CoNLL dataset family.
+You find many on the Huggingface model hub, for example [this one](https://huggingface.co/dslim/bert-base-NER). You need to download the `model.safetensors`, `config.json`, `tokenizer_config.json`, and `tokenizer.json` files.
+If you want to train your own model, continue below.
+
+There is some necessary preprocessing and data wrangling to train the model. 
+We will not go into the details here, but if you are interested in a more thorough walkthrough, look into this [resource](https://huggingface.co/learn/nlp-course/chapter7/2?fw=pt) by Huggingface which served as the basis for this example.
+
+It is not feasible to train the model on a CPU. If you do not have access to a GPU, you can use free GPU instances on [Google Colab](https://colab.research.google.com/).
+When opening a new Colab notebook, make sure to select a GPU instance in the upper right corner.
+Then, you can upload the `training.py`. You can ignore any warnings about the data not being persisted.
+
+Next, install the necessary dependencies: `!pip install datasets transformers[torch]`.  Google Colab comes with some dependencies already installed in the environment.
+Hence, if you are working with a different GPU instance, make sure to install everything from the `pyproject.toml` in the `examples/artifact_benchmarking` folder.
+
+Finally, you can execute the `training.py` with `!python training.py`.
+This will train two BERT models ("bert-base-uncased" and "distilbert-base-uncased") which we can compare using nnbench. 
+If you want, you can adapt the training script to train other models by editing the tuples in the `tokenizers_and_models` list at the bottom of the training script. 
+The training of the models takes around 10 minutes.
+
+Once it is done, download the respective files and save them to your disk.
+They should be the same mentioned above. We will need the paths to the files for benchmarking later.
+
+## The benchmarks
+
+The benchmarking code is found in the `examples/huggingface/benchmark.py`.
+We calculate precision, recall, accuracy, and f1 scores for the whole test set and specific labels.
+Additionally, we obtain information about the model such as its memory footprint and inference time.
+
+We are not walking through the whole file but instead point out certain design choices as an inspiration to you. 
+If you are interested in a more detailed walkthrough on how to set up benchmarks, you can find it [here](../guides/benchmarks.md).
+
+Notable design choices in this benchmark are that we factored out the evaluation loop as it is necessary for all evaluation metrics.
+We cache it using the `functools.cache` decorator so the evaluation loop runs only once per benchmark run instead of once per metric which greatly reduces runtime.
+
+We also use `nnbench.parametrize` to get the per-class metrics.
+As the parametrization method needs the same arguments for each benchmark, we use Python's builtin `functools.partial` to fill the arguments.
+
+```python
+--8<-- "examples/huggingface/benchmark.py:131:139"
+```
+
+!!! Tip
+    In this parametrization, the model path is hardcoded to "dslim/distilbert-NER" on the HuggingFace hub.
+    When benchmarking other models, be sure to change this path to the actual model you want to benchmark.
+
+After this, the benchmarking code is actually very simple, as in most of the other examples.
+You find it in the nnbench repository in `examples/huggingface/runner.py`.
+
+## Custom memo classes
+
+The parametrization contains a list of models, which are each instances of a `TokenClassificationModelMemo` a custom class we implemented which inherits from the `nnbench.Memo` class.
+A big advantage of a memo in this case is its ability to lazy-load models and later evict the loaded models again from a cache.
+
+```python
+--8<-- "examples/huggingface/benchmark.py:23:35"
+```
+
+The `Memo` class is a generic wrapper around serialized data of any kind.
+It allows for lazy deserialization of artifacts from uniquely identifying metadata like storage paths, checksums, or model names on HuggingFace Hub in our case.
+In our derived class, we have to override the `Memo.__call__()` method to properly load the memoized value into memory.
+
+We do similar with the CoNLLpp dataset.
+
+```python
+--8<-- "examples/huggingface/benchmark.py:51:64"
+```
+
+In this case, we lazy-load the `datasets.Dataset` object.
+In the following `IndexLabelMapMemo` class, we store a dictionary mapping the label ID to a semantic string.
+
+```python
+--8<-- "examples/huggingface/benchmark.py:67:82"
+```
+
+!!! Info
+    There is no need to type-hint `TokenClassificationModelMemo`s in the corresponding benchmarks -
+    the benchmark runner takes care of filling in the memoized values for the memos themselves.
+
+Because we implemented our memoized values as four different memo class types, this modularizes the benchmark input parameters -
+we only need to reference memos when they are actually used. Considering the recall benchmarks:
+
+```python
+--8<-- "examples/huggingface/benchmark.py:174:204"
+```
+
+we see that the memoized `index_label_mapping` argument is only necessary in the per-class benchmark, so it is never passed to the main computation.
+
+!!! Tip
+    When implementing memos for a benchmark workload, using only one value per memo at the cost of another class definition is often worth it,
+    since you have more direct control over what goes into your benchmarks, and you can avoid having unused parameters altogether with this approach.
diff --git a/docs/tutorials/index.md b/docs/tutorials/index.md
index 3f8e177..31e552f 100644
--- a/docs/tutorials/index.md
+++ b/docs/tutorials/index.md
@@ -8,4 +8,4 @@ Click any of the links below for inspiration on how to use nnbench in your proje
 * [Using a streamlit web app to dispatch benchmarks](streamlit.md)
 * [Analyzing benchmark results at scale with duckDB](duckdb.md)
 * [Streaming benchmark results to a cloud database (Google BigQuery)](bq.md)
-* [How to benchmark previously trained models via artifacts](artifact_benchmarking.md)
+* [How to benchmark pre-trained HuggingFace models with memos](huggingface.md)
diff --git a/examples/artifact_benchmarking/pyproject.toml b/examples/artifact_benchmarking/pyproject.toml
deleted file mode 100644
index 0ddb1b9..0000000
--- a/examples/artifact_benchmarking/pyproject.toml
+++ /dev/null
@@ -1,19 +0,0 @@
-[build-system]
-requires = ["setuptools>=45", "setuptools-scm[toml]>=7.1"]
-build-backend = "setuptools.build_meta"
-
-[project]
-name = "nnbench-saved-artifacts"
-requires-python = ">= 3.8"
-version = "0.1.0"
-description = "Integration example of Prefect with nnbench."
-readme = "docs/guides/prefect.md"
-license = { text = "Apache-2.0" }
-dependencies = [
-    "transformers",
-    "datasets",
-    "torch",
-    "nnbench@git+https://github.com/aai-institute/nnbench.git"
-]
-maintainers = [{name="Max Mynter", email="m.mynter@appliedai-institute.de"}]
-authors = [{ name = "appliedAI Initiative", email = "info+oss@appliedai.de" }]
diff --git a/examples/artifact_benchmarking/requirements.txt b/examples/artifact_benchmarking/requirements.txt
deleted file mode 100644
index 96fd684..0000000
--- a/examples/artifact_benchmarking/requirements.txt
+++ /dev/null
@@ -1,47 +0,0 @@
-#
-# This file is autogenerated by pip-compile with Python 3.11
-# by the following command:
-#
-#    pip-compile --extra=dev --no-annotate
-#
-aiohttp==3.9.3
-aiosignal==1.3.1
-attrs==23.2.0
-certifi==2024.2.2
-charset-normalizer==3.3.2
-datasets==2.18.0
-dill==0.3.8
-filelock==3.13.1
-frozenlist==1.4.1
-fsspec[http]==2024.2.0
-huggingface-hub==0.21.3
-idna==3.6
-jinja2==3.1.3
-markupsafe==2.1.5
-mpmath==1.3.0
-multidict==6.0.5
-multiprocess==0.70.16
-networkx==3.2.1
-numpy==1.26.4
-packaging==23.2
-pandas==2.2.1
-pyarrow==15.0.0
-pyarrow-hotfix==0.6
-python-dateutil==2.9.0.post0
-pytz==2024.1
-pyyaml==6.0.1
-regex==2023.12.25
-requests==2.31.0
-safetensors==0.4.2
-six==1.16.0
-sympy==1.12
-tabulate==0.9.0
-tokenizers==0.15.2
-torch==2.2.1
-tqdm==4.66.2
-transformers==4.38.2
-typing-extensions==4.10.0
-tzdata==2024.1
-urllib3==2.2.1
-xxhash==3.4.1
-yarl==1.9.4
diff --git a/examples/artifact_benchmarking/src/__init__.py b/examples/artifact_benchmarking/src/__init__.py
deleted file mode 100644
index e69de29..0000000
diff --git a/examples/artifact_benchmarking/src/training/__init__.py b/examples/artifact_benchmarking/src/training/__init__.py
deleted file mode 100644
index e69de29..0000000
diff --git a/examples/artifact_benchmarking/src/benchmark.py b/examples/huggingface/benchmark.py
similarity index 100%
rename from examples/artifact_benchmarking/src/benchmark.py
rename to examples/huggingface/benchmark.py
diff --git a/examples/artifact_benchmarking/src/runner.py b/examples/huggingface/runner.py
similarity index 100%
rename from examples/artifact_benchmarking/src/runner.py
rename to examples/huggingface/runner.py
diff --git a/examples/artifact_benchmarking/src/training/training.py b/examples/huggingface/training.py
similarity index 100%
rename from examples/artifact_benchmarking/src/training/training.py
rename to examples/huggingface/training.py
diff --git a/mkdocs.yml b/mkdocs.yml
index 0e29be4..80df33c 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -26,7 +26,7 @@ nav:
     - guides/transforms.md
   - Examples:
     - tutorials/index.md
-    - tutorials/artifact_benchmarking.md
+    - tutorials/huggingface.md
     - tutorials/mnist.md
     - tutorials/prefect.md
     - tutorials/streamlit.md