Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Unitxt Multimodality Support #2364

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ build
dist
*.egg-info
venv
.venv/
.vscode/
temp
__pycache__
Expand Down
52 changes: 8 additions & 44 deletions lm_eval/tasks/unitxt/README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
# Unitxt

Unitxt is a library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. These components are centralized in the Unitxt-Catalog, thus fostering collaboration and exploration in modern textual data workflows.

The full Unitxt catalog can be viewed in an [online explorer](https://unitxt.readthedocs.io/en/latest/docs/demo.html).

Read more about Unitxt at [www.unitxt.ai](https://www.unitxt.ai/).

### Paper

Title: `Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI`
Abstract: `https://arxiv.org/abs/2401.14019`

Unitxt is a library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. These components are centralized in the Unitxt-Catalog, thus fostering collaboration and exploration in modern textual data workflows.
Abstract: [link](https://arxiv.org/abs/2401.14019)

The full Unitxt catalog can be viewed in an online explorer. `https://unitxt.readthedocs.io/en/latest/docs/demo.html`

Homepage: https://unitxt.readthedocs.io/en/latest/index.html

### Citation

Expand All @@ -36,42 +38,4 @@ The full list of Unitxt tasks currently supported can be seen under `tasks/unitx

### Adding tasks

You can add additional tasks from the Unitxt catalog by generating new LM-Eval yaml files for these datasets.

The Unitxt task yaml files are generated via the `generate_yamls.py` script in the `tasks/unitxt` directory.

To add a yaml file for an existing dataset Unitxt which is not yet in LM-Eval:
1. Add the card name to the `unitxt_datasets` file in the `tasks/unitxt` directory.
2. The generate_yaml.py contains the default Unitxt [template](https://unitxt.readthedocs.io/en/latest/docs/adding_template.html) used for each kind of NLP task in the `default_template_per_task` dictionary. If the dataset is of a Unitxt task type, previously not used in LM-Eval, you will need to add a default template for it in the dictionary.

```
default_template_per_task = {
"tasks.classification.multi_label" : "templates.classification.multi_label.title" ,
"tasks.classification.multi_class" : "templates.classification.multi_class.title" ,
"tasks.summarization.abstractive" : "templates.summarization.abstractive.full",
"tasks.regression.two_texts" : "templates.regression.two_texts.simple",
"tasks.qa.with_context.extractive" : "templates.qa.with_context.simple",
"tasks.grammatical_error_correction" : "templates.grammatical_error_correction.simple",
"tasks.span_labeling.extraction" : "templates.span_labeling.extraction.title"
}
```
3. Run `python generate_yaml.py` (this will generate all the datasets listed in the `unitxt_datasets`)

If you want to add a new dataset to the Unitxt catalog, see the Unitxt documentation:

https://unitxt.readthedocs.io/en/latest/docs/adding_dataset.html



### Checklist

For adding novel benchmarks/datasets to the library:
* [ ] Is the task an existing benchmark in the literature?
* [ ] Have you referenced the original paper that introduced the task?
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?


If other tasks on this dataset are already supported:
* [ ] Is the "Main" variant of this task clearly denoted?
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
See the [adding tasks guide](https://www.unitxt.ai/en/latest/docs/lm_eval.html#).
3 changes: 3 additions & 0 deletions lm_eval/tasks/unitxt/doc_vqa.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
task: doc_vqa
include: unitxt_multimodal
recipe: card=cards.doc_vqa.en,template=templates.qa.with_context.title
63 changes: 58 additions & 5 deletions lm_eval/tasks/unitxt/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,12 @@
Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively.
"""

import importlib.util
import re
from functools import partial
from typing import Optional
from typing import Any, Dict, Optional

import datasets
import evaluate

from lm_eval.api.instance import Instance
Expand All @@ -25,6 +28,10 @@
"""


def is_unitxt_installed() -> bool:
return importlib.util.find_spec("unitxt") is not None


def score(items, metric):
predictions, references = zip(*items)
evaluator = evaluate.load("unitxt/metric")
Expand All @@ -41,17 +48,30 @@ def __init__(
self,
config: Optional[dict] = None,
) -> None:
if config is None:
config = {}
assert "recipe" in config, "Unitxt task must have a 'recipe' string."
super().__init__(
config={
"metadata": {"version": self.VERSION},
"dataset_kwargs": {"trust_remote_code": True},
"dataset_name": config["recipe"],
"dataset_path": "unitxt/data",
}
)
self.image_decoder = datasets.Image()
self.metrics = self.dataset["test"][0]["metrics"]

def download(self, dataset_kwargs: Optional[Dict[str, Any]] = None) -> None:
if is_unitxt_installed():
from unitxt import load_dataset

self.dataset = load_dataset(self.DATASET_NAME)
else:
self.dataset = datasets.load_dataset(
name=self.DATASET_NAME,
path="unitxt/data",
trust_remote_code=True,
)

def has_training_docs(self):
return "train" in self.dataset

Expand Down Expand Up @@ -79,6 +99,9 @@ def should_decontaminate(self):
def doc_to_target(self, doc):
doc["target"]

def get_arguments(self, doc, ctx):
return (ctx, {"until": ["\n"]})

def construct_requests(self, doc, ctx, **kwargs):
"""Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LM.
Expand All @@ -90,12 +113,11 @@ def construct_requests(self, doc, ctx, **kwargs):
language description, as well as the few shot examples, and the question
part of the document for `doc`.
"""

return [
Instance(
request_type="generate_until",
doc=doc,
arguments=(ctx, {"until": ["\n"]}),
arguments=self.get_arguments(doc, ctx),
idx=0,
**kwargs,
)
Expand Down Expand Up @@ -140,3 +162,34 @@ def higher_is_better(self):
whether a higher value of the submetric is better
"""
return {metric.replace("metrics.", ""): True for metric in self.metrics}


images_regex = r'<img\s+src=["\'](.*?)["\']\s*/?>'
image_source_regex = r'<img\s+src=["\'](.*?)["\']'


def extract_images(text, instance):
image_sources = re.findall(image_source_regex, text)
images = []
for image_source in image_sources:
current = instance
for key in image_source.split("/"):
if key.isdigit():
key = int(key)
current = current[key]
images.append(current)
return images


class UnitxtMultiModal(Unitxt):
MULTIMODAL = True

def doc_to_text(self, doc):
return re.sub(images_regex, "<image>", doc["source"])

def doc_to_image(self, doc):
images = extract_images(doc["source"], doc)
return [self.image_decoder.decode_example(image) for image in images]

def get_arguments(self, doc, ctx):
return (ctx, {"until": ["\n"]}, {"visual": self.doc_to_image(doc)})
1 change: 1 addition & 0 deletions lm_eval/tasks/unitxt/unitxt_multimodal
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
class: !function task.UnitxtMultiModal
Loading