Skip to content

Commit

Permalink
Add IPEX models (#516)
Browse files Browse the repository at this point in the history
* add IPEX model and README

update ipex modeling and add case for text-generation and text-classification

Signed-off-by: Wang, Yi A <[email protected]>

* fix style

* IPEX modeling refactorization

* typo

* remove use cache arg when loading model

* fix style

* move tests

* remove readme

* add test

* add warning if use_cache mismatch

* fix

* format

* update setup

* add use_cache attribute


---------

Signed-off-by: Wang, Yi A <[email protected]>
Co-authored-by: Feng, Jiqing <[email protected]>
  • Loading branch information
echarlaix and jiqing-feng authored Jan 26, 2024
1 parent a622f4d commit 805e737
Show file tree
Hide file tree
Showing 18 changed files with 969 additions and 190 deletions.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@

🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.

[Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) is an open-source library which provides optimizations for both eager mode and graph mode, however, compared to eager mode, graph mode in PyTorch* normally yields better performance from optimization techniques, such as operation fusion.

Intel [Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.

[OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
Expand All @@ -19,6 +21,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
|:-----------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|
| [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"` |
| [OpenVINO](https://docs.openvino.ai/latest/index.html) | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"` |
| [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) | `pip install --upgrade-strategy eager "optimum[ipex]"` |

The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.

Expand All @@ -37,7 +40,7 @@ or to install from source including dependencies:
python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
```

where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
where `extras` can be one or more of `ipex`, `neural-compressor`, `openvino`, `nncf`.

# Quick tour

Expand Down
4 changes: 2 additions & 2 deletions docs/source/reference_inc.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ specific language governing permissions and limitations under the License.

## INCModelForCausalLM

[[autodoc]] neural_compressor.modeling_decoder.INCModelForCausalLM
[[autodoc]] neural_compressor.modeling_base.INCModelForCausalLM

## INCModelForSeq2SeqLM

[[autodoc]] neural_compressor.modeling_base.INCModelForSeq2SeqLM
[[autodoc]] neural_compressor.modeling_base.INCModelForSeq2SeqLM
25 changes: 21 additions & 4 deletions optimum/intel/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,20 @@
if not is_ipex_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
_import_structure["utils.dummy_ipex_objects"] = ["inference_mode"]
from .utils import dummy_ipex_objects

_import_structure["utils.dummy_ipex_objects"] = [
name for name in dir(dummy_ipex_objects) if not name.startswith("_")
]
else:
_import_structure["ipex"] = ["inference_mode"]
_import_structure["ipex"] = [
"inference_mode",
"IPEXModelForCausalLM",
"IPEXModelForSequenceClassification",
"IPEXModelForMaskedLM",
"IPEXModelForTokenClassification",
]


try:
if not (is_openvino_available() and is_nncf_available()):
Expand Down Expand Up @@ -144,9 +155,15 @@
if not is_ipex_available():
raise OptionalDependencyNotAvailable()
except OptionalDependencyNotAvailable:
from .utils.dummy_ipex_objects import inference_mode
from .utils.dummy_ipex_objects import *
else:
from .ipex import inference_mode
from .ipex import (
IPEXModelForCausalLM,
IPEXModelForMaskedLM,
IPEXModelForSequenceClassification,
IPEXModelForTokenClassification,
inference_mode,
)

try:
if not (is_openvino_available() and is_nncf_available()):
Expand Down
4 changes: 4 additions & 0 deletions optimum/intel/generation/modeling.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ def prepare_jit_inputs(model: PreTrainedModel, task: str, use_cache: bool = Fals

def jit_trace(model: PreTrainedModel, task: str, use_cache: bool = False):
model_inputs = prepare_jit_inputs(model, task, use_cache)
model.config.return_dict = False
# check if the model_inputs is correct.
model(**model_inputs)

Expand Down Expand Up @@ -106,6 +107,9 @@ def __init__(
self.normalized_config = NormalizedConfigManager.get_normalized_config_class(config.model_type)(config)
self.model_dtype = kwargs.get("model_dtype", None)

logger.warning(
f"The class `{self.__class__}` has been depreciated and will be removed in optimum-intel v1.14, please use IPEXModel instead"
)
if isinstance(model, torch.jit.ScriptModule):
self.input_names = {
inputs.debugName().split(".")[0] for inputs in model.graph.inputs() if inputs.debugName() != "self"
Expand Down
7 changes: 7 additions & 0 deletions optimum/intel/ipex/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,8 @@
from optimum.intel.ipex.modeling_base import (
IPEXModelForCausalLM,
IPEXModelForMaskedLM,
IPEXModelForSequenceClassification,
IPEXModelForTokenClassification,
)

from .inference import inference_mode
141 changes: 78 additions & 63 deletions optimum/intel/ipex/inference.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,19 @@
# Copyright 2024 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# ruff: noqa

import logging
from typing import Union

Expand All @@ -7,7 +23,30 @@
from transformers.pipelines import Pipeline
from transformers.utils import is_ipex_available

from ..generation.modeling import TSModelForCausalLM, jit_trace
from ...exporters.tasks import TasksManager
from ..generation.modeling import jit_trace
from .modeling_base import (
IPEXModel,
IPEXModelForCausalLM,
IPEXModelForMaskedLM,
IPEXModelForSequenceClassification,
IPEXModelForTokenClassification,
IPEXBloomForCausalLM,
IPEXMPTForCausalLM,
IPEXOPTForCausalLM,
IPEXGPTBigCodeForCausalLM,
)


from .utils import _HEAD_TO_AUTOMODELS


_MODEL_TYPE_TO_AUTOMODELS = {
"bloom": IPEXBloomForCausalLM,
"mpt": IPEXMPTForCausalLM,
"opt": IPEXOPTForCausalLM,
"big_code": IPEXGPTBigCodeForCausalLM,
}


logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -42,17 +81,6 @@ def __getattr__(self, item):
return self.item


class _ModelGenerationWrapper(_ModelFallbackWrapper):
def __getattr__(self, item):
if not item.startswith("__"):
try:
return getattr(self._optimized, item)
except Exception:
return getattr(self._default, item)
else:
return self.item


@add_start_docstrings(
"""
inference_mode is an Intel specific context-manager analogous to PyTorch's inference_mode to use for inference
Expand All @@ -66,7 +94,6 @@ def __init__(
self,
model: Union[nn.Module, Pipeline],
dtype: torch.dtype = torch.float32,
jit: bool = False,
**kwargs,
):
"""
Expand All @@ -88,65 +115,53 @@ def __init__(
self._dtype = dtype
self._graph_mode = False # Let's keep for future use when it doesn't hang anymore
self._original = None
self._jit = jit

if "jit" in kwargs:
logger.warning(
"`jit` is deprecated and will be removed in a future version. Use `IPEXModel` to load and export your model to TorchScript instead."
)
self._jit = kwargs.pop("jit", False)

def __enter__(self):
if self._model.framework == "pt":
with torch.inference_mode():
try:
ipex.enable_onednn_fusion(True)
if isinstance(self._model, Pipeline):
self._original = self._model.model

model = ipex.optimize(
self._model.model,
dtype=self._dtype,
graph_mode=self._graph_mode,
level="O1",
auto_kernel_selection=True,
)

# Enable automatic mixed precision (AMP) if we are going to target `bfloat16`
with torch.cpu.amp.autocast(
enabled=(self._dtype == torch.bfloat16 and self._original.dtype != torch.bfloat16)
), torch.no_grad():
if self._jit:
try:
use_cache = False
if hasattr(self._original.config, "use_cache") and self._original.config.use_cache:
use_cache = True
model = jit_trace(
model=model,
task=self._model.task,
use_cache=use_cache,
)
if self._model.task == "text-generation":
model = TSModelForCausalLM(
model=model,
config=self._original.config,
use_cache=use_cache,
model_dtype=self._original.dtype,
)
except Exception as e:
logger.warning(f"failed to use PyTorch jit mode due to: {e}.")
# Patching model with the new one
self._model.model = _ModelGenerationWrapper(model, self._original)
return self._model
else:
self._original = self._model
model = ipex.optimize(
self._model,
dtype=self._dtype,
graph_mode=self._graph_mode,
level="O1",
auto_kernel_selection=True,
self._original = self._model.model if isinstance(self._model, Pipeline) else self._model
model = ipex.optimize(
self._original,
dtype=self._dtype,
graph_mode=self._graph_mode,
level="O1",
auto_kernel_selection=True,
)
if self._jit:
use_cache = getattr(self._original.config, "use_cache", False)
task = (
self._model.task
if isinstance(self._model, Pipeline)
else TasksManager._infer_task_from_model_or_model_class(model)
)
if task in _HEAD_TO_AUTOMODELS:
model = jit_trace(model, task, use_cache)
model_type = getattr(self._original.config, "model_type", "").replace("_", "-")

if task == "text-generation" and model_type in _MODEL_TYPE_TO_AUTOMODELS.keys():
auto_model_class = _MODEL_TYPE_TO_AUTOMODELS[task]
else:
auto_model_class = eval(_HEAD_TO_AUTOMODELS[task])

model = auto_model_class(model, self._original.config, use_cache=use_cache)

# Enable automatic mixed precision (AMP) if we are going to target `bfloat16`
with torch.cpu.amp.autocast(enabled=self._dtype == torch.bfloat16):
if isinstance(self._model, Pipeline):
# Patching model with the new one
self._model.model = _ModelFallbackWrapper(model, self._original)
return self._model
return model

# Enable automatic mixed precision (AMP) if we are going to target `bfloat16`
with torch.cpu.amp.autocast(
enabled=(self._dtype == torch.bfloat16 and self._original.dtype != torch.bfloat16)
):
return model
except RuntimeError:
return self._model
else:
Expand Down
Loading

0 comments on commit 805e737

Please sign in to comment.