Add IPEX models (#516)

* add IPEX model and README update ipex modeling and add case for text-generation and text-classification Signed-off-by: Wang, Yi A <[email protected]> * fix style * IPEX modeling refactorization * typo * remove use cache arg when loading model * fix style * move tests * remove readme * add test * add warning if use_cache mismatch * fix * format * update setup * add use_cache attribute --------- Signed-off-by: Wang, Yi A <[email protected]> Co-authored-by: Feng, Jiqing <[email protected]>
huggingface · Jan 26, 2024 · 805e737 · 805e737
1 parent a622f4d
commit 805e737
Show file tree

Hide file tree

Showing 18 changed files with 969 additions and 190 deletions.
diff --git a/README.md b/README.md
@@ -6,6 +6,8 @@
 
 🤗 Optimum Intel is the interface between the 🤗 Transformers and Diffusers libraries and the different tools and libraries provided by Intel to accelerate end-to-end pipelines on Intel architectures.
 
+[Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction) is an open-source library which provides optimizations for both eager mode and graph mode, however, compared to eager mode, graph mode in PyTorch* normally yields better performance from optimization techniques, such as operation fusion.
+
 Intel [Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) is an open-source library enabling the usage of the most popular compression techniques such as quantization, pruning and knowledge distillation. It supports automatic accuracy-driven tuning strategies in order for users to easily generate quantized model. The users can easily apply static, dynamic and aware-training quantization approaches while giving an expected accuracy criteria. It also supports different weight pruning techniques enabling the creation of pruned model giving a predefined sparsity target.
 
 [OpenVINO](https://docs.openvino.ai/latest/index.html) is an open-source toolkit that enables high performance inference capabilities for Intel CPUs, GPUs, and special DL inference accelerators ([see](https://docs.openvino.ai/latest/openvino_docs_OV_UG_supported_plugins_Supported_Devices.html) the full list of supported devices). It is supplied with a set of tools to optimize your models with compression techniques such as quantization, pruning and knowledge distillation. Optimum Intel provides a simple interface to optimize your Transformers and Diffusers models, convert them to the OpenVINO Intermediate Representation (IR) format and run inference using OpenVINO Runtime.
@@ -19,6 +21,7 @@ To install the latest release of 🤗 Optimum Intel with the corresponding requi
 |:-----------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------|
 | [Intel Neural Compressor](https://www.intel.com/content/www/us/en/developer/tools/oneapi/neural-compressor.html) | `pip install --upgrade-strategy eager "optimum[neural-compressor]"`  |
 | [OpenVINO](https://docs.openvino.ai/latest/index.html)                                                           | `pip install --upgrade-strategy eager "optimum[openvino,nncf]"`      |
+| [Intel Extension for PyTorch](https://intel.github.io/intel-extension-for-pytorch/#introduction)                 | `pip install --upgrade-strategy eager "optimum[ipex]"`               |
 
 The `--upgrade-strategy eager` option is needed to ensure `optimum-intel` is upgraded to the latest version.
 
@@ -37,7 +40,7 @@ or to install from source including dependencies:
 python -m pip install "optimum-intel[extras]"@git+https://github.com/huggingface/optimum-intel.git
 ```
 
-where `extras` can be one or more of `neural-compressor`, `openvino`, `nncf`.
+where `extras` can be one or more of `ipex`, `neural-compressor`, `openvino`, `nncf`.
 
 # Quick tour
 

diff --git a/docs/source/reference_inc.mdx b/docs/source/reference_inc.mdx
@@ -43,8 +43,8 @@ specific language governing permissions and limitations under the License.
 
 ## INCModelForCausalLM
 
-[[autodoc]] neural_compressor.modeling_decoder.INCModelForCausalLM
+[[autodoc]] neural_compressor.modeling_base.INCModelForCausalLM
 
 ## INCModelForSeq2SeqLM
 
-[[autodoc]] neural_compressor.modeling_base.INCModelForSeq2SeqLM
+[[autodoc]] neural_compressor.modeling_base.INCModelForSeq2SeqLM
diff --git a/optimum/intel/__init__.py b/optimum/intel/__init__.py
@@ -35,9 +35,20 @@
     if not is_ipex_available():
         raise OptionalDependencyNotAvailable()
 except OptionalDependencyNotAvailable:
-    _import_structure["utils.dummy_ipex_objects"] = ["inference_mode"]
+    from .utils import dummy_ipex_objects
+
+    _import_structure["utils.dummy_ipex_objects"] = [
+        name for name in dir(dummy_ipex_objects) if not name.startswith("_")
+    ]
 else:
-    _import_structure["ipex"] = ["inference_mode"]
+    _import_structure["ipex"] = [
+        "inference_mode",
+        "IPEXModelForCausalLM",
+        "IPEXModelForSequenceClassification",
+        "IPEXModelForMaskedLM",
+        "IPEXModelForTokenClassification",
+    ]
+
 
 try:
     if not (is_openvino_available() and is_nncf_available()):
@@ -144,9 +155,15 @@
         if not is_ipex_available():
             raise OptionalDependencyNotAvailable()
     except OptionalDependencyNotAvailable:
-        from .utils.dummy_ipex_objects import inference_mode
+        from .utils.dummy_ipex_objects import *
     else:
-        from .ipex import inference_mode
+        from .ipex import (
+            IPEXModelForCausalLM,
+            IPEXModelForMaskedLM,
+            IPEXModelForSequenceClassification,
+            IPEXModelForTokenClassification,
+            inference_mode,
+        )
 
     try:
         if not (is_openvino_available() and is_nncf_available()):

diff --git a/optimum/intel/generation/modeling.py b/optimum/intel/generation/modeling.py
@@ -66,6 +66,7 @@ def prepare_jit_inputs(model: PreTrainedModel, task: str, use_cache: bool = Fals
 
 def jit_trace(model: PreTrainedModel, task: str, use_cache: bool = False):
     model_inputs = prepare_jit_inputs(model, task, use_cache)
+    model.config.return_dict = False
     # check if the model_inputs is correct.
     model(**model_inputs)
 
@@ -106,6 +107,9 @@ def __init__(
         self.normalized_config = NormalizedConfigManager.get_normalized_config_class(config.model_type)(config)
         self.model_dtype = kwargs.get("model_dtype", None)
 
+        logger.warning(
+            f"The class `{self.__class__}` has been depreciated and will be removed in optimum-intel v1.14, please use IPEXModel instead"
+        )
         if isinstance(model, torch.jit.ScriptModule):
             self.input_names = {
                 inputs.debugName().split(".")[0] for inputs in model.graph.inputs() if inputs.debugName() != "self"

diff --git a/optimum/intel/ipex/__init__.py b/optimum/intel/ipex/__init__.py
@@ -1 +1,8 @@
+from optimum.intel.ipex.modeling_base import (
+    IPEXModelForCausalLM,
+    IPEXModelForMaskedLM,
+    IPEXModelForSequenceClassification,
+    IPEXModelForTokenClassification,
+)
+
 from .inference import inference_mode
diff --git a/optimum/intel/ipex/inference.py b/optimum/intel/ipex/inference.py
@@ -1,3 +1,19 @@
+#  Copyright 2024 The HuggingFace Team. All rights reserved.
+#
+#  Licensed under the Apache License, Version 2.0 (the "License");
+#  you may not use this file except in compliance with the License.
+#  You may obtain a copy of the License at
+#
+#      http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing, software
+#  distributed under the License is distributed on an "AS IS" BASIS,
+#  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+#  See the License for the specific language governing permissions and
+#  limitations under the License.
+
+# ruff: noqa
+
 import logging
 from typing import Union
 
@@ -7,7 +23,30 @@
 from transformers.pipelines import Pipeline
 from transformers.utils import is_ipex_available
 
-from ..generation.modeling import TSModelForCausalLM, jit_trace
+from ...exporters.tasks import TasksManager
+from ..generation.modeling import jit_trace
+from .modeling_base import (
+    IPEXModel,
+    IPEXModelForCausalLM,
+    IPEXModelForMaskedLM,
+    IPEXModelForSequenceClassification,
+    IPEXModelForTokenClassification,
+    IPEXBloomForCausalLM,
+    IPEXMPTForCausalLM,
+    IPEXOPTForCausalLM,
+    IPEXGPTBigCodeForCausalLM,
+)
+
+
+from .utils import _HEAD_TO_AUTOMODELS
+
+
+_MODEL_TYPE_TO_AUTOMODELS = {
+    "bloom": IPEXBloomForCausalLM,
+    "mpt": IPEXMPTForCausalLM,
+    "opt": IPEXOPTForCausalLM,
+    "big_code": IPEXGPTBigCodeForCausalLM,
+}
 
 
 logger = logging.getLogger(__name__)
@@ -42,17 +81,6 @@ def __getattr__(self, item):
             return self.item
 
 
-class _ModelGenerationWrapper(_ModelFallbackWrapper):
-    def __getattr__(self, item):
-        if not item.startswith("__"):
-            try:
-                return getattr(self._optimized, item)
-            except Exception:
-                return getattr(self._default, item)
-        else:
-            return self.item
-
-
 @add_start_docstrings(
     """
     inference_mode is an Intel specific context-manager analogous to PyTorch's inference_mode to use for inference
@@ -66,7 +94,6 @@ def __init__(
         self,
         model: Union[nn.Module, Pipeline],
         dtype: torch.dtype = torch.float32,
-        jit: bool = False,
         **kwargs,
     ):
         """
@@ -88,65 +115,53 @@ def __init__(
         self._dtype = dtype
         self._graph_mode = False  # Let's keep for future use when it doesn't hang anymore
         self._original = None
-        self._jit = jit
+
+        if "jit" in kwargs:
+            logger.warning(
+                "`jit` is deprecated and will be removed in a future version. Use `IPEXModel` to load and export your model to TorchScript instead."
+            )
+        self._jit = kwargs.pop("jit", False)
 
     def __enter__(self):
         if self._model.framework == "pt":
             with torch.inference_mode():
                 try:
                     ipex.enable_onednn_fusion(True)
-                    if isinstance(self._model, Pipeline):
-                        self._original = self._model.model
-
-                        model = ipex.optimize(
-                            self._model.model,
-                            dtype=self._dtype,
-                            graph_mode=self._graph_mode,
-                            level="O1",
-                            auto_kernel_selection=True,
-                        )
 
-                        # Enable automatic mixed precision (AMP) if we are going to target `bfloat16`
-                        with torch.cpu.amp.autocast(
-                            enabled=(self._dtype == torch.bfloat16 and self._original.dtype != torch.bfloat16)
-                        ), torch.no_grad():
-                            if self._jit:
-                                try:
-                                    use_cache = False
-                                    if hasattr(self._original.config, "use_cache") and self._original.config.use_cache:
-                                        use_cache = True
-                                    model = jit_trace(
-                                        model=model,
-                                        task=self._model.task,
-                                        use_cache=use_cache,
-                                    )
-                                    if self._model.task == "text-generation":
-                                        model = TSModelForCausalLM(
-                                            model=model,
-                                            config=self._original.config,
-                                            use_cache=use_cache,
-                                            model_dtype=self._original.dtype,
-                                        )
-                                except Exception as e:
-                                    logger.warning(f"failed to use PyTorch jit mode due to: {e}.")
-                                # Patching model with the new one
-                            self._model.model = _ModelGenerationWrapper(model, self._original)
-                            return self._model
-                    else:
-                        self._original = self._model
-                        model = ipex.optimize(
-                            self._model,
-                            dtype=self._dtype,
-                            graph_mode=self._graph_mode,
-                            level="O1",
-                            auto_kernel_selection=True,
+                    self._original = self._model.model if isinstance(self._model, Pipeline) else self._model
+                    model = ipex.optimize(
+                        self._original,
+                        dtype=self._dtype,
+                        graph_mode=self._graph_mode,
+                        level="O1",
+                        auto_kernel_selection=True,
+                    )
+                    if self._jit:
+                        use_cache = getattr(self._original.config, "use_cache", False)
+                        task = (
+                            self._model.task
+                            if isinstance(self._model, Pipeline)
+                            else TasksManager._infer_task_from_model_or_model_class(model)
                         )
+                        if task in _HEAD_TO_AUTOMODELS:
+                            model = jit_trace(model, task, use_cache)
+                            model_type = getattr(self._original.config, "model_type", "").replace("_", "-")
+
+                            if task == "text-generation" and model_type in _MODEL_TYPE_TO_AUTOMODELS.keys():
+                                auto_model_class = _MODEL_TYPE_TO_AUTOMODELS[task]
+                            else:
+                                auto_model_class = eval(_HEAD_TO_AUTOMODELS[task])
+
+                            model = auto_model_class(model, self._original.config, use_cache=use_cache)
+
+                    # Enable automatic mixed precision (AMP) if we are going to target `bfloat16`
+                    with torch.cpu.amp.autocast(enabled=self._dtype == torch.bfloat16):
+                        if isinstance(self._model, Pipeline):
+                            # Patching model with the new one
+                            self._model.model = _ModelFallbackWrapper(model, self._original)
+                            return self._model
+                        return model
 
-                        # Enable automatic mixed precision (AMP) if we are going to target `bfloat16`
-                        with torch.cpu.amp.autocast(
-                            enabled=(self._dtype == torch.bfloat16 and self._original.dtype != torch.bfloat16)
-                        ):
-                            return model
                 except RuntimeError:
                     return self._model
         else: