diff --git a/Dockerfile.hpu b/Dockerfile.hpu index d18fc016387bf..aa1502cc5ee8b 100644 --- a/Dockerfile.hpu +++ b/Dockerfile.hpu @@ -1,4 +1,4 @@ -FROM vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest +FROM vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest COPY ./ /workspace/vllm diff --git a/README_GAUDI.md b/README_GAUDI.md index 0a095ddd0bd73..a1b8d53122929 100644 --- a/README_GAUDI.md +++ b/README_GAUDI.md @@ -11,7 +11,7 @@ Please follow the instructions provided in the [Gaudi Installation Guide](https: - OS: Ubuntu 22.04 LTS - Python: 3.10 - Intel Gaudi accelerator -- Intel Gaudi software version 1.18.0 +- Intel Gaudi software version 1.19.0 ## Quick start using Dockerfile ``` @@ -44,13 +44,29 @@ It is highly recommended to use the latest Docker image from Intel Gaudi vault. Use the following commands to run a Docker image: ```{.console} -$ docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest -$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest +$ docker pull vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest +$ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest ``` -### Build and Install vLLM-fork +### Build and Install vLLM -Currently, the latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repo. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following: +Currently, we are providing multiple ways which can be used to install vLLM with Intel® Gaudi®, pick one option: + +#### 1. Build and Install the stable version + +Periodically, we are releasing vLLM to allign with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork). To install the stable release from [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following: + +```{.console} +$ git clone https://github.com/HabanaAI/vllm-fork.git +$ cd vllm-fork +$ git checkout v0.6.4.post2+Gaudi-1.19.0 +$ pip install -r requirements-hpu.txt +$ python setup.py develop +``` + +#### 2. Build and Install the latest from vLLM-fork + +The latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repository. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following: ```{.console} $ git clone https://github.com/HabanaAI/vllm-fork.git @@ -59,6 +75,16 @@ $ git checkout habana_main $ pip install -r requirements-hpu.txt $ python setup.py develop ``` +#### 3. Build and Install from vLLM upstream + +If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following: + +```{.console} +$ git clone https://github.com/vllm-project/vllm.git +$ cd vllm +$ pip install -r requirements-hpu.txt +$ python setup.py develop +``` # Supported Features @@ -71,11 +97,11 @@ $ python setup.py develop - Inference with [HPU Graphs](https://docs.habana.ai/en/latest/PyTorch/Inference_on_PyTorch/Inference_Using_HPU_Graphs.html) for accelerating low-batch latency and throughput - Attention with Linear Biases (ALiBi) - INC quantization +- LoRA adapters # Unsupported Features - Beam search -- LoRA adapters - AWQ quantization - Prefill chunking (mixed-batch inferencing) @@ -112,7 +138,7 @@ Currently in vLLM for HPU we support four execution modes, depending on selected | 1 | 1 | PyTorch lazy mode | > [!WARNING] -> In 1.18.0, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode. +> In current release, all modes utilizing `PT_HPU_LAZY_MODE=0` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance please use HPU Graphs, or PyTorch lazy mode. ## Bucketing mechanism diff --git a/docs/source/getting_started/gaudi-installation.rst b/docs/source/getting_started/gaudi-installation.rst index 79d40293fd470..672b2d883100b 100644 --- a/docs/source/getting_started/gaudi-installation.rst +++ b/docs/source/getting_started/gaudi-installation.rst @@ -17,8 +17,8 @@ Requirements - OS: Ubuntu 22.04 LTS - Python: 3.10 -- Intel Gaudi accelerator -- Intel Gaudi software version 1.18.0 +- Intel® Gaudi® AI Accelerator +- Intel Gaudi software version 1.19.0 Quick start using Dockerfile @@ -63,23 +63,29 @@ Use the following commands to run a Docker image: .. code:: console - $ docker pull vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest - $ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.18.0/ubuntu22.04/habanalabs/pytorch-installer-2.4.0:latest + $ docker pull vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest + $ docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest Build and Install vLLM ~~~~~~~~~~~~~~~~~~~~~~ -To build and install vLLM from source, run: +Currently, we are providing multiple ways which can be used to install vLLM with Intel® Gaudi®, pick one option: + +1. Build and Install the stable version + +Periodically, we are releasing vLLM to allign with Intel® Gaudi® software releases. The stable version is released with a tag, and supports fully validated features and performance optimizations in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork). To install the stable release from [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following: .. code:: console - $ git clone https://github.com/vllm-project/vllm.git - $ cd vllm + $ git clone https://github.com/HabanaAI/vllm-fork.git + $ cd vllm-fork + $ git checkout v0.6.4.post2+Gaudi-1.19.0 $ pip install -r requirements-hpu.txt $ python setup.py develop +2. Build and Install the latest from vLLM-fork -Currently, the latest features and performance optimizations are developed in Gaudi's `vLLM-fork `__ and we periodically upstream them to vLLM main repo. To install latest `HabanaAI/vLLM-fork `__, run the following: +The latest features and performance optimizations are developed in Gaudi's [vLLM-fork](https://github.com/HabanaAI/vllm-fork) and we periodically upstream them to vLLM main repository. To install latest [HabanaAI/vLLM-fork](https://github.com/HabanaAI/vllm-fork), run the following: .. code:: console @@ -89,6 +95,16 @@ Currently, the latest features and performance optimizations are developed in Ga $ pip install -r requirements-hpu.txt $ python setup.py develop +3. Build and Install from vLLM upstream + +If you prefer to build and install directly from the main vLLM source, where periodically we are upstreaming new features, run the following: + +.. code:: console + + $ git clone https://github.com/vllm-project/vllm.git + $ cd vllm + $ pip install -r requirements-hpu.txt + $ python setup.py develop Supported Features ================== @@ -107,12 +123,12 @@ Supported Features for accelerating low-batch latency and throughput - Attention with Linear Biases (ALiBi) - INC quantization +- LoRA adapters Unsupported Features ==================== - Beam search -- LoRA adapters - AWQ quantization - Prefill chunking (mixed-batch inferencing) @@ -186,7 +202,7 @@ Currently in vLLM for HPU we support four execution modes, depending on selected - PyTorch lazy mode .. warning:: - In 1.18.0, all modes utilizing ``PT_HPU_LAZY_MODE=0`` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance in 1.18.0, please use HPU Graphs, or PyTorch lazy mode. + In current release, all modes utilizing ``PT_HPU_LAZY_MODE=0`` are highly experimental and should be only used for validating functional correctness. Their performance will be improved in the next releases. For obtaining the best performance please use HPU Graphs, or PyTorch lazy mode. Bucketing mechanism diff --git a/docs/source/serving/compatibility_matrix.rst b/docs/source/serving/compatibility_matrix.rst index f629b3ca78318..7c0a34efd30ce 100644 --- a/docs/source/serving/compatibility_matrix.rst +++ b/docs/source/serving/compatibility_matrix.rst @@ -305,6 +305,7 @@ Feature x Hardware - Hopper - CPU - AMD + - Gaudi * - :ref:`CP ` - `✗ `__ - ✅ @@ -313,6 +314,7 @@ Feature x Hardware - ✅ - ✗ - ✅ + - ✗ * - :ref:`APC ` - `✗ `__ - ✅ @@ -321,6 +323,7 @@ Feature x Hardware - ✅ - ✗ - ✅ + - ✅ * - :ref:`LoRA ` - ✅ - ✅ @@ -329,6 +332,7 @@ Feature x Hardware - ✅ - `✗ `__ - ✅ + - ✅ * - :abbr:`prmpt adptr (Prompt Adapter)` - ✅ - ✅ @@ -337,6 +341,7 @@ Feature x Hardware - ✅ - `✗ `__ - ✅ + - ✗ * - :ref:`SD ` - ✅ - ✅ @@ -345,6 +350,7 @@ Feature x Hardware - ✅ - ✅ - ✅ + - ✅ * - CUDA graph - ✅ - ✅ @@ -353,6 +359,7 @@ Feature x Hardware - ✅ - ✗ - ✅ + - ✗ * - :abbr:`enc-dec (Encoder-Decoder Models)` - ✅ - ✅ @@ -361,6 +368,7 @@ Feature x Hardware - ✅ - ✅ - ✗ + - ✅ * - :abbr:`logP (Logprobs)` - ✅ - ✅ @@ -369,6 +377,7 @@ Feature x Hardware - ✅ - ✅ - ✅ + - ✅ * - :abbr:`prmpt logP (Prompt Logprobs)` - ✅ - ✅ @@ -377,6 +386,7 @@ Feature x Hardware - ✅ - ✅ - ✅ + - ✅ * - :abbr:`async output (Async Output Processing)` - ✅ - ✅ @@ -385,6 +395,7 @@ Feature x Hardware - ✅ - ✗ - ✗ + - ✅ * - multi-step - ✅ - ✅ @@ -393,6 +404,7 @@ Feature x Hardware - ✅ - `✗ `__ - ✅ + - ✅ * - :abbr:`MM (Multimodal)` - ✅ - ✅ @@ -401,6 +413,7 @@ Feature x Hardware - ✅ - ✅ - ✅ + - ✅ * - best-of - ✅ - ✅ @@ -409,6 +422,7 @@ Feature x Hardware - ✅ - ✅ - ✅ + - ✅ * - beam-search - ✅ - ✅ @@ -417,6 +431,7 @@ Feature x Hardware - ✅ - ✅ - ✅ + - ✗ * - :abbr:`guided dec (Guided Decoding)` - ✅ - ✅ @@ -425,3 +440,4 @@ Feature x Hardware - ✅ - ✅ - ✅ + - ✅