From 53550069ebc24ad94c2fe94a376bf7ee23028f67 Mon Sep 17 00:00:00 2001 From: dudeperf3ct Date: Thu, 3 Oct 2024 13:00:32 +0530 Subject: [PATCH] Update vllm documentation --- .../component-guide/model-deployers/vllm.md | 43 +++++++++++++++++-- 1 file changed, 40 insertions(+), 3 deletions(-) diff --git a/docs/book/component-guide/model-deployers/vllm.md b/docs/book/component-guide/model-deployers/vllm.md index 85cbf85935c..e05f801bc58 100644 --- a/docs/book/component-guide/model-deployers/vllm.md +++ b/docs/book/component-guide/model-deployers/vllm.md @@ -4,8 +4,17 @@ description: Deploying your LLM locally with vLLM. # vLLM +[vLLM](https://docs.vllm.ai/en/latest/) is a fast and easy-to-use library for LLM inference and serving. + ## When to use it? +You should use vLLM Model Deployer: + +* Deploying Large Language models with state-of-the-art serving throughput creating an OpenAI-compatible API server +* Continuous batching of incoming requests +* Quantization: GPTQ, AWQ, INT4, INT8, and FP8 +* Features such as PagedAttention, Speculative decoding, Chunked prefill + ## How do you deploy it? The vLLM Model Deployer flavor is provided by the vLLM ZenML integration, so you need to install it on your local machine to be able to deploy your models. You can do this by running the following command: @@ -24,14 +33,42 @@ The ZenML integration will provision a local vLLM deployment server as a daemon ## How do you use it? +If you'd like to see this in action, check out this example of of a [deployment pipeline](https://github.com/zenml-io/zenml-projects/blob/79f67ea52c3908b9b33c9a41eef18cb7d72362e8/llm-vllm-deployer/pipelines/deploy_pipeline.py#L25). + +### Deploy an LLM + +The [vllm_model_deployer_step](https://github.com/zenml-io/zenml-projects/blob/79f67ea52c3908b9b33c9a41eef18cb7d72362e8/llm-vllm-deployer/steps/vllm_deployer.py#L32) exposes a `VLLMDeploymentService` that you can use in your pipeline. Here is an example snippet: + +```python + +from zenml import pipeline +from typing import Annotated +from steps.vllm_deployer import vllm_model_deployer_step +from zenml.integrations.vllm.services.vllm_deployment import VLLMDeploymentService + + +@pipeline() +def deploy_vllm_pipeline( + model: str, + timeout: int = 1200, +) -> Annotated[VLLMDeploymentService, "GPT2"]: + service = vllm_model_deployer_step( + model=model, + timeout=timeout, + ) + return service +``` + +Here is an [example](https://github.com/zenml-io/zenml-projects/tree/79f67ea52c3908b9b33c9a41eef18cb7d72362e8/llm-vllm-deployer) of running a GPT-2 model using vLLM. + #### Configuration Within the `VLLMDeploymentService` you can configure: -* `model`: Name or path of the huggingface model to use. -* `tokenizer`: Name or path of the huggingface tokenizer to use. If unspecified, model name or path will be used. +* `model`: Name or path of the Hugging Face model to use. +* `tokenizer`: Name or path of the Hugging Face tokenizer to use. If unspecified, model name or path will be used. * `served_model_name`: The model name(s) used in the API. If not specified, the model name will be the same as the `model` argument. -* `trust_remote_code`: Trust remote code from huggingface. +* `trust_remote_code`: Trust remote code from Hugging Face. * `tokenizer_mode`: The tokenizer mode. Allowed choices: ['auto', 'slow', 'mistral'] * `dtype`: Data type for model weights and activations. Allowed choices: ['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'] * `revision`: The specific model version to use. It can be a branch name, a tag name, or a commit id. If unspecified, will use the default version.