diff --git a/docs/deployment/self-deployment/overview.mdx b/docs/deployment/self-deployment/overview.mdx index 0b9e327..b17611a 100644 --- a/docs/deployment/self-deployment/overview.mdx +++ b/docs/deployment/self-deployment/overview.mdx @@ -4,13 +4,14 @@ title: Self-deployment slug: overview --- -Mistral AI provides ready-to-use Docker images on the Github registry. The weights are distributed separately. +Mistral AI models can be self-deployed on your own infrastructure through various +inference engines. We recommend using [vLLM](https://vllm.readthedocs.io/), a +highly-optimized Python-only serving framework which can exponse an OpenAI-compatible +API. -To run these images, you need a cloud virtual machine matching the requirements for a given model. These requirements can be found in the [model description](/getting-started/models). +Other inference engine alternatives include +[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and +[TGI](https://huggingface.co/docs/text-generation-inference/index). -We recommend three different serving frameworks for our models : -- [vLLM](https://vllm.readthedocs.io/): A python only serving framework which deploys an API matching OpenAI's spec. vLLM provides paged attention kernel to improve serving throughput. -- NVidias's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) served with Nvidia's [Triton Inference Server](https://github.com/triton-inference-server) : TensorRT-LLM provides a DSL to build fast inference engines with dedicated kernels for large language models. Triton Inference Server allows efficient serving of these inference engines. -- [TGI](https://huggingface.co/docs/text-generation-inference/index): A toolkit for deploying LLMs, including OpenAI's spec, grammars, production monitoring, and tools functionality. - -These images can be run locally, or on your favorite cloud provider, using [SkyPilot](https://skypilot.readthedocs.io/en/latest/). +You can also leverage specific tools to facilitate infrastructure management, such as +[SkyPilot](https://skypilot.readthedocs.io) or [Cerebrium](https://www.cerebrium.ai). diff --git a/docs/deployment/self-deployment/vllm.mdx b/docs/deployment/self-deployment/vllm.mdx index 09ff1e5..c7af9b4 100644 --- a/docs/deployment/self-deployment/vllm.mdx +++ b/docs/deployment/self-deployment/vllm.mdx @@ -7,117 +7,311 @@ sidebar_position: 3.31 import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -vLLM can be deployed using a docker image we provide, or directly from the python package. +[vLLM](https://github.com/vllm-project/vllm) is an open-source LLM inference and serving +engine. It is particularly appropriate as a target platform for self-deploying Mistral +models on-premise. -:::info -If you are deploying a given model for the first time, you will first need to go to the model's card page on the HuggingFace website then accept the conditions of access. +## Pre-requisites -This is a one-time operation for each model and does not affect their license terms. +- The hardware requirements for vLLM are listed on its [installation documentation page](https://docs.vllm.ai/en/latest/getting_started/installation.html). +- By default, vLLM sources the model weights from Hugging Face. To access Mistral model + repositories you need to be authenticated on Hugging Face, so an access + token `HF_TOKEN` with the `READ` permission will be required. You should also make sure that you have + accepted the conditions of access on each model card page. +- If you already have the model artifacts on your infrastructure you can use + them directly by pointing vLLM to their local path instead of a Hugging Face + model ID. In this scenario you will be able to skip all Hugging Face related + setup steps. -::: -## With docker +## Getting started + +The following sections will guide you through the process of deploying and +querying Mistral models on vLLM. + +### Installing vLLM + +- Create a Python virtual environment and install the `vllm` package (version + `>=0.6.1.post1` to ensure maximum compatibility with all Mistral models). + +- Authenticate on the HuggingFace Hub using your access token `$HF_TOKEN` : + ```bash + huggingface-cli login --token $HF_TOKEN + ``` + +### Offline mode inference + +When using vLLM in _offline mode_ the model is loaded and used for one-off +batch inference workloads. -On a GPU-enabled host, you can run the Mistral AI LLM Inference image with the following command to download the model from Hugging Face: - - -```bash -docker run --gpus all \ - -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \ - ghcr.io/mistralai/mistral-src/vllm:latest \ - --host 0.0.0.0 \ - --model mistralai/Mistral-7B-Instruct-v0.2 -``` - - - - -```bash -docker run --gpus all \ - -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \ - ghcr.io/mistralai/mistral-src/vllm:latest \ - --host 0.0.0.0 \ - --model mistralai/Mixtral-8x7B-Instruct-v0.1 \ - --tensor-parallel-size 2 # adapt to your GPUs \ - --load-format pt # needed since both `pt` and `safetensors` are available -``` - - - ```bash - docker run --gpus all \ - -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \ - ghcr.io/mistralai/mistral-src/vllm:latest \ - --host 0.0.0.0 \ - --model mistralai/Mixtral-8x22B-Instruct-v0.1 \ - --tensor-parallel-size 4 # adapt to your GPUs \ - ``` - + + + ```python + from vllm import LLM + from vllm.sampling_params import SamplingParams + + model_name = "mistralai/Mistral-NeMo-Instruct-2407" + sampling_params = SamplingParams(max_tokens=8192) + + llm = LLM( + model=model_name, + tokenizer_mode="mistral", + load_format="mistral", + config_format="mistral", + ) + + messages = [ + { + "role": "user", + "content": "Who is the best French painter. Answer with detailed explanations.", + } + ] + + res = llm.chat(messages=messages, sampling_params=sampling_params) + print(res[0].outputs[0].text) + + ``` + + + + Suppose you want to caption the following images: +
+ + + +
+ + You can do so by running the following code: + + ```python + from vllm import LLM + from vllm.sampling_params import SamplingParams + + model_name = "mistralai/Pixtral-12B-2409" + max_img_per_msg = 3 + + sampling_params = SamplingParams(max_tokens=8192) + llm = LLM( + model=model_name, + tokenizer_mode="mistral", + load_format="mistral", + config_format="mistral", + limit_mm_per_prompt={"image": max_img_per_msg}, + ) + + urls = [f"https://picsum.photos/id/{id}/512/512" for id in ["1", "11", "111"]] + + messages = [ + { + "role": "user", + "content": [ + {"type": "text", "text": "Describe this image"}, + ] + [{"type": "image_url", "image_url": {"url": f"{u}"}} for u in urls], + }, + ] + + res = llm.chat(messages=messages, sampling_params=sampling_params) + print(res[0].outputs[0].text) + ``` +
-Where `HF_TOKEN` is an environment variable containing your [Hugging Face user access token](https://huggingface.co/docs/hub/security-tokens). -This will spawn a vLLM instance exposing an OpenAI-like API, as documented in the [API section](/api). +### Server mode inference +In _server mode_, vLLM spawns an HTTP server that continuously +waits for clients to connect and send requests concurrently. +The server exposes a REST API that implements the OpenAI protocol, +allowing you to directly reuse existing code relying on the OpenAI API. + + + + Start the inference server to deploy your model, e.g. for Mistral NeMo: + + ```bash + vllm serve mistralai/Mistral-Nemo-Instruct-2407 \ + --tokenizer_mode mistral \ + --config_format mistral \ + --load_format mistral + ``` + + You can now run inference requests with text input: + + + + ```bash + curl --location 'http://localhost:8000/v1/chat/completions' \ + --header 'Content-Type: application/json' \ + --header 'Authorization: Bearer token' \ + --data '{ + "model": "mistralai/Mistral-Nemo-Instruct-2407", + "messages": [ + { + "role": "user", + "content": "Who is the best French painter? Answer in one short sentence." + } + ] + }' + ``` + + + ```python + import httpx + + url = 'http://localhost:8000/v1/chat/completions' + headers = { + 'Content-Type': 'application/json', + 'Authorization': 'Bearer token' + } + data = { + "model": "mistralai/Mistral-Nemo-Instruct-2407", + "messages": [ + { + "role": "user", + "content": "Who is the best French painter? Answer in one short sentence." + } + ] + } + + response = httpx.post(url, headers=headers, json=data) + + print(response.json()) + + ``` + + + + + + + + +Start the inference server to deploy your model, e.g. for Pixtral-12B: + + ```bash + vllm serve mistralai/Pixtral-12B-2409 \ + --tokenizer_mode mistral \ + --config_format mistral \ + --load_format mistral + ``` :::info -If your GPU has CUDA capabilities below 8.0, you will see the error `ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your XXX GPU has compute capability 7.0`. You need to pass the parameter `--dtype half` to the Docker command line. +- The default number of image inputs per prompt is set to 1. To increase it, set the + `--limit_mm_per_prompt` option (e.g. `--limit_mm_per_prompt 'image=4'`). + +- If you encounter memory issues, set the `--max_model_len` option to reduce the + memory requirements of vLLM (e.g. `--max_model_len 16384`). More troubleshooting + details can be found in the + [vLLM documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html#troubleshooting). ::: -The dockerfile for this image can be found on our [reference implementation github](https://github.com/mistralai/mistral-src/blob/main/deploy/Dockerfile). +You can now run inference requests with images and text inputs. Suppose you +want to caption the following image: -## Without docker +
+ +
+
-Alternatively, you can directly spawn a vLLM server on a GPU-enabled host with Cuda 11.8. +You can prompt the model and retrieve its response like so: + + + ```bash + curl --location 'http://localhost:8000/v1/chat/completions' \ + --header 'Content-Type: application/json' \ + --header 'Authorization: Bearer token' \ + --data '{ + "model": "mistralai/Pixtral-12B-2409", + "messages": [ + { + "role": "user", + "content": [ + {"type" : "text", "text": "Describe this image in a short sentence."}, + {"type": "image_url", "image_url": {"url": "https://picsum.photos/id/237/200/300"}} + ] + } + ] + }' + ``` + + + ```python + import httpx -### Install vLLM + url = "http://localhost:8000/v1/chat/completions" + headers = {"Content-Type": "application/json", "Authorization": "Bearer token"} + data = { + "model": "mistralai/Pixtral-12B-2409", + "messages": [ + { + "role": "user", + "content": [ + {"type": "text", "text": "Describe this image in a short sentence."}, + { + "type": "image_url", + "image_url": {"url": "https://picsum.photos/id/237/200/300"}, + }, + ], + } + ], + } -Firstly you need to install vLLM (or use `conda add vllm` if you are using Anaconda): + response = httpx.post(url, headers=headers, json=data) -```bash -pip install vllm -``` + print(response.json()) + ``` + + -### Log in to the Hugging Face hub -You will also need to log in to the Hugging Face hub using: -```bash -huggingface-cli login -``` +
+
-### Run the OpenAI compatible inference endpoint +## Deploying with Docker + +If you are looking to deploy vLLM as a containerized inference server you can leverage +the project's official Docker image (see more details in the +[vLLM Docker documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html)). + +- Set the HuggingFace access token environment variable in your shell: + ```bash + export HF_TOKEN=your-access-token + ``` + +- Run the Docker command to start the container: + + + ```bash + docker run --runtime nvidia --gpus all \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \ + -p 8000:8000 \ + --ipc=host \ + vllm/vllm-openai:latest \ + --model mistralai/Mistral-NeMo-Instruct-2407 \ + --tokenizer_mode mistral \ + --load_format mistral \ + --config_format mistral + ``` + + + ```bash + docker run --runtime nvidia --gpus all \ + -v ~/.cache/huggingface:/root/.cache/huggingface \ + --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \ + -p 8000:8000 \ + --ipc=host \ + vllm/vllm-openai:latest \ + --model mistralai/Pixtral-12B-2409 \ + --tokenizer_mode mistral \ + --load_format mistral \ + --config_format mistral + ``` + + + +Once the container is up and running you will be able to run inference on your model +using the same code as in a standalone deployment. -You can then use the following command to start the server: - - - -```bash -python -u -m vllm.entrypoints.openai.api_server \ - --host 0.0.0.0 \ - --model mistralai/Mistral-7B-Instruct-v0.2 -``` - - - -```bash -python -u -m vllm.entrypoints.openai.api_server \ - --host 0.0.0.0 \ - --model mistralai/Mixtral-8X7B-Instruct-v0.1 \ - --tensor-parallel-size 2 # adapt to your GPUs \ - --load-format pt # needed since both `pt` and `safetensors` are available -``` - - - - - -```bash -python -u -m vllm.entrypoints.openai.api_server \ - --host 0.0.0.0 \ - --model mistralai/Mixtral-8X22B-Instruct-v0.1 \ - --tensor-parallel-size 4 # adapt to your GPUs \ -``` - - - diff --git a/static/img/countryside.png b/static/img/countryside.png new file mode 100644 index 0000000..9969d3a Binary files /dev/null and b/static/img/countryside.png differ diff --git a/static/img/doggo.png b/static/img/doggo.png new file mode 100644 index 0000000..8d261e6 Binary files /dev/null and b/static/img/doggo.png differ diff --git a/static/img/laptop.png b/static/img/laptop.png new file mode 100644 index 0000000..78532d9 Binary files /dev/null and b/static/img/laptop.png differ diff --git a/static/img/vintage_car.png b/static/img/vintage_car.png new file mode 100644 index 0000000..822b853 Binary files /dev/null and b/static/img/vintage_car.png differ diff --git a/version.txt b/version.txt index 3e24c46..c8fe2be 100644 --- a/version.txt +++ b/version.txt @@ -1 +1 @@ -v0.0.80 +v0.0.15