diff --git a/docs/deployment/self-deployment/overview.mdx b/docs/deployment/self-deployment/overview.mdx
index 0b9e327..b17611a 100644
--- a/docs/deployment/self-deployment/overview.mdx
+++ b/docs/deployment/self-deployment/overview.mdx
@@ -4,13 +4,14 @@ title: Self-deployment
slug: overview
---
-Mistral AI provides ready-to-use Docker images on the Github registry. The weights are distributed separately.
+Mistral AI models can be self-deployed on your own infrastructure through various
+inference engines. We recommend using [vLLM](https://vllm.readthedocs.io/), a
+highly-optimized Python-only serving framework which can exponse an OpenAI-compatible
+API.
-To run these images, you need a cloud virtual machine matching the requirements for a given model. These requirements can be found in the [model description](/getting-started/models).
+Other inference engine alternatives include
+[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and
+[TGI](https://huggingface.co/docs/text-generation-inference/index).
-We recommend three different serving frameworks for our models :
-- [vLLM](https://vllm.readthedocs.io/): A python only serving framework which deploys an API matching OpenAI's spec. vLLM provides paged attention kernel to improve serving throughput.
-- NVidias's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) served with Nvidia's [Triton Inference Server](https://github.com/triton-inference-server) : TensorRT-LLM provides a DSL to build fast inference engines with dedicated kernels for large language models. Triton Inference Server allows efficient serving of these inference engines.
-- [TGI](https://huggingface.co/docs/text-generation-inference/index): A toolkit for deploying LLMs, including OpenAI's spec, grammars, production monitoring, and tools functionality.
-
-These images can be run locally, or on your favorite cloud provider, using [SkyPilot](https://skypilot.readthedocs.io/en/latest/).
+You can also leverage specific tools to facilitate infrastructure management, such as
+[SkyPilot](https://skypilot.readthedocs.io) or [Cerebrium](https://www.cerebrium.ai).
diff --git a/docs/deployment/self-deployment/vllm.mdx b/docs/deployment/self-deployment/vllm.mdx
index 09ff1e5..c7af9b4 100644
--- a/docs/deployment/self-deployment/vllm.mdx
+++ b/docs/deployment/self-deployment/vllm.mdx
@@ -7,117 +7,311 @@ sidebar_position: 3.31
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
-vLLM can be deployed using a docker image we provide, or directly from the python package.
+[vLLM](https://github.com/vllm-project/vllm) is an open-source LLM inference and serving
+engine. It is particularly appropriate as a target platform for self-deploying Mistral
+models on-premise.
-:::info
-If you are deploying a given model for the first time, you will first need to go to the model's card page on the HuggingFace website then accept the conditions of access.
+## Pre-requisites
-This is a one-time operation for each model and does not affect their license terms.
+- The hardware requirements for vLLM are listed on its [installation documentation page](https://docs.vllm.ai/en/latest/getting_started/installation.html).
+- By default, vLLM sources the model weights from Hugging Face. To access Mistral model
+ repositories you need to be authenticated on Hugging Face, so an access
+ token `HF_TOKEN` with the `READ` permission will be required. You should also make sure that you have
+ accepted the conditions of access on each model card page.
+- If you already have the model artifacts on your infrastructure you can use
+ them directly by pointing vLLM to their local path instead of a Hugging Face
+ model ID. In this scenario you will be able to skip all Hugging Face related
+ setup steps.
-:::
-## With docker
+## Getting started
+
+The following sections will guide you through the process of deploying and
+querying Mistral models on vLLM.
+
+### Installing vLLM
+
+- Create a Python virtual environment and install the `vllm` package (version
+ `>=0.6.1.post1` to ensure maximum compatibility with all Mistral models).
+
+- Authenticate on the HuggingFace Hub using your access token `$HF_TOKEN` :
+ ```bash
+ huggingface-cli login --token $HF_TOKEN
+ ```
+
+### Offline mode inference
+
+When using vLLM in _offline mode_ the model is loaded and used for one-off
+batch inference workloads.
-On a GPU-enabled host, you can run the Mistral AI LLM Inference image with the following command to download the model from Hugging Face:
-
-
-```bash
-docker run --gpus all \
- -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
- ghcr.io/mistralai/mistral-src/vllm:latest \
- --host 0.0.0.0 \
- --model mistralai/Mistral-7B-Instruct-v0.2
-```
-
-
-
-
-```bash
-docker run --gpus all \
- -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
- ghcr.io/mistralai/mistral-src/vllm:latest \
- --host 0.0.0.0 \
- --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
- --tensor-parallel-size 2 # adapt to your GPUs \
- --load-format pt # needed since both `pt` and `safetensors` are available
-```
-
-
- ```bash
- docker run --gpus all \
- -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
- ghcr.io/mistralai/mistral-src/vllm:latest \
- --host 0.0.0.0 \
- --model mistralai/Mixtral-8x22B-Instruct-v0.1 \
- --tensor-parallel-size 4 # adapt to your GPUs \
- ```
-
+
+
+ ```python
+ from vllm import LLM
+ from vllm.sampling_params import SamplingParams
+
+ model_name = "mistralai/Mistral-NeMo-Instruct-2407"
+ sampling_params = SamplingParams(max_tokens=8192)
+
+ llm = LLM(
+ model=model_name,
+ tokenizer_mode="mistral",
+ load_format="mistral",
+ config_format="mistral",
+ )
+
+ messages = [
+ {
+ "role": "user",
+ "content": "Who is the best French painter. Answer with detailed explanations.",
+ }
+ ]
+
+ res = llm.chat(messages=messages, sampling_params=sampling_params)
+ print(res[0].outputs[0].text)
+
+ ```
+
+
+
+ Suppose you want to caption the following images:
+
+
+
+
+
+
+ You can do so by running the following code:
+
+ ```python
+ from vllm import LLM
+ from vllm.sampling_params import SamplingParams
+
+ model_name = "mistralai/Pixtral-12B-2409"
+ max_img_per_msg = 3
+
+ sampling_params = SamplingParams(max_tokens=8192)
+ llm = LLM(
+ model=model_name,
+ tokenizer_mode="mistral",
+ load_format="mistral",
+ config_format="mistral",
+ limit_mm_per_prompt={"image": max_img_per_msg},
+ )
+
+ urls = [f"https://picsum.photos/id/{id}/512/512" for id in ["1", "11", "111"]]
+
+ messages = [
+ {
+ "role": "user",
+ "content": [
+ {"type": "text", "text": "Describe this image"},
+ ] + [{"type": "image_url", "image_url": {"url": f"{u}"}} for u in urls],
+ },
+ ]
+
+ res = llm.chat(messages=messages, sampling_params=sampling_params)
+ print(res[0].outputs[0].text)
+ ```
+
-Where `HF_TOKEN` is an environment variable containing your [Hugging Face user access token](https://huggingface.co/docs/hub/security-tokens).
-This will spawn a vLLM instance exposing an OpenAI-like API, as documented in the [API section](/api).
+### Server mode inference
+In _server mode_, vLLM spawns an HTTP server that continuously
+waits for clients to connect and send requests concurrently.
+The server exposes a REST API that implements the OpenAI protocol,
+allowing you to directly reuse existing code relying on the OpenAI API.
+
+
+
+ Start the inference server to deploy your model, e.g. for Mistral NeMo:
+
+ ```bash
+ vllm serve mistralai/Mistral-Nemo-Instruct-2407 \
+ --tokenizer_mode mistral \
+ --config_format mistral \
+ --load_format mistral
+ ```
+
+ You can now run inference requests with text input:
+
+
+
+ ```bash
+ curl --location 'http://localhost:8000/v1/chat/completions' \
+ --header 'Content-Type: application/json' \
+ --header 'Authorization: Bearer token' \
+ --data '{
+ "model": "mistralai/Mistral-Nemo-Instruct-2407",
+ "messages": [
+ {
+ "role": "user",
+ "content": "Who is the best French painter? Answer in one short sentence."
+ }
+ ]
+ }'
+ ```
+
+
+ ```python
+ import httpx
+
+ url = 'http://localhost:8000/v1/chat/completions'
+ headers = {
+ 'Content-Type': 'application/json',
+ 'Authorization': 'Bearer token'
+ }
+ data = {
+ "model": "mistralai/Mistral-Nemo-Instruct-2407",
+ "messages": [
+ {
+ "role": "user",
+ "content": "Who is the best French painter? Answer in one short sentence."
+ }
+ ]
+ }
+
+ response = httpx.post(url, headers=headers, json=data)
+
+ print(response.json())
+
+ ```
+
+
+
+
+
+
+
+
+Start the inference server to deploy your model, e.g. for Pixtral-12B:
+
+ ```bash
+ vllm serve mistralai/Pixtral-12B-2409 \
+ --tokenizer_mode mistral \
+ --config_format mistral \
+ --load_format mistral
+ ```
:::info
-If your GPU has CUDA capabilities below 8.0, you will see the error `ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your XXX GPU has compute capability 7.0`. You need to pass the parameter `--dtype half` to the Docker command line.
+- The default number of image inputs per prompt is set to 1. To increase it, set the
+ `--limit_mm_per_prompt` option (e.g. `--limit_mm_per_prompt 'image=4'`).
+
+- If you encounter memory issues, set the `--max_model_len` option to reduce the
+ memory requirements of vLLM (e.g. `--max_model_len 16384`). More troubleshooting
+ details can be found in the
+ [vLLM documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html#troubleshooting).
:::
-The dockerfile for this image can be found on our [reference implementation github](https://github.com/mistralai/mistral-src/blob/main/deploy/Dockerfile).
+You can now run inference requests with images and text inputs. Suppose you
+want to caption the following image:
-## Without docker
+
+
+
+
-Alternatively, you can directly spawn a vLLM server on a GPU-enabled host with Cuda 11.8.
+You can prompt the model and retrieve its response like so:
+
+
+ ```bash
+ curl --location 'http://localhost:8000/v1/chat/completions' \
+ --header 'Content-Type: application/json' \
+ --header 'Authorization: Bearer token' \
+ --data '{
+ "model": "mistralai/Pixtral-12B-2409",
+ "messages": [
+ {
+ "role": "user",
+ "content": [
+ {"type" : "text", "text": "Describe this image in a short sentence."},
+ {"type": "image_url", "image_url": {"url": "https://picsum.photos/id/237/200/300"}}
+ ]
+ }
+ ]
+ }'
+ ```
+
+
+ ```python
+ import httpx
-### Install vLLM
+ url = "http://localhost:8000/v1/chat/completions"
+ headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
+ data = {
+ "model": "mistralai/Pixtral-12B-2409",
+ "messages": [
+ {
+ "role": "user",
+ "content": [
+ {"type": "text", "text": "Describe this image in a short sentence."},
+ {
+ "type": "image_url",
+ "image_url": {"url": "https://picsum.photos/id/237/200/300"},
+ },
+ ],
+ }
+ ],
+ }
-Firstly you need to install vLLM (or use `conda add vllm` if you are using Anaconda):
+ response = httpx.post(url, headers=headers, json=data)
-```bash
-pip install vllm
-```
+ print(response.json())
+ ```
+
+
-### Log in to the Hugging Face hub
-You will also need to log in to the Hugging Face hub using:
-```bash
-huggingface-cli login
-```
+
+
-### Run the OpenAI compatible inference endpoint
+## Deploying with Docker
+
+If you are looking to deploy vLLM as a containerized inference server you can leverage
+the project's official Docker image (see more details in the
+[vLLM Docker documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html)).
+
+- Set the HuggingFace access token environment variable in your shell:
+ ```bash
+ export HF_TOKEN=your-access-token
+ ```
+
+- Run the Docker command to start the container:
+
+
+ ```bash
+ docker run --runtime nvidia --gpus all \
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
+ --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
+ -p 8000:8000 \
+ --ipc=host \
+ vllm/vllm-openai:latest \
+ --model mistralai/Mistral-NeMo-Instruct-2407 \
+ --tokenizer_mode mistral \
+ --load_format mistral \
+ --config_format mistral
+ ```
+
+
+ ```bash
+ docker run --runtime nvidia --gpus all \
+ -v ~/.cache/huggingface:/root/.cache/huggingface \
+ --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
+ -p 8000:8000 \
+ --ipc=host \
+ vllm/vllm-openai:latest \
+ --model mistralai/Pixtral-12B-2409 \
+ --tokenizer_mode mistral \
+ --load_format mistral \
+ --config_format mistral
+ ```
+
+
+
+Once the container is up and running you will be able to run inference on your model
+using the same code as in a standalone deployment.
-You can then use the following command to start the server:
-
-
-
-```bash
-python -u -m vllm.entrypoints.openai.api_server \
- --host 0.0.0.0 \
- --model mistralai/Mistral-7B-Instruct-v0.2
-```
-
-
-
-```bash
-python -u -m vllm.entrypoints.openai.api_server \
- --host 0.0.0.0 \
- --model mistralai/Mixtral-8X7B-Instruct-v0.1 \
- --tensor-parallel-size 2 # adapt to your GPUs \
- --load-format pt # needed since both `pt` and `safetensors` are available
-```
-
-
-
-
-
-```bash
-python -u -m vllm.entrypoints.openai.api_server \
- --host 0.0.0.0 \
- --model mistralai/Mixtral-8X22B-Instruct-v0.1 \
- --tensor-parallel-size 4 # adapt to your GPUs \
-```
-
-
-
diff --git a/static/img/countryside.png b/static/img/countryside.png
new file mode 100644
index 0000000..9969d3a
Binary files /dev/null and b/static/img/countryside.png differ
diff --git a/static/img/doggo.png b/static/img/doggo.png
new file mode 100644
index 0000000..8d261e6
Binary files /dev/null and b/static/img/doggo.png differ
diff --git a/static/img/laptop.png b/static/img/laptop.png
new file mode 100644
index 0000000..78532d9
Binary files /dev/null and b/static/img/laptop.png differ
diff --git a/static/img/vintage_car.png b/static/img/vintage_car.png
new file mode 100644
index 0000000..822b853
Binary files /dev/null and b/static/img/vintage_car.png differ
diff --git a/version.txt b/version.txt
index 3e24c46..c8fe2be 100644
--- a/version.txt
+++ b/version.txt
@@ -1 +1 @@
-v0.0.80
+v0.0.15