diff --git a/docs/deployment/self-deployment/overview.mdx b/docs/deployment/self-deployment/overview.mdx
index 0b9e327..b17611a 100644
--- a/docs/deployment/self-deployment/overview.mdx
+++ b/docs/deployment/self-deployment/overview.mdx
@@ -4,13 +4,14 @@ title: Self-deployment
 slug: overview
 ---
 
-Mistral AI provides ready-to-use Docker images on the Github registry. The weights are distributed separately.
+Mistral AI models can be self-deployed on your own infrastructure through various
+inference engines. We recommend using [vLLM](https://vllm.readthedocs.io/), a
+highly-optimized Python-only serving framework which can exponse an OpenAI-compatible
+API.
 
-To run these images, you need a cloud virtual machine matching the requirements for a given model. These requirements can be found in the [model description](/getting-started/models).
+Other inference engine alternatives include 
+[TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and
+[TGI](https://huggingface.co/docs/text-generation-inference/index).
 
-We recommend three different serving frameworks for our models :
-- [vLLM](https://vllm.readthedocs.io/): A python only serving framework which deploys an API matching OpenAI's spec. vLLM provides paged attention kernel to improve serving throughput.
-- NVidias's [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) served with Nvidia's [Triton Inference Server](https://github.com/triton-inference-server) : TensorRT-LLM provides a DSL to build fast inference engines with dedicated kernels for large language models. Triton Inference Server allows efficient serving of these inference engines.
-- [TGI](https://huggingface.co/docs/text-generation-inference/index): A toolkit for deploying LLMs, including OpenAI's spec, grammars, production monitoring, and tools functionality.
-
-These images can be run locally, or on your favorite cloud provider, using [SkyPilot](https://skypilot.readthedocs.io/en/latest/).
+You can also leverage specific tools to facilitate infrastructure management, such as 
+[SkyPilot](https://skypilot.readthedocs.io) or [Cerebrium](https://www.cerebrium.ai).
diff --git a/docs/deployment/self-deployment/vllm.mdx b/docs/deployment/self-deployment/vllm.mdx
index 09ff1e5..c7af9b4 100644
--- a/docs/deployment/self-deployment/vllm.mdx
+++ b/docs/deployment/self-deployment/vllm.mdx
@@ -7,117 +7,311 @@ sidebar_position: 3.31
 import Tabs from '@theme/Tabs';
 import TabItem from '@theme/TabItem';
 
-vLLM can be deployed using a docker image we provide, or directly from the python package.
+[vLLM](https://github.com/vllm-project/vllm) is an open-source LLM inference and serving 
+engine. It is particularly appropriate as a target platform for self-deploying Mistral 
+models on-premise.
 
-:::info
-If you are deploying a given model for the first time, you will first need to go to the model's card page on the HuggingFace website then accept the conditions of access.
+## Pre-requisites
 
-This is a one-time operation for each model and does not affect their license terms.
+- The hardware requirements for vLLM are listed on its [installation documentation page](https://docs.vllm.ai/en/latest/getting_started/installation.html).
+- By default, vLLM sources the model weights from Hugging Face. To access Mistral model
+  repositories you need to be authenticated on Hugging Face, so an access
+  token `HF_TOKEN` with the `READ` permission will be required. You should also make sure that you have
+  accepted the conditions of access on each model card page.
+- If you already have the model artifacts on your infrastructure you can use 
+  them directly by pointing vLLM to their local path instead of a Hugging Face
+  model ID. In this scenario you will be able to skip all Hugging Face related 
+  setup steps.
 
-:::
 
-## With docker
+## Getting started
+
+The following sections will guide you through the process of deploying and
+querying Mistral models on vLLM.
+
+### Installing vLLM
+
+- Create a Python virtual environment and install the `vllm` package (version 
+  `>=0.6.1.post1` to ensure maximum compatibility with all Mistral models).
+
+- Authenticate on the HuggingFace Hub using your access token `$HF_TOKEN` :
+  ```bash
+  huggingface-cli login --token $HF_TOKEN
+  ```
+
+### Offline mode inference
+
+When using vLLM in _offline mode_ the model is loaded and used for one-off
+batch inference workloads.
 
-On a GPU-enabled host, you can run the Mistral AI LLM Inference image with the following command to download the model from Hugging Face:
 <Tabs>
-  <TabItem value="mistral7b" label="Mistral-7B" default>
-
-```bash
-docker run --gpus all \
-    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
-    ghcr.io/mistralai/mistral-src/vllm:latest \
-    --host 0.0.0.0 \
-    --model mistralai/Mistral-7B-Instruct-v0.2
-```
-
-  </TabItem>
-  <TabItem value="mixtral8x7b" label="Mixtral-8X7B">
-
-```bash
-docker run --gpus all \
-    -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
-    ghcr.io/mistralai/mistral-src/vllm:latest \
-    --host 0.0.0.0 \
-    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
-    --tensor-parallel-size 2 # adapt to your GPUs \
-    --load-format pt # needed since both `pt` and `safetensors` are available
-```
-  </TabItem>
-  <TabItem value="mixtral8x22b" label="Mixtral-8X22B">
-    ```bash
-    docker run --gpus all \
-        -e HF_TOKEN=$HF_TOKEN -p 8000:8000 \
-        ghcr.io/mistralai/mistral-src/vllm:latest \
-        --host 0.0.0.0 \
-        --model mistralai/Mixtral-8x22B-Instruct-v0.1 \
-        --tensor-parallel-size 4 # adapt to your GPUs \
-    ```
-  </TabItem>
+    <TabItem value="vllm-batch-nemo" label="Text input (Mistral NeMo)">
+
+        ```python
+        from vllm import LLM
+        from vllm.sampling_params import SamplingParams
+
+        model_name = "mistralai/Mistral-NeMo-Instruct-2407"
+        sampling_params = SamplingParams(max_tokens=8192)
+
+        llm = LLM(
+            model=model_name,
+            tokenizer_mode="mistral",
+            load_format="mistral",
+            config_format="mistral",
+        )
+
+        messages = [
+            {
+                "role": "user",
+                "content": "Who is the best French painter. Answer with detailed explanations.",
+            }
+        ]
+
+        res = llm.chat(messages=messages, sampling_params=sampling_params)
+        print(res[0].outputs[0].text)
+
+        ```
+
+    </TabItem>
+    <TabItem value="vllm-batch-pixtral" label="Image + text input (Pixtral-12B)">
+        Suppose you want to caption the following images:
+          <center>
+              <a href="https://picsum.photos/id/1/512/512"><img alt="" src="/img/laptop.png" width="20%"/></a>
+              <a href="https://picsum.photos/id/11/512/512"><img alt="" src="/img/countryside.png"  width="20%"/></a>
+              <a href="https://picsum.photos/id/111/512/512"><img alt="" src="/img/vintage_car.png"  width="20%"/></a>
+          </center>
+
+        You can do so by running the following code:
+
+        ```python
+        from vllm import LLM
+        from vllm.sampling_params import SamplingParams
+
+        model_name = "mistralai/Pixtral-12B-2409"
+        max_img_per_msg = 3
+
+        sampling_params = SamplingParams(max_tokens=8192)
+        llm = LLM(
+            model=model_name,
+            tokenizer_mode="mistral",
+            load_format="mistral",
+            config_format="mistral",
+            limit_mm_per_prompt={"image": max_img_per_msg},
+        )
+
+        urls = [f"https://picsum.photos/id/{id}/512/512" for id in ["1", "11", "111"]]
+
+        messages = [
+            {
+                "role": "user",
+                "content": [
+                    {"type": "text", "text": "Describe this image"},
+                    ] + [{"type": "image_url", "image_url": {"url": f"{u}"}} for u in urls],
+            },
+        ]
+
+        res = llm.chat(messages=messages, sampling_params=sampling_params)
+        print(res[0].outputs[0].text)
+        ```
+    </TabItem>
 </Tabs>
 
-Where `HF_TOKEN` is an environment variable containing your [Hugging Face user access token](https://huggingface.co/docs/hub/security-tokens).
-This will spawn a vLLM instance exposing an OpenAI-like API, as documented in the [API section](/api).
+### Server mode inference
 
+In _server mode_, vLLM spawns an HTTP server that continuously
+waits for clients to connect and send requests concurrently.
+The server exposes a REST API that implements the OpenAI protocol,
+allowing you to directly reuse existing code relying on the OpenAI API.
+
+<Tabs>
+    <TabItem value="vllm-server-text" label="Text input (Mistral NeMo)">
+        Start the inference server to deploy your model, e.g. for Mistral NeMo:
+
+          ```bash
+          vllm serve mistralai/Mistral-Nemo-Instruct-2407 \
+            --tokenizer_mode mistral \
+            --config_format mistral \
+            --load_format mistral
+          ```
+
+        You can now run inference requests with text input:
+
+          <Tabs>
+            <TabItem value="vllm-infer-nemo-curl" label="cURL">
+                ```bash
+                curl --location 'http://localhost:8000/v1/chat/completions' \
+                    --header 'Content-Type: application/json' \
+                    --header 'Authorization: Bearer token' \
+                    --data '{
+                        "model": "mistralai/Mistral-Nemo-Instruct-2407",
+                        "messages": [
+                          {
+                            "role": "user",
+                            "content": "Who is the best French painter? Answer in one short sentence."
+                          }
+                        ]
+                      }'
+                ```
+            </TabItem>
+            <TabItem value="vllm-infer-nemo-python" label="Python">
+                ```python
+                import httpx
+
+                url = 'http://localhost:8000/v1/chat/completions'
+                headers = {
+                    'Content-Type': 'application/json',
+                    'Authorization': 'Bearer token'
+                }
+                data = {
+                    "model": "mistralai/Mistral-Nemo-Instruct-2407",
+                    "messages": [
+                        {
+                            "role": "user",
+                            "content": "Who is the best French painter? Answer in one short sentence."
+                        }
+                    ]
+                }
+
+                response = httpx.post(url, headers=headers, json=data)
+
+                print(response.json())
+
+                ```
+            </TabItem>
+          </Tabs>
+
+    </TabItem>
+
+    <TabItem value="vllm-server-mm" label="Image + text input (Pixtral-12B)">
+
+
+Start the inference server to deploy your model, e.g. for Pixtral-12B:
+
+    ```bash
+    vllm serve mistralai/Pixtral-12B-2409 \
+        --tokenizer_mode mistral \
+        --config_format mistral \
+        --load_format mistral
+    ```
 :::info
 
-If your GPU has CUDA capabilities below 8.0, you will see the error `ValueError: Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your XXX GPU has compute capability 7.0`. You need to pass the parameter `--dtype half` to the Docker command line.
+- The default number of image inputs per prompt is set to 1. To increase it, set the
+  `--limit_mm_per_prompt` option (e.g. `--limit_mm_per_prompt 'image=4'`).
+
+- If you encounter memory issues, set the `--max_model_len` option to reduce the
+  memory requirements of vLLM (e.g. `--max_model_len 16384`). More troubleshooting
+  details can be found in the 
+  [vLLM documentation](https://qwen.readthedocs.io/en/latest/deployment/vllm.html#troubleshooting).
 
 :::
 
-The dockerfile for this image can be found on our [reference implementation github](https://github.com/mistralai/mistral-src/blob/main/deploy/Dockerfile).
+You can now run inference requests with images and text inputs. Suppose you
+want to caption the following image:
 
-## Without docker
+        <center>
+            <a href="https://picsum.photos/id/237/512/512"><img alt="" src="/img/doggo.png"  width="20%"/></a>
+        </center>
+        <br/>
 
-Alternatively, you can directly spawn a vLLM server on a GPU-enabled host with Cuda 11.8.
+You can prompt the model and retrieve its response like so:
+    <Tabs>
+        <TabItem value="vllm-infer-pixtral-curl" label="cURL">
+        ```bash
+        curl --location 'http://localhost:8000/v1/chat/completions' \
+        --header 'Content-Type: application/json' \
+        --header 'Authorization: Bearer token' \
+        --data '{
+            "model": "mistralai/Pixtral-12B-2409",
+            "messages": [
+              {
+                "role": "user",
+                "content": [
+                    {"type" : "text", "text": "Describe this image in a short sentence."},
+                    {"type": "image_url", "image_url": {"url": "https://picsum.photos/id/237/200/300"}}
+                ]
+              }
+            ]
+          }'
+          ```
+        </TabItem>
+        <TabItem value="vllm-infer-pixtral-python" label="Python">
+      ```python
+        import httpx
 
-### Install vLLM
+        url = "http://localhost:8000/v1/chat/completions"
+        headers = {"Content-Type": "application/json", "Authorization": "Bearer token"}
+        data = {
+            "model": "mistralai/Pixtral-12B-2409",
+            "messages": [
+                {
+                    "role": "user",
+                    "content": [
+                        {"type": "text", "text": "Describe this image in a short sentence."},
+                        {
+                            "type": "image_url",
+                            "image_url": {"url": "https://picsum.photos/id/237/200/300"},
+                        },
+                    ],
+                }
+            ],
+        }
 
-Firstly you need to install vLLM (or use `conda add vllm` if you are using Anaconda):
+        response = httpx.post(url, headers=headers, json=data)
 
-```bash
-pip install vllm
-```
+        print(response.json())
+        ```
+        </TabItem>
+    </Tabs>
 
-### Log in to the Hugging Face hub
 
-You will also need to log in to the Hugging Face hub using: 
 
-```bash
-huggingface-cli login
-```
+    </TabItem>
+</Tabs>
 
-### Run the OpenAI compatible inference endpoint
+## Deploying with Docker
+
+If you are looking to deploy vLLM as a containerized inference server you can leverage
+the project's official Docker image (see more details in the 
+[vLLM Docker documentation](https://docs.vllm.ai/en/latest/serving/deploying_with_docker.html)).
+
+- Set the HuggingFace access token environment variable in your shell:
+  ```bash
+  export HF_TOKEN=your-access-token
+  ```
+
+- Run the Docker command to start the container:
+  <Tabs>
+    <TabItem value="vllm-docker-nemo" label="Mistral NeMo">
+        ```bash
+        docker run --runtime nvidia --gpus all \
+            -v ~/.cache/huggingface:/root/.cache/huggingface \
+            --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
+            -p 8000:8000 \
+            --ipc=host \
+            vllm/vllm-openai:latest \
+            --model mistralai/Mistral-NeMo-Instruct-2407 \
+            --tokenizer_mode mistral \
+            --load_format mistral \
+            --config_format mistral
+        ```
+    </TabItem>
+    <TabItem value="vllm-docker-pixtral" label="Pixtral-12B">
+        ```bash
+        docker run --runtime nvidia --gpus all \
+            -v ~/.cache/huggingface:/root/.cache/huggingface \
+            --env "HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}" \
+            -p 8000:8000 \
+            --ipc=host \
+            vllm/vllm-openai:latest \
+            --model mistralai/Pixtral-12B-2409 \
+            --tokenizer_mode mistral \
+            --load_format mistral \
+            --config_format mistral 
+        ```
+    </TabItem>
+  </Tabs>
+  
+Once the container is up and running you will be able to run inference on your model
+using the same code as in a standalone deployment.
 
-You can then use the following command to start the server:
-<Tabs>
-  <TabItem value="mistral7b" label="Mistral-7B" default>
-
-```bash
-python -u -m vllm.entrypoints.openai.api_server \
-       --host 0.0.0.0 \
-       --model mistralai/Mistral-7B-Instruct-v0.2
-```
-  </TabItem>
-  <TabItem value="mixtral8x7b" label="Mixtral-8X7B">
-
-```bash
-python -u -m vllm.entrypoints.openai.api_server \
-       --host 0.0.0.0 \
-       --model mistralai/Mixtral-8X7B-Instruct-v0.1 \
-       --tensor-parallel-size 2 # adapt to your GPUs \
-      --load-format pt # needed since both `pt` and `safetensors` are available
-```
-
-  </TabItem>
-  <TabItem value="mixtral8x22b" label="Mixtral-8X22B">
-
-
-```bash
-python -u -m vllm.entrypoints.openai.api_server \
-       --host 0.0.0.0 \
-       --model mistralai/Mixtral-8X22B-Instruct-v0.1 \
-       --tensor-parallel-size 4 # adapt to your GPUs \
-```
-
-  </TabItem>
-</Tabs>
diff --git a/static/img/countryside.png b/static/img/countryside.png
new file mode 100644
index 0000000..9969d3a
Binary files /dev/null and b/static/img/countryside.png differ
diff --git a/static/img/doggo.png b/static/img/doggo.png
new file mode 100644
index 0000000..8d261e6
Binary files /dev/null and b/static/img/doggo.png differ
diff --git a/static/img/laptop.png b/static/img/laptop.png
new file mode 100644
index 0000000..78532d9
Binary files /dev/null and b/static/img/laptop.png differ
diff --git a/static/img/vintage_car.png b/static/img/vintage_car.png
new file mode 100644
index 0000000..822b853
Binary files /dev/null and b/static/img/vintage_car.png differ
diff --git a/version.txt b/version.txt
index 3e24c46..c8fe2be 100644
--- a/version.txt
+++ b/version.txt
@@ -1 +1 @@
-v0.0.80
+v0.0.15