Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docker container support #1271

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
61 changes: 61 additions & 0 deletions docker/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
## Containers for mlc_llm REST API and instant command line interface (CLI) chat with LLM

A set of docker container templates for your scaled production deployment of GPU-accelerated mlc_llms. Based on the recent work on the SLM, jit flow, and OpenAI compatible APIs including function calling.

These containers are designed to be:

* minimalist - nothing non-essential is included; you can layer on your own security policy for example
* non-opinionated - use CNCF k8s or docker compose or swarm or whatever you have for orchestration
* adaptive and composable - nobody knows what you intend to do with these containers, and we don't guess
* compatible - with multi-GPU support maturing and batching still in testing, these containers should suvive upcoming changes without needing to be severely revamped.
* practical NOW - usable and deployable TODAY with 2024/2025 level workstation/consumer hardware and mlc-ai

### Structure

Base containers are segregated by GPU acceleration stacks. See the README of the sub-folders for more information
```
cuda
|-- cuda122


rocm
|-- rocm57

bin

test
```

The `bin` folder has the template executables that will start the containers.

The `test` folder contains the tests.

#### Community contribution

This structure enables the greater community to easily contribute new tested templates for other cuda and rocm releases, for example.

#### Greatly enhanced out-of-box UX

Managing the huge physical size of the weights for an LLM model is a major hurdle when deploying modern LLM in production / experimental environment at any scale. Couple this with the need to compile NN network support library for every combination and permutation of GPU hardware vs OS supported - and an _impossibly frustrating_ out-of-box user experience is guaranteed.

The latest improvement in JIT and SLM flow for MLC_LLM specifically addresses this. And these docker container templates further enhances the out-of-box UX, down to one single easy to use command line (with automatic cached LLM weights management).

Users of such images can simply decide to run "llama2 7b on cuda 12.2" and in one single command immediately pull down an image onto their workstation running AI apps served by Llama 2 already GPU accelerated. The weights are downloaded directly from huggingface and converted _specifically for her/his GPU hardware and OS_ the first time the command is executed; any subsequent invocation can start _instantly_ using the already converted weights.

As an example the command to start an interactive chat with this LLM on Cuda 1.22 accelerated Linux is:

```
startcuda122chat.sh Llama-2-7b-chat-hf-q4f32_1
```

One container template is supplied for REST API serving, and another one is available for interactive command line chat with any supported LLMs.


##### Compatibility with future improvements

There is no loss of flexibility in using these containers, the REST API implementation already support batching - the ability to handle multiple concurrent inferences at the same time. And any future improvements in MLC_AI will be

#### Tests

Tests are made global as they apply to mlc_ai running across any supported GPU configurations.

26 changes: 26 additions & 0 deletions docker/bin/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
### container startup scripts


> NOTE: Please make sure you are in the `bin` directory when starting these scripts. Make sure you have write permissions to the `cache` directory. These scripts make use of the `cache` folder there to cache all model weights and custom compiled libraries.

The `<model name>` that are supported (at any time) can be obtained from [MLC AI's Huggingface Repo](https://huggingface.co/mlc-ai). There are *88 supported models at the time of writing* and soon will have hundreds more.

![image](https://github.com/Sing-Li/dockertest/assets/122633/e1068b42-cfe1-4385-8c71-0791d2987d8b)

Some currently popular `model names` that our community are actively exploring include:

* `Llama-2-7b-chat-hf-q4f16_1`
* `Mistral-7B-Instruct-v0.2-q4f16_1`
* `gemma-7b-it-q4ff16_2`
* `phi-1_5_q4f32_1`

Try using these `<model name>` when parameterizing the scripts.

You can modify the `serve` scripts directly to support specific network interfaces (on a multi-homed system, defaults to `0.0.0.0` = all interfaces) and to change the listening port (defaults to port `8000`).

|Command | Description | Usage|
|-------|------|------|
|`startcuda122chat.sh` | starts up command line interactive chat with specified LLM on Cuda 12.1 linux system | `sh ./startcuda122chat.sh <mlc model name>`|
|`startcuda122serve.sh` | runs a server handling multiple concurrent REST API calls to the specified LLM on Cuda 12.1 linux system| `sh ./startcuda122serve.sh <mlc model name>`|
|`startrocm57chat.sh` | starts up command line interactive chat with specified LLM on Rocm 5.7 linux system | `sh ./startrocm57chat.sh <mlc model name>`|
|`startrocm57serve.sh` | runs a server handling multiple concurrent REST API calls to the specified LLM on Rocm 5.7 linux system| `sh ./startrocm57serve.sh <mlc model name>`|
1 change: 1 addition & 0 deletions docker/bin/startcuda122chat.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docker run --gpus all --rm -it --network host -v ./cache:/root/.cache mlcllmcuda122:v0.1 chat HF://mlc-ai/$1-MLC
1 change: 1 addition & 0 deletions docker/bin/startcuda122serve.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docker run --gpus all --rm --network host -v ./cache:/root/.cache mlcllmcuda122:v0.1 serve HF://mlc-ai/$1-MLC --host 0.0.0.0 --port 8000
1 change: 1 addition & 0 deletions docker/bin/startrocm57chat.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docker run --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --rm --network host -v ./cache:/root/.cache mlcllmrocm57:v0.1 chat HF://mlc-ai/$1-MLC
1 change: 1 addition & 0 deletions docker/bin/startrocm57serve.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docker run --device=/dev/kfd --device=/dev/dri --security-opt seccomp=unconfined --group-add video --rm --network host mlcllmrocm57:v0.1 serve HF://mlc-ai/$1-MLC --host 0.0.0.0 --port 8000
Empty file.
22 changes: 22 additions & 0 deletions docker/cuda/cuda122/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
FROM nvidia/cuda:12.2.2-devel-ubuntu22.04

ENV MLC_PATH /mlcllm

# setup python 3 and pip, load the mlc-ai nightlies

RUN apt update && \
apt install --yes python3.11 pip git git-lfs && \
pip install --pre -U -f https://mlc.ai/wheels \
mlc-llm-nightly-cu122 mlc-ai-nightly-cu122 &&\
mkdir -p $MLC_PATH

VOLUME ${MLC_PATH}

WORKDIR ${MLC_PATH}


ENTRYPOINT ["mlc_llm"]

CMD ["chat", "HF://mlc-ai/Llama-2-7b-chat-hf-q4f32_1-MLC"]


7 changes: 7 additions & 0 deletions docker/cuda/cuda122/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
## Base mlc_llm docker image for Cuda 12.2 systems

Make sure you perform:

`sh ./buildimage.sh`

This will build the base docker image for Cuda 12.2, from the latest nightly. The resulting image will be on your local registry, you can further push the image to any deployment registry. The image size will be very large (about 18.4GB) since it includes all cuda toolkit and support libraries,
3 changes: 3 additions & 0 deletions docker/cuda/cuda122/buildimage.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
docker build --no-cache -t mlcllmcuda122:v0.1 -f ./Dockerfile .


21 changes: 21 additions & 0 deletions docker/rocm/rocm57/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# NOTE: This Dockerfile is based on ROCm 5.7
FROM rocm/dev-ubuntu-22.04:5.7-complete

ENV MLC_PATH /mlcllm

# setup python 3 and pip, load the mlc-ai nightlies

RUN apt update && \
apt install --yes python3.11 pip git git-lfs && \
pip install --pre -U -f https://mlc.ai/wheels \
mlc-llm-nightly-rocm57 mlc-ai-nightly-rocm57 &&\
mkdir -p $MLC_PATH

VOLUME ${MLC_PATH}

WORKDIR ${MLC_PATH}

ENTRYPOINT ["mlc_llm"]

CMD ["chat", "HF://mlc-ai/Llama-2-7b-chat-hf-q4f32_1-MLC"]

7 changes: 7 additions & 0 deletions docker/rocm/rocm57/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
## Base mlc_llm docker image for ROCm 5.7 systems

Make sure you perform:

`sh ./buildimage.sh`

This will build the base docker image for ROCm 5.7, from the latest nightly. The resulting image will be on your local registry, you can further push the image to any deployment registry. The image size will be very large (about 28.1GB) since it includes all ROCm toolkit and support libraries,
1 change: 1 addition & 0 deletions docker/rocm/rocm57/buidimage.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
docker build --no-cache -t mlcllmrocm57:v0.1 -f ./Dockerfile .
8 changes: 8 additions & 0 deletions docker/test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
## Tests for mlc_llm `serve`

The simple test programs for REST API serving (including function calling / tools pattern) when using mlc_llm `serve` with any of the 100s of supported models.

|Test name|Description|
|------------|---------------|
|`sample_client_for-testing.py`|Calls completion REST API once without streaming, and then again with streaming, and displays the output. Make sure you modify the `payload` LLM name field to match the actually LLM you are testing.|
|`functionall.py`|Actual function calling example utilizing OpenAI compatible API _tools_ field. Make sure you modify the `payload` LLM name field to match the actually LLM you are testing. This example will only work with models fine-tuned for function calling, including many Mixtral/Mistral derivatives.|
41 changes: 41 additions & 0 deletions docker/test/functioncall.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import requests
import json

tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
},
}
]

payload = {
"model": "HF://mlc-ai/gorilla-openfunctions-v2-q4f16_1-MLC",
# "model": "HF://mlc-ai/gemma-2b-it-q4f16_1-MLC",
"messages": [
{
"role": "user",
"content": "What is the current weather in Pittsburgh, PA in fahrenheit?",
}
],
"stream": False,
"tools": tools,
}

r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload)
print(f"{r.json()['choices'][0]['message']['tool_calls'][0]['function']}\n")

# Output: {'name': 'get_current_weather', 'arguments': {'location': 'Pittsburgh, PA', 'unit': 'fahrenheit'}}
45 changes: 45 additions & 0 deletions docker/test/sample_client_for-testing.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
import requests
import json

class color:
PURPLE = '\033[95m'
CYAN = '\033[96m'
DARKCYAN = '\033[36m'
BLUE = '\033[94m'
GREEN = '\033[92m'
YELLOW = '\033[93m'
RED = '\033[91m'
BOLD = '\033[1m'
UNDERLINE = '\033[4m'
END = '\033[0m'

# Get a response using a prompt without streaming
payload = {
# "model": "HF://mlc-ai/gemma-2b-it-q4f16_1-MLC",
"model": "HF://mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC",
"messages": [{"role": "user", "content": "write a haiku"}],
"stream": False
}
r = requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload)
print(f"{color.BOLD}Without streaming:{color.END}\n{color.GREEN}{r.json()['choices'][0]['message']['content']}{color.END}\n")


payload = {
# "model": "HF://mlc-ai/gemma-2b-it-q4f16_1-MLC",
"model": "HF://mlc-ai/Mistral-7B-Instruct-v0.2-q4f16_1-MLC",
"messages": [{"role": "user", "content": "Write a 500 words essay about the civil war"}],
"stream": True
}

print(f"{color.BOLD}With streaming:{color.END}")
with requests.post("http://127.0.0.1:8000/v1/chat/completions", json=payload, stream=True) as r:
for chunk in r.iter_content(chunk_size=None):
chunk = chunk.decode("utf-8")
if "[DONE]" in chunk[6:]:
break
response = json.loads(chunk[6:])
content = response["choices"][0]["delta"].get("content", "")
print(f"{color.GREEN}{content}{color.END}", end="", flush=True)

print("\n")