This repository has been archived by the owner on Oct 11, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 13
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(scripts): clean LLM latency benchmark (#132)
* refactor(scripts): update latency script * docs(readme): update script instructions * docs(latency): update latency benchmark * feat(scripts): add docker for latency bench * docs(scripts): add a README * docs(readme): update readme
- Loading branch information
Showing
7 changed files
with
193 additions
and
81 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
FROM python:3.11-alpine3.19 | ||
|
||
WORKDIR /app | ||
|
||
# set environment variables | ||
ENV PYTHONDONTWRITEBYTECODE 1 | ||
ENV PYTHONUNBUFFERED 1 | ||
ENV PYTHONPATH "${PYTHONPATH}:/app" | ||
|
||
# install dependencies | ||
RUN set -eux \ | ||
&& pip install --no-cache-dir uv \ | ||
&& uv pip install --no-cache --system requests==2.31.0 tqdm==4.66.2 numpy==1.26.4 \ | ||
&& rm -rf /root/.cache | ||
|
||
# copy script | ||
COPY ./evaluate_latency.py /app/evaluate.py |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,65 @@ | ||
# LLM throughput benchmark | ||
|
||
## The benchmark | ||
|
||
You crave for perfect code suggestions, but you don't know whether it fits your needs in terms of latency? | ||
|
||
We ran our tests on the following hardware: | ||
|
||
- [NVIDIA GeForce RTX 3060](https://www.nvidia.com/fr-fr/geforce/graphics-cards/30-series/rtx-3060-3060ti/) (mobile)* | ||
- [NVIDIA GeForce RTX 3070](https://www.nvidia.com/fr-fr/geforce/graphics-cards/30-series/rtx-3070-3070ti/) ([Scaleway GPU-3070-S](https://www.scaleway.com/en/pricing/?tags=compute)) | ||
- [NVIDIA A10](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) ([Lambda Cloud gpu_1x_a10](https://lambdalabs.com/service/gpu-cloud#pricing)) | ||
- [NVIDIA A10G](https://www.nvidia.com/en-us/data-center/products/a10-gpu/) ([AWS g5.xlarge](https://aws.amazon.com/ec2/instance-types/g5/)) | ||
|
||
*The laptop hardware setup includes an [Intel(R) Core(TM) i7-12700H](https://ark.intel.com/content/www/us/en/ark/products/132228/intel-core-i7-12700h-processor-24m-cache-up-to-4-70-ghz.html) for the CPU* | ||
|
||
with the following LLMs (cf. [Ollama hub](https://ollama.com/library)): | ||
- Deepseek Coder 6.7b - instruct ([Ollama](https://ollama.com/library/deepseek-coder), [HuggingFace](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-instruct)) | ||
- OpenCodeInterpreter 6.7b ([Ollama](https://ollama.com/pxlksr/opencodeinterpreter-ds), [HuggingFace](https://huggingface.co/m-a-p/OpenCodeInterpreter-DS-6.7B), [paper](https://arxiv.org/abs/2402.14658)) | ||
- Dolphin Mistral 7b ([Ollama](https://ollama.com/library/dolphin-mistral), [HuggingFace](https://huggingface.co/cognitivecomputations/dolphin-2.6-mistral-7b-dpo-laser), [paper](https://arxiv.org/abs/2310.06825)) | ||
- Coming soon: StarChat v2 ([HuggingFace](https://huggingface.co/HuggingFaceH4/starchat2-15b-v0.1), [paper](https://arxiv.org/abs/2402.19173)) | ||
|
||
and the following quantization formats: q3_K_M, q4_K_M, q5_K_M. | ||
|
||
This [benchmark](latency.csv) was performed over 5 iterations on 4 different sequences, including on a **laptop** to better reflect performances that can be expected by common users. | ||
|
||
## Run it on your hardware | ||
|
||
### Local setup | ||
|
||
Quite simply, start the docker: | ||
``` | ||
docker compose up -d --wait | ||
``` | ||
Pull the model you want | ||
``` | ||
docker compose exec -T ollama ollama pull MODEL | ||
``` | ||
|
||
And run the evaluation | ||
``` | ||
docker compose exec -T evaluator python scripts/ollama/evaluate_latency.py MODEL | ||
``` | ||
|
||
### Remote instance | ||
|
||
Start the evaluator only | ||
``` | ||
docker compose up -d evaluator --wait | ||
``` | ||
And run the evaluation by targeting your remote instance: | ||
``` | ||
docker compose exec -T evaluator python scripts/ollama/evaluate_latency.py MODEL --endpoint http://HOST:PORT | ||
``` | ||
|
||
*All script arguments can be checked using `python scripts/ollama/evaluate_latency.py --help`* | ||
|
||
### Others | ||
|
||
Here are the results for other LLMs that have have only been evaluated on the laptop GPU: | ||
|
||
| Model | Ingestion mean (std) | Generation mean (std) | | ||
| ------------------------------------------------------------ | ---------------------- | --------------------- | | ||
| [tinyllama:1.1b-chat-v1-q4_0](https://ollama.com/library/tinyllama:1.1b-chat-v1-q4_0) | 2014.63 tok/s (±12.62) | 227.13 tok/s (±2.26) | | ||
| [dolphin-phi:2.7b-v2.6-q4_0](https://ollama.com/library/dolphin-phi:2.7b-v2.6-q4_0) | 684.07 tok/s (±3.85) | 122.25 toks/s (±0.87) | | ||
| [dolphin-mistral:7b-v2.6](https://ollama.com/library/dolphin-mistral:7b-v2.6) | 291.94 tok/s (±0.4) | 60.56 tok/s (±0.15) | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
version: '3.7' | ||
|
||
services: | ||
ollama: | ||
image: ollama/ollama:0.1.29 | ||
ports: | ||
- "11434:11434" | ||
volumes: | ||
- "$HOME/.ollama:/root/.ollama" | ||
deploy: | ||
resources: | ||
reservations: | ||
devices: | ||
- driver: nvidia | ||
count: 1 | ||
capabilities: [gpu] | ||
command: serve | ||
healthcheck: | ||
test: ["CMD-SHELL", "ollama --help"] | ||
interval: 10s | ||
timeout: 5s | ||
retries: 3 | ||
|
||
evaluator: | ||
image: quackai/evaluator:latest | ||
build: . | ||
depends_on: | ||
ollama: | ||
condition: service_healthy | ||
environment: | ||
- OLLAMA_ENDPOINT=http://ollama:11434 | ||
volumes: | ||
- ./evaluate_latency.py:/app/evaluate.py | ||
command: sleep infinity | ||
|
||
volumes: | ||
ollama: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
model,hardware,ingestion_mean (tok/s),ingestion_std (tok/s),generation_mean (tok/s),generation_std (tok/s) | ||
deepseek-coder:6.7b-instruct-q5_K_M,NVIDIA RTX 3060 (laptop),35.43,3.46,23.68,0.74 | ||
deepseek-coder:6.7b-instruct-q4_K_M,NVIDIA RTX 3060 (laptop),72.27,10.69,36.82,1.25 | ||
deepseek-coder:6.7b-instruct-q3_K_M,NVIDIA RTX 3060 (laptop),90.1,32.43,50.34,1.28 | ||
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M,NVIDIA RTX 3060 (laptop),78.94,10.2,37.95,1.65 | ||
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M,NVIDIA RTX 3060 (laptop),126.75,31.5,50.05,0.84 | ||
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M,NVIDIA RTX 3060 (laptop),89.47,29.91,47.09,0.67 | ||
deepseek-coder:6.7b-instruct-q4_K_M,NVIDIA RTX 3070 (Scaleway GPU-3070-S),266.98,95.63,75.53,1.56 | ||
deepseek-coder:6.7b-instruct-q3_K_M,NVIDIA RTX 3070 (Scaleway GPU-3070-S),141.43,50.4,73.69,1.61 | ||
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M,NVIDIA RTX 3070 (Scaleway GPU-3070-S),285.81,73.55,75.14,3.13 | ||
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M,NVIDIA RTX 3070 (Scaleway GPU-3070-S),234.2,79.38,71.54,1 | ||
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M,NVIDIA RTX 3070 (Scaleway GPU-3070-S),114.54,38.24,69.29,0.98 | ||
deepseek-coder:6.7b-instruct-q4_K_M,NVIDIA A10 (Lambda Cloud gpu_1x_a10),208.65,74.02,78.68,1.64 | ||
deepseek-coder:6.7b-instruct-q3_K_M,NVIDIA A10 (Lambda Cloud gpu_1x_a10),111.84,39.9,71.66,1.75 | ||
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M,NVIDIA A10 (Lambda Cloud gpu_1x_a10),226.66,65.65,77.26,2.72 | ||
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M,NVIDIA A10 (Lambda Cloud gpu_1x_a10),202.43,69.55,73.9,0.87 | ||
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M,NVIDIA A10 (Lambda Cloud gpu_1x_a10),112.82,38.46,66.98,0.79 | ||
deepseek-coder:6.7b-instruct-q4_K_M,A10G (AWS g5.xlarge),186.81,66.03,79.62,1.52 | ||
deepseek-coder:6.7b-instruct-q3_K_M,A10G (AWS g5.xlarge),99.83,35.41,84.47,1.69 | ||
pxlksr/opencodeinterpreter-ds:6.7b-Q4_K_M,A10G (AWS g5.xlarge),212.08,86.58,79.02,3.35 | ||
dolphin-mistral:7b-v2.6-dpo-laser-q4_K_M,A10G (AWS g5.xlarge),187.2,62.24,75.91,1 | ||
dolphin-mistral:7b-v2.6-dpo-laser-q3_K_M,A10G (AWS g5.xlarge),102.36,34.29,81.23,1.02 |