From 801f99b154e44b857c67da3ecca3f1745dc2ad0b Mon Sep 17 00:00:00 2001 From: jiangzishan Date: Tue, 19 Nov 2024 22:07:14 +0800 Subject: [PATCH] add llm_perf doc. --- byte_infer_perf/llm_perf/README.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/byte_infer_perf/llm_perf/README.md b/byte_infer_perf/llm_perf/README.md index 4327db23..72ad43a4 100644 --- a/byte_infer_perf/llm_perf/README.md +++ b/byte_infer_perf/llm_perf/README.md @@ -24,6 +24,12 @@ You can run following command automate all steps with chatglm2 model on GPU back python3 byte_infer_perf/llm_perf/launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b ``` +## Split model +Splitting model is needed if model is too large to fit into one GPU. Except for chatglm2-6b, other models should be splitted manually using `split_model.py` under `backends/GPU/model_impl`, such as `split_mixtral.py`. `chatglm2-6b` will be automatically splitted online. + +After splitting model, you will find a subdirectory `TP8` (tp_size=8) under model directory. + + ## Test accuracy (single query with specify prompt) Launch a server running mixtral-8x22b (tp_size=8, max_batch_size=8) with following command: ```shell