From 801f99b154e44b857c67da3ecca3f1745dc2ad0b Mon Sep 17 00:00:00 2001
From: jiangzishan <jiangzishan@bytedance.com>
Date: Tue, 19 Nov 2024 22:07:14 +0800
Subject: [PATCH] add llm_perf doc.

---
 byte_infer_perf/llm_perf/README.md | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/byte_infer_perf/llm_perf/README.md b/byte_infer_perf/llm_perf/README.md
index 4327db23..72ad43a4 100644
--- a/byte_infer_perf/llm_perf/README.md
+++ b/byte_infer_perf/llm_perf/README.md
@@ -24,6 +24,12 @@ You can run following command automate all steps with chatglm2 model on GPU back
 python3 byte_infer_perf/llm_perf/launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b 
 ```
 
+## Split model
+Splitting model is needed if model is too large to fit into one GPU. Except for chatglm2-6b, other models should be splitted manually using `split_model.py` under `backends/GPU/model_impl`, such as `split_mixtral.py`. `chatglm2-6b` will be automatically splitted online.
+
+After splitting model, you will find a subdirectory `TP8` (tp_size=8) under model directory.
+
+
 ## Test accuracy (single query with specify prompt)
 Launch a server running mixtral-8x22b (tp_size=8, max_batch_size=8) with following command:
 ```shell