From 0dbf19b2de548a5d97fd327b9207dbfe44301784 Mon Sep 17 00:00:00 2001
From: Zishan Jiang <jiangzishan@bytedance.com>
Date: Fri, 28 Jun 2024 11:01:56 +0800
Subject: [PATCH] Update llm_perf README.md

---
 byte_infer_perf/llm_perf/README.md | 61 ++++++++++++++++++++++--------
 1 file changed, 46 insertions(+), 15 deletions(-)

diff --git a/byte_infer_perf/llm_perf/README.md b/byte_infer_perf/llm_perf/README.md
index 4807ce28..f0d0cadb 100644
--- a/byte_infer_perf/llm_perf/README.md
+++ b/byte_infer_perf/llm_perf/README.md
@@ -1,32 +1,63 @@
 # Byte LLM Perf
-
-Vendors can refer to this document for guidance on building backend: [Byte LLM Perf](https://bytemlperf.ai/zh/guide/inference_llm_vendor.html)
-
 ## Requirements
 * Python >= 3.8
-* torch==2.1.0
+* torch >= 2.1.0
 
 ## Installation
 ```shell
-pip3 install torch==2.1.0
+# modify according to torch version and hardware
+pip3 install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118
+
+# install required packages
 pip3 install -r requirements.txt
 ```
 
 ## Quick Start
-Please be sure to complete the installation steps before proceeding with the following steps.
-
-To start llm_perf, there are 3 steps:
-1. Download opensource model weights(.pt file)
-2. Download model output logits in specific input case(.npy file)
-3. Start accuracy and performance test case
+Please be sure to complete the installation steps before proceeding with the following steps: 
+1. Modify task workload, for example, [chatglm2-torch-fp16-6b.json](https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/workloads/chatglm2-torch-fp16-6b.json)
+2. Download model weights using prepare_model.sh or huggingface_cli.
+3. Download model output logits in specific input case(.npy files) using prepare_model.sh.
+4. Start accuracy and performance tests.
 
 You can run following command automate all steps with chatglm2 model on GPU backend
 ```shell
 python3 byte_infer_perf/llm_perf/launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b 
 ```
 
+## Demo Project
+[GPU Backend](https://github.com/bytedance/ByteMLPerf/tree/main/byte_infer_perf/llm_perf/backends/GPU) provides a demo project that realizes llm inference of chatglm2-6b on A100 with following features: 
+- Separate functional components:
+    * Scheduler 
+        - custom scheduling on tasks
+    * Inferencer
+        - transfer tasks to real inputs and get outputs
+    * Mp Engine
+        - deal with TP logic using multiple processes
+    * Sampler
+        - postprocess logic
+    * Ckpt Loader
+        - custom ckpt loader with split logic which matches TP logic.
+    * Custom model implementation
+        - custom model implementation using hardware backend torch realization
+- Seperate scheduling logic
+    * Context: one task, input_ids shape is [1, q_len]
+    * Decode: multiple tasks, input_ids shape up to [max_batch_size, 1]
+- Tensor parallelism
+- kv cache
+
+The demo project is intended to provide a reference implementation, and there's no guarantee of achieving optimal performance. More technical details will be provided later on [ByteMLPerf](https://bytemlperf.ai)
+
+
+## Vendor Integration
+Vendors can refer to this document for guidance on building backend: [Byte LLM Perf](https://bytemlperf.ai/zh/guide/inference_llm_vendor.html)
+
 ## Models
-The list of supported models is:
-* [chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
-* [chinese-llama-2-13b](https://huggingface.co/hfl/chinese-llama-2-13b)
-* [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
+The following models are planned to be supported:
+* [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
+* [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)
+* [tiiuae/falcon-180B](https://huggingface.co/tiiuae/falcon-180B)
+* [mistralai/Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)
+
+The following models are outdated and will be removed in future vesions:
+* [hfl/chinese-llama-2-13b](https://huggingface.co/hfl/chinese-llama-2-13b)
+