From 0dbf19b2de548a5d97fd327b9207dbfe44301784 Mon Sep 17 00:00:00 2001 From: Zishan Jiang Date: Fri, 28 Jun 2024 11:01:56 +0800 Subject: [PATCH] Update llm_perf README.md --- byte_infer_perf/llm_perf/README.md | 61 ++++++++++++++++++++++-------- 1 file changed, 46 insertions(+), 15 deletions(-) diff --git a/byte_infer_perf/llm_perf/README.md b/byte_infer_perf/llm_perf/README.md index 4807ce28..f0d0cadb 100644 --- a/byte_infer_perf/llm_perf/README.md +++ b/byte_infer_perf/llm_perf/README.md @@ -1,32 +1,63 @@ # Byte LLM Perf - -Vendors can refer to this document for guidance on building backend: [Byte LLM Perf](https://bytemlperf.ai/zh/guide/inference_llm_vendor.html) - ## Requirements * Python >= 3.8 -* torch==2.1.0 +* torch >= 2.1.0 ## Installation ```shell -pip3 install torch==2.1.0 +# modify according to torch version and hardware +pip3 install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118 + +# install required packages pip3 install -r requirements.txt ``` ## Quick Start -Please be sure to complete the installation steps before proceeding with the following steps. - -To start llm_perf, there are 3 steps: -1. Download opensource model weights(.pt file) -2. Download model output logits in specific input case(.npy file) -3. Start accuracy and performance test case +Please be sure to complete the installation steps before proceeding with the following steps: +1. Modify task workload, for example, [chatglm2-torch-fp16-6b.json](https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/workloads/chatglm2-torch-fp16-6b.json) +2. Download model weights using prepare_model.sh or huggingface_cli. +3. Download model output logits in specific input case(.npy files) using prepare_model.sh. +4. Start accuracy and performance tests. You can run following command automate all steps with chatglm2 model on GPU backend ```shell python3 byte_infer_perf/llm_perf/launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b ``` +## Demo Project +[GPU Backend](https://github.com/bytedance/ByteMLPerf/tree/main/byte_infer_perf/llm_perf/backends/GPU) provides a demo project that realizes llm inference of chatglm2-6b on A100 with following features: +- Separate functional components: + * Scheduler + - custom scheduling on tasks + * Inferencer + - transfer tasks to real inputs and get outputs + * Mp Engine + - deal with TP logic using multiple processes + * Sampler + - postprocess logic + * Ckpt Loader + - custom ckpt loader with split logic which matches TP logic. + * Custom model implementation + - custom model implementation using hardware backend torch realization +- Seperate scheduling logic + * Context: one task, input_ids shape is [1, q_len] + * Decode: multiple tasks, input_ids shape up to [max_batch_size, 1] +- Tensor parallelism +- kv cache + +The demo project is intended to provide a reference implementation, and there's no guarantee of achieving optimal performance. More technical details will be provided later on [ByteMLPerf](https://bytemlperf.ai) + + +## Vendor Integration +Vendors can refer to this document for guidance on building backend: [Byte LLM Perf](https://bytemlperf.ai/zh/guide/inference_llm_vendor.html) + ## Models -The list of supported models is: -* [chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) -* [chinese-llama-2-13b](https://huggingface.co/hfl/chinese-llama-2-13b) -* [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1) +The following models are planned to be supported: +* [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) +* [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B) +* [tiiuae/falcon-180B](https://huggingface.co/tiiuae/falcon-180B) +* [mistralai/Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1) + +The following models are outdated and will be removed in future vesions: +* [hfl/chinese-llama-2-13b](https://huggingface.co/hfl/chinese-llama-2-13b) +