Skip to content

Commit

Permalink
Update llm_perf README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
suisiyuan authored Jun 28, 2024
1 parent 554794d commit 0dbf19b
Showing 1 changed file with 46 additions and 15 deletions.
61 changes: 46 additions & 15 deletions byte_infer_perf/llm_perf/README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,63 @@
# Byte LLM Perf

Vendors can refer to this document for guidance on building backend: [Byte LLM Perf](https://bytemlperf.ai/zh/guide/inference_llm_vendor.html)

## Requirements
* Python >= 3.8
* torch==2.1.0
* torch >= 2.1.0

## Installation
```shell
pip3 install torch==2.1.0
# modify according to torch version and hardware
pip3 install torch==2.1.0 --index-url https://download.pytorch.org/whl/cu118

# install required packages
pip3 install -r requirements.txt
```

## Quick Start
Please be sure to complete the installation steps before proceeding with the following steps.

To start llm_perf, there are 3 steps:
1. Download opensource model weights(.pt file)
2. Download model output logits in specific input case(.npy file)
3. Start accuracy and performance test case
Please be sure to complete the installation steps before proceeding with the following steps:
1. Modify task workload, for example, [chatglm2-torch-fp16-6b.json](https://github.com/bytedance/ByteMLPerf/blob/main/byte_infer_perf/llm_perf/workloads/chatglm2-torch-fp16-6b.json)
2. Download model weights using prepare_model.sh or huggingface_cli.
3. Download model output logits in specific input case(.npy files) using prepare_model.sh.
4. Start accuracy and performance tests.

You can run following command automate all steps with chatglm2 model on GPU backend
```shell
python3 byte_infer_perf/llm_perf/launch.py --hardware_type GPU --task chatglm2-torch-fp16-6b
```

## Demo Project
[GPU Backend](https://github.com/bytedance/ByteMLPerf/tree/main/byte_infer_perf/llm_perf/backends/GPU) provides a demo project that realizes llm inference of chatglm2-6b on A100 with following features:
- Separate functional components:
* Scheduler
- custom scheduling on tasks
* Inferencer
- transfer tasks to real inputs and get outputs
* Mp Engine
- deal with TP logic using multiple processes
* Sampler
- postprocess logic
* Ckpt Loader
- custom ckpt loader with split logic which matches TP logic.
* Custom model implementation
- custom model implementation using hardware backend torch realization
- Seperate scheduling logic
* Context: one task, input_ids shape is [1, q_len]
* Decode: multiple tasks, input_ids shape up to [max_batch_size, 1]
- Tensor parallelism
- kv cache

The demo project is intended to provide a reference implementation, and there's no guarantee of achieving optimal performance. More technical details will be provided later on [ByteMLPerf](https://bytemlperf.ai)


## Vendor Integration
Vendors can refer to this document for guidance on building backend: [Byte LLM Perf](https://bytemlperf.ai/zh/guide/inference_llm_vendor.html)

## Models
The list of supported models is:
* [chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
* [chinese-llama-2-13b](https://huggingface.co/hfl/chinese-llama-2-13b)
* [Mixtral-8x7B-v0.1](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1)
The following models are planned to be supported:
* [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b)
* [meta-llama/Meta-Llama-3-70B](https://huggingface.co/meta-llama/Meta-Llama-3-70B)
* [tiiuae/falcon-180B](https://huggingface.co/tiiuae/falcon-180B)
* [mistralai/Mixtral-8x22B-v0.1](https://huggingface.co/mistralai/Mixtral-8x22B-v0.1)

The following models are outdated and will be removed in future vesions:
* [hfl/chinese-llama-2-13b](https://huggingface.co/hfl/chinese-llama-2-13b)

0 comments on commit 0dbf19b

Please sign in to comment.