Skip to content

Latest commit

 

History

History
240 lines (190 loc) · 9.51 KB

README.md

File metadata and controls

240 lines (190 loc) · 9.51 KB

OPT

This document explains how to build the OPT model using TensorRT-LLM and run on a single GPU, a single node with multiple GPUs or multiple nodes with multiple GPUs.

Overview

The TensorRT-LLM OPT implementation can be found in tensorrt_llm/models/opt/model.py. The TensorRT-LLM OPT example code is located in examples/opt. There is one file:

In addition, there are two shared files in the parent folder examples for inference and evaluation:

Support Matrix

  • FP16
  • INT8 & INT4 Weight-Only
  • Tensor Parallel

Usage

The next two sections describe how to convert the weights from the HuggingFace (HF) Transformers format to the TensorRT-LLM format.

1. Download weights from HuggingFace Transformers

You have to make sure git-lfs is properly installed to load the checkpoints.

pip install -r requirements.txt && sudo apt-get install git-lfs

There are four different checkpoints available. Use one of the following commands to fetch the checkpoint you are interested in.

# OPT-125M
git-lfs clone https://huggingface.co/facebook/opt-125m

# OPT-350M
git-lfs clone https://huggingface.co/facebook/opt-350m

# OPT-2.7B
git-lfs clone https://huggingface.co/facebook/opt-2.7b

# OPT-66B
git-lfs clone https://huggingface.co/facebook/opt-66b

2. Convert weights from HF Transformers to TensorRT-LLM format

# OPT-125M
python3 convert_checkpoint.py --model_dir ./opt-125m \
                --dtype float16 \
                --output_dir ./opt/125M/trt_ckpt/fp16/1-gpu/

# OPT-350M
python3 convert_checkpoint.py --model_dir ./opt-350m \
                --dtype float16 \
                --output_dir ./opt/350M/trt_ckpt/fp16/1-gpu/

# OPT-2.7B
python3 convert_checkpoint.py --model_dir ./opt-2.7b \
                --dtype float16 \
                --output_dir ./opt/2.7B/trt_ckpt/fp16/1-gpu/

# OPT-66B
python3 convert_checkpoint.py --model_dir ./opt-66b \
                --dtype float16 \
                --world_size 4 \
                --output_dir ./opt/66B/trt_ckpt/fp16/4-gpu/ \
                --workers 2

3. Build TensorRT engine(s)

# OPT-125M
trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/1-gpu/ \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_output_len 100 \
                --output_dir ./opt/125M/trt_engines/fp16/1-gpu/

# OPT-350M
trtllm-build --checkpoint_dir ./opt/350M/trt_ckpt/fp16/1-gpu/ \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_output_len 100 \
                --output_dir ./opt/350M/trt_engines/fp16/1-gpu/

# OPT-2.7B
trtllm-build --checkpoint_dir ./opt/2.7B/trt_ckpt/fp16/1-gpu/ \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_output_len 100 \
                --output_dir ./opt/2.7B/trt_engines/fp16/1-gpu/

# OPT-66B
trtllm-build --checkpoint_dir ./opt/66B/trt_ckpt/fp16/4-gpu/ \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_output_len 100 \
                --output_dir ./opt/66B/trt_engines/fp16/4-gpu/ \
                --workers 2

4. Summarization using the OPT model

The following section describes how to run a TensorRT-LLM OPT model to summarize the articles from the cnn_dailymail dataset. For each summary, the script can compute the ROUGE scores and use the ROUGE-1 score to validate the implementation. The script can also perform the same summarization using the HF OPT model.

# OPT-125M
python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/1-gpu/ \
                        --test_hf \
                        --batch_size 1 \
                        --test_trt_llm \
                        --hf_model_dir opt-125m \
                        --data_type fp16 \
                        --check_accuracy \
                        --tensorrt_llm_rouge1_threshold=14

# OPT-350M
python3 ../summarize.py --engine_dir ./opt/350M/trt_engines/fp16/1-gpu/ \
                        --test_hf \
                        --batch_size 1 \
                        --test_trt_llm \
                        --hf_model_dir opt-350m \
                        --data_type fp16 \
                        --check_accuracy \
                        --tensorrt_llm_rouge1_threshold=20

# OPT-2.7B
python3 ../summarize.py --engine_dir ./opt/2.7B/trt_engines/fp16/1-gpu/ \
                        --test_hf \
                        --batch_size 1 \
                        --test_trt_llm \
                        --hf_model_dir opt-2.7b \
                        --data_type fp16 \
                        --check_accuracy \
                        --tensorrt_llm_rouge1_threshold=20

# OPT-66B
mpirun -n 4 --allow-run-as-root \
    python3 ../summarize.py --engine_dir ./opt/66B/trt_engines/fp16/4-gpu/ \
                            --batch_size 1 \
                            --test_trt_llm \
                            --hf_model_dir opt-66b \
                            --data_type fp16 \
                            --check_accuracy \
                            --tensorrt_llm_rouge1_threshold=20

Fused MultiHead Attention (FMHA)

You can enable the FMHA kernels for OPT by adding --enable_context_fmha to the invocation of trtllm-build. Note that it is disabled by default because of possible accuracy issues due to the use of Flash Attention.

If you find that the default fp16 accumulation (--enable_context_fmha) cannot meet the requirement, you can try to enable fp32 accumulation by adding --enable_context_fmha_fp32_acc. However, it is expected to see performance drop.

Note --enable_context_fmha / --enable_context_fmha_fp32_acc has to be used together with --use_gpt_attention_plugin float16.

Tensor Parallelism for Embedding Lookup Table.

Since the embedding lookup table can be several gigabytes in size. We can distribute this weight across multiple GPUs in order to reduce the memory consumption per GPU.

1. Enable this feature

To enable this feature, add the flag --use_parallel_embedding to build.py.

2. Choose the dimension for tensor parallelism

Assume the size of embedding lookup table is (vocab_size * hidden_size), we can shard it along the vocab_size (--embedding_sharding_dim 0) or hidden_size (--embedding_sharding_dim 1) dimension.

2.1 To shard the embedding lookup table along the hidden_size dimension, set the flag --use_parallel_embedding --embedding_sharding_dim 1. Here is an example:

python3 convert_checkpoint.py --model_dir ./opt-125m \
                --dtype float16 \
                --output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
                --world_size 2 \
                --use_parallel_embedding \
                --embedding_sharding_dim 1

2.2 To shard the embedding lookup table along the vocab_size dimension, set the flag --use_parallel_embedding --embedding_sharding_dim 0.

Meanwhile, we provide a lookup plugin to support tensor parallelism on vocab_size dimension.

  • An example of sharing along vocab_size dimension with lookup plugin:
python3 convert_checkpoint.py --model_dir ./opt-125m \
                --dtype float16 \
                --output_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
                --world_size 2 \
                --use_parallel_embedding \
                --embedding_sharding_dim 0

trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --use_lookup_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_output_len 100 \
                --output_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
                --workers 2

mpirun -n 2 --allow-run-as-root \
      python3 ../summarize.py --engine_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
                        --batch_size 1 \
                        --test_trt_llm \
                        --hf_model_dir opt-125m \
                        --data_type fp16 \
                        --check_accuracy \
                        --tensorrt_llm_rouge1_threshold=14
  • An example of sharing along vocab_size dimension without lookup plugin:
trtllm-build --checkpoint_dir ./opt/125M/trt_ckpt/fp16/2-gpu/ \
                --use_gemm_plugin float16 \
                --use_gpt_attention_plugin float16 \
                --max_batch_size 8 \
                --max_input_len 924 \
                --max_output_len 100 \
                --output_dir ./opt/125M/trt_engines/fp16/2-gpu/ \
                --workers 2