GPT

Introduction

This document describes what FasterTransformer provides for the GPT model, explaining the workflow and optimization. We also provide a guide to help users to run the GPT model on FasterTransformer. Finally, we provide benchmark to demonstrate the speed of FasterTransformer on GPT.

GPT is a variant of Decoding model, which does not have the encoder module, cross multi-head attention, and uses GeLU as the activation. In 2020, OpenAI shows that using very giant model and lots of training data can significantly improve the capacity of GPT model in their paper. However, it is impossible to put such model into a single GPU. For example, the largest model, GPT-3, has 175 billion parameters, which takes about 350 GBs under half data type. Therefore, multi-gpus, even multi-nodes, is necessary. To solve the bottleneck of latency and memory due to the model size, FasterTransformer provides kernels with high efficiency, optimized memory usage, and model parallelism on multiple frameworks.

Supported features

Checkpoint converter
- Huggingface
- Megatron
- Nemo Megatron
- TensorFlow
Data type
- FP32
- FP16
- BF16
- INT8 weight only PTQ.
  - Limitations:
    - Hidden sizes must be a multiple of 64 after weights are split for TP.
    - The kernel typically only gives performance benefits for small batch (typically less than 32 or 64) and when weight matrices are large.
    - Weight only PTQ only works for FP16/BF16 compute.
    - Only supported on Volta and newer architectures.
  - Note:
    - Weights are preprocessed offline based on the current GPU to optimize the weight alignment for consumption by tensorcores. Currently, we directly consume FP32/BF16/FP16 weights and quantize them just before inference. If we want to store quantized weights, they MUST be preprocessed for the GPU intended to be used with inference.
    - When using the torch APIs, int8 mode is only available via the Parallel GPT Op. The Parallel GPT Op can also be used on single GPU.
- INT8 with SmoothQuant
- FP8 (Experimental)
Feature
- Multi-GPU multi-node inference
- Dynamic random seed
- Stop tokens
- Beam search and sampling are both supported
- Loading FP32 or FP16 weights
Frameworks
- TensorFlow
- PyTorch
- C++
- Triton backend

Model architecture

Workflow

Fig 1. Workflow of GPT model.

Fig 1 demonstrates the workflow of FasterTransformer GPT. Different from BERT and encoder-decoder structure, GPT receive some input ids as context, and generates the respective output ids as response. In this workflow, the major bottleneck is the GptDecoderLayer (transformer block) because the time increase linearly when we increase the number of layers. In GPT-3, the GptDecoderLayer takes about 95% of total time.

FasterTransformer splits the whole workflow into 2 parts. The first one is “computing the k/v cache of context (input ids), and the second part is “auto-regressive generating the output ids”. The operations of these two parts are similar, but the shapes of tensors in the SelfAttention is different. So, we use 2 different implementations to handle two different cases, as demonstrating in Fig 2. In DecoderSelfAttention, the sequence length of query is always 1, so we used customed fused masked multi-head attention kernel to handle. On the other hand, the sequence length of query in the ContextSelfAttention is maximum input length, so we use cuBLAS to leverage the tensor core.

Fig 2. Comparison between different self attention. Fig 3. Workflow of GPT with tensor parallelism.

The following examples demonstrating how to run multi-GPU and multi-node GPT model.

examples/cpp/multi_gpu_gpt_example.cc: It uses MPI to organize all GPUs.
examples/cpp/multi_gpu_gpt_triton_example.cc: It uses threading for intra node, and MPI for inter node. This example also demonstrates how to use Triton backend API of FasterTransformer to run the GPT model.
examples/pytorch/gpt/multi_gpu_gpt_example.py: This example is similar to examples/cpp/multi_gpu_gpt_example.cc, but encapsulate the instance of FasterTransformer by PyTorch OP.

In summary, the workflow to run the GPT model is:

Initializing the NCCL comm and setting ranks of tensor parallel and pipeline parallel by MPI or threading
Load weights by the ranks of tensor parallel, pipeline parallel and other model hyper-parameters.
Create the instance of ParalelGpt by the ranks of tensor parallel, pipeline parallel and other model hyper-parameters.
Receive the request from client and convert the request to the format of input tensors for ParallelGpt.
Run forward
Convert the output tensors of ParallelGpt to response of client and return the response. In c++ example codes, we skip the step 4 and step 6, loading the request by examples/cpp/multi_gpu_gpt/start_ids.csv. In PyTorch example codes, the request comes from the PyTorch side. In Triton example codes, we have a completed examples from step 1 to step 6.

The source codes are put in src/fastertransformer/models/multi_gpu_gpt/ParallelGpt.cc. The arguments, input tensors and output tensors of GPT:

Constructor of GPT

Classification	Name	Data Type	Description
[0]	max_batch_size	size_t	Deprecated, move to input
[1]	max_seq_len	size_t	Deprecated, move to input
[2]	max_input_len	size_t	Deprecated, move to input
[3]	beam_width	size_t	Deprecated, move to input
[4]	head_num	size_t	Head number for model configuration
[5]	size_per_head	size_t	Size per head for model configuration
[6]	inter_size	size_t	The inter size of feed forward network. It is often set to 4 * head_num * size_per_head.
[7]	num_layer	size_t	Number of transformer layers for model configuration
[8]	vocab_size	int	Vocabulary size for model configuration
[9]	start_id	int	Start id for vocabulary
[18]	temperature	float	Deprecated, move to input
[19]	len_penalty	float	Deprecated, move to input
[20]	repetition_penalty	float	Deprecated, move to input
[21]	tensor_para	NcclParam	Tensor Parallel information, which is declared in `src/fastertransformer/utils/nccl_utils.h`
[22]	pipeline_para	NcclParam	Pipeline Parallel information, which is declared in `src/fastertransformer/utils/nccl_utils.h`
[23]	stream	cudaStream_t	CUDA stream
[24]	cublas_wrapper	cublasMMWrapper*	Pointer of cuBLAS wrapper, which is declared in `src/fastertransformer/utils/cublasMMWrapper.h`
[26]	is_free_buffer_after_forward	bool	If setting to be `true`, FasterTransformer will allocate buffer before forward, and free buffer after forward. When the allocator is based on memory pool, setting to `true` may help reducing the memory usage during inference.
[27]	cuda_device_prop	cudaDeviceProp*	Pointer of CUDA device properties, which is used to get the properties of hardware like size of shared memory
[28]	sparse	bool	Is using sparsity. Experimental feature
[29]	int8_mode	int	0 means no quantization. 1 means use weight-only PTQ Experimental feature. 2 for weight and activation quantization Experimental feature.
[30]	custom_all_reduce_comm	AbstractCustomComm	Custom all reduction communication for custom all reduction in model parallelism. It is only supported in 8-way tensor parallelism
[31]	enable_custom_all_reduce	int	Flag of enabling custom all reduction or not
[32]	remove_padding	bool	Remove the padding of input ids or not in context phase.
[33]	shared_contexts_ratio	float	Ratio that controls the use of the shared contexts optimization. If the compact size (that accounts only for unique prompts) is less than ratio * batch size, use the optimized implementation. Setting shared_contexts_ratio=0 deactivate the optimization.

Input of GPT

Name	Tensor/Parameter Shape	Location	Data Type	Description
input_ids	[batch_size, max_input_length]	GPU	int	The input ids (context)
input_lengths	[batch_size]	GPU	int	The lengths of input ids
prompt_learning_task_name_ids	[batch_size]	CPU	int	Optional. Task name ids for prompt learning.
output_seq_len	[batch_size]	CPU	uint32_t	The largest number of tokens you hope for results. Note that it contains the input length
stop_words_list	[batch_size, 2, stop_words_length]	GPU	int	Optional. When FT generates words in this list, it will stop the generation. An extension of stop id
bad_words_list	[batch_size, 2, bad_words_length]	GPU	int	Optional. The words in the list will never be sampled.
repetition_penalty	[1] or [batch_size]	CPU	float	Optional. Repetition penalty applied to logits for both beam search and sampling. Exclusive with presence_penalty.
presence_penalty	[1] or [batch_size]	CPU	float	Optional. Presence penalty - additive type of repetition penalty - applied to logits for both beam search and sampling. Exclusive with repetition_penalty.
min_length	[1] or [batch_size]	CPU	int	Optional. Minimum number of tokens to generate
random_seed	[1] or [batch_size]	CPU	unsigned long long int	Optional. Random seed to initialize the random table in sampling.
request_prompt_lengths	[batch_size],	GPU	int	Optional. Length of prefix soft prompt embedding. This describes how many tokens of soft prompt embedding in each sentence.
request_prompt_embedding	[batch_size, max_prompt_length, hidden_units]	GPU	float/half/bfloat16	Optional. FT will concat them with results of embedding lookup kernel. For prefix soft prompt embedding, the type must be float; for p/prompt tuning, the type is same to weight.
request_prompt_type	[batch_size]	CPU	int	Optional. Prompt type of request. This is necessary when user pass the prompt embedding by input
is_return_context_cum_log_probs	[1]	CPU	bool	Optional. Return the cumulative log probability of context or not
is_return_context_embeddings	[1]	CPU	bool	Optional. Return the sum of context tokens encodings or not
session_len	[1]	CPU	uint32	Optional. The maximum time length allowed during the whole interactive generation. Only used for interactive generation feature
continue_gen	[1]	CPU	bool	Optional. A flag to tell FasterTransformer to not discard previous tokens and continue producing token based on previous generations. Only used for interactive generation feature
memory_len	[1]	CPU	uint32	Optional. The maximum time memory used in attention modules. Reduces the memory footprint but quality of generation might degrades.
top_p_decay	[batch_size]	GPU	float	Optional. decay values for top_p sampling
top_p_min	[batch_size]	GPU	float	Optional. min top_p values for top p sampling
top_p_reset_ids	[batch_size]	GPU	uint32	Optional. reset ids for resetting top_p values for top p sampling

Output of GPT

Name	Tensor/Parameter Shape	Location	Data Type	Description
output_ids	[batch_size, beam_width, max_output_seq_len]	GPU	int	The output ids. It contains the input_ids and generated ids
sequence_length	[batch_size, beam_width]	GPU	int	The lengths of output ids
output_log_probs	[batch_size, beam_width, request_output_seq_len]	GPU	float	Optional. It records the log probability of logits at each step for sampling.
cum_log_probs	[batch_size, beam_width]	GPU	float	Optional. Cumulative log probability of generated sentences
context_embeddings	[batch_size, beam_width, hidden_units]	GPU	float	Optional. Sum of context tokens encodings.

The beam_width value is set by the output shape directly. When the beam_width of output_ids is larger than 1, FT will use beam search to generate tokens; otherwise, FT will use topk or topp sampling. When the inputs of beam search and sampling is invalid, like beam width 1, top k 0, top p 0.0, FT will run greedy search automatically.

Optimization

Kernel optimization: many kernels are based on the kernels of decoder and decoding modules, which are already highly optimized. To prevent from recomputing the previous keys and values, we will allocate a buffer to store them at each step. Although it takes some additional memory usage, we can save the cost of recomputing, allocating buffer at each step, and the cost of concatenation.
Memory optimization: Different to traditional models like BERT, GPT-3 has 175 billion parameters, taking 350 GBs even if we store the model by half precision. Therefore, we must reduce the memory usage for other parts. In FasterTransformer, we will reuse the memory buffer of different decoder layers. Since the number of layers in GPT-3 is 96, we only need 1/96 memory.
Model parallelism: In GPT model, FasterTransormer provides both tensor parallelism and pipeline parallelism. For tensor parallelism, FasterTransformer follows the idea of Megatron. For both self-attention block and feed forward network block, we split the weights of first matrix multiplication by row and split the weights of the second matrix multiplication by column. By optimization, we can reduce the reduction operation to 2 times for each transformer block. The workflow is demonstrated in Fig 3. For pipeline parallelism, FasterTransformer splits the whole batch of request into multiple micro batches and hide the bubble of communication. FasterTransformer will adjust the micro batch size automatically for different cases. Users can adjust the model parallelism by modifying the gpt_config.ini file. We recommend to use tensor parallel intra node, and use pipeline parallel inter node because tensor parallel requires more NCCL communication.
Multiple frameworks: Except the source codes on c, FasterTransformer also provide the TensorFlow op, PyTorch op and Triton backend. Currently, TensorFlow op only supports the single GPU, while PyTorch op and Triton backend support multi-GPU and multi-node. To prevent the additional work of splitting model for model parallelism, FasterTransformer also provides a tool to split and convert the model of Megatron to binary files, then FasterTransformer can load the model in binary directly.

Inference Options

We provide the environment variables to tune for specific usage.

Name	Description	Default	Values accepted
`FMHA_ENABLE`	enable the fused multi-head attention kernels (fp16 accumulation)	disabled	`ON` = enable fmha, otherwise disabled
`CONTEXT_ATTENTION_BMM1_HALF_ACCUM`	use fp16 accumulation for the qk gemm, and only make a difference to unfused multi-head attention kernels	fp32 accumulation	`ON` = fp32 accumulation, otherwise fp16 accumulation

Setup

The following guide demonstrates how to run the examples of c++, PyTorch and Triton backend.

Requirements

CMake >= 3.8 for Tensorflow, CMake >= 3.13 for PyTorch
CUDA 11.0 or newer version
NCCL 2.10 or newer version
Python: Only verify on python 3
Tensorflow: Verify on 1.15, 1.13 and 1.14 should work.
PyTorch: Verify on 1.8.0, >= 1.5.0 should work.

Recommend use nvcr image like nvcr.io/nvidia/tensorflow:22.09-tf1-py3 or nvcr.io/nvidia/pytorch:22.09-py3.

These components are readily available within the NGC TensorFlow Docker image below.

Ensure you have the following components:

NVIDIA Docker and NGC container are recommended
NVIDIA Pascal or Volta or Turing or Ampere based GPU

For more information about how to get started with NGC containers, see the following sections from the NVIDIA GPU Cloud Documentation and the Deep Learning Documentation:

Getting Started Using NVIDIA GPU Cloud
Accessing And Pulling From The NGC Container Registry
Running TensorFlow
Running PyTorch

For those unable to use the NGC container, to set up the required environment or create your own container, see the versioned NVIDIA Container Support Matrix.

Build the FasterTransformer

Prepare

You can choose the tensorflow version and python version you want. Here, we list some possible images:

To achieve best performance, we recommend to use the latest image. For example, running image `nvcr.io/nvidia/tensorflow:22.09-tf1-py3` by

```bash
nvidia-docker run -ti --shm-size 5g --rm nvcr.io/nvidia/tensorflow:22.09-tf1-py3 bash
git clone https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
```

Build the project

Note: the xx of -DSM=xx in following scripts means the compute capability of your GPU. The following table shows the compute capability of common GPUs.

GPU	compute capacity
P40	60
P4	61
V100	70
T4	75
A100	80
A30	80
A10	86

By default, -DSM is set by 70, 75, 80 and 86. When users set more kinds of -DSM, it requires longer time to compile. So, we suggest setting the -DSM for the device you use only. Here, we use xx as an example due to convenience.

build with C++

cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_MULTI_GPU=ON ..
make -j12

build with TensorFlow

Uses need to set the path of TensorFlow. For example, if we use nvcr.io/nvidia/tensorflow:22.09-tf1-py3, then

cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_TF=ON -DTF_PATH=/usr/local/lib/python3.8/dist-packages/tensorflow_core/ -DBUILD_MULTI_GPU=ON ..
make -j12

build with PyTorch
```
cmake -DSM=xx -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON ..
make -j12
```
This will build the TorchScript custom class. Please make sure that the PyTorch >= 1.5.0.

How to use

Prepare

Install required tools

pip install -r ../examples/pytorch/gpt/requirement.txt

To run the GPT on c, users need to convert the checkpoint of TensorFlow or PyTorch to binary files, and then load by FasterTransformer c api. Unfortunately, there is no published large model. So, users are only able to verify the correctness by smaller model. Currently, FasterTransformer provides two kinds of samples. First one is using the checkpoint of OpenAI GPT-2 model (which is trained by TensorFlow); Another choice is using the checkpoint of Megatron (which is trained by pytorch).

Download vocab and merge table

They can be used in both OpenAI GPT-2 and Megatron.

wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json -P ../models
wget https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt -P ../models

Download openai-gpt model and convert

To convert the OpenAI GPT model to binary, FasterTransformer provides a tool sample/tensorflow/utils/openai_gpt_ckpt_convert.py to convert the checkpoint. The converter requires the following arguments:

-i: The path of megatron model
-o: The output path of converted model
-t_g: The tensor parallel size to train the model
-i_g: The tensor parallel size we hope for inference
-h_n: Number of heads, which is the hyper-parameter of the model

mkdir -p ../models/openai-gpt-models/
python tensorflow/utils/download_gpt2_model.py <model_name>
e.g. python ../examples/tensorflow/gpt/utils/download_gpt2_model.py 124M
mv models/124M ../models/openai-gpt-models/
python ../examples/tensorflow/gpt/utils/openai_gpt_ckpt_converter.py -o ../models/openai-gpt-models/c-model/124m/ -i ../models/openai-gpt-models/124M/model.ckpt -g 1 # convert 124M model with 1 TP mode
python ../examples/tensorflow/gpt/utils/openai_gpt_ckpt_converter.py -o ../models/openai-gpt-models/c-model/124m/ -i ../models/openai-gpt-models/124M/model.ckpt -g 4 # convert 124M model with 4 TP mode

In the repo of OpenAI, they provide many models, including 124M, 355M, 774M and 1558M

Download megatron model and convert

To convert the Megatron GPT model to binary, FasterTransformer provides a tool examples/pytorch/utils/megatron_ckpt_convert.py to convert the checkpoint.

wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir -p ../models/megatron-models/345m
unzip megatron_lm_345m_v0.0.zip -d ../models/megatron-models/345m
export PYTHONPATH=$PWD/..:${PYTHONPATH}
python ../examples/pytorch/gpt/utils/megatron_ckpt_convert.py \
        -head_num 16 \
        -i ../models/megatron-models/345m/release/ \
        -o ../models/megatron-models/c-model/345m/ \
        -t_g 1 \
        -i_g 1 \
        --vocab-path ../models/gpt2-vocab.json \
        --merges-path ../models/gpt2-merges.txt
python ../examples/pytorch/gpt/utils/megatron_ckpt_convert.py \
        -head_num 16 \
        -i ../models/megatron-models/345m/release/ \
        -o ../models/megatron-models/c-model/345m/ \
        -t_g 1 \
        -i_g 8 \
        --vocab-path ../models/gpt2-vocab.json \
        --merges-path ../models/gpt2-merges.txt

where t_g means the number GPUs of TP during training, and i_g means the number of GPUs for TP during inference.

Note that there are different checkpoint version of Megatron. The version of the checkpoint above is 0.

For model trained by pipeline parallelism or the checkpoint version is 3, you don't need to specify head_num or checkpoint_version as it can retrieve from model_args.

python ../examples/pytorch/gpt/utils/megatron_ckpt_convert.py -i ../models/megatron-models/345m/release/ -o ../models/megatron-models/c-model/345m/ -i_g 1

Download onnx model and convert

Note that the original gpt2-10.onnx model at https://github.com/onnx/models/raw/master/text/machine_comprehension/gpt-2/model/gpt2-10.onnx is removed. And new link https://github.com/onnx/models/blob/main/text/machine_comprehension/gpt-2/model/gpt2-10.onnx cannot be loaded by onnx successfully.

To convert the ONNX GPT model to binary, FasterTransformer provides a tool examples/onnx/multi_gpu_gpt/onnx_ckpt_convert.py to convert the checkpoint.

wget https://github.com/onnx/models/blob/main/text/machine_comprehension/gpt-2/model/gpt2-10.onnx
python ../examples/onnx/multi_gpu_gpt/onnx_ckpt_convert.py -i gpt2-10.onnx -o ../models/onnx-models/c-model/124m/ -i_g 1
python ../examples/onnx/multi_gpu_gpt/onnx_ckpt_convert.py -i gpt2-10.onnx -o ../models/onnx-models/c-model/124m/ -i_g 4

Download huggingface gpt model and convert

git clone https://huggingface.co/gpt2-xl
python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-xl/ -o ../models/huggingface-models/c-model/gpt2-xl -i_g 1

Run GPT

Run GPT under on C++ with multiple gpu

1.1 Generate the gemm_config.in file.
Data Type = 0 (FP32) or 1 (FP16) or 2 (BF16)

./bin/gpt_gemm <batch_size> <beam_width> <max_input_len> <head_number> <size_per_head> <inter_size> <vocab_size> <data_type> <tensor_para_size> <is_append>
E.g., ./bin/gpt_gemm 8 1 32 12 128 6144 51200 1 1 1

If the application may have multiple different shapes (like different batch size), users can run multiple time and set is_append to be true. For example

./bin/gpt_gemm 8 1 32 12 128 6144 51200 1 1 0 # bs 8, not append, will create a new gemm_config.ini
./bin/gpt_gemm 16 1 32 12 128 6144 51200 1 1 1 # bs 16, append results in existed gemm_config.ini

1.2 Run GPT on C++

Users can see the details of arguments in examples/cpp/multi_gpu_gpt/gpt_config.ini. It controls the model path, model size, tensor parallelism size, and some hyper-parameters.

./bin/multi_gpu_gpt_example

then use following script to convert the token ids to sentence.

python ../examples/pytorch/gpt/utils/gpt_token_converter.py --vocab_file=../models/gpt2-vocab.json  --bpe_file=../models/gpt2-merges.txt

By setting the data_type of gpt_config.ini to fp16 or bf16, users can run gpt model under fp16 or bf16.

1.3 Run with tensor parallelism (TP), pipeline parallelism (PP)

Users can use tensor_para_size and pipeline_para_size in gpt_config.ini to control the size of model parallel. Note that the number of processes must equal to tensor_para_size * pipeline_para_size.

mpirun -n 8 ./bin/multi_gpu_gpt_example
python ../examples/pytorch/gpt/utils/gpt_token_converter.py --vocab_file=../models/gpt2-vocab.json  --bpe_file=../models/gpt2-merges.txt

1.4 Run gpt on multi-nodes

Since the c sample codes use the MPI to communicate, it can extend to multi-nodes easily, except that users need to setup some network environment to communicate between multi-nodes. The following scripts are an example to show how to run multi-nodes inference on slurm.

srun -N2 -n2 -t 600 --pty bash # Assume we get 2 nodes: prm-dgx-09 and prm-dgx-10
srun -N2 -n2 docker pull nvcr.io/nvidia/tensorflow:22.09-tf1-py3

srun -N2 -n2  nvidia-docker run -itd --shm-size 5g --rm --privileged --network=host --pid=host --cap-add=IPC_LOCK --device=/dev/infiniband -v $PWD:$PWD -w $PWD --name ft-test nvcr.io/nvidia/tensorflow:22.09-tf1-py3 /bin/bash

srun -N2 -n2  nvidia-docker exec -i --env SLURM_NTASKS --env SLURM_NODEID --env SLURM_PROCID --env SLURM_STEP_NODELIST --env SLURMD_NODENAME --privileged ft-test bash -c "mkdir /root/.ssh && cp $PWD/ssh/* /root/.ssh && chmod 700 /root/.ssh && chmod 640 /root/.ssh/authorized_keys2 && chmod 400 /root/.ssh/id_rsa && apt-get update && apt-get install ssh -y && mkdir /run/sshd/ && /usr/sbin/sshd -p 11068 && nvidia-smi -lgc 1530"

nvidia-docker exec -ti ft-test bash
cd FasterTransformer/build
mpirun --allow-run-as-root -np 2 -H prm-dgx-09:1,prm-dgx-10:1 -mca plm_rsh_args "-p 11068" ./bin/multi_gpu_gpt_example
srun -N2 -n2 docker stop ft-test

Run GPT on PyTorch

Basically, gpt_example.py includes the example how to declare a model, load a checkpoint, and forward context inputs and get generated outputs in Pytorch.

For generating outputs based on context inputs, create a text file including the context inputs (line by line) and set --sample_file_input to the text file path. (By default, the script will generate outputs without context inputs.) Set --sample_file_output to write the outputs to a file. Use --data_type fp16/bf16 to run in FP16 or BF16.

Run with -h to see more settings.

python ../examples/pytorch/gpt/multi_gpu_gpt_example.py -h

2.1 Run GPT with TP and PP on single node (NVIDIA DGX A100). Note that the number of processes must equal to tensor_para_size * pipeline_para_size.

# No parallelism (tensor_para_size=1, pipeline_para_size=1)
python ../examples/pytorch/gpt/multi_gpu_gpt_example.py

# TP (tensor_para_size=8, pipeline_para_size=1)
mpirun -n 8 --allow-run-as-root python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=8 --pipeline_para_size=1 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/8-gpu"

# LP (tensor_para_size=1, pipeline_para_size=8)
mpirun -n 8 --allow-run-as-root python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=1 --pipeline_para_size=8 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/1-gpu"

# TP and LP (tensor_para_size=4, pipeline_para_size=2)
mpirun -n 8 --allow-run-as-root python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=4 --pipeline_para_size=2 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/4-gpu"

2.2 Run GPT with TP and PP on single-node/multi-node (NVIDIA SuperPOD)

Set up in interactive mode

```bash
srun -A devtech -J devtech-gpt:gpt -p luna -N1 --mpi=pmix --ntasks-per-node=8 --container-image nvcr.io/nvidia/pytorch:22.09-py3 --container-mounts /lustre/fsw/devtech/hpc-devtech/dahn/FasterTransformer:/workspace/fastertransformer --container-workdir /workspace/fastertransformer --pty bash

mkdir build && cd build
cmake -DSM=80 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON .. && make -j12
```

Run on singe-node

* tensor_para_size=8, pipeline_para_size=1

```bash
srun -A devtech -p luna -N1 --mpi=pmix --ntasks-per-node=8 --container-image nvcr.io/nvidia/pytorch:22.09-py3 --container-mounts /lustre/fsw/devtech/hpc-devtech/dahn/FasterTransformer:/workspace/fastertransformer --container-workdir /workspace/fastertransformer/build python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=8 --pipeline_para_size=1 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/8-gpu"
```

Run on multi-node

* tensor_para_size=8, pipeline_para_size=2

```bash
srun -A devtech -p luna -N2 --mpi=pmix --ntasks-per-node=8 --container-image nvcr.io/nvidia/pytorch:22.09-py3 --container-mounts /lustre/fsw/devtech/hpc-devtech/dahn/FasterTransformer:/workspace/fastertransformer --container-workdir /workspace/fastertransformer/build python ../examples/pytorch/gpt/multi_gpu_gpt_example.py --tensor_para_size=8 --pipeline_para_size=2 --ckpt_path="/workspace/fastertransformer/models/megatron-models/c-model/345m/8-gpu"
```

2.2 Run LAMBADA test on PyTorch

download data set:

```bash
wget https://github.com/cybertronai/bflm/raw/master/lambada_test.jsonl -P ../models/megatron-models
export PYTHONPATH=$PWD/../:$PYTHONPATH
python ../examples/pytorch/gpt/utils/update_gpt_config.py \
        --model-dir ../models/megatron-models/c-model/345m/1-gpu/ \
        --config-ini-path ../models/megatron-models/c-model/345m/1-gpu/config.ini \
        --pipeline-para-size 1 \
        --tensor-para-size 1 \
        --max-seq-len 512 \
        --beam-width 1 \
        --sampling-top-k 1 \
        --sampling-top-p 0 \
        --data-type fp16
python ../examples/pytorch/gpt/lambada_task_example.py \
       --batch-size 64 \
       --checkpoint-path ../models/megatron-models/c-model/345m/1-gpu/ \
       --lib-path lib/libth_transformer.so \
       --lambada-path ../models/megatron-models/lambada_test.jsonl
```

Run GPT on tensorflow

Follow Download openai-gpt model and convert to prepare the model. Assume the TF model is put in ../models/openai-gpt-models/.

./bin/gpt_gemm 4 1 32 12 64 3072 50257 1 1
python ../examples/tensorflow/gpt/gpt_example.py --batch_size=4 \
                                                 --length=32 \
                                                 --top_k=4 \
                                                 --top_p=0.6 \
                                                 --data_type=fp16 \
                                                 --models_dir=../models/openai-gpt-models/

Note that the tensorflow op only supports single gpu.

Run GPT with prompts

GPT now supports p/prompt-tuning. It works with nemo checkpoint and prompt learning.

Convert the prompt weights

Use the examples/pytorch/gpt/utils/nemo_ckpt_convert.py to convert the NeMo Megatron Prompt Weights. It will automatically generate configuration needed for triton backend inference.

Note that you need to specify start_id, end_id by yourself in order to make sure that it is consistent with the tokenizer.
Run GPT with C++ example

You need to specify the example gpt_config.ini like below to enable the p/prompt_tuning feature.
```
[gptj_6B]
head_num=16
size_per_head=256
vocab_size=50400
decoder_layers=28
rotary_embedding=64
start_id=50256
end_id=50256
inter_size=16384
num_tasks=2
prompt_learning_type=2

;prompt learning example (soft prompt doesn't need it)
[gptj_6B_task_0]
task_name=task_0
prompt_length=5

[gptj_6B_task_1]
task_name=task_1
prompt_length=10
```
task_name and prompt_length are specified for loading prompt weights. prompt_learning_start_id is needed for checking whether ids are prompts or normal input ids.

prompt_learning_type:
- no prompt: 0
- soft_prompt: 1
- prefix_prompt: 2
- p/prompt_tuning: 3

Run Meta OPT

Meta OPT and OpenAI GPT do not have big differences in terms of structures, so they are sharing the same model and triton backend classes.
You need to convert the Huggingface Meta Opt models to fastertransformer format by examples/pytorch/gpt/utils/huggingface_opt_convert.py.

Run OPT under on C++ with multiple gpu

Users can see the details of arguments in examples/cpp/multi_gpu_gpt/gpt_config.ini. It controls the model path, model size, tensor parallelism size, and some hyper-parameters.
In order to run with Meta Opt models, you need to add additional configuraitons: model_variant, which controls the layernorm_eps, layernorm_type, activation_type, has_post_decoder_layernorm.

For example, the opt 125m model configuraitons would be like:
```
[opt_125M]
head_num=12
size_per_head=64
vocab_size=50272
decoder_layers=12
start_id=2
end_id=2
inter_size=3072
model_variant=opt-pre ;define variant structure
```
There are two model types: opt-pre = pre_layernorm, opt_post = post_layernorm
Note that: the model has post decoder layernorm when layernorm_type is pre_layernorm.

1.1 Support for w8a8 int8 mode with OPT (preview)

FasterTransformer supports having certain operations with both weights and activations in int8. To keep high accuracy with your model, we recommend SmoothQuant models. Fig 4 presents the data flow. You can convert a regular OPT model to a SmoothQuant one with this repo. You must also generate activation records for calibrating the scaling factors. With these, you can convert the SmoothQuant model for w8a8 inference in FT:
```
python3 examples/pytorch/gpt/utils/huggingface_opt_convert.py -i ../smoothquant/opt-1.3b-smooth/ -o ../nlp-models/ft/test/opt-1.3b-int8/ -i_g 1 -act_scale ../smoothquant/opt-1.3b-smooth.scales.pt
```
Then, set the int8_mode to 2 in examples/cpp/gpt/gpt_config.ini and run bin/multi_gpu_gpt_example. Note that this optimization only supports OPT with pre-layernorm (opt-pre).

Fig 4. SmoothQuant workflow.

Run OPT on PyTorch

We can run summarization task examples of meta opt models. See examples/pytorch/gpt/opt_summarization.py.

Note that the summarization test are ran by topk = 2, so the rouge score of HF and FT are often different.

Run on opt-125m model

git lfs clone https://huggingface.co/facebook/opt-125m
python ../examples/pytorch/gpt/utils/huggingface_opt_convert.py \
      -i opt-125m/ \
      -o opt-125m/c-model/ \
      -i_g 1
python3 ../examples/pytorch/gpt/opt_summarization.py \
        --summarize \
        --test_hf \
        --max_ite 20 \
        --ft_model_location opt-125m/c-model \
        --hf_model_name opt-125m

The results are similar to:

Hugging Face (total latency: 9.258284 sec)
rouge1 : 20.36984889475218
rouge2 : 4.854345624891912
rougeL : 14.82866480289381
rougeLsum : 18.23638863809613
Faster Transformers (total latency: 3.9376330000000004 sec)
rouge1 : 26.676168312282357
rouge2 : 10.004052949342602
rougeL : 19.20934213532261
rougeLsum : 24.243496576656323

Run on opt-350m model

git lfs clone https://huggingface.co/facebook/opt-350m
python ../examples/pytorch/gpt/utils/huggingface_opt_convert.py \
      -i opt-350m/ \
      -o opt-350m/c-model/ \
      -i_g 1
python3 ../examples/pytorch/gpt/opt_summarization.py \
        --summarize \
        --test_hf \
        --max_ite 20 \
        --ft_model_location opt-350m/c-model \
        --hf_model_name opt-350m \
        --data_type fp16

The results are similar to:

Hugging Face (total latency: 21.961627 sec)
rouge1 : 28.939621379501467
rouge2 : 9.858278077813752
rougeL : 19.159853526952528
rougeLsum : 26.120654334830885
Faster Transformers (total latency: 6.293255999999998 sec)
rouge1 : 26.80687566772978
rouge2 : 8.639787737378661
rougeL : 18.90520115636779
rougeLsum : 24.372302912676407

We can also run OPT summarization with int8

python3 ../examples/pytorch/gpt/opt_summarization.py \
        --summarize \
        --test_hf \
        --max_ite 20 \
        --ft_model_location opt-350m/c-model \
        --hf_model_name opt-350m \
        --data_type fp16
        --int8_mode 1

The results are similar to (from RTX 3090):

Hugging Face (total latency: 17.364539 sec)
rouge1 : 29.781707569865045
rouge2 : 10.400027824789843
rougeL : 20.295983024772482
rougeLsum : 26.529982852324874
Faster Transformers (total latency: 6.088986 sec)
rouge1 : 26.744781183506355
rouge2 : 7.118945671926842
rougeL : 17.357590762660852
rougeLsum : 24.31072167607998

Run OPT with Triton Backends

Model configurations have been automatically generated when converting the meta opt models.
Then, you can use the converted weights and configuration file to serve the model by triton servers. Example of the config.ini when converting the model:

[gpt]
model_name = opt-350m/
head_num = 16
size_per_head = 64
inter_size = 4096
max_pos_seq_len = 2048
num_layer = 24
layernorm_eps = 1e-5
layernorm_type = post_layernorm
activation_type = Relu
has_post_decoder_layernorm = 0
vocab_size = 50272
start_id = 2
end_id = 2
weight_data_type = fp32

Run BLOOM

BLOOM is a variant of GPT model leveraging ALiBi, which does not need a learnt positional encoding and allows the model to generate sequences longer than the sequence length used in training. BLOOM has also similar structure to OpenAI GPT, so like OPT FT provides BLOOM model through the GPT classes as a variation. Users can convert a pretrained Huggingface BLOOM model into fastertransformer format by using examples/pytorch/gpt/utils/huggingface_bloom_convert.py.

Run BLOOM under on C++ with multiple gpu

Users can find the details of parameters from examples/cpp/multi_gpu_gpt/gpt_config.ini, which controls the checkpoint path, model size, tensor parallelism size, as well as the other hyper-parameters. Like OPT, we need to set an additional configuration model_variant=bloom. For example, the bloom-560m model configuraitons would be like:
```
[bloom_560M]
head_num=16
size_per_head=64
vocab_size=250880
decoder_layers=24
start_id=1
end_id=3
inter_size=4096
model_variant=bloom ; define variant structure
```

Run BLOOM on PyTorch

We provide a LAMBADA task example for BLOOM model. Please see examples/pytorch/gpt/bloom_lambada.py.

Run on bloom-560m model

git clone https://huggingface.co/bigscience/bloom-560m
python ../examples/pytorch/gpt/utils/huggingface_bloom_convert.py \
    --input-dir bloom-560m \
    --output-dir bloom-560m/c-model \
    -tp 1 -p 4 -v
wget https://github.com/cybertronai/bflm/raw/master/lambada_test.jsonl -P ../datasets/lambada
# Run HF benchmark
python ../examples/pytorch/gpt/bloom_lambada.py \
    --tokenizer-path bloom-560m \
    --dataset-path ../datasets/lambada/lambada_test.jsonl \
    --test-hf --show-progress
# Run FT benchmark
python ../examples/pytorch/gpt/bloom_lambada.py \
    --checkpoint-path bloom-560m/c-model/1-gpu \
    --tokenizer-path bloom-560m \
    --dataset-path ../datasets/lambada/lambada_test.jsonl \
    --show-progress

The result accuracy will be around 35.3% in both cases.

(HF) Accuracy: 35.3775% (1823/5153) (elapsed time: 23.3663 sec)
(FT) Accuracy: 35.3386% (1821/5153) (elapsed time: 10.8444 sec)

Run BLOOM with Triton Backends

Same as OPT, when converting into FT checkpoint, configurations have been automatically generated, allowing us to run the model through a triton server without any further step. Example of the config.ini when converting the model:
```
  [gpt]
  model_name=bloom-560m/
  num_layer=24
  head_num=16
  inter_size=4096
  size_per_head=64
  vocab_size=250880
  layernorm_eps=1e-05
  weight_data_type=fp32
  tensor_para_size=1
  start_id=1
  end_id=2
```

gpt with triton backend

Details are in transformer_backend

GPT with MOE

We choose the checkpoint provided by modelscope. This checkpoint is trained by chinese dataset. So, we will test by some chinese texts. Besides, we need some modification on Megatron-DeepSpeed to load the MOE checkpoint. We have put the modified Megtron-DeepSpeed codes in moe_ft branch of https://github.com/byshiue/Megatron-DeepSpeed/.

pip install git+https://github.com/microsoft/DeepSpeed.git
git clone https://github.com/byshiue/Megatron-DeepSpeed/ -b moe_ft
pip install Megatron-DeepSpeed/
pip install jieba
pip install -r ../examples/pytorch/gpt/requirement.txt

git lfs clone https://www.modelscope.cn/PAI/nlp_gpt3_text-generation_0.35B_MoE-64.git
mv nlp_gpt3_text-generation_0.35B_MoE-64 ../models
PYTHONPATH=$PWD/../ python ../examples/pytorch/gpt/utils/megatron_gpt_moe_ckpt_convert.py \
                    --input-dir ../models/nlp_gpt3_text-generation_0.35B_MoE-64/model \
                    --saved-dir ../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models \
                    --infer-gpu-num 1 \
                    --vocab-path ../models/gpt2-vocab.json \
                    --merges-path ../models/gpt2-merges.txt

echo \
'据悉,自驾
“首金”花落谁家,无疑' > sample_input_file.txt

python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
        --tensor_para_size=1 \
        --pipeline_para_size=1 \
        --ckpt_path=../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models/1-gpu/ \
        --data_type=fp16 \
        --vocab_file=../models/nlp_gpt3_text-generation_0.35B_MoE-64/tokenizer.json \
        --vocab_size=51200 \
        --start_id=7 \
        --end_id=7 \
        --sample_input_file=sample_input_file.txt \
        --use_jieba_tokenizer

The output should be like

[INFO] batch 0, beam 0:
[Context]
据悉,自驾

[Output]
游的人数正在逐年增加,而且越来越多的人选择自驾游,而且越来越多的人选择自驾

[INFO] batch 1, beam 0:
[Context]
“首金”花落谁家,无疑

[Output]
是一场精彩的“战役”。 “首金”花落谁家,是一场精彩的“战役”。

modelscope also provides 27B checkpoint, which can be put in single A100-80GB under FP16 and have higher qualities.

FT also supports GPT-MOE with model parallelism.

PYTHONPATH=$PWD/../ python ../examples/pytorch/gpt/utils/megatron_gpt_moe_ckpt_convert.py \
                    --input-dir ../models/nlp_gpt3_text-generation_0.35B_MoE-64/model \
                    --saved-dir ../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models \
                    --infer-gpu-num 2 \
                    --vocab-path ../models/gpt2-vocab.json \
                    --merges-path ../models/gpt2-merges.txt

mpirun -n 2 python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
        --tensor_para_size=2 \
        --pipeline_para_size=1 \
        --ckpt_path=../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models/2-gpu/ \
        --data_type=fp16 \
        --vocab_file=../models/nlp_gpt3_text-generation_0.35B_MoE-64/tokenizer.json \
        --vocab_size=51200 \
        --start_id=7 \
        --end_id=7 \
        --sample_input_file=sample_input_file.txt \
        --use_jieba_tokenizer

mpirun -n 2 python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
        --tensor_para_size=1 \
        --pipeline_para_size=2 \
        --ckpt_path=../models/nlp_gpt3_text-generation_0.35B_MoE-64/model/c-models/1-gpu/ \
        --data_type=fp16 \
        --vocab_file=../models/nlp_gpt3_text-generation_0.35B_MoE-64/tokenizer.json \
        --vocab_size=51200 \
        --start_id=7 \
        --end_id=7 \
        --sample_input_file=sample_input_file.txt \
        --use_jieba_tokenizer

GPT with FP8 (Experimental)

Note that FP8 is supported since Hopper and CUDA 11.8. Here, we use docker image nvcr.io/nvidia/pytorch:22.10-py3 to demonstrate

mkdir build
cmake -DSM=90 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON -DBUILD_MULTI_GPU=ON -DENABLE_FP8=ON ..
make -j12
wget --content-disposition https://api.ngc.nvidia.com/v2/models/nvidia/megatron_lm_345m/versions/v0.0/zip -O megatron_lm_345m_v0.0.zip
mkdir models/345m/ -p
unzip megatron_lm_345m_v0.0.zip -d ./models/345m

export PYTHONPATH=$PWD/..:${PYTHONPATH}
python3 ../examples/pytorch/gpt/utils/megatron_fp8_ckpt_convert.py \
      -i ./models/345m/release \
      -o ./models/345m/c-model/ \
      -i_g 1 \
      -head_num 16 \
      -trained_tensor_parallel_size 1

python3 ../examples/pytorch/gpt/gpt_summarization.py \
        --data_type fp8 \
        --lib_path ./lib/libth_transformer.so \
        --summarize \
         --ft_model_location ./models/345m/c-model/

The checkpoint does not have quantization. FT will initialize them by identity scales directly. However, the accuracy is still good like following:

rouge1 : 23.264943073521202
rouge2 : 6.43987431806994
rougeL : 16.517620811297537
rougeLsum : 21.24054457217973

Advanced features

generate different sentences and enable shared context

The model downloading and conversion are described in Download megatron model and convert.

A common request is, we have single input request, and hope to reply multiple results with different random seed. To achieve this target, we can mulpitle the inputs by several times, and set different random seed for different sentences in a batch. You can enable it by adding --enable_random_seed. Otherwise, all random seed would be set to 0 by default.

For example, we prepare a input with batch size 4, and the sentences are all same.

for i in {1..4} ; do echo " Article : (CNN)James Best, best known for his portrayal of bumbling sheriff Rosco P. Coltrane on TV's \"The Dukes of Hazzard,\" died Monday after a brief illness. He was 88. Best died in hospice in Hickory, North Carolina, of complications from pneumonia, said Steve Latshaw, a longtime friend and Hollywood colleague. Although he'd been a busy actor for decades in theater and in Hollywood, Best didn't become famous until 1979, when \"The Dukes of Hazzard's\" cornpone charms began beaming into millions of American homes almost every Friday night. For seven seasons, Best's Rosco P. Coltrane chased the moonshine-running Duke boys back and forth across the back roads of fictitious Hazzard County, Georgia, although his \"hot pursuit\" usually ended with him crashing his patrol car. Although Rosco was slow-witted and corrupt, Best gave him a childlike enthusiasm that got laughs and made him endearing. His character became known for his distinctive \"kew-kew-kew\" chuckle and for goofy catchphrases such as \"cuff 'em and stuff 'em! \" upon making an arrest. Among the most popular shows on TV in the early '80s, \"The Dukes of Hazzard\" ran until 1985 and spawned TV movies, an animated series and video games. Several of Best's \"Hazzard\" co-stars paid tribute to the late actor on social media. \"I laughed and learned more from Jimmie in one hour than from anyone else in a whole year,\" co-star John Schneider, who played Bo Duke, said on Twitter. \"Give Uncle Jesse my love when you see him dear friend.\" \"Jimmy Best was the most constantly creative person I have ever known,\" said Ben Jones, who played mechanic Cooter on the show, in a Facebook post. \"Every minute of his long life was spent acting, writing, producing, painting, teaching, fishing, or involved in another of his life's many passions.\" Born Jewel Guy on July 26, 1926, in Powderly, Kentucky, Best was orphaned at 3 and adopted by Armen and Essa Best, who renamed him James and raised him in rural Indiana. Best served in the Army during World War II before launching his acting career. TL;DR: " >> sample_input.txt ; done

Then, we run the multi_gpu_gpt_example.py with --enable_random_seed:

python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py  \
        --ckpt_path ../models/megatron-models/c-model/345m/1-gpu/ \
        --vocab_file ../models/gpt2-vocab.json  \
        --merges_file ../models/gpt2-merges.txt  \
        --sample_input_file sample_input.txt \
        --max_batch_size 4 \
        --time  \
        --top_p 0.9 \
        --top_k 0 \
        --shared_contexts_ratio 0.0 \
        --enable_random_seed \
        --output_len 8

You can see the results are little different, and the program will show the time cost like:

[INFO] GPT time costs: 64.25 ms

Although this method can achieve our target, but computing same duplicated inputs is waste. So, we can set --shared_contexts_ratio to compute the duplicated inputs once in context phase:

python3 ../examples/pytorch/gpt/multi_gpu_gpt_example.py  \
        --ckpt_path ../models/megatron-models/c-model/345m/1-gpu/ \
        --vocab_file ../models/gpt2-vocab.json  \
        --merges_file ../models/gpt2-merges.txt  \
        --sample_input_file sample_input.txt \
        --max_batch_size 4 \
        --time  \
        --top_p 0.9 \
        --top_k 0 \
        --shared_contexts_ratio 1.0 \
        --enable_random_seed \
        --output_len 8

You can see the inference is faster than original one like:

[INFO] GPT time costs: 41.69 ms

Notes:

The results of enabling shared_context and disabling shared_context may be different because the shape of GEMM are changed. But it does not affect the qualities of generation.
We use short output_len in this example to demonstarte the benefit of shared_context. In real application, the more duplicated input, longer input length compared to output length, the more speedup shared_context brings.
Since the additional overhead of enabling shared_context is ignorable, we enable it by default.

Interactive generation

Fig 5. GPT generate some outputs by some inputs

Fig 6. New inputs with previous texts and some additional new input ids.

In some scenarios (like chatting), the new requests are related to previous requests. Currently, users can pass all previous inputs and outputs as a new inputs into FT to make FT generate new reply from these previous texts, like what we see in Fig 5 and Fig 6. However, this means that we need to re-compute the k/v cache of all previous inputs and outputs again, which is time wasting when the context is very long.

Fig 7. The workflow of generation with interactive generation

To achieve better performance and prevent useless computing, we add a new flag continue_gen into GPT. When this flag is on, FT keeps all results during generation and assume the users will provide some more texts. And FT would not compute the k/v cache of the results it already has, but only compute the k/v cache of new ids. The workflow would become what we demonstrate in Fig 7. To prevent allocate the memory buffer again, users also need to set the session_len to be the maximum sequence length of the final sentence, but not only for intermediate sentence.

We will use multi_gpu_gpt_interactive_example to demonstarte how to use this feature. In this example, we load the examples/cpp/multi_gpu_gpt/start_ids.csv first (the input length are all 8):

818, 262, 938, 3155, 286, 1528, 11, 257
198, 464, 968, 8221, 2732, 286, 15198, 318
464, 968, 1971, 12056, 423, 257, 649, 1182
464, 968, 1971, 3782, 468, 3199, 663, 5079
818, 257, 1445, 326, 481, 1884, 787, 340
464, 968, 1971, 12056, 6, 5859, 41683, 423
198, 198, 464, 5398, 4332, 628, 628, 198
464, 717, 640, 314, 2497, 262, 3807, 11

then generates 32 tokens with setting continue_gen=true to get an intermediate results (the results are saved in out.interm):

818 262 938 3155 286 1528 11 257 1256 286 661 423 587 4737 502 546 262 649 1492 11 290 314 1053 587 2111 284 3280 617 286 262 2683 326 661 423 587 4737 502 13 198 198
198 464 968 8221 2732 286 15198 318 1762 351 262 1181 338 9358 5011 284 5004 262 1266 835 284 1445 262 4979 13 198 1 1135 821 1016 284 307 2045 379 262 1266 835 284 1445 262
464 968 1971 12056 423 257 649 1182 3985 11 290 339 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 13 679 338 257 3516 508 338 587 1088 262 4652 329 257 890 640
464 968 1971 3782 468 3199 663 5079 1351 286 262 995 338 749 14212 661 13 198 464 1351 11 543 373 14102 416 262 968 1971 3782 11 318 1912 319 257 5526 286 517 621 352 11
818 257 1445 326 481 1884 787 340 4577 329 262 1664 284 3677 663 7303 11 262 1664 468 4987 284 3677 663 10171 287 262 1664 284 257 1448 286 7713 2957 416 262 2839 13598 4081 309
464 968 1971 12056 6 5859 41683 423 587 257 1263 636 286 262 1074 338 1943 428 1622 13 198 464 12056 423 587 1498 284 1057 262 2613 6840 11 290 484 423 587 1498 284 1057 262
198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332
464 717 640 314 2497 262 3807 11 314 373 588 11 705 5812 616 1793 11 428 318 523 3608 2637 314 373 588 11 705 40 765 284 307 287 428 3807 2637 314 373 588 11 705

Next, we load another inputs from examples/cpp/multi_gpu+gpt/interactive_inputs_ids (the input length are all 8 again):

5962, 11, 314, 561, 588, 284, 910, 326
11125, 286, 2844, 291, 5028, 422, 262, 7627
392, 257, 1913, 1998, 351, 1353, 12, 28282
830, 34643, 11, 7602, 11, 4708, 6332, 1938
5, 38328, 763, 13, 1119, 481, 2148, 257
3245, 355, 257, 22080, 1074, 13, 4042, 286
14150, 26443, 262, 1230, 338, 1410, 284, 3958
5195, 4398, 470, 314, 7342, 340, 2961, 30

and pass into FT again (note that we only need to pass new ids because FT already records all previous ids). Then FT will concatenate these new ids into output ids, compute k/v caches for only these new ids, and then generate another 32 tokens as a new response (the results are saved in out):

818 262 938 3155 286 1528 11 257 1256 286 661 423 587 4737 502 546 262 649 1492 11 290 314 1053 587 2111 284 3280 617 286 262 2683 326 661 423 587 4737 502 13 198 198 5962 11 314 561 588 284 910 326 314 1101 407 257 4336 286 262 1492 13 314 892 340 338 257 1310 1165 881 286 257 366 10919 611 1 1492 13 314 892 340 338 257 1310 1165
198 464 968 8221 2732 286 15198 318 1762 351 262 1181 338 9358 5011 284 5004 262 1266 835 284 1445 262 4979 13 198 1 1135 821 1016 284 307 2045 379 262 1266 835 284 1445 262 11125 286 2844 291 5028 422 262 7627 7784 15296 284 262 7421 7784 15296 553 531 42743 6523 3899 1024 33246 271 13 198 464 42743 318 635 2045 379 262 5885 286 3867 262 4979 422 262 7421
464 968 1971 12056 423 257 649 1182 3985 11 290 339 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 13 679 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 392 257 1913 1998 351 1353 12 28282 18370 13 679 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 13 679 338 257 3516 508 338 587 1088 262 4652 329 257 890 640 13
464 968 1971 3782 468 3199 663 5079 1351 286 262 995 338 749 14212 661 13 198 464 1351 11 543 373 14102 416 262 968 1971 3782 11 318 1912 319 257 5526 286 517 621 352 11 830 34643 11 7602 11 4708 6332 1938 290 584 14212 661 13 198 464 1351 318 14102 416 262 968 1971 3782 290 318 3199 319 262 3052 286 262 7533 13 198 464 1351 318 20633 416 262
818 257 1445 326 481 1884 787 340 4577 329 262 1664 284 3677 663 7303 11 262 1664 468 4987 284 3677 663 10171 287 262 1664 284 257 1448 286 7713 2957 416 262 2839 13598 4081 309 5 38328 763 13 1119 481 2148 257 2472 286 720 16 13 20 2997 287 5003 290 4283 13 198 464 1730 318 2938 284 1969 287 262 1218 2063 286 428 614 13 198 464 1664 531 340
464 968 1971 12056 6 5859 41683 423 587 257 1263 636 286 262 1074 338 1943 428 1622 13 198 464 12056 423 587 1498 284 1057 262 2613 6840 11 290 484 423 587 1498 284 1057 262 3245 355 257 22080 1074 13 4042 286 262 640 11 262 12056 423 587 1498 284 1057 262 2613 6840 11 290 484 423 587 1498 284 1057 262 3245 355 257 22080 1074 13 198 464 12056 423
198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 628 628 198 198 464 5398 4332 14150 26443 262 1230 338 1410 284 3958 262 779 286 262 1573 366 16991 1 287 262 1499 338 1743 3303 13 198 198 464 1230 338 1410 284 3958 262 779 286 262 1573 366 16991 1 287
464 717 640 314 2497 262 3807 11 314 373 588 11 705 5812 616 1793 11 428 318 523 3608 2637 314 373 588 11 705 40 765 284 307 287 428 3807 2637 314 373 588 11 705 5195 4398 470 314 7342 340 2961 30 4162 4398 470 314 1775 340 878 8348 314 373 588 11 705 40 765 284 307 287 428 3807 2637 314 373 588 11 705 40 765 284 307 287 428

Performance

Hardware settings (A100 SuperPod architecture):

Intra node: 8xA100-80GBs (with mclk 1593MHz, pclk 1410MHz) with AMD EPYC 7742 64-Core Processor, linked by NVSwitch
Inter node: Linked by Infiniband, 8x200Gb/s NICs

Large model inference with model parallel

We demonstrate the inference time of Megatron and FasterTransformer on Triton, and show the speedup of FasterTransformer compare to Megatron for GPT-175B and GPT-89B. In the experiments of GPT, we updated the following parameters:

Performance of Megatron-530B

head_num = 128
size_per_head = 160
num_layers = 105
data_type = FP16
vocab_size = 51200
top_p = 0.9

TP means tensor parallelism, PP means pipeline parallelism.

Fig 8. Latency on input length 60, output length 20. TP means tensor parallelism and PP means pipeline parallelism.

Fig 9. Throughput per GPU on input length 60, output length 20. TP means tensor parallelism and PP means pipeline parallelism.

Fig 10. Latency on fixing output length 20, 16 ways tensor parallelism, different input length and batch size.

Fig 11. Latency on fixing input length 128, 16 ways tensor parallelism, different output length and batch size.

Batch Size	Input Length	Output Length	Latency of TP-16, PP-1 (ms)	Latency of TP-32, PP-1 (ms)	Latency of TP-8, PP-3 (ms)
1	20	8	565	431	842
2	20	8	598	455	860
4	20	8	616	493	867
8	20	8	660	523	929
16	20	8	730	575	1049
32	20	8	865	672	1283
64	20	8	1191	942	1722
128	20	8	1862	1431	2124
256	20	8	3341	2483	3140

1	60	20	1379	1037	2085
2	60	20	1515	1110	2122
4	60	20	1512	1198	2184
8	60	20	1631	1295	2367
16	60	20	1868	1454	2753
32	60	20	2361	1804	3543
64	60	20	3383	2646	4117
128	60	20	5406	4099	5319
256	60	20	OOM	7203	8318

1	128	8	585	451	866
2	128	8	667	508	932
4	128	8	765	606	1097
8	128	8	990	766	1434
16	128	8	1377	1074	2104
32	128	8	2251	1741	2623
64	128	8	4002	3114	3578
128	128	8	OOM	5784	5512
256	128	8	OOM	11232	9614

Performance of GPT-175B

head_num = 96
size_per_head = 128
num_layers = 96
data_type = FP16
vocab_size = 51200
top_p = 0.9
tensor_parallel_size = 8 with NVLink

Batch_size	Input Seqlen	Output Seqlen	Megatron Latency (ms)	FT Latency (ms)	FT Speedup
1	128	8	660.38	488.86	1.35
2	128	8	687.34	509.47	1.35
4	128	8	1004.88	629.64	1.60
8	128	8	1705.07	749.86	2.27
12	128	8	2365.02	886.24	2.67
16	128	8	3111.57	1037.47	3.00
20	128	8	3723.73	1135.72	3.28
32	128	8	5778.72	1547.44	3.73

1	512	32	2384.78	1719.96	1.39
2	512	32	2503.24	1830.56	1.37
4	512	32	3658.65	2092.56	1.75
8	512	32	6238.79	2629.97	2.37
16	512	32	11409.53	3706.23	3.08

Perofrmance of GPT-89B

head_num = 96
size_per_head = 128
num_layers = 48
data_type = FP16
vocab_size = 51200
top_p = 0.9
tensor_parallel_size = 8 with NVLink

Batch_size	Input Seqlen	Output Seqlen	Megatron Latency (ms)	FT Latency (ms)	FT Speedup
1	128	8	342.86	279.44	1.23
2	128	8	369.43	280.24	1.32
4	128	8	540.97	317.71	1.70
8	128	8	912.46	377.50	2.42
12	128	8	1263.39	445.46	2.84
16	128	8	1663.39	524.80	3.17
20	128	8	1991.16	575.83	3.46
32	128	8	3086.85	786.57	3.92

1	512	32	1244.81	887.52	1.40
2	512	32	1357.54	940.11	1.44
4	512	32	1970.08	1133.22	1.74
8	512	32	3341.66	1415.02	2.36
16	512	32	6090.07	1952.2	3.12

Performance of GPT-20B

head_num = 48
size_per_head = 128
num_layers = 44
data_type = FP16
vocab_size = 51200
top_p = 0.9

TP means tensor parallelism

Batch_size	Input Length	Output Length	Latency of single GPU (ms)	Latency of 2-way TP (ms)	Latency of 4-way TP (ms)	Latency of 8-way TP (ms)
1	20	8	225	147	102	89
2	20	8	225	152	108	94
4	20	8	228	158	113	100
8	20	8	239	169	121	107
16	20	8	268	191	133	113
32	20	8	331	230	155	127
64	20	8	452	314	200	169
128	20	8	726	484	318	256
256	20	8	1352	844	533	416

1	60	20	560	358	248	212
2	60	20	562	378	262	222
4	60	20	582	393	274	236
8	60	20	635	429	299	247
16	60	20	748	510	345	272
32	60	20	933	620	418	325
64	60	20	1352	887	574	454
128	60	20	2218	1384	928	699
256	60	20	4141	2424	1574	1152

1	128	20	566	362	254	217
2	128	20	580	385	267	227
4	128	20	629	421	290	244
8	128	20	740	487	333	267
16	128	20	931	618	405	312
32	128	20	1335	862	547	418
64	128	20	2157	1379	832	634
128	128	20	3830	2365	1439	1072
256	128	20	OOM	4414	2639	1943

1	80	200	5609	3532	2438	2053
2	80	200	5588	3682	2544	2095
4	80	200	5661	3797	2646	2206
8	80	200	5838	3984	2741	2268
16	80	200	6167	4356	2964	2307
32	80	200	6864	4817	3233	2566
64	80	200	8290	6003	3815	3173
128	80	200	OOM	7884	5239	4303
256	80	200	OOM	12007	7603	6087

1	200	200	5648	3544	2481	2080
2	200	200	5686	3739	2597	2131
4	200	200	5830	3876	2719	2249
8	200	200	6146	4123	2851	2338
16	200	200	6815	4672	3152	2475
32	200	200	8111	5488	3634	2811
64	200	200	10766	7256	4536	3621
128	200	200	OOM	10538	6618	5229
256	200	200	OOM	OOM	10447	7895

Performance of GPT-6.7B

head_num = 32
size_per_head = 128
num_layers = 32
data_type = FP16
vocab_size = 51200
top_p = 0.9
tensor_para_size = 1

Batch_size	Input Seqlen	Output Seqlen	FT Latency (ms)	Memory Usage (GB)
1	128	8	98.29	15.55
2	128	8	106.74	15.66
4	128	8	123.47	15.87
8	128	8	162.51	16.31
16	128	8	241.16	17.19
32	128	8	400.35	18.84
64	128	8	718.07	22.17

1	512	32	384.70	15.96
2	512	32	425.88	16.30
4	512	32	514.93	16.99
8	512	32	699.62	18.72
16	512	32	1068.88	22.17
32	512	32	1814.03	28.73
64	512	32	3306.41	41.84

Performance of GPT-1.3B

head_num = 32
size_per_head = 64
num_layers = 24
data_type = FP16
vocab_size = 51200
top_p = 0.9
tensor_para_size = 1

Batch_size	Input Seqlen	Output Seqlen	FT Latency (ms)	Memory Usage (GB)
1	128	8	36.76	8.67
2	128	8	39.16	5.39
4	128	8	43.32	5.49
8	128	8	52.92	5.66
16	128	8	74.44	6.00
32	128	8	116.74	6.66
64	128	8	201.71	7.97

1	512	32	135.85	5.58
2	512	32	150.57	5.71
4	512	32	178.25	5.97
8	512	32	232.11	6.64
16	512	32	345.96	7.98
32	512	32	578.52	10.52
64	512	32	1036.21	15.61

Performance of GPT-350M

head_num = 16
size_per_head = 64
num_layers = 24
data_type = FP16
vocab_size = 51200
top_p = 0.9
tensor_para_size = 1

Batch_size	Input Seqlen	Output Seqlen	FT Latency (ms)	Memory Usage (GB)
1	128	8	25.43	3.43
2	128	8	26.42	3.46
4	128	8	28.00	3.51
8	128	8	32.56	3.61
16	128	8	42.87	3.78
32	128	8	62.61	4.13
64	128	8	104.51	4.81

1	512	32	92.01	3.57
2	512	32	97.87	3.65
4	512	32	110.70	3.78
8	512	32	136.45	4.12
16	512	32	189.91	4.80
32	512	32	296.15	6.09
64	512	32	529.18	8.67

Files

gpt_guide.md

Latest commit

History

gpt_guide.md

File metadata and controls

GPT

Table Of Contents

Introduction

Supported features

Model architecture

Workflow

Optimization

Inference Options

Setup

Requirements

Build the FasterTransformer

Prepare

Build the project

How to use

Prepare

Download openai-gpt model and convert

Download megatron model and convert

Download onnx model and convert

Download huggingface gpt model and convert

Run GPT

Set up in interactive mode

Run on singe-node

Run on multi-node

Run GPT with prompts

Run Meta OPT

Run BLOOM

gpt with triton backend

GPT with MOE

GPT with FP8 (Experimental)

Advanced features

generate different sentences and enable shared context

Interactive generation

Performance

Large model inference with model parallel

Performance of Megatron-530B

Performance of GPT-175B

Perofrmance of GPT-89B

Performance of GPT-20B

Performance of GPT-6.7B

Performance of GPT-1.3B

Performance of GPT-350M