llama 70b model fusion and shardding #18175

frank-dong-ms · 2023-10-30T21:44:19Z

Description

Support llama-70b model fusion and shardding

Motivation and Context

This change enables shard and export llama-70b model into Onnx as this model is too large for single GPU.
This change also fuses llama-70b model with repeat_kv pattern different with llama-7b and llama-13b.

…ript

github-advanced-security

lintrunner found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

onnxruntime/python/tools/transformers/models/llama/llama_parity.py

onnxruntime/python/tools/transformers/convert_generation.py

onnxruntime/python/tools/transformers/fusion_rotary_attention.py

onnxruntime/python/tools/transformers/models/llama/README.md

onnxruntime/python/tools/transformers/models/llama/benchmark.py

tianleiwu · 2023-10-31T00:24:06Z

Lint/python format pipeline failed. Please run lintrunner -a and fix all warnings.
see https://github.com/microsoft/onnxruntime/blob/main/docs/Coding_Conventions_and_Standards.md#linting

onnxruntime/python/tools/transformers/fusion_rotary_attention.py

onnxruntime/python/tools/transformers/models/llama/README.md

onnxruntime/python/tools/transformers/models/llama/benchmark.py

onnxruntime/python/tools/transformers/models/llama/convert_to_onnx.py

onnxruntime/python/tools/transformers/convert_generation.py

onnxruntime/python/tools/transformers/models/llama/README.md

kunal-vaishnavi · 2023-11-01T18:32:44Z

Since device-id is being removed in favor of adding CUDA_VISIBLE_DEVICES=<comma-separated list of device ids to use> to support multi-GPU use, can you update its usage in benchmark_all.py?

onnxruntime/onnxruntime/python/tools/transformers/models/llama/benchmark_all.py

Lines 103 to 108 in 9e8ad39

    
           parser.add_argument( 
        
               "--device-id", 
        
               type=int, 
        
               default=0, 
        
               help="GPU device ID", 
        
           )

It would also be useful in the README to show an example command with CUDA_VISIBLE_DEVICES for both benchmark.py and benchmark_all.py (just as there's an example command with CUDA_VISIBLE_DEVICES for export).

frank-dong-ms · 2023-11-02T03:25:58Z

/azp run Windows GPU CI Pipeline

azure-pipelines · 2023-11-02T03:26:08Z

Azure Pipelines successfully started running 1 pipeline(s).

frank-dong-ms · 2023-11-02T03:27:07Z

/azp run Windows GPU CI Pipeline

azure-pipelines · 2023-11-02T03:27:16Z

Azure Pipelines successfully started running 1 pipeline(s).

frank-dong-ms · 2023-11-02T03:29:54Z

/azp run Windows GPU TensorRT CI Pipeline

azure-pipelines · 2023-11-02T03:30:03Z

Azure Pipelines successfully started running 1 pipeline(s).

…ulti-gpus

onnxruntime/python/tools/transformers/models/llama/dist_settings.py

+    elif "OMPI_COMM_WORLD_LOCAL_RANK" in os.environ:
+        from mpi4py import MPI
+
+        comm = MPI.COMM_WORLD  # noqa: F841


### Description Support llama-70b model fusion and shardding ### Motivation and Context This change enables shard and export llama-70b model into Onnx as this model is too large for single GPU. This change also fuses llama-70b model with repeat_kv pattern different with llama-7b and llama-13b.

add shardding support for llama model and update convert/benchmark sc…

c958bc6

…ript

github-advanced-security bot found potential problems Oct 30, 2023

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/llama_parity.py Fixed Show fixed Hide fixed

fix bug in llama input

dd41176

frank-dong-ms requested review from kunal-vaishnavi and tianleiwu October 30, 2023 23:03

frank-dong-ms added release:1.16.2 labels Oct 30, 2023

tianleiwu reviewed Oct 31, 2023

View reviewed changes

onnxruntime/python/tools/transformers/convert_generation.py Outdated Show resolved Hide resolved

tianleiwu reviewed Oct 31, 2023

View reviewed changes

onnxruntime/python/tools/transformers/fusion_rotary_attention.py Show resolved Hide resolved

tianleiwu reviewed Oct 31, 2023

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/README.md Outdated Show resolved Hide resolved

tianleiwu reviewed Oct 31, 2023

View reviewed changes

onnxruntime/python/tools/transformers/models/llama/benchmark.py Outdated Show resolved Hide resolved