Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama 70b model fusion and shardding #18175

Merged
merged 15 commits into from
Nov 2, 2023

Conversation

frank-dong-ms
Copy link
Contributor

Description

Support llama-70b model fusion and shardding

Motivation and Context

This change enables shard and export llama-70b model into Onnx as this model is too large for single GPU.
This change also fuses llama-70b model with repeat_kv pattern different with llama-7b and llama-13b.

Copy link

@github-advanced-security github-advanced-security bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lintrunner found more than 10 potential problems in the proposed changes. Check the Files changed tab for more details.

@tianleiwu
Copy link
Contributor

Lint/python format pipeline failed. Please run lintrunner -a and fix all warnings.
see https://github.com/microsoft/onnxruntime/blob/main/docs/Coding_Conventions_and_Standards.md#linting

@kunal-vaishnavi
Copy link
Contributor

Since device-id is being removed in favor of adding CUDA_VISIBLE_DEVICES=<comma-separated list of device ids to use> to support multi-GPU use, can you update its usage in benchmark_all.py?

parser.add_argument(
"--device-id",
type=int,
default=0,
help="GPU device ID",
)

It would also be useful in the README to show an example command with CUDA_VISIBLE_DEVICES for both benchmark.py and benchmark_all.py (just as there's an example command with CUDA_VISIBLE_DEVICES for export).

tianleiwu
tianleiwu previously approved these changes Nov 2, 2023
@frank-dong-ms
Copy link
Contributor Author

/azp run Windows GPU CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@frank-dong-ms
Copy link
Contributor Author

/azp run Windows GPU CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@frank-dong-ms
Copy link
Contributor Author

/azp run Windows GPU TensorRT CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

elif "OMPI_COMM_WORLD_LOCAL_RANK" in os.environ:
from mpi4py import MPI

comm = MPI.COMM_WORLD # noqa: F841

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable comm is not used.
@frank-dong-ms frank-dong-ms merged commit dabd395 into microsoft:main Nov 2, 2023
84 of 86 checks passed
tianleiwu pushed a commit that referenced this pull request Nov 2, 2023
### Description
Support llama-70b model fusion and shardding



### Motivation and Context
This change enables shard and export llama-70b model into Onnx as this
model is too large for single GPU.
This change also fuses llama-70b model with repeat_kv pattern different
with llama-7b and llama-13b.
kleiti pushed a commit to kleiti/onnxruntime that referenced this pull request Mar 22, 2024
### Description
Support llama-70b model fusion and shardding



### Motivation and Context
This change enables shard and export llama-70b model into Onnx as this
model is too large for single GPU.
This change also fuses llama-70b model with repeat_kv pattern different
with llama-7b and llama-13b.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants