-
Notifications
You must be signed in to change notification settings - Fork 3k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
### Description This PR reduces the memory usage when exporting and benchmarking LLaMA. ### Motivation and Context - Exporting: The PyTorch model is deleted from memory after a successful export instead of deleting it from memory after exporting + converting the ONNX model to the desired precision. - Benchmarking: In the ONNX model with GroupQueryAttention, the KV cache inputs use the same GPU memory for both the prompt and token generation benchmarks.
- Loading branch information
1 parent
2b95e74
commit d1b85f5
Showing
4 changed files
with
248 additions
and
186 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.