From 12890e57fdfc5876f840a672816cde1e02b894bd Mon Sep 17 00:00:00 2001 From: Kunal Vaishnavi Date: Fri, 19 Jan 2024 07:04:37 +0000 Subject: [PATCH] Update prerequisites --- .../tools/transformers/models/llama/README.md | 31 +++++-------------- .../models/llama/requirements.txt | 2 +- 2 files changed, 8 insertions(+), 25 deletions(-) diff --git a/onnxruntime/python/tools/transformers/models/llama/README.md b/onnxruntime/python/tools/transformers/models/llama/README.md index e7bcc19635f40..f9552e02d74b9 100644 --- a/onnxruntime/python/tools/transformers/models/llama/README.md +++ b/onnxruntime/python/tools/transformers/models/llama/README.md @@ -42,23 +42,6 @@ $ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama To make this option compatible with [Hugging Face's Optimum](https://github.com/huggingface/optimum), you will need to create `config.json` and `generation_config.json` for your model and store them in the same directory as your ONNX models. For example, you can find those JSON files for LLaMA-2 7B on Hugging Face [here](https://huggingface.co/meta-llama/Llama-2-7b-hf). -As indicated in `requirements.txt`, you will also need to install Optimum from source. Once installed, you will need to modify `ORTModelForCausalLM.forward` in `optimum/optimum/onnxruntime/modeling_decoder.py` as follows: - -``` -# Before -if self.use_cache: - if past_key_values is not None: - input_ids = input_ids[:, -1:] - # Flatten the past_key_values (no need to flatten for models using multi-query attn) - - -# After -if self.use_cache: - if past_key_values is not None: - input_ids = input_ids[:, -1:] if past_key_values[0][0].shape[2] != 0 else input_ids - # Flatten the past_key_values (no need to flatten for models using multi-query attn) -``` - ### Option 2: from [Microsoft's custom export](https://github.com/microsoft/Llama-2-Onnx) Please follow the [README instructions](https://github.com/microsoft/Llama-2-Onnx#before-you-start) in the custom export of LLaMA-2. @@ -254,7 +237,7 @@ Here are some examples of how you can benchmark LLaMA-2. 1. PyTorch without `torch.compile`, FP32 ``` -python3 -m models.llama.benchmark \ +CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \ --benchmark-type hf-pt-eager \ --model-name meta-llama/Llama-2-7b-hf \ --precision fp32 \ @@ -266,7 +249,7 @@ python3 -m models.llama.benchmark \ 2. PyTorch with `torch.compile`, FP16 ``` -python3 -m models.llama.benchmark \ +CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \ --benchmark-type hf-pt-compile \ --model-name meta-llama/Llama-2-7b-hf \ --precision fp16 \ @@ -278,7 +261,7 @@ python3 -m models.llama.benchmark \ 3. Optimum + ONNX Runtime, FP32, export via Optimum or convert_to_onnx ``` -python3 -m models.llama.benchmark \ +CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \ --benchmark-type hf-ort \ --hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \ --model-name meta-llama/Llama-2-7b-hf \ @@ -291,7 +274,7 @@ python3 -m models.llama.benchmark \ 4. Optimum + ONNX Runtime, FP16, export via Optimum or convert_to_onnx ``` -python3 -m models.llama.benchmark \ +CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \ --benchmark-type hf-ort \ --hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \ --model-name meta-llama/Llama-2-7b-hf \ @@ -304,7 +287,7 @@ python3 -m models.llama.benchmark \ 5. ONNX Runtime, FP32, Microsoft custom export ``` -python3 -m models.llama.benchmark \ +CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \ --benchmark-type ort-msft \ --ort-model-path ./llama-2-onnx/7B_float32/ONNX/LlamaV2_7B_float32.onnx \ --model-name meta-llama/Llama-2-7b-hf \ @@ -316,7 +299,7 @@ python3 -m models.llama.benchmark \ 6. ONNX Runtime, FP16, Microsoft custom export ``` -python3 -m models.llama.benchmark \ +CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \ --benchmark-type ort-msft \ --ort-model-path ./llama-2-onnx/7B_float16/ONNX/LlamaV2_7B_float16.onnx \ --model-name meta-llama/Llama-2-7b-hf \ @@ -367,7 +350,7 @@ You can profile a variant by adding the `--profile` flag and providing one batch ### Benchmark All You can use `benchmark_all.py` to benchmark across various options and automatically store the results in a CSV file. Here is an example. ``` -python3 -m models.llama.benchmark_all \ +CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_all \ --hf-pt-eager \ --hf-pt-compile \ --hf-ort-dir-path ./llama2-7b-fp16/ \ diff --git a/onnxruntime/python/tools/transformers/models/llama/requirements.txt b/onnxruntime/python/tools/transformers/models/llama/requirements.txt index 4210f36982aef..b72c972e7a16a 100644 --- a/onnxruntime/python/tools/transformers/models/llama/requirements.txt +++ b/onnxruntime/python/tools/transformers/models/llama/requirements.txt @@ -1,4 +1,4 @@ -git+https://github.com/huggingface/optimum.git +optimum>=1.14.1 transformers>=4.33.2 torch>=2.2.0.dev20230920 onnx>=1.14.0