Skip to content

Commit

Permalink
Update prerequisites
Browse files Browse the repository at this point in the history
  • Loading branch information
kunal-vaishnavi committed Jan 19, 2024
1 parent f589fdf commit 12890e5
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 25 deletions.
31 changes: 7 additions & 24 deletions onnxruntime/python/tools/transformers/models/llama/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,23 +42,6 @@ $ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama

To make this option compatible with [Hugging Face's Optimum](https://github.com/huggingface/optimum), you will need to create `config.json` and `generation_config.json` for your model and store them in the same directory as your ONNX models. For example, you can find those JSON files for LLaMA-2 7B on Hugging Face [here](https://huggingface.co/meta-llama/Llama-2-7b-hf).

As indicated in `requirements.txt`, you will also need to install Optimum from source. Once installed, you will need to modify `ORTModelForCausalLM.forward` in `optimum/optimum/onnxruntime/modeling_decoder.py` as follows:

```
# Before
if self.use_cache:
if past_key_values is not None:
input_ids = input_ids[:, -1:]
# Flatten the past_key_values (no need to flatten for models using multi-query attn)
# After
if self.use_cache:
if past_key_values is not None:
input_ids = input_ids[:, -1:] if past_key_values[0][0].shape[2] != 0 else input_ids
# Flatten the past_key_values (no need to flatten for models using multi-query attn)
```

### Option 2: from [Microsoft's custom export](https://github.com/microsoft/Llama-2-Onnx)

Please follow the [README instructions](https://github.com/microsoft/Llama-2-Onnx#before-you-start) in the custom export of LLaMA-2.
Expand Down Expand Up @@ -254,7 +237,7 @@ Here are some examples of how you can benchmark LLaMA-2.

1. PyTorch without `torch.compile`, FP32
```
python3 -m models.llama.benchmark \
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
--benchmark-type hf-pt-eager \
--model-name meta-llama/Llama-2-7b-hf \
--precision fp32 \
Expand All @@ -266,7 +249,7 @@ python3 -m models.llama.benchmark \

2. PyTorch with `torch.compile`, FP16
```
python3 -m models.llama.benchmark \
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
--benchmark-type hf-pt-compile \
--model-name meta-llama/Llama-2-7b-hf \
--precision fp16 \
Expand All @@ -278,7 +261,7 @@ python3 -m models.llama.benchmark \

3. Optimum + ONNX Runtime, FP32, export via Optimum or convert_to_onnx
```
python3 -m models.llama.benchmark \
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
--benchmark-type hf-ort \
--hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \
--model-name meta-llama/Llama-2-7b-hf \
Expand All @@ -291,7 +274,7 @@ python3 -m models.llama.benchmark \

4. Optimum + ONNX Runtime, FP16, export via Optimum or convert_to_onnx
```
python3 -m models.llama.benchmark \
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
--benchmark-type hf-ort \
--hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \
--model-name meta-llama/Llama-2-7b-hf \
Expand All @@ -304,7 +287,7 @@ python3 -m models.llama.benchmark \

5. ONNX Runtime, FP32, Microsoft custom export
```
python3 -m models.llama.benchmark \
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
--benchmark-type ort-msft \
--ort-model-path ./llama-2-onnx/7B_float32/ONNX/LlamaV2_7B_float32.onnx \
--model-name meta-llama/Llama-2-7b-hf \
Expand All @@ -316,7 +299,7 @@ python3 -m models.llama.benchmark \

6. ONNX Runtime, FP16, Microsoft custom export
```
python3 -m models.llama.benchmark \
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
--benchmark-type ort-msft \
--ort-model-path ./llama-2-onnx/7B_float16/ONNX/LlamaV2_7B_float16.onnx \
--model-name meta-llama/Llama-2-7b-hf \
Expand Down Expand Up @@ -367,7 +350,7 @@ You can profile a variant by adding the `--profile` flag and providing one batch
### Benchmark All
You can use `benchmark_all.py` to benchmark across various options and automatically store the results in a CSV file. Here is an example.
```
python3 -m models.llama.benchmark_all \
CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_all \
--hf-pt-eager \
--hf-pt-compile \
--hf-ort-dir-path ./llama2-7b-fp16/ \
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
git+https://github.com/huggingface/optimum.git
optimum>=1.14.1
transformers>=4.33.2
torch>=2.2.0.dev20230920
onnx>=1.14.0
Expand Down

0 comments on commit 12890e5

Please sign in to comment.