Speedup Llama2 cpu throughput in bench by 1.69x with iobinding #19853

BowenBao · 2024-03-11T18:18:38Z

Description

Always set use_io_binding=True when using optimum.onnxruntime unless there is a special case.

Motivation and Context

By default, ORTModel under optimum.onnxruntime will choose the appropriate use_io_binding value based on provider and use cases.

    use_io_binding (`Optional[bool]`, defaults to `None`):
       Whether to use IOBinding during inference to avoid memory copy between the host and device, or between numpy/torch tensors and ONNX Runtime ORTValue. Defaults to
       `True` if the execution provider is CUDAExecutionProvider. For [~onnxruntime.ORTModelForCausalLM], defaults to `True` on CPUExecutionProvider,
      in all other cases defaults to `False`.

For Llama token benchmark, using iobinding yields almost 2x speedup, even on CPU. This is because this particular model yields a large number of outputs (>60). Without iobinding, a copy is performed for each output from ortvalue to numpy array. This adds significant overhead to the overall run time.

Evaluating Llama2 `model(inputs)` step with past_key_values

Before, w/o iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.4518657898902893 s
Throughput: 2.2130464894073856 tps

After, w/ iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.2662619352340698 s
Throughput: 3.7557001871893703 tps

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py

kunal-vaishnavi · 2024-03-11T18:26:23Z

Should we manually set use_io_binding = True anyways for readability purposes and add a comment explaining why it is always true? It might be confusing why use_io_binding is not enabled when reading the code.

BowenBao · 2024-03-11T18:44:51Z

The other perspective is that use_io_binding can be viewed as impl detail of optimum, and we trust optimum to pick the best choice. This way there are less mental burden in writing the benchmark script. But I'm also okay with your proposal.

I wonder if using iobinding / dlpack is always better, and if we should do something to improve the default sess.run with it.

Speedup Llama cpu bench by 2x with iobinding

39fdf48

BowenBao requested review from kunal-vaishnavi and tianleiwu March 11, 2024 18:18

BowenBao commented Mar 11, 2024

View reviewed changes

onnxruntime/python/tools/transformers/models/stable_diffusion/benchmark.py Show resolved Hide resolved

address comments

3446c7d

kunal-vaishnavi approved these changes Mar 11, 2024

View reviewed changes

tianleiwu approved these changes Mar 12, 2024

View reviewed changes

BowenBao merged commit 742595b into main Mar 12, 2024
95 checks passed

BowenBao deleted the bowbao/bench_iobinding branch March 12, 2024 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup Llama2 cpu throughput in bench by 1.69x with iobinding #19853

Speedup Llama2 cpu throughput in bench by 1.69x with iobinding #19853

BowenBao commented Mar 11, 2024 •

edited

Loading

kunal-vaishnavi commented Mar 11, 2024

BowenBao commented Mar 11, 2024

Speedup Llama2 cpu throughput in bench by 1.69x with iobinding #19853

Speedup Llama2 cpu throughput in bench by 1.69x with iobinding #19853

Conversation

BowenBao commented Mar 11, 2024 • edited Loading

Description

Motivation and Context

kunal-vaishnavi commented Mar 11, 2024

BowenBao commented Mar 11, 2024

BowenBao commented Mar 11, 2024 •

edited

Loading