Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup Llama2 cpu throughput in bench by 1.69x with iobinding #19853

Merged
merged 2 commits into from
Mar 12, 2024

Conversation

BowenBao
Copy link
Contributor

@BowenBao BowenBao commented Mar 11, 2024

Description

Always set use_io_binding=True when using optimum.onnxruntime unless there is a special case.

Motivation and Context

By default, ORTModel under optimum.onnxruntime will choose the appropriate use_io_binding value based on provider and use cases.

    use_io_binding (`Optional[bool]`, defaults to `None`):
       Whether to use IOBinding during inference to avoid memory copy between the host and device, or between numpy/torch tensors and ONNX Runtime ORTValue. Defaults to
       `True` if the execution provider is CUDAExecutionProvider. For [~onnxruntime.ORTModelForCausalLM], defaults to `True` on CPUExecutionProvider,
      in all other cases defaults to `False`.

For Llama token benchmark, using iobinding yields almost 2x speedup, even on CPU. This is because this particular model yields a large number of outputs (>60). Without iobinding, a copy is performed for each output from ortvalue to numpy array. This adds significant overhead to the overall run time.

Evaluating Llama2 `model(inputs)` step with past_key_values

Before, w/o iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.4518657898902893 s
Throughput: 2.2130464894073856 tps

After, w/ iobinding on cpu

Batch Size: 1
Sequence Length: 512
Latency: 0.2662619352340698 s
Throughput: 3.7557001871893703 tps

@kunal-vaishnavi
Copy link
Contributor

Should we manually set use_io_binding = True anyways for readability purposes and add a comment explaining why it is always true? It might be confusing why use_io_binding is not enabled when reading the code.

@BowenBao
Copy link
Contributor Author

The other perspective is that use_io_binding can be viewed as impl detail of optimum, and we trust optimum to pick the best choice. This way there are less mental burden in writing the benchmark script. But I'm also okay with your proposal.

I wonder if using iobinding / dlpack is always better, and if we should do something to improve the default sess.run with it.

@BowenBao BowenBao merged commit 742595b into main Mar 12, 2024
95 checks passed
@BowenBao BowenBao deleted the bowbao/bench_iobinding branch March 12, 2024 16:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants