Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FusedMHARunnerFP16v2 will cause onnxruntime coredump when multi-host-threads run session.run() #22262

Closed
zwyao opened this issue Sep 29, 2024 · 3 comments

Comments

@zwyao
Copy link

zwyao commented Sep 29, 2024

Describe the issue

in my bert model,when i use head-size == 32,the attention cuda kernel will cause ort codedump,the error msg says “cuda illegal memory access was encountered”.
i find the reason is the FusedMHARunnerFP16v2 dose not support concurrent running.

To reproduce

attention_bug_fix.txt

this is my fix code

Urgency

No response

Platform

Linux

OS Version

1.18.0

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

1.18.0 master

ONNX Runtime API

C++

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

@zwyao
Copy link
Author

zwyao commented Sep 29, 2024

i find the bug dose not be fixed in the latest version 1.19.1

@tianleiwu
Copy link
Contributor

tianleiwu commented Sep 30, 2024

@zwyao,
The thread-safe for self attention FusedMHARunnerFP16v2 was fixed in #21420. There was another fix for cross-attention.
The bug was resolved in 1.19.0 release. Please try 1.19.2.

@zwyao
Copy link
Author

zwyao commented Sep 30, 2024

@zwyao, The thread-safe for self attention FusedMHARunnerFP16v2 was fixed in #21420. There was another fix for cross-attention. The bug was resolved in 1.19.0 release. Please try 1.19.2.

emmm, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants