-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ROCm] fix: obtain AMD GPU memory info through rocm_smi library #21190
Conversation
@microsoft-github-policy-service agree [company="AMD"] |
@microsoft-github-policy-service agree company="AMD Inc." |
Good idea.
a9c8672
From: Tianlei Wu ***@***.***>
Sent: Friday, June 28, 2024 07:13
To: microsoft/onnxruntime ***@***.***>
Cc: Hann Wang ***@***.***>; Mention ***@***.***>
Subject: Re: [microsoft/onnxruntime] [ROCm] fix: obtain AMD GPU memory info through rocm_smi library (PR #21190)
How about login like the following:
const auto status = hipMemGetInfo(free, total);
if (status != hipSuccess){
ROCMSMI_CALL_THROW(rsmi_init(0));
ROCMSMI_CALL_THROW(rsmi_dev_memory_total_get(deviceId, RSMI_MEM_TYPE_VIS_VRAM, total));
ROCMSMI_CALL_THROW(rsmi_dev_memory_usage_get(deviceId, RSMI_MEM_TYPE_VIS_VRAM, &used));
*free= *total- used;
ROCMSMI_CALL_THROW(rsmi_shut_down());
}
—
Reply to this email directly, view it on GitHub<#21190 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACAVPJAO2ORD5AXOPONAGHLZJSL7XAVCNFSM6AAAAABJ7N7PBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJVHAYTCNRQHE>.
You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>
|
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline |
/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
Pipelines were unable to run due to time out waiting for the pull request to finish merging. |
2 similar comments
Pipelines were unable to run due to time out waiting for the pull request to finish merging. |
Pipelines were unable to run due to time out waiting for the pull request to finish merging. |
@hann-wang, the python format pipeline failed. Please fix it by running lintrunner at the root like
|
/azp run orttraining-amd-gpu-ci-pipeline |
Azure Pipelines successfully started running 1 pipeline(s). |
got it, thank you! |
/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline |
/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline |
/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline |
Azure Pipelines successfully started running 3 pipeline(s). |
Azure Pipelines successfully started running 10 pipeline(s). |
1 similar comment
Azure Pipelines successfully started running 10 pipeline(s). |
Description
Previously ROCMExecutionProvider uses
hipMemGetInfo
to obtain the sizes of total memory and available memory. However, this API has been broken since ROCm 5.7. In this PR, we userocm_smi
library instead ofhipMemGetInfo
.Motivation and Context
hipMemGetInfo
API has been broken since ROCm 5.7 and inference with ROCMExecutionProvider will lead to following errors:MIOpen has a brute-force fix for this (https://github.com/ROCm/MIOpen/blob/911e67189592c311374940493f2099f3abced60d/src/hip/handlehip.cpp#L72). Instead of hard-coding available memory to 16GB, I suppose we could obtain memory info through
rocm_smi
library as in this PR.