[ROCm] fix: obtain AMD GPU memory info through rocm_smi library #21190

hann-wang · 2024-06-27T08:38:55Z

Description

Previously ROCMExecutionProvider uses hipMemGetInfo to obtain the sizes of total memory and available memory. However, this API has been broken since ROCm 5.7. In this PR, we use rocm_smi library instead of hipMemGetInfo.

Motivation and Context

hipMemGetInfo API has been broken since ROCm 5.7 and inference with ROCMExecutionProvider will lead to following errors:

HIP failure 1: invalid argument ; GPU=0 ; hostname=4cc4900475fe ; file=/onnxruntime/onnxruntime/core/providers/rocm/rocm_execution_provider.cc ; line=229 ; expr=hipMemGetInfo(&free, &total);

MIOpen has a brute-force fix for this (https://github.com/ROCm/MIOpen/blob/911e67189592c311374940493f2099f3abced60d/src/hip/handlehip.cpp#L72). Instead of hard-coding available memory to 16GB, I suppose we could obtain memory info through rocm_smi library as in this PR.

hann-wang · 2024-06-27T08:40:18Z

@hann-wang please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree [company="AMD"]

hann-wang · 2024-06-27T08:41:12Z

@microsoft-github-policy-service agree company="your company"

@microsoft-github-policy-service agree company="AMD Inc."

hann-wang · 2024-06-28T04:00:54Z

Good idea. a9c8672 From: Tianlei Wu ***@***.***> Sent: Friday, June 28, 2024 07:13 To: microsoft/onnxruntime ***@***.***> Cc: Hann Wang ***@***.***>; Mention ***@***.***> Subject: Re: [microsoft/onnxruntime] [ROCm] fix: obtain AMD GPU memory info through rocm_smi library (PR #21190) How about login like the following: const auto status = hipMemGetInfo(free, total); if (status != hipSuccess){ ROCMSMI_CALL_THROW(rsmi_init(0)); ROCMSMI_CALL_THROW(rsmi_dev_memory_total_get(deviceId, RSMI_MEM_TYPE_VIS_VRAM, total)); ROCMSMI_CALL_THROW(rsmi_dev_memory_usage_get(deviceId, RSMI_MEM_TYPE_VIS_VRAM, &used)); *free= *total- used; ROCMSMI_CALL_THROW(rsmi_shut_down()); } — Reply to this email directly, view it on GitHub<#21190 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACAVPJAO2ORD5AXOPONAGHLZJSL7XAVCNFSM6AAAAABJ7N7PBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCOJVHAYTCNRQHE>. You are receiving this because you were mentioned.Message ID: ***@***.******@***.***>>

tianleiwu · 2024-06-28T17:18:49Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-06-28T17:18:50Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

tianleiwu · 2024-06-28T17:18:52Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-06-28T17:21:10Z

Pipelines were unable to run due to time out waiting for the pull request to finish merging.

azure-pipelines · 2024-06-28T17:21:11Z

Pipelines were unable to run due to time out waiting for the pull request to finish merging.

azure-pipelines · 2024-06-28T17:21:13Z

Pipelines were unable to run due to time out waiting for the pull request to finish merging.

tianleiwu · 2024-06-28T17:24:42Z

@hann-wang, the python format pipeline failed. Please fix it by running lintrunner at the root like

pip install -r requirements-lintrunner.txt
pip install lintrunner
lintrunner init
lintrunner -a

tianleiwu · 2024-06-28T17:25:41Z

/azp run orttraining-amd-gpu-ci-pipeline

azure-pipelines · 2024-06-28T17:26:22Z

Azure Pipelines successfully started running 1 pipeline(s).

hann-wang · 2024-07-01T01:38:21Z

@hann-wang, the python format pipeline failed. Please fix it by running lintrunner at the root like
pip install -r requirements-lintrunner.txt
pip install lintrunner
lintrunner init
lintrunner -a

got it, thank you!

9058961

tianleiwu · 2024-07-01T04:27:19Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

tianleiwu · 2024-07-01T04:27:20Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

tianleiwu · 2024-07-01T04:27:21Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-07-01T04:27:36Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-07-01T04:27:51Z

Azure Pipelines successfully started running 10 pipeline(s).

azure-pipelines · 2024-07-01T04:27:52Z

Azure Pipelines successfully started running 10 pipeline(s).

hann-wang added 2 commits June 27, 2024 16:26

patch: obtain gpu meminfo by rocm_smi

faae212

restore build.py

30259cf

hann-wang changed the title ~~fix: obtain memory info through rocm_smi library~~ fix: obtain AMD GPU memory info through rocm_smi library Jun 27, 2024

hann-wang changed the title ~~fix: obtain AMD GPU memory info through rocm_smi library~~ [ROCm] fix: obtain AMD GPU memory info through rocm_smi library Jun 27, 2024

tianleiwu requested a review from cloudhan June 27, 2024 22:55

fix: try hipMemGetInfo first

a9c8672

chore: format

9058961

tianleiwu approved these changes Jul 4, 2024

View reviewed changes

tianleiwu merged commit d28c26a into microsoft:main Jul 10, 2024
86 of 88 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] fix: obtain AMD GPU memory info through rocm_smi library #21190

[ROCm] fix: obtain AMD GPU memory info through rocm_smi library #21190

hann-wang commented Jun 27, 2024

hann-wang commented Jun 27, 2024

hann-wang commented Jun 27, 2024

hann-wang commented Jun 28, 2024 via email

tianleiwu commented Jun 28, 2024

tianleiwu commented Jun 28, 2024

tianleiwu commented Jun 28, 2024

azure-pipelines bot commented Jun 28, 2024

azure-pipelines bot commented Jun 28, 2024

azure-pipelines bot commented Jun 28, 2024

tianleiwu commented Jun 28, 2024

tianleiwu commented Jun 28, 2024

azure-pipelines bot commented Jun 28, 2024

hann-wang commented Jul 1, 2024

tianleiwu commented Jul 1, 2024

tianleiwu commented Jul 1, 2024

tianleiwu commented Jul 1, 2024

azure-pipelines bot commented Jul 1, 2024

azure-pipelines bot commented Jul 1, 2024

azure-pipelines bot commented Jul 1, 2024

[ROCm] fix: obtain AMD GPU memory info through rocm_smi library #21190

[ROCm] fix: obtain AMD GPU memory info through rocm_smi library #21190

Conversation

hann-wang commented Jun 27, 2024

Description

Motivation and Context

hann-wang commented Jun 27, 2024

hann-wang commented Jun 27, 2024

hann-wang commented Jun 28, 2024 via email

tianleiwu commented Jun 28, 2024

tianleiwu commented Jun 28, 2024

tianleiwu commented Jun 28, 2024

azure-pipelines bot commented Jun 28, 2024

azure-pipelines bot commented Jun 28, 2024

azure-pipelines bot commented Jun 28, 2024

tianleiwu commented Jun 28, 2024

tianleiwu commented Jun 28, 2024

azure-pipelines bot commented Jun 28, 2024

hann-wang commented Jul 1, 2024

tianleiwu commented Jul 1, 2024

tianleiwu commented Jul 1, 2024

tianleiwu commented Jul 1, 2024

azure-pipelines bot commented Jul 1, 2024

azure-pipelines bot commented Jul 1, 2024

azure-pipelines bot commented Jul 1, 2024