Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] subprocess.CalledProcessError #6585

Open
jagadish-amd opened this issue Sep 27, 2024 · 4 comments · May be fixed by #6587
Open

[BUG] subprocess.CalledProcessError #6585

jagadish-amd opened this issue Sep 27, 2024 · 4 comments · May be fixed by #6587

Comments

@jagadish-amd
Copy link

jagadish-amd commented Sep 27, 2024

Due to the PR #6498
we are seeing issues on ROCm.
1.
[rank7]: raise child_exception_type(errno_num, err_msg, err_filename)
[rank7]: FileNotFoundError: [Errno 2] No such file or directory: "/opt/rocm/bin/rocminfo | grep -Eo -m1 'Wavefront Size:[[:space:]]+[0-9]+' | grep -Eo '[0-9]+'"
[rank0]: Traceback (most recent call last):

Building extension module transformer_inference...
Using envvar MAX_JOBS (32) as the number of workers...
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
Using /root/.cache/torch_extensions/py39_cpu as PyTorch extensions root...
ninja: error: build.ninja:8: unexpected '='

[rank4]: Traceback (most recent call last):
[rank4]: File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2107, in _run_ninja_build
[rank4]: subprocess.run(
[rank4]: File "/opt/conda/envs/py_3.9/lib/python3.9/subprocess.py", line 528, in run
[rank4]: raise CalledProcessError(retcode, process.args,
[rank4]: subprocess.CalledProcessError: Command '['ninja', '-v', '-j', '32']' returned non-zero exit status 1.

Simple code to repro the issue.
import deepspeed
deepspeed.ops.op_builder.InferenceBuilder().load()

@loadams @jithunnair-amd @pruthvistony @jeffdaily

@jagadish-amd
Copy link
Author

ping @tjruwase

@jagadish-amd
Copy link
Author

shlex.split() is not working as expected for the strings. (rocm_wavefront_size_cmd, rocm_gpu_arch_cmd)
The below patch resolves the errors. (bringing back the old code partially)

diff --git a/op_builder/builder.py b/op_builder/builder.py
index e935a179..aa564f74 100644
--- a/op_builder/builder.py
+++ b/op_builder/builder.py
@@ -254,7 +254,7 @@ class OpBuilder(ABC):
         rocm_gpu_arch_cmd = str(rocm_info) + " | grep -o -m 1 'gfx.*'"
         try:
             safe_cmd = shlex.split(rocm_gpu_arch_cmd)
-            result = subprocess.check_output(safe_cmd)
+            result = subprocess.check_output(rocm_gpu_arch_cmd, shell=True)
             rocm_gpu_arch = result.decode('utf-8').strip()
         except subprocess.CalledProcessError:
             rocm_gpu_arch = ""
@@ -273,7 +273,7 @@ class OpBuilder(ABC):
             rocm_info) + " | grep -Eo -m1 'Wavefront Size:[[:space:]]+[0-9]+' | grep -Eo '[0-9]+'"
         try:
             safe_cmd = shlex.split(rocm_wavefront_size_cmd)
-            result = subprocess.check_output(rocm_wavefront_size_cmd)
+            result = subprocess.check_output(rocm_wavefront_size_cmd, shell=True)
             rocm_wavefront_size = result.decode('utf-8').strip()
         except subprocess.CalledProcessError:
             rocm_wavefront_size = "32"

Note that, safe_cmd is not used in get_rocm_wavefront_size(). (in the checked-in code)

@tjruwase
Copy link
Contributor

@jagadish-amd, thanks! This fix looks good since there is no user-input involved in the cmds.

@tjruwase
Copy link
Contributor

@jagadish-amd, can you please share a PR with this fix?

jagadish-amd added a commit to jagadish-amd/DeepSpeed that referenced this issue Sep 27, 2024
Fixes microsoft#6585
Use shell=True for subprocess.check_output() in case of ROCm
commands. Do not use shlex.split() since command string has
wildcard expansion.

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>
@jagadish-amd jagadish-amd linked a pull request Sep 27, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants