Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Something get wrong when run “aio_” and "gds_" file(DeepNVMe) #6567

Open
niebowen666 opened this issue Sep 24, 2024 · 11 comments
Open

Something get wrong when run “aio_” and "gds_" file(DeepNVMe) #6567

niebowen666 opened this issue Sep 24, 2024 · 11 comments
Assignees
Labels
bug Something isn't working training

Comments

@niebowen666
Copy link

Describe the bug
I couldn't run DeepNVMe demo properly.
It shows:

collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build
subprocess.run(
File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

It seems that something wrong about ninja, there is short of "build.ninja".
Anybody suffer this situation?

ds_report output
[2024-09-24 17:35:01,773] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja
ninja .................. [OKAY]

op name ................ installed .. compatible

async_io ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
[WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
[WARNING] NVIDIA Inference is only supported on Ampere and newer architectures
fp_quantizer ........... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
gds .................... [NO] ....... [OKAY]
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
inference_core_ops ..... [NO] ....... [NO]
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
cutlass_ops ............ [NO] ....... [NO]
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
transformer_inference .. [NO] ....... [NO]
quantizer .............. [NO] ....... [OKAY]
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
ragged_device_ops ...... [NO] ....... [NO]
[WARNING] NVIDIA Inference is only supported on Pascal and newer architectures
ragged_ops ............. [NO] ....... [NO]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
[WARNING] using untested triton version (3.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]

DeepSpeed general environment info:
torch install path ............... ['/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch']
torch version .................... 2.4.1+cu121
deepspeed install path ........... ['/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed']
deepspeed info ................... 0.15.1, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.4, cuda 12.1
shared memory (/dev/shm) size .... 125.75 GB

System info (please complete the following information):

  • OS: Ubuntu 22.04
  • Python 3.9.18

conda list

Name Version Build Channel

_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
_sysroot_linux-64_curr_repodata_hack 3 h69a702a_16 conda-forge
annotated-types 0.7.0 pypi_0 pypi
binutils_impl_linux-64 2.40 ha1999f0_7 conda-forge
binutils_linux-64 2.40 hb3c18ed_3 conda-forge
bzip2 1.0.8 h4bc722e_7 conda-forge
c-ares 1.19.1 h5eee18b_0 anaconda
ca-certificates 2024.8.30 hbcca054_0 conda-forge
cmake 3.26.4 h96355d8_0 anaconda
cuda 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-cccl 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-command-line-tools 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-compiler 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-cudart 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-cudart-dev 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-cudart-static 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-cuobjdump 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-cupti 12.1.62 0 nvidia/label/cuda-12.1.0
cuda-cupti-static 12.1.62 0 nvidia/label/cuda-12.1.0
cuda-cuxxfilt 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-demo-suite 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-documentation 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-driver-dev 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-gdb 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-libraries 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-libraries-dev 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-libraries-static 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-nsight 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nsight-compute 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-nvcc 12.1.105 0 nvidia
cuda-nvdisasm 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvml-dev 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvprof 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvprune 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvrtc 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvrtc-dev 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvrtc-static 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-nvtx 12.1.66 0 nvidia/label/cuda-12.1.0
cuda-nvvp 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-opencl 12.1.56 0 nvidia/label/cuda-12.1.0
cuda-opencl-dev 12.1.56 0 nvidia/label/cuda-12.1.0
cuda-profiler-api 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-runtime 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-sanitizer-api 12.1.55 0 nvidia/label/cuda-12.1.0
cuda-toolkit 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-tools 12.1.0 0 nvidia/label/cuda-12.1.0
cuda-visual-tools 12.1.0 0 nvidia/label/cuda-12.1.0
deepspeed 0.15.1 pypi_0 pypi
expat 2.6.3 h6a678d5_0 anaconda
filelock 3.16.1 pypi_0 pypi
fsspec 2024.9.0 pypi_0 pypi
gcc 14.1.0 h6f9ffa1_1 conda-forge
gcc_impl_linux-64 14.1.0 h3c94d91_1 conda-forge
gcc_linux-64 14.1.0 h3f71edc_3 conda-forge
gds-tools 1.6.0.25 0 nvidia/label/cuda-12.1.0
gxx 14.1.0 h6f9ffa1_1 conda-forge
gxx_impl_linux-64 14.1.0 h8d00ecb_1 conda-forge
gxx_linux-64 14.1.0 hc55ae77_3 conda-forge
hjson 3.1.0 pypi_0 pypi
jinja2 3.1.4 pypi_0 pypi
kernel-headers_linux-64 3.10.0 h4a8ded7_16 conda-forge
krb5 1.20.1 h143b758_1 anaconda
ld_impl_linux-64 2.40 hf3520f5_7 conda-forge
libcublas 12.1.0.26 0 nvidia/label/cuda-12.1.0
libcublas-dev 12.1.0.26 0 nvidia/label/cuda-12.1.0
libcublas-static 12.1.0.26 0 nvidia/label/cuda-12.1.0
libcufft 11.0.2.4 0 nvidia/label/cuda-12.1.0
libcufft-dev 11.0.2.4 0 nvidia/label/cuda-12.1.0
libcufft-static 11.0.2.4 0 nvidia/label/cuda-12.1.0
libcufile 1.6.0.25 0 nvidia/label/cuda-12.1.0
libcufile-dev 1.6.0.25 0 nvidia/label/cuda-12.1.0
libcufile-static 1.6.0.25 0 nvidia/label/cuda-12.1.0
libcurand 10.3.2.56 0 nvidia/label/cuda-12.1.0
libcurand-dev 10.3.2.56 0 nvidia/label/cuda-12.1.0
libcurand-static 10.3.2.56 0 nvidia/label/cuda-12.1.0
libcurl 7.88.1 h251f7ec_2 anaconda
libcusolver 11.4.4.55 0 nvidia/label/cuda-12.1.0
libcusolver-dev 11.4.4.55 0 nvidia/label/cuda-12.1.0
libcusolver-static 11.4.4.55 0 nvidia/label/cuda-12.1.0
libcusparse 12.0.2.55 0 nvidia/label/cuda-12.1.0
libcusparse-dev 12.0.2.55 0 nvidia/label/cuda-12.1.0
libcusparse-static 12.0.2.55 0 nvidia/label/cuda-12.1.0
libedit 3.1.20230828 h5eee18b_0 anaconda
libev 4.33 h7f8727e_1 anaconda
libffi 3.4.2 h7f98852_5 conda-forge
libgcc 14.1.0 h77fa898_1 conda-forge
libgcc-devel_linux-64 14.1.0 h5d3d1c9_101 conda-forge
libgcc-ng 14.1.0 h69a702a_1 conda-forge
libgomp 14.1.0 h77fa898_1 conda-forge
libnghttp2 1.57.0 h2d74bed_0 anaconda
libnpp 12.0.2.50 0 nvidia/label/cuda-12.1.0
libnpp-dev 12.0.2.50 0 nvidia/label/cuda-12.1.0
libnpp-static 12.0.2.50 0 nvidia/label/cuda-12.1.0
libnsl 2.0.1 hd590300_0 conda-forge
libnvjitlink 12.1.55 0 nvidia/label/cuda-12.1.0
libnvjitlink-dev 12.1.55 0 nvidia/label/cuda-12.1.0
libnvjpeg 12.1.0.39 0 nvidia/label/cuda-12.1.0
libnvjpeg-dev 12.1.0.39 0 nvidia/label/cuda-12.1.0
libnvjpeg-static 12.1.0.39 0 nvidia/label/cuda-12.1.0
libnvvm-samples 12.1.55 0 nvidia/label/cuda-12.1.0
libsanitizer 14.1.0 hcba0ae0_1 conda-forge
libsqlite 3.45.2 h2797004_0 conda-forge
libssh2 1.11.0 h251f7ec_0 anaconda
libstdcxx 14.1.0 hc0a3c3a_1 conda-forge
libstdcxx-devel_linux-64 14.1.0 h5d3d1c9_101 conda-forge
libstdcxx-ng 14.1.0 h4852527_1 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libuv 1.48.0 h5eee18b_0 anaconda
libxcrypt 4.4.36 hd590300_1 conda-forge
libzlib 1.2.13 h4ab18f5_6 conda-forge
lz4-c 1.9.4 h6a678d5_1 anaconda
markupsafe 2.1.5 pypi_0 pypi
mpmath 1.3.0 pypi_0 pypi
ncurses 6.4 h6a678d5_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
networkx 3.2.1 pypi_0 pypi
ninja 1.11.1.1 pypi_0 pypi
nsight-compute 2023.1.0.15 0 nvidia/label/cuda-12.1.0
numpy 2.0.2 pypi_0 pypi
nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
nvidia-ml-py 12.560.30 pypi_0 pypi
nvidia-nccl-cu12 2.20.5 pypi_0 pypi
nvidia-nvjitlink-cu12 12.6.68 pypi_0 pypi
nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
openssl 3.3.2 hb9d3cd8_0 conda-forge
packaging 24.1 pypi_0 pypi
pillow 10.4.0 pypi_0 pypi
pip 24.2 py39h06a4308_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
psutil 6.0.0 pypi_0 pypi
py-cpuinfo 9.0.0 pypi_0 pypi
pydantic 2.9.2 pypi_0 pypi
pydantic-core 2.23.4 pypi_0 pypi
python 3.9.18 h0755675_1_cpython conda-forge
readline 8.2 h5eee18b_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
rhash 1.4.3 hdbd6064_0 anaconda
setuptools 75.1.0 py39h06a4308_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
sqlite 3.45.2 h2c6b66d_0 conda-forge
sympy 1.13.3 pypi_0 pypi
sysroot_linux-64 2.17 h4a8ded7_16 conda-forge
tk 8.6.13 noxft_h4845f30_101 conda-forge
torch 2.4.1 pypi_0 pypi
torchaudio 2.4.1 pypi_0 pypi
torchvision 0.19.1 pypi_0 pypi
tqdm 4.66.5 pypi_0 pypi
triton 3.0.0 pypi_0 pypi
typing-extensions 4.12.2 pypi_0 pypi
tzdata 2024a h04d1e81_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
wheel 0.44.0 py39h06a4308_0 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
xz 5.4.6 h5eee18b_1 http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
zlib 1.2.13 h4ab18f5_6 conda-forge
zstd 1.5.5 hc292b87_2 anaconda

@niebowen666 niebowen666 added bug Something isn't working training labels Sep 24, 2024
@jomayeri
Copy link
Contributor

What command did you run?

@jomayeri jomayeri self-assigned this Sep 24, 2024
@niebowen666
Copy link
Author

The commands are "python aio_store_cpu_tensor.py --nvme_folder tensor/" and "python gds_store_gpu_tensor.py --nvme_folder tensor/"

@niebowen666
Copy link
Author

@jomayeri

@niebowen666
Copy link
Author

niebowen666 commented Sep 25, 2024

The error information:

Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/1] /root/anaconda3/envs/deepspeed/bin/x86_64-conda-linux-gnu-c++ deepspeed_py_io_handle.o deepspeed_py_aio.o deepspeed_py_aio_handle.o deepspeed_aio_thread.o deepspeed_aio_utils.o deepspeed_aio_common.o deepspeed_aio_types.o deepspeed_cpu_op.o deepspeed_aio_op_desc.o deepspeed_py_copy.o deepspeed_pin_tensor.o py_ds_aio.o -shared -L/root/anaconda3/envs/deepspeed -L/root/anaconda3/envs/deepspeed/lib64 -laio -lcuda -lcudart -L/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o async_io.so
FAILED: async_io.so
/root/anaconda3/envs/deepspeed/bin/x86_64-conda-linux-gnu-c++ deepspeed_py_io_handle.o deepspeed_py_aio.o deepspeed_py_aio_handle.o deepspeed_aio_thread.o deepspeed_aio_utils.o deepspeed_aio_common.o deepspeed_aio_types.o deepspeed_cpu_op.o deepspeed_aio_op_desc.o deepspeed_py_copy.o deepspeed_pin_tensor.o py_ds_aio.o -shared -L/root/anaconda3/envs/deepspeed -L/root/anaconda3/envs/deepspeed/lib64 -laio -lcuda -lcudart -L/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o async_io.so
/root/anaconda3/envs/deepspeed/bin/../lib/gcc/x86_64-conda-linux-gnu/14.1.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build
subprocess.run(
File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 40, in
main()
File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 18, in main
aio_handle = AsyncIOBuilder().load().aio_handle()
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
return self.jit_load(verbose)
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
op_module = load(name=self.name,
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1312, in load
return _jit_compile(
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile
_write_ninja_file_and_build_library(
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'async_io'

@jomayeri
Copy link
Contributor

Based on the output it looks like the compilation can't link to the cuda library cannot find -lcuda: No such file or directory. Your ds_report shows CUDA installed, you might try setting the CUDA_HOME environment variable to point to the location of the cuda install and rebuilding.

@niebowen666
Copy link
Author

niebowen666 commented Sep 26, 2024

@jomayeri
Since I installed DeepSpeed by using anaconda.
I have configured the environment variable CUDA_HOME as /root/anaconda3/envs/deepspeed/

echo $CUDA_HOME
/root/anaconda3/envs/deepspeed/
the virtual environment built in conda named deepspeed

Is that right?

Besides, I also set the CUDNN_HOME to /root/anaconda3/envs/deepspeed/.
It‘s useless, too.

The error is:
FAILED: async_io.so
/root/anaconda3/envs/deepspeed/bin/x86_64-conda-linux-gnu-c++ deepspeed_py_io_handle.o deepspeed_py_aio.o deepspeed_py_aio_handle.o deepspeed_aio_thread.o deepspeed_aio_utils.o deepspeed_aio_common.o deepspeed_aio_types.o deepspeed_cpu_op.o deepspeed_aio_op_desc.o deepspeed_py_copy.o deepspeed_pin_tensor.o py_ds_aio.o -shared -L/root/anaconda3/envs/deepspeed/ -L/root/anaconda3/envs/deepspeed/lib64 -laio -lcuda -lcudart -L/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/lib -lc10 -ltorch_cpu -ltorch -ltorch_python -o async_io.so
/root/anaconda3/envs/deepspeed/bin/../lib/gcc/x86_64-conda-linux-gnu/14.1.0/../../../../x86_64-conda-linux-gnu/bin/ld: cannot find -lcuda: No such file or directory
collect2: error: ld returned 1 exit status
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2105, in _run_ninja_build
subprocess.run(
File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 528, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 40, in
main()
File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 18, in main
aio_handle = AsyncIOBuilder().load().aio_handle()
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 531, in load
return self.jit_load(verbose)
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 578, in jit_load
op_module = load(name=self.name,
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1312, in load
return _jit_compile(
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1722, in _jit_compile
_write_ninja_file_and_build_library(
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 1834, in _write_ninja_file_and_build_library
_run_ninja_build(
File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/torch/utils/cpp_extension.py", line 2121, in _run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'async_io'

@jomayeri
Copy link
Contributor

No Cuda is not installed in DeepSpeed. Typically it is stored in /usr/local you can run whereis cuda to find it. Do commands like nvidia-smi work on the system?

@niebowen666
Copy link
Author

niebowen666 commented Sep 27, 2024

@jomayeri Yeah~ I found cuda in /usr/local when I ran whereis cuda

whereis cuda
cuda: /usr/lib/cuda

Then I configured the CUDA_HOME and CUDANN_HOME as /usr/lib/cuda:

export CUDA_HOME=/usr/lib/cuda
export CUDNN_HOME=/usr/lib/cuda
source ~/.bashrc

echo $CUDA_HOME
/usr/lib/cuda
echo $CUDNN_HOME
/usr/lib/cuda

Finally, I ran the command python aio_store_cpu_tensor.py --nvme_folder tensor/
It give me a new feedback:

[2024-09-27 08:46:29,890] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 3, in <module>
    from deepspeed.ops.op_builder import AsyncIOBuilder
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
    from ..git_version_info import compatible_ops as __compatible_ops__
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/git_version_info.py", line 29, in <module>
    op_compatible = builder.is_compatible()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 35, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 53, in installed_cuda_version
    output = subprocess.check_output([cuda_home + "/bin/nvcc", "-V"], universal_newlines=True)
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/lib/cuda/bin/nvcc'

It seems that it's no nvcc in directory "cuda"?

/usr/lib/cuda
bin  include  lib64  nvvm  version.txt
ls /usr/lib/cuda/bin
there is nothing

Besides, I also found cuda in /usr/local

ls /usr/local
bin  cuda-12.1  etc  games  include  kernelobjects  lib  man  mysql  pgsql  sbin  share  src  ssl

I reset the CUDA_HOME and CUDNN_HOME environment variable to /usr/local/cuda-12.1, and run python aio_store_cpu_tensor.py --nvme_folder tensor/
The same error happen:

[2024-09-27 09:06:52,673] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/home/nbw/DeepNVMe_demo/aio_store_cpu_tensor.py", line 3, in <module>
    from deepspeed.ops.op_builder import AsyncIOBuilder
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
    from ..git_version_info import compatible_ops as __compatible_ops__
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/git_version_info.py", line 29, in <module>
    op_compatible = builder.is_compatible()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 35, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 53, in installed_cuda_version
    output = subprocess.check_output([cuda_home + "/bin/nvcc", "-V"], universal_newlines=True)
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda-12.1//bin/nvcc'

It also effect ds_report:
ds_report

[2024-09-27 09:05:40,449] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Traceback (most recent call last):
  File "/root/anaconda3/envs/deepspeed/bin/ds_report", line 3, in <module>
    from deepspeed.env_report import cli_main
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/__init__.py", line 25, in <module>
    from . import ops
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/__init__.py", line 15, in <module>
    from ..git_version_info import compatible_ops as __compatible_ops__
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/git_version_info.py", line 29, in <module>
    op_compatible = builder.is_compatible()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/fp_quantizer.py", line 35, in is_compatible
    sys_cuda_major, _ = installed_cuda_version()
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/site-packages/deepspeed/ops/op_builder/builder.py", line 53, in installed_cuda_version
    output = subprocess.check_output([cuda_home + "/bin/nvcc", "-V"], universal_newlines=True)
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 424, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/root/anaconda3/envs/deepspeed/lib/python3.9/subprocess.py", line 1837, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/cuda-12.1//bin/nvcc'

My nvcc information as below:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Mon_Apr__3_17:16:06_PDT_2023
Cuda compilation tools, release 12.1, V12.1.105
Build cuda_12.1.r12.1/compiler.32688072_0

nvidia-smi

Fri Sep 27 09:09:58 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.28.03              Driver Version: 560.28.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce GTX TITAN X     Off |   00000000:86:00.0 Off |                  N/A |
| 18%   47C    P0             44W /  250W |       0MiB /  12288MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

@niebowen666
Copy link
Author

The cuda is not installed peoperly, right?

 ls /usr/local/cuda-12.1/
nsight-compute-2023.1.0  nsight-systems-2023.1.2  nvvm  targets

@niebowen666
Copy link
Author

ls /usr/lib/cuda/
bin  include  lib64  nvvm  version.txt
ls /usr/lib/cuda/bin
nothing
ls /usr/lib/cuda/include/
nothing
ls /usr/lib/cuda/lib64/
nothing
ls /usr/lib/cuda/nvvm/
libdevice

@jomayeri
Copy link
Contributor

Yes it looks like you should reinstall cuda.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

2 participants