-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conda package build returning internal cuFile error. #508
Comments
Output from
active environment : img_cuda12.2-kvikio
active env location : /home/fstrug/.conda/envs/img_cuda12.2-kvikio
shell level : 2
user config file : /home/fstrug/.condarc
populated config files : /opt/conda/.condarc
conda version : 24.7.1
conda-build version : not installed
python version : 3.10.15.final.0
solver : libmamba (default)
virtual packages : __archspec=1=zen3
__conda=24.7.1=0
__cuda=12.2=0
__glibc=2.34=0
__linux=6.3.12=0
__unix=0=0
base environment : /opt/conda (read only)
conda av data dir : /opt/conda/etc/conda
conda av metadata url : None
channel URLs : https://conda.anaconda.org/conda-forge/linux-64
https://conda.anaconda.org/conda-forge/noarch
package cache : /opt/conda/pkgs
/home/fstrug/.conda/pkgs
envs directories : /home/fstrug/.conda/envs
/opt/conda/envs
platform : linux-64
user-agent : conda/24.7.1 requests/2.32.3 CPython/3.10.15 Linux/6.3.12-200.fc38.x86_64 almalinux/9.4 glibc/2.34 solver/libmamba conda-libmamba-solver/24.7.0 libmambapy/1.5.9
UID:GID : 57561:5063
netrc file : None
offline mode : False
# Name Version Build Channel
_libgcc_mutex 0.1 conda_forge conda-forge
_openmp_mutex 4.5 2_gnu conda-forge
annotated-types 0.7.0 pyhd8ed1ab_0 conda-forge
asciitree 0.3.3 py_2 conda-forge
asttokens 2.4.1 pyhd8ed1ab_0 conda-forge
attr 2.5.1 h166bdaf_1 conda-forge
awkward 2.6.6 pypi_0 pypi
awkward-cpp 35 pypi_0 pypi
aws-c-auth 0.7.22 h96bc93b_2 conda-forge
aws-c-cal 0.6.14 h88a6e22_1 conda-forge
aws-c-common 0.9.19 h4ab18f5_0 conda-forge
aws-c-compression 0.2.18 h83b837d_6 conda-forge
aws-c-event-stream 0.4.2 ha47c788_12 conda-forge
aws-c-http 0.8.1 h29d6fba_17 conda-forge
aws-c-io 0.14.8 h21d4f22_5 conda-forge
aws-c-mqtt 0.10.4 h759edc4_4 conda-forge
aws-c-s3 0.5.9 h594631b_3 conda-forge
aws-c-sdkutils 0.1.16 h83b837d_2 conda-forge
aws-checksums 0.1.18 h83b837d_6 conda-forge
aws-crt-cpp 0.26.9 he3a8b3b_0 conda-forge
aws-sdk-cpp 1.11.329 hba8bd5f_3 conda-forge
binutils_impl_linux-64 2.40 ha1999f0_2 conda-forge
bokeh 3.4.1 pyhd8ed1ab_0 conda-forge
boost-histogram 1.4.1 py311h9547e67_0 conda-forge
brotli 1.1.0 hd590300_1 conda-forge
brotli-bin 1.1.0 hd590300_1 conda-forge
brotli-python 1.1.0 py311hb755f60_1 conda-forge
bzip2 1.0.8 hd590300_5 conda-forge
c-ares 1.28.1 hd590300_0 conda-forge
ca-certificates 2024.8.30 hbcca054_0 conda-forge
cachetools 5.3.3 pyhd8ed1ab_0 conda-forge
certifi 2024.8.30 pyhd8ed1ab_0 conda-forge
cffi 1.16.0 py311hb3a22ac_0 conda-forge
click 8.1.7 unix_pyh707e725_0 conda-forge
click-default-group 1.2.4 pyhd8ed1ab_0 conda-forge
cloudpickle 3.0.0 pyhd8ed1ab_0 conda-forge
coffea 2024.3.0 pyhd8ed1ab_0 conda-forge
colorama 0.4.6 pyhd8ed1ab_0 conda-forge
comm 0.2.2 pyhd8ed1ab_0 conda-forge
contourpy 1.2.1 py311h9547e67_0 conda-forge
correctionlib 2.5.0 py311h9e0f504_1 conda-forge
cramjam 2.8.3 py311h46250e7_0 conda-forge
cuda-cccl_linux-64 12.2.140 ha770c72_0 conda-forge
cuda-crt-dev_linux-64 12.2.140 ha770c72_1 conda-forge
cuda-crt-tools 12.2.140 ha770c72_1 conda-forge
cuda-cudart 12.2.140 hd3aeb46_0 conda-forge
cuda-cudart-dev 12.2.140 hd3aeb46_0 conda-forge
cuda-cudart-dev_linux-64 12.2.140 h59595ed_0 conda-forge
cuda-cudart-static 12.2.140 hd3aeb46_0 conda-forge
cuda-cudart-static_linux-64 12.2.140 h59595ed_0 conda-forge
cuda-cudart_linux-64 12.2.140 h59595ed_0 conda-forge
cuda-libraries 12.5.0 0 nvidia
cuda-nsight-compute 12.2.2 0 nvidia/label/cuda-12.2.2
cuda-nvcc 12.4.131 0 nvidia
cuda-nvcc-dev_linux-64 12.2.140 ha770c72_1 conda-forge
cuda-nvcc-impl 12.2.140 hd3aeb46_1 conda-forge
cuda-nvcc-tools 12.2.140 hd3aeb46_1 conda-forge
cuda-nvprof 12.4.127 0 nvidia
cuda-nvrtc 12.2.140 hd3aeb46_0 conda-forge
cuda-nvvm-dev_linux-64 12.2.140 ha770c72_1 conda-forge
cuda-nvvm-impl 12.2.140 h59595ed_1 conda-forge
cuda-nvvm-tools 12.2.140 h59595ed_1 conda-forge
cuda-opencl 12.4.127 0 nvidia
cuda-python 12.5.0 py311h817de4b_0 conda-forge
cuda-version 12.2 he2b69de_3 conda-forge
cudf 24.06.00 cuda12_py311_240605_g7c706cc400_0 rapidsai
cupy 13.1.0 py311hf829483_4 conda-forge
cupy-core 13.1.0 py311he1e6e68_4 conda-forge
cycler 0.12.1 pyhd8ed1ab_0 conda-forge
cytoolz 0.12.3 py311h459d7ec_0 conda-forge
dask 2024.5.2 pyhd8ed1ab_0 conda-forge
dask-awkward 2024.3.0 pyhd8ed1ab_0 conda-forge
dask-core 2024.5.2 pyhd8ed1ab_0 conda-forge
dask-expr 1.1.2 pyhd8ed1ab_0 conda-forge
dask-histogram 2024.3.0 pyhd8ed1ab_0 conda-forge
debugpy 1.8.1 py311hb755f60_0 conda-forge
decorator 5.1.1 pyhd8ed1ab_0 conda-forge
distributed 2024.5.2 pyhd8ed1ab_0 conda-forge
dlpack 0.8 h59595ed_3 conda-forge
entrypoints 0.4 pyhd8ed1ab_0 conda-forge
exceptiongroup 1.2.0 pyhd8ed1ab_2 conda-forge
executing 2.0.1 pyhd8ed1ab_0 conda-forge
fasteners 0.17.3 pyhd8ed1ab_0 conda-forge
fastparquet 2024.5.0 py311h18e1886_0 conda-forge
fastrlock 0.8.2 py311hb755f60_2 conda-forge
fmt 10.2.1 h00ab1b0_0 conda-forge
fonttools 4.53.0 py311h331c9d8_0 conda-forge
freetype 2.12.1 h267a509_2 conda-forge
fsspec 2024.6.0 pyhff2d567_0 conda-forge
gcc 12.4.0 h236703b_1 conda-forge
gcc_impl_linux-64 12.4.0 hb2e57f8_1 conda-forge
gettext 0.22.5 h59595ed_2 conda-forge
gettext-tools 0.22.5 h59595ed_2 conda-forge
gflags 2.2.2 he1b5a44_1004 conda-forge
glog 0.7.0 hed5481d_0 conda-forge
hepconvert 1.3.4 pyhd8ed1ab_0 conda-forge
hist 2.7.3 ha770c72_0 conda-forge
hist-base 2.7.3 pyhd8ed1ab_0 conda-forge
histoprint 2.4.0 pyhd8ed1ab_0 conda-forge
iminuit 2.25.2 py311hb755f60_0 conda-forge
importlib-metadata 7.1.0 pyha770c72_0 conda-forge
importlib_metadata 7.1.0 hd8ed1ab_0 conda-forge
ipykernel 6.29.3 pyhd33586a_0 conda-forge
ipython 8.25.0 pyh707e725_0 conda-forge
jedi 0.19.1 pyhd8ed1ab_0 conda-forge
jinja2 3.1.4 pyhd8ed1ab_0 conda-forge
jit 0.2.7 pypi_0 pypi
jupyter_client 8.6.2 pyhd8ed1ab_0 conda-forge
jupyter_core 5.7.2 py311h38be061_0 conda-forge
kernel-headers_linux-64 3.10.0 he073ed8_17 conda-forge
keyutils 1.6.1 h166bdaf_0 conda-forge
kiwisolver 1.4.5 py311h9547e67_1 conda-forge
krb5 1.21.2 h659d440_0 conda-forge
kvikio 24.06.00 cuda12_py311_240605_gd3f15ec_0 rapidsai
lcms2 2.16 hb7c19ff_0 conda-forge
ld_impl_linux-64 2.40 hf3520f5_2 conda-forge
lerc 4.0.0 h27087fc_0 conda-forge
libabseil 20240116.2 cxx17_h59595ed_0 conda-forge
libarrow 16.1.0 hcb6531f_6_cpu conda-forge
libarrow-acero 16.1.0 hac33072_6_cpu conda-forge
libarrow-dataset 16.1.0 hac33072_6_cpu conda-forge
libarrow-substrait 16.1.0 h7e0c224_6_cpu conda-forge
libasprintf 0.22.5 h661eb56_2 conda-forge
libasprintf-devel 0.22.5 h661eb56_2 conda-forge
libblas 3.9.0 22_linux64_openblas conda-forge
libbrotlicommon 1.1.0 hd590300_1 conda-forge
libbrotlidec 1.1.0 hd590300_1 conda-forge
libbrotlienc 1.1.0 hd590300_1 conda-forge
libcap 2.69 h0f662aa_0 conda-forge
libcblas 3.9.0 22_linux64_openblas conda-forge
libcrc32c 1.1.2 h9c3ff4c_0 conda-forge
libcublas 12.2.5.6 hd3aeb46_0 conda-forge
libcudf 24.06.00 cuda12_240605_g7c706cc400_0 rapidsai
libcufft 11.0.8.103 hd3aeb46_0 conda-forge
libcufile 1.7.2.10 hd3aeb46_0 conda-forge
libcufile-dev 1.7.2.10 hd3aeb46_0 conda-forge
libcurand 10.3.3.141 hd3aeb46_0 conda-forge
libcurl 8.8.0 hca28451_0 conda-forge
libcusolver 11.5.2.141 hd3aeb46_0 conda-forge
libcusparse 12.1.2.141 hd3aeb46_0 conda-forge
libdeflate 1.20 hd590300_0 conda-forge
libdrm 2.4.120 hd590300_0 conda-forge
libedit 3.1.20191231 he28a2e2_2 conda-forge
libev 4.33 hd590300_2 conda-forge
libevent 2.1.12 hf998b51_1 conda-forge
libexpat 2.6.2 h59595ed_0 conda-forge
libffi 3.4.2 h7f98852_5 conda-forge
libgcc 14.1.0 h77fa898_1 conda-forge
libgcc-devel_linux-64 12.4.0 ha4f9413_101 conda-forge
libgcc-ng 14.1.0 h69a702a_1 conda-forge
libgcrypt 1.10.3 hd590300_0 conda-forge
libgettextpo 0.22.5 h59595ed_2 conda-forge
libgettextpo-devel 0.22.5 h59595ed_2 conda-forge
libgfortran-ng 13.2.0 h69a702a_7 conda-forge
libgfortran5 13.2.0 hca663fb_7 conda-forge
libgomp 14.1.0 h77fa898_1 conda-forge
libgoogle-cloud 2.24.0 h2736e30_0 conda-forge
libgoogle-cloud-storage 2.24.0 h3d9a0c8_0 conda-forge
libgpg-error 1.49 h4f305b6_0 conda-forge
libgrpc 1.62.2 h15f2491_0 conda-forge
libjpeg-turbo 3.0.0 hd590300_1 conda-forge
libkvikio 24.06.00 cuda12_240605_gd3f15ec_0 rapidsai
liblapack 3.9.0 22_linux64_openblas conda-forge
libllvm14 14.0.6 hcd5def8_4 conda-forge
libnghttp2 1.58.0 h47da74e_1 conda-forge
libnpp 12.2.5.30 0 nvidia
libnsl 2.0.1 hd590300_0 conda-forge
libnvfatbin 12.4.127 0 nvidia
libnvjitlink 12.2.140 hd3aeb46_0 conda-forge
libnvjpeg 12.3.1.117 0 nvidia
libopenblas 0.3.27 pthreads_h413a1c8_0 conda-forge
libparquet 16.1.0 h6a7eafb_6_cpu conda-forge
libpciaccess 0.18 hd590300_0 conda-forge
libpng 1.6.43 h2797004_0 conda-forge
libprotobuf 4.25.3 h08a7969_0 conda-forge
libre2-11 2023.09.01 h5a48ba9_2 conda-forge
librmm 24.06.00 cuda12_240605_gd889275f_0 rapidsai
libsanitizer 12.4.0 h46f95d5_1 conda-forge
libsodium 1.0.18 h36c2ea0_1 conda-forge
libsqlite 3.45.3 h2797004_0 conda-forge
libssh2 1.11.0 h0841786_0 conda-forge
libstdcxx 14.1.0 hc0a3c3a_1 conda-forge
libstdcxx-ng 13.2.0 hc0a3c3a_7 conda-forge
libsystemd0 255 h3516f8a_1 conda-forge
libthrift 0.19.0 hb90f79a_1 conda-forge
libtiff 4.6.0 h1dd3fc0_3 conda-forge
libunwind 1.6.2 h9c3ff4c_0 conda-forge
libutf8proc 2.8.0 h166bdaf_0 conda-forge
libuuid 2.38.1 h0b41bf4_0 conda-forge
libwebp-base 1.4.0 hd590300_0 conda-forge
libxcb 1.15 h0b41bf4_0 conda-forge
libxcrypt 4.4.36 hd590300_1 conda-forge
libzlib 1.3.1 h4ab18f5_1 conda-forge
llvmlite 0.43.0 py311hbde99c3_0 conda-forge
locket 1.0.0 pyhd8ed1ab_0 conda-forge
lz4 4.3.3 py311h38e4bf4_0 conda-forge
lz4-c 1.9.4 hcb278e6_0 conda-forge
markdown-it-py 3.0.0 pyhd8ed1ab_0 conda-forge
markupsafe 2.1.5 py311h459d7ec_0 conda-forge
matplotlib-base 3.8.4 py311ha4ca890_2 conda-forge
matplotlib-inline 0.1.7 pyhd8ed1ab_0 conda-forge
mdurl 0.1.2 pyhd8ed1ab_0 conda-forge
mplhep 0.3.48 pyhd8ed1ab_0 conda-forge
mplhep_data 0.0.3 pyhd8ed1ab_0 conda-forge
msgpack-python 1.0.8 py311h52f7536_0 conda-forge
munkres 1.1.4 pyh9f0ad1d_0 conda-forge
ncurses 6.5 h59595ed_0 conda-forge
nest-asyncio 1.6.0 pyhd8ed1ab_0 conda-forge
nsight-compute 2023.2.2.3 0 nvidia/label/cuda-12.2.2
numba 0.60.0 py311h4bc866e_0 conda-forge
numcodecs 0.11.0 py311hcafe171_1 conda-forge
numpy 1.26.4 py311h64a7726_0 conda-forge
nvcomp 3.0.6 h10b603f_0 conda-forge
nvtop 3.1.0 hefaacde_0 conda-forge
nvtx 0.2.10 py311h459d7ec_0 conda-forge
openjpeg 2.5.2 h488ebb8_0 conda-forge
openssl 3.3.2 hb9d3cd8_0 conda-forge
orc 2.0.1 h17fec99_1 conda-forge
packaging 24.0 pyhd8ed1ab_0 conda-forge
pandas 2.2.2 py311h14de704_1 conda-forge
parso 0.8.4 pyhd8ed1ab_0 conda-forge
partd 1.4.2 pyhd8ed1ab_0 conda-forge
pexpect 4.9.0 pyhd8ed1ab_0 conda-forge
pickleshare 0.7.5 py_1003 conda-forge
pillow 10.3.0 py311h18e6fac_0 conda-forge
pip 24.0 pyhd8ed1ab_0 conda-forge
platformdirs 4.2.2 pyhd8ed1ab_0 conda-forge
prompt-toolkit 3.0.46 pyha770c72_0 conda-forge
psutil 5.9.8 py311h459d7ec_0 conda-forge
pthread-stubs 0.4 h36c2ea0_1001 conda-forge
ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge
pure_eval 0.2.2 pyhd8ed1ab_0 conda-forge
py-spy 0.3.14 h87a5ac0_0 conda-forge
pyarrow 16.1.0 py311h781c19f_1 conda-forge
pyarrow-core 16.1.0 py311h8e2c35d_1_cpu conda-forge
pyarrow-hotfix 0.6 pyhd8ed1ab_0 conda-forge
pycparser 2.22 pyhd8ed1ab_0 conda-forge
pydantic 2.7.3 pyhd8ed1ab_0 conda-forge
pydantic-core 2.18.4 py311h5ecf98a_0 conda-forge
pygments 2.18.0 pyhd8ed1ab_0 conda-forge
pynvjitlink 0.2.3 py311hdaa3023_0 rapidsai
pyparsing 3.1.2 pyhd8ed1ab_0 conda-forge
pysocks 1.7.1 pyha2e5f31_6 conda-forge
python 3.11.9 hb806964_0_cpython conda-forge
python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge
python-tzdata 2024.1 pyhd8ed1ab_0 conda-forge
python-xxhash 3.4.1 py311h459d7ec_0 conda-forge
python_abi 3.11 4_cp311 conda-forge
pytz 2024.1 pyhd8ed1ab_0 conda-forge
pyyaml 6.0.1 py311h459d7ec_1 conda-forge
pyzmq 26.0.3 py311h08a0b41_0 conda-forge
re2 2023.09.01 h7f4b329_2 conda-forge
readline 8.2 h8228510_1 conda-forge
rich 13.7.1 pyhd8ed1ab_0 conda-forge
rmm 24.06.00 cuda12_py311_240605_gd889275f_0 rapidsai
s2n 1.4.15 he19d79f_0 conda-forge
scipy 1.13.1 py311h517d4fd_0 conda-forge
setuptools 70.0.0 pyhd8ed1ab_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
snappy 1.2.0 hdb0a2a9_1 conda-forge
sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge
spdlog 1.12.0 hd2e6256_2 conda-forge
stack_data 0.6.2 pyhd8ed1ab_0 conda-forge
sysroot_linux-64 2.17 h4a8ded7_17 conda-forge
tblib 3.0.0 pyhd8ed1ab_0 conda-forge
tk 8.6.13 noxft_h4845f30_101 conda-forge
toml 0.10.2 pyhd8ed1ab_0 conda-forge
toolz 0.12.1 pyhd8ed1ab_0 conda-forge
tornado 6.4 py311h459d7ec_0 conda-forge
tqdm 4.66.4 pyhd8ed1ab_0 conda-forge
traitlets 5.14.3 pyhd8ed1ab_0 conda-forge
typing-extensions 4.12.1 hd8ed1ab_0 conda-forge
typing_extensions 4.12.1 pyha770c72_0 conda-forge
tzdata 2024a h0c530f3_0 conda-forge
uhi 0.4.0 pyhd8ed1ab_0 conda-forge
uproot 5.3.7 ha770c72_0 conda-forge
uproot-base 5.3.7 pyhd8ed1ab_0 conda-forge
urllib3 2.2.1 pyhd8ed1ab_0 conda-forge
wcwidth 0.2.13 pyhd8ed1ab_0 conda-forge
wheel 0.43.0 pyhd8ed1ab_1 conda-forge
xorg-libxau 1.0.11 hd590300_0 conda-forge
xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge
xxhash 0.8.2 hd590300_0 conda-forge
xyzservices 2024.4.0 pyhd8ed1ab_0 conda-forge
xz 5.2.6 h166bdaf_0 conda-forge
yaml 0.2.5 h7f98852_2 conda-forge
zarr 2.18.2 pyhd8ed1ab_0 conda-forge
zeromq 4.3.5 h75354e8_4 conda-forge
zict 3.0.0 pyhd8ed1ab_0 conda-forge
zipp 3.17.0 pyhd8ed1ab_0 conda-forge
zstandard 0.22.0 py311hb6f056b_1 conda-forge
zstd 1.5.6 ha6fb4c9_0 conda-forge
|
Originally posted by @madsbk in #378 Some system information: OS: AlmaLinux 9 version=6.3.12-200.fc38.x86_64
CPU: AMD EPYC 7543 32-Core Processor
Architecture - x86_64
GPU: NVIDIA A100 80GB PCIe
Driver Version - 535.129.03
Cuda Version -12.2 Kvikio environment created with |
Hmm, running on a
Maybe it is something in {
// NOTE : Application can override custom configuration via export CUFILE_ENV_PATH_JSON=<filepath>
// e.g : export CUFILE_ENV_PATH_JSON="/home/<xxx>/cufile.json"
"logging": {
// log directory, if not enabled will create log file under current working directory
//"dir": "/home/<xxxx>",
// NOTICE|ERROR|WARN|INFO|DEBUG|TRACE (in decreasing order of severity)
"level": "ERROR"
},
"profile": {
// nvtx profiling on/off
"nvtx": false,
// cufile stats level(0-3)
"cufile_stats": 0
},
"execution" : {
// max number of workitems in the queue;
"max_io_queue_depth": 128,
// max number of host threads per gpu to spawn for parallel IO
"max_io_threads" : 4,
// enable support for parallel IO
"parallel_io" : true,
// minimum IO threshold before splitting the IO
"min_io_threshold_size_kb" : 8192,
// maximum parallelism for a single request
"max_request_parallelism" : 4
},
"properties": {
// max IO chunk size (parameter should be multiples of 64K) used by cuFileRead/Write internally per IO request
"max_direct_io_size_kb" : 16384,
// device memory size (parameter should be 4K aligned) for reserving bounce buffers for the entire GPU
"max_device_cache_size_kb" : 131072,
// limit on maximum device memory size (parameter should be 4K aligned) that can be pinned for a given process
"max_device_pinned_mem_size_kb" : 33554432,
// true or false (true will enable asynchronous io submission to nvidia-fs driver)
// Note : currently the overall IO will still be synchronous
"use_poll_mode" : false,
// maximum IO request size (parameter should be 4K aligned) within or equal to which library will use polling for IO completion
"poll_mode_max_size_kb": 4,
// allow p2pdma, this will enable use of cuFile without nvme patches
"use_pci_p2pdma": false,
// allow compat mode, this will enable use of cuFile posix read/writes
"allow_compat_mode": true,
// enable GDS write support for RDMA based storage
"gds_rdma_write_support": true,
// GDS batch size
"io_batchsize": 128,
// enable io priority w.r.t compute streams
// valid options are "default", "low", "med", "high"
"io_priority": "default",
// client-side rdma addr list for user-space file-systems(e.g ["10.0.1.0", "10.0.2.0"])
"rdma_dev_addr_list": [ ],
// load balancing policy for RDMA memory registration(MR), (RoundRobin, RoundRobinMaxMin)
// In RoundRobin, MRs will be distributed uniformly across NICS closest to a GPU
// In RoundRobinMaxMin, MRs will be distributed across NICS closest to a GPU
// with minimal sharing of NICS acros GPUS
"rdma_load_balancing_policy": "RoundRobin",
//32-bit dc key value in hex
//"rdma_dc_key": "0xffeeddcc",
//To enable/disable different rdma OPs use the below bit map
//Bit 0 - If set enables Local RDMA WRITE
//Bit 1 - If set enables Remote RDMA WRITE
//Bit 2 - If set enables Remote RDMA READ
//Bit 3 - If set enables REMOTE RDMA Atomics
//Bit 4 - If set enables Relaxed ordering.
//"rdma_access_mask": "0x1f",
// In platforms where IO transfer to a GPU will cause cross RootPort PCie transfers, enabling this feature
// might help improve overall BW provided there exists a GPU(s) with Root Port common to that of the storage NIC(s).
// If this feature is enabled, please provide the ip addresses used by the mount either in file-system specific
// section for mount_table or in the rdma_dev_addr_list property in properties section
"rdma_dynamic_routing": false,
// The order describes the sequence in which a policy is selected for dynamic routing for cross Root Port transfers
// If the first policy is not applicable, it will fallback to the next and so on.
// policy GPU_MEM_NVLINKS: use GPU memory with NVLink to transfer data between GPUs
// policy GPU_MEM: use GPU memory with PCIe to transfer data between GPUs
// policy SYS_MEM: use system memory with PCIe to transfer data to GPU
// policy P2P: use P2P PCIe to transfer across between NIC and GPU
"rdma_dynamic_routing_order": [ "GPU_MEM_NVLINKS", "GPU_MEM", "SYS_MEM", "P2P" ]
},
"fs": {
"generic": {
// for unaligned writes, setting it to true will, cuFileWrite use posix write internally instead of regular GDS write
"posix_unaligned_writes" : false
},
"beegfs" : {
// IO threshold for read/write (param should be 4K aligned)) equal to or below which cuFile will use posix read/write
"posix_gds_min_kb" : 0
// To restrict the IO to selected IP list, when dynamic routing is enabled
// if using a single BeeGFS mount, provide the ip addresses here
//"rdma_dev_addr_list" : []
// if using multiple lustre mounts, provide ip addresses used by respective mount here
//"mount_table" : {
// "/beegfs/client1" : {
// "rdma_dev_addr_list" : ["172.172.1.40", "172.172.1.42"]
// },
// "/beegfs/client2" : {
// "rdma_dev_addr_list" : ["172.172.2.40", "172.172.2.42"]
// }
//}
},
"lustre": {
// IO threshold for read/write (param should be 4K aligned)) equal to or below which cuFile will use posix read/write
"posix_gds_min_kb" : 16
// To restrict the IO to selected IP list, when dynamic routing is enabled
// if using a single lustre mount, provide the ip addresses here (use : sudo lnetctl net show)
//"rdma_dev_addr_list" : []
// if using multiple lustre mounts, provide ip addresses used by respective mount here
//"mount_table" : {
// "/lustre/ai200_01/client" : {
// "rdma_dev_addr_list" : ["172.172.1.40", "172.172.1.42"]
// },
// "/lustre/ai200_02/client" : {
// "rdma_dev_addr_list" : ["172.172.2.40", "172.172.2.42"]
// }
//}
},
"nfs": {
// To restrict the IO to selected IP list, when dynamic routing is enabled
//"rdma_dev_addr_list" : []
//"mount_table" : {
// "/mnt/nfsrdma_01/" : {
// "rdma_dev_addr_list" : []
// },
// "/mnt/nfsrdma_02/" : {
// "rdma_dev_addr_list" : []
// }
//}
},
"gpfs": {
//allow GDS writes with GPFS
"gds_write_support": false,
//allow Async support
"gds_async_support": true
//"rdma_dev_addr_list" : []
//"mount_table" : {
// "/mnt/gpfs_01" : {
// "rdma_dev_addr_list" : []
// },
// "/mnt/gpfs_02/" : {
// "rdma_dev_addr_list" : []
// }
//}
},
"weka": {
// enable/disable RDMA write
"rdma_write_support" : false
}
},
"denylist": {
// specify list of vendor driver modules to deny for nvidia-fs (e.g. ["nvme" , "nvme_rdma"])
"drivers": [ ],
// specify list of block devices to prevent IO using cuFile (e.g. [ "/dev/nvme0n1" ])
"devices": [ ],
// specify list of mount points to prevent IO using cuFile (e.g. ["/mnt/test"])
"mounts": [ ],
// specify list of file-systems to prevent IO using cuFile (e.g ["lustre", "wekafs"])
"filesystems": [ ]
},
"miscellaneous": {
// enable only for enforcing strict checks at API level for debugging
"api_check_aggressive": false
}
} |
This file does not exist on our system |
Yes, it should be safe to use my config. You can use |
Using the 28-10-2024 17:38:08:134 [pid=8196 tid=8196] ERROR 0:140 unable to load, liburcu-bp.so.6
28-10-2024 17:38:08:134 [pid=8196 tid=8196] ERROR 0:140 unable to load, liburcu-bp.so.1
28-10-2024 17:38:08:134 [pid=8196 tid=8196] WARN 0:168 failed to open /proc/driver/nvidia-fs/devcount error: No such file or directory
28-10-2024 17:38:08:134 [pid=8196 tid=8196] NOTICE cufio-drv:720 running in compatible mode
28-10-2024 17:38:08:423 [pid=8196 tid=8196] ERROR 0:140 unable to load, libnuma.so.1.0.0
28-10-2024 17:38:08:423 [pid=8196 tid=8196] ERROR 0:91 dlopen error libnuma.so.1.0.0: cannot open shared object file: No such file or directory
28-10-2024 17:38:08:427 [pid=8196 tid=8196] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:526
28-10-2024 17:38:08:427 [pid=8196 tid=8196] ERROR cufio-fs:742 error getting volume attributes error for device: dev_no: 0:526
28-10-2024 17:38:08:427 [pid=8196 tid=8196] ERROR cufio-obj:215 unable to get volume attributes for fd 36
28-10-2024 17:38:08:427 [pid=8196 tid=8196] ERROR cufio:310 cuFileHandleRegister error, failed to allocate file object
28-10-2024 17:38:08:427 [pid=8196 tid=8196] ERROR cufio:338 cuFileHandleRegister error: internal error |
28-10-2024 17:38:08:134 [pid=8196 tid=8196] WARN 0:168 failed to open /proc/driver/nvidia-fs/devcount error: No such file or directory $ sudo lsmod | grep nvidia_fs 28-10-2024 17:38:08:427 [pid=8196 tid=8196] ERROR cufio-fs:322 error creating udev_device for block device dev_no: 0:526
This error indicates that the library is not understand the block device type for filesystem and get the volume attributes. Where is the dataset located ? $ lsblk |
Originally posted by @fstrug in #378
The text was updated successfully, but these errors were encountered: