Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

herro inference error #71

Open
1Wencai opened this issue Dec 2, 2024 · 1 comment
Open

herro inference error #71

1Wencai opened this issue Dec 2, 2024 · 1 comment

Comments

@1Wencai
Copy link

1Wencai commented Dec 2, 2024

Hello,
Following is my command:
singularity run --nv /data/a/zhangwencai/software/herro.sif inference --read-alns /data/b/zhangwencai/ultra_long/japo_fromGuoSong/minimap2_alignment -t 1 -b 1 -m /data/a/zhangwencai/software/herro/model_R9_v0.1.pt /data/b/zhangwencai/ultra_long/japo_fromGuoSong/DY48490_ONT_UL_200kb.fastq DY48490_ONT_UL_200kb_herro.fasta

The following is the error content:
[00:00:05] Parsed 10543 reads. [00:00:00] Processing 1/? batch ⡀ thread '' panicked at /herro/src/inference.rs:209:70:
Cannot load model.: Torch("CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero.\nException raised from device_count_impl at ../c10/cuda/CUDAFunctions.cpp:69 (most recent call first):\nframe #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x6b (0x7f5cd385a6bb in /libs/libtorch/lib/libc10.so)\nframe #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0xc9 (0x7f5cd3855769 in /libs/libtorch/lib/libc10.so)\nframe #2: c10::cuda::device_count_ensure_non_zero() + 0x117 (0x7f5cd324b027 in /libs/libtorch/lib/libc10_cuda.so)\nframe #3: + 0x103931a (0x7f5ced03931a in /libs/libtorch/lib/libtorch_cuda.so)\nframe #4: + 0x2c30f36 (0x7f5ceec30f36 in /libs/libtorch/lib/libtorch_cuda.so)\nframe #5: + 0x2c30ffb (0x7f5ceec30ffb in /libs/libtorch/lib/libtorch_cuda.so)\nframe #6: at::_ops::empty_strided::redispatch(c10::DispatchKeySet, c10::ArrayRefc10::SymInt, c10::ArrayRefc10::SymInt, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional) + 0x1fb (0x7f5cd5eb71fb in /libs/libtorch/lib/libtorch_cpu.so)\nframe #7: + 0x25ebc75 (0x7f5cd61ebc75 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #8: at::_ops::empty_strided::call(c10::ArrayRefc10::SymInt, c10::ArrayRefc10::SymInt, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional) + 0x168 (0x7f5cd5ef2328 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #9: + 0x1701f5f (0x7f5cd5301f5f in /libs/libtorch/lib/libtorch_cpu.so)\nframe #10: at::native::_to_copy(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x17e3 (0x7f5cd56a6cf3 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #11: + 0x27d3603 (0x7f5cd63d3603 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #12: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x103 (0x7f5cd5b93c83 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #13: + 0x25f01c8 (0x7f5cd61f01c8 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #14: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x103 (0x7f5cd5b93c83 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #15: + 0x3a66271 (0x7f5cd7666271 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #16: + 0x3a6681b (0x7f5cd766681b in /libs/libtorch/lib/libtorch_cpu.so)\nframe #17: at::_ops::_to_copy::call(at::Tensor const&, c10::optionalc10::ScalarType, c10::optionalc10::Layout, c10::optionalc10::Device, c10::optional, bool, c10::optionalc10::MemoryFormat) + 0x201 (0x7f5cd5c16651 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #18: at::native::to(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optionalc10::MemoryFormat) + 0xfd (0x7f5cd56a505d in /libs/libtorch/lib/libtorch_cpu.so)\nframe #19: + 0x29a5612 (0x7f5cd65a5612 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #20: at::_ops::to_device::call(at::Tensor const&, c10::Device, c10::ScalarType, bool, bool, c10::optionalc10::MemoryFormat) + 0x1c1 (0x7f5cd5d95cd1 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #21: torch::jit::Unpickler::readInstruction() + 0x1719 (0x7f5cd8766789 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #22: torch::jit::Unpickler::run() + 0xa8 (0x7f5cd8767988 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #23: torch::jit::Unpickler::parse_ivalue() + 0x2e (0x7f5cd876953e in /libs/libtorch/lib/libtorch_cpu.so)\nframe #24: torch::jit::readArchiveAndTensors(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optional<std::function<c10::StrongTypePtr (c10::QualifiedName const&)> >, c10::optional<std::function<c10::intrusive_ptr<c10::ivalue::Object, c10::detail::intrusive_target_default_null_typec10::ivalue::Object > (c10::StrongTypePtr, c10::IValue)> >, c10::optionalc10::Device, caffe2::serialize::PyTorchStreamReader&, c10::Type::SingletonOrSharedTypePtrc10::Type (*)(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&), std::shared_ptrtorch::jit::DeserializationStorageContext) + 0x529 (0x7f5cd87241a9 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #25: + 0x4b08c4b (0x7f5cd8708c4b in /libs/libtorch/lib/libtorch_cpu.so)\nframe #26: + 0x4b0b04b (0x7f5cd870b04b in /libs/libtorch/lib/libtorch_cpu.so)\nframe #27: torch::jit::import_ir_module(std::shared_ptrtorch::jit::CompilationUnit, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optionalc10::Device, std::unordered_map<std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::__cxx11::basic_string<char, std::char_traits, std::allocator >, std::hash<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits, std::allocator > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits, std::allocator > const, std::__cxx11::basic_string<char, std::char_traits, std::allocator > > > >&, bool, bool) + 0x3a2 (0x7f5cd870f6c2 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #28: torch::jit::import_ir_module(std::shared_ptrtorch::jit::CompilationUnit, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optionalc10::Device, bool) + 0x92 (0x7f5cd870fa42 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #29: torch::jit::load(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, c10::optionalc10::Device, bool) + 0xd1 (0x7f5cd870fb71 in /libs/libtorch/lib/libtorch_cpu.so)\nframe #30: + 0x1ee52e (0x55cf434da52e in herro)\nframe #31: + 0xd4bc9 (0x55cf433c0bc9 in herro)\nframe #32: + 0x1062b6 (0x55cf433f22b6 in herro)\nframe #33: + 0xc0aec (0x55cf433acaec in herro)\nframe #34: + 0xf56e5 (0x55cf433e16e5 in herro)\nframe #35: + 0x15ae9b (0x55cf43446e9b in herro)\nframe #36: + 0x94ac3 (0x7f5cd366bac3 in /lib/x86_64-linux-gnu/libc.so.6)\nframe #37: clone + 0x44 (0x7f5cd36fca04 in /lib/x86_64-linux-gnu/libc.so.6)\n")
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
Aborted (core dumped)

Please tell me where is the error, what should I do?

Below are my CUDA version and GPU version:

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Aug_14_10:10:22_PDT_2024
Cuda compilation tools, release 12.6, V12.6.68
Build cuda_12.6.r12.6/compiler.34714021_0

nvidia-smi
Mon Dec 2 09:55:33 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:31:00.0 Off | Off |
| 30% 58C P0 80W / 300W | 1MiB / 49140MiB | 2% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

Best wishes,
WenCai

@dominikstanojevic
Copy link
Member

dominikstanojevic commented Dec 5, 2024

Hi,

Can you please check if you can run pytorch with CUDA on your host machine? For example, run python (with torch installed) and:

import torch
torch.cuda.is_available()

You should see True.

Best,
Dominik

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants