Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault Or Memory Aceess Exception During Test or Evaluation #61

Open
fdchiu opened this issue Sep 11, 2024 · 1 comment
Open

Comments

@fdchiu
Copy link

fdchiu commented Sep 11, 2024

I am running into segmentation fault error most of the time with no python exception showing. Run the same test/evaluation I am getting python exceptions:

File "/home/david/project/rvt/RVT-master/modules/utils/detection.py", line 41, in add_backbone_features
self.features[k].append(v[selected_indices] if selected_indices is not None else v)

It's consistent at this line 41. Anyone knows why and how to resolve the issue?

Command line for test:
python validation.py dataset=gen1 dataset.path=/home/david/project/rvt/gen1 checkpoint=/home/david/project/rvt/rvt-s.ckpt use_test_set=0 hardware.gpus=0 +experiment/gen1="small.yaml" batch_size.eval=8 model.postprocess.confidence_threshold=0.001

--------------------------- exception log----------------------------------
_Loaded model weights from checkpoint at /home/david/project/rvt/rvt-s.ckpt
Validation DataLoader 0: : 0it [00:00, ?it/s]Error executing job with overrides: ['dataset=gen1', 'dataset.path=/home/david/project/rvt/gen1', 'checkpoint=/home/david/project/rvt/rvt-s.ckpt', 'use_test_set=0', 'hardware.gpus=0', '+experiment/gen1=small.yaml', 'batch_size.eval=8', 'model.postprocess.confidence_threshold=0.001']
Traceback (most recent call last):
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 736, in _validate_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1174, in _run_stage
return self._run_evaluate()
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1214, in _run_evaluate
eval_loop_results = self._evaluation_loop.run()
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 137, in advance
output = self._evaluation_step(**kwargs)
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 234, in _evaluation_step
output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1480, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/david/anaconda3/envs/rvt/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 390, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/home/david/project/rvt/RVT-master/modules/detection.py", line 283, in validation_step
return self._val_test_step_impl(batch=batch, mode=Mode.VAL)
File "/home/david/project/rvt/RVT-master/modules/detection.py", line 249, in _val_test_step_impl
backbone_feature_selector.add_backbone_features(backbone_features=backbone_features,
File "/home/david/project/rvt/RVT-master/modules/utils/detection.py", line 41, in add_backbone_features
self.features[k].append(v[selected_indices] if selected_indices is not None else v)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

_

@fdchiu
Copy link
Author

fdchiu commented Sep 11, 2024

conda list:

packages in environment at /home/david/anaconda3/envs/rvt:

Name Version Build Channel

_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
aiohappyeyeballs 2.4.0 pypi_0 pypi
aiohttp 3.10.5 pypi_0 pypi
aiosignal 1.3.1 pypi_0 pypi
antlr4-python3-runtime 4.9.3 pypi_0 pypi
appdirs 1.4.4 pypi_0 pypi
async-timeout 4.0.3 pypi_0 pypi
attrs 24.2.0 pypi_0 pypi
bbox-visualizer 0.1.0 pypi_0 pypi
blas 1.0 mkl
ca-certificates 2024.7.2 h06a4308_0
certifi 2024.8.30 pypi_0 pypi
charset-normalizer 3.3.2 pypi_0 pypi
click 8.1.7 pypi_0 pypi
contourpy 1.3.0 pypi_0 pypi
cycler 0.12.1 pypi_0 pypi
docker-pycreds 0.4.0 pypi_0 pypi
einops 0.6.0 pypi_0 pypi
filelock 3.16.0 pypi_0 pypi
fonttools 4.53.1 pypi_0 pypi
frozenlist 1.4.1 pypi_0 pypi
fsspec 2024.9.0 pypi_0 pypi
gitdb 4.0.11 pypi_0 pypi
gitpython 3.1.43 pypi_0 pypi
h5py 3.8.0 pypi_0 pypi
hdf5plugin 4.4.0 pypi_0 pypi
hydra-core 1.3.2 pypi_0 pypi
idna 3.8 pypi_0 pypi
importlib-resources 6.4.4 pypi_0 pypi
intel-openmp 2023.1.0 hdb19cb5_46306
jinja2 3.1.4 pypi_0 pypi
kiwisolver 1.4.7 pypi_0 pypi
ld_impl_linux-64 2.38 h1181459_1
libffi 3.4.4 h6a678d5_1
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libstdcxx-ng 11.2.0 h1234567_1
lightning-utilities 0.11.7 pypi_0 pypi
llvmlite 0.43.0 pypi_0 pypi
markupsafe 2.1.5 pypi_0 pypi
matplotlib 3.9.2 pypi_0 pypi
mkl 2023.1.0 h213fc3f_46344
mkl-service 2.4.0 py39h5eee18b_1
mkl_fft 1.3.10 py39h5eee18b_0
mkl_random 1.2.7 py39h1128e8f_0
mpmath 1.3.0 pypi_0 pypi
multidict 6.0.5 pypi_0 pypi
ncurses 6.4 h6a678d5_0
networkx 3.2.1 pypi_0 pypi
numba 0.60.0 pypi_0 pypi
numpy 2.0.2 pypi_0 pypi
numpy-base 1.26.4 py39hb5e798b_0
nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi
nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
nvidia-nccl-cu12 2.18.1 pypi_0 pypi
nvidia-nvjitlink-cu12 12.6.68 pypi_0 pypi
nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
omegaconf 2.3.0 pypi_0 pypi
opencv-python 4.6.0.66 pypi_0 pypi
openssl 3.0.15 h5eee18b_0
packaging 24.1 pypi_0 pypi
pandas 1.5.3 pypi_0 pypi
pathtools 0.1.2 pypi_0 pypi
pillow 10.4.0 pypi_0 pypi
pip 24.2 py39h06a4308_0
platformdirs 4.3.2 pypi_0 pypi
plotly 5.13.1 pypi_0 pypi
protobuf 3.20.1 pypi_0 pypi
psutil 6.0.0 pypi_0 pypi
pycocotools 2.0.6 pypi_0 pypi
pyparsing 3.1.4 pypi_0 pypi
python 3.9.19 h955ad1f_1
python-dateutil 2.9.0.post0 pypi_0 pypi
pytorch-lightning 1.8.6 pypi_0 pypi
pytz 2024.1 pypi_0 pypi
pyyaml 6.0.2 pypi_0 pypi
readline 8.2 h5eee18b_0
requests 2.32.3 pypi_0 pypi
sentry-sdk 2.13.0 pypi_0 pypi
setproctitle 1.3.3 pypi_0 pypi
setuptools 72.1.0 py39h06a4308_0
six 1.16.0 pypi_0 pypi
smmap 5.0.1 pypi_0 pypi
sqlite 3.45.3 h5eee18b_0
strenum 0.4.10 pypi_0 pypi
sympy 1.13.2 pypi_0 pypi
tabulate 0.9.0 pypi_0 pypi
tbb 2021.8.0 hdb19cb5_0
tenacity 9.0.0 pypi_0 pypi
tensorboardx 2.6.2.2 pypi_0 pypi
tk 8.6.14 h39e8969_0
torch 2.1.0 pypi_0 pypi
torchdata 0.7.0 pypi_0 pypi
torchmetrics 1.4.1 pypi_0 pypi
torchvision 0.16.0 pypi_0 pypi
tqdm 4.66.5 pypi_0 pypi
triton 2.1.0 pypi_0 pypi
typing-extensions 4.12.2 pypi_0 pypi
tzdata 2024a h04d1e81_0
urllib3 2.2.2 pypi_0 pypi
wandb 0.17.2 pypi_0 pypi
wheel 0.43.0 py39h06a4308_0
xz 5.4.6 h5eee18b_1
yarl 1.10.0 pypi_0 pypi
zipp 3.20.1 pypi_0 pypi
zlib 1.2.13 h5eee18b_1

rvt) david@:RVT-master$ echo $LD_LIBRARY_PATH
/usr/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu::/home/david/anaconda3/lib/:/home/david/anaconda3/envs/tf/lib/python3.9/site-packages/nvidia/cudnn/lib:/home/david/anaconda3/lib/:/home/david/anaconda3/envs/tf/lib/python3.9/site-packages/nvidia/cudnn/lib

echo $HDF5_PLUGIN_PATH
(empty)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant