-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Randomly crash when training #33
Comments
I have not encountered that issue myself but it might help if you can post the output of:
|
Hi @magehrig Thank you for your reply. Here are the outputs.
packages in environment at /home/xky/anaconda3/envs/rvt: Name Version Build Channel
|
your conda list is different from the one I get when I execute the readme install instructions.
Can you check whether you get the same errors when you set up a fresh conda env using the instructions in the readme without any additional installations? |
Initially I set up my conda env exactly following the readme, but it failed with some packages. For example, I can't install h5py from conda-forge, that's why I use pip install. Looking for: ['h5py=3.8.0', 'blosc-hdf5-plugin=1.0.0', 'hydra-core=1.3.2', 'einops=0.6.0', 'torchdata=0.6.0', 'tqdm', 'numba', 'pytorch=2.0.0', 'torchvision=0.15.0', 'pytorch-cuda=11.8'] pytorch/linux-64 Using cache Pinned packages:
Could not solve for environment specs |
@magehrig add_anaconda_token: True
Thank you for your patience! |
I'm having the same problem, the program just crashes randomly, did you solve it? |
When I run train.py every time, the process crashes intermittently, and I get the following error message:
File "/home/xky/RVT-master/train.py", line 141, in main
trainer.fit(model=module, ckpt_path=ckpt_path, datamodule=data_module)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 603, in fit
call._call_and_handle_interrupt(
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 38, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 645, in _fit_impl
self._run(model, ckpt_path=self.ckpt_path)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1098, in _run
results = self._run_stage()
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1177, in _run_stage
self._run_train()
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1200, in _run_train
self.fit_loop.run()
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/loops/fit_loop.py", line 267, in advance
self._outputs = self.epoch_loop.run(self._data_fetcher)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
self.on_advance_end()
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 251, in on_advance_end
self._run_validation()
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 310, in _run_validation
self.val_loop.run()
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 152, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/loops/loop.py", line 199, in run
self.advance(*args, **kwargs)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 121, in advance
batch = next(data_fetcher)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 184, in next
return self.fetching_function()
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 258, in fetching_function
self._fetch_next_batch(self.dataloader_iter)
File "/home/xky/.local/lib/python3.9/site-packages/pytorch_lightning/utilities/fetching.py", line 280, in _fetch_next_batch
batch = next(iterator)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 634, in next
data = self._next_data()
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 678, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 41, in fetch
data = next(self.dataset_iter)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 144, in next
return self._get_next()
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 132, in _get_next
result = next(self.iterator)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 215, in wrap_next
result = next_func(*args, **kwargs)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/datapipe.py", line 369, in next
return next(self._datapipe_iter)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 144, in next
return self._get_next()
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 132, in _get_next
result = next(self.iterator)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 185, in wrap_generator
response = gen.send(request)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/iter/combining.py", line 589, in iter
yield from zip(*iterators)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 185, in wrap_generator
response = gen.send(request)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torchdata/datapipes/iter/util/zip_longest.py", line 56, in iter
value = next(iterators[i])
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 185, in wrap_generator
response = gen.send(request)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/iter/combining.py", line 52, in iter
for data in dp:
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torch/utils/data/datapipes/_hook_iterator.py", line 185, in wrap_generator
response = gen.send(request)
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/torchdata/datapipes/map/util/converter.py", line 47, in iter
yield self.datapipe[idx]
File "/home/xky/RVT-master/data/genx_utils/sequence_for_streaming.py", line 152, in getitem
ev_repr = self._get_event_repr_torch(start_idx=start_idx, end_idx=end_idx)
File "/home/xky/RVT-master/data/genx_utils/sequence_base.py", line 91, in _get_event_repr_torch
ev_repr = h5f['data'][start_idx:end_idx]
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "/home/xky/anaconda3/envs/rvt/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 768, in getitem
return self._fast_reader.read(args)
File "h5py/_selector.pyx", line 376, in h5py._selector.Reader.read
OSError: Can't read data (filter returned failure during read)
This exception is thrown by iter of MapToIterConverterIterDataPipe(datapipe=SequenceForIter, indices=range(0, 41))
There is a similar issue #10 , but I can't solve this question with any method mentioned there.
The text was updated successfully, but these errors were encountered: