Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault after 14 epochs training #18

Open
hnrna opened this issue Mar 23, 2023 · 2 comments
Open

Segmentation fault after 14 epochs training #18

hnrna opened this issue Mar 23, 2023 · 2 comments

Comments

@hnrna
Copy link

hnrna commented Mar 23, 2023

When I try to use custom dataset for training, Segmentation fault occurs after completing 14 epochs training.

How can I fix it?

dataset: custom data (training set: 475 , validation set: 53)

image size: 960*480

Environment and configuration are as follows:
Python: 3.7.16
PyTorch: 1.13.1
CUDA: 11.6

Trainning command:
python train.py --warmup --checkpoint 1 --win_size 10 --train_ps 320 --env _self_dataset --gpu 6,7

error info:

... (1-13 epoch info)
------------------------------------------------------------------                                                                                        
Epoch: 14       Time: 61.3263   Loss: 6.0985    LearningRate 0.000200
------------------------------------------------------------------
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1120, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/multiprocessing/queues.py", line 104, in get
    if not self._poll(timeout):
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/multiprocessing/connection.py", line 257, in poll
    return self._poll(timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/multiprocessing/connection.py", line 414, in _poll
    r = wait([self], timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/multiprocessing/connection.py", line 921, in wait
    ready = selector.select(timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
    _error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 4112666) is killed by signal: Segmentation fault.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "train.py", line 174, in <module>
    for ii, data_val in enumerate((val_loader), 0):
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 628, in __next__
    data = self._next_data()
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1316, in _next_data
    idx, data = self._get_data()
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1282, in _get_data
    success, data = self._try_get_data()
  File "/DATA_EDS/x123/anaconda3/envs/ShadowFormer/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1133, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 4112666, 4112989) exited unexpectedly
@hnrna
Copy link
Author

hnrna commented Mar 23, 2023

It looks like a Segmentation fault that occurs when it is the turn to run the evaluation program.

@Shyla1999
Copy link

您好,请问怎么训练成功的呀,我训练太慢了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants