Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when trying to reproduce on Stanford 3D Dataset #55

Closed
FengZicai opened this issue Dec 29, 2021 · 0 comments
Closed

Comments

@FengZicai
Copy link

I encountered this problem when trying to run

./scripts/train_stanford.sh 4 "default" "--stanford3d_path ./Stanford3D"

When I set --num_workers to 0,it reports as follows:

./scripts/train_stanford.sh: line 34: 30654 Segmentation fault python3 -m main --dataset StanfordArea5Dataset --batch_size $BATCH_SIZE --scheduler PolyLR --model Res16UNet34 --conv1_kernel_size 5 --log_dir $LOG_DIR --lr 1e-1 --max_iter 60000 --data_aug_color_trans_ratio 0.05 --data_aug_color_jitter_std 0.005 $3 2>&1
     30655 Done | tee -a "$LOG"

When I set --num_workers to 1,it reports as follows:

yq01-qianmo-com-127-2-22 12/29 14:21:28 ===> Start testing
ERROR: Unexpected segmentation fault encountered in worker.
Traceback (most recent call last):
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/queues.py", line 113, in get
    return _ForkingPickler.loads(res)
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/multiprocessing/reductions.py", line 289, in rebuild_storage_fd
    fd = df.detach()
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/resource_sharer.py", line 57, in detach
    with _resource_sharer.get_connection(self._id) as conn:
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/resource_sharer.py", line 87, in get_connection
    c = Client(address, authkey=process.current_process().authkey)
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/connection.py", line 492, in Client
    c = SocketClient(address)
  File "/miniconda3/envs/py3-mink/lib/python3.7/multiprocessing/connection.py", line 620, in SocketClient
    s.connect(address)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/miniconda3/envs/py3-mink/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/miniconda3/envs/py3-mink/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/spatiotemporalsegmentation/main.py", line 162, in <module>
    main()
  File "/spatiotemporalsegmentation/main.py", line 157, in main
    test(model, test_data_loader, config)
  File "/spatiotemporalsegmentation/lib/test.py", line 98, in test
    coords, input, target = data_iter.next()
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
    data = self._next_data()
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
    idx, data = self._get_data()
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
    success, data = self._try_get_data()
  File "/miniconda3/envs/py3-mink/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
    raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 5202) exited unexpectedly

I tried it on two computers. And I have tried with different versions,

  • cuda 11.1
  • MinkowskiEngine 0.5.4
  • pytorch 1.9.0

or

  • cuda 10.2
  • MinkowskiEngine 0.4.3
  • pytorch 1.5.0 or 1.7.1 or 1.9.0 or 1.10.2

Could you please tell me which version of MinkowskiEngine I should use?

I also tested step by step and found that the problem occurred in 96 line of lib/test.py/:

        coords, input, target = data_iter.next()

I have been troubled by this problem for several days. Could you please provide me with some ideas to solve this problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant