run on different gpu? training images get corrupted during training? #187

Shanshan-Huang · 2020-12-16T17:09:51Z

I have two questions regarding running the EDVR model.

I realized that everytime I submitted the job to a different gpu (even if they are of the same type e.g. Titan X), I have to do rm build/ and python setup.py develop again, otherwise I would get error in modulated_deformable_im2col_cuda; no kernel image is available for execution on the device. I suspect that it has something to do with dynamic installation? Right now, I had to keep three copies of the same repo in order to simultaneously run 3 jobs. Is it the way to go or is there any better options?
I followed one of the sugguestions in another posts to have pytorch 1.4, torchvision 0.5 with cudatoolit 10.1
I always get the following error and sometimes even explicit png CRC error when cv2.imdecode() returns None, and I realized that the training png's are somehow corrupted even though I verified all images before training. Did you encounter this problem before? Is it related to multi-processing data loading? This is happening everytime especially when I turn off the TSA and set frame to 1.

Traceback (most recent call last):
  File "basicsr/train.py", line 252, in <module>
    main()
  File "basicsr/train.py", line 234, in main
    train_data = prefetcher.next()
  File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/data/prefetch_dataloader.py", line 76, in next
    return next(self.loader)
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 838, in _next_data
    return self._process_data(data)
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/_utils.py", line 394, in reraise
    raise self.exc_type(msg)
AttributeError: Caught AttributeError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/itet-stor/huangsha/net_scratch/conda_envs/test/lib/python3.7/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/data/moving_cityscape_dataset.py", line 147, in __getitem__
    img_gt = imfrombytes(img_bytes, float32=True)
  File "/scratch_net/biwidl216/huangsha/BasicSR_1/basicsr/utils/img_util.py", line 125, in imfrombytes
    img = img.astype(np.float32) / 255.
AttributeError: 'NoneType' object has no attribute 'astype'

/scratch/slurm/spool/job219938/slurm_script: line 31: 16770 Bus error               python -u basicsr/train.py -opt options/train/EDVR/train_EDVR_DARK_20_frame_window_1_patch_64.yml

Thank you very much for your help :) and best wishes!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run on different gpu? training images get corrupted during training? #187

run on different gpu? training images get corrupted during training? #187

Shanshan-Huang commented Dec 16, 2020 •

edited

Loading

run on different gpu? training images get corrupted during training? #187

run on different gpu? training images get corrupted during training? #187

Comments

Shanshan-Huang commented Dec 16, 2020 • edited Loading

Shanshan-Huang commented Dec 16, 2020 •

edited

Loading