Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Failed to fetch video idx 168596 from /data/k400/train/salsa_dancing/EY6MSW3zkr8_000048_000058.avi; after 99 trials #558

Open
Christinepan881 opened this issue Jun 16, 2022 · 12 comments

Comments

@Christinepan881
Copy link

When I use the MViT config to run the code on K400 dataset, I just met the errors:
...
Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 3
Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 5
Failed to decode video idx 72108 from /data/k400/train/filling_eyebrows/1m50SSGbG2k_000148_000158.avi; trial 99
Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 15
Failed to decode video idx 139676 from /data/k400/train/playing_paintball/coNWv_D7Fyk_000135_000145.avi; trial 95
Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 6
Failed to decode video idx 205437 from /data/k400/train/taking_a_shower/U540GFOTF6U_000002_000012.avi; trial 99
Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 16
Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 7
Failed to decode video idx 154000 from /data/k400/train/punching_bag/BNwpN8GFixE_000010_000020.avi; trial 0
Failed to decode video idx 139676 from /data/k400/train/playing_paintball/coNWv_D7Fyk_000135_000145.avi; trial 96
Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 4
Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 17
Failed to decode video idx 138602 from /data/k400/train/playing_monopoly/Hn_o3mu9peY_000040_000050.avi; trial 8
Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 5
Failed to decode video idx 86337 from /data/k400/train/headbanging/c6JhdcwPHQU_000002_000012.avi; trial 97
Failed to decode video idx 170537 from /data/k400/train/scuba_diving/dQQK-KSp_pE_000044_000054.avi; trial 18
Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 6
Failed to decode video idx 204993 from /data/k400/train/tai_chi/qV7j-jQCH3M_000027_000037.avi; trial 0
Failed to decode video idx 31483 from /data/k400/train/changing_oil/csJFMaPl9Og_000370_000380.avi; trial 7
Failed to decode video idx 86337 from /data/k400/train/headbanging/c6JhdcwPHQU_000002_000012.avi; trial 98
Traceback (most recent call last):
File "tools/run_net.py", line 45, in
main()
File "tools/run_net.py", line 26, in main
launch_job(cfg=cfg, init_method=args.init_method, func=train)
File "/data/home/SlowFast/slowfast/utils/misc.py", line 296, in launch_job
torch.multiprocessing.spawn(
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 2 terminated with the following error:
Traceback (most recent call last):
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/data/home/SlowFast/slowfast/utils/multiprocessing.py", line 60, in run
ret = func(cfg)
File "/data/home/SlowFast/tools/train_net.py", line 708, in train
train_epoch(
File "/data/home/SlowFast/tools/train_net.py", line 86, in train_epoch
for cur_iter, (inputs, labels, index, time, meta) in enumerate(
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in next
data = self._next_data()
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1203, in _next_data
return self._process_data(data)
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1229, in _process_data
data.reraise()
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/_utils.py", line 434, in reraise
raise exception
RuntimeError: Caught RuntimeError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 287, in _worker_loop
data = fetcher.fetch(index)
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/home/miniconda/envs/test0/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 49, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/home/SlowFast/slowfast/datasets/kinetics.py", line 488, in getitem
raise RuntimeError(
RuntimeError: Failed to fetch video idx 168596 from /data/k400/train/salsa_dancing/EY6MSW3zkr8_000048_000058.avi; after 99 trials

I have checked with the data paths, and there is no problem with the path.

Anyone know the reason?
Thanks!

@kkk55596
Copy link

Hi, have you solved this problem already? I also meet this problem.

@alpargun
Copy link

alpargun commented Jul 14, 2022

This is due to torchvision backend during video decoding. Some people mentioned that building torchvision from source solves this issue, however, I haven't been able to fix it yet.
This issue already discusses this problem and a possible solution is to change the video decoding backend to PyAV instead. In the YAML config file, you can add:

DATA:
  DECODING_BACKEND: pyav

to switch to the PyAV backend. However, PyAV backend introduces another error related to changed data types that is due to a recent commit, so this pull request already solves this problem. I did the necessary changes in the given pull request and now I am able to run the framework with the PyAv backend.

@haooooooqi
Copy link
Contributor

Thanks for playing with pysf.
You might get the issue fixed if you preprocess the video to the same format?

@kkk55596
Copy link

kkk55596 commented Aug 3, 2022

I solved this problem after re-installing torchvision from source.
Thus, I can use the following method.
DATA: DECODING_BACKEND: torchvision

@alpargun
Copy link

alpargun commented Aug 5, 2022

Which torch and torchvision version are you using? Thanks

@poincarelee
Copy link

The pull request you mentioned did solve the problem.
I met another problem: the top 1 error( also top5 error) seems not to decrease straightly, in certain epoch, top1 error was 37.5% ,while during some later epoch, top1 error became 50%, and the final accuracy(top 1 acc) is 42.14%(top5 acc: 72.81) which is much smaller than reported in paper, just as follows:

image
I trained X3D on HMDB51 dataset.
Anything wrong with the training code?

@alpargun
Copy link

alpargun commented Sep 9, 2022

I haven't trained on the HMDB51 dataset yet but I am assuming two possibilities:

  • Please check the paper again to see if they did pre-training on another dataset to obtain the published results
  • HMDB51 does not have a config file in the SlowFast repo. Hence, the values of parameters in your config can affect the performance because for the other datasets, SlowFast already provides config files that are tuned for the corresponding datasets

@poincarelee
Copy link

You are right.
Kinetics and AVA datasets are preferred. I referred to other dataset's config file(like Kinetics') and changed it for hmdb51. K400 is still a little larger, training would be much longer. However, I am now working on K400 and choose about 10% for training, which still needs about 3 days.

@poincarelee
Copy link

poincarelee commented Sep 16, 2022

@alpargun.
Hi, I have trained on K400 dataset, while the top1-error and top5-error seems weird.

image
As shown from the picture above, epoch number is 105, top1 error is still 81.25% in certain batch, while in some other batch it's 56% or 43%. Most of batches during one epoch are nearly 50% but there's always some batch being 80% or 70%, top5 error also vibrates but doesn't show such a trend. Have you met this problem before?

@Patrick-CH
Copy link

Thanks for playing with pysf. You might get the issue fixed if you preprocess the video to the same format?

I have tried. Even if i preprocess the videos to the same format .mp4, the problem still exists.

@alpargun
Copy link

Hi, you might find the INSTALL.md file in my SlowFast fork useful for updated installation steps. I would suggest PyTorch <= 1.13.1 as I had similar problems with 2.0.

Following the INSTALL.md file, I suggest installing PyTorch together with TorchVision. I recently set up SlowFast on multiple Ubuntu 20.04 machines and a MacBook following this updated INSTALL.md, and had no problems.

@ConvAndConv
Copy link

i face same question, torch==2.0.0 ,torchvision==0.15.1, i use Kinetics config and slowfast_8*8_r50.yaml, how can i fix it without lowered torch version?Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants