Apparent locking issues when running across multiple GPUs #283

gtebbutt · 2023-12-06T11:47:58Z

I've noticed an interesting issue when running on multi-GPU machines: although selecting gpu(N) as the decoding context initially works as expected, the overall throughput when running multiple processes drops off very rapidly until there's only one process showing activity on a single GPU, sometimes with occasional very short bursts of processing from others.

This happens even when the processes are totally independent (started separately from different screen sessions, operating on entirely different files, using separate GPUs, for example), which leads me to think there's probably a hardware- or system-level locking mechanism being used globally rather than per-process since it occurs even between separate python instances.

Working theory is that it could be falling through to a global lock of some kind due to setting decoder_info_.vidLock = nullptr;, but so far that hasn't brought us closer to a fix. Would be very helpful to hear if anyone else has (or hasn't!) run into similar issues?

Possibly related to #187 and/or #159?

The text was updated successfully, but these errors were encountered:

johndpope · 2024-06-05T19:58:10Z

i've been using claude opus to do ai pyton coding on complex tasks - and I frequently just throw the entire codebase (as one file) as context and ask ai to fix / find problems. are you using gpu context because it's faster?

are you on intel? im looking to do some processing on 35,000 videos - and currently it's taking 1-5 mins per video.
gpu seems obvious choice - but wondering if intel codecs could give good boost.
https://github.com/Intel-Media-SDK/MediaSDK - this is EOL

claude opus may have found your locking problem here
#302

gtebbutt · 2024-06-06T10:18:42Z

Oh I've been meaning to do a proper writeup on video decoding for a few months now, and just haven't had the time. Quick notes for now on what we learned from processing video at reasonably large scale (millions of files/billions of frames/few hundred TB of data):

We ended up using VALI, and I'd strongly recommend it. It's the successor to NVIDIA's VPF (Video Processing Framework), continued by one of the original devs. Right now it's the fastest option available by a significant margin, but the learning curve is steep. The sample scripts in the old VPF repo are a reasonable starting point, the API is close enough to broadly match up, but that repo is now unmaintained and they do have some bugs. Another thing on my long to-do list is publishing a decord-like wrapper class for VALI, but realistically it may be a while before I get to it.
Second best option is torchaudio.StreamReader, and it's much easier to use than VALI. I last benchmarked on torch 2.1, but as far as I know this is still the case: torchvision uses a different configuration for whatever reason, but the torchaudio decoder supports video as well and is much faster. The main issue we ran into is that torch doesn't have a specific CUDA kernel for colour space conversion (whereas VALI does), and NVDEC/CUVID only return NV12 data - that one extra function to convert to RGB was enough to significantly reduce the overall speed and colour accuracy of the process. Even an extra 3ms per frame adds up to a month of extra compute time at the scale we had.
Decord is still up there in terms of both speed and colour accuracy, which is very impressive, but we ran into enough reliability issues to take it off the table. Alongside the locking problem, there were several difficult-to-debug situations where it'd occasionally return blank frames - they were never replicable and seemed to happen at random, so our best guess is a race condition somewhere in the pipeline between CUDA, decord, and torch. It ended up being quicker to convert our code to use a different library than it would have been to pin that down.

Throwing the whole repo into an LLM is a technique I hadn't thought of, so it's interesting to see what it came back with! Realistically I'm not likely to get the time to dive into the decord codebase and see how accurate the suggestions are, but if it is possible to get the threading and context locking sorted out that'd be a big win for the library.

gtebbutt mentioned this issue Jun 6, 2024

NV12/YUV->RGB colour accuracy and CUDA pytorch/audio#3799

Open

This was referenced Jun 6, 2024

Current results of training - epoch 4 johndpope/MegaPortrait-hack#36

Closed

Design for preprocessor at scale / distributed johndpope/MegaPortrait-hack#38

Closed

Multicore preprocessing solution for pytorch RomanArzumanyan/VALI#45

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apparent locking issues when running across multiple GPUs #283

Apparent locking issues when running across multiple GPUs #283

gtebbutt commented Dec 6, 2023 •

edited

Loading

johndpope commented Jun 5, 2024 •

edited

Loading

gtebbutt commented Jun 6, 2024

Apparent locking issues when running across multiple GPUs #283

Apparent locking issues when running across multiple GPUs #283

Comments

gtebbutt commented Dec 6, 2023 • edited Loading

johndpope commented Jun 5, 2024 • edited Loading

gtebbutt commented Jun 6, 2024

gtebbutt commented Dec 6, 2023 •

edited

Loading

johndpope commented Jun 5, 2024 •

edited

Loading