-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Impractical GPU memory requirements #43
Comments
Hey @SebastienTs, there are a number of reasons the memory usage is often way more than you might expect:
I would suggest you try Apart from that, the only other practical option is to chunk the arrays as in Tile-by-tile deconvolution using dask.ipynb. |
Thanks a lot for your reply! I had 1, 2 and 5 in mind but even then do you really believe that 3 and 4 could explain the remaining 30x memory overhead (from 270 MB to 8 GB)? If that is the case I can sleep peacefuly but it sounds like a real lot to me and I want to make sure that something is not misconfigured or extremely suboptimal for the Tensorflow version I am using... I have not seen any noticeable reduction in memory usage by using pad_mode='2357' when invoking fd_restoration.RichardsonLucyDeconvolver. I would happily consider the cucim alternative that is recommened but unfortunately my code needs to run on a Windows box. |
Hm well 10x wouldn't surprise me too much but 30x does seem extreme. When it comes to potential TF issues I really have no idea. You should take a look at this too if you haven't seen it: #42 (comment). Some of those alternatives to this library may be Windows friendly. |
Have a look in my repo: I basically use dask to divide the images and assemble them again when the GPU mem is not enough. This is the bioformats version (older, might have some tweaks to be done) They should be able to run on google collaboratory version if you'd like to tweak around. You also need the libraries at: hope it helps I can do 2048x1024 times two in my 6GB laptop. The other option is to add the "RAM option" that will share RAM and vRAM and it's still a lot faster than only normal RAM. |
While indeed extremely fast, the GPU memory requirement is impractical on my setup: about 8 GB for a 1024x1024x19 image (16-bit) and a tiny 32x32x16 PSF. For images slightly above 1024x1024 (same number of Z slices), I can only run the code on a RTX 3090 (24 GB)!
The problem seems to stem from the FFT CUDA kernel. The error reported is:
tensorflow/stream_executor/cuda/cuda_fft.cc:253] failed to allocate work area.
tensorflow/stream_executor/cuda/cuda_fft.cc:430] Initialize Params: rank: 3 elem_count: 32 input_embed: 32 input_stride: 1 input_distance: 536870912 output_embed: 32 output_stride: 1 output_distance: 536870912 batch_count: 1
tensorflow/stream_executor/cuda/cuda_fft.cc:439] failed to initialize batched cufft plan with customized allocator:
Something is probably not right in the code... anybody knows of a workaround?
The text was updated successfully, but these errors were encountered: