Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sv4d,torch.cuda.OutOfMemoryError #393

Open
shengxiao-zhou opened this issue Aug 1, 2024 · 12 comments
Open

sv4d,torch.cuda.OutOfMemoryError #393

shengxiao-zhou opened this issue Aug 1, 2024 · 12 comments

Comments

@shengxiao-zhou
Copy link

When I executed sv4d's quickstart, an out of memory error occurred, but it ran successfully when I executed sv3d. I referred to some friends who adjusted decoding_t (originally 14, reduced to 1), but it still didn't work. In addition, I am using a 40G A100. Has anyone encountered a similar problem and successfully solved it? I would be very grateful.

The following is the error content

(sv4d) [zhoushengxiao@gpu3 SV4D]$ python scripts/sampling/simple_video_sample_4d.py --input_path assets/test_video1.mp4 --output_folder outputs/sv4d
Reading assets/test_video1.mp4
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
VideoTransformerBlock is using checkpointing
Initialized embedder #0: FrozenOpenCLIPImagePredictionEmbedder with 683800065 params. Trainable: False
Initialized embedder #1: VideoPredictionEmbedderWithEncoder with 83653863 params. Trainable: False
Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False
Initialized embedder #3: ConcatTimestepEmbedderND with 0 params. Trainable: False
Initialized embedder #4: ConcatTimestepEmbedderND with 0 params. Trainable: False
Restored from checkpoints/sv3d_p.safetensors with 0 missing and 0 unexpected keys
/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
Initialized embedder #0: FrozenOpenCLIPImagePredictionEmbedder with 683800065 params. Trainable: False
Initialized embedder #1: VideoPredictionEmbedderWithEncoder with 83653863 params. Trainable: False
Initialized embedder #2: ConcatTimestepEmbedderND with 0 params. Trainable: False
Initialized embedder #3: ConcatTimestepEmbedderND with 0 params. Trainable: False
Initialized embedder #4: VideoPredictionEmbedderWithEncoder with 83653863 params. Trainable: False
Initialized embedder #5: VideoPredictionEmbedderWithEncoder with 83653863 params. Trainable: False
Restored from checkpoints/sv4d.safetensors with 0 missing and 0 unexpected keys
Sampling anchor frames [ 4 8 12 16 20]
Traceback (most recent call last):
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/scripts/sampling/simple_video_sample_4d.py", line 236, in
Fire(sample)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/scripts/sampling/simple_video_sample_4d.py", line 170, in sample
samples = run_img2vid(
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/scripts/demo/sv4d_helpers.py", line 705, in run_img2vid
samples = do_sample(
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/scripts/demo/sv4d_helpers.py", line 764, in do_sample
c, uc = model.conditioner.get_unconditional_conditioning(
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/sgm/modules/encoders/modules.py", line 183, in get_unconditional_conditioning
c = self(batch_c, force_cond_zero_embeddings)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/sgm/modules/encoders/modules.py", line 132, in forward
emb_out = embedder(batch[embedder.input_key])
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/sgm/modules/encoders/modules.py", line 1019, in forward
out = self.encoder.encode(vid[n * n_samples : (n + 1) * n_samples])
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/sgm/models/autoencoder.py", line 472, in encode
z = self.encoder(x)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/sgm/modules/diffusionmodules/model.py", line 584, in forward
h = self.down[i_level].block[i_block](hs[-1], temb)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/sgm/modules/diffusionmodules/model.py", line 134, in forward
h = nonlinearity(h)
File "/mnt/lustre/GPU3/home/zhoushengxiao/workspace/codes/SV4D/sv4d/lib/python3.10/site-packages/sgm/modules/diffusionmodules/model.py", line 49, in nonlinearity
return x * torch.sigmoid(x)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 6.33 GiB (GPU 0; 39.38 GiB total capacity; 31.96 GiB already allocated; 3.20 GiB free; 35.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@ymxie97
Copy link
Collaborator

ymxie97 commented Aug 2, 2024

We are currently working on reducing the memory consumption (#394). Will merge to main branch soon. Thanks for your patience!

@tppqt
Copy link

tppqt commented Aug 3, 2024

WechatIMG1587
why can't use multi GPUs? can you add this function

We are currently working on reducing the memory consumption (#394). Will merge to main branch soon. Thanks for your patience!

@tppqt
Copy link

tppqt commented Aug 3, 2024

by the way, the resolution must be 576X576?

@ymxie97
Copy link
Collaborator

ymxie97 commented Aug 3, 2024

by the way, the resolution must be 576X576?

You can change image size by --img_size=512

@ymxie97
Copy link
Collaborator

ymxie97 commented Aug 3, 2024

WechatIMG1587 why can't use multi GPUs? can you add this function

We are currently working on reducing the memory consumption (#394). Will merge to main branch soon. Thanks for your patience!

Hi, are you using the latest commit? The memory should not be that high. The default value of encoding_t and decoding_t have been changed to 8 and 4 in the latest commit.

@tppqt
Copy link

tppqt commented Aug 3, 2024

WechatIMG1587 why can't use multi GPUs? can you add this function

We are currently working on reducing the memory consumption (#394). Will merge to main branch soon. Thanks for your patience!

Hi, are you using the latest commit? The memory should not be that high. The default value of encoding_t and decoding_t have been changed to 8 and 4 in the latest commit.

OK, I will try it

@shengxiao-zhou
Copy link
Author

We are currently working on reducing the memory consumption (#394). Will merge to main branch soon. Thanks for your patience!

Thank you very much for your answer. You updated the code 3 days ago, and I completed the quickstart successfully. But I found a problem. Why is the result of my inference so poor? I used the video in the assets/sv4d_videos directory you gave as input and performed sv4d inference. The output video effect is somewhat inconsistent with the display on your github project homepage. The effect is not so clear. Perhaps further settings are needed?

000004_diag.mp4

@ymxie97
Copy link
Collaborator

ymxie97 commented Aug 6, 2024

@shengxiao-zhou Thank you for your interest in SV4D. The website showcases the 4D results of this case rather than the novel view video results, and there are some differences between them.

I suggest adjusting the image_frame_ratio (--image_frame_ratio=) or increase number of denoising steps (--num_steps=) to see if this improves the results.
----Update----
We tried --image_frame_ratio=0.99 and --num_steps=40:

sv4d_final.7.mp4

@DevikalyanDas
Copy link

Hello @ymxie97 ,

How to obtain the camera parameters(extrinsics and intrinsics) for the videos generated?

Best,
Devi

@ymingxie
Copy link

ymingxie commented Aug 9, 2024

@DevikalyanDas The extrinsic can be obtained from [polar and azimuth angles] (https://github.com/Stability-AI/generative-models/blob/main/scripts/sampling/simple_video_sample_4d.py#L119-L122). You can set camera distance to 2 and fov_degree to 33.9.

@DevikalyanDas
Copy link

Hello @ymingxie , thanks for your reply. From Fov_degree I could obtain the intrinsic (assuming the c_x and c_y are image center (576/2,576/2)). For the extrinsic, I wanted to know what is the camera coordinate system that has been used such that I can create the look_at matrix, which will provide a world to camera view transformation?

@ymingxie
Copy link

ymingxie commented Aug 10, 2024

@DevikalyanDas We used OpenCV camera coordinate (right hand, +z forward, +x right).
To convert spherical into a cartesian coordinate, you can check this, which give the camera position in cartesian coord.
Then you can get the rotation vector by (assume up_vector is (0, 0, 1), the target position is (0, 0, 0)):
z_axis = camera_position - target_position
z_axis /= norm(z_axis)
x_axis = cross(up_vector, z_axis)
x_axis /= norm(x_axis)
y_axis = cross(z_axis, x_axis)
y_axis /= norm(y_axis)
Camera matrix = [
[x_axis[0], y_axis[0], z_axis[0], camera_position[0]],
[x_axis[1], y_axis[1], z_axis[1], camera_position[1]],
[x_axis[2], y_axis[2], z_axis[2], camera_position[2]],
[0, 0, 0, 1],
]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants