Training setting #7

yongjinColinChoi · 2024-11-02T01:49:35Z

Hello, I have some queations about training settings.

In the inference code, there is a line that says:

conditional_latents_mask = mask_token.repeat(bsz_cfg, num_frames-2, 1, latent_h, latent_w)

It seems like two batches were used for CFG, but instead of using 0 for the unconditional part, the same values as the conditional part were repeated. Is there a specific reason for this approach? Was the model trained entirely with conditional training without any separate unconditional training?

Also, in the original SVD Xtend code, a learning rate of 1e-5 is typically used, but the Framer paper mentions using a learning rate of 1e-4. Is there a specific reason for this difference?
The SVD pretrained model used here generates 25 frames at a resolution of 1024x576, but isn’t there also a model that generates 14 frames at 512x320? The frame setting seems closer to the latter; is there a reason for choosing the former model?

encounter1997 · 2024-11-02T03:12:18Z

Hi, thanks for your interest!

The mask token is not crucial in video interpolation. We added this mask token since we believe it's useful if we want to fine-tune the model for other relevant tasks like video inpainting. The model is trained with conditioning_dropout_prob=0.1. For the unconditional branch in CFG, the start and end image vae latents are not concatenated for conditioning.
We trained the model with a learning rate of 1e-5, and the first training stage takes 100k iterations. We are sorry for the mistakes and will update the Arxiv paper ASAP.
The model was trained at a lower resolution for a smaller training cost. Now, we are working on training with the 1024x576 resolution.

Please let us know if you have further questions.

yongjinColinChoi · 2024-11-02T03:19:18Z

Thanks for your reply!

In your second point, what does “first stage” refer to? The paper mentions that during UNet training, only the temporal layer and input convolution were trained initially before training ControlNet. Does “first stage” mean UNet training? Or is the UNet training itself divided into multiple stages?

encounter1997 · 2024-11-02T03:21:34Z

Thanks for your reply!

In your second point, what does “first stage” refer to? The paper mentions that during UNet training, only the temporal layer and input convolution were trained initially before training ControlNet. Does “first stage” mean UNet training? Or is the UNet training itself divided into multiple stages?

The "first stage" means UNet training, while the "second stage" refers to controlling branch training. The model is trained only for these two stages.

yongjinColinChoi · 2024-11-04T03:06:20Z

Thank you.

I’m training based on Hugging Face’s Framer inference code and SVD Xtend, but after around 2000 steps, the results become completely distorted, and there’s a loss explosion. I’m wondering if there are any setting changes, aside from architecture, made in SVD Xtend that I should be aware of.

encounter1997 · 2024-11-04T11:22:57Z

Thank you.

I’m training based on Hugging Face’s Framer inference code and SVD Xtend, but after around 2000 steps, the results become completely distorted, and there’s a loss explosion. I’m wondering if there are any setting changes, aside from architecture, made in SVD Xtend that I should be aware of.

We did not meet this problem during model fine-tuning. Here are several details for training.

Enable xformers
use fp16
Only fine-tune the temporal modules in UNet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training setting #7

Training setting #7

yongjinColinChoi commented Nov 2, 2024

encounter1997 commented Nov 2, 2024 •

edited

Loading

yongjinColinChoi commented Nov 2, 2024

encounter1997 commented Nov 2, 2024

yongjinColinChoi commented Nov 4, 2024

encounter1997 commented Nov 4, 2024

Training setting #7

Training setting #7

Comments

yongjinColinChoi commented Nov 2, 2024

encounter1997 commented Nov 2, 2024 • edited Loading

yongjinColinChoi commented Nov 2, 2024

encounter1997 commented Nov 2, 2024

yongjinColinChoi commented Nov 4, 2024

encounter1997 commented Nov 4, 2024

encounter1997 commented Nov 2, 2024 •

edited

Loading