Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training setting #7

Open
yongjinColinChoi opened this issue Nov 2, 2024 · 5 comments
Open

Training setting #7

yongjinColinChoi opened this issue Nov 2, 2024 · 5 comments

Comments

@yongjinColinChoi
Copy link

Hello, I have some queations about training settings.

  1. In the inference code, there is a line that says:

conditional_latents_mask = mask_token.repeat(bsz_cfg, num_frames-2, 1, latent_h, latent_w)

It seems like two batches were used for CFG, but instead of using 0 for the unconditional part, the same values as the conditional part were repeated. Is there a specific reason for this approach? Was the model trained entirely with conditional training without any separate unconditional training?

  1. Also, in the original SVD Xtend code, a learning rate of 1e-5 is typically used, but the Framer paper mentions using a learning rate of 1e-4. Is there a specific reason for this difference?

  2. The SVD pretrained model used here generates 25 frames at a resolution of 1024x576, but isn’t there also a model that generates 14 frames at 512x320? The frame setting seems closer to the latter; is there a reason for choosing the former model?

@encounter1997
Copy link
Member

encounter1997 commented Nov 2, 2024

Hi, thanks for your interest!

  1. The mask token is not crucial in video interpolation. We added this mask token since we believe it's useful if we want to fine-tune the model for other relevant tasks like video inpainting. The model is trained with conditioning_dropout_prob=0.1. For the unconditional branch in CFG, the start and end image vae latents are not concatenated for conditioning.
  2. We trained the model with a learning rate of 1e-5, and the first training stage takes 100k iterations. We are sorry for the mistakes and will update the Arxiv paper ASAP.
  3. The model was trained at a lower resolution for a smaller training cost. Now, we are working on training with the 1024x576 resolution.

Please let us know if you have further questions.

@yongjinColinChoi
Copy link
Author

Thanks for your reply!

In your second point, what does “first stage” refer to? The paper mentions that during UNet training, only the temporal layer and input convolution were trained initially before training ControlNet. Does “first stage” mean UNet training? Or is the UNet training itself divided into multiple stages?

@encounter1997
Copy link
Member

Thanks for your reply!

In your second point, what does “first stage” refer to? The paper mentions that during UNet training, only the temporal layer and input convolution were trained initially before training ControlNet. Does “first stage” mean UNet training? Or is the UNet training itself divided into multiple stages?

The "first stage" means UNet training, while the "second stage" refers to controlling branch training. The model is trained only for these two stages.

@yongjinColinChoi
Copy link
Author

Thank you.

I’m training based on Hugging Face’s Framer inference code and SVD Xtend, but after around 2000 steps, the results become completely distorted, and there’s a loss explosion. I’m wondering if there are any setting changes, aside from architecture, made in SVD Xtend that I should be aware of.

@encounter1997
Copy link
Member

Thank you.

I’m training based on Hugging Face’s Framer inference code and SVD Xtend, but after around 2000 steps, the results become completely distorted, and there’s a loss explosion. I’m wondering if there are any setting changes, aside from architecture, made in SVD Xtend that I should be aware of.

We did not meet this problem during model fine-tuning. Here are several details for training.

  1. Enable xformers
  2. use fp16
  3. Only fine-tune the temporal modules in UNet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants