-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training setting #7
Comments
Hi, thanks for your interest!
Please let us know if you have further questions. |
Thanks for your reply! In your second point, what does “first stage” refer to? The paper mentions that during UNet training, only the temporal layer and input convolution were trained initially before training ControlNet. Does “first stage” mean UNet training? Or is the UNet training itself divided into multiple stages? |
The "first stage" means UNet training, while the "second stage" refers to controlling branch training. The model is trained only for these two stages. |
Thank you. I’m training based on Hugging Face’s Framer inference code and SVD Xtend, but after around 2000 steps, the results become completely distorted, and there’s a loss explosion. I’m wondering if there are any setting changes, aside from architecture, made in SVD Xtend that I should be aware of. |
We did not meet this problem during model fine-tuning. Here are several details for training.
|
Hello, I have some queations about training settings.
conditional_latents_mask = mask_token.repeat(bsz_cfg, num_frames-2, 1, latent_h, latent_w)
It seems like two batches were used for CFG, but instead of using 0 for the unconditional part, the same values as the conditional part were repeated. Is there a specific reason for this approach? Was the model trained entirely with conditional training without any separate unconditional training?
Also, in the original SVD Xtend code, a learning rate of 1e-5 is typically used, but the Framer paper mentions using a learning rate of 1e-4. Is there a specific reason for this difference?
The SVD pretrained model used here generates 25 frames at a resolution of 1024x576, but isn’t there also a model that generates 14 frames at 512x320? The frame setting seems closer to the latter; is there a reason for choosing the former model?
The text was updated successfully, but these errors were encountered: