forked from NVIDIA/NeMo
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
SDXL improvements (and support for Draft+) [DRAFT PR] (NVIDIA#9543)
* add slurm files to .gitignore * add differentiable decode to SDXL VAE * Optionally return predicted noise during the single step sampling process * also change `get_gamma` as a new function to use inside other functions which may interact with sampling (e.g. draft+) * debugging sdunet converter script * Added SD/SDXL conversion script from HF to NeMo * added 'from_nemo' config for VAE * tmp commit, please make changes (oci is super slow, cannot even run vim) * new inference yaml works * add logging to autoencoder * !(dont squash) Added enabling support for LinearWrapper for SDLoRA * added samples_per_batch and fsdp arguments to SDXL inference * added extra optionally wrapper to FSDP * remove unncessary comments * remove unnecessary comments * Apply isort and black reformatting Signed-off-by: yaoyu-33 <[email protected]> --------- Signed-off-by: yaoyu-33 <[email protected]> Co-authored-by: Rohit Jena <[email protected]> Co-authored-by: Yu Yao <[email protected]> Co-authored-by: yaoyu-33 <[email protected]>
- Loading branch information
1 parent
f9c3a8b
commit 55d6e62
Showing
22 changed files
with
880 additions
and
89 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -17,7 +17,6 @@ trainer: | |
enable_model_summary: True | ||
limit_val_batches: 0 | ||
|
||
|
||
exp_manager: | ||
exp_dir: null | ||
name: ${name} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
189 changes: 189 additions & 0 deletions
189
examples/multimodal/text_to_image/stable_diffusion/conf/sd_xl_infer_v2.yaml
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,189 @@ | ||
trainer: | ||
devices: 1 | ||
num_nodes: 1 | ||
accelerator: gpu | ||
precision: 32 | ||
logger: False # logger provided by exp_manager | ||
enable_checkpointing: False | ||
use_distributed_sampler: False | ||
max_epochs: -1 # PTL default. In practice, max_steps will be reached first. | ||
max_steps: -1 # consumed_samples = global_step * micro_batch_size * data_parallel_size * accumulate_grad_batches | ||
log_every_n_steps: 10 | ||
accumulate_grad_batches: 1 # do not modify, grad acc is automatic for training megatron models | ||
gradient_clip_val: 1.0 | ||
benchmark: False | ||
enable_model_summary: True | ||
limit_val_batches: 0 | ||
|
||
|
||
infer: | ||
num_samples_per_batch: 1 | ||
num_samples: 4 | ||
prompt: | ||
- "A professional photograph of an astronaut riding a pig" | ||
- 'A photo of a Shiba Inu dog with a backpack riding a bike. It is wearing sunglasses and a beach hat.' | ||
- 'A cute corgi lives in a house made out of sushi.' | ||
- 'A high contrast portrait of a very happy fuzzy panda dressed as a chef in a high end kitchen making dough. There is a painting of flowers on the wall behind him.' | ||
- 'A brain riding a rocketship heading towards the moon.' | ||
negative_prompt: "" | ||
seed: 123 | ||
|
||
|
||
sampling: | ||
base: | ||
sampler: EulerEDMSampler | ||
width: 512 | ||
height: 512 | ||
steps: 50 | ||
discretization: "LegacyDDPMDiscretization" | ||
guider: "VanillaCFG" | ||
thresholder: "None" | ||
scale: 5.0 | ||
img2img_strength: 1.0 | ||
sigma_min: 0.0292 | ||
sigma_max: 14.6146 | ||
rho: 3.0 | ||
s_churn: 0.0 | ||
s_tmin: 0.0 | ||
s_tmax: 999.0 | ||
s_noise: 1.0 | ||
eta: 1.0 | ||
order: 4 | ||
orig_width: 512 | ||
orig_height: 512 | ||
crop_coords_top: 0 | ||
crop_coords_left: 0 | ||
aesthetic_score: 5.0 | ||
negative_aesthetic_score: 5.0 | ||
|
||
# model: | ||
# is_legacy: False | ||
|
||
use_refiner: False | ||
use_fp16: False # use fp16 model weights | ||
out_path: ./output | ||
|
||
base_model_config: /opt/NeMo/examples/multimodal/generative/stable_diffusion/conf/sd_xl_base.yaml | ||
refiner_config: /opt/NeMo/examples/multimodal/generative/stable_diffusion/conf/sd_xl_refiner.yaml | ||
|
||
model: | ||
scale_factor: 0.13025 | ||
disable_first_stage_autocast: True | ||
is_legacy: False | ||
restore_from_path: "" | ||
|
||
fsdp: False | ||
fsdp_set_buffer_dtype: null | ||
fsdp_sharding_strategy: 'full' | ||
use_cpu_initialization: True | ||
# hidden_size: 4 | ||
# pipeline_model_parallel_size: 4 | ||
|
||
optim: | ||
name: fused_adam | ||
lr: 1e-4 | ||
weight_decay: 0.0 | ||
betas: | ||
- 0.9 | ||
- 0.999 | ||
sched: | ||
name: WarmupHoldPolicy | ||
warmup_steps: 10 | ||
hold_steps: 10000000000000 # Incredibly large value to hold the lr as constant | ||
|
||
denoiser_config: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.denoiser.DiscreteDenoiser | ||
num_idx: 1000 | ||
|
||
weighting_config: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.denoiser_weighting.EpsWeighting | ||
scaling_config: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.denoiser_scaling.EpsScaling | ||
discretization_config: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.discretizer.LegacyDDPMDiscretization | ||
|
||
unet_config: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.diffusionmodules.openaimodel.UNetModel | ||
from_pretrained: /opt/nemo-aligner/checkpoints/sdxl/unet_nemo.ckpt | ||
from_NeMo: True | ||
adm_in_channels: 2816 | ||
num_classes: sequential | ||
use_checkpoint: False | ||
in_channels: 4 | ||
out_channels: 4 | ||
model_channels: 320 | ||
attention_resolutions: [ 4, 2 ] | ||
num_res_blocks: 2 | ||
channel_mult: [ 1, 2, 4 ] | ||
num_head_channels: 64 | ||
use_spatial_transformer: True | ||
use_linear_in_transformer: True | ||
transformer_depth: [ 1, 2, 10 ] # note: the first is unused (due to attn_res starting at 2) 32, 16, 8 --> 64, 32, 16 | ||
context_dim: 2048 | ||
image_size: 64 # unused | ||
# spatial_transformer_attn_type: softmax #note: only default softmax is supported now | ||
legacy: False | ||
use_flash_attention: False | ||
|
||
first_stage_config: | ||
# _target_: nemo.collections.multimodal.models.stable_diffusion.ldm.autoencoder.AutoencoderKLInferenceWrapper | ||
_target_: nemo.collections.multimodal.models.text_to_image.stable_diffusion.ldm.autoencoder.AutoencoderKLInferenceWrapper | ||
from_pretrained: /opt/nemo-aligner/checkpoints/sdxl/vae_nemo.ckpt | ||
from_NeMo: True | ||
embed_dim: 4 | ||
monitor: val/rec_loss | ||
ddconfig: | ||
attn_type: vanilla | ||
double_z: true | ||
z_channels: 4 | ||
resolution: 256 | ||
in_channels: 3 | ||
out_ch: 3 | ||
ch: 128 | ||
ch_mult: [ 1, 2, 4, 4 ] | ||
num_res_blocks: 2 | ||
attn_resolutions: [ ] | ||
dropout: 0.0 | ||
lossconfig: | ||
target: torch.nn.Identity | ||
|
||
conditioner_config: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.GeneralConditioner | ||
emb_models: | ||
# crossattn cond | ||
- is_trainable: False | ||
input_key: txt | ||
emb_model: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenCLIPEmbedder | ||
layer: hidden | ||
layer_idx: 11 | ||
# crossattn and vector cond | ||
- is_trainable: False | ||
input_key: txt | ||
emb_model: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.FrozenOpenCLIPEmbedder2 | ||
arch: ViT-bigG-14 | ||
version: laion2b_s39b_b160k | ||
freeze: True | ||
layer: penultimate | ||
always_return_pooled: True | ||
legacy: False | ||
# vector cond | ||
- is_trainable: False | ||
input_key: original_size_as_tuple | ||
emb_model: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.ConcatTimestepEmbedderND | ||
outdim: 256 # multiplied by two | ||
# vector cond | ||
- is_trainable: False | ||
input_key: crop_coords_top_left | ||
emb_model: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.ConcatTimestepEmbedderND | ||
outdim: 256 # multiplied by two | ||
# vector cond | ||
- is_trainable: False | ||
input_key: target_size_as_tuple | ||
emb_model: | ||
_target_: nemo.collections.multimodal.modules.stable_diffusion.encoders.modules.ConcatTimestepEmbedderND | ||
outdim: 256 # multiplied by two | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.