Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support ControlNet LoRA #1936

Closed
wants to merge 1 commit into from
Closed

Conversation

huchenlei
Copy link
Collaborator

@huchenlei huchenlei commented Aug 21, 2023

This PR ports following change in ComfyUI(comfyanonymous/ComfyUI@d6e4b34).

There are seveal things this PR is going to achieve:

  1. Correctly recognize ControlNet LoRA and convert its state_dict to acceptable format
  2. Use custom linear, conv_nd operations when a ControlNet LoRA is loaded
  3. Create yaml configs for ControlNet LoRA models

Questions to be answered:

  1. Will this approach support composing multiple ControlNet LoRAs?
  2. Some code is ported from an GPL license repo (Comfy). Probably need to either rewrite / contact Comfy for license conflict solution.

scripts/hook.py Outdated
@@ -763,7 +769,7 @@ def hacked_group_norm_forward(self, *args, **kwargs):
gn_modules = [model.middle_block]
model.middle_block.gn_weight = 0

input_block_indices = [4, 5, 7, 8, 10, 11]
input_block_indices = [4, 5, 7, 8, 10, 11] if not getattr(process.sd_model, 'is_sdxl', False) else [4, 5, 7, 8]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SDXL only has 9 input blocks.

scripts/hook.py Outdated
@@ -600,8 +606,8 @@ def forward(self, x, timesteps=None, context=None, **kwargs):
h = aligned_adding(h, total_controlnet_embedding.pop(), require_inpaint_hijack)

# U-Net Decoder
for i, module in enumerate(self.output_blocks):
h = th.cat([h, aligned_adding(hs.pop(), total_controlnet_embedding.pop(), require_inpaint_hijack)], dim=1)
for module, (hs_item, controlnet_embedding_item) in zip(self.output_blocks, reversed(list(zip(hs, total_controlnet_embedding)))):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hs has 9 tensor elements, while total_controlnet_embedding has 10 tensor elements and 2 0 padding at the end.

This change aligns the 2 arrays, but I am not sure it is the correct fix here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That impl does not handle a tailing 0 for middle block weights. Revised the impl to calculate array length based on model type.

@huchenlei
Copy link
Collaborator Author

Currently got blocked on transformer arch issue. See details in #1933 (comment)

In comfy's implementation, transformer depth is different at each level.

label_emb is another issue as there are weights for that field in the LoRA state dict.

@huchenlei
Copy link
Collaborator Author

Ok, I have proved that the LoRA can run on A1111. Just there is a change necessary in ldm's SpatialTransformer. I will look to find a workaround that does not need to modify ldm source later.

Screen Capture 022 - Stable Diffusion - localhost

@huchenlei
Copy link
Collaborator Author

huchenlei commented Aug 22, 2023

Comfy UI has following line doing some extra adjustment on input emb

        if self.num_classes is not None:
            assert y.shape[0] == x.shape[0]
            emb = emb + self.label_emb(y)

Where label_emb is a network that converts adm to emb's shape.

self.label_emb = nn.Sequential(
                    nn.Sequential(
                        linear(adm_in_channels, time_embed_dim, dtype=self.dtype),
                        nn.SiLU(),
                        linear(time_embed_dim, time_embed_dim, dtype=self.dtype),
                    )
                )

And adm in comfy is calculated differently for refiner and base SDXL models:
https://github.com/comfyanonymous/ComfyUI/blob/763b0cf024c8fd462343ab0a8cfdab099714168b/comfy/model_base.py#L157C1-L207

I am pretty sure something similar must also exists in A1111 and we just need to perform some mapping to get it to a format that label_emb network accepts.

However, from the experiment commenting out the +self.label_emb(y) line, the result does not seem to be that different using the base model (left is no label_emb, right is with label_emb):
Screen Capture 023 - Compare the difference between images - Diff Checker - www diffchecker com

Both result follows the guidance map. So probably we can live without it for now.

@huchenlei
Copy link
Collaborator Author

Some testings on rank128 size S models:

Canny:
tmprsib75w5
00012-1

Depth:
tmp9zngl_yq
00013-1

Re-color:
tmpw4aoqgxs
00015-1

Using multiple units seems to be fine, but produce result with generally worse qualify. (Kinda expected?)

@lllyasviel PTAL

@FurkanGozukara
Copy link

thank you so much amazing work

@George0726
Copy link
Contributor

thank you so much amazing and fast work

@lllyasviel
Copy link
Collaborator

lllyasviel commented Aug 23, 2023

I need to take a look if there is something wrong in the impl before it is going to merge, the saturation seems a bit weird and I am not sure that some of those artifacts are limitations of the released models. Also, we should begin to merge those after webui 1.6.x since there are much differences between 1.5.x and 1.6.x

@huchenlei
Copy link
Collaborator Author

I need to take a look if there is something wrong in the impl before it is going to merge, the saturation seems a bit weird and I am not sure that some of those artifacts are limitations of the released models. Also, we should begin to merge those after webui 1.6.x since there are much differences between 1.5.x and 1.6.x

I tested the generations under 1.5.1 release tag and current HEAD of dev branch, which produces similar results.

I am not sure about the saturation issue. You can checkout the branch and do some testing yourself. My prompt is very simple here
1man holding microphone
neg: abstract, black and white

The comfy UI results I attached eariler is probably not comparable as the workflow they offer is img2img. Here are some canny img2img generation results with A1111:
grid-0000

🚧 handle lora statedict

🔧 Fix yaml config issue

Workaround all obvious errors

Correctly overwrite Linear/Conv2D

wip

🐛 Fix transformer depth issue

🐛 Fix SDXL middle block missing issue

rename config

nits
@lllyasviel
Copy link
Collaborator

where you find control-lora-sdxl.yaml?

@lllyasviel
Copy link
Collaborator

wait. it seems that they are using a lora to construct a controlnet to control xl ...
why dont they directly use a lora to control xl?

@FurkanGozukara
Copy link

wait. it seems that they are using a lora to construct a controlnet to control xl ... why dont they directly use a lora to control xl?

i don't think so they are pro as you

so many people right now expecting controlnet for auto1111

@lllyasviel
Copy link
Collaborator

to be continued in #1952

@lllyasviel lllyasviel closed this Aug 23, 2023
@huchenlei
Copy link
Collaborator Author

where you find control-lora-sdxl.yaml?

I dumped the config out of Comfy and made that yaml.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants