Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed the use of undenoised latent from the DPM++ scheduler #7

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

czkoko
Copy link

@czkoko czkoko commented Mar 29, 2024

  • The latent returned by the step function in DPM++ scheduler is not denoised. It seems to be the prediction of the next time step.
  • I don't know much about it. I don't know if this fix is right, but it does solve the picture quality problem.
  • The following is the picture quality comparison of DPM++ 2M Karras before and after the fixed.

316493631-11d1cbbb-c5af-48ad-9507-903f0233a170
316493584-314ba683-59f6-4bc1-9cde-2f5108f5adc9

@nosferatu500
Copy link

LGTM

Now I see why I have no issue in my app you described a while ago, I DO exactly the same in my own implementation.

@czkoko
Copy link
Author

czkoko commented Mar 29, 2024

LGTM

Now I see why I have no issue in my app you described a while ago, I DO exactly the same in my own implementation.

Yes, this problem seems to have existed since Guernika came out. I took a lot of detours before I finally found out that the problem was here.

@SpiraMira
Copy link

LGTM
Now I see why I have no issue in my app you described a while ago, I DO exactly the same in my own implementation.

Yes, this problem seems to have existed since Guernika came out. I took a lot of detours before I finally found out that the problem was here.

LGTM

Great bug fix! I just checked the hugging face Diffusion application and the denoised latent is derived the same way.

again, great find.

@GuiyeC
Copy link
Contributor

GuiyeC commented Mar 31, 2024

@czkoko thanks for the PR, I'm not being able to reproduce this with either SD1.5 or SDXL, I don't see any real difference in the outputs, maybe you could clarify how you generated those results, also if this is a fix for one specific scheduler I think it would make sense to implement it in Schedulers, it feels really hacky having it here.

@SpiraMira could you explain what you meant with "the hugging face diffusion application does this"? can you share where? I rechecked DPMSolverMultistepScheduler and I don't see it.

@SpiraMira
Copy link

SpiraMira commented Apr 1, 2024

@czkoko thanks for the PR, I'm not being able to reproduce this with either SD1.5 or SDXL, I don't see any real difference in the outputs, maybe you could clarify how you generated those results, also if this is a fix for one specific scheduler I think it would make sense to implement it in Schedulers, it feels really hacky having it here.

@SpiraMira could you explain what you meant with "the hugging face diffusion application does this"? can you share where? I rechecked DPMSolverMultistepScheduler and I don't see it.

Hi @GuiyeC - by that I mean the ml-stable-diffusion demo app @ https://github.com/huggingface/swift-coreml-diffusers (written mostly by folks at hugging face).

The issue is at the pipeline level not the scheduler itself. Below is an excerpt of their pipeline denoising loop (in StableDiffusionPipeline:generateImages()). Their final decoded image(s) source is :
denoisedLatents[i] = scheduler[i].modelOutputs.last ?? latents[I] as opposed to just latents[I]

The last time I checked the DPMSolverMultistep scheduler implementations were the same. Do you think there’s an issue there?

Note: all this said, I also am straining to see a marked difference between the two different approaches. @czkoko more details on the model, pipeline, scheduler and prompts used would be helpful.

 // De-noising loop
...
...
...            
            // Predict noise residuals from latent samples
            // and current time step conditioned on hidden states
            var noise = try unet.predictNoise(
                latents: latentUnetInput,
                timeStep: t,
                hiddenStates: hiddenStates,
                additionalResiduals: additionalResiduals
            )

            noise = performGuidance(noise, config.guidanceScale)

            // Have the scheduler compute the previous (t-1) latent
            // sample given the predicted noise and current sample
            for i in 0..<config.imageCount {
                latents[i] = scheduler[i].step(
                    output: noise[i],
                    timeStep: t,
                    sample: latents[i]
                )

                denoisedLatents[i] = scheduler[i].modelOutputs.last ?? latents[i]
            }

            let currentLatentSamples = config.useDenoisedIntermediates ? denoisedLatents : latents

            // Report progress
            let progress = Progress(
                pipeline: self,
                prompt: config.prompt,
                step: step,
                stepCount: timeSteps.count,
                currentLatentSamples: currentLatentSamples,
                configuration: config
            )
            if !progressHandler(progress) {
                // Stop if requested by handler
                return []
            }
        }

        if reduceMemory {
            controlNet?.unloadResources()
            unet.unloadResources()
        }

        // Decode the latent samples to images
        return try decodeToImages(denoisedLatents, configuration: config)
`

@GuiyeC
Copy link
Contributor

GuiyeC commented Apr 1, 2024

@SpiraMira I still don't think this is an issue in the pipeline, they are using Apple's ml-stable-diffusion, and they may just be using the same hack to fix this, but I think the scheduler should be returning the correct output, specially if this doesn't happen with other schedulers.

@czkoko
Copy link
Author

czkoko commented Apr 1, 2024

@GuiyeC

  • There is no problem with SD 1.5.

  • I think you should use good fine-tuning models to test. They no longer need refiner. The picture quality generated by SDXL Base is very bad, and you need to generate real portrait photos, so that it is easier to find the details, and the cartoon style may not be easy to find.

  • Next, I will test with the better Crystal Clear XL:

prompt: cinematic, A fair-skinned woman. 35mm photograph, highly detailed
seed: 3850049176
step: 15
cfg: 5.0

Guernika App: (DPM++ 2M) It is obvious that there are serious picture quality problems. If you turn on tae decoding to preview the image, you will see that the picture quality of the last few steps is better than the final image.
2024-04-01 19 02 33

The following is to use the same parameters and use this PR to fix it.
DPM++ 2M:
dpmpp2m

DPM++ 2M Karras:
dpmpp2mKarras

@czkoko
Copy link
Author

czkoko commented Apr 1, 2024

The following is 15 images of building a simple command-line program to decode each step. Through GuernikaKit 1.6.1, this PR is not applied.

You can see that the final image adds a lot of unclean noise than step 14. DPM++ 2M will be more obvious.
dpmpp2mKarras.zip

@GuiyeC
Copy link
Contributor

GuiyeC commented Apr 1, 2024

@czkoko I'm not saying there is no bug, but still I think this is a bug with the scheduler, not with the pipeline, the scheduler should return the correct output, if it's the last step then it should return the latent ready for decoding.

Are you seeing this on any other schedulers?

@czkoko
Copy link
Author

czkoko commented Apr 1, 2024

I don't know much about it, but compared with other schedulers, it seems that before the step function return, prevSample needs to be processed by the convertModelOutput function. The step function in DPMSolverMultistepScheduler, DPMSolverSinglestepScheduler directly return prevSample.

@GuiyeC
Copy link
Contributor

GuiyeC commented Apr 1, 2024

This seems to be the same issue.

@czkoko
Copy link
Author

czkoko commented Apr 1, 2024

@GuiyeC
If you want to solve the problem in the scheduler, maybe it should be like this. I have tested, that the picture quality return to normal. Do you think this is right?

    public func step(
        // ......
        if lowerOrderStepped < solverOrder {
            lowerOrderStepped += 1
        }
        if stepIndex == timeSteps.count - 1 {
            return convertModelOutput(modelOutput: output, timestep: t, sample: prevSample)
        }
        return prevSample
}

@GuiyeC
Copy link
Contributor

GuiyeC commented Apr 1, 2024

You could return modelOutput earlier and without reconverting the output:

public func step(
    output: MLShapedArray<Float32>,
    timeStep t: Double,
    sample: MLShapedArray<Float32>,
    generator: RandomGenerator
) -> MLShapedArray<Float32> {
    ...
    
    let modelOutput = convertModelOutput(modelOutput: output, timestep: timeStep, sample: sample)
    if modelOutputs.count == solverOrder { modelOutputs.removeFirst() }
    modelOutputs.append(modelOutput)
    
    if stepIndex == timeSteps.count - 1 {
        return modelOutput
    }
    
    let prevSample: MLShapedArray<Float32>
    ...
    return prevSample
}

I still would like to see how they've solved it in other implementations, but maybe this is good enough for a temporary fix.

@SpiraMira
Copy link

@czkoko I'm not saying there is no bug, but still I think this is a bug with the scheduler, not with the pipeline, the scheduler should return the correct output, if it's the last step then it should return the latent ready for decoding.

Are you seeing this on any other schedulers?

@GuiyeC and @czkoko - all the non-solver schedulers will return the same denoised latent so modelOutput[0] == latent. I just double checked in the debugger (my own Guernika based app), so the proposed fix will have no effect (bug or not).

I dug deeper into the GuernikaKit:Schedulers:DPMSolverMultistepScheduler code (vs apple's version) and noticed a difference in the secondOrderUpdate function. The last weightedSum parameters are different. The test case uses 15 steps so (I believe) it triggers the secondOrderUpdate call. Could this be an issue ? Also, which version of DPMSolverMultistepScheduler is the correct one ?

@SpiraMira
Copy link

@czkoko I'm not saying there is no bug, but still I think this is a bug with the scheduler, not with the pipeline, the scheduler should return the correct output, if it's the last step then it should return the latent ready for decoding.
Are you seeing this on any other schedulers?

@GuiyeC and @czkoko - all the non-solver schedulers will return the same denoised latent so modelOutput[0] == latent. I just double checked in the debugger (my own Guernika based app), so the proposed fix will have no effect (bug or not).

I dug deeper into the GuernikaKit:Schedulers:DPMSolverMultistepScheduler code (vs apple's version) and noticed a difference in the secondOrderUpdate function. The last weightedSum parameters are different. The test case uses 15 steps so (I believe) it triggers the secondOrderUpdate call. Could this be an issue ? Also, which version of DPMSolverMultistepScheduler is the correct one ?

Correction: my bad, it looks like the GuernikaKit Schedulers version was just refactored a bit, so functionally the same.

@czkoko
Copy link
Author

czkoko commented Apr 2, 2024

Liuliu's draw things has just been open source. Maybe you can find some solutions from his code, and his DPM++ SDE has no problem with SDXL, but it takes double the time.
https://github.com/drawthingsai/draw-things-community/tree/main/Libraries/SwiftDiffusion/Sources/Samplers

@SpiraMira
Copy link

Liuliu's draw things has just been open source. Maybe you can find some solutions from his code, and his DPM++ SDE has no problem with SDXL, but it takes double the time. https://github.com/drawthingsai/draw-things-community/tree/main/Libraries/SwiftDiffusion/Sources/Samplers

@czkoko - is DPM++ 2M or SDE the problem child? I thought it was 2M Karras. Also, which XL model are you testing with? Would love to be able to reproduce your images on my side...

@czkoko
Copy link
Author

czkoko commented Apr 4, 2024

@czkoko - is DPM++ 2M or SDE the problem child? I thought it was 2M Karras. Also, which XL model are you testing with? Would love to be able to reproduce your images on my side...

@SpiraMira

  • DPM++ 2M, DPM++ 2M Karras, DPM++ SDE, DPM++ SDE Karras all have problems, among which DPM++ 2M, DPM++ 2M Karras can use the currently method to solve.
  • But SDE seems to have other problems. For example, the SDXL picture will be hazy, no matter how many steps there are. After merging the lightning lora, hazy disappears, but the color will be red.
  • I test with Crystal Clear XL.

@SpiraMira
Copy link

@czkoko - is DPM++ 2M or SDE the problem child? I thought it was 2M Karras. Also, which XL model are you testing with? Would love to be able to reproduce your images on my side...

@SpiraMira

  • DPM++ 2M, DPM++ 2M Karras, DPM++ SDE, DPM++ SDE Karras all have problems, among which DPM++ 2M, DPM++ 2M Karras can use the currently method to solve.
  • But SDE seems to have other problems. For example, the SDXL picture will be hazy, no matter how many steps there are. After merging the lightning lora, hazy disappears, but the color will be red.
  • I test with Crystal Clear XL.

@czkoko - So I played around with updating our current DPMSolverMulti along the lines of this hugging face patch for a similar problem: https://github.com/huggingface/diffusers/pull/6477/files#diff-517cce3913a4b16e1d17a0b945a920e400aa5553471df6cd85f71fc8f079b4b4 with some qualitative success with the 2M scheduler.

The most significant piece of their patch affects the step function's lowerOrderFinal evaluation:

# Improve numerical stability for small number of steps
         lower_order_final = (self.step_index == len(self.timesteps) - 1) and (
             self.config.euler_at_final
             or (self.config.lower_order_final and len(self.timesteps) < 15)
             or self.config.final_sigmas_type == "zero"
         )

my version:

let lowerOrderFinal = stepIndex == timeSteps.count - 1 && ((useLowerOrderFinal && timeSteps.count < 15) || !useFinalTraingScheduleSigma)

by setting my useFinalTraingScheduleSigma flag off, I achieve the same effect. Here are some shots with the 2M scheduler:

  • I used a coreml version of Crystal Clear XL (8 bit 1024x1024) from https://huggingface.co/coreml-community/coreml-crystalClearXL_SDXL_8-bit/tree/main/original - using your seed, prompts and settings as best I could

  • My results with Karras have always been acceptable, so didn’t focus on Karras yet.

  • I haven’t looked at the single step schedulers

  • 2M scheduler results only

  • first row is my fix in reverse order steps 17...14

  • second row is 2M original in reverse order steps 17..14 (noise > 15 steps as expected)

Screenshot 2024-04-04 at 8 17 09 PM

The rest of the patch adds either the final training schedule sigma or 0 during initialization. I’ve implemented it but I’m still not sure of its overall affect on our schedulers.

Looks like the results are similar to the proposed hack, but this may be backed by more solid science and research. (of course assuming hugging face didn’t just hack this too since it made it into their baseline). I’m not an expert. Would love to hear from @GuiyeC about this. (I can submit some code for review if you like)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants