diff --git a/.github/workflows/build_documentation.yml b/.github/workflows/build_documentation.yml index bd45b08d24f7..67229d634c91 100644 --- a/.github/workflows/build_documentation.yml +++ b/.github/workflows/build_documentation.yml @@ -16,7 +16,7 @@ jobs: install_libgl1: true package: diffusers notebook_folder: diffusers_doc - languages: en ko zh + languages: en ko zh ja secrets: token: ${{ secrets.HUGGINGFACE_PUSH }} diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml index 18b606ca754c..f5b666ee27ff 100644 --- a/.github/workflows/build_pr_documentation.yml +++ b/.github/workflows/build_pr_documentation.yml @@ -15,4 +15,4 @@ jobs: pr_number: ${{ github.event.number }} install_libgl1: true package: diffusers - languages: en ko zh + languages: en ko zh ja diff --git a/PHILOSOPHY.md b/PHILOSOPHY.md index 6c2a7dd1b528..df1b0e4ddd43 100644 --- a/PHILOSOPHY.md +++ b/PHILOSOPHY.md @@ -70,7 +70,7 @@ The following design principles are followed: - Pipelines should be used **only** for inference. - Pipelines should be very readable, self-explanatory, and easy to tweak. - Pipelines should be designed to build on top of each other and be easy to integrate into higher-level APIs. -- Pipelines are **not** intended to be feature-complete user interfaces. For future complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner) +- Pipelines are **not** intended to be feature-complete user interfaces. For future complete user interfaces one should rather have a look at [InvokeAI](https://github.com/invoke-ai/InvokeAI), [Diffuzers](https://github.com/abhishekkrthakur/diffuzers), and [lama-cleaner](https://github.com/Sanster/lama-cleaner). - Every pipeline should have one and only one way to run it via a `__call__` method. The naming of the `__call__` arguments should be shared across all pipelines. - Pipelines should be named after the task they are intended to solve. - In almost all cases, novel diffusion pipelines shall be implemented in a new pipeline folder/file. @@ -104,7 +104,7 @@ The following design principles are followed: - Schedulers all inherit from `SchedulerMixin` and `ConfigMixin`. - Schedulers can be easily swapped out with the [`ConfigMixin.from_config`](https://huggingface.co/docs/diffusers/main/en/api/configuration#diffusers.ConfigMixin.from_config) method as explained in detail [here](./using-diffusers/schedulers.md). - Every scheduler has to have a `set_num_inference_steps`, and a `step` function. `set_num_inference_steps(...)` has to be called before every denoising process, *i.e.* before `step(...)` is called. -- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon +- Every scheduler exposes the timesteps to be "looped over" via a `timesteps` attribute, which is an array of timesteps the model will be called upon. - The `step(...)` function takes a predicted model output and the "current" sample (x_t) and returns the "previous", slightly more denoised sample (x_t-1). - Given the complexity of diffusion schedulers, the `step` function does not expose all the complexity and can be a bit of a "black box". - In almost all cases, novel schedulers shall be implemented in a new scheduling file. diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index 718feeaa1171..cef8f474c00e 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -34,6 +34,8 @@ title: Load safetensors - local: using-diffusers/other-formats title: Load different Stable Diffusion formats + - local: using-diffusers/loading_adapters + title: Load adapters - local: using-diffusers/push_to_hub title: Push files to the Hub title: Loading & Hub @@ -81,8 +83,8 @@ - local: using-diffusers/custom_pipeline_examples title: Community pipelines - local: using-diffusers/contribute_pipeline - title: How to contribute a community pipeline - title: Pipelines for Inference + title: Contribute a community pipeline + title: Specific pipeline examples - sections: - local: training/overview title: Overview @@ -168,8 +170,6 @@ title: Custom normalization layers - local: api/attnprocessor title: Attention Processor - - local: api/diffusion_pipeline - title: Diffusion Pipeline - local: api/logging title: Logging - local: api/configuration @@ -254,6 +254,8 @@ title: Kandinsky - local: api/pipelines/kandinsky_v22 title: Kandinsky 2.2 + - local: api/pipelines/latent_consistency_models + title: Latent Consistency Models - local: api/pipelines/latent_diffusion title: Latent Diffusion - local: api/pipelines/panorama @@ -370,6 +372,8 @@ title: KDPM2AncestralDiscreteScheduler - local: api/schedulers/dpm_discrete title: KDPM2DiscreteScheduler + - local: api/schedulers/lcm + title: LCMScheduler - local: api/schedulers/lms_discrete title: LMSDiscreteScheduler - local: api/schedulers/pndm diff --git a/docs/source/en/api/diffusion_pipeline.md b/docs/source/en/api/diffusion_pipeline.md deleted file mode 100644 index d99443002469..000000000000 --- a/docs/source/en/api/diffusion_pipeline.md +++ /dev/null @@ -1,36 +0,0 @@ - - -# Pipelines - -The [`DiffusionPipeline`] is the quickest way to load any pretrained diffusion pipeline from the [Hub](https://huggingface.co/models?library=diffusers) for inference. - - - -You shouldn't use the [`DiffusionPipeline`] class for training or finetuning a diffusion model. Individual -components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead. - - - -The pipeline type (for example [`StableDiffusionPipeline`]) of any diffusion pipeline loaded with [`~DiffusionPipeline.from_pretrained`] is automatically -detected and pipeline components are loaded and passed to the `__init__` function of the pipeline. - -Any pipeline object can be saved locally with [`~DiffusionPipeline.save_pretrained`]. - -## DiffusionPipeline - -[[autodoc]] DiffusionPipeline - - all - - __call__ - - device - - to - - components diff --git a/docs/source/en/api/models/controlnet.md b/docs/source/en/api/models/controlnet.md index e02adde8a1bc..58359723a08e 100644 --- a/docs/source/en/api/models/controlnet.md +++ b/docs/source/en/api/models/controlnet.md @@ -12,13 +12,13 @@ By default the [`ControlNetModel`] should be loaded with [`~ModelMixin.from_pret from the original format using [`FromOriginalControlnetMixin.from_single_file`] as follows: ```py -from diffusers import StableDiffusionControlnetPipeline, ControlNetModel +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local path controlnet = ControlNetModel.from_single_file(url) url = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path -pipe = StableDiffusionControlnetPipeline.from_single_file(url, controlnet=controlnet) +pipe = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet) ``` ## ControlNetModel diff --git a/docs/source/en/api/pipelines/latent_consistency_models.md b/docs/source/en/api/pipelines/latent_consistency_models.md new file mode 100644 index 000000000000..d8e47be2c257 --- /dev/null +++ b/docs/source/en/api/pipelines/latent_consistency_models.md @@ -0,0 +1,44 @@ +# Latent Consistency Models + +Latent Consistency Models (LCMs) were proposed in [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. + +The abstract of the [paper](https://arxiv.org/pdf/2310.04378.pdf) is as follows: + +*Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference.* + +A demo for the [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) checkpoint can be found [here](https://huggingface.co/spaces/SimianLuo/Latent_Consistency_Model). + +This pipeline was contributed by [luosiallen](https://luosiallen.github.io/) and [dg845](https://github.com/dg845). + +```python +import torch +from diffusers import DiffusionPipeline + +pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", torch_dtype=torch.float32) + +# To save GPU memory, torch.float16 can be used, but it may compromise image quality. +pipe.to(torch_device="cuda", torch_dtype=torch.float32) + +prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" + +# Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps. +num_inference_steps = 4 + +images = pipe(prompt=prompt, num_inference_steps=num_inference_steps, guidance_scale=8.0).images +``` + +## LatentConsistencyModelPipeline + +[[autodoc]] LatentConsistencyModelPipeline + - all + - __call__ + - enable_freeu + - disable_freeu + - enable_vae_slicing + - disable_vae_slicing + - enable_vae_tiling + - disable_vae_tiling + +## StableDiffusionPipelineOutput + +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/overview.md b/docs/source/en/api/pipelines/overview.md index 625e4d661d00..9caf5c6b4121 100644 --- a/docs/source/en/api/pipelines/overview.md +++ b/docs/source/en/api/pipelines/overview.md @@ -12,16 +12,74 @@ specific language governing permissions and limitations under the License. # Pipelines -Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different scheduler or even model components. +Pipelines provide a simple way to run state-of-the-art diffusion models in inference by bundling all of the necessary components (multiple independently-trained models, schedulers, and processors) into a single end-to-end class. Pipelines are flexible and they can be adapted to use different schedulers or even model components. -All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components. +All pipelines are built from the base [`DiffusionPipeline`] class which provides basic functionality for loading, downloading, and saving all the components. Specific pipeline types (for example [`StableDiffusionPipeline`]) loaded with [`~DiffusionPipeline.from_pretrained`] are automatically detected and the pipeline components are loaded and passed to the `__init__` function of the pipeline. -Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../traininig/overview) guides instead! +You shouldn't use the [`DiffusionPipeline`] class for training. Individual components (for example, [`UNet2DModel`] and [`UNet2DConditionModel`]) of diffusion pipelines are usually trained individually, so we suggest directly working with them instead. + +
+ +Pipelines do not offer any training functionality. You'll notice PyTorch's autograd is disabled by decorating the [`~DiffusionPipeline.__call__`] method with a [`torch.no_grad`](https://pytorch.org/docs/stable/generated/torch.no_grad.html) decorator because pipelines should not be used for training. If you're interested in training, please take a look at the [Training](../../training/overview) guides instead!
+The table below lists all the pipelines currently available in 🤗 Diffusers and the tasks they support. Click on a pipeline to view its abstract and published paper. + +| Pipeline | Tasks | +|---|---| +| [AltDiffusion](alt_diffusion) | image2image | +| [Attend-and-Excite](attend_and_excite) | text2image | +| [Audio Diffusion](audio_diffusion) | image2audio | +| [AudioLDM](audioldm) | text2audio | +| [AudioLDM2](audioldm2) | text2audio | +| [BLIP Diffusion](blip_diffusion) | text2image | +| [Consistency Models](consistency_models) | unconditional image generation | +| [ControlNet](controlnet) | text2image, image2image, inpainting | +| [ControlNet with Stable Diffusion XL](controlnet_sdxl) | text2image | +| [Cycle Diffusion](cycle_diffusion) | image2image | +| [Dance Diffusion](dance_diffusion) | unconditional audio generation | +| [DDIM](ddim) | unconditional image generation | +| [DDPM](ddpm) | unconditional image generation | +| [DeepFloyd IF](deepfloyd_if) | text2image, image2image, inpainting, super-resolution | +| [DiffEdit](diffedit) | inpainting | +| [DiT](dit) | text2image | +| [GLIGEN](gligen) | text2image | +| [InstructPix2Pix](pix2pix) | image editing | +| [Kandinsky](kandinsky) | text2image, image2image, inpainting, interpolation | +| [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting | +| [Latent Diffusion](latent_diffusion) | text2image, super-resolution | +| [LDM3D](ldm3d_diffusion) | text2image, text-to-3D | +| [MultiDiffusion](panorama) | text2image | +| [MusicLDM](musicldm) | text2audio | +| [PaintByExample](paint_by_example) | inpainting | +| [ParaDiGMS](paradigms) | text2image | +| [Pix2Pix Zero](pix2pix_zero) | image editing | +| [PNDM](pndm) | unconditional image generation | +| [RePaint](repaint) | inpainting | +| [ScoreSdeVe](score_sde_ve) | unconditional image generation | +| [Self-Attention Guidance](self_attention_guidance) | text2image | +| [Semantic Guidance](semantic_stable_diffusion) | text2image | +| [Shap-E](shap_e) | text-to-3D, image-to-3D | +| [Spectrogram Diffusion](spectrogram_diffusion) | | +| [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution | +| [Stable Diffusion Model Editing](model_editing) | model editing | +| [Stable Diffusion XL](stable_diffusion_xl) | text2image, image2image, inpainting | +| [Stable unCLIP](stable_unclip) | text2image, image variation | +| [KarrasVe](karras_ve) | unconditional image generation | +| [T2I Adapter](adapter) | text2image | +| [Text2Video](text_to_video) | text2video, video2video | +| [Text2Video Zero](text_to_video_zero) | text2video | +| [UnCLIP](unclip) | text2image, image variation | +| [Unconditional Latent Diffusion](latent_diffusion_uncond) | unconditional image generation | +| [UniDiffuser](unidiffuser) | text2image, image2text, image variation, text variation, unconditional image generation, unconditional audio generation | +| [Value-guided planning](value_guided_sampling) | value guided sampling | +| [Versatile Diffusion](versatile_diffusion) | text2image, image variation | +| [VQ Diffusion](vq_diffusion) | text2image | +| [Wuerstchen](wuerstchen) | text2image | + ## DiffusionPipeline [[autodoc]] DiffusionPipeline diff --git a/docs/source/en/api/schedulers/lcm.md b/docs/source/en/api/schedulers/lcm.md new file mode 100644 index 000000000000..fb55e52ac1f3 --- /dev/null +++ b/docs/source/en/api/schedulers/lcm.md @@ -0,0 +1,9 @@ +# Latent Consistency Model Multistep Scheduler + +## Overview + +Multistep and onestep scheduler (Algorithm 3) introduced alongside latent consistency models in the paper [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. +This scheduler should be able to generate good samples from [`LatentConsistencyModelPipeline`] in 1-8 steps. + +## LCMScheduler +[[autodoc]] LCMScheduler diff --git a/docs/source/en/index.md b/docs/source/en/index.md index f2012abc6970..f4cf2e2114ec 100644 --- a/docs/source/en/index.md +++ b/docs/source/en/index.md @@ -22,7 +22,7 @@ specific language governing permissions and limitations under the License. The library has three main components: -- State-of-the-art [diffusion pipelines](api/pipelines/overview) for inference with just a few lines of code. +- State-of-the-art diffusion pipelines for inference with just a few lines of code. There are many pipelines in 🤗 Diffusers, check out the table in the pipeline [overview](api/pipelines/overview) for a complete list of available pipelines and the task they solve. - Interchangeable [noise schedulers](api/schedulers/overview) for balancing trade-offs between generation speed and quality. - Pretrained [models](api/models) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. @@ -45,54 +45,4 @@ The library has three main components:

Technical descriptions of how 🤗 Diffusers classes and methods work.

- - -## Supported pipelines - -| Pipeline | Paper/Repository | Tasks | -|---|---|:---:| -| [alt_diffusion](./api/pipelines/alt_diffusion) | [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | -| [audio_diffusion](./api/pipelines/audio_diffusion) | [Audio Diffusion](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation | -| [controlnet](./api/pipelines/controlnet) | [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | -| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation | -| [dance_diffusion](./api/pipelines/dance_diffusion) | [Dance Diffusion](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation | -| [ddpm](./api/pipelines/ddpm) | [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation | -| [ddim](./api/pipelines/ddim) | [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation | -| [if](./if) | [**IF**](./api/pipelines/if) | Image Generation | -| [if_img2img](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | -| [if_inpainting](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | -| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation | -| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image | -| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation | -| [paint_by_example](./api/pipelines/paint_by_example) | [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting | -| [pndm](./api/pipelines/pndm) | [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation | -| [score_sde_ve](./api/pipelines/score_sde_ve) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | -| [score_sde_vp](./api/pipelines/score_sde_vp) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | -| [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [Semantic Guidance](https://arxiv.org/abs/2301.12247) | Text-Guided Generation | -| [stable_diffusion_adapter](./api/pipelines/stable_diffusion/adapter) | [**T2I-Adapter**](https://arxiv.org/abs/2302.08453) | Image-to-Image Text-Guided Generation | - -| [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | -| [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | -| [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | -| [stable_diffusion_panorama](./api/pipelines/stable_diffusion/panorama) | [MultiDiffusion](https://multidiffusion.github.io/) | Text-to-Panorama Generation | -| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) | Text-Guided Image Editing| -| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing | -| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation | -| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation | -| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation | -| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image | -| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing | -| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation | -| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting | -| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Depth-Conditional Stable Diffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) | Depth-to-Image Generation | -| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image | -| [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [Safe Stable Diffusion](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | -| [stable_unclip](./stable_unclip) | Stable unCLIP | Text-to-Image Generation | -| [stable_unclip](./stable_unclip) | Stable unCLIP | Image-to-Image Text-Guided Generation | -| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation | -| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation | -| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)(implementation by [kakaobrain](https://github.com/kakaobrain/karlo)) | Text-to-Image Generation | -| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation | -| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation | -| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation | -| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation | -| [stable_diffusion_ldm3d](./api/pipelines/stable_diffusion/ldm3d_diffusion) | [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) | Text to Image and Depth Generation | + \ No newline at end of file diff --git a/docs/source/en/installation.md b/docs/source/en/installation.md index 1a0951bf7bba..ee15fb56384d 100644 --- a/docs/source/en/installation.md +++ b/docs/source/en/installation.md @@ -12,12 +12,10 @@ specific language governing permissions and limitations under the License. # Installation -Install 🤗 Diffusers for whichever deep learning library you're working with. +🤗 Diffusers is tested on Python 3.8+, PyTorch 1.7.0+, and Flax. Follow the installation instructions below for the deep learning library you are using: -🤗 Diffusers is tested on Python 3.8+, PyTorch 1.7.0+ and Flax. Follow the installation instructions below for the deep learning library you are using: - -- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions. -- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions. +- [PyTorch](https://pytorch.org/get-started/locally/) installation instructions +- [Flax](https://flax.readthedocs.io/en/latest/) installation instructions ## Install with pip @@ -37,7 +35,7 @@ Activate the virtual environment: source .env/bin/activate ``` -🤗 Diffusers also relies on the 🤗 Transformers library, and you can install both with the following command: +You should also install 🤗 Transformers because 🤗 Diffusers relies on its models: @@ -54,9 +52,7 @@ pip install diffusers["flax"] transformers ## Install from source -Before installing 🤗 Diffusers from source, make sure you have `torch` and 🤗 Accelerate installed. - -For `torch` installation, refer to the `torch` [installation](https://pytorch.org/get-started/locally/#start-locally) guide. +Before installing 🤗 Diffusers from source, make sure you have PyTorch and 🤗 Accelerate installed. To install 🤗 Accelerate: @@ -64,7 +60,7 @@ To install 🤗 Accelerate: pip install accelerate ``` -Install 🤗 Diffusers from source with the following command: +Then install 🤗 Diffusers from source: ```bash pip install git+https://github.com/huggingface/diffusers @@ -75,7 +71,7 @@ The `main` version is useful for staying up-to-date with the latest developments For instance, if a bug has been fixed since the last official release but a new release hasn't been rolled out yet. However, this means the `main` version may not always be stable. We strive to keep the `main` version operational, and most issues are usually resolved within a few hours or a day. -If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose), so we can fix it even sooner! +If you run into a problem, please open an [Issue](https://github.com/huggingface/diffusers/issues/new/choose) so we can fix it even sooner! ## Editable install @@ -123,17 +119,29 @@ git pull Your Python environment will find the `main` version of 🤗 Diffusers on the next run. -## Notice on telemetry logging +## Cache + +Model weights and files are downloaded from the Hub to a cache which is usually your home directory. You can change the cache location by specifying the `HF_HOME` or `HUGGINFACE_HUB_CACHE` environment variables or configuring the `cache_dir` parameter in methods like [`~DiffusionPipeline.from_pretrained`]. + +Cached files allow you to run 🤗 Diffusers offline. To prevent 🤗 Diffusers from connecting to the internet, set the `HF_HUB_OFFLINE` environment variable to `True` and 🤗 Diffusers will only load previously downloaded files in the cache. + +```shell +export HF_HUB_OFFLINE=True +``` + +For more details about managing and cleaning the cache, take a look at the [caching](https://huggingface.co/docs/huggingface_hub/guides/manage-cache) guide. + +## Telemetry logging -Our library gathers telemetry information during `from_pretrained()` requests. -This data includes the version of Diffusers and PyTorch/Flax, the requested model or pipeline class, -and the path to a pre-trained checkpoint if it is hosted on the Hub. +Our library gathers telemetry information during [`~DiffusionPipeline.from_pretrained`] requests. +The data gathered includes the version of 🤗 Diffusers and PyTorch/Flax, the requested model or pipeline class, +and the path to a pretrained checkpoint if it is hosted on the Hugging Face Hub. This usage data helps us debug issues and prioritize new features. -Telemetry is only sent when loading models and pipelines from the HuggingFace Hub, -and is not collected during local usage. +Telemetry is only sent when loading models and pipelines from the Hub, +and it is not collected if you're loading local files. -We understand that not everyone wants to share additional information, and we respect your privacy, -so you can disable telemetry collection by setting the `DISABLE_TELEMETRY` environment variable from your terminal: +We understand that not everyone wants to share additional information,and we respect your privacy. +You can disable telemetry collection by setting the `DISABLE_TELEMETRY` environment variable from your terminal: On Linux/MacOS: ```bash diff --git a/docs/source/en/using-diffusers/contribute_pipeline.md b/docs/source/en/using-diffusers/contribute_pipeline.md index 501847ad20e7..15b4b20ab34a 100644 --- a/docs/source/en/using-diffusers/contribute_pipeline.md +++ b/docs/source/en/using-diffusers/contribute_pipeline.md @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# How to contribute a community pipeline +# Contribute a community pipeline diff --git a/docs/source/en/using-diffusers/custom_pipeline_examples.md b/docs/source/en/using-diffusers/custom_pipeline_examples.md index 2f47d1b26c6c..555292568349 100644 --- a/docs/source/en/using-diffusers/custom_pipeline_examples.md +++ b/docs/source/en/using-diffusers/custom_pipeline_examples.md @@ -14,273 +14,106 @@ specific language governing permissions and limitations under the License. [[open-in-colab]] -> **For more information about community pipelines, please have a look at [this issue](https://github.com/huggingface/diffusers/issues/841).** + -**Community** examples consist of both inference and training examples that have been added by the community. -Please have a look at the following table to get an overview of all community examples. Click on the **Code Example** to get a copy-and-paste ready code example that you can try out. -If a community doesn't work as expected, please open an issue and ping the author on it. +For more context about the design choices behind community pipelines, please have a look at [this issue](https://github.com/huggingface/diffusers/issues/841). -| Example | Description | Code Example | Colab | Author | -|:---------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------:| -| CLIP Guided Stable Diffusion | Doing CLIP guidance for text to image generation with Stable Diffusion | [CLIP Guided Stable Diffusion](#clip-guided-stable-diffusion) | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/CLIP_Guided_Stable_diffusion_with_diffusers.ipynb) | [Suraj Patil](https://github.com/patil-suraj/) | -| One Step U-Net (Dummy) | Example showcasing of how to use Community Pipelines (see https://github.com/huggingface/diffusers/issues/841) | [One Step U-Net](#one-step-unet) | - | [Patrick von Platen](https://github.com/patrickvonplaten/) | -| Stable Diffusion Interpolation | Interpolate the latent space of Stable Diffusion between different prompts/seeds | [Stable Diffusion Interpolation](#stable-diffusion-interpolation) | - | [Nate Raw](https://github.com/nateraw/) | -| Stable Diffusion Mega | **One** Stable Diffusion Pipeline with all functionalities of [Text2Image](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py), [Image2Image](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_img2img.py) and [Inpainting](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint.py) | [Stable Diffusion Mega](#stable-diffusion-mega) | - | [Patrick von Platen](https://github.com/patrickvonplaten/) | -| Long Prompt Weighting Stable Diffusion | **One** Stable Diffusion Pipeline without tokens length limit, and support parsing weighting in prompt. | [Long Prompt Weighting Stable Diffusion](#long-prompt-weighting-stable-diffusion) | - | [SkyTNT](https://github.com/SkyTNT) | -| Speech to Image | Using automatic-speech-recognition to transcribe text and Stable Diffusion to generate images | [Speech to Image](#speech-to-image) | - | [Mikail Duzenli](https://github.com/MikailINTech) + + +Community pipelines allow you to get creative and build your own unique pipelines to share with the community. You can find all community pipelines in the [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community) folder along with inference and training examples for how to use them. This guide showcases some of the community pipelines and hopefully it'll inspire you to create your own (feel free to open a PR with your own pipeline and we will merge it!). + +To load a community pipeline, use the `custom_pipeline` argument in [`DiffusionPipeline`] to specify one of the files in [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community): -To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly. ```py pipe = DiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder", use_safetensors=True ) ``` -## Example usages +If a community pipeline doesn't work as expected, please open a GitHub issue and mention the author. -### CLIP Guided Stable Diffusion +You can learn more about community pipelines in the how to [load community pipelines](custom_pipeline_overview) and how to [contribute a community pipeline](contribute_pipeline) guides. -CLIP guided stable diffusion can help to generate more realistic images -by guiding stable diffusion at every denoising step with an additional CLIP model. +## Multilingual Stable Diffusion -The following code requires roughly 12GB of GPU RAM. +The multilingual Stable Diffusion pipeline uses a pretrained [XLM-RoBERTa](https://huggingface.co/papluca/xlm-roberta-base-language-detection) to identify a language and the [mBART-large-50](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt) model to handle the translation. This allows you to generate images from text in 20 languages. -```python -from diffusers import DiffusionPipeline -from transformers import CLIPImageProcessor, CLIPModel +```py +from PIL import Image import torch - - -feature_extractor = CLIPImageProcessor.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K") -clip_model = CLIPModel.from_pretrained("laion/CLIP-ViT-B-32-laion2B-s34B-b79K", torch_dtype=torch.float16) - - -guided_pipeline = DiffusionPipeline.from_pretrained( - "CompVis/stable-diffusion-v1-4", - custom_pipeline="clip_guided_stable_diffusion", - clip_model=clip_model, - feature_extractor=feature_extractor, - torch_dtype=torch.float16, - use_safetensors=True, -) -guided_pipeline.enable_attention_slicing() -guided_pipeline = guided_pipeline.to("cuda") - -prompt = "fantasy book cover, full moon, fantasy forest landscape, golden vector elements, fantasy magic, dark light night, intricate, elegant, sharp focus, illustration, highly detailed, digital painting, concept art, matte, art by WLOP and Artgerm and Albert Bierstadt, masterpiece" - -generator = torch.Generator(device="cuda").manual_seed(0) -images = [] -for i in range(4): - image = guided_pipeline( - prompt, - num_inference_steps=50, - guidance_scale=7.5, - clip_guidance_scale=100, - num_cutouts=4, - use_cutouts=False, - generator=generator, - ).images[0] - images.append(image) - -# save images locally -for i, img in enumerate(images): - img.save(f"./clip_guided_sd/image_{i}.png") -``` - -The `images` list contains a list of PIL images that can be saved locally or displayed directly in a google colab. -Generated images tend to be of higher qualtiy than natively using stable diffusion. E.g. the above script generates the following images: - -![clip_guidance](https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/clip_guidance/merged_clip_guidance.jpg). - -### One Step Unet - -The dummy "one-step-unet" can be run as follows: - -```python from diffusers import DiffusionPipeline - -pipe = DiffusionPipeline.from_pretrained("google/ddpm-cifar10-32", custom_pipeline="one_step_unet") -pipe() -``` - -**Note**: This community pipeline is not useful as a feature, but rather just serves as an example of how community pipelines can be added (see https://github.com/huggingface/diffusers/issues/841). - -### Stable Diffusion Interpolation - -The following code can be run on a GPU of at least 8GB VRAM and should take approximately 5 minutes. - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained( - "CompVis/stable-diffusion-v1-4", - torch_dtype=torch.float16, - safety_checker=None, # Very important for videos...lots of false positives while interpolating - custom_pipeline="interpolate_stable_diffusion", - use_safetensors=True, -).to("cuda") -pipe.enable_attention_slicing() - -frame_filepaths = pipe.walk( - prompts=["a dog", "a cat", "a horse"], - seeds=[42, 1337, 1234], - num_interpolation_steps=16, - output_dir="./dreams", - batch_size=4, - height=512, - width=512, - guidance_scale=8.5, - num_inference_steps=50, +from diffusers.utils import make_image_grid +from transformers import ( + pipeline, + MBart50TokenizerFast, + MBartForConditionalGeneration, ) -``` - -The output of the `walk(...)` function returns a list of images saved under the folder as defined in `output_dir`. You can use these images to create videos of stable diffusion. - -> **Please have a look at https://github.com/nateraw/stable-diffusion-videos for more in-detail information on how to create videos using stable diffusion as well as more feature-complete functionality.** - -### Stable Diffusion Mega - -The Stable Diffusion Mega Pipeline lets you use the main use cases of the stable diffusion pipeline in a single class. - -```python -#!/usr/bin/env python3 -from diffusers import DiffusionPipeline -import PIL -import requests -from io import BytesIO -import torch +device = "cuda" if torch.cuda.is_available() else "cpu" +device_dict = {"cuda": 0, "cpu": -1} -def download_image(url): - response = requests.get(url) - return PIL.Image.open(BytesIO(response.content)).convert("RGB") +# add language detection pipeline +language_detection_model_ckpt = "papluca/xlm-roberta-base-language-detection" +language_detection_pipeline = pipeline("text-classification", + model=language_detection_model_ckpt, + device=device_dict[device]) +# add model for language translation +trans_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt") +trans_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device) -pipe = DiffusionPipeline.from_pretrained( +diffuser_pipeline = DiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", - custom_pipeline="stable_diffusion_mega", + custom_pipeline="multilingual_stable_diffusion", + detection_pipeline=language_detection_pipeline, + translation_model=trans_model, + translation_tokenizer=trans_tokenizer, torch_dtype=torch.float16, - use_safetensors=True, -) -pipe.to("cuda") -pipe.enable_attention_slicing() - - -### Text-to-Image - -images = pipe.text2img("An astronaut riding a horse").images - -### Image-to-Image - -init_image = download_image( - "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" ) -prompt = "A fantasy landscape, trending on artstation" - -images = pipe.img2img(prompt=prompt, image=init_image, strength=0.75, guidance_scale=7.5).images - -### Inpainting - -img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" -mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" -init_image = download_image(img_url).resize((512, 512)) -mask_image = download_image(mask_url).resize((512, 512)) - -prompt = "a cat sitting on a bench" -images = pipe.inpaint(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.75).images -``` - -As shown above this one pipeline can run all both "text-to-image", "image-to-image", and "inpainting" in one pipeline. - -### Long Prompt Weighting Stable Diffusion - -The Pipeline lets you input prompt without 77 token length limit. And you can increase words weighting by using "()" or decrease words weighting by using "[]" -The Pipeline also lets you use the main use cases of the stable diffusion pipeline in a single class. - -#### pytorch - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained( - "hakurei/waifu-diffusion", custom_pipeline="lpw_stable_diffusion", torch_dtype=torch.float16, use_safetensors=True -) -pipe = pipe.to("cuda") - -prompt = "best_quality (1girl:1.3) bow bride brown_hair closed_mouth frilled_bow frilled_hair_tubes frills (full_body:1.3) fox_ear hair_bow hair_tubes happy hood japanese_clothes kimono long_sleeves red_bow smile solo tabi uchikake white_kimono wide_sleeves cherry_blossoms" -neg_prompt = "lowres, bad_anatomy, error_body, error_hair, error_arm, error_hands, bad_hands, error_fingers, bad_fingers, missing_fingers, error_legs, bad_legs, multiple_legs, missing_legs, error_lighting, error_shadow, error_reflection, text, error, extra_digit, fewer_digits, cropped, worst_quality, low_quality, normal_quality, jpeg_artifacts, signature, watermark, username, blurry" - -pipe.text2img(prompt, negative_prompt=neg_prompt, width=512, height=512, max_embeddings_multiples=3).images[0] -``` - -#### onnxruntime - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained( - "CompVis/stable-diffusion-v1-4", - custom_pipeline="lpw_stable_diffusion_onnx", - revision="onnx", - provider="CUDAExecutionProvider", - use_safetensors=True, -) +diffuser_pipeline.enable_attention_slicing() +diffuser_pipeline = diffuser_pipeline.to(device) -prompt = "a photo of an astronaut riding a horse on mars, best quality" -neg_prompt = "lowres, bad anatomy, error body, error hair, error arm, error hands, bad hands, error fingers, bad fingers, missing fingers, error legs, bad legs, multiple legs, missing legs, error lighting, error shadow, error reflection, text, error, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry" +prompt = ["a photograph of an astronaut riding a horse", + "Una casa en la playa", + "Ein Hund, der Orange isst", + "Un restaurant parisien"] -pipe.text2img(prompt, negative_prompt=neg_prompt, width=512, height=512, max_embeddings_multiples=3).images[0] +images = diffuser_pipeline(prompt).images +grid = make_image_grid(images, rows=2, cols=2) +grid ``` -if you see `Token indices sequence length is longer than the specified maximum sequence length for this model ( *** > 77 ) . Running this sequence through the model will result in indexing errors`. Do not worry, it is normal. - -### Speech to Image - -The following code can generate an image from an audio sample using pre-trained OpenAI whisper-small and Stable Diffusion. - -```Python -import torch - -import matplotlib.pyplot as plt -from datasets import load_dataset -from diffusers import DiffusionPipeline -from transformers import ( - WhisperForConditionalGeneration, - WhisperProcessor, -) - - -device = "cuda" if torch.cuda.is_available() else "cpu" - -ds = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation") +
+ +
-audio_sample = ds[3] +## MagicMix -text = audio_sample["text"].lower() -speech_data = audio_sample["audio"]["array"] +[MagicMix](https://huggingface.co/papers/2210.16056) is a pipeline that can mix an image and text prompt to generate a new image that preserves the image structure. The `mix_factor` determines how much influence the prompt has on the layout generation, `kmin` controls the number of steps during the content generation process, and `kmax` determines how much information is kept in the layout of the original image. -model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small").to(device) -processor = WhisperProcessor.from_pretrained("openai/whisper-small") +```py +from diffusers import DiffusionPipeline, DDIMScheduler +from diffusers.utils import load_image -diffuser_pipeline = DiffusionPipeline.from_pretrained( +pipeline = DiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", - custom_pipeline="speech_to_image_diffusion", - speech_model=model, - speech_processor=processor, - torch_dtype=torch.float16, - use_safetensors=True, -) - -diffuser_pipeline.enable_attention_slicing() -diffuser_pipeline = diffuser_pipeline.to(device) + custom_pipeline="magic_mix", + scheduler = DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"), +).to('cuda') -output = diffuser_pipeline(speech_data) -plt.imshow(output.images[0]) +img = load_image("https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg") +mix_img = pipeline(img, prompt="bed", kmin = 0.3, kmax = 0.5, mix_factor = 0.5) +mix_img ``` -This example produces the following image: -![image](https://user-images.githubusercontent.com/45072645/196901736-77d9c6fc-63ee-4072-90b0-dc8b903d63e3.png) \ No newline at end of file +
+
+ +
image prompt
+
+
+ +
image and text prompt mix
+
+
\ No newline at end of file diff --git a/docs/source/en/using-diffusers/loading_adapters.md b/docs/source/en/using-diffusers/loading_adapters.md new file mode 100644 index 000000000000..0514688721d1 --- /dev/null +++ b/docs/source/en/using-diffusers/loading_adapters.md @@ -0,0 +1,300 @@ + + +# Load adapters + +[[open-in-colab]] + +There are several [training](../training/overview) techniques for personalizing diffusion models to generate images of a specific subject or images in certain styles. Each of these training methods produce a different type of adapter. Some of the adapters generate an entirely new model, while other adapters only modify a smaller set of embeddings or weights. This means the loading process for each adapter is also different. + +This guide will show you how to load DreamBooth, textual inversion, and LoRA weights. + + + +Feel free to browse the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer), [LoRA the Explorer](multimodalart/LoraTheExplorer), and the [Diffusers Models Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery) for checkpoints and embeddings to use. + + + +## DreamBooth + +[DreamBooth](https://dreambooth.github.io/) finetunes an *entire diffusion model* on just several images of a subject to generate images of that subject in new styles and settings. This method works by using a special word in the prompt that the model learns to associate with the subject image. Of all the training methods, DreamBooth produces the largest file size (usually a few GBs) because it is a full checkpoint model. + +Let's load the [herge_style](https://huggingface.co/sd-dreambooth-library/herge-style) checkpoint, which is trained on just 10 images drawn by Hergé, to generate images in that style. For it to work, you need to include the special word `herge_style` in your prompt to trigger the checkpoint: + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("sd-dreambooth-library/herge-style", torch_dtype=torch.float16).to("cuda") +prompt = "A cute herge_style brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration" +image = pipeline(prompt).images[0] +``` + +
+ +
+ +## Textual inversion + +[Textual inversion](https://textual-inversion.github.io/) is very similar to DreamBooth and it can also personalize a diffusion model to generate certain concepts (styles, objects) from just a few images. This method works by training and finding new embeddings that represent the images you provide with a special word in the prompt. As a result, the diffusion model weights stays the same and the training process produces a relatively tiny (a few KBs) file. + +Because textual inversion creates embeddings, it cannot be used on its own like DreamBooth and requires another model. + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda") +``` + +Now you can load the textual inversion embeddings with the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] method and generate some images. Let's load the [sd-concepts-library/gta5-artwork](https://huggingface.co/sd-concepts-library/gta5-artwork) embeddings and you'll need to include the special word `` in your prompt to trigger it: + +```py +pipeline.load_textual_inversion("sd-concepts-library/gta5-artwork") +prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, style" +image = pipeline(prompt).images[0] +``` + +
+ +
+ +Textual inversion can also be trained on undesirable things to create *negative embeddings* to discourage a model from generating images with those undesirable things like blurry images or extra fingers on a hand. This can be a easy way to quickly improve your prompt. You'll also load the embeddings with [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`], but this time, you'll need two more parameters: + +- `weight_name`: specifies the weight file to load if the file was saved in the 🤗 Diffusers format with a specific name or if the file is stored in the A1111 format +- `token`: specifies the special word to use in the prompt to trigger the embeddings + +Let's load the [sayakpaul/EasyNegative-test](https://huggingface.co/sayakpaul/EasyNegative-test) embeddings: + +```py +pipeline.load_textual_inversion( + "sayakpaul/EasyNegative-test", weight_name="EasyNegative.safetensors", token="EasyNegative" +) +``` + +Now you can use the `token` to generate an image with the negative embeddings: + +```py +prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration, EasyNegative" +negative_prompt = "EasyNegative" + +image = pipeline(prompt, negative_prompt=negative_prompt, num_inference_steps=50).images[0] +``` + +
+ +
+ +## LoRA + +[Low-Rank Adaptation (LoRA)](https://huggingface.co/papers/2106.09685) is a popular training technique because it is fast and generates smaller file sizes (a couple hundred MBs). Like the other methods in this guide, LoRA can train a model to learn new styles from just a few images. It works by inserting new weights into the diffusion model and then only the new weights are trained instead of the entire model. This makes LoRAs faster to train and easier to store. + + + +LoRA is a very general training technique that can be used with other training methods. For example, it is common to train a model with DreamBooth and LoRA. + + + +LoRAs also need to be used with another model: + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") +``` + +Then use the [`~loaders.LoraLoaderMixin.load_lora_weights`] method to load the [ostris/super-cereal-sdxl-lora](https://huggingface.co/ostris/super-cereal-sdxl-lora) weights and specify the weights filename from the repository: + +```py +pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora", weight_name="cereal_box_sdxl_v1.safetensors") +prompt = "bears, pizza bites" +image = pipeline(prompt).images[0] +``` + +
+ +
+ +The [`~loaders.LoraLoaderMixin.load_lora_weights`] method loads LoRA weights into both the UNet and text encoder. It is the preferred way for loading LoRAs because it can handle cases where: + +- the LoRA weights don't have separate identifiers for the UNet and text encoder +- the LoRA weights have separate identifiers for the UNet and text encoder + +But if you only need to load LoRA weights into the UNet, then you can use the [`~loaders.UNet2DConditionLoadersMixin.load_attn_procs`] method. Let's load the [jbilcke-hf/sdxl-cinematic-1](https://huggingface.co/jbilcke-hf/sdxl-cinematic-1) LoRA: + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") +pipeline.unet.load_attn_procs("jbilcke-hf/sdxl-cinematic-1", weight_name="pytorch_lora_weights.safetensors") + +# use cnmt in the prompt to trigger the LoRA +prompt = "A cute cnmt eating a slice of pizza, stunning color scheme, masterpiece, illustration" +image = pipeline(prompt).images[0] +``` + +
+ +
+ + + +For both [`~loaders.LoraLoaderMixin.load_lora_weights`] and [`~loaders.UNet2DConditionLoadersMixin.load_attn_procs`], you can pass the `cross_attention_kwargs={"scale": 0.5}` parameter to adjust how much of the LoRA weights to use. A value of `0` is the same as only using the base model weights, and a value of `1` is equivalent to using the fully finetuned LoRA. + + + +To unload the LoRA weights, use the [`~loaders.LoraLoaderMixin.unload_lora_weights`] method to discard the LoRA weights and restore the model to its original weights: + +```py +pipeline.unload_lora_weights() +``` + +### Load multiple LoRAs + +It can be fun to use multiple LoRAs together to create something entirely new and unique. The [`~loaders.LoraLoaderMixin.fuse_lora`] method allows you to fuse the LoRA weights with the original weights of the underlying model. + + + +Fusing the weights can lead to a speedup in inference latency because you don't need to separately load the base model and LoRA! You can save your fused pipeline with [`~DiffusionPipeline.save_pretrained`] to avoid loading and fusing the weights every time you want to use the model. + + + +Load an initial model: + +```py +from diffusers import StableDiffusionXLPipeline, AutoencoderKL +import torch + +vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16) +pipeline = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + vae=vae, + torch_dtype=torch.float16, +).to("cuda") +``` + +Then load the LoRA checkpoint and fuse it with the original weights. The `lora_scale` parameter controls how much to scale the output by with the LoRA weights. It is important to make the `lora_scale` adjustments in the [`~loaders.LoraLoaderMixin.fuse_lora`] method because it won't work if you try to pass `scale` to the `cross_attention_kwargs` in the pipeline. + +If you need to reset the original model weights for any reason (use a different `lora_scale`), you should use the [`~loaders.LoraLoaderMixin.unfuse_lora`] method. + +```py +pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl") +pipeline.fuse_lora(lora_scale=0.7) + +# to unfuse the LoRA weights +pipeline.unfuse_lora() +``` + +Then fuse this pipeline with the next set of LoRA weights: + +```py +pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora") +pipeline.fuse_lora(lora_scale=0.7) +``` + + + +You can't unfuse multiple LoRA checkpoints so if you need to reset the model to its original weights, you'll need to reload it. + + + +Now you can generate an image that uses the weights from both LoRAs: + +```py +prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration" +image = pipeline(prompt).images[0] +``` + +### 🤗 PEFT + + + +Read the [Inference with 🤗 PEFT](../tutorials/using_peft_for_inference) tutorial to learn more its integration with 🤗 Diffusers and how you can easily work with and juggle multiple adapters. + + + +Another way you can load and use multiple LoRAs is to specify the `adapter_name` parameter in [`~loaders.LoraLoaderMixin.load_lora_weights`]. This method takes advantage of the 🤗 PEFT integration. For example, load and name both LoRA weights: + +```py +from diffusers import DiffusionPipeline +import torch + +pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") +pipeline.load_lora_weights("ostris/ikea-instructions-lora-sdxl", weight_name="ikea_instructions_xl_v1_5.safetensors", adapter_name="ikea") +pipeline.load_lora_weights("ostris/super-cereal-sdxl-lora", weight_name="cereal_box_sdxl_v1.safetensors", adapter_name="cereal") +``` + +Now use the [`~loaders.UNet2DConditionLoadersMixin.set_adapters`] to activate both LoRAs, and you can configure how much weight each LoRA should have on the output: + +```py +pipeline.set_adapters(["ikea", "cereal"], adapter_weights=[0.7, 0.5]) +``` + +Then generate an image: + +```py +prompt = "A cute brown bear eating a slice of pizza, stunning color scheme, masterpiece, illustration" +image = pipeline(prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}).images[0] +``` + +### Kohya and TheLastBen + +Other popular LoRA trainers from the community include those by [Kohya](https://github.com/kohya-ss/sd-scripts/) and [TheLastBen](https://github.com/TheLastBen/fast-stable-diffusion). These trainers create different LoRA checkpoints than those trained by 🤗 Diffusers, but they can still be loaded in the same way. + +Let's download the [Blueprintify SD XL 1.0](https://civitai.com/models/150986/blueprintify-sd-xl-10) checkpoint from [Civitai](https://civitai.com/): + +```py +!wget https://civitai.com/api/download/models/168776 -O blueprintify-sd-xl-10.safetensors +``` + +Load the LoRA checkpoint with the [`~loaders.LoraLoaderMixin.load_lora_weights`] method, and specify the filename in the `weight_name` parameter: + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to("cuda") +pipeline.load_lora_weights("path/to/weights", weight_name="blueprintify-sd-xl-10.safetensors") +``` + +Generate an image: + +```py +# use bl3uprint in the prompt to trigger the LoRA +prompt = "bl3uprint, a highly detailed blueprint of the eiffel tower, explaining how to build all parts, many txt, blueprint grid backdrop" +image = pipeline(prompt).images[0] +``` + + + +Some limitations of using Kohya LoRAs with 🤗 Diffusers include: + +- Images may not look like those generated by UIs - like ComfyUI - for multiple reasons which are explained [here](https://github.com/huggingface/diffusers/pull/4287/#issuecomment-1655110736). +- [LyCORIS checkpoints](https://github.com/KohakuBlueleaf/LyCORIS) aren't fully supported. The [`~loaders.LoraLoaderMixin.load_lora_weights`] method loads LyCORIS checkpoints with LoRA and LoCon modules, but Hada and LoKR are not supported. + + + +Loading a checkpoint from TheLastBen is very similar. For example, to load the [TheLastBen/William_Eggleston_Style_SDXL](https://huggingface.co/TheLastBen/William_Eggleston_Style_SDXL) checkpoint: + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") +pipeline.load_lora_weights("TheLastBen/William_Eggleston_Style_SDXL", weight_name="wegg.safetensors") + +# use by william eggleston in the prompt to trigger the LoRA +prompt = "a house by william eggleston, sunrays, beautiful, sunlight, sunrays, beautiful" +image = pipeline(prompt=prompt).images[0] +``` \ No newline at end of file diff --git a/docs/source/en/using-diffusers/pipeline_overview.md b/docs/source/en/using-diffusers/pipeline_overview.md index 4ee25b51dc6f..6d3ee7cc61ce 100644 --- a/docs/source/en/using-diffusers/pipeline_overview.md +++ b/docs/source/en/using-diffusers/pipeline_overview.md @@ -14,4 +14,4 @@ specific language governing permissions and limitations under the License. A pipeline is an end-to-end class that provides a quick and easy way to use a diffusion system for inference by bundling independently trained models and schedulers together. Certain combinations of models and schedulers define specific pipeline types, like [`StableDiffusionXLPipeline`] or [`StableDiffusionControlNetPipeline`], with specific capabilities. All pipeline types inherit from the base [`DiffusionPipeline`] class; pass it any checkpoint, and it'll automatically detect the pipeline type and load the necessary components. -This section introduces you to some of the more complex pipelines like Stable Diffusion XL, ControlNet, and DiffEdit, which require additional inputs. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to control randomness on your hardware when generating images, and how to create a community pipeline for a custom task like generating images from speech. \ No newline at end of file +This section demonstrates how to use specific pipelines such as Stable Diffusion XL, ControlNet, and DiffEdit. You'll also learn how to use a distilled version of the Stable Diffusion model to speed up inference, how to create reproducible pipelines, and how to use and contribute community pipelines. \ No newline at end of file diff --git a/docs/source/en/using-diffusers/textual_inversion_inference.md b/docs/source/en/using-diffusers/textual_inversion_inference.md index 0ca4ecc58d4e..821b8ec6745a 100644 --- a/docs/source/en/using-diffusers/textual_inversion_inference.md +++ b/docs/source/en/using-diffusers/textual_inversion_inference.md @@ -4,7 +4,7 @@ The [`StableDiffusionPipeline`] supports textual inversion, a technique that enables a model like Stable Diffusion to learn a new concept from just a few sample images. This gives you more control over the generated images and allows you to tailor the model towards specific concepts. You can get started quickly with a collection of community created concepts in the [Stable Diffusion Conceptualizer](https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer). -This guide will show you how to run inference with textual inversion using a pre-learned concept from the Stable Diffusion Conceptualizer. If you're interested in teaching a model new concepts with textual inversion, take a look at the [Textual Inversion](./training/text_inversion) training guide. +This guide will show you how to run inference with textual inversion using a pre-learned concept from the Stable Diffusion Conceptualizer. If you're interested in teaching a model new concepts with textual inversion, take a look at the [Textual Inversion](../training/text_inversion) training guide. Login to your Hugging Face account: diff --git a/docs/source/ja/_toctree.yml b/docs/source/ja/_toctree.yml new file mode 100644 index 000000000000..7af1f9f2b28d --- /dev/null +++ b/docs/source/ja/_toctree.yml @@ -0,0 +1,10 @@ +- sections: + - local: index + title: 🧨 Diffusers + - local: quicktour + title: 簡単な案内 + - local: stable_diffusion + title: 効果的で効率的な拡散モデル + - local: installation + title: インストール + title: はじめに \ No newline at end of file diff --git a/docs/source/ja/index.md b/docs/source/ja/index.md new file mode 100644 index 000000000000..6e8ba78dd55f --- /dev/null +++ b/docs/source/ja/index.md @@ -0,0 +1,98 @@ + + +

+
+ +
+

+ +# Diffusers + +🤗 Diffusers は、画像や音声、さらには分子の3D構造を生成するための、最先端の事前学習済みDiffusion Model(拡散モデル)を提供するライブラリです。シンプルな生成ソリューションをお探しの場合でも、独自の拡散モデルをトレーニングしたい場合でも、🤗 Diffusers はその両方をサポートするモジュール式のツールボックスです。我々のライブラリは、[性能より使いやすさ](conceptual/philosophy#usability-over-performance)、[簡単よりシンプル](conceptual/philosophy#simple-over-easy)、[抽象化よりカスタマイズ性](conceptual/philosophy#tweakable-contributorfriendly-over-abstraction)に重点を置いて設計されています。 + +このライブラリには3つの主要コンポーネントがあります: + +- 最先端の[拡散パイプライン](api/pipelines/overview)で数行のコードで生成が可能です。 +- 交換可能な[ノイズスケジューラ](api/schedulers/overview)で生成速度と品質のトレードオフのバランスをとれます。 +- 事前に訓練された[モデル](api/models)は、ビルディングブロックとして使用することができ、スケジューラと組み合わせることで、独自のエンドツーエンドの拡散システムを作成することができます。 + + + +## Supported pipelines + +| Pipeline | Paper/Repository | Tasks | +|---|---|:---:| +| [alt_diffusion](./api/pipelines/alt_diffusion) | [AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities](https://arxiv.org/abs/2211.06679) | Image-to-Image Text-Guided Generation | +| [audio_diffusion](./api/pipelines/audio_diffusion) | [Audio Diffusion](https://github.com/teticio/audio-diffusion.git) | Unconditional Audio Generation | +| [controlnet](./api/pipelines/controlnet) | [Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) | Image-to-Image Text-Guided Generation | +| [cycle_diffusion](./api/pipelines/cycle_diffusion) | [Unifying Diffusion Models' Latent Space, with Applications to CycleDiffusion and Guidance](https://arxiv.org/abs/2210.05559) | Image-to-Image Text-Guided Generation | +| [dance_diffusion](./api/pipelines/dance_diffusion) | [Dance Diffusion](https://github.com/williamberman/diffusers.git) | Unconditional Audio Generation | +| [ddpm](./api/pipelines/ddpm) | [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) | Unconditional Image Generation | +| [ddim](./api/pipelines/ddim) | [Denoising Diffusion Implicit Models](https://arxiv.org/abs/2010.02502) | Unconditional Image Generation | +| [if](./if) | [**IF**](./api/pipelines/if) | Image Generation | +| [if_img2img](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | +| [if_inpainting](./if) | [**IF**](./api/pipelines/if) | Image-to-Image Generation | +| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Text-to-Image Generation | +| [latent_diffusion](./api/pipelines/latent_diffusion) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752)| Super Resolution Image-to-Image | +| [latent_diffusion_uncond](./api/pipelines/latent_diffusion_uncond) | [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) | Unconditional Image Generation | +| [paint_by_example](./api/pipelines/paint_by_example) | [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://arxiv.org/abs/2211.13227) | Image-Guided Image Inpainting | +| [pndm](./api/pipelines/pndm) | [Pseudo Numerical Methods for Diffusion Models on Manifolds](https://arxiv.org/abs/2202.09778) | Unconditional Image Generation | +| [score_sde_ve](./api/pipelines/score_sde_ve) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | +| [score_sde_vp](./api/pipelines/score_sde_vp) | [Score-Based Generative Modeling through Stochastic Differential Equations](https://openreview.net/forum?id=PxTIG12RRHS) | Unconditional Image Generation | +| [semantic_stable_diffusion](./api/pipelines/semantic_stable_diffusion) | [Semantic Guidance](https://arxiv.org/abs/2301.12247) | Text-Guided Generation | +| [stable_diffusion_adapter](./api/pipelines/stable_diffusion/adapter) | [**T2I-Adapter**](https://arxiv.org/abs/2302.08453) | Image-to-Image Text-Guided Generation | - +| [stable_diffusion_text2img](./api/pipelines/stable_diffusion/text2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-to-Image Generation | +| [stable_diffusion_img2img](./api/pipelines/stable_diffusion/img2img) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Image-to-Image Text-Guided Generation | +| [stable_diffusion_inpaint](./api/pipelines/stable_diffusion/inpaint) | [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release) | Text-Guided Image Inpainting | +| [stable_diffusion_panorama](./api/pipelines/stable_diffusion/panorama) | [MultiDiffusion](https://multidiffusion.github.io/) | Text-to-Panorama Generation | +| [stable_diffusion_pix2pix](./api/pipelines/stable_diffusion/pix2pix) | [InstructPix2Pix: Learning to Follow Image Editing Instructions](https://arxiv.org/abs/2211.09800) | Text-Guided Image Editing| +| [stable_diffusion_pix2pix_zero](./api/pipelines/stable_diffusion/pix2pix_zero) | [Zero-shot Image-to-Image Translation](https://pix2pixzero.github.io/) | Text-Guided Image Editing | +| [stable_diffusion_attend_and_excite](./api/pipelines/stable_diffusion/attend_and_excite) | [Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models](https://arxiv.org/abs/2301.13826) | Text-to-Image Generation | +| [stable_diffusion_self_attention_guidance](./api/pipelines/stable_diffusion/self_attention_guidance) | [Improving Sample Quality of Diffusion Models Using Self-Attention Guidance](https://arxiv.org/abs/2210.00939) | Text-to-Image Generation Unconditional Image Generation | +| [stable_diffusion_image_variation](./stable_diffusion/image_variation) | [Stable Diffusion Image Variations](https://github.com/LambdaLabsML/lambda-diffusers#stable-diffusion-image-variations) | Image-to-Image Generation | +| [stable_diffusion_latent_upscale](./stable_diffusion/latent_upscale) | [Stable Diffusion Latent Upscaler](https://twitter.com/StabilityAI/status/1590531958815064065) | Text-Guided Super Resolution Image-to-Image | +| [stable_diffusion_model_editing](./api/pipelines/stable_diffusion/model_editing) | [Editing Implicit Assumptions in Text-to-Image Diffusion Models](https://time-diffusion.github.io/) | Text-to-Image Model Editing | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-to-Image Generation | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Image Inpainting | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Depth-Conditional Stable Diffusion](https://github.com/Stability-AI/stablediffusion#depth-conditional-stable-diffusion) | Depth-to-Image Generation | +| [stable_diffusion_2](./api/pipelines/stable_diffusion_2) | [Stable Diffusion 2](https://stability.ai/blog/stable-diffusion-v2-release) | Text-Guided Super Resolution Image-to-Image | +| [stable_diffusion_safe](./api/pipelines/stable_diffusion_safe) | [Safe Stable Diffusion](https://arxiv.org/abs/2211.05105) | Text-Guided Generation | +| [stable_unclip](./stable_unclip) | Stable unCLIP | Text-to-Image Generation | +| [stable_unclip](./stable_unclip) | Stable unCLIP | Image-to-Image Text-Guided Generation | +| [stochastic_karras_ve](./api/pipelines/stochastic_karras_ve) | [Elucidating the Design Space of Diffusion-Based Generative Models](https://arxiv.org/abs/2206.00364) | Unconditional Image Generation | +| [text_to_video_sd](./api/pipelines/text_to_video) | [Modelscope's Text-to-video-synthesis Model in Open Domain](https://modelscope.cn/models/damo/text-to-video-synthesis/summary) | Text-to-Video Generation | +| [unclip](./api/pipelines/unclip) | [Hierarchical Text-Conditional Image Generation with CLIP Latents](https://arxiv.org/abs/2204.06125)(implementation by [kakaobrain](https://github.com/kakaobrain/karlo)) | Text-to-Image Generation | +| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Text-to-Image Generation | +| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Image Variations Generation | +| [versatile_diffusion](./api/pipelines/versatile_diffusion) | [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://arxiv.org/abs/2211.08332) | Dual Image and Text Guided Generation | +| [vq_diffusion](./api/pipelines/vq_diffusion) | [Vector Quantized Diffusion Model for Text-to-Image Synthesis](https://arxiv.org/abs/2111.14822) | Text-to-Image Generation | +| [stable_diffusion_ldm3d](./api/pipelines/stable_diffusion/ldm3d_diffusion) | [LDM3D: Latent Diffusion Model for 3D](https://arxiv.org/abs/2305.10853) | Text to Image and Depth Generation | diff --git a/docs/source/ja/installation.md b/docs/source/ja/installation.md new file mode 100644 index 000000000000..dbfd19d6cb7a --- /dev/null +++ b/docs/source/ja/installation.md @@ -0,0 +1,145 @@ + + +# インストール + +お使いのディープラーニングライブラリに合わせてDiffusersをインストールできます。 + +🤗 DiffusersはPython 3.8+、PyTorch 1.7.0+、Flaxでテストされています。使用するディープラーニングライブラリの以下のインストール手順に従ってください: + +- [PyTorch](https://pytorch.org/get-started/locally/)のインストール手順。 +- [Flax](https://flax.readthedocs.io/en/latest/)のインストール手順。 + +## pip でインストール + +Diffusersは[仮想環境](https://docs.python.org/3/library/venv.html)の中でインストールすることが推奨されています。 +Python の仮想環境についてよく知らない場合は、こちらの [ガイド](https://packaging.python.org/guides/installing-using-pip-and-virtual-environments/) を参照してください。 +仮想環境は異なるプロジェクトの管理を容易にし、依存関係間の互換性の問題を回避します。 + +ではさっそく、プロジェクトディレクトリに仮想環境を作ってみます: + +```bash +python -m venv .env +``` + +仮想環境をアクティブにします: + +```bash +source .env/bin/activate +``` + +🤗 Diffusers もまた 🤗 Transformers ライブラリに依存しており、以下のコマンドで両方をインストールできます: + + + +```bash +pip install diffusers["torch"] transformers +``` + + +```bash +pip install diffusers["flax"] transformers +``` + + + +## ソースからのインストール + +ソースから🤗 Diffusersをインストールする前に、`torch`と🤗 Accelerateがインストールされていることを確認してください。 + +`torch`のインストールについては、`torch` [インストール](https://pytorch.org/get-started/locally/#start-locally)ガイドを参照してください。 + +🤗 Accelerateをインストールするには: + +```bash +pip install accelerate +``` + +以下のコマンドでソースから🤗 Diffusersをインストールできます: + +```bash +pip install git+https://github.com/huggingface/diffusers +``` + +このコマンドは最新の `stable` バージョンではなく、最先端の `main` バージョンをインストールします。 +`main`バージョンは最新の開発に対応するのに便利です。 +例えば、前回の公式リリース以降にバグが修正されたが、新しいリリースがまだリリースされていない場合などには都合がいいです。 +しかし、これは `main` バージョンが常に安定しているとは限らないです。 +私たちは `main` バージョンを運用し続けるよう努力しており、ほとんどの問題は通常数時間から1日以内に解決されます。 +もし問題が発生した場合は、[Issue](https://github.com/huggingface/diffusers/issues/new/choose) を開いてください! + +## 編集可能なインストール + +以下の場合、編集可能なインストールが必要です: + +* ソースコードの `main` バージョンを使用する。 +* 🤗 Diffusers に貢献し、コードの変更をテストする必要がある場合。 + +リポジトリをクローンし、次のコマンドで 🤗 Diffusers をインストールしてください: + +```bash +git clone https://github.com/huggingface/diffusers.git +cd diffusers +``` + + + +```bash +pip install -e ".[torch]" +``` + + +```bash +pip install -e ".[flax]" +``` + + + +これらのコマンドは、リポジトリをクローンしたフォルダと Python のライブラリパスをリンクします。 +Python は通常のライブラリパスに加えて、クローンしたフォルダの中を探すようになります。 +例えば、Python パッケージが通常 `~/anaconda3/envs/main/lib/python3.8/site-packages/` にインストールされている場合、Python はクローンした `~/diffusers/` フォルダも同様に参照します。 + + + +ライブラリを使い続けたい場合は、`diffusers`フォルダを残しておく必要があります。 + + + +これで、以下のコマンドで簡単にクローンを最新版の🤗 Diffusersにアップデートできます: + +```bash +cd ~/diffusers/ +git pull +``` + +Python環境は次の実行時に `main` バージョンの🤗 Diffusersを見つけます。 + +## テレメトリー・ロギングに関するお知らせ + +このライブラリは `from_pretrained()` リクエスト中にデータを収集します。 +このデータには Diffusers と PyTorch/Flax のバージョン、要求されたモデルやパイプラインクラスが含まれます。 +また、Hubでホストされている場合は、事前に学習されたチェックポイントへのパスが含まれます。 +この使用データは問題のデバッグや新機能の優先順位付けに役立ちます。 +テレメトリーはHuggingFace Hubからモデルやパイプラインをロードするときのみ送信されます。ローカルでの使用中は収集されません。 + +我々は、すべての人が追加情報を共有したくないことを理解し、あなたのプライバシーを尊重します。 +そのため、ターミナルから `DISABLE_TELEMETRY` 環境変数を設定することで、データ収集を無効にすることができます: + +Linux/MacOSの場合 +```bash +export DISABLE_TELEMETRY=YES +``` + +Windows の場合 +```bash +set DISABLE_TELEMETRY=YES +``` diff --git a/docs/source/ja/quicktour.md b/docs/source/ja/quicktour.md new file mode 100644 index 000000000000..04c93af4168c --- /dev/null +++ b/docs/source/ja/quicktour.md @@ -0,0 +1,316 @@ + + +[[open-in-colab]] + +# 簡単な案内 + +拡散モデル(Diffusion Model)は、ランダムな正規分布から段階的にノイズ除去するように学習され、画像や音声などの目的のものを生成できます。これは生成AIに多大な関心を呼び起こしました。インターネット上で拡散によって生成された画像の例を見たことがあるでしょう。🧨 Diffusersは、誰もが拡散モデルに広くアクセスできるようにすることを目的としたライブラリです。 + +この案内では、開発者または日常的なユーザーに関わらず、🧨 Diffusers を紹介し、素早く目的のものを生成できるようにします!このライブラリには3つの主要コンポーネントがあります: + +* [`DiffusionPipeline`]は事前に学習された拡散モデルからサンプルを迅速に生成するために設計された高レベルのエンドツーエンドクラス。 +* 拡散システムを作成するためのビルディングブロックとして使用できる、人気のある事前学習された[モデル](./api/models)アーキテクチャとモジュール。 +* 多くの異なる[スケジューラ](./api/schedulers/overview) - ノイズがどのようにトレーニングのために加えられるか、そして生成中にどのようにノイズ除去された画像を生成するかを制御するアルゴリズム。 + +この案内では、[`DiffusionPipeline`]を生成に使用する方法を紹介し、モデルとスケジューラを組み合わせて[`DiffusionPipeline`]の内部で起こっていることを再現する方法を説明します。 + + + +この案内は🧨 Diffusers [ノートブック](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/diffusers_intro.ipynb)を簡略化したもので、すぐに使い始めることができます。Diffusers 🧨のゴール、設計哲学、コアAPIの詳細についてもっと知りたい方は、ノートブックをご覧ください! + + + +始める前に必要なライブラリーがすべてインストールされていることを確認してください: + +```py +# uncomment to install the necessary libraries in Colab +#!pip install --upgrade diffusers accelerate transformers +``` + +- [🤗 Accelerate](https://huggingface.co/docs/accelerate/index)生成とトレーニングのためのモデルのロードを高速化します +- [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview)ような最も一般的な拡散モデルを実行するには、[🤗 Transformers](https://huggingface.co/docs/transformers/index)が必要です。 + +## 拡散パイプライン + +[`DiffusionPipeline`]は事前学習された拡散システムを生成に使用する最も簡単な方法です。これはモデルとスケジューラを含むエンドツーエンドのシステムです。[`DiffusionPipeline`]は多くの作業/タスクにすぐに使用することができます。また、サポートされているタスクの完全なリストについては[🧨Diffusersの概要](./api/pipelines/overview#diffusers-summary)の表を参照してください。 + +| **タスク** | **説明** | **パイプライン** +|------------------------------|--------------------------------------------------------------------------------------------------------------|-----------------| +| Unconditional Image Generation | 正規分布から画像生成 | [unconditional_image_generation](./using-diffusers/unconditional_image_generation) | +| Text-Guided Image Generation | 文章から画像生成 | [conditional_image_generation](./using-diffusers/conditional_image_generation) | +| Text-Guided Image-to-Image Translation | 画像と文章から新たな画像生成 | [img2img](./using-diffusers/img2img) | +| Text-Guided Image-Inpainting | 画像、マスク、および文章が指定された場合に、画像のマスクされた部分を文章をもとに修復 | [inpaint](./using-diffusers/inpaint) | +| Text-Guided Depth-to-Image Translation | 文章と深度推定によって構造を保持しながら画像生成 | [depth2img](./using-diffusers/depth2img) | + +まず、[`DiffusionPipeline`]のインスタンスを作成し、ダウンロードしたいパイプラインのチェックポイントを指定します。 +この[`DiffusionPipeline`]はHugging Face Hubに保存されている任意の[チェックポイント](https://huggingface.co/models?library=diffusers&sort=downloads)を使用することができます。 +この案内では、[`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)チェックポイントでテキストから画像へ生成します。 + + + +[Stable Diffusion]モデルについては、モデルを実行する前にまず[ライセンス](https://huggingface.co/spaces/CompVis/stable-diffusion-license)を注意深くお読みください。🧨 Diffusers は、攻撃的または有害なコンテンツを防ぐために [`safety_checker`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/safety_checker.py) を実装していますが、モデルの改良された画像生成機能により、潜在的に有害なコンテンツが生成される可能性があります。 + + + +モデルを[`~DiffusionPipeline.from_pretrained`]メソッドでロードします: + +```python +>>> from diffusers import DiffusionPipeline + +>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) +``` +[`DiffusionPipeline`]は全てのモデリング、トークン化、スケジューリングコンポーネントをダウンロードしてキャッシュします。Stable Diffusionパイプラインは[`UNet2DConditionModel`]と[`PNDMScheduler`]などで構成されています: + +```py +>>> pipeline +StableDiffusionPipeline { + "_class_name": "StableDiffusionPipeline", + "_diffusers_version": "0.13.1", + ..., + "scheduler": [ + "diffusers", + "PNDMScheduler" + ], + ..., + "unet": [ + "diffusers", + "UNet2DConditionModel" + ], + "vae": [ + "diffusers", + "AutoencoderKL" + ] +} +``` + +このモデルはおよそ14億個のパラメータで構成されているため、GPU上でパイプラインを実行することを強く推奨します。 +PyTorchと同じように、ジェネレータオブジェクトをGPUに移すことができます: + +```python +>>> pipeline.to("cuda") +``` + +これで、文章を `pipeline` に渡して画像を生成し、ノイズ除去された画像にアクセスできるようになりました。デフォルトでは、画像出力は[`PIL.Image`](https://pillow.readthedocs.io/en/stable/reference/Image.html?highlight=image#the-image-class)オブジェクトでラップされます。 + +```python +>>> image = pipeline("An image of a squirrel in Picasso style").images[0] +>>> image +``` + +
+ +
+ +`save`関数で画像を保存できます: + +```python +>>> image.save("image_of_squirrel_painting.png") +``` + +### ローカルパイプライン + +ローカルでパイプラインを使用することもできます。唯一の違いは、最初にウェイトをダウンロードする必要があることです: + +```bash +!git lfs install +!git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 +``` + +保存したウェイトをパイプラインにロードします: + +```python +>>> pipeline = DiffusionPipeline.from_pretrained("./stable-diffusion-v1-5", use_safetensors=True) +``` + +これで、上のセクションと同じようにパイプラインを動かすことができます。 + +### スケジューラの交換 + +スケジューラーによって、ノイズ除去のスピードや品質のトレードオフが異なります。どれが自分に最適かを知る最善の方法は、実際に試してみることです!Diffusers 🧨の主な機能の1つは、スケジューラを簡単に切り替えることができることです。例えば、デフォルトの[`PNDMScheduler`]を[`EulerDiscreteScheduler`]に置き換えるには、[`~diffusers.ConfigMixin.from_config`]メソッドでロードできます: + +```py +>>> from diffusers import EulerDiscreteScheduler + +>>> pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", use_safetensors=True) +>>> pipeline.scheduler = EulerDiscreteScheduler.from_config(pipeline.scheduler.config) +``` + +新しいスケジューラを使って画像を生成し、その違いに気づくかどうか試してみてください! + +次のセクションでは、[`DiffusionPipeline`]を構成するコンポーネント(モデルとスケジューラ)を詳しく見て、これらのコンポーネントを使って猫の画像を生成する方法を学びます。 + +## モデル + +ほとんどのモデルはノイズの多いサンプルを取り、各タイムステップで*残りのノイズ*を予測します(他のモデルは前のサンプルを直接予測するか、速度または[`v-prediction`](https://github.com/huggingface/diffusers/blob/5e5ce13e2f89ac45a0066cb3f369462a3cf1d9ef/src/diffusers/schedulers/scheduling_ddim.py#L110)を予測するように学習します)。モデルを混ぜて他の拡散システムを作ることもできます。 + +モデルは[`~ModelMixin.from_pretrained`]メソッドで開始されます。このメソッドはモデルをローカルにキャッシュするので、次にモデルをロードするときに高速になります。この案内では、[`UNet2DModel`]をロードします。これは基本的な画像生成モデルであり、猫画像で学習されたチェックポイントを使います: + +```py +>>> from diffusers import UNet2DModel + +>>> repo_id = "google/ddpm-cat-256" +>>> model = UNet2DModel.from_pretrained(repo_id, use_safetensors=True) +``` + +モデルのパラメータにアクセスするには、`model.config` を呼び出せます: + +```py +>>> model.config +``` + +モデル構成は🧊凍結🧊されたディクショナリであり、モデル作成後にこれらのパラメー タを変更することはできません。これは意図的なもので、最初にモデル・アーキテクチャを定義するために使用されるパラメータが同じままであることを保証します。他のパラメータは生成中に調整することができます。 + +最も重要なパラメータは以下の通りです: + +* sample_size`: 入力サンプルの高さと幅。 +* `in_channels`: 入力サンプルの入力チャンネル数。 +* down_block_types` と `up_block_types`: UNet アーキテクチャを作成するために使用されるダウンサンプリングブロックとアップサンプリングブロックのタイプ。 +* block_out_channels`: ダウンサンプリングブロックの出力チャンネル数。逆順でアップサンプリングブロックの入力チャンネル数にも使用されます。 +* layer_per_block`: 各 UNet ブロックに含まれる ResNet ブロックの数。 + +このモデルを生成に使用するには、ランダムな画像の形の正規分布を作成します。このモデルは複数のランダムな正規分布を受け取ることができるため`batch`軸を入れます。入力チャンネル数に対応する`channel`軸も必要です。画像の高さと幅に対応する`sample_size`軸を持つ必要があります: + +```py +>>> import torch + +>>> torch.manual_seed(0) + +>>> noisy_sample = torch.randn(1, model.config.in_channels, model.config.sample_size, model.config.sample_size) +>>> noisy_sample.shape +torch.Size([1, 3, 256, 256]) +``` + +画像生成には、ノイズの多い画像と `timestep` をモデルに渡します。`timestep`は入力画像がどの程度ノイズが多いかを示します。これは、モデルが拡散プロセスにおける自分の位置を決定するのに役立ちます。モデルの出力を得るには `sample` メソッドを使用します: + +```py +>>> with torch.no_grad(): +... noisy_residual = model(sample=noisy_sample, timestep=2).sample +``` + +しかし、実際の例を生成するには、ノイズ除去プロセスをガイドするスケジューラが必要です。次のセクションでは、モデルをスケジューラと組み合わせる方法を学びます。 + +## スケジューラ + +スケジューラは、モデルの出力(この場合は `noisy_residual` )が与えられたときに、ノイズの多いサンプルからノイズの少ないサンプルへの移行を管理します。 + + + + +🧨 Diffusersは拡散システムを構築するためのツールボックスです。[`DiffusionPipeline`]は事前に構築された拡散システムを使い始めるのに便利な方法ですが、独自のモデルとスケジューラコンポーネントを個別に選択してカスタム拡散システムを構築することもできます。 + + + +この案内では、[`DDPMScheduler`]を[`~diffusers.ConfigMixin.from_config`]メソッドでインスタンス化します: + +```py +>>> from diffusers import DDPMScheduler + +>>> scheduler = DDPMScheduler.from_config(repo_id) +>>> scheduler +DDPMScheduler { + "_class_name": "DDPMScheduler", + "_diffusers_version": "0.13.1", + "beta_end": 0.02, + "beta_schedule": "linear", + "beta_start": 0.0001, + "clip_sample": true, + "clip_sample_range": 1.0, + "num_train_timesteps": 1000, + "prediction_type": "epsilon", + "trained_betas": null, + "variance_type": "fixed_small" +} +``` + + + +💡 スケジューラがどのようにコンフィギュレーションからインスタンス化されるかに注目してください。モデルとは異なり、スケジューラは学習可能な重みを持たず、パラメーターを持ちません! + + + +最も重要なパラメータは以下の通りです: + +* num_train_timesteps`: ノイズ除去処理の長さ、言い換えれば、ランダムな正規分布をデータサンプルに処理するのに必要なタイムステップ数です。 +* `beta_schedule`: 生成とトレーニングに使用するノイズスケジュールのタイプ。 +* `beta_start` と `beta_end`: ノイズスケジュールの開始値と終了値。 + +少しノイズの少ない画像を予測するには、スケジューラの [`~diffusers.DDPMScheduler.step`] メソッドに以下を渡します: モデルの出力、`timestep`、現在の `sample`。 + +```py +>>> less_noisy_sample = scheduler.step(model_output=noisy_residual, timestep=2, sample=noisy_sample).prev_sample +>>> less_noisy_sample.shape +``` + +`less_noisy_sample`は次の`timestep`に渡すことができ、そこでさらにノイズが少なくなります! + +では、すべてをまとめて、ノイズ除去プロセス全体を視覚化してみましょう。 + +まず、ノイズ除去された画像を後処理して `PIL.Image` として表示する関数を作成します: + +```py +>>> import PIL.Image +>>> import numpy as np + + +>>> def display_sample(sample, i): +... image_processed = sample.cpu().permute(0, 2, 3, 1) +... image_processed = (image_processed + 1.0) * 127.5 +... image_processed = image_processed.numpy().astype(np.uint8) + +... image_pil = PIL.Image.fromarray(image_processed[0]) +... display(f"Image at step {i}") +... display(image_pil) +``` + +ノイズ除去処理を高速化するために入力とモデルをGPUに移します: + +```py +>>> model.to("cuda") +>>> noisy_sample = noisy_sample.to("cuda") +``` + +ここで、ノイズが少なくなったサンプルの残りのノイズを予測するノイズ除去ループを作成し、スケジューラを使ってさらにノイズの少ないサンプルを計算します: + +```py +>>> import tqdm + +>>> sample = noisy_sample + +>>> for i, t in enumerate(tqdm.tqdm(scheduler.timesteps)): +... # 1. predict noise residual +... with torch.no_grad(): +... residual = model(sample, t).sample + +... # 2. compute less noisy image and set x_t -> x_t-1 +... sample = scheduler.step(residual, t, sample).prev_sample + +... # 3. optionally look at image +... if (i + 1) % 50 == 0: +... display_sample(sample, i + 1) +``` + +何もないところから猫が生成されるのを、座って見てください!😻 + +
+ +
+ +## 次のステップ + +このクイックツアーで、🧨ディフューザーを使ったクールな画像をいくつか作成できたと思います!次のステップとして + +* モデルをトレーニングまたは微調整については、[training](./tutorials/basic_training)チュートリアルを参照してください。 +* 様々な使用例については、公式およびコミュニティの[training or finetuning scripts](https://github.com/huggingface/diffusers/tree/main/examples#-diffusers-examples)の例を参照してください。 +* スケジューラのロード、アクセス、変更、比較については[Using different Schedulers](./using-diffusers/schedulers)ガイドを参照してください。 +* プロンプトエンジニアリング、スピードとメモリの最適化、より高品質な画像を生成するためのヒントやトリックについては、[Stable Diffusion](./stable_diffusion)ガイドを参照してください。 +* 🧨 Diffusers の高速化については、最適化された [PyTorch on a GPU](./optimization/fp16)のガイド、[Stable Diffusion on Apple Silicon (M1/M2)](./optimization/mps)と[ONNX Runtime](./optimization/onnx)を参照してください。 diff --git a/docs/source/ja/stable_diffusion.md b/docs/source/ja/stable_diffusion.md new file mode 100644 index 000000000000..fb5afc49435b --- /dev/null +++ b/docs/source/ja/stable_diffusion.md @@ -0,0 +1,260 @@ + + +# 効果的で効率的な拡散モデル + +[[open-in-colab]] + +[`DiffusionPipeline`]を使って特定のスタイルで画像を生成したり、希望する画像を生成したりするのは難しいことです。多くの場合、[`DiffusionPipeline`]を何度か実行してからでないと満足のいく画像は得られません。しかし、何もないところから何かを生成するにはたくさんの計算が必要です。生成を何度も何度も実行する場合、特にたくさんの計算量が必要になります。 + +そのため、パイプラインから*計算*(速度)と*メモリ*(GPU RAM)の効率を最大限に引き出し、生成サイクル間の時間を短縮することで、より高速な反復処理を行えるようにすることが重要です。 + +このチュートリアルでは、[`DiffusionPipeline`]を用いて、より速く、より良い計算を行う方法を説明します。 + +まず、[`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5)モデルをロードします: + +```python +from diffusers import DiffusionPipeline + +model_id = "runwayml/stable-diffusion-v1-5" +pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) +``` + +ここで使用するプロンプトの例は年老いた戦士の長の肖像画ですが、ご自由に変更してください: + +```python +prompt = "portrait photo of a old warrior chief" +``` + +## Speed + + + +💡 GPUを利用できない場合は、[Colab](https://colab.research.google.com/)のようなGPUプロバイダーから無料で利用できます! + + + +画像生成を高速化する最も簡単な方法の1つは、PyTorchモジュールと同じようにGPU上にパイプラインを配置することです: + +```python +pipeline = pipeline.to("cuda") +``` + +同じイメージを使って改良できるようにするには、[`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html)を使い、[reproducibility](./using-diffusers/reproducibility)の種を設定します: + +```python +import torch + +generator = torch.Generator("cuda").manual_seed(0) +``` + +これで画像を生成できます: + +```python +image = pipeline(prompt, generator=generator).images[0] +image +``` + +
+ +
+ +この処理にはT4 GPUで~30秒かかりました(割り当てられているGPUがT4より優れている場合はもっと速いかもしれません)。デフォルトでは、[`DiffusionPipeline`]は完全な`float32`精度で生成を50ステップ実行します。float16`のような低い精度に変更するか、推論ステップ数を減らすことで高速化することができます。 + +まずは `float16` でモデルをロードして画像を生成してみましょう: + +```python +import torch + +pipeline = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True) +pipeline = pipeline.to("cuda") +generator = torch.Generator("cuda").manual_seed(0) +image = pipeline(prompt, generator=generator).images[0] +image +``` + +
+ +
+ +今回、画像生成にかかった時間はわずか11秒で、以前より3倍近く速くなりました! + + + +💡 パイプラインは常に `float16` で実行することを強くお勧めします。 + + + +生成ステップ数を減らすという方法もあります。より効率的なスケジューラを選択することで、出力品質を犠牲にすることなくステップ数を減らすことができます。`compatibles`メソッドを呼び出すことで、[`DiffusionPipeline`]の現在のモデルと互換性のあるスケジューラを見つけることができます: + +```python +pipeline.scheduler.compatibles +[ + diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler, + diffusers.schedulers.scheduling_unipc_multistep.UniPCMultistepScheduler, + diffusers.schedulers.scheduling_k_dpm_2_discrete.KDPM2DiscreteScheduler, + diffusers.schedulers.scheduling_deis_multistep.DEISMultistepScheduler, + diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler, + diffusers.schedulers.scheduling_dpmsolver_multistep.DPMSolverMultistepScheduler, + diffusers.schedulers.scheduling_ddpm.DDPMScheduler, + diffusers.schedulers.scheduling_dpmsolver_singlestep.DPMSolverSinglestepScheduler, + diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler, + diffusers.schedulers.scheduling_heun_discrete.HeunDiscreteScheduler, + diffusers.schedulers.scheduling_pndm.PNDMScheduler, + diffusers.schedulers.scheduling_euler_ancestral_discrete.EulerAncestralDiscreteScheduler, + diffusers.schedulers.scheduling_ddim.DDIMScheduler, +] +``` + +Stable Diffusionモデルはデフォルトで[`PNDMScheduler`]を使用します。このスケジューラは通常~50の推論ステップを必要としますが、[`DPMSolverMultistepScheduler`]のような高性能なスケジューラでは~20または25の推論ステップで済みます。[`ConfigMixin.from_config`]メソッドを使用すると、新しいスケジューラをロードすることができます: + +```python +from diffusers import DPMSolverMultistepScheduler + +pipeline.scheduler = DPMSolverMultistepScheduler.from_config(pipeline.scheduler.config) +``` + +ここで `num_inference_steps` を20に設定します: + +```python +generator = torch.Generator("cuda").manual_seed(0) +image = pipeline(prompt, generator=generator, num_inference_steps=20).images[0] +image +``` + +
+ +
+ +推論時間をわずか4秒に短縮することに成功した!⚡️ + +## メモリー + +パイプラインのパフォーマンスを向上させるもう1つの鍵は、消費メモリを少なくすることです。一度に生成できる画像の数を確認する最も簡単な方法は、`OutOfMemoryError`(OOM)が発生するまで、さまざまなバッチサイズを試してみることです。 + +文章と `Generators` のリストから画像のバッチを生成する関数を作成します。各 `Generator` にシードを割り当てて、良い結果が得られた場合に再利用できるようにします。 + +```python +def get_inputs(batch_size=1): + generator = [torch.Generator("cuda").manual_seed(i) for i in range(batch_size)] + prompts = batch_size * [prompt] + num_inference_steps = 20 + + return {"prompt": prompts, "generator": generator, "num_inference_steps": num_inference_steps} +``` + +`batch_size=4`で開始し、どれだけメモリを消費したかを確認します: + +```python +from diffusers.utils import make_image_grid + +images = pipeline(**get_inputs(batch_size=4)).images +make_image_grid(images, 2, 2) +``` + +大容量のRAMを搭載したGPUでない限り、上記のコードはおそらく`OOM`エラーを返したはずです!メモリの大半はクロスアテンションレイヤーが占めています。この処理をバッチで実行する代わりに、逐次実行することでメモリを大幅に節約できます。必要なのは、[`~DiffusionPipeline.enable_attention_slicing`]関数を使用することだけです: + +```python +pipeline.enable_attention_slicing() +``` + +今度は`batch_size`を8にしてみてください! + +```python +images = pipeline(**get_inputs(batch_size=8)).images +make_image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +以前は4枚の画像のバッチを生成することさえできませんでしたが、今では8枚の画像のバッチを1枚あたり~3.5秒で生成できます!これはおそらく、品質を犠牲にすることなくT4 GPUでできる最速の処理速度です。 + +## 品質 + +前の2つのセクションでは、`fp16` を使ってパイプラインの速度を最適化する方法、よりパフォーマン スなスケジューラーを使って生成ステップ数を減らす方法、アテンションスライスを有効 にしてメモリ消費量を減らす方法について学びました。今度は、生成される画像の品質を向上させる方法に焦点を当てます。 + +### より良いチェックポイント + +最も単純なステップは、より良いチェックポイントを使うことです。Stable Diffusionモデルは良い出発点であり、公式発表以来、いくつかの改良版もリリースされています。しかし、新しいバージョンを使ったからといって、自動的に良い結果が得られるわけではありません。最良の結果を得るためには、自分でさまざまなチェックポイントを試してみたり、ちょっとした研究([ネガティブプロンプト](https://minimaxir.com/2022/11/stable-diffusion-negative-prompt/)の使用など)をしたりする必要があります。 + +この分野が成長するにつれて、特定のスタイルを生み出すために微調整された、より質の高いチェックポイントが増えています。[Hub](https://huggingface.co/models?library=diffusers&sort=downloads)や[Diffusers Gallery](https://huggingface.co/spaces/huggingface-projects/diffusers-gallery)を探索して、興味のあるものを見つけてみてください! + +### より良いパイプラインコンポーネント + +現在のパイプラインコンポーネントを新しいバージョンに置き換えてみることもできます。Stability AIが提供する最新の[autodecoder](https://huggingface.co/stabilityai/stable-diffusion-2-1/tree/main/vae)をパイプラインにロードし、画像を生成してみましょう: + +```python +from diffusers import AutoencoderKL + +vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to("cuda") +pipeline.vae = vae +images = pipeline(**get_inputs(batch_size=8)).images +make_image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +### より良いプロンプト・エンジニアリング + +画像を生成するために使用する文章は、*プロンプトエンジニアリング*と呼ばれる分野を作られるほど、非常に重要です。プロンプト・エンジニアリングで考慮すべき点は以下の通りです: + +- 生成したい画像やその類似画像は、インターネット上にどのように保存されているか? +- 私が望むスタイルにモデルを誘導するために、どのような追加詳細を与えるべきか? + +このことを念頭に置いて、プロンプトに色やより質の高いディテールを含めるように改良してみましょう: + +```python +prompt += ", tribal panther make up, blue on red, side profile, looking away, serious eyes" +prompt += " 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta" +``` + +新しいプロンプトで画像のバッチを生成しましょう: + +```python +images = pipeline(**get_inputs(batch_size=8)).images +make_image_grid(images, rows=2, cols=4) +``` + +
+ +
+ +かなりいいです!種が`1`の`Generator`に対応する2番目の画像に、被写体の年齢に関するテキストを追加して、もう少し手を加えてみましょう: + +```python +prompts = [ + "portrait photo of the oldest warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a old warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", + "portrait photo of a young warrior chief, tribal panther make up, blue on red, side profile, looking away, serious eyes 50mm portrait photography, hard rim lighting photography--beta --ar 2:3 --beta --upbeta", +] + +generator = [torch.Generator("cuda").manual_seed(1) for _ in range(len(prompts))] +images = pipeline(prompt=prompts, generator=generator, num_inference_steps=25).images +make_image_grid(images, 2, 2) +``` + +
+ +
+ +## 次のステップ + +このチュートリアルでは、[`DiffusionPipeline`]を最適化して計算効率とメモリ効率を向上させ、生成される出力の品質を向上させる方法を学びました。パイプラインをさらに高速化することに興味があれば、以下のリソースを参照してください: + +- [PyTorch 2.0](./optimization/torch2.0)と[`torch.compile`](https://pytorch.org/docs/stable/generated/torch.compile.html)がどのように生成速度を5-300%高速化できるかを学んでください。A100 GPUの場合、画像生成は最大50%速くなります! +- PyTorch 2が使えない場合は、[xFormers](./optimization/xformers)をインストールすることをお勧めします。このライブラリのメモリ効率の良いアテンションメカニズムは PyTorch 1.13.1 と相性が良く、高速化とメモリ消費量の削減を同時に実現します。 +- モデルのオフロードなど、その他の最適化テクニックは [this guide](./optimization/fp16) でカバーされています。 diff --git a/examples/community/README.md b/examples/community/README.md index 1073240d8b94..b4fb44c25385 100755 --- a/examples/community/README.md +++ b/examples/community/README.md @@ -45,6 +45,7 @@ FABRIC - Stable Diffusion with feedback Pipeline | pipeline supports feedback fr sketch inpaint - Inpainting with non-inpaint Stable Diffusion | sketch inpaint much like in automatic1111 | [Masked Im2Im Stable Diffusion Pipeline](#stable-diffusion-masked-im2im) | - | [Anatoly Belikov](https://github.com/noskill) | prompt-to-prompt | change parts of a prompt and retain image structure (see [paper page](https://prompt-to-prompt.github.io/)) | [Prompt2Prompt Pipeline](#prompt2prompt-pipeline) | - | [Umer H. Adil](https://twitter.com/UmerHAdil) | | Latent Consistency Pipeline | Implementation of [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378) | [Latent Consistency Pipeline](#latent-consistency-pipeline) | - | [Simian Luo](https://github.com/luosiallen) | +| Latent Consistency Img2img Pipeline | Img2img pipeline for Latent Consistency Models | [Latent Consistency Img2Img Pipeline](#latent-consistency-img2img-pipeline) | - | [Logan Zoellner](https://github.com/nagolinc) | To load a custom pipeline you just need to pass the `custom_pipeline` argument to `DiffusionPipeline`, as one of the files in `diffusers/examples/community`. Feel free to send a PR with your own pipelines, we will merge them quickly. @@ -2185,3 +2186,35 @@ images = pipe(prompt=prompt, num_inference_steps=num_inference_steps, guidance_s For any questions or feedback, feel free to reach out to [Simian Luo](https://github.com/luosiallen). You can also try this pipeline directly in the [🚀 official spaces](https://huggingface.co/spaces/SimianLuo/Latent_Consistency_Model). + + + +### Latent Consistency Img2img Pipeline + +This pipeline extends the Latent Consistency Pipeline to allow it to take an input image. + +```py +from diffusers import DiffusionPipeline +import torch + +pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", custom_pipeline="latent_consistency_img2img") + +# To save GPU memory, torch.float16 can be used, but it may compromise image quality. +pipe.to(torch_device="cuda", torch_dtype=torch.float32) +``` + +- 2. Run inference with as little as 4 steps: + +```py +prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" + + +input_image=Image.open("myimg.png") + +strength = 0.5 #strength =0 (no change) strength=1 (completely overwrite image) + +# Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps. +num_inference_steps = 4 + +images = pipe(prompt=prompt, image=input_image, strength=strength, num_inference_steps=num_inference_steps, guidance_scale=8.0, lcm_origin_steps=50, output_type="pil").images +``` diff --git a/examples/community/latent_consistency_img2img.py b/examples/community/latent_consistency_img2img.py new file mode 100644 index 000000000000..cc40d41eab6e --- /dev/null +++ b/examples/community/latent_consistency_img2img.py @@ -0,0 +1,829 @@ +# Copyright 2023 Stanford University Team and The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# DISCLAIMER: This code is strongly influenced by https://github.com/pesser/pytorch_diffusion +# and https://github.com/hojonathanho/diffusion + +import math +from dataclasses import dataclass +from typing import Any, Dict, List, Optional, Tuple, Union + +import numpy as np +import PIL.Image +import torch +from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer + +from diffusers import AutoencoderKL, ConfigMixin, DiffusionPipeline, SchedulerMixin, UNet2DConditionModel, logging +from diffusers.configuration_utils import register_to_config +from diffusers.image_processor import PipelineImageInput, VaeImageProcessor +from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput +from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker +from diffusers.utils import BaseOutput +from diffusers.utils.torch_utils import randn_tensor + + +logger = logging.get_logger(__name__) # pylint: disable=invalid-name + + +class LatentConsistencyModelImg2ImgPipeline(DiffusionPipeline): + _optional_components = ["scheduler"] + + def __init__( + self, + vae: AutoencoderKL, + text_encoder: CLIPTextModel, + tokenizer: CLIPTokenizer, + unet: UNet2DConditionModel, + scheduler: "LCMSchedulerWithTimestamp", + safety_checker: StableDiffusionSafetyChecker, + feature_extractor: CLIPImageProcessor, + requires_safety_checker: bool = True, + ): + super().__init__() + + scheduler = ( + scheduler + if scheduler is not None + else LCMSchedulerWithTimestamp( + beta_start=0.00085, beta_end=0.0120, beta_schedule="scaled_linear", prediction_type="epsilon" + ) + ) + + self.register_modules( + vae=vae, + text_encoder=text_encoder, + tokenizer=tokenizer, + unet=unet, + scheduler=scheduler, + safety_checker=safety_checker, + feature_extractor=feature_extractor, + ) + self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) + self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor) + + def _encode_prompt( + self, + prompt, + device, + num_images_per_prompt, + prompt_embeds: None, + ): + r""" + Encodes the prompt into text encoder hidden states. + Args: + prompt (`str` or `List[str]`, *optional*): + prompt to be encoded + device: (`torch.device`): + torch device + num_images_per_prompt (`int`): + number of images that should be generated per prompt + prompt_embeds (`torch.FloatTensor`, *optional*): + Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not + provided, text embeddings will be generated from `prompt` input argument. + """ + + if prompt is not None and isinstance(prompt, str): + pass + elif prompt is not None and isinstance(prompt, list): + len(prompt) + else: + prompt_embeds.shape[0] + + if prompt_embeds is None: + text_inputs = self.tokenizer( + prompt, + padding="max_length", + max_length=self.tokenizer.model_max_length, + truncation=True, + return_tensors="pt", + ) + text_input_ids = text_inputs.input_ids + untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids + + if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal( + text_input_ids, untruncated_ids + ): + removed_text = self.tokenizer.batch_decode( + untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1] + ) + logger.warning( + "The following part of your input was truncated because CLIP can only handle sequences up to" + f" {self.tokenizer.model_max_length} tokens: {removed_text}" + ) + + if hasattr(self.text_encoder.config, "use_attention_mask") and self.text_encoder.config.use_attention_mask: + attention_mask = text_inputs.attention_mask.to(device) + else: + attention_mask = None + + prompt_embeds = self.text_encoder( + text_input_ids.to(device), + attention_mask=attention_mask, + ) + prompt_embeds = prompt_embeds[0] + + if self.text_encoder is not None: + prompt_embeds_dtype = self.text_encoder.dtype + elif self.unet is not None: + prompt_embeds_dtype = self.unet.dtype + else: + prompt_embeds_dtype = prompt_embeds.dtype + + prompt_embeds = prompt_embeds.to(dtype=prompt_embeds_dtype, device=device) + + bs_embed, seq_len, _ = prompt_embeds.shape + # duplicate text embeddings for each generation per prompt, using mps friendly method + prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1) + prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1) + + # Don't need to get uncond prompt embedding because of LCM Guided Distillation + return prompt_embeds + + def run_safety_checker(self, image, device, dtype): + if self.safety_checker is None: + has_nsfw_concept = None + else: + if torch.is_tensor(image): + feature_extractor_input = self.image_processor.postprocess(image, output_type="pil") + else: + feature_extractor_input = self.image_processor.numpy_to_pil(image) + safety_checker_input = self.feature_extractor(feature_extractor_input, return_tensors="pt").to(device) + image, has_nsfw_concept = self.safety_checker( + images=image, clip_input=safety_checker_input.pixel_values.to(dtype) + ) + return image, has_nsfw_concept + + def prepare_latents( + self, + image, + timestep, + batch_size, + num_channels_latents, + height, + width, + dtype, + device, + latents=None, + generator=None, + ): + shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor) + + if not isinstance(image, (torch.Tensor, PIL.Image.Image, list)): + raise ValueError( + f"`image` has to be of type `torch.Tensor`, `PIL.Image.Image` or list but is {type(image)}" + ) + + image = image.to(device=device, dtype=dtype) + + # batch_size = batch_size * num_images_per_prompt + + if image.shape[1] == 4: + init_latents = image + + else: + if isinstance(generator, list) and len(generator) != batch_size: + raise ValueError( + f"You have passed a list of generators of length {len(generator)}, but requested an effective batch" + f" size of {batch_size}. Make sure the batch size matches the length of the generators." + ) + + elif isinstance(generator, list): + init_latents = [ + self.vae.encode(image[i : i + 1]).latent_dist.sample(generator[i]) for i in range(batch_size) + ] + init_latents = torch.cat(init_latents, dim=0) + else: + init_latents = self.vae.encode(image).latent_dist.sample(generator) + + init_latents = self.vae.config.scaling_factor * init_latents + + if batch_size > init_latents.shape[0] and batch_size % init_latents.shape[0] == 0: + # expand init_latents for batch_size + ( + f"You have passed {batch_size} text prompts (`prompt`), but only {init_latents.shape[0]} initial" + " images (`image`). Initial images are now duplicating to match the number of text prompts. Note" + " that this behavior is deprecated and will be removed in a version 1.0.0. Please make sure to update" + " your script to pass as many initial images as text prompts to suppress this warning." + ) + # deprecate("len(prompt) != len(image)", "1.0.0", deprecation_message, standard_warn=False) + additional_image_per_prompt = batch_size // init_latents.shape[0] + init_latents = torch.cat([init_latents] * additional_image_per_prompt, dim=0) + elif batch_size > init_latents.shape[0] and batch_size % init_latents.shape[0] != 0: + raise ValueError( + f"Cannot duplicate `image` of batch size {init_latents.shape[0]} to {batch_size} text prompts." + ) + else: + init_latents = torch.cat([init_latents], dim=0) + + shape = init_latents.shape + noise = randn_tensor(shape, generator=generator, device=device, dtype=dtype) + + # get latents + init_latents = self.scheduler.add_noise(init_latents, noise, timestep) + latents = init_latents + + return latents + + if latents is None: + latents = torch.randn(shape, dtype=dtype).to(device) + else: + latents = latents.to(device) + # scale the initial noise by the standard deviation required by the scheduler + latents = latents * self.scheduler.init_noise_sigma + return latents + + def get_w_embedding(self, w, embedding_dim=512, dtype=torch.float32): + """ + see https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298 + Args: + timesteps: torch.Tensor: generate embedding vectors at these timesteps + embedding_dim: int: dimension of the embeddings to generate + dtype: data type of the generated embeddings + Returns: + embedding vectors with shape `(len(timesteps), embedding_dim)` + """ + assert len(w.shape) == 1 + w = w * 1000.0 + + half_dim = embedding_dim // 2 + emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1) + emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb) + emb = w.to(dtype)[:, None] * emb[None, :] + emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1) + if embedding_dim % 2 == 1: # zero pad + emb = torch.nn.functional.pad(emb, (0, 1)) + assert emb.shape == (w.shape[0], embedding_dim) + return emb + + def get_timesteps(self, num_inference_steps, strength, device): + # get the original timestep using init_timestep + init_timestep = min(int(num_inference_steps * strength), num_inference_steps) + + t_start = max(num_inference_steps - init_timestep, 0) + timesteps = self.scheduler.timesteps[t_start * self.scheduler.order :] + + return timesteps, num_inference_steps - t_start + + @torch.no_grad() + def __call__( + self, + prompt: Union[str, List[str]] = None, + image: PipelineImageInput = None, + strength: float = 0.8, + height: Optional[int] = 768, + width: Optional[int] = 768, + guidance_scale: float = 7.5, + num_images_per_prompt: Optional[int] = 1, + latents: Optional[torch.FloatTensor] = None, + num_inference_steps: int = 4, + lcm_origin_steps: int = 50, + prompt_embeds: Optional[torch.FloatTensor] = None, + output_type: Optional[str] = "pil", + return_dict: bool = True, + cross_attention_kwargs: Optional[Dict[str, Any]] = None, + ): + # 0. Default height and width to unet + height = height or self.unet.config.sample_size * self.vae_scale_factor + width = width or self.unet.config.sample_size * self.vae_scale_factor + + # 2. Define call parameters + if prompt is not None and isinstance(prompt, str): + batch_size = 1 + elif prompt is not None and isinstance(prompt, list): + batch_size = len(prompt) + else: + batch_size = prompt_embeds.shape[0] + + device = self._execution_device + # do_classifier_free_guidance = guidance_scale > 0.0 # In LCM Implementation: cfg_noise = noise_cond + cfg_scale * (noise_cond - noise_uncond) , (cfg_scale > 0.0 using CFG) + + # 3. Encode input prompt + prompt_embeds = self._encode_prompt( + prompt, + device, + num_images_per_prompt, + prompt_embeds=prompt_embeds, + ) + + # 3.5 encode image + image = self.image_processor.preprocess(image) + + # 4. Prepare timesteps + self.scheduler.set_timesteps(strength, num_inference_steps, lcm_origin_steps) + # timesteps = self.scheduler.timesteps + # timesteps, num_inference_steps = self.get_timesteps(num_inference_steps, 1.0, device) + timesteps = self.scheduler.timesteps + latent_timestep = timesteps[:1].repeat(batch_size * num_images_per_prompt) + + print("timesteps: ", timesteps) + + # 5. Prepare latent variable + num_channels_latents = self.unet.config.in_channels + latents = self.prepare_latents( + image, + latent_timestep, + batch_size * num_images_per_prompt, + num_channels_latents, + height, + width, + prompt_embeds.dtype, + device, + latents, + ) + bs = batch_size * num_images_per_prompt + + # 6. Get Guidance Scale Embedding + w = torch.tensor(guidance_scale).repeat(bs) + w_embedding = self.get_w_embedding(w, embedding_dim=256).to(device=device, dtype=latents.dtype) + + # 7. LCM MultiStep Sampling Loop: + with self.progress_bar(total=num_inference_steps) as progress_bar: + for i, t in enumerate(timesteps): + ts = torch.full((bs,), t, device=device, dtype=torch.long) + latents = latents.to(prompt_embeds.dtype) + + # model prediction (v-prediction, eps, x) + model_pred = self.unet( + latents, + ts, + timestep_cond=w_embedding, + encoder_hidden_states=prompt_embeds, + cross_attention_kwargs=cross_attention_kwargs, + return_dict=False, + )[0] + + # compute the previous noisy sample x_t -> x_t-1 + latents, denoised = self.scheduler.step(model_pred, i, t, latents, return_dict=False) + + # # call the callback, if provided + # if i == len(timesteps) - 1: + progress_bar.update() + + denoised = denoised.to(prompt_embeds.dtype) + if not output_type == "latent": + image = self.vae.decode(denoised / self.vae.config.scaling_factor, return_dict=False)[0] + image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype) + else: + image = denoised + has_nsfw_concept = None + + if has_nsfw_concept is None: + do_denormalize = [True] * image.shape[0] + else: + do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept] + + image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize) + + if not return_dict: + return (image, has_nsfw_concept) + + return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept) + + +@dataclass +# Copied from diffusers.schedulers.scheduling_ddpm.DDPMSchedulerOutput with DDPM->DDIM +class LCMSchedulerOutput(BaseOutput): + """ + Output class for the scheduler's `step` function output. + Args: + prev_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images): + Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the + denoising loop. + pred_original_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images): + The predicted denoised sample `(x_{0})` based on the model output from the current timestep. + `pred_original_sample` can be used to preview progress or for guidance. + """ + + prev_sample: torch.FloatTensor + denoised: Optional[torch.FloatTensor] = None + + +# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar +def betas_for_alpha_bar( + num_diffusion_timesteps, + max_beta=0.999, + alpha_transform_type="cosine", +): + """ + Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of + (1-beta) over time from t = [0,1]. + Contains a function alpha_bar that takes an argument t and transforms it to the cumulative product of (1-beta) up + to that part of the diffusion process. + Args: + num_diffusion_timesteps (`int`): the number of betas to produce. + max_beta (`float`): the maximum beta to use; use values lower than 1 to + prevent singularities. + alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar. + Choose from `cosine` or `exp` + Returns: + betas (`np.ndarray`): the betas used by the scheduler to step the model outputs + """ + if alpha_transform_type == "cosine": + + def alpha_bar_fn(t): + return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2 + + elif alpha_transform_type == "exp": + + def alpha_bar_fn(t): + return math.exp(t * -12.0) + + else: + raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}") + + betas = [] + for i in range(num_diffusion_timesteps): + t1 = i / num_diffusion_timesteps + t2 = (i + 1) / num_diffusion_timesteps + betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta)) + return torch.tensor(betas, dtype=torch.float32) + + +def rescale_zero_terminal_snr(betas): + """ + Rescales betas to have zero terminal SNR Based on https://arxiv.org/pdf/2305.08891.pdf (Algorithm 1) + Args: + betas (`torch.FloatTensor`): + the betas that the scheduler is being initialized with. + Returns: + `torch.FloatTensor`: rescaled betas with zero terminal SNR + """ + # Convert betas to alphas_bar_sqrt + alphas = 1.0 - betas + alphas_cumprod = torch.cumprod(alphas, dim=0) + alphas_bar_sqrt = alphas_cumprod.sqrt() + + # Store old values. + alphas_bar_sqrt_0 = alphas_bar_sqrt[0].clone() + alphas_bar_sqrt_T = alphas_bar_sqrt[-1].clone() + + # Shift so the last timestep is zero. + alphas_bar_sqrt -= alphas_bar_sqrt_T + + # Scale so the first timestep is back to the old value. + alphas_bar_sqrt *= alphas_bar_sqrt_0 / (alphas_bar_sqrt_0 - alphas_bar_sqrt_T) + + # Convert alphas_bar_sqrt to betas + alphas_bar = alphas_bar_sqrt**2 # Revert sqrt + alphas = alphas_bar[1:] / alphas_bar[:-1] # Revert cumprod + alphas = torch.cat([alphas_bar[0:1], alphas]) + betas = 1 - alphas + + return betas + + +class LCMSchedulerWithTimestamp(SchedulerMixin, ConfigMixin): + """ + This class modifies LCMScheduler to add a timestamp argument to set_timesteps + + + `LCMScheduler` extends the denoising procedure introduced in denoising diffusion probabilistic models (DDPMs) with + non-Markovian guidance. + This model inherits from [`SchedulerMixin`] and [`ConfigMixin`]. Check the superclass documentation for the generic + methods the library implements for all schedulers such as loading and saving. + Args: + num_train_timesteps (`int`, defaults to 1000): + The number of diffusion steps to train the model. + beta_start (`float`, defaults to 0.0001): + The starting `beta` value of inference. + beta_end (`float`, defaults to 0.02): + The final `beta` value. + beta_schedule (`str`, defaults to `"linear"`): + The beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from + `linear`, `scaled_linear`, or `squaredcos_cap_v2`. + trained_betas (`np.ndarray`, *optional*): + Pass an array of betas directly to the constructor to bypass `beta_start` and `beta_end`. + clip_sample (`bool`, defaults to `True`): + Clip the predicted sample for numerical stability. + clip_sample_range (`float`, defaults to 1.0): + The maximum magnitude for sample clipping. Valid only when `clip_sample=True`. + set_alpha_to_one (`bool`, defaults to `True`): + Each diffusion step uses the alphas product value at that step and at the previous one. For the final step + there is no previous alpha. When this option is `True` the previous alpha product is fixed to `1`, + otherwise it uses the alpha value at step 0. + steps_offset (`int`, defaults to 0): + An offset added to the inference steps. You can use a combination of `offset=1` and + `set_alpha_to_one=False` to make the last step use step 0 for the previous alpha product like in Stable + Diffusion. + prediction_type (`str`, defaults to `epsilon`, *optional*): + Prediction type of the scheduler function; can be `epsilon` (predicts the noise of the diffusion process), + `sample` (directly predicts the noisy sample`) or `v_prediction` (see section 2.4 of [Imagen + Video](https://imagen.research.google/video/paper.pdf) paper). + thresholding (`bool`, defaults to `False`): + Whether to use the "dynamic thresholding" method. This is unsuitable for latent-space diffusion models such + as Stable Diffusion. + dynamic_thresholding_ratio (`float`, defaults to 0.995): + The ratio for the dynamic thresholding method. Valid only when `thresholding=True`. + sample_max_value (`float`, defaults to 1.0): + The threshold value for dynamic thresholding. Valid only when `thresholding=True`. + timestep_spacing (`str`, defaults to `"leading"`): + The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and + Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information. + rescale_betas_zero_snr (`bool`, defaults to `False`): + Whether to rescale the betas to have zero terminal SNR. This enables the model to generate very bright and + dark samples instead of limiting it to samples with medium brightness. Loosely related to + [`--offset_noise`](https://github.com/huggingface/diffusers/blob/74fd735eb073eb1d774b1ab4154a0876eb82f055/examples/dreambooth/train_dreambooth.py#L506). + """ + + # _compatibles = [e.name for e in KarrasDiffusionSchedulers] + order = 1 + + @register_to_config + def __init__( + self, + num_train_timesteps: int = 1000, + beta_start: float = 0.0001, + beta_end: float = 0.02, + beta_schedule: str = "linear", + trained_betas: Optional[Union[np.ndarray, List[float]]] = None, + clip_sample: bool = True, + set_alpha_to_one: bool = True, + steps_offset: int = 0, + prediction_type: str = "epsilon", + thresholding: bool = False, + dynamic_thresholding_ratio: float = 0.995, + clip_sample_range: float = 1.0, + sample_max_value: float = 1.0, + timestep_spacing: str = "leading", + rescale_betas_zero_snr: bool = False, + ): + if trained_betas is not None: + self.betas = torch.tensor(trained_betas, dtype=torch.float32) + elif beta_schedule == "linear": + self.betas = torch.linspace(beta_start, beta_end, num_train_timesteps, dtype=torch.float32) + elif beta_schedule == "scaled_linear": + # this schedule is very specific to the latent diffusion model. + self.betas = ( + torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2 + ) + elif beta_schedule == "squaredcos_cap_v2": + # Glide cosine schedule + self.betas = betas_for_alpha_bar(num_train_timesteps) + else: + raise NotImplementedError(f"{beta_schedule} does is not implemented for {self.__class__}") + + # Rescale for zero SNR + if rescale_betas_zero_snr: + self.betas = rescale_zero_terminal_snr(self.betas) + + self.alphas = 1.0 - self.betas + self.alphas_cumprod = torch.cumprod(self.alphas, dim=0) + + # At every step in ddim, we are looking into the previous alphas_cumprod + # For the final step, there is no previous alphas_cumprod because we are already at 0 + # `set_alpha_to_one` decides whether we set this parameter simply to one or + # whether we use the final alpha of the "non-previous" one. + self.final_alpha_cumprod = torch.tensor(1.0) if set_alpha_to_one else self.alphas_cumprod[0] + + # standard deviation of the initial noise distribution + self.init_noise_sigma = 1.0 + + # setable values + self.num_inference_steps = None + self.timesteps = torch.from_numpy(np.arange(0, num_train_timesteps)[::-1].copy().astype(np.int64)) + + def scale_model_input(self, sample: torch.FloatTensor, timestep: Optional[int] = None) -> torch.FloatTensor: + """ + Ensures interchangeability with schedulers that need to scale the denoising model input depending on the + current timestep. + Args: + sample (`torch.FloatTensor`): + The input sample. + timestep (`int`, *optional*): + The current timestep in the diffusion chain. + Returns: + `torch.FloatTensor`: + A scaled input sample. + """ + return sample + + def _get_variance(self, timestep, prev_timestep): + alpha_prod_t = self.alphas_cumprod[timestep] + alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod + beta_prod_t = 1 - alpha_prod_t + beta_prod_t_prev = 1 - alpha_prod_t_prev + + variance = (beta_prod_t_prev / beta_prod_t) * (1 - alpha_prod_t / alpha_prod_t_prev) + + return variance + + # Copied from diffusers.schedulers.scheduling_ddpm.DDPMScheduler._threshold_sample + def _threshold_sample(self, sample: torch.FloatTensor) -> torch.FloatTensor: + """ + "Dynamic thresholding: At each sampling step we set s to a certain percentile absolute pixel value in xt0 (the + prediction of x_0 at timestep t), and if s > 1, then we threshold xt0 to the range [-s, s] and then divide by + s. Dynamic thresholding pushes saturated pixels (those near -1 and 1) inwards, thereby actively preventing + pixels from saturation at each step. We find that dynamic thresholding results in significantly better + photorealism as well as better image-text alignment, especially when using very large guidance weights." + https://arxiv.org/abs/2205.11487 + """ + dtype = sample.dtype + batch_size, channels, height, width = sample.shape + + if dtype not in (torch.float32, torch.float64): + sample = sample.float() # upcast for quantile calculation, and clamp not implemented for cpu half + + # Flatten sample for doing quantile calculation along each image + sample = sample.reshape(batch_size, channels * height * width) + + abs_sample = sample.abs() # "a certain percentile absolute pixel value" + + s = torch.quantile(abs_sample, self.config.dynamic_thresholding_ratio, dim=1) + s = torch.clamp( + s, min=1, max=self.config.sample_max_value + ) # When clamped to min=1, equivalent to standard clipping to [-1, 1] + + s = s.unsqueeze(1) # (batch_size, 1) because clamp will broadcast along dim=0 + sample = torch.clamp(sample, -s, s) / s # "we threshold xt0 to the range [-s, s] and then divide by s" + + sample = sample.reshape(batch_size, channels, height, width) + sample = sample.to(dtype) + + return sample + + def set_timesteps( + self, stength, num_inference_steps: int, lcm_origin_steps: int, device: Union[str, torch.device] = None + ): + """ + Sets the discrete timesteps used for the diffusion chain (to be run before inference). + Args: + num_inference_steps (`int`): + The number of diffusion steps used when generating samples with a pre-trained model. + """ + + if num_inference_steps > self.config.num_train_timesteps: + raise ValueError( + f"`num_inference_steps`: {num_inference_steps} cannot be larger than `self.config.train_timesteps`:" + f" {self.config.num_train_timesteps} as the unet model trained with this scheduler can only handle" + f" maximal {self.config.num_train_timesteps} timesteps." + ) + + self.num_inference_steps = num_inference_steps + + # LCM Timesteps Setting: # Linear Spacing + c = self.config.num_train_timesteps // lcm_origin_steps + lcm_origin_timesteps = ( + np.asarray(list(range(1, int(lcm_origin_steps * stength) + 1))) * c - 1 + ) # LCM Training Steps Schedule + skipping_step = len(lcm_origin_timesteps) // num_inference_steps + timesteps = lcm_origin_timesteps[::-skipping_step][:num_inference_steps] # LCM Inference Steps Schedule + + self.timesteps = torch.from_numpy(timesteps.copy()).to(device) + + def get_scalings_for_boundary_condition_discrete(self, t): + self.sigma_data = 0.5 # Default: 0.5 + + # By dividing 0.1: This is almost a delta function at t=0. + c_skip = self.sigma_data**2 / ((t / 0.1) ** 2 + self.sigma_data**2) + c_out = (t / 0.1) / ((t / 0.1) ** 2 + self.sigma_data**2) ** 0.5 + return c_skip, c_out + + def step( + self, + model_output: torch.FloatTensor, + timeindex: int, + timestep: int, + sample: torch.FloatTensor, + eta: float = 0.0, + use_clipped_model_output: bool = False, + generator=None, + variance_noise: Optional[torch.FloatTensor] = None, + return_dict: bool = True, + ) -> Union[LCMSchedulerOutput, Tuple]: + """ + Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion + process from the learned model outputs (most often the predicted noise). + Args: + model_output (`torch.FloatTensor`): + The direct output from learned diffusion model. + timestep (`float`): + The current discrete timestep in the diffusion chain. + sample (`torch.FloatTensor`): + A current instance of a sample created by the diffusion process. + eta (`float`): + The weight of noise for added noise in diffusion step. + use_clipped_model_output (`bool`, defaults to `False`): + If `True`, computes "corrected" `model_output` from the clipped predicted original sample. Necessary + because predicted original sample is clipped to [-1, 1] when `self.config.clip_sample` is `True`. If no + clipping has happened, "corrected" `model_output` would coincide with the one provided as input and + `use_clipped_model_output` has no effect. + generator (`torch.Generator`, *optional*): + A random number generator. + variance_noise (`torch.FloatTensor`): + Alternative to generating noise with `generator` by directly providing the noise for the variance + itself. Useful for methods such as [`CycleDiffusion`]. + return_dict (`bool`, *optional*, defaults to `True`): + Whether or not to return a [`~schedulers.scheduling_lcm.LCMSchedulerOutput`] or `tuple`. + Returns: + [`~schedulers.scheduling_utils.LCMSchedulerOutput`] or `tuple`: + If return_dict is `True`, [`~schedulers.scheduling_lcm.LCMSchedulerOutput`] is returned, otherwise a + tuple is returned where the first element is the sample tensor. + """ + if self.num_inference_steps is None: + raise ValueError( + "Number of inference steps is 'None', you need to run 'set_timesteps' after creating the scheduler" + ) + + # 1. get previous step value + prev_timeindex = timeindex + 1 + if prev_timeindex < len(self.timesteps): + prev_timestep = self.timesteps[prev_timeindex] + else: + prev_timestep = timestep + + # 2. compute alphas, betas + alpha_prod_t = self.alphas_cumprod[timestep] + alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod + + beta_prod_t = 1 - alpha_prod_t + beta_prod_t_prev = 1 - alpha_prod_t_prev + + # 3. Get scalings for boundary conditions + c_skip, c_out = self.get_scalings_for_boundary_condition_discrete(timestep) + + # 4. Different Parameterization: + parameterization = self.config.prediction_type + + if parameterization == "epsilon": # noise-prediction + pred_x0 = (sample - beta_prod_t.sqrt() * model_output) / alpha_prod_t.sqrt() + + elif parameterization == "sample": # x-prediction + pred_x0 = model_output + + elif parameterization == "v_prediction": # v-prediction + pred_x0 = alpha_prod_t.sqrt() * sample - beta_prod_t.sqrt() * model_output + + # 4. Denoise model output using boundary conditions + denoised = c_out * pred_x0 + c_skip * sample + + # 5. Sample z ~ N(0, I), For MultiStep Inference + # Noise is not used for one-step sampling. + if len(self.timesteps) > 1: + noise = torch.randn(model_output.shape).to(model_output.device) + prev_sample = alpha_prod_t_prev.sqrt() * denoised + beta_prod_t_prev.sqrt() * noise + else: + prev_sample = denoised + + if not return_dict: + return (prev_sample, denoised) + + return LCMSchedulerOutput(prev_sample=prev_sample, denoised=denoised) + + # Copied from diffusers.schedulers.scheduling_ddpm.DDPMScheduler.add_noise + def add_noise( + self, + original_samples: torch.FloatTensor, + noise: torch.FloatTensor, + timesteps: torch.IntTensor, + ) -> torch.FloatTensor: + # Make sure alphas_cumprod and timestep have same device and dtype as original_samples + alphas_cumprod = self.alphas_cumprod.to(device=original_samples.device, dtype=original_samples.dtype) + timesteps = timesteps.to(original_samples.device) + + sqrt_alpha_prod = alphas_cumprod[timesteps] ** 0.5 + sqrt_alpha_prod = sqrt_alpha_prod.flatten() + while len(sqrt_alpha_prod.shape) < len(original_samples.shape): + sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1) + + sqrt_one_minus_alpha_prod = (1 - alphas_cumprod[timesteps]) ** 0.5 + sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten() + while len(sqrt_one_minus_alpha_prod.shape) < len(original_samples.shape): + sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1) + + noisy_samples = sqrt_alpha_prod * original_samples + sqrt_one_minus_alpha_prod * noise + return noisy_samples + + # Copied from diffusers.schedulers.scheduling_ddpm.DDPMScheduler.get_velocity + def get_velocity( + self, sample: torch.FloatTensor, noise: torch.FloatTensor, timesteps: torch.IntTensor + ) -> torch.FloatTensor: + # Make sure alphas_cumprod and timestep have same device and dtype as sample + alphas_cumprod = self.alphas_cumprod.to(device=sample.device, dtype=sample.dtype) + timesteps = timesteps.to(sample.device) + + sqrt_alpha_prod = alphas_cumprod[timesteps] ** 0.5 + sqrt_alpha_prod = sqrt_alpha_prod.flatten() + while len(sqrt_alpha_prod.shape) < len(sample.shape): + sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1) + + sqrt_one_minus_alpha_prod = (1 - alphas_cumprod[timesteps]) ** 0.5 + sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten() + while len(sqrt_one_minus_alpha_prod.shape) < len(sample.shape): + sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1) + + velocity = sqrt_alpha_prod * noise - sqrt_one_minus_alpha_prod * sample + return velocity + + def __len__(self): + return self.config.num_train_timesteps diff --git a/examples/dreambooth/train_dreambooth.py b/examples/dreambooth/train_dreambooth.py index 6ad79a47deb5..6d59ee4de383 100644 --- a/examples/dreambooth/train_dreambooth.py +++ b/examples/dreambooth/train_dreambooth.py @@ -1167,7 +1167,7 @@ def compute_text_embeddings(prompt): if args.resume_from_checkpoint != "latest": path = os.path.basename(args.resume_from_checkpoint) else: - # Get the mos recent checkpoint + # Get the most recent checkpoint dirs = os.listdir(args.output_dir) dirs = [d for d in dirs if d.startswith("checkpoint")] dirs = sorted(dirs, key=lambda x: int(x.split("-")[1])) @@ -1364,7 +1364,7 @@ def compute_text_embeddings(prompt): if global_step >= args.max_train_steps: break - # Create the pipeline using using the trained modules and save it. + # Create the pipeline using the trained modules and save it. accelerator.wait_for_everyone() if accelerator.is_main_process: pipeline_args = {} diff --git a/examples/text_to_image/train_text_to_image_flax.py b/examples/text_to_image/train_text_to_image_flax.py index ac3afcbaba12..63ea53c52a11 100644 --- a/examples/text_to_image/train_text_to_image_flax.py +++ b/examples/text_to_image/train_text_to_image_flax.py @@ -208,6 +208,12 @@ def parse_args(): ), ) parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank") + parser.add_argument( + "--from_pt", + action="store_true", + default=False, + help="Flag to indicate whether to convert models from PyTorch.", + ) args = parser.parse_args() env_local_rank = int(os.environ.get("LOCAL_RANK", -1)) @@ -374,16 +380,31 @@ def collate_fn(examples): # Load models and create wrapper for stable diffusion tokenizer = CLIPTokenizer.from_pretrained( - args.pretrained_model_name_or_path, revision=args.revision, subfolder="tokenizer" + args.pretrained_model_name_or_path, + from_pt=args.from_pt, + revision=args.revision, + subfolder="tokenizer", ) text_encoder = FlaxCLIPTextModel.from_pretrained( - args.pretrained_model_name_or_path, revision=args.revision, subfolder="text_encoder", dtype=weight_dtype + args.pretrained_model_name_or_path, + from_pt=args.from_pt, + revision=args.revision, + subfolder="text_encoder", + dtype=weight_dtype, ) vae, vae_params = FlaxAutoencoderKL.from_pretrained( - args.pretrained_model_name_or_path, revision=args.revision, subfolder="vae", dtype=weight_dtype + args.pretrained_model_name_or_path, + from_pt=args.from_pt, + revision=args.revision, + subfolder="vae", + dtype=weight_dtype, ) unet, unet_params = FlaxUNet2DConditionModel.from_pretrained( - args.pretrained_model_name_or_path, revision=args.revision, subfolder="unet", dtype=weight_dtype + args.pretrained_model_name_or_path, + from_pt=args.from_pt, + revision=args.revision, + subfolder="unet", + dtype=weight_dtype, ) # Optimization diff --git a/src/diffusers/__init__.py b/src/diffusers/__init__.py index 42f352c029c8..9d146ac233c2 100644 --- a/src/diffusers/__init__.py +++ b/src/diffusers/__init__.py @@ -142,6 +142,7 @@ "KarrasVeScheduler", "KDPM2AncestralDiscreteScheduler", "KDPM2DiscreteScheduler", + "LCMScheduler", "PNDMScheduler", "RePaintScheduler", "SchedulerMixin", @@ -226,6 +227,7 @@ "KandinskyV22Pipeline", "KandinskyV22PriorEmb2EmbPipeline", "KandinskyV22PriorPipeline", + "LatentConsistencyModelPipeline", "LDMTextToImagePipeline", "MusicLDMPipeline", "PaintByExamplePipeline", @@ -499,6 +501,7 @@ KarrasVeScheduler, KDPM2AncestralDiscreteScheduler, KDPM2DiscreteScheduler, + LCMScheduler, PNDMScheduler, RePaintScheduler, SchedulerMixin, @@ -564,6 +567,7 @@ KandinskyV22Pipeline, KandinskyV22PriorEmb2EmbPipeline, KandinskyV22PriorPipeline, + LatentConsistencyModelPipeline, LDMTextToImagePipeline, MusicLDMPipeline, PaintByExamplePipeline, diff --git a/src/diffusers/configuration_utils.py b/src/diffusers/configuration_utils.py index 9bc25155a0b6..a67fa9d41ca5 100644 --- a/src/diffusers/configuration_utils.py +++ b/src/diffusers/configuration_utils.py @@ -485,10 +485,18 @@ def extract_init_dict(cls, config_dict, **kwargs): # remove attributes from orig class that cannot be expected orig_cls_name = config_dict.pop("_class_name", cls.__name__) - if orig_cls_name != cls.__name__ and hasattr(diffusers_library, orig_cls_name): + if ( + isinstance(orig_cls_name, str) + and orig_cls_name != cls.__name__ + and hasattr(diffusers_library, orig_cls_name) + ): orig_cls = getattr(diffusers_library, orig_cls_name) unexpected_keys_from_orig = cls._get_init_keys(orig_cls) - expected_keys config_dict = {k: v for k, v in config_dict.items() if k not in unexpected_keys_from_orig} + elif not isinstance(orig_cls_name, str) and not isinstance(orig_cls_name, (list, tuple)): + raise ValueError( + "Make sure that the `_class_name` is of type string or list of string (for custom pipelines)." + ) # remove private attributes config_dict = {k: v for k, v in config_dict.items() if not k.startswith("_")} diff --git a/src/diffusers/loaders.py b/src/diffusers/loaders.py index e36088e4645d..67043866be6e 100644 --- a/src/diffusers/loaders.py +++ b/src/diffusers/loaders.py @@ -3087,13 +3087,13 @@ def from_single_file(cls, pretrained_model_link_or_path, **kwargs): Examples: ```py - from diffusers import StableDiffusionControlnetPipeline, ControlNetModel + from diffusers import StableDiffusionControlNetPipeline, ControlNetModel url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local path model = ControlNetModel.from_single_file(url) url = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path - pipe = StableDiffusionControlnetPipeline.from_single_file(url, controlnet=controlnet) + pipe = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet) ``` """ # import here to avoid circular dependency @@ -3171,7 +3171,7 @@ def from_single_file(cls, pretrained_model_link_or_path, **kwargs): ) if torch_dtype is not None: - controlnet.to(torch_dtype=torch_dtype) + controlnet.to(dtype=torch_dtype) return controlnet diff --git a/src/diffusers/models/activations.py b/src/diffusers/models/activations.py index e66d90040fd2..8b75162ba597 100644 --- a/src/diffusers/models/activations.py +++ b/src/diffusers/models/activations.py @@ -21,6 +21,15 @@ from .lora import LoRACompatibleLinear +ACTIVATION_FUNCTIONS = { + "swish": nn.SiLU(), + "silu": nn.SiLU(), + "mish": nn.Mish(), + "gelu": nn.GELU(), + "relu": nn.ReLU(), +} + + def get_activation(act_fn: str) -> nn.Module: """Helper function to get activation function from string. @@ -30,14 +39,10 @@ def get_activation(act_fn: str) -> nn.Module: Returns: nn.Module: Activation function. """ - if act_fn in ["swish", "silu"]: - return nn.SiLU() - elif act_fn == "mish": - return nn.Mish() - elif act_fn == "gelu": - return nn.GELU() - elif act_fn == "relu": - return nn.ReLU() + + act_fn = act_fn.lower() + if act_fn in ACTIVATION_FUNCTIONS: + return ACTIVATION_FUNCTIONS[act_fn] else: raise ValueError(f"Unsupported activation function: {act_fn}") diff --git a/src/diffusers/models/attention_processor.py b/src/diffusers/models/attention_processor.py index 9856f3c7739c..efed305a0e96 100644 --- a/src/diffusers/models/attention_processor.py +++ b/src/diffusers/models/attention_processor.py @@ -40,14 +40,50 @@ class Attention(nn.Module): A cross attention layer. Parameters: - query_dim (`int`): The number of channels in the query. + query_dim (`int`): + The number of channels in the query. cross_attention_dim (`int`, *optional*): The number of channels in the encoder_hidden_states. If not given, defaults to `query_dim`. - heads (`int`, *optional*, defaults to 8): The number of heads to use for multi-head attention. - dim_head (`int`, *optional*, defaults to 64): The number of channels in each head. - dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use. + heads (`int`, *optional*, defaults to 8): + The number of heads to use for multi-head attention. + dim_head (`int`, *optional*, defaults to 64): + The number of channels in each head. + dropout (`float`, *optional*, defaults to 0.0): + The dropout probability to use. bias (`bool`, *optional*, defaults to False): Set to `True` for the query, key, and value linear layers to contain a bias parameter. + upcast_attention (`bool`, *optional*, defaults to False): + Set to `True` to upcast the attention computation to `float32`. + upcast_softmax (`bool`, *optional*, defaults to False): + Set to `True` to upcast the softmax computation to `float32`. + cross_attention_norm (`str`, *optional*, defaults to `None`): + The type of normalization to use for the cross attention. Can be `None`, `layer_norm`, or `group_norm`. + cross_attention_norm_num_groups (`int`, *optional*, defaults to 32): + The number of groups to use for the group norm in the cross attention. + added_kv_proj_dim (`int`, *optional*, defaults to `None`): + The number of channels to use for the added key and value projections. If `None`, no projection is used. + norm_num_groups (`int`, *optional*, defaults to `None`): + The number of groups to use for the group norm in the attention. + spatial_norm_dim (`int`, *optional*, defaults to `None`): + The number of channels to use for the spatial normalization. + out_bias (`bool`, *optional*, defaults to `True`): + Set to `True` to use a bias in the output linear layer. + scale_qk (`bool`, *optional*, defaults to `True`): + Set to `True` to scale the query and key by `1 / sqrt(dim_head)`. + only_cross_attention (`bool`, *optional*, defaults to `False`): + Set to `True` to only use cross attention and not added_kv_proj_dim. Can only be set to `True` if + `added_kv_proj_dim` is not `None`. + eps (`float`, *optional*, defaults to 1e-5): + An additional value added to the denominator in group normalization that is used for numerical stability. + rescale_output_factor (`float`, *optional*, defaults to 1.0): + A factor to rescale the output by dividing it with this value. + residual_connection (`bool`, *optional*, defaults to `False`): + Set to `True` to add the residual connection to the output. + _from_deprecated_attn_block (`bool`, *optional*, defaults to `False`): + Set to `True` if the attention block is loaded from a deprecated state dict. + processor (`AttnProcessor`, *optional*, defaults to `None`): + The attention processor to use. If `None`, defaults to `AttnProcessor2_0` if `torch 2.x` is used and + `AttnProcessor` otherwise. """ def __init__( @@ -57,7 +93,7 @@ def __init__( heads: int = 8, dim_head: int = 64, dropout: float = 0.0, - bias=False, + bias: bool = False, upcast_attention: bool = False, upcast_softmax: bool = False, cross_attention_norm: Optional[str] = None, @@ -71,7 +107,7 @@ def __init__( eps: float = 1e-5, rescale_output_factor: float = 1.0, residual_connection: bool = False, - _from_deprecated_attn_block=False, + _from_deprecated_attn_block: bool = False, processor: Optional["AttnProcessor"] = None, ): super().__init__() @@ -172,7 +208,17 @@ def __init__( def set_use_memory_efficient_attention_xformers( self, use_memory_efficient_attention_xformers: bool, attention_op: Optional[Callable] = None - ): + ) -> None: + r""" + Set whether to use memory efficient attention from `xformers` or not. + + Args: + use_memory_efficient_attention_xformers (`bool`): + Whether to use memory efficient attention from `xformers` or not. + attention_op (`Callable`, *optional*): + The attention operation to use. Defaults to `None` which uses the default attention operation from + `xformers`. + """ is_lora = hasattr(self, "processor") and isinstance( self.processor, LORA_ATTENTION_PROCESSORS, @@ -294,7 +340,14 @@ def set_use_memory_efficient_attention_xformers( self.set_processor(processor) - def set_attention_slice(self, slice_size): + def set_attention_slice(self, slice_size: int) -> None: + r""" + Set the slice size for attention computation. + + Args: + slice_size (`int`): + The slice size for attention computation. + """ if slice_size is not None and slice_size > self.sliceable_head_dim: raise ValueError(f"slice_size {slice_size} has to be smaller or equal to {self.sliceable_head_dim}.") @@ -315,7 +368,16 @@ def set_attention_slice(self, slice_size): self.set_processor(processor) - def set_processor(self, processor: "AttnProcessor", _remove_lora=False): + def set_processor(self, processor: "AttnProcessor", _remove_lora: bool = False) -> None: + r""" + Set the attention processor to use. + + Args: + processor (`AttnProcessor`): + The attention processor to use. + _remove_lora (`bool`, *optional*, defaults to `False`): + Set to `True` to remove LoRA layers from the model. + """ if hasattr(self, "processor") and _remove_lora and self.to_q.lora_layer is not None: deprecate( "set_processor to offload LoRA", @@ -342,6 +404,16 @@ def set_processor(self, processor: "AttnProcessor", _remove_lora=False): self.processor = processor def get_processor(self, return_deprecated_lora: bool = False) -> "AttentionProcessor": + r""" + Get the attention processor in use. + + Args: + return_deprecated_lora (`bool`, *optional*, defaults to `False`): + Set to `True` to return the deprecated LoRA attention processor. + + Returns: + "AttentionProcessor": The attention processor in use. + """ if not return_deprecated_lora: return self.processor @@ -421,7 +493,29 @@ def get_processor(self, return_deprecated_lora: bool = False) -> "AttentionProce return lora_processor - def forward(self, hidden_states, encoder_hidden_states=None, attention_mask=None, **cross_attention_kwargs): + def forward( + self, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + **cross_attention_kwargs, + ) -> torch.Tensor: + r""" + The forward method of the `Attention` class. + + Args: + hidden_states (`torch.Tensor`): + The hidden states of the query. + encoder_hidden_states (`torch.Tensor`, *optional*): + The hidden states of the encoder. + attention_mask (`torch.Tensor`, *optional*): + The attention mask to use. If `None`, no mask is applied. + **cross_attention_kwargs: + Additional keyword arguments to pass along to the cross attention. + + Returns: + `torch.Tensor`: The output of the attention layer. + """ # The `Attention` class can call different attention processors / attention functions # here we simply pass along all tensors to the selected processor class # For standard processors that are defined here, `**cross_attention_kwargs` is empty @@ -433,14 +527,36 @@ def forward(self, hidden_states, encoder_hidden_states=None, attention_mask=None **cross_attention_kwargs, ) - def batch_to_head_dim(self, tensor): + def batch_to_head_dim(self, tensor: torch.Tensor) -> torch.Tensor: + r""" + Reshape the tensor from `[batch_size, seq_len, dim]` to `[batch_size // heads, seq_len, dim * heads]`. `heads` + is the number of heads initialized while constructing the `Attention` class. + + Args: + tensor (`torch.Tensor`): The tensor to reshape. + + Returns: + `torch.Tensor`: The reshaped tensor. + """ head_size = self.heads batch_size, seq_len, dim = tensor.shape tensor = tensor.reshape(batch_size // head_size, head_size, seq_len, dim) tensor = tensor.permute(0, 2, 1, 3).reshape(batch_size // head_size, seq_len, dim * head_size) return tensor - def head_to_batch_dim(self, tensor, out_dim=3): + def head_to_batch_dim(self, tensor: torch.Tensor, out_dim: int = 3) -> torch.Tensor: + r""" + Reshape the tensor from `[batch_size, seq_len, dim]` to `[batch_size, seq_len, heads, dim // heads]` `heads` is + the number of heads initialized while constructing the `Attention` class. + + Args: + tensor (`torch.Tensor`): The tensor to reshape. + out_dim (`int`, *optional*, defaults to `3`): The output dimension of the tensor. If `3`, the tensor is + reshaped to `[batch_size * heads, seq_len, dim // heads]`. + + Returns: + `torch.Tensor`: The reshaped tensor. + """ head_size = self.heads batch_size, seq_len, dim = tensor.shape tensor = tensor.reshape(batch_size, seq_len, head_size, dim // head_size) @@ -451,7 +567,20 @@ def head_to_batch_dim(self, tensor, out_dim=3): return tensor - def get_attention_scores(self, query, key, attention_mask=None): + def get_attention_scores( + self, query: torch.Tensor, key: torch.Tensor, attention_mask: torch.Tensor = None + ) -> torch.Tensor: + r""" + Compute the attention scores. + + Args: + query (`torch.Tensor`): The query tensor. + key (`torch.Tensor`): The key tensor. + attention_mask (`torch.Tensor`, *optional*): The attention mask to use. If `None`, no mask is applied. + + Returns: + `torch.Tensor`: The attention probabilities/scores. + """ dtype = query.dtype if self.upcast_attention: query = query.float() @@ -485,7 +614,25 @@ def get_attention_scores(self, query, key, attention_mask=None): return attention_probs - def prepare_attention_mask(self, attention_mask, target_length, batch_size, out_dim=3): + def prepare_attention_mask( + self, attention_mask: torch.Tensor, target_length: int, batch_size: int, out_dim: int = 3 + ) -> torch.Tensor: + r""" + Prepare the attention mask for the attention computation. + + Args: + attention_mask (`torch.Tensor`): + The attention mask to prepare. + target_length (`int`): + The target length of the attention mask. This is the length of the attention mask after padding. + batch_size (`int`): + The batch size, which is used to repeat the attention mask. + out_dim (`int`, *optional*, defaults to `3`): + The output dimension of the attention mask. Can be either `3` or `4`. + + Returns: + `torch.Tensor`: The prepared attention mask. + """ head_size = self.heads if attention_mask is None: return attention_mask @@ -514,7 +661,17 @@ def prepare_attention_mask(self, attention_mask, target_length, batch_size, out_ return attention_mask - def norm_encoder_hidden_states(self, encoder_hidden_states): + def norm_encoder_hidden_states(self, encoder_hidden_states: torch.Tensor) -> torch.Tensor: + r""" + Normalize the encoder hidden states. Requires `self.norm_cross` to be specified when constructing the + `Attention` class. + + Args: + encoder_hidden_states (`torch.Tensor`): Hidden states of the encoder. + + Returns: + `torch.Tensor`: The normalized encoder hidden states. + """ assert self.norm_cross is not None, "self.norm_cross must be defined to call self.norm_encoder_hidden_states" if isinstance(self.norm_cross, nn.LayerNorm): @@ -542,12 +699,12 @@ class AttnProcessor: def __call__( self, attn: Attention, - hidden_states, - encoder_hidden_states=None, - attention_mask=None, - temb=None, - scale=1.0, - ): + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + temb: Optional[torch.FloatTensor] = None, + scale: float = 1.0, + ) -> torch.Tensor: residual = hidden_states args = () if USE_PEFT_BACKEND else (scale,) @@ -624,12 +781,12 @@ class CustomDiffusionAttnProcessor(nn.Module): def __init__( self, - train_kv=True, - train_q_out=True, - hidden_size=None, - cross_attention_dim=None, - out_bias=True, - dropout=0.0, + train_kv: bool = True, + train_q_out: bool = True, + hidden_size: Optional[int] = None, + cross_attention_dim: Optional[int] = None, + out_bias: bool = True, + dropout: float = 0.0, ): super().__init__() self.train_kv = train_kv @@ -648,7 +805,13 @@ def __init__( self.to_out_custom_diffusion.append(nn.Linear(hidden_size, hidden_size, bias=out_bias)) self.to_out_custom_diffusion.append(nn.Dropout(dropout)) - def __call__(self, attn: Attention, hidden_states, encoder_hidden_states=None, attention_mask=None): + def __call__( + self, + attn: Attention, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + ) -> torch.Tensor: batch_size, sequence_length, _ = hidden_states.shape attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size) if self.train_q_out: @@ -707,7 +870,14 @@ class AttnAddedKVProcessor: encoder. """ - def __call__(self, attn: Attention, hidden_states, encoder_hidden_states=None, attention_mask=None, scale=1.0): + def __call__( + self, + attn: Attention, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + scale: float = 1.0, + ) -> torch.Tensor: residual = hidden_states hidden_states = hidden_states.view(hidden_states.shape[0], hidden_states.shape[1], -1).transpose(1, 2) batch_size, sequence_length, _ = hidden_states.shape @@ -767,7 +937,14 @@ def __init__(self): "AttnAddedKVProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0." ) - def __call__(self, attn: Attention, hidden_states, encoder_hidden_states=None, attention_mask=None, scale=1.0): + def __call__( + self, + attn: Attention, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + scale: float = 1.0, + ) -> torch.Tensor: residual = hidden_states hidden_states = hidden_states.view(hidden_states.shape[0], hidden_states.shape[1], -1).transpose(1, 2) batch_size, sequence_length, _ = hidden_states.shape @@ -833,7 +1010,13 @@ class XFormersAttnAddedKVProcessor: def __init__(self, attention_op: Optional[Callable] = None): self.attention_op = attention_op - def __call__(self, attn: Attention, hidden_states, encoder_hidden_states=None, attention_mask=None): + def __call__( + self, + attn: Attention, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + ) -> torch.Tensor: residual = hidden_states hidden_states = hidden_states.view(hidden_states.shape[0], hidden_states.shape[1], -1).transpose(1, 2) batch_size, sequence_length, _ = hidden_states.shape @@ -906,9 +1089,11 @@ def __call__( attention_mask: Optional[torch.FloatTensor] = None, temb: Optional[torch.FloatTensor] = None, scale: float = 1.0, - ): + ) -> torch.FloatTensor: residual = hidden_states + args = () if USE_PEFT_BACKEND else (scale,) + if attn.spatial_norm is not None: hidden_states = attn.spatial_norm(hidden_states, temb) @@ -936,15 +1121,15 @@ def __call__( if attn.group_norm is not None: hidden_states = attn.group_norm(hidden_states.transpose(1, 2)).transpose(1, 2) - query = attn.to_q(hidden_states, scale=scale) + query = attn.to_q(hidden_states, *args) if encoder_hidden_states is None: encoder_hidden_states = hidden_states elif attn.norm_cross: encoder_hidden_states = attn.norm_encoder_hidden_states(encoder_hidden_states) - key = attn.to_k(encoder_hidden_states, scale=scale) - value = attn.to_v(encoder_hidden_states, scale=scale) + key = attn.to_k(encoder_hidden_states, *args) + value = attn.to_v(encoder_hidden_states, *args) query = attn.head_to_batch_dim(query).contiguous() key = attn.head_to_batch_dim(key).contiguous() @@ -957,7 +1142,7 @@ def __call__( hidden_states = attn.batch_to_head_dim(hidden_states) # linear proj - hidden_states = attn.to_out[0](hidden_states, scale=scale) + hidden_states = attn.to_out[0](hidden_states, *args) # dropout hidden_states = attn.to_out[1](hidden_states) @@ -984,12 +1169,12 @@ def __init__(self): def __call__( self, attn: Attention, - hidden_states, - encoder_hidden_states=None, - attention_mask=None, - temb=None, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + temb: Optional[torch.FloatTensor] = None, scale: float = 1.0, - ): + ) -> torch.FloatTensor: residual = hidden_states if attn.spatial_norm is not None: @@ -1089,12 +1274,12 @@ class CustomDiffusionXFormersAttnProcessor(nn.Module): def __init__( self, - train_kv=True, - train_q_out=False, - hidden_size=None, - cross_attention_dim=None, - out_bias=True, - dropout=0.0, + train_kv: bool = True, + train_q_out: bool = False, + hidden_size: Optional[int] = None, + cross_attention_dim: Optional[int] = None, + out_bias: bool = True, + dropout: float = 0.0, attention_op: Optional[Callable] = None, ): super().__init__() @@ -1115,7 +1300,13 @@ def __init__( self.to_out_custom_diffusion.append(nn.Linear(hidden_size, hidden_size, bias=out_bias)) self.to_out_custom_diffusion.append(nn.Dropout(dropout)) - def __call__(self, attn: Attention, hidden_states, encoder_hidden_states=None, attention_mask=None): + def __call__( + self, + attn: Attention, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: batch_size, sequence_length, _ = ( hidden_states.shape if encoder_hidden_states is None else encoder_hidden_states.shape ) @@ -1195,12 +1386,12 @@ class CustomDiffusionAttnProcessor2_0(nn.Module): def __init__( self, - train_kv=True, - train_q_out=True, - hidden_size=None, - cross_attention_dim=None, - out_bias=True, - dropout=0.0, + train_kv: bool = True, + train_q_out: bool = True, + hidden_size: Optional[int] = None, + cross_attention_dim: Optional[int] = None, + out_bias: bool = True, + dropout: float = 0.0, ): super().__init__() self.train_kv = train_kv @@ -1219,7 +1410,13 @@ def __init__( self.to_out_custom_diffusion.append(nn.Linear(hidden_size, hidden_size, bias=out_bias)) self.to_out_custom_diffusion.append(nn.Dropout(dropout)) - def __call__(self, attn: Attention, hidden_states, encoder_hidden_states=None, attention_mask=None): + def __call__( + self, + attn: Attention, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: batch_size, sequence_length, _ = hidden_states.shape attention_mask = attn.prepare_attention_mask(attention_mask, sequence_length, batch_size) if self.train_q_out: @@ -1288,10 +1485,16 @@ class SlicedAttnProcessor: `attention_head_dim` must be a multiple of the `slice_size`. """ - def __init__(self, slice_size): + def __init__(self, slice_size: int): self.slice_size = slice_size - def __call__(self, attn: Attention, hidden_states, encoder_hidden_states=None, attention_mask=None): + def __call__( + self, + attn: Attention, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: residual = hidden_states input_ndim = hidden_states.ndim @@ -1372,7 +1575,14 @@ class SlicedAttnAddedKVProcessor: def __init__(self, slice_size): self.slice_size = slice_size - def __call__(self, attn: "Attention", hidden_states, encoder_hidden_states=None, attention_mask=None, temb=None): + def __call__( + self, + attn: "Attention", + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + temb: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: residual = hidden_states if attn.spatial_norm is not None: @@ -1446,20 +1656,26 @@ def __call__(self, attn: "Attention", hidden_states, encoder_hidden_states=None, class SpatialNorm(nn.Module): """ - Spatially conditioned normalization as defined in https://arxiv.org/abs/2209.09002 + Spatially conditioned normalization as defined in https://arxiv.org/abs/2209.09002. + + Args: + f_channels (`int`): + The number of channels for input to group normalization layer, and output of the spatial norm layer. + zq_channels (`int`): + The number of channels for the quantized vector as described in the paper. """ def __init__( self, - f_channels, - zq_channels, + f_channels: int, + zq_channels: int, ): super().__init__() self.norm_layer = nn.GroupNorm(num_channels=f_channels, num_groups=32, eps=1e-6, affine=True) self.conv_y = nn.Conv2d(zq_channels, f_channels, kernel_size=1, stride=1, padding=0) self.conv_b = nn.Conv2d(zq_channels, f_channels, kernel_size=1, stride=1, padding=0) - def forward(self, f, zq): + def forward(self, f: torch.FloatTensor, zq: torch.FloatTensor) -> torch.FloatTensor: f_size = f.shape[-2:] zq = F.interpolate(zq, size=f_size, mode="nearest") norm_f = self.norm_layer(f) @@ -1481,9 +1697,18 @@ class LoRAAttnProcessor(nn.Module): The dimension of the LoRA update matrices. network_alpha (`int`, *optional*): Equivalent to `alpha` but it's usage is specific to Kohya (A1111) style LoRAs. + kwargs (`dict`): + Additional keyword arguments to pass to the `LoRALinearLayer` layers. """ - def __init__(self, hidden_size, cross_attention_dim=None, rank=4, network_alpha=None, **kwargs): + def __init__( + self, + hidden_size: int, + cross_attention_dim: Optional[int] = None, + rank: int = 4, + network_alpha: Optional[int] = None, + **kwargs, + ): super().__init__() self.hidden_size = hidden_size @@ -1510,7 +1735,7 @@ def __init__(self, hidden_size, cross_attention_dim=None, rank=4, network_alpha= self.to_v_lora = LoRALinearLayer(cross_attention_dim or v_hidden_size, v_hidden_size, v_rank, network_alpha) self.to_out_lora = LoRALinearLayer(out_hidden_size, out_hidden_size, out_rank, network_alpha) - def __call__(self, attn: Attention, hidden_states, *args, **kwargs): + def __call__(self, attn: Attention, hidden_states: torch.FloatTensor, *args, **kwargs) -> torch.FloatTensor: self_cls_name = self.__class__.__name__ deprecate( self_cls_name, @@ -1545,9 +1770,18 @@ class LoRAAttnProcessor2_0(nn.Module): The dimension of the LoRA update matrices. network_alpha (`int`, *optional*): Equivalent to `alpha` but it's usage is specific to Kohya (A1111) style LoRAs. + kwargs (`dict`): + Additional keyword arguments to pass to the `LoRALinearLayer` layers. """ - def __init__(self, hidden_size, cross_attention_dim=None, rank=4, network_alpha=None, **kwargs): + def __init__( + self, + hidden_size: int, + cross_attention_dim: Optional[int] = None, + rank: int = 4, + network_alpha: Optional[int] = None, + **kwargs, + ): super().__init__() if not hasattr(F, "scaled_dot_product_attention"): raise ImportError("AttnProcessor2_0 requires PyTorch 2.0, to use it, please upgrade PyTorch to 2.0.") @@ -1576,7 +1810,7 @@ def __init__(self, hidden_size, cross_attention_dim=None, rank=4, network_alpha= self.to_v_lora = LoRALinearLayer(cross_attention_dim or v_hidden_size, v_hidden_size, v_rank, network_alpha) self.to_out_lora = LoRALinearLayer(out_hidden_size, out_hidden_size, out_rank, network_alpha) - def __call__(self, attn: Attention, hidden_states, *args, **kwargs): + def __call__(self, attn: Attention, hidden_states: torch.FloatTensor, *args, **kwargs) -> torch.FloatTensor: self_cls_name = self.__class__.__name__ deprecate( self_cls_name, @@ -1615,16 +1849,17 @@ class LoRAXFormersAttnProcessor(nn.Module): operator. network_alpha (`int`, *optional*): Equivalent to `alpha` but it's usage is specific to Kohya (A1111) style LoRAs. - + kwargs (`dict`): + Additional keyword arguments to pass to the `LoRALinearLayer` layers. """ def __init__( self, - hidden_size, - cross_attention_dim, - rank=4, + hidden_size: int, + cross_attention_dim: int, + rank: int = 4, attention_op: Optional[Callable] = None, - network_alpha=None, + network_alpha: Optional[int] = None, **kwargs, ): super().__init__() @@ -1654,7 +1889,7 @@ def __init__( self.to_v_lora = LoRALinearLayer(cross_attention_dim or v_hidden_size, v_hidden_size, v_rank, network_alpha) self.to_out_lora = LoRALinearLayer(out_hidden_size, out_hidden_size, out_rank, network_alpha) - def __call__(self, attn: Attention, hidden_states, *args, **kwargs): + def __call__(self, attn: Attention, hidden_states: torch.FloatTensor, *args, **kwargs) -> torch.FloatTensor: self_cls_name = self.__class__.__name__ deprecate( self_cls_name, @@ -1687,10 +1922,19 @@ class LoRAAttnAddedKVProcessor(nn.Module): The number of channels in the `encoder_hidden_states`. rank (`int`, defaults to 4): The dimension of the LoRA update matrices. - + network_alpha (`int`, *optional*): + Equivalent to `alpha` but it's usage is specific to Kohya (A1111) style LoRAs. + kwargs (`dict`): + Additional keyword arguments to pass to the `LoRALinearLayer` layers. """ - def __init__(self, hidden_size, cross_attention_dim=None, rank=4, network_alpha=None): + def __init__( + self, + hidden_size: int, + cross_attention_dim: Optional[int] = None, + rank: int = 4, + network_alpha: Optional[int] = None, + ): super().__init__() self.hidden_size = hidden_size @@ -1704,7 +1948,7 @@ def __init__(self, hidden_size, cross_attention_dim=None, rank=4, network_alpha= self.to_v_lora = LoRALinearLayer(hidden_size, hidden_size, rank, network_alpha) self.to_out_lora = LoRALinearLayer(hidden_size, hidden_size, rank, network_alpha) - def __call__(self, attn: Attention, hidden_states, *args, **kwargs): + def __call__(self, attn: Attention, hidden_states: torch.FloatTensor, *args, **kwargs) -> torch.FloatTensor: self_cls_name = self.__class__.__name__ deprecate( self_cls_name, @@ -1762,7 +2006,7 @@ def __call__(self, attn: Attention, hidden_states, *args, **kwargs): CustomDiffusionAttnProcessor, CustomDiffusionXFormersAttnProcessor, CustomDiffusionAttnProcessor2_0, - # depraceted + # deprecated LoRAAttnProcessor, LoRAAttnProcessor2_0, LoRAXFormersAttnProcessor, diff --git a/src/diffusers/models/t5_film_transformer.py b/src/diffusers/models/t5_film_transformer.py index 1c41e656a9db..26ff3f6b8127 100644 --- a/src/diffusers/models/t5_film_transformer.py +++ b/src/diffusers/models/t5_film_transformer.py @@ -12,6 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. import math +from typing import Optional, Tuple import torch from torch import nn @@ -23,6 +24,28 @@ class T5FilmDecoder(ModelMixin, ConfigMixin): + r""" + T5 style decoder with FiLM conditioning. + + Args: + input_dims (`int`, *optional*, defaults to `128`): + The number of input dimensions. + targets_length (`int`, *optional*, defaults to `256`): + The length of the targets. + d_model (`int`, *optional*, defaults to `768`): + Size of the input hidden states. + num_layers (`int`, *optional*, defaults to `12`): + The number of `DecoderLayer`'s to use. + num_heads (`int`, *optional*, defaults to `12`): + The number of attention heads to use. + d_kv (`int`, *optional*, defaults to `64`): + Size of the key-value projection vectors. + d_ff (`int`, *optional*, defaults to `2048`): + The number of dimensions in the intermediate feed-forward layer of `DecoderLayer`'s. + dropout_rate (`float`, *optional*, defaults to `0.1`): + Dropout probability. + """ + @register_to_config def __init__( self, @@ -63,7 +86,7 @@ def __init__( self.post_dropout = nn.Dropout(p=dropout_rate) self.spec_out = nn.Linear(d_model, input_dims, bias=False) - def encoder_decoder_mask(self, query_input, key_input): + def encoder_decoder_mask(self, query_input: torch.FloatTensor, key_input: torch.FloatTensor) -> torch.FloatTensor: mask = torch.mul(query_input.unsqueeze(-1), key_input.unsqueeze(-2)) return mask.unsqueeze(-3) @@ -125,7 +148,27 @@ def forward(self, encodings_and_masks, decoder_input_tokens, decoder_noise_time) class DecoderLayer(nn.Module): - def __init__(self, d_model, d_kv, num_heads, d_ff, dropout_rate, layer_norm_epsilon=1e-6): + r""" + T5 decoder layer. + + Args: + d_model (`int`): + Size of the input hidden states. + d_kv (`int`): + Size of the key-value projection vectors. + num_heads (`int`): + Number of attention heads. + d_ff (`int`): + Size of the intermediate feed-forward layer. + dropout_rate (`float`): + Dropout probability. + layer_norm_epsilon (`float`, *optional*, defaults to `1e-6`): + A small value used for numerical stability to avoid dividing by zero. + """ + + def __init__( + self, d_model: int, d_kv: int, num_heads: int, d_ff: int, dropout_rate: float, layer_norm_epsilon: float = 1e-6 + ): super().__init__() self.layer = nn.ModuleList() @@ -152,13 +195,13 @@ def __init__(self, d_model, d_kv, num_heads, d_ff, dropout_rate, layer_norm_epsi def forward( self, - hidden_states, - conditioning_emb=None, - attention_mask=None, - encoder_hidden_states=None, - encoder_attention_mask=None, + hidden_states: torch.FloatTensor, + conditioning_emb: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.Tensor] = None, + encoder_attention_mask: Optional[torch.Tensor] = None, encoder_decoder_position_bias=None, - ): + ) -> Tuple[torch.FloatTensor]: hidden_states = self.layer[0]( hidden_states, conditioning_emb=conditioning_emb, @@ -183,7 +226,21 @@ def forward( class T5LayerSelfAttentionCond(nn.Module): - def __init__(self, d_model, d_kv, num_heads, dropout_rate): + r""" + T5 style self-attention layer with conditioning. + + Args: + d_model (`int`): + Size of the input hidden states. + d_kv (`int`): + Size of the key-value projection vectors. + num_heads (`int`): + Number of attention heads. + dropout_rate (`float`): + Dropout probability. + """ + + def __init__(self, d_model: int, d_kv: int, num_heads: int, dropout_rate: float): super().__init__() self.layer_norm = T5LayerNorm(d_model) self.FiLMLayer = T5FiLMLayer(in_features=d_model * 4, out_features=d_model) @@ -192,10 +249,10 @@ def __init__(self, d_model, d_kv, num_heads, dropout_rate): def forward( self, - hidden_states, - conditioning_emb=None, - attention_mask=None, - ): + hidden_states: torch.FloatTensor, + conditioning_emb: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: # pre_self_attention_layer_norm normed_hidden_states = self.layer_norm(hidden_states) @@ -211,7 +268,23 @@ def forward( class T5LayerCrossAttention(nn.Module): - def __init__(self, d_model, d_kv, num_heads, dropout_rate, layer_norm_epsilon): + r""" + T5 style cross-attention layer. + + Args: + d_model (`int`): + Size of the input hidden states. + d_kv (`int`): + Size of the key-value projection vectors. + num_heads (`int`): + Number of attention heads. + dropout_rate (`float`): + Dropout probability. + layer_norm_epsilon (`float`): + A small value used for numerical stability to avoid dividing by zero. + """ + + def __init__(self, d_model: int, d_kv: int, num_heads: int, dropout_rate: float, layer_norm_epsilon: float): super().__init__() self.attention = Attention(query_dim=d_model, heads=num_heads, dim_head=d_kv, out_bias=False, scale_qk=False) self.layer_norm = T5LayerNorm(d_model, eps=layer_norm_epsilon) @@ -219,10 +292,10 @@ def __init__(self, d_model, d_kv, num_heads, dropout_rate, layer_norm_epsilon): def forward( self, - hidden_states, - key_value_states=None, - attention_mask=None, - ): + hidden_states: torch.FloatTensor, + key_value_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: normed_hidden_states = self.layer_norm(hidden_states) attention_output = self.attention( normed_hidden_states, @@ -234,14 +307,30 @@ def forward( class T5LayerFFCond(nn.Module): - def __init__(self, d_model, d_ff, dropout_rate, layer_norm_epsilon): + r""" + T5 style feed-forward conditional layer. + + Args: + d_model (`int`): + Size of the input hidden states. + d_ff (`int`): + Size of the intermediate feed-forward layer. + dropout_rate (`float`): + Dropout probability. + layer_norm_epsilon (`float`): + A small value used for numerical stability to avoid dividing by zero. + """ + + def __init__(self, d_model: int, d_ff: int, dropout_rate: float, layer_norm_epsilon: float): super().__init__() self.DenseReluDense = T5DenseGatedActDense(d_model=d_model, d_ff=d_ff, dropout_rate=dropout_rate) self.film = T5FiLMLayer(in_features=d_model * 4, out_features=d_model) self.layer_norm = T5LayerNorm(d_model, eps=layer_norm_epsilon) self.dropout = nn.Dropout(dropout_rate) - def forward(self, hidden_states, conditioning_emb=None): + def forward( + self, hidden_states: torch.FloatTensor, conditioning_emb: Optional[torch.FloatTensor] = None + ) -> torch.FloatTensor: forwarded_states = self.layer_norm(hidden_states) if conditioning_emb is not None: forwarded_states = self.film(forwarded_states, conditioning_emb) @@ -252,7 +341,19 @@ def forward(self, hidden_states, conditioning_emb=None): class T5DenseGatedActDense(nn.Module): - def __init__(self, d_model, d_ff, dropout_rate): + r""" + T5 style feed-forward layer with gated activations and dropout. + + Args: + d_model (`int`): + Size of the input hidden states. + d_ff (`int`): + Size of the intermediate feed-forward layer. + dropout_rate (`float`): + Dropout probability. + """ + + def __init__(self, d_model: int, d_ff: int, dropout_rate: float): super().__init__() self.wi_0 = nn.Linear(d_model, d_ff, bias=False) self.wi_1 = nn.Linear(d_model, d_ff, bias=False) @@ -260,7 +361,7 @@ def __init__(self, d_model, d_ff, dropout_rate): self.dropout = nn.Dropout(dropout_rate) self.act = NewGELUActivation() - def forward(self, hidden_states): + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: hidden_gelu = self.act(self.wi_0(hidden_states)) hidden_linear = self.wi_1(hidden_states) hidden_states = hidden_gelu * hidden_linear @@ -271,7 +372,17 @@ def forward(self, hidden_states): class T5LayerNorm(nn.Module): - def __init__(self, hidden_size, eps=1e-6): + r""" + T5 style layer normalization module. + + Args: + hidden_size (`int`): + Size of the input hidden states. + eps (`float`, `optional`, defaults to `1e-6`): + A small value used for numerical stability to avoid dividing by zero. + """ + + def __init__(self, hidden_size: int, eps: float = 1e-6): """ Construct a layernorm module in the T5 style. No bias and no subtraction of mean. """ @@ -279,7 +390,7 @@ def __init__(self, hidden_size, eps=1e-6): self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps - def forward(self, hidden_states): + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: # T5 uses a layer_norm which only scales and doesn't shift, which is also known as Root Mean # Square Layer Normalization https://arxiv.org/abs/1910.07467 thus variance is calculated # w/o mean and there is no bias. Additionally we want to make sure that the accumulation for @@ -307,14 +418,20 @@ def forward(self, input: torch.Tensor) -> torch.Tensor: class T5FiLMLayer(nn.Module): """ - FiLM Layer + T5 style FiLM Layer. + + Args: + in_features (`int`): + Number of input features. + out_features (`int`): + Number of output features. """ - def __init__(self, in_features, out_features): + def __init__(self, in_features: int, out_features: int): super().__init__() self.scale_bias = nn.Linear(in_features, out_features * 2, bias=False) - def forward(self, x, conditioning_emb): + def forward(self, x: torch.FloatTensor, conditioning_emb: torch.FloatTensor) -> torch.FloatTensor: emb = self.scale_bias(conditioning_emb) scale, shift = torch.chunk(emb, 2, -1) x = x * (1 + scale) + shift diff --git a/src/diffusers/models/transformer_temporal.py b/src/diffusers/models/transformer_temporal.py index d59284875736..55c9e6968a32 100644 --- a/src/diffusers/models/transformer_temporal.py +++ b/src/diffusers/models/transformer_temporal.py @@ -12,7 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. from dataclasses import dataclass -from typing import Optional +from typing import Any, Dict, Optional import torch from torch import nn @@ -48,11 +48,15 @@ class TransformerTemporalModel(ModelMixin, ConfigMixin): num_layers (`int`, *optional*, defaults to 1): The number of layers of Transformer blocks to use. dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use. cross_attention_dim (`int`, *optional*): The number of `encoder_hidden_states` dimensions to use. - sample_size (`int`, *optional*): The width of the latent images (specify if the input is **discrete**). - This is fixed during training since it is used to learn a number of position embeddings. - activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to use in feed-forward. attention_bias (`bool`, *optional*): Configure if the `TransformerBlock` attention should contain a bias parameter. + sample_size (`int`, *optional*): The width of the latent images (specify if the input is **discrete**). + This is fixed during training since it is used to learn a number of position embeddings. + activation_fn (`str`, *optional*, defaults to `"geglu"`): + Activation function to use in feed-forward. See `diffusers.models.activations.get_activation` for supported + activation functions. + norm_elementwise_affine (`bool`, *optional*): + Configure if the `TransformerBlock` should use learnable elementwise affine parameters for normalization. double_self_attention (`bool`, *optional*): Configure if each `TransformerBlock` should contain two self-attention layers. """ @@ -106,14 +110,14 @@ def __init__( def forward( self, - hidden_states, - encoder_hidden_states=None, - timestep=None, - class_labels=None, - num_frames=1, - cross_attention_kwargs=None, + hidden_states: torch.FloatTensor, + encoder_hidden_states: Optional[torch.LongTensor] = None, + timestep: Optional[torch.LongTensor] = None, + class_labels: torch.LongTensor = None, + num_frames: int = 1, + cross_attention_kwargs: Optional[Dict[str, Any]] = None, return_dict: bool = True, - ): + ) -> TransformerTemporalModelOutput: """ The [`TransformerTemporal`] forward method. @@ -123,7 +127,7 @@ def forward( encoder_hidden_states ( `torch.LongTensor` of shape `(batch size, encoder_hidden_states dim)`, *optional*): Conditional embeddings for cross attention layer. If not given, cross-attention defaults to self-attention. - timestep ( `torch.long`, *optional*): + timestep ( `torch.LongTensor`, *optional*): Used to indicate denoising step. Optional timestep to be applied as an embedding in `AdaLayerNorm`. class_labels ( `torch.LongTensor` of shape `(batch size, num classes)`, *optional*): Used to indicate class labels conditioning. Optional class labels to be applied as an embedding in diff --git a/src/diffusers/models/unet_1d_blocks.py b/src/diffusers/models/unet_1d_blocks.py index 84ae48e0f8c4..74a2f1681ead 100644 --- a/src/diffusers/models/unet_1d_blocks.py +++ b/src/diffusers/models/unet_1d_blocks.py @@ -12,6 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. import math +from typing import Optional, Tuple, Union import torch import torch.nn.functional as F @@ -24,17 +25,17 @@ class DownResnetBlock1D(nn.Module): def __init__( self, - in_channels, - out_channels=None, - num_layers=1, - conv_shortcut=False, - temb_channels=32, - groups=32, - groups_out=None, - non_linearity=None, - time_embedding_norm="default", - output_scale_factor=1.0, - add_downsample=True, + in_channels: int, + out_channels: Optional[int] = None, + num_layers: int = 1, + conv_shortcut: bool = False, + temb_channels: int = 32, + groups: int = 32, + groups_out: Optional[int] = None, + non_linearity: Optional[str] = None, + time_embedding_norm: str = "default", + output_scale_factor: float = 1.0, + add_downsample: bool = True, ): super().__init__() self.in_channels = in_channels @@ -65,7 +66,7 @@ def __init__( if add_downsample: self.downsample = Downsample1D(out_channels, use_conv=True, padding=1) - def forward(self, hidden_states, temb=None): + def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: output_states = () hidden_states = self.resnets[0](hidden_states, temb) @@ -86,16 +87,16 @@ def forward(self, hidden_states, temb=None): class UpResnetBlock1D(nn.Module): def __init__( self, - in_channels, - out_channels=None, - num_layers=1, - temb_channels=32, - groups=32, - groups_out=None, - non_linearity=None, - time_embedding_norm="default", - output_scale_factor=1.0, - add_upsample=True, + in_channels: int, + out_channels: Optional[int] = None, + num_layers: int = 1, + temb_channels: int = 32, + groups: int = 32, + groups_out: Optional[int] = None, + non_linearity: Optional[str] = None, + time_embedding_norm: str = "default", + output_scale_factor: float = 1.0, + add_upsample: bool = True, ): super().__init__() self.in_channels = in_channels @@ -125,7 +126,12 @@ def __init__( if add_upsample: self.upsample = Upsample1D(out_channels, use_conv_transpose=True) - def forward(self, hidden_states, res_hidden_states_tuple=None, temb=None): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Optional[Tuple[torch.FloatTensor, ...]] = None, + temb: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: if res_hidden_states_tuple is not None: res_hidden_states = res_hidden_states_tuple[-1] hidden_states = torch.cat((hidden_states, res_hidden_states), dim=1) @@ -144,7 +150,7 @@ def forward(self, hidden_states, res_hidden_states_tuple=None, temb=None): class ValueFunctionMidBlock1D(nn.Module): - def __init__(self, in_channels, out_channels, embed_dim): + def __init__(self, in_channels: int, out_channels: int, embed_dim: int): super().__init__() self.in_channels = in_channels self.out_channels = out_channels @@ -155,7 +161,7 @@ def __init__(self, in_channels, out_channels, embed_dim): self.res2 = ResidualTemporalBlock1D(in_channels // 2, in_channels // 4, embed_dim=embed_dim) self.down2 = Downsample1D(out_channels // 4, use_conv=True) - def forward(self, x, temb=None): + def forward(self, x: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: x = self.res1(x, temb) x = self.down1(x) x = self.res2(x, temb) @@ -166,13 +172,13 @@ def forward(self, x, temb=None): class MidResTemporalBlock1D(nn.Module): def __init__( self, - in_channels, - out_channels, - embed_dim, + in_channels: int, + out_channels: int, + embed_dim: int, num_layers: int = 1, add_downsample: bool = False, add_upsample: bool = False, - non_linearity=None, + non_linearity: Optional[str] = None, ): super().__init__() self.in_channels = in_channels @@ -203,7 +209,7 @@ def __init__( if self.upsample and self.downsample: raise ValueError("Block cannot downsample and upsample") - def forward(self, hidden_states, temb): + def forward(self, hidden_states: torch.FloatTensor, temb: torch.FloatTensor) -> torch.FloatTensor: hidden_states = self.resnets[0](hidden_states, temb) for resnet in self.resnets[1:]: hidden_states = resnet(hidden_states, temb) @@ -217,14 +223,14 @@ def forward(self, hidden_states, temb): class OutConv1DBlock(nn.Module): - def __init__(self, num_groups_out, out_channels, embed_dim, act_fn): + def __init__(self, num_groups_out: int, out_channels: int, embed_dim: int, act_fn: str): super().__init__() self.final_conv1d_1 = nn.Conv1d(embed_dim, embed_dim, 5, padding=2) self.final_conv1d_gn = nn.GroupNorm(num_groups_out, embed_dim) self.final_conv1d_act = get_activation(act_fn) self.final_conv1d_2 = nn.Conv1d(embed_dim, out_channels, 1) - def forward(self, hidden_states, temb=None): + def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: hidden_states = self.final_conv1d_1(hidden_states) hidden_states = rearrange_dims(hidden_states) hidden_states = self.final_conv1d_gn(hidden_states) @@ -235,7 +241,7 @@ def forward(self, hidden_states, temb=None): class OutValueFunctionBlock(nn.Module): - def __init__(self, fc_dim, embed_dim, act_fn="mish"): + def __init__(self, fc_dim: int, embed_dim: int, act_fn: str = "mish"): super().__init__() self.final_block = nn.ModuleList( [ @@ -245,7 +251,7 @@ def __init__(self, fc_dim, embed_dim, act_fn="mish"): ] ) - def forward(self, hidden_states, temb): + def forward(self, hidden_states: torch.FloatTensor, temb: torch.FloatTensor) -> torch.FloatTensor: hidden_states = hidden_states.view(hidden_states.shape[0], -1) hidden_states = torch.cat((hidden_states, temb), dim=-1) for layer in self.final_block: @@ -275,14 +281,14 @@ def forward(self, hidden_states, temb): class Downsample1d(nn.Module): - def __init__(self, kernel="linear", pad_mode="reflect"): + def __init__(self, kernel: str = "linear", pad_mode: str = "reflect"): super().__init__() self.pad_mode = pad_mode kernel_1d = torch.tensor(_kernels[kernel]) self.pad = kernel_1d.shape[0] // 2 - 1 self.register_buffer("kernel", kernel_1d) - def forward(self, hidden_states): + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: hidden_states = F.pad(hidden_states, (self.pad,) * 2, self.pad_mode) weight = hidden_states.new_zeros([hidden_states.shape[1], hidden_states.shape[1], self.kernel.shape[0]]) indices = torch.arange(hidden_states.shape[1], device=hidden_states.device) @@ -292,14 +298,14 @@ def forward(self, hidden_states): class Upsample1d(nn.Module): - def __init__(self, kernel="linear", pad_mode="reflect"): + def __init__(self, kernel: str = "linear", pad_mode: str = "reflect"): super().__init__() self.pad_mode = pad_mode kernel_1d = torch.tensor(_kernels[kernel]) * 2 self.pad = kernel_1d.shape[0] // 2 - 1 self.register_buffer("kernel", kernel_1d) - def forward(self, hidden_states, temb=None): + def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: hidden_states = F.pad(hidden_states, ((self.pad + 1) // 2,) * 2, self.pad_mode) weight = hidden_states.new_zeros([hidden_states.shape[1], hidden_states.shape[1], self.kernel.shape[0]]) indices = torch.arange(hidden_states.shape[1], device=hidden_states.device) @@ -309,7 +315,7 @@ def forward(self, hidden_states, temb=None): class SelfAttention1d(nn.Module): - def __init__(self, in_channels, n_head=1, dropout_rate=0.0): + def __init__(self, in_channels: int, n_head: int = 1, dropout_rate: float = 0.0): super().__init__() self.channels = in_channels self.group_norm = nn.GroupNorm(1, num_channels=in_channels) @@ -329,7 +335,7 @@ def transpose_for_scores(self, projection: torch.Tensor) -> torch.Tensor: new_projection = projection.view(new_projection_shape).permute(0, 2, 1, 3) return new_projection - def forward(self, hidden_states): + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: residual = hidden_states batch, channel_dim, seq = hidden_states.shape @@ -367,7 +373,7 @@ def forward(self, hidden_states): class ResConvBlock(nn.Module): - def __init__(self, in_channels, mid_channels, out_channels, is_last=False): + def __init__(self, in_channels: int, mid_channels: int, out_channels: int, is_last: bool = False): super().__init__() self.is_last = is_last self.has_conv_skip = in_channels != out_channels @@ -384,7 +390,7 @@ def __init__(self, in_channels, mid_channels, out_channels, is_last=False): self.group_norm_2 = nn.GroupNorm(1, out_channels) self.gelu_2 = nn.GELU() - def forward(self, hidden_states): + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: residual = self.conv_skip(hidden_states) if self.has_conv_skip else hidden_states hidden_states = self.conv_1(hidden_states) @@ -401,7 +407,7 @@ def forward(self, hidden_states): class UNetMidBlock1D(nn.Module): - def __init__(self, mid_channels, in_channels, out_channels=None): + def __init__(self, mid_channels: int, in_channels: int, out_channels: Optional[int] = None): super().__init__() out_channels = in_channels if out_channels is None else out_channels @@ -429,7 +435,7 @@ def __init__(self, mid_channels, in_channels, out_channels=None): self.attentions = nn.ModuleList(attentions) self.resnets = nn.ModuleList(resnets) - def forward(self, hidden_states, temb=None): + def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: hidden_states = self.down(hidden_states) for attn, resnet in zip(self.attentions, self.resnets): hidden_states = resnet(hidden_states) @@ -441,7 +447,7 @@ def forward(self, hidden_states, temb=None): class AttnDownBlock1D(nn.Module): - def __init__(self, out_channels, in_channels, mid_channels=None): + def __init__(self, out_channels: int, in_channels: int, mid_channels: Optional[int] = None): super().__init__() mid_channels = out_channels if mid_channels is None else mid_channels @@ -460,7 +466,7 @@ def __init__(self, out_channels, in_channels, mid_channels=None): self.attentions = nn.ModuleList(attentions) self.resnets = nn.ModuleList(resnets) - def forward(self, hidden_states, temb=None): + def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: hidden_states = self.down(hidden_states) for resnet, attn in zip(self.resnets, self.attentions): @@ -471,7 +477,7 @@ def forward(self, hidden_states, temb=None): class DownBlock1D(nn.Module): - def __init__(self, out_channels, in_channels, mid_channels=None): + def __init__(self, out_channels: int, in_channels: int, mid_channels: Optional[int] = None): super().__init__() mid_channels = out_channels if mid_channels is None else mid_channels @@ -484,7 +490,7 @@ def __init__(self, out_channels, in_channels, mid_channels=None): self.resnets = nn.ModuleList(resnets) - def forward(self, hidden_states, temb=None): + def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: hidden_states = self.down(hidden_states) for resnet in self.resnets: @@ -494,7 +500,7 @@ def forward(self, hidden_states, temb=None): class DownBlock1DNoSkip(nn.Module): - def __init__(self, out_channels, in_channels, mid_channels=None): + def __init__(self, out_channels: int, in_channels: int, mid_channels: Optional[int] = None): super().__init__() mid_channels = out_channels if mid_channels is None else mid_channels @@ -506,7 +512,7 @@ def __init__(self, out_channels, in_channels, mid_channels=None): self.resnets = nn.ModuleList(resnets) - def forward(self, hidden_states, temb=None): + def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: hidden_states = torch.cat([hidden_states, temb], dim=1) for resnet in self.resnets: hidden_states = resnet(hidden_states) @@ -515,7 +521,7 @@ def forward(self, hidden_states, temb=None): class AttnUpBlock1D(nn.Module): - def __init__(self, in_channels, out_channels, mid_channels=None): + def __init__(self, in_channels: int, out_channels: int, mid_channels: Optional[int] = None): super().__init__() mid_channels = out_channels if mid_channels is None else mid_channels @@ -534,7 +540,12 @@ def __init__(self, in_channels, out_channels, mid_channels=None): self.resnets = nn.ModuleList(resnets) self.up = Upsample1d(kernel="cubic") - def forward(self, hidden_states, res_hidden_states_tuple, temb=None): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: res_hidden_states = res_hidden_states_tuple[-1] hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1) @@ -548,7 +559,7 @@ def forward(self, hidden_states, res_hidden_states_tuple, temb=None): class UpBlock1D(nn.Module): - def __init__(self, in_channels, out_channels, mid_channels=None): + def __init__(self, in_channels: int, out_channels: int, mid_channels: Optional[int] = None): super().__init__() mid_channels = in_channels if mid_channels is None else mid_channels @@ -561,7 +572,12 @@ def __init__(self, in_channels, out_channels, mid_channels=None): self.resnets = nn.ModuleList(resnets) self.up = Upsample1d(kernel="cubic") - def forward(self, hidden_states, res_hidden_states_tuple, temb=None): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: res_hidden_states = res_hidden_states_tuple[-1] hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1) @@ -574,7 +590,7 @@ def forward(self, hidden_states, res_hidden_states_tuple, temb=None): class UpBlock1DNoSkip(nn.Module): - def __init__(self, in_channels, out_channels, mid_channels=None): + def __init__(self, in_channels: int, out_channels: int, mid_channels: Optional[int] = None): super().__init__() mid_channels = in_channels if mid_channels is None else mid_channels @@ -586,7 +602,12 @@ def __init__(self, in_channels, out_channels, mid_channels=None): self.resnets = nn.ModuleList(resnets) - def forward(self, hidden_states, res_hidden_states_tuple, temb=None): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: res_hidden_states = res_hidden_states_tuple[-1] hidden_states = torch.cat([hidden_states, res_hidden_states], dim=1) @@ -596,7 +617,20 @@ def forward(self, hidden_states, res_hidden_states_tuple, temb=None): return hidden_states -def get_down_block(down_block_type, num_layers, in_channels, out_channels, temb_channels, add_downsample): +DownBlockType = Union[DownResnetBlock1D, DownBlock1D, AttnDownBlock1D, DownBlock1DNoSkip] +MidBlockType = Union[MidResTemporalBlock1D, ValueFunctionMidBlock1D, UNetMidBlock1D] +OutBlockType = Union[OutConv1DBlock, OutValueFunctionBlock] +UpBlockType = Union[UpResnetBlock1D, UpBlock1D, AttnUpBlock1D, UpBlock1DNoSkip] + + +def get_down_block( + down_block_type: str, + num_layers: int, + in_channels: int, + out_channels: int, + temb_channels: int, + add_downsample: bool, +) -> DownBlockType: if down_block_type == "DownResnetBlock1D": return DownResnetBlock1D( in_channels=in_channels, @@ -614,7 +648,9 @@ def get_down_block(down_block_type, num_layers, in_channels, out_channels, temb_ raise ValueError(f"{down_block_type} does not exist.") -def get_up_block(up_block_type, num_layers, in_channels, out_channels, temb_channels, add_upsample): +def get_up_block( + up_block_type: str, num_layers: int, in_channels: int, out_channels: int, temb_channels: int, add_upsample: bool +) -> UpBlockType: if up_block_type == "UpResnetBlock1D": return UpResnetBlock1D( in_channels=in_channels, @@ -632,7 +668,15 @@ def get_up_block(up_block_type, num_layers, in_channels, out_channels, temb_chan raise ValueError(f"{up_block_type} does not exist.") -def get_mid_block(mid_block_type, num_layers, in_channels, mid_channels, out_channels, embed_dim, add_downsample): +def get_mid_block( + mid_block_type: str, + num_layers: int, + in_channels: int, + mid_channels: int, + out_channels: int, + embed_dim: int, + add_downsample: bool, +) -> MidBlockType: if mid_block_type == "MidResTemporalBlock1D": return MidResTemporalBlock1D( num_layers=num_layers, @@ -648,7 +692,9 @@ def get_mid_block(mid_block_type, num_layers, in_channels, mid_channels, out_cha raise ValueError(f"{mid_block_type} does not exist.") -def get_out_block(*, out_block_type, num_groups_out, embed_dim, out_channels, act_fn, fc_dim): +def get_out_block( + *, out_block_type: str, num_groups_out: int, embed_dim: int, out_channels: int, act_fn: str, fc_dim: int +) -> Optional[OutBlockType]: if out_block_type == "OutConv1DBlock": return OutConv1DBlock(num_groups_out, out_channels, embed_dim, act_fn) elif out_block_type == "ValueFunction": diff --git a/src/diffusers/models/unet_2d_blocks.py b/src/diffusers/models/unet_2d_blocks.py index cfaedd717bef..e404cef224ff 100644 --- a/src/diffusers/models/unet_2d_blocks.py +++ b/src/diffusers/models/unet_2d_blocks.py @@ -32,31 +32,31 @@ def get_down_block( - down_block_type, - num_layers, - in_channels, - out_channels, - temb_channels, - add_downsample, - resnet_eps, - resnet_act_fn, - transformer_layers_per_block=1, - num_attention_heads=None, - resnet_groups=None, - cross_attention_dim=None, - downsample_padding=None, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, - resnet_time_scale_shift="default", - attention_type="default", - resnet_skip_time_act=False, - resnet_out_scale_factor=1.0, - cross_attention_norm=None, - attention_head_dim=None, - downsample_type=None, - dropout=0.0, + down_block_type: str, + num_layers: int, + in_channels: int, + out_channels: int, + temb_channels: int, + add_downsample: bool, + resnet_eps: float, + resnet_act_fn: str, + transformer_layers_per_block: int = 1, + num_attention_heads: Optional[int] = None, + resnet_groups: Optional[int] = None, + cross_attention_dim: Optional[int] = None, + downsample_padding: Optional[int] = None, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, + resnet_time_scale_shift: str = "default", + attention_type: str = "default", + resnet_skip_time_act: bool = False, + resnet_out_scale_factor: float = 1.0, + cross_attention_norm: Optional[str] = None, + attention_head_dim: Optional[int] = None, + downsample_type: Optional[str] = None, + dropout: float = 0.0, ): # If attn head dim is not defined, we default it to the number of heads if attention_head_dim is None: @@ -241,33 +241,33 @@ def get_down_block( def get_up_block( - up_block_type, - num_layers, - in_channels, - out_channels, - prev_output_channel, - temb_channels, - add_upsample, - resnet_eps, - resnet_act_fn, - resolution_idx=None, - transformer_layers_per_block=1, - num_attention_heads=None, - resnet_groups=None, - cross_attention_dim=None, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, - resnet_time_scale_shift="default", - attention_type="default", - resnet_skip_time_act=False, - resnet_out_scale_factor=1.0, - cross_attention_norm=None, - attention_head_dim=None, - upsample_type=None, - dropout=0.0, -): + up_block_type: str, + num_layers: int, + in_channels: int, + out_channels: int, + prev_output_channel: int, + temb_channels: int, + add_upsample: bool, + resnet_eps: float, + resnet_act_fn: str, + resolution_idx: Optional[int] = None, + transformer_layers_per_block: int = 1, + num_attention_heads: Optional[int] = None, + resnet_groups: Optional[int] = None, + cross_attention_dim: Optional[int] = None, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, + resnet_time_scale_shift: str = "default", + attention_type: str = "default", + resnet_skip_time_act: bool = False, + resnet_out_scale_factor: float = 1.0, + cross_attention_norm: Optional[str] = None, + attention_head_dim: Optional[int] = None, + upsample_type: Optional[str] = None, + dropout: float = 0.0, +) -> nn.Module: # If attn head dim is not defined, we default it to the number of heads if attention_head_dim is None: logger.warn( @@ -498,7 +498,7 @@ def __init__(self, in_channels: int, out_channels: int, act_fn: str): ) self.fuse = nn.ReLU() - def forward(self, x): + def forward(self, x: torch.FloatTensor) -> torch.FloatTensor: return self.fuse(self.conv(x) + self.skip(x)) @@ -546,8 +546,8 @@ def __init__( attn_groups: Optional[int] = None, resnet_pre_norm: bool = True, add_attention: bool = True, - attention_head_dim=1, - output_scale_factor=1.0, + attention_head_dim: int = 1, + output_scale_factor: float = 1.0, ): super().__init__() resnet_groups = resnet_groups if resnet_groups is not None else min(in_channels // 4, 32) @@ -617,7 +617,7 @@ def __init__( self.attentions = nn.ModuleList(attentions) self.resnets = nn.ModuleList(resnets) - def forward(self, hidden_states, temb=None): + def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: hidden_states = self.resnets[0](hidden_states, temb) for attn, resnet in zip(self.attentions, self.resnets[1:]): if attn is not None: @@ -640,13 +640,13 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - output_scale_factor=1.0, - cross_attention_dim=1280, - dual_cross_attention=False, - use_linear_projection=False, - upcast_attention=False, - attention_type="default", + num_attention_heads: int = 1, + output_scale_factor: float = 1.0, + cross_attention_dim: int = 1280, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + upcast_attention: bool = False, + attention_type: str = "default", ): super().__init__() @@ -785,12 +785,12 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - attention_head_dim=1, - output_scale_factor=1.0, - cross_attention_dim=1280, - skip_time_act=False, - only_cross_attention=False, - cross_attention_norm=None, + attention_head_dim: int = 1, + output_scale_factor: float = 1.0, + cross_attention_dim: int = 1280, + skip_time_act: bool = False, + only_cross_attention: bool = False, + cross_attention_norm: Optional[str] = None, ): super().__init__() @@ -866,7 +866,7 @@ def forward( attention_mask: Optional[torch.FloatTensor] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - ): + ) -> torch.FloatTensor: cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {} lora_scale = cross_attention_kwargs.get("scale", 1.0) @@ -910,10 +910,10 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - attention_head_dim=1, - output_scale_factor=1.0, - downsample_padding=1, - downsample_type="conv", + attention_head_dim: int = 1, + output_scale_factor: float = 1.0, + downsample_padding: int = 1, + downsample_type: str = "conv", ): super().__init__() resnets = [] @@ -989,7 +989,13 @@ def __init__( else: self.downsamplers = None - def forward(self, hidden_states, temb=None, upsample_size=None, cross_attention_kwargs=None): + def forward( + self, + hidden_states: torch.FloatTensor, + temb: Optional[torch.FloatTensor] = None, + upsample_size: Optional[int] = None, + cross_attention_kwargs: Optional[Dict[str, Any]] = None, + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {} lora_scale = cross_attention_kwargs.get("scale", 1.0) @@ -1028,16 +1034,16 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - downsample_padding=1, - add_downsample=True, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, - attention_type="default", + num_attention_heads: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + downsample_padding: int = 1, + add_downsample: bool = True, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, + attention_type: str = "default", ): super().__init__() resnets = [] @@ -1114,8 +1120,8 @@ def forward( attention_mask: Optional[torch.FloatTensor] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - additional_residuals=None, - ): + additional_residuals: Optional[torch.FloatTensor] = None, + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0 @@ -1188,9 +1194,9 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_downsample=True, - downsample_padding=1, + output_scale_factor: float = 1.0, + add_downsample: bool = True, + downsample_padding: int = 1, ): super().__init__() resnets = [] @@ -1227,7 +1233,9 @@ def __init__( self.gradient_checkpointing = False - def forward(self, hidden_states, temb=None, scale: float = 1.0): + def forward( + self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None, scale: float = 1.0 + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () for resnet in self.resnets: @@ -1273,9 +1281,9 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_downsample=True, - downsample_padding=1, + output_scale_factor: float = 1.0, + add_downsample: bool = True, + downsample_padding: int = 1, ): super().__init__() resnets = [] @@ -1310,7 +1318,7 @@ def __init__( else: self.downsamplers = None - def forward(self, hidden_states, scale: float = 1.0): + def forward(self, hidden_states: torch.FloatTensor, scale: float = 1.0) -> torch.FloatTensor: for resnet in self.resnets: hidden_states = resnet(hidden_states, temb=None, scale=scale) @@ -1333,10 +1341,10 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - attention_head_dim=1, - output_scale_factor=1.0, - add_downsample=True, - downsample_padding=1, + attention_head_dim: int = 1, + output_scale_factor: float = 1.0, + add_downsample: bool = True, + downsample_padding: int = 1, ): super().__init__() resnets = [] @@ -1393,7 +1401,7 @@ def __init__( else: self.downsamplers = None - def forward(self, hidden_states, scale: float = 1.0): + def forward(self, hidden_states: torch.FloatTensor, scale: float = 1.0) -> torch.FloatTensor: for resnet, attn in zip(self.resnets, self.attentions): hidden_states = resnet(hidden_states, temb=None, scale=scale) cross_attention_kwargs = {"scale": scale} @@ -1418,9 +1426,9 @@ def __init__( resnet_time_scale_shift: str = "default", resnet_act_fn: str = "swish", resnet_pre_norm: bool = True, - attention_head_dim=1, - output_scale_factor=np.sqrt(2.0), - add_downsample=True, + attention_head_dim: int = 1, + output_scale_factor: float = np.sqrt(2.0), + add_downsample: bool = True, ): super().__init__() self.attentions = nn.ModuleList([]) @@ -1487,7 +1495,13 @@ def __init__( self.downsamplers = None self.skip_conv = None - def forward(self, hidden_states, temb=None, skip_sample=None, scale: float = 1.0): + def forward( + self, + hidden_states: torch.FloatTensor, + temb: Optional[torch.FloatTensor] = None, + skip_sample: Optional[torch.FloatTensor] = None, + scale: float = 1.0, + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...], torch.FloatTensor]: output_states = () for resnet, attn in zip(self.resnets, self.attentions): @@ -1520,9 +1534,9 @@ def __init__( resnet_time_scale_shift: str = "default", resnet_act_fn: str = "swish", resnet_pre_norm: bool = True, - output_scale_factor=np.sqrt(2.0), - add_downsample=True, - downsample_padding=1, + output_scale_factor: float = np.sqrt(2.0), + add_downsample: bool = True, + downsample_padding: int = 1, ): super().__init__() self.resnets = nn.ModuleList([]) @@ -1568,7 +1582,13 @@ def __init__( self.downsamplers = None self.skip_conv = None - def forward(self, hidden_states, temb=None, skip_sample=None, scale: float = 1.0): + def forward( + self, + hidden_states: torch.FloatTensor, + temb: Optional[torch.FloatTensor] = None, + skip_sample: Optional[torch.FloatTensor] = None, + scale: float = 1.0, + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...], torch.FloatTensor]: output_states = () for resnet in self.resnets: @@ -1600,9 +1620,9 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_downsample=True, - skip_time_act=False, + output_scale_factor: float = 1.0, + add_downsample: bool = True, + skip_time_act: bool = False, ): super().__init__() resnets = [] @@ -1651,7 +1671,9 @@ def __init__( self.gradient_checkpointing = False - def forward(self, hidden_states, temb=None, scale: float = 1.0): + def forward( + self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None, scale: float = 1.0 + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () for resnet in self.resnets: @@ -1698,13 +1720,13 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - attention_head_dim=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - add_downsample=True, - skip_time_act=False, - only_cross_attention=False, - cross_attention_norm=None, + attention_head_dim: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + add_downsample: bool = True, + skip_time_act: bool = False, + only_cross_attention: bool = False, + cross_attention_norm: Optional[str] = None, ): super().__init__() @@ -1788,7 +1810,7 @@ def forward( attention_mask: Optional[torch.FloatTensor] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - ): + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {} @@ -1856,7 +1878,7 @@ def __init__( resnet_eps: float = 1e-5, resnet_act_fn: str = "gelu", resnet_group_size: int = 32, - add_downsample=False, + add_downsample: bool = False, ): super().__init__() resnets = [] @@ -1891,7 +1913,9 @@ def __init__( self.gradient_checkpointing = False - def forward(self, hidden_states, temb=None, scale: float = 1.0): + def forward( + self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None, scale: float = 1.0 + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () for resnet in self.resnets: @@ -1933,7 +1957,7 @@ def __init__( dropout: float = 0.0, num_layers: int = 4, resnet_group_size: int = 32, - add_downsample=True, + add_downsample: bool = True, attention_head_dim: int = 64, add_self_attention: bool = False, resnet_eps: float = 1e-5, @@ -1996,7 +2020,7 @@ def forward( attention_mask: Optional[torch.FloatTensor] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - ): + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0 @@ -2065,9 +2089,9 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - attention_head_dim=1, - output_scale_factor=1.0, - upsample_type="conv", + attention_head_dim: int = 1, + output_scale_factor: float = 1.0, + upsample_type: str = "conv", ): super().__init__() resnets = [] @@ -2142,7 +2166,14 @@ def __init__( self.resolution_idx = resolution_idx - def forward(self, hidden_states, res_hidden_states_tuple, temb=None, upsample_size=None, scale: float = 1.0): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + upsample_size: Optional[int] = None, + scale: float = 1.0, + ) -> torch.FloatTensor: for resnet, attn in zip(self.resnets, self.attentions): # pop res hidden states res_hidden_states = res_hidden_states_tuple[-1] @@ -2170,7 +2201,7 @@ def __init__( out_channels: int, prev_output_channel: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, transformer_layers_per_block: Union[int, Tuple[int]] = 1, @@ -2179,15 +2210,15 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - add_upsample=True, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, - attention_type="default", + num_attention_heads: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, + attention_type: str = "default", ): super().__init__() resnets = [] @@ -2264,7 +2295,7 @@ def forward( upsample_size: Optional[int] = None, attention_mask: Optional[torch.FloatTensor] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - ): + ) -> torch.FloatTensor: lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0 is_freeu_enabled = ( getattr(self, "s1", None) @@ -2343,7 +2374,7 @@ def __init__( prev_output_channel: int, out_channels: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, resnet_eps: float = 1e-6, @@ -2351,8 +2382,8 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_upsample=True, + output_scale_factor: float = 1.0, + add_upsample: bool = True, ): super().__init__() resnets = [] @@ -2386,7 +2417,14 @@ def __init__( self.gradient_checkpointing = False self.resolution_idx = resolution_idx - def forward(self, hidden_states, res_hidden_states_tuple, temb=None, upsample_size=None, scale: float = 1.0): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + upsample_size: Optional[int] = None, + scale: float = 1.0, + ) -> torch.FloatTensor: is_freeu_enabled = ( getattr(self, "s1", None) and getattr(self, "s2", None) @@ -2444,7 +2482,7 @@ def __init__( self, in_channels: int, out_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, resnet_eps: float = 1e-6, @@ -2452,9 +2490,9 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_upsample=True, - temb_channels=None, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + temb_channels: Optional[int] = None, ): super().__init__() resnets = [] @@ -2486,7 +2524,9 @@ def __init__( self.resolution_idx = resolution_idx - def forward(self, hidden_states, temb=None, scale: float = 1.0): + def forward( + self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None, scale: float = 1.0 + ) -> torch.FloatTensor: for resnet in self.resnets: hidden_states = resnet(hidden_states, temb=temb, scale=scale) @@ -2502,7 +2542,7 @@ def __init__( self, in_channels: int, out_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, resnet_eps: float = 1e-6, @@ -2510,10 +2550,10 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - attention_head_dim=1, - output_scale_factor=1.0, - add_upsample=True, - temb_channels=None, + attention_head_dim: int = 1, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + temb_channels: Optional[int] = None, ): super().__init__() resnets = [] @@ -2568,7 +2608,9 @@ def __init__( self.resolution_idx = resolution_idx - def forward(self, hidden_states, temb=None, scale: float = 1.0): + def forward( + self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None, scale: float = 1.0 + ) -> torch.FloatTensor: for resnet, attn in zip(self.resnets, self.attentions): hidden_states = resnet(hidden_states, temb=temb, scale=scale) cross_attention_kwargs = {"scale": scale} @@ -2588,16 +2630,16 @@ def __init__( prev_output_channel: int, out_channels: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, resnet_eps: float = 1e-6, resnet_time_scale_shift: str = "default", resnet_act_fn: str = "swish", resnet_pre_norm: bool = True, - attention_head_dim=1, - output_scale_factor=np.sqrt(2.0), - add_upsample=True, + attention_head_dim: int = 1, + output_scale_factor: float = np.sqrt(2.0), + add_upsample: bool = True, ): super().__init__() self.attentions = nn.ModuleList([]) @@ -2675,7 +2717,14 @@ def __init__( self.resolution_idx = resolution_idx - def forward(self, hidden_states, res_hidden_states_tuple, temb=None, skip_sample=None, scale: float = 1.0): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + skip_sample=None, + scale: float = 1.0, + ) -> Tuple[torch.FloatTensor, torch.FloatTensor]: for resnet in self.resnets: # pop res hidden states res_hidden_states = res_hidden_states_tuple[-1] @@ -2711,16 +2760,16 @@ def __init__( prev_output_channel: int, out_channels: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, resnet_eps: float = 1e-6, resnet_time_scale_shift: str = "default", resnet_act_fn: str = "swish", resnet_pre_norm: bool = True, - output_scale_factor=np.sqrt(2.0), - add_upsample=True, - upsample_padding=1, + output_scale_factor: float = np.sqrt(2.0), + add_upsample: bool = True, + upsample_padding: int = 1, ): super().__init__() self.resnets = nn.ModuleList([]) @@ -2776,7 +2825,14 @@ def __init__( self.resolution_idx = resolution_idx - def forward(self, hidden_states, res_hidden_states_tuple, temb=None, skip_sample=None, scale: float = 1.0): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + skip_sample=None, + scale: float = 1.0, + ) -> Tuple[torch.FloatTensor, torch.FloatTensor]: for resnet in self.resnets: # pop res hidden states res_hidden_states = res_hidden_states_tuple[-1] @@ -2809,7 +2865,7 @@ def __init__( prev_output_channel: int, out_channels: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, resnet_eps: float = 1e-6, @@ -2817,9 +2873,9 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_upsample=True, - skip_time_act=False, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + skip_time_act: bool = False, ): super().__init__() resnets = [] @@ -2871,7 +2927,14 @@ def __init__( self.gradient_checkpointing = False self.resolution_idx = resolution_idx - def forward(self, hidden_states, res_hidden_states_tuple, temb=None, upsample_size=None, scale: float = 1.0): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + upsample_size: Optional[int] = None, + scale: float = 1.0, + ) -> torch.FloatTensor: for resnet in self.resnets: # pop res hidden states res_hidden_states = res_hidden_states_tuple[-1] @@ -2911,7 +2974,7 @@ def __init__( out_channels: int, prev_output_channel: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, resnet_eps: float = 1e-6, @@ -2919,13 +2982,13 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - attention_head_dim=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - add_upsample=True, - skip_time_act=False, - only_cross_attention=False, - cross_attention_norm=None, + attention_head_dim: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + skip_time_act: bool = False, + only_cross_attention: bool = False, + cross_attention_norm: Optional[str] = None, ): super().__init__() resnets = [] @@ -3013,7 +3076,7 @@ def forward( attention_mask: Optional[torch.FloatTensor] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - ): + ) -> torch.FloatTensor: cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {} lora_scale = cross_attention_kwargs.get("scale", 1.0) @@ -3082,7 +3145,7 @@ def __init__( resnet_eps: float = 1e-5, resnet_act_fn: str = "gelu", resnet_group_size: Optional[int] = 32, - add_upsample=True, + add_upsample: bool = True, ): super().__init__() resnets = [] @@ -3120,7 +3183,14 @@ def __init__( self.gradient_checkpointing = False self.resolution_idx = resolution_idx - def forward(self, hidden_states, res_hidden_states_tuple, temb=None, upsample_size=None, scale: float = 1.0): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + upsample_size: Optional[int] = None, + scale: float = 1.0, + ) -> torch.FloatTensor: res_hidden_states_tuple = res_hidden_states_tuple[-1] if res_hidden_states_tuple is not None: hidden_states = torch.cat([hidden_states, res_hidden_states_tuple], dim=1) @@ -3164,7 +3234,7 @@ def __init__( resnet_eps: float = 1e-5, resnet_act_fn: str = "gelu", resnet_group_size: int = 32, - attention_head_dim=1, # attention dim_head + attention_head_dim: int = 1, # attention dim_head cross_attention_dim: int = 768, add_upsample: bool = True, upcast_attention: bool = False, @@ -3248,7 +3318,7 @@ def forward( upsample_size: Optional[int] = None, attention_mask: Optional[torch.FloatTensor] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - ): + ) -> torch.FloatTensor: res_hidden_states_tuple = res_hidden_states_tuple[-1] if res_hidden_states_tuple is not None: hidden_states = torch.cat([hidden_states, res_hidden_states_tuple], dim=1) @@ -3310,11 +3380,18 @@ class KAttentionBlock(nn.Module): attention_head_dim (`int`): The number of channels in each head. dropout (`float`, *optional*, defaults to 0.0): The dropout probability to use. cross_attention_dim (`int`, *optional*): The size of the encoder_hidden_states vector for cross attention. - activation_fn (`str`, *optional*, defaults to `"geglu"`): Activation function to be used in feed-forward. - num_embeds_ada_norm (: - obj: `int`, *optional*): The number of diffusion steps used during training. See `Transformer2DModel`. - attention_bias (: - obj: `bool`, *optional*, defaults to `False`): Configure if the attentions should contain a bias parameter. + attention_bias (`bool`, *optional*, defaults to `False`): + Configure if the attention layers should contain a bias parameter. + upcast_attention (`bool`, *optional*, defaults to `False`): + Set to `True` to upcast the attention computation to `float32`. + temb_channels (`int`, *optional*, defaults to 768): + The number of channels in the token embedding. + add_self_attention (`bool`, *optional*, defaults to `False`): + Set to `True` to add self-attention to the block. + cross_attention_norm (`str`, *optional*, defaults to `None`): + The type of normalization to use for the cross attention. Can be `None`, `layer_norm`, or `group_norm`. + group_size (`int`, *optional*, defaults to 32): + The number of groups to separate the channels into for group normalization. """ def __init__( @@ -3360,10 +3437,10 @@ def __init__( cross_attention_norm=cross_attention_norm, ) - def _to_3d(self, hidden_states, height, weight): + def _to_3d(self, hidden_states: torch.FloatTensor, height: int, weight: int) -> torch.FloatTensor: return hidden_states.permute(0, 2, 3, 1).reshape(hidden_states.shape[0], height * weight, -1) - def _to_4d(self, hidden_states, height, weight): + def _to_4d(self, hidden_states: torch.FloatTensor, height: int, weight: int) -> torch.FloatTensor: return hidden_states.permute(0, 2, 1).reshape(hidden_states.shape[0], -1, height, weight) def forward( @@ -3376,7 +3453,7 @@ def forward( attention_mask: Optional[torch.FloatTensor] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - ): + ) -> torch.FloatTensor: cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {} # 1. Self-Attention diff --git a/src/diffusers/models/vae.py b/src/diffusers/models/vae.py index 36983eefc01f..da08bc360942 100644 --- a/src/diffusers/models/vae.py +++ b/src/diffusers/models/vae.py @@ -12,7 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. from dataclasses import dataclass -from typing import Optional +from typing import Optional, Tuple import numpy as np import torch @@ -27,7 +27,7 @@ @dataclass class DecoderOutput(BaseOutput): - """ + r""" Output of decoding method. Args: @@ -39,16 +39,39 @@ class DecoderOutput(BaseOutput): class Encoder(nn.Module): + r""" + The `Encoder` layer of a variational autoencoder that encodes its input into a latent representation. + + Args: + in_channels (`int`, *optional*, defaults to 3): + The number of input channels. + out_channels (`int`, *optional*, defaults to 3): + The number of output channels. + down_block_types (`Tuple[str, ...]`, *optional*, defaults to `("DownEncoderBlock2D",)`): + The types of down blocks to use. See `~diffusers.models.unet_2d_blocks.get_down_block` for available + options. + block_out_channels (`Tuple[int, ...]`, *optional*, defaults to `(64,)`): + The number of output channels for each block. + layers_per_block (`int`, *optional*, defaults to 2): + The number of layers per block. + norm_num_groups (`int`, *optional*, defaults to 32): + The number of groups for normalization. + act_fn (`str`, *optional*, defaults to `"silu"`): + The activation function to use. See `~diffusers.models.activations.get_activation` for available options. + double_z (`bool`, *optional*, defaults to `True`): + Whether to double the number of output channels for the last block. + """ + def __init__( self, - in_channels=3, - out_channels=3, - down_block_types=("DownEncoderBlock2D",), - block_out_channels=(64,), - layers_per_block=2, - norm_num_groups=32, - act_fn="silu", - double_z=True, + in_channels: int = 3, + out_channels: int = 3, + down_block_types: Tuple[str, ...] = ("DownEncoderBlock2D",), + block_out_channels: Tuple[int, ...] = (64,), + layers_per_block: int = 2, + norm_num_groups: int = 32, + act_fn: str = "silu", + double_z: bool = True, ): super().__init__() self.layers_per_block = layers_per_block @@ -107,7 +130,8 @@ def __init__( self.gradient_checkpointing = False - def forward(self, x): + def forward(self, x: torch.FloatTensor) -> torch.FloatTensor: + r"""The forward method of the `Encoder` class.""" sample = x sample = self.conv_in(sample) @@ -152,16 +176,38 @@ def custom_forward(*inputs): class Decoder(nn.Module): + r""" + The `Decoder` layer of a variational autoencoder that decodes its latent representation into an output sample. + + Args: + in_channels (`int`, *optional*, defaults to 3): + The number of input channels. + out_channels (`int`, *optional*, defaults to 3): + The number of output channels. + up_block_types (`Tuple[str, ...]`, *optional*, defaults to `("UpDecoderBlock2D",)`): + The types of up blocks to use. See `~diffusers.models.unet_2d_blocks.get_up_block` for available options. + block_out_channels (`Tuple[int, ...]`, *optional*, defaults to `(64,)`): + The number of output channels for each block. + layers_per_block (`int`, *optional*, defaults to 2): + The number of layers per block. + norm_num_groups (`int`, *optional*, defaults to 32): + The number of groups for normalization. + act_fn (`str`, *optional*, defaults to `"silu"`): + The activation function to use. See `~diffusers.models.activations.get_activation` for available options. + norm_type (`str`, *optional*, defaults to `"group"`): + The normalization type to use. Can be either `"group"` or `"spatial"`. + """ + def __init__( self, - in_channels=3, - out_channels=3, - up_block_types=("UpDecoderBlock2D",), - block_out_channels=(64,), - layers_per_block=2, - norm_num_groups=32, - act_fn="silu", - norm_type="group", # group, spatial + in_channels: int = 3, + out_channels: int = 3, + up_block_types: Tuple[str, ...] = ("UpDecoderBlock2D",), + block_out_channels: Tuple[int, ...] = (64,), + layers_per_block: int = 2, + norm_num_groups: int = 32, + act_fn: str = "silu", + norm_type: str = "group", # group, spatial ): super().__init__() self.layers_per_block = layers_per_block @@ -227,7 +273,8 @@ def __init__( self.gradient_checkpointing = False - def forward(self, z, latent_embeds=None): + def forward(self, z: torch.FloatTensor, latent_embeds: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: + r"""The forward method of the `Decoder` class.""" sample = z sample = self.conv_in(sample) @@ -283,6 +330,16 @@ def custom_forward(*inputs): class UpSample(nn.Module): + r""" + The `UpSample` layer of a variational autoencoder that upsamples its input. + + Args: + in_channels (`int`, *optional*, defaults to 3): + The number of input channels. + out_channels (`int`, *optional*, defaults to 3): + The number of output channels. + """ + def __init__( self, in_channels: int, @@ -294,6 +351,7 @@ def __init__( self.deconv = nn.ConvTranspose2d(in_channels, out_channels, kernel_size=4, stride=2, padding=1) def forward(self, x: torch.FloatTensor) -> torch.FloatTensor: + r"""The forward method of the `UpSample` class.""" x = torch.relu(x) x = self.deconv(x) return x @@ -342,6 +400,7 @@ def __init__( self.layers = nn.Sequential(*layers) def forward(self, x: torch.FloatTensor, mask=None) -> torch.FloatTensor: + r"""The forward method of the `MaskConditionEncoder` class.""" out = {} for l in range(len(self.layers)): layer = self.layers[l] @@ -352,19 +411,38 @@ def forward(self, x: torch.FloatTensor, mask=None) -> torch.FloatTensor: class MaskConditionDecoder(nn.Module): - """The `MaskConditionDecoder` should be used in combination with [`AsymmetricAutoencoderKL`] to enhance the model's - decoder with a conditioner on the mask and masked image.""" + r"""The `MaskConditionDecoder` should be used in combination with [`AsymmetricAutoencoderKL`] to enhance the model's + decoder with a conditioner on the mask and masked image. + + Args: + in_channels (`int`, *optional*, defaults to 3): + The number of input channels. + out_channels (`int`, *optional*, defaults to 3): + The number of output channels. + up_block_types (`Tuple[str, ...]`, *optional*, defaults to `("UpDecoderBlock2D",)`): + The types of up blocks to use. See `~diffusers.models.unet_2d_blocks.get_up_block` for available options. + block_out_channels (`Tuple[int, ...]`, *optional*, defaults to `(64,)`): + The number of output channels for each block. + layers_per_block (`int`, *optional*, defaults to 2): + The number of layers per block. + norm_num_groups (`int`, *optional*, defaults to 32): + The number of groups for normalization. + act_fn (`str`, *optional*, defaults to `"silu"`): + The activation function to use. See `~diffusers.models.activations.get_activation` for available options. + norm_type (`str`, *optional*, defaults to `"group"`): + The normalization type to use. Can be either `"group"` or `"spatial"`. + """ def __init__( self, - in_channels=3, - out_channels=3, - up_block_types=("UpDecoderBlock2D",), - block_out_channels=(64,), - layers_per_block=2, - norm_num_groups=32, - act_fn="silu", - norm_type="group", # group, spatial + in_channels: int = 3, + out_channels: int = 3, + up_block_types: Tuple[str, ...] = ("UpDecoderBlock2D",), + block_out_channels: Tuple[int, ...] = (64,), + layers_per_block: int = 2, + norm_num_groups: int = 32, + act_fn: str = "silu", + norm_type: str = "group", # group, spatial ): super().__init__() self.layers_per_block = layers_per_block @@ -437,7 +515,14 @@ def __init__( self.gradient_checkpointing = False - def forward(self, z, image=None, mask=None, latent_embeds=None): + def forward( + self, + z: torch.FloatTensor, + image: Optional[torch.FloatTensor] = None, + mask: Optional[torch.FloatTensor] = None, + latent_embeds: Optional[torch.FloatTensor] = None, + ) -> torch.FloatTensor: + r"""The forward method of the `MaskConditionDecoder` class.""" sample = z sample = self.conv_in(sample) @@ -539,7 +624,14 @@ class VectorQuantizer(nn.Module): # backwards compatibility we use the buggy version by default, but you can # specify legacy=False to fix it. def __init__( - self, n_e, vq_embed_dim, beta, remap=None, unknown_index="random", sane_index_shape=False, legacy=True + self, + n_e: int, + vq_embed_dim: int, + beta: float, + remap=None, + unknown_index: str = "random", + sane_index_shape: bool = False, + legacy: bool = True, ): super().__init__() self.n_e = n_e @@ -553,6 +645,7 @@ def __init__( self.remap = remap if self.remap is not None: self.register_buffer("used", torch.tensor(np.load(self.remap))) + self.used: torch.Tensor self.re_embed = self.used.shape[0] self.unknown_index = unknown_index # "random" or "extra" or integer if self.unknown_index == "extra": @@ -567,7 +660,7 @@ def __init__( self.sane_index_shape = sane_index_shape - def remap_to_used(self, inds): + def remap_to_used(self, inds: torch.LongTensor) -> torch.LongTensor: ishape = inds.shape assert len(ishape) > 1 inds = inds.reshape(ishape[0], -1) @@ -581,7 +674,7 @@ def remap_to_used(self, inds): new[unknown] = self.unknown_index return new.reshape(ishape) - def unmap_to_all(self, inds): + def unmap_to_all(self, inds: torch.LongTensor) -> torch.LongTensor: ishape = inds.shape assert len(ishape) > 1 inds = inds.reshape(ishape[0], -1) @@ -591,7 +684,7 @@ def unmap_to_all(self, inds): back = torch.gather(used[None, :][inds.shape[0] * [0], :], 1, inds) return back.reshape(ishape) - def forward(self, z): + def forward(self, z: torch.FloatTensor) -> Tuple[torch.FloatTensor, torch.FloatTensor, Tuple]: # reshape z -> (batch, height, width, channel) and flatten z = z.permute(0, 2, 3, 1).contiguous() z_flattened = z.view(-1, self.vq_embed_dim) @@ -610,7 +703,7 @@ def forward(self, z): loss = torch.mean((z_q.detach() - z) ** 2) + self.beta * torch.mean((z_q - z.detach()) ** 2) # preserve gradients - z_q = z + (z_q - z).detach() + z_q: torch.FloatTensor = z + (z_q - z).detach() # reshape back to match original input shape z_q = z_q.permute(0, 3, 1, 2).contiguous() @@ -625,7 +718,7 @@ def forward(self, z): return z_q, loss, (perplexity, min_encodings, min_encoding_indices) - def get_codebook_entry(self, indices, shape): + def get_codebook_entry(self, indices: torch.LongTensor, shape: Tuple[int, ...]) -> torch.FloatTensor: # shape specifying (batch, height, width, channel) if self.remap is not None: indices = indices.reshape(shape[0], -1) # add batch axis @@ -633,7 +726,7 @@ def get_codebook_entry(self, indices, shape): indices = indices.reshape(-1) # flatten again # get quantized latent vectors - z_q = self.embedding(indices) + z_q: torch.FloatTensor = self.embedding(indices) if shape is not None: z_q = z_q.view(shape) @@ -644,7 +737,7 @@ def get_codebook_entry(self, indices, shape): class DiagonalGaussianDistribution(object): - def __init__(self, parameters, deterministic=False): + def __init__(self, parameters: torch.Tensor, deterministic: bool = False): self.parameters = parameters self.mean, self.logvar = torch.chunk(parameters, 2, dim=1) self.logvar = torch.clamp(self.logvar, -30.0, 20.0) @@ -664,7 +757,7 @@ def sample(self, generator: Optional[torch.Generator] = None) -> torch.FloatTens x = self.mean + self.std * sample return x - def kl(self, other=None): + def kl(self, other: "DiagonalGaussianDistribution" = None) -> torch.Tensor: if self.deterministic: return torch.Tensor([0.0]) else: @@ -680,23 +773,40 @@ def kl(self, other=None): dim=[1, 2, 3], ) - def nll(self, sample, dims=[1, 2, 3]): + def nll(self, sample: torch.Tensor, dims: Tuple[int, ...] = [1, 2, 3]) -> torch.Tensor: if self.deterministic: return torch.Tensor([0.0]) logtwopi = np.log(2.0 * np.pi) return 0.5 * torch.sum(logtwopi + self.logvar + torch.pow(sample - self.mean, 2) / self.var, dim=dims) - def mode(self): + def mode(self) -> torch.Tensor: return self.mean class EncoderTiny(nn.Module): + r""" + The `EncoderTiny` layer is a simpler version of the `Encoder` layer. + + Args: + in_channels (`int`): + The number of input channels. + out_channels (`int`): + The number of output channels. + num_blocks (`Tuple[int, ...]`): + Each value of the tuple represents a Conv2d layer followed by `value` number of `AutoencoderTinyBlock`'s to + use. + block_out_channels (`Tuple[int, ...]`): + The number of output channels for each block. + act_fn (`str`): + The activation function to use. See `~diffusers.models.activations.get_activation` for available options. + """ + def __init__( self, in_channels: int, out_channels: int, - num_blocks: int, - block_out_channels: int, + num_blocks: Tuple[int, ...], + block_out_channels: Tuple[int, ...], act_fn: str, ): super().__init__() @@ -718,7 +828,8 @@ def __init__( self.layers = nn.Sequential(*layers) self.gradient_checkpointing = False - def forward(self, x): + def forward(self, x: torch.FloatTensor) -> torch.FloatTensor: + r"""The forward method of the `EncoderTiny` class.""" if self.training and self.gradient_checkpointing: def create_custom_forward(module): @@ -740,12 +851,31 @@ def custom_forward(*inputs): class DecoderTiny(nn.Module): + r""" + The `DecoderTiny` layer is a simpler version of the `Decoder` layer. + + Args: + in_channels (`int`): + The number of input channels. + out_channels (`int`): + The number of output channels. + num_blocks (`Tuple[int, ...]`): + Each value of the tuple represents a Conv2d layer followed by `value` number of `AutoencoderTinyBlock`'s to + use. + block_out_channels (`Tuple[int, ...]`): + The number of output channels for each block. + upsampling_scaling_factor (`int`): + The scaling factor to use for upsampling. + act_fn (`str`): + The activation function to use. See `~diffusers.models.activations.get_activation` for available options. + """ + def __init__( self, in_channels: int, out_channels: int, - num_blocks: int, - block_out_channels: int, + num_blocks: Tuple[int, ...], + block_out_channels: Tuple[int, ...], upsampling_scaling_factor: int, act_fn: str, ): @@ -772,7 +902,8 @@ def __init__( self.layers = nn.Sequential(*layers) self.gradient_checkpointing = False - def forward(self, x): + def forward(self, x: torch.FloatTensor) -> torch.FloatTensor: + r"""The forward method of the `DecoderTiny` class.""" # Clamp. x = torch.tanh(x / 3) * 3 diff --git a/src/diffusers/models/vq_model.py b/src/diffusers/models/vq_model.py index 0c15300af213..08ad122c3891 100644 --- a/src/diffusers/models/vq_model.py +++ b/src/diffusers/models/vq_model.py @@ -53,10 +53,12 @@ class VQModel(ModelMixin, ConfigMixin): Tuple of upsample block types. block_out_channels (`Tuple[int]`, *optional*, defaults to `(64,)`): Tuple of block output channels. + layers_per_block (`int`, *optional*, defaults to `1`): Number of layers per block. act_fn (`str`, *optional*, defaults to `"silu"`): The activation function to use. latent_channels (`int`, *optional*, defaults to `3`): Number of channels in the latent space. sample_size (`int`, *optional*, defaults to `32`): Sample input size. num_vq_embeddings (`int`, *optional*, defaults to `256`): Number of codebook vectors in the VQ-VAE. + norm_num_groups (`int`, *optional*, defaults to `32`): Number of groups for normalization layers. vq_embed_dim (`int`, *optional*): Hidden dim of codebook vectors in the VQ-VAE. scaling_factor (`float`, *optional*, defaults to `0.18215`): The component-wise standard deviation of the trained latent space computed using the first batch of the @@ -65,6 +67,8 @@ class VQModel(ModelMixin, ConfigMixin): diffusion model. When decoding, the latents are scaled back to the original scale with the formula: `z = 1 / scaling_factor * z`. For more details, refer to sections 4.3.2 and D.1 of the [High-Resolution Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) paper. + norm_type (`str`, *optional*, defaults to `"group"`): + Type of normalization layer to use. Can be one of `"group"` or `"spatial"`. """ @register_to_config @@ -72,9 +76,9 @@ def __init__( self, in_channels: int = 3, out_channels: int = 3, - down_block_types: Tuple[str] = ("DownEncoderBlock2D",), - up_block_types: Tuple[str] = ("UpDecoderBlock2D",), - block_out_channels: Tuple[int] = (64,), + down_block_types: Tuple[str, ...] = ("DownEncoderBlock2D",), + up_block_types: Tuple[str, ...] = ("UpDecoderBlock2D",), + block_out_channels: Tuple[int, ...] = (64,), layers_per_block: int = 1, act_fn: str = "silu", latent_channels: int = 3, diff --git a/src/diffusers/pipelines/__init__.py b/src/diffusers/pipelines/__init__.py index 19fe2f72d447..df7a89fc1b81 100644 --- a/src/diffusers/pipelines/__init__.py +++ b/src/diffusers/pipelines/__init__.py @@ -109,6 +109,7 @@ "KandinskyV22PriorEmb2EmbPipeline", "KandinskyV22PriorPipeline", ] + _import_structure["latent_consistency_models"] = ["LatentConsistencyModelPipeline"] _import_structure["latent_diffusion"].extend(["LDMTextToImagePipeline"]) _import_structure["musicldm"] = ["MusicLDMPipeline"] _import_structure["paint_by_example"] = ["PaintByExamplePipeline"] @@ -331,6 +332,7 @@ KandinskyV22PriorEmb2EmbPipeline, KandinskyV22PriorPipeline, ) + from .latent_consistency_models import LatentConsistencyModelPipeline from .latent_diffusion import LDMTextToImagePipeline from .musicldm import MusicLDMPipeline from .paint_by_example import PaintByExamplePipeline diff --git a/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint_sd_xl.py b/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint_sd_xl.py index 6c5d9a3993d4..46c9f25b6eb6 100644 --- a/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint_sd_xl.py +++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet_inpaint_sd_xl.py @@ -896,8 +896,20 @@ def get_timesteps(self, num_inference_steps, strength, device, denoising_start=N - (denoising_start * self.scheduler.config.num_train_timesteps) ) ) - timesteps = list(filter(lambda ts: ts < discrete_timestep_cutoff, timesteps)) - return torch.tensor(timesteps), len(timesteps) + + num_inference_steps = (timesteps < discrete_timestep_cutoff).sum().item() + if self.scheduler.order == 2 and num_inference_steps % 2 == 0: + # if the scheduler is a 2nd order scheduler we might have to do +1 + # because `num_inference_steps` might be even given that every timestep + # (except the highest one) is duplicated. If `num_inference_steps` is even it would + # mean that we cut the timesteps in the middle of the denoising step + # (between 1st and 2nd devirative) which leads to incorrect results. By adding 1 + # we ensure that the denoising process always ends after the 2nd derivate step of the scheduler + num_inference_steps = num_inference_steps + 1 + + # because t_n+1 >= t_n, we slice the timesteps starting from the end + timesteps = timesteps[-num_inference_steps:] + return timesteps, num_inference_steps return timesteps, num_inference_steps - t_start diff --git a/src/diffusers/pipelines/latent_consistency_models/__init__.py b/src/diffusers/pipelines/latent_consistency_models/__init__.py new file mode 100644 index 000000000000..03d2f516adaf --- /dev/null +++ b/src/diffusers/pipelines/latent_consistency_models/__init__.py @@ -0,0 +1,22 @@ +from typing import TYPE_CHECKING + +from ...utils import ( + _LazyModule, +) + + +_import_structure = {"pipeline_latent_consistency_models": ["LatentConsistencyModelPipeline"]} + + +if TYPE_CHECKING: + from .pipeline_latent_consistency_models import LatentConsistencyModelPipeline + +else: + import sys + + sys.modules[__name__] = _LazyModule( + __name__, + globals()["__file__"], + _import_structure, + module_spec=__spec__, + ) diff --git a/src/diffusers/pipelines/latent_consistency_models/pipeline_latent_consistency_models.py b/src/diffusers/pipelines/latent_consistency_models/pipeline_latent_consistency_models.py new file mode 100644 index 000000000000..04dcef4152d4 --- /dev/null +++ b/src/diffusers/pipelines/latent_consistency_models/pipeline_latent_consistency_models.py @@ -0,0 +1,673 @@ +# Copyright 2023 Stanford University Team and The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# DISCLAIMER: This code is strongly influenced by https://github.com/pesser/pytorch_diffusion +# and https://github.com/hojonathanho/diffusion + +import inspect +from typing import Any, Callable, Dict, List, Optional, Union + +import torch +from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer + +from ...image_processor import VaeImageProcessor +from ...loaders import FromSingleFileMixin, LoraLoaderMixin, TextualInversionLoaderMixin +from ...models import AutoencoderKL, UNet2DConditionModel +from ...models.lora import adjust_lora_scale_text_encoder +from ...schedulers import LCMScheduler +from ...utils import ( + USE_PEFT_BACKEND, + logging, + scale_lora_layers, + unscale_lora_layers, +) +from ...utils.torch_utils import randn_tensor +from ..pipeline_utils import DiffusionPipeline +from ..stable_diffusion import StableDiffusionPipelineOutput, StableDiffusionSafetyChecker + + +logger = logging.get_logger(__name__) # pylint: disable=invalid-name + + +class LatentConsistencyModelPipeline( + DiffusionPipeline, TextualInversionLoaderMixin, LoraLoaderMixin, FromSingleFileMixin +): + r""" + Pipeline for text-to-image generation using a latent consistency model. + + This model inherits from [`DiffusionPipeline`]. Check the superclass documentation for the generic methods + implemented for all pipelines (downloading, saving, running on a particular device, etc.). + + The pipeline also inherits the following loading methods: + - [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] for loading textual inversion embeddings + - [`~loaders.LoraLoaderMixin.load_lora_weights`] for loading LoRA weights + - [`~loaders.LoraLoaderMixin.save_lora_weights`] for saving LoRA weights + - [`~loaders.FromSingleFileMixin.from_single_file`] for loading `.ckpt` files + + Args: + vae ([`AutoencoderKL`]): + Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations. + text_encoder ([`~transformers.CLIPTextModel`]): + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). + tokenizer ([`~transformers.CLIPTokenizer`]): + A `CLIPTokenizer` to tokenize text. + unet ([`UNet2DConditionModel`]): + A `UNet2DConditionModel` to denoise the encoded image latents. + scheduler ([`SchedulerMixin`]): + A scheduler to be used in combination with `unet` to denoise the encoded image latents. Currently only + supports [`LCMScheduler`]. + safety_checker ([`StableDiffusionSafetyChecker`]): + Classification module that estimates whether generated images could be considered offensive or harmful. + Please refer to the [model card](https://huggingface.co/runwayml/stable-diffusion-v1-5) for more details + about a model's potential harms. + feature_extractor ([`~transformers.CLIPImageProcessor`]): + A `CLIPImageProcessor` to extract features from generated images; used as inputs to the `safety_checker`. + requires_safety_checker (`bool`, *optional*, defaults to `True`): + Whether the pipeline requires a safety checker component. + """ + model_cpu_offload_seq = "text_encoder->unet->vae" + _optional_components = ["safety_checker", "feature_extractor"] + _exclude_from_cpu_offload = ["safety_checker"] + + def __init__( + self, + vae: AutoencoderKL, + text_encoder: CLIPTextModel, + tokenizer: CLIPTokenizer, + unet: UNet2DConditionModel, + scheduler: LCMScheduler, + safety_checker: StableDiffusionSafetyChecker, + feature_extractor: CLIPImageProcessor, + requires_safety_checker: bool = True, + ): + super().__init__() + + if safety_checker is None and requires_safety_checker: + logger.warning( + f"You have disabled the safety checker for {self.__class__} by passing `safety_checker=None`. Ensure" + " that you abide to the conditions of the Stable Diffusion license and do not expose unfiltered" + " results in services or applications open to the public. Both the diffusers team and Hugging Face" + " strongly recommend to keep the safety filter enabled in all public facing circumstances, disabling" + " it only for use-cases that involve analyzing network behavior or auditing its results. For more" + " information, please have a look at https://github.com/huggingface/diffusers/pull/254 ." + ) + + if safety_checker is not None and feature_extractor is None: + raise ValueError( + "Make sure to define a feature extractor when loading {self.__class__} if you want to use the safety" + " checker. If you do not want to use the safety checker, you can pass `'safety_checker=None'` instead." + ) + + self.register_modules( + vae=vae, + text_encoder=text_encoder, + tokenizer=tokenizer, + unet=unet, + scheduler=scheduler, + safety_checker=safety_checker, + feature_extractor=feature_extractor, + ) + self.vae_scale_factor = 2 ** (len(self.vae.config.block_out_channels) - 1) + self.image_processor = VaeImageProcessor(vae_scale_factor=self.vae_scale_factor) + self.register_to_config(requires_safety_checker=requires_safety_checker) + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_slicing + def enable_vae_slicing(self): + r""" + Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to + compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. + """ + self.vae.enable_slicing() + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_slicing + def disable_vae_slicing(self): + r""" + Disable sliced VAE decoding. If `enable_vae_slicing` was previously enabled, this method will go back to + computing decoding in one step. + """ + self.vae.disable_slicing() + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_vae_tiling + def enable_vae_tiling(self): + r""" + Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to + compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow + processing larger images. + """ + self.vae.enable_tiling() + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_vae_tiling + def disable_vae_tiling(self): + r""" + Disable tiled VAE decoding. If `enable_vae_tiling` was previously enabled, this method will go back to + computing decoding in one step. + """ + self.vae.disable_tiling() + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.enable_freeu + def enable_freeu(self, s1: float, s2: float, b1: float, b2: float): + r"""Enables the FreeU mechanism as in https://arxiv.org/abs/2309.11497. + + The suffixes after the scaling factors represent the stages where they are being applied. + + Please refer to the [official repository](https://github.com/ChenyangSi/FreeU) for combinations of the values + that are known to work well for different pipelines such as Stable Diffusion v1, v2, and Stable Diffusion XL. + + Args: + s1 (`float`): + Scaling factor for stage 1 to attenuate the contributions of the skip features. This is done to + mitigate "oversmoothing effect" in the enhanced denoising process. + s2 (`float`): + Scaling factor for stage 2 to attenuate the contributions of the skip features. This is done to + mitigate "oversmoothing effect" in the enhanced denoising process. + b1 (`float`): Scaling factor for stage 1 to amplify the contributions of backbone features. + b2 (`float`): Scaling factor for stage 2 to amplify the contributions of backbone features. + """ + if not hasattr(self, "unet"): + raise ValueError("The pipeline must have `unet` for using FreeU.") + self.unet.enable_freeu(s1=s1, s2=s2, b1=b1, b2=b2) + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.disable_freeu + def disable_freeu(self): + """Disables the FreeU mechanism if enabled.""" + self.unet.disable_freeu() + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.encode_prompt + def encode_prompt( + self, + prompt, + device, + num_images_per_prompt, + do_classifier_free_guidance, + negative_prompt=None, + prompt_embeds: Optional[torch.FloatTensor] = None, + negative_prompt_embeds: Optional[torch.FloatTensor] = None, + lora_scale: Optional[float] = None, + clip_skip: Optional[int] = None, + ): + r""" + Encodes the prompt into text encoder hidden states. + + Args: + prompt (`str` or `List[str]`, *optional*): + prompt to be encoded + device: (`torch.device`): + torch device + num_images_per_prompt (`int`): + number of images that should be generated per prompt + do_classifier_free_guidance (`bool`): + whether to use classifier free guidance or not + negative_prompt (`str` or `List[str]`, *optional*): + The prompt or prompts not to guide the image generation. If not defined, one has to pass + `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is + less than `1`). + prompt_embeds (`torch.FloatTensor`, *optional*): + Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not + provided, text embeddings will be generated from `prompt` input argument. + negative_prompt_embeds (`torch.FloatTensor`, *optional*): + Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt + weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input + argument. + lora_scale (`float`, *optional*): + A LoRA scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded. + clip_skip (`int`, *optional*): + Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that + the output of the pre-final layer will be used for computing the prompt embeddings. + """ + # set lora scale so that monkey patched LoRA + # function of text encoder can correctly access it + if lora_scale is not None and isinstance(self, LoraLoaderMixin): + self._lora_scale = lora_scale + + # dynamically adjust the LoRA scale + if not USE_PEFT_BACKEND: + adjust_lora_scale_text_encoder(self.text_encoder, lora_scale) + else: + scale_lora_layers(self.text_encoder, lora_scale) + + if prompt is not None and isinstance(prompt, str): + batch_size = 1 + elif prompt is not None and isinstance(prompt, list): + batch_size = len(prompt) + else: + batch_size = prompt_embeds.shape[0] + + if prompt_embeds is None: + # textual inversion: procecss multi-vector tokens if necessary + if isinstance(self, TextualInversionLoaderMixin): + prompt = self.maybe_convert_prompt(prompt, self.tokenizer) + + text_inputs = self.tokenizer( + prompt, + padding="max_length", + max_length=self.tokenizer.model_max_length, + truncation=True, + return_tensors="pt", + ) + text_input_ids = text_inputs.input_ids + untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids + + if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal( + text_input_ids, untruncated_ids + ): + removed_text = self.tokenizer.batch_decode( + untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1] + ) + logger.warning( + "The following part of your input was truncated because CLIP can only handle sequences up to" + f" {self.tokenizer.model_max_length} tokens: {removed_text}" + ) + + if hasattr(self.text_encoder.config, "use_attention_mask") and self.text_encoder.config.use_attention_mask: + attention_mask = text_inputs.attention_mask.to(device) + else: + attention_mask = None + + if clip_skip is None: + prompt_embeds = self.text_encoder(text_input_ids.to(device), attention_mask=attention_mask) + prompt_embeds = prompt_embeds[0] + else: + prompt_embeds = self.text_encoder( + text_input_ids.to(device), attention_mask=attention_mask, output_hidden_states=True + ) + # Access the `hidden_states` first, that contains a tuple of + # all the hidden states from the encoder layers. Then index into + # the tuple to access the hidden states from the desired layer. + prompt_embeds = prompt_embeds[-1][-(clip_skip + 1)] + # We also need to apply the final LayerNorm here to not mess with the + # representations. The `last_hidden_states` that we typically use for + # obtaining the final prompt representations passes through the LayerNorm + # layer. + prompt_embeds = self.text_encoder.text_model.final_layer_norm(prompt_embeds) + + if self.text_encoder is not None: + prompt_embeds_dtype = self.text_encoder.dtype + elif self.unet is not None: + prompt_embeds_dtype = self.unet.dtype + else: + prompt_embeds_dtype = prompt_embeds.dtype + + prompt_embeds = prompt_embeds.to(dtype=prompt_embeds_dtype, device=device) + + bs_embed, seq_len, _ = prompt_embeds.shape + # duplicate text embeddings for each generation per prompt, using mps friendly method + prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1) + prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1) + + # get unconditional embeddings for classifier free guidance + if do_classifier_free_guidance and negative_prompt_embeds is None: + uncond_tokens: List[str] + if negative_prompt is None: + uncond_tokens = [""] * batch_size + elif prompt is not None and type(prompt) is not type(negative_prompt): + raise TypeError( + f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !=" + f" {type(prompt)}." + ) + elif isinstance(negative_prompt, str): + uncond_tokens = [negative_prompt] + elif batch_size != len(negative_prompt): + raise ValueError( + f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:" + f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches" + " the batch size of `prompt`." + ) + else: + uncond_tokens = negative_prompt + + # textual inversion: procecss multi-vector tokens if necessary + if isinstance(self, TextualInversionLoaderMixin): + uncond_tokens = self.maybe_convert_prompt(uncond_tokens, self.tokenizer) + + max_length = prompt_embeds.shape[1] + uncond_input = self.tokenizer( + uncond_tokens, + padding="max_length", + max_length=max_length, + truncation=True, + return_tensors="pt", + ) + + if hasattr(self.text_encoder.config, "use_attention_mask") and self.text_encoder.config.use_attention_mask: + attention_mask = uncond_input.attention_mask.to(device) + else: + attention_mask = None + + negative_prompt_embeds = self.text_encoder( + uncond_input.input_ids.to(device), + attention_mask=attention_mask, + ) + negative_prompt_embeds = negative_prompt_embeds[0] + + if do_classifier_free_guidance: + # duplicate unconditional embeddings for each generation per prompt, using mps friendly method + seq_len = negative_prompt_embeds.shape[1] + + negative_prompt_embeds = negative_prompt_embeds.to(dtype=prompt_embeds_dtype, device=device) + + negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1) + negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1) + + if isinstance(self, LoraLoaderMixin) and USE_PEFT_BACKEND: + # Retrieve the original scale by scaling back the LoRA layers + unscale_lora_layers(self.text_encoder, lora_scale) + + return prompt_embeds, negative_prompt_embeds + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.run_safety_checker + def run_safety_checker(self, image, device, dtype): + if self.safety_checker is None: + has_nsfw_concept = None + else: + if torch.is_tensor(image): + feature_extractor_input = self.image_processor.postprocess(image, output_type="pil") + else: + feature_extractor_input = self.image_processor.numpy_to_pil(image) + safety_checker_input = self.feature_extractor(feature_extractor_input, return_tensors="pt").to(device) + image, has_nsfw_concept = self.safety_checker( + images=image, clip_input=safety_checker_input.pixel_values.to(dtype) + ) + return image, has_nsfw_concept + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_latents + def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, device, generator, latents=None): + shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor) + if isinstance(generator, list) and len(generator) != batch_size: + raise ValueError( + f"You have passed a list of generators of length {len(generator)}, but requested an effective batch" + f" size of {batch_size}. Make sure the batch size matches the length of the generators." + ) + + if latents is None: + latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype) + else: + latents = latents.to(device) + + # scale the initial noise by the standard deviation required by the scheduler + latents = latents * self.scheduler.init_noise_sigma + return latents + + def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32): + """ + See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298 + + Args: + timesteps (`torch.Tensor`): + generate embedding vectors at these timesteps + embedding_dim (`int`, *optional*, defaults to 512): + dimension of the embeddings to generate + dtype: + data type of the generated embeddings + + Returns: + `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)` + """ + assert len(w.shape) == 1 + w = w * 1000.0 + + half_dim = embedding_dim // 2 + emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1) + emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb) + emb = w.to(dtype)[:, None] * emb[None, :] + emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1) + if embedding_dim % 2 == 1: # zero pad + emb = torch.nn.functional.pad(emb, (0, 1)) + assert emb.shape == (w.shape[0], embedding_dim) + return emb + + # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs + def prepare_extra_step_kwargs(self, generator, eta): + # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature + # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers. + # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502 + # and should be between [0, 1] + + accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys()) + extra_step_kwargs = {} + if accepts_eta: + extra_step_kwargs["eta"] = eta + + # check if the scheduler accepts generator + accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys()) + if accepts_generator: + extra_step_kwargs["generator"] = generator + return extra_step_kwargs + + # Currently StableDiffusionPipeline.check_inputs with negative prompt stuff removed + def check_inputs( + self, + prompt: Union[str, List[str]], + height: int, + width: int, + callback_steps: int, + prompt_embeds: Optional[torch.FloatTensor] = None, + ): + if height % 8 != 0 or width % 8 != 0: + raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.") + + if (callback_steps is None) or ( + callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0) + ): + raise ValueError( + f"`callback_steps` has to be a positive integer but is {callback_steps} of type" + f" {type(callback_steps)}." + ) + + if prompt is not None and prompt_embeds is not None: + raise ValueError( + f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to" + " only forward one of the two." + ) + elif prompt is None and prompt_embeds is None: + raise ValueError( + "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined." + ) + elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)): + raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}") + + @torch.no_grad() + def __call__( + self, + prompt: Union[str, List[str]] = None, + height: Optional[int] = None, + width: Optional[int] = None, + num_inference_steps: int = 4, + original_inference_steps: int = None, + guidance_scale: float = 8.5, + num_images_per_prompt: Optional[int] = 1, + generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None, + latents: Optional[torch.FloatTensor] = None, + prompt_embeds: Optional[torch.FloatTensor] = None, + output_type: Optional[str] = "pil", + return_dict: bool = True, + callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None, + callback_steps: int = 1, + cross_attention_kwargs: Optional[Dict[str, Any]] = None, + clip_skip: Optional[int] = None, + ): + r""" + The call function to the pipeline for generation. + + Args: + prompt (`str` or `List[str]`, *optional*): + The prompt or prompts to guide image generation. If not defined, you need to pass `prompt_embeds`. + height (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): + The height in pixels of the generated image. + width (`int`, *optional*, defaults to `self.unet.config.sample_size * self.vae_scale_factor`): + The width in pixels of the generated image. + num_inference_steps (`int`, *optional*, defaults to 50): + The number of denoising steps. More denoising steps usually lead to a higher quality image at the + expense of slower inference. + original_inference_steps (`int`, *optional*): + The original number of inference steps use to generate a linearly-spaced timestep schedule, from which + we will draw `num_inference_steps` evenly spaced timesteps from as our final timestep schedule, + following the Skipping-Step method in the paper (see Section 4.3). If not set this will default to the + scheduler's `original_inference_steps` attribute. + guidance_scale (`float`, *optional*, defaults to 7.5): + A higher guidance scale value encourages the model to generate images closely linked to the text + `prompt` at the expense of lower image quality. Guidance scale is enabled when `guidance_scale > 1`. + Note that the original latent consistency models paper uses a different CFG formulation where the + guidance scales are decreased by 1 (so in the paper formulation CFG is enabled when `guidance_scale > + 0`). + num_images_per_prompt (`int`, *optional*, defaults to 1): + The number of images to generate per prompt. + generator (`torch.Generator` or `List[torch.Generator]`, *optional*): + A [`torch.Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) to make + generation deterministic. + latents (`torch.FloatTensor`, *optional*): + Pre-generated noisy latents sampled from a Gaussian distribution, to be used as inputs for image + generation. Can be used to tweak the same generation with different prompts. If not provided, a latents + tensor is generated by sampling using the supplied random `generator`. + prompt_embeds (`torch.FloatTensor`, *optional*): + Pre-generated text embeddings. Can be used to easily tweak text inputs (prompt weighting). If not + provided, text embeddings are generated from the `prompt` input argument. + output_type (`str`, *optional*, defaults to `"pil"`): + The output format of the generated image. Choose between `PIL.Image` or `np.array`. + return_dict (`bool`, *optional*, defaults to `True`): + Whether or not to return a [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] instead of a + plain tuple. + callback (`Callable`, *optional*): + A function that calls every `callback_steps` steps during inference. The function is called with the + following arguments: `callback(step: int, timestep: int, latents: torch.FloatTensor)`. + callback_steps (`int`, *optional*, defaults to 1): + The frequency at which the `callback` function is called. If not specified, the callback is called at + every step. + cross_attention_kwargs (`dict`, *optional*): + A kwargs dictionary that if specified is passed along to the [`AttentionProcessor`] as defined in + [`self.processor`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py). + clip_skip (`int`, *optional*): + Number of layers to be skipped from CLIP while computing the prompt embeddings. A value of 1 means that + the output of the pre-final layer will be used for computing the prompt embeddings. + + Returns: + [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] or `tuple`: + If `return_dict` is `True`, [`~pipelines.stable_diffusion.StableDiffusionPipelineOutput`] is returned, + otherwise a `tuple` is returned where the first element is a list with the generated images and the + second element is a list of `bool`s indicating whether the corresponding generated image contains + "not-safe-for-work" (nsfw) content. + """ + # 0. Default height and width to unet + height = height or self.unet.config.sample_size * self.vae_scale_factor + width = width or self.unet.config.sample_size * self.vae_scale_factor + + # 1. Check inputs. Raise error if not correct + self.check_inputs(prompt, height, width, callback_steps, prompt_embeds) + + # 2. Define call parameters + if prompt is not None and isinstance(prompt, str): + batch_size = 1 + elif prompt is not None and isinstance(prompt, list): + batch_size = len(prompt) + else: + batch_size = prompt_embeds.shape[0] + + device = self._execution_device + # do_classifier_free_guidance = guidance_scale > 1.0 + + # 3. Encode input prompt + lora_scale = cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None + + # NOTE: when a LCM is distilled from an LDM via latent consistency distillation (Algorithm 1) with guided + # distillation, the forward pass of the LCM learns to approximate sampling from the LDM using CFG with the + # unconditional prompt "" (the empty string). Due to this, LCMs currently do not support negative prompts. + prompt_embeds, _ = self.encode_prompt( + prompt, + device, + num_images_per_prompt, + False, + negative_prompt=None, + prompt_embeds=prompt_embeds, + negative_prompt_embeds=None, + lora_scale=lora_scale, + clip_skip=clip_skip, + ) + + # 4. Prepare timesteps + self.scheduler.set_timesteps(num_inference_steps, device, original_inference_steps=original_inference_steps) + timesteps = self.scheduler.timesteps + + # 5. Prepare latent variable + num_channels_latents = self.unet.config.in_channels + latents = self.prepare_latents( + batch_size * num_images_per_prompt, + num_channels_latents, + height, + width, + prompt_embeds.dtype, + device, + generator, + latents, + ) + bs = batch_size * num_images_per_prompt + + # 6. Get Guidance Scale Embedding + # NOTE: We use the Imagen CFG formulation that StableDiffusionPipeline uses rather than the original LCM paper + # CFG formulation, so we need to subtract 1 from the input guidance_scale. + # LCM CFG formulation: cfg_noise = noise_cond + cfg_scale * (noise_cond - noise_uncond), (cfg_scale > 0.0 using CFG) + w = torch.tensor(guidance_scale - 1).repeat(bs) + w_embedding = self.get_guidance_scale_embedding(w, embedding_dim=self.unet.config.time_cond_proj_dim).to( + device=device, dtype=latents.dtype + ) + + # 7. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline + extra_step_kwargs = self.prepare_extra_step_kwargs(generator, None) + + # 8. LCM MultiStep Sampling Loop: + num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order + with self.progress_bar(total=num_inference_steps) as progress_bar: + for i, t in enumerate(timesteps): + latents = latents.to(prompt_embeds.dtype) + + # model prediction (v-prediction, eps, x) + model_pred = self.unet( + latents, + t, + timestep_cond=w_embedding, + encoder_hidden_states=prompt_embeds, + cross_attention_kwargs=cross_attention_kwargs, + return_dict=False, + )[0] + + # compute the previous noisy sample x_t -> x_t-1 + latents, denoised = self.scheduler.step(model_pred, t, latents, **extra_step_kwargs, return_dict=False) + + # call the callback, if provided + if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0): + progress_bar.update() + if callback is not None and i % callback_steps == 0: + step_idx = i // getattr(self.scheduler, "order", 1) + callback(step_idx, t, latents) + + denoised = denoised.to(prompt_embeds.dtype) + if not output_type == "latent": + image = self.vae.decode(denoised / self.vae.config.scaling_factor, return_dict=False)[0] + image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype) + else: + image = denoised + has_nsfw_concept = None + + if has_nsfw_concept is None: + do_denormalize = [True] * image.shape[0] + else: + do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept] + + image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize) + + # Offload all models + self.maybe_free_model_hooks() + + if not return_dict: + return (image, has_nsfw_concept) + + return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept) diff --git a/src/diffusers/pipelines/pipeline_utils.py b/src/diffusers/pipelines/pipeline_utils.py index bad23a60293f..512cf8d56718 100644 --- a/src/diffusers/pipelines/pipeline_utils.py +++ b/src/diffusers/pipelines/pipeline_utils.py @@ -33,8 +33,6 @@ from requests.exceptions import HTTPError from tqdm.auto import tqdm -import diffusers - from .. import __version__ from ..configuration_utils import ConfigMixin from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT @@ -305,13 +303,23 @@ def maybe_raise_or_warn( ) -def get_class_obj_and_candidates(library_name, class_name, importable_classes, pipelines, is_pipeline_module): +def get_class_obj_and_candidates( + library_name, class_name, importable_classes, pipelines, is_pipeline_module, component_name=None, cache_dir=None +): """Simple helper method to retrieve class object of module as well as potential parent class objects""" + component_folder = os.path.join(cache_dir, component_name) + if is_pipeline_module: pipeline_module = getattr(pipelines, library_name) class_obj = getattr(pipeline_module, class_name) class_candidates = {c: class_obj for c in importable_classes.keys()} + elif os.path.isfile(os.path.join(component_folder, library_name + ".py")): + # load custom component + class_obj = get_class_from_dynamic_module( + component_folder, module_file=library_name + ".py", class_name=class_name + ) + class_candidates = {c: class_obj for c in importable_classes.keys()} else: # else we just import it from the library. library = importlib.import_module(library_name) @@ -323,7 +331,15 @@ def get_class_obj_and_candidates(library_name, class_name, importable_classes, p def _get_pipeline_class( - class_obj, config, load_connected_pipeline=False, custom_pipeline=None, cache_dir=None, revision=None + class_obj, + config, + load_connected_pipeline=False, + custom_pipeline=None, + repo_id=None, + hub_revision=None, + class_name=None, + cache_dir=None, + revision=None, ): if custom_pipeline is not None: if custom_pipeline.endswith(".py"): @@ -331,11 +347,19 @@ def _get_pipeline_class( # decompose into folder & file file_name = path.name custom_pipeline = path.parent.absolute() + elif repo_id is not None: + file_name = f"{custom_pipeline}.py" + custom_pipeline = repo_id else: file_name = CUSTOM_PIPELINE_FILE_NAME return get_class_from_dynamic_module( - custom_pipeline, module_file=file_name, cache_dir=cache_dir, revision=revision + custom_pipeline, + module_file=file_name, + class_name=class_name, + repo_id=repo_id, + cache_dir=cache_dir, + revision=revision if hub_revision is None else hub_revision, ) if class_obj != DiffusionPipeline: @@ -383,11 +407,18 @@ def load_sub_model( variant: str, low_cpu_mem_usage: bool, cached_folder: Union[str, os.PathLike], + revision: str = None, ): """Helper method to load the module `name` from `library_name` and `class_name`""" # retrieve class candidates class_obj, class_candidates = get_class_obj_and_candidates( - library_name, class_name, importable_classes, pipelines, is_pipeline_module + library_name, + class_name, + importable_classes, + pipelines, + is_pipeline_module, + component_name=name, + cache_dir=cached_folder, ) load_method_name = None @@ -414,14 +445,15 @@ def load_sub_model( load_method = getattr(class_obj, load_method_name) # add kwargs to loading method + diffusers_module = importlib.import_module(__name__.split(".")[0]) loading_kwargs = {} if issubclass(class_obj, torch.nn.Module): loading_kwargs["torch_dtype"] = torch_dtype - if issubclass(class_obj, diffusers.OnnxRuntimeModel): + if issubclass(class_obj, diffusers_module.OnnxRuntimeModel): loading_kwargs["provider"] = provider loading_kwargs["sess_options"] = sess_options - is_diffusers_model = issubclass(class_obj, diffusers.ModelMixin) + is_diffusers_model = issubclass(class_obj, diffusers_module.ModelMixin) if is_transformers_available(): transformers_version = version.parse(version.parse(transformers.__version__).base_version) @@ -501,7 +533,8 @@ class DiffusionPipeline(ConfigMixin, PushToHubMixin): def register_modules(self, **kwargs): # import it here to avoid circular import - from diffusers import pipelines + diffusers_module = importlib.import_module(__name__.split(".")[0]) + pipelines = getattr(diffusers_module, "pipelines") for name, module in kwargs.items(): # retrieve library @@ -1080,11 +1113,21 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P # 3. Load the pipeline class, if using custom module then load it from the hub # if we load from explicit class, let's use it + custom_class_name = None + if os.path.isfile(os.path.join(cached_folder, f"{custom_pipeline}.py")): + custom_pipeline = os.path.join(cached_folder, f"{custom_pipeline}.py") + elif isinstance(config_dict["_class_name"], (list, tuple)) and os.path.isfile( + os.path.join(cached_folder, f"{config_dict['_class_name'][0]}.py") + ): + custom_pipeline = os.path.join(cached_folder, f"{config_dict['_class_name'][0]}.py") + custom_class_name = config_dict["_class_name"][1] + pipeline_class = _get_pipeline_class( cls, config_dict, load_connected_pipeline=load_connected_pipeline, custom_pipeline=custom_pipeline, + class_name=custom_class_name, cache_dir=cache_dir, revision=custom_revision, ) @@ -1223,6 +1266,7 @@ def load_module(name, value): variant=variant, low_cpu_mem_usage=low_cpu_mem_usage, cached_folder=cached_folder, + revision=revision, ) logger.info( f"Loaded {name} as {class_name} from `{name}` subfolder of {pretrained_model_name_or_path}." @@ -1542,6 +1586,10 @@ def download(cls, pretrained_model_name, **kwargs) -> Union[str, os.PathLike]: will never be downloaded. By default `use_onnx` defaults to the `_is_onnx` class attribute which is `False` for non-ONNX pipelines and `True` for ONNX pipelines. ONNX weights include both files ending with `.onnx` and `.pb`. + trust_remote_code (`bool`, *optional*, defaults to `False`): + Whether or not to allow for custom pipelines and components defined on the Hub in their own files. This + option should only be set to `True` for repositories you trust and in which you have read the code, as + it will execute code present on the Hub on your local machine. Returns: `os.PathLike`: @@ -1569,6 +1617,7 @@ def download(cls, pretrained_model_name, **kwargs) -> Union[str, os.PathLike]: use_safetensors = kwargs.pop("use_safetensors", None) use_onnx = kwargs.pop("use_onnx", None) load_connected_pipeline = kwargs.pop("load_connected_pipeline", False) + trust_remote_code = kwargs.pop("trust_remote_code", False) allow_pickle = False if use_safetensors is None: @@ -1604,15 +1653,34 @@ def download(cls, pretrained_model_name, **kwargs) -> Union[str, os.PathLike]: ) config_dict = cls._dict_from_json_file(config_file) - ignore_filenames = config_dict.pop("_ignore_files", []) # retrieve all folder_names that contain relevant files - folder_names = [k for k, v in config_dict.items() if isinstance(v, list)] + folder_names = [k for k, v in config_dict.items() if isinstance(v, list) and k != "_class_name"] filenames = {sibling.rfilename for sibling in info.siblings} model_filenames, variant_filenames = variant_compatible_siblings(filenames, variant=variant) + diffusers_module = importlib.import_module(__name__.split(".")[0]) + pipelines = getattr(diffusers_module, "pipelines") + + # optionally create a custom component <> custom file mapping + custom_components = {} + for component in folder_names: + module_candidate = config_dict[component][0] + + if module_candidate is None: + continue + + candidate_file = os.path.join(component, module_candidate + ".py") + + if candidate_file in filenames: + custom_components[component] = module_candidate + elif module_candidate not in LOADABLE_CLASSES and not hasattr(pipelines, module_candidate): + raise ValueError( + f"{candidate_file} as defined in `model_index.json` does not exist in {pretrained_model_name} and is not a module in 'diffusers/pipelines'." + ) + if len(variant_filenames) == 0 and variant is not None: deprecation_message = ( f"You are trying to load the model files of the `variant={variant}`, but no such modeling files are available." @@ -1636,12 +1704,21 @@ def download(cls, pretrained_model_name, **kwargs) -> Union[str, os.PathLike]: model_folder_names = {os.path.split(f)[0] for f in model_filenames if os.path.split(f)[0] in folder_names} + custom_class_name = None + if custom_pipeline is None and isinstance(config_dict["_class_name"], (list, tuple)): + custom_pipeline = config_dict["_class_name"][0] + custom_class_name = config_dict["_class_name"][1] + # all filenames compatible with variant will be added allow_patterns = list(model_filenames) # allow all patterns from non-model folders # this enables downloading schedulers, tokenizers, ... allow_patterns += [f"{k}/*" for k in folder_names if k not in model_folder_names] + # add custom component files + allow_patterns += [f"{k}/{f}.py" for k, f in custom_components.items()] + # add custom pipeline file + allow_patterns += [f"{custom_pipeline}.py"] if f"{custom_pipeline}.py" in filenames else [] # also allow downloading config.json files with the model allow_patterns += [os.path.join(k, "config.json") for k in model_folder_names] @@ -1652,12 +1729,32 @@ def download(cls, pretrained_model_name, **kwargs) -> Union[str, os.PathLike]: CUSTOM_PIPELINE_FILE_NAME, ] + load_pipe_from_hub = custom_pipeline is not None and f"{custom_pipeline}.py" in filenames + load_components_from_hub = len(custom_components) > 0 + + if load_pipe_from_hub and not trust_remote_code: + raise ValueError( + f"The repository for {pretrained_model_name} contains custom code in {custom_pipeline}.py which must be executed to correctly " + f"load the model. You can inspect the repository content at https://hf.co/{pretrained_model_name}/blob/main/{custom_pipeline}.py.\n" + f"Please pass the argument `trust_remote_code=True` to allow custom code to be run." + ) + + if load_components_from_hub and not trust_remote_code: + raise ValueError( + f"The repository for {pretrained_model_name} contains custom code in {'.py, '.join([os.path.join(k, v) for k,v in custom_components.items()])} which must be executed to correctly " + f"load the model. You can inspect the repository content at {', '.join([f'https://hf.co/{pretrained_model_name}/{k}/{v}.py' for k,v in custom_components.items()])}.\n" + f"Please pass the argument `trust_remote_code=True` to allow custom code to be run." + ) + # retrieve passed components that should not be downloaded pipeline_class = _get_pipeline_class( cls, config_dict, load_connected_pipeline=load_connected_pipeline, custom_pipeline=custom_pipeline, + repo_id=pretrained_model_name if load_pipe_from_hub else None, + hub_revision=revision, + class_name=custom_class_name, cache_dir=cache_dir, revision=custom_revision, ) @@ -1754,9 +1851,10 @@ def download(cls, pretrained_model_name, **kwargs) -> Union[str, os.PathLike]: # retrieve pipeline class from local file cls_name = cls.load_config(os.path.join(cached_folder, "model_index.json")).get("_class_name", None) - cls_name = cls_name[4:] if cls_name.startswith("Flax") else cls_name + cls_name = cls_name[4:] if isinstance(cls_name, str) and cls_name.startswith("Flax") else cls_name - pipeline_class = getattr(diffusers, cls_name, None) + diffusers_module = importlib.import_module(__name__.split(".")[0]) + pipeline_class = getattr(diffusers_module, cls_name, None) if isinstance(cls_name, str) else None if pipeline_class is not None and pipeline_class._load_connected_pipes: modelcard = ModelCard.load(os.path.join(cached_folder, "README.md")) diff --git a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py index 825c74ce0707..57d00af82106 100644 --- a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py +++ b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_img2img.py @@ -553,8 +553,20 @@ def get_timesteps(self, num_inference_steps, strength, device, denoising_start=N - (denoising_start * self.scheduler.config.num_train_timesteps) ) ) - timesteps = list(filter(lambda ts: ts < discrete_timestep_cutoff, timesteps)) - return torch.tensor(timesteps), len(timesteps) + + num_inference_steps = (timesteps < discrete_timestep_cutoff).sum().item() + if self.scheduler.order == 2 and num_inference_steps % 2 == 0: + # if the scheduler is a 2nd order scheduler we might have to do +1 + # because `num_inference_steps` might be even given that every timestep + # (except the highest one) is duplicated. If `num_inference_steps` is even it would + # mean that we cut the timesteps in the middle of the denoising step + # (between 1st and 2nd devirative) which leads to incorrect results. By adding 1 + # we ensure that the denoising process always ends after the 2nd derivate step of the scheduler + num_inference_steps = num_inference_steps + 1 + + # because t_n+1 >= t_n, we slice the timesteps starting from the end + timesteps = timesteps[-num_inference_steps:] + return timesteps, num_inference_steps return timesteps, num_inference_steps - t_start diff --git a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py index 535cc7268305..11ae0a0d85f0 100644 --- a/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py +++ b/src/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py @@ -838,8 +838,20 @@ def get_timesteps(self, num_inference_steps, strength, device, denoising_start=N - (denoising_start * self.scheduler.config.num_train_timesteps) ) ) - timesteps = list(filter(lambda ts: ts < discrete_timestep_cutoff, timesteps)) - return torch.tensor(timesteps), len(timesteps) + + num_inference_steps = (timesteps < discrete_timestep_cutoff).sum().item() + if self.scheduler.order == 2 and num_inference_steps % 2 == 0: + # if the scheduler is a 2nd order scheduler we might have to do +1 + # because `num_inference_steps` might be even given that every timestep + # (except the highest one) is duplicated. If `num_inference_steps` is even it would + # mean that we cut the timesteps in the middle of the denoising step + # (between 1st and 2nd devirative) which leads to incorrect results. By adding 1 + # we ensure that the denoising process always ends after the 2nd derivate step of the scheduler + num_inference_steps = num_inference_steps + 1 + + # because t_n+1 >= t_n, we slice the timesteps starting from the end + timesteps = timesteps[-num_inference_steps:] + return timesteps, num_inference_steps return timesteps, num_inference_steps - t_start diff --git a/src/diffusers/pipelines/versatile_diffusion/modeling_text_unet.py b/src/diffusers/pipelines/versatile_diffusion/modeling_text_unet.py index d936666d6139..9066c47c56c6 100644 --- a/src/diffusers/pipelines/versatile_diffusion/modeling_text_unet.py +++ b/src/diffusers/pipelines/versatile_diffusion/modeling_text_unet.py @@ -1508,9 +1508,9 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_downsample=True, - downsample_padding=1, + output_scale_factor: float = 1.0, + add_downsample: bool = True, + downsample_padding: int = 1, ): super().__init__() resnets = [] @@ -1547,7 +1547,9 @@ def __init__( self.gradient_checkpointing = False - def forward(self, hidden_states, temb=None, scale: float = 1.0): + def forward( + self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None, scale: float = 1.0 + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () for resnet in self.resnets: @@ -1596,16 +1598,16 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - downsample_padding=1, - add_downsample=True, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, - attention_type="default", + num_attention_heads: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + downsample_padding: int = 1, + add_downsample: bool = True, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, + attention_type: str = "default", ): super().__init__() resnets = [] @@ -1682,8 +1684,8 @@ def forward( attention_mask: Optional[torch.FloatTensor] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - additional_residuals=None, - ): + additional_residuals: Optional[torch.FloatTensor] = None, + ) -> Tuple[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0 @@ -1751,7 +1753,7 @@ def __init__( prev_output_channel: int, out_channels: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, resnet_eps: float = 1e-6, @@ -1759,8 +1761,8 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_upsample=True, + output_scale_factor: float = 1.0, + add_upsample: bool = True, ): super().__init__() resnets = [] @@ -1794,7 +1796,14 @@ def __init__( self.gradient_checkpointing = False self.resolution_idx = resolution_idx - def forward(self, hidden_states, res_hidden_states_tuple, temb=None, upsample_size=None, scale: float = 1.0): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + upsample_size: Optional[int] = None, + scale: float = 1.0, + ) -> torch.FloatTensor: is_freeu_enabled = ( getattr(self, "s1", None) and getattr(self, "s2", None) @@ -1855,7 +1864,7 @@ def __init__( out_channels: int, prev_output_channel: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, transformer_layers_per_block: Union[int, Tuple[int]] = 1, @@ -1864,15 +1873,15 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - add_upsample=True, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, - attention_type="default", + num_attention_heads: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, + attention_type: str = "default", ): super().__init__() resnets = [] @@ -1949,7 +1958,7 @@ def forward( upsample_size: Optional[int] = None, attention_mask: Optional[torch.FloatTensor] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - ): + ) -> torch.FloatTensor: lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0 is_freeu_enabled = ( getattr(self, "s1", None) @@ -2066,8 +2075,8 @@ def __init__( attn_groups: Optional[int] = None, resnet_pre_norm: bool = True, add_attention: bool = True, - attention_head_dim=1, - output_scale_factor=1.0, + attention_head_dim: int = 1, + output_scale_factor: float = 1.0, ): super().__init__() resnet_groups = resnet_groups if resnet_groups is not None else min(in_channels // 4, 32) @@ -2138,7 +2147,7 @@ def __init__( self.attentions = nn.ModuleList(attentions) self.resnets = nn.ModuleList(resnets) - def forward(self, hidden_states, temb=None): + def forward(self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None) -> torch.FloatTensor: hidden_states = self.resnets[0](hidden_states, temb) for attn, resnet in zip(self.attentions, self.resnets[1:]): if attn is not None: @@ -2162,13 +2171,13 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - output_scale_factor=1.0, - cross_attention_dim=1280, - dual_cross_attention=False, - use_linear_projection=False, - upcast_attention=False, - attention_type="default", + num_attention_heads: int = 1, + output_scale_factor: float = 1.0, + cross_attention_dim: int = 1280, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + upcast_attention: bool = False, + attention_type: str = "default", ): super().__init__() @@ -2308,12 +2317,12 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - attention_head_dim=1, - output_scale_factor=1.0, - cross_attention_dim=1280, - skip_time_act=False, - only_cross_attention=False, - cross_attention_norm=None, + attention_head_dim: int = 1, + output_scale_factor: float = 1.0, + cross_attention_dim: int = 1280, + skip_time_act: bool = False, + only_cross_attention: bool = False, + cross_attention_norm: Optional[str] = None, ): super().__init__() @@ -2389,7 +2398,7 @@ def forward( attention_mask: Optional[torch.FloatTensor] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - ): + ) -> torch.FloatTensor: cross_attention_kwargs = cross_attention_kwargs if cross_attention_kwargs is not None else {} lora_scale = cross_attention_kwargs.get("scale", 1.0) diff --git a/src/diffusers/schedulers/__init__.py b/src/diffusers/schedulers/__init__.py index c6d1ee6d1006..85fd9d25e5da 100644 --- a/src/diffusers/schedulers/__init__.py +++ b/src/diffusers/schedulers/__init__.py @@ -56,6 +56,7 @@ _import_structure["scheduling_k_dpm_2_ancestral_discrete"] = ["KDPM2AncestralDiscreteScheduler"] _import_structure["scheduling_k_dpm_2_discrete"] = ["KDPM2DiscreteScheduler"] _import_structure["scheduling_karras_ve"] = ["KarrasVeScheduler"] + _import_structure["scheduling_lcm"] = ["LCMScheduler"] _import_structure["scheduling_pndm"] = ["PNDMScheduler"] _import_structure["scheduling_repaint"] = ["RePaintScheduler"] _import_structure["scheduling_sde_ve"] = ["ScoreSdeVeScheduler"] @@ -145,6 +146,7 @@ from .scheduling_k_dpm_2_ancestral_discrete import KDPM2AncestralDiscreteScheduler from .scheduling_k_dpm_2_discrete import KDPM2DiscreteScheduler from .scheduling_karras_ve import KarrasVeScheduler + from .scheduling_lcm import LCMScheduler from .scheduling_pndm import PNDMScheduler from .scheduling_repaint import RePaintScheduler from .scheduling_sde_ve import ScoreSdeVeScheduler diff --git a/src/diffusers/schedulers/scheduling_lcm.py b/src/diffusers/schedulers/scheduling_lcm.py new file mode 100644 index 000000000000..1ee430623da4 --- /dev/null +++ b/src/diffusers/schedulers/scheduling_lcm.py @@ -0,0 +1,529 @@ +# Copyright 2023 Stanford University Team and The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +# DISCLAIMER: This code is strongly influenced by https://github.com/pesser/pytorch_diffusion +# and https://github.com/hojonathanho/diffusion + +import math +from dataclasses import dataclass +from typing import List, Optional, Tuple, Union + +import numpy as np +import torch + +from ..configuration_utils import ConfigMixin, register_to_config +from ..utils import BaseOutput, logging +from ..utils.torch_utils import randn_tensor +from .scheduling_utils import SchedulerMixin + + +logger = logging.get_logger(__name__) # pylint: disable=invalid-name + + +@dataclass +class LCMSchedulerOutput(BaseOutput): + """ + Output class for the scheduler's `step` function output. + + Args: + prev_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images): + Computed sample `(x_{t-1})` of previous timestep. `prev_sample` should be used as next model input in the + denoising loop. + pred_original_sample (`torch.FloatTensor` of shape `(batch_size, num_channels, height, width)` for images): + The predicted denoised sample `(x_{0})` based on the model output from the current timestep. + `pred_original_sample` can be used to preview progress or for guidance. + """ + + prev_sample: torch.FloatTensor + denoised: Optional[torch.FloatTensor] = None + + +# Copied from diffusers.schedulers.scheduling_ddpm.betas_for_alpha_bar +def betas_for_alpha_bar( + num_diffusion_timesteps, + max_beta=0.999, + alpha_transform_type="cosine", +): + """ + Create a beta schedule that discretizes the given alpha_t_bar function, which defines the cumulative product of + (1-beta) over time from t = [0,1]. + + Contains a function alpha_bar that takes an argument t and transforms it to the cumulative product of (1-beta) up + to that part of the diffusion process. + + + Args: + num_diffusion_timesteps (`int`): the number of betas to produce. + max_beta (`float`): the maximum beta to use; use values lower than 1 to + prevent singularities. + alpha_transform_type (`str`, *optional*, default to `cosine`): the type of noise schedule for alpha_bar. + Choose from `cosine` or `exp` + + Returns: + betas (`np.ndarray`): the betas used by the scheduler to step the model outputs + """ + if alpha_transform_type == "cosine": + + def alpha_bar_fn(t): + return math.cos((t + 0.008) / 1.008 * math.pi / 2) ** 2 + + elif alpha_transform_type == "exp": + + def alpha_bar_fn(t): + return math.exp(t * -12.0) + + else: + raise ValueError(f"Unsupported alpha_tranform_type: {alpha_transform_type}") + + betas = [] + for i in range(num_diffusion_timesteps): + t1 = i / num_diffusion_timesteps + t2 = (i + 1) / num_diffusion_timesteps + betas.append(min(1 - alpha_bar_fn(t2) / alpha_bar_fn(t1), max_beta)) + return torch.tensor(betas, dtype=torch.float32) + + +# Copied from diffusers.schedulers.scheduling_ddim.rescale_zero_terminal_snr +def rescale_zero_terminal_snr(betas: torch.FloatTensor) -> torch.FloatTensor: + """ + Rescales betas to have zero terminal SNR Based on https://arxiv.org/pdf/2305.08891.pdf (Algorithm 1) + + + Args: + betas (`torch.FloatTensor`): + the betas that the scheduler is being initialized with. + + Returns: + `torch.FloatTensor`: rescaled betas with zero terminal SNR + """ + # Convert betas to alphas_bar_sqrt + alphas = 1.0 - betas + alphas_cumprod = torch.cumprod(alphas, dim=0) + alphas_bar_sqrt = alphas_cumprod.sqrt() + + # Store old values. + alphas_bar_sqrt_0 = alphas_bar_sqrt[0].clone() + alphas_bar_sqrt_T = alphas_bar_sqrt[-1].clone() + + # Shift so the last timestep is zero. + alphas_bar_sqrt -= alphas_bar_sqrt_T + + # Scale so the first timestep is back to the old value. + alphas_bar_sqrt *= alphas_bar_sqrt_0 / (alphas_bar_sqrt_0 - alphas_bar_sqrt_T) + + # Convert alphas_bar_sqrt to betas + alphas_bar = alphas_bar_sqrt**2 # Revert sqrt + alphas = alphas_bar[1:] / alphas_bar[:-1] # Revert cumprod + alphas = torch.cat([alphas_bar[0:1], alphas]) + betas = 1 - alphas + + return betas + + +class LCMScheduler(SchedulerMixin, ConfigMixin): + """ + `LCMScheduler` extends the denoising procedure introduced in denoising diffusion probabilistic models (DDPMs) with + non-Markovian guidance. + + This model inherits from [`SchedulerMixin`] and [`ConfigMixin`]. [`~ConfigMixin`] takes care of storing all config + attributes that are passed in the scheduler's `__init__` function, such as `num_train_timesteps`. They can be + accessed via `scheduler.config.num_train_timesteps`. [`SchedulerMixin`] provides general loading and saving + functionality via the [`SchedulerMixin.save_pretrained`] and [`~SchedulerMixin.from_pretrained`] functions. + + Args: + num_train_timesteps (`int`, defaults to 1000): + The number of diffusion steps to train the model. + beta_start (`float`, defaults to 0.0001): + The starting `beta` value of inference. + beta_end (`float`, defaults to 0.02): + The final `beta` value. + beta_schedule (`str`, defaults to `"linear"`): + The beta schedule, a mapping from a beta range to a sequence of betas for stepping the model. Choose from + `linear`, `scaled_linear`, or `squaredcos_cap_v2`. + trained_betas (`np.ndarray`, *optional*): + Pass an array of betas directly to the constructor to bypass `beta_start` and `beta_end`. + original_inference_steps (`int`, *optional*, defaults to 50): + The default number of inference steps used to generate a linearly-spaced timestep schedule, from which we + will ultimately take `num_inference_steps` evenly spaced timesteps to form the final timestep schedule. + clip_sample (`bool`, defaults to `True`): + Clip the predicted sample for numerical stability. + clip_sample_range (`float`, defaults to 1.0): + The maximum magnitude for sample clipping. Valid only when `clip_sample=True`. + set_alpha_to_one (`bool`, defaults to `True`): + Each diffusion step uses the alphas product value at that step and at the previous one. For the final step + there is no previous alpha. When this option is `True` the previous alpha product is fixed to `1`, + otherwise it uses the alpha value at step 0. + steps_offset (`int`, defaults to 0): + An offset added to the inference steps. You can use a combination of `offset=1` and + `set_alpha_to_one=False` to make the last step use step 0 for the previous alpha product like in Stable + Diffusion. + prediction_type (`str`, defaults to `epsilon`, *optional*): + Prediction type of the scheduler function; can be `epsilon` (predicts the noise of the diffusion process), + `sample` (directly predicts the noisy sample`) or `v_prediction` (see section 2.4 of [Imagen + Video](https://imagen.research.google/video/paper.pdf) paper). + thresholding (`bool`, defaults to `False`): + Whether to use the "dynamic thresholding" method. This is unsuitable for latent-space diffusion models such + as Stable Diffusion. + dynamic_thresholding_ratio (`float`, defaults to 0.995): + The ratio for the dynamic thresholding method. Valid only when `thresholding=True`. + sample_max_value (`float`, defaults to 1.0): + The threshold value for dynamic thresholding. Valid only when `thresholding=True`. + timestep_spacing (`str`, defaults to `"leading"`): + The way the timesteps should be scaled. Refer to Table 2 of the [Common Diffusion Noise Schedules and + Sample Steps are Flawed](https://huggingface.co/papers/2305.08891) for more information. + rescale_betas_zero_snr (`bool`, defaults to `False`): + Whether to rescale the betas to have zero terminal SNR. This enables the model to generate very bright and + dark samples instead of limiting it to samples with medium brightness. Loosely related to + [`--offset_noise`](https://github.com/huggingface/diffusers/blob/74fd735eb073eb1d774b1ab4154a0876eb82f055/examples/dreambooth/train_dreambooth.py#L506). + """ + + order = 1 + + @register_to_config + def __init__( + self, + num_train_timesteps: int = 1000, + beta_start: float = 0.00085, + beta_end: float = 0.012, + beta_schedule: str = "scaled_linear", + trained_betas: Optional[Union[np.ndarray, List[float]]] = None, + original_inference_steps: int = 50, + clip_sample: bool = False, + clip_sample_range: float = 1.0, + set_alpha_to_one: bool = True, + steps_offset: int = 0, + prediction_type: str = "epsilon", + thresholding: bool = False, + dynamic_thresholding_ratio: float = 0.995, + sample_max_value: float = 1.0, + timestep_spacing: str = "leading", + rescale_betas_zero_snr: bool = False, + ): + if trained_betas is not None: + self.betas = torch.tensor(trained_betas, dtype=torch.float32) + elif beta_schedule == "linear": + self.betas = torch.linspace(beta_start, beta_end, num_train_timesteps, dtype=torch.float32) + elif beta_schedule == "scaled_linear": + # this schedule is very specific to the latent diffusion model. + self.betas = ( + torch.linspace(beta_start**0.5, beta_end**0.5, num_train_timesteps, dtype=torch.float32) ** 2 + ) + elif beta_schedule == "squaredcos_cap_v2": + # Glide cosine schedule + self.betas = betas_for_alpha_bar(num_train_timesteps) + else: + raise NotImplementedError(f"{beta_schedule} does is not implemented for {self.__class__}") + + # Rescale for zero SNR + if rescale_betas_zero_snr: + self.betas = rescale_zero_terminal_snr(self.betas) + + self.alphas = 1.0 - self.betas + self.alphas_cumprod = torch.cumprod(self.alphas, dim=0) + + # At every step in ddim, we are looking into the previous alphas_cumprod + # For the final step, there is no previous alphas_cumprod because we are already at 0 + # `set_alpha_to_one` decides whether we set this parameter simply to one or + # whether we use the final alpha of the "non-previous" one. + self.final_alpha_cumprod = torch.tensor(1.0) if set_alpha_to_one else self.alphas_cumprod[0] + + # standard deviation of the initial noise distribution + self.init_noise_sigma = 1.0 + + # setable values + self.num_inference_steps = None + self.timesteps = torch.from_numpy(np.arange(0, num_train_timesteps)[::-1].copy().astype(np.int64)) + + self._step_index = None + + # Copied from diffusers.schedulers.scheduling_euler_discrete.EulerDiscreteScheduler._init_step_index + def _init_step_index(self, timestep): + if isinstance(timestep, torch.Tensor): + timestep = timestep.to(self.timesteps.device) + + index_candidates = (self.timesteps == timestep).nonzero() + + # The sigma index that is taken for the **very** first `step` + # is always the second index (or the last index if there is only 1) + # This way we can ensure we don't accidentally skip a sigma in + # case we start in the middle of the denoising schedule (e.g. for image-to-image) + if len(index_candidates) > 1: + step_index = index_candidates[1] + else: + step_index = index_candidates[0] + + self._step_index = step_index.item() + + @property + def step_index(self): + return self._step_index + + def scale_model_input(self, sample: torch.FloatTensor, timestep: Optional[int] = None) -> torch.FloatTensor: + """ + Ensures interchangeability with schedulers that need to scale the denoising model input depending on the + current timestep. + + Args: + sample (`torch.FloatTensor`): + The input sample. + timestep (`int`, *optional*): + The current timestep in the diffusion chain. + Returns: + `torch.FloatTensor`: + A scaled input sample. + """ + return sample + + # Copied from diffusers.schedulers.scheduling_ddpm.DDPMScheduler._threshold_sample + def _threshold_sample(self, sample: torch.FloatTensor) -> torch.FloatTensor: + """ + "Dynamic thresholding: At each sampling step we set s to a certain percentile absolute pixel value in xt0 (the + prediction of x_0 at timestep t), and if s > 1, then we threshold xt0 to the range [-s, s] and then divide by + s. Dynamic thresholding pushes saturated pixels (those near -1 and 1) inwards, thereby actively preventing + pixels from saturation at each step. We find that dynamic thresholding results in significantly better + photorealism as well as better image-text alignment, especially when using very large guidance weights." + + https://arxiv.org/abs/2205.11487 + """ + dtype = sample.dtype + batch_size, channels, *remaining_dims = sample.shape + + if dtype not in (torch.float32, torch.float64): + sample = sample.float() # upcast for quantile calculation, and clamp not implemented for cpu half + + # Flatten sample for doing quantile calculation along each image + sample = sample.reshape(batch_size, channels * np.prod(remaining_dims)) + + abs_sample = sample.abs() # "a certain percentile absolute pixel value" + + s = torch.quantile(abs_sample, self.config.dynamic_thresholding_ratio, dim=1) + s = torch.clamp( + s, min=1, max=self.config.sample_max_value + ) # When clamped to min=1, equivalent to standard clipping to [-1, 1] + s = s.unsqueeze(1) # (batch_size, 1) because clamp will broadcast along dim=0 + sample = torch.clamp(sample, -s, s) / s # "we threshold xt0 to the range [-s, s] and then divide by s" + + sample = sample.reshape(batch_size, channels, *remaining_dims) + sample = sample.to(dtype) + + return sample + + def set_timesteps( + self, + num_inference_steps: int, + device: Union[str, torch.device] = None, + original_inference_steps: Optional[int] = None, + ): + """ + Sets the discrete timesteps used for the diffusion chain (to be run before inference). + + Args: + num_inference_steps (`int`): + The number of diffusion steps used when generating samples with a pre-trained model. + device (`str` or `torch.device`, *optional*): + The device to which the timesteps should be moved to. If `None`, the timesteps are not moved. + original_inference_steps (`int`, *optional*): + The original number of inference steps, which will be used to generate a linearly-spaced timestep + schedule (which is different from the standard `diffusers` implementation). We will then take + `num_inference_steps` timesteps from this schedule, evenly spaced in terms of indices, and use that as + our final timestep schedule. If not set, this will default to the `original_inference_steps` attribute. + """ + + if num_inference_steps > self.config.num_train_timesteps: + raise ValueError( + f"`num_inference_steps`: {num_inference_steps} cannot be larger than `self.config.train_timesteps`:" + f" {self.config.num_train_timesteps} as the unet model trained with this scheduler can only handle" + f" maximal {self.config.num_train_timesteps} timesteps." + ) + + self.num_inference_steps = num_inference_steps + original_steps = ( + original_inference_steps if original_inference_steps is not None else self.original_inference_steps + ) + + if original_steps > self.config.num_train_timesteps: + raise ValueError( + f"`original_steps`: {original_steps} cannot be larger than `self.config.train_timesteps`:" + f" {self.config.num_train_timesteps} as the unet model trained with this scheduler can only handle" + f" maximal {self.config.num_train_timesteps} timesteps." + ) + + if num_inference_steps > original_steps: + raise ValueError( + f"`num_inference_steps`: {num_inference_steps} cannot be larger than `original_inference_steps`:" + f" {original_steps} because the final timestep schedule will be a subset of the" + f" `original_inference_steps`-sized initial timestep schedule." + ) + + # LCM Timesteps Setting + # Currently, only linear spacing is supported. + c = self.config.num_train_timesteps // original_steps + # LCM Training Steps Schedule + lcm_origin_timesteps = np.asarray(list(range(1, original_steps + 1))) * c - 1 + skipping_step = len(lcm_origin_timesteps) // num_inference_steps + # LCM Inference Steps Schedule + timesteps = lcm_origin_timesteps[::-skipping_step][:num_inference_steps] + + self.timesteps = torch.from_numpy(timesteps.copy()).to(device=device, dtype=torch.long) + + self._step_index = None + + def get_scalings_for_boundary_condition_discrete(self, t): + self.sigma_data = 0.5 # Default: 0.5 + + # By dividing 0.1: This is almost a delta function at t=0. + c_skip = self.sigma_data**2 / ((t / 0.1) ** 2 + self.sigma_data**2) + c_out = (t / 0.1) / ((t / 0.1) ** 2 + self.sigma_data**2) ** 0.5 + return c_skip, c_out + + def step( + self, + model_output: torch.FloatTensor, + timestep: int, + sample: torch.FloatTensor, + generator: Optional[torch.Generator] = None, + return_dict: bool = True, + ) -> Union[LCMSchedulerOutput, Tuple]: + """ + Predict the sample from the previous timestep by reversing the SDE. This function propagates the diffusion + process from the learned model outputs (most often the predicted noise). + + Args: + model_output (`torch.FloatTensor`): + The direct output from learned diffusion model. + timestep (`float`): + The current discrete timestep in the diffusion chain. + sample (`torch.FloatTensor`): + A current instance of a sample created by the diffusion process. + generator (`torch.Generator`, *optional*): + A random number generator. + return_dict (`bool`, *optional*, defaults to `True`): + Whether or not to return a [`~schedulers.scheduling_lcm.LCMSchedulerOutput`] or `tuple`. + Returns: + [`~schedulers.scheduling_utils.LCMSchedulerOutput`] or `tuple`: + If return_dict is `True`, [`~schedulers.scheduling_lcm.LCMSchedulerOutput`] is returned, otherwise a + tuple is returned where the first element is the sample tensor. + """ + if self.num_inference_steps is None: + raise ValueError( + "Number of inference steps is 'None', you need to run 'set_timesteps' after creating the scheduler" + ) + + if self.step_index is None: + self._init_step_index(timestep) + + # 1. get previous step value + prev_step_index = self.step_index + 1 + if prev_step_index < len(self.timesteps): + prev_timestep = self.timesteps[prev_step_index] + else: + prev_timestep = timestep + + # 2. compute alphas, betas + alpha_prod_t = self.alphas_cumprod[timestep] + alpha_prod_t_prev = self.alphas_cumprod[prev_timestep] if prev_timestep >= 0 else self.final_alpha_cumprod + + beta_prod_t = 1 - alpha_prod_t + beta_prod_t_prev = 1 - alpha_prod_t_prev + + # 3. Get scalings for boundary conditions + c_skip, c_out = self.get_scalings_for_boundary_condition_discrete(timestep) + + # 4. Compute the predicted original sample x_0 based on the model parameterization + if self.config.prediction_type == "epsilon": # noise-prediction + predicted_original_sample = (sample - beta_prod_t.sqrt() * model_output) / alpha_prod_t.sqrt() + elif self.config.prediction_type == "sample": # x-prediction + predicted_original_sample = model_output + elif self.config.prediction_type == "v_prediction": # v-prediction + predicted_original_sample = alpha_prod_t.sqrt() * sample - beta_prod_t.sqrt() * model_output + else: + raise ValueError( + f"prediction_type given as {self.config.prediction_type} must be one of `epsilon`, `sample` or" + " `v_prediction` for `LCMScheduler`." + ) + + # 5. Clip or threshold "predicted x_0" + if self.config.thresholding: + predicted_original_sample = self._threshold_sample(predicted_original_sample) + elif self.config.clip_sample: + predicted_original_sample = predicted_original_sample.clamp( + -self.config.clip_sample_range, self.config.clip_sample_range + ) + + # 6. Denoise model output using boundary conditions + denoised = c_out * predicted_original_sample + c_skip * sample + + # 7. Sample and inject noise z ~ N(0, I) for MultiStep Inference + # Noise is not used for one-step sampling. + if len(self.timesteps) > 1: + noise = randn_tensor(model_output.shape, generator=generator, device=model_output.device) + prev_sample = alpha_prod_t_prev.sqrt() * denoised + beta_prod_t_prev.sqrt() * noise + else: + prev_sample = denoised + + # upon completion increase step index by one + self._step_index += 1 + + if not return_dict: + return (prev_sample, denoised) + + return LCMSchedulerOutput(prev_sample=prev_sample, denoised=denoised) + + # Copied from diffusers.schedulers.scheduling_ddpm.DDPMScheduler.add_noise + def add_noise( + self, + original_samples: torch.FloatTensor, + noise: torch.FloatTensor, + timesteps: torch.IntTensor, + ) -> torch.FloatTensor: + # Make sure alphas_cumprod and timestep have same device and dtype as original_samples + alphas_cumprod = self.alphas_cumprod.to(device=original_samples.device, dtype=original_samples.dtype) + timesteps = timesteps.to(original_samples.device) + + sqrt_alpha_prod = alphas_cumprod[timesteps] ** 0.5 + sqrt_alpha_prod = sqrt_alpha_prod.flatten() + while len(sqrt_alpha_prod.shape) < len(original_samples.shape): + sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1) + + sqrt_one_minus_alpha_prod = (1 - alphas_cumprod[timesteps]) ** 0.5 + sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten() + while len(sqrt_one_minus_alpha_prod.shape) < len(original_samples.shape): + sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1) + + noisy_samples = sqrt_alpha_prod * original_samples + sqrt_one_minus_alpha_prod * noise + return noisy_samples + + # Copied from diffusers.schedulers.scheduling_ddpm.DDPMScheduler.get_velocity + def get_velocity( + self, sample: torch.FloatTensor, noise: torch.FloatTensor, timesteps: torch.IntTensor + ) -> torch.FloatTensor: + # Make sure alphas_cumprod and timestep have same device and dtype as sample + alphas_cumprod = self.alphas_cumprod.to(device=sample.device, dtype=sample.dtype) + timesteps = timesteps.to(sample.device) + + sqrt_alpha_prod = alphas_cumprod[timesteps] ** 0.5 + sqrt_alpha_prod = sqrt_alpha_prod.flatten() + while len(sqrt_alpha_prod.shape) < len(sample.shape): + sqrt_alpha_prod = sqrt_alpha_prod.unsqueeze(-1) + + sqrt_one_minus_alpha_prod = (1 - alphas_cumprod[timesteps]) ** 0.5 + sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.flatten() + while len(sqrt_one_minus_alpha_prod.shape) < len(sample.shape): + sqrt_one_minus_alpha_prod = sqrt_one_minus_alpha_prod.unsqueeze(-1) + + velocity = sqrt_alpha_prod * noise - sqrt_one_minus_alpha_prod * sample + return velocity + + def __len__(self): + return self.config.num_train_timesteps diff --git a/src/diffusers/utils/dummy_pt_objects.py b/src/diffusers/utils/dummy_pt_objects.py index 8e95dde52caf..890f836c73c6 100644 --- a/src/diffusers/utils/dummy_pt_objects.py +++ b/src/diffusers/utils/dummy_pt_objects.py @@ -825,6 +825,21 @@ def from_pretrained(cls, *args, **kwargs): requires_backends(cls, ["torch"]) +class LCMScheduler(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + @classmethod + def from_config(cls, *args, **kwargs): + requires_backends(cls, ["torch"]) + + @classmethod + def from_pretrained(cls, *args, **kwargs): + requires_backends(cls, ["torch"]) + + class PNDMScheduler(metaclass=DummyObject): _backends = ["torch"] diff --git a/src/diffusers/utils/dummy_torch_and_transformers_objects.py b/src/diffusers/utils/dummy_torch_and_transformers_objects.py index d831cc49b495..3b5e3ad4e07d 100644 --- a/src/diffusers/utils/dummy_torch_and_transformers_objects.py +++ b/src/diffusers/utils/dummy_torch_and_transformers_objects.py @@ -482,6 +482,21 @@ def from_pretrained(cls, *args, **kwargs): requires_backends(cls, ["torch", "transformers"]) +class LatentConsistencyModelPipeline(metaclass=DummyObject): + _backends = ["torch", "transformers"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch", "transformers"]) + + @classmethod + def from_config(cls, *args, **kwargs): + requires_backends(cls, ["torch", "transformers"]) + + @classmethod + def from_pretrained(cls, *args, **kwargs): + requires_backends(cls, ["torch", "transformers"]) + + class LDMTextToImagePipeline(metaclass=DummyObject): _backends = ["torch", "transformers"] diff --git a/src/diffusers/utils/outputs.py b/src/diffusers/utils/outputs.py index 802c699eb9cc..a057b506aec0 100644 --- a/src/diffusers/utils/outputs.py +++ b/src/diffusers/utils/outputs.py @@ -51,6 +51,21 @@ class BaseOutput(OrderedDict):
""" + def __init_subclass__(cls) -> None: + """Register subclasses as pytree nodes. + + This is necessary to synchronize gradients when using `torch.nn.parallel.DistributedDataParallel` with + `static_graph=True` with modules that output `ModelOutput` subclasses. + """ + if is_torch_available(): + import torch.utils._pytree + + torch.utils._pytree._register_pytree_node( + cls, + torch.utils._pytree._dict_flatten, + lambda values, context: cls(**torch.utils._pytree._dict_unflatten(values, context)), + ) + def __post_init__(self): class_fields = fields(self) diff --git a/tests/others/test_outputs.py b/tests/others/test_outputs.py index 492e71f0ba31..cf709d93f709 100644 --- a/tests/others/test_outputs.py +++ b/tests/others/test_outputs.py @@ -7,6 +7,7 @@ import PIL.Image from diffusers.utils.outputs import BaseOutput +from diffusers.utils.testing_utils import require_torch @dataclass @@ -69,3 +70,24 @@ def test_outputs_serialization(self): assert dir(outputs_orig) == dir(outputs_copy) assert dict(outputs_orig) == dict(outputs_copy) assert vars(outputs_orig) == vars(outputs_copy) + + @require_torch + def test_torch_pytree(self): + # ensure torch.utils._pytree treats ModelOutput subclasses as nodes (and not leaves) + # this is important for DistributedDataParallel gradient synchronization with static_graph=True + import torch + import torch.utils._pytree + + data = np.random.rand(1, 3, 4, 4) + x = CustomOutput(images=data) + self.assertFalse(torch.utils._pytree._is_leaf(x)) + + expected_flat_outs = [data] + expected_tree_spec = torch.utils._pytree.TreeSpec(CustomOutput, ["images"], [torch.utils._pytree.LeafSpec()]) + + actual_flat_outs, actual_tree_spec = torch.utils._pytree.tree_flatten(x) + self.assertEqual(expected_flat_outs, actual_flat_outs) + self.assertEqual(expected_tree_spec, actual_tree_spec) + + unflattened_x = torch.utils._pytree.tree_unflatten(actual_flat_outs, actual_tree_spec) + self.assertEqual(x, unflattened_x) diff --git a/tests/pipelines/controlnet/test_controlnet_img2img.py b/tests/pipelines/controlnet/test_controlnet_img2img.py index 3113836f5d0a..5a7f70eb488a 100644 --- a/tests/pipelines/controlnet/test_controlnet_img2img.py +++ b/tests/pipelines/controlnet/test_controlnet_img2img.py @@ -72,7 +72,7 @@ class ControlNetImg2ImgPipelineFastTests( def get_dummy_components(self): torch.manual_seed(0) unet = UNet2DConditionModel( - block_out_channels=(32, 64), + block_out_channels=(4, 8), layers_per_block=2, sample_size=32, in_channels=4, @@ -80,15 +80,17 @@ def get_dummy_components(self): down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"), cross_attention_dim=32, + norm_num_groups=1, ) torch.manual_seed(0) controlnet = ControlNetModel( - block_out_channels=(32, 64), + block_out_channels=(4, 8), layers_per_block=2, in_channels=4, down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), cross_attention_dim=32, conditioning_embedding_out_channels=(16, 32), + norm_num_groups=1, ) torch.manual_seed(0) scheduler = DDIMScheduler( @@ -100,12 +102,13 @@ def get_dummy_components(self): ) torch.manual_seed(0) vae = AutoencoderKL( - block_out_channels=[32, 64], + block_out_channels=[4, 8], in_channels=3, out_channels=3, down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"], up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"], latent_channels=4, + norm_num_groups=2, ) torch.manual_seed(0) text_encoder_config = CLIPTextConfig( @@ -186,7 +189,7 @@ class StableDiffusionMultiControlNetPipelineFastTests( def get_dummy_components(self): torch.manual_seed(0) unet = UNet2DConditionModel( - block_out_channels=(32, 64), + block_out_channels=(4, 8), layers_per_block=2, sample_size=32, in_channels=4, @@ -194,6 +197,7 @@ def get_dummy_components(self): down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"), cross_attention_dim=32, + norm_num_groups=1, ) torch.manual_seed(0) @@ -203,23 +207,25 @@ def init_weights(m): m.bias.data.fill_(1.0) controlnet1 = ControlNetModel( - block_out_channels=(32, 64), + block_out_channels=(4, 8), layers_per_block=2, in_channels=4, down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), cross_attention_dim=32, conditioning_embedding_out_channels=(16, 32), + norm_num_groups=1, ) controlnet1.controlnet_down_blocks.apply(init_weights) torch.manual_seed(0) controlnet2 = ControlNetModel( - block_out_channels=(32, 64), + block_out_channels=(4, 8), layers_per_block=2, in_channels=4, down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), cross_attention_dim=32, conditioning_embedding_out_channels=(16, 32), + norm_num_groups=1, ) controlnet2.controlnet_down_blocks.apply(init_weights) @@ -233,12 +239,13 @@ def init_weights(m): ) torch.manual_seed(0) vae = AutoencoderKL( - block_out_channels=[32, 64], + block_out_channels=[4, 8], in_channels=3, out_channels=3, down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"], up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"], latent_channels=4, + norm_num_groups=2, ) torch.manual_seed(0) text_encoder_config = CLIPTextConfig( diff --git a/tests/pipelines/latent_consistency_models/__init__.py b/tests/pipelines/latent_consistency_models/__init__.py new file mode 100644 index 000000000000..e69de29bb2d1 diff --git a/tests/pipelines/latent_consistency_models/test_latent_consistency_models.py b/tests/pipelines/latent_consistency_models/test_latent_consistency_models.py new file mode 100644 index 000000000000..0ef33a688eae --- /dev/null +++ b/tests/pipelines/latent_consistency_models/test_latent_consistency_models.py @@ -0,0 +1,195 @@ +import gc +import unittest + +import numpy as np +import torch +from transformers import CLIPTextConfig, CLIPTextModel, CLIPTokenizer + +from diffusers import ( + AutoencoderKL, + LatentConsistencyModelPipeline, + LCMScheduler, + UNet2DConditionModel, +) +from diffusers.utils.testing_utils import ( + enable_full_determinism, + require_torch_gpu, + slow, + torch_device, +) + +from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS +from ..test_pipelines_common import PipelineLatentTesterMixin, PipelineTesterMixin + + +enable_full_determinism() + + +class LatentConsistencyModelPipelineFastTests(PipelineLatentTesterMixin, PipelineTesterMixin, unittest.TestCase): + pipeline_class = LatentConsistencyModelPipeline + params = TEXT_TO_IMAGE_PARAMS - {"negative_prompt", "negative_prompt_embeds"} + batch_params = TEXT_TO_IMAGE_BATCH_PARAMS - {"negative_prompt"} + image_params = TEXT_TO_IMAGE_IMAGE_PARAMS + image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS + + def get_dummy_components(self): + torch.manual_seed(0) + unet = UNet2DConditionModel( + block_out_channels=(4, 8), + layers_per_block=1, + sample_size=32, + in_channels=4, + out_channels=4, + down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"), + up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"), + cross_attention_dim=32, + norm_num_groups=2, + time_cond_proj_dim=32, + ) + scheduler = LCMScheduler( + beta_start=0.00085, + beta_end=0.012, + beta_schedule="scaled_linear", + clip_sample=False, + set_alpha_to_one=False, + ) + torch.manual_seed(0) + vae = AutoencoderKL( + block_out_channels=[4, 8], + in_channels=3, + out_channels=3, + down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"], + up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"], + latent_channels=4, + norm_num_groups=2, + ) + torch.manual_seed(0) + text_encoder_config = CLIPTextConfig( + bos_token_id=0, + eos_token_id=2, + hidden_size=32, + intermediate_size=64, + layer_norm_eps=1e-05, + num_attention_heads=8, + num_hidden_layers=3, + pad_token_id=1, + vocab_size=1000, + ) + text_encoder = CLIPTextModel(text_encoder_config) + tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip") + + components = { + "unet": unet, + "scheduler": scheduler, + "vae": vae, + "text_encoder": text_encoder, + "tokenizer": tokenizer, + "safety_checker": None, + "feature_extractor": None, + "requires_safety_checker": False, + } + return components + + def get_dummy_inputs(self, device, seed=0): + if str(device).startswith("mps"): + generator = torch.manual_seed(seed) + else: + generator = torch.Generator(device=device).manual_seed(seed) + inputs = { + "prompt": "A painting of a squirrel eating a burger", + "generator": generator, + "num_inference_steps": 2, + "guidance_scale": 6.0, + "output_type": "np", + } + return inputs + + def test_lcm_onestep(self): + device = "cpu" # ensure determinism for the device-dependent torch.Generator + + components = self.get_dummy_components() + pipe = LatentConsistencyModelPipeline(**components) + pipe = pipe.to(torch_device) + pipe.set_progress_bar_config(disable=None) + + inputs = self.get_dummy_inputs(device) + inputs["num_inference_steps"] = 1 + output = pipe(**inputs) + image = output.images + assert image.shape == (1, 64, 64, 3) + + image_slice = image[0, -3:, -3:, -1] + expected_slice = np.array([0.1441, 0.5304, 0.5452, 0.1361, 0.4011, 0.4370, 0.5326, 0.3492, 0.3637]) + assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-3 + + def test_lcm_multistep(self): + device = "cpu" # ensure determinism for the device-dependent torch.Generator + + components = self.get_dummy_components() + pipe = LatentConsistencyModelPipeline(**components) + pipe = pipe.to(torch_device) + pipe.set_progress_bar_config(disable=None) + + inputs = self.get_dummy_inputs(device) + output = pipe(**inputs) + image = output.images + assert image.shape == (1, 64, 64, 3) + + image_slice = image[0, -3:, -3:, -1] + # TODO: get expected slice + expected_slice = np.array([0.1540, 0.5205, 0.5458, 0.1200, 0.3983, 0.4350, 0.5386, 0.3522, 0.3614]) + assert np.abs(image_slice.flatten() - expected_slice).max() < 2e-2 + + def test_inference_batch_single_identical(self): + super().test_inference_batch_single_identical(expected_max_diff=5e-4) + + +@slow +@require_torch_gpu +class LatentConsistencyModelPipelineSlowTests(unittest.TestCase): + def setUp(self): + gc.collect() + torch.cuda.empty_cache() + + def get_inputs(self, device, generator_device="cpu", dtype=torch.float32, seed=0): + generator = torch.Generator(device=generator_device).manual_seed(seed) + latents = np.random.RandomState(seed).standard_normal((1, 4, 64, 64)) + latents = torch.from_numpy(latents).to(device=device, dtype=dtype) + inputs = { + "prompt": "a photograph of an astronaut riding a horse", + "latents": latents, + "generator": generator, + "num_inference_steps": 3, + "guidance_scale": 7.5, + "output_type": "np", + } + return inputs + + def test_lcm_onestep(self): + pipe = LatentConsistencyModelPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", safety_checker=None) + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + pipe = pipe.to(torch_device) + pipe.set_progress_bar_config(disable=None) + + inputs = self.get_inputs(torch_device) + inputs["num_inference_steps"] = 1 + image = pipe(**inputs).images + assert image.shape == (1, 512, 512, 3) + + image_slice = image[0, -3:, -3:, -1].flatten() + expected_slice = np.array([0.1025, 0.0911, 0.0984, 0.0981, 0.0901, 0.0918, 0.1055, 0.0940, 0.0730]) + assert np.abs(image_slice - expected_slice).max() < 1e-3 + + def test_lcm_multistep(self): + pipe = LatentConsistencyModelPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", safety_checker=None) + pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + pipe = pipe.to(torch_device) + pipe.set_progress_bar_config(disable=None) + + inputs = self.get_inputs(torch_device) + image = pipe(**inputs).images + assert image.shape == (1, 512, 512, 3) + + image_slice = image[0, -3:, -3:, -1].flatten() + expected_slice = np.array([0.01855, 0.01855, 0.01489, 0.01392, 0.01782, 0.01465, 0.01831, 0.02539, 0.0]) + assert np.abs(image_slice - expected_slice).max() < 1e-3 diff --git a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py index 4906670890e8..a9e0cc4671c6 100644 --- a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py +++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl.py @@ -32,7 +32,7 @@ UNet2DConditionModel, UniPCMultistepScheduler, ) -from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, torch_device +from diffusers.utils.testing_utils import enable_full_determinism, require_torch_gpu, slow, torch_device from ..pipeline_params import TEXT_TO_IMAGE_BATCH_PARAMS, TEXT_TO_IMAGE_IMAGE_PARAMS, TEXT_TO_IMAGE_PARAMS from ..test_pipelines_common import PipelineLatentTesterMixin, PipelineTesterMixin, SDXLOptionalComponentsTesterMixin @@ -301,6 +301,107 @@ def test_stable_diffusion_xl_img2img_prompt_embeds_only(self): # make sure that it's equal assert np.abs(image_slice_1.flatten() - image_slice_2.flatten()).max() < 1e-4 + def test_stable_diffusion_two_xl_mixture_of_denoiser_fast(self): + components = self.get_dummy_components() + pipe_1 = StableDiffusionXLPipeline(**components).to(torch_device) + pipe_1.unet.set_default_attn_processor() + pipe_2 = StableDiffusionXLImg2ImgPipeline(**components).to(torch_device) + pipe_2.unet.set_default_attn_processor() + + def assert_run_mixture( + num_steps, + split, + scheduler_cls_orig, + expected_tss, + num_train_timesteps=pipe_1.scheduler.config.num_train_timesteps, + ): + inputs = self.get_dummy_inputs(torch_device) + inputs["num_inference_steps"] = num_steps + + class scheduler_cls(scheduler_cls_orig): + pass + + pipe_1.scheduler = scheduler_cls.from_config(pipe_1.scheduler.config) + pipe_2.scheduler = scheduler_cls.from_config(pipe_2.scheduler.config) + + # Let's retrieve the number of timesteps we want to use + pipe_1.scheduler.set_timesteps(num_steps) + expected_steps = pipe_1.scheduler.timesteps.tolist() + + if pipe_1.scheduler.order == 2: + expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss)) + expected_steps_2 = expected_steps_1[-1:] + list(filter(lambda ts: ts < split, expected_tss)) + expected_steps = expected_steps_1 + expected_steps_2 + else: + expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss)) + expected_steps_2 = list(filter(lambda ts: ts < split, expected_tss)) + + # now we monkey patch step `done_steps` + # list into the step function for testing + done_steps = [] + old_step = copy.copy(scheduler_cls.step) + + def new_step(self, *args, **kwargs): + done_steps.append(args[1].cpu().item()) # args[1] is always the passed `t` + return old_step(self, *args, **kwargs) + + scheduler_cls.step = new_step + + inputs_1 = { + **inputs, + **{ + "denoising_end": 1.0 - (split / num_train_timesteps), + "output_type": "latent", + }, + } + latents = pipe_1(**inputs_1).images[0] + + assert expected_steps_1 == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}" + + inputs_2 = { + **inputs, + **{ + "denoising_start": 1.0 - (split / num_train_timesteps), + "image": latents, + }, + } + pipe_2(**inputs_2).images[0] + + assert expected_steps_2 == done_steps[len(expected_steps_1) :] + assert expected_steps == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}" + + steps = 10 + for split in [300, 700]: + for scheduler_cls_timesteps in [ + (EulerDiscreteScheduler, [901, 801, 701, 601, 501, 401, 301, 201, 101, 1]), + ( + HeunDiscreteScheduler, + [ + 901.0, + 801.0, + 801.0, + 701.0, + 701.0, + 601.0, + 601.0, + 501.0, + 501.0, + 401.0, + 401.0, + 301.0, + 301.0, + 201.0, + 201.0, + 101.0, + 101.0, + 1.0, + 1.0, + ], + ), + ]: + assert_run_mixture(steps, split, scheduler_cls_timesteps[0], scheduler_cls_timesteps[1]) + + @slow def test_stable_diffusion_two_xl_mixture_of_denoiser(self): components = self.get_dummy_components() pipe_1 = StableDiffusionXLPipeline(**components).to(torch_device) @@ -328,8 +429,13 @@ class scheduler_cls(scheduler_cls_orig): pipe_1.scheduler.set_timesteps(num_steps) expected_steps = pipe_1.scheduler.timesteps.tolist() - expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss)) - expected_steps_2 = list(filter(lambda ts: ts < split, expected_tss)) + if pipe_1.scheduler.order == 2: + expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss)) + expected_steps_2 = expected_steps_1[-1:] + list(filter(lambda ts: ts < split, expected_tss)) + expected_steps = expected_steps_1 + expected_steps_2 + else: + expected_steps_1 = list(filter(lambda ts: ts >= split, expected_tss)) + expected_steps_2 = list(filter(lambda ts: ts < split, expected_tss)) # now we monkey patch step `done_steps` # list into the step function for testing @@ -579,6 +685,7 @@ def new_step(self, *args, **kwargs): ]: assert_run_mixture(steps, split, scheduler_cls_timesteps[0], scheduler_cls_timesteps[1]) + @slow def test_stable_diffusion_three_xl_mixture_of_denoiser(self): components = self.get_dummy_components() pipe_1 = StableDiffusionXLPipeline(**components).to(torch_device) @@ -611,13 +718,18 @@ class scheduler_cls(scheduler_cls_orig): split_1_ts = num_train_timesteps - int(round(num_train_timesteps * split_1)) split_2_ts = num_train_timesteps - int(round(num_train_timesteps * split_2)) - expected_steps_1 = expected_steps[:split_1_ts] - expected_steps_2 = expected_steps[split_1_ts:split_2_ts] - expected_steps_3 = expected_steps[split_2_ts:] - expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps)) - expected_steps_2 = list(filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps)) - expected_steps_3 = list(filter(lambda ts: ts < split_2_ts, expected_steps)) + if pipe_1.scheduler.order == 2: + expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps)) + expected_steps_2 = expected_steps_1[-1:] + list( + filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps) + ) + expected_steps_3 = expected_steps_2[-1:] + list(filter(lambda ts: ts < split_2_ts, expected_steps)) + expected_steps = expected_steps_1 + expected_steps_2 + expected_steps_3 + else: + expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps)) + expected_steps_2 = list(filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps)) + expected_steps_3 = list(filter(lambda ts: ts < split_2_ts, expected_steps)) # now we monkey patch step `done_steps` # list into the step function for testing diff --git a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_inpaint.py b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_inpaint.py index 7e3698d8ca16..8f1a983b562e 100644 --- a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_inpaint.py +++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_inpaint.py @@ -32,7 +32,7 @@ UNet2DConditionModel, UniPCMultistepScheduler, ) -from diffusers.utils.testing_utils import enable_full_determinism, floats_tensor, require_torch_gpu, torch_device +from diffusers.utils.testing_utils import enable_full_determinism, floats_tensor, require_torch_gpu, slow, torch_device from ..pipeline_params import TEXT_GUIDED_IMAGE_INPAINTING_BATCH_PARAMS, TEXT_GUIDED_IMAGE_INPAINTING_PARAMS from ..test_pipelines_common import PipelineLatentTesterMixin, PipelineTesterMixin @@ -294,6 +294,66 @@ def test_stable_diffusion_xl_refiner(self): assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2 + def test_stable_diffusion_two_xl_mixture_of_denoiser_fast(self): + components = self.get_dummy_components() + pipe_1 = StableDiffusionXLInpaintPipeline(**components).to(torch_device) + pipe_1.unet.set_default_attn_processor() + pipe_2 = StableDiffusionXLInpaintPipeline(**components).to(torch_device) + pipe_2.unet.set_default_attn_processor() + + def assert_run_mixture( + num_steps, split, scheduler_cls_orig, num_train_timesteps=pipe_1.scheduler.config.num_train_timesteps + ): + inputs = self.get_dummy_inputs(torch_device) + inputs["num_inference_steps"] = num_steps + + class scheduler_cls(scheduler_cls_orig): + pass + + pipe_1.scheduler = scheduler_cls.from_config(pipe_1.scheduler.config) + pipe_2.scheduler = scheduler_cls.from_config(pipe_2.scheduler.config) + + # Let's retrieve the number of timesteps we want to use + pipe_1.scheduler.set_timesteps(num_steps) + expected_steps = pipe_1.scheduler.timesteps.tolist() + + split_ts = num_train_timesteps - int(round(num_train_timesteps * split)) + + if pipe_1.scheduler.order == 2: + expected_steps_1 = list(filter(lambda ts: ts >= split_ts, expected_steps)) + expected_steps_2 = expected_steps_1[-1:] + list(filter(lambda ts: ts < split_ts, expected_steps)) + expected_steps = expected_steps_1 + expected_steps_2 + else: + expected_steps_1 = list(filter(lambda ts: ts >= split_ts, expected_steps)) + expected_steps_2 = list(filter(lambda ts: ts < split_ts, expected_steps)) + + # now we monkey patch step `done_steps` + # list into the step function for testing + done_steps = [] + old_step = copy.copy(scheduler_cls.step) + + def new_step(self, *args, **kwargs): + done_steps.append(args[1].cpu().item()) # args[1] is always the passed `t` + return old_step(self, *args, **kwargs) + + scheduler_cls.step = new_step + + inputs_1 = {**inputs, **{"denoising_end": split, "output_type": "latent"}} + latents = pipe_1(**inputs_1).images[0] + + assert expected_steps_1 == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}" + + inputs_2 = {**inputs, **{"denoising_start": split, "image": latents}} + pipe_2(**inputs_2).images[0] + + assert expected_steps_2 == done_steps[len(expected_steps_1) :] + assert expected_steps == done_steps, f"Failure with {scheduler_cls.__name__} and {num_steps} and {split}" + + for steps in [7, 20]: + assert_run_mixture(steps, 0.33, EulerDiscreteScheduler) + assert_run_mixture(steps, 0.33, HeunDiscreteScheduler) + + @slow def test_stable_diffusion_two_xl_mixture_of_denoiser(self): components = self.get_dummy_components() pipe_1 = StableDiffusionXLInpaintPipeline(**components).to(torch_device) @@ -318,11 +378,14 @@ class scheduler_cls(scheduler_cls_orig): expected_steps = pipe_1.scheduler.timesteps.tolist() split_ts = num_train_timesteps - int(round(num_train_timesteps * split)) - expected_steps_1 = expected_steps[:split_ts] - expected_steps_2 = expected_steps[split_ts:] - expected_steps_1 = list(filter(lambda ts: ts >= split_ts, expected_steps)) - expected_steps_2 = list(filter(lambda ts: ts < split_ts, expected_steps)) + if pipe_1.scheduler.order == 2: + expected_steps_1 = list(filter(lambda ts: ts >= split_ts, expected_steps)) + expected_steps_2 = expected_steps_1[-1:] + list(filter(lambda ts: ts < split_ts, expected_steps)) + expected_steps = expected_steps_1 + expected_steps_2 + else: + expected_steps_1 = list(filter(lambda ts: ts >= split_ts, expected_steps)) + expected_steps_2 = list(filter(lambda ts: ts < split_ts, expected_steps)) # now we monkey patch step `done_steps` # list into the step function for testing @@ -357,6 +420,7 @@ def new_step(self, *args, **kwargs): ]: assert_run_mixture(steps, split, scheduler_cls) + @slow def test_stable_diffusion_three_xl_mixture_of_denoiser(self): components = self.get_dummy_components() pipe_1 = StableDiffusionXLInpaintPipeline(**components).to(torch_device) @@ -389,13 +453,18 @@ class scheduler_cls(scheduler_cls_orig): split_1_ts = num_train_timesteps - int(round(num_train_timesteps * split_1)) split_2_ts = num_train_timesteps - int(round(num_train_timesteps * split_2)) - expected_steps_1 = expected_steps[:split_1_ts] - expected_steps_2 = expected_steps[split_1_ts:split_2_ts] - expected_steps_3 = expected_steps[split_2_ts:] - expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps)) - expected_steps_2 = list(filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps)) - expected_steps_3 = list(filter(lambda ts: ts < split_2_ts, expected_steps)) + if pipe_1.scheduler.order == 2: + expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps)) + expected_steps_2 = expected_steps_1[-1:] + list( + filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps) + ) + expected_steps_3 = expected_steps_2[-1:] + list(filter(lambda ts: ts < split_2_ts, expected_steps)) + expected_steps = expected_steps_1 + expected_steps_2 + expected_steps_3 + else: + expected_steps_1 = list(filter(lambda ts: ts >= split_1_ts, expected_steps)) + expected_steps_2 = list(filter(lambda ts: ts >= split_2_ts and ts < split_1_ts, expected_steps)) + expected_steps_3 = list(filter(lambda ts: ts < split_2_ts, expected_steps)) # now we monkey patch step `done_steps` # list into the step function for testing diff --git a/tests/pipelines/test_pipelines.py b/tests/pipelines/test_pipelines.py index 13861b581c9b..875fd787c8b0 100644 --- a/tests/pipelines/test_pipelines.py +++ b/tests/pipelines/test_pipelines.py @@ -862,6 +862,58 @@ def test_run_custom_pipeline(self): # compare output to https://huggingface.co/hf-internal-testing/diffusers-dummy-pipeline/blob/main/pipeline.py#L102 assert output_str == "This is a test" + def test_remote_components(self): + # make sure that trust remote code has to be passed + with self.assertRaises(ValueError): + pipeline = DiffusionPipeline.from_pretrained("hf-internal-testing/tiny-sdxl-custom-components") + + # Check that only loading custom componets "my_unet", "my_scheduler" works + pipeline = DiffusionPipeline.from_pretrained( + "hf-internal-testing/tiny-sdxl-custom-components", trust_remote_code=True + ) + + assert pipeline.config.unet == ("diffusers_modules.local.my_unet_model", "MyUNetModel") + assert pipeline.config.scheduler == ("diffusers_modules.local.my_scheduler", "MyScheduler") + assert pipeline.__class__.__name__ == "StableDiffusionXLPipeline" + + pipeline = pipeline.to(torch_device) + images = pipeline("test", num_inference_steps=2, output_type="np")[0] + + assert images.shape == (1, 64, 64, 3) + + # Check that only loading custom componets "my_unet", "my_scheduler" and explicit custom pipeline works + pipeline = DiffusionPipeline.from_pretrained( + "hf-internal-testing/tiny-sdxl-custom-components", custom_pipeline="my_pipeline", trust_remote_code=True + ) + + assert pipeline.config.unet == ("diffusers_modules.local.my_unet_model", "MyUNetModel") + assert pipeline.config.scheduler == ("diffusers_modules.local.my_scheduler", "MyScheduler") + assert pipeline.__class__.__name__ == "MyPipeline" + + pipeline = pipeline.to(torch_device) + images = pipeline("test", num_inference_steps=2, output_type="np")[0] + + assert images.shape == (1, 64, 64, 3) + + def test_remote_auto_custom_pipe(self): + # make sure that trust remote code has to be passed + with self.assertRaises(ValueError): + pipeline = DiffusionPipeline.from_pretrained("hf-internal-testing/tiny-sdxl-custom-all") + + # Check that only loading custom componets "my_unet", "my_scheduler" and auto custom pipeline works + pipeline = DiffusionPipeline.from_pretrained( + "hf-internal-testing/tiny-sdxl-custom-all", trust_remote_code=True + ) + + assert pipeline.config.unet == ("diffusers_modules.local.my_unet_model", "MyUNetModel") + assert pipeline.config.scheduler == ("diffusers_modules.local.my_scheduler", "MyScheduler") + assert pipeline.__class__.__name__ == "MyPipeline" + + pipeline = pipeline.to(torch_device) + images = pipeline("test", num_inference_steps=2, output_type="np")[0] + + assert images.shape == (1, 64, 64, 3) + def test_local_custom_pipeline_repo(self): local_custom_pipeline_path = get_tests_dir("fixtures/custom_pipeline") pipeline = DiffusionPipeline.from_pretrained( diff --git a/tests/schedulers/test_scheduler_lcm.py b/tests/schedulers/test_scheduler_lcm.py new file mode 100644 index 000000000000..48b68fa47ddc --- /dev/null +++ b/tests/schedulers/test_scheduler_lcm.py @@ -0,0 +1,244 @@ +import tempfile +from typing import Dict, List, Tuple + +import torch + +from diffusers import LCMScheduler +from diffusers.utils.testing_utils import torch_device + +from .test_schedulers import SchedulerCommonTest + + +class LCMSchedulerTest(SchedulerCommonTest): + scheduler_classes = (LCMScheduler,) + forward_default_kwargs = (("num_inference_steps", 10),) + + def get_scheduler_config(self, **kwargs): + config = { + "num_train_timesteps": 1000, + "beta_start": 0.00085, + "beta_end": 0.0120, + "beta_schedule": "scaled_linear", + "prediction_type": "epsilon", + } + + config.update(**kwargs) + return config + + @property + def default_valid_timestep(self): + kwargs = dict(self.forward_default_kwargs) + num_inference_steps = kwargs.pop("num_inference_steps", None) + + scheduler_config = self.get_scheduler_config() + scheduler = self.scheduler_classes[0](**scheduler_config) + + scheduler.set_timesteps(num_inference_steps) + timestep = scheduler.timesteps[-1] + return timestep + + def test_timesteps(self): + for timesteps in [100, 500, 1000]: + # 0 is not guaranteed to be in the timestep schedule, but timesteps - 1 is + self.check_over_configs(time_step=timesteps - 1, num_train_timesteps=timesteps) + + def test_betas(self): + for beta_start, beta_end in zip([0.0001, 0.001, 0.01, 0.1], [0.002, 0.02, 0.2, 2]): + self.check_over_configs(time_step=self.default_valid_timestep, beta_start=beta_start, beta_end=beta_end) + + def test_schedules(self): + for schedule in ["linear", "scaled_linear", "squaredcos_cap_v2"]: + self.check_over_configs(time_step=self.default_valid_timestep, beta_schedule=schedule) + + def test_prediction_type(self): + for prediction_type in ["epsilon", "v_prediction"]: + self.check_over_configs(time_step=self.default_valid_timestep, prediction_type=prediction_type) + + def test_clip_sample(self): + for clip_sample in [True, False]: + self.check_over_configs(time_step=self.default_valid_timestep, clip_sample=clip_sample) + + def test_thresholding(self): + self.check_over_configs(time_step=self.default_valid_timestep, thresholding=False) + for threshold in [0.5, 1.0, 2.0]: + for prediction_type in ["epsilon", "v_prediction"]: + self.check_over_configs( + time_step=self.default_valid_timestep, + thresholding=True, + prediction_type=prediction_type, + sample_max_value=threshold, + ) + + def test_time_indices(self): + # Get default timestep schedule. + kwargs = dict(self.forward_default_kwargs) + num_inference_steps = kwargs.pop("num_inference_steps", None) + + scheduler_config = self.get_scheduler_config() + scheduler = self.scheduler_classes[0](**scheduler_config) + + scheduler.set_timesteps(num_inference_steps) + timesteps = scheduler.timesteps + for t in timesteps: + self.check_over_forward(time_step=t) + + def test_inference_steps(self): + # Hardcoded for now + for t, num_inference_steps in zip([99, 39, 19], [10, 25, 50]): + self.check_over_forward(time_step=t, num_inference_steps=num_inference_steps) + + # Override test_add_noise_device because the hardcoded num_inference_steps of 100 doesn't work + # for LCMScheduler under default settings + def test_add_noise_device(self, num_inference_steps=10): + for scheduler_class in self.scheduler_classes: + scheduler_config = self.get_scheduler_config() + scheduler = scheduler_class(**scheduler_config) + scheduler.set_timesteps(num_inference_steps) + + sample = self.dummy_sample.to(torch_device) + scaled_sample = scheduler.scale_model_input(sample, 0.0) + self.assertEqual(sample.shape, scaled_sample.shape) + + noise = torch.randn_like(scaled_sample).to(torch_device) + t = scheduler.timesteps[5][None] + noised = scheduler.add_noise(scaled_sample, noise, t) + self.assertEqual(noised.shape, scaled_sample.shape) + + # Override test_from_save_pretrained because it hardcodes a timestep of 1 + def test_from_save_pretrained(self): + kwargs = dict(self.forward_default_kwargs) + num_inference_steps = kwargs.pop("num_inference_steps", None) + + for scheduler_class in self.scheduler_classes: + timestep = self.default_valid_timestep + + scheduler_config = self.get_scheduler_config() + scheduler = scheduler_class(**scheduler_config) + + sample = self.dummy_sample + residual = 0.1 * sample + + with tempfile.TemporaryDirectory() as tmpdirname: + scheduler.save_config(tmpdirname) + new_scheduler = scheduler_class.from_pretrained(tmpdirname) + + scheduler.set_timesteps(num_inference_steps) + new_scheduler.set_timesteps(num_inference_steps) + + kwargs["generator"] = torch.manual_seed(0) + output = scheduler.step(residual, timestep, sample, **kwargs).prev_sample + + kwargs["generator"] = torch.manual_seed(0) + new_output = new_scheduler.step(residual, timestep, sample, **kwargs).prev_sample + + assert torch.sum(torch.abs(output - new_output)) < 1e-5, "Scheduler outputs are not identical" + + # Override test_step_shape because uses 0 and 1 as hardcoded timesteps + def test_step_shape(self): + kwargs = dict(self.forward_default_kwargs) + num_inference_steps = kwargs.pop("num_inference_steps", None) + + for scheduler_class in self.scheduler_classes: + scheduler_config = self.get_scheduler_config() + scheduler = scheduler_class(**scheduler_config) + + sample = self.dummy_sample + residual = 0.1 * sample + + scheduler.set_timesteps(num_inference_steps) + + timestep_0 = scheduler.timesteps[-2] + timestep_1 = scheduler.timesteps[-1] + + output_0 = scheduler.step(residual, timestep_0, sample, **kwargs).prev_sample + output_1 = scheduler.step(residual, timestep_1, sample, **kwargs).prev_sample + + self.assertEqual(output_0.shape, sample.shape) + self.assertEqual(output_0.shape, output_1.shape) + + # Override test_set_scheduler_outputs_equivalence since it uses 0 as a hardcoded timestep + def test_scheduler_outputs_equivalence(self): + def set_nan_tensor_to_zero(t): + t[t != t] = 0 + return t + + def recursive_check(tuple_object, dict_object): + if isinstance(tuple_object, (List, Tuple)): + for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object.values()): + recursive_check(tuple_iterable_value, dict_iterable_value) + elif isinstance(tuple_object, Dict): + for tuple_iterable_value, dict_iterable_value in zip(tuple_object.values(), dict_object.values()): + recursive_check(tuple_iterable_value, dict_iterable_value) + elif tuple_object is None: + return + else: + self.assertTrue( + torch.allclose( + set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5 + ), + msg=( + "Tuple and dict output are not equal. Difference:" + f" {torch.max(torch.abs(tuple_object - dict_object))}. Tuple has `nan`:" + f" {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. Dict has" + f" `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}." + ), + ) + + kwargs = dict(self.forward_default_kwargs) + num_inference_steps = kwargs.pop("num_inference_steps", 50) + + timestep = self.default_valid_timestep + + for scheduler_class in self.scheduler_classes: + scheduler_config = self.get_scheduler_config() + scheduler = scheduler_class(**scheduler_config) + + sample = self.dummy_sample + residual = 0.1 * sample + + scheduler.set_timesteps(num_inference_steps) + kwargs["generator"] = torch.manual_seed(0) + outputs_dict = scheduler.step(residual, timestep, sample, **kwargs) + + scheduler.set_timesteps(num_inference_steps) + kwargs["generator"] = torch.manual_seed(0) + outputs_tuple = scheduler.step(residual, timestep, sample, return_dict=False, **kwargs) + + recursive_check(outputs_tuple, outputs_dict) + + def full_loop(self, num_inference_steps=10, seed=0, **config): + scheduler_class = self.scheduler_classes[0] + scheduler_config = self.get_scheduler_config(**config) + scheduler = scheduler_class(**scheduler_config) + + model = self.dummy_model() + sample = self.dummy_sample_deter + generator = torch.manual_seed(seed) + + scheduler.set_timesteps(num_inference_steps) + + for t in scheduler.timesteps: + residual = model(sample, t) + sample = scheduler.step(residual, t, sample, generator).prev_sample + + return sample + + def test_full_loop_onestep(self): + sample = self.full_loop(num_inference_steps=1) + + result_sum = torch.sum(torch.abs(sample)) + result_mean = torch.mean(torch.abs(sample)) + + # TODO: get expected sum and mean + assert abs(result_sum.item() - 18.7097) < 1e-2 + assert abs(result_mean.item() - 0.0244) < 1e-3 + + def test_full_loop_multistep(self): + sample = self.full_loop(num_inference_steps=10) + + result_sum = torch.sum(torch.abs(sample)) + result_mean = torch.mean(torch.abs(sample)) + + # TODO: get expected sum and mean + assert abs(result_sum.item() - 280.5618) < 1e-2 + assert abs(result_mean.item() - 0.3653) < 1e-3