diff --git a/.github/ISSUE_TEMPLATE/bug-report.yml b/.github/ISSUE_TEMPLATE/bug-report.yml index e9e672e72dd5..0073cb793698 100644 --- a/.github/ISSUE_TEMPLATE/bug-report.yml +++ b/.github/ISSUE_TEMPLATE/bug-report.yml @@ -1,5 +1,5 @@ name: "\U0001F41B Bug Report" -description: Report a bug on diffusers +description: Report a bug on Diffusers labels: [ "bug" ] body: - type: markdown @@ -10,7 +10,7 @@ body: Thus, issues are of the same importance as pull requests when contributing to this library ❀️. In order to make your issue as **useful for the community as possible**, let's try to stick to some simple guidelines: - 1. Please try to be as precise and concise as possible. - *Give your issue a fitting title. Assume that someone which very limited knowledge of diffusers can understand your issue. Add links to the source code, documentation other issues, pull requests etc...* + *Give your issue a fitting title. Assume that someone which very limited knowledge of Diffusers can understand your issue. Add links to the source code, documentation other issues, pull requests etc...* - 2. If your issue is about something not working, **always** provide a reproducible code snippet. The reader should be able to reproduce your issue by **only copy-pasting your code snippet into a Python shell**. *The community cannot solve your issue if it cannot reproduce it. If your bug is related to training, add your training script and make everything needed to train public. Otherwise, just add a simple Python code snippet.* - 3. Add the **minimum** amount of code / context that is needed to understand, reproduce your issue. @@ -19,7 +19,7 @@ body: - type: markdown attributes: value: | - For more in-detail information on how to write good issues you can have a look [here](https://huggingface.co/course/chapter8/5?fw=pt) + For more in-detail information on how to write good issues you can have a look [here](https://huggingface.co/course/chapter8/5?fw=pt). - type: textarea id: bug-description attributes: @@ -47,7 +47,7 @@ body: attributes: label: System Info description: Please share your system info with us. You can run the command `diffusers-cli env` and copy-paste its output below. - placeholder: diffusers version, platform, python version, ... + placeholder: Diffusers version, platform, Python version, ... validations: required: true - type: textarea @@ -55,7 +55,7 @@ body: attributes: label: Who can help? description: | - Your issue will be replied to more quickly if you can figure out the right person to tag with @ + Your issue will be replied to more quickly if you can figure out the right person to tag with @. If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of **who to tag**. All issues are read by one of the core maintainers, so if you don't know who to tag, just leave this blank and @@ -66,7 +66,7 @@ body: Questions on DiffusionPipeline (Saving, Loading, From pretrained, ...): Questions on pipelines: - - Stable Diffusion @yiyixuxu @DN6 @patrickvonplaten @sayakpaul @patrickvonplaten + - Stable Diffusion @yiyixuxu @DN6 @sayakpaul @patrickvonplaten - Stable Diffusion XL @yiyixuxu @sayakpaul @DN6 @patrickvonplaten - Kandinsky @yiyixuxu @patrickvonplaten - ControlNet @sayakpaul @yiyixuxu @DN6 @patrickvonplaten diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml index 304c02ca9cc4..ffc3ddc5dc39 100644 --- a/.github/ISSUE_TEMPLATE/config.yml +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -1,7 +1,4 @@ contact_links: - - name: Blank issue - url: https://github.com/huggingface/diffusers/issues/new - about: Other - name: Forum - url: https://discuss.huggingface.co/ - about: General usage questions and community discussions \ No newline at end of file + url: https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63 + about: General usage questions and community discussions diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md index 24405ec4fa1d..42f93232c1de 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -1,5 +1,5 @@ --- -name: "\U0001F680 Feature request" +name: "\U0001F680 Feature Request" about: Suggest an idea for this project title: '' labels: '' @@ -8,13 +8,13 @@ assignees: '' --- **Is your feature request related to a problem? Please describe.** -A clear and concise description of what the problem is. Ex. I'm always frustrated when [...] +A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]. -**Describe the solution you'd like** +**Describe the solution you'd like.** A clear and concise description of what you want to happen. -**Describe alternatives you've considered** +**Describe alternatives you've considered.** A clear and concise description of any alternative solutions or features you've considered. -**Additional context** +**Additional context.** Add any other context or screenshots about the feature request here. diff --git a/.github/ISSUE_TEMPLATE/new-model-addition.yml b/.github/ISSUE_TEMPLATE/new-model-addition.yml index 2055599e44cd..432e287dd334 100644 --- a/.github/ISSUE_TEMPLATE/new-model-addition.yml +++ b/.github/ISSUE_TEMPLATE/new-model-addition.yml @@ -1,5 +1,5 @@ -name: "\U0001F31F New model/pipeline/scheduler addition" -description: Submit a proposal/request to implement a new diffusion model / pipeline / scheduler +name: "\U0001F31F New Model/Pipeline/Scheduler Addition" +description: Submit a proposal/request to implement a new diffusion model/pipeline/scheduler labels: [ "New model/pipeline/scheduler" ] body: @@ -19,7 +19,7 @@ body: description: | Please note that if the model implementation isn't available or if the weights aren't open-source, we are less likely to implement it in `diffusers`. options: - - label: "The model implementation is available" + - label: "The model implementation is available." - label: "The model weights are available (Only relevant if addition is not a scheduler)." - type: textarea diff --git a/.github/ISSUE_TEMPLATE/translate.md b/.github/ISSUE_TEMPLATE/translate.md new file mode 100644 index 000000000000..3471ec9640d7 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/translate.md @@ -0,0 +1,29 @@ +--- +name: 🌐 Translating a New Language? +about: Start a new translation effort in your language +title: '[] Translating docs to ' +labels: WIP +assignees: '' + +--- + + + +Hi! + +Let's bring the documentation to all the -speaking community 🌐. + +Who would want to translate? Please follow the πŸ€— [TRANSLATING guide](https://github.com/huggingface/diffusers/blob/main/docs/TRANSLATING.md). Here is a list of the files ready for translation. Let us know in this issue if you'd like to translate any, and we'll add your name to the list. + +Some notes: + +* Please translate using an informal tone (imagine you are talking with a friend about Diffusers πŸ€—). +* Please translate in a gender-neutral way. +* Add your translations to the folder called `` inside the [source folder](https://github.com/huggingface/diffusers/tree/main/docs/source). +* Register your translation in `/_toctree.yml`; please follow the order of the [English version](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml). +* Once you're finished, open a pull request and tag this issue by including #issue-number in the description, where issue-number is the number of this issue. Please ping @stevhliu for review. +* πŸ™‹ If you'd like others to help you with the translation, you can also post in the πŸ€— [forums](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63). + +Thank you so much for your help! πŸ€— diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md index d8c6a821a3b8..53be591fe2a6 100644 --- a/.github/PULL_REQUEST_TEMPLATE.md +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -19,10 +19,10 @@ Fixes # (issue) - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [ ] Did you read the [contributor guideline](https://github.com/huggingface/diffusers/blob/main/CONTRIBUTING.md)? - [ ] Did you read our [philosophy doc](https://github.com/huggingface/diffusers/blob/main/PHILOSOPHY.md) (important for complex PRs)? -- [ ] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. +- [ ] Was this discussed/approved via a GitHub issue or the [forum](https://discuss.huggingface.co/c/discussion-related-to-httpsgithubcomhuggingfacediffusers/63)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/diffusers/tree/main/docs), and - [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). + [here are tips on formatting docstrings](https://github.com/huggingface/diffusers/tree/main/docs#writing-source-documentation). - [ ] Did you write any new necessary tests? @@ -31,7 +31,7 @@ Fixes # (issue) Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR. - +


@@ -14,7 +30,10 @@ GitHub release - Contributor Covenant + Contributor Covenant + + + X account

@@ -26,11 +45,11 @@ - State-of-the-art [diffusion pipelines](https://huggingface.co/docs/diffusers/api/pipelines/overview) that can be run in inference with just a few lines of code. - Interchangeable noise [schedulers](https://huggingface.co/docs/diffusers/api/schedulers/overview) for different diffusion speeds and output quality. -- Pretrained [models](https://huggingface.co/docs/diffusers/api/models) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. +- Pretrained [models](https://huggingface.co/docs/diffusers/api/models/overview) that can be used as building blocks, and combined with schedulers, for creating your own end-to-end diffusion systems. ## Installation -We recommend installing πŸ€— Diffusers in a virtual environment from PyPi or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/#installation), please refer to their official documentation. +We recommend installing πŸ€— Diffusers in a virtual environment from PyPI or Conda. For more details about installing [PyTorch](https://pytorch.org/get-started/locally/) and [Flax](https://flax.readthedocs.io/en/latest/#installation), please refer to their official documentation. ### PyTorch @@ -60,7 +79,7 @@ Please refer to the [How to use Stable Diffusion in Apple Silicon](https://huggi ## Quickstart -Generating outputs is super easy with πŸ€— Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 4000+ checkpoints): +Generating outputs is super easy with πŸ€— Diffusers. To generate an image from text, use the `from_pretrained` method to load any pretrained diffusion model (browse the [Hub](https://huggingface.co/models?library=diffusers&sort=downloads) for 15000+ checkpoints): ```python from diffusers import DiffusionPipeline @@ -77,14 +96,13 @@ You can also dig into the models and schedulers toolbox to build your own diffus from diffusers import DDPMScheduler, UNet2DModel from PIL import Image import torch -import numpy as np scheduler = DDPMScheduler.from_pretrained("google/ddpm-cat-256") model = UNet2DModel.from_pretrained("google/ddpm-cat-256").to("cuda") scheduler.set_timesteps(50) sample_size = model.config.sample_size -noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda") +noise = torch.randn((1, 3, sample_size, sample_size), device="cuda") input = noise for t in scheduler.timesteps: @@ -119,8 +137,7 @@ You can look out for [issues](https://github.com/huggingface/diffusers/issues) y - See [New model/pipeline](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+pipeline%2Fmodel%22) to contribute exciting new diffusion models / diffusion pipelines - See [New scheduler](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22New+scheduler%22) -Also, say πŸ‘‹ in our public Discord channel Join us on Discord. We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or -just hang out β˜•. +Also, say πŸ‘‹ in our public Discord channel Join us on Discord. We discuss the hottest trends about diffusion models, help each other with contributions, personal projects or just hang out β˜•. ## Popular Tasks & Pipelines @@ -143,12 +160,12 @@ just hang out β˜•. Text-to-Image - unclip + unCLIP kakaobrain/karlo-v1-alpha Text-to-Image - DeepFloyd IF + DeepFloyd IF DeepFloyd/IF-I-XL-v1.0 @@ -158,12 +175,12 @@ just hang out β˜•. Text-guided Image-to-Image - Controlnet + ControlNet lllyasviel/sd-controlnet-canny Text-guided Image-to-Image - Instruct Pix2Pix + InstructPix2Pix timbrooks/instruct-pix2pix @@ -173,7 +190,7 @@ just hang out β˜•. Text-guided Image Inpainting - Stable Diffusion Inpaint + Stable Diffusion Inpainting runwayml/stable-diffusion-inpainting @@ -204,9 +221,9 @@ just hang out β˜•. - https://github.com/deep-floyd/IF - https://github.com/bentoml/BentoML - https://github.com/bmaltais/kohya_ss -- +3000 other amazing GitHub repositories πŸ’ͺ +- +6000 other amazing GitHub repositories πŸ’ͺ -Thank you for using us ❀️ +Thank you for using us ❀️. ## Credits diff --git a/docs/TRANSLATING.md b/docs/TRANSLATING.md index b5a88812f30a..b64ac9fd8d68 100644 --- a/docs/TRANSLATING.md +++ b/docs/TRANSLATING.md @@ -1,10 +1,22 @@ + + ### Translating the Diffusers documentation into your language As part of our mission to democratize machine learning, we'd love to make the Diffusers library available in many more languages! Follow the steps below if you want to help translate the documentation into your language πŸ™. **πŸ—žοΈ Open an issue** -To get started, navigate to the [Issues](https://github.com/huggingface/diffusers/issues) page of this repo and check if anyone else has opened an issue for your language. If not, open a new issue by selecting the "Translation template" from the "New issue" button. +To get started, navigate to the [Issues](https://github.com/huggingface/diffusers/issues) page of this repo and check if anyone else has opened an issue for your language. If not, open a new issue by selecting the "🌐 Translating a New Language?" from the "New issue" button. Once an issue exists, post a comment to indicate which chapters you'd like to work on, and we'll add your name to the list. @@ -16,7 +28,7 @@ First, you'll need to [fork the Diffusers repo](https://docs.github.com/en/get-s Once you've forked the repo, you'll want to get the files on your local machine for editing. You can do that by cloning the fork with Git as follows: ```bash -git clone https://github.com/YOUR-USERNAME/diffusers.git +git clone https://github.com//diffusers.git ``` **πŸ“‹ Copy-paste the English version with a new language code** @@ -29,10 +41,10 @@ You'll only need to copy the files in the [`docs/source/en`](https://github.com/ ```bash cd ~/path/to/diffusers/docs -cp -r source/en source/LANG-ID +cp -r source/en source/ ``` -Here, `LANG-ID` should be one of the ISO 639-1 or ISO 639-2 language codes -- see [here](https://www.loc.gov/standards/iso639-2/php/code_list.php) for a handy table. +Here, `` should be one of the ISO 639-1 or ISO 639-2 language codes -- see [here](https://www.loc.gov/standards/iso639-2/php/code_list.php) for a handy table. **✍️ Start translating** @@ -40,7 +52,7 @@ The fun part comes - translating the text! The first thing we recommend is translating the part of the `_toctree.yml` file that corresponds to your doc chapter. This file is used to render the table of contents on the website. -> πŸ™‹ If the `_toctree.yml` file doesn't yet exist for your language, you can create one by copy-pasting from the English version and deleting the sections unrelated to your chapter. Just make sure it exists in the `docs/source/LANG-ID/` directory! +> πŸ™‹ If the `_toctree.yml` file doesn't yet exist for your language, you can create one by copy-pasting from the English version and deleting the sections unrelated to your chapter. Just make sure it exists in the `docs/source//` directory! The fields you should add are `local` (with the name of the file containing the translation; e.g. `autoclass_tutorial`), and `title` (with the title of the doc in your language; e.g. `Load pretrained instances with an AutoClass`) -- as a reference, here is the `_toctree.yml` for [English](https://github.com/huggingface/diffusers/blob/main/docs/source/en/_toctree.yml): diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml index c7c330f000d0..9d3c0b462d88 100644 --- a/docs/source/en/_toctree.yml +++ b/docs/source/en/_toctree.yml @@ -19,7 +19,6 @@ title: Train a diffusion model - local: tutorials/using_peft_for_inference title: Inference with PEFT - title: Tutorials - sections: - sections: - local: using-diffusers/loading_overview @@ -72,8 +71,6 @@ title: Overview - local: using-diffusers/sdxl title: Stable Diffusion XL - - local: using-diffusers/lcm - title: Latent Consistency Models - local: using-diffusers/kandinsky title: Kandinsky - local: using-diffusers/controlnet @@ -92,6 +89,10 @@ title: Community pipelines - local: using-diffusers/contribute_pipeline title: Contribute a community pipeline + - local: using-diffusers/inference_with_lcm_lora + title: Latent Consistency Model-LoRA + - local: using-diffusers/inference_with_lcm + title: Latent Consistency Model title: Specific pipeline examples - sections: - local: training/overview @@ -100,26 +101,36 @@ title: Create a dataset for training - local: training/adapt_a_model title: Adapt a model to a new task - - local: training/unconditional_training - title: Unconditional image generation - - local: training/text_inversion - title: Textual Inversion - - local: training/dreambooth - title: DreamBooth - - local: training/text2image - title: Text-to-image - - local: training/lora - title: Low-Rank Adaptation of Large Language Models (LoRA) - - local: training/controlnet - title: ControlNet - - local: training/instructpix2pix - title: InstructPix2Pix Training - - local: training/custom_diffusion - title: Custom Diffusion - - local: training/t2i_adapters - title: T2I-Adapters - - local: training/ddpo - title: Reinforcement learning training with DDPO + - sections: + - local: training/unconditional_training + title: Unconditional image generation + - local: training/text2image + title: Text-to-image + - local: training/sdxl + title: Stable Diffusion XL + - local: training/kandinsky + title: Kandinsky 2.2 + - local: training/wuerstchen + title: Wuerstchen + - local: training/controlnet + title: ControlNet + - local: training/t2i_adapters + title: T2I-Adapters + - local: training/instructpix2pix + title: InstructPix2Pix + title: Models + - sections: + - local: training/text_inversion + title: Textual Inversion + - local: training/dreambooth + title: DreamBooth + - local: training/lora + title: LoRA + - local: training/custom_diffusion + title: Custom Diffusion + - local: training/ddpo + title: Reinforcement learning training with DDPO + title: Methods title: Training - sections: - local: using-diffusers/other-modalities @@ -231,7 +242,7 @@ - local: api/pipelines/auto_pipeline title: AutoPipeline - local: api/pipelines/blip_diffusion - title: BLIP Diffusion + title: BLIP-Diffusion - local: api/pipelines/consistency_models title: Consistency Models - local: api/pipelines/controlnet @@ -267,13 +278,13 @@ - local: api/pipelines/musicldm title: MusicLDM - local: api/pipelines/paint_by_example - title: Paint By Example + title: Paint by Example - local: api/pipelines/paradigms title: Parallel Sampling of Diffusion Models - local: api/pipelines/pix2pix_zero title: Pix2Pix Zero - local: api/pipelines/pixart - title: PixArt + title: PixArt-Ξ± - local: api/pipelines/pndm title: PNDM - local: api/pipelines/repaint diff --git a/docs/source/en/api/pipelines/alt_diffusion.md b/docs/source/en/api/pipelines/alt_diffusion.md index a8bd115db450..d0326affbb63 100644 --- a/docs/source/en/api/pipelines/alt_diffusion.md +++ b/docs/source/en/api/pipelines/alt_diffusion.md @@ -16,7 +16,7 @@ AltDiffusion was proposed in [AltCLIP: Altering the Language Encoder in CLIP for The abstract from the paper is: -*In this work, we present a conceptually simple and effective method to train a strong bilingual multimodal representation model. Starting from the pretrained multimodal representation model CLIP released by OpenAI, we switched its text encoder with a pretrained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k- CN, and COCO-CN. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding.* +*In this work, we present a conceptually simple and effective method to train a strong bilingual/multilingual multimodal representation model. Starting from the pre-trained multimodal representation model CLIP released by OpenAI, we altered its text encoder with a pre-trained multilingual text encoder XLM-R, and aligned both languages and image representations by a two-stage training schema consisting of teacher learning and contrastive learning. We validate our method through evaluations of a wide range of tasks. We set new state-of-the-art performances on a bunch of tasks including ImageNet-CN, Flicker30k-CN, COCO-CN and XTD. Further, we obtain very close performances with CLIP on almost all tasks, suggesting that one can simply alter the text encoder in CLIP for extended capabilities such as multilingual understanding. Our models and code are available at [this https URL](https://github.com/FlagAI-Open/FlagAI).* ## Tips @@ -44,4 +44,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) [[autodoc]] pipelines.alt_diffusion.AltDiffusionPipelineOutput - all - - __call__ \ No newline at end of file + - __call__ diff --git a/docs/source/en/api/pipelines/animatediff.md b/docs/source/en/api/pipelines/animatediff.md index 6e328c2f7a4c..422d345b9057 100644 --- a/docs/source/en/api/pipelines/animatediff.md +++ b/docs/source/en/api/pipelines/animatediff.md @@ -14,11 +14,11 @@ specific language governing permissions and limitations under the License. ## Overview -[AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) by Yuwei Guo, Ceyuan Yang*, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai +[AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning](https://arxiv.org/abs/2307.04725) by Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, Bo Dai. The abstract of the paper is the following: -With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at this https URL . +*With the advance of text-to-image models (e.g., Stable Diffusion) and corresponding personalization techniques such as DreamBooth and LoRA, everyone can manifest their imagination into high-quality images at an affordable cost. Subsequently, there is a great demand for image animation techniques to further combine generated static images with motion dynamics. In this report, we propose a practical framework to animate most of the existing personalized text-to-image models once and for all, saving efforts in model-specific tuning. At the core of the proposed framework is to insert a newly initialized motion modeling module into the frozen text-to-image model and train it on video clips to distill reasonable motion priors. Once trained, by simply injecting this motion modeling module, all personalized versions derived from the same base T2I readily become text-driven models that produce diverse and personalized animated images. We conduct our evaluation on several public representative personalized text-to-image models across anime pictures and realistic photographs, and demonstrate that our proposed framework helps these models generate temporally smooth animation clips while preserving the domain and diversity of their outputs. Code and pre-trained weights will be publicly available at [this https URL](https://animatediff.github.io/).* ## Available Pipelines @@ -28,7 +28,7 @@ With the advance of text-to-image models (e.g., Stable Diffusion) and correspond ## Available checkpoints -Motion Adapter checkpoints can be found under [guoyww](https://huggingface.co/guoyww/). These checkpoints are meant to work with any model based on Stable Diffusion 1.4/1.5 +Motion Adapter checkpoints can be found under [guoyww](https://huggingface.co/guoyww/). These checkpoints are meant to work with any model based on Stable Diffusion 1.4/1.5. ## Usage example @@ -211,6 +211,11 @@ export_to_gif(frames, "animation.gif") + + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + ## AnimateDiffPipeline @@ -227,4 +232,3 @@ export_to_gif(frames, "animation.gif") ## AnimateDiffPipelineOutput [[autodoc]] pipelines.animatediff.AnimateDiffPipelineOutput - diff --git a/docs/source/en/api/pipelines/attend_and_excite.md b/docs/source/en/api/pipelines/attend_and_excite.md index b61e24823e46..94f33cf1d0b6 100644 --- a/docs/source/en/api/pipelines/attend_and_excite.md +++ b/docs/source/en/api/pipelines/attend_and_excite.md @@ -16,7 +16,7 @@ Attend-and-Excite for Stable Diffusion was proposed in [Attend-and-Excite: Atten The abstract from the paper is: -*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.* +*Recent text-to-image generative models have demonstrated an unparalleled ability to generate diverse and creative imagery guided by a target text prompt. While revolutionary, current state-of-the-art diffusion models may still fail in generating images that fully convey the semantics in the given text prompt. We analyze the publicly available Stable Diffusion model and assess the existence of catastrophic neglect, where the model fails to generate one or more of the subjects from the input prompt. Moreover, we find that in some cases the model also fails to correctly bind attributes (e.g., colors) to their corresponding subjects. To help mitigate these failure cases, we introduce the concept of Generative Semantic Nursing (GSN), where we seek to intervene in the generative process on the fly during inference time to improve the faithfulness of the generated images. Using an attention-based formulation of GSN, dubbed Attend-and-Excite, we guide the model to refine the cross-attention units to attend to all subject tokens in the text prompt and strengthen - or excite - their activations, encouraging the model to generate all subjects described in the text prompt. We compare our approach to alternative approaches and demonstrate that it conveys the desired concepts more faithfully across a range of text prompts.* You can find additional information about Attend-and-Excite on the [project page](https://attendandexcite.github.io/Attend-and-Excite/), the [original codebase](https://github.com/AttendAndExcite/Attend-and-Excite), or try it out in a [demo](https://huggingface.co/spaces/AttendAndExcite/Attend-and-Excite). @@ -34,4 +34,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) ## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/audio_diffusion.md b/docs/source/en/api/pipelines/audio_diffusion.md index 5a90689b4a7b..3d140fe202a6 100644 --- a/docs/source/en/api/pipelines/audio_diffusion.md +++ b/docs/source/en/api/pipelines/audio_diffusion.md @@ -14,8 +14,6 @@ specific language governing permissions and limitations under the License. [Audio Diffusion](https://github.com/teticio/audio-diffusion) is by Robert Dargavel Smith, and it leverages the recent advances in image generation from diffusion models by converting audio samples to and from Mel spectrogram images. -The original codebase, training scripts and example notebooks can be found at [teticio/audio-diffusion](https://github.com/teticio/audio-diffusion). - Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. diff --git a/docs/source/en/api/pipelines/audioldm.md b/docs/source/en/api/pipelines/audioldm.md index f3e625fcbf21..43fb0f1a3bf4 100644 --- a/docs/source/en/api/pipelines/audioldm.md +++ b/docs/source/en/api/pipelines/audioldm.md @@ -19,9 +19,9 @@ sound effects, human speech and music. The abstract from the paper is: -*Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at https://audioldm.github.io.* +*Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at [this https URL](https://audioldm.github.io/).* -The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM). +The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM). ## Tips @@ -47,4 +47,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## AudioPipelineOutput -[[autodoc]] pipelines.AudioPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.AudioPipelineOutput diff --git a/docs/source/en/api/pipelines/audioldm2.md b/docs/source/en/api/pipelines/audioldm2.md index 3d0c332653f8..89bb6b8cc922 100644 --- a/docs/source/en/api/pipelines/audioldm2.md +++ b/docs/source/en/api/pipelines/audioldm2.md @@ -12,36 +12,23 @@ specific language governing permissions and limitations under the License. # AudioLDM 2 -AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) -by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate -text-conditional sound effects, human speech and music. - -Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2 -is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two -text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap) -and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings -are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel). -A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively -predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding -vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel) -of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention -conditioning, as in most other LDMs. +AudioLDM 2 was proposed in [AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining](https://arxiv.org/abs/2308.05734) by Haohe Liu et al. AudioLDM 2 takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music. + +Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM 2 is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from text embeddings. Two text encoder models are used to compute the text embeddings from a prompt input: the text-branch of [CLAP](https://huggingface.co/docs/transformers/main/en/model_doc/clap) and the encoder of [Flan-T5](https://huggingface.co/docs/transformers/main/en/model_doc/flan-t5). These text embeddings are then projected to a shared embedding space by an [AudioLDM2ProjectionModel](https://huggingface.co/docs/diffusers/main/api/pipelines/audioldm2#diffusers.AudioLDM2ProjectionModel). A [GPT2](https://huggingface.co/docs/transformers/main/en/model_doc/gpt2) _language model (LM)_ is used to auto-regressively predict eight new embedding vectors, conditional on the projected CLAP and Flan-T5 embeddings. The generated embedding vectors and Flan-T5 text embeddings are used as cross-attention conditioning in the LDM. The [UNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2UNet2DConditionModel) of AudioLDM 2 is unique in the sense that it takes **two** cross-attention embeddings, as opposed to one cross-attention conditioning, as in most other LDMs. The abstract of the paper is the following: -*Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called language of audio (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate new state-of-the-art or competitive performance to previous approaches.* +*Although audio generation shares commonalities across different types of audio, such as speech, music, and sound effects, designing models for each type requires careful consideration of specific objectives and biases that can significantly differ from those of other types. To bring us closer to a unified perspective of audio generation, this paper proposes a framework that utilizes the same learning method for speech, music, and sound effect generation. Our framework introduces a general representation of audio, called "language of audio" (LOA). Any audio can be translated into LOA based on AudioMAE, a self-supervised pre-trained representation learning model. In the generation process, we translate any modalities into LOA by using a GPT-2 model, and we perform self-supervised audio generation learning with a latent diffusion model conditioned on LOA. The proposed framework naturally brings advantages such as in-context learning abilities and reusable self-supervised pretrained AudioMAE and latent diffusion models. Experiments on the major benchmarks of text-to-audio, text-to-music, and text-to-speech demonstrate state-of-the-art or competitive performance against previous approaches. Our code, pretrained model, and demo are available at [this https URL](https://audioldm.github.io/audioldm2).* -This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be -found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2). +This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). The original codebase can be found at [haoheliu/audioldm2](https://github.com/haoheliu/audioldm2). ## Tips ### Choosing a checkpoint -AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio -generation. The third checkpoint is trained exclusively on text-to-music generation. +AudioLDM2 comes in three variants. Two of these checkpoints are applicable to the general task of text-to-audio generation. The third checkpoint is trained exclusively on text-to-music generation. -All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet. +All checkpoints share the same model size for the text encoders and VAE. They differ in the size and depth of the UNet. See table below for details on the three checkpoints: | Checkpoint | Task | UNet Model Size | Total Model Size | Training Data / h | @@ -54,7 +41,7 @@ See table below for details on the three checkpoints: * Descriptive prompt inputs work best: use adjectives to describe the sound (e.g. "high quality" or "clear") and make the prompt context specific (e.g. "water stream in a forest" instead of "stream"). * It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with. -* Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality." +* Using a **negative prompt** can significantly improve the quality of the generated waveform, by guiding the generation away from terms that correspond to poor quality audio. Try using a negative prompt of "Low quality." ### Controlling inference @@ -63,7 +50,7 @@ See table below for details on the three checkpoints: ### Evaluating generated waveforms: -* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation +* The quality of the generated waveforms can vary significantly based on the seed. Try generating with different seeds until you find a satisfactory generation. * Multiple waveforms can be generated in one go: set `num_waveforms_per_prompt` to a value greater than 1. Automatic scoring will be performed between the generated waveforms and prompt text, and the audios ranked from best to worst accordingly. The following example demonstrates how to construct good music generation using the aforementioned tips: [example](https://huggingface.co/docs/diffusers/main/en/api/pipelines/audioldm2#diffusers.AudioLDM2Pipeline.__call__.example). @@ -88,4 +75,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - forward ## AudioPipelineOutput -[[autodoc]] pipelines.AudioPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.AudioPipelineOutput diff --git a/docs/source/en/api/pipelines/auto_pipeline.md b/docs/source/en/api/pipelines/auto_pipeline.md index 68a0ede6d2fa..e9b932f33dd2 100644 --- a/docs/source/en/api/pipelines/auto_pipeline.md +++ b/docs/source/en/api/pipelines/auto_pipeline.md @@ -35,18 +35,18 @@ image = pipeline(prompt, num_inference_steps=25).images[0] -Check out the [AutoPipeline](/tutorials/autopipeline) tutorial to learn how to use this API! +Check out the [AutoPipeline](../../tutorials/autopipeline) tutorial to learn how to use this API! `AutoPipeline` supports text-to-image, image-to-image, and inpainting for the following diffusion models: -- [Stable Diffusion](./stable_diffusion) +- [Stable Diffusion](./stable_diffusion/overview) - [ControlNet](./controlnet) - [Stable Diffusion XL (SDXL)](./stable_diffusion/stable_diffusion_xl) -- [DeepFloyd IF](./if) -- [Kandinsky](./kandinsky) -- [Kandinsky 2.2](./kandinsky#kandinsky-22) +- [DeepFloyd IF](./deepfloyd_if) +- [Kandinsky 2.1](./kandinsky) +- [Kandinsky 2.2](./kandinsky_v22) ## AutoPipelineForText2Image @@ -56,7 +56,6 @@ Check out the [AutoPipeline](/tutorials/autopipeline) tutorial to learn how to u - from_pretrained - from_pipe - ## AutoPipelineForImage2Image [[autodoc]] AutoPipelineForImage2Image @@ -70,5 +69,3 @@ Check out the [AutoPipeline](/tutorials/autopipeline) tutorial to learn how to u - all - from_pretrained - from_pipe - - diff --git a/docs/source/en/api/pipelines/blip_diffusion.md b/docs/source/en/api/pipelines/blip_diffusion.md index 490287a224eb..b2fa5de2508c 100644 --- a/docs/source/en/api/pipelines/blip_diffusion.md +++ b/docs/source/en/api/pipelines/blip_diffusion.md @@ -1,13 +1,25 @@ -# Blip Diffusion + + +# BLIP-Diffusion + +BLIP-Diffusion was proposed in [BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing](https://arxiv.org/abs/2305.14720). It enables zero-shot subject-driven generation and control-guided zero-shot generation. The abstract from the paper is: -*Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications.* +*Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Project page at [this https URL](https://dxli94.github.io/BLIP-Diffusion-website/).* -The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization. +The original codebase can be found at [salesforce/LAVIS](https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion). You can find the official BLIP-Diffusion checkpoints under the [hf.co/SalesForce](https://hf.co/SalesForce) organization. `BlipDiffusionPipeline` and `BlipDiffusionControlNetPipeline` were contributed by [`ayushtues`](https://github.com/ayushtues/). diff --git a/docs/source/en/api/pipelines/consistency_models.md b/docs/source/en/api/pipelines/consistency_models.md index 26f73e88b409..afdee2c0c8e9 100644 --- a/docs/source/en/api/pipelines/consistency_models.md +++ b/docs/source/en/api/pipelines/consistency_models.md @@ -1,10 +1,22 @@ + + # Consistency Models Consistency Models were proposed in [Consistency Models](https://huggingface.co/papers/2303.01469) by Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. The abstract from the paper is: -*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256. * +*Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.* The original codebase can be found at [openai/consistency_models](https://github.com/openai/consistency_models), and additional checkpoints are available at [openai](https://huggingface.co/openai). @@ -27,17 +39,18 @@ For an additional speed-up, use `torch.compile` to generate multiple images in < + pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) # Multistep sampling - # Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo: + # Timesteps can be explicitly specified; the particular timesteps below are from the original GitHub repo: # https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L83 for _ in range(10): image = pipe(timesteps=[17, 0]).images[0] image.show() ``` + ## ConsistencyModelPipeline [[autodoc]] ConsistencyModelPipeline - all - __call__ ## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/controlnet.md b/docs/source/en/api/pipelines/controlnet.md index 5604c0cd1a2d..0f636af79b77 100644 --- a/docs/source/en/api/pipelines/controlnet.md +++ b/docs/source/en/api/pipelines/controlnet.md @@ -12,13 +12,13 @@ specific language governing permissions and limitations under the License. # ControlNet -ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. +ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. The abstract from the paper is: -*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.* +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* This model was contributed by [takuma104](https://huggingface.co/takuma104). ❀️ @@ -67,7 +67,6 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - load_textual_inversion ## StableDiffusionPipelineOutput - [[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput ## FlaxStableDiffusionControlNetPipeline @@ -76,5 +75,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## FlaxStableDiffusionControlNetPipelineOutput - -[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/controlnet_sdxl.md b/docs/source/en/api/pipelines/controlnet_sdxl.md index bea83f2603a4..755f18341d20 100644 --- a/docs/source/en/api/pipelines/controlnet_sdxl.md +++ b/docs/source/en/api/pipelines/controlnet_sdxl.md @@ -12,13 +12,13 @@ specific language governing permissions and limitations under the License. # ControlNet with Stable Diffusion XL -ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang and Maneesh Agrawala. +ControlNet was introduced in [Adding Conditional Control to Text-to-Image Diffusion Models](https://huggingface.co/papers/2302.05543) by Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. With a ControlNet model, you can provide an additional control image to condition and control Stable Diffusion generation. For example, if you provide a depth map, the ControlNet model generates an image that'll preserve the spatial information from the depth map. It is a more flexible and accurate way to control the image generation process. The abstract from the paper is: -*We present a neural network structure, ControlNet, to control pretrained large diffusion models to support additional input conditions. The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on a personal devices. Alternatively, if powerful computation clusters are available, the model can scale to large amounts (millions to billions) of data. We report that large diffusion models like Stable Diffusion can be augmented with ControlNets to enable conditional inputs like edge maps, segmentation maps, keypoints, etc. This may enrich the methods to control large diffusion models and further facilitate related applications.* +*We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, eg, edges, depth, segmentation, human pose, etc, with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models.* You can find additional smaller Stable Diffusion XL (SDXL) ControlNet checkpoints from the πŸ€— [Diffusers](https://huggingface.co/diffusers) Hub organization, and browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) checkpoints on the Hub. @@ -28,7 +28,7 @@ You can find additional smaller Stable Diffusion XL (SDXL) ControlNet checkpoint -If you don't see a checkpoint you're interested in, you can train your own SDXL ControlNet with our [training script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md). +If you don't see a checkpoint you're interested in, you can train your own SDXL ControlNet with our [training script](../../../../../examples/controlnet/README_sdxl). @@ -50,6 +50,6 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) [[autodoc]] StableDiffusionXLControlNetInpaintPipeline - all - __call__ -## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +## StableDiffusionPipelineOutput +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/cycle_diffusion.md b/docs/source/en/api/pipelines/cycle_diffusion.md index 99f7fb9b518d..13ada0594a6a 100644 --- a/docs/source/en/api/pipelines/cycle_diffusion.md +++ b/docs/source/en/api/pipelines/cycle_diffusion.md @@ -16,7 +16,7 @@ Cycle Diffusion is a text guided image-to-image generation model proposed in [Un The abstract from the paper is: -*Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs.* +*Diffusion models have achieved unprecedented performance in generative modeling. The commonly-adopted formulation of the latent code of diffusion models is a sequence of gradually denoised samples, as opposed to the simpler (e.g., Gaussian) latent space of GANs, VAEs, and normalizing flows. This paper provides an alternative, Gaussian formulation of the latent space of various diffusion models, as well as an invertible DPM-Encoder that maps images into the latent space. While our formulation is purely based on the definition of diffusion models, we demonstrate several intriguing consequences. (1) Empirically, we observe that a common latent space emerges from two diffusion models trained independently on related domains. In light of this finding, we propose CycleDiffusion, which uses DPM-Encoder for unpaired image-to-image translation. Furthermore, applying CycleDiffusion to text-to-image diffusion models, we show that large-scale text-to-image diffusion models can be used as zero-shot image-to-image editors. (2) One can guide pre-trained diffusion models and GANs by controlling the latent codes in a unified, plug-and-play formulation based on energy-based models. Using the CLIP model and a face recognition model as guidance, we demonstrate that diffusion models have better coverage of low-density sub-populations and individuals than GANs. The code is publicly available at [this https URL](https://github.com/ChenWu98/cycle-diffusion).* @@ -30,4 +30,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## StableDiffusionPiplineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/dance_diffusion.md b/docs/source/en/api/pipelines/dance_diffusion.md index 94e6e7bd797a..fcf52a1ec081 100644 --- a/docs/source/en/api/pipelines/dance_diffusion.md +++ b/docs/source/en/api/pipelines/dance_diffusion.md @@ -16,7 +16,6 @@ specific language governing permissions and limitations under the License. Dance Diffusion is the first in a suite of generative audio tools for producers and musicians released by [Harmonai](https://github.com/Harmonai-org). -The original codebase of this implementation can be found at [Harmonai-org](https://github.com/Harmonai-org/sample-generator). @@ -30,4 +29,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## AudioPipelineOutput -[[autodoc]] pipelines.AudioPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.AudioPipelineOutput diff --git a/docs/source/en/api/pipelines/ddim.md b/docs/source/en/api/pipelines/ddim.md index c2bf95c4e566..5c876806f600 100644 --- a/docs/source/en/api/pipelines/ddim.md +++ b/docs/source/en/api/pipelines/ddim.md @@ -26,4 +26,4 @@ The original codebase can be found at [ermongroup/ddim](https://github.com/ermon - __call__ ## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/deepfloyd_if.md b/docs/source/en/api/pipelines/deepfloyd_if.md index 7769b71d38dc..8168c6577979 100644 --- a/docs/source/en/api/pipelines/deepfloyd_if.md +++ b/docs/source/en/api/pipelines/deepfloyd_if.md @@ -10,32 +10,31 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# DeepFloyd IF +# DeepFloyd IF ## Overview -DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. -The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules: +DeepFloyd IF is a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding. +The model is a modular composed of a frozen text encoder and three cascaded pixel diffusion modules: - Stage 1: a base model that generates 64x64 px image based on text prompt, -- Stage 2: a 64x64 px => 256x256 px super-resolution model, and a +- Stage 2: a 64x64 px => 256x256 px super-resolution model, and - Stage 3: a 256x256 px => 1024x1024 px super-resolution model -Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings, -which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. -Stage 3 is [Stability's x4 Upscaling model](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler). -The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. +Stage 1 and Stage 2 utilize a frozen text encoder based on the T5 transformer to extract text embeddings, which are then fed into a UNet architecture enhanced with cross-attention and attention pooling. +Stage 3 is [Stability AI's x4 Upscaling model](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler). +The result is a highly efficient model that outperforms current state-of-the-art models, achieving a zero-shot FID score of 6.66 on the COCO dataset. Our work underscores the potential of larger UNet architectures in the first stage of cascaded diffusion models and depicts a promising future for text-to-image synthesis. ## Usage Before you can use IF, you need to accept its usage conditions. To do so: -1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be logged in +1. Make sure to have a [Hugging Face account](https://huggingface.co/join) and be logged in. 2. Accept the license on the model card of [DeepFloyd/IF-I-XL-v1.0](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0). Accepting the license on the stage I model card will auto accept for the other IF models. -3. Make sure to login locally. Install `huggingface_hub` +3. Make sure to login locally. Install `huggingface_hub`: ```sh pip install huggingface_hub --upgrade ``` -run the login function in a Python shell +run the login function in a Python shell: ```py from huggingface_hub import login @@ -48,7 +47,7 @@ and enter your [Hugging Face Hub access token](https://huggingface.co/docs/hub/s Next we install `diffusers` and dependencies: ```sh -pip install diffusers accelerate transformers safetensors +pip install -q diffusers accelerate transformers ``` The following sections give more in-detail examples of how to use IF. Specifically: @@ -73,20 +72,17 @@ The following sections give more in-detail examples of how to use IF. Specifical - *Stage-3* - [stabilityai/stable-diffusion-x4-upscaler](https://huggingface.co/stabilityai/stable-diffusion-x4-upscaler) -**Demo** -[![Hugging Face Spaces](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue)](https://huggingface.co/spaces/DeepFloyd/IF) **Google Colab** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/deepfloyd_if_free_tier_google_colab.ipynb) ### Text-to-Image Generation -By default diffusers makes use of [model cpu offloading](https://huggingface.co/docs/diffusers/optimization/fp16#model-offloading-for-fast-inference-and-memory-savings) -to run the whole IF pipeline with as little as 14 GB of VRAM. +By default diffusers makes use of [model cpu offloading](../../optimization/memory#model-offloading) to run the whole IF pipeline with as little as 14 GB of VRAM. ```python from diffusers import DiffusionPipeline -from diffusers.utils import pt_to_pil +from diffusers.utils import pt_to_pil, make_image_grid import torch # stage 1 @@ -117,48 +113,43 @@ generator = torch.manual_seed(1) prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) # stage 1 -image = stage_1( +stage_1_output = stage_1( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt" ).images -pt_to_pil(image)[0].save("./if_stage_I.png") +#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") # stage 2 -image = stage_2( - image=image, +stage_2_output = stage_2( + image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images -pt_to_pil(image)[0].save("./if_stage_II.png") +#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png") # stage 3 -image = stage_3(prompt=prompt, image=image, noise_level=100, generator=generator).images -image[0].save("./if_stage_III.png") +stage_3_output = stage_3(prompt=prompt, image=stage_2_output, noise_level=100, generator=generator).images +#stage_3_output[0].save("./if_stage_III.png") +make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=3) ``` ### Text Guided Image-to-Image Generation The same IF model weights can be used for text-guided image-to-image translation or image variation. -In this case just make sure to load the weights using the [`IFInpaintingPipeline`] and [`IFInpaintingSuperResolutionPipeline`] pipelines. +In this case just make sure to load the weights using the [`IFImg2ImgPipeline`] and [`IFImg2ImgSuperResolutionPipeline`] pipelines. **Note**: You can also directly move the weights of the text-to-image pipelines to the image-to-image pipelines -without loading them twice by making use of the [`~DiffusionPipeline.components()`] function as explained [here](#converting-between-different-pipelines). +without loading them twice by making use of the [`~DiffusionPipeline.components`] argument as explained [here](#converting-between-different-pipelines). ```python from diffusers import IFImg2ImgPipeline, IFImg2ImgSuperResolutionPipeline, DiffusionPipeline -from diffusers.utils import pt_to_pil - +from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch -from PIL import Image -import requests -from io import BytesIO - # download image url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" -response = requests.get(url) -original_image = Image.open(BytesIO(response.content)).convert("RGB") +original_image = load_image(url) original_image = original_image.resize((768, 512)) # stage 1 @@ -189,29 +180,30 @@ generator = torch.manual_seed(1) prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) # stage 1 -image = stage_1( +stage_1_output = stage_1( image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images -pt_to_pil(image)[0].save("./if_stage_I.png") +#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") # stage 2 -image = stage_2( - image=image, +stage_2_output = stage_2( + image=stage_1_output, original_image=original_image, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt", ).images -pt_to_pil(image)[0].save("./if_stage_II.png") +#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png") # stage 3 -image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images -image[0].save("./if_stage_III.png") +stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images +#stage_3_output[0].save("./if_stage_III.png") +make_image_grid([original_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=4) ``` ### Text Guided Inpainting Generation @@ -224,24 +216,16 @@ without loading them twice by making use of the [`~DiffusionPipeline.components( ```python from diffusers import IFInpaintingPipeline, IFInpaintingSuperResolutionPipeline, DiffusionPipeline -from diffusers.utils import pt_to_pil +from diffusers.utils import pt_to_pil, load_image, make_image_grid import torch -from PIL import Image -import requests -from io import BytesIO - # download image url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/person.png" -response = requests.get(url) -original_image = Image.open(BytesIO(response.content)).convert("RGB") -original_image = original_image +original_image = load_image(url) # download mask url = "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/if/glasses_mask.png" -response = requests.get(url) -mask_image = Image.open(BytesIO(response.content)) -mask_image = mask_image +mask_image = load_image(url) # stage 1 stage_1 = IFInpaintingPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) @@ -271,7 +255,7 @@ generator = torch.manual_seed(1) prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt) # stage 1 -image = stage_1( +stage_1_output = stage_1( image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, @@ -279,11 +263,11 @@ image = stage_1( generator=generator, output_type="pt", ).images -pt_to_pil(image)[0].save("./if_stage_I.png") +#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") # stage 2 -image = stage_2( - image=image, +stage_2_output = stage_2( + image=stage_1_output, original_image=original_image, mask_image=mask_image, prompt_embeds=prompt_embeds, @@ -291,11 +275,12 @@ image = stage_2( generator=generator, output_type="pt", ).images -pt_to_pil(image)[0].save("./if_stage_II.png") +#pt_to_pil(stage_1_output)[0].save("./if_stage_II.png") # stage 3 -image = stage_3(prompt=prompt, image=image, generator=generator, noise_level=100).images -image[0].save("./if_stage_III.png") +stage_3_output = stage_3(prompt=prompt, image=stage_2_output, generator=generator, noise_level=100).images +#stage_3_output[0].save("./if_stage_III.png") +make_image_grid([original_image, mask_image, pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0], stage_3_output[0]], rows=1, rows=5) ``` ### Converting between different pipelines @@ -332,13 +317,13 @@ pipe.to("cuda") You can also run the diffusion process for a shorter number of timesteps. -This can either be done with the `num_inference_steps` argument +This can either be done with the `num_inference_steps` argument: ```py pipe("", num_inference_steps=30) ``` -Or with the `timesteps` argument +Or with the `timesteps` argument: ```py from diffusers.pipelines.deepfloyd_if import fast27_timesteps @@ -347,8 +332,7 @@ pipe("", timesteps=fast27_timesteps) ``` When doing image variation or inpainting, you can also decrease the number of timesteps -with the strength argument. The strength argument is the amount of noise to add to -the input image which also determines how many steps to run in the denoising process. +with the strength argument. The strength argument is the amount of noise to add to the input image which also determines how many steps to run in the denoising process. A smaller number will vary the image less but run faster. ```py @@ -362,18 +346,19 @@ You can also use [`torch.compile`](../../optimization/torch2.0). Note that we ha with IF and it might not give expected results. ```py +from diffusers import DiffusionPipeline import torch pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16) pipe.to("cuda") -pipe.text_encoder = torch.compile(pipe.text_encoder) -pipe.unet = torch.compile(pipe.unet) +pipe.text_encoder = torch.compile(pipe.text_encoder, mode="reduce-overhead", fullgraph=True) +pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) ``` ### Optimizing for memory -When optimizing for GPU memory, we can use the standard diffusers cpu offloading APIs. +When optimizing for GPU memory, we can use the standard diffusers CPU offloading APIs. Either the model based CPU offloading, @@ -410,23 +395,21 @@ pipe = DiffusionPipeline.from_pretrained( prompt_embeds, negative_embeds = pipe.encode_prompt("") ``` -For CPU RAM constrained machines like google colab free tier where we can't load all -model components to the CPU at once, we can manually only load the pipeline with -the text encoder or unet when the respective model components are needed. +For CPU RAM constrained machines like Google Colab free tier where we can't load all model components to the CPU at once, we can manually only load the pipeline with +the text encoder or UNet when the respective model components are needed. ```py from diffusers import IFPipeline, IFSuperResolutionPipeline import torch import gc from transformers import T5EncoderModel -from diffusers.utils import pt_to_pil +from diffusers.utils import pt_to_pil, make_image_grid text_encoder = T5EncoderModel.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", subfolder="text_encoder", device_map="auto", load_in_8bit=True, variant="8bit" ) # text to image - pipe = DiffusionPipeline.from_pretrained( "DeepFloyd/IF-I-XL-v1.0", text_encoder=text_encoder, # pass the previously instantiated 8bit text encoder @@ -448,14 +431,14 @@ pipe = IFPipeline.from_pretrained( ) generator = torch.Generator().manual_seed(0) -image = pipe( +stage_1_output = pipe( prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images -pt_to_pil(image)[0].save("./if_stage_I.png") +#pt_to_pil(stage_1_output)[0].save("./if_stage_I.png") # Remove the pipeline so we can load the super-resolution pipeline del pipe @@ -469,24 +452,24 @@ pipe = IFSuperResolutionPipeline.from_pretrained( ) generator = torch.Generator().manual_seed(0) -image = pipe( - image=image, +stage_2_output = pipe( + image=stage_1_output, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, output_type="pt", generator=generator, ).images -pt_to_pil(image)[0].save("./if_stage_II.png") +#pt_to_pil(stage_2_output)[0].save("./if_stage_II.png") +make_image_grid([pt_to_pil(stage_1_output)[0], pt_to_pil(stage_2_output)[0]], rows=1, rows=2) ``` - ## Available Pipelines: | Pipeline | Tasks | Colab |---|---|:---:| | [pipeline_if.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - | -| [pipeline_if_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if.py) | *Text-to-Image Generation* | - | +| [pipeline_if_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_superresolution.py) | *Text-to-Image Generation* | - | | [pipeline_if_img2img.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img.py) | *Image-to-Image Generation* | - | | [pipeline_if_img2img_superresolution.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_img2img_superresolution.py) | *Image-to-Image Generation* | - | | [pipeline_if_inpainting.py](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/deepfloyd_if/pipeline_if_inpainting.py) | *Image-to-Image Generation* | - | diff --git a/docs/source/en/api/pipelines/diffedit.md b/docs/source/en/api/pipelines/diffedit.md index 2ba7f9092907..7ab6ab2391e9 100644 --- a/docs/source/en/api/pipelines/diffedit.md +++ b/docs/source/en/api/pipelines/diffedit.md @@ -22,7 +22,7 @@ The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❀️ -## Tips +## Tips * The pipeline can generate masks that can be fed into other inpainting pipelines. * In order to generate an image using this pipeline, both an image mask (source and target prompts can be manually specified or generated, and passed to [`~StableDiffusionDiffEditPipeline.generate_mask`]) @@ -42,7 +42,7 @@ the phrases including "cat" to `negative_prompt` and "dog" to `prompt`. * Swap the `source_prompt` and `target_prompt` in the arguments to `generate_mask`. * Change the input prompt in [`~StableDiffusionDiffEditPipeline.invert`] to include "dog". * Swap the `prompt` and `negative_prompt` in the arguments to call the pipeline to generate the final edited image. -* The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](/using-diffusers/diffedit) guide for more details. +* The source and target prompts, or their corresponding embeddings, can also be automatically generated. Please refer to the [DiffEdit](../../using-diffusers/diffedit) guide for more details. ## StableDiffusionDiffEditPipeline [[autodoc]] StableDiffusionDiffEditPipeline @@ -52,4 +52,4 @@ the phrases including "cat" to `negative_prompt` and "dog" to `prompt`. - __call__ ## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/dit.md b/docs/source/en/api/pipelines/dit.md index 147d3ccbcab2..9e49a3bd68e7 100644 --- a/docs/source/en/api/pipelines/dit.md +++ b/docs/source/en/api/pipelines/dit.md @@ -32,4 +32,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/kandinsky.md b/docs/source/en/api/pipelines/kandinsky.md index 30bc29a5e12e..12073d4a14e7 100644 --- a/docs/source/en/api/pipelines/kandinsky.md +++ b/docs/source/en/api/pipelines/kandinsky.md @@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License. # Kandinsky 2.1 -Kandinsky 2.1 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey) and [Denis Dimitrov](https://github.com/denndimitrov). +Kandinsky 2.1 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Vladimir Arkhipkin](https://github.com/oriBetelgeuse), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey), and [Denis Dimitrov](https://github.com/denndimitrov). The description from it's GitHub page is: @@ -23,13 +23,19 @@ Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) + + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## KandinskyPriorPipeline [[autodoc]] KandinskyPriorPipeline - all - __call__ - interpolate - + ## KandinskyPipeline [[autodoc]] KandinskyPipeline diff --git a/docs/source/en/api/pipelines/kandinsky_v22.md b/docs/source/en/api/pipelines/kandinsky_v22.md index 350b96c3a9be..3a32eb42412a 100644 --- a/docs/source/en/api/pipelines/kandinsky_v22.md +++ b/docs/source/en/api/pipelines/kandinsky_v22.md @@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License. # Kandinsky 2.2 -Kandinsky 2.1 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey) and [Denis Dimitrov](https://github.com/denndimitrov). +Kandinsky 2.2 is created by [Arseniy Shakhmatov](https://github.com/cene555), [Anton Razzhigaev](https://github.com/razzant), [Aleksandr Nikolich](https://github.com/AlexWortega), [Vladimir Arkhipkin](https://github.com/oriBetelgeuse), [Igor Pavlov](https://github.com/boomb0om), [Andrey Kuznetsov](https://github.com/kuznetsoffandrey), and [Denis Dimitrov](https://github.com/denndimitrov). The description from it's GitHub page is: @@ -23,6 +23,12 @@ Check out the [Kandinsky Community](https://huggingface.co/kandinsky-community) + + +Make sure to check out the schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## KandinskyV22PriorPipeline [[autodoc]] KandinskyV22PriorPipeline diff --git a/docs/source/en/api/pipelines/latent_consistency_models.md b/docs/source/en/api/pipelines/latent_consistency_models.md index 1a7c14fb1a77..e5d4beba2bed 100644 --- a/docs/source/en/api/pipelines/latent_consistency_models.md +++ b/docs/source/en/api/pipelines/latent_consistency_models.md @@ -1,10 +1,22 @@ + + # Latent Consistency Models -Latent Consistency Models (LCMs) were proposed in [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://arxiv.org/abs/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. +Latent Consistency Models (LCMs) were proposed in [Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference](https://huggingface.co/papers/2310.04378) by Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. -The abstract of the [paper](https://arxiv.org/pdf/2310.04378.pdf) is as follows: +The abstract of the paper is as follows: -*Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference.* +*Latent Diffusion models (LDMs) have achieved remarkable results in synthesizing high-resolution images. However, the iterative sampling process is computationally intensive and leads to slow generation. Inspired by Consistency Models (song et al.), we propose Latent Consistency Models (LCMs), enabling swift inference with minimal steps on any pre-trained LDMs, including Stable Diffusion (rombach et al). Viewing the guided reverse diffusion process as solving an augmented probability flow ODE (PF-ODE), LCMs are designed to directly predict the solution of such ODE in latent space, mitigating the need for numerous iterations and allowing rapid, high-fidelity sampling. Efficiently distilled from pre-trained classifier-free guided diffusion models, a high-quality 768 x 768 2~4-step LCM takes only 32 A100 GPU hours for training. Furthermore, we introduce Latent Consistency Fine-tuning (LCF), a novel method that is tailored for fine-tuning LCMs on customized image datasets. Evaluation on the LAION-5B-Aesthetics dataset demonstrates that LCMs achieve state-of-the-art text-to-image generation performance with few-step inference. Project Page: [this https URL](https://latent-consistency-models.github.io/).* A demo for the [SimianLuo/LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) checkpoint can be found [here](https://huggingface.co/spaces/SimianLuo/Latent_Consistency_Model). diff --git a/docs/source/en/api/pipelines/latent_diffusion.md b/docs/source/en/api/pipelines/latent_diffusion.md index 8fed7d335407..de6f96bea19a 100644 --- a/docs/source/en/api/pipelines/latent_diffusion.md +++ b/docs/source/en/api/pipelines/latent_diffusion.md @@ -18,7 +18,7 @@ The abstract from the paper is: *By decomposing the image formation process into a sequential application of denoising autoencoders, diffusion models (DMs) achieve state-of-the-art synthesis results on image data and beyond. Additionally, their formulation allows for a guiding mechanism to control the image generation process without retraining. However, since these models typically operate directly in pixel space, optimization of powerful DMs often consumes hundreds of GPU days and inference is expensive due to sequential evaluations. To enable DM training on limited computational resources while retaining their quality and flexibility, we apply them in the latent space of powerful pretrained autoencoders. In contrast to previous work, training diffusion models on such a representation allows for the first time to reach a near-optimal point between complexity reduction and detail preservation, greatly boosting visual fidelity. By introducing cross-attention layers into the model architecture, we turn diffusion models into powerful and flexible generators for general conditioning inputs such as text or bounding boxes and high-resolution synthesis becomes possible in a convolutional manner. Our latent diffusion models (LDMs) achieve a new state of the art for image inpainting and highly competitive performance on various tasks, including unconditional image generation, semantic scene synthesis, and super-resolution, while significantly reducing computational requirements compared to pixel-based DMs.* -The original codebase can be found at [Compvis/latent-diffusion](https://github.com/CompVis/latent-diffusion). +The original codebase can be found at [CompVis/latent-diffusion](https://github.com/CompVis/latent-diffusion). @@ -37,4 +37,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/model_editing.md b/docs/source/en/api/pipelines/model_editing.md index 32f24ebc088f..2d94a50e4355 100644 --- a/docs/source/en/api/pipelines/model_editing.md +++ b/docs/source/en/api/pipelines/model_editing.md @@ -32,4 +32,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - all ## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/musicldm.md b/docs/source/en/api/pipelines/musicldm.md index 0936b306cd85..896f707c76d7 100644 --- a/docs/source/en/api/pipelines/musicldm.md +++ b/docs/source/en/api/pipelines/musicldm.md @@ -13,20 +13,17 @@ specific language governing permissions and limitations under the License. # MusicLDM MusicLDM was proposed in [MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies](https://huggingface.co/papers/2308.01546) by Ke Chen, Yusong Wu, Haohe Liu, Marianna Nezhurina, Taylor Berg-Kirkpatrick, Shlomo Dubnov. -MusicLDM takes a text prompt as input and predicts the corresponding music sample. +MusicLDM takes a text prompt as input and predicts the corresponding music sample. -Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm/overview), +Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview) and [AudioLDM](https://huggingface.co/docs/diffusers/api/pipelines/audioldm), MusicLDM is a text-to-music _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap) latents. -MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to -the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies -encourages the model to interpolate between the training samples, but stay within the domain of the training data. The -result is generated music that is more diverse while staying faithful to the corresponding style. +MusicLDM is trained on a corpus of 466 hours of music data. Beat-synchronous data augmentation strategies are applied to the music samples, both in the time domain and in the latent space. Using beat-synchronous data augmentation strategies encourages the model to interpolate between the training samples, but stay within the domain of the training data. The result is generated music that is more diverse while staying faithful to the corresponding style. The abstract of the paper is the following: -*In this paper, we present MusicLDM, a state-of-the-art text-to-music model that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, to encourage the model to generate music more diverse while still staying faithful to the corresponding style.* +*Diffusion models have shown promising results in cross-modal generation tasks, including text-to-image and text-to-audio generation. However, generating music, as a special type of audio, presents unique challenges due to limited availability of music data and sensitive issues related to copyright and plagiarism. In this paper, to tackle these challenges, we first construct a state-of-the-art text-to-music model, MusicLDM, that adapts Stable Diffusion and AudioLDM architectures to the music domain. We achieve this by retraining the contrastive language-audio pretraining model (CLAP) and the Hifi-GAN vocoder, as components of MusicLDM, on a collection of music data samples. Then, to address the limitations of training data and to avoid plagiarism, we leverage a beat tracking model and propose two different mixup strategies for data augmentation: beat-synchronous audio mixup and beat-synchronous latent mixup, which recombine training audio directly or via a latent embeddings space, respectively. Such mixup strategies encourage the model to interpolate between musical training samples and generate new music within the convex hull of the training data, making the generated music more diverse while still staying faithful to the corresponding style. In addition to popular evaluation metrics, we design several new evaluation metrics based on CLAP score to demonstrate that our proposed MusicLDM and beat-synchronous mixup strategies improve both the quality and novelty of generated music, as well as the correspondence between input text and generated music.* This pipeline was contributed by [sanchit-gandhi](https://huggingface.co/sanchit-gandhi). @@ -52,4 +49,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) ## MusicLDMPipeline [[autodoc]] MusicLDMPipeline - all - - __call__ \ No newline at end of file + - __call__ diff --git a/docs/source/en/api/pipelines/overview.md b/docs/source/en/api/pipelines/overview.md index 9caf5c6b4121..a7f4f477ef82 100644 --- a/docs/source/en/api/pipelines/overview.md +++ b/docs/source/en/api/pipelines/overview.md @@ -31,6 +31,7 @@ The table below lists all the pipelines currently available in πŸ€— Diffusers an | Pipeline | Tasks | |---|---| | [AltDiffusion](alt_diffusion) | image2image | +| [AnimateDiff](animatediff) | text2video | | [Attend-and-Excite](attend_and_excite) | text2image | | [Audio Diffusion](audio_diffusion) | image2audio | | [AudioLDM](audioldm) | text2audio | @@ -46,33 +47,35 @@ The table below lists all the pipelines currently available in πŸ€— Diffusers an | [DeepFloyd IF](deepfloyd_if) | text2image, image2image, inpainting, super-resolution | | [DiffEdit](diffedit) | inpainting | | [DiT](dit) | text2image | -| [GLIGEN](gligen) | text2image | +| [GLIGEN](stable_diffusion/gligen) | text2image | | [InstructPix2Pix](pix2pix) | image editing | -| [Kandinsky](kandinsky) | text2image, image2image, inpainting, interpolation | +| [Kandinsky 2.1](kandinsky) | text2image, image2image, inpainting, interpolation | | [Kandinsky 2.2](kandinsky_v22) | text2image, image2image, inpainting | +| [Latent Consistency Models](latent_consistency_models) | text2image | | [Latent Diffusion](latent_diffusion) | text2image, super-resolution | -| [LDM3D](ldm3d_diffusion) | text2image, text-to-3D | +| [LDM3D](stable_diffusion/ldm3d_diffusion) | text2image, text-to-3D | | [MultiDiffusion](panorama) | text2image | | [MusicLDM](musicldm) | text2audio | -| [PaintByExample](paint_by_example) | inpainting | +| [Paint by Example](paint_by_example) | inpainting | | [ParaDiGMS](paradigms) | text2image | | [Pix2Pix Zero](pix2pix_zero) | image editing | +| [PixArt-Ξ±](pixart) | text2image | | [PNDM](pndm) | unconditional image generation | | [RePaint](repaint) | inpainting | -| [ScoreSdeVe](score_sde_ve) | unconditional image generation | +| [Score SDE VE](score_sde_ve) | unconditional image generation | | [Self-Attention Guidance](self_attention_guidance) | text2image | | [Semantic Guidance](semantic_stable_diffusion) | text2image | | [Shap-E](shap_e) | text-to-3D, image-to-3D | | [Spectrogram Diffusion](spectrogram_diffusion) | | | [Stable Diffusion](stable_diffusion/overview) | text2image, image2image, depth2image, inpainting, image variation, latent upscaler, super-resolution | | [Stable Diffusion Model Editing](model_editing) | model editing | -| [Stable Diffusion XL](stable_diffusion_xl) | text2image, image2image, inpainting | +| [Stable Diffusion XL](stable_diffusion/stable_diffusion_xl) | text2image, image2image, inpainting | | [Stable unCLIP](stable_unclip) | text2image, image variation | -| [KarrasVe](karras_ve) | unconditional image generation | -| [T2I Adapter](adapter) | text2image | +| [Stochastic Karras VE](stochastic_karras_ve) | unconditional image generation | +| [T2I-Adapter](stable_diffusion/adapter) | text2image | | [Text2Video](text_to_video) | text2video, video2video | -| [Text2Video Zero](text_to_video_zero) | text2video | -| [UnCLIP](unclip) | text2image, image variation | +| [Text2Video-Zero](text_to_video_zero) | text2video | +| [unCLIP](unclip) | text2image, image variation | | [Unconditional Latent Diffusion](latent_diffusion_uncond) | unconditional image generation | | [UniDiffuser](unidiffuser) | text2image, image2text, image variation, text variation, unconditional image generation, unconditional audio generation | | [Value-guided planning](value_guided_sampling) | value guided sampling | diff --git a/docs/source/en/api/pipelines/paint_by_example.md b/docs/source/en/api/pipelines/paint_by_example.md index d04a378a09d3..b89e80cbb254 100644 --- a/docs/source/en/api/pipelines/paint_by_example.md +++ b/docs/source/en/api/pipelines/paint_by_example.md @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Paint By Example +# Paint by Example [Paint by Example: Exemplar-based Image Editing with Diffusion Models](https://huggingface.co/papers/2211.13227) is by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen. @@ -22,7 +22,7 @@ The original codebase can be found at [Fantasy-Studio/Paint-by-Example](https:// ## Tips -PaintByExample is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint is warm-started from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to inpaint partly masked images conditioned on example and reference images. +Paint by Example is supported by the official [Fantasy-Studio/Paint-by-Example](https://huggingface.co/Fantasy-Studio/Paint-by-Example) checkpoint. The checkpoint is warm-started from [CompVis/stable-diffusion-v1-4](https://huggingface.co/CompVis/stable-diffusion-v1-4) to inpaint partly masked images conditioned on example and reference images. @@ -36,4 +36,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/panorama.md b/docs/source/en/api/pipelines/panorama.md index 4ad5624a44c7..8aa86112aa89 100644 --- a/docs/source/en/api/pipelines/panorama.md +++ b/docs/source/en/api/pipelines/panorama.md @@ -22,19 +22,12 @@ You can find additional information about MultiDiffusion on the [project page](h ## Tips -While calling [`StableDiffusionPanoramaPipeline`], it's possible to specify the `view_batch_size` parameter to be > 1. +While calling [`StableDiffusionPanoramaPipeline`], it's possible to specify the `view_batch_size` parameter to be > 1. For some GPUs with high performance, this can speedup the generation process and increase VRAM usage. To generate panorama-like images make sure you pass the width parameter accordingly. We recommend a width value of 2048 which is the default. -Circular padding is applied to ensure there are no stitching artifacts when working with -panoramas to ensure a seamless transition from the rightmost part to the leftmost part. -By enabling circular padding (set `circular_padding=True`), the operation applies additional -crops after the rightmost point of the image, allowing the model to "see” the transition -from the rightmost part to the leftmost part. This helps maintain visual consistency in -a 360-degree sense and creates a proper β€œpanorama” that can be viewed using 360-degree -panorama viewers. When decoding latents in Stable Diffusion, circular padding is applied -to ensure that the decoded latents match in the RGB space. +Circular padding is applied to ensure there are no stitching artifacts when working with panoramas to ensure a seamless transition from the rightmost part to the leftmost part. By enabling circular padding (set `circular_padding=True`), the operation applies additional crops after the rightmost point of the image, allowing the model to "see” the transition from the rightmost part to the leftmost part. This helps maintain visual consistency in a 360-degree sense and creates a proper β€œpanorama” that can be viewed using 360-degree panorama viewers. When decoding latents in Stable Diffusion, circular padding is applied to ensure that the decoded latents match in the RGB space. For example, without circular padding, there is a stitching artifact (default): ![img](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/indoor_%20no_circular_padding.png) @@ -54,4 +47,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - all ## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/paradigms.md b/docs/source/en/api/pipelines/paradigms.md index 4606b1f53eb6..ca2fedc796df 100644 --- a/docs/source/en/api/pipelines/paradigms.md +++ b/docs/source/en/api/pipelines/paradigms.md @@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. The abstract from the paper is: -*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 16s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.* +*Diffusion models are powerful generative models but suffer from slow sampling, often taking 1000 sequential denoising steps for one sample. As a result, considerable efforts have been directed toward reducing the number of denoising steps, but these methods hurt sample quality. Instead of reducing the number of denoising steps (trading quality for speed), in this paper we explore an orthogonal approach: can we run the denoising steps in parallel (trading compute for speed)? In spite of the sequential nature of the denoising steps, we show that surprisingly it is possible to parallelize sampling via Picard iterations, by guessing the solution of future denoising steps and iteratively refining until convergence. With this insight, we present ParaDiGMS, a novel method to accelerate the sampling of pretrained diffusion models by denoising multiple steps in parallel. ParaDiGMS is the first diffusion sampling method that enables trading compute for speed and is even compatible with existing fast sampling techniques such as DDIM and DPMSolver. Using ParaDiGMS, we improve sampling speed by 2-4x across a range of robotics and image generation models, giving state-of-the-art sampling speeds of 0.2s on 100-step DiffusionPolicy and 14.6s on 1000-step StableDiffusion-v2 with no measurable degradation of task reward, FID score, or CLIP score.* The original codebase can be found at [AndyShih12/paradigms](https://github.com/AndyShih12/paradigms), and the pipeline was contributed by [AndyShih12](https://github.com/AndyShih12). ❀️ @@ -26,17 +26,14 @@ This pipeline improves sampling speed by running denoising steps in parallel, at Therefore, it is better to call this pipeline when running on multiple GPUs. Otherwise, without enough GPU bandwidth sampling may be even slower than sequential sampling. -The two parameters to play with are `parallel` (batch size) and `tolerance`. -- If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 -(for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size -may not fit in memory, and lower batch size gives less parallelism. -- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation. -If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`. +The two parameters to play with are `parallel` (batch size) and `tolerance`. +- If it fits in memory, for a 1000-step DDPM you can aim for a batch size of around 100 (for example, 8 GPUs and `batch_per_device=12` to get `parallel=96`). A higher batch size may not fit in memory, and lower batch size gives less parallelism. +- For tolerance, using a higher tolerance may get better speedups but can risk sample quality degradation. If there is quality degradation with the default tolerance, then use a lower tolerance like `0.001`. For a 1000-step DDPM on 8 A100 GPUs, you can expect around a 3x speedup from [`StableDiffusionParadigmsPipeline`] compared to the [`StableDiffusionPipeline`] by setting `parallel=80` and `tolerance=0.1`. -πŸ€— Diffusers offers [distributed inference support](../training/distributed_inference) for generating multiple prompts +πŸ€— Diffusers offers [distributed inference support](../../training/distributed_inference) for generating multiple prompts in parallel on multiple GPUs. But [`StableDiffusionParadigmsPipeline`] is designed for speeding up sampling of a single prompt by using multiple GPUs. diff --git a/docs/source/en/api/pipelines/pix2pix_zero.md b/docs/source/en/api/pipelines/pix2pix_zero.md index 9d43667c068b..6d7b9fb31471 100644 --- a/docs/source/en/api/pipelines/pix2pix_zero.md +++ b/docs/source/en/api/pipelines/pix2pix_zero.md @@ -20,7 +20,7 @@ The abstract from the paper is: You can find additional information about Pix2Pix Zero on the [project page](https://pix2pixzero.github.io/), [original codebase](https://github.com/pix2pixzero/pix2pix-zero), and try it out in a [demo](https://huggingface.co/spaces/pix2pix-zero-library/pix2pix-zero-demo). -## Tips +## Tips * The pipeline can be conditioned on real input images. Check out the code examples below to know more. * The pipeline exposes two arguments namely `source_embeds` and `target_embeds` @@ -29,12 +29,11 @@ you wanted to translate from "cat" to "dog". In this case, the edit direction wi this in the pipeline, you simply have to set the embeddings related to the phrases including "cat" to `source_embeds` and "dog" to `target_embeds`. Refer to the code example below for more details. * When you're using this pipeline from a prompt, specify the _source_ concept in the prompt. Taking -the above example, a valid input prompt would be: "a high resolution painting of a **cat** in the style of van gough". +the above example, a valid input prompt would be: "a high resolution painting of a **cat** in the style of van gogh". * If you wanted to reverse the direction in the example above, i.e., "dog -> cat", then it's recommended to: * Swap the `source_embeds` and `target_embeds`. - * Change the input prompt to include "dog". -* To learn more about how the source and target embeddings are generated, refer to the [original -paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions on how to generate the embeddings. + * Change the input prompt to include "dog". +* To learn more about how the source and target embeddings are generated, refer to the [original paper](https://arxiv.org/abs/2302.03027). Below, we also provide some directions on how to generate the embeddings. * Note that the quality of the outputs generated with this pipeline is dependent on how good the `source_embeds` and `target_embeds` are. Please, refer to [this discussion](#generating-source-and-target-embeddings) for some suggestions on the topic. ## Available Pipelines: @@ -79,23 +78,22 @@ for url in [src_embs_url, target_embs_url]: src_embeds = torch.load(src_embs_url.split("/")[-1]) target_embeds = torch.load(target_embs_url.split("/")[-1]) -images = pipeline( +image = pipeline( prompt, source_embeds=src_embeds, target_embeds=target_embeds, num_inference_steps=50, cross_attention_guidance_amount=0.15, -).images -images[0].save("edited_image_dog.png") +).images[0] +image ``` ### Based on an input image When the pipeline is conditioned on an input image, we first obtain an inverted -noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then -the inverted noise is used to start the generation process. +noise from it using a `DDIMInverseScheduler` with the help of a generated caption. Then the inverted noise is used to start the generation process. -First, let's load our pipeline: +First, let's load our pipeline: ```py import torch @@ -119,25 +117,25 @@ pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler pipeline.enable_model_cpu_offload() ``` -Then, we load an input image for conditioning and obtain a suitable caption for it: +Then, we load an input image for conditioning and obtain a suitable caption for it: ```py -import requests -from PIL import Image +from diffusers.utils import load_image img_url = "https://github.com/pix2pixzero/pix2pix-zero/raw/main/assets/test_images/cats/cat_6.png" -raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB").resize((512, 512)) +raw_image = load_image(url).resize((512, 512)) caption = pipeline.generate_caption(raw_image) +caption ``` -Then we employ the generated caption and the input image to get the inverted noise: +Then we employ the generated caption and the input image to get the inverted noise: -```py +```py generator = torch.manual_seed(0) inv_latents = pipeline.invert(caption, image=raw_image, generator=generator).latents ``` -Now, generate the image with edit directions: +Now, generate the image with edit directions: ```py # See the "Generating source and target embeddings" section below to @@ -159,16 +157,16 @@ image = pipeline( latents=inv_latents, negative_prompt=caption, ).images[0] -image.save("edited_image.png") +image ``` -## Generating source and target embeddings +## Generating source and target embeddings The authors originally used the [GPT-3 API](https://openai.com/api/) to generate the source and target captions for discovering edit directions. However, we can also leverage open source and public models for the same purpose. Below, we provide an end-to-end example with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model for generating captions and [CLIP](https://huggingface.co/docs/transformers/model_doc/clip) for -computing embeddings on the generated captions. +computing embeddings on the generated captions. **1. Load the generation model**: @@ -180,7 +178,7 @@ tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl") model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16) ``` -**2. Construct a starting prompt**: +**2. Construct a starting prompt**: ```py source_concept = "cat" @@ -193,11 +191,11 @@ target_text = f"Provide a caption for images containing a {target_concept}. " "The captions should be in English and should be no longer than 150 characters." ``` -Here, we're interested in the "cat -> dog" direction. +Here, we're interested in the "cat -> dog" direction. **3. Generate captions**: -We can use a utility like so for this purpose. +We can use a utility like so for this purpose. ```py def generate_captions(input_prompt): @@ -214,17 +212,18 @@ And then we just call it to generate our captions: ```py source_captions = generate_captions(source_text) target_captions = generate_captions(target_concept) +print(source_captions, target_captions, sep='\n') ``` We encourage you to play around with the different parameters supported by the `generate()` method ([documentation](https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.generation_tf_utils.TFGenerationMixin.generate)) for the generation quality you are looking for. -**4. Load the embedding model**: +**4. Load the embedding model**: Here, we need to use the same text encoder model used by the subsequent Stable Diffusion model. -```py -from diffusers import StableDiffusionPix2PixZeroPipeline +```py +from diffusers import StableDiffusionPix2PixZeroPipeline pipeline = StableDiffusionPix2PixZeroPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16 @@ -236,8 +235,8 @@ text_encoder = pipeline.text_encoder **5. Compute embeddings**: -```py -import torch +```py +import torch def embed_captions(sentences, tokenizer, text_encoder, device="cuda"): with torch.no_grad(): @@ -261,23 +260,29 @@ target_embeddings = embed_captions(target_captions, tokenizer, text_encoder) And you're done! [Here](https://colab.research.google.com/drive/1tz2C1EdfZYAPlzXXbTnf-5PRBiR8_R1F?usp=sharing) is a Colab Notebook that you can use to interact with the entire process. -Now, you can use these embeddings directly while calling the pipeline: +Now, you can use these embeddings directly while calling the pipeline: ```py from diffusers import DDIMScheduler pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) -images = pipeline( +image = pipeline( prompt, source_embeds=source_embeddings, target_embeds=target_embeddings, num_inference_steps=50, cross_attention_guidance_amount=0.15, -).images -images[0].save("edited_image_dog.png") +).images[0] +image ``` + + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## StableDiffusionPix2PixZeroPipeline [[autodoc]] StableDiffusionPix2PixZeroPipeline - __call__ diff --git a/docs/source/en/api/pipelines/pixart.md b/docs/source/en/api/pipelines/pixart.md index 5c84d039ed28..6fa44cd508e4 100644 --- a/docs/source/en/api/pipelines/pixart.md +++ b/docs/source/en/api/pipelines/pixart.md @@ -10,7 +10,7 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# PixArt +# PixArt-Ξ± ![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/pixart/header_collage.png) @@ -24,13 +24,20 @@ You can find the original codebase at [PixArt-alpha/PixArt-alpha](https://github Some notes about this pipeline: -* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit.md). -* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. +* It uses a Transformer backbone (instead of a UNet) for denoising. As such it has a similar architecture as [DiT](./dit). +* It was trained using text conditions computed from T5. This aspect makes the pipeline better at following complex text prompts with intricate details. * It is good at producing high-resolution images at different aspect ratios. To get the best results, the authors recommend some size brackets which can be found [here](https://github.com/PixArt-alpha/PixArt-alpha/blob/08fbbd281ec96866109bdd2cdb75f2f58fb17610/diffusion/data/datasets/utils.py). * It rivals the quality of state-of-the-art text-to-image generation systems (as of this writing) such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient than them. + + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## PixArtAlphaPipeline [[autodoc]] PixArtAlphaPipeline - all - - __call__ \ No newline at end of file + - __call__ + \ No newline at end of file diff --git a/docs/source/en/api/pipelines/pndm.md b/docs/source/en/api/pipelines/pndm.md index 96b1fc0f99d3..162e7934dc22 100644 --- a/docs/source/en/api/pipelines/pndm.md +++ b/docs/source/en/api/pipelines/pndm.md @@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. # PNDM -[Pseudo Numerical methods for Diffusion Models on manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao. +[Pseudo Numerical Methods for Diffusion Models on Manifolds](https://huggingface.co/papers/2202.09778) (PNDM) is by Luping Liu, Yi Ren, Zhijie Lin and Zhou Zhao. The abstract from the paper is: @@ -32,4 +32,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/score_sde_ve.md b/docs/source/en/api/pipelines/score_sde_ve.md index 374e93557506..cc9c8574f92d 100644 --- a/docs/source/en/api/pipelines/score_sde_ve.md +++ b/docs/source/en/api/pipelines/score_sde_ve.md @@ -32,4 +32,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/self_attention_guidance.md b/docs/source/en/api/pipelines/self_attention_guidance.md index 3b18fec39bf8..408e62daf988 100644 --- a/docs/source/en/api/pipelines/self_attention_guidance.md +++ b/docs/source/en/api/pipelines/self_attention_guidance.md @@ -32,4 +32,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - all ## StableDiffusionOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/semantic_stable_diffusion.md b/docs/source/en/api/pipelines/semantic_stable_diffusion.md index cb02cc3c34d0..d7b393447cf8 100644 --- a/docs/source/en/api/pipelines/semantic_stable_diffusion.md +++ b/docs/source/en/api/pipelines/semantic_stable_diffusion.md @@ -12,12 +12,12 @@ specific language governing permissions and limitations under the License. # Semantic Guidance -Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Diffusion using Semantic Dimensions](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation. +Semantic Guidance for Diffusion Models was proposed in [SEGA: Instructing Text-to-Image Models using Semantic Guidance](https://huggingface.co/papers/2301.12247) and provides strong semantic control over image generation. Small changes to the text prompt usually result in entirely different output images. However, with SEGA a variety of changes to the image are enabled that can be controlled easily and intuitively, while staying true to the original image composition. The abstract from the paper is: -*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on a variety of tasks and provide evidence for its versatility and flexibility.* +*Text-to-image diffusion models have recently received a lot of interest for their astonishing ability to produce high-fidelity images from text only. However, achieving one-shot generation that aligns with the user's intent is nearly impossible, yet small changes to the input prompt often result in very different images. This leaves the user with little semantic control. To put the user in control, we show how to interact with the diffusion process to flexibly steer it along semantic directions. This semantic guidance (SEGA) generalizes to any generative architecture using classifier-free guidance. More importantly, it allows for subtle and extensive edits, changes in composition and style, as well as optimizing the overall artistic conception. We demonstrate SEGA's effectiveness on both latent and pixel-based diffusion models such as Stable Diffusion, Paella, and DeepFloyd-IF using a variety of tasks, thus providing strong evidence for its versatility, flexibility, and improvements over existing methods.* diff --git a/docs/source/en/api/pipelines/shap_e.md b/docs/source/en/api/pipelines/shap_e.md index 80f303b07887..bbf904afb5c8 100644 --- a/docs/source/en/api/pipelines/shap_e.md +++ b/docs/source/en/api/pipelines/shap_e.md @@ -9,7 +9,7 @@ specific language governing permissions and limitations under the License. # Shap-E -The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewon Jun from [OpenAI](https://github.com/openai). +The Shap-E model was proposed in [Shap-E: Generating Conditional 3D Implicit Functions](https://huggingface.co/papers/2305.02463) by Alex Nichol and Heewoo Jun from [OpenAI](https://github.com/openai). The abstract from the paper is: @@ -34,4 +34,4 @@ See the [reuse components across pipelines](../../using-diffusers/loading#reuse- - __call__ ## ShapEPipelineOutput -[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.shap_e.pipeline_shap_e.ShapEPipelineOutput diff --git a/docs/source/en/api/pipelines/spectrogram_diffusion.md b/docs/source/en/api/pipelines/spectrogram_diffusion.md index 54c3745d48cc..cc9ff3e45646 100644 --- a/docs/source/en/api/pipelines/spectrogram_diffusion.md +++ b/docs/source/en/api/pipelines/spectrogram_diffusion.md @@ -34,4 +34,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## AudioPipelineOutput -[[autodoc]] pipelines.AudioPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.AudioPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_diffusion/adapter.md b/docs/source/en/api/pipelines/stable_diffusion/adapter.md index cf3aca4bfa52..0e2e7fd250fc 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/adapter.md +++ b/docs/source/en/api/pipelines/stable_diffusion/adapter.md @@ -20,7 +20,7 @@ Using the pretrained models we can provide control images (for example, a depth The abstract of the paper is the following: -*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate structure control is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and small T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, and achieve rich control and editing effects. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.* +*The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power of learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage of the knowledge learned by the model, especially when flexible and accurate controlling (e.g., color and structure) is needed. In this paper, we aim to ``dig out" the capabilities that T2I models have implicitly learned, and then explicitly use them to control the generation more granularly. Specifically, we propose to learn simple and lightweight T2I-Adapters to align internal knowledge in T2I models with external control signals, while freezing the original large T2I models. In this way, we can train various adapters according to different conditions, achieving rich control and editing effects in the color and structure of the generation results. Further, the proposed T2I-Adapters have attractive properties of practical value, such as composability and generalization ability. Extensive experiments demonstrate that our T2I-Adapter has promising generation quality and a wide range of applications.* This model was contributed by the community contributor [HimariO](https://github.com/HimariO) ❀️ . @@ -33,7 +33,7 @@ This model was contributed by the community contributor [HimariO](https://github ## Usage example with the base model of StableDiffusion-1.4/1.5 -In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5. +In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-1.4/1.5. All adapters use the same pipeline. 1. Images are first converted into the appropriate *control image* format. @@ -42,7 +42,7 @@ All adapters use the same pipeline. Let's have a look at a simple example using the [Color Adapter](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1). ```python -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid image = load_image("https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_ref.png") ``` @@ -83,20 +83,21 @@ Finally, pass the prompt and control image to the pipeline ```py # fix the random seed, so you will get the same result as the example -generator = torch.manual_seed(7) +generator = torch.Generator("cuda").manual_seed(7) out_image = pipe( "At night, glowing cubes in front of the beach", image=color_palette, generator=generator, ).images[0] +make_image_grid([image, color_palette, out_image], rows=1, cols=3) ``` ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/color_output.png) ## Usage example with the base model of StableDiffusion-XL -In the following we give a simple example of how to use a *T2IAdapter* checkpoint with Diffusers for inference based on StableDiffusion-XL. +In the following we give a simple example of how to use a *T2I-Adapter* checkpoint with Diffusers for inference based on StableDiffusion-XL. All adapters use the same pipeline. 1. Images are first downloaded into the appropriate *control image* format. @@ -105,7 +106,7 @@ All adapters use the same pipeline. Let's have a look at a simple example using the [Sketch Adapter](https://huggingface.co/Adapter/t2iadapter/tree/main/sketch_sdxl_1.0). ```python -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid sketch_image = load_image("https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch.png").convert("L") ``` @@ -121,10 +122,9 @@ from diffusers import ( StableDiffusionXLAdapterPipeline, DDPMScheduler ) -from diffusers.models.unet_2d_condition import UNet2DConditionModel model_id = "stabilityai/stable-diffusion-xl-base-1.0" -adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0",torch_dtype=torch.float16, adapter_type="full_adapter_xl") +adapter = T2IAdapter.from_pretrained("Adapter/t2iadapter", subfolder="sketch_sdxl_1.0", torch_dtype=torch.float16, adapter_type="full_adapter_xl") scheduler = DDPMScheduler.from_pretrained(model_id, subfolder="scheduler") pipe = StableDiffusionXLAdapterPipeline.from_pretrained( @@ -141,12 +141,13 @@ Finally, pass the prompt and control image to the pipeline generator = torch.Generator().manual_seed(42) sketch_image_out = pipe( - prompt="a photo of a dog in real world, high quality", - negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality", - image=sketch_image, - generator=generator, + prompt="a photo of a dog in real world, high quality", + negative_prompt="extra digit, fewer digits, cropped, worst quality, low quality", + image=sketch_image, + generator=generator, guidance_scale=7.5 ).images[0] +make_image_grid([sketch_image, sketch_image_out], rows=1, cols=2) ``` ![img](https://huggingface.co/Adapter/t2iadapter/resolve/main/sketch_output.png) @@ -159,7 +160,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu | Model Name | Control Image Overview| Control Image Example | Generated Image Example | |---|---|---|---| -|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)
*Trained with spatial color palette* | A image with 8x8 color palette.||| +|[TencentARC/t2iadapter_color_sd14v1](https://huggingface.co/TencentARC/t2iadapter_color_sd14v1)
*Trained with spatial color palette* | An image with 8x8 color palette.||| |[TencentARC/t2iadapter_canny_sd14v1](https://huggingface.co/TencentARC/t2iadapter_canny_sd14v1)
*Trained with canny edge detection* | A monochrome image with white edges on a black background.||| |[TencentARC/t2iadapter_sketch_sd14v1](https://huggingface.co/TencentARC/t2iadapter_sketch_sd14v1)
*Trained with [PidiNet](https://github.com/zhuoinoulu/pidinet) edge detection* | A hand-drawn monochrome image with white outlines on a black background.||| |[TencentARC/t2iadapter_depth_sd14v1](https://huggingface.co/TencentARC/t2iadapter_depth_sd14v1)
*Trained with Midas depth estimation* | A grayscale image with black representing deep areas and white representing shallow areas.||| @@ -181,9 +182,7 @@ Non-diffusers checkpoints can be found under [TencentARC/T2I-Adapter](https://hu Here we use the keypose adapter for the character posture and the depth adapter for creating the scene. ```py -import torch -from PIL import Image -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid cond_keypose = load_image( "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_sample_input.png" @@ -191,7 +190,7 @@ cond_keypose = load_image( cond_depth = load_image( "https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png" ) -cond = [[cond_keypose, cond_depth]] +cond = [cond_keypose, cond_depth] prompt = ["A man walking in an office room with a nice view"] ``` @@ -202,12 +201,13 @@ The two control images look as such: ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/depth_sample_input.png) -`MultiAdapter` combines keypose and depth adapters. +`MultiAdapter` combines keypose and depth adapters. `adapter_conditioning_scale` balances the relative influence of the different adapters. ```py -from diffusers import StableDiffusionAdapterPipeline, MultiAdapter +import torch +from diffusers import StableDiffusionAdapterPipeline, MultiAdapter, T2IAdapter adapters = MultiAdapter( [ @@ -221,19 +221,20 @@ pipe = StableDiffusionAdapterPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16, adapter=adapters, -) +).to("cuda") -images = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8]) +image = pipe(prompt, cond, adapter_conditioning_scale=[0.8, 0.8]).images[0] +make_image_grid([cond_keypose, cond_depth, image], rows=1, cols=3) ``` ![img](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/t2i-adapter/keypose_depth_sample_output.png) -## T2I Adapter vs ControlNet +## T2I-Adapter vs ControlNet -T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet). -T2i-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process. -However, T2I-Adapter performs slightly worse than ControlNet. +T2I-Adapter is similar to [ControlNet](https://huggingface.co/docs/diffusers/main/en/api/pipelines/controlnet). +T2I-Adapter uses a smaller auxiliary network which is only run once for the entire diffusion process. +However, T2I-Adapter performs slightly worse than ControlNet. ## StableDiffusionAdapterPipeline [[autodoc]] StableDiffusionAdapterPipeline diff --git a/docs/source/en/api/pipelines/stable_diffusion/depth2img.md b/docs/source/en/api/pipelines/stable_diffusion/depth2img.md index 09814f387b72..f7c8f2de9420 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/depth2img.md +++ b/docs/source/en/api/pipelines/stable_diffusion/depth2img.md @@ -12,11 +12,11 @@ specific language governing permissions and limitations under the License. # Depth-to-image -The Stable Diffusion model can also infer depth based on an image using [MiDas](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. +The Stable Diffusion model can also infer depth based on an image using [MiDaS](https://github.com/isl-org/MiDaS). This allows you to pass a text prompt and an initial image to condition the generation of new images as well as a `depth_map` to preserve the image structure. -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! @@ -37,4 +37,4 @@ If you're interested in using one of the official checkpoints for a task, explor ## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_diffusion/inpaint.md b/docs/source/en/api/pipelines/stable_diffusion/inpaint.md index dc935d0bd17b..362ad325ac85 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/inpaint.md +++ b/docs/source/en/api/pipelines/stable_diffusion/inpaint.md @@ -23,7 +23,7 @@ text-to-image Stable Diffusion checkpoints, such as -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! @@ -54,4 +54,4 @@ If you're interested in using one of the official checkpoints for a task, explor ## FlaxStableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md index 0775485e68db..bdb113f6e465 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md +++ b/docs/source/en/api/pipelines/stable_diffusion/latent_upscale.md @@ -16,7 +16,7 @@ The Stable Diffusion latent upscaler model was created by [Katherine Crowson](ht -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! @@ -35,4 +35,4 @@ If you're interested in using one of the official checkpoints for a task, explor ## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_diffusion/overview.md b/docs/source/en/api/pipelines/stable_diffusion/overview.md index fe30e7177dbf..fb4f2739dd2b 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/overview.md +++ b/docs/source/en/api/pipelines/stable_diffusion/overview.md @@ -34,7 +34,7 @@ The table below summarizes the available Stable Diffusion pipelines, their suppo Supported tasks - Space + πŸ€— Space @@ -165,4 +165,4 @@ img2img = StableDiffusionImg2ImgPipeline(**text2img.components) inpaint = StableDiffusionInpaintPipeline(**text2img.components) # now you can use text2img(...), img2img(...), inpaint(...) just like the call methods of each respective pipeline -``` \ No newline at end of file +``` diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md index d44e9f507830..75f36ba335a6 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_2.md @@ -14,12 +14,12 @@ specific language governing permissions and limitations under the License. Stable Diffusion 2 is a text-to-image _latent diffusion_ model built upon the work of the original [Stable Diffusion](https://stability.ai/blog/stable-diffusion-public-release), and it was led by Robin Rombach and Katherine Crowson from [Stability AI](https://stability.ai/) and [LAION](https://laion.ai/). -*The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels. +*The Stable Diffusion 2.0 release includes robust text-to-image models trained using a brand new text encoder (OpenCLIP), developed by LAION with support from Stability AI, which greatly improves the quality of the generated images compared to earlier V1 releases. The text-to-image models in this release can generate images with default resolutions of both 512x512 pixels and 768x768 pixels. These models are trained on an aesthetic subset of the [LAION-5B dataset](https://laion.ai/blog/laion-5b/) created by the DeepFloyd team at Stability AI, which is then further filtered to remove adult content using [LAION’s NSFW filter](https://openreview.net/forum?id=M3Y74vmsMcY).* For more details about how Stable Diffusion 2 works and how it differs from the original Stable Diffusion, please refer to the official [announcement post](https://stability.ai/blog/stable-diffusion-v2-release). -The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it's currently the fastest scheduler. +The architecture of Stable Diffusion 2 is more or less identical to the original [Stable Diffusion model](./text2img) so check out it's API documentation for how to use Stable Diffusion 2. We recommend using the [`DPMSolverMultistepScheduler`] as it gives a reasonable speed/quality trade-off and can be run with as little as 20 steps. Stable Diffusion 2 is available for tasks like text-to-image, inpainting, super-resolution, and depth-to-image: @@ -35,7 +35,7 @@ Here are some examples for how to use Stable Diffusion 2 for each task: -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! @@ -55,30 +55,21 @@ pipe = pipe.to("cuda") prompt = "High quality photo of an astronaut riding a horse in space" image = pipe(prompt, num_inference_steps=25).images[0] -image.save("astronaut.png") +image ``` ## Inpainting ```py -import PIL -import requests import torch -from io import BytesIO - from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler - - -def download_image(url): - response = requests.get(url) - return PIL.Image.open(BytesIO(response.content)).convert("RGB") - +from diffusers.utils import load_image, make_image_grid img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" -init_image = download_image(img_url).resize((512, 512)) -mask_image = download_image(mask_url).resize((512, 512)) +init_image = load_image(img_url).resize((512, 512)) +mask_image = load_image(mask_url).resize((512, 512)) repo_id = "stabilityai/stable-diffusion-2-inpainting" pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16") @@ -88,17 +79,14 @@ pipe = pipe.to("cuda") prompt = "Face of a yellow cat, high resolution, sitting on a park bench" image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0] - -image.save("yellow_cat.png") +make_image_grid([init_image, mask_image, image], rows=1, cols=3) ``` ## Super-resolution ```py -import requests -from PIL import Image -from io import BytesIO from diffusers import StableDiffusionUpscalePipeline +from diffusers.utils import load_image, make_image_grid import torch # load model and scheduler @@ -108,22 +96,19 @@ pipeline = pipeline.to("cuda") # let's download an image url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png" -response = requests.get(url) -low_res_img = Image.open(BytesIO(response.content)).convert("RGB") +low_res_img = load_image(url) low_res_img = low_res_img.resize((128, 128)) prompt = "a white cat" upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0] -upscaled_image.save("upsampled_cat.png") +make_image_grid([low_res_img.resize((512, 512)), upscaled_image.resize((512, 512))], rows=1, cols=2) ``` ## Depth-to-image ```py import torch -import requests -from PIL import Image - from diffusers import StableDiffusionDepth2ImgPipeline +from diffusers.utils import load_image, make_image_grid pipe = StableDiffusionDepth2ImgPipeline.from_pretrained( "stabilityai/stable-diffusion-2-depth", @@ -132,8 +117,9 @@ pipe = StableDiffusionDepth2ImgPipeline.from_pretrained( url = "http://images.cocodataset.org/val2017/000000039769.jpg" -init_image = Image.open(requests.get(url, stream=True).raw) +init_image = load_image(url) prompt = "two tigers" -n_propmt = "bad, deformed, ugly, bad anotomy" -image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0] -``` \ No newline at end of file +negative_prompt = "bad, deformed, ugly, bad anotomy" +image = pipe(prompt=prompt, image=init_image, negative_prompt=negative_prompt, strength=0.7).images[0] +make_image_grid([init_image, image], rows=1, cols=2) +``` diff --git a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md index d257a6e91edc..74f4cba08354 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md +++ b/docs/source/en/api/pipelines/stable_diffusion/stable_diffusion_xl.md @@ -23,7 +23,7 @@ The abstract from the paper is: - Using SDXL with a DPM++ scheduler for less than 50 steps is known to produce [visual artifacts](https://github.com/huggingface/diffusers/issues/5433) because the solver becomes numerically unstable. To fix this issue, take a look at this [PR](https://github.com/huggingface/diffusers/pull/5541) which recommends for ODE/SDE solvers: - set `use_karras_sigmas=True` or `lu_lambdas=True` to improve image quality - set `euler_at_final=True` if you're using a solver with uniform step sizes (DPM++2M or DPM++2M SDE) -- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't for for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). +- Most SDXL checkpoints work best with an image size of 1024x1024. Image sizes of 768x768 and 512x512 are also supported, but the results aren't as good. Anything below 512x512 is not recommended and likely won't be for default checkpoints like [stabilityai/stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). - SDXL can pass a different prompt for each of the text encoders it was trained on. We can even pass different parts of the same prompt to the text encoders. - SDXL output images can be improved by making use of a refiner model in an image-to-image setting. - SDXL offers `negative_original_size`, `negative_crops_coords_top_left`, and `negative_target_size` to negatively condition the model on image resolution and cropping parameters. @@ -32,7 +32,7 @@ The abstract from the paper is: To learn how to use SDXL for various tasks, how to optimize performance, and other usage examples, take a look at the [Stable Diffusion XL](../../../using-diffusers/sdxl) guide. -Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! +Check out the [Stability AI](https://huggingface.co/stabilityai) Hub organization for the official base and refiner model checkpoints! diff --git a/docs/source/en/api/pipelines/stable_diffusion/text2img.md b/docs/source/en/api/pipelines/stable_diffusion/text2img.md index 8d09602d8605..75d0b305d22f 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/text2img.md +++ b/docs/source/en/api/pipelines/stable_diffusion/text2img.md @@ -20,7 +20,7 @@ The abstract from the paper is: -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! @@ -56,4 +56,4 @@ If you're interested in using one of the official checkpoints for a task, explor ## FlaxStableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.FlaxStableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_diffusion/upscale.md b/docs/source/en/api/pipelines/stable_diffusion/upscale.md index 0bad9be0dcd4..d8df718d9d36 100644 --- a/docs/source/en/api/pipelines/stable_diffusion/upscale.md +++ b/docs/source/en/api/pipelines/stable_diffusion/upscale.md @@ -16,7 +16,7 @@ The Stable Diffusion upscaler diffusion model was created by the researchers and -Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! +Make sure to check out the Stable Diffusion [Tips](overview#tips) section to learn how to explore the tradeoff between scheduler speed and quality, and how to reuse pipeline components efficiently! If you're interested in using one of the official checkpoints for a task, explore the [CompVis](https://huggingface.co/CompVis), [Runway](https://huggingface.co/runwayml), and [Stability AI](https://huggingface.co/stabilityai) Hub organizations! @@ -34,4 +34,4 @@ If you're interested in using one of the official checkpoints for a task, explor ## StableDiffusionPipelineOutput -[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.stable_diffusion.StableDiffusionPipelineOutput diff --git a/docs/source/en/api/pipelines/stable_unclip.md b/docs/source/en/api/pipelines/stable_unclip.md index 739d357ddcdf..2942cefec4a9 100644 --- a/docs/source/en/api/pipelines/stable_unclip.md +++ b/docs/source/en/api/pipelines/stable_unclip.md @@ -22,12 +22,10 @@ The abstract from the paper is: ## Tips -Stable unCLIP takes `noise_level` as input during inference which determines how much noise is added -to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, -we do not add any additional noise to the image embeddings (`noise_level = 0`). +Stable unCLIP takes `noise_level` as input during inference which determines how much noise is added to the image embeddings. A higher `noise_level` increases variation in the final un-noised images. By default, we do not add any additional noise to the image embeddings (`noise_level = 0`). ### Text-to-Image Generation -Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha) +Stable unCLIP can be leveraged for text-to-image generation by pipelining it with the prior model of KakaoBrain's open source DALL-E 2 replication [Karlo](https://huggingface.co/kakaobrain/karlo-v1-alpha): ```python import torch @@ -60,12 +58,12 @@ pipe = StableUnCLIPPipeline.from_pretrained( pipe = pipe.to("cuda") wave_prompt = "dramatic wave, the Oceans roar, Strong wave spiral across the oceans as the waves unfurl into roaring crests; perfect wave form; perfect wave shape; dramatic wave shape; wave shape unbelievable; wave; wave shape spectacular" -images = pipe(prompt=wave_prompt).images -images[0].save("waves.png") +image = pipe(prompt=wave_prompt).images[0] +image ``` -For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use. +For text-to-image we use `stabilityai/stable-diffusion-2-1-unclip-small` as it was trained on CLIP ViT-L/14 embedding, the same as the Karlo model prior. [stabilityai/stable-diffusion-2-1-unclip](https://hf.co/stabilityai/stable-diffusion-2-1-unclip) was trained on OpenCLIP ViT-H, so we don't recommend its use. @@ -90,12 +88,19 @@ images[0].save("variation_image.png") Optionally, you can also pass a prompt to `pipe` such as: -```python +```python prompt = "A fantasy landscape, trending on artstation" -images = pipe(init_image, prompt=prompt).images -images[0].save("variation_image_two.png") +image = pipe(init_image, prompt=prompt).images[0] +image ``` + + + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## StableUnCLIPPipeline [[autodoc]] StableUnCLIPPipeline @@ -108,7 +113,6 @@ images[0].save("variation_image_two.png") - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - ## StableUnCLIPImg2ImgPipeline [[autodoc]] StableUnCLIPImg2ImgPipeline @@ -120,6 +124,6 @@ images[0].save("variation_image_two.png") - disable_vae_slicing - enable_xformers_memory_efficient_attention - disable_xformers_memory_efficient_attention - + ## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/stochastic_karras_ve.md b/docs/source/en/api/pipelines/stochastic_karras_ve.md index 3db24e80ca94..0e3f1a5b8333 100644 --- a/docs/source/en/api/pipelines/stochastic_karras_ve.md +++ b/docs/source/en/api/pipelines/stochastic_karras_ve.md @@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. The abstract from the paper: -*We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of an existing ImageNet-64 model from 2.07 to near-SOTA 1.55.* +*We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of a previously trained ImageNet-64 model from 2.07 to near-SOTA 1.55, and after re-training with our proposed improvements to a new SOTA of 1.36.* @@ -30,4 +30,4 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) - __call__ ## ImagePipelineOutput -[[autodoc]] pipelines.ImagePipelineOutput \ No newline at end of file +[[autodoc]] pipelines.ImagePipelineOutput diff --git a/docs/source/en/api/pipelines/text_to_video.md b/docs/source/en/api/pipelines/text_to_video.md index e6e081cfa645..244bb2e43b74 100644 --- a/docs/source/en/api/pipelines/text_to_video.md +++ b/docs/source/en/api/pipelines/text_to_video.md @@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. -πŸ§ͺ This pipeline is for research purposes only. +πŸ§ͺ This pipeline is for research purposes only. @@ -26,13 +26,13 @@ The abstract from the paper is: You can find additional information about Text-to-Video on the [project page](https://modelscope.cn/models/damo/text-to-video-synthesis/summary), [original codebase](https://github.com/modelscope/modelscope/), and try it out in a [demo](https://huggingface.co/spaces/damo-vilab/modelscope-text-to-video-synthesis). Official checkpoints can be found at [damo-vilab](https://huggingface.co/damo-vilab) and [cerspense](https://huggingface.co/cerspense). -## Usage example +## Usage example ### `text-to-video-ms-1.7b` Let's start by generating a short video with the default length of 16 frames (2s at 8 fps): -```python +```python import torch from diffusers import DiffusionPipeline from diffusers.utils import export_to_video @@ -88,7 +88,7 @@ video_path = export_to_video(video_frames) video_path ``` -Here are some sample outputs: +Here are some sample outputs: @@ -118,8 +118,9 @@ which can then be upscaled using [`VideoToVideoSDPipeline`] and [`cerspense/zero ```py import torch -from diffusers import DiffusionPipeline +from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler from diffusers.utils import export_to_video +from PIL import Image pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16) pipe.enable_model_cpu_offload() @@ -152,7 +153,7 @@ video_path = export_to_video(video_frames) video_path ``` -Here are some sample outputs: +Here are some sample outputs:
@@ -166,6 +167,12 @@ Here are some sample outputs:
+ + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## TextToVideoSDPipeline [[autodoc]] TextToVideoSDPipeline - all diff --git a/docs/source/en/api/pipelines/text_to_video_zero.md b/docs/source/en/api/pipelines/text_to_video_zero.md index b64d72db0187..626e75f94936 100644 --- a/docs/source/en/api/pipelines/text_to_video_zero.md +++ b/docs/source/en/api/pipelines/text_to_video_zero.md @@ -12,12 +12,7 @@ specific language governing permissions and limitations under the License. # Text2Video-Zero -[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by -Levon Khachatryan, -Andranik Movsisyan, -Vahram Tadevosyan, -Roberto Henschel, -[Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com). +[Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators](https://huggingface.co/papers/2303.13439) is by Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, [Zhangyang Wang](https://www.ece.utexas.edu/people/faculty/atlas-wang), Shant Navasardyan, [Humphrey Shi](https://www.humphreyshi.com). Text2Video-Zero enables zero-shot video generation using either: 1. A textual prompt @@ -35,16 +30,15 @@ Our key modifications include (i) enriching the latent codes of the generated fr Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data.* -You can find additional information about Text-to-Video Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero). +You can find additional information about Text2Video-Zero on the [project page](https://text2video-zero.github.io/), [paper](https://arxiv.org/abs/2303.13439), and [original codebase](https://github.com/Picsart-AI-Research/Text2Video-Zero). ## Usage example ### Text-To-Video -To generate a video from prompt, run the following python command +To generate a video from prompt, run the following Python code: ```python import torch -import imageio from diffusers import TextToVideoZeroPipeline model_id = "runwayml/stable-diffusion-v1-5" @@ -63,18 +57,17 @@ You can change these parameters in the pipeline call: * Video length: * `video_length`, the number of frames video_length to be generated. Default: `video_length=8` -We an also generate longer videos by doing the processing in a chunk-by-chunk manner: +We can also generate longer videos by doing the processing in a chunk-by-chunk manner: ```python import torch -import imageio from diffusers import TextToVideoZeroPipeline import numpy as np model_id = "runwayml/stable-diffusion-v1-5" pipe = TextToVideoZeroPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") seed = 0 -video_length = 8 -chunk_size = 4 +video_length = 24 #24 Γ· 4fps = 6 seconds +chunk_size = 8 prompt = "A panda is playing guitar on times square" # Generate the video chunk-by-chunk @@ -122,7 +115,7 @@ To generate a video from prompt with additional pose control frame_count = 8 pose_images = [Image.fromarray(reader.get_data(i)) for i in range(frame_count)] ``` - To extract pose from actual video, read [ControlNet documentation](./stable_diffusion/controlnet). + To extract pose from actual video, read [ControlNet documentation](controlnet). 3. Run `StableDiffusionControlNetPipeline` with our custom attention processor @@ -152,13 +145,12 @@ To generate a video from prompt with additional pose control ### Text-To-Video with Edge Control -To generate a video from prompt with additional pose control, -follow the steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny). +To generate a video from prompt with additional Canny edge control, follow the same steps described above for pose-guided generation using [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny). ### Video Instruct-Pix2Pix -To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/pix2pix)): +To perform text-guided video editing (with [InstructPix2Pix](pix2pix)): 1. Download a demo video @@ -196,12 +188,12 @@ To perform text-guided video editing (with [InstructPix2Pix](./stable_diffusion/ ``` -### DreamBooth specialization +### DreamBooth specialization Methods **Text-To-Video**, **Text-To-Video with Pose Control** and **Text-To-Video with Edge Control** -can run with custom [DreamBooth](../training/dreambooth) models, as shown below for +can run with custom [DreamBooth](../../training/dreambooth) models, as shown below for [Canny edge ControlNet model](https://huggingface.co/lllyasviel/sd-controlnet-canny) and -[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model +[Avatar style DreamBooth](https://huggingface.co/PAIR/text2video-zero-controlnet-canny-avatar) model: 1. Download a demo video @@ -250,6 +242,11 @@ can run with custom [DreamBooth](../training/dreambooth) models, as shown below You can filter out some available DreamBooth-trained models with [this link](https://huggingface.co/models?search=dreambooth). + + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + ## TextToVideoZeroPipeline [[autodoc]] TextToVideoZeroPipeline @@ -257,4 +254,4 @@ You can filter out some available DreamBooth-trained models with [this link](htt - __call__ ## TextToVideoPipelineOutput -[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.text_to_video_synthesis.pipeline_text_to_video_zero.TextToVideoPipelineOutput diff --git a/docs/source/en/api/pipelines/unclip.md b/docs/source/en/api/pipelines/unclip.md index 0cb5dc54dc29..da076ae8320c 100644 --- a/docs/source/en/api/pipelines/unclip.md +++ b/docs/source/en/api/pipelines/unclip.md @@ -9,13 +9,13 @@ specific language governing permissions and limitations under the License. # unCLIP -[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in πŸ€— Diffusers comes from kakaobrain's [karlo]((https://github.com/kakaobrain/karlo)). +[Hierarchical Text-Conditional Image Generation with CLIP Latents](https://huggingface.co/papers/2204.06125) is by Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, Mark Chen. The unCLIP model in πŸ€— Diffusers comes from kakaobrain's [karlo](https://github.com/kakaobrain/karlo). The abstract from the paper is following: *Contrastive models like CLIP have been shown to learn robust representations of images that capture both semantics and style. To leverage these representations for image generation, we propose a two-stage model: a prior that generates a CLIP image embedding given a text caption, and a decoder that generates an image conditioned on the image embedding. We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity. Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style, while varying the non-essential details absent from the image representation. Moreover, the joint embedding space of CLIP enables language-guided image manipulations in a zero-shot fashion. We use diffusion models for the decoder and experiment with both autoregressive and diffusion models for the prior, finding that the latter are computationally more efficient and produce higher-quality samples.* -You can find lucidrains DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch). +You can find lucidrains' DALL-E 2 recreation at [lucidrains/DALLE2-pytorch](https://github.com/lucidrains/DALLE2-pytorch). diff --git a/docs/source/en/api/pipelines/unidiffuser.md b/docs/source/en/api/pipelines/unidiffuser.md index cc59b168711c..5da194e320cc 100644 --- a/docs/source/en/api/pipelines/unidiffuser.md +++ b/docs/source/en/api/pipelines/unidiffuser.md @@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License. The UniDiffuser model was proposed in [One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale](https://huggingface.co/papers/2303.06555) by Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, Jun Zhu. -The abstract from the [paper](https://arxiv.org/abs/2303.06555) is: +The abstract from the paper is: *This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be unified as predicting the noise in the perturbed data, where the perturbation levels (i.e. timesteps) can be different for different modalities. Inspired by the unified view, UniDiffuser learns all distributions simultaneously with a minimal modification to the original diffusion model -- perturbs data in all modalities instead of a single modality, inputs individual timesteps in different modalities, and predicts the noise of all modalities instead of a single modality. UniDiffuser is parameterized by a transformer for diffusion models to handle input types of different modalities. Implemented on large-scale paired image-text data, UniDiffuser is able to perform image, text, text-to-image, image-to-text, and image-text pair generation by setting proper timesteps without additional overhead. In particular, UniDiffuser is able to produce perceptually realistic samples in all tasks and its quantitative results (e.g., the FID and CLIP score) are not only superior to existing general-purpose models but also comparable to the bespoken models (e.g., Stable Diffusion and DALL-E 2) in representative tasks (e.g., text-to-image generation).* @@ -54,7 +54,7 @@ image.save("unidiffuser_joint_sample_image.png") print(text) ``` -This is also called "joint" generation in the UniDiffusers paper, since we are sampling from the joint image-text distribution. +This is also called "joint" generation in the UniDiffuser paper, since we are sampling from the joint image-text distribution. Note that the generation task is inferred from the inputs used when calling the pipeline. It is also possible to manually specify the unconditional generation task ("mode") manually with [`UniDiffuserPipeline.set_joint_mode`]: @@ -65,7 +65,7 @@ pipe.set_joint_mode() sample = pipe(num_inference_steps=20, guidance_scale=8.0) ``` -When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting the infer the mode. +When the mode is set manually, subsequent calls to the pipeline will use the set mode without attempting to infer the mode. You can reset the mode with [`UniDiffuserPipeline.reset_mode`], after which the pipeline will once again infer the mode. You can also generate only an image or only text (which the UniDiffuser paper calls "marginal" generation since we sample from the marginal distribution of images and text, respectively): @@ -100,7 +100,7 @@ prompt = "an elephant under the sea" sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0) t2i_image = sample.images[0] -t2i_image.save("unidiffuser_text2img_sample_image.png") +t2i_image ``` The `text2img` mode requires that either an input `prompt` or `prompt_embeds` be supplied. You can set the `text2img` mode manually with [`UniDiffuserPipeline.set_text_to_image_mode`]. @@ -133,7 +133,7 @@ The `img2text` mode requires that an input `image` be supplied. You can set the ### Image Variation -The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and the perform a text-to-image generation on the outputs of the first generation. +The UniDiffuser authors suggest performing image variation through a "round-trip" generation method, where given an input image, we first perform an image-to-text generation, and then perform a text-to-image generation on the outputs of the first generation. This produces a new image which is semantically similar to the input image: ```python @@ -147,7 +147,7 @@ model_id_or_path = "thu-ml/unidiffuser-v1" pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16) pipe.to(device) -# Image variation can be performed with a image-to-text generation followed by a text-to-image generation: +# Image variation can be performed with an image-to-text generation followed by a text-to-image generation: # 1. Image-to-text generation image_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/unidiffuser/unidiffuser_example_image.jpg" init_image = load_image(image_url).resize((512, 512)) @@ -164,7 +164,6 @@ final_image.save("unidiffuser_image_variation_sample.png") ### Text Variation - Similarly, text variation can be performed on an input prompt with a text-to-image generation followed by a image-to-text generation: ```python @@ -191,10 +190,16 @@ final_prompt = sample.text[0] print(final_prompt) ``` + + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## UniDiffuserPipeline [[autodoc]] UniDiffuserPipeline - all - __call__ ## ImageTextPipelineOutput -[[autodoc]] pipelines.ImageTextPipelineOutput \ No newline at end of file +[[autodoc]] pipelines.ImageTextPipelineOutput diff --git a/docs/source/en/api/pipelines/value_guided_sampling.md b/docs/source/en/api/pipelines/value_guided_sampling.md index 0509b196b578..01b7717f49f8 100644 --- a/docs/source/en/api/pipelines/value_guided_sampling.md +++ b/docs/source/en/api/pipelines/value_guided_sampling.md @@ -22,11 +22,17 @@ This pipeline is based on the [Planning with Diffusion for Flexible Behavior Syn The abstract from the paper is: -*Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility*. +*Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.* -You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb). +You can find additional information about the model on the [project page](https://diffusion-planning.github.io/), the [original codebase](https://github.com/jannerm/diffuser), or try it out in a demo [notebook](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/reinforcement_learning_with_diffusers.ipynb). The script to run the model is available [here](https://github.com/huggingface/diffusers/tree/main/examples/reinforcement_learning). + + +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-components-across-pipelines) section to learn how to efficiently load the same components into multiple pipelines. + + + ## ValueGuidedRLPipeline -[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline \ No newline at end of file +[[autodoc]] diffusers.experimental.ValueGuidedRLPipeline diff --git a/docs/source/en/api/pipelines/versatile_diffusion.md b/docs/source/en/api/pipelines/versatile_diffusion.md index 1ddde1393157..953f4822486a 100644 --- a/docs/source/en/api/pipelines/versatile_diffusion.md +++ b/docs/source/en/api/pipelines/versatile_diffusion.md @@ -12,11 +12,11 @@ specific language governing permissions and limitations under the License. # Versatile Diffusion -Versatile Diffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://huggingface.co/papers/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi . +Versatile Diffusion was proposed in [Versatile Diffusion: Text, Images and Variations All in One Diffusion Model](https://huggingface.co/papers/2211.08332) by Xingqian Xu, Zhangyang Wang, Eric Zhang, Kai Wang, Humphrey Shi. The abstract from the paper is: -*The recent advances in diffusion models have set an impressive milestone in many generation tasks. Trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest in academia and industry. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-flow network, dubbed Versatile Diffusion (VD), that handles text-to-image, image-to-text, image-variation, and text-variation in one unified model. Moreover, we generalize VD to a unified multi-flow multimodal diffusion framework with grouped layers, swappable streams, and other propositions that can process modalities beyond images and text. Through our experiments, we demonstrate that VD and its underlying framework have the following merits: a) VD handles all subtasks with competitive quality; b) VD initiates novel extensions and applications such as disentanglement of style and semantic, image-text dual-guided generation, etc.; c) Through these experiments and applications, VD provides more semantic insights of the generated outputs.* +*Recent advances in diffusion models have set an impressive milestone in many generation tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model. The pipeline design of VD instantiates a unified multi-flow diffusion framework, consisting of sharable and swappable layer modules that enable the crossmodal generality beyond images and text. Through extensive experiments, we demonstrate that VD successfully achieves the following: a) VD outperforms the baseline approaches and handles all its base tasks with competitive quality; b) VD enables novel extensions such as disentanglement of style and semantics, dual- and multi-context blending, etc.; c) The success of our multi-flow multimodal framework over images and text may inspire further diffusion-based universal AI research.* ## Tips diff --git a/docs/source/en/api/pipelines/wuerstchen.md b/docs/source/en/api/pipelines/wuerstchen.md index 29f1530dc338..127c6df9413e 100644 --- a/docs/source/en/api/pipelines/wuerstchen.md +++ b/docs/source/en/api/pipelines/wuerstchen.md @@ -1,15 +1,27 @@ + + # WΓΌrstchen -[WΓΌrstchen: Efficient Pretraining of Text-to-Image Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville. +[Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models](https://huggingface.co/papers/2306.00637) is by Pablo Pernias, Dominic Rampas, Mats L. Richter and Christopher Pal and Marc Aubreville. The abstract from the paper is: -*We introduce WΓΌrstchen, a novel technique for text-to-image synthesis that unites competitive performance with unprecedented cost-effectiveness and ease of training on constrained hardware. Building on recent advancements in machine learning, our approach, which utilizes latent diffusion strategies at strong latent image compression rates, significantly reduces the computational burden, typically associated with state-of-the-art models, while preserving, if not enhancing, the quality of generated images. Wuerstchen achieves notable speed improvements at inference time, thereby rendering real-time applications more viable. One of the key advantages of our method lies in its modest training requirements of only 9,200 GPU hours, slashing the usual costs significantly without compromising the end performance. In a comparison against the state-of-the-art, we found the approach to yield strong competitiveness. This paper opens the door to a new line of research that prioritizes both performance and computational accessibility, hence democratizing the use of sophisticated AI technologies. Through Wuerstchen, we demonstrate a compelling stride forward in the realm of text-to-image synthesis, offering an innovative path to explore in future research.* +*We introduce WΓΌrstchen, a novel architecture for text-to-image synthesis that combines competitive performance with unprecedented cost-effectiveness for large-scale text-to-image diffusion models. A key contribution of our work is to develop a latent diffusion technique in which we learn a detailed but extremely compact semantic image representation used to guide the diffusion process. This highly compressed representation of an image provides much more detailed guidance compared to latent representations of language and this significantly reduces the computational requirements to achieve state-of-the-art results. Our approach also improves the quality of text-conditioned image generation based on our user preference study. The training requirements of our approach consists of 24,602 A100-GPU hours - compared to Stable Diffusion 2.1's 200,000 GPU hours. Our approach also requires less training data to achieve these results. Furthermore, our compact latent representations allows us to perform inference over twice as fast, slashing the usual costs and carbon footprint of a state-of-the-art (SOTA) diffusion model significantly, without compromising the end performance. In a broader comparison against SOTA models our approach is substantially more efficient and compares favorably in terms of image quality. We believe that this work motivates more emphasis on the prioritization of both performance and computational accessibility.* ## WΓΌrstchen Overview -WΓΌrstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. WΓΌrstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. WΓΌrstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637) ). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference. +WΓΌrstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images. Why is this important? Compressing data can reduce computational costs for both training and inference by magnitudes. Training on 1024x1024 images is way more expensive than training on 32x32. Usually, other works make use of a relatively small compression, in the range of 4x - 8x spatial compression. WΓΌrstchen takes this to an extreme. Through its novel design, we achieve a 42x spatial compression. This was unseen before because common methods fail to faithfully reconstruct detailed images after 16x spatial compression. WΓΌrstchen employs a two-stage compression, what we call Stage A and Stage B. Stage A is a VQGAN, and Stage B is a Diffusion Autoencoder (more details can be found in the [paper](https://huggingface.co/papers/2306.00637)). A third model, Stage C, is learned in that highly compressed latent space. This training requires fractions of the compute used for current top-performing models, while also allowing cheaper and faster inference. ## WΓΌrstchen v2 comes to Diffusers @@ -21,7 +33,7 @@ After the initial paper release, we have improved numerous things in the archite - Better quality -We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are: +We are releasing 3 checkpoints for the text-conditional image generation model (Stage C). Those are: - v2-base - v2-aesthetic @@ -45,7 +57,7 @@ pipe = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dty caption = "Anthropomorphic cat dressed as a fire fighter" images = pipe( - caption, + caption, width=1024, height=1536, prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS, @@ -90,7 +102,8 @@ decoder_output = decoder_pipeline( negative_prompt=negative_prompt, guidance_scale=0.0, output_type="pil", -).images +).images[0] +decoder_output ``` ## Speed-Up Inference @@ -113,6 +126,7 @@ after 1024x1024 is 1152x1152 The original codebase, as well as experimental ideas, can be found at [dome272/Wuerstchen](https://github.com/dome272/Wuerstchen). + ## WuerstchenCombinedPipeline [[autodoc]] WuerstchenCombinedPipeline @@ -139,8 +153,8 @@ The original codebase, as well as experimental ideas, can be found at [dome272/W ```bibtex @misc{pernias2023wuerstchen, - title={Wuerstchen: Efficient Pretraining of Text-to-Image Models}, - author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher Pal and Marc Aubreville}, + title={Wuerstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models}, + author={Pablo Pernias and Dominic Rampas and Mats L. Richter and Christopher J. Pal and Marc Aubreville}, year={2023}, eprint={2306.00637}, archivePrefix={arXiv}, diff --git a/docs/source/en/optimization/memory.md b/docs/source/en/optimization/memory.md index 281b65df8d8c..42a1bcea8fb5 100644 --- a/docs/source/en/optimization/memory.md +++ b/docs/source/en/optimization/memory.md @@ -194,9 +194,9 @@ unet_runs_per_experiment = 50 # load inputs def generate_inputs(): - sample = torch.randn(2, 4, 64, 64).half().cuda() - timestep = torch.rand(1).half().cuda() * 999 - encoder_hidden_states = torch.randn(2, 77, 768).half().cuda() + sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16) + timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999 + encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16) return sample, timestep, encoder_hidden_states diff --git a/docs/source/en/training/controlnet.md b/docs/source/en/training/controlnet.md index 40632d67b81e..4be2cbc93252 100644 --- a/docs/source/en/training/controlnet.md +++ b/docs/source/en/training/controlnet.md @@ -12,245 +12,247 @@ specific language governing permissions and limitations under the License. # ControlNet -[Adding Conditional Control to Text-to-Image Diffusion Models](https://arxiv.org/abs/2302.05543) (ControlNet) by Lvmin Zhang and Maneesh Agrawala. +[ControlNet](https://hf.co/papers/2302.05543) models are adapters trained on top of another pretrained model. It allows for a greater degree of control over image generation by conditioning the model with an additional input image. The input image can be a canny edge, depth map, human pose, and many more. -This example is based on the [training example in the original ControlNet repository](https://github.com/lllyasviel/ControlNet/blob/main/docs/train.md). It trains a ControlNet to fill circles using a [small synthetic dataset](https://huggingface.co/datasets/fusing/fill50k). +If you're training on a GPU with limited vRAM, you should try enabling the `gradient_checkpointing`, `gradient_accumulation_steps`, and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing or xFormers. You should have a GPU with >30GB of memory if you want to train faster with Flax. -## Installing the dependencies +This guide will explore the [train_controlnet.py](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py) training script to help you become familiar with it, and how you can adapt it for your own use-case. -Before running the scripts, make sure to install the library's training dependencies. +Before running the script, make sure you install the library from source: - - -To successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the installation up to date. We update the example scripts frequently and install example-specific requirements. - - - -To do this, execute the following steps in a new virtual environment: ```bash git clone https://github.com/huggingface/diffusers cd diffusers -pip install -e . +pip install . ``` -Then navigate into the [example folder](https://github.com/huggingface/diffusers/tree/main/examples/controlnet) +Then navigate to the example folder containing the training script and install the required dependencies for the script you're using: + + + ```bash cd examples/controlnet +pip install -r requirements.txt ``` + + + +If you have access to a TPU, the Flax training script runs even faster! Let's run the training script on the [Google Cloud TPU VM](https://cloud.google.com/tpu/docs/run-calculation-jax). Create a single TPU v4-8 VM and connect to it: -Now run: ```bash -pip install -r requirements.txt +ZONE=us-central2-b +TPU_TYPE=v4-8 +VM_NAME=hg_flax + +gcloud alpha compute tpus tpu-vm create $VM_NAME \ + --zone $ZONE \ + --accelerator-type $TPU_TYPE \ + --version tpu-vm-v4-base + +gcloud alpha compute tpus tpu-vm ssh $VM_NAME --zone $ZONE -- \ +``` + +Install JAX 0.4.5: + +```bash +pip install "jax[tpu]==0.4.5" -f https://storage.googleapis.com/jax-releases/libtpu_releases.html +``` + +Then install the required dependencies for the Flax script: + +```bash +cd examples/controlnet +pip install -r requirements_flax.txt ``` -And initialize an [πŸ€—Accelerate](https://github.com/huggingface/accelerate/) environment with: + + + + + +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. + + + +Initialize an πŸ€— Accelerate environment: ```bash accelerate config ``` -Or for a default πŸ€—Accelerate configuration without answering questions about your environment: +To setup a default πŸ€— Accelerate environment without choosing any configurations: ```bash accelerate config default ``` -Or if your environment doesn't support an interactive shell like a notebook: +Or if your environment doesn't support an interactive shell, like a notebook, you can use: -```python +```bash from accelerate.utils import write_basic_config write_basic_config() ``` -## Circle filling dataset +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. -The original dataset is hosted in the ControlNet [repo](https://huggingface.co/lllyasviel/ControlNet/blob/main/training/fill50k.zip), but we re-uploaded it [here](https://huggingface.co/datasets/fusing/fill50k) to be compatible with πŸ€— Datasets so that it can handle the data loading within the training script. + -Our training examples use [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) because that is what the original set of ControlNet models was trained on. However, ControlNet can be trained to augment any compatible Stable Diffusion model (such as [`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4)) or [`stabilityai/stable-diffusion-2-1`](https://huggingface.co/stabilityai/stable-diffusion-2-1). +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet.py) and let us know if you have any questions or concerns. -To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. + -## Training +## Script parameters -Download the following images to condition our training with: +The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L231) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. -```sh -wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png +For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command: -wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png +```bash +accelerate launch train_controlnet.py \ + --mixed_precision="fp16" ``` -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. +Many of the basic and important parameters are described in the [Text-to-image](text2image#script-parameters) training guide, so this guide just focuses on the relevant parameters for ControlNet: -The training script creates and saves a `diffusion_pytorch_model.bin` file in your repository. +- `--max_train_samples`: the number of training samples; this can be lowered for faster training, but if you want to stream really large datasets, you'll need to include this parameter and the `--streaming` parameter in your training command +- `--gradient_accumulation_steps`: number of update steps to accumulate before the backward pass; this allows you to train with a bigger batch size than your GPU memory can typically handle -```bash -export MODEL_DIR="runwayml/stable-diffusion-v1-5" -export OUTPUT_DIR="path to save model" +### Min-SNR weighting + +The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script. +Add the `--snr_gamma` parameter and set it to the recommended value of 5.0: + +```bash accelerate launch train_controlnet.py \ - --pretrained_model_name_or_path=$MODEL_DIR \ - --output_dir=$OUTPUT_DIR \ - --dataset_name=fusing/fill50k \ - --resolution=512 \ - --learning_rate=1e-5 \ - --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \ - --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ - --train_batch_size=4 \ - --push_to_hub + --snr_gamma=5.0 ``` -This default configuration requires ~38GB VRAM. +## Training script -By default, the training script logs outputs to tensorboard. Pass `--report_to wandb` to use Weights & -Biases. +As with the script parameters, a general walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the relevant parts of the ControlNet script. -Gradient accumulation with a smaller batch size can be used to reduce training requirements to ~20 GB VRAM. +The training script has a [`make_train_dataset`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L582) function for preprocessing the dataset with image transforms and caption tokenization. You'll see that in addition to the usual caption tokenization and image transforms, the script also includes transforms for the conditioning image. -```bash -export MODEL_DIR="runwayml/stable-diffusion-v1-5" -export OUTPUT_DIR="path to save model" + -accelerate launch train_controlnet.py \ - --pretrained_model_name_or_path=$MODEL_DIR \ - --output_dir=$OUTPUT_DIR \ - --dataset_name=fusing/fill50k \ - --resolution=512 \ - --learning_rate=1e-5 \ - --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \ - --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ - --train_batch_size=1 \ - --gradient_accumulation_steps=4 \ - --push_to_hub -``` +If you're streaming a dataset on a TPU, performance may be bottlenecked by the πŸ€— Datasets library which is not optimized for images. To ensure maximum throughput, you're encouraged to explore other dataset formats like [WebDataset](https://webdataset.github.io/webdataset/), [TorchData](https://github.com/pytorch/data), and [TensorFlow Datasets](https://www.tensorflow.org/datasets/tfless_tfds). -## Training with multiple GPUs + -`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch) -for running distributed training with `accelerate`. Here is an example command: +```py +conditioning_image_transforms = transforms.Compose( + [ + transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR), + transforms.CenterCrop(args.resolution), + transforms.ToTensor(), + ] +) +``` -```bash -export MODEL_DIR="runwayml/stable-diffusion-v1-5" -export OUTPUT_DIR="path to save model" +Within the [`main()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L713) function, you'll find the code for loading the tokenizer, text encoder, scheduler and models. This is also where the ControlNet model is loaded either from existing weights or randomly initialized from a UNet: -accelerate launch --mixed_precision="fp16" --multi_gpu train_controlnet.py \ - --pretrained_model_name_or_path=$MODEL_DIR \ - --output_dir=$OUTPUT_DIR \ - --dataset_name=fusing/fill50k \ - --resolution=512 \ - --learning_rate=1e-5 \ - --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \ - --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ - --train_batch_size=4 \ - --mixed_precision="fp16" \ - --tracker_project_name="controlnet-demo" \ - --report_to=wandb \ - --push_to_hub +```py +if args.controlnet_model_name_or_path: + logger.info("Loading existing controlnet weights") + controlnet = ControlNetModel.from_pretrained(args.controlnet_model_name_or_path) +else: + logger.info("Initializing controlnet weights from unet") + controlnet = ControlNetModel.from_unet(unet) ``` -## Example results +The [optimizer](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L871) is set up to update the ControlNet parameters: -#### After 300 steps with batch size 8 +```py +params_to_optimize = controlnet.parameters() +optimizer = optimizer_class( + params_to_optimize, + lr=args.learning_rate, + betas=(args.adam_beta1, args.adam_beta2), + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) +``` -| | | -|-------------------|:-------------------------:| -| | red circle with blue background | -![conditioning image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png) | ![red circle with blue background](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/red_circle_with_blue_background_300_steps.png) | -| | cyan circle with brown floral background | -![conditioning image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png) | ![cyan circle with brown floral background](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/cyan_circle_with_brown_floral_background_300_steps.png) | +Finally, in the [training loop](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/controlnet/train_controlnet.py#L943), the conditioning text embeddings and image are passed to the down and mid-blocks of the ControlNet model: +```py +encoder_hidden_states = text_encoder(batch["input_ids"])[0] +controlnet_image = batch["conditioning_pixel_values"].to(dtype=weight_dtype) + +down_block_res_samples, mid_block_res_sample = controlnet( + noisy_latents, + timesteps, + encoder_hidden_states=encoder_hidden_states, + controlnet_cond=controlnet_image, + return_dict=False, +) +``` -#### After 6000 steps with batch size 8: +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. -| | | -|-------------------|:-------------------------:| -| | red circle with blue background | -![conditioning image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png) | ![red circle with blue background](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/red_circle_with_blue_background_6000_steps.png) | -| | cyan circle with brown floral background | -![conditioning image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png) | ![cyan circle with brown floral background](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/cyan_circle_with_brown_floral_background_6000_steps.png) | +## Launch the script -## Training on a 16 GB GPU +Now you're ready to launch the training script! πŸš€ -Enable the following optimizations to train on a 16GB GPU: +This guide uses the [fusing/fill50k](https://huggingface.co/datasets/fusing/fill50k) dataset, but remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset) guide). -- Gradient checkpointing -- bitsandbyte's 8-bit optimizer (take a look at the [installation]((https://github.com/TimDettmers/bitsandbytes#requirements--installation) instructions if you don't already have it installed) +Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model and `OUTPUT_DIR` to where you want to save the model. -Now you can launch the training script: +Download the following images to condition your training with: ```bash -export MODEL_DIR="runwayml/stable-diffusion-v1-5" -export OUTPUT_DIR="path to save model" - -accelerate launch train_controlnet.py \ - --pretrained_model_name_or_path=$MODEL_DIR \ - --output_dir=$OUTPUT_DIR \ - --dataset_name=fusing/fill50k \ - --resolution=512 \ - --learning_rate=1e-5 \ - --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \ - --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ - --train_batch_size=1 \ - --gradient_accumulation_steps=4 \ - --gradient_checkpointing \ - --use_8bit_adam \ - --push_to_hub +wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png +wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png ``` -## Training on a 12 GB GPU +One more thing before you launch the script! Depending on the GPU you have, you may need to enable certain optimizations to train a ControlNet. The default configuration in this script requires ~38GB of vRAM. If you're training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command. -Enable the following optimizations to train on a 12GB GPU: -- Gradient checkpointing -- bitsandbyte's 8-bit optimizer (take a look at the [installation]((https://github.com/TimDettmers/bitsandbytes#requirements--installation) instructions if you don't already have it installed) -- xFormers (take a look at the [installation](https://huggingface.co/docs/diffusers/training/optimization/xformers) instructions if you don't already have it installed) -- set gradients to `None` + + -```bash -export MODEL_DIR="runwayml/stable-diffusion-v1-5" -export OUTPUT_DIR="path to save model" +On a 16GB GPU, you can use bitsandbytes 8-bit optimizer and gradient checkpointing to optimize your training run. Install bitsandbytes: +```py +pip install bitsandbytes +``` + +Then, add the following parameter to your training command: + +```bash accelerate launch train_controlnet.py \ - --pretrained_model_name_or_path=$MODEL_DIR \ - --output_dir=$OUTPUT_DIR \ - --dataset_name=fusing/fill50k \ - --resolution=512 \ - --learning_rate=1e-5 \ - --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \ - --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ - --train_batch_size=1 \ - --gradient_accumulation_steps=4 \ - --gradient_checkpointing \ - --use_8bit_adam \ - --enable_xformers_memory_efficient_attention \ - --set_grads_to_none \ - --push_to_hub + --gradient_checkpointing \ + --use_8bit_adam \ ``` -When using `enable_xformers_memory_efficient_attention`, please make sure to install `xformers` by `pip install xformers`. + + -## Training on an 8 GB GPU +On a 12GB GPU, you'll need bitsandbytes 8-bit optimizer, gradient checkpointing, xFormers, and set the gradients to `None` instead of zero to reduce your memory-usage. -We have not exhaustively tested DeepSpeed support for ControlNet. While the configuration does -save memory, we have not confirmed whether the configuration trains successfully. You will very likely -have to make changes to the config to have a successful training run. +```bash +accelerate launch train_controlnet.py \ + --use_8bit_adam \ + --gradient_checkpointing \ + --enable_xformers_memory_efficient_attention \ + --set_grads_to_none \ +``` -Enable the following optimizations to train on a 8GB GPU: -- Gradient checkpointing -- bitsandbyte's 8-bit optimizer (take a look at the [installation]((https://github.com/TimDettmers/bitsandbytes#requirements--installation) instructions if you don't already have it installed) -- xFormers (take a look at the [installation](https://huggingface.co/docs/diffusers/training/optimization/xformers) instructions if you don't already have it installed) -- set gradients to `None` -- DeepSpeed stage 2 with parameter and optimizer offloading -- fp16 mixed precision + + -[DeepSpeed](https://www.deepspeed.ai/) can offload tensors from VRAM to either -CPU or NVME. This requires significantly more RAM (about 25 GB). +On a 8GB GPU, you'll need to use [DeepSpeed](https://www.deepspeed.ai/) to offload some of the tensors from the vRAM to either the CPU or NVME to allow training with less GPU memory. -You'll have to configure your environment with `accelerate config` to enable DeepSpeed stage 2. +Run the following command to configure your πŸ€— Accelerate environment: -The configuration file should look like this: +```bash +accelerate config +``` + +During configuration, confirm that you want to use DeepSpeed stage 2. Now it should be possible to train on under 8GB vRAM by combining DeepSpeed stage 2, fp16 mixed precision, and offloading the model parameters and the optimizer state to the CPU. The drawback is that this requires more system RAM (~25 GB). See the [DeepSpeed documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more configuration options. Your configuration file should look something like: -```yaml +```bash compute_environment: LOCAL_MACHINE deepspeed_config: gradient_accumulation_steps: 4 @@ -261,73 +263,104 @@ deepspeed_config: distributed_type: DEEPSPEED ``` - +You should also change the default Adam optimizer to DeepSpeed’s optimized version of Adam [`deepspeed.ops.adam.DeepSpeedCPUAdam`](https://deepspeed.readthedocs.io/en/latest/optimizers.html#adam-cpu) for a substantial speedup. Enabling `DeepSpeedCPUAdam` requires your system’s CUDA toolchain version to be the same as the one installed with PyTorch. -See [documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more DeepSpeed configuration options. +bitsandbytes 8-bit optimizers don’t seem to be compatible with DeepSpeed at the moment. - +That's it! You don't need to add any additional parameters to your training command. -Changing the default Adam optimizer to DeepSpeed's Adam -`deepspeed.ops.adam.DeepSpeedCPUAdam` gives a substantial speedup but -it requires a CUDA toolchain with the same version as PyTorch. 8-bit optimizer -does not seem to be compatible with DeepSpeed at the moment. + + + + + ```bash export MODEL_DIR="runwayml/stable-diffusion-v1-5" -export OUTPUT_DIR="path to save model" +export OUTPUT_DIR="path/to/save/model" accelerate launch train_controlnet.py \ --pretrained_model_name_or_path=$MODEL_DIR \ --output_dir=$OUTPUT_DIR \ --dataset_name=fusing/fill50k \ --resolution=512 \ + --learning_rate=1e-5 \ --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \ --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ - --gradient_checkpointing \ - --enable_xformers_memory_efficient_attention \ - --set_grads_to_none \ - --mixed_precision fp16 \ --push_to_hub ``` -## Inference + + -The trained model can be run with the [`StableDiffusionControlNetPipeline`]. -Set `base_model_path` and `controlnet_path` to the values `--pretrained_model_name_or_path` and -`--output_dir` were respectively set to in the training script. +With Flax, you can [profile your code](https://jax.readthedocs.io/en/latest/profiling.html) by adding the `--profile_steps==5` parameter to your training command. Install the Tensorboard profile plugin: -```py -from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, UniPCMultistepScheduler -from diffusers.utils import load_image -import torch +```bash +pip install tensorflow tensorboard-plugin-profile +tensorboard --logdir runs/fill-circle-100steps-20230411_165612/ +``` -base_model_path = "path to model" -controlnet_path = "path to controlnet" +Then you can inspect the profile at [http://localhost:6006/#profile](http://localhost:6006/#profile). -controlnet = ControlNetModel.from_pretrained(controlnet_path, torch_dtype=torch.float16, use_safetensors=True) -pipe = StableDiffusionControlNetPipeline.from_pretrained( - base_model_path, controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True -) + -# speed up diffusion process with faster scheduler and memory optimization -pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) -# remove following line if xformers is not installed -pipe.enable_xformers_memory_efficient_attention() +If you run into version conflicts with the plugin, try uninstalling and reinstalling all versions of TensorFlow and Tensorboard. The debugging functionality of the profile plugin is still experimental, and not all views are fully functional. The `trace_viewer` cuts off events after 1M, which can result in all your device traces getting lost if for example, you profile the compilation step by accident. -pipe.enable_model_cpu_offload() + + +```bash +python3 train_controlnet_flax.py \ + --pretrained_model_name_or_path=$MODEL_DIR \ + --output_dir=$OUTPUT_DIR \ + --dataset_name=fusing/fill50k \ + --resolution=512 \ + --learning_rate=1e-5 \ + --validation_image "./conditioning_image_1.png" "./conditioning_image_2.png" \ + --validation_prompt "red circle with blue background" "cyan circle with brown floral background" \ + --validation_steps=1000 \ + --train_batch_size=2 \ + --revision="non-ema" \ + --from_pt \ + --report_to="wandb" \ + --tracker_project_name=$HUB_MODEL_ID \ + --num_train_epochs=11 \ + --push_to_hub \ + --hub_model_id=$HUB_MODEL_ID +``` + + + + +Once training is complete, you can use your newly trained model for inference! + +```py +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel +from diffusers.utils import load_image +import torch + +controlnet = ControlNetModel.from_pretrained("path/to/controlnet", torch_dtype=torch.float16) +pipeline = StableDiffusionControlNetPipeline.from_pretrained( + "path/to/base/model", controlnet=controlnet, torch_dtype=torch.float16 +).to("cuda") control_image = load_image("./conditioning_image_1.png") prompt = "pale golden rod circle with old lace background" -# generate image generator = torch.manual_seed(0) image = pipe(prompt, num_inference_steps=20, generator=generator, image=control_image).images[0] - image.save("./output.png") ``` ## Stable Diffusion XL -Training with [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) is also supported via the `train_controlnet_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/README_sdxl.md). +Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [`train_controlnet_sdxl.py`](https://github.com/huggingface/diffusers/blob/main/examples/controlnet/train_controlnet_sdxl.py) script to train a ControlNet adapter for the SDXL model. + +The SDXL training script is discussed in more detail in the [SDXL training](sdxl) guide. + +## Next steps + +Congratulations on training your own ControlNet! To learn more about how to use your new model, the following guides may be helpful: + +- Learn how to [use a ControlNet](../using-diffusers/controlnet) for inference on a variety of tasks. \ No newline at end of file diff --git a/docs/source/en/training/custom_diffusion.md b/docs/source/en/training/custom_diffusion.md index 153ae81f1216..6601a7a93284 100644 --- a/docs/source/en/training/custom_diffusion.md +++ b/docs/source/en/training/custom_diffusion.md @@ -10,149 +10,276 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Custom Diffusion training example +# Custom Diffusion -[Custom Diffusion](https://arxiv.org/abs/2212.04488) is a method to customize text-to-image models like Stable Diffusion given just a few (4~5) images of a subject. -The `train_custom_diffusion.py` script shows how to implement the training procedure and adapt it for stable diffusion. +[Custom Diffusion](https://huggingface.co/papers/2212.04488) is a training technique for personalizing image generation models. Like Textual Inversion, DreamBooth, and LoRA, Custom Diffusion only requires a few (~4-5) example images. This technique works by only training weights in the cross-attention layers, and it uses a special word to represent the newly learned concept. Custom Diffusion is unique because it can also learn multiple concepts at the same time. -This training example was contributed by [Nupur Kumari](https://nupurkmr9.github.io/) (one of the authors of Custom Diffusion). +If you're training on a GPU with limited vRAM, you should try enabling xFormers with `--enable_xformers_memory_efficient_attention` for faster training with lower vRAM requirements (16GB). To save even more memory, add `--set_grads_to_none` in the training argument to set the gradients to `None` instead of zero (this option can cause some issues, so if you experience any, try removing this parameter). -## Running locally with PyTorch +This guide will explore the [train_custom_diffusion.py](https://github.com/huggingface/diffusers/blob/main/examples/custom_diffusion/train_custom_diffusion.py) script to help you become more familiar with it, and how you can adapt it for your own use-case. -### Installing the dependencies - -Before running the scripts, make sure to install the library's training dependencies: - -**Important** - -To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: +Before running the script, make sure you install the library from source: ```bash git clone https://github.com/huggingface/diffusers cd diffusers -pip install -e . +pip install . ``` -Then cd into the [example folder](https://github.com/huggingface/diffusers/tree/main/examples/custom_diffusion) +Navigate to the example folder with the training script and install the required dependencies: -``` +```bash cd examples/custom_diffusion +pip install -r requirements.txt +pip install clip-retrieval ``` -Now run + -```bash -pip install -r requirements.txt -pip install clip-retrieval -``` +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. -And initialize an [πŸ€—Accelerate](https://github.com/huggingface/accelerate/) environment with: + + +Initialize an πŸ€— Accelerate environment: ```bash accelerate config ``` -Or for a default accelerate configuration without answering questions about your environment +To setup a default πŸ€— Accelerate environment without choosing any configurations: ```bash accelerate config default ``` -Or if your environment doesn't support an interactive shell e.g. a notebook +Or if your environment doesn't support an interactive shell, like a notebook, you can use: -```python +```bash from accelerate.utils import write_basic_config write_basic_config() ``` -### Cat example 😺 -Now let's get our dataset. Download dataset from [here](https://www.cs.cmu.edu/~custom-diffusion/assets/data.zip) and unzip it. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. + + + +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/custom_diffusion/train_custom_diffusion.py) and let us know if you have any questions or concerns. + + -We also collect 200 real images using `clip-retrieval` which are combined with the target images in the training dataset as a regularization. This prevents overfitting to the given target image. The following flags enable the regularization `with_prior_preservation`, `real_prior` with `prior_loss_weight=1.`. -The `class_prompt` should be the category name same as target image. The collected real images are with text captions similar to the `class_prompt`. The retrieved image are saved in `class_data_dir`. You can disable `real_prior` to use generated images as regularization. To collect the real images use this command first before training. +## Script parameters + +The training script contains all the parameters to help you customize your training run. These are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L319) function. The function comes with default values, but you can also set your own values in the training command if you'd like. + +For example, to change the resolution of the input image: ```bash -pip install clip-retrieval -python retrieve.py --class_prompt cat --class_data_dir real_reg/samples_cat --num_class_images 200 +accelerate launch train_custom_diffusion.py \ + --resolution=256 ``` -**___Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.___** +Many of the basic parameters are described in the [DreamBooth](dreambooth#script-parameters) training guide, so this guide focuses on the parameters unique to Custom Diffusion: + +- `--freeze_model`: freezes the key and value parameters in the cross-attention layer; the default is `crossattn_kv`, but you can set it to `crossattn` to train all the parameters in the cross-attention layer +- `--concepts_list`: to learn multiple concepts, provide a path to a JSON file containing the concepts +- `--modifier_token`: a special word used to represent the learned concept +- `--initializer_token`: + +### Prior preservation loss + +Prior preservation loss is a method that uses a model's own generated samples to help it learn how to generate more diverse images. Because these generated sample images belong to the same class as the images you provided, they help the model retain what it has learned about the class and how it can use what it already knows about the class to make new compositions. -The script creates and saves model checkpoints and a `pytorch_custom_diffusion_weights.bin` file in your repository. +Many of the parameters for prior preservation loss are described in the [DreamBooth](dreambooth#prior-preservation-loss) training guide. + +### Regularization + +Custom Diffusion includes training the target images with a small set of real images to prevent overfitting. As you can imagine, this can be easy to do when you're only training on a few images! Download 200 real images with `clip_retrieval`. The `class_prompt` should be the same category as the target images. These images are stored in `class_data_dir`. ```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export OUTPUT_DIR="path-to-save-model" -export INSTANCE_DIR="./data/cat" +python retrieve.py --class_prompt cat --class_data_dir real_reg/samples_cat --num_class_images 200 +``` +To enable regularization, add the following parameters: + +- `--with_prior_preservation`: whether to use prior preservation loss +- `--prior_loss_weight`: controls the influence of the prior preservation loss on the model +- `--real_prior`: whether to use a small set of real images to prevent overfitting + +```bash accelerate launch train_custom_diffusion.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --class_data_dir=./real_reg/samples_cat/ \ - --with_prior_preservation --real_prior --prior_loss_weight=1.0 \ - --class_prompt="cat" --num_class_images=200 \ - --instance_prompt="photo of a cat" \ - --resolution=512 \ - --train_batch_size=2 \ - --learning_rate=1e-5 \ - --lr_warmup_steps=0 \ - --max_train_steps=250 \ - --scale_lr --hflip \ - --modifier_token "" \ - --push_to_hub + --with_prior_preservation \ + --prior_loss_weight=1.0 \ + --class_data_dir="./real_reg/samples_cat" \ + --class_prompt="cat" \ + --real_prior=True \ ``` -**Use `--enable_xformers_memory_efficient_attention` for faster training with lower VRAM requirement (16GB per GPU). Follow [this guide](https://github.com/facebookresearch/xformers) for installation instructions.** +## Training script + + + +A lot of the code in the Custom Diffusion training script is similar to the [DreamBooth](dreambooth#training-script) script. This guide instead focuses on the code that is relevant to Custom Diffusion. + + + +The Custom Diffusion training script has two dataset classes: + +- [`CustomDiffusionDataset`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L165): preprocesses the images, class images, and prompts for training +- [`PromptDataset`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L148): prepares the prompts for generating class images + +Next, the `modifier_token` is [added to the tokenizer](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L811), converted to token ids, and the token embeddings are resized to account for the new `modifier_token`. Then the `modifier_token` embeddings are initialized with the embeddings of the `initializer_token`. All parameters in the text encoder are frozen, except for the token embeddings since this is what the model is trying to learn to associate with the concepts. + +```py +params_to_freeze = itertools.chain( + text_encoder.text_model.encoder.parameters(), + text_encoder.text_model.final_layer_norm.parameters(), + text_encoder.text_model.embeddings.position_embedding.parameters(), +) +freeze_params(params_to_freeze) +``` + +Now you'll need to add the [Custom Diffusion weights](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/custom_diffusion/train_custom_diffusion.py#L911C3-L911C3) to the attention layers. This is a really important step for getting the shape and size of the attention weights correct, and for setting the appropriate number of attention processors in each UNet block. + +```py +st = unet.state_dict() +for name, _ in unet.attn_processors.items(): + cross_attention_dim = None if name.endswith("attn1.processor") else unet.config.cross_attention_dim + if name.startswith("mid_block"): + hidden_size = unet.config.block_out_channels[-1] + elif name.startswith("up_blocks"): + block_id = int(name[len("up_blocks.")]) + hidden_size = list(reversed(unet.config.block_out_channels))[block_id] + elif name.startswith("down_blocks"): + block_id = int(name[len("down_blocks.")]) + hidden_size = unet.config.block_out_channels[block_id] + layer_name = name.split(".processor")[0] + weights = { + "to_k_custom_diffusion.weight": st[layer_name + ".to_k.weight"], + "to_v_custom_diffusion.weight": st[layer_name + ".to_v.weight"], + } + if train_q_out: + weights["to_q_custom_diffusion.weight"] = st[layer_name + ".to_q.weight"] + weights["to_out_custom_diffusion.0.weight"] = st[layer_name + ".to_out.0.weight"] + weights["to_out_custom_diffusion.0.bias"] = st[layer_name + ".to_out.0.bias"] + if cross_attention_dim is not None: + custom_diffusion_attn_procs[name] = attention_class( + train_kv=train_kv, + train_q_out=train_q_out, + hidden_size=hidden_size, + cross_attention_dim=cross_attention_dim, + ).to(unet.device) + custom_diffusion_attn_procs[name].load_state_dict(weights) + else: + custom_diffusion_attn_procs[name] = attention_class( + train_kv=False, + train_q_out=False, + hidden_size=hidden_size, + cross_attention_dim=cross_attention_dim, + ) +del st +unet.set_attn_processor(custom_diffusion_attn_procs) +custom_diffusion_layers = AttnProcsLayers(unet.attn_processors) +``` + +The [optimizer](https://github.com/huggingface/diffusers/blob/84cd9e8d01adb47f046b1ee449fc76a0c32dc4e2/examples/custom_diffusion/train_custom_diffusion.py#L982) is initialized to update the cross-attention layer parameters: + +```py +optimizer = optimizer_class( + itertools.chain(text_encoder.get_input_embeddings().parameters(), custom_diffusion_layers.parameters()) + if args.modifier_token is not None + else custom_diffusion_layers.parameters(), + lr=args.learning_rate, + betas=(args.adam_beta1, args.adam_beta2), + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) +``` + +In the [training loop](https://github.com/huggingface/diffusers/blob/84cd9e8d01adb47f046b1ee449fc76a0c32dc4e2/examples/custom_diffusion/train_custom_diffusion.py#L1048), it is important to only update the embeddings for the concept you're trying to learn. This means setting the gradients of all the other token embeddings to zero: + +```py +if args.modifier_token is not None: + if accelerator.num_processes > 1: + grads_text_encoder = text_encoder.module.get_input_embeddings().weight.grad + else: + grads_text_encoder = text_encoder.get_input_embeddings().weight.grad + index_grads_to_zero = torch.arange(len(tokenizer)) != modifier_token_id[0] + for i in range(len(modifier_token_id[1:])): + index_grads_to_zero = index_grads_to_zero & ( + torch.arange(len(tokenizer)) != modifier_token_id[i] + ) + grads_text_encoder.data[index_grads_to_zero, :] = grads_text_encoder.data[ + index_grads_to_zero, : + ].fill_(0) +``` -To track your experiments using Weights and Biases (`wandb`) and to save intermediate results (which we HIGHLY recommend), follow these steps: +## Launch the script -* Install `wandb`: `pip install wandb`. -* Authorize: `wandb login`. -* Then specify a `validation_prompt` and set `report_to` to `wandb` while launching training. You can also configure the following related arguments: - * `num_validation_images` - * `validation_steps` +Once you’ve made all your changes or you’re okay with the default configuration, you’re ready to launch the training script! πŸš€ -Here is an example command: +In this guide, you'll download and use these example [cat images](https://www.cs.cmu.edu/~custom-diffusion/assets/data.zip). You can also create and use your own dataset if you want (see the [Create a dataset for training](create_dataset) guide). + +Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, `INSTANCE_DIR` to the path where you just downloaded the cat images to, and `OUTPUT_DIR` to where you want to save the model. You'll use `` as the special word to tie the newly learned embeddings to. The script creates and saves model checkpoints and a pytorch_custom_diffusion_weights.bin file to your repository. + +To monitor training progress with Weights and Biases, add the `--report_to=wandb` parameter to the training command and specify a validation prompt with `--validation_prompt`. This is useful for debugging and saving intermediate results. + + + +If you're training on human faces, the Custom Diffusion team has found the following parameters to work well: + +- `--learning_rate=5e-6` +- `--max_train_steps` can be anywhere between 1000 and 2000 +- `--freeze_model=crossattn` +- use at least 15-20 images to train with + + + + + ```bash +export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export OUTPUT_DIR="path-to-save-model" +export INSTANCE_DIR="./data/cat" + accelerate launch train_custom_diffusion.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ --output_dir=$OUTPUT_DIR \ --class_data_dir=./real_reg/samples_cat/ \ - --with_prior_preservation --real_prior --prior_loss_weight=1.0 \ - --class_prompt="cat" --num_class_images=200 \ + --with_prior_preservation \ + --real_prior \ + --prior_loss_weight=1.0 \ + --class_prompt="cat" \ + --num_class_images=200 \ --instance_prompt="photo of a cat" \ --resolution=512 \ --train_batch_size=2 \ --learning_rate=1e-5 \ --lr_warmup_steps=0 \ --max_train_steps=250 \ - --scale_lr --hflip \ + --scale_lr \ + --hflip \ --modifier_token "" \ --validation_prompt=" cat sitting in a bucket" \ --report_to="wandb" \ --push_to_hub ``` -Here is an example [Weights and Biases page](https://wandb.ai/sayakpaul/custom-diffusion/runs/26ghrcau) where you can check out the intermediate results along with other training details. + + -If you specify `--push_to_hub`, the learned parameters will be pushed to a repository on the Hugging Face Hub. Here is an [example repository](https://huggingface.co/sayakpaul/custom-diffusion-cat). +Custom Diffusion can also learn multiple concepts if you provide a [JSON](https://github.com/adobe-research/custom-diffusion/blob/main/assets/concept_list.json) file with some details about each concept it should learn. -### Training on multiple concepts 🐱πŸͺ΅ - -Provide a [json](https://github.com/adobe-research/custom-diffusion/blob/main/assets/concept_list.json) file with the info about each concept, similar to [this](https://github.com/ShivamShrirao/diffusers/blob/main/examples/dreambooth/train_dreambooth.py). - -To collect the real images run this command for each concept in the json file. +Run clip-retrieval to collect some real images to use for regularization: ```bash pip install clip-retrieval python retrieve.py --class_prompt {} --class_data_dir {} --num_class_images 200 ``` -And then we're ready to start training! +Then you can launch the script: ```bash export MODEL_NAME="CompVis/stable-diffusion-v1-4" @@ -162,73 +289,40 @@ accelerate launch train_custom_diffusion.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --output_dir=$OUTPUT_DIR \ --concepts_list=./concept_list.json \ - --with_prior_preservation --real_prior --prior_loss_weight=1.0 \ + --with_prior_preservation \ + --real_prior \ + --prior_loss_weight=1.0 \ --resolution=512 \ --train_batch_size=2 \ --learning_rate=1e-5 \ --lr_warmup_steps=0 \ --max_train_steps=500 \ --num_class_images=200 \ - --scale_lr --hflip \ + --scale_lr \ + --hflip \ --modifier_token "+" \ --push_to_hub ``` -Here is an example [Weights and Biases page](https://wandb.ai/sayakpaul/custom-diffusion/runs/3990tzkg) where you can check out the intermediate results along with other training details. - -### Training on human faces - -For fine-tuning on human faces we found the following configuration to work better: `learning_rate=5e-6`, `max_train_steps=1000 to 2000`, and `freeze_model=crossattn` with at least 15-20 images. - -To collect the real images use this command first before training. - -```bash -pip install clip-retrieval -python retrieve.py --class_prompt person --class_data_dir real_reg/samples_person --num_class_images 200 -``` - -Then start training! - -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export OUTPUT_DIR="path-to-save-model" -export INSTANCE_DIR="path-to-images" - -accelerate launch train_custom_diffusion.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --class_data_dir=./real_reg/samples_person/ \ - --with_prior_preservation --real_prior --prior_loss_weight=1.0 \ - --class_prompt="person" --num_class_images=200 \ - --instance_prompt="photo of a person" \ - --resolution=512 \ - --train_batch_size=2 \ - --learning_rate=5e-6 \ - --lr_warmup_steps=0 \ - --max_train_steps=1000 \ - --scale_lr --hflip --noaug \ - --freeze_model crossattn \ - --modifier_token "" \ - --enable_xformers_memory_efficient_attention \ - --push_to_hub -``` + + -## Inference +Once training is finished, you can use your new Custom Diffusion model for inference. -Once you have trained a model using the above command, you can run inference using the below command. Make sure to include the `modifier token` (e.g. \ in above example) in your prompt. + + -```python +```py import torch from diffusers import DiffusionPipeline -pipe = DiffusionPipeline.from_pretrained( - "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16, use_safetensors=True +pipeline = DiffusionPipeline.from_pretrained( + "CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16, ).to("cuda") -pipe.unet.load_attn_procs("path-to-save-model", weight_name="pytorch_custom_diffusion_weights.bin") -pipe.load_textual_inversion("path-to-save-model", weight_name=".bin") +pipeline.unet.load_attn_procs("path-to-save-model", weight_name="pytorch_custom_diffusion_weights.bin") +pipeline.load_textual_inversion("path-to-save-model", weight_name=".bin") -image = pipe( +image = pipeline( " cat sitting in a bucket", num_inference_steps=100, guidance_scale=6.0, @@ -237,47 +331,20 @@ image = pipe( image.save("cat.png") ``` -It's possible to directly load these parameters from a Hub repository: + + -```python +```py import torch from huggingface_hub.repocard import RepoCard from diffusers import DiffusionPipeline -model_id = "sayakpaul/custom-diffusion-cat" -card = RepoCard.load(model_id) -base_model_id = card.data.to_dict()["base_model"] +pipeline = DiffusionPipeline.from_pretrained("sayakpaul/custom-diffusion-cat-wooden-pot", torch_dtype=torch.float16).to("cuda") +pipeline.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin") +pipeline.load_textual_inversion(model_id, weight_name=".bin") +pipeline.load_textual_inversion(model_id, weight_name=".bin") -pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipe.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin") -pipe.load_textual_inversion(model_id, weight_name=".bin") - -image = pipe( - " cat sitting in a bucket", - num_inference_steps=100, - guidance_scale=6.0, - eta=1.0, -).images[0] -image.save("cat.png") -``` - -Here is an example of performing inference with multiple concepts: - -```python -import torch -from huggingface_hub.repocard import RepoCard -from diffusers import DiffusionPipeline - -model_id = "sayakpaul/custom-diffusion-cat-wooden-pot" -card = RepoCard.load(model_id) -base_model_id = card.data.to_dict()["base_model"] - -pipe = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda") -pipe.unet.load_attn_procs(model_id, weight_name="pytorch_custom_diffusion_weights.bin") -pipe.load_textual_inversion(model_id, weight_name=".bin") -pipe.load_textual_inversion(model_id, weight_name=".bin") - -image = pipe( +image = pipeline( "the cat sculpture in the style of a wooden pot", num_inference_steps=100, guidance_scale=6.0, @@ -286,20 +353,11 @@ image = pipe( image.save("multi-subject.png") ``` -Here, `cat` and `wooden pot` refer to the multiple concepts. - -### Inference from a training checkpoint - -You can also perform inference from one of the complete checkpoint saved during the training process, if you used the `--checkpointing_steps` argument. - -TODO. - -## Set grads to none - -To save even more memory, pass the `--set_grads_to_none` argument to the script. This will set grads to None instead of zero. However, be aware that it changes certain behaviors, so if you start experiencing any problems, remove this argument. + + -More info: https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html +## Next steps -## Experimental results +Congratulations on training a model with Custom Diffusion! πŸŽ‰ To learn more: -You can refer to [our webpage](https://www.cs.cmu.edu/~custom-diffusion/) that discusses our experiments in detail. +- Read the [Multi-Concept Customization of Text-to-Image Diffusion](https://www.cs.cmu.edu/~custom-diffusion/) blog post to learn more details about the experimental results from the Custom Diffusion team. \ No newline at end of file diff --git a/docs/source/en/training/dreambooth.md b/docs/source/en/training/dreambooth.md index 30a20a971966..e71d2ea7bbe7 100644 --- a/docs/source/en/training/dreambooth.md +++ b/docs/source/en/training/dreambooth.md @@ -12,430 +12,287 @@ specific language governing permissions and limitations under the License. # DreamBooth -[DreamBooth](https://arxiv.org/abs/2208.12242) is a method to personalize text-to-image models like Stable Diffusion given just a few (3-5) images of a subject. It allows the model to generate contextualized images of the subject in different scenes, poses, and views. +[DreamBooth](https://huggingface.co/papers/2208.12242) is a training technique that updates the entire diffusion model by training on just a few images of a subject or style. It works by associating a special word in the prompt with the example images. -![Dreambooth examples from the project's blog](https://dreambooth.github.io/DreamBooth_files/teaser_static.jpg) -Dreambooth examples from the project's blog. +If you're training on a GPU with limited vRAM, you should try enabling the `gradient_checkpointing` and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing or xFormers. You should have a GPU with >30GB of memory if you want to train faster with Flax. -This guide will show you how to finetune DreamBooth with the [`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) model for various GPU sizes, and with Flax. All the training scripts for DreamBooth used in this guide can be found [here](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) if you're interested in digging deeper and seeing how things work. +This guide will explore the [train_dreambooth.py](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) script to help you become more familiar with it, and how you can adapt it for your own use-case. -Before running the scripts, make sure you install the library's training dependencies. We also recommend installing 🧨 Diffusers from the `main` GitHub branch: +Before running the script, make sure you install the library from source: ```bash -pip install git+https://github.com/huggingface/diffusers -pip install -U -r diffusers/examples/dreambooth/requirements.txt +git clone https://github.com/huggingface/diffusers +cd diffusers +pip install . ``` -xFormers is not part of the training requirements, but we recommend you [install](../optimization/xformers) it if you can because it could make your training faster and less memory intensive. +Navigate to the example folder with the training script and install the required dependencies for the script you're using: -After all the dependencies have been set up, initialize a [πŸ€— Accelerate](https://github.com/huggingface/accelerate/) environment with: + + ```bash -accelerate config +cd examples/dreambooth +pip install -r requirements.txt ``` -To setup a default πŸ€— Accelerate environment without choosing any configurations: + + ```bash -accelerate config default +cd examples/dreambooth +pip install -r requirements_flax.txt ``` -Or if your environment doesn't support an interactive shell like a notebook, you can use: - -```py -from accelerate.utils import write_basic_config - -write_basic_config() -``` + + -Finally, download a [few images of a dog](https://huggingface.co/datasets/diffusers/dog-example) to DreamBooth with: + -```py -from huggingface_hub import snapshot_download - -local_dir = "./dog" -snapshot_download( - "diffusers/dog-example", - local_dir=local_dir, - repo_type="dataset", - ignore_patterns=".gitattributes", -) -``` - -To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. - -## Finetuning - - - -DreamBooth finetuning is very sensitive to hyperparameters and easy to overfit. We recommend you take a look at our [in-depth analysis](https://huggingface.co/blog/dreambooth) with recommended settings for different subjects to help you choose the appropriate hyperparameters. +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. - - -Set the `INSTANCE_DIR` environment variable to the path of the directory containing the dog images. - -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`] argument. The `instance_prompt` argument is a text prompt that contains a unique identifier, such as `sks`, and the class the image belongs to, which in this example is `a photo of a sks dog`. +Initialize an πŸ€— Accelerate environment: ```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export OUTPUT_DIR="path_to_saved_model" -``` - -Then you can launch the training script (you can find the full training script [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py)) with the following command: - -```bash -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a photo of sks dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=1 \ - --learning_rate=5e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --max_train_steps=400 \ - --push_to_hub +accelerate config ``` - - -If you have access to TPUs or want to train even faster, you can try out the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_flax.py). The Flax training script doesn't support gradient checkpointing or gradient accumulation, so you'll need a GPU with at least 30GB of memory. -Before running the script, make sure you have the requirements installed: +To setup a default πŸ€— Accelerate environment without choosing any configurations: ```bash -pip install -U -r requirements.txt +accelerate config default ``` -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`] argument. The `instance_prompt` argument is a text prompt that contains a unique identifier, such as `sks`, and the class the image belongs to, which in this example is `a photo of a sks dog`. - -Now you can launch the training script with the following command: +Or if your environment doesn't support an interactive shell, like a notebook, you can use: ```bash -export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" -export INSTANCE_DIR="./dog" -export OUTPUT_DIR="path-to-save-model" +from accelerate.utils import write_basic_config -python train_dreambooth_flax.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a photo of sks dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --learning_rate=5e-6 \ - --max_train_steps=400 \ - --push_to_hub +write_basic_config() ``` - - -## Finetuning with prior-preserving loss +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. -Prior preservation is used to avoid overfitting and language-drift (check out the [paper](https://arxiv.org/abs/2208.12242) to learn more if you're interested). For prior preservation, you use other images of the same class as part of the training process. The nice thing is that you can generate those images using the Stable Diffusion model itself! The training script will save the generated images to a local path you specify. + -The authors recommend generating `num_epochs * num_samples` images for prior preservation. In most cases, 200-300 images work well. - - - -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path_to_class_images" -export OUTPUT_DIR="path_to_saved_model" +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth.py) and let us know if you have any questions or concerns. -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=1 \ - --learning_rate=5e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub -``` - - -```bash -export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path-to-class-images" -export OUTPUT_DIR="path-to-save-model" - -python train_dreambooth_flax.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --learning_rate=5e-6 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub -``` - - - -## Finetuning the text encoder and UNet + -The script also allows you to finetune the `text_encoder` along with the `unet`. In our experiments (check out the [Training Stable Diffusion with DreamBooth using 🧨 Diffusers](https://huggingface.co/blog/dreambooth) post for more details), this yields much better results, especially when generating images of faces. +## Script parameters -Training the text encoder requires additional memory and it won't fit on a 16GB GPU. You'll need at least 24GB VRAM to use this option. +DreamBooth is very sensitive to training hyperparameters, and it is easy to overfit. Read the [Training Stable Diffusion with Dreambooth using 🧨 Diffusers](https://huggingface.co/blog/dreambooth) blog post for recommended settings for different subjects to help you choose the appropriate hyperparameters. -Pass the `--train_text_encoder` argument to the training script to enable finetuning the `text_encoder` and `unet`: +The training script offers many parameters for customizing your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L228) function. The parameters are set with default values that should work pretty well out-of-the-box, but you can also set your own values in the training command if you'd like. - - -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path_to_class_images" -export OUTPUT_DIR="path_to_saved_model" +For example, to train in the bf16 format: +```bash accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --train_text_encoder \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --use_8bit_adam \ - --gradient_checkpointing \ - --learning_rate=2e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub + --mixed_precision="bf16" ``` - - -```bash -export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path-to-class-images" -export OUTPUT_DIR="path-to-save-model" -python train_dreambooth_flax.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --train_text_encoder \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --learning_rate=2e-6 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub -``` - - +Some basic and important parameters to know and specify are: -## Finetuning with LoRA +- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model +- `--instance_data_dir`: path to a folder containing the training dataset (example images) +- `--instance_prompt`: the text prompt that contains the special word for the example images +- `--train_text_encoder`: whether to also train the text encoder +- `--output_dir`: where to save the trained model +- `--push_to_hub`: whether to push the trained model to the Hub +- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command -You can also use Low-Rank Adaptation of Large Language Models (LoRA), a fine-tuning technique for accelerating training large models, on DreamBooth. For more details, take a look at the [LoRA training](./lora#dreambooth) guide. +### Min-SNR weighting -## Saving checkpoints while training +The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script. -It's easy to overfit while training with Dreambooth, so sometimes it's useful to save regular checkpoints during the training process. One of the intermediate checkpoints might actually work better than the final model! Pass the following argument to the training script to enable saving checkpoints: +Add the `--snr_gamma` parameter and set it to the recommended value of 5.0: ```bash - --checkpointing_steps=500 +accelerate launch train_dreambooth.py \ + --snr_gamma=5.0 ``` -This saves the full training state in subfolders of your `output_dir`. Subfolder names begin with the prefix `checkpoint-`, followed by the number of steps performed so far; for example, `checkpoint-1500` would be a checkpoint saved after 1500 training steps. +### Prior preservation loss -### Resume training from a saved checkpoint +Prior preservation loss is a method that uses a model's own generated samples to help it learn how to generate more diverse images. Because these generated sample images belong to the same class as the images you provided, they help the model retain what it has learned about the class and how it can use what it already knows about the class to make new compositions. -If you want to resume training from any of the saved checkpoints, you can pass the argument `--resume_from_checkpoint` to the script and specify the name of the checkpoint you want to use. You can also use the special string `"latest"` to resume from the last saved checkpoint (the one with the largest number of steps). For example, the following would resume training from the checkpoint saved after 1500 steps: +- `--with_prior_preservation`: whether to use prior preservation loss +- `--prior_loss_weight`: controls the influence of the prior preservation loss on the model +- `--class_data_dir`: path to a folder containing the generated class sample images +- `--class_prompt`: the text prompt describing the class of the generated sample images ```bash - --resume_from_checkpoint="checkpoint-1500" +accelerate launch train_dreambooth.py \ + --with_prior_preservation \ + --prior_loss_weight=1.0 \ + --class_data_dir="path/to/class/images" \ + --class_prompt="text prompt describing class" ``` -This is a good opportunity to tweak some of your hyperparameters if you wish. +### Train text encoder -### Inference from a saved checkpoint +To improve the quality of the generated outputs, you can also train the text encoder in addition to the UNet. This requires additional memory and you'll need a GPU with at least 24GB of vRAM. If you have the necessary hardware, then training the text encoder produces better results, especially when generating images of faces. Enable this option by: -Saved checkpoints are stored in a format suitable for resuming training. They not only include the model weights, but also the state of the optimizer, data loaders, and learning rate. +```bash +accelerate launch train_dreambooth.py \ + --train_text_encoder +``` -If you have **`"accelerate>=0.16.0"`** installed, use the following code to run -inference from an intermediate checkpoint. +## Training script -```python -from diffusers import DiffusionPipeline, UNet2DConditionModel -from transformers import CLIPTextModel -import torch +DreamBooth comes with its own dataset classes: -# Load the pipeline with the same arguments (model, revision) that were used for training -model_id = "CompVis/stable-diffusion-v1-4" +- [`DreamBoothDataset`](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L604): preprocesses the images and class images, and tokenizes the prompts for training +- [`PromptDataset`](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L738): generates the prompt embeddings to generate the class images -unet = UNet2DConditionModel.from_pretrained("/sddata/dreambooth/daruma-v2-1/checkpoint-100/unet") +If you enabled [prior preservation loss](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L842), the class images are generated here: -# if you have trained with `--args.train_text_encoder` make sure to also load the text encoder -text_encoder = CLIPTextModel.from_pretrained("/sddata/dreambooth/daruma-v2-1/checkpoint-100/text_encoder") +```py +sample_dataset = PromptDataset(args.class_prompt, num_new_images) +sample_dataloader = torch.utils.data.DataLoader(sample_dataset, batch_size=args.sample_batch_size) -pipeline = DiffusionPipeline.from_pretrained( - model_id, unet=unet, text_encoder=text_encoder, dtype=torch.float16, use_safetensors=True -) -pipeline.to("cuda") +sample_dataloader = accelerator.prepare(sample_dataloader) +pipeline.to(accelerator.device) -# Perform inference, or save, or push to the hub -pipeline.save_pretrained("dreambooth-pipeline") +for example in tqdm( + sample_dataloader, desc="Generating class images", disable=not accelerator.is_local_main_process +): + images = pipeline(example["prompt"]).images ``` -If you have **`"accelerate<0.16.0"`** installed, you need to convert it to an inference pipeline first: +Next is the [`main()`](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L799) function which handles setting up the dataset for training and the training loop itself. The script loads the [tokenizer](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L898), [scheduler and models](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L912C1-L912C1): -```python -from accelerate import Accelerator -from diffusers import DiffusionPipeline - -# Load the pipeline with the same arguments (model, revision) that were used for training -model_id = "CompVis/stable-diffusion-v1-4" -pipeline = DiffusionPipeline.from_pretrained(model_id, use_safetensors=True) +```py +# Load the tokenizer +if args.tokenizer_name: + tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_name, revision=args.revision, use_fast=False) +elif args.pretrained_model_name_or_path: + tokenizer = AutoTokenizer.from_pretrained( + args.pretrained_model_name_or_path, + subfolder="tokenizer", + revision=args.revision, + use_fast=False, + ) + +# Load scheduler and models +noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler") +text_encoder = text_encoder_cls.from_pretrained( + args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision +) -accelerator = Accelerator() +if model_has_vae(args): + vae = AutoencoderKL.from_pretrained( + args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision + ) +else: + vae = None -# Use text_encoder if `--train_text_encoder` was used for the initial training -unet, text_encoder = accelerator.prepare(pipeline.unet, pipeline.text_encoder) +unet = UNet2DConditionModel.from_pretrained( + args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision +) +``` -# Restore state from a checkpoint path. You have to use the absolute path here. -accelerator.load_state("/sddata/dreambooth/daruma-v2-1/checkpoint-100") +Then, it's time to [create the training dataset](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L1073) and DataLoader from `DreamBoothDataset`: -# Rebuild the pipeline with the unwrapped models (assignment to .unet and .text_encoder should work too) -pipeline = DiffusionPipeline.from_pretrained( - model_id, - unet=accelerator.unwrap_model(unet), - text_encoder=accelerator.unwrap_model(text_encoder), - use_safetensors=True, +```py +train_dataset = DreamBoothDataset( + instance_data_root=args.instance_data_dir, + instance_prompt=args.instance_prompt, + class_data_root=args.class_data_dir if args.with_prior_preservation else None, + class_prompt=args.class_prompt, + class_num=args.num_class_images, + tokenizer=tokenizer, + size=args.resolution, + center_crop=args.center_crop, + encoder_hidden_states=pre_computed_encoder_hidden_states, + class_prompt_encoder_hidden_states=pre_computed_class_prompt_encoder_hidden_states, + tokenizer_max_length=args.tokenizer_max_length, ) -# Perform inference, or save, or push to the hub -pipeline.save_pretrained("dreambooth-pipeline") +train_dataloader = torch.utils.data.DataLoader( + train_dataset, + batch_size=args.train_batch_size, + shuffle=True, + collate_fn=lambda examples: collate_fn(examples, args.with_prior_preservation), + num_workers=args.dataloader_num_workers, +) ``` -## Optimizations for different GPU sizes +Lastly, the [training loop](https://github.com/huggingface/diffusers/blob/072e00897a7cf4302c347a63ec917b4b8add16d4/examples/dreambooth/train_dreambooth.py#L1151) takes care of the remaining steps such as converting images to latent space, adding noise to the input, predicting the noise residual, and calculating the loss. -Depending on your hardware, there are a few different ways to optimize DreamBooth on GPUs from 16GB to just 8GB! +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. -### xFormers +## Launch the script -[xFormers](https://github.com/facebookresearch/xformers) is a toolbox for optimizing Transformers, and it includes a [memory-efficient attention](https://facebookresearch.github.io/xformers/components/ops.html#module-xformers.ops) mechanism that is used in 🧨 Diffusers. You'll need to [install xFormers](./optimization/xformers) and then add the following argument to your training script: +You're now ready to launch the training script! πŸš€ -```bash - --enable_xformers_memory_efficient_attention -``` +For this guide, you'll download some images of a [dog](https://huggingface.co/datasets/diffusers/dog-example) and store them in a directory. But remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset) guide). -xFormers is not available in Flax. +```py +from huggingface_hub import snapshot_download -### Set gradients to none +local_dir = "./dog" +snapshot_download( + "diffusers/dog-example", + local_dir=local_dir, + repo_type="dataset", + ignore_patterns=".gitattributes", +) +``` + +Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, `INSTANCE_DIR` to the path where you just downloaded the dog images to, and `OUTPUT_DIR` to where you want to save the model. You'll use `sks` as the special word to tie the training to. -Another way you can lower your memory footprint is to [set the gradients](https://pytorch.org/docs/stable/generated/torch.optim.Optimizer.zero_grad.html) to `None` instead of zero. However, this may change certain behaviors, so if you run into any issues, try removing this argument. Add the following argument to your training script to set the gradients to `None`: +If you're interested in following along with the training process, you can periodically save generated images as training progresses. Add the following parameters to the training command: ```bash - --set_grads_to_none +--validation_prompt="a photo of a sks dog" +--num_validation_images=4 +--validation_steps=100 ``` -### 16GB GPU +One more thing before you launch the script! Depending on the GPU you have, you may need to enable certain optimizations to train DreamBooth. -With the help of gradient checkpointing and [bitsandbytes](https://github.com/TimDettmers/bitsandbytes) 8-bit optimizer, it's possible to train DreamBooth on a 16GB GPU. Make sure you have bitsandbytes installed: + + -```bash +On a 16GB GPU, you can use bitsandbytes 8-bit optimizer and gradient checkpointing to help you train a DreamBooth model. Install bitsandbytes: + +```py pip install bitsandbytes ``` -Then pass the `--use_8bit_adam` option to the training script: +Then, add the following parameter to your training command: ```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path_to_class_images" -export OUTPUT_DIR="path_to_saved_model" - accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=2 --gradient_checkpointing \ + --gradient_checkpointing \ --use_8bit_adam \ - --learning_rate=5e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub ``` -### 12GB GPU + + -To run DreamBooth on a 12GB GPU, you'll need to enable gradient checkpointing, the 8-bit optimizer, xFormers, and set the gradients to `None`: +On a 12GB GPU, you'll need bitsandbytes 8-bit optimizer, gradient checkpointing, xFormers, and set the gradients to `None` instead of zero to reduce your memory-usage. ```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export INSTANCE_DIR="./dog" -export CLASS_DIR="path-to-class-images" -export OUTPUT_DIR="path-to-save-model" - accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ - --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ - --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=1 --gradient_checkpointing \ --use_8bit_adam \ + --gradient_checkpointing \ --enable_xformers_memory_efficient_attention \ --set_grads_to_none \ - --learning_rate=2e-6 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --push_to_hub ``` -### 8 GB GPU + + -For 8GB GPUs, you'll need the help of [DeepSpeed](https://www.deepspeed.ai/) to offload some -tensors from the VRAM to either the CPU or NVME, enabling training with less GPU memory. +On a 8GB GPU, you'll need [DeepSpeed](https://www.deepspeed.ai/) to offload some of the tensors from the vRAM to either the CPU or NVME to allow training with less GPU memory. Run the following command to configure your πŸ€— Accelerate environment: @@ -443,268 +300,148 @@ Run the following command to configure your πŸ€— Accelerate environment: accelerate config ``` -During configuration, confirm that you want to use DeepSpeed. Now it's possible to train on under 8GB VRAM by combining DeepSpeed stage 2, fp16 mixed precision, and offloading the model parameters and the optimizer state to the CPU. The drawback is that this requires more system RAM, about 25 GB. See [the DeepSpeed documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more configuration options. +During configuration, confirm that you want to use DeepSpeed. Now it should be possible to train on under 8GB vRAM by combining DeepSpeed stage 2, fp16 mixed precision, and offloading the model parameters and the optimizer state to the CPU. The drawback is that this requires more system RAM (~25 GB). See the [DeepSpeed documentation](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) for more configuration options. + +You should also change the default Adam optimizer to DeepSpeed’s optimized version of Adam [`deepspeed.ops.adam.DeepSpeedCPUAdam`](https://deepspeed.readthedocs.io/en/latest/optimizers.html#adam-cpu) for a substantial speedup. Enabling `DeepSpeedCPUAdam` requires your system’s CUDA toolchain version to be the same as the one installed with PyTorch. -You should also change the default Adam optimizer to DeepSpeed's optimized version of Adam -[`deepspeed.ops.adam.DeepSpeedCPUAdam`](https://deepspeed.readthedocs.io/en/latest/optimizers.html#adam-cpu) for a substantial speedup. Enabling `DeepSpeedCPUAdam` requires your system's CUDA toolchain version to be the same as the one installed with PyTorch. +bitsandbytes 8-bit optimizers don’t seem to be compatible with DeepSpeed at the moment. -8-bit optimizers don't seem to be compatible with DeepSpeed at the moment. +That's it! You don't need to add any additional parameters to your training command. -Launch training with the following command: + + + + + ```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export MODEL_NAME="runwayml/stable-diffusion-v1-5" export INSTANCE_DIR="./dog" -export CLASS_DIR="path_to_class_images" export OUTPUT_DIR="path_to_saved_model" accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ + --pretrained_model_name_or_path=$MODEL_NAME \ --instance_data_dir=$INSTANCE_DIR \ - --class_data_dir=$CLASS_DIR \ --output_dir=$OUTPUT_DIR \ - --with_prior_preservation --prior_loss_weight=1.0 \ --instance_prompt="a photo of sks dog" \ - --class_prompt="a photo of dog" \ --resolution=512 \ --train_batch_size=1 \ - --sample_batch_size=1 \ - --gradient_accumulation_steps=1 --gradient_checkpointing \ + --gradient_accumulation_steps=1 \ --learning_rate=5e-6 \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ - --num_class_images=200 \ - --max_train_steps=800 \ - --mixed_precision=fp16 \ + --max_train_steps=400 \ --push_to_hub ``` -## Inference - -Once you have trained a model, specify the path to where the model is saved, and use it for inference in the [`StableDiffusionPipeline`]. Make sure your prompts include the special `identifier` used during training (`sks` in the previous examples). - -If you have **`"accelerate>=0.16.0"`** installed, you can use the following code to run -inference from an intermediate checkpoint: - -```python -from diffusers import DiffusionPipeline -import torch + + -model_id = "path_to_saved_model" -pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda") - -prompt = "A photo of sks dog in a bucket" -image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0] +```bash +export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" +export INSTANCE_DIR="./dog" +export OUTPUT_DIR="path-to-save-model" -image.save("dog-bucket.png") +python train_dreambooth_flax.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --instance_data_dir=$INSTANCE_DIR \ + --output_dir=$OUTPUT_DIR \ + --instance_prompt="a photo of sks dog" \ + --resolution=512 \ + --train_batch_size=1 \ + --learning_rate=5e-6 \ + --max_train_steps=400 \ + --push_to_hub ``` -You may also run inference from any of the [saved training checkpoints](#inference-from-a-saved-checkpoint). + + -## IF +Once training is complete, you can use your newly trained model for inference! -You can use the lora and full dreambooth scripts to train the text to image [IF model](https://huggingface.co/DeepFloyd/IF-I-XL-v1.0) and the stage II upscaler -[IF model](https://huggingface.co/DeepFloyd/IF-II-L-v1.0). + -Note that IF has a predicted variance, and our finetuning scripts only train the models predicted error, so for finetuned IF models we switch to a fixed -variance schedule. The full finetuning scripts will update the scheduler config for the full saved model. However, when loading saved LoRA weights, you -must also update the pipeline's scheduler config. +Can't wait to try your model for inference before training is complete? 🀭 Make sure you have the latest version of πŸ€— Accelerate installed. ```py -from diffusers import DiffusionPipeline - -pipe = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", use_safetensors=True) - -pipe.load_lora_weights("") - -# Update scheduler config to fixed variance schedule -pipe.scheduler = pipe.scheduler.__class__.from_config(pipe.scheduler.config, variance_type="fixed_small") -``` - -Additionally, a few alternative cli flags are needed for IF. - -`--resolution=64`: IF is a pixel space diffusion model. In order to operate on un-compressed pixels, the input images are of a much smaller resolution. - -`--pre_compute_text_embeddings`: IF uses [T5](https://huggingface.co/docs/transformers/model_doc/t5) for its text encoder. In order to save GPU memory, we pre compute all text embeddings and then de-allocate -T5. - -`--tokenizer_max_length=77`: T5 has a longer default text length, but the default IF encoding procedure uses a smaller number. - -`--text_encoder_use_attention_mask`: T5 passes the attention mask to the text encoder. - -### Tips and Tricks -We find LoRA to be sufficient for finetuning the stage I model as the low resolution of the model makes representing finegrained detail hard regardless. - -For common and/or not-visually complex object concepts, you can get away with not-finetuning the upscaler. Just be sure to adjust the prompt passed to the -upscaler to remove the new token from the instance prompt. I.e. if your stage I prompt is "a sks dog", use "a dog" for your stage II prompt. +from diffusers import DiffusionPipeline, UNet2DConditionModel +from transformers import CLIPTextModel +import torch -For finegrained detail like faces that aren't present in the original training set, we find that full finetuning of the stage II upscaler is better than -LoRA finetuning stage II. +unet = UNet2DConditionModel.from_pretrained("path/to/model/checkpoint-100/unet") -For finegrained detail like faces, we find that lower learning rates along with larger batch sizes work best. +# if you have trained with `--args.train_text_encoder` make sure to also load the text encoder +text_encoder = CLIPTextModel.from_pretrained("path/to/model/checkpoint-100/checkpoint-100/text_encoder") -For stage II, we find that lower learning rates are also needed. +pipeline = DiffusionPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", unet=unet, text_encoder=text_encoder, dtype=torch.float16, +).to("cuda") -We found experimentally that the DDPM scheduler with the default larger number of denoising steps to sometimes work better than the DPM Solver scheduler -used in the training scripts. +image = pipeline("A photo of sks dog in a bucket", num_inference_steps=50, guidance_scale=7.5).images[0] +image.save("dog-bucket.png") +``` -### Stage II additional validation images + -The stage II validation requires images to upscale, we can download a downsized version of the training set: + + ```py -from huggingface_hub import snapshot_download +from diffusers import DiffusionPipeline +import torch -local_dir = "./dog_downsized" -snapshot_download( - "diffusers/dog-example-downsized", - local_dir=local_dir, - repo_type="dataset", - ignore_patterns=".gitattributes", -) +pipeline = DiffusionPipeline.from_pretrained("path_to_saved_model", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +image = pipeline("A photo of sks dog in a bucket", num_inference_steps=50, guidance_scale=7.5).images[0] +image.save("dog-bucket.png") ``` -### IF stage I LoRA Dreambooth -This training configuration requires ~28 GB VRAM. + + -```sh -export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0" -export INSTANCE_DIR="dog" -export OUTPUT_DIR="dreambooth_dog_lora" - -accelerate launch train_dreambooth_lora.py \ - --report_to wandb \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a sks dog" \ - --resolution=64 \ - --train_batch_size=4 \ - --gradient_accumulation_steps=1 \ - --learning_rate=5e-6 \ - --scale_lr \ - --max_train_steps=1200 \ - --validation_prompt="a sks dog" \ - --validation_epochs=25 \ - --checkpointing_steps=100 \ - --pre_compute_text_embeddings \ - --tokenizer_max_length=77 \ - --text_encoder_use_attention_mask -``` - -### IF stage II LoRA Dreambooth - -`--validation_images`: These images are upscaled during validation steps. - -`--class_labels_conditioning=timesteps`: Pass additional conditioning to the UNet needed for stage II. - -`--learning_rate=1e-6`: Lower learning rate than stage I. - -`--resolution=256`: The upscaler expects higher resolution inputs - -```sh -export MODEL_NAME="DeepFloyd/IF-II-L-v1.0" -export INSTANCE_DIR="dog" -export OUTPUT_DIR="dreambooth_dog_upscale" -export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png" - -python train_dreambooth_lora.py \ - --report_to wandb \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a sks dog" \ - --resolution=256 \ - --train_batch_size=4 \ - --gradient_accumulation_steps=1 \ - --learning_rate=1e-6 \ - --max_train_steps=2000 \ - --validation_prompt="a sks dog" \ - --validation_epochs=100 \ - --checkpointing_steps=500 \ - --pre_compute_text_embeddings \ - --tokenizer_max_length=77 \ - --text_encoder_use_attention_mask \ - --validation_images $VALIDATION_IMAGES \ - --class_labels_conditioning=timesteps -``` +```py +import jax +import numpy as np +from flax.jax_utils import replicate +from flax.training.common_utils import shard +from diffusers import FlaxStableDiffusionPipeline -### IF Stage I Full Dreambooth -`--skip_save_text_encoder`: When training the full model, this will skip saving the entire T5 with the finetuned model. You can still load the pipeline -with a T5 loaded from the original model. +pipeline, params = FlaxStableDiffusionPipeline.from_pretrained("path-to-your-trained-model", dtype=jax.numpy.bfloat16) -`use_8bit_adam`: Due to the size of the optimizer states, we recommend training the full XL IF model with 8bit adam. +prompt = "A photo of sks dog in a bucket" +prng_seed = jax.random.PRNGKey(0) +num_inference_steps = 50 -`--learning_rate=1e-7`: For full dreambooth, IF requires very low learning rates. With higher learning rates model quality will degrade. Note that it is -likely the learning rate can be increased with larger batch sizes. +num_samples = jax.device_count() +prompt = num_samples * [prompt] +prompt_ids = pipeline.prepare_inputs(prompt) -Using 8bit adam and a batch size of 4, the model can be trained in ~48 GB VRAM. +# shard inputs and rng +params = replicate(params) +prng_seed = jax.random.split(prng_seed, jax.device_count()) +prompt_ids = shard(prompt_ids) -```sh -export MODEL_NAME="DeepFloyd/IF-I-XL-v1.0" +images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images +images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:]))) +image.save("dog-bucket.png") +``` -export INSTANCE_DIR="dog" -export OUTPUT_DIR="dreambooth_if" + + -accelerate launch train_dreambooth.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a photo of sks dog" \ - --resolution=64 \ - --train_batch_size=4 \ - --gradient_accumulation_steps=1 \ - --learning_rate=1e-7 \ - --max_train_steps=150 \ - --validation_prompt "a photo of sks dog" \ - --validation_steps 25 \ - --text_encoder_use_attention_mask \ - --tokenizer_max_length 77 \ - --pre_compute_text_embeddings \ - --use_8bit_adam \ - --set_grads_to_none \ - --skip_save_text_encoder \ - --push_to_hub -``` +## LoRA -### IF Stage II Full Dreambooth +LoRA is a training technique for significantly reducing the number of trainable parameters. As a result, training is faster and it is easier to store the resulting weights because they are a lot smaller (~100MBs). Use the [train_dreambooth_lora.py](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py) script to train with LoRA. -`--learning_rate=5e-6`: With a smaller effective batch size of 4, we found that we required learning rates as low as -1e-8. +The LoRA training script is discussed in more detail in the [LoRA training](lora) guide. -`--resolution=256`: The upscaler expects higher resolution inputs +## Stable Diffusion XL -`--train_batch_size=2` and `--gradient_accumulation_steps=6`: We found that full training of stage II particularly with -faces required large effective batch sizes. +Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [train_dreambooth_lora_sdxl.py](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sdxl.py) script to train a SDXL model with LoRA. -```sh -export MODEL_NAME="DeepFloyd/IF-II-L-v1.0" -export INSTANCE_DIR="dog" -export OUTPUT_DIR="dreambooth_dog_upscale" -export VALIDATION_IMAGES="dog_downsized/image_1.png dog_downsized/image_2.png dog_downsized/image_3.png dog_downsized/image_4.png" +The SDXL training script is discussed in more detail in the [SDXL training](sdxl) guide. -accelerate launch train_dreambooth.py \ - --report_to wandb \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a sks dog" \ - --resolution=256 \ - --train_batch_size=2 \ - --gradient_accumulation_steps=6 \ - --learning_rate=5e-6 \ - --max_train_steps=2000 \ - --validation_prompt="a sks dog" \ - --validation_steps=150 \ - --checkpointing_steps=500 \ - --pre_compute_text_embeddings \ - --tokenizer_max_length=77 \ - --text_encoder_use_attention_mask \ - --validation_images $VALIDATION_IMAGES \ - --class_labels_conditioning timesteps \ - --push_to_hub -``` +## Next steps -## Stable Diffusion XL +Congratulations on training your DreamBooth model! To learn more about how to use your new model, the following guide may be helpful: -We support fine-tuning of the UNet and text encoders shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with DreamBooth and LoRA via the `train_dreambooth_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md). \ No newline at end of file +- Learn how to [load a DreamBooth](../using-diffusers/loading_adapters) model for inference if you trained your model with LoRA. \ No newline at end of file diff --git a/docs/source/en/training/instructpix2pix.md b/docs/source/en/training/instructpix2pix.md index efbc2f298a7a..7e17af2cd988 100644 --- a/docs/source/en/training/instructpix2pix.md +++ b/docs/source/en/training/instructpix2pix.md @@ -10,208 +10,243 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# InstructPix2Pix +# InstructPix2Pix -[InstructPix2Pix](https://arxiv.org/abs/2211.09800) is a method to fine-tune text-conditioned diffusion models such that they can follow an edit instruction for an input image. Models fine-tuned using this method take the following as inputs: +[InstructPix2Pix](https://hf.co/papers/2211.09800) is a Stable Diffusion model trained to edit images from human-provided instructions. For example, your prompt can be "turn the clouds rainy" and the model will edit the input image accordingly. This model is conditioned on the text prompt (or editing instruction) and the input image. -

- instructpix2pix-inputs -

+This guide will explore the [train_instruct_pix2pix.py](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py) training script to help you become familiar with it, and how you can adapt it for your own use-case. -The output is an "edited" image that reflects the edit instruction applied on the input image: +Before running the script, make sure you install the library from source: -

- instructpix2pix-output -

- -The `train_instruct_pix2pix.py` script (you can find the it [here](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py)) shows how to implement the training procedure and adapt it for Stable Diffusion. - -***Disclaimer: Even though `train_instruct_pix2pix.py` implements the InstructPix2Pix -training procedure while being faithful to the [original implementation](https://github.com/timothybrooks/instruct-pix2pix) we have only tested it on a [small-scale dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples). This can impact the end results. For better results, we recommend longer training runs with a larger dataset. [Here](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) you can find a large dataset for InstructPix2Pix training.*** - -## Running locally with PyTorch - -### Installing the dependencies - -Before running the scripts, make sure to install the library's training dependencies: - -**Important** - -To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: ```bash git clone https://github.com/huggingface/diffusers cd diffusers -pip install -e . +pip install . ``` -Then cd in the example folder -```bash -cd examples/instruct_pix2pix -``` +Then navigate to the example folder containing the training script and install the required dependencies for the script you're using: -Now run ```bash +cd examples/instruct_pix2pix pip install -r requirements.txt ``` -And initialize an [πŸ€—Accelerate](https://github.com/huggingface/accelerate/) environment with: + + +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. + + + +Initialize an πŸ€— Accelerate environment: ```bash accelerate config ``` -Or for a default accelerate configuration without answering questions about your environment +To setup a default πŸ€— Accelerate environment without choosing any configurations: ```bash accelerate config default ``` -Or if your environment doesn't support an interactive shell e.g. a notebook +Or if your environment doesn't support an interactive shell, like a notebook, you can use: -```python +```bash from accelerate.utils import write_basic_config write_basic_config() ``` -### Toy example +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. + + + +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix.py) and let us know if you have any questions or concerns. -As mentioned before, we'll use a [small toy dataset](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) for training. The dataset -is a smaller version of the [original dataset](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered) used in the InstructPix2Pix paper. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. + -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. You'll also need to specify the dataset name in `DATASET_ID`: +## Script parameters + +The training script has many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L65) function. Default values are provided for most parameters that work pretty well, but you can also set your own values in the training command if you'd like. + +For example, to increase the resolution of the input image: ```bash -export MODEL_NAME="runwayml/stable-diffusion-v1-5" -export DATASET_ID="fusing/instructpix2pix-1000-samples" +accelerate launch train_instruct_pix2pix.py \ + --resolution=512 \ ``` -Now, we can launch training. The script saves all the components (`feature_extractor`, `scheduler`, `text_encoder`, `unet`, etc) in a subfolder in your repository. +Many of the basic and important parameters are described in the [Text-to-image](text2image#script-parameters) training guide, so this guide just focuses on the relevant parameters for InstructPix2Pix: -```bash -accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --dataset_name=$DATASET_ID \ - --enable_xformers_memory_efficient_attention \ - --resolution=256 --random_flip \ - --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \ - --max_train_steps=15000 \ - --checkpointing_steps=5000 --checkpoints_total_limit=1 \ - --learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \ - --conditioning_dropout_prob=0.05 \ - --mixed_precision=fp16 \ - --seed=42 \ - --push_to_hub +- `--original_image_column`: the original image before the edits are made +- `--edited_image_column`: the image after the edits are made +- `--edit_prompt_column`: the instructions to edit the image +- `--conditioning_dropout_prob`: the dropout probability for the edited image and edit prompts during training which enables classifier-free guidance (CFG) for one or both conditioning inputs + +## Training script + +The dataset preprocessing code and training loop are found in the [`main()`](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L374) function. This is where you'll make your changes to the training script to adapt it for your own use-case. + +As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the InstructPix2Pix relevant parts of the script. + +The script begins by modifing the [number of input channels](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L445) in the first convolutional layer of the UNet to account for InstructPix2Pix's additional conditioning image: + +```py +in_channels = 8 +out_channels = unet.conv_in.out_channels +unet.register_to_config(in_channels=in_channels) + +with torch.no_grad(): + new_conv_in = nn.Conv2d( + in_channels, out_channels, unet.conv_in.kernel_size, unet.conv_in.stride, unet.conv_in.padding + ) + new_conv_in.weight.zero_() + new_conv_in.weight[:, :4, :, :].copy_(unet.conv_in.weight) + unet.conv_in = new_conv_in +``` + +These UNet parameters are [updated](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L545C1-L551C6) by the optimizer: + +```py +optimizer = optimizer_cls( + unet.parameters(), + lr=args.learning_rate, + betas=(args.adam_beta1, args.adam_beta2), + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) ``` -Additionally, we support performing validation inference to monitor training progress -with Weights and Biases. You can enable this feature with `report_to="wandb"`: +Next, the edited images and and edit instructions are [preprocessed](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L624) and [tokenized](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L610C24-L610C24). It is important the same image transformations are applied to the original and edited images. + +```py +def preprocess_train(examples): + preprocessed_images = preprocess_images(examples) + + original_images, edited_images = preprocessed_images.chunk(2) + original_images = original_images.reshape(-1, 3, args.resolution, args.resolution) + edited_images = edited_images.reshape(-1, 3, args.resolution, args.resolution) + + examples["original_pixel_values"] = original_images + examples["edited_pixel_values"] = edited_images + + captions = list(examples[edit_prompt_column]) + examples["input_ids"] = tokenize_captions(captions) + return examples +``` + +Finally, in the [training loop](https://github.com/huggingface/diffusers/blob/64603389da01082055a901f2883c4810d1144edb/examples/instruct_pix2pix/train_instruct_pix2pix.py#L730), it starts by encoding the edited images into latent space: + +```py +latents = vae.encode(batch["edited_pixel_values"].to(weight_dtype)).latent_dist.sample() +latents = latents * vae.config.scaling_factor +``` + +Then, the script applies dropout to the original image and edit instruction embeddings to support CFG. This is what enables the model to modulate the influence of the edit instruction and original image on the edited image. + +```py +encoder_hidden_states = text_encoder(batch["input_ids"])[0] +original_image_embeds = vae.encode(batch["original_pixel_values"].to(weight_dtype)).latent_dist.mode() + +if args.conditioning_dropout_prob is not None: + random_p = torch.rand(bsz, device=latents.device, generator=generator) + prompt_mask = random_p < 2 * args.conditioning_dropout_prob + prompt_mask = prompt_mask.reshape(bsz, 1, 1) + null_conditioning = text_encoder(tokenize_captions([""]).to(accelerator.device))[0] + encoder_hidden_states = torch.where(prompt_mask, null_conditioning, encoder_hidden_states) + + image_mask_dtype = original_image_embeds.dtype + image_mask = 1 - ( + (random_p >= args.conditioning_dropout_prob).to(image_mask_dtype) + * (random_p < 3 * args.conditioning_dropout_prob).to(image_mask_dtype) + ) + image_mask = image_mask.reshape(bsz, 1, 1, 1) + original_image_embeds = image_mask * original_image_embeds +``` + +That's pretty much it! Aside from the differences described here, the rest of the script is very similar to the [Text-to-image](text2image#training-script) training script, so feel free to check it out for more details. If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Once you're happy with the changes to your script or if you're okay with the default configuration, you're ready to launch the training script! πŸš€ + +This guide uses the [fusing/instructpix2pix-1000-samples](https://huggingface.co/datasets/fusing/instructpix2pix-1000-samples) dataset, which is a smaller version of the [original dataset](https://huggingface.co/datasets/timbrooks/instructpix2pix-clip-filtered). You can also create and use your own dataset if you'd like (see the [Create a dataset for training](create_dataset) guide). + +Set the `MODEL_NAME` environment variable to the name of the model (can be a model id on the Hub or a path to a local model), and the `DATASET_ID` to the name of the dataset on the Hub. The script creates and saves all the components (feature extractor, scheduler, text encoder, UNet, etc.) to a subfolder in your repository. + + + +For better results, try longer training runs with a larger dataset. We've only tested this training script on a smaller-scale dataset. + +
+ +To monitor training progress with Weights and Biases, add the `--report_to=wandb` parameter to the training command and specify a validation image with `--val_image_url` and a validation prompt with `--validation_prompt`. This can be really useful for debugging the model. + +
+ +If you’re training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command. ```bash accelerate launch --mixed_precision="fp16" train_instruct_pix2pix.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$DATASET_ID \ --enable_xformers_memory_efficient_attention \ - --resolution=256 --random_flip \ - --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \ + --resolution=256 \ + --random_flip \ + --train_batch_size=4 \ + --gradient_accumulation_steps=4 \ + --gradient_checkpointing \ --max_train_steps=15000 \ - --checkpointing_steps=5000 --checkpoints_total_limit=1 \ - --learning_rate=5e-05 --max_grad_norm=1 --lr_warmup_steps=0 \ + --checkpointing_steps=5000 \ + --checkpoints_total_limit=1 \ + --learning_rate=5e-05 \ + --max_grad_norm=1 \ + --lr_warmup_steps=0 \ --conditioning_dropout_prob=0.05 \ --mixed_precision=fp16 \ - --val_image_url="https://hf.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png" \ - --validation_prompt="make the mountains snowy" \ --seed=42 \ - --report_to=wandb \ --push_to_hub - ``` - - We recommend this type of validation as it can be useful for model debugging. Note that you need `wandb` installed to use this. You can install `wandb` by running `pip install wandb`. - - [Here](https://wandb.ai/sayakpaul/instruct-pix2pix/runs/ctr3kovq), you can find an example training run that includes some validation samples and the training hyperparameters. - - ***Note: In the original paper, the authors observed that even when the model is trained with an image resolution of 256x256, it generalizes well to bigger resolutions such as 512x512. This is likely because of the larger dataset they used during training.*** - - ## Training with multiple GPUs - -`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch) -for running distributed training with `accelerate`. Here is an example command: - -```bash -accelerate launch --mixed_precision="fp16" --multi_gpu train_instruct_pix2pix.py \ - --pretrained_model_name_or_path=runwayml/stable-diffusion-v1-5 \ - --dataset_name=sayakpaul/instructpix2pix-1000-samples \ - --use_ema \ - --enable_xformers_memory_efficient_attention \ - --resolution=512 --random_flip \ - --train_batch_size=4 --gradient_accumulation_steps=4 --gradient_checkpointing \ - --max_train_steps=15000 \ - --checkpointing_steps=5000 --checkpoints_total_limit=1 \ - --learning_rate=5e-05 --lr_warmup_steps=0 \ - --conditioning_dropout_prob=0.05 \ - --mixed_precision=fp16 \ - --seed=42 \ - --push_to_hub ``` - ## Inference - - Once training is complete, we can perform inference: +After training is finished, you can use your new InstructPix2Pix for inference: - ```python +```py import PIL import requests import torch from diffusers import StableDiffusionInstructPix2PixPipeline +from diffusers.utils import load_image -model_id = "your_model_id" # <- replace this -pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained( - model_id, torch_dtype=torch.float16, use_safetensors=True -).to("cuda") +pipeline = StableDiffusionInstructPix2PixPipeline.from_pretrained("your_cool_model", torch_dtype=torch.float16).to("cuda") generator = torch.Generator("cuda").manual_seed(0) -url = "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png" - - -def download_image(url): - image = PIL.Image.open(requests.get(url, stream=True).raw) - image = PIL.ImageOps.exif_transpose(image) - image = image.convert("RGB") - return image - - -image = download_image(url) -prompt = "wipe out the lake" +image = load_image("https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/test_pix2pix_4.png") +prompt = "add some ducks to the lake" num_inference_steps = 20 image_guidance_scale = 1.5 guidance_scale = 10 -edited_image = pipe( - prompt, - image=image, - num_inference_steps=num_inference_steps, - image_guidance_scale=image_guidance_scale, - guidance_scale=guidance_scale, - generator=generator, +edited_image = pipeline( + prompt, + image=image, + num_inference_steps=num_inference_steps, + image_guidance_scale=image_guidance_scale, + guidance_scale=guidance_scale, + generator=generator, ).images[0] edited_image.save("edited_image.png") ``` -An example model repo obtained using this training script can be found -here - [sayakpaul/instruct-pix2pix](https://huggingface.co/sayakpaul/instruct-pix2pix). +You should experiment with different `num_inference_steps`, `image_guidance_scale`, and `guidance_scale` values to see how they affect inference speed and quality. The guidance scale parameters are especially impactful because they control how much the original image and edit instructions affect the edited image. -We encourage you to play with the following three parameters to control -speed and quality during performance: +## Stable Diffusion XL -* `num_inference_steps` -* `image_guidance_scale` -* `guidance_scale` +Stable Diffusion XL (SDXL) is a powerful text-to-image model that generates high-resolution images, and it adds a second text-encoder to its architecture. Use the [`train_instruct_pix2pix_sdxl.py`](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/train_instruct_pix2pix_sdxl.py) script to train a SDXL model to follow image editing instructions. -Particularly, `image_guidance_scale` and `guidance_scale` can have a profound impact -on the generated ("edited") image (see [here](https://twitter.com/RisingSayak/status/1628392199196151808?s=20) for an example). +The SDXL training script is discussed in more detail in the [SDXL training](sdxl) guide. -If you're looking for some interesting ways to use the InstructPix2Pix training methodology, we welcome you to check out this blog post: [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd). +## Next steps -## Stable Diffusion XL +Congratulations on training your own InstructPix2Pix model! πŸ₯³ To learn more about the model, it may be helpful to: -Training with [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) is also supported via the `train_instruct_pix2pix_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/instruct_pix2pix/README_sdxl.md). \ No newline at end of file +- Read the [Instruction-tuning Stable Diffusion with InstructPix2Pix](https://huggingface.co/blog/instruction-tuning-sd) blog post to learn more about some experiments we've done with InstructPix2Pix, dataset preparation, and results for different instructions. \ No newline at end of file diff --git a/docs/source/en/training/kandinsky.md b/docs/source/en/training/kandinsky.md new file mode 100644 index 000000000000..b2174996be0d --- /dev/null +++ b/docs/source/en/training/kandinsky.md @@ -0,0 +1,327 @@ + + +# Kandinsky 2.2 + + + +This script is experimental, and it's easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset. + + + +Kandinsky 2.2 is a multilingual text-to-image model capable of producing more photorealistic images. The model includes an image prior model for creating image embeddings from text prompts, and a decoder model that generates images based on the prior model's embeddings. That's why you'll find two separate scripts in Diffusers for Kandinsky 2.2, one for training the prior model and one for training the decoder model. You can train both models separately, but to get the best results, you should train both the prior and decoder models. + +Depending on your GPU, you may need to enable `gradient_checkpointing` (⚠️ not supported for the prior model!), `mixed_precision`, and `gradient_accumulation_steps` to help fit the model into memory and to speedup training. You can reduce your memory-usage even more by enabling memory-efficient attention with [xFormers](../optimization/xformers) (version [v0.0.16](https://github.com/huggingface/diffusers/issues/2234#issuecomment-1416931212) fails for training on some GPUs so you may need to install a development version instead). + +This guide explores the [train_text_to_image_prior.py](https://github.com/huggingface/diffusers/blob/main/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py) and the [train_text_to_image_decoder.py](https://github.com/huggingface/diffusers/blob/main/examples/kandinsky2_2/text_to_image/train_text_to_image_decoder.py) scripts to help you become more familiar with it, and how you can adapt it for your own use-case. + +Before running the scripts, make sure you install the library from source: + +```bash +git clone https://github.com/huggingface/diffusers +cd diffusers +pip install . +``` + +Then navigate to the example folder containing the training script and install the required dependencies for the script you're using: + +```bash +cd examples/kandinsky2_2/text_to_image +pip install -r requirements.txt +``` + + + +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. + + + +Initialize an πŸ€— Accelerate environment: + +```bash +accelerate config +``` + +To setup a default πŸ€— Accelerate environment without choosing any configurations: + +```bash +accelerate config default +``` + +Or if your environment doesn't support an interactive shell, like a notebook, you can use: + +```bash +from accelerate.utils import write_basic_config + +write_basic_config() +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. + + + +The following sections highlight parts of the training scripts that are important for understanding how to modify it, but it doesn't cover every aspect of the scripts in detail. If you're interested in learning more, feel free to read through the scripts and let us know if you have any questions or concerns. + + + +## Script parameters + +The training scripts provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py#L190) function. The training scripts provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. + +For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command: + +```bash +accelerate launch train_text_to_image_prior.py \ + --mixed_precision="fp16" +``` + +Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so let's get straight to a walkthrough of the Kandinsky training scripts! + +### Min-SNR weighting + +The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script. + +Add the `--snr_gamma` parameter and set it to the recommended value of 5.0: + +```bash +accelerate launch train_text_to_image_prior.py \ + --snr_gamma=5.0 +``` + +## Training script + +The training script is also similar to the [Text-to-image](text2image#training-script) training guide, but it's been modified to support training the prior and decoder models. This guide focuses on the code that is unique to the Kandinsky 2.2 training scripts. + + + + +The [`main()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py#L441) function contains the code for preparing the dataset and training the model. + +One of the main differences you'll notice right away is that the training script also loads a [`~transformers.CLIPImageProcessor`] - in addition to a scheduler and tokenizer - for preprocessing images and a [`~transformers.CLIPVisionModelWithProjection`] model for encoding the images: + +```py +noise_scheduler = DDPMScheduler(beta_schedule="squaredcos_cap_v2", prediction_type="sample") +image_processor = CLIPImageProcessor.from_pretrained( + args.pretrained_prior_model_name_or_path, subfolder="image_processor" +) +tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_prior_model_name_or_path, subfolder="tokenizer") + +with ContextManagers(deepspeed_zero_init_disabled_context_manager()): + image_encoder = CLIPVisionModelWithProjection.from_pretrained( + args.pretrained_prior_model_name_or_path, subfolder="image_encoder", torch_dtype=weight_dtype + ).eval() + text_encoder = CLIPTextModelWithProjection.from_pretrained( + args.pretrained_prior_model_name_or_path, subfolder="text_encoder", torch_dtype=weight_dtype + ).eval() +``` + +Kandinsky uses a [`PriorTransformer`] to generate the image embeddings, so you'll want to setup the optimizer to learn the prior mode's parameters. + +```py +prior = PriorTransformer.from_pretrained(args.pretrained_prior_model_name_or_path, subfolder="prior") +prior.train() +optimizer = optimizer_cls( + prior.parameters(), + lr=args.learning_rate, + betas=(args.adam_beta1, args.adam_beta2), + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) +``` + +Next, the input captions are tokenized, and images are [preprocessed](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py#L632) by the [`~transformers.CLIPImageProcessor`]: + +```py +def preprocess_train(examples): + images = [image.convert("RGB") for image in examples[image_column]] + examples["clip_pixel_values"] = image_processor(images, return_tensors="pt").pixel_values + examples["text_input_ids"], examples["text_mask"] = tokenize_captions(examples) + return examples +``` + +Finally, the [training loop](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_prior.py#L718) converts the input images into latents, adds noise to the image embeddings, and makes a prediction: + +```py +model_pred = prior( + noisy_latents, + timestep=timesteps, + proj_embedding=prompt_embeds, + encoder_hidden_states=text_encoder_hidden_states, + attention_mask=text_mask, +).predicted_image_embedding +``` + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. + + + + +The [`main()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_decoder.py#L440) function contains the code for preparing the dataset and training the model. + +Unlike the prior model, the decoder initializes a [`VQModel`] to decode the latents into images and it uses a [`UNet2DConditionModel`]: + +```py +with ContextManagers(deepspeed_zero_init_disabled_context_manager()): + vae = VQModel.from_pretrained( + args.pretrained_decoder_model_name_or_path, subfolder="movq", torch_dtype=weight_dtype + ).eval() + image_encoder = CLIPVisionModelWithProjection.from_pretrained( + args.pretrained_prior_model_name_or_path, subfolder="image_encoder", torch_dtype=weight_dtype + ).eval() +unet = UNet2DConditionModel.from_pretrained(args.pretrained_decoder_model_name_or_path, subfolder="unet") +``` + +Next, the script includes several image transforms and a [preprocessing](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_decoder.py#L622) function for applying the transforms to the images and returning the pixel values: + +```py +def preprocess_train(examples): + images = [image.convert("RGB") for image in examples[image_column]] + examples["pixel_values"] = [train_transforms(image) for image in images] + examples["clip_pixel_values"] = image_processor(images, return_tensors="pt").pixel_values + return examples +``` + +Lastly, the [training loop](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/kandinsky2_2/text_to_image/train_text_to_image_decoder.py#L706) handles converting the images to latents, adding noise, and predicting the noise residual. + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. + +```py +model_pred = unet(noisy_latents, timesteps, None, added_cond_kwargs=added_cond_kwargs).sample[:, :4] +``` + + + + +## Launch the script + +Once you’ve made all your changes or you’re okay with the default configuration, you’re ready to launch the training script! πŸš€ + +You'll train on the [PokΓ©mon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate your own PokΓ©mon, but you can also create and train on your own dataset by following the [Create a dataset for training](create_dataset) guide. Set the environment variable `DATASET_NAME` to the name of the dataset on the Hub or if you're training on your own files, set the environment variable `TRAIN_DIR` to a path to your dataset. + +If you’re training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command. + + + +To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. You’ll also need to add the `--validation_prompt` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results. + + + + + + +```bash +export DATASET_NAME="lambdalabs/pokemon-blip-captions" + +accelerate launch --mixed_precision="fp16" train_text_to_image_prior.py \ + --dataset_name=$DATASET_NAME \ + --resolution=768 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --max_train_steps=15000 \ + --learning_rate=1e-05 \ + --max_grad_norm=1 \ + --checkpoints_total_limit=3 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --validation_prompts="A robot pokemon, 4k photo" \ + --report_to="wandb" \ + --push_to_hub \ + --output_dir="kandi2-prior-pokemon-model" +``` + + + + +```bash +export DATASET_NAME="lambdalabs/pokemon-blip-captions" + +accelerate launch --mixed_precision="fp16" train_text_to_image_decoder.py \ + --dataset_name=$DATASET_NAME \ + --resolution=768 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --gradient_checkpointing \ + --max_train_steps=15000 \ + --learning_rate=1e-05 \ + --max_grad_norm=1 \ + --checkpoints_total_limit=3 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --validation_prompts="A robot pokemon, 4k photo" \ + --report_to="wandb" \ + --push_to_hub \ + --output_dir="kandi2-decoder-pokemon-model" +``` + + + + +Once training is finished, you can use your newly trained model for inference! + + + + +```py +from diffusers import AutoPipelineForText2Image, DiffusionPipeline +import torch + +prior_pipeline = DiffusionPipeline.from_pretrained(output_dir, torch_dtype=torch.float16) +prior_components = {"prior_" + k: v for k,v in prior_pipeline.components.items()} +pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", **prior_components, torch_dtype=torch.float16) + +pipe.enable_model_cpu_offload() +prompt="A robot pokemon, 4k photo" +image = pipeline(prompt=prompt, negative_prompt=negative_prompt).images[0] +``` + + + +Feel free to replace `kandinsky-community/kandinsky-2-2-decoder` with your own trained decoder checkpoint! + + + + + + +```py +from diffusers import AutoPipelineForText2Image +import torch + +pipeline = AutoPipelineForText2Image.from_pretrained("path/to/saved/model", torch_dtype=torch.float16) +pipeline.enable_model_cpu_offload() + +prompt="A robot pokemon, 4k photo" +image = pipeline(prompt=prompt).images[0] +``` + +For the decoder model, you can also perform inference from a saved checkpoint which can be useful for viewing intermediate results. In this case, load the checkpoint into the UNet: + +```py +from diffusers import AutoPipelineForText2Image, UNet2DConditionModel + +unet = UNet2DConditionModel.from_pretrained("path/to/saved/model" + "/checkpoint-/unet") + +pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", unet=unet, torch_dtype=torch.float16) +pipeline.enable_model_cpu_offload() + +image = pipeline(prompt="A robot pokemon, 4k photo").images[0] +``` + + + + +## Next steps + +Congratulations on training a Kandinsky 2.2 model! To learn more about how to use your new model, the following guides may be helpful: + +- Read the [Kandinsky](../using-diffusers/kandinsky) guide to learn how to use it for a variety of different tasks (text-to-image, image-to-image, inpainting, interpolation), and how it can be combined with a ControlNet. +- Check out the [DreamBooth](dreambooth) and [LoRA](lora) training guides to learn how to train a personalized Kandinsky model with just a few example images. These two training techniques can even be combined! diff --git a/docs/source/en/training/lora.md b/docs/source/en/training/lora.md index 7c13b7af9d7d..9ad088917dbc 100644 --- a/docs/source/en/training/lora.md +++ b/docs/source/en/training/lora.md @@ -10,571 +10,208 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# Low-Rank Adaptation of Large Language Models (LoRA) +# LoRA -This is an experimental feature. Its APIs can change in future. +This is experimental and the API may change in the future. -[Low-Rank Adaptation of Large Language Models (LoRA)](https://arxiv.org/abs/2106.09685) is a training method that accelerates the training of large models while consuming less memory. It adds pairs of rank-decomposition weight matrices (called **update matrices**) to existing weights, and **only** trains those newly added weights. This has a couple of advantages: - -- Previous pretrained weights are kept frozen so the model is not as prone to [catastrophic forgetting](https://www.pnas.org/doi/10.1073/pnas.1611835114). -- Rank-decomposition matrices have significantly fewer parameters than the original model, which means that trained LoRA weights are easily portable. -- LoRA matrices are generally added to the attention layers of the original model. 🧨 Diffusers provides the [`~diffusers.loaders.UNet2DConditionLoadersMixin.load_attn_procs`] method to load the LoRA weights into a model's attention layers. You can control the extent to which the model is adapted toward new training images via a `scale` parameter. -- The greater memory-efficiency allows you to run fine-tuning on consumer GPUs like the Tesla T4, RTX 3080 or even the RTX 2080 Ti! GPUs like the T4 are free and readily accessible in Kaggle or Google Colab notebooks. +[LoRA (Low-Rank Adaptation of Large Language Models)](https://hf.co/papers/2106.09685) is a popular and lightweight training technique that significantly reduces the number of trainable parameters. It works by inserting a smaller number of new weights into the model and only these are trained. This makes training with LoRA much faster, memory-efficient, and produces smaller model weights (a few hundred MBs), which are easier to store and share. LoRA can also be combined with other training techniques like DreamBooth to speedup training. -πŸ’‘ LoRA is not only limited to attention layers. The authors found that amending -the attention layers of a language model is sufficient to obtain good downstream performance with great efficiency. This is why it's common to just add the LoRA weights to the attention layers of a model. Check out the [Using LoRA for efficient Stable Diffusion fine-tuning](https://huggingface.co/blog/lora) blog for more information about how LoRA works! +LoRA is very versatile and supported for [DreamBooth](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py), [Kandinsky 2.2](https://github.com/huggingface/diffusers/blob/main/examples/kandinsky2_2/text_to_image/train_text_to_image_lora_decoder.py), [Stable Diffusion XL](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora_sdxl.py), [text-to-image](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py), and [Wuerstchen](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/train_text_to_image_lora_prior.py). -[cloneofsimo](https://github.com/cloneofsimo) was the first to try out LoRA training for Stable Diffusion in the popular [lora](https://github.com/cloneofsimo/lora) GitHub repository. 🧨 Diffusers now supports finetuning with LoRA for [text-to-image generation](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image#training-with-lora) and [DreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth#training-with-low-rank-adaptation-of-large-language-models-lora). This guide will show you how to do both. +This guide will explore the [train_text_to_image_lora.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py) script to help you become more familiar with it, and how you can adapt it for your own use-case. -If you'd like to store or share your model with the community, login to your Hugging Face account (create [one](https://hf.co/join) if you don't have one already): +Before running the script, make sure you install the library from source: ```bash -huggingface-cli login +git clone https://github.com/huggingface/diffusers +cd diffusers +pip install . ``` -## Text-to-image - -Finetuning a model like Stable Diffusion, which has billions of parameters, can be slow and difficult. With LoRA, it is much easier and faster to finetune a diffusion model. It can run on hardware with as little as 11GB of GPU RAM without resorting to tricks such as 8-bit optimizers. - -### Training[[text-to-image-training]] - -Let's finetune [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) on the [PokΓ©mon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate your own PokΓ©mon. +Navigate to the example folder with the training script and install the required dependencies for the script you're using: -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. You'll also need to set the `DATASET_NAME` environment variable to the name of the dataset you want to train on. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. - -The `OUTPUT_DIR` and `HUB_MODEL_ID` variables are optional and specify where to save the model to on the Hub: + + ```bash -export MODEL_NAME="runwayml/stable-diffusion-v1-5" -export OUTPUT_DIR="/sddata/finetune/lora/pokemon" -export HUB_MODEL_ID="pokemon-lora" -export DATASET_NAME="lambdalabs/pokemon-blip-captions" +cd examples/text_to_image +pip install -r requirements.txt ``` -There are some flags to be aware of before you start training: - -* `--push_to_hub` stores the trained LoRA embeddings on the Hub. -* `--report_to=wandb` reports and logs the training results to your Weights & Biases dashboard (as an example, take a look at this [report](https://wandb.ai/pcuenq/text2image-fine-tune/runs/b4k1w0tn?workspace=user-pcuenq)). -* `--learning_rate=1e-04`, you can afford to use a higher learning rate than you normally would with LoRA. - -Now you're ready to launch the training (you can find the full training script [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_lora.py)). Training takes about 5 hours on a 2080 Ti GPU with 11GB of RAM, and it'll create and save model checkpoints and the `pytorch_lora_weights` in your repository. + + ```bash -accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --dataset_name=$DATASET_NAME \ - --dataloader_num_workers=8 \ - --resolution=512 --center_crop --random_flip \ - --train_batch_size=1 \ - --gradient_accumulation_steps=4 \ - --max_train_steps=15000 \ - --learning_rate=1e-04 \ - --max_grad_norm=1 \ - --lr_scheduler="cosine" --lr_warmup_steps=0 \ - --output_dir=${OUTPUT_DIR} \ - --push_to_hub \ - --hub_model_id=${HUB_MODEL_ID} \ - --report_to=wandb \ - --checkpointing_steps=500 \ - --validation_prompt="A pokemon with blue eyes." \ - --seed=1337 -``` - -### Inference[[text-to-image-inference]] - -Now you can use the model for inference by loading the base model in the [`StableDiffusionPipeline`] and then the [`DPMSolverMultistepScheduler`]: - -```py ->>> import torch ->>> from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler - ->>> model_base = "runwayml/stable-diffusion-v1-5" - ->>> pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16, use_safetensors=True) ->>> pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config) -``` - -Load the LoRA weights from your finetuned model *on top of the base model weights*, and then move the pipeline to a GPU for faster inference. When you merge the LoRA weights with the frozen pretrained model weights, you can optionally adjust how much of the weights to merge with the `scale` parameter: - - - -πŸ’‘ A `scale` value of `0` is the same as not using your LoRA weights and you're only using the base model weights, and a `scale` value of `1` means you're only using the fully finetuned LoRA weights. Values between `0` and `1` interpolates between the two weights. - - - -```py ->>> pipe.unet.load_attn_procs(lora_model_path) ->>> pipe.to("cuda") - -# use half the weights from the LoRA finetuned model and half the weights from the base model ->>> image = pipe( -... "A pokemon with blue eyes.", num_inference_steps=25, guidance_scale=7.5, cross_attention_kwargs={"scale": 0.5} -... ).images[0] - -# OR, use the weights from the fully finetuned LoRA model -# >>> image = pipe("A pokemon with blue eyes.", num_inference_steps=25, guidance_scale=7.5).images[0] - ->>> image.save("blue_pokemon.png") -``` - - - -If you are loading the LoRA parameters from the Hub and if the Hub repository has -a `base_model` tag (such as [this](https://huggingface.co/sayakpaul/sd-model-finetuned-lora-t4/blob/main/README.md?code=true#L4)), then -you can do: - -```py -from huggingface_hub.repocard import RepoCard - -lora_model_id = "sayakpaul/sd-model-finetuned-lora-t4" -card = RepoCard.load(lora_model_id) -base_model_id = card.data.to_dict()["base_model"] - -pipe = StableDiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, use_safetensors=True) -... +cd examples/text_to_image +pip install -r requirements_flax.txt ``` - - - -## DreamBooth - -[DreamBooth](https://arxiv.org/abs/2208.12242) is a finetuning technique for personalizing a text-to-image model like Stable Diffusion to generate photorealistic images of a subject in different contexts, given a few images of the subject. However, DreamBooth is very sensitive to hyperparameters and it is easy to overfit. Some important hyperparameters to consider include those that affect the training time (learning rate, number of training steps), and inference time (number of steps, scheduler type). + + -πŸ’‘ Take a look at the [Training Stable Diffusion with DreamBooth using 🧨 Diffusers](https://huggingface.co/blog/dreambooth) blog for an in-depth analysis of DreamBooth experiments and recommended settings. +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. -### Training[[dreambooth-training]] - -Let's finetune [`stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) with DreamBooth and LoRA with some 🐢 [dog images](https://drive.google.com/drive/folders/1BO_dyz-p65qhBRRMRA4TbZ8qW4rB99JZ). Download and save these images to a directory. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. - -To start, specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. You'll also need to set `INSTANCE_DIR` to the path of the directory containing the images. - -The `OUTPUT_DIR` variables is optional and specifies where to save the model to on the Hub: +Initialize an πŸ€— Accelerate environment: ```bash -export MODEL_NAME="runwayml/stable-diffusion-v1-5" -export INSTANCE_DIR="path-to-instance-images" -export OUTPUT_DIR="path-to-save-model" +accelerate config ``` -There are some flags to be aware of before you start training: - -* `--push_to_hub` stores the trained LoRA embeddings on the Hub. -* `--report_to=wandb` reports and logs the training results to your Weights & Biases dashboard (as an example, take a look at this [report](https://wandb.ai/pcuenq/text2image-fine-tune/runs/b4k1w0tn?workspace=user-pcuenq)). -* `--learning_rate=1e-04`, you can afford to use a higher learning rate than you normally would with LoRA. - -Now you're ready to launch the training (you can find the full training script [here](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora.py)). The script creates and saves model checkpoints and the `pytorch_lora_weights.bin` file in your repository. - -It's also possible to additionally fine-tune the text encoder with LoRA. This, in most cases, leads -to better results with a slight increase in the compute. To allow fine-tuning the text encoder with LoRA, -specify the `--train_text_encoder` while launching the `train_dreambooth_lora.py` script. +To setup a default πŸ€— Accelerate environment without choosing any configurations: ```bash -accelerate launch train_dreambooth_lora.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --instance_data_dir=$INSTANCE_DIR \ - --output_dir=$OUTPUT_DIR \ - --instance_prompt="a photo of sks dog" \ - --resolution=512 \ - --train_batch_size=1 \ - --gradient_accumulation_steps=1 \ - --checkpointing_steps=100 \ - --learning_rate=1e-4 \ - --report_to="wandb" \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ - --max_train_steps=500 \ - --validation_prompt="A photo of sks dog in a bucket" \ - --validation_epochs=50 \ - --seed="0" \ - --push_to_hub -``` - -### Inference[[dreambooth-inference]] - -Now you can use the model for inference by loading the base model in the [`StableDiffusionPipeline`]: - -```py ->>> import torch ->>> from diffusers import StableDiffusionPipeline - ->>> model_base = "runwayml/stable-diffusion-v1-5" - ->>> pipe = StableDiffusionPipeline.from_pretrained(model_base, torch_dtype=torch.float16, use_safetensors=True) +accelerate config default ``` -Load the LoRA weights from your finetuned DreamBooth model *on top of the base model weights*, and then move the pipeline to a GPU for faster inference. When you merge the LoRA weights with the frozen pretrained model weights, you can optionally adjust how much of the weights to merge with the `scale` parameter: - - - -πŸ’‘ A `scale` value of `0` is the same as not using your LoRA weights and you're only using the base model weights, and a `scale` value of `1` means you're only using the fully finetuned LoRA weights. Values between `0` and `1` interpolates between the two weights. - - +Or if your environment doesn't support an interactive shell, like a notebook, you can use: -```py ->>> pipe.unet.load_attn_procs(lora_model_path) ->>> pipe.to("cuda") - -# use half the weights from the LoRA finetuned model and half the weights from the base model ->>> image = pipe( -... "A picture of a sks dog in a bucket.", -... num_inference_steps=25, -... guidance_scale=7.5, -... cross_attention_kwargs={"scale": 0.5}, -... ).images[0] - -# OR, use the weights from the fully finetuned LoRA model -# >>> image = pipe("A picture of a sks dog in a bucket.", num_inference_steps=25, guidance_scale=7.5).images[0] +```bash +from accelerate.utils import write_basic_config ->>> image.save("bucket-dog.png") +write_basic_config() ``` -If you used `--train_text_encoder` during training, then use `pipe.load_lora_weights()` to load the LoRA -weights. For example: - -```python -from huggingface_hub.repocard import RepoCard -from diffusers import StableDiffusionPipeline -import torch - -lora_model_id = "sayakpaul/dreambooth-text-encoder-test" -card = RepoCard.load(lora_model_id) -base_model_id = card.data.to_dict()["base_model"] - -pipe = StableDiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16, use_safetensors=True) -pipe = pipe.to("cuda") -pipe.load_lora_weights(lora_model_id) -image = pipe("A picture of a sks dog in a bucket", num_inference_steps=25).images[0] -``` +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. -If your LoRA parameters involve the UNet as well as the Text Encoder, then passing -`cross_attention_kwargs={"scale": 0.5}` will apply the `scale` value to both the UNet -and the Text Encoder. +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/text_to_image_lora.py) and let us know if you have any questions or concerns. -Note that the use of [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] is preferred to [`~diffusers.loaders.UNet2DConditionLoadersMixin.load_attn_procs`] for loading LoRA parameters. This is because -[`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] can handle the following situations: - -* LoRA parameters that don't have separate identifiers for the UNet and the text encoder (such as [`"patrickvonplaten/lora_dreambooth_dog_example"`](https://huggingface.co/patrickvonplaten/lora_dreambooth_dog_example)). So, you can just do: - - ```py - pipe.load_lora_weights(lora_model_path) - ``` - -* LoRA parameters that have separate identifiers for the UNet and the text encoder such as: [`"sayakpaul/dreambooth"`](https://huggingface.co/sayakpaul/dreambooth). - - - -You can also provide a local directory path to [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] as well as [`~diffusers.loaders.UNet2DConditionLoadersMixin.load_attn_procs`]. - - - -## Stable Diffusion XL - -We support fine-tuning with [Stable Diffusion XL](https://huggingface.co/papers/2307.01952). Please refer to the following docs: - -* [text_to_image/README_sdxl.md](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md) -* [dreambooth/README_sdxl.md](https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_sdxl.md) - -## Unloading LoRA parameters - -You can call [`~diffusers.loaders.LoraLoaderMixin.unload_lora_weights`] on a pipeline to unload the LoRA parameters. - -## Fusing LoRA parameters - -You can call [`~diffusers.loaders.LoraLoaderMixin.fuse_lora`] on a pipeline to merge the LoRA parameters with the original parameters of the underlying model(s). This can lead to a potential speedup in the inference latency. - -## Unfusing LoRA parameters +## Script parameters -To undo `fuse_lora`, call [`~diffusers.loaders.LoraLoaderMixin.unfuse_lora`] on a pipeline. +The training script has many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L85) function. Default values are provided for most parameters that work pretty well, but you can also set your own values in the training command if you'd like. -## Working with different LoRA scales when using LoRA fusion +For example, to increase the number of epochs to train: -If you need to use `scale` when working with `fuse_lora()` to control the influence of the LoRA parameters on the outputs, you should specify `lora_scale` within `fuse_lora()`. Passing the `scale` parameter to `cross_attention_kwargs` when you call the pipeline won't work. - -To use a different `lora_scale` with `fuse_lora()`, you should first call `unfuse_lora()` on the corresponding pipeline and call `fuse_lora()` again with the expected `lora_scale`. - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -lora_model_id = "hf-internal-testing/sdxl-1.0-lora" -lora_filename = "sd_xl_offset_example-lora_1.0.safetensors" -pipe.load_lora_weights(lora_model_id, weight_name=lora_filename) - -# This uses a default `lora_scale` of 1.0. -pipe.fuse_lora() - -generator = torch.manual_seed(0) -images_fusion = pipe( - "masterpiece, best quality, mountain", generator=generator, num_inference_steps=2 -).images - -# To work with a different `lora_scale`, first reverse the effects of `fuse_lora()`. -pipe.unfuse_lora() - -# Then proceed as follows. -pipe.load_lora_weights(lora_model_id, weight_name=lora_filename) -pipe.fuse_lora(lora_scale=0.5) - -generator = torch.manual_seed(0) -images_fusion = pipe( - "masterpiece, best quality, mountain", generator=generator, num_inference_steps=2 -).images -``` - -## Serializing pipelines with fused LoRA parameters - -Let's say you want to load the pipeline above that has its UNet fused with the LoRA parameters. You can easily do so by simply calling the `save_pretrained()` method on `pipe`. - -After loading the LoRA parameters into a pipeline, if you want to serialize the pipeline such that the affected model components are already fused with the LoRA parameters, you should: - -* call `fuse_lora()` on the pipeline with the desired `lora_scale`, given you've already loaded the LoRA parameters into it. -* call `save_pretrained()` on the pipeline. - -Here is a complete example: - -```python -from diffusers import DiffusionPipeline -import torch - -pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16).to("cuda") -lora_model_id = "hf-internal-testing/sdxl-1.0-lora" -lora_filename = "sd_xl_offset_example-lora_1.0.safetensors" -pipe.load_lora_weights(lora_model_id, weight_name=lora_filename) - -# First, fuse the LoRA parameters. -pipe.fuse_lora() - -# Then save. -pipe.save_pretrained("my-pipeline-with-fused-lora") +```bash +accelerate launch train_text_to_image_lora.py \ + --num_train_epochs=150 \ ``` -Now, you can load the pipeline and directly perform inference without having to load the LoRA parameters again: +Many of the basic and important parameters are described in the [Text-to-image](text2image#script-parameters) training guide, so this guide just focuses on the LoRA relevant parameters: -```python -from diffusers import DiffusionPipeline -import torch +- `--rank`: the number of low-rank matrices to train +- `--learning_rate`: the default learning rate is 1e-4, but with LoRA, you can use a higher learning rate -pipe = DiffusionPipeline.from_pretrained("my-pipeline-with-fused-lora", torch_dtype=torch.float16).to("cuda") +## Training script -generator = torch.manual_seed(0) -images_fusion = pipe( - "masterpiece, best quality, mountain", generator=generator, num_inference_steps=2 -).images -``` +The dataset preprocessing code and training loop are found in the [`main()`](https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L371) function, and if you need to adapt the training script, this is where you'll make your changes. -## Working with multiple LoRA checkpoints +As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the LoRA relevant parts of the script. -With the `fuse_lora()` method as described above, it's possible to load multiple LoRA checkpoints. Let's work through a complete example. First we load the base pipeline: +The script begins by adding the [new LoRA weights](https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L447) to the attention layers. This involves correctly configuring the weight size for each block in the UNet. You'll see the `rank` parameter is used to create the [`~models.attention_processor.LoRAAttnProcessor`]: -```python -from diffusers import StableDiffusionXLPipeline, AutoencoderKL -import torch +```py +lora_attn_procs = {} +for name in unet.attn_processors.keys(): + cross_attention_dim = None if name.endswith("attn1.processor") else unet.config.cross_attention_dim + if name.startswith("mid_block"): + hidden_size = unet.config.block_out_channels[-1] + elif name.startswith("up_blocks"): + block_id = int(name[len("up_blocks.")]) + hidden_size = list(reversed(unet.config.block_out_channels))[block_id] + elif name.startswith("down_blocks"): + block_id = int(name[len("down_blocks.")]) + hidden_size = unet.config.block_out_channels[block_id] + + lora_attn_procs[name] = LoRAAttnProcessor( + hidden_size=hidden_size, + cross_attention_dim=cross_attention_dim, + rank=args.rank, + ) + +unet.set_attn_processor(lora_attn_procs) +lora_layers = AttnProcsLayers(unet.attn_processors) +``` + +The [optimizer](https://github.com/huggingface/diffusers/blob/dd9a5caf61f04d11c0fa9f3947b69ab0010c9a0f/examples/text_to_image/train_text_to_image_lora.py#L519) is initialized with the `lora_layers` because these are the only weights that'll be optimized: -vae = AutoencoderKL.from_pretrained("madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16) -pipe = StableDiffusionXLPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - vae=vae, - torch_dtype=torch.float16, +```py +optimizer = optimizer_cls( + lora_layers.parameters(), + lr=args.learning_rate, + betas=(args.adam_beta1, args.adam_beta2), + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, ) -pipe.to("cuda") ``` -Then let's two LoRA checkpoints and fuse them with specific `lora_scale` values: +Aside from setting up the LoRA layers, the training script is more or less the same as train_text_to_image.py! -```python -# LoRA one. -pipe.load_lora_weights("goofyai/cyborg_style_xl") -pipe.fuse_lora(lora_scale=0.7) +## Launch the script -# LoRA two. -pipe.load_lora_weights("TheLastBen/Pikachu_SDXL") -pipe.fuse_lora(lora_scale=0.7) -``` +Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! πŸš€ - +Let's train on the [PokΓ©mon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate our yown PokΓ©mon. Set the environment variables `MODEL_NAME` and `DATASET_NAME` to the model and dataset respectively. You should also specify where to save the model in `OUTPUT_DIR`, and the name of the model to save to on the Hub with `HUB_MODEL_ID`. The script creates and saves the following files to your repository: -Play with the `lora_scale` parameter when working with multiple LoRAs to control the amount of their influence on the final outputs. +- saved model checkpoints +- `pytorch_lora_weights.safetensors` (the trained LoRA weights) - - -Let's see them in action: - -```python -prompt = "cyborg style pikachu" -image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0] -``` - -![cyborg_pikachu](https://huggingface.co/datasets/diffusers/docs-images/resolve/main/cyborg_pikachu.png) +If you're training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command. -Currently, unfusing multiple LoRA checkpoints is not possible. +A full training run takes ~5 hours on a 2080 Ti GPU with 11GB of VRAM. -## Supporting different LoRA checkpoints from Diffusers - -πŸ€— Diffusers supports loading checkpoints from popular LoRA trainers such as [Kohya](https://github.com/kohya-ss/sd-scripts/) and [TheLastBen](https://github.com/TheLastBen/fast-stable-diffusion). In this section, we outline the current API's details and limitations. - -### Kohya - -This support was made possible because of the amazing contributors: [@takuma104](https://github.com/takuma104) and [@isidentical](https://github.com/isidentical). - -We support loading Kohya LoRA checkpoints using [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`]. In this section, we explain how to load such a checkpoint from [CivitAI](https://civitai.com/) -in Diffusers and perform inference with it. - -First, download a checkpoint. We'll use -[this one](https://civitai.com/models/13239/light-and-shadow) for demonstration purposes. - ```bash -wget https://civitai.com/api/download/models/15603 -O light_and_shadow.safetensors -``` - -Next, we initialize a [`~DiffusionPipeline`]: - -```python -import torch - -from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler - -pipeline = StableDiffusionPipeline.from_pretrained( - "gsdf/Counterfeit-V2.5", torch_dtype=torch.float16, safety_checker=None, use_safetensors=True -).to("cuda") -pipeline.scheduler = DPMSolverMultistepScheduler.from_config( - pipeline.scheduler.config, use_karras_sigmas=True -) -``` - -We then load the checkpoint downloaded from CivitAI: - -```python -pipeline.load_lora_weights(".", weight_name="light_and_shadow.safetensors") -``` - - - -If you're loading a checkpoint in the `safetensors` format, please ensure you have `safetensors` installed. - - - -And then it's time for running inference: - -```python -prompt = "masterpiece, best quality, 1girl, at dusk" -negative_prompt = ("(low quality, worst quality:1.4), (bad anatomy), (inaccurate limb:1.2), " - "bad composition, inaccurate eyes, extra digit, fewer digits, (extra arms:1.2), large breasts") - -images = pipeline(prompt=prompt, - negative_prompt=negative_prompt, - width=512, - height=768, - num_inference_steps=15, - num_images_per_prompt=4, - generator=torch.manual_seed(0) -).images -``` - -Below is a comparison between the LoRA and the non-LoRA results: - -![lora_non_lora](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lora_non_lora_comparison.png) - -You have a similar checkpoint stored on the Hugging Face Hub, you can load it -directly with [`~diffusers.loaders.LoraLoaderMixin.load_lora_weights`] like so: - -```python -lora_model_id = "sayakpaul/civitai-light-shadow-lora" -lora_filename = "light_and_shadow.safetensors" -pipeline.load_lora_weights(lora_model_id, weight_name=lora_filename) -``` - -### Kohya + Stable Diffusion XL - -After the release of [Stable Diffusion XL](https://huggingface.co/papers/2307.01952), the community contributed some amazing LoRA checkpoints trained on top of it with the Kohya trainer. - -Here are some example checkpoints we tried out: - -* SDXL 0.9: - * https://civitai.com/models/22279?modelVersionId=118556 - * https://civitai.com/models/104515/sdxlor30costumesrevue-starlight-saijoclaudine-lora - * https://civitai.com/models/108448/daiton-sdxl-test - * https://filebin.net/2ntfqqnapiu9q3zx/pixelbuildings128-v1.safetensors -* SDXL 1.0: - * https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/blob/main/sd_xl_offset_example-lora_1.0.safetensors - -Here is an example of how to perform inference with these checkpoints in `diffusers`: - -```python -from diffusers import DiffusionPipeline -import torch - -base_model_id = "stabilityai/stable-diffusion-xl-base-0.9" -pipeline = DiffusionPipeline.from_pretrained(base_model_id, torch_dtype=torch.float16).to("cuda") -pipeline.load_lora_weights(".", weight_name="Kamepan.safetensors") - -prompt = "anime screencap, glint, drawing, best quality, light smile, shy, a full body of a girl wearing wedding dress in the middle of the forest beneath the trees, fireflies, big eyes, 2d, cute, anime girl, waifu, cel shading, magical girl, vivid colors, (outline:1.1), manga anime artstyle, masterpiece, official wallpaper, glint " -negative_prompt = "(deformed, bad quality, sketch, depth of field, blurry:1.1), grainy, bad anatomy, bad perspective, old, ugly, realistic, cartoon, disney, bad proportions" -generator = torch.manual_seed(2947883060) -num_inference_steps = 30 -guidance_scale = 7 +export MODEL_NAME="runwayml/stable-diffusion-v1-5" +export OUTPUT_DIR="/sddata/finetune/lora/pokemon" +export HUB_MODEL_ID="pokemon-lora" +export DATASET_NAME="lambdalabs/pokemon-blip-captions" -image = pipeline( - prompt=prompt, negative_prompt=negative_prompt, num_inference_steps=num_inference_steps, - generator=generator, guidance_scale=guidance_scale -).images[0] -image.save("Kamepan.png") +accelerate launch --mixed_precision="fp16" train_text_to_image_lora.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --dataset_name=$DATASET_NAME \ + --dataloader_num_workers=8 \ + --resolution=512 + --center_crop \ + --random_flip \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --max_train_steps=15000 \ + --learning_rate=1e-04 \ + --max_grad_norm=1 \ + --lr_scheduler="cosine" \ + --lr_warmup_steps=0 \ + --output_dir=${OUTPUT_DIR} \ + --push_to_hub \ + --hub_model_id=${HUB_MODEL_ID} \ + --report_to=wandb \ + --checkpointing_steps=500 \ + --validation_prompt="A pokemon with blue eyes." \ + --seed=1337 ``` -`Kamepan.safetensors` comes from https://civitai.com/models/22279?modelVersionId=118556 . - -If you notice carefully, the inference UX is exactly identical to what we presented in the sections above. +Once training has been completed, you can use your model for inference: -Thanks to [@isidentical](https://github.com/isidentical) for helping us on integrating this feature. - - - -**Known limitations specific to the Kohya LoRAs**: - -* When images don't looks similar to other UIs, such as ComfyUI, it can be because of multiple reasons, as explained [here](https://github.com/huggingface/diffusers/pull/4287/#issuecomment-1655110736). -* We don't fully support [LyCORIS checkpoints](https://github.com/KohakuBlueleaf/LyCORIS). To the best of our knowledge, our current `load_lora_weights()` should support LyCORIS checkpoints that have LoRA and LoCon modules but not the other ones, such as Hada, LoKR, etc. - - - -### TheLastBen - -Here is an example: - -```python -from diffusers import DiffusionPipeline +```py +from diffusers import AutoPipelineForText2Image import torch -pipeline_id = "Lykon/dreamshaper-xl-1-0" +pipeline = AutoPipelineForText2Image.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda") +pipeline.load_lora_weights("path/to/lora/model", weight_name="pytorch_lora_weights.safetensors") +image = pipeline("A pokemon with blue eyes").images[0] +``` -pipe = DiffusionPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16) -pipe.enable_model_cpu_offload() +## Next steps -lora_model_id = "TheLastBen/Papercut_SDXL" -lora_filename = "papercut.safetensors" -pipe.load_lora_weights(lora_model_id, weight_name=lora_filename) +Congratulations on training a new model with LoRA! To learn more about how to use your new model, the following guides may be helpful: -prompt = "papercut sonic" -image = pipe(prompt=prompt, num_inference_steps=20, generator=torch.manual_seed(0)).images[0] -image -``` +- Learn how to [load different LoRA formats](../using-diffusers/loading_adapters#LoRA) trained using community trainers like Kohya and TheLastBen. +- Learn how to use and [combine multiple LoRA's](../tutorials/using_peft_for_inference) with PEFT for inference. \ No newline at end of file diff --git a/docs/source/en/training/overview.md b/docs/source/en/training/overview.md index c6fe339eda73..50a9417972a0 100644 --- a/docs/source/en/training/overview.md +++ b/docs/source/en/training/overview.md @@ -10,66 +10,37 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# 🧨 Diffusers Training Examples +# Overview -Diffusers training examples are a collection of scripts to demonstrate how to effectively use the `diffusers` library -for a variety of use cases. +πŸ€— Diffusers provides a collection of training scripts for you to train your own diffusion models. You can find all of our training scripts in [diffusers/examples](https://github.com/huggingface/diffusers/tree/main/examples). -**Note**: If you are looking for **official** examples on how to use `diffusers` for inference, -please have a look at [src/diffusers/pipelines](https://github.com/huggingface/diffusers/tree/main/src/diffusers/pipelines) +Each training script is: -Our examples aspire to be **self-contained**, **easy-to-tweak**, **beginner-friendly** and for **one-purpose-only**. -More specifically, this means: +- **Self-contained**: the training script does not depend on any local files, and all packages required to run the script are installed from the `requirements.txt` file. +- **Easy-to-tweak**: the training scripts are an example of how to train a diffusion model for a specific task and won't work out-of-the-box for every training scenario. You'll likely need to adapt the training script for your specific use-case. To help you with that, we've fully exposed the data preprocessing code and the training loop so you can modify it for your own use. +- **Beginner-friendly**: the training scripts are designed to be beginner-friendly and easy to understand, rather than including the latest state-of-the-art methods to get the best and most competitive results. Any training methods we consider too complex are purposefully left out. +- **Single-purpose**: each training script is expressly designed for only one task to keep it readable and understandable. -- **Self-contained**: An example script shall only depend on "pip-install-able" Python packages that can be found in a `requirements.txt` file. Example scripts shall **not** depend on any local files. This means that one can simply download an example script, *e.g.* [train_unconditional.py](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py), install the required dependencies, *e.g.* [requirements.txt](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/requirements.txt) and execute the example script. -- **Easy-to-tweak**: While we strive to present as many use cases as possible, the example scripts are just that - examples. It is expected that they won't work out-of-the box on your specific problem and that you will be required to change a few lines of code to adapt them to your needs. To help you with that, most of the examples fully expose the preprocessing of the data and the training loop to allow you to tweak and edit them as required. -- **Beginner-friendly**: We do not aim for providing state-of-the-art training scripts for the newest models, but rather examples that can be used as a way to better understand diffusion models and how to use them with the `diffusers` library. We often purposefully leave out certain state-of-the-art methods if we consider them too complex for beginners. -- **One-purpose-only**: Examples should show one task and one task only. Even if a task is from a modeling -point of view very similar, *e.g.* image super-resolution and image modification tend to use the same model and training method, we want examples to showcase only one task to keep them as readable and easy-to-understand as possible. +Our current collection of training scripts include: -We provide **official** examples that cover the most popular tasks of diffusion models. -*Official* examples are **actively** maintained by the `diffusers` maintainers and we try to rigorously follow our example philosophy as defined above. -If you feel like another important example should exist, we are more than happy to welcome a [Feature Request](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=) or directly a [Pull Request](https://github.com/huggingface/diffusers/compare) from you! +| Training | SDXL-support | LoRA-support | Flax-support | +|---|---|---|---| +| [unconditional image generation](https://github.com/huggingface/diffusers/tree/main/examples/unconditional_image_generation) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) | | | | +| [text-to-image](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image) | πŸ‘ | πŸ‘ | πŸ‘ | +| [textual inversion](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb) | | | πŸ‘ | +| [DreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_dreambooth_training.ipynb) | πŸ‘ | πŸ‘ | πŸ‘ | +| [ControlNet](https://github.com/huggingface/diffusers/tree/main/examples/controlnet) | πŸ‘ | | πŸ‘ | +| [InstructPix2Pix](https://github.com/huggingface/diffusers/tree/main/examples/instruct_pix2pix) | πŸ‘ | | | +| [Custom Diffusion](https://github.com/huggingface/diffusers/tree/main/examples/custom_diffusion) | | | | +| [T2I-Adapters](https://github.com/huggingface/diffusers/tree/main/examples/t2i_adapter) | πŸ‘ | | | +| [Kandinsky 2.2](https://github.com/huggingface/diffusers/tree/main/examples/kandinsky2_2/text_to_image) | | πŸ‘ | | +| [Wuerstchen](https://github.com/huggingface/diffusers/tree/main/examples/wuerstchen/text_to_image) | | πŸ‘ | | -Training examples show how to pretrain or fine-tune diffusion models for a variety of tasks. Currently we support: +These examples are **actively** maintained, so please feel free to open an issue if they aren't working as expected. If you feel like another training example should be included, you're more than welcome to start a [Feature Request](https://github.com/huggingface/diffusers/issues/new?assignees=&labels=&template=feature_request.md&title=) to discuss your feature idea with us and whether it meets our criteria of being self-contained, easy-to-tweak, beginner-friendly, and single-purpose. -- [Unconditional Training](./unconditional_training) -- [Text-to-Image Training](./text2image)* -- [Text Inversion](./text_inversion) -- [Dreambooth](./dreambooth)* -- [LoRA Support](./lora)* -- [ControlNet](./controlnet)* -- [InstructPix2Pix](./instructpix2pix)* -- [Custom Diffusion](./custom_diffusion) -- [T2I-Adapters](./t2i_adapters)* +## Install -*: Supports [Stable Diffusion XL](../api/pipelines/stable_diffusion/stable_diffusion_xl). - -If possible, please [install xFormers](../optimization/xformers) for memory efficient attention. This could help make your training faster and less memory intensive. - -| Task | πŸ€— Accelerate | πŸ€— Datasets | Colab -|---|---|:---:|:---:| -| [**Unconditional Image Generation**](./unconditional_training) | βœ… | βœ… | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/training_example.ipynb) -| [**Text-to-Image fine-tuning**](./text2image) | βœ… | βœ… | -| [**Textual Inversion**](./text_inversion) | βœ… | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_textual_inversion_training.ipynb) -| [**Dreambooth**](./dreambooth) | βœ… | - | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/huggingface/notebooks/blob/main/diffusers/sd_dreambooth_training.ipynb) -| [**Training with LoRA**](./lora) | βœ… | - | - | -| [**ControlNet**](./controlnet) | βœ… | βœ… | - | -| [**InstructPix2Pix**](./instructpix2pix) | βœ… | βœ… | - | -| [**Custom Diffusion**](./custom_diffusion) | βœ… | βœ… | - | -| [**T2I Adapters**](./t2i_adapters) | βœ… | βœ… | - | - -## Community - -In addition, we provide **community** examples, which are examples added and maintained by our community. -Community examples can consist of both *training* examples or *inference* pipelines. -For such examples, we are more lenient regarding the philosophy defined above and also cannot guarantee to provide maintenance for every issue. -Examples that are useful for the community, but are either not yet deemed popular or not yet following our above philosophy should go into the [community examples](https://github.com/huggingface/diffusers/tree/main/examples/community) folder. The community folder therefore includes training examples and inference pipelines. -**Note**: Community examples can be a [great first contribution](https://github.com/huggingface/diffusers/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) to show to the community how you like to use `diffusers` πŸͺ„. - -## Important note - -To make sure you can successfully run the latest versions of the example scripts, you have to **install the library from source** and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: +Make sure you can successfully run the latest versions of the example scripts by installing the library from source in a new virtual environment: ```bash git clone https://github.com/huggingface/diffusers @@ -77,8 +48,16 @@ cd diffusers pip install . ``` -Then cd in the example folder of your choice and run +Then navigate to the folder of the training script (for example, [DreamBooth](https://github.com/huggingface/diffusers/tree/main/examples/dreambooth)) and install the `requirements.txt` file. Some training scripts have a specific requirement file for SDXL, LoRA or Flax. If you're using one of these scripts, make sure you install its corresponding requirements file. ```bash +cd examples/dreambooth pip install -r requirements.txt +# to train SDXL with DreamBooth +pip install -r requirements_sdxl.txt ``` + +To speedup training and reduce memory-usage, we recommend: + +- using PyTorch 2.0 or higher to automatically use [scaled dot product attention](../optimization/torch2.0#scaled-dot-product-attention) during training (you don't need to make any changes to the training code) +- installing [xFormers](../optimization/xformers) to enable memory-efficient attention \ No newline at end of file diff --git a/docs/source/en/training/sdxl.md b/docs/source/en/training/sdxl.md new file mode 100644 index 000000000000..eebb614e907b --- /dev/null +++ b/docs/source/en/training/sdxl.md @@ -0,0 +1,266 @@ + + +# Stable Diffusion XL + + + +This script is experimental, and it's easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset. + + + +[Stable Diffusion XL (SDXL)](https://hf.co/papers/2307.01952) is a larger and more powerful iteration of the Stable Diffusion model, capable of producing higher resolution images. + +SDXL's UNet is 3x larger and the model adds a second text encoder to the architecture. Depending on the hardware available to you, this can be very computationally intensive and it may not run on a consumer GPU like a Tesla T4. To help fit this larger model into memory and to speedup training, try enabling `gradient_checkpointing`, `mixed_precision`, and `gradient_accumulation_steps`. You can reduce your memory-usage even more by enabling memory-efficient attention with [xFormers](../optimization/xformers) and using [bitsandbytes'](https://github.com/TimDettmers/bitsandbytes) 8-bit optimizer. + +This guide will explore the [train_text_to_image_sdxl.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_sdxl.py) training script to help you become more familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: + +```bash +git clone https://github.com/huggingface/diffusers +cd diffusers +pip install . +``` + +Then navigate to the example folder containing the training script and install the required dependencies for the script you're using: + +```bash +cd examples/text_to_image +pip install -r requirements_sdxl.txt +``` + + + +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. + + + +Initialize an πŸ€— Accelerate environment: + +```bash +accelerate config +``` + +To setup a default πŸ€— Accelerate environment without choosing any configurations: + +```bash +accelerate config default +``` + +Or if your environment doesn't support an interactive shell, like a notebook, you can use: + +```bash +from accelerate.utils import write_basic_config + +write_basic_config() +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. + +## Script parameters + + + +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_sdxl.py) and let us know if you have any questions or concerns. + + + +The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L129) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. + +For example, to speedup training with mixed precision using the bf16 format, add the `--mixed_precision` parameter to the training command: + +```bash +accelerate launch train_text_to_image_sdxl.py \ + --mixed_precision="bf16" +``` + +Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so you'll focus on the parameters that are relevant to training SDXL in this guide. + +- `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify a better [VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) +- `--proportion_empty_prompts`: the proportion of image prompts to replace with empty strings +- `--timestep_bias_strategy`: where (earlier vs. later) in the timestep to apply a bias, which can encourage the model to either learn low or high frequency details +- `--timestep_bias_multiplier`: the weight of the bias to apply to the timestep +- `--timestep_bias_begin`: the timestep to begin applying the bias +- `--timestep_bias_end`: the timestep to end applying the bias +- `--timestep_bias_portion`: the proportion of timesteps to apply the bias to + +### Min-SNR weighting + +The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting either `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script. + +Add the `--snr_gamma` parameter and set it to the recommended value of 5.0: + +```bash +accelerate launch train_text_to_image_sdxl.py \ + --snr_gamma=5.0 +``` + +## Training script + +The training script is also similar to the [Text-to-image](text2image#training-script) training guide, but it's been modified to support SDXL training. This guide will focus on the code that is unique to the SDXL training script. + +It starts by creating functions to [tokenize the prompts](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L478) to calculate the prompt embeddings, and to compute the image embeddings with the [VAE](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L519). Next, you'll a function to [generate the timesteps weights](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L531) depending on the number of timesteps and the timestep bias strategy to apply. + +Within the [`main()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L572) function, in addition to loading a tokenizer, the script loads a second tokenizer and text encoder because the SDXL architecture uses two of each: + +```py +tokenizer_one = AutoTokenizer.from_pretrained( + args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision, use_fast=False +) +tokenizer_two = AutoTokenizer.from_pretrained( + args.pretrained_model_name_or_path, subfolder="tokenizer_2", revision=args.revision, use_fast=False +) + +text_encoder_cls_one = import_model_class_from_model_name_or_path( + args.pretrained_model_name_or_path, args.revision +) +text_encoder_cls_two = import_model_class_from_model_name_or_path( + args.pretrained_model_name_or_path, args.revision, subfolder="text_encoder_2" +) +``` + +The [prompt and image embeddings](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L857) are computed first and kept in memory, which isn't typically an issue for a smaller dataset, but for larger datasets it can lead to memory problems. If this is the case, you should save the pre-computed embeddings to disk separately and load them into memory during the training process (see this [PR](https://github.com/huggingface/diffusers/pull/4505) for more discussion about this topic). + +```py +text_encoders = [text_encoder_one, text_encoder_two] +tokenizers = [tokenizer_one, tokenizer_two] +compute_embeddings_fn = functools.partial( + encode_prompt, + text_encoders=text_encoders, + tokenizers=tokenizers, + proportion_empty_prompts=args.proportion_empty_prompts, + caption_column=args.caption_column, +) + +train_dataset = train_dataset.map(compute_embeddings_fn, batched=True, new_fingerprint=new_fingerprint) +train_dataset = train_dataset.map( + compute_vae_encodings_fn, + batched=True, + batch_size=args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps, + new_fingerprint=new_fingerprint_for_vae, +) +``` + +After calculating the embeddings, the text encoder, VAE, and tokenizer are deleted to free up some memory: + +```py +del text_encoders, tokenizers, vae +gc.collect() +torch.cuda.empty_cache() +``` + +Finally, the [training loop](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/text_to_image/train_text_to_image_sdxl.py#L943) takes care of the rest. If you chose to apply a timestep bias strategy, you'll see the timestep weights are calculated and added as noise: + +```py +weights = generate_timestep_weights(args, noise_scheduler.config.num_train_timesteps).to( + model_input.device + ) + timesteps = torch.multinomial(weights, bsz, replacement=True).long() + +noisy_model_input = noise_scheduler.add_noise(model_input, noise, timesteps) +``` + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Once you’ve made all your changes or you’re okay with the default configuration, you’re ready to launch the training script! πŸš€ + +Let’s train on the [PokΓ©mon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate your own PokΓ©mon. Set the environment variables `MODEL_NAME` and `DATASET_NAME` to the model and the dataset (either from the Hub or a local path). You should also specify a VAE other than the SDXL VAE (either from the Hub or a local path) with `VAE_NAME` to avoid numerical instabilities. + + + +To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. You’ll also need to add the `--validation_prompt` and `--validation_epochs` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results. + + + +```bash +export MODEL_NAME="stabilityai/stable-diffusion-xl-base-1.0" +export VAE_NAME="madebyollin/sdxl-vae-fp16-fix" +export DATASET_NAME="lambdalabs/pokemon-blip-captions" + +accelerate launch train_text_to_image_sdxl.py \ + --pretrained_model_name_or_path=$MODEL_NAME \ + --pretrained_vae_model_name_or_path=$VAE_NAME \ + --dataset_name=$DATASET_NAME \ + --enable_xformers_memory_efficient_attention \ + --resolution=512 \ + --center_crop \ + --random_flip \ + --proportion_empty_prompts=0.2 \ + --train_batch_size=1 \ + --gradient_accumulation_steps=4 \ + --gradient_checkpointing \ + --max_train_steps=10000 \ + --use_8bit_adam \ + --learning_rate=1e-06 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --mixed_precision="fp16" \ + --report_to="wandb" \ + --validation_prompt="a cute Sundar Pichai creature" \ + --validation_epochs 5 \ + --checkpointing_steps=5000 \ + --output_dir="sdxl-pokemon-model" \ + --push_to_hub +``` + +After you've finished training, you can use your newly trained SDXL model for inference! + + + + +```py +from diffusers import DiffusionPipeline +import torch + +pipeline = DiffusionPipeline.from_pretrained("path/to/your/model", torch_dtype=torch.float16).to("cuda") + +prompt = "A pokemon with green eyes and red legs." +image = pipeline(prompt, num_inference_steps=30, guidance_scale=7.5).images[0] +image.save("pokemon.png") +``` + + + + +[PyTorch XLA](https://pytorch.org/xla) allows you to run PyTorch on XLA devices such as TPUs, which can be faster. The initial warmup step takes longer because the model needs to be compiled and optimized. However, subsequent calls to the pipeline on an input **with the same length** as the original prompt are much faster because it can reuse the optimized graph. + +```py +from diffusers import DiffusionPipeline +import torch +import torch_xla.core.xla_model as xm + +device = xm.xla_device() +pipeline = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0").to(device) + +prompt = "A pokemon with green eyes and red legs." +start = time() +image = pipeline(prompt, num_inference_steps=inference_steps).images[0] +print(f'Compilation time is {time()-start} sec') +image.save("pokemon.png") + +start = time() +image = pipeline(prompt, num_inference_steps=inference_steps).images[0] +print(f'Inference time is {time()-start} sec after compilation') +``` + + + + +## Next steps + +Congratulations on training a SDXL model! To learn more about how to use your new model, the following guides may be helpful: + +- Read the [Stable Diffusion XL](../using-diffusers/sdxl) guide to learn how to use it for a variety of different tasks (text-to-image, image-to-image, inpainting), how to use it's refiner model, and the different types of micro-conditionings. +- Check out the [DreamBooth](dreambooth) and [LoRA](lora) training guides to learn how to train a personalized SDXL model with just a few example images. These two training techniques can even be combined! \ No newline at end of file diff --git a/docs/source/en/training/t2i_adapters.md b/docs/source/en/training/t2i_adapters.md index 08a4dfaf4599..9d4f292b1d3f 100644 --- a/docs/source/en/training/t2i_adapters.md +++ b/docs/source/en/training/t2i_adapters.md @@ -10,67 +10,167 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> -# T2I-Adapters for Stable Diffusion XL (SDXL) +# T2I-Adapter -The `train_t2i_adapter_sdxl.py` script (as shown below) shows how to implement the [T2I-Adapter training procedure](https://hf.co/papers/2302.08453) for [Stable Diffusion XL](https://huggingface.co/papers/2307.01952). +[T2I-Adapter]((https://hf.co/papers/2302.08453)) is a lightweight adapter model that provides an additional conditioning input image (line art, canny, sketch, depth, pose) to better control image generation. It is similar to a ControlNet, but it is a lot smaller (~77M parameters and ~300MB file size) because its only inserts weights into the UNet instead of copying and training it. -## Running locally with PyTorch +The T2I-Adapter is only available for training with the Stable Diffusion XL (SDXL) model. -### Installing the dependencies +This guide will explore the [train_t2i_adapter_sdxl.py](https://github.com/huggingface/diffusers/blob/main/examples/t2i_adapter/train_t2i_adapter_sdxl.py) training script to help you become familiar with it, and how you can adapt it for your own use-case. -Before running the scripts, make sure to install the library's training dependencies: - -**Important** - -To make sure you can successfully run the latest versions of the example scripts, we highly recommend **installing from source** and keeping the install up to date as we update the example scripts frequently and install some example-specific requirements. To do this, execute the following steps in a new virtual environment: +Before running the script, make sure you install the library from source: ```bash git clone https://github.com/huggingface/diffusers cd diffusers -pip install -e . +pip install . ``` -Then cd in the `examples/t2i_adapter` folder and run +Then navigate to the example folder containing the training script and install the required dependencies for the script you're using: + ```bash -pip install -r requirements_sdxl.txt +cd examples/t2i_adapter +pip install -r requirements.txt ``` -And initialize an [πŸ€—Accelerate](https://github.com/huggingface/accelerate/) environment with: + + +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. + + + +Initialize an πŸ€— Accelerate environment: ```bash accelerate config ``` -Or for a default accelerate configuration without answering questions about your environment +To setup a default πŸ€— Accelerate environment without choosing any configurations: ```bash accelerate config default ``` -Or if your environment doesn't support an interactive shell (e.g., a notebook) +Or if your environment doesn't support an interactive shell, like a notebook, you can use: -```python +```bash from accelerate.utils import write_basic_config + write_basic_config() ``` -When running `accelerate config`, if we specify torch compile mode to True there can be dramatic speedups. +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. -## Circle filling dataset + -The original dataset is hosted in the [ControlNet repo](https://huggingface.co/lllyasviel/ControlNet/blob/main/training/fill50k.zip). We re-uploaded it to be compatible with `datasets` [here](https://huggingface.co/datasets/fusing/fill50k). Note that `datasets` handles dataloading within the training script. +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/t2i_adapter/train_t2i_adapter_sdxl.py) and let us know if you have any questions or concerns. -## Training + -Our training examples use two test conditioning images. They can be downloaded by running +## Script parameters -```sh -wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png +The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L233) function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. +For example, to activate gradient accumulation, add the `--gradient_accumulation_steps` parameter to the training command: + +```bash +accelerate launch train_t2i_adapter_sdxl.py \ + ----gradient_accumulation_steps=4 +``` + +Many of the basic and important parameters are described in the [Text-to-image](text2image#script-parameters) training guide, so this guide just focuses on the relevant T2I-Adapter parameters: + +- `--pretrained_vae_model_name_or_path`: path to a pretrained VAE; the SDXL VAE is known to suffer from numerical instability, so this parameter allows you to specify a better [VAE](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) +- `--crops_coords_top_left_h` and `--crops_coords_top_left_w`: height and width coordinates to include in SDXL's crop coordinate embeddings +- `--conditioning_image_column`: the column of the conditioning images in the dataset +- `--proportion_empty_prompts`: the proportion of image prompts to replace with empty strings + +## Training script + +As with the script parameters, a walkthrough of the training script is provided in the [Text-to-image](text2image#training-script) training guide. Instead, this guide takes a look at the T2I-Adapter relevant parts of the script. + +The training script begins by preparing the dataset. This incudes [tokenizing](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L674) the prompt and [applying transforms](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L714) to the images and conditioning images. + +```py +conditioning_image_transforms = transforms.Compose( + [ + transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR), + transforms.CenterCrop(args.resolution), + transforms.ToTensor(), + ] +) +``` + +Within the [`main()`](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L770) function, the T2I-Adapter is either loaded from a pretrained adapter or it is randomly initialized: + +```py +if args.adapter_model_name_or_path: + logger.info("Loading existing adapter weights.") + t2iadapter = T2IAdapter.from_pretrained(args.adapter_model_name_or_path) +else: + logger.info("Initializing t2iadapter weights.") + t2iadapter = T2IAdapter( + in_channels=3, + channels=(320, 640, 1280, 1280), + num_res_blocks=2, + downscale_factor=16, + adapter_type="full_adapter_xl", + ) +``` + +The [optimizer](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L952) is initialized for the T2I-Adapter parameters: + +```py +params_to_optimize = t2iadapter.parameters() +optimizer = optimizer_class( + params_to_optimize, + lr=args.learning_rate, + betas=(args.adam_beta1, args.adam_beta2), + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) +``` + +Lastly, in the [training loop](https://github.com/huggingface/diffusers/blob/aab6de22c33cc01fb7bc81c0807d6109e2c998c9/examples/t2i_adapter/train_t2i_adapter_sdxl.py#L1086), the adapter conditioning image and the text embeddings are passed to the UNet to predict the noise residual: + +```py +t2iadapter_image = batch["conditioning_pixel_values"].to(dtype=weight_dtype) +down_block_additional_residuals = t2iadapter(t2iadapter_image) +down_block_additional_residuals = [ + sample.to(dtype=weight_dtype) for sample in down_block_additional_residuals +] + +model_pred = unet( + inp_noisy_latents, + timesteps, + encoder_hidden_states=batch["prompt_ids"], + added_cond_kwargs=batch["unet_added_conditions"], + down_block_additional_residuals=down_block_additional_residuals, +).sample +``` + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Now you’re ready to launch the training script! πŸš€ + +For this example training, you'll use the [fusing/fill50k](https://huggingface.co/datasets/fusing/fill50k) dataset. You can also create and use your own dataset if you want (see the [Create a dataset for training](https://moon-ci-docs.huggingface.co/docs/diffusers/pr_5512/en/training/create_dataset) guide). + +Set the environment variable `MODEL_DIR` to a model id on the Hub or a path to a local model and `OUTPUT_DIR` to where you want to save the model. + +Download the following images to condition your training with: + +```bash +wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_1.png wget https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet_training/conditioning_image_2.png ``` -Then run `huggingface-cli login` to log into your Hugging Face account. This is needed to be able to push the trained T2IAdapter parameters to Hugging Face Hub. + + +To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. You'll also need to add the `--validation_image`, `--validation_prompt`, and `--validation_steps` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results. + + ```bash export MODEL_DIR="stabilityai/stable-diffusion-xl-base-1.0" @@ -94,50 +194,34 @@ accelerate launch train_t2i_adapter_sdxl.py \ --push_to_hub ``` -To better track our training experiments, we're using the following flags in the command above: +Once training is complete, you can use your T2I-Adapter for inference: -* `report_to="wandb` will ensure the training runs are tracked on Weights and Biases. To use it, be sure to install `wandb` with `pip install wandb`. -* `validation_image`, `validation_prompt`, and `validation_steps` to allow the script to do a few validation inference runs. This allows us to qualitatively check if the training is progressing as expected. - -Our experiments were conducted on a single 40GB A100 GPU. - -### Inference - -Once training is done, we can perform inference like so: - -```python +```py from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter, EulerAncestralDiscreteSchedulerTest from diffusers.utils import load_image import torch -base_model_path = "stabilityai/stable-diffusion-xl-base-1.0" -adapter_path = "path to adapter" - -adapter = T2IAdapter.from_pretrained(adapter_path, torch_dtype=torch.float16) -pipe = StableDiffusionXLAdapterPipeline.from_pretrained( - base_model_path, adapter=adapter, torch_dtype=torch.float16 +adapter = T2IAdapter.from_pretrained("path/to/adapter", torch_dtype=torch.float16) +pipeline = StableDiffusionXLAdapterPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", adapter=adapter, torch_dtype=torch.float16 ) -# speed up diffusion process with faster scheduler and memory optimization -pipe.scheduler = EulerAncestralDiscreteSchedulerTest.from_config(pipe.scheduler.config) -# remove following line if xformers is not installed or when using Torch 2.0. -pipe.enable_xformers_memory_efficient_attention() -# memory optimization. -pipe.enable_model_cpu_offload() +pipeline.scheduler = EulerAncestralDiscreteSchedulerTest.from_config(pipe.scheduler.config) +pipeline.enable_xformers_memory_efficient_attention() +pipeline.enable_model_cpu_offload() control_image = load_image("./conditioning_image_1.png") prompt = "pale golden rod circle with old lace background" -# generate image generator = torch.manual_seed(0) -image = pipe( - prompt, num_inference_steps=20, generator=generator, image=control_image +image = pipeline( + prompt, image=control_image, generator=generator ).images[0] image.save("./output.png") ``` -## Notes +## Next steps -### Specifying a better VAE +Congratulations on training a T2I-Adapter model! πŸŽ‰ To learn more: -SDXL's VAE is known to suffer from numerical instability issues. This is why we also expose a CLI argument namely `--pretrained_vae_model_name_or_path` that lets you specify the location of a better VAE (such as [this one](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix)). +- Read the [Efficient Controllable Generation for SDXL with T2I-Adapters](https://www.cs.cmu.edu/~custom-diffusion/) blog post to learn more details about the experimental results from the T2I-Adapter team. diff --git a/docs/source/en/training/text2image.md b/docs/source/en/training/text2image.md index 6aa39572ab34..9fa353ae3122 100644 --- a/docs/source/en/training/text2image.md +++ b/docs/source/en/training/text2image.md @@ -10,129 +10,167 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> - # Text-to-image -The text-to-image fine-tuning script is experimental. It's easy to overfit and run into issues like catastrophic forgetting. We recommend you explore different hyperparameters to get the best results on your dataset. +The text-to-image script is experimental, and it's easy to overfit and run into issues like catastrophic forgetting. Try exploring different hyperparameters to get the best results on your dataset. -Text-to-image models like Stable Diffusion generate an image from a text prompt. This guide will show you how to finetune the [`CompVis/stable-diffusion-v1-4`](https://huggingface.co/CompVis/stable-diffusion-v1-4) model on your own dataset with PyTorch and Flax. All the training scripts for text-to-image finetuning used in this guide can be found in this [repository](https://github.com/huggingface/diffusers/tree/main/examples/text_to_image) if you're interested in taking a closer look. +Text-to-image models like Stable Diffusion are conditioned to generate images given a text prompt. + +Training a model can be taxing on your hardware, but if you enable `gradient_checkpointing` and `mixed_precision`, it is possible to train a model on a single 24GB GPU. If you're training with larger batch sizes or want to train faster, it's better to use GPUs with more than 30GB of memory. You can reduce your memory footprint by enabling memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing, gradient accumulation or xFormers. A GPU with at least 30GB of memory or a TPU v3 is recommended for training with Flax. -Before running the scripts, make sure to install the library's training dependencies: +This guide will explore the [train_text_to_image.py](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) training script to help you become familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: ```bash -pip install git+https://github.com/huggingface/diffusers.git -pip install -U -r requirements.txt +git clone https://github.com/huggingface/diffusers +cd diffusers +pip install . ``` -And initialize an [πŸ€— Accelerate](https://github.com/huggingface/accelerate/) environment with: +Then navigate to the example folder containing the training script and install the required dependencies for the script you're using: + + ```bash -accelerate config +cd examples/text_to_image +pip install -r requirements.txt ``` + + +```bash +cd examples/text_to_image +pip install -r requirements_flax.txt +``` + + -If you have already cloned the repo, then you won't need to go through these steps. Instead, you can pass the path to your local checkout to the training script and it will be loaded from there. + -## Hardware requirements +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. -Using `gradient_checkpointing` and `mixed_precision`, it should be possible to finetune the model on a single 24GB GPU. For higher `batch_size`'s and faster training, it's better to use GPUs with more than 30GB of GPU memory. You can also use JAX/Flax for fine-tuning on TPUs or GPUs, which will be covered [below](#flax-jax-finetuning). + -You can reduce your memory footprint even more by enabling memory efficient attention with xFormers. Make sure you have [xFormers installed](./optimization/xformers) and pass the `--enable_xformers_memory_efficient_attention` flag to the training script. +Initialize an πŸ€— Accelerate environment: -xFormers is not available for Flax. +```bash +accelerate config +``` -## Upload model to Hub +To setup a default πŸ€— Accelerate environment without choosing any configurations: + +```bash +accelerate config default +``` -Store your model on the Hub by adding the following argument to the training script: +Or if your environment doesn't support an interactive shell, like a notebook, you can use: ```bash - --push_to_hub +from accelerate.utils import write_basic_config + +write_basic_config() ``` -## Save and load checkpoints +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. + +## Script parameters + + + +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) and let us know if you have any questions or concerns. + + -It is a good idea to regularly save checkpoints in case anything happens during training. To save a checkpoint, pass the following argument to the training script: +The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L193) function. This function provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. + +For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command: ```bash - --checkpointing_steps=500 +accelerate launch train_text_to_image.py \ + --mixed_precision="fp16" ``` -Every 500 steps, the full training state is saved in a subfolder in the `output_dir`. The checkpoint has the format `checkpoint-` followed by the number of steps trained so far. For example, `checkpoint-1500` is a checkpoint saved after 1500 training steps. +Some basic and important parameters include: + +- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model +- `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on +- `--image_column`: the name of the image column in the dataset to train on +- `--caption_column`: the name of the text column in the dataset to train on +- `--output_dir`: where to save the trained model +- `--push_to_hub`: whether to push the trained model to the Hub +- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command -To load a checkpoint to resume training, pass the argument `--resume_from_checkpoint` to the training script and specify the checkpoint you want to resume from. For example, the following argument resumes training from the checkpoint saved after 1500 training steps: +### Min-SNR weighting + +The [Min-SNR](https://huggingface.co/papers/2303.09556) weighting strategy can help with training by rebalancing the loss to achieve faster convergence. The training script supports predicting `epsilon` (noise) or `v_prediction`, but Min-SNR is compatible with both prediction types. This weighting strategy is only supported by PyTorch and is unavailable in the Flax training script. + +Add the `--snr_gamma` parameter and set it to the recommended value of 5.0: ```bash - --resume_from_checkpoint="checkpoint-1500" +accelerate launch train_text_to_image.py \ + --snr_gamma=5.0 ``` -## Fine-tuning +You can compare the loss surfaces for different `snr_gamma` values in this [Weights and Biases](https://wandb.ai/sayakpaul/text2image-finetune-minsnr) report. For smaller datasets, the effects of Min-SNR may not be as obvious compared to larger datasets. - - -Launch the [PyTorch training script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image.py) for a fine-tuning run on the [PokΓ©mon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset like this. +## Training script -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. +The dataset preprocessing code and training loop are found in the [`main()`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L490) function. If you need to adapt the training script, this is where you'll need to make your changes. -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export dataset_name="lambdalabs/pokemon-blip-captions" +The `train_text_to_image` script starts by [loading a scheduler](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L543) and tokenizer. You can choose to use a different scheduler here if you want: -accelerate launch --mixed_precision="fp16" train_text_to_image.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --dataset_name=$dataset_name \ - --use_ema \ - --resolution=512 --center_crop --random_flip \ - --train_batch_size=1 \ - --gradient_accumulation_steps=4 \ - --gradient_checkpointing \ - --max_train_steps=15000 \ - --learning_rate=1e-05 \ - --max_grad_norm=1 \ - --lr_scheduler="constant" --lr_warmup_steps=0 \ - --output_dir="sd-pokemon-model" \ - --push_to_hub +```py +noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler") +tokenizer = CLIPTokenizer.from_pretrained( + args.pretrained_model_name_or_path, subfolder="tokenizer", revision=args.revision +) ``` -To finetune on your own dataset, prepare the dataset according to the format required by πŸ€— [Datasets](https://huggingface.co/docs/datasets/index). You can [upload your dataset to the Hub](https://huggingface.co/docs/datasets/image_dataset#upload-dataset-to-the-hub), or you can [prepare a local folder with your files](https://huggingface.co/docs/datasets/image_dataset#imagefolder). +Then the script [loads the UNet](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L619) model: -Modify the script if you want to use custom loading logic. We left pointers in the code in the appropriate places to help you. πŸ€— The example script below shows how to finetune on a local dataset in `TRAIN_DIR` and where to save the model to in `OUTPUT_DIR`: +```py +load_model = UNet2DConditionModel.from_pretrained(input_dir, subfolder="unet") +model.register_to_config(**load_model.config) -```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" -export TRAIN_DIR="path_to_your_dataset" -export OUTPUT_DIR="path_to_save_model" +model.load_state_dict(load_model.state_dict()) +``` -accelerate launch train_text_to_image.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --train_data_dir=$TRAIN_DIR \ - --use_ema \ - --resolution=512 --center_crop --random_flip \ - --train_batch_size=1 \ - --gradient_accumulation_steps=4 \ - --gradient_checkpointing \ - --mixed_precision="fp16" \ - --max_train_steps=15000 \ - --learning_rate=1e-05 \ - --max_grad_norm=1 \ - --lr_scheduler="constant" - --lr_warmup_steps=0 \ - --output_dir=${OUTPUT_DIR} \ - --push_to_hub +Next, the text and image columns of the dataset need to be preprocessed. The [`tokenize_captions`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L724) function handles tokenizing the inputs, and the [`train_transforms`](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L742) function specifies the type of transforms to apply to the image. Both of these functions are bundled into `preprocess_train`: + +```py +def preprocess_train(examples): + images = [image.convert("RGB") for image in examples[image_column]] + examples["pixel_values"] = [train_transforms(image) for image in images] + examples["input_ids"] = tokenize_captions(examples) + return examples ``` -#### Training with multiple GPUs +Lastly, the [training loop](https://github.com/huggingface/diffusers/blob/8959c5b9dec1c94d6ba482c94a58d2215c5fd026/examples/text_to_image/train_text_to_image.py#L878) handles everything else. It encodes images into latent space, adds noise to the latents, computes the text embeddings to condition on, updates the model parameters, and saves and pushes the model to the Hub. If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! πŸš€ -`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch) -for running distributed training with `accelerate`. Here is an example command: + + + +Let's train on the [PokΓ©mon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset to generate your own PokΓ©mon. Set the environment variables `MODEL_NAME` and `dataset_name` to the model and the dataset (either from the Hub or a local path). If you're training on more than one GPU, add the `--multi_gpu` parameter to the `accelerate launch` command. + + + +To train on a local dataset, set the `TRAIN_DIR` and `OUTPUT_DIR` environment variables to the path of the dataset and where to save the model to. + + ```bash -export MODEL_NAME="CompVis/stable-diffusion-v1-4" +export MODEL_NAME="runwayml/stable-diffusion-v1-5" export dataset_name="lambdalabs/pokemon-blip-captions" -accelerate launch --mixed_precision="fp16" --multi_gpu train_text_to_image.py \ +accelerate launch --mixed_precision="fp16" train_text_to_image.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --dataset_name=$dataset_name \ --use_ema \ @@ -140,28 +178,27 @@ accelerate launch --mixed_precision="fp16" --multi_gpu train_text_to_image.py \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --gradient_checkpointing \ - --max_train_steps=15000 \ + --max_train_steps=15000 \ --learning_rate=1e-05 \ --max_grad_norm=1 \ - --lr_scheduler="constant" \ - --lr_warmup_steps=0 \ + --enable_xformers_memory_efficient_attention + --lr_scheduler="constant" --lr_warmup_steps=0 \ --output_dir="sd-pokemon-model" \ --push_to_hub ``` - - -With Flax, it's possible to train a Stable Diffusion model faster on TPUs and GPUs thanks to [@duongna211](https://github.com/duongna21). This is very efficient on TPU hardware but works great on GPUs too. The Flax training script doesn't support features like gradient checkpointing or gradient accumulation yet, so you'll need a GPU with at least 30GB of memory or a TPU v3. + + -Before running the script, make sure you have the requirements installed: +Training with Flax can be faster on TPUs and GPUs thanks to [@duongna211](https://github.com/duongna21). Flax is more efficient on a TPU, but GPU performance is also great. -```bash -pip install -U -r requirements_flax.txt -``` +Set the environment variables `MODEL_NAME` and `dataset_name` to the model and the dataset (either from the Hub or a local path). + + -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. +To train on a local dataset, set the `TRAIN_DIR` and `OUTPUT_DIR` environment variables to the path of the dataset and where to save the model to. -Now you can launch the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/train_text_to_image_flax.py) like this: + ```bash export MODEL_NAME="runwayml/stable-diffusion-v1-5" @@ -179,82 +216,35 @@ python train_text_to_image_flax.py \ --push_to_hub ``` -To finetune on your own dataset, prepare the dataset according to the format required by πŸ€— [Datasets](https://huggingface.co/docs/datasets/index). You can [upload your dataset to the Hub](https://huggingface.co/docs/datasets/image_dataset#upload-dataset-to-the-hub), or you can [prepare a local folder with your files](https://huggingface.co/docs/datasets/image_dataset#imagefolder). + + -Modify the script if you want to use custom loading logic. We left pointers in the code in the appropriate places to help you. πŸ€— The example script below shows how to finetune on a local dataset in `TRAIN_DIR`: +Once training is complete, you can use your newly trained model for inference: -```bash -export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" -export TRAIN_DIR="path_to_your_dataset" + + -python train_text_to_image_flax.py \ - --pretrained_model_name_or_path=$MODEL_NAME \ - --train_data_dir=$TRAIN_DIR \ - --resolution=512 --center_crop --random_flip \ - --train_batch_size=1 \ - --mixed_precision="fp16" \ - --max_train_steps=15000 \ - --learning_rate=1e-05 \ - --max_grad_norm=1 \ - --output_dir="sd-pokemon-model" \ - --push_to_hub -``` - - - -## Training with Min-SNR weighting - -We support training with the Min-SNR weighting strategy proposed in [Efficient Diffusion Training via Min-SNR Weighting Strategy](https://arxiv.org/abs/2303.09556) which helps to achieve faster convergence -by rebalancing the loss. In order to use it, one needs to set the `--snr_gamma` argument. The recommended -value when using it is 5.0. - -You can find [this project on Weights and Biases](https://wandb.ai/sayakpaul/text2image-finetune-minsnr) that compares the loss surfaces of the following setups: - -* Training without the Min-SNR weighting strategy -* Training with the Min-SNR weighting strategy (`snr_gamma` set to 5.0) -* Training with the Min-SNR weighting strategy (`snr_gamma` set to 1.0) - -For our small Pokemons dataset, the effects of Min-SNR weighting strategy might not appear to be pronounced, but for larger datasets, we believe the effects will be more pronounced. - -Also, note that in this example, we either predict `epsilon` (i.e., the noise) or the `v_prediction`. For both of these cases, the formulation of the Min-SNR weighting strategy that we have used holds. - - - -Training with Min-SNR weighting strategy is only supported in PyTorch. - - - -## LoRA - -You can also use Low-Rank Adaptation of Large Language Models (LoRA), a fine-tuning technique for accelerating training large models, for fine-tuning text-to-image models. For more details, take a look at the [LoRA training](lora#text-to-image) guide. - -## Inference - -Now you can load the fine-tuned model for inference by passing the model path or model name on the Hub to the [`StableDiffusionPipeline`]: - - - -```python +```py from diffusers import StableDiffusionPipeline +import torch -model_path = "path_to_saved_model" -pipe = StableDiffusionPipeline.from_pretrained(model_path, torch_dtype=torch.float16, use_safetensors=True) -pipe.to("cuda") +pipeline = StableDiffusionPipeline.from_pretrained("path/to/saved_model", torch_dtype=torch.float16, use_safetensors=True).to("cuda") -image = pipe(prompt="yoda").images[0] +image = pipeline(prompt="yoda").images[0] image.save("yoda-pokemon.png") ``` - - -```python + + + + +```py import jax import numpy as np from flax.jax_utils import replicate from flax.training.common_utils import shard from diffusers import FlaxStableDiffusionPipeline -model_path = "path_to_saved_model" -pipe, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16) +pipeline, params = FlaxStableDiffusionPipeline.from_pretrained("path/to/saved_model", dtype=jax.numpy.bfloat16) prompt = "yoda pokemon" prng_seed = jax.random.PRNGKey(0) @@ -273,16 +263,13 @@ images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True). images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:]))) image.save("yoda-pokemon.png") ``` - - - - -## Stable Diffusion XL -* We support fine-tuning the UNet shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) via the `train_text_to_image_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md). -* We also support fine-tuning of the UNet and Text Encoder shipped in [Stable Diffusion XL](https://huggingface.co/papers/2307.01952) with LoRA via the `train_text_to_image_lora_sdxl.py` script. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/text_to_image/README_sdxl.md). + + +## Next steps -## Kandinsky 2.2 +Congratulations on training your own text-to-image model! To learn more about how to use your new model, the following guides may be helpful: -* We support fine-tuning both the decoder and prior in Kandinsky2.2 with the `train_text_to_image_prior.py` and `train_text_to_image_decoder.py` scripts. LoRA support is also included. Please refer to the docs [here](https://github.com/huggingface/diffusers/blob/main/examples/kandinsky2_2/text_to_image/README_sdxl.md). \ No newline at end of file +- Learn how to [load LoRA weights](../using-diffusers/loading_adapters#LoRA) for inference if you trained your model with LoRA. +- Learn more about how certain parameters like guidance scale or techniques such as prompt weighting can help you control inference in the [Text-to-image](../using-diffusers/conditional_image_generation) task guide. diff --git a/docs/source/en/training/text_inversion.md b/docs/source/en/training/text_inversion.md index 7cc7d57e7c6c..025dd457c55a 100644 --- a/docs/source/en/training/text_inversion.md +++ b/docs/source/en/training/text_inversion.md @@ -10,30 +10,50 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o specific language governing permissions and limitations under the License. --> +# Textual Inversion +[Textual Inversion](https://hf.co/papers/2208.01618) is a training technique for personalizing image generation models with just a few example images of what you want it to learn. This technique works by learning and updating the text embeddings (the new embeddings are tied to a special word you must use in the prompt) to match the example images you provide. -# Textual Inversion +If you're training on a GPU with limited vRAM, you should try enabling the `gradient_checkpointing` and `mixed_precision` parameters in the training command. You can also reduce your memory footprint by using memory-efficient attention with [xFormers](../optimization/xformers). JAX/Flax training is also supported for efficient training on TPUs and GPUs, but it doesn't support gradient checkpointing or xFormers. With the same configuration and setup as PyTorch, the Flax training script should be at least ~70% faster! -[Textual Inversion](https://arxiv.org/abs/2208.01618) is a technique for capturing novel concepts from a small number of example images. While the technique was originally demonstrated with a [latent diffusion model](https://github.com/CompVis/latent-diffusion), it has since been applied to other model variants like [Stable Diffusion](https://huggingface.co/docs/diffusers/main/en/conceptual/stable_diffusion). The learned concepts can be used to better control the images generated from text-to-image pipelines. It learns new "words" in the text encoder's embedding space, which are used within text prompts for personalized image generation. +This guide will explore the [textual_inversion.py](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py) script to help you become more familiar with it, and how you can adapt it for your own use-case. -![Textual Inversion example](https://textual-inversion.github.io/static/images/editing/colorful_teapot.JPG) -By using just 3-5 images you can teach new concepts to a model such as Stable Diffusion for personalized image generation (image source). +Before running the script, make sure you install the library from source: -This guide will show you how to train a [`runwayml/stable-diffusion-v1-5`](https://huggingface.co/runwayml/stable-diffusion-v1-5) model with Textual Inversion. All the training scripts for Textual Inversion used in this guide can be found [here](https://github.com/huggingface/diffusers/tree/main/examples/textual_inversion) if you're interested in taking a closer look at how things work under the hood. +```bash +git clone https://github.com/huggingface/diffusers +cd diffusers +pip install . +``` - +Navigate to the example folder with the training script and install the required dependencies for the script you're using: -There is a community-created collection of trained Textual Inversion models in the [Stable Diffusion Textual Inversion Concepts Library](https://huggingface.co/sd-concepts-library) which are readily available for inference. Over time, this'll hopefully grow into a useful resource as more concepts are added! + + - +```bash +cd examples/textual_inversion +pip install -r requirements.txt +``` -Before you begin, make sure you install the library's training dependencies: + + ```bash -pip install diffusers accelerate transformers +cd examples/textual_inversion +pip install -r requirements_flax.txt ``` -After all the dependencies have been set up, initialize a [πŸ€—Accelerate](https://github.com/huggingface/accelerate/) environment with: + + + + + +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. + + + +Initialize an πŸ€— Accelerate environment: ```bash accelerate config @@ -45,7 +65,7 @@ To setup a default πŸ€— Accelerate environment without choosing any configuratio accelerate config default ``` -Or if your environment doesn't support an interactive shell like a notebook, you can use: +Or if your environment doesn't support an interactive shell, like a notebook, you can use: ```bash from accelerate.utils import write_basic_config @@ -53,33 +73,92 @@ from accelerate.utils import write_basic_config write_basic_config() ``` -Finally, you try and [install xFormers](https://huggingface.co/docs/diffusers/main/en/training/optimization/xformers) to reduce your memory footprint with xFormers memory-efficient attention. Once you have xFormers installed, add the `--enable_xformers_memory_efficient_attention` argument to the training script. xFormers is not supported for Flax. +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. + + + +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py) and let us know if you have any questions or concerns. + + + +## Script parameters -## Upload model to Hub +The training script has many parameters to help you tailor the training run to your needs. All of the parameters and their descriptions are listed in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/839c2a5ece0af4e75530cb520d77bc7ed8acf474/examples/textual_inversion/textual_inversion.py#L176) function. Where applicable, Diffusers provides default values for each parameter such as the training batch size and learning rate, but feel free to change these values in the training command if you'd like. -If you want to store your model on the Hub, add the following argument to the training script: +For example, to increase the number of gradient accumulation steps above the default value of 1: ```bash ---push_to_hub +accelerate launch textual_inversion.py \ + --gradient_accumulation_steps=4 ``` -## Save and load checkpoints +Some other basic and important parameters to specify include: -It is often a good idea to regularly save checkpoints of your model during training. This way, you can resume training from a saved checkpoint if your training is interrupted for any reason. To save a checkpoint, pass the following argument to the training script to save the full training state in a subfolder in `output_dir` every 500 steps: +- `--pretrained_model_name_or_path`: the name of the model on the Hub or a local path to the pretrained model +- `--train_data_dir`: path to a folder containing the training dataset (example images) +- `--output_dir`: where to save the trained model +- `--push_to_hub`: whether to push the trained model to the Hub +- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if for some reason training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command +- `--num_vectors`: the number of vectors to learn the embeddings with; increasing this parameter helps the model learn better but it comes with increased training costs +- `--placeholder_token`: the special word to tie the learned embeddings to (you must use the word in your prompt for inference) +- `--initializer_token`: a single-word that roughly describes the object or style you're trying to train on +- `--learnable_property`: whether you're training the model to learn a new "style" (for example, Van Gogh's painting style) or "object" (for example, your dog) -```bash ---checkpointing_steps=500 +## Training script + +Unlike some of the other training scripts, textual_inversion.py has a custom dataset class, [`TextualInversionDataset`](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L487) for creating a dataset. You can customize the image size, placeholder token, interpolation method, whether to crop the image, and more. If you need to change how the dataset is created, you can modify `TextualInversionDataset`. + +Next, you'll find the dataset preprocessing code and training loop in the [`main()`](https://github.com/huggingface/diffusers/blob/839c2a5ece0af4e75530cb520d77bc7ed8acf474/examples/textual_inversion/textual_inversion.py#L573) function. + +The script starts by loading the [tokenizer](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L616), [scheduler and model](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L622): + +```py +# Load tokenizer +if args.tokenizer_name: + tokenizer = CLIPTokenizer.from_pretrained(args.tokenizer_name) +elif args.pretrained_model_name_or_path: + tokenizer = CLIPTokenizer.from_pretrained(args.pretrained_model_name_or_path, subfolder="tokenizer") + +# Load scheduler and models +noise_scheduler = DDPMScheduler.from_pretrained(args.pretrained_model_name_or_path, subfolder="scheduler") +text_encoder = CLIPTextModel.from_pretrained( + args.pretrained_model_name_or_path, subfolder="text_encoder", revision=args.revision +) +vae = AutoencoderKL.from_pretrained(args.pretrained_model_name_or_path, subfolder="vae", revision=args.revision) +unet = UNet2DConditionModel.from_pretrained( + args.pretrained_model_name_or_path, subfolder="unet", revision=args.revision +) ``` -To resume training from a saved checkpoint, pass the following argument to the training script and the specific checkpoint you'd like to resume from: +The special [placeholder token](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L632) is added next to the tokenizer, and the embedding is readjusted to account for the new token. -```bash ---resume_from_checkpoint="checkpoint-1500" +Then, the script [creates a dataset](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L716) from the `TextualInversionDataset`: + +```py +train_dataset = TextualInversionDataset( + data_root=args.train_data_dir, + tokenizer=tokenizer, + size=args.resolution, + placeholder_token=(" ".join(tokenizer.convert_ids_to_tokens(placeholder_token_ids))), + repeats=args.repeats, + learnable_property=args.learnable_property, + center_crop=args.center_crop, + set="train", +) +train_dataloader = torch.utils.data.DataLoader( + train_dataset, batch_size=args.train_batch_size, shuffle=True, num_workers=args.dataloader_num_workers +) ``` -## Finetuning +Finally, the [training loop](https://github.com/huggingface/diffusers/blob/b81c69e489aad3a0ba73798c459a33990dc4379c/examples/textual_inversion/textual_inversion.py#L784) handles everything else from predicting the noisy residual to updating the embedding weights of the special placeholder token. + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. -For your training dataset, download these [images of a cat toy](https://huggingface.co/datasets/diffusers/cat_toy_example) and store them in a directory. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. +## Launch the script + +Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! πŸš€ + +For this guide, you'll download some images of a [cat toy](https://huggingface.co/datasets/diffusers/cat_toy_example) and store them in a directory. But remember, you can create and use your own dataset if you want (see the [Create a dataset for training](create_dataset) guide). ```py from huggingface_hub import snapshot_download @@ -90,18 +169,29 @@ snapshot_download( ) ``` -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument, and the `DATA_DIR` environment variable to the path of the directory containing the images. +Set the environment variable `MODEL_NAME` to a model id on the Hub or a path to a local model, and `DATA_DIR` to the path where you just downloaded the cat images to. The script creates and saves the following files to your repository: -Now you can launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion.py). The script creates and saves the following files to your repository: `learned_embeds.bin`, `token_identifier.txt`, and `type_of_concept.txt`. +- `learned_embeds.bin`: the learned embedding vectors corresponding to your example images +- `token_identifier.txt`: the special placeholder token +- `type_of_concept.txt`: the type of concept you're training on (either "object" or "style") - + -πŸ’‘ A full training run takes ~1 hour on one V100 GPU. While you're waiting for the training to complete, feel free to check out [how Textual Inversion works](#how-it-works) in the section below if you're curious! +A full training run takes ~1 hour on a single V100 GPU. - - +One more thing before you launch the script. If you're interested in following along with the training process, you can periodically save generated images as training progresses. Add the following parameters to the training command: + +```bash +--validation_prompt="A train" +--num_validation_images=4 +--validation_steps=100 +``` + + + + ```bash export MODEL_NAME="runwayml/stable-diffusion-v1-5" export DATA_DIR="./cat" @@ -110,42 +200,22 @@ accelerate launch textual_inversion.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --train_data_dir=$DATA_DIR \ --learnable_property="object" \ - --placeholder_token="" --initializer_token="toy" \ + --placeholder_token="" \ + --initializer_token="toy" \ --resolution=512 \ --train_batch_size=1 \ --gradient_accumulation_steps=4 \ --max_train_steps=3000 \ - --learning_rate=5.0e-04 --scale_lr \ + --learning_rate=5.0e-04 \ + --scale_lr \ --lr_scheduler="constant" \ --lr_warmup_steps=0 \ --output_dir="textual_inversion_cat" \ --push_to_hub ``` - - -πŸ’‘ If you want to increase the trainable capacity, you can associate your placeholder token, *e.g.* `` to -multiple embedding vectors. This can help the model to better capture the style of more (complex) images. -To enable training multiple embedding vectors, simply pass: - -```bash ---num_vectors=5 -``` - - - - -If you have access to TPUs, try out the [Flax training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py) to train even faster (this'll also work for GPUs). With the same configuration settings, the Flax training script should be at least 70% faster than the PyTorch training script! ⚑️ - -Before you begin, make sure you install the Flax specific dependencies: - -```bash -pip install -U -r requirements_flax.txt -``` - -Specify the `MODEL_NAME` environment variable (either a Hub model repository id or a path to the directory containing the model weights) and pass it to the [`pretrained_model_name_or_path`](https://huggingface.co/docs/diffusers/en/api/diffusion_pipeline#diffusers.DiffusionPipeline.from_pretrained.pretrained_model_name_or_path) argument. - -Then you can launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/textual_inversion/textual_inversion_flax.py): + + ```bash export MODEL_NAME="duongna/stable-diffusion-v1-4-flax" @@ -155,89 +225,41 @@ python textual_inversion_flax.py \ --pretrained_model_name_or_path=$MODEL_NAME \ --train_data_dir=$DATA_DIR \ --learnable_property="object" \ - --placeholder_token="" --initializer_token="toy" \ + --placeholder_token="" \ + --initializer_token="toy" \ --resolution=512 \ --train_batch_size=1 \ --max_train_steps=3000 \ - --learning_rate=5.0e-04 --scale_lr \ + --learning_rate=5.0e-04 \ + --scale_lr \ --output_dir="textual_inversion_cat" \ --push_to_hub ``` - - - -### Intermediate logging - -If you're interested in following along with your model training progress, you can save the generated images from the training process. Add the following arguments to the training script to enable intermediate logging: - -- `validation_prompt`, the prompt used to generate samples (this is set to `None` by default and intermediate logging is disabled) -- `num_validation_images`, the number of sample images to generate -- `validation_steps`, the number of steps before generating `num_validation_images` from the `validation_prompt` - -```bash ---validation_prompt="A backpack" ---num_validation_images=4 ---validation_steps=100 -``` - -## Inference - -Once you have trained a model, you can use it for inference with the [`StableDiffusionPipeline`]. -The textual inversion script will by default only save the textual inversion embedding vector(s) that have -been added to the text encoder embedding matrix and consequently been trained. - - - - + + -πŸ’‘ The community has created a large library of different textual inversion embedding vectors, called [sd-concepts-library](https://huggingface.co/sd-concepts-library). -Instead of training textual inversion embeddings from scratch you can also see whether a fitting textual inversion embedding has already been added to the library. +After training is complete, you can use your newly trained model for inference like: - + + -To load the textual inversion embeddings you first need to load the base model that was used when training -your textual inversion embedding vectors. Here we assume that [`runwayml/stable-diffusion-v1-5`](runwayml/stable-diffusion-v1-5) -was used as a base model so we load it first: -```python +```py from diffusers import StableDiffusionPipeline import torch -model_id = "runwayml/stable-diffusion-v1-5" -pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16, use_safetensors=True).to("cuda") -``` - -Next, we need to load the textual inversion embedding vector which can be done via the [`TextualInversionLoaderMixin.load_textual_inversion`] -function. Here we'll load the embeddings of the "" example from before. -```python -pipe.load_textual_inversion("sd-concepts-library/cat-toy") -``` - -Now we can run the pipeline making sure that the placeholder token `` is used in our prompt. - -```python -prompt = "A backpack" - -image = pipe(prompt, num_inference_steps=50).images[0] -image.save("cat-backpack.png") +pipeline = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16).to("cuda") +pipeline.load_textual_inversion("sd-concepts-library/cat-toy") +image = pipeline("A train", num_inference_steps=50).images[0] +image.save("cat-train.png") ``` -The function [`TextualInversionLoaderMixin.load_textual_inversion`] can not only -load textual embedding vectors saved in Diffusers' format, but also embedding vectors -saved in [Automatic1111](https://github.com/AUTOMATIC1111/stable-diffusion-webui) format. -To do so, you can first download an embedding vector from [civitAI](https://civitai.com/models/3036?modelVersionId=8387) -and then load it locally: -```python -pipe.load_textual_inversion("./charturnerv2.pt") -``` - - -Currently there is no `load_textual_inversion` function for Flax so one has to make sure the textual inversion -embedding vector is saved as part of the model after training. + + -The model can then be run just like any other Flax model: +Flax doesn't support the [`~loaders.TextualInversionLoaderMixin.load_textual_inversion`] method, but the textual_inversion_flax.py script [saves](https://github.com/huggingface/diffusers/blob/c0f058265161178f2a88849e92b37ffdc81f1dcc/examples/textual_inversion/textual_inversion_flax.py#L636C2-L636C2) the learned embeddings as a part of the model after training. This means you can use the model for inference like any other Flax model: -```python +```py import jax import numpy as np from flax.jax_utils import replicate @@ -247,7 +269,7 @@ from diffusers import FlaxStableDiffusionPipeline model_path = "path-to-your-trained-model" pipeline, params = FlaxStableDiffusionPipeline.from_pretrained(model_path, dtype=jax.numpy.bfloat16) -prompt = "A backpack" +prompt = "A train" prng_seed = jax.random.PRNGKey(0) num_inference_steps = 50 @@ -262,16 +284,15 @@ prompt_ids = shard(prompt_ids) images = pipeline(prompt_ids, params, prng_seed, num_inference_steps, jit=True).images images = pipeline.numpy_to_pil(np.asarray(images.reshape((num_samples,) + images.shape[-3:]))) -image.save("cat-backpack.png") +image.save("cat-train.png") ``` - - -## How it works + + -![Diagram from the paper showing overview](https://textual-inversion.github.io/static/images/training/training.JPG) -Architecture overview from the Textual Inversion blog post. +## Next steps -Usually, text prompts are tokenized into an embedding before being passed to a model, which is often a transformer. Textual Inversion does something similar, but it learns a new token embedding, `v*`, from a special token `S*` in the diagram above. The model output is used to condition the diffusion model, which helps the diffusion model understand the prompt and new concepts from just a few example images. +Congratulations on training your own Textual Inversion model! πŸŽ‰ To learn more about how to use your new model, the following guides may be helpful: -To do this, Textual Inversion uses a generator model and noisy versions of the training images. The generator tries to predict less noisy versions of the images, and the token embedding `v*` is optimized based on how well the generator does. If the token embedding successfully captures the new concept, it gives more useful information to the diffusion model and helps create clearer images with less noise. This optimization process typically occurs after several thousand steps of exposure to a variety of prompt and image variants. +- Learn how to [load Textual Inversion embeddings](../using-diffusers/loading_adapters) and also use them as negative embeddings. +- Learn how to use [Textual Inversion](textual_inversion_inference) for inference with Stable Diffusion 1/2 and Stable Diffusion XL. \ No newline at end of file diff --git a/docs/source/en/training/unconditional_training.md b/docs/source/en/training/unconditional_training.md index 7a588cc4cc63..97b644883cae 100644 --- a/docs/source/en/training/unconditional_training.md +++ b/docs/source/en/training/unconditional_training.md @@ -12,25 +12,32 @@ specific language governing permissions and limitations under the License. # Unconditional image generation -Unconditional image generation is not conditioned on any text or images, unlike text- or image-to-image models. It only generates images that resemble its training data distribution. +Unconditional image generation models are not conditioned on text or images during training. It only generates images that resemble its training data distribution. - +This guide will explore the [train_unconditional.py](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py) training script to help you become familiar with it, and how you can adapt it for your own use-case. +Before running the script, make sure you install the library from source: -This guide will show you how to train an unconditional image generation model on existing datasets as well as your own custom dataset. All the training scripts for unconditional image generation can be found [here](https://github.com/huggingface/diffusers/tree/main/examples/unconditional_image_generation) if you're interested in learning more about the training details. +```bash +git clone https://github.com/huggingface/diffusers +cd diffusers +pip install . +``` -Before running the script, make sure you install the library's training dependencies: +Then navigate to the example folder containing the training script and install the required dependencies: ```bash -pip install diffusers[training] accelerate datasets +cd examples/unconditional_image_generation +pip install -r requirements.txt ``` -Next, initialize an πŸ€— [Accelerate](https://github.com/huggingface/accelerate/) environment with: + + +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. + + + +Initialize an πŸ€— Accelerate environment: ```bash accelerate config @@ -50,97 +57,151 @@ from accelerate.utils import write_basic_config write_basic_config() ``` -## Upload model to Hub +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. -You can upload your model on the Hub by adding the following argument to the training script: +## Script parameters -```bash ---push_to_hub -``` + + +The following sections highlight parts of the training script that are important for understanding how to modify it, but it doesn't cover every aspect of the script in detail. If you're interested in learning more, feel free to read through the [script](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py) and let us know if you have any questions or concerns. + + -## Save and load checkpoints +The training script provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L55) function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. -It is a good idea to regularly save checkpoints in case anything happens during training. To save a checkpoint, pass the following argument to the training script: +For example, to speedup training with mixed precision using the bf16 format, add the `--mixed_precision` parameter to the training command: ```bash ---checkpointing_steps=500 +accelerate launch train_unconditional.py \ + --mixed_precision="bf16" ``` -The full training state is saved in a subfolder in the `output_dir` every 500 steps, which allows you to load a checkpoint and resume training if you pass the `--resume_from_checkpoint` argument to the training script: +Some basic and important parameters to specify include: + +- `--dataset_name`: the name of the dataset on the Hub or a local path to the dataset to train on +- `--output_dir`: where to save the trained model +- `--push_to_hub`: whether to push the trained model to the Hub +- `--checkpointing_steps`: frequency of saving a checkpoint as the model trains; this is useful if training is interrupted, you can continue training from that checkpoint by adding `--resume_from_checkpoint` to your training command + +Bring your dataset, and let the training script handle everything else! + +## Training script + +The code for preprocessing the dataset and the training loop is found in the [`main()`](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L275) function. If you need to adapt the training script, this is where you'll need to make your changes. + +The `train_unconditional` script [initializes a `UNet2DModel`](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L356) if you don't provide a model configuration. You can configure the UNet here if you'd like: + +```py +model = UNet2DModel( + sample_size=args.resolution, + in_channels=3, + out_channels=3, + layers_per_block=2, + block_out_channels=(128, 128, 256, 256, 512, 512), + down_block_types=( + "DownBlock2D", + "DownBlock2D", + "DownBlock2D", + "DownBlock2D", + "AttnDownBlock2D", + "DownBlock2D", + ), + up_block_types=( + "UpBlock2D", + "AttnUpBlock2D", + "UpBlock2D", + "UpBlock2D", + "UpBlock2D", + "UpBlock2D", + ), +) +``` -```bash ---resume_from_checkpoint="checkpoint-1500" +Next, the script initializes a [scheduler](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L418) and [optimizer](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L429): + +```py +# Initialize the scheduler +accepts_prediction_type = "prediction_type" in set(inspect.signature(DDPMScheduler.__init__).parameters.keys()) +if accepts_prediction_type: + noise_scheduler = DDPMScheduler( + num_train_timesteps=args.ddpm_num_steps, + beta_schedule=args.ddpm_beta_schedule, + prediction_type=args.prediction_type, + ) +else: + noise_scheduler = DDPMScheduler(num_train_timesteps=args.ddpm_num_steps, beta_schedule=args.ddpm_beta_schedule) + +# Initialize the optimizer +optimizer = torch.optim.AdamW( + model.parameters(), + lr=args.learning_rate, + betas=(args.adam_beta1, args.adam_beta2), + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) ``` -## Finetuning +Then it [loads a dataset](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L451) and you can specify how to [preprocess](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L455) it: -You're ready to launch the [training script](https://github.com/huggingface/diffusers/blob/main/examples/unconditional_image_generation/train_unconditional.py) now! Specify the dataset name to finetune on with the `--dataset_name` argument and then save it to the path in `--output_dir`. To use your own dataset, take a look at the [Create a dataset for training](create_dataset) guide. +```py +dataset = load_dataset("imagefolder", data_dir=args.train_data_dir, cache_dir=args.cache_dir, split="train") -The training script creates and saves a `diffusion_pytorch_model.bin` file in your repository. +augmentations = transforms.Compose( + [ + transforms.Resize(args.resolution, interpolation=transforms.InterpolationMode.BILINEAR), + transforms.CenterCrop(args.resolution) if args.center_crop else transforms.RandomCrop(args.resolution), + transforms.RandomHorizontalFlip() if args.random_flip else transforms.Lambda(lambda x: x), + transforms.ToTensor(), + transforms.Normalize([0.5], [0.5]), + ] +) +``` - +Finally, the [training loop](https://github.com/huggingface/diffusers/blob/096f84b05f9514fae9f185cbec0a4d38fbad9919/examples/unconditional_image_generation/train_unconditional.py#L540) handles everything else such as adding noise to the images, predicting the noise residual, calculating the loss, saving checkpoints at specified steps, and saving and pushing the model to the Hub. If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Once you've made all your changes or you're okay with the default configuration, you're ready to launch the training script! πŸš€ -πŸ’‘ A full training run takes 2 hours on 4xV100 GPUs. + + +A full training run takes 2 hours on 4xV100 GPUs. -For example, to finetune on the [Oxford Flowers](https://huggingface.co/datasets/huggan/flowers-102-categories) dataset: + + ```bash accelerate launch train_unconditional.py \ --dataset_name="huggan/flowers-102-categories" \ - --resolution=64 \ --output_dir="ddpm-ema-flowers-64" \ - --train_batch_size=16 \ - --num_epochs=100 \ - --gradient_accumulation_steps=1 \ - --learning_rate=1e-4 \ - --lr_warmup_steps=500 \ - --mixed_precision=no \ + --mixed_precision="fp16" \ --push_to_hub ``` -
- -
+
+ -Or if you want to train your model on the [Pokemon](https://huggingface.co/datasets/huggan/pokemon) dataset: +If you're training with more than one GPU, add the `--multi_gpu` parameter to the training command: ```bash -accelerate launch train_unconditional.py \ - --dataset_name="huggan/pokemon" \ - --resolution=64 \ - --output_dir="ddpm-ema-pokemon-64" \ - --train_batch_size=16 \ - --num_epochs=100 \ - --gradient_accumulation_steps=1 \ - --learning_rate=1e-4 \ - --lr_warmup_steps=500 \ - --mixed_precision=no \ +accelerate launch --mixed_precision="fp16" --multi_gpu train_unconditional.py \ + --dataset_name="huggan/flowers-102-categories" \ + --output_dir="ddpm-ema-flowers-64" \ + --mixed_precision="fp16" \ --push_to_hub ``` -
- -
+
+
-### Training with multiple GPUs +The training script creates and saves a checkpoint file in your repository. Now you can load and use your trained model for inference: -`accelerate` allows for seamless multi-GPU training. Follow the instructions [here](https://huggingface.co/docs/accelerate/basic_tutorials/launch) -for running distributed training with `accelerate`. Here is an example command: +```py +from diffusers import DiffusionPipeline +import torch -```bash -accelerate launch --mixed_precision="fp16" --multi_gpu train_unconditional.py \ - --dataset_name="huggan/pokemon" \ - --resolution=64 --center_crop --random_flip \ - --output_dir="ddpm-ema-pokemon-64" \ - --train_batch_size=16 \ - --num_epochs=100 \ - --gradient_accumulation_steps=1 \ - --use_ema \ - --learning_rate=1e-4 \ - --lr_warmup_steps=500 \ - --mixed_precision="fp16" \ - --logger="wandb" \ - --push_to_hub -``` \ No newline at end of file +pipeline = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128").to("cuda") +image = pipeline().images[0] +``` diff --git a/docs/source/en/training/wuerstchen.md b/docs/source/en/training/wuerstchen.md new file mode 100644 index 000000000000..9f04c8556a75 --- /dev/null +++ b/docs/source/en/training/wuerstchen.md @@ -0,0 +1,189 @@ + + +# Wuerstchen + +The [Wuerstchen](https://hf.co/papers/2306.00637) model drastically reduces computational costs by compressing the latent space by 42x, without compromising image quality and accelerating inference. During training, Wuerstchen uses two models (VQGAN + autoencoder) to compress the latents, and then a third model (text-conditioned latent diffusion model) is conditioned on this highly compressed space to generate an image. + +To fit the prior model into GPU memory and to speedup training, try enabling `gradient_accumulation_steps`, `gradient_checkpointing`, and `mixed_precision` respectively. + +This guide explores the [train_text_to_image_prior.py](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/train_text_to_image_prior.py) script to help you become more familiar with it, and how you can adapt it for your own use-case. + +Before running the script, make sure you install the library from source: + +```bash +git clone https://github.com/huggingface/diffusers +cd diffusers +pip install . +``` + +Then navigate to the example folder containing the training script and install the required dependencies for the script you're using: + +```bash +cd examples/wuerstchen/text_to_image +pip install -r requirements.txt +``` + + + +πŸ€— Accelerate is a library for helping you train on multiple GPUs/TPUs or with mixed-precision. It'll automatically configure your training setup based on your hardware and environment. Take a look at the πŸ€— Accelerate [Quick tour](https://huggingface.co/docs/accelerate/quicktour) to learn more. + + + +Initialize an πŸ€— Accelerate environment: + +```bash +accelerate config +``` + +To setup a default πŸ€— Accelerate environment without choosing any configurations: + +```bash +accelerate config default +``` + +Or if your environment doesn't support an interactive shell, like a notebook, you can use: + +```bash +from accelerate.utils import write_basic_config + +write_basic_config() +``` + +Lastly, if you want to train a model on your own dataset, take a look at the [Create a dataset for training](create_dataset) guide to learn how to create a dataset that works with the training script. + + + +The following sections highlight parts of the training scripts that are important for understanding how to modify it, but it doesn't cover every aspect of the [script](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/train_text_to_image_prior.py) in detail. If you're interested in learning more, feel free to read through the scripts and let us know if you have any questions or concerns. + + + +## Script parameters + +The training scripts provides many parameters to help you customize your training run. All of the parameters and their descriptions are found in the [`parse_args()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L192) function. It provides default values for each parameter, such as the training batch size and learning rate, but you can also set your own values in the training command if you'd like. + +For example, to speedup training with mixed precision using the fp16 format, add the `--mixed_precision` parameter to the training command: + +```bash +accelerate launch train_text_to_image_prior.py \ + --mixed_precision="fp16" +``` + +Most of the parameters are identical to the parameters in the [Text-to-image](text2image#script-parameters) training guide, so let's dive right into the Wuerstchen training script! + +## Training script + +The training script is also similar to the [Text-to-image](text2image#training-script) training guide, but it's been modified to support Wuerstchen. This guide focuses on the code that is unique to the Wuerstchen training script. + +The [`main()`](https://github.com/huggingface/diffusers/blob/6e68c71503682c8693cb5b06a4da4911dfd655ee/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L441) function starts by initializing the image encoder - an [EfficientNet](https://github.com/huggingface/diffusers/blob/main/examples/wuerstchen/text_to_image/modeling_efficient_net_encoder.py) - in addition to the usual scheduler and tokenizer. + +```py +with ContextManagers(deepspeed_zero_init_disabled_context_manager()): + pretrained_checkpoint_file = hf_hub_download("dome272/wuerstchen", filename="model_v2_stage_b.pt") + state_dict = torch.load(pretrained_checkpoint_file, map_location="cpu") + image_encoder = EfficientNetEncoder() + image_encoder.load_state_dict(state_dict["effnet_state_dict"]) + image_encoder.eval() +``` + +You'll also load the [`WuerstchenPrior`] model for optimization. + +```py +prior = WuerstchenPrior.from_pretrained(args.pretrained_prior_model_name_or_path, subfolder="prior") + +optimizer = optimizer_cls( + prior.parameters(), + lr=args.learning_rate, + betas=(args.adam_beta1, args.adam_beta2), + weight_decay=args.adam_weight_decay, + eps=args.adam_epsilon, +) +``` + +Next, you'll apply some [transforms](https://github.com/huggingface/diffusers/blob/65ef7a0c5c594b4f84092e328fbdd73183613b30/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L656) to the images and [tokenize](https://github.com/huggingface/diffusers/blob/65ef7a0c5c594b4f84092e328fbdd73183613b30/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L637) the captions: + +```py +def preprocess_train(examples): + images = [image.convert("RGB") for image in examples[image_column]] + examples["effnet_pixel_values"] = [effnet_transforms(image) for image in images] + examples["text_input_ids"], examples["text_mask"] = tokenize_captions(examples) + return examples +``` + +Finally, the [training loop](https://github.com/huggingface/diffusers/blob/65ef7a0c5c594b4f84092e328fbdd73183613b30/examples/wuerstchen/text_to_image/train_text_to_image_prior.py#L656) handles compressing the images to latent space with the `EfficientNetEncoder`, adding noise to the latents, and predicting the noise residual with the [`WuerstchenPrior`] model. + +```py +pred_noise = prior(noisy_latents, timesteps, prompt_embeds) +``` + +If you want to learn more about how the training loop works, check out the [Understanding pipelines, models and schedulers](../using-diffusers/write_own_pipeline) tutorial which breaks down the basic pattern of the denoising process. + +## Launch the script + +Once you’ve made all your changes or you’re okay with the default configuration, you’re ready to launch the training script! πŸš€ + +Set the `DATASET_NAME` environment variable to the dataset name from the Hub. This guide uses the [PokΓ©mon BLIP captions](https://huggingface.co/datasets/lambdalabs/pokemon-blip-captions) dataset, but you can create and train on your own datasets as well (see the [Create a dataset for training](create_dataset) guide). + + + +To monitor training progress with Weights & Biases, add the `--report_to=wandb` parameter to the training command. You’ll also need to add the `--validation_prompt` to the training command to keep track of results. This can be really useful for debugging the model and viewing intermediate results. + + + +```bash +export DATASET_NAME="lambdalabs/pokemon-blip-captions" + +accelerate launch train_text_to_image_prior.py \ + --mixed_precision="fp16" \ + --dataset_name=$DATASET_NAME \ + --resolution=768 \ + --train_batch_size=4 \ + --gradient_accumulation_steps=4 \ + --gradient_checkpointing \ + --dataloader_num_workers=4 \ + --max_train_steps=15000 \ + --learning_rate=1e-05 \ + --max_grad_norm=1 \ + --checkpoints_total_limit=3 \ + --lr_scheduler="constant" \ + --lr_warmup_steps=0 \ + --validation_prompts="A robot pokemon, 4k photo" \ + --report_to="wandb" \ + --push_to_hub \ + --output_dir="wuerstchen-prior-pokemon-model" +``` + +Once training is complete, you can use your newly trained model for inference! + +```py +import torch +from diffusers import AutoPipelineForText2Image +from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS + +pipeline = AutoPipelineForText2Image.from_pretrained("path/to/saved/model", torch_dtype=torch.float16).to("cuda") + +caption = "A cute bird pokemon holding a shield" +images = pipeline( + caption, + width=1024, + height=1536, + prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS, + prior_guidance_scale=4.0, + num_images_per_prompt=2, +).images +``` + +## Next steps + +Congratulations on training a Wuerstchen model! To learn more about how to use your new model, the following may be helpful: + +- Take a look at the [Wuerstchen](../api/pipelines/wuerstchen#text-to-image-generation) API documentation to learn more about how to use the pipeline for text-to-image generation and its limitations. diff --git a/docs/source/en/tutorials/autopipeline.md b/docs/source/en/tutorials/autopipeline.md index fcc6f5300eab..4f17760e8bc0 100644 --- a/docs/source/en/tutorials/autopipeline.md +++ b/docs/source/en/tutorials/autopipeline.md @@ -50,7 +50,7 @@ Under the hood, [`AutoPipelineForText2Image`]: 1. automatically detects a `"stable-diffusion"` class from the [`model_index.json`](https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/model_index.json) file 2. loads the corresponding text-to-image [`StableDiffusionPipeline`] based on the `"stable-diffusion"` class name -Likewise, for image-to-image, [`AutoPipelineForImage2Image`] detects a `"stable-diffusion"` checkpoint from the `model_index.json` file and it'll load the corresponding [`StableDiffusionImg2ImgPipeline`] behind the scenes. You can also pass any additional arguments specific to the pipeline class such as `strength`, which determines the amount of noise or variation added to an input image: +Likewise, for image-to-image, [`AutoPipelineForImage2Image`] detects a `"stable-diffusion"` checkpoint from the `model_index.json` file and it'll load the corresponding [`StableDiffusionImg2ImgPipeline`] behind the scenes. You can also pass any additional arguments specific to the pipeline class such as `strength`, which determines the amount of noise or variation added to an input image: ```py from diffusers import AutoPipelineForImage2Image diff --git a/docs/source/en/tutorials/basic_training.md b/docs/source/en/tutorials/basic_training.md index 3b545cdf572e..c9ce315af41f 100644 --- a/docs/source/en/tutorials/basic_training.md +++ b/docs/source/en/tutorials/basic_training.md @@ -321,13 +321,13 @@ Now you can wrap all these components together in a training loop with πŸ€— Acce ... for step, batch in enumerate(train_dataloader): ... clean_images = batch["images"] ... # Sample noise to add to the images -... noise = torch.randn(clean_images.shape).to(clean_images.device) +... noise = torch.randn(clean_images.shape, device=clean_images.device) ... bs = clean_images.shape[0] ... # Sample a random timestep for each image ... timesteps = torch.randint( ... 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device -... ).long() +... ) ... # Add noise to the clean images according to the noise magnitude at each timestep ... # (this is the forward diffusion process) diff --git a/docs/source/en/tutorials/tutorial_overview.md b/docs/source/en/tutorials/tutorial_overview.md index 85c30256ec89..ee7a49e43851 100644 --- a/docs/source/en/tutorials/tutorial_overview.md +++ b/docs/source/en/tutorials/tutorial_overview.md @@ -12,7 +12,7 @@ specific language governing permissions and limitations under the License. # Overview -Welcome to 🧨 Diffusers! If you're new to diffusion models and generative AI, and want to learn more, then you've come to the right place. These beginner-friendly tutorials are designed to provide a gentle introduction to diffusion models and help you understand the library fundamentals - the core components and how 🧨 Diffusers is meant to be used. +Welcome to 🧨 Diffusers! If you're new to diffusion models and generative AI, and want to learn more, then you've come to the right place. These beginner-friendly tutorials are designed to provide a gentle introduction to diffusion models and help you understand the library fundamentals - the core components and how 🧨 Diffusers is meant to be used. You'll learn how to use a pipeline for inference to rapidly generate things, and then deconstruct that pipeline to really understand how to use the library as a modular toolbox for building your own diffusion systems. In the next lesson, you'll learn how to train your own diffusion model to generate what you want. diff --git a/docs/source/en/tutorials/using_peft_for_inference.md b/docs/source/en/tutorials/using_peft_for_inference.md index da69b712a989..6f317a7610b2 100644 --- a/docs/source/en/tutorials/using_peft_for_inference.md +++ b/docs/source/en/tutorials/using_peft_for_inference.md @@ -58,7 +58,7 @@ image ``` ![toy-face](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_8_1.png) - + With the `adapter_name` parameter, it is really easy to use another adapter for inference! Load the [nerijs/pixel-art-xl](https://huggingface.co/nerijs/pixel-art-xl) adapter that has been fine-tuned to generate pixel art images, and let's call it `"pixel"`. @@ -80,7 +80,7 @@ image ``` ![pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_12_1.png) - + ## Combine multiple adapters You can also perform multi-adapter inference where you combine different adapter checkpoints for inference. @@ -112,7 +112,7 @@ image ``` ![toy-face-pixel-art](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/peft_integration/diffusers_peft_lora_inference_16_1.png) - + Impressive! As you can see, the model was able to generate an image that mixes the characteristics of both adapters. If you want to go back to using only one adapter, use the [`~diffusers.loaders.UNet2DConditionLoadersMixin.set_adapters`] method to activate the `"toy"` adapter: diff --git a/docs/source/en/using-diffusers/conditional_image_generation.md b/docs/source/en/using-diffusers/conditional_image_generation.md index 9832f53cffe6..eaca038d59fe 100644 --- a/docs/source/en/using-diffusers/conditional_image_generation.md +++ b/docs/source/en/using-diffusers/conditional_image_generation.md @@ -226,7 +226,7 @@ pipeline = AutoPipelineForText2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda") image = pipeline( - prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", + prompt="Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", negative_prompt="ugly, deformed, disfigured, poor details, bad anatomy", ).images[0] image @@ -258,7 +258,7 @@ pipeline = AutoPipelineForText2Image.from_pretrained( ).to("cuda") generator = torch.Generator(device="cuda").manual_seed(30) image = pipeline( - "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", + "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k", generator=generator, ).images[0] image diff --git a/docs/source/en/using-diffusers/contribute_pipeline.md b/docs/source/en/using-diffusers/contribute_pipeline.md index 15b4b20ab34a..ea0ec51721f2 100644 --- a/docs/source/en/using-diffusers/contribute_pipeline.md +++ b/docs/source/en/using-diffusers/contribute_pipeline.md @@ -30,7 +30,6 @@ You should start by creating a `one_step_unet.py` file for your community pipeli from diffusers import DiffusionPipeline import torch - class UnetSchedulerOneForwardPipeline(DiffusionPipeline): def __init__(self, unet, scheduler): super().__init__() @@ -49,7 +48,7 @@ To ensure your pipeline and its components (`unet` and `scheduler`) can be saved + self.register_modules(unet=unet, scheduler=scheduler) ``` -Cool, the `__init__` step is done and you can move to the forward pass now! πŸ”₯ +Cool, the `__init__` step is done and you can move to the forward pass now! πŸ”₯ ## Define the forward pass @@ -59,7 +58,6 @@ In the forward pass, which we recommend defining as `__call__`, you have complet from diffusers import DiffusionPipeline import torch - class UnetSchedulerOneForwardPipeline(DiffusionPipeline): def __init__(self, unet, scheduler): super().__init__() @@ -150,12 +148,12 @@ Sometimes you can't load all the pipeline components weights from an official re ```python from diffusers import DiffusionPipeline -from transformers import CLIPFeatureExtractor, CLIPModel +from transformers import CLIPImageProcessor, CLIPModel model_id = "CompVis/stable-diffusion-v1-4" clip_model_id = "laion/CLIP-ViT-B-32-laion2B-s34B-b79K" -feature_extractor = CLIPFeatureExtractor.from_pretrained(clip_model_id) +feature_extractor = CLIPImageProcessor.from_pretrained(clip_model_id) clip_model = CLIPModel.from_pretrained(clip_model_id, torch_dtype=torch.float16) pipeline = DiffusionPipeline.from_pretrained( @@ -172,7 +170,7 @@ pipeline = DiffusionPipeline.from_pretrained( The magic behind community pipelines is contained in the following code. It allows the community pipeline to be loaded from GitHub or the Hub, and it'll be available to all 🧨 Diffusers packages. ```python -# 2. Load the pipeline class, if using custom module then load it from the hub +# 2. Load the pipeline class, if using custom module then load it from the Hub # if we load from explicit class, let's use it if custom_pipeline is not None: pipeline_class = get_class_from_dynamic_module( diff --git a/docs/source/en/using-diffusers/controlnet.md b/docs/source/en/using-diffusers/controlnet.md index 71fd3c7a307e..c50d2e96e8ed 100644 --- a/docs/source/en/using-diffusers/controlnet.md +++ b/docs/source/en/using-diffusers/controlnet.md @@ -16,7 +16,7 @@ ControlNet is a type of model for controlling image diffusion models by conditio -Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub. +Check out Section 3.5 of the [ControlNet](https://huggingface.co/papers/2302.05543) paper v1 for a list of ControlNet implementations on various conditioning inputs. You can find the official Stable Diffusion ControlNet conditioned models on [lllyasviel](https://huggingface.co/lllyasviel)'s Hub profile, and more [community-trained](https://huggingface.co/models?other=stable-diffusion&other=controlnet) ones on the Hub. For Stable Diffusion XL (SDXL) ControlNet models, you can find them on the πŸ€— [Diffusers](https://huggingface.co/diffusers) Hub organization, or you can browse [community-trained](https://huggingface.co/models?other=stable-diffusion-xl&other=controlnet) ones on the Hub. @@ -35,7 +35,7 @@ Before you begin, make sure you have the following libraries installed: ```py # uncomment to install the necessary libraries in Colab -#!pip install diffusers transformers accelerate safetensors opencv-python +#!pip install -q diffusers transformers accelerate opencv-python ``` ## Text-to-image @@ -45,17 +45,16 @@ For text-to-image, you normally pass a text prompt to the model. But with Contro Load an image and use the [opencv-python](https://github.com/opencv/opencv-python) library to extract the canny image: ```py -from diffusers import StableDiffusionControlNetPipeline -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid from PIL import Image import cv2 import numpy as np -image = load_image( +original_image = load_image( "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" ) -image = np.array(image) +image = np.array(original_image) low_threshold = 100 high_threshold = 200 @@ -86,7 +85,7 @@ import torch controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16, use_safetensors=True) pipe = StableDiffusionControlNetPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True -).to("cuda") +) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() @@ -98,6 +97,7 @@ Now pass your prompt and canny image to the pipeline: output = pipe( "the mona lisa", image=canny_image ).images[0] +make_image_grid([original_image, canny_image, output], rows=1, cols=3) ```
@@ -117,12 +117,11 @@ import torch import numpy as np from transformers import pipeline -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-img2img.jpg" -).resize((768, 768)) - +) def get_depth_map(image, depth_estimator): image = depth_estimator(image)["depth"] @@ -146,7 +145,7 @@ import torch controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, use_safetensors=True) pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True -).to("cuda") +) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() @@ -158,6 +157,7 @@ Now pass your prompt, initial image, and depth map to the pipeline: output = pipe( "lego batman and robin", image=image, control_image=depth_map, ).images[0] +make_image_grid([image, output], rows=1, cols=2) ```
@@ -171,18 +171,14 @@ output = pipe(
- ## Inpainting -For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Let’s condition the model with a canny image, a white outline of an image on a black background. This way, the ControlNet can use the canny image as a control to guide the model to generate an image with the same outline. +For inpainting, you need an initial image, a mask image, and a prompt describing what to replace the mask with. ControlNet models allow you to add another control image to condition a model with. Let’s condition the model with an inpainting mask. This way, the ControlNet can use the inpainting mask as a control to guide the model to generate an image within the mask area. Load an initial image and a mask image: ```py -from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler -from diffusers.utils import load_image -import numpy as np -import torch +from diffusers.utils import load_image, make_image_grid init_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint.jpg" @@ -193,11 +189,15 @@ mask_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/controlnet-inpaint-mask.jpg" ) mask_image = mask_image.resize((512, 512)) +make_image_grid([init_image, mask_image], rows=1, cols=2) ``` Create a function to prepare the control image from the initial and mask images. This'll create a tensor to mark the pixels in `init_image` as masked if the corresponding pixel in `mask_image` is over a certain threshold. ```py +import numpy as np +import torch + def make_inpaint_condition(image, image_mask): image = np.array(image.convert("RGB")).astype(np.float32) / 255.0 image_mask = np.array(image_mask.convert("L")).astype(np.float32) / 255.0 @@ -226,12 +226,11 @@ Load a ControlNet model conditioned on inpainting and pass it to the [`StableDif ```py from diffusers import StableDiffusionControlNetInpaintPipeline, ControlNetModel, UniPCMultistepScheduler -import torch controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpaint", torch_dtype=torch.float16, use_safetensors=True) pipe = StableDiffusionControlNetInpaintPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, use_safetensors=True -).to("cuda") +) pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config) pipe.enable_model_cpu_offload() @@ -248,6 +247,7 @@ output = pipe( mask_image=mask_image, control_image=control_image, ).images[0] +make_image_grid([init_image, mask_image, output], rows=1, cols=3) ```
@@ -270,14 +270,29 @@ Set `guess_mode=True` in the pipeline, and it is [recommended](https://github.co ```py from diffusers import StableDiffusionControlNetPipeline, ControlNetModel +from diffusers.utils import load_image, make_image_grid +import numpy as np import torch +from PIL import Image +import cv2 controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", use_safetensors=True) -pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to( - "cuda" -) +pipe = StableDiffusionControlNetPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", controlnet=controlnet, use_safetensors=True).to("cuda") + +original_image = load_image("https://huggingface.co/takuma104/controlnet_dev/resolve/main/bird_512x512.png") + +image = np.array(original_image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) + image = pipe("", image=canny_image, guess_mode=True, guidance_scale=3.0).images[0] -image +make_image_grid([original_image, canny_image, image], rows=1, cols=3) ```
@@ -293,22 +308,23 @@ image ## ControlNet with Stable Diffusion XL -There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but we've trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the πŸ€— [Diffusers](https://huggingface.co/diffusers) Hub organization! +There aren't too many ControlNet models compatible with Stable Diffusion XL (SDXL) at the moment, but we've trained two full-sized ControlNet models for SDXL conditioned on canny edge detection and depth maps. We're also experimenting with creating smaller versions of these SDXL-compatible ControlNet models so it is easier to run on resource-constrained hardware. You can find these checkpoints on the [πŸ€— Diffusers Hub organization](https://huggingface.co/diffusers)! Let's use a SDXL ControlNet conditioned on canny images to generate an image. Start by loading an image and prepare the canny image: ```py from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid from PIL import Image import cv2 import numpy as np +import torch -image = load_image( +original_image = load_image( "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png" ) -image = np.array(image) +image = np.array(original_image) low_threshold = 100 high_threshold = 200 @@ -317,7 +333,7 @@ image = cv2.Canny(image, low_threshold, high_threshold) image = image[:, :, None] image = np.concatenate([image, image, image], axis=2) canny_image = Image.fromarray(image) -canny_image +make_image_grid([original_image, canny_image], rows=1, cols=2) ```
@@ -362,13 +378,13 @@ The [`controlnet_conditioning_scale`](https://huggingface.co/docs/diffusers/main prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting" negative_prompt = 'low quality, bad quality, sketches' -images = pipe( +image = pipe( prompt, negative_prompt=negative_prompt, image=canny_image, controlnet_conditioning_scale=0.5, ).images[0] -images +make_image_grid([original_image, canny_image, image], rows=1, cols=3) ```
@@ -379,17 +395,16 @@ You can use [`StableDiffusionXLControlNetPipeline`] in guess mode as well by set ```py from diffusers import StableDiffusionXLControlNetPipeline, ControlNetModel, AutoencoderKL -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid import numpy as np import torch - import cv2 from PIL import Image prompt = "aerial view, a futuristic research complex in a bright foggy jungle, hard lighting" negative_prompt = "low quality, bad quality, sketches" -image = load_image( +original_image = load_image( "https://hf.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd_controlnet/hf-logo.png" ) @@ -402,15 +417,16 @@ pipe = StableDiffusionXLControlNetPipeline.from_pretrained( ) pipe.enable_model_cpu_offload() -image = np.array(image) +image = np.array(original_image) image = cv2.Canny(image, 100, 200) image = image[:, :, None] image = np.concatenate([image, image, image], axis=2) canny_image = Image.fromarray(image) image = pipe( - prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True, + prompt, negative_prompt=negative_prompt, controlnet_conditioning_scale=0.5, image=canny_image, guess_mode=True, ).images[0] +make_image_grid([original_image, canny_image, image], rows=1, cols=3) ``` ### MultiControlNet @@ -431,29 +447,30 @@ In this example, you'll combine a canny image and a human pose estimation image Prepare the canny image conditioning: ```py -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid from PIL import Image import numpy as np import cv2 -canny_image = load_image( +original_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/landscape.png" ) -canny_image = np.array(canny_image) +image = np.array(original_image) low_threshold = 100 high_threshold = 200 -canny_image = cv2.Canny(canny_image, low_threshold, high_threshold) +image = cv2.Canny(image, low_threshold, high_threshold) # zero out middle columns of image where pose will be overlaid -zero_start = canny_image.shape[1] // 4 -zero_end = zero_start + canny_image.shape[1] // 2 -canny_image[:, zero_start:zero_end] = 0 +zero_start = image.shape[1] // 4 +zero_end = zero_start + image.shape[1] // 2 +image[:, zero_start:zero_end] = 0 -canny_image = canny_image[:, :, None] -canny_image = np.concatenate([canny_image, canny_image, canny_image], axis=2) -canny_image = Image.fromarray(canny_image).resize((1024, 1024)) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) +make_image_grid([original_image, canny_image], rows=1, cols=2) ```
@@ -467,18 +484,24 @@ canny_image = Image.fromarray(canny_image).resize((1024, 1024))
+For human pose estimation, install [controlnet_aux](https://github.com/patrickvonplaten/controlnet_aux): + +```py +# uncomment to install the necessary library in Colab +#!pip install -q controlnet-aux +``` + Prepare the human pose estimation conditioning: ```py from controlnet_aux import OpenposeDetector -from diffusers.utils import load_image openpose = OpenposeDetector.from_pretrained("lllyasviel/ControlNet") - -openpose_image = load_image( +original_image = load_image( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/person.png" ) -openpose_image = openpose(openpose_image).resize((1024, 1024)) +openpose_image = openpose(original_image) +make_image_grid([original_image, openpose_image], rows=1, cols=2) ```
@@ -500,7 +523,7 @@ import torch controlnets = [ ControlNetModel.from_pretrained( - "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True + "thibaud/controlnet-openpose-sdxl-1.0", torch_dtype=torch.float16 ), ControlNetModel.from_pretrained( "diffusers/controlnet-canny-sdxl-1.0", torch_dtype=torch.float16, use_safetensors=True @@ -523,7 +546,7 @@ negative_prompt = "monochrome, lowres, bad anatomy, worst quality, low quality" generator = torch.manual_seed(1) -images = [openpose_image, canny_image] +images = [openpose_image.resize((1024, 1024)), canny_image.resize((1024, 1024))] images = pipe( prompt, @@ -533,9 +556,11 @@ images = pipe( negative_prompt=negative_prompt, num_images_per_prompt=3, controlnet_conditioning_scale=[1.0, 0.8], -).images[0] +).images +make_image_grid([original_image, canny_image, openpose_image, + images[0].resize((512, 512)), images[1].resize((512, 512)), images[2].resize((512, 512))], rows=2, cols=3) ```
-
\ No newline at end of file +
diff --git a/docs/source/en/using-diffusers/custom_pipeline_examples.md b/docs/source/en/using-diffusers/custom_pipeline_examples.md index 555292568349..e0d3182f3e8a 100644 --- a/docs/source/en/using-diffusers/custom_pipeline_examples.md +++ b/docs/source/en/using-diffusers/custom_pipeline_examples.md @@ -25,6 +25,8 @@ Community pipelines allow you to get creative and build your own unique pipeline To load a community pipeline, use the `custom_pipeline` argument in [`DiffusionPipeline`] to specify one of the files in [diffusers/examples/community](https://github.com/huggingface/diffusers/tree/main/examples/community): ```py +from diffusers import DiffusionPipeline + pipe = DiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", custom_pipeline="filename_in_the_community_folder", use_safetensors=True ) @@ -39,7 +41,6 @@ You can learn more about community pipelines in the how to [load community pipel The multilingual Stable Diffusion pipeline uses a pretrained [XLM-RoBERTa](https://huggingface.co/papluca/xlm-roberta-base-language-detection) to identify a language and the [mBART-large-50](https://huggingface.co/facebook/mbart-large-50-many-to-one-mmt) model to handle the translation. This allows you to generate images from text in 20 languages. ```py -from PIL import Image import torch from diffusers import DiffusionPipeline from diffusers.utils import make_image_grid @@ -59,29 +60,28 @@ language_detection_pipeline = pipeline("text-classification", device=device_dict[device]) # add model for language translation -trans_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt") -trans_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device) +translation_tokenizer = MBart50TokenizerFast.from_pretrained("facebook/mbart-large-50-many-to-one-mmt") +translation_model = MBartForConditionalGeneration.from_pretrained("facebook/mbart-large-50-many-to-one-mmt").to(device) diffuser_pipeline = DiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", custom_pipeline="multilingual_stable_diffusion", detection_pipeline=language_detection_pipeline, - translation_model=trans_model, - translation_tokenizer=trans_tokenizer, + translation_model=translation_model, + translation_tokenizer=translation_tokenizer, torch_dtype=torch.float16, ) diffuser_pipeline.enable_attention_slicing() diffuser_pipeline = diffuser_pipeline.to(device) -prompt = ["a photograph of an astronaut riding a horse", +prompt = ["a photograph of an astronaut riding a horse", "Una casa en la playa", "Ein Hund, der Orange isst", "Un restaurant parisien"] images = diffuser_pipeline(prompt).images -grid = make_image_grid(images, rows=2, cols=2) -grid +make_image_grid(images, rows=2, cols=2) ```
@@ -94,26 +94,26 @@ grid ```py from diffusers import DiffusionPipeline, DDIMScheduler -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid pipeline = DiffusionPipeline.from_pretrained( "CompVis/stable-diffusion-v1-4", custom_pipeline="magic_mix", - scheduler = DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"), + scheduler=DDIMScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="scheduler"), ).to('cuda') img = load_image("https://user-images.githubusercontent.com/59410571/209578593-141467c7-d831-4792-8b9a-b17dc5e47816.jpg") -mix_img = pipeline(img, prompt="bed", kmin = 0.3, kmax = 0.5, mix_factor = 0.5) -mix_img +mix_img = pipeline(img, prompt="bed", kmin=0.3, kmax=0.5, mix_factor=0.5) +make_image_grid([img, mix_img], rows=1, cols=2) ```
-
image prompt
+
original image
image and text prompt mix
-
\ No newline at end of file +
diff --git a/docs/source/en/using-diffusers/custom_pipeline_overview.md b/docs/source/en/using-diffusers/custom_pipeline_overview.md index 10627d3163d8..f898bd0dc205 100644 --- a/docs/source/en/using-diffusers/custom_pipeline_overview.md +++ b/docs/source/en/using-diffusers/custom_pipeline_overview.md @@ -117,10 +117,10 @@ from pipeline_t2v_base_pixel import TextToVideoIFPipeline import torch pipeline = TextToVideoIFPipeline( - unet=unet, - text_encoder=text_encoder, - tokenizer=tokenizer, - scheduler=scheduler, + unet=unet, + text_encoder=text_encoder, + tokenizer=tokenizer, + scheduler=scheduler, feature_extractor=feature_extractor ) pipeline = pipeline.to(device="cuda") diff --git a/docs/source/en/using-diffusers/diffedit.md b/docs/source/en/using-diffusers/diffedit.md index 1c4a347e7396..1c3793177ce1 100644 --- a/docs/source/en/using-diffusers/diffedit.md +++ b/docs/source/en/using-diffusers/diffedit.md @@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed: ```py # uncomment to install the necessary libraries in Colab -#!pip install diffusers transformers accelerate safetensors +#!pip install -q diffusers transformers accelerate ``` The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then: @@ -59,15 +59,18 @@ pipeline.enable_vae_slicing() Load the image to edit: ```py -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" -raw_image = load_image(img_url).convert("RGB").resize((768, 768)) +raw_image = load_image(img_url).resize((768, 768)) +raw_image ``` Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image: ```py +from PIL import Image + source_prompt = "a bowl of fruits" target_prompt = "a basket of pears" mask_image = pipeline.generate_mask( @@ -75,6 +78,7 @@ mask_image = pipeline.generate_mask( source_prompt=source_prompt, target_prompt=target_prompt, ) +Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768)) ``` Next, create the inverted latents and pass it a caption describing the image: @@ -86,13 +90,14 @@ inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`: ```py -image = pipeline( +output_image = pipeline( prompt=target_prompt, mask_image=mask_image, image_latents=inv_latents, negative_prompt=source_prompt, ).images[0] -image.save("edited_image.png") +mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768)) +make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3) ```
@@ -116,8 +121,8 @@ Load the Flan-T5 model and tokenizer from the πŸ€— Transformers library: import torch from transformers import AutoTokenizer, T5ForConditionalGeneration -tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl") -model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-xl", device_map="auto", torch_dtype=torch.float16) +tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large") +model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16) ``` Provide some initial text to prompt the model to generate the source and target prompts. @@ -136,7 +141,7 @@ target_text = f"Provide a caption for images containing a {target_concept}. " Next, create a utility function to generate the prompts: ```py -@torch.no_grad +@torch.no_grad() def generate_prompts(input_prompt): input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda") @@ -160,12 +165,12 @@ Check out the [generation strategy](https://huggingface.co/docs/transformers/mai Load the text encoder model used by the [`StableDiffusionDiffEditPipeline`] to encode the text. You'll use the text encoder to compute the text embeddings: ```py -import torch -from diffusers import StableDiffusionDiffEditPipeline +import torch +from diffusers import StableDiffusionDiffEditPipeline pipeline = StableDiffusionDiffEditPipeline.from_pretrained( "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() pipeline.enable_vae_slicing() @@ -193,33 +198,39 @@ Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_ ```diff from diffusers import DDIMInverseScheduler, DDIMScheduler - from diffusers.utils import load_image + from diffusers.utils import load_image, make_image_grid + from PIL import Image pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config) pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config) img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" - raw_image = load_image(img_url).convert("RGB").resize((768, 768)) - + raw_image = load_image(img_url).resize((768, 768)) mask_image = pipeline.generate_mask( image=raw_image, +- source_prompt=source_prompt, +- target_prompt=target_prompt, + source_prompt_embeds=source_embeds, + target_prompt_embeds=target_embeds, ) inv_latents = pipeline.invert( +- prompt=source_prompt, + prompt_embeds=source_embeds, image=raw_image, ).latents - images = pipeline( + output_image = pipeline( mask_image=mask_image, image_latents=inv_latents, +- prompt=target_prompt, +- negative_prompt=source_prompt, + prompt_embeds=target_embeds, + negative_prompt_embeds=source_embeds, - ).images - images[0].save("edited_image.png") + ).images[0] + mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L") + make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3) ``` ## Generate a caption for inversion @@ -260,7 +271,7 @@ Load an input image and generate a caption for it using the `generate_caption` f from diffusers.utils import load_image img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png" -raw_image = load_image(img_url).convert("RGB").resize((768, 768)) +raw_image = load_image(img_url).resize((768, 768)) caption = generate_caption(raw_image, model, processor) ``` @@ -271,4 +282,4 @@ caption = generate_caption(raw_image, model, processor)
-Now you can drop the caption into the [`~StableDiffusionDiffEditPipeline.invert`] function to generate the partially inverted latents! \ No newline at end of file +Now you can drop the caption into the [`~StableDiffusionDiffEditPipeline.invert`] function to generate the partially inverted latents! diff --git a/docs/source/en/using-diffusers/freeu.md b/docs/source/en/using-diffusers/freeu.md index c5f3577ae3aa..6e8f5773cd75 100644 --- a/docs/source/en/using-diffusers/freeu.md +++ b/docs/source/en/using-diffusers/freeu.md @@ -14,12 +14,12 @@ specific language governing permissions and limitations under the License. [[open-in-colab]] -The UNet is responsible for denoising during the reverse diffusion process, and there are two distinct features in its architecture: +The UNet is responsible for denoising during the reverse diffusion process, and there are two distinct features in its architecture: 1. Backbone features primarily contribute to the denoising process 2. Skip features mainly introduce high-frequency features into the decoder module and can make the network overlook the semantics in the backbone features -However, the skip connection can sometimes introduce unnatural image details. [FreeU](https://hf.co/papers/2309.11497) is a technique for improving image quality by rebalancing the contributions from the UNet’s skip connections and backbone feature maps. +However, the skip connection can sometimes introduce unnatural image details. [FreeU](https://hf.co/papers/2309.11497) is a technique for improving image quality by rebalancing the contributions from the UNet’s skip connections and backbone feature maps. FreeU is applied during inference and it does not require any additional training. The technique works for different tasks such as text-to-image, image-to-image, and text-to-video. @@ -27,11 +27,11 @@ In this guide, you will apply FreeU to the [`StableDiffusionPipeline`], [`Stable ## StableDiffusionPipeline -Load the pipeline: +Load the pipeline: ```py from diffusers import DiffusionPipeline -import torch +import torch pipeline = DiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, safety_checker=None @@ -70,7 +70,7 @@ Let's see how Stable Diffusion 2 results are impacted: ```py from diffusers import DiffusionPipeline -import torch +import torch pipeline = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, safety_checker=None @@ -92,7 +92,7 @@ Finally, let's take a look at how FreeU affects Stable Diffusion XL results: ```py from diffusers import DiffusionPipeline -import torch +import torch pipeline = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, diff --git a/docs/source/en/using-diffusers/img2img.md b/docs/source/en/using-diffusers/img2img.md index 5caba021f39e..6014d87b7906 100644 --- a/docs/source/en/using-diffusers/img2img.md +++ b/docs/source/en/using-diffusers/img2img.md @@ -27,7 +27,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForImage2Image.from_pretrained( "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -79,7 +79,7 @@ from diffusers.utils import make_image_grid, load_image pipeline = AutoPipelineForImage2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -117,7 +117,7 @@ from diffusers.utils import make_image_grid, load_image pipeline = AutoPipelineForImage2Image.from_pretrained( "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -157,7 +157,7 @@ from diffusers.utils import make_image_grid, load_image pipeline = AutoPipelineForImage2Image.from_pretrained( "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -204,7 +204,7 @@ from diffusers.utils import make_image_grid, load_image pipeline = AutoPipelineForImage2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -248,7 +248,7 @@ from diffusers.utils import make_image_grid, load_image pipeline = AutoPipelineForImage2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -290,7 +290,7 @@ from diffusers.utils import make_image_grid, load_image pipeline = AutoPipelineForImage2Image.from_pretrained( "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -335,7 +335,7 @@ from diffusers.utils import make_image_grid pipeline = AutoPipelineForText2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -349,7 +349,7 @@ Now you can pass this generated image to the image-to-image pipeline: ```py pipeline = AutoPipelineForImage2Image.from_pretrained( "kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16, use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -371,7 +371,7 @@ from diffusers.utils import make_image_grid, load_image pipeline = AutoPipelineForImage2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -397,7 +397,7 @@ Pass the latent output from this pipeline to the next pipeline to generate an im ```py pipeline = AutoPipelineForImage2Image.from_pretrained( "ogkalu/Comic-Diffusion", torch_dtype=torch.float16 -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -411,7 +411,7 @@ Repeat one more time to generate the final image in a [pixel art style](https:// ```py pipeline = AutoPipelineForImage2Image.from_pretrained( "kohbanye/pixel-art-style", torch_dtype=torch.float16 -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -434,7 +434,7 @@ from diffusers.utils import make_image_grid, load_image pipeline = AutoPipelineForImage2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -462,7 +462,7 @@ from diffusers import StableDiffusionLatentUpscalePipeline upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained( "stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) upscaler.enable_model_cpu_offload() upscaler.enable_xformers_memory_efficient_attention() @@ -476,7 +476,7 @@ from diffusers import StableDiffusionUpscalePipeline super_res = StableDiffusionUpscalePipeline.from_pretrained( "stabilityai/stable-diffusion-x4-upscaler", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) super_res.enable_model_cpu_offload() super_res.enable_xformers_memory_efficient_attention() @@ -500,7 +500,7 @@ import torch pipeline = AutoPipelineForImage2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -537,7 +537,7 @@ import torch controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11f1p_sd15_depth", torch_dtype=torch.float16, variant="fp16", use_safetensors=True) pipeline = AutoPipelineForImage2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -571,7 +571,7 @@ Let's apply a new [style](https://huggingface.co/nitrosocke/elden-ring-diffusion ```py pipeline = AutoPipelineForImage2Image.from_pretrained( "nitrosocke/elden-ring-diffusion", torch_dtype=torch.float16, -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() diff --git a/docs/source/en/using-diffusers/inference_with_lcm.md b/docs/source/en/using-diffusers/inference_with_lcm.md new file mode 100644 index 000000000000..36b3c6c810fc --- /dev/null +++ b/docs/source/en/using-diffusers/inference_with_lcm.md @@ -0,0 +1,274 @@ + + +[[open-in-colab]] + +# Latent Consistency Model + +Latent Consistency Models (LCM) enable quality image generation in typically 2-4 steps making it possible to use diffusion models in almost real-time settings. + +From the [official website](https://latent-consistency-models.github.io/): + +> LCMs can be distilled from any pre-trained Stable Diffusion (SD) in only 4,000 training steps (~32 A100 GPU Hours) for generating high quality 768 x 768 resolution images in 2~4 steps or even one step, significantly accelerating text-to-image generation. We employ LCM to distill the Dreamshaper-V7 version of SD in just 4,000 training iterations. + +For a more technical overview of LCMs, refer to [the paper](https://huggingface.co/papers/2310.04378). + +LCM distilled models are available for [stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and the [SSD-1B](https://huggingface.co/segmind/SSD-1B) model. All the checkpoints can be found in this [collection](https://huggingface.co/collections/latent-consistency/latent-consistency-models-weights-654ce61a95edd6dffccef6a8). + +This guide shows how to perform inference with LCMs for +- text-to-image +- image-to-image +- combined with style LoRAs +- ControlNet/T2I-Adapter + +## Text-to-image + +You'll use the [`StableDiffusionXLPipeline`] pipeline with the [`LCMScheduler`] and then load the LCM-LoRA. Together with the LCM-LoRA and the scheduler, the pipeline enables a fast inference workflow, overcoming the slow iterative nature of diffusion models. + +```python +from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler +import torch + +unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + torch_dtype=torch.float16, + variant="fp16", +) +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16", +).to("cuda") +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" + +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0 +).images[0] +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_full_sdxl_t2i.png) + +Notice that we use only 4 steps for generation which is way less than what's typically used for standard SDXL. + +Some details to keep in mind: + +* To perform classifier-free guidance, batch size is usually doubled inside the pipeline. LCM, however, applies guidance using guidance embeddings, so the batch size does not have to be doubled in this case. This leads to a faster inference time, with the drawback that negative prompts don't have any effect on the denoising process. +* The UNet was trained using the [3., 13.] guidance scale range. So, that is the ideal range for `guidance_scale`. However, disabling `guidance_scale` using a value of 1.0 is also effective in most cases. + + +## Image-to-image + +LCMs can be applied to image-to-image tasks too. For this example, we'll use the [LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) model, but the same steps can be applied to other LCM models as well. + +```python +import torch +from diffusers import AutoPipelineForImage2Image, UNet2DConditionModel, LCMScheduler +from diffusers.utils import make_image_grid, load_image + +unet = UNet2DConditionModel.from_pretrained( + "SimianLuo/LCM_Dreamshaper_v7", + subfolder="unet", + torch_dtype=torch.float16, +) + +pipe = AutoPipelineForImage2Image.from_pretrained( + "Lykon/dreamshaper-7", + unet=unet, + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# prepare image +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png" +init_image = load_image(url) +prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k" + +# pass prompt and image to pipeline +generator = torch.manual_seed(0) +image = pipe( + prompt, + image=init_image, + num_inference_steps=4, + guidance_scale=7.5, + strength=0.5, + generator=generator +).images[0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_full_sdv1-5_i2i.png) + + + + +You can get different results based on your prompt and the image you provide. To get the best results, we recommend trying different values for `num_inference_steps`, `strength`, and `guidance_scale` parameters and choose the best one. + + + + +## Combine with style LoRAs + +LCMs can be used with other styled LoRAs to generate styled-images in very few steps (4-8). In the following example, we'll use the [papercut LoRA](TheLastBen/Papercut_SDXL). + +```python +from diffusers import StableDiffusionXLPipeline, UNet2DConditionModel, LCMScheduler +import torch + +unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + torch_dtype=torch.float16, + variant="fp16", +) +pipe = StableDiffusionXLPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16, variant="fp16", +).to("cuda") +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut") + +prompt = "papercut, a cute fox" + +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0 +).images[0] +image +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_full_sdx_lora_mix.png) + + +## ControlNet/T2I-Adapter + +Let's look at how we can perform inference with ControlNet/T2I-Adapter and a LCM. + +### ControlNet +For this example, we'll use the [LCM_Dreamshaper_v7](https://huggingface.co/SimianLuo/LCM_Dreamshaper_v7) model with canny ControlNet, but the same steps can be applied to other LCM models as well. + +```python +import torch +import cv2 +import numpy as np +from PIL import Image + +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler +from diffusers.utils import load_image, make_image_grid + +image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" +).resize((512, 512)) + +image = np.array(image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) + +controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16) +pipe = StableDiffusionControlNetPipeline.from_pretrained( + "SimianLuo/LCM_Dreamshaper_v7", + controlnet=controlnet, + torch_dtype=torch.float16, + safety_checker=None, +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +generator = torch.manual_seed(0) +image = pipe( + "the mona lisa", + image=canny_image, + num_inference_steps=4, + generator=generator, +).images[0] +make_image_grid([canny_image, image], rows=1, cols=2) +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_full_sdv1-5_controlnet.png) + + + +The inference parameters in this example might not work for all examples, so we recommend trying different values for the `num_inference_steps`, `guidance_scale`, `controlnet_conditioning_scale`, and `cross_attention_kwargs` parameters and choosing the best one. + + +### T2I-Adapter + +This example shows how to use the `lcm-sdxl` with the [Canny T2I-Adapter](TencentARC/t2i-adapter-canny-sdxl-1.0). + +```python +import torch +import cv2 +import numpy as np +from PIL import Image + +from diffusers import StableDiffusionXLAdapterPipeline, UNet2DConditionModel, T2IAdapter, LCMScheduler +from diffusers.utils import load_image, make_image_grid + +# Prepare image +# Detect the canny map in low resolution to avoid high-frequency details +image = load_image( + "https://huggingface.co/Adapter/t2iadapter/resolve/main/figs_SDXLV1.0/org_canny.jpg" +).resize((384, 384)) + +image = np.array(image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image).resize((1024, 1216)) + +# load adapter +adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, varient="fp16").to("cuda") + +unet = UNet2DConditionModel.from_pretrained( + "latent-consistency/lcm-sdxl", + torch_dtype=torch.float16, + variant="fp16", +) +pipe = StableDiffusionXLAdapterPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + unet=unet, + adapter=adapter, + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") + +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +prompt = "Mystical fairy in real, magic, 4k picture, high quality" +negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" + +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + image=canny_image, + num_inference_steps=4, + guidance_scale=5, + adapter_conditioning_scale=0.8, + adapter_conditioning_factor=1, + generator=generator, +).images[0] +grid = make_image_grid([canny_image, image], rows=1, cols=2) +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_full_sdxl_t2iadapter.png) diff --git a/docs/source/en/using-diffusers/inference_with_lcm_lora.md b/docs/source/en/using-diffusers/inference_with_lcm_lora.md new file mode 100644 index 000000000000..554e5fda2c2a --- /dev/null +++ b/docs/source/en/using-diffusers/inference_with_lcm_lora.md @@ -0,0 +1,422 @@ + + +[[open-in-colab]] + +# Performing inference with LCM-LoRA + +Latent Consistency Models (LCM) enable quality image generation in typically 2-4 steps making it possible to use diffusion models in almost real-time settings. + +From the [official website](https://latent-consistency-models.github.io/): + +> LCMs can be distilled from any pre-trained Stable Diffusion (SD) in only 4,000 training steps (~32 A100 GPU Hours) for generating high quality 768 x 768 resolution images in 2~4 steps or even one step, significantly accelerating text-to-image generation. We employ LCM to distill the Dreamshaper-V7 version of SD in just 4,000 training iterations. + +For a more technical overview of LCMs, refer to [the paper](https://huggingface.co/papers/2310.04378). + +However, each model needs to be distilled separately for latent consistency distillation. The core idea with LCM-LoRA is to train just a few adapter layers, the adapter being LoRA in this case. +This way, we don't have to train the full model and keep the number of trainable parameters manageable. The resulting LoRAs can then be applied to any fine-tuned version of the model without distilling them separately. +Additionally, the LoRAs can be applied to image-to-image, ControlNet/T2I-Adapter, inpainting, AnimateDiff etc. +The LCM-LoRA can also be combined with other LoRAs to generate styled images in very few steps (4-8). + +LCM-LoRAs are available for [stable-diffusion-v1-5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0), and the [SSD-1B](https://huggingface.co/segmind/SSD-1B) model. All the checkpoints can be found in this [collection](https://huggingface.co/collections/latent-consistency/latent-consistency-models-loras-654cdd24e111e16f0865fba6). + +For more details about LCM-LoRA, refer to [the technical report](https://huggingface.co/papers/2311.05556). + +This guide shows how to perform inference with LCM-LoRAs for +- text-to-image +- image-to-image +- combined with styled LoRAs +- ControlNet/T2I-Adapter +- inpainting +- AnimateDiff + +Before going through this guide, we'll take a look at the general workflow for performing inference with LCM-LoRAs. +LCM-LoRAs are similar to other Stable Diffusion LoRAs so they can be used with any [`DiffusionPipeline`] that supports LoRAs. + +- Load the task specific pipeline and model. +- Set the scheduler to [`LCMScheduler`]. +- Load the LCM-LoRA weights for the model. +- Reduce the `guidance_scale` between `[1.0, 2.0]` and set the `num_inference_steps` between [4, 8]. +- Perform inference with the pipeline with the usual parameters. + +Let's look at how we can perform inference with LCM-LoRAs for different tasks. + +First, make sure you have [peft](https://github.com/huggingface/peft) installed, for better LoRA support. + +```bash +pip install -U peft +``` + +## Text-to-image + +You'll use the [`StableDiffusionXLPipeline`] with the scheduler: [`LCMScheduler`] and then load the LCM-LoRA. Together with the LCM-LoRA and the scheduler, the pipeline enables a fast inference workflow overcoming the slow iterative nature of diffusion models. + +```python +import torch +from diffusers import DiffusionPipeline, LCMScheduler + +pipe = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + variant="fp16", + torch_dtype=torch.float16 +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LCM-LoRA +pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") + +prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" + +generator = torch.manual_seed(42) +image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0 +).images[0] +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_sdxl_t2i.png) + +Notice that we use only 4 steps for generation which is way less than what's typically used for standard SDXL. + + + +You may have noticed that we set `guidance_scale=1.0`, which disables classifer-free-guidance. This is because the LCM-LoRA is trained with guidance, so the batch size does not have to be doubled in this case. This leads to a faster inference time, with the drawback that negative prompts don't have any effect on the denoising process. + +You can also use guidance with LCM-LoRA, but due to the nature of training the model is very sensitve to the `guidance_scale` values, high values can lead to artifacts in the generated images. In our experiments, we found that the best values are in the range of [1.0, 2.0]. + + + +### Inference with a fine-tuned model + +As mentioned above, the LCM-LoRA can be applied to any fine-tuned version of the model without having to distill them separately. Let's look at how we can perform inference with a fine-tuned model. In this example, we'll use the [animagine-xl](https://huggingface.co/Linaqruf/animagine-xl) model, which is a fine-tuned version of the SDXL model for generating anime. + +```python +from diffusers import DiffusionPipeline, LCMScheduler + +pipe = DiffusionPipeline.from_pretrained( + "Linaqruf/animagine-xl", + variant="fp16", + torch_dtype=torch.float16 +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LCM-LoRA +pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") + +prompt = "face focus, cute, masterpiece, best quality, 1girl, green hair, sweater, looking at viewer, upper body, beanie, outdoors, night, turtleneck" + +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=1.0 +).images[0] +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_sdxl_t2i_finetuned.png) + + +## Image-to-image + +LCM-LoRA can be applied to image-to-image tasks too. Let's look at how we can perform image-to-image generation with LCMs. For this example we'll use the [dreamshaper-7](https://huggingface.co/Lykon/dreamshaper-7) model and the LCM-LoRA for `stable-diffusion-v1-5 `. + +```python +import torch +from diffusers import AutoPipelineForImage2Image, LCMScheduler +from diffusers.utils import make_image_grid, load_image + +pipe = AutoPipelineForImage2Image.from_pretrained( + "Lykon/dreamshaper-7", + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LCM-LoRA +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") + +# prepare image +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/img2img-init.png" +init_image = load_image(url) +prompt = "Astronauts in a jungle, cold color palette, muted colors, detailed, 8k" + +# pass prompt and image to pipeline +generator = torch.manual_seed(0) +image = pipe( + prompt, + image=init_image, + num_inference_steps=4, + guidance_scale=1, + strength=0.6, + generator=generator +).images[0] +make_image_grid([init_image, image], rows=1, cols=2) +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_sdv1-5_i2i.png) + + + + +You can get different results based on your prompt and the image you provide. To get the best results, we recommend trying different values for `num_inference_steps`, `strength`, and `guidance_scale` parameters and choose the best one. + + + + +## Combine with styled LoRAs + +LCM-LoRA can be combined with other LoRAs to generate styled-images in very few steps (4-8). In the following example, we'll use the LCM-LoRA with the [papercut LoRA](TheLastBen/Papercut_SDXL). +To learn more about how to combine LoRAs, refer to [this guide](https://huggingface.co/docs/diffusers/tutorials/using_peft_for_inference#combine-multiple-adapters). + +```python +import torch +from diffusers import DiffusionPipeline, LCMScheduler + +pipe = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + variant="fp16", + torch_dtype=torch.float16 +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LoRAs +pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl", adapter_name="lcm") +pipe.load_lora_weights("TheLastBen/Papercut_SDXL", weight_name="papercut.safetensors", adapter_name="papercut") + +# Combine LoRAs +pipe.set_adapters(["lcm", "papercut"], adapter_weights=[1.0, 0.8]) + +prompt = "papercut, a cute fox" +generator = torch.manual_seed(0) +image = pipe(prompt, num_inference_steps=4, guidance_scale=1, generator=generator).images[0] +image +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_sdx_lora_mix.png) + + +## ControlNet/T2I-Adapter + +Let's look at how we can perform inference with ControlNet/T2I-Adapter and LCM-LoRA. + +### ControlNet +For this example, we'll use the SD-v1-5 model and the LCM-LoRA for SD-v1-5 with canny ControlNet. + +```python +import torch +import cv2 +import numpy as np +from PIL import Image + +from diffusers import StableDiffusionControlNetPipeline, ControlNetModel, LCMScheduler +from diffusers.utils import load_image + +image = load_image( + "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png" +).resize((512, 512)) + +image = np.array(image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image) + +controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16) +pipe = StableDiffusionControlNetPipeline.from_pretrained( + "runwayml/stable-diffusion-v1-5", + controlnet=controlnet, + torch_dtype=torch.float16, + safety_checker=None, + variant="fp16" +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LCM-LoRA +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") + +generator = torch.manual_seed(0) +image = pipe( + "the mona lisa", + image=canny_image, + num_inference_steps=4, + guidance_scale=1.5, + controlnet_conditioning_scale=0.8, + cross_attention_kwargs={"scale": 1}, + generator=generator, +).images[0] +make_image_grid([canny_image, image], rows=1, cols=2) +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_sdv1-5_controlnet.png) + + + +The inference parameters in this example might not work for all examples, so we recommend you to try different values for `num_inference_steps`, `guidance_scale`, `controlnet_conditioning_scale` and `cross_attention_kwargs` parameters and choose the best one. + + +### T2I-Adapter + +This example shows how to use the LCM-LoRA with the [Canny T2I-Adapter](TencentARC/t2i-adapter-canny-sdxl-1.0) and SDXL. + +```python +import torch +import cv2 +import numpy as np +from PIL import Image + +from diffusers import StableDiffusionXLAdapterPipeline, T2IAdapter, LCMScheduler +from diffusers.utils import load_image, make_image_grid + +# Prepare image +# Detect the canny map in low resolution to avoid high-frequency details +image = load_image( + "https://huggingface.co/Adapter/t2iadapter/resolve/main/figs_SDXLV1.0/org_canny.jpg" +).resize((384, 384)) + +image = np.array(image) + +low_threshold = 100 +high_threshold = 200 + +image = cv2.Canny(image, low_threshold, high_threshold) +image = image[:, :, None] +image = np.concatenate([image, image, image], axis=2) +canny_image = Image.fromarray(image).resize((1024, 1024)) + +# load adapter +adapter = T2IAdapter.from_pretrained("TencentARC/t2i-adapter-canny-sdxl-1.0", torch_dtype=torch.float16, varient="fp16").to("cuda") + +pipe = StableDiffusionXLAdapterPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + adapter=adapter, + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LCM-LoRA +pipe.load_lora_weights("latent-consistency/lcm-lora-sdxl") + +prompt = "Mystical fairy in real, magic, 4k picture, high quality" +negative_prompt = "extra digit, fewer digits, cropped, worst quality, low quality, glitch, deformed, mutated, ugly, disfigured" + +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, + negative_prompt=negative_prompt, + image=canny_image, + num_inference_steps=4, + guidance_scale=1.5, + adapter_conditioning_scale=0.8, + adapter_conditioning_factor=1, + generator=generator, +).images[0] +make_image_grid([canny_image, image], rows=1, cols=2) +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_sdxl_t2iadapter.png) + + +## Inpainting + +LCM-LoRA can be used for inpainting as well. + +```python +import torch +from diffusers import AutoPipelineForInpainting, LCMScheduler +from diffusers.utils import load_image, make_image_grid + +pipe = AutoPipelineForInpainting.from_pretrained( + "runwayml/stable-diffusion-inpainting", + torch_dtype=torch.float16, + variant="fp16", +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LCM-LoRA +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5") + +# load base and mask image +init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint.png") +mask_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/inpaint_mask.png") + +# generator = torch.Generator("cuda").manual_seed(92) +prompt = "concept art digital painting of an elven castle, inspired by lord of the rings, highly detailed, 8k" +generator = torch.manual_seed(0) +image = pipe( + prompt=prompt, + image=init_image, + mask_image=mask_image, + generator=generator, + num_inference_steps=4, + guidance_scale=4, +).images[0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_sdv1-5_inpainting.png) + + +## AnimateDiff + +[`AnimateDiff`] allows you to animate images using Stable Diffusion models. To get good results, we need to generate multiple frames (16-24), and doing this with standard SD models can be very slow. +LCM-LoRA can be used to speed up the process significantly, as you just need to do 4-8 steps for each frame. Let's look at how we can perform animation with LCM-LoRA and AnimateDiff. + +```python +import torch +from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler, LCMScheduler +from diffusers.utils import export_to_gif + +adapter = MotionAdapter.from_pretrained("diffusers/animatediff-motion-adapter-v1-5") +pipe = AnimateDiffPipeline.from_pretrained( + "frankjoshua/toonyou_beta6", + motion_adapter=adapter, +).to("cuda") + +# set scheduler +pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) + +# load LCM-LoRA +pipe.load_lora_weights("latent-consistency/lcm-lora-sdv1-5", adapter_name="lcm") +pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-in", weight_name="diffusion_pytorch_model.safetensors", adapter_name="motion-lora") + +pipe.set_adapters(["lcm", "motion-lora"], adapter_weights=[0.55, 1.2]) + +prompt = "best quality, masterpiece, 1girl, looking at viewer, blurry background, upper body, contemporary, dress" +generator = torch.manual_seed(0) +frames = pipe( + prompt=prompt, + num_inference_steps=5, + guidance_scale=1.25, + cross_attention_kwargs={"scale": 1}, + num_frames=24, + generator=generator +).frames[0] +export_to_gif(frames, "animation.gif") +``` + +![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_sdv1-5_animatediff.gif) \ No newline at end of file diff --git a/docs/source/en/using-diffusers/inpaint.md b/docs/source/en/using-diffusers/inpaint.md index abdfbffb908b..e6b1010f13b0 100644 --- a/docs/source/en/using-diffusers/inpaint.md +++ b/docs/source/en/using-diffusers/inpaint.md @@ -27,7 +27,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16 -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -98,7 +98,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16" -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -124,7 +124,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "diffusers/stable-diffusion-xl-1.0-inpainting-0.1", torch_dtype=torch.float16, variant="fp16" -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -150,7 +150,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16 -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -379,7 +379,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16" -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -424,7 +424,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16" -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -464,7 +464,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16" -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -503,7 +503,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForText2Image.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16, variant="fp16", use_safetensors=True -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -522,7 +522,7 @@ And let's inpaint the masked area with a waterfall: ```py pipeline = AutoPipelineForInpainting.from_pretrained( "kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16 -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -556,7 +556,7 @@ from diffusers.utils import load_image, make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, variant="fp16" -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -577,7 +577,7 @@ Now let's pass the image to another inpainting pipeline with SDXL's refiner mode ```py pipeline = AutoPipelineForInpainting.from_pretrained( "stabilityai/stable-diffusion-xl-refiner-1.0", torch_dtype=torch.float16, variant="fp16" -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -636,7 +636,7 @@ from diffusers.utils import make_image_grid pipeline = AutoPipelineForInpainting.from_pretrained( "runwayml/stable-diffusion-inpainting", torch_dtype=torch.float16, -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -667,7 +667,7 @@ controlnet = ControlNetModel.from_pretrained("lllyasviel/control_v11p_sd15_inpai # pass ControlNet to the pipeline pipeline = StableDiffusionControlNetInpaintPipeline.from_pretrained( "runwayml/stable-diffusion-inpainting", controlnet=controlnet, torch_dtype=torch.float16, variant="fp16" -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() @@ -705,7 +705,7 @@ from diffusers import AutoPipelineForImage2Image pipeline = AutoPipelineForImage2Image.from_pretrained( "nitrosocke/elden-ring-diffusion", torch_dtype=torch.float16, -).to("cuda") +) pipeline.enable_model_cpu_offload() # remove following line if xFormers is not installed or you have PyTorch 2.0 or higher installed pipeline.enable_xformers_memory_efficient_attention() diff --git a/docs/source/en/using-diffusers/kandinsky.md b/docs/source/en/using-diffusers/kandinsky.md index 4ca544270766..05be2e1ee289 100644 --- a/docs/source/en/using-diffusers/kandinsky.md +++ b/docs/source/en/using-diffusers/kandinsky.md @@ -1,3 +1,15 @@ + + # Kandinsky [[open-in-colab]] @@ -14,7 +26,7 @@ Before you begin, make sure you have the following libraries installed: ```py # uncomment to install the necessary libraries in Colab -#!pip install transformers accelerate safetensors +#!pip install -q diffusers transformers accelerate ``` @@ -46,6 +58,7 @@ Now pass all the prompts and embeddings to the [`KandinskyPipeline`] to generate ```py image = pipeline(prompt, image_embeds=image_embeds, negative_prompt=negative_prompt, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0] +image ```
@@ -71,6 +84,7 @@ Pass the `image_embeds` and `negative_image_embeds` to the [`KandinskyV22Pipelin ```py image = pipeline(image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768).images[0] +image ```
@@ -91,13 +105,14 @@ Use the [`AutoPipelineForText2Image`] to automatically call the combined pipelin from diffusers import AutoPipelineForText2Image import torch -pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16).to("cuda") +pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) pipeline.enable_model_cpu_offload() prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" negative_prompt = "low quality, bad quality" -image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale = 4.0, height=768, width=768).images[0] +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0] +image ``` @@ -107,13 +122,14 @@ image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_ from diffusers import AutoPipelineForText2Image import torch -pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda") +pipeline = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16) pipeline.enable_model_cpu_offload() prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting" negative_prompt = "low quality, bad quality" -image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale = 4.0, height=768, width=768).images[0] +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_scale=1.0, guidance_scale=4.0, height=768, width=768).images[0] +image ``` @@ -121,7 +137,7 @@ image = pipeline(prompt=prompt, negative_prompt=negative_prompt, prior_guidance_ ## Image-to-image -For image-to-image, pass the initial image and text prompt to condition the image with to the pipeline. Start by loading the prior pipeline: +For image-to-image, pass the initial image and text prompt to condition the image to the pipeline. Start by loading the prior pipeline: @@ -151,14 +167,11 @@ pipeline = KandinskyV22Img2ImgPipeline.from_pretrained("kandinsky-community/kand Download an image to condition on: ```py -from PIL import Image -import requests -from io import BytesIO +from diffusers.utils import load_image # download image url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" -response = requests.get(url) -original_image = Image.open(BytesIO(response.content)).convert("RGB") +original_image = load_image(url) original_image = original_image.resize((768, 512)) ``` @@ -181,7 +194,10 @@ Now pass the original image, and all the prompts and embeddings to the pipeline ```py -image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_emebds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] +from diffusers.utils import make_image_grid + +image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] +make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) ```
@@ -192,7 +208,10 @@ image = pipeline(prompt, negative_prompt=negative_prompt, image=original_image, ```py -image = pipeline(image=original_image, image_embeds=image_emebds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] +from diffusers.utils import make_image_grid + +image = pipeline(image=original_image, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds, height=768, width=768, strength=0.3).images[0] +make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) ```
@@ -211,25 +230,22 @@ Use the [`AutoPipelineForImage2Image`] to automatically call the combined pipeli ```py from diffusers import AutoPipelineForImage2Image +from diffusers.utils import make_image_grid, load_image import torch -import requests -from io import BytesIO -from PIL import Image -import os -pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True).to("cuda") +pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16, use_safetensors=True) pipeline.enable_model_cpu_offload() prompt = "A fantasy landscape, Cinematic lighting" negative_prompt = "low quality, bad quality" url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -response = requests.get(url) -original_image = Image.open(BytesIO(response.content)).convert("RGB") +original_image = load_image(url) + original_image.thumbnail((768, 768)) -image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0] +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0] +make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) ``` @@ -237,25 +253,22 @@ image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0] ```py from diffusers import AutoPipelineForImage2Image +from diffusers.utils import make_image_grid, load_image import torch -import requests -from io import BytesIO -from PIL import Image -import os -pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16).to("cuda") +pipeline = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16) pipeline.enable_model_cpu_offload() prompt = "A fantasy landscape, Cinematic lighting" negative_prompt = "low quality, bad quality" url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg" - -response = requests.get(url) -original_image = Image.open(BytesIO(response.content)).convert("RGB") +original_image = load_image(url) + original_image.thumbnail((768, 768)) -image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0] +image = pipeline(prompt=prompt, negative_prompt=negative_prompt, image=original_image, strength=0.3).images[0] +make_image_grid([original_image.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) ``` @@ -265,7 +278,7 @@ image = pipeline(prompt=prompt, image=original_image, strength=0.3).images[0] -⚠️ The Kandinsky models uses ⬜️ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels: +⚠️ The Kandinsky models use ⬜️ **white pixels** to represent the masked area now instead of black pixels. If you are using [`KandinskyInpaintPipeline`] in production, you need to change the mask to use white pixels: ```py # For PIL input @@ -285,9 +298,10 @@ For inpainting, you'll need the original image, a mask of the area to replace in ```py from diffusers import KandinskyInpaintPipeline, KandinskyPriorPipeline -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid import torch import numpy as np +from PIL import Image prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda") @@ -298,9 +312,10 @@ pipeline = KandinskyInpaintPipeline.from_pretrained("kandinsky-community/kandins ```py from diffusers import KandinskyV22InpaintPipeline, KandinskyV22PriorPipeline -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid import torch import numpy as np +from PIL import Image prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") pipeline = KandinskyV22InpaintPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16, use_safetensors=True).to("cuda") @@ -331,7 +346,9 @@ Now pass the initial image, mask, and prompt and embeddings to the pipeline to g ```py -image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] +output_image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] +mask = Image.fromarray((mask*255).astype('uint8'), 'L') +make_image_grid([init_image, mask, output_image], rows=1, cols=3) ```
@@ -342,7 +359,9 @@ image = pipeline(prompt, image=init_image, mask_image=mask, **prior_output, heig ```py -image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] +output_image = pipeline(image=init_image, mask_image=mask, **prior_output, height=768, width=768, num_inference_steps=150).images[0] +mask = Image.fromarray((mask*255).astype('uint8'), 'L') +make_image_grid([init_image, mask, output_image], rows=1, cols=3) ```
@@ -359,14 +378,23 @@ You can also use the end-to-end [`KandinskyInpaintCombinedPipeline`] and [`Kandi ```py import torch +import numpy as np +from PIL import Image from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image, make_image_grid pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-1-inpaint", torch_dtype=torch.float16) pipe.enable_model_cpu_offload() +init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") +mask = np.zeros((768, 768), dtype=np.float32) +# mask area above cat's head +mask[:250, 250:-250] = 1 prompt = "a hat" -image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0] +output_image = pipe(prompt=prompt, image=init_image, mask_image=mask).images[0] +mask = Image.fromarray((mask*255).astype('uint8'), 'L') +make_image_grid([init_image, mask, output_image], rows=1, cols=3) ``` @@ -374,14 +402,23 @@ image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0] ```py import torch +import numpy as np +from PIL import Image from diffusers import AutoPipelineForInpainting +from diffusers.utils import load_image, make_image_grid pipe = AutoPipelineForInpainting.from_pretrained("kandinsky-community/kandinsky-2-2-decoder-inpaint", torch_dtype=torch.float16) pipe.enable_model_cpu_offload() +init_image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") +mask = np.zeros((768, 768), dtype=np.float32) +# mask area above cat's head +mask[:250, 250:-250] = 1 prompt = "a hat" -image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0] +output_image = pipe(prompt=prompt, image=original_image, mask_image=mask).images[0] +mask = Image.fromarray((mask*255).astype('uint8'), 'L') +make_image_grid([init_image, mask, output_image], rows=1, cols=3) ``` @@ -396,13 +433,13 @@ Interpolation allows you to explore the latent space between the image and text ```py from diffusers import KandinskyPriorPipeline, KandinskyPipeline -from diffusers.utils import load_image -import PIL +from diffusers.utils import load_image, make_image_grid import torch prior_pipeline = KandinskyPriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") +make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2) ``` @@ -410,13 +447,13 @@ img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffuser ```py from diffusers import KandinskyV22PriorPipeline, KandinskyV22Pipeline -from diffusers.utils import load_image -import PIL +from diffusers.utils import load_image, make_image_grid import torch prior_pipeline = KandinskyV22PriorPipeline.from_pretrained("kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True).to("cuda") img_1 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/cat.png") img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky/starry_night.jpeg") +make_image_grid([img_1.resize((512,512)), img_2.resize((512,512))], rows=1, cols=2) ``` @@ -436,7 +473,7 @@ img_2 = load_image("https://huggingface.co/datasets/hf-internal-testing/diffuser Specify the text or images to interpolate, and set the weights for each text or image. Experiment with the weights to see how they affect the interpolation! ```py -images_texts = ["a cat", img1, img2] +images_texts = ["a cat", img_1, img_2] weights = [0.3, 0.3, 0.4] ``` @@ -499,6 +536,7 @@ from diffusers.utils import load_image img = load_image( "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png" ).resize((768, 768)) +img ```
@@ -512,8 +550,6 @@ import torch import numpy as np from transformers import pipeline -from diffusers.utils import load_image - def make_hint(image, depth_estimator): image = depth_estimator(image)["depth"] @@ -524,7 +560,6 @@ def make_hint(image, depth_estimator): hint = detected_map.permute(2, 0, 1) return hint - depth_estimator = pipeline("depth-estimation") hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda") ``` @@ -538,10 +573,10 @@ from diffusers import KandinskyV22PriorPipeline, KandinskyV22ControlnetPipeline prior_pipeline = KandinskyV22PriorPipeline.from_pretrained( "kandinsky-community/kandinsky-2-2-prior", torch_dtype=torch.float16, use_safetensors=True -)to("cuda") +).to("cuda") pipeline = KandinskyV22ControlnetPipeline.from_pretrained( - "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16, use_safetensors=True + "kandinsky-community/kandinsky-2-2-controlnet-depth", torch_dtype=torch.float16 ).to("cuda") ``` @@ -549,11 +584,11 @@ Generate the image embeddings from a prompt and negative prompt: ```py prompt = "A robot, 4k photo" - negative_prior_prompt = "lowres, text, error, cropped, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, out of frame, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, username, watermark, signature" generator = torch.Generator(device="cuda").manual_seed(43) -image_emb, zero_image_emb = pipe_prior( + +image_emb, zero_image_emb = prior_pipeline( prompt=prompt, negative_prompt=negative_prior_prompt, generator=generator ).to_tuple() ``` @@ -587,10 +622,9 @@ from diffusers.utils import load_image from transformers import pipeline img = load_image( - "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main" "/kandinskyv22/cat.png" + "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinskyv22/cat.png" ).resize((768, 768)) - def make_hint(image, depth_estimator): image = depth_estimator(image)["depth"] image = np.array(image) @@ -600,7 +634,6 @@ def make_hint(image, depth_estimator): hint = detected_map.permute(2, 0, 1) return hint - depth_estimator = pipeline("depth-estimation") hint = make_hint(img, depth_estimator).unsqueeze(0).half().to("cuda") ``` @@ -625,15 +658,15 @@ negative_prior_prompt = "lowres, text, error, cropped, worst quality, low qualit generator = torch.Generator(device="cuda").manual_seed(43) -img_emb = pipe_prior(prompt=prompt, image=img, strength=0.85, generator=generator) -negative_emb = pipe_prior(prompt=negative_prior_prompt, image=img, strength=1, generator=generator) +img_emb = prior_pipeline(prompt=prompt, image=img, strength=0.85, generator=generator) +negative_emb = prior_pipeline(prompt=negative_prior_prompt, image=img, strength=1, generator=generator) ``` Now you can run the [`KandinskyV22ControlnetImg2ImgPipeline`] to generate an image from the initial image and the image embeddings: ```py image = pipeline(image=img, strength=0.5, image_embeds=img_emb.image_embeds, negative_image_embeds=negative_emb.image_embeds, hint=hint, num_inference_steps=50, generator=generator, height=768, width=768).images[0] -image +make_image_grid([img.resize((512, 512)), image.resize((512, 512))], rows=1, cols=2) ```
@@ -644,7 +677,7 @@ image Kandinsky is unique because it requires a prior pipeline to generate the mappings, and a second pipeline to decode the latents into an image. Optimization efforts should be focused on the second pipeline because that is where the bulk of the computation is done. Here are some tips to improve Kandinsky during inference. -1. Enable [xFormers](https://moon-ci-docs.huggingface.co/optimization/xformers) if you're using PyTorch < 2.0: +1. Enable [xFormers](../optimization/xformers) if you're using PyTorch < 2.0: ```diff from diffusers import DiffusionPipeline @@ -654,14 +687,11 @@ Kandinsky is unique because it requires a prior pipeline to generate the mapping + pipe.enable_xformers_memory_efficient_attention() ``` -2. Enable `torch.compile` if you're using PyTorch 2.0 to automatically use scaled dot-product attention (SDPA): +2. Enable `torch.compile` if you're using PyTorch >= 2.0 to automatically use scaled dot-product attention (SDPA): ```diff pipe.unet.to(memory_format=torch.channels_last) -+ pipe.unet = torch.compile(pipe.unet, mode="reduced-overhead", fullgraph=True) - - pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16) -+ pipe.enable_xformers_memory_efficient_attention() ++ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True) ``` This is the same as explicitly setting the attention processor to use [`~models.attention_processor.AttnAddedKVProcessor2_0`]: @@ -685,8 +715,9 @@ pipe.unet.set_attn_processor(AttnAddedKVProcessor2_0()) 4. By default, the text-to-image pipeline uses the [`DDIMScheduler`] but you can replace it with another scheduler like [`DDPMScheduler`] to see how that affects the tradeoff between inference speed and image quality: ```py -from diffusers import DDPMSCheduler +from diffusers import DDPMScheduler +from diffusers import DiffusionPipeline scheduler = DDPMScheduler.from_pretrained("kandinsky-community/kandinsky-2-1", subfolder="ddpm_scheduler") pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", scheduler=scheduler, torch_dtype=torch.float16, use_safetensors=True).to("cuda") -``` \ No newline at end of file +``` diff --git a/docs/source/en/using-diffusers/lcm.md b/docs/source/en/using-diffusers/lcm.md deleted file mode 100644 index 39bc2426a92b..000000000000 --- a/docs/source/en/using-diffusers/lcm.md +++ /dev/null @@ -1,154 +0,0 @@ - - -# Performing inference with LCM - -Latent Consistency Models (LCM) enable quality image generation in typically 2-4 steps making it possible to use diffusion models in almost real-time settings. - -From the [official website](https://latent-consistency-models.github.io/): - -> LCMs can be distilled from any pre-trained Stable Diffusion (SD) in only 4,000 training steps (~32 A100 GPU Hours) for generating high quality 768 x 768 resolution images in 2~4 steps or even one step, significantly accelerating text-to-image generation. We employ LCM to distill the Dreamshaper-V7 version of SD in just 4,000 training iterations. - -For a more technical overview of LCMs, refer to [the paper](https://huggingface.co/papers/2310.04378). - -This guide shows how to perform inference with LCMs for text-to-image and image-to-image generation tasks. It will also cover performing inference with LoRA checkpoints. - -## Text-to-image - -You'll use the [`StableDiffusionXLPipeline`] here changing the `unet`. The UNet was distilled from the SDXL UNet using the framework introduced in LCM. Another important component is the scheduler: [`LCMScheduler`]. Together with the distilled UNet and the scheduler, LCM enables a fast inference workflow overcoming the slow iterative nature of diffusion models. - -```python -from diffusers import DiffusionPipeline, UNet2DConditionModel, LCMScheduler -import torch - -unet = UNet2DConditionModel.from_pretrained( - "latent-consistency/lcm-sdxl", - torch_dtype=torch.float16, - variant="fp16", -) -pipe = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16 -).to("cuda") -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k" - -generator = torch.manual_seed(0) -image = pipe( - prompt=prompt, num_inference_steps=4, generator=generator, guidance_scale=8.0 -).images[0] -``` - -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_intro.png) - -Notice that we use only 4 steps for generation which is way less than what's typically used for standard SDXL. - -Some details to keep in mind: - -* To perform classifier-free guidance, batch size is usually doubled inside the pipeline. LCM, however, applies guidance using guidance embeddings, so the batch size does not have to be doubled in this case. This leads to a faster inference time, with the drawback that negative prompts don't have any effect on the denoising process. -* The UNet was trained using the [3., 13.] guidance scale range. So, that is the ideal range for `guidance_scale`. However, disabling `guidance_scale` using a value of 1.0 is also effective in most cases. - -## Image-to-image - -The findings above apply to image-to-image tasks too. Let's look at how we can perform image-to-image generation with LCMs: - -```python -from diffusers import AutoPipelineForImage2Image, UNet2DConditionModel, LCMScheduler -from diffusers.utils import load_image -import torch - -unet = UNet2DConditionModel.from_pretrained( - "latent-consistency/lcm-sdxl", - torch_dtype=torch.float16, - variant="fp16", -) -pipe = AutoPipelineForImage2Image.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", unet=unet, torch_dtype=torch.float16 -).to("cuda") -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -prompt = "High altitude snowy mountains" -image = load_image( - "https://huggingface.co/datasets/sayakpaul/sample-datasets/resolve/main/snowy_mountains.jpeg" -) - -generator = torch.manual_seed(0) -image = pipe( - prompt=prompt, - image=image, - num_inference_steps=4, - generator=generator, - guidance_scale=8.0, -).images[0] -``` -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_i2i.png) - -## LoRA - -It is possible to generalize the LCM framework to use with [LoRA](../training/lora.md). It effectively eliminates the need to conduct expensive fine-tuning runs as LoRA training concerns just a few number of parameters compared to full fine-tuning. During inference, the [`LCMScheduler`] comes to the advantage as it enables very few-steps inference without compromising the quality. - -We recommend to disable `guidance_scale` by setting it 0. The model is trained to follow prompts accurately -even without using guidance scale. You can however, still use guidance scale in which case we recommend -using values between 1.0 and 2.0. - -### Text-to-image - -```python -from diffusers import DiffusionPipeline, LCMScheduler -import torch - -model_id = "stabilityai/stable-diffusion-xl-base-1.0" -lcm_lora_id = "latent-consistency/lcm-lora-sdxl" - -pipe = DiffusionPipeline.from_pretrained(model_id, variant="fp16", torch_dtype=torch.float16).to("cuda") - -pipe.load_lora_weights(lcm_lora_id) -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -prompt = "close-up photography of old man standing in the rain at night, in a street lit by lamps, leica 35mm summilux" -image = pipe( - prompt=prompt, - num_inference_steps=4, - guidance_scale=0, # set guidance scale to 0 to disable it -).images[0] -``` -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lora_lcm.png) - -### Image-to-image - -Extending LCM LoRA to image-to-image is possible: - -```python -from diffusers import StableDiffusionXLImg2ImgPipeline, LCMScheduler -from diffusers.utils import load_image -import torch - -model_id = "stabilityai/stable-diffusion-xl-base-1.0" -lcm_lora_id = "latent-consistency/lcm-lora-sdxl" - -pipe = StableDiffusionXLImg2ImgPipeline.from_pretrained(model_id, variant="fp16", torch_dtype=torch.float16).to("cuda") - -pipe.load_lora_weights(lcm_lora_id) -pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config) - -prompt = "close-up photography of old man standing in the rain at night, in a street lit by lamps, leica 35mm summilux" - -image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lora_lcm.png") - -image = pipe( - prompt=prompt, - image=image, - num_inference_steps=4, - guidance_scale=0, # set guidance scale to 0 to disable it -).images[0] -``` -![](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/lcm/lcm_lora_i2i.png) diff --git a/docs/source/en/using-diffusers/loading.md b/docs/source/en/using-diffusers/loading.md index 57348e849e6b..d9e19a5bdd2a 100644 --- a/docs/source/en/using-diffusers/loading.md +++ b/docs/source/en/using-diffusers/loading.md @@ -232,7 +232,7 @@ TODO(Patrick) - Make sure to uncomment this part as soon as things are deprecate #### Using `revision` to load pipeline variants is deprecated -Previously the `revision` argument of [`DiffusionPipeline.from_pretrained`] was heavily used to +Previously the `revision` argument of [`DiffusionPipeline.from_pretrained`] was heavily used to load model variants, e.g.: ```python @@ -247,7 +247,7 @@ The above example is therefore deprecated and won't be supported anymore for `di -If you load diffusers pipelines or models with `revision="fp16"` or `revision="non_ema"`, +If you load diffusers pipelines or models with `revision="fp16"` or `revision="non_ema"`, please make sure to update the code and use `variant="fp16"` or `variation="non_ema"` respectively instead. diff --git a/docs/source/en/using-diffusers/loading_adapters.md b/docs/source/en/using-diffusers/loading_adapters.md index 8f6bf85da318..e73e042bd4d5 100644 --- a/docs/source/en/using-diffusers/loading_adapters.md +++ b/docs/source/en/using-diffusers/loading_adapters.md @@ -189,7 +189,7 @@ pipeline = StableDiffusionXLPipeline.from_pretrained( ).to("cuda") ``` -Next, load the LoRA checkpoint and fuse it with the original weights. The `lora_scale` parameter controls how much to scale the output by with the LoRA weights. It is important to make the `lora_scale` adjustments in the [`~loaders.LoraLoaderMixin.fuse_lora`] method because it won't work if you try to pass `scale` to the `cross_attention_kwargs` in the pipeline. +Next, load the LoRA checkpoint and fuse it with the original weights. The `lora_scale` parameter controls how much to scale the output by with the LoRA weights. It is important to make the `lora_scale` adjustments in the [`~loaders.LoraLoaderMixin.fuse_lora`] method because it won't work if you try to pass `scale` to the `cross_attention_kwargs` in the pipeline. If you need to reset the original model weights for any reason (use a different `lora_scale`), you should use the [`~loaders.LoraLoaderMixin.unfuse_lora`] method. diff --git a/docs/source/en/using-diffusers/other-formats.md b/docs/source/en/using-diffusers/other-formats.md index 84945a6da87a..6f8e00d1e396 100644 --- a/docs/source/en/using-diffusers/other-formats.md +++ b/docs/source/en/using-diffusers/other-formats.md @@ -34,11 +34,11 @@ There are two options for converting a `.ckpt` file: use a Space to convert the The easiest and most convenient way to convert a `.ckpt` file is to use the [SD to Diffusers](https://huggingface.co/spaces/diffusers/sd-to-diffusers) Space. You can follow the instructions on the Space to convert the `.ckpt` file. -This approach works well for basic models, but it may struggle with more customized models. You'll know the Space failed if it returns an empty pull request or error. In this case, you can try converting the `.ckpt` file with a script. +This approach works well for basic models, but it may struggle with more customized models. You'll know the Space failed if it returns an empty pull request or error. In this case, you can try converting the `.ckpt` file with a script. ### Convert with a script -πŸ€— Diffusers provides a [conversion script](https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py) for converting `.ckpt` files. This approach is more reliable than the Space above. +πŸ€— Diffusers provides a [conversion script](https://github.com/huggingface/diffusers/blob/main/scripts/convert_original_stable_diffusion_to_diffusers.py) for converting `.ckpt` files. This approach is more reliable than the Space above. Before you start, make sure you have a local clone of πŸ€— Diffusers to run the script and log in to your Hugging Face account so you can open pull requests and push your converted model to the Hub. @@ -86,11 +86,11 @@ git push origin pr/13:refs/pr/13 -πŸ§ͺ This is an experimental feature. Only Stable Diffusion v1 checkpoints are supported by the Convert KerasCV Space at the moment. +πŸ§ͺ This is an experimental feature. Only Stable Diffusion v1 checkpoints are supported by the Convert KerasCV Space at the moment. -[KerasCV](https://keras.io/keras_cv/) supports training for [Stable Diffusion](https://github.com/keras-team/keras-cv/blob/master/keras_cv/models/stable_diffusion) v1 and v2. However, it offers limited support for experimenting with Stable Diffusion models for inference and deployment whereas πŸ€— Diffusers has a more complete set of features for this purpose, such as different [noise schedulers](https://huggingface.co/docs/diffusers/using-diffusers/schedulers), [flash attention](https://huggingface.co/docs/diffusers/optimization/xformers), and [other +[KerasCV](https://keras.io/keras_cv/) supports training for [Stable Diffusion](https://github.com/keras-team/keras-cv/blob/master/keras_cv/models/stable_diffusion) v1 and v2. However, it offers limited support for experimenting with Stable Diffusion models for inference and deployment whereas πŸ€— Diffusers has a more complete set of features for this purpose, such as different [noise schedulers](https://huggingface.co/docs/diffusers/using-diffusers/schedulers), [flash attention](https://huggingface.co/docs/diffusers/optimization/xformers), and [other optimization techniques](https://huggingface.co/docs/diffusers/optimization/fp16). The [Convert KerasCV](https://huggingface.co/spaces/sayakpaul/convert-kerascv-sd-diffusers) Space converts `.pb` or `.h5` files to PyTorch, and then wraps them in a [`StableDiffusionPipeline`] so it is ready for inference. The converted checkpoint is stored in a repository on the Hugging Face Hub. diff --git a/docs/source/en/using-diffusers/reproducibility.md b/docs/source/en/using-diffusers/reproducibility.md index cc9dcf62666d..5bc1d02b14d4 100644 --- a/docs/source/en/using-diffusers/reproducibility.md +++ b/docs/source/en/using-diffusers/reproducibility.md @@ -55,7 +55,7 @@ But if you need to reliably generate the same image, that'll depend on whether y ### CPU -To generate reproducible results on a CPU, you'll need to use a PyTorch [`Generator`](https://pytorch.org/docs/stable/generated/torch.randn.html) and set a seed: +To generate reproducible results on a CPU, you'll need to use a PyTorch [`Generator`](https://pytorch.org/docs/stable/generated/torch.Generator.html) and set a seed: ```python import torch @@ -83,7 +83,7 @@ If you run this code example on your specific hardware and PyTorch version, you πŸ’‘ It might be a bit unintuitive at first to pass `Generator` objects to the pipeline instead of just integer values representing the seed, but this is the recommended design when dealing with -probabilistic models in PyTorch as `Generator`'s are *random states* that can be +probabilistic models in PyTorch, as `Generator`s are *random states* that can be passed to multiple pipelines in a sequence. @@ -159,6 +159,7 @@ PyTorch typically benchmarks multiple algorithms to select the fastest one, but ```py import os +import torch os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":16:8" @@ -171,7 +172,6 @@ Now when you run the same pipeline twice, you'll get identical results. ```py import torch from diffusers import DDIMScheduler, StableDiffusionPipeline -import numpy as np model_id = "runwayml/stable-diffusion-v1-5" pipe = StableDiffusionPipeline.from_pretrained(model_id, use_safetensors=True).to("cuda") @@ -186,6 +186,6 @@ result1 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type=" g.manual_seed(0) result2 = pipe(prompt=prompt, num_inference_steps=50, generator=g, output_type="latent").images -print("L_inf dist = ", abs(result1 - result2).max()) -"L_inf dist = tensor(0., device='cuda:0')" -``` \ No newline at end of file +print("L_inf dist =", abs(result1 - result2).max()) +"L_inf dist = tensor(0., device='cuda:0')" +``` diff --git a/docs/source/en/using-diffusers/schedulers.md b/docs/source/en/using-diffusers/schedulers.md index 9a8dd29ec2ea..6b5d8da465d8 100644 --- a/docs/source/en/using-diffusers/schedulers.md +++ b/docs/source/en/using-diffusers/schedulers.md @@ -14,10 +14,10 @@ specific language governing permissions and limitations under the License. [[open-in-colab]] -Diffusion pipelines are inherently a collection of diffusion models and schedulers that are partly independent from each other. This means that one is able to switch out parts of the pipeline to better customize +Diffusion pipelines are inherently a collection of diffusion models and schedulers that are partly independent from each other. This means that one is able to switch out parts of the pipeline to better customize a pipeline to one's use case. The best example of this is the [Schedulers](../api/schedulers/overview). -Whereas diffusion models usually simply define the forward pass from noise to a less noisy sample, +Whereas diffusion models usually simply define the forward pass from noise to a less noisy sample, schedulers define the whole denoising process, *i.e.*: - How many denoising steps? - Stochastic or deterministic? @@ -77,7 +77,7 @@ PNDMScheduler { } ``` -We can see that the scheduler is of type [`PNDMScheduler`]. +We can see that the scheduler is of type [`PNDMScheduler`]. Cool, now let's compare the scheduler in its performance to other schedulers. First we define a prompt on which we will test all the different schedulers: @@ -102,7 +102,7 @@ image ## Changing the scheduler -Now we show how easy it is to change the scheduler of a pipeline. Every scheduler has a property [`~SchedulerMixin.compatibles`] +Now we show how easy it is to change the scheduler of a pipeline. Every scheduler has a property [`~SchedulerMixin.compatibles`] which defines all compatible schedulers. You can take a look at all available, compatible schedulers for the Stable Diffusion pipeline as follows. ```python @@ -127,7 +127,7 @@ pipeline.scheduler.compatibles diffusers.schedulers.scheduling_k_dpm_2_ancestral_discrete.KDPM2AncestralDiscreteScheduler] ``` -Cool, lots of schedulers to look at. Feel free to have a look at their respective class definitions: +Cool, lots of schedulers to look at. Feel free to have a look at their respective class definitions: - [`EulerDiscreteScheduler`], - [`LMSDiscreteScheduler`], @@ -143,7 +143,7 @@ Cool, lots of schedulers to look at. Feel free to have a look at their respectiv - [`DPMSolverSinglestepScheduler`], - [`KDPM2AncestralDiscreteScheduler`]. -We will now compare the input prompt with all other schedulers. To change the scheduler of the pipeline you can make use of the +We will now compare the input prompt with all other schedulers. To change the scheduler of the pipeline you can make use of the convenient [`~ConfigMixin.config`] property in combination with the [`~ConfigMixin.from_config`] function. ```python @@ -171,7 +171,7 @@ FrozenDict([('num_train_timesteps', 1000), ``` This configuration can then be used to instantiate a scheduler -of a different class that is compatible with the pipeline. Here, +of a different class that is compatible with the pipeline. Here, we change the scheduler to the [`DDIMScheduler`]. ```python @@ -198,7 +198,7 @@ If you are a JAX/Flax user, please check [this section](#changing-the-scheduler- ## Compare schedulers -So far we have tried running the stable diffusion pipeline with two schedulers: [`PNDMScheduler`] and [`DDIMScheduler`]. +So far we have tried running the stable diffusion pipeline with two schedulers: [`PNDMScheduler`] and [`DDIMScheduler`]. A number of better schedulers have been released that can be run with much fewer steps; let's compare them here: [`LMSDiscreteScheduler`] usually leads to better results: diff --git a/docs/source/en/using-diffusers/sdxl.md b/docs/source/en/using-diffusers/sdxl.md index 1016c57ca0ec..25b581fc6f6f 100644 --- a/docs/source/en/using-diffusers/sdxl.md +++ b/docs/source/en/using-diffusers/sdxl.md @@ -26,7 +26,7 @@ Before you begin, make sure you have the following libraries installed: ```py # uncomment to install the necessary libraries in Colab -#!pip install diffusers transformers accelerate safetensors omegaconf invisible-watermark>=0.2.0 +#!pip install -q diffusers transformers accelerate omegaconf invisible-watermark>=0.2.0 ``` @@ -84,7 +84,8 @@ pipeline_text2image = AutoPipelineForText2Image.from_pretrained( ).to("cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipeline(prompt=prompt).images[0] +image = pipeline_text2image(prompt=prompt).images[0] +image ```
@@ -96,16 +97,17 @@ image = pipeline(prompt=prompt).images[0] For image-to-image, SDXL works especially well with image sizes between 768x768 and 1024x1024. Pass an initial image, and a text prompt to condition the image with: ```py -from diffusers import AutoPipelineForImg2Img -from diffusers.utils import load_image +from diffusers import AutoPipelineForImage2Image +from diffusers.utils import load_image, make_image_grid # use from_pipe to avoid consuming additional memory when loading a checkpoint pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda") -url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-img2img.png" -init_image = load_image(url).convert("RGB") +url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" +init_image = load_image(url) prompt = "a dog catching a frisbee in the jungle" image = pipeline(prompt, image=init_image, strength=0.8, guidance_scale=10.5).images[0] +make_image_grid([init_image, image], rows=1, cols=2) ```
@@ -118,7 +120,7 @@ For inpainting, you'll need the original image and a mask of what you want to re ```py from diffusers import AutoPipelineForInpainting -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid # use from_pipe to avoid consuming additional memory when loading a checkpoint pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda") @@ -126,11 +128,12 @@ pipeline = AutoPipelineForInpainting.from_pipe(pipeline_text2image).to("cuda") img_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-text2img.png" mask_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/sdxl-inpaint-mask.png" -init_image = load_image(img_url).convert("RGB") -mask_image = load_image(mask_url).convert("RGB") +init_image = load_image(img_url) +mask_image = load_image(mask_url) prompt = "A deep sea diver floating" image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strength=0.85, guidance_scale=12.5).images[0] +make_image_grid([init_image, mask_image, image], rows=1, cols=3) ```
@@ -141,12 +144,12 @@ image = pipeline(prompt=prompt, image=init_image, mask_image=mask_image, strengt SDXL includes a [refiner model](https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0) specialized in denoising low-noise stage images to generate higher-quality images from the base model. There are two ways to use the refiner: -1. use the base and refiner model together to produce a refined image -2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL is originally trained) +1. use the base and refiner models together to produce a refined image +2. use the base model to produce an image, and subsequently use the refiner model to add more details to the image (this is how SDXL was originally trained) ### Base + refiner model -When you use the base and refiner model together to generate an image, this is known as an ([*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/)). The ensemble of expert denoisers approach requires less overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise. +When you use the base and refiner model together to generate an image, this is known as an [*ensemble of expert denoisers*](https://research.nvidia.com/labs/dir/eDiff-I/). The ensemble of expert denoisers approach requires fewer overall denoising steps versus passing the base model's output to the refiner model, so it should be significantly faster to run. However, you won't be able to inspect the base model's output because it still contains a large amount of noise. As an ensemble of expert denoisers, the base model serves as the expert during the high-noise diffusion stage and the refiner model serves as the expert during the low-noise diffusion stage. Load the base and refiner model: @@ -193,12 +196,13 @@ image = refiner( denoising_start=0.8, image=image, ).images[0] +image ```
generated image of a lion on a rock at night -
base model
+
default base model
generated image of a lion on a rock at night in higher quality @@ -210,7 +214,8 @@ The refiner model can also be used for inpainting in the [`StableDiffusionXLInpa ```py from diffusers import StableDiffusionXLInpaintPipeline -from diffusers.utils import load_image +from diffusers.utils import load_image, make_image_grid +import torch base = StableDiffusionXLInpaintPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True @@ -218,8 +223,8 @@ base = StableDiffusionXLInpaintPipeline.from_pretrained( refiner = StableDiffusionXLInpaintPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=pipe.text_encoder_2, - vae=pipe.vae, + text_encoder_2=base.text_encoder_2, + vae=base.vae, torch_dtype=torch.float16, use_safetensors=True, variant="fp16", @@ -228,8 +233,8 @@ refiner = StableDiffusionXLInpaintPipeline.from_pretrained( img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" -init_image = load_image(img_url).convert("RGB") -mask_image = load_image(mask_url).convert("RGB") +init_image = load_image(img_url) +mask_image = load_image(mask_url) prompt = "A majestic tiger sitting on a bench" num_inference_steps = 75 @@ -250,6 +255,7 @@ image = refiner( num_inference_steps=num_inference_steps, denoising_start=high_noise_frac, ).images[0] +make_image_grid([init_image, mask_image, image.resize((512, 512))], rows=1, cols=3) ``` This ensemble of expert denoisers method works well for all available schedulers! @@ -270,8 +276,8 @@ base = DiffusionPipeline.from_pretrained( refiner = DiffusionPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-refiner-1.0", - text_encoder_2=pipe.text_encoder_2, - vae=pipe.vae, + text_encoder_2=base.text_encoder_2, + vae=base.vae, torch_dtype=torch.float16, use_safetensors=True, variant="fp16", @@ -303,7 +309,7 @@ image = refiner(prompt=prompt, image=image[None, :]).images[0]
-For inpainting, load the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner. +For inpainting, load the base and the refiner model in the [`StableDiffusionXLInpaintPipeline`], remove the `denoising_end` and `denoising_start` parameters, and choose a smaller number of inference steps for the refiner. ## Micro-conditioning @@ -343,7 +349,7 @@ image = pipe(
-
Images negative conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).
+
Images negatively conditioned on image resolutions of (128, 128), (256, 256), and (512, 512).
### Crop conditioning @@ -354,13 +360,13 @@ Images generated by previous Stable Diffusion models may sometimes appear to be from diffusers import StableDiffusionXLPipeline import torch - pipeline = StableDiffusionXLPipeline.from_pretrained( "stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True ).to("cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" -image = pipeline(prompt=prompt, crops_coords_top_left=(256,0)).images[0] +image = pipeline(prompt=prompt, crops_coords_top_left=(256, 0)).images[0] +image ```
@@ -384,11 +390,12 @@ image = pipe( negative_crops_coords_top_left=(0, 0), negative_target_size=(1024, 1024), ).images[0] +image ``` ## Use a different prompt for each text-encoder -SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using a negative prompts): +SDXL uses two text-encoders, so it is possible to pass a different prompt to each text-encoder, which can [improve quality](https://github.com/huggingface/diffusers/issues/4004#issuecomment-1627764201). Pass your original prompt to `prompt` and the second prompt to `prompt_2` (use `negative_prompt` and `negative_prompt_2` if you're using negative prompts): ```py from diffusers import StableDiffusionXLPipeline @@ -403,13 +410,14 @@ prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" # prompt_2 is passed to OpenCLIP-ViT/bigG-14 prompt_2 = "Van Gogh painting" image = pipeline(prompt=prompt, prompt_2=prompt_2).images[0] +image ```
generated image of an astronaut in a jungle in the style of a van gogh painting
-The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl] section. +The dual text-encoders also support textual inversion embeddings that need to be loaded separately as explained in the [SDXL textual inversion](textual_inversion_inference#stable-diffusion-xl) section. ## Optimizations @@ -420,18 +428,18 @@ SDXL is a large model, and you may need to optimize memory to get it to run on y ```diff - base.to("cuda") - refiner.to("cuda") -+ base.enable_model_cpu_offload -+ refiner.enable_model_cpu_offload ++ base.enable_model_cpu_offload() ++ refiner.enable_model_cpu_offload() ``` -2. Use `torch.compile` for ~20% speed-up (you need `torch>2.0`): +2. Use `torch.compile` for ~20% speed-up (you need `torch>=2.0`): ```diff + base.unet = torch.compile(base.unet, mode="reduce-overhead", fullgraph=True) + refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True) ``` -3. Enable [xFormers](/optimization/xformers) to run SDXL if `torch<2.0`: +3. Enable [xFormers](../optimization/xformers) to run SDXL if `torch<2.0`: ```diff + base.enable_xformers_memory_efficient_attention() diff --git a/docs/source/en/using-diffusers/shap-e.md b/docs/source/en/using-diffusers/shap-e.md index b5ba7923049d..f0ce977584a5 100644 --- a/docs/source/en/using-diffusers/shap-e.md +++ b/docs/source/en/using-diffusers/shap-e.md @@ -16,7 +16,7 @@ specific language governing permissions and limitations under the License. Shap-E is a conditional model for generating 3D assets which could be used for video game development, interior design, and architecture. It is trained on a large dataset of 3D assets, and post-processed to render more views of each object and produce 16K instead of 4K point clouds. The Shap-E model is trained in two steps: -1. a encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset +1. an encoder accepts the point clouds and rendered views of a 3D asset and outputs the parameters of implicit functions that represent the asset 2. a diffusion model is trained on the latents produced by the encoder to generate either neural radiance fields (NeRFs) or a textured 3D mesh, making it easier to render and use the 3D asset in downstream applications This guide will show you how to use Shap-E to start generating your own 3D assets! @@ -25,7 +25,7 @@ Before you begin, make sure you have the following libraries installed: ```py # uncomment to install the necessary libraries in Colab -#!pip install diffusers transformers accelerate safetensors trimesh +#!pip install -q diffusers transformers accelerate trimesh ``` ## Text-to-3D @@ -38,7 +38,7 @@ from diffusers import ShapEPipeline device = torch.device("cuda" if torch.cuda.is_available() else "cpu") -pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True) +pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16") pipe = pipe.to(device) guidance_scale = 15.0 @@ -64,11 +64,11 @@ export_to_gif(images[1], "cake_3d.gif")
-
firecracker
+
prompt = "A firecracker"
-
cupcake
+
prompt = "A birthday cupcake"
@@ -99,6 +99,7 @@ Pass the cheeseburger to the [`ShapEImg2ImgPipeline`] to generate a 3D represent ```py from PIL import Image +from diffusers import ShapEImg2ImgPipeline from diffusers.utils import export_to_gif pipe = ShapEImg2ImgPipeline.from_pretrained("openai/shap-e-img2img", torch_dtype=torch.float16, variant="fp16").to("cuda") @@ -139,7 +140,7 @@ from diffusers import ShapEPipeline device = torch.device("cuda" if torch.cuda.is_available() else "cpu") -pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16", use_safetensors=True) +pipe = ShapEPipeline.from_pretrained("openai/shap-e", torch_dtype=torch.float16, variant="fp16") pipe = pipe.to(device) guidance_scale = 15.0 @@ -160,7 +161,7 @@ You can optionally save the mesh output as an `obj` file with the [`~utils.expor from diffusers.utils import export_to_ply ply_path = export_to_ply(images[0], "3d_cake.ply") -print(f"saved to folder: {ply_path}") +print(f"Saved to folder: {ply_path}") ``` Then you can convert the `ply` file to a `glb` file with the trimesh library: @@ -169,7 +170,7 @@ Then you can convert the `ply` file to a `glb` file with the trimesh library: import trimesh mesh = trimesh.load("3d_cake.ply") -mesh.export("3d_cake.glb", file_type="glb") +mesh_export = mesh.export("3d_cake.glb", file_type="glb") ``` By default, the mesh output is focused from the bottom viewpoint but you can change the default viewpoint by applying a rotation transform: @@ -181,11 +182,11 @@ import numpy as np mesh = trimesh.load("3d_cake.ply") rot = trimesh.transformations.rotation_matrix(-np.pi / 2, [1, 0, 0]) mesh = mesh.apply_transform(rot) -mesh.export("3d_cake.glb", file_type="glb") +mesh_export = mesh.export("3d_cake.glb", file_type="glb") ``` Upload the mesh file to your dataset repository to visualize it with the Dataset viewer!
-
\ No newline at end of file +
diff --git a/docs/source/en/using-diffusers/textual_inversion_inference.md b/docs/source/en/using-diffusers/textual_inversion_inference.md index 7583dee63e3b..084101c06ba3 100644 --- a/docs/source/en/using-diffusers/textual_inversion_inference.md +++ b/docs/source/en/using-diffusers/textual_inversion_inference.md @@ -95,7 +95,7 @@ state_dict ``` There are two tensors, `"clip_g"` and `"clip_l"`. -`"clip_g"` corresponds to the bigger text encoder in SDXL and refers to +`"clip_g"` corresponds to the bigger text encoder in SDXL and refers to `pipe.text_encoder_2` and `"clip_l"` refers to `pipe.text_encoder`. Now you can load each tensor separately by passing them along with the correct text encoder and tokenizer diff --git a/docs/source/en/using-diffusers/unconditional_image_generation.md b/docs/source/en/using-diffusers/unconditional_image_generation.md index c055bc75c5a4..1983f6981e8f 100644 --- a/docs/source/en/using-diffusers/unconditional_image_generation.md +++ b/docs/source/en/using-diffusers/unconditional_image_generation.md @@ -35,7 +35,7 @@ from diffusers import DiffusionPipeline generator = DiffusionPipeline.from_pretrained("anton-l/ddpm-butterflies-128", use_safetensors=True) ``` -The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. +The [`DiffusionPipeline`] downloads and caches all modeling, tokenization, and scheduling components. Because the model consists of roughly 1.4 billion parameters, we strongly recommend running it on a GPU. You can move the generator object to a GPU, just like you would in PyTorch: diff --git a/docs/source/en/using-diffusers/weighted_prompts.md b/docs/source/en/using-diffusers/weighted_prompts.md index 5007d235ae99..947d18b86ec8 100644 --- a/docs/source/en/using-diffusers/weighted_prompts.md +++ b/docs/source/en/using-diffusers/weighted_prompts.md @@ -142,7 +142,7 @@ image ## Conjunction A conjunction diffuses each prompt independently and concatenates their results by their weighted sum. Add `.and()` to the end of a list of prompts to create a conjunction: - + ```py prompt_embeds = compel_proc('["a red cat", "playing with a", "ball"].and()') generator = torch.Generator(device="cuda").manual_seed(55) diff --git a/docs/source/en/using-diffusers/write_own_pipeline.md b/docs/source/en/using-diffusers/write_own_pipeline.md index 38fc9e6457dd..4ca3fe33223b 100644 --- a/docs/source/en/using-diffusers/write_own_pipeline.md +++ b/docs/source/en/using-diffusers/write_own_pipeline.md @@ -14,7 +14,7 @@ specific language governing permissions and limitations under the License. [[open-in-colab]] -🧨 Diffusers is designed to be a user-friendly and flexible toolbox for building diffusion systems tailored to your use-case. At the core of the toolbox are models and schedulers. While the [`DiffusionPipeline`] bundles these components together for convenience, you can also unbundle the pipeline and use the models and schedulers separately to create new diffusion systems. +🧨 Diffusers is designed to be a user-friendly and flexible toolbox for building diffusion systems tailored to your use-case. At the core of the toolbox are models and schedulers. While the [`DiffusionPipeline`] bundles these components together for convenience, you can also unbundle the pipeline and use the models and schedulers separately to create new diffusion systems. In this tutorial, you'll learn how to use models and schedulers to assemble a diffusion system for inference, starting with a basic pipeline and then progressing to the Stable Diffusion pipeline. @@ -36,7 +36,7 @@ A pipeline is a quick and easy way to run a model for inference, requiring no mo That was super easy, but how did the pipeline do that? Let's breakdown the pipeline and take a look at what's happening under the hood. -In the example above, the pipeline contains a [`UNet2DModel`] model and a [`DDPMScheduler`]. The pipeline denoises an image by taking random noise the size of the desired output and passing it through the model several times. At each timestep, the model predicts the *noise residual* and the scheduler uses it to predict a less noisy image. The pipeline repeats this process until it reaches the end of the specified number of inference steps. +In the example above, the pipeline contains a [`UNet2DModel`] model and a [`DDPMScheduler`]. The pipeline denoises an image by taking random noise the size of the desired output and passing it through the model several times. At each timestep, the model predicts the *noise residual* and the scheduler uses it to predict a less noisy image. The pipeline repeats this process until it reaches the end of the specified number of inference steps. To recreate the pipeline with the model and scheduler separately, let's write our own denoising process. @@ -71,7 +71,7 @@ tensor([980, 960, 940, 920, 900, 880, 860, 840, 820, 800, 780, 760, 740, 720, >>> import torch >>> sample_size = model.config.sample_size ->>> noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda") +>>> noise = torch.randn((1, 3, sample_size, sample_size), device="cuda") ``` 5. Now write a loop to iterate over the timesteps. At each timestep, the model does a [`UNet2DModel.forward`] pass and returns the noisy residual. The scheduler's [`~DDPMScheduler.step`] method takes the noisy residual, timestep, and input and it predicts the image at the previous timestep. This output becomes the next input to the model in the denoising loop, and it'll repeat until it reaches the end of the `timesteps` array. @@ -153,7 +153,7 @@ To speed up inference, move the models to a GPU since, unlike the scheduler, the ### Create text embeddings -The next step is to tokenize the text to generate embeddings. The text is used to condition the UNet model and steer the diffusion process towards something that resembles the input prompt. +The next step is to tokenize the text to generate embeddings. The text is used to condition the UNet model and steer the diffusion process towards something that resembles the input prompt. @@ -216,8 +216,8 @@ Next, generate some initial random noise as a starting point for the diffusion p >>> latents = torch.randn( ... (batch_size, unet.config.in_channels, height // 8, width // 8), ... generator=generator, +... device=torch_device, ... ) ->>> latents = latents.to(torch_device) ``` ### Denoise the image @@ -284,7 +284,7 @@ Lastly, convert the image to a `PIL.Image` to see your generated image! ## Next steps -From basic to complex pipelines, you've seen that all you really need to write your own diffusion system is a denoising loop. The loop should set the scheduler's timesteps, iterate over them, and alternate between calling the UNet model to predict the noise residual and passing it to the scheduler to compute the previous noisy sample. +From basic to complex pipelines, you've seen that all you really need to write your own diffusion system is a denoising loop. The loop should set the scheduler's timesteps, iterate over them, and alternate between calling the UNet model to predict the noise residual and passing it to the scheduler to compute the previous noisy sample. This is really what 🧨 Diffusers is designed for: to make it intuitive and easy to write your own diffusion system using models and schedulers. diff --git a/docs/source/ko/optimization/fp16.md b/docs/source/ko/optimization/fp16.md index 30197305540c..0f2c487a75ce 100644 --- a/docs/source/ko/optimization/fp16.md +++ b/docs/source/ko/optimization/fp16.md @@ -273,9 +273,9 @@ unet_runs_per_experiment = 50 # μž…λ ₯ 뢈러였기 def generate_inputs(): - sample = torch.randn(2, 4, 64, 64).half().cuda() - timestep = torch.rand(1).half().cuda() * 999 - encoder_hidden_states = torch.randn(2, 77, 768).half().cuda() + sample = torch.randn((2, 4, 64, 64), device="cuda", dtype=torch.float16) + timestep = torch.rand(1, device="cuda", dtype=torch.float16) * 999 + encoder_hidden_states = torch.randn((2, 77, 768), device="cuda", dtype=torch.float16) return sample, timestep, encoder_hidden_states diff --git a/docs/source/ko/tutorials/basic_training.md b/docs/source/ko/tutorials/basic_training.md index a4e5e2a0c8bb..df5e74c22ca8 100644 --- a/docs/source/ko/tutorials/basic_training.md +++ b/docs/source/ko/tutorials/basic_training.md @@ -322,13 +322,13 @@ TensorBoard에 λ‘œκΉ…, κ·Έλž˜λ””μ–ΈνŠΈ λˆ„μ  및 ν˜Όν•© 정밀도 ν•™μŠ΅μ„ 쉽 ... for step, batch in enumerate(train_dataloader): ... clean_images = batch["images"] ... # 이미지에 더할 λ…Έμ΄μ¦ˆλ₯Ό μƒ˜ν”Œλ§ν•©λ‹ˆλ‹€. -... noise = torch.randn(clean_images.shape).to(clean_images.device) +... noise = torch.randn(clean_images.shape, device=clean_images.device) ... bs = clean_images.shape[0] ... # 각 이미지λ₯Ό μœ„ν•œ λžœλ€ν•œ νƒ€μž„μŠ€ν…(timestep)을 μƒ˜ν”Œλ§ν•©λ‹ˆλ‹€. ... timesteps = torch.randint( ... 0, noise_scheduler.config.num_train_timesteps, (bs,), device=clean_images.device -... ).long() +... ) ... # 각 νƒ€μž„μŠ€ν…μ˜ λ…Έμ΄μ¦ˆ 크기에 따라 κΉ¨λ—ν•œ 이미지에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•©λ‹ˆλ‹€. ... # (μ΄λŠ” foward diffusion κ³Όμ •μž…λ‹ˆλ‹€.) diff --git a/docs/source/ko/using-diffusers/write_own_pipeline.md b/docs/source/ko/using-diffusers/write_own_pipeline.md index a6469644566c..787c8113bf0d 100644 --- a/docs/source/ko/using-diffusers/write_own_pipeline.md +++ b/docs/source/ko/using-diffusers/write_own_pipeline.md @@ -71,7 +71,7 @@ specific language governing permissions and limitations under the License. >>> import torch >>> sample_size = model.config.sample_size - >>> noise = torch.randn((1, 3, sample_size, sample_size)).to("cuda") + >>> noise = torch.randn((1, 3, sample_size, sample_size), device="cuda") ``` 5. 이제 timestep을 λ°˜λ³΅ν•˜λŠ” 루프λ₯Ό μž‘μ„±ν•©λ‹ˆλ‹€. 각 timestepμ—μ„œ λͺ¨λΈμ€ [`UNet2DModel.forward`]λ₯Ό 톡해 noisy residual을 λ°˜ν™˜ν•©λ‹ˆλ‹€. μŠ€μΌ€μ€„λŸ¬μ˜ [`~DDPMScheduler.step`] λ©”μ„œλ“œλŠ” noisy residual, timestep, 그리고 μž…λ ₯을 λ°›μ•„ 이전 timestepμ—μ„œ 이미지λ₯Ό μ˜ˆμΈ‘ν•©λ‹ˆλ‹€. 이 좜λ ₯은 λ…Έμ΄μ¦ˆ 제거 λ£¨ν”„μ˜ λͺ¨λΈμ— λŒ€ν•œ λ‹€μŒ μž…λ ₯이 되며, `timesteps` λ°°μ—΄μ˜ 끝에 도달할 λ•ŒκΉŒμ§€ λ°˜λ³΅λ©λ‹ˆλ‹€. @@ -212,8 +212,8 @@ Stable Diffusion 은 text-to-image *latent diffusion* λͺ¨λΈμž…λ‹ˆλ‹€. latent di >>> latents = torch.randn( ... (batch_size, unet.in_channels, height // 8, width // 8), ... generator=generator, +... device=torch_device, ... ) ->>> latents = latents.to(torch_device) ``` ### 이미지 λ…Έμ΄μ¦ˆ 제거 diff --git a/examples/community/checkpoint_merger.py b/examples/community/checkpoint_merger.py index 02e8684e6ade..10381020bf63 100644 --- a/examples/community/checkpoint_merger.py +++ b/examples/community/checkpoint_merger.py @@ -13,7 +13,7 @@ class CheckpointMergerPipeline(DiffusionPipeline): """ - A class that that supports merging diffusion models based on the discussion here: + A class that supports merging diffusion models based on the discussion here: https://github.com/huggingface/diffusers/issues/877 Example usage:- diff --git a/examples/community/ddim_noise_comparative_analysis.py b/examples/community/ddim_noise_comparative_analysis.py index e1633ce4636b..482c0a5826d2 100644 --- a/examples/community/ddim_noise_comparative_analysis.py +++ b/examples/community/ddim_noise_comparative_analysis.py @@ -18,7 +18,7 @@ import torch from torchvision import transforms -from diffusers.pipeline_utils import DiffusionPipeline, ImagePipelineOutput +from diffusers.pipelines.pipeline_utils import DiffusionPipeline, ImagePipelineOutput from diffusers.schedulers import DDIMScheduler from diffusers.utils.torch_utils import randn_tensor diff --git a/examples/community/iadb.py b/examples/community/iadb.py index 1f421ee0ea4c..6089e49fc621 100644 --- a/examples/community/iadb.py +++ b/examples/community/iadb.py @@ -4,7 +4,7 @@ from diffusers import DiffusionPipeline from diffusers.configuration_utils import ConfigMixin -from diffusers.pipeline_utils import ImagePipelineOutput +from diffusers.pipelines.pipeline_utils import ImagePipelineOutput from diffusers.schedulers.scheduling_utils import SchedulerMixin diff --git a/examples/community/lpw_stable_diffusion_xl.py b/examples/community/lpw_stable_diffusion_xl.py index 66e2ffb159a1..abf066f1b3f4 100644 --- a/examples/community/lpw_stable_diffusion_xl.py +++ b/examples/community/lpw_stable_diffusion_xl.py @@ -249,6 +249,7 @@ def get_weighted_text_embeddings_sdxl( prompt_2: str = None, neg_prompt: str = "", neg_prompt_2: str = None, + num_images_per_prompt: int = 1, ): """ This function can process long prompt with weights, no length limitation @@ -260,6 +261,7 @@ def get_weighted_text_embeddings_sdxl( prompt_2 (str) neg_prompt (str) neg_prompt_2 (str) + num_images_per_prompt (int) Returns: prompt_embeds (torch.Tensor) neg_prompt_embeds (torch.Tensor) @@ -383,6 +385,22 @@ def get_weighted_text_embeddings_sdxl( prompt_embeds = torch.cat(embeds, dim=1) negative_prompt_embeds = torch.cat(neg_embeds, dim=1) + bs_embed, seq_len, _ = prompt_embeds.shape + # duplicate text embeddings for each generation per prompt, using mps friendly method + prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1) + prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1) + + seq_len = negative_prompt_embeds.shape[1] + negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1) + negative_prompt_embeds = negative_prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1) + + pooled_prompt_embeds = pooled_prompt_embeds.repeat(1, num_images_per_prompt, 1).view( + bs_embed * num_images_per_prompt, -1 + ) + negative_pooled_prompt_embeds = negative_pooled_prompt_embeds.repeat(1, num_images_per_prompt, 1).view( + bs_embed * num_images_per_prompt, -1 + ) + return prompt_embeds, negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds @@ -1096,7 +1114,9 @@ def __call__( negative_prompt_embeds, pooled_prompt_embeds, negative_pooled_prompt_embeds, - ) = get_weighted_text_embeddings_sdxl(pipe=self, prompt=prompt, neg_prompt=negative_prompt) + ) = get_weighted_text_embeddings_sdxl( + pipe=self, prompt=prompt, neg_prompt=negative_prompt, num_images_per_prompt=num_images_per_prompt + ) # 4. Prepare timesteps self.scheduler.set_timesteps(num_inference_steps, device=device) diff --git a/examples/community/mixture_canvas.py b/examples/community/mixture_canvas.py index 40139d1139ad..46daa920ba97 100644 --- a/examples/community/mixture_canvas.py +++ b/examples/community/mixture_canvas.py @@ -12,7 +12,7 @@ from transformers import CLIPFeatureExtractor, CLIPTextModel, CLIPTokenizer from diffusers.models import AutoencoderKL, UNet2DConditionModel -from diffusers.pipeline_utils import DiffusionPipeline +from diffusers.pipelines.pipeline_utils import DiffusionPipeline from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker from diffusers.schedulers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler diff --git a/examples/community/mixture_tiling.py b/examples/community/mixture_tiling.py index 3e701cf607f5..f92ae0e1d359 100644 --- a/examples/community/mixture_tiling.py +++ b/examples/community/mixture_tiling.py @@ -7,7 +7,7 @@ from tqdm.auto import tqdm from diffusers.models import AutoencoderKL, UNet2DConditionModel -from diffusers.pipeline_utils import DiffusionPipeline +from diffusers.pipelines.pipeline_utils import DiffusionPipeline from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker from diffusers.schedulers import DDIMScheduler, LMSDiscreteScheduler, PNDMScheduler from diffusers.utils import logging diff --git a/examples/community/pipeline_fabric.py b/examples/community/pipeline_fabric.py index c5783402b36c..080d0c221727 100644 --- a/examples/community/pipeline_fabric.py +++ b/examples/community/pipeline_fabric.py @@ -14,7 +14,6 @@ from typing import List, Optional, Union import torch -from diffuser.utils.torch_utils import randn_tensor from packaging import version from PIL import Image from transformers import CLIPTextModel, CLIPTokenizer @@ -33,6 +32,7 @@ logging, replace_example_docstring, ) +from diffusers.utils.torch_utils import randn_tensor logger = logging.get_logger(__name__) # pylint: disable=invalid-name diff --git a/examples/community/stable_diffusion_controlnet_img2img.py b/examples/community/stable_diffusion_controlnet_img2img.py index 550aa8ba61a3..a2b92fff0fb5 100644 --- a/examples/community/stable_diffusion_controlnet_img2img.py +++ b/examples/community/stable_diffusion_controlnet_img2img.py @@ -9,8 +9,8 @@ from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer from diffusers import AutoencoderKL, ControlNetModel, DiffusionPipeline, UNet2DConditionModel, logging +from diffusers.pipelines.controlnet.multicontrolnet import MultiControlNetModel from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput, StableDiffusionSafetyChecker -from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_controlnet import MultiControlNetModel from diffusers.schedulers import KarrasDiffusionSchedulers from diffusers.utils import ( PIL_INTERPOLATION, diff --git a/examples/community/stable_diffusion_controlnet_inpaint.py b/examples/community/stable_diffusion_controlnet_inpaint.py index 30903bbf66bf..b87973366418 100644 --- a/examples/community/stable_diffusion_controlnet_inpaint.py +++ b/examples/community/stable_diffusion_controlnet_inpaint.py @@ -10,8 +10,8 @@ from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer from diffusers import AutoencoderKL, ControlNetModel, DiffusionPipeline, UNet2DConditionModel, logging +from diffusers.pipelines.controlnet.multicontrolnet import MultiControlNetModel from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput, StableDiffusionSafetyChecker -from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion_controlnet import MultiControlNetModel from diffusers.schedulers import KarrasDiffusionSchedulers from diffusers.utils import ( PIL_INTERPOLATION, diff --git a/examples/community/stable_diffusion_ipex.py b/examples/community/stable_diffusion_ipex.py index 58fe362f4a2f..fef075a84b05 100644 --- a/examples/community/stable_diffusion_ipex.py +++ b/examples/community/stable_diffusion_ipex.py @@ -23,7 +23,7 @@ from diffusers.configuration_utils import FrozenDict from diffusers.loaders import TextualInversionLoaderMixin from diffusers.models import AutoencoderKL, UNet2DConditionModel -from diffusers.pipeline_utils import DiffusionPipeline +from diffusers.pipelines.pipeline_utils import DiffusionPipeline from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker from diffusers.schedulers import KarrasDiffusionSchedulers diff --git a/examples/dreambooth/train_dreambooth_lora.py b/examples/dreambooth/train_dreambooth_lora.py index a82d880ff5b1..d10e62ac8def 100644 --- a/examples/dreambooth/train_dreambooth_lora.py +++ b/examples/dreambooth/train_dreambooth_lora.py @@ -51,16 +51,13 @@ StableDiffusionPipeline, UNet2DConditionModel, ) -from diffusers.loaders import ( - LoraLoaderMixin, - text_encoder_lora_state_dict, -) +from diffusers.loaders import LoraLoaderMixin from diffusers.models.attention_processor import ( AttnAddedKVProcessor, AttnAddedKVProcessor2_0, SlicedAttnAddedKVProcessor, ) -from diffusers.models.lora import LoRALinearLayer +from diffusers.models.lora import LoRALinearLayer, text_encoder_lora_state_dict from diffusers.optimization import get_scheduler from diffusers.training_utils import unet_lora_state_dict from diffusers.utils import check_min_version, is_wandb_available diff --git a/examples/dreambooth/train_dreambooth_lora_sdxl.py b/examples/dreambooth/train_dreambooth_lora_sdxl.py index 002e01b28405..ef2020398b2d 100644 --- a/examples/dreambooth/train_dreambooth_lora_sdxl.py +++ b/examples/dreambooth/train_dreambooth_lora_sdxl.py @@ -49,8 +49,8 @@ StableDiffusionXLPipeline, UNet2DConditionModel, ) -from diffusers.loaders import LoraLoaderMixin, text_encoder_lora_state_dict -from diffusers.models.lora import LoRALinearLayer +from diffusers.loaders import LoraLoaderMixin +from diffusers.models.lora import LoRALinearLayer, text_encoder_lora_state_dict from diffusers.optimization import get_scheduler from diffusers.training_utils import unet_lora_state_dict from diffusers.utils import check_min_version, is_wandb_available diff --git a/examples/text_to_image/train_text_to_image_lora_sdxl.py b/examples/text_to_image/train_text_to_image_lora_sdxl.py index b69940603128..bff928541f57 100644 --- a/examples/text_to_image/train_text_to_image_lora_sdxl.py +++ b/examples/text_to_image/train_text_to_image_lora_sdxl.py @@ -49,8 +49,8 @@ StableDiffusionXLPipeline, UNet2DConditionModel, ) -from diffusers.loaders import LoraLoaderMixin, text_encoder_lora_state_dict -from diffusers.models.lora import LoRALinearLayer +from diffusers.loaders import LoraLoaderMixin +from diffusers.models.lora import LoRALinearLayer, text_encoder_lora_state_dict from diffusers.optimization import get_scheduler from diffusers.training_utils import compute_snr from diffusers.utils import check_min_version, is_wandb_available diff --git a/setup.py b/setup.py index f4b14aee49e5..9bed326b441d 100644 --- a/setup.py +++ b/setup.py @@ -15,12 +15,12 @@ """ Simple check list from AllenNLP repo: https://github.com/allenai/allennlp/blob/main/setup.py -To create the package for pypi. +To create the package for PyPI. 1. Run `make pre-release` (or `make pre-patch` for a patch release) then run `make fix-copies` to fix the index of the documentation. - If releasing on a special branch, copy the updated README.md on the main branch for your the commit you will make + If releasing on a special branch, copy the updated README.md on the main branch for the commit you will make for the post-release and run `make fix-copies` on the main branch as well. 2. Run Tests for Amazon Sagemaker. The documentation is located in `./tests/sagemaker/README.md`, otherwise @philschmid. @@ -30,29 +30,29 @@ 4. Checkout the release branch (v-release, for example v4.19-release), and commit these changes with the message: "Release: " and push. -5. Wait for the tests on main to be completed and be green (otherwise revert and fix bugs) +5. Wait for the tests on main to be completed and be green (otherwise revert and fix bugs). -6. Add a tag in git to mark the release: "git tag v -m 'Adds tag v for pypi' " +6. Add a tag in git to mark the release: "git tag v -m 'Adds tag v for PyPI'" Push the tag to git: git push --tags origin v-release 7. Build both the sources and the wheel. Do not change anything in setup.py between creating the wheel and the source distribution (obviously). - For the wheel, run: "python setup.py bdist_wheel" in the top level directory. - (this will build a wheel for the python version you use to build it). + For the wheel, run: "python setup.py bdist_wheel" in the top level directory + (This will build a wheel for the Python version you use to build it). For the sources, run: "python setup.py sdist" You should now have a /dist directory with both .whl and .tar.gz source versions. Long story cut short, you need to run both before you can upload the distribution to the - test pypi and the actual pypi servers: + test PyPI and the actual PyPI servers: python setup.py bdist_wheel && python setup.py sdist -8. Check that everything looks correct by uploading the package to the pypi test server: +8. Check that everything looks correct by uploading the package to the PyPI test server: twine upload dist/* -r pypitest - (pypi suggest using twine as other methods upload files via plaintext.) + (pypi suggests using twine as other methods upload files via plaintext.) You may have to specify the repository url, use the following command then: twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/ @@ -64,20 +64,21 @@ pip install -i https://testpypi.python.org/pypi diffusers Check you can run the following commands: - python -c "python -c "from diffusers import __version__; print(__version__)" + python -c "from diffusers import __version__; print(__version__)" python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('fusing/unet-ldm-dummy-update'); pipe()" python -c "from diffusers import DiffusionPipeline; pipe = DiffusionPipeline.from_pretrained('hf-internal-testing/tiny-stable-diffusion-pipe', safety_checker=None); pipe('ah suh du')" python -c "from diffusers import *" -9. Upload the final version to actual pypi: +9. Upload the final version to the actual PyPI: twine upload dist/* -r pypi -10. Prepare the release notes and publish them on github once everything is looking hunky-dory. +10. Prepare the release notes and publish them on GitHub once everything is looking hunky-dory. 11. Run `make post-release` (or, for a patch release, `make post-patch`). If you were on a branch for the release, you need to go back to main before executing this. """ +import sys import os import re from distutils.core import Command @@ -142,7 +143,7 @@ # anywhere. If you need to quickly access the data from this table in a shell, you can do so easily with: # # python -c 'import sys; from diffusers.dependency_versions_table import deps; \ -# print(" ".join([ deps[x] for x in sys.argv[1:]]))' tokenizers datasets +# print(" ".join([deps[x] for x in sys.argv[1:]]))' tokenizers datasets # # Just pass the desired package names to that script as it's shown with 2 packages above. # @@ -151,7 +152,7 @@ # You can then feed this for example to `pip`: # # pip install -U $(python -c 'import sys; from diffusers.dependency_versions_table import deps; \ -# print(" ".join([ deps[x] for x in sys.argv[1:]]))' tokenizers datasets) +# print(" ".join([deps[x] for x in sys.argv[1:]]))' tokenizers datasets) # @@ -182,7 +183,7 @@ def run(self): content = [ "# THIS FILE HAS BEEN AUTOGENERATED. To update:", "# 1. modify the `_deps` dict in setup.py", - "# 2. run `make deps_table_update``", + "# 2. run `make deps_table_update`", "deps = {", entries, "}", @@ -194,7 +195,6 @@ def run(self): f.write("\n".join(content)) -extras = {} extras = {} @@ -242,6 +242,8 @@ def run(self): deps["Pillow"], ] +version_range_max = max(sys.version_info[1], 10) + 1 + setup( name="diffusers", version="0.24.0.dev0", # expected format is one of x.y.z.dev0, or x.y.z.rc1 or x.y.z (no to dashes, yes to dots) @@ -268,30 +270,33 @@ def run(self): "Intended Audience :: Science/Research", "License :: OSI Approved :: Apache Software License", "Operating System :: OS Independent", - "Programming Language :: Python :: 3", - "Programming Language :: Python :: 3.8", - "Programming Language :: Python :: 3.9", "Topic :: Scientific/Engineering :: Artificial Intelligence", + "Programming Language :: Python :: 3", + ] + + [ + f"Programming Language :: Python :: 3.{i}" + for i in range(8, version_range_max) ], cmdclass={"deps_table_update": DepsTableUpdateCommand}, ) + # Release checklist # 1. Change the version in __init__.py and setup.py. # 2. Commit these changes with the message: "Release: Release" -# 3. Add a tag in git to mark the release: "git tag RELEASE -m 'Adds tag RELEASE for pypi' " +# 3. Add a tag in git to mark the release: "git tag RELEASE -m 'Adds tag RELEASE for PyPI'" # Push the tag to git: git push --tags origin main # 4. Run the following commands in the top-level directory: # python setup.py bdist_wheel # python setup.py sdist -# 5. Upload the package to the pypi test server first: +# 5. Upload the package to the PyPI test server first: # twine upload dist/* -r pypitest # twine upload dist/* -r pypitest --repository-url=https://test.pypi.org/legacy/ # 6. Check that you can install it in a virtualenv by running: # pip install -i https://testpypi.python.org/pypi diffusers # diffusers env # diffusers test -# 7. Upload the final version to actual pypi: +# 7. Upload the final version to the actual PyPI: # twine upload dist/* -r pypi -# 8. Add release notes to the tag in github once everything is looking hunky-dory. -# 9. Update the version in __init__.py, setup.py to the new version "-dev" and push to master +# 8. Add release notes to the tag in GitHub once everything is looking hunky-dory. +# 9. Update the version in __init__.py, setup.py to the new version "-dev" and push to main. diff --git a/src/diffusers/__init__.py b/src/diffusers/__init__.py index 787e3b1c29e7..21e7fbd59f24 100644 --- a/src/diffusers/__init__.py +++ b/src/diffusers/__init__.py @@ -94,6 +94,7 @@ "VQModel", ] ) + _import_structure["optimization"] = [ "get_constant_schedule", "get_constant_schedule_with_warmup", @@ -103,7 +104,6 @@ "get_polynomial_decay_schedule_with_warmup", "get_scheduler", ] - _import_structure["pipelines"].extend( [ "AudioPipelineOutput", diff --git a/src/diffusers/dependency_versions_table.py b/src/diffusers/dependency_versions_table.py index 970013c31a20..b04706476037 100644 --- a/src/diffusers/dependency_versions_table.py +++ b/src/diffusers/dependency_versions_table.py @@ -1,6 +1,6 @@ # THIS FILE HAS BEEN AUTOGENERATED. To update: # 1. modify the `_deps` dict in setup.py -# 2. run `make deps_table_update`` +# 2. run `make deps_table_update` deps = { "Pillow": "Pillow", "accelerate": "accelerate>=0.11.0", diff --git a/src/diffusers/image_processor.py b/src/diffusers/image_processor.py index 28a12f2d1364..de60c46eb239 100644 --- a/src/diffusers/image_processor.py +++ b/src/diffusers/image_processor.py @@ -13,7 +13,7 @@ # limitations under the License. import warnings -from typing import List, Optional, Union +from typing import List, Optional, Tuple, Union import numpy as np import PIL.Image @@ -126,14 +126,14 @@ def pt_to_numpy(images: torch.FloatTensor) -> np.ndarray: return images @staticmethod - def normalize(images): + def normalize(images: Union[np.ndarray, torch.Tensor]) -> Union[np.ndarray, torch.Tensor]: """ Normalize an image array to [-1,1]. """ return 2.0 * images - 1.0 @staticmethod - def denormalize(images): + def denormalize(images: Union[np.ndarray, torch.Tensor]) -> Union[np.ndarray, torch.Tensor]: """ Denormalize an image array to [0,1]. """ @@ -159,10 +159,10 @@ def convert_to_grayscale(image: PIL.Image.Image) -> PIL.Image.Image: def get_default_height_width( self, - image: [PIL.Image.Image, np.ndarray, torch.Tensor], + image: Union[PIL.Image.Image, np.ndarray, torch.Tensor], height: Optional[int] = None, width: Optional[int] = None, - ): + ) -> Tuple[int, int]: """ This function return the height and width that are downscaled to the next integer multiple of `vae_scale_factor`. @@ -202,12 +202,24 @@ def get_default_height_width( def resize( self, - image: [PIL.Image.Image, np.ndarray, torch.Tensor], + image: Union[PIL.Image.Image, np.ndarray, torch.Tensor], height: Optional[int] = None, width: Optional[int] = None, - ) -> [PIL.Image.Image, np.ndarray, torch.Tensor]: + ) -> Union[PIL.Image.Image, np.ndarray, torch.Tensor]: """ Resize image. + + Args: + image (`PIL.Image.Image`, `np.ndarray` or `torch.Tensor`): + The image input, can be a PIL image, numpy array or pytorch tensor. + height (`int`, *optional*, defaults to `None`): + The height to resize to. + width (`int`, *optional*`, defaults to `None`): + The width to resize to. + + Returns: + `PIL.Image.Image`, `np.ndarray` or `torch.Tensor`: + The resized image. """ if isinstance(image, PIL.Image.Image): image = image.resize((width, height), resample=PIL_INTERPOLATION[self.config.resample]) @@ -227,7 +239,15 @@ def resize( def binarize(self, image: PIL.Image.Image) -> PIL.Image.Image: """ - create a mask + Create a mask. + + Args: + image (`PIL.Image.Image`): + The image input, should be a PIL image. + + Returns: + `PIL.Image.Image`: + The binarized image. Values less than 0.5 are set to 0, values greater than 0.5 are set to 1. """ image[image < 0.5] = 0 image[image >= 0.5] = 1 @@ -327,7 +347,23 @@ def postprocess( image: torch.FloatTensor, output_type: str = "pil", do_denormalize: Optional[List[bool]] = None, - ): + ) -> Union[PIL.Image.Image, np.ndarray, torch.FloatTensor]: + """ + Postprocess the image output from tensor to `output_type`. + + Args: + image (`torch.FloatTensor`): + The image input, should be a pytorch tensor with shape `B x C x H x W`. + output_type (`str`, *optional*, defaults to `pil`): + The output type of the image, can be one of `pil`, `np`, `pt`, `latent`. + do_denormalize (`List[bool]`, *optional*, defaults to `None`): + Whether to denormalize the image to [0,1]. If `None`, will use the value of `do_normalize` in the + `VaeImageProcessor` config. + + Returns: + `PIL.Image.Image`, `np.ndarray` or `torch.FloatTensor`: + The postprocessed image. + """ if not isinstance(image, torch.Tensor): raise ValueError( f"Input for postprocessing is in incorrect format: {type(image)}. We only support pytorch tensor" @@ -390,7 +426,7 @@ def __init__( super().__init__() @staticmethod - def numpy_to_pil(images): + def numpy_to_pil(images: np.ndarray) -> List[PIL.Image.Image]: """ Convert a NumPy image or a batch of images to a PIL image. """ @@ -406,7 +442,7 @@ def numpy_to_pil(images): return pil_images @staticmethod - def rgblike_to_depthmap(image): + def rgblike_to_depthmap(image: Union[np.ndarray, torch.Tensor]) -> Union[np.ndarray, torch.Tensor]: """ Args: image: RGB-like depth image @@ -416,7 +452,7 @@ def rgblike_to_depthmap(image): """ return image[:, :, 1] * 2**8 + image[:, :, 2] - def numpy_to_depth(self, images): + def numpy_to_depth(self, images: np.ndarray) -> List[PIL.Image.Image]: """ Convert a NumPy depth image or a batch of images to a PIL image. """ @@ -441,7 +477,23 @@ def postprocess( image: torch.FloatTensor, output_type: str = "pil", do_denormalize: Optional[List[bool]] = None, - ): + ) -> Union[PIL.Image.Image, np.ndarray, torch.FloatTensor]: + """ + Postprocess the image output from tensor to `output_type`. + + Args: + image (`torch.FloatTensor`): + The image input, should be a pytorch tensor with shape `B x C x H x W`. + output_type (`str`, *optional*, defaults to `pil`): + The output type of the image, can be one of `pil`, `np`, `pt`, `latent`. + do_denormalize (`List[bool]`, *optional*, defaults to `None`): + Whether to denormalize the image to [0,1]. If `None`, will use the value of `do_normalize` in the + `VaeImageProcessor` config. + + Returns: + `PIL.Image.Image`, `np.ndarray` or `torch.FloatTensor`: + The postprocessed image. + """ if not isinstance(image, torch.Tensor): raise ValueError( f"Input for postprocessing is in incorrect format: {type(image)}. We only support pytorch tensor" diff --git a/src/diffusers/loaders.py b/src/diffusers/loaders.py deleted file mode 100644 index 4590c2452b88..000000000000 --- a/src/diffusers/loaders.py +++ /dev/null @@ -1,3336 +0,0 @@ -# Copyright 2023 The HuggingFace Team. All rights reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import os -import re -from collections import defaultdict -from contextlib import nullcontext -from io import BytesIO -from pathlib import Path -from typing import Callable, Dict, List, Optional, Union - -import requests -import safetensors -import torch -from huggingface_hub import hf_hub_download, model_info -from packaging import version -from torch import nn - -from . import __version__ -from .models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta -from .utils import ( - DIFFUSERS_CACHE, - HF_HUB_OFFLINE, - USE_PEFT_BACKEND, - _get_model_file, - convert_state_dict_to_diffusers, - convert_state_dict_to_peft, - convert_unet_state_dict_to_peft, - deprecate, - get_adapter_name, - get_peft_kwargs, - is_accelerate_available, - is_omegaconf_available, - is_transformers_available, - logging, - recurse_remove_peft_layers, - scale_lora_layers, - set_adapter_layers, - set_weights_and_activate_adapters, -) -from .utils.import_utils import BACKENDS_MAPPING - - -if is_transformers_available(): - from transformers import CLIPTextModel, CLIPTextModelWithProjection, PreTrainedModel - -if is_accelerate_available(): - from accelerate import init_empty_weights - from accelerate.hooks import AlignDevicesHook, CpuOffload, remove_hook_from_module - -logger = logging.get_logger(__name__) - -TEXT_ENCODER_NAME = "text_encoder" -UNET_NAME = "unet" - -LORA_WEIGHT_NAME = "pytorch_lora_weights.bin" -LORA_WEIGHT_NAME_SAFE = "pytorch_lora_weights.safetensors" - -TEXT_INVERSION_NAME = "learned_embeds.bin" -TEXT_INVERSION_NAME_SAFE = "learned_embeds.safetensors" - -CUSTOM_DIFFUSION_WEIGHT_NAME = "pytorch_custom_diffusion_weights.bin" -CUSTOM_DIFFUSION_WEIGHT_NAME_SAFE = "pytorch_custom_diffusion_weights.safetensors" - -LORA_DEPRECATION_MESSAGE = "You are using an old version of LoRA backend. This will be deprecated in the next releases in favor of PEFT make sure to install the latest PEFT and transformers packages in the future." - - -class PatchedLoraProjection(nn.Module): - def __init__(self, regular_linear_layer, lora_scale=1, network_alpha=None, rank=4, dtype=None): - super().__init__() - from .models.lora import LoRALinearLayer - - self.regular_linear_layer = regular_linear_layer - - device = self.regular_linear_layer.weight.device - - if dtype is None: - dtype = self.regular_linear_layer.weight.dtype - - self.lora_linear_layer = LoRALinearLayer( - self.regular_linear_layer.in_features, - self.regular_linear_layer.out_features, - network_alpha=network_alpha, - device=device, - dtype=dtype, - rank=rank, - ) - - self.lora_scale = lora_scale - - # overwrite PyTorch's `state_dict` to be sure that only the 'regular_linear_layer' weights are saved - # when saving the whole text encoder model and when LoRA is unloaded or fused - def state_dict(self, *args, destination=None, prefix="", keep_vars=False): - if self.lora_linear_layer is None: - return self.regular_linear_layer.state_dict( - *args, destination=destination, prefix=prefix, keep_vars=keep_vars - ) - - return super().state_dict(*args, destination=destination, prefix=prefix, keep_vars=keep_vars) - - def _fuse_lora(self, lora_scale=1.0, safe_fusing=False): - if self.lora_linear_layer is None: - return - - dtype, device = self.regular_linear_layer.weight.data.dtype, self.regular_linear_layer.weight.data.device - - w_orig = self.regular_linear_layer.weight.data.float() - w_up = self.lora_linear_layer.up.weight.data.float() - w_down = self.lora_linear_layer.down.weight.data.float() - - if self.lora_linear_layer.network_alpha is not None: - w_up = w_up * self.lora_linear_layer.network_alpha / self.lora_linear_layer.rank - - fused_weight = w_orig + (lora_scale * torch.bmm(w_up[None, :], w_down[None, :])[0]) - - if safe_fusing and torch.isnan(fused_weight).any().item(): - raise ValueError( - "This LoRA weight seems to be broken. " - f"Encountered NaN values when trying to fuse LoRA weights for {self}." - "LoRA weights will not be fused." - ) - - self.regular_linear_layer.weight.data = fused_weight.to(device=device, dtype=dtype) - - # we can drop the lora layer now - self.lora_linear_layer = None - - # offload the up and down matrices to CPU to not blow the memory - self.w_up = w_up.cpu() - self.w_down = w_down.cpu() - self.lora_scale = lora_scale - - def _unfuse_lora(self): - if not (getattr(self, "w_up", None) is not None and getattr(self, "w_down", None) is not None): - return - - fused_weight = self.regular_linear_layer.weight.data - dtype, device = fused_weight.dtype, fused_weight.device - - w_up = self.w_up.to(device=device).float() - w_down = self.w_down.to(device).float() - - unfused_weight = fused_weight.float() - (self.lora_scale * torch.bmm(w_up[None, :], w_down[None, :])[0]) - self.regular_linear_layer.weight.data = unfused_weight.to(device=device, dtype=dtype) - - self.w_up = None - self.w_down = None - - def forward(self, input): - if self.lora_scale is None: - self.lora_scale = 1.0 - if self.lora_linear_layer is None: - return self.regular_linear_layer(input) - return self.regular_linear_layer(input) + (self.lora_scale * self.lora_linear_layer(input)) - - -def text_encoder_attn_modules(text_encoder): - attn_modules = [] - - if isinstance(text_encoder, (CLIPTextModel, CLIPTextModelWithProjection)): - for i, layer in enumerate(text_encoder.text_model.encoder.layers): - name = f"text_model.encoder.layers.{i}.self_attn" - mod = layer.self_attn - attn_modules.append((name, mod)) - else: - raise ValueError(f"do not know how to get attention modules for: {text_encoder.__class__.__name__}") - - return attn_modules - - -def text_encoder_mlp_modules(text_encoder): - mlp_modules = [] - - if isinstance(text_encoder, (CLIPTextModel, CLIPTextModelWithProjection)): - for i, layer in enumerate(text_encoder.text_model.encoder.layers): - mlp_mod = layer.mlp - name = f"text_model.encoder.layers.{i}.mlp" - mlp_modules.append((name, mlp_mod)) - else: - raise ValueError(f"do not know how to get mlp modules for: {text_encoder.__class__.__name__}") - - return mlp_modules - - -def text_encoder_lora_state_dict(text_encoder): - state_dict = {} - - for name, module in text_encoder_attn_modules(text_encoder): - for k, v in module.q_proj.lora_linear_layer.state_dict().items(): - state_dict[f"{name}.q_proj.lora_linear_layer.{k}"] = v - - for k, v in module.k_proj.lora_linear_layer.state_dict().items(): - state_dict[f"{name}.k_proj.lora_linear_layer.{k}"] = v - - for k, v in module.v_proj.lora_linear_layer.state_dict().items(): - state_dict[f"{name}.v_proj.lora_linear_layer.{k}"] = v - - for k, v in module.out_proj.lora_linear_layer.state_dict().items(): - state_dict[f"{name}.out_proj.lora_linear_layer.{k}"] = v - - return state_dict - - -class AttnProcsLayers(torch.nn.Module): - def __init__(self, state_dict: Dict[str, torch.Tensor]): - super().__init__() - self.layers = torch.nn.ModuleList(state_dict.values()) - self.mapping = dict(enumerate(state_dict.keys())) - self.rev_mapping = {v: k for k, v in enumerate(state_dict.keys())} - - # .processor for unet, .self_attn for text encoder - self.split_keys = [".processor", ".self_attn"] - - # we add a hook to state_dict() and load_state_dict() so that the - # naming fits with `unet.attn_processors` - def map_to(module, state_dict, *args, **kwargs): - new_state_dict = {} - for key, value in state_dict.items(): - num = int(key.split(".")[1]) # 0 is always "layers" - new_key = key.replace(f"layers.{num}", module.mapping[num]) - new_state_dict[new_key] = value - - return new_state_dict - - def remap_key(key, state_dict): - for k in self.split_keys: - if k in key: - return key.split(k)[0] + k - - raise ValueError( - f"There seems to be a problem with the state_dict: {set(state_dict.keys())}. {key} has to have one of {self.split_keys}." - ) - - def map_from(module, state_dict, *args, **kwargs): - all_keys = list(state_dict.keys()) - for key in all_keys: - replace_key = remap_key(key, state_dict) - new_key = key.replace(replace_key, f"layers.{module.rev_mapping[replace_key]}") - state_dict[new_key] = state_dict[key] - del state_dict[key] - - self._register_state_dict_hook(map_to) - self._register_load_state_dict_pre_hook(map_from, with_module=True) - - -class UNet2DConditionLoadersMixin: - text_encoder_name = TEXT_ENCODER_NAME - unet_name = UNET_NAME - - def load_attn_procs(self, pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], **kwargs): - r""" - Load pretrained attention processor layers into [`UNet2DConditionModel`]. Attention processor layers have to be - defined in - [`attention_processor.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py) - and be a `torch.nn.Module` class. - - Parameters: - pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`): - Can be either: - - - A string, the model id (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on - the Hub. - - A path to a directory (for example `./my_model_directory`) containing the model weights saved - with [`ModelMixin.save_pretrained`]. - - A [torch state - dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict). - - cache_dir (`Union[str, os.PathLike]`, *optional*): - Path to a directory where a downloaded pretrained model configuration is cached if the standard cache - is not used. - force_download (`bool`, *optional*, defaults to `False`): - Whether or not to force the (re-)download of the model weights and configuration files, overriding the - cached versions if they exist. - resume_download (`bool`, *optional*, defaults to `False`): - Whether or not to resume downloading the model weights and configuration files. If set to `False`, any - incompletely downloaded files are deleted. - proxies (`Dict[str, str]`, *optional*): - A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', - 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. - local_files_only (`bool`, *optional*, defaults to `False`): - Whether to only load local model weights and configuration files or not. If set to `True`, the model - won't be downloaded from the Hub. - use_auth_token (`str` or *bool*, *optional*): - The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from - `diffusers-cli login` (stored in `~/.huggingface`) is used. - low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`): - Speed up model loading only loading the pretrained weights and not initializing the weights. This also - tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. - Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this - argument to `True` will raise an error. - revision (`str`, *optional*, defaults to `"main"`): - The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier - allowed by Git. - subfolder (`str`, *optional*, defaults to `""`): - The subfolder location of a model file within a larger model repository on the Hub or locally. - mirror (`str`, *optional*): - Mirror source to resolve accessibility issues if you’re downloading a model in China. We do not - guarantee the timeliness or safety of the source, and you should refer to the mirror site for more - information. - - """ - from .models.attention_processor import ( - CustomDiffusionAttnProcessor, - ) - from .models.lora import LoRACompatibleConv, LoRACompatibleLinear, LoRAConv2dLayer, LoRALinearLayer - - cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) - force_download = kwargs.pop("force_download", False) - resume_download = kwargs.pop("resume_download", False) - proxies = kwargs.pop("proxies", None) - local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) - use_auth_token = kwargs.pop("use_auth_token", None) - revision = kwargs.pop("revision", None) - subfolder = kwargs.pop("subfolder", None) - weight_name = kwargs.pop("weight_name", None) - use_safetensors = kwargs.pop("use_safetensors", None) - low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT) - # This value has the same meaning as the `--network_alpha` option in the kohya-ss trainer script. - # See https://github.com/darkstorm2150/sd-scripts/blob/main/docs/train_network_README-en.md#execute-learning - network_alphas = kwargs.pop("network_alphas", None) - - _pipeline = kwargs.pop("_pipeline", None) - - is_network_alphas_none = network_alphas is None - - allow_pickle = False - - if use_safetensors is None: - use_safetensors = True - allow_pickle = True - - user_agent = { - "file_type": "attn_procs_weights", - "framework": "pytorch", - } - - if low_cpu_mem_usage and not is_accelerate_available(): - low_cpu_mem_usage = False - logger.warning( - "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the" - " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install" - " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip" - " install accelerate\n```\n." - ) - - model_file = None - if not isinstance(pretrained_model_name_or_path_or_dict, dict): - # Let's first try to load .safetensors weights - if (use_safetensors and weight_name is None) or ( - weight_name is not None and weight_name.endswith(".safetensors") - ): - try: - model_file = _get_model_file( - pretrained_model_name_or_path_or_dict, - weights_name=weight_name or LORA_WEIGHT_NAME_SAFE, - cache_dir=cache_dir, - force_download=force_download, - resume_download=resume_download, - proxies=proxies, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - revision=revision, - subfolder=subfolder, - user_agent=user_agent, - ) - state_dict = safetensors.torch.load_file(model_file, device="cpu") - except IOError as e: - if not allow_pickle: - raise e - # try loading non-safetensors weights - pass - if model_file is None: - model_file = _get_model_file( - pretrained_model_name_or_path_or_dict, - weights_name=weight_name or LORA_WEIGHT_NAME, - cache_dir=cache_dir, - force_download=force_download, - resume_download=resume_download, - proxies=proxies, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - revision=revision, - subfolder=subfolder, - user_agent=user_agent, - ) - state_dict = torch.load(model_file, map_location="cpu") - else: - state_dict = pretrained_model_name_or_path_or_dict - - # fill attn processors - lora_layers_list = [] - - is_lora = all(("lora" in k or k.endswith(".alpha")) for k in state_dict.keys()) and not USE_PEFT_BACKEND - is_custom_diffusion = any("custom_diffusion" in k for k in state_dict.keys()) - - if is_lora: - # correct keys - state_dict, network_alphas = self.convert_state_dict_legacy_attn_format(state_dict, network_alphas) - - if network_alphas is not None: - network_alphas_keys = list(network_alphas.keys()) - used_network_alphas_keys = set() - - lora_grouped_dict = defaultdict(dict) - mapped_network_alphas = {} - - all_keys = list(state_dict.keys()) - for key in all_keys: - value = state_dict.pop(key) - attn_processor_key, sub_key = ".".join(key.split(".")[:-3]), ".".join(key.split(".")[-3:]) - lora_grouped_dict[attn_processor_key][sub_key] = value - - # Create another `mapped_network_alphas` dictionary so that we can properly map them. - if network_alphas is not None: - for k in network_alphas_keys: - if k.replace(".alpha", "") in key: - mapped_network_alphas.update({attn_processor_key: network_alphas.get(k)}) - used_network_alphas_keys.add(k) - - if not is_network_alphas_none: - if len(set(network_alphas_keys) - used_network_alphas_keys) > 0: - raise ValueError( - f"The `network_alphas` has to be empty at this point but has the following keys \n\n {', '.join(network_alphas.keys())}" - ) - - if len(state_dict) > 0: - raise ValueError( - f"The `state_dict` has to be empty at this point but has the following keys \n\n {', '.join(state_dict.keys())}" - ) - - for key, value_dict in lora_grouped_dict.items(): - attn_processor = self - for sub_key in key.split("."): - attn_processor = getattr(attn_processor, sub_key) - - # Process non-attention layers, which don't have to_{k,v,q,out_proj}_lora layers - # or add_{k,v,q,out_proj}_proj_lora layers. - rank = value_dict["lora.down.weight"].shape[0] - - if isinstance(attn_processor, LoRACompatibleConv): - in_features = attn_processor.in_channels - out_features = attn_processor.out_channels - kernel_size = attn_processor.kernel_size - - ctx = init_empty_weights if low_cpu_mem_usage else nullcontext - with ctx(): - lora = LoRAConv2dLayer( - in_features=in_features, - out_features=out_features, - rank=rank, - kernel_size=kernel_size, - stride=attn_processor.stride, - padding=attn_processor.padding, - network_alpha=mapped_network_alphas.get(key), - ) - elif isinstance(attn_processor, LoRACompatibleLinear): - ctx = init_empty_weights if low_cpu_mem_usage else nullcontext - with ctx(): - lora = LoRALinearLayer( - attn_processor.in_features, - attn_processor.out_features, - rank, - mapped_network_alphas.get(key), - ) - else: - raise ValueError(f"Module {key} is not a LoRACompatibleConv or LoRACompatibleLinear module.") - - value_dict = {k.replace("lora.", ""): v for k, v in value_dict.items()} - lora_layers_list.append((attn_processor, lora)) - - if low_cpu_mem_usage: - device = next(iter(value_dict.values())).device - dtype = next(iter(value_dict.values())).dtype - load_model_dict_into_meta(lora, value_dict, device=device, dtype=dtype) - else: - lora.load_state_dict(value_dict) - - elif is_custom_diffusion: - attn_processors = {} - custom_diffusion_grouped_dict = defaultdict(dict) - for key, value in state_dict.items(): - if len(value) == 0: - custom_diffusion_grouped_dict[key] = {} - else: - if "to_out" in key: - attn_processor_key, sub_key = ".".join(key.split(".")[:-3]), ".".join(key.split(".")[-3:]) - else: - attn_processor_key, sub_key = ".".join(key.split(".")[:-2]), ".".join(key.split(".")[-2:]) - custom_diffusion_grouped_dict[attn_processor_key][sub_key] = value - - for key, value_dict in custom_diffusion_grouped_dict.items(): - if len(value_dict) == 0: - attn_processors[key] = CustomDiffusionAttnProcessor( - train_kv=False, train_q_out=False, hidden_size=None, cross_attention_dim=None - ) - else: - cross_attention_dim = value_dict["to_k_custom_diffusion.weight"].shape[1] - hidden_size = value_dict["to_k_custom_diffusion.weight"].shape[0] - train_q_out = True if "to_q_custom_diffusion.weight" in value_dict else False - attn_processors[key] = CustomDiffusionAttnProcessor( - train_kv=True, - train_q_out=train_q_out, - hidden_size=hidden_size, - cross_attention_dim=cross_attention_dim, - ) - attn_processors[key].load_state_dict(value_dict) - elif USE_PEFT_BACKEND: - # In that case we have nothing to do as loading the adapter weights is already handled above by `set_peft_model_state_dict` - # on the Unet - pass - else: - raise ValueError( - f"{model_file} does not seem to be in the correct format expected by LoRA or Custom Diffusion training." - ) - - # - - def convert_state_dict_legacy_attn_format(self, state_dict, network_alphas): - is_new_lora_format = all( - key.startswith(self.unet_name) or key.startswith(self.text_encoder_name) for key in state_dict.keys() - ) - if is_new_lora_format: - # Strip the `"unet"` prefix. - is_text_encoder_present = any(key.startswith(self.text_encoder_name) for key in state_dict.keys()) - if is_text_encoder_present: - warn_message = "The state_dict contains LoRA params corresponding to the text encoder which are not being used here. To use both UNet and text encoder related LoRA params, use [`pipe.load_lora_weights()`](https://huggingface.co/docs/diffusers/main/en/api/loaders#diffusers.loaders.LoraLoaderMixin.load_lora_weights)." - logger.warn(warn_message) - unet_keys = [k for k in state_dict.keys() if k.startswith(self.unet_name)] - state_dict = {k.replace(f"{self.unet_name}.", ""): v for k, v in state_dict.items() if k in unet_keys} - - # change processor format to 'pure' LoRACompatibleLinear format - if any("processor" in k.split(".") for k in state_dict.keys()): - - def format_to_lora_compatible(key): - if "processor" not in key.split("."): - return key - return key.replace(".processor", "").replace("to_out_lora", "to_out.0.lora").replace("_lora", ".lora") - - state_dict = {format_to_lora_compatible(k): v for k, v in state_dict.items()} - - if network_alphas is not None: - network_alphas = {format_to_lora_compatible(k): v for k, v in network_alphas.items()} - return state_dict, network_alphas - - def save_attn_procs( - self, - save_directory: Union[str, os.PathLike], - is_main_process: bool = True, - weight_name: str = None, - save_function: Callable = None, - safe_serialization: bool = True, - **kwargs, - ): - r""" - Save an attention processor to a directory so that it can be reloaded using the - [`~loaders.UNet2DConditionLoadersMixin.load_attn_procs`] method. - - Arguments: - save_directory (`str` or `os.PathLike`): - Directory to save an attention processor to. Will be created if it doesn't exist. - is_main_process (`bool`, *optional*, defaults to `True`): - Whether the process calling this is the main process or not. Useful during distributed training and you - need to call this function on all processes. In this case, set `is_main_process=True` only on the main - process to avoid race conditions. - save_function (`Callable`): - The function to use to save the state dictionary. Useful during distributed training when you need to - replace `torch.save` with another method. Can be configured with the environment variable - `DIFFUSERS_SAVE_MODE`. - safe_serialization (`bool`, *optional*, defaults to `True`): - Whether to save the model using `safetensors` or the traditional PyTorch way with `pickle`. - """ - from .models.attention_processor import ( - CustomDiffusionAttnProcessor, - CustomDiffusionAttnProcessor2_0, - CustomDiffusionXFormersAttnProcessor, - ) - - if os.path.isfile(save_directory): - logger.error(f"Provided path ({save_directory}) should be a directory, not a file") - return - - if save_function is None: - if safe_serialization: - - def save_function(weights, filename): - return safetensors.torch.save_file(weights, filename, metadata={"format": "pt"}) - - else: - save_function = torch.save - - os.makedirs(save_directory, exist_ok=True) - - is_custom_diffusion = any( - isinstance( - x, - (CustomDiffusionAttnProcessor, CustomDiffusionAttnProcessor2_0, CustomDiffusionXFormersAttnProcessor), - ) - for (_, x) in self.attn_processors.items() - ) - if is_custom_diffusion: - model_to_save = AttnProcsLayers( - { - y: x - for (y, x) in self.attn_processors.items() - if isinstance( - x, - ( - CustomDiffusionAttnProcessor, - CustomDiffusionAttnProcessor2_0, - CustomDiffusionXFormersAttnProcessor, - ), - ) - } - ) - state_dict = model_to_save.state_dict() - for name, attn in self.attn_processors.items(): - if len(attn.state_dict()) == 0: - state_dict[name] = {} - else: - model_to_save = AttnProcsLayers(self.attn_processors) - state_dict = model_to_save.state_dict() - - if weight_name is None: - if safe_serialization: - weight_name = CUSTOM_DIFFUSION_WEIGHT_NAME_SAFE if is_custom_diffusion else LORA_WEIGHT_NAME_SAFE - else: - weight_name = CUSTOM_DIFFUSION_WEIGHT_NAME if is_custom_diffusion else LORA_WEIGHT_NAME - - # Save the model - save_function(state_dict, os.path.join(save_directory, weight_name)) - logger.info(f"Model weights saved in {os.path.join(save_directory, weight_name)}") - - def fuse_lora(self, lora_scale=1.0, safe_fusing=False): - self.lora_scale = lora_scale - self._safe_fusing = safe_fusing - self.apply(self._fuse_lora_apply) - - def _fuse_lora_apply(self, module): - if not USE_PEFT_BACKEND: - if hasattr(module, "_fuse_lora"): - module._fuse_lora(self.lora_scale, self._safe_fusing) - else: - from peft.tuners.tuners_utils import BaseTunerLayer - - if isinstance(module, BaseTunerLayer): - if self.lora_scale != 1.0: - module.scale_layer(self.lora_scale) - module.merge(safe_merge=self._safe_fusing) - - def unfuse_lora(self): - self.apply(self._unfuse_lora_apply) - - def _unfuse_lora_apply(self, module): - if not USE_PEFT_BACKEND: - if hasattr(module, "_unfuse_lora"): - module._unfuse_lora() - else: - from peft.tuners.tuners_utils import BaseTunerLayer - - if isinstance(module, BaseTunerLayer): - module.unmerge() - - def set_adapters( - self, - adapter_names: Union[List[str], str], - weights: Optional[Union[List[float], float]] = None, - ): - """ - Sets the adapter layers for the unet. - - Args: - adapter_names (`List[str]` or `str`): - The names of the adapters to use. - weights (`Union[List[float], float]`, *optional*): - The adapter(s) weights to use with the UNet. If `None`, the weights are set to `1.0` for all the - adapters. - """ - if not USE_PEFT_BACKEND: - raise ValueError("PEFT backend is required for `set_adapters()`.") - - adapter_names = [adapter_names] if isinstance(adapter_names, str) else adapter_names - - if weights is None: - weights = [1.0] * len(adapter_names) - elif isinstance(weights, float): - weights = [weights] * len(adapter_names) - - if len(adapter_names) != len(weights): - raise ValueError( - f"Length of adapter names {len(adapter_names)} is not equal to the length of their weights {len(weights)}." - ) - - set_weights_and_activate_adapters(self, adapter_names, weights) - - def disable_lora(self): - """ - Disables the active LoRA layers for the unet. - """ - if not USE_PEFT_BACKEND: - raise ValueError("PEFT backend is required for this method.") - set_adapter_layers(self, enabled=False) - - def enable_lora(self): - """ - Enables the active LoRA layers for the unet. - """ - if not USE_PEFT_BACKEND: - raise ValueError("PEFT backend is required for this method.") - set_adapter_layers(self, enabled=True) - - -def load_textual_inversion_state_dicts(pretrained_model_name_or_paths, **kwargs): - cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) - force_download = kwargs.pop("force_download", False) - resume_download = kwargs.pop("resume_download", False) - proxies = kwargs.pop("proxies", None) - local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) - use_auth_token = kwargs.pop("use_auth_token", None) - revision = kwargs.pop("revision", None) - subfolder = kwargs.pop("subfolder", None) - weight_name = kwargs.pop("weight_name", None) - use_safetensors = kwargs.pop("use_safetensors", None) - - allow_pickle = False - if use_safetensors is None: - use_safetensors = True - allow_pickle = True - - user_agent = { - "file_type": "text_inversion", - "framework": "pytorch", - } - state_dicts = [] - for pretrained_model_name_or_path in pretrained_model_name_or_paths: - if not isinstance(pretrained_model_name_or_path, (dict, torch.Tensor)): - # 3.1. Load textual inversion file - model_file = None - - # Let's first try to load .safetensors weights - if (use_safetensors and weight_name is None) or ( - weight_name is not None and weight_name.endswith(".safetensors") - ): - try: - model_file = _get_model_file( - pretrained_model_name_or_path, - weights_name=weight_name or TEXT_INVERSION_NAME_SAFE, - cache_dir=cache_dir, - force_download=force_download, - resume_download=resume_download, - proxies=proxies, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - revision=revision, - subfolder=subfolder, - user_agent=user_agent, - ) - state_dict = safetensors.torch.load_file(model_file, device="cpu") - except Exception as e: - if not allow_pickle: - raise e - - model_file = None - - if model_file is None: - model_file = _get_model_file( - pretrained_model_name_or_path, - weights_name=weight_name or TEXT_INVERSION_NAME, - cache_dir=cache_dir, - force_download=force_download, - resume_download=resume_download, - proxies=proxies, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - revision=revision, - subfolder=subfolder, - user_agent=user_agent, - ) - state_dict = torch.load(model_file, map_location="cpu") - else: - state_dict = pretrained_model_name_or_path - - state_dicts.append(state_dict) - - return state_dicts - - -class TextualInversionLoaderMixin: - r""" - Load textual inversion tokens and embeddings to the tokenizer and text encoder. - """ - - def maybe_convert_prompt(self, prompt: Union[str, List[str]], tokenizer: "PreTrainedTokenizer"): # noqa: F821 - r""" - Processes prompts that include a special token corresponding to a multi-vector textual inversion embedding to - be replaced with multiple special tokens each corresponding to one of the vectors. If the prompt has no textual - inversion token or if the textual inversion token is a single vector, the input prompt is returned. - - Parameters: - prompt (`str` or list of `str`): - The prompt or prompts to guide the image generation. - tokenizer (`PreTrainedTokenizer`): - The tokenizer responsible for encoding the prompt into input tokens. - - Returns: - `str` or list of `str`: The converted prompt - """ - if not isinstance(prompt, List): - prompts = [prompt] - else: - prompts = prompt - - prompts = [self._maybe_convert_prompt(p, tokenizer) for p in prompts] - - if not isinstance(prompt, List): - return prompts[0] - - return prompts - - def _maybe_convert_prompt(self, prompt: str, tokenizer: "PreTrainedTokenizer"): # noqa: F821 - r""" - Maybe convert a prompt into a "multi vector"-compatible prompt. If the prompt includes a token that corresponds - to a multi-vector textual inversion embedding, this function will process the prompt so that the special token - is replaced with multiple special tokens each corresponding to one of the vectors. If the prompt has no textual - inversion token or a textual inversion token that is a single vector, the input prompt is simply returned. - - Parameters: - prompt (`str`): - The prompt to guide the image generation. - tokenizer (`PreTrainedTokenizer`): - The tokenizer responsible for encoding the prompt into input tokens. - - Returns: - `str`: The converted prompt - """ - tokens = tokenizer.tokenize(prompt) - unique_tokens = set(tokens) - for token in unique_tokens: - if token in tokenizer.added_tokens_encoder: - replacement = token - i = 1 - while f"{token}_{i}" in tokenizer.added_tokens_encoder: - replacement += f" {token}_{i}" - i += 1 - - prompt = prompt.replace(token, replacement) - - return prompt - - def _check_text_inv_inputs(self, tokenizer, text_encoder, pretrained_model_name_or_paths, tokens): - if tokenizer is None: - raise ValueError( - f"{self.__class__.__name__} requires `self.tokenizer` or passing a `tokenizer` of type `PreTrainedTokenizer` for calling" - f" `{self.load_textual_inversion.__name__}`" - ) - - if text_encoder is None: - raise ValueError( - f"{self.__class__.__name__} requires `self.text_encoder` or passing a `text_encoder` of type `PreTrainedModel` for calling" - f" `{self.load_textual_inversion.__name__}`" - ) - - if len(pretrained_model_name_or_paths) != len(tokens): - raise ValueError( - f"You have passed a list of models of length {len(pretrained_model_name_or_paths)}, and list of tokens of length {len(tokens)} " - f"Make sure both lists have the same length." - ) - - valid_tokens = [t for t in tokens if t is not None] - if len(set(valid_tokens)) < len(valid_tokens): - raise ValueError(f"You have passed a list of tokens that contains duplicates: {tokens}") - - @staticmethod - def _retrieve_tokens_and_embeddings(tokens, state_dicts, tokenizer): - all_tokens = [] - all_embeddings = [] - for state_dict, token in zip(state_dicts, tokens): - if isinstance(state_dict, torch.Tensor): - if token is None: - raise ValueError( - "You are trying to load a textual inversion embedding that has been saved as a PyTorch tensor. Make sure to pass the name of the corresponding token in this case: `token=...`." - ) - loaded_token = token - embedding = state_dict - elif len(state_dict) == 1: - # diffusers - loaded_token, embedding = next(iter(state_dict.items())) - elif "string_to_param" in state_dict: - # A1111 - loaded_token = state_dict["name"] - embedding = state_dict["string_to_param"]["*"] - else: - raise ValueError( - f"Loaded state dictonary is incorrect: {state_dict}. \n\n" - "Please verify that the loaded state dictionary of the textual embedding either only has a single key or includes the `string_to_param`" - " input key." - ) - - if token is not None and loaded_token != token: - logger.info(f"The loaded token: {loaded_token} is overwritten by the passed token {token}.") - else: - token = loaded_token - - if token in tokenizer.get_vocab(): - raise ValueError( - f"Token {token} already in tokenizer vocabulary. Please choose a different token name or remove {token} and embedding from the tokenizer and text encoder." - ) - - all_tokens.append(token) - all_embeddings.append(embedding) - - return all_tokens, all_embeddings - - @staticmethod - def _extend_tokens_and_embeddings(tokens, embeddings, tokenizer): - all_tokens = [] - all_embeddings = [] - - for embedding, token in zip(embeddings, tokens): - if f"{token}_1" in tokenizer.get_vocab(): - multi_vector_tokens = [token] - i = 1 - while f"{token}_{i}" in tokenizer.added_tokens_encoder: - multi_vector_tokens.append(f"{token}_{i}") - i += 1 - - raise ValueError( - f"Multi-vector Token {multi_vector_tokens} already in tokenizer vocabulary. Please choose a different token name or remove the {multi_vector_tokens} and embedding from the tokenizer and text encoder." - ) - - is_multi_vector = len(embedding.shape) > 1 and embedding.shape[0] > 1 - if is_multi_vector: - all_tokens += [token] + [f"{token}_{i}" for i in range(1, embedding.shape[0])] - all_embeddings += [e for e in embedding] # noqa: C416 - else: - all_tokens += [token] - all_embeddings += [embedding[0]] if len(embedding.shape) > 1 else [embedding] - - return all_tokens, all_embeddings - - def load_textual_inversion( - self, - pretrained_model_name_or_path: Union[str, List[str], Dict[str, torch.Tensor], List[Dict[str, torch.Tensor]]], - token: Optional[Union[str, List[str]]] = None, - tokenizer: Optional["PreTrainedTokenizer"] = None, # noqa: F821 - text_encoder: Optional["PreTrainedModel"] = None, # noqa: F821 - **kwargs, - ): - r""" - Load textual inversion embeddings into the text encoder of [`StableDiffusionPipeline`] (both πŸ€— Diffusers and - Automatic1111 formats are supported). - - Parameters: - pretrained_model_name_or_path (`str` or `os.PathLike` or `List[str or os.PathLike]` or `Dict` or `List[Dict]`): - Can be either one of the following or a list of them: - - - A string, the *model id* (for example `sd-concepts-library/low-poly-hd-logos-icons`) of a - pretrained model hosted on the Hub. - - A path to a *directory* (for example `./my_text_inversion_directory/`) containing the textual - inversion weights. - - A path to a *file* (for example `./my_text_inversions.pt`) containing textual inversion weights. - - A [torch state - dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict). - - token (`str` or `List[str]`, *optional*): - Override the token to use for the textual inversion weights. If `pretrained_model_name_or_path` is a - list, then `token` must also be a list of equal length. - text_encoder ([`~transformers.CLIPTextModel`], *optional*): - Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). - If not specified, function will take self.tokenizer. - tokenizer ([`~transformers.CLIPTokenizer`], *optional*): - A `CLIPTokenizer` to tokenize text. If not specified, function will take self.tokenizer. - weight_name (`str`, *optional*): - Name of a custom weight file. This should be used when: - - - The saved textual inversion file is in πŸ€— Diffusers format, but was saved under a specific weight - name such as `text_inv.bin`. - - The saved textual inversion file is in the Automatic1111 format. - cache_dir (`Union[str, os.PathLike]`, *optional*): - Path to a directory where a downloaded pretrained model configuration is cached if the standard cache - is not used. - force_download (`bool`, *optional*, defaults to `False`): - Whether or not to force the (re-)download of the model weights and configuration files, overriding the - cached versions if they exist. - resume_download (`bool`, *optional*, defaults to `False`): - Whether or not to resume downloading the model weights and configuration files. If set to `False`, any - incompletely downloaded files are deleted. - proxies (`Dict[str, str]`, *optional*): - A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', - 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. - local_files_only (`bool`, *optional*, defaults to `False`): - Whether to only load local model weights and configuration files or not. If set to `True`, the model - won't be downloaded from the Hub. - use_auth_token (`str` or *bool*, *optional*): - The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from - `diffusers-cli login` (stored in `~/.huggingface`) is used. - revision (`str`, *optional*, defaults to `"main"`): - The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier - allowed by Git. - subfolder (`str`, *optional*, defaults to `""`): - The subfolder location of a model file within a larger model repository on the Hub or locally. - mirror (`str`, *optional*): - Mirror source to resolve accessibility issues if you're downloading a model in China. We do not - guarantee the timeliness or safety of the source, and you should refer to the mirror site for more - information. - - Example: - - To load a textual inversion embedding vector in πŸ€— Diffusers format: - - ```py - from diffusers import StableDiffusionPipeline - import torch - - model_id = "runwayml/stable-diffusion-v1-5" - pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") - - pipe.load_textual_inversion("sd-concepts-library/cat-toy") - - prompt = "A backpack" - - image = pipe(prompt, num_inference_steps=50).images[0] - image.save("cat-backpack.png") - ``` - - To load a textual inversion embedding vector in Automatic1111 format, make sure to download the vector first - (for example from [civitAI](https://civitai.com/models/3036?modelVersionId=9857)) and then load the vector - locally: - - ```py - from diffusers import StableDiffusionPipeline - import torch - - model_id = "runwayml/stable-diffusion-v1-5" - pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") - - pipe.load_textual_inversion("./charturnerv2.pt", token="charturnerv2") - - prompt = "charturnerv2, multiple views of the same character in the same outfit, a character turnaround of a woman wearing a black jacket and red shirt, best quality, intricate details." - - image = pipe(prompt, num_inference_steps=50).images[0] - image.save("character.png") - ``` - - """ - # 1. Set correct tokenizer and text encoder - tokenizer = tokenizer or getattr(self, "tokenizer", None) - text_encoder = text_encoder or getattr(self, "text_encoder", None) - - # 2. Normalize inputs - pretrained_model_name_or_paths = ( - [pretrained_model_name_or_path] - if not isinstance(pretrained_model_name_or_path, list) - else pretrained_model_name_or_path - ) - tokens = len(pretrained_model_name_or_paths) * [token] if (isinstance(token, str) or token is None) else token - - # 3. Check inputs - self._check_text_inv_inputs(tokenizer, text_encoder, pretrained_model_name_or_paths, tokens) - - # 4. Load state dicts of textual embeddings - state_dicts = load_textual_inversion_state_dicts(pretrained_model_name_or_paths, **kwargs) - - # 4. Retrieve tokens and embeddings - tokens, embeddings = self._retrieve_tokens_and_embeddings(tokens, state_dicts, tokenizer) - - # 5. Extend tokens and embeddings for multi vector - tokens, embeddings = self._extend_tokens_and_embeddings(tokens, embeddings, tokenizer) - - # 6. Make sure all embeddings have the correct size - expected_emb_dim = text_encoder.get_input_embeddings().weight.shape[-1] - if any(expected_emb_dim != emb.shape[-1] for emb in embeddings): - raise ValueError( - "Loaded embeddings are of incorrect shape. Expected each textual inversion embedding " - "to be of shape {input_embeddings.shape[-1]}, but are {embeddings.shape[-1]} " - ) - - # 7. Now we can be sure that loading the embedding matrix works - # < Unsafe code: - - # 7.1 Offload all hooks in case the pipeline was cpu offloaded before make sure, we offload and onload again - is_model_cpu_offload = False - is_sequential_cpu_offload = False - for _, component in self.components.items(): - if isinstance(component, nn.Module): - if hasattr(component, "_hf_hook"): - is_model_cpu_offload = isinstance(getattr(component, "_hf_hook"), CpuOffload) - is_sequential_cpu_offload = isinstance(getattr(component, "_hf_hook"), AlignDevicesHook) - logger.info( - "Accelerate hooks detected. Since you have called `load_textual_inversion()`, the previous hooks will be first removed. Then the textual inversion parameters will be loaded and the hooks will be applied again." - ) - remove_hook_from_module(component, recurse=is_sequential_cpu_offload) - - # 7.2 save expected device and dtype - device = text_encoder.device - dtype = text_encoder.dtype - - # 7.3 Increase token embedding matrix - text_encoder.resize_token_embeddings(len(tokenizer) + len(tokens)) - input_embeddings = text_encoder.get_input_embeddings().weight - - # 7.4 Load token and embedding - for token, embedding in zip(tokens, embeddings): - # add tokens and get ids - tokenizer.add_tokens(token) - token_id = tokenizer.convert_tokens_to_ids(token) - input_embeddings.data[token_id] = embedding - logger.info(f"Loaded textual inversion embedding for {token}.") - - input_embeddings.to(dtype=dtype, device=device) - - # 7.5 Offload the model again - if is_model_cpu_offload: - self.enable_model_cpu_offload() - elif is_sequential_cpu_offload: - self.enable_sequential_cpu_offload() - - # / Unsafe Code > - - -class LoraLoaderMixin: - r""" - Load LoRA layers into [`UNet2DConditionModel`] and - [`CLIPTextModel`](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel). - """ - text_encoder_name = TEXT_ENCODER_NAME - unet_name = UNET_NAME - num_fused_loras = 0 - - def load_lora_weights( - self, pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], adapter_name=None, **kwargs - ): - """ - Load LoRA weights specified in `pretrained_model_name_or_path_or_dict` into `self.unet` and - `self.text_encoder`. - - All kwargs are forwarded to `self.lora_state_dict`. - - See [`~loaders.LoraLoaderMixin.lora_state_dict`] for more details on how the state dict is loaded. - - See [`~loaders.LoraLoaderMixin.load_lora_into_unet`] for more details on how the state dict is loaded into - `self.unet`. - - See [`~loaders.LoraLoaderMixin.load_lora_into_text_encoder`] for more details on how the state dict is loaded - into `self.text_encoder`. - - Parameters: - pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`): - See [`~loaders.LoraLoaderMixin.lora_state_dict`]. - kwargs (`dict`, *optional*): - See [`~loaders.LoraLoaderMixin.lora_state_dict`]. - adapter_name (`str`, *optional*): - Adapter name to be used for referencing the loaded adapter model. If not specified, it will use - `default_{i}` where i is the total number of adapters being loaded. - """ - # First, ensure that the checkpoint is a compatible one and can be successfully loaded. - state_dict, network_alphas = self.lora_state_dict(pretrained_model_name_or_path_or_dict, **kwargs) - - is_correct_format = all("lora" in key for key in state_dict.keys()) - if not is_correct_format: - raise ValueError("Invalid LoRA checkpoint.") - - low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT) - - self.load_lora_into_unet( - state_dict, - network_alphas=network_alphas, - unet=getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet, - low_cpu_mem_usage=low_cpu_mem_usage, - adapter_name=adapter_name, - _pipeline=self, - ) - self.load_lora_into_text_encoder( - state_dict, - network_alphas=network_alphas, - text_encoder=getattr(self, self.text_encoder_name) - if not hasattr(self, "text_encoder") - else self.text_encoder, - lora_scale=self.lora_scale, - low_cpu_mem_usage=low_cpu_mem_usage, - adapter_name=adapter_name, - _pipeline=self, - ) - - @classmethod - def lora_state_dict( - cls, - pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], - **kwargs, - ): - r""" - Return state dict for lora weights and the network alphas. - - - - We support loading A1111 formatted LoRA checkpoints in a limited capacity. - - This function is experimental and might change in the future. - - - - Parameters: - pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`): - Can be either: - - - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on - the Hub. - - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved - with [`ModelMixin.save_pretrained`]. - - A [torch state - dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict). - - cache_dir (`Union[str, os.PathLike]`, *optional*): - Path to a directory where a downloaded pretrained model configuration is cached if the standard cache - is not used. - force_download (`bool`, *optional*, defaults to `False`): - Whether or not to force the (re-)download of the model weights and configuration files, overriding the - cached versions if they exist. - resume_download (`bool`, *optional*, defaults to `False`): - Whether or not to resume downloading the model weights and configuration files. If set to `False`, any - incompletely downloaded files are deleted. - proxies (`Dict[str, str]`, *optional*): - A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', - 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. - local_files_only (`bool`, *optional*, defaults to `False`): - Whether to only load local model weights and configuration files or not. If set to `True`, the model - won't be downloaded from the Hub. - use_auth_token (`str` or *bool*, *optional*): - The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from - `diffusers-cli login` (stored in `~/.huggingface`) is used. - revision (`str`, *optional*, defaults to `"main"`): - The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier - allowed by Git. - subfolder (`str`, *optional*, defaults to `""`): - The subfolder location of a model file within a larger model repository on the Hub or locally. - low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`): - Speed up model loading only loading the pretrained weights and not initializing the weights. This also - tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. - Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this - argument to `True` will raise an error. - mirror (`str`, *optional*): - Mirror source to resolve accessibility issues if you're downloading a model in China. We do not - guarantee the timeliness or safety of the source, and you should refer to the mirror site for more - information. - - """ - # Load the main state dict first which has the LoRA layers for either of - # UNet and text encoder or both. - cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) - force_download = kwargs.pop("force_download", False) - resume_download = kwargs.pop("resume_download", False) - proxies = kwargs.pop("proxies", None) - local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) - use_auth_token = kwargs.pop("use_auth_token", None) - revision = kwargs.pop("revision", None) - subfolder = kwargs.pop("subfolder", None) - weight_name = kwargs.pop("weight_name", None) - unet_config = kwargs.pop("unet_config", None) - use_safetensors = kwargs.pop("use_safetensors", None) - - allow_pickle = False - if use_safetensors is None: - use_safetensors = True - allow_pickle = True - - user_agent = { - "file_type": "attn_procs_weights", - "framework": "pytorch", - } - - model_file = None - if not isinstance(pretrained_model_name_or_path_or_dict, dict): - # Let's first try to load .safetensors weights - if (use_safetensors and weight_name is None) or ( - weight_name is not None and weight_name.endswith(".safetensors") - ): - try: - # Here we're relaxing the loading check to enable more Inference API - # friendliness where sometimes, it's not at all possible to automatically - # determine `weight_name`. - if weight_name is None: - weight_name = cls._best_guess_weight_name( - pretrained_model_name_or_path_or_dict, file_extension=".safetensors" - ) - model_file = _get_model_file( - pretrained_model_name_or_path_or_dict, - weights_name=weight_name or LORA_WEIGHT_NAME_SAFE, - cache_dir=cache_dir, - force_download=force_download, - resume_download=resume_download, - proxies=proxies, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - revision=revision, - subfolder=subfolder, - user_agent=user_agent, - ) - state_dict = safetensors.torch.load_file(model_file, device="cpu") - except (IOError, safetensors.SafetensorError) as e: - if not allow_pickle: - raise e - # try loading non-safetensors weights - model_file = None - pass - - if model_file is None: - if weight_name is None: - weight_name = cls._best_guess_weight_name( - pretrained_model_name_or_path_or_dict, file_extension=".bin" - ) - model_file = _get_model_file( - pretrained_model_name_or_path_or_dict, - weights_name=weight_name or LORA_WEIGHT_NAME, - cache_dir=cache_dir, - force_download=force_download, - resume_download=resume_download, - proxies=proxies, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - revision=revision, - subfolder=subfolder, - user_agent=user_agent, - ) - state_dict = torch.load(model_file, map_location="cpu") - else: - state_dict = pretrained_model_name_or_path_or_dict - - network_alphas = None - # TODO: replace it with a method from `state_dict_utils` - if all( - ( - k.startswith("lora_te_") - or k.startswith("lora_unet_") - or k.startswith("lora_te1_") - or k.startswith("lora_te2_") - ) - for k in state_dict.keys() - ): - # Map SDXL blocks correctly. - if unet_config is not None: - # use unet config to remap block numbers - state_dict = cls._maybe_map_sgm_blocks_to_diffusers(state_dict, unet_config) - state_dict, network_alphas = cls._convert_kohya_lora_to_diffusers(state_dict) - - return state_dict, network_alphas - - @classmethod - def _best_guess_weight_name(cls, pretrained_model_name_or_path_or_dict, file_extension=".safetensors"): - targeted_files = [] - - if os.path.isfile(pretrained_model_name_or_path_or_dict): - return - elif os.path.isdir(pretrained_model_name_or_path_or_dict): - targeted_files = [ - f for f in os.listdir(pretrained_model_name_or_path_or_dict) if f.endswith(file_extension) - ] - else: - files_in_repo = model_info(pretrained_model_name_or_path_or_dict).siblings - targeted_files = [f.rfilename for f in files_in_repo if f.rfilename.endswith(file_extension)] - if len(targeted_files) == 0: - return - - # "scheduler" does not correspond to a LoRA checkpoint. - # "optimizer" does not correspond to a LoRA checkpoint - # only top-level checkpoints are considered and not the other ones, hence "checkpoint". - unallowed_substrings = {"scheduler", "optimizer", "checkpoint"} - targeted_files = list( - filter(lambda x: all(substring not in x for substring in unallowed_substrings), targeted_files) - ) - - if any(f.endswith(LORA_WEIGHT_NAME) for f in targeted_files): - targeted_files = list(filter(lambda x: x.endswith(LORA_WEIGHT_NAME), targeted_files)) - elif any(f.endswith(LORA_WEIGHT_NAME_SAFE) for f in targeted_files): - targeted_files = list(filter(lambda x: x.endswith(LORA_WEIGHT_NAME_SAFE), targeted_files)) - - if len(targeted_files) > 1: - raise ValueError( - f"Provided path contains more than one weights file in the {file_extension} format. Either specify `weight_name` in `load_lora_weights` or make sure there's only one `.safetensors` or `.bin` file in {pretrained_model_name_or_path_or_dict}." - ) - weight_name = targeted_files[0] - return weight_name - - @classmethod - def _maybe_map_sgm_blocks_to_diffusers(cls, state_dict, unet_config, delimiter="_", block_slice_pos=5): - # 1. get all state_dict_keys - all_keys = list(state_dict.keys()) - sgm_patterns = ["input_blocks", "middle_block", "output_blocks"] - - # 2. check if needs remapping, if not return original dict - is_in_sgm_format = False - for key in all_keys: - if any(p in key for p in sgm_patterns): - is_in_sgm_format = True - break - - if not is_in_sgm_format: - return state_dict - - # 3. Else remap from SGM patterns - new_state_dict = {} - inner_block_map = ["resnets", "attentions", "upsamplers"] - - # Retrieves # of down, mid and up blocks - input_block_ids, middle_block_ids, output_block_ids = set(), set(), set() - - for layer in all_keys: - if "text" in layer: - new_state_dict[layer] = state_dict.pop(layer) - else: - layer_id = int(layer.split(delimiter)[:block_slice_pos][-1]) - if sgm_patterns[0] in layer: - input_block_ids.add(layer_id) - elif sgm_patterns[1] in layer: - middle_block_ids.add(layer_id) - elif sgm_patterns[2] in layer: - output_block_ids.add(layer_id) - else: - raise ValueError(f"Checkpoint not supported because layer {layer} not supported.") - - input_blocks = { - layer_id: [key for key in state_dict if f"input_blocks{delimiter}{layer_id}" in key] - for layer_id in input_block_ids - } - middle_blocks = { - layer_id: [key for key in state_dict if f"middle_block{delimiter}{layer_id}" in key] - for layer_id in middle_block_ids - } - output_blocks = { - layer_id: [key for key in state_dict if f"output_blocks{delimiter}{layer_id}" in key] - for layer_id in output_block_ids - } - - # Rename keys accordingly - for i in input_block_ids: - block_id = (i - 1) // (unet_config.layers_per_block + 1) - layer_in_block_id = (i - 1) % (unet_config.layers_per_block + 1) - - for key in input_blocks[i]: - inner_block_id = int(key.split(delimiter)[block_slice_pos]) - inner_block_key = inner_block_map[inner_block_id] if "op" not in key else "downsamplers" - inner_layers_in_block = str(layer_in_block_id) if "op" not in key else "0" - new_key = delimiter.join( - key.split(delimiter)[: block_slice_pos - 1] - + [str(block_id), inner_block_key, inner_layers_in_block] - + key.split(delimiter)[block_slice_pos + 1 :] - ) - new_state_dict[new_key] = state_dict.pop(key) - - for i in middle_block_ids: - key_part = None - if i == 0: - key_part = [inner_block_map[0], "0"] - elif i == 1: - key_part = [inner_block_map[1], "0"] - elif i == 2: - key_part = [inner_block_map[0], "1"] - else: - raise ValueError(f"Invalid middle block id {i}.") - - for key in middle_blocks[i]: - new_key = delimiter.join( - key.split(delimiter)[: block_slice_pos - 1] + key_part + key.split(delimiter)[block_slice_pos:] - ) - new_state_dict[new_key] = state_dict.pop(key) - - for i in output_block_ids: - block_id = i // (unet_config.layers_per_block + 1) - layer_in_block_id = i % (unet_config.layers_per_block + 1) - - for key in output_blocks[i]: - inner_block_id = int(key.split(delimiter)[block_slice_pos]) - inner_block_key = inner_block_map[inner_block_id] - inner_layers_in_block = str(layer_in_block_id) if inner_block_id < 2 else "0" - new_key = delimiter.join( - key.split(delimiter)[: block_slice_pos - 1] - + [str(block_id), inner_block_key, inner_layers_in_block] - + key.split(delimiter)[block_slice_pos + 1 :] - ) - new_state_dict[new_key] = state_dict.pop(key) - - if len(state_dict) > 0: - raise ValueError("At this point all state dict entries have to be converted.") - - return new_state_dict - - @classmethod - def _optionally_disable_offloading(cls, _pipeline): - """ - Optionally removes offloading in case the pipeline has been already sequentially offloaded to CPU. - - Args: - _pipeline (`DiffusionPipeline`): - The pipeline to disable offloading for. - - Returns: - tuple: - A tuple indicating if `is_model_cpu_offload` or `is_sequential_cpu_offload` is True. - """ - is_model_cpu_offload = False - is_sequential_cpu_offload = False - - if _pipeline is not None: - for _, component in _pipeline.components.items(): - if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"): - if not is_model_cpu_offload: - is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload) - if not is_sequential_cpu_offload: - is_sequential_cpu_offload = isinstance(component._hf_hook, AlignDevicesHook) - - logger.info( - "Accelerate hooks detected. Since you have called `load_lora_weights()`, the previous hooks will be first removed. Then the LoRA parameters will be loaded and the hooks will be applied again." - ) - remove_hook_from_module(component, recurse=is_sequential_cpu_offload) - - return (is_model_cpu_offload, is_sequential_cpu_offload) - - @classmethod - def load_lora_into_unet( - cls, state_dict, network_alphas, unet, low_cpu_mem_usage=None, adapter_name=None, _pipeline=None - ): - """ - This will load the LoRA layers specified in `state_dict` into `unet`. - - Parameters: - state_dict (`dict`): - A standard state dict containing the lora layer parameters. The keys can either be indexed directly - into the unet or prefixed with an additional `unet` which can be used to distinguish between text - encoder lora layers. - network_alphas (`Dict[str, float]`): - See `LoRALinearLayer` for more details. - unet (`UNet2DConditionModel`): - The UNet model to load the LoRA layers into. - low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`): - Speed up model loading only loading the pretrained weights and not initializing the weights. This also - tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. - Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this - argument to `True` will raise an error. - adapter_name (`str`, *optional*): - Adapter name to be used for referencing the loaded adapter model. If not specified, it will use - `default_{i}` where i is the total number of adapters being loaded. - """ - low_cpu_mem_usage = low_cpu_mem_usage if low_cpu_mem_usage is not None else _LOW_CPU_MEM_USAGE_DEFAULT - # If the serialization format is new (introduced in https://github.com/huggingface/diffusers/pull/2918), - # then the `state_dict` keys should have `cls.unet_name` and/or `cls.text_encoder_name` as - # their prefixes. - keys = list(state_dict.keys()) - - if all(key.startswith(cls.unet_name) or key.startswith(cls.text_encoder_name) for key in keys): - # Load the layers corresponding to UNet. - logger.info(f"Loading {cls.unet_name}.") - - unet_keys = [k for k in keys if k.startswith(cls.unet_name)] - state_dict = {k.replace(f"{cls.unet_name}.", ""): v for k, v in state_dict.items() if k in unet_keys} - - if network_alphas is not None: - alpha_keys = [k for k in network_alphas.keys() if k.startswith(cls.unet_name)] - network_alphas = { - k.replace(f"{cls.unet_name}.", ""): v for k, v in network_alphas.items() if k in alpha_keys - } - - else: - # Otherwise, we're dealing with the old format. This means the `state_dict` should only - # contain the module names of the `unet` as its keys WITHOUT any prefix. - warn_message = "You have saved the LoRA weights using the old format. To convert the old LoRA weights to the new format, you can first load them in a dictionary and then create a new dictionary like the following: `new_state_dict = {f'unet.{module_name}': params for module_name, params in old_state_dict.items()}`." - logger.warn(warn_message) - - if USE_PEFT_BACKEND and len(state_dict.keys()) > 0: - from peft import LoraConfig, inject_adapter_in_model, set_peft_model_state_dict - - if adapter_name in getattr(unet, "peft_config", {}): - raise ValueError( - f"Adapter name {adapter_name} already in use in the Unet - please select a new adapter name." - ) - - state_dict = convert_unet_state_dict_to_peft(state_dict) - - if network_alphas is not None: - # The alphas state dict have the same structure as Unet, thus we convert it to peft format using - # `convert_unet_state_dict_to_peft` method. - network_alphas = convert_unet_state_dict_to_peft(network_alphas) - - rank = {} - for key, val in state_dict.items(): - if "lora_B" in key: - rank[key] = val.shape[1] - - lora_config_kwargs = get_peft_kwargs(rank, network_alphas, state_dict, is_unet=True) - lora_config = LoraConfig(**lora_config_kwargs) - - # adapter_name - if adapter_name is None: - adapter_name = get_adapter_name(unet) - - # In case the pipeline has been already offloaded to CPU - temporarily remove the hooks - # otherwise loading LoRA weights will lead to an error - is_model_cpu_offload, is_sequential_cpu_offload = cls._optionally_disable_offloading(_pipeline) - - inject_adapter_in_model(lora_config, unet, adapter_name=adapter_name) - incompatible_keys = set_peft_model_state_dict(unet, state_dict, adapter_name) - - if incompatible_keys is not None: - # check only for unexpected keys - unexpected_keys = getattr(incompatible_keys, "unexpected_keys", None) - if unexpected_keys: - logger.warning( - f"Loading adapter weights from state_dict led to unexpected keys not found in the model: " - f" {unexpected_keys}. " - ) - - # Offload back. - if is_model_cpu_offload: - _pipeline.enable_model_cpu_offload() - elif is_sequential_cpu_offload: - _pipeline.enable_sequential_cpu_offload() - # Unsafe code /> - - unet.load_attn_procs( - state_dict, network_alphas=network_alphas, low_cpu_mem_usage=low_cpu_mem_usage, _pipeline=_pipeline - ) - - @classmethod - def load_lora_into_text_encoder( - cls, - state_dict, - network_alphas, - text_encoder, - prefix=None, - lora_scale=1.0, - low_cpu_mem_usage=None, - adapter_name=None, - _pipeline=None, - ): - """ - This will load the LoRA layers specified in `state_dict` into `text_encoder` - - Parameters: - state_dict (`dict`): - A standard state dict containing the lora layer parameters. The key should be prefixed with an - additional `text_encoder` to distinguish between unet lora layers. - network_alphas (`Dict[str, float]`): - See `LoRALinearLayer` for more details. - text_encoder (`CLIPTextModel`): - The text encoder model to load the LoRA layers into. - prefix (`str`): - Expected prefix of the `text_encoder` in the `state_dict`. - lora_scale (`float`): - How much to scale the output of the lora linear layer before it is added with the output of the regular - lora layer. - low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`): - Speed up model loading only loading the pretrained weights and not initializing the weights. This also - tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. - Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this - argument to `True` will raise an error. - adapter_name (`str`, *optional*): - Adapter name to be used for referencing the loaded adapter model. If not specified, it will use - `default_{i}` where i is the total number of adapters being loaded. - """ - low_cpu_mem_usage = low_cpu_mem_usage if low_cpu_mem_usage is not None else _LOW_CPU_MEM_USAGE_DEFAULT - - # If the serialization format is new (introduced in https://github.com/huggingface/diffusers/pull/2918), - # then the `state_dict` keys should have `self.unet_name` and/or `self.text_encoder_name` as - # their prefixes. - keys = list(state_dict.keys()) - prefix = cls.text_encoder_name if prefix is None else prefix - - # Safe prefix to check with. - if any(cls.text_encoder_name in key for key in keys): - # Load the layers corresponding to text encoder and make necessary adjustments. - text_encoder_keys = [k for k in keys if k.startswith(prefix) and k.split(".")[0] == prefix] - text_encoder_lora_state_dict = { - k.replace(f"{prefix}.", ""): v for k, v in state_dict.items() if k in text_encoder_keys - } - - if len(text_encoder_lora_state_dict) > 0: - logger.info(f"Loading {prefix}.") - rank = {} - text_encoder_lora_state_dict = convert_state_dict_to_diffusers(text_encoder_lora_state_dict) - - if USE_PEFT_BACKEND: - # convert state dict - text_encoder_lora_state_dict = convert_state_dict_to_peft(text_encoder_lora_state_dict) - - for name, _ in text_encoder_attn_modules(text_encoder): - rank_key = f"{name}.out_proj.lora_B.weight" - rank[rank_key] = text_encoder_lora_state_dict[rank_key].shape[1] - - patch_mlp = any(".mlp." in key for key in text_encoder_lora_state_dict.keys()) - if patch_mlp: - for name, _ in text_encoder_mlp_modules(text_encoder): - rank_key_fc1 = f"{name}.fc1.lora_B.weight" - rank_key_fc2 = f"{name}.fc2.lora_B.weight" - - rank[rank_key_fc1] = text_encoder_lora_state_dict[rank_key_fc1].shape[1] - rank[rank_key_fc2] = text_encoder_lora_state_dict[rank_key_fc2].shape[1] - else: - for name, _ in text_encoder_attn_modules(text_encoder): - rank_key = f"{name}.out_proj.lora_linear_layer.up.weight" - rank.update({rank_key: text_encoder_lora_state_dict[rank_key].shape[1]}) - - patch_mlp = any(".mlp." in key for key in text_encoder_lora_state_dict.keys()) - if patch_mlp: - for name, _ in text_encoder_mlp_modules(text_encoder): - rank_key_fc1 = f"{name}.fc1.lora_linear_layer.up.weight" - rank_key_fc2 = f"{name}.fc2.lora_linear_layer.up.weight" - rank[rank_key_fc1] = text_encoder_lora_state_dict[rank_key_fc1].shape[1] - rank[rank_key_fc2] = text_encoder_lora_state_dict[rank_key_fc2].shape[1] - - if network_alphas is not None: - alpha_keys = [ - k for k in network_alphas.keys() if k.startswith(prefix) and k.split(".")[0] == prefix - ] - network_alphas = { - k.replace(f"{prefix}.", ""): v for k, v in network_alphas.items() if k in alpha_keys - } - - if USE_PEFT_BACKEND: - from peft import LoraConfig - - lora_config_kwargs = get_peft_kwargs( - rank, network_alphas, text_encoder_lora_state_dict, is_unet=False - ) - - lora_config = LoraConfig(**lora_config_kwargs) - - # adapter_name - if adapter_name is None: - adapter_name = get_adapter_name(text_encoder) - - is_model_cpu_offload, is_sequential_cpu_offload = cls._optionally_disable_offloading(_pipeline) - - # inject LoRA layers and load the state dict - # in transformers we automatically check whether the adapter name is already in use or not - text_encoder.load_adapter( - adapter_name=adapter_name, - adapter_state_dict=text_encoder_lora_state_dict, - peft_config=lora_config, - ) - - # scale LoRA layers with `lora_scale` - scale_lora_layers(text_encoder, weight=lora_scale) - else: - cls._modify_text_encoder( - text_encoder, - lora_scale, - network_alphas, - rank=rank, - patch_mlp=patch_mlp, - low_cpu_mem_usage=low_cpu_mem_usage, - ) - - is_pipeline_offloaded = _pipeline is not None and any( - isinstance(c, torch.nn.Module) and hasattr(c, "_hf_hook") - for c in _pipeline.components.values() - ) - if is_pipeline_offloaded and low_cpu_mem_usage: - low_cpu_mem_usage = True - logger.info( - f"Pipeline {_pipeline.__class__} is offloaded. Therefore low cpu mem usage loading is forced." - ) - - if low_cpu_mem_usage: - device = next(iter(text_encoder_lora_state_dict.values())).device - dtype = next(iter(text_encoder_lora_state_dict.values())).dtype - unexpected_keys = load_model_dict_into_meta( - text_encoder, text_encoder_lora_state_dict, device=device, dtype=dtype - ) - else: - load_state_dict_results = text_encoder.load_state_dict( - text_encoder_lora_state_dict, strict=False - ) - unexpected_keys = load_state_dict_results.unexpected_keys - - if len(unexpected_keys) != 0: - raise ValueError( - f"failed to load text encoder state dict, unexpected keys: {load_state_dict_results.unexpected_keys}" - ) - - # - - @property - def lora_scale(self) -> float: - # property function that returns the lora scale which can be set at run time by the pipeline. - # if _lora_scale has not been set, return 1 - return self._lora_scale if hasattr(self, "_lora_scale") else 1.0 - - def _remove_text_encoder_monkey_patch(self): - if USE_PEFT_BACKEND: - remove_method = recurse_remove_peft_layers - else: - remove_method = self._remove_text_encoder_monkey_patch_classmethod - - if hasattr(self, "text_encoder"): - remove_method(self.text_encoder) - - # In case text encoder have no Lora attached - if USE_PEFT_BACKEND and getattr(self.text_encoder, "peft_config", None) is not None: - del self.text_encoder.peft_config - self.text_encoder._hf_peft_config_loaded = None - if hasattr(self, "text_encoder_2"): - remove_method(self.text_encoder_2) - if USE_PEFT_BACKEND: - del self.text_encoder_2.peft_config - self.text_encoder_2._hf_peft_config_loaded = None - - @classmethod - def _remove_text_encoder_monkey_patch_classmethod(cls, text_encoder): - if version.parse(__version__) > version.parse("0.23"): - deprecate("_remove_text_encoder_monkey_patch_classmethod", "0.25", LORA_DEPRECATION_MESSAGE) - - for _, attn_module in text_encoder_attn_modules(text_encoder): - if isinstance(attn_module.q_proj, PatchedLoraProjection): - attn_module.q_proj.lora_linear_layer = None - attn_module.k_proj.lora_linear_layer = None - attn_module.v_proj.lora_linear_layer = None - attn_module.out_proj.lora_linear_layer = None - - for _, mlp_module in text_encoder_mlp_modules(text_encoder): - if isinstance(mlp_module.fc1, PatchedLoraProjection): - mlp_module.fc1.lora_linear_layer = None - mlp_module.fc2.lora_linear_layer = None - - @classmethod - def _modify_text_encoder( - cls, - text_encoder, - lora_scale=1, - network_alphas=None, - rank: Union[Dict[str, int], int] = 4, - dtype=None, - patch_mlp=False, - low_cpu_mem_usage=False, - ): - r""" - Monkey-patches the forward passes of attention modules of the text encoder. - """ - if version.parse(__version__) > version.parse("0.23"): - deprecate("_modify_text_encoder", "0.25", LORA_DEPRECATION_MESSAGE) - - def create_patched_linear_lora(model, network_alpha, rank, dtype, lora_parameters): - linear_layer = model.regular_linear_layer if isinstance(model, PatchedLoraProjection) else model - ctx = init_empty_weights if low_cpu_mem_usage else nullcontext - with ctx(): - model = PatchedLoraProjection(linear_layer, lora_scale, network_alpha, rank, dtype=dtype) - - lora_parameters.extend(model.lora_linear_layer.parameters()) - return model - - # First, remove any monkey-patch that might have been applied before - cls._remove_text_encoder_monkey_patch_classmethod(text_encoder) - - lora_parameters = [] - network_alphas = {} if network_alphas is None else network_alphas - is_network_alphas_populated = len(network_alphas) > 0 - - for name, attn_module in text_encoder_attn_modules(text_encoder): - query_alpha = network_alphas.pop(name + ".to_q_lora.down.weight.alpha", None) - key_alpha = network_alphas.pop(name + ".to_k_lora.down.weight.alpha", None) - value_alpha = network_alphas.pop(name + ".to_v_lora.down.weight.alpha", None) - out_alpha = network_alphas.pop(name + ".to_out_lora.down.weight.alpha", None) - - if isinstance(rank, dict): - current_rank = rank.pop(f"{name}.out_proj.lora_linear_layer.up.weight") - else: - current_rank = rank - - attn_module.q_proj = create_patched_linear_lora( - attn_module.q_proj, query_alpha, current_rank, dtype, lora_parameters - ) - attn_module.k_proj = create_patched_linear_lora( - attn_module.k_proj, key_alpha, current_rank, dtype, lora_parameters - ) - attn_module.v_proj = create_patched_linear_lora( - attn_module.v_proj, value_alpha, current_rank, dtype, lora_parameters - ) - attn_module.out_proj = create_patched_linear_lora( - attn_module.out_proj, out_alpha, current_rank, dtype, lora_parameters - ) - - if patch_mlp: - for name, mlp_module in text_encoder_mlp_modules(text_encoder): - fc1_alpha = network_alphas.pop(name + ".fc1.lora_linear_layer.down.weight.alpha", None) - fc2_alpha = network_alphas.pop(name + ".fc2.lora_linear_layer.down.weight.alpha", None) - - current_rank_fc1 = rank.pop(f"{name}.fc1.lora_linear_layer.up.weight") - current_rank_fc2 = rank.pop(f"{name}.fc2.lora_linear_layer.up.weight") - - mlp_module.fc1 = create_patched_linear_lora( - mlp_module.fc1, fc1_alpha, current_rank_fc1, dtype, lora_parameters - ) - mlp_module.fc2 = create_patched_linear_lora( - mlp_module.fc2, fc2_alpha, current_rank_fc2, dtype, lora_parameters - ) - - if is_network_alphas_populated and len(network_alphas) > 0: - raise ValueError( - f"The `network_alphas` has to be empty at this point but has the following keys \n\n {', '.join(network_alphas.keys())}" - ) - - return lora_parameters - - @classmethod - def save_lora_weights( - cls, - save_directory: Union[str, os.PathLike], - unet_lora_layers: Dict[str, Union[torch.nn.Module, torch.Tensor]] = None, - text_encoder_lora_layers: Dict[str, torch.nn.Module] = None, - is_main_process: bool = True, - weight_name: str = None, - save_function: Callable = None, - safe_serialization: bool = True, - ): - r""" - Save the LoRA parameters corresponding to the UNet and text encoder. - - Arguments: - save_directory (`str` or `os.PathLike`): - Directory to save LoRA parameters to. Will be created if it doesn't exist. - unet_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): - State dict of the LoRA layers corresponding to the `unet`. - text_encoder_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): - State dict of the LoRA layers corresponding to the `text_encoder`. Must explicitly pass the text - encoder LoRA state dict because it comes from πŸ€— Transformers. - is_main_process (`bool`, *optional*, defaults to `True`): - Whether the process calling this is the main process or not. Useful during distributed training and you - need to call this function on all processes. In this case, set `is_main_process=True` only on the main - process to avoid race conditions. - save_function (`Callable`): - The function to use to save the state dictionary. Useful during distributed training when you need to - replace `torch.save` with another method. Can be configured with the environment variable - `DIFFUSERS_SAVE_MODE`. - safe_serialization (`bool`, *optional*, defaults to `True`): - Whether to save the model using `safetensors` or the traditional PyTorch way with `pickle`. - """ - # Create a flat dictionary. - state_dict = {} - - # Populate the dictionary. - if unet_lora_layers is not None: - weights = ( - unet_lora_layers.state_dict() if isinstance(unet_lora_layers, torch.nn.Module) else unet_lora_layers - ) - - unet_lora_state_dict = {f"{cls.unet_name}.{module_name}": param for module_name, param in weights.items()} - state_dict.update(unet_lora_state_dict) - - if text_encoder_lora_layers is not None: - weights = ( - text_encoder_lora_layers.state_dict() - if isinstance(text_encoder_lora_layers, torch.nn.Module) - else text_encoder_lora_layers - ) - - text_encoder_lora_state_dict = { - f"{cls.text_encoder_name}.{module_name}": param for module_name, param in weights.items() - } - state_dict.update(text_encoder_lora_state_dict) - - # Save the model - cls.write_lora_layers( - state_dict=state_dict, - save_directory=save_directory, - is_main_process=is_main_process, - weight_name=weight_name, - save_function=save_function, - safe_serialization=safe_serialization, - ) - - @staticmethod - def write_lora_layers( - state_dict: Dict[str, torch.Tensor], - save_directory: str, - is_main_process: bool, - weight_name: str, - save_function: Callable, - safe_serialization: bool, - ): - if os.path.isfile(save_directory): - logger.error(f"Provided path ({save_directory}) should be a directory, not a file") - return - - if save_function is None: - if safe_serialization: - - def save_function(weights, filename): - return safetensors.torch.save_file(weights, filename, metadata={"format": "pt"}) - - else: - save_function = torch.save - - os.makedirs(save_directory, exist_ok=True) - - if weight_name is None: - if safe_serialization: - weight_name = LORA_WEIGHT_NAME_SAFE - else: - weight_name = LORA_WEIGHT_NAME - - save_function(state_dict, os.path.join(save_directory, weight_name)) - logger.info(f"Model weights saved in {os.path.join(save_directory, weight_name)}") - - @classmethod - def _convert_kohya_lora_to_diffusers(cls, state_dict): - unet_state_dict = {} - te_state_dict = {} - te2_state_dict = {} - network_alphas = {} - - # every down weight has a corresponding up weight and potentially an alpha weight - lora_keys = [k for k in state_dict.keys() if k.endswith("lora_down.weight")] - for key in lora_keys: - lora_name = key.split(".")[0] - lora_name_up = lora_name + ".lora_up.weight" - lora_name_alpha = lora_name + ".alpha" - - if lora_name.startswith("lora_unet_"): - diffusers_name = key.replace("lora_unet_", "").replace("_", ".") - - if "input.blocks" in diffusers_name: - diffusers_name = diffusers_name.replace("input.blocks", "down_blocks") - else: - diffusers_name = diffusers_name.replace("down.blocks", "down_blocks") - - if "middle.block" in diffusers_name: - diffusers_name = diffusers_name.replace("middle.block", "mid_block") - else: - diffusers_name = diffusers_name.replace("mid.block", "mid_block") - if "output.blocks" in diffusers_name: - diffusers_name = diffusers_name.replace("output.blocks", "up_blocks") - else: - diffusers_name = diffusers_name.replace("up.blocks", "up_blocks") - - diffusers_name = diffusers_name.replace("transformer.blocks", "transformer_blocks") - diffusers_name = diffusers_name.replace("to.q.lora", "to_q_lora") - diffusers_name = diffusers_name.replace("to.k.lora", "to_k_lora") - diffusers_name = diffusers_name.replace("to.v.lora", "to_v_lora") - diffusers_name = diffusers_name.replace("to.out.0.lora", "to_out_lora") - diffusers_name = diffusers_name.replace("proj.in", "proj_in") - diffusers_name = diffusers_name.replace("proj.out", "proj_out") - diffusers_name = diffusers_name.replace("emb.layers", "time_emb_proj") - - # SDXL specificity. - if "emb" in diffusers_name and "time.emb.proj" not in diffusers_name: - pattern = r"\.\d+(?=\D*$)" - diffusers_name = re.sub(pattern, "", diffusers_name, count=1) - if ".in." in diffusers_name: - diffusers_name = diffusers_name.replace("in.layers.2", "conv1") - if ".out." in diffusers_name: - diffusers_name = diffusers_name.replace("out.layers.3", "conv2") - if "downsamplers" in diffusers_name or "upsamplers" in diffusers_name: - diffusers_name = diffusers_name.replace("op", "conv") - if "skip" in diffusers_name: - diffusers_name = diffusers_name.replace("skip.connection", "conv_shortcut") - - # LyCORIS specificity. - if "time.emb.proj" in diffusers_name: - diffusers_name = diffusers_name.replace("time.emb.proj", "time_emb_proj") - if "conv.shortcut" in diffusers_name: - diffusers_name = diffusers_name.replace("conv.shortcut", "conv_shortcut") - - # General coverage. - if "transformer_blocks" in diffusers_name: - if "attn1" in diffusers_name or "attn2" in diffusers_name: - diffusers_name = diffusers_name.replace("attn1", "attn1.processor") - diffusers_name = diffusers_name.replace("attn2", "attn2.processor") - unet_state_dict[diffusers_name] = state_dict.pop(key) - unet_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - elif "ff" in diffusers_name: - unet_state_dict[diffusers_name] = state_dict.pop(key) - unet_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - elif any(key in diffusers_name for key in ("proj_in", "proj_out")): - unet_state_dict[diffusers_name] = state_dict.pop(key) - unet_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - else: - unet_state_dict[diffusers_name] = state_dict.pop(key) - unet_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - - elif lora_name.startswith("lora_te_"): - diffusers_name = key.replace("lora_te_", "").replace("_", ".") - diffusers_name = diffusers_name.replace("text.model", "text_model") - diffusers_name = diffusers_name.replace("self.attn", "self_attn") - diffusers_name = diffusers_name.replace("q.proj.lora", "to_q_lora") - diffusers_name = diffusers_name.replace("k.proj.lora", "to_k_lora") - diffusers_name = diffusers_name.replace("v.proj.lora", "to_v_lora") - diffusers_name = diffusers_name.replace("out.proj.lora", "to_out_lora") - if "self_attn" in diffusers_name: - te_state_dict[diffusers_name] = state_dict.pop(key) - te_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - elif "mlp" in diffusers_name: - # Be aware that this is the new diffusers convention and the rest of the code might - # not utilize it yet. - diffusers_name = diffusers_name.replace(".lora.", ".lora_linear_layer.") - te_state_dict[diffusers_name] = state_dict.pop(key) - te_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - - # (sayakpaul): Duplicate code. Needs to be cleaned. - elif lora_name.startswith("lora_te1_"): - diffusers_name = key.replace("lora_te1_", "").replace("_", ".") - diffusers_name = diffusers_name.replace("text.model", "text_model") - diffusers_name = diffusers_name.replace("self.attn", "self_attn") - diffusers_name = diffusers_name.replace("q.proj.lora", "to_q_lora") - diffusers_name = diffusers_name.replace("k.proj.lora", "to_k_lora") - diffusers_name = diffusers_name.replace("v.proj.lora", "to_v_lora") - diffusers_name = diffusers_name.replace("out.proj.lora", "to_out_lora") - if "self_attn" in diffusers_name: - te_state_dict[diffusers_name] = state_dict.pop(key) - te_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - elif "mlp" in diffusers_name: - # Be aware that this is the new diffusers convention and the rest of the code might - # not utilize it yet. - diffusers_name = diffusers_name.replace(".lora.", ".lora_linear_layer.") - te_state_dict[diffusers_name] = state_dict.pop(key) - te_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - - # (sayakpaul): Duplicate code. Needs to be cleaned. - elif lora_name.startswith("lora_te2_"): - diffusers_name = key.replace("lora_te2_", "").replace("_", ".") - diffusers_name = diffusers_name.replace("text.model", "text_model") - diffusers_name = diffusers_name.replace("self.attn", "self_attn") - diffusers_name = diffusers_name.replace("q.proj.lora", "to_q_lora") - diffusers_name = diffusers_name.replace("k.proj.lora", "to_k_lora") - diffusers_name = diffusers_name.replace("v.proj.lora", "to_v_lora") - diffusers_name = diffusers_name.replace("out.proj.lora", "to_out_lora") - if "self_attn" in diffusers_name: - te2_state_dict[diffusers_name] = state_dict.pop(key) - te2_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - elif "mlp" in diffusers_name: - # Be aware that this is the new diffusers convention and the rest of the code might - # not utilize it yet. - diffusers_name = diffusers_name.replace(".lora.", ".lora_linear_layer.") - te2_state_dict[diffusers_name] = state_dict.pop(key) - te2_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) - - # Rename the alphas so that they can be mapped appropriately. - if lora_name_alpha in state_dict: - alpha = state_dict.pop(lora_name_alpha).item() - if lora_name_alpha.startswith("lora_unet_"): - prefix = "unet." - elif lora_name_alpha.startswith(("lora_te_", "lora_te1_")): - prefix = "text_encoder." - else: - prefix = "text_encoder_2." - new_name = prefix + diffusers_name.split(".lora.")[0] + ".alpha" - network_alphas.update({new_name: alpha}) - - if len(state_dict) > 0: - raise ValueError( - f"The following keys have not been correctly be renamed: \n\n {', '.join(state_dict.keys())}" - ) - - logger.info("Kohya-style checkpoint detected.") - unet_state_dict = {f"{cls.unet_name}.{module_name}": params for module_name, params in unet_state_dict.items()} - te_state_dict = { - f"{cls.text_encoder_name}.{module_name}": params for module_name, params in te_state_dict.items() - } - te2_state_dict = ( - {f"text_encoder_2.{module_name}": params for module_name, params in te2_state_dict.items()} - if len(te2_state_dict) > 0 - else None - ) - if te2_state_dict is not None: - te_state_dict.update(te2_state_dict) - - new_state_dict = {**unet_state_dict, **te_state_dict} - return new_state_dict, network_alphas - - def unload_lora_weights(self): - """ - Unloads the LoRA parameters. - - Examples: - - ```python - >>> # Assuming `pipeline` is already loaded with the LoRA parameters. - >>> pipeline.unload_lora_weights() - >>> ... - ``` - """ - if not USE_PEFT_BACKEND: - if version.parse(__version__) > version.parse("0.23"): - logger.warn( - "You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights," - "you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT." - ) - - for _, module in self.unet.named_modules(): - if hasattr(module, "set_lora_layer"): - module.set_lora_layer(None) - else: - recurse_remove_peft_layers(self.unet) - if hasattr(self.unet, "peft_config"): - del self.unet.peft_config - - # Safe to call the following regardless of LoRA. - self._remove_text_encoder_monkey_patch() - - def fuse_lora( - self, - fuse_unet: bool = True, - fuse_text_encoder: bool = True, - lora_scale: float = 1.0, - safe_fusing: bool = False, - ): - r""" - Fuses the LoRA parameters into the original parameters of the corresponding blocks. - - - - This is an experimental API. - - - - Args: - fuse_unet (`bool`, defaults to `True`): Whether to fuse the UNet LoRA parameters. - fuse_text_encoder (`bool`, defaults to `True`): - Whether to fuse the text encoder LoRA parameters. If the text encoder wasn't monkey-patched with the - LoRA parameters then it won't have any effect. - lora_scale (`float`, defaults to 1.0): - Controls how much to influence the outputs with the LoRA parameters. - safe_fusing (`bool`, defaults to `False`): - Whether to check fused weights for NaN values before fusing and if values are NaN not fusing them. - """ - if fuse_unet or fuse_text_encoder: - self.num_fused_loras += 1 - if self.num_fused_loras > 1: - logger.warn( - "The current API is supported for operating with a single LoRA file. You are trying to load and fuse more than one LoRA which is not well-supported.", - ) - - if fuse_unet: - self.unet.fuse_lora(lora_scale, safe_fusing=safe_fusing) - - if USE_PEFT_BACKEND: - from peft.tuners.tuners_utils import BaseTunerLayer - - def fuse_text_encoder_lora(text_encoder, lora_scale=1.0, safe_fusing=False): - # TODO(Patrick, Younes): enable "safe" fusing - for module in text_encoder.modules(): - if isinstance(module, BaseTunerLayer): - if lora_scale != 1.0: - module.scale_layer(lora_scale) - - module.merge() - - else: - if version.parse(__version__) > version.parse("0.23"): - deprecate("fuse_text_encoder_lora", "0.25", LORA_DEPRECATION_MESSAGE) - - def fuse_text_encoder_lora(text_encoder, lora_scale=1.0, safe_fusing=False): - for _, attn_module in text_encoder_attn_modules(text_encoder): - if isinstance(attn_module.q_proj, PatchedLoraProjection): - attn_module.q_proj._fuse_lora(lora_scale, safe_fusing) - attn_module.k_proj._fuse_lora(lora_scale, safe_fusing) - attn_module.v_proj._fuse_lora(lora_scale, safe_fusing) - attn_module.out_proj._fuse_lora(lora_scale, safe_fusing) - - for _, mlp_module in text_encoder_mlp_modules(text_encoder): - if isinstance(mlp_module.fc1, PatchedLoraProjection): - mlp_module.fc1._fuse_lora(lora_scale, safe_fusing) - mlp_module.fc2._fuse_lora(lora_scale, safe_fusing) - - if fuse_text_encoder: - if hasattr(self, "text_encoder"): - fuse_text_encoder_lora(self.text_encoder, lora_scale, safe_fusing) - if hasattr(self, "text_encoder_2"): - fuse_text_encoder_lora(self.text_encoder_2, lora_scale, safe_fusing) - - def unfuse_lora(self, unfuse_unet: bool = True, unfuse_text_encoder: bool = True): - r""" - Reverses the effect of - [`pipe.fuse_lora()`](https://huggingface.co/docs/diffusers/main/en/api/loaders#diffusers.loaders.LoraLoaderMixin.fuse_lora). - - - - This is an experimental API. - - - - Args: - unfuse_unet (`bool`, defaults to `True`): Whether to unfuse the UNet LoRA parameters. - unfuse_text_encoder (`bool`, defaults to `True`): - Whether to unfuse the text encoder LoRA parameters. If the text encoder wasn't monkey-patched with the - LoRA parameters then it won't have any effect. - """ - if unfuse_unet: - if not USE_PEFT_BACKEND: - self.unet.unfuse_lora() - else: - from peft.tuners.tuners_utils import BaseTunerLayer - - for module in self.unet.modules(): - if isinstance(module, BaseTunerLayer): - module.unmerge() - - if USE_PEFT_BACKEND: - from peft.tuners.tuners_utils import BaseTunerLayer - - def unfuse_text_encoder_lora(text_encoder): - for module in text_encoder.modules(): - if isinstance(module, BaseTunerLayer): - module.unmerge() - - else: - if version.parse(__version__) > version.parse("0.23"): - deprecate("unfuse_text_encoder_lora", "0.25", LORA_DEPRECATION_MESSAGE) - - def unfuse_text_encoder_lora(text_encoder): - for _, attn_module in text_encoder_attn_modules(text_encoder): - if isinstance(attn_module.q_proj, PatchedLoraProjection): - attn_module.q_proj._unfuse_lora() - attn_module.k_proj._unfuse_lora() - attn_module.v_proj._unfuse_lora() - attn_module.out_proj._unfuse_lora() - - for _, mlp_module in text_encoder_mlp_modules(text_encoder): - if isinstance(mlp_module.fc1, PatchedLoraProjection): - mlp_module.fc1._unfuse_lora() - mlp_module.fc2._unfuse_lora() - - if unfuse_text_encoder: - if hasattr(self, "text_encoder"): - unfuse_text_encoder_lora(self.text_encoder) - if hasattr(self, "text_encoder_2"): - unfuse_text_encoder_lora(self.text_encoder_2) - - self.num_fused_loras -= 1 - - def set_adapters_for_text_encoder( - self, - adapter_names: Union[List[str], str], - text_encoder: Optional["PreTrainedModel"] = None, # noqa: F821 - text_encoder_weights: List[float] = None, - ): - """ - Sets the adapter layers for the text encoder. - - Args: - adapter_names (`List[str]` or `str`): - The names of the adapters to use. - text_encoder (`torch.nn.Module`, *optional*): - The text encoder module to set the adapter layers for. If `None`, it will try to get the `text_encoder` - attribute. - text_encoder_weights (`List[float]`, *optional*): - The weights to use for the text encoder. If `None`, the weights are set to `1.0` for all the adapters. - """ - if not USE_PEFT_BACKEND: - raise ValueError("PEFT backend is required for this method.") - - def process_weights(adapter_names, weights): - if weights is None: - weights = [1.0] * len(adapter_names) - elif isinstance(weights, float): - weights = [weights] - - if len(adapter_names) != len(weights): - raise ValueError( - f"Length of adapter names {len(adapter_names)} is not equal to the length of the weights {len(weights)}" - ) - return weights - - adapter_names = [adapter_names] if isinstance(adapter_names, str) else adapter_names - text_encoder_weights = process_weights(adapter_names, text_encoder_weights) - text_encoder = text_encoder or getattr(self, "text_encoder", None) - if text_encoder is None: - raise ValueError( - "The pipeline does not have a default `pipe.text_encoder` class. Please make sure to pass a `text_encoder` instead." - ) - set_weights_and_activate_adapters(text_encoder, adapter_names, text_encoder_weights) - - def disable_lora_for_text_encoder(self, text_encoder: Optional["PreTrainedModel"] = None): - """ - Disables the LoRA layers for the text encoder. - - Args: - text_encoder (`torch.nn.Module`, *optional*): - The text encoder module to disable the LoRA layers for. If `None`, it will try to get the - `text_encoder` attribute. - """ - if not USE_PEFT_BACKEND: - raise ValueError("PEFT backend is required for this method.") - - text_encoder = text_encoder or getattr(self, "text_encoder", None) - if text_encoder is None: - raise ValueError("Text Encoder not found.") - set_adapter_layers(text_encoder, enabled=False) - - def enable_lora_for_text_encoder(self, text_encoder: Optional["PreTrainedModel"] = None): - """ - Enables the LoRA layers for the text encoder. - - Args: - text_encoder (`torch.nn.Module`, *optional*): - The text encoder module to enable the LoRA layers for. If `None`, it will try to get the `text_encoder` - attribute. - """ - if not USE_PEFT_BACKEND: - raise ValueError("PEFT backend is required for this method.") - text_encoder = text_encoder or getattr(self, "text_encoder", None) - if text_encoder is None: - raise ValueError("Text Encoder not found.") - set_adapter_layers(self.text_encoder, enabled=True) - - def set_adapters( - self, - adapter_names: Union[List[str], str], - adapter_weights: Optional[List[float]] = None, - ): - # Handle the UNET - self.unet.set_adapters(adapter_names, adapter_weights) - - # Handle the Text Encoder - if hasattr(self, "text_encoder"): - self.set_adapters_for_text_encoder(adapter_names, self.text_encoder, adapter_weights) - if hasattr(self, "text_encoder_2"): - self.set_adapters_for_text_encoder(adapter_names, self.text_encoder_2, adapter_weights) - - def disable_lora(self): - if not USE_PEFT_BACKEND: - raise ValueError("PEFT backend is required for this method.") - - # Disable unet adapters - self.unet.disable_lora() - - # Disable text encoder adapters - if hasattr(self, "text_encoder"): - self.disable_lora_for_text_encoder(self.text_encoder) - if hasattr(self, "text_encoder_2"): - self.disable_lora_for_text_encoder(self.text_encoder_2) - - def enable_lora(self): - if not USE_PEFT_BACKEND: - raise ValueError("PEFT backend is required for this method.") - - # Enable unet adapters - self.unet.enable_lora() - - # Enable text encoder adapters - if hasattr(self, "text_encoder"): - self.enable_lora_for_text_encoder(self.text_encoder) - if hasattr(self, "text_encoder_2"): - self.enable_lora_for_text_encoder(self.text_encoder_2) - - def get_active_adapters(self) -> List[str]: - """ - Gets the list of the current active adapters. - - Example: - - ```python - from diffusers import DiffusionPipeline - - pipeline = DiffusionPipeline.from_pretrained( - "stabilityai/stable-diffusion-xl-base-1.0", - ).to("cuda") - pipeline.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy") - pipeline.get_active_adapters() - ``` - """ - if not USE_PEFT_BACKEND: - raise ValueError( - "PEFT backend is required for this method. Please install the latest version of PEFT `pip install -U peft`" - ) - - from peft.tuners.tuners_utils import BaseTunerLayer - - active_adapters = [] - - for module in self.unet.modules(): - if isinstance(module, BaseTunerLayer): - active_adapters = module.active_adapters - break - - return active_adapters - - def get_list_adapters(self) -> Dict[str, List[str]]: - """ - Gets the current list of all available adapters in the pipeline. - """ - if not USE_PEFT_BACKEND: - raise ValueError( - "PEFT backend is required for this method. Please install the latest version of PEFT `pip install -U peft`" - ) - - set_adapters = {} - - if hasattr(self, "text_encoder") and hasattr(self.text_encoder, "peft_config"): - set_adapters["text_encoder"] = list(self.text_encoder.peft_config.keys()) - - if hasattr(self, "text_encoder_2") and hasattr(self.text_encoder_2, "peft_config"): - set_adapters["text_encoder_2"] = list(self.text_encoder_2.peft_config.keys()) - - if hasattr(self, "unet") and hasattr(self.unet, "peft_config"): - set_adapters["unet"] = list(self.unet.peft_config.keys()) - - return set_adapters - - def set_lora_device(self, adapter_names: List[str], device: Union[torch.device, str, int]) -> None: - """ - Moves the LoRAs listed in `adapter_names` to a target device. Useful for offloading the LoRA to the CPU in case - you want to load multiple adapters and free some GPU memory. - - Args: - adapter_names (`List[str]`): - List of adapters to send device to. - device (`Union[torch.device, str, int]`): - Device to send the adapters to. Can be either a torch device, a str or an integer. - """ - if not USE_PEFT_BACKEND: - raise ValueError("PEFT backend is required for this method.") - - from peft.tuners.tuners_utils import BaseTunerLayer - - # Handle the UNET - for unet_module in self.unet.modules(): - if isinstance(unet_module, BaseTunerLayer): - for adapter_name in adapter_names: - unet_module.lora_A[adapter_name].to(device) - unet_module.lora_B[adapter_name].to(device) - - # Handle the text encoder - modules_to_process = [] - if hasattr(self, "text_encoder"): - modules_to_process.append(self.text_encoder) - - if hasattr(self, "text_encoder_2"): - modules_to_process.append(self.text_encoder_2) - - for text_encoder in modules_to_process: - # loop over submodules - for text_encoder_module in text_encoder.modules(): - if isinstance(text_encoder_module, BaseTunerLayer): - for adapter_name in adapter_names: - text_encoder_module.lora_A[adapter_name].to(device) - text_encoder_module.lora_B[adapter_name].to(device) - - -class FromSingleFileMixin: - """ - Load model weights saved in the `.ckpt` format into a [`DiffusionPipeline`]. - """ - - @classmethod - def from_ckpt(cls, *args, **kwargs): - deprecation_message = "The function `from_ckpt` is deprecated in favor of `from_single_file` and will be removed in diffusers v.0.21. Please make sure to use `StableDiffusionPipeline.from_single_file(...)` instead." - deprecate("from_ckpt", "0.21.0", deprecation_message, standard_warn=False) - return cls.from_single_file(*args, **kwargs) - - @classmethod - def from_single_file(cls, pretrained_model_link_or_path, **kwargs): - r""" - Instantiate a [`DiffusionPipeline`] from pretrained pipeline weights saved in the `.ckpt` or `.safetensors` - format. The pipeline is set in evaluation mode (`model.eval()`) by default. - - Parameters: - pretrained_model_link_or_path (`str` or `os.PathLike`, *optional*): - Can be either: - - A link to the `.ckpt` file (for example - `"https://huggingface.co//blob/main/.ckpt"`) on the Hub. - - A path to a *file* containing all pipeline weights. - torch_dtype (`str` or `torch.dtype`, *optional*): - Override the default `torch.dtype` and load the model with another dtype. If `"auto"` is passed, the - dtype is automatically derived from the model's weights. - force_download (`bool`, *optional*, defaults to `False`): - Whether or not to force the (re-)download of the model weights and configuration files, overriding the - cached versions if they exist. - cache_dir (`Union[str, os.PathLike]`, *optional*): - Path to a directory where a downloaded pretrained model configuration is cached if the standard cache - is not used. - resume_download (`bool`, *optional*, defaults to `False`): - Whether or not to resume downloading the model weights and configuration files. If set to `False`, any - incompletely downloaded files are deleted. - proxies (`Dict[str, str]`, *optional*): - A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', - 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. - local_files_only (`bool`, *optional*, defaults to `False`): - Whether to only load local model weights and configuration files or not. If set to `True`, the model - won't be downloaded from the Hub. - use_auth_token (`str` or *bool*, *optional*): - The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from - `diffusers-cli login` (stored in `~/.huggingface`) is used. - revision (`str`, *optional*, defaults to `"main"`): - The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier - allowed by Git. - use_safetensors (`bool`, *optional*, defaults to `None`): - If set to `None`, the safetensors weights are downloaded if they're available **and** if the - safetensors library is installed. If set to `True`, the model is forcibly loaded from safetensors - weights. If set to `False`, safetensors weights are not loaded. - extract_ema (`bool`, *optional*, defaults to `False`): - Whether to extract the EMA weights or not. Pass `True` to extract the EMA weights which usually yield - higher quality images for inference. Non-EMA weights are usually better for continuing finetuning. - upcast_attention (`bool`, *optional*, defaults to `None`): - Whether the attention computation should always be upcasted. - image_size (`int`, *optional*, defaults to 512): - The image size the model was trained on. Use 512 for all Stable Diffusion v1 models and the Stable - Diffusion v2 base model. Use 768 for Stable Diffusion v2. - prediction_type (`str`, *optional*): - The prediction type the model was trained on. Use `'epsilon'` for all Stable Diffusion v1 models and - the Stable Diffusion v2 base model. Use `'v_prediction'` for Stable Diffusion v2. - num_in_channels (`int`, *optional*, defaults to `None`): - The number of input channels. If `None`, it is automatically inferred. - scheduler_type (`str`, *optional*, defaults to `"pndm"`): - Type of scheduler to use. Should be one of `["pndm", "lms", "heun", "euler", "euler-ancestral", "dpm", - "ddim"]`. - load_safety_checker (`bool`, *optional*, defaults to `True`): - Whether to load the safety checker or not. - text_encoder ([`~transformers.CLIPTextModel`], *optional*, defaults to `None`): - An instance of `CLIPTextModel` to use, specifically the - [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. If this - parameter is `None`, the function loads a new instance of `CLIPTextModel` by itself if needed. - vae (`AutoencoderKL`, *optional*, defaults to `None`): - Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. If - this parameter is `None`, the function will load a new instance of [CLIP] by itself, if needed. - tokenizer ([`~transformers.CLIPTokenizer`], *optional*, defaults to `None`): - An instance of `CLIPTokenizer` to use. If this parameter is `None`, the function loads a new instance - of `CLIPTokenizer` by itself if needed. - original_config_file (`str`): - Path to `.yaml` config file corresponding to the original architecture. If `None`, will be - automatically inferred by looking for a key that only exists in SD2.0 models. - kwargs (remaining dictionary of keyword arguments, *optional*): - Can be used to overwrite load and saveable variables (for example the pipeline components of the - specific pipeline class). The overwritten components are directly passed to the pipelines `__init__` - method. See example below for more information. - - Examples: - - ```py - >>> from diffusers import StableDiffusionPipeline - - >>> # Download pipeline from huggingface.co and cache. - >>> pipeline = StableDiffusionPipeline.from_single_file( - ... "https://huggingface.co/WarriorMama777/OrangeMixs/blob/main/Models/AbyssOrangeMix/AbyssOrangeMix.safetensors" - ... ) - - >>> # Download pipeline from local file - >>> # file is downloaded under ./v1-5-pruned-emaonly.ckpt - >>> pipeline = StableDiffusionPipeline.from_single_file("./v1-5-pruned-emaonly") - - >>> # Enable float16 and move to GPU - >>> pipeline = StableDiffusionPipeline.from_single_file( - ... "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.ckpt", - ... torch_dtype=torch.float16, - ... ) - >>> pipeline.to("cuda") - ``` - """ - # import here to avoid circular dependency - from .pipelines.stable_diffusion.convert_from_ckpt import download_from_original_stable_diffusion_ckpt - - original_config_file = kwargs.pop("original_config_file", None) - config_files = kwargs.pop("config_files", None) - cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) - resume_download = kwargs.pop("resume_download", False) - force_download = kwargs.pop("force_download", False) - proxies = kwargs.pop("proxies", None) - local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) - use_auth_token = kwargs.pop("use_auth_token", None) - revision = kwargs.pop("revision", None) - extract_ema = kwargs.pop("extract_ema", False) - image_size = kwargs.pop("image_size", None) - scheduler_type = kwargs.pop("scheduler_type", "pndm") - num_in_channels = kwargs.pop("num_in_channels", None) - upcast_attention = kwargs.pop("upcast_attention", None) - load_safety_checker = kwargs.pop("load_safety_checker", True) - prediction_type = kwargs.pop("prediction_type", None) - text_encoder = kwargs.pop("text_encoder", None) - vae = kwargs.pop("vae", None) - controlnet = kwargs.pop("controlnet", None) - adapter = kwargs.pop("adapter", None) - tokenizer = kwargs.pop("tokenizer", None) - - torch_dtype = kwargs.pop("torch_dtype", None) - - use_safetensors = kwargs.pop("use_safetensors", None) - - pipeline_name = cls.__name__ - file_extension = pretrained_model_link_or_path.rsplit(".", 1)[-1] - from_safetensors = file_extension == "safetensors" - - if from_safetensors and use_safetensors is False: - raise ValueError("Make sure to install `safetensors` with `pip install safetensors`.") - - # TODO: For now we only support stable diffusion - stable_unclip = None - model_type = None - - if pipeline_name in [ - "StableDiffusionControlNetPipeline", - "StableDiffusionControlNetImg2ImgPipeline", - "StableDiffusionControlNetInpaintPipeline", - ]: - from .models.controlnet import ControlNetModel - from .pipelines.controlnet.multicontrolnet import MultiControlNetModel - - # list/tuple or a single instance of ControlNetModel or MultiControlNetModel - if not ( - isinstance(controlnet, (ControlNetModel, MultiControlNetModel)) - or isinstance(controlnet, (list, tuple)) - and isinstance(controlnet[0], ControlNetModel) - ): - raise ValueError("ControlNet needs to be passed if loading from ControlNet pipeline.") - elif "StableDiffusion" in pipeline_name: - # Model type will be inferred from the checkpoint. - pass - elif pipeline_name == "StableUnCLIPPipeline": - model_type = "FrozenOpenCLIPEmbedder" - stable_unclip = "txt2img" - elif pipeline_name == "StableUnCLIPImg2ImgPipeline": - model_type = "FrozenOpenCLIPEmbedder" - stable_unclip = "img2img" - elif pipeline_name == "PaintByExamplePipeline": - model_type = "PaintByExample" - elif pipeline_name == "LDMTextToImagePipeline": - model_type = "LDMTextToImage" - else: - raise ValueError(f"Unhandled pipeline class: {pipeline_name}") - - # remove huggingface url - has_valid_url_prefix = False - valid_url_prefixes = ["https://huggingface.co/", "huggingface.co/", "hf.co/", "https://hf.co/"] - for prefix in valid_url_prefixes: - if pretrained_model_link_or_path.startswith(prefix): - pretrained_model_link_or_path = pretrained_model_link_or_path[len(prefix) :] - has_valid_url_prefix = True - - # Code based on diffusers.pipelines.pipeline_utils.DiffusionPipeline.from_pretrained - ckpt_path = Path(pretrained_model_link_or_path) - if not ckpt_path.is_file(): - if not has_valid_url_prefix: - raise ValueError( - f"The provided path is either not a file or a valid huggingface URL was not provided. Valid URLs begin with {', '.join(valid_url_prefixes)}" - ) - - # get repo_id and (potentially nested) file path of ckpt in repo - repo_id = "/".join(ckpt_path.parts[:2]) - file_path = "/".join(ckpt_path.parts[2:]) - - if file_path.startswith("blob/"): - file_path = file_path[len("blob/") :] - - if file_path.startswith("main/"): - file_path = file_path[len("main/") :] - - pretrained_model_link_or_path = hf_hub_download( - repo_id, - filename=file_path, - cache_dir=cache_dir, - resume_download=resume_download, - proxies=proxies, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - revision=revision, - force_download=force_download, - ) - - pipe = download_from_original_stable_diffusion_ckpt( - pretrained_model_link_or_path, - pipeline_class=cls, - model_type=model_type, - stable_unclip=stable_unclip, - controlnet=controlnet, - adapter=adapter, - from_safetensors=from_safetensors, - extract_ema=extract_ema, - image_size=image_size, - scheduler_type=scheduler_type, - num_in_channels=num_in_channels, - upcast_attention=upcast_attention, - load_safety_checker=load_safety_checker, - prediction_type=prediction_type, - text_encoder=text_encoder, - vae=vae, - tokenizer=tokenizer, - original_config_file=original_config_file, - config_files=config_files, - local_files_only=local_files_only, - ) - - if torch_dtype is not None: - pipe.to(torch_dtype=torch_dtype) - - return pipe - - -class FromOriginalVAEMixin: - @classmethod - def from_single_file(cls, pretrained_model_link_or_path, **kwargs): - r""" - Instantiate a [`AutoencoderKL`] from pretrained controlnet weights saved in the original `.ckpt` or - `.safetensors` format. The pipeline is format. The pipeline is set in evaluation mode (`model.eval()`) by - default. - - Parameters: - pretrained_model_link_or_path (`str` or `os.PathLike`, *optional*): - Can be either: - - A link to the `.ckpt` file (for example - `"https://huggingface.co//blob/main/.ckpt"`) on the Hub. - - A path to a *file* containing all pipeline weights. - torch_dtype (`str` or `torch.dtype`, *optional*): - Override the default `torch.dtype` and load the model with another dtype. If `"auto"` is passed, the - dtype is automatically derived from the model's weights. - force_download (`bool`, *optional*, defaults to `False`): - Whether or not to force the (re-)download of the model weights and configuration files, overriding the - cached versions if they exist. - cache_dir (`Union[str, os.PathLike]`, *optional*): - Path to a directory where a downloaded pretrained model configuration is cached if the standard cache - is not used. - resume_download (`bool`, *optional*, defaults to `False`): - Whether or not to resume downloading the model weights and configuration files. If set to `False`, any - incompletely downloaded files are deleted. - proxies (`Dict[str, str]`, *optional*): - A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', - 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. - local_files_only (`bool`, *optional*, defaults to `False`): - Whether to only load local model weights and configuration files or not. If set to True, the model - won't be downloaded from the Hub. - use_auth_token (`str` or *bool*, *optional*): - The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from - `diffusers-cli login` (stored in `~/.huggingface`) is used. - revision (`str`, *optional*, defaults to `"main"`): - The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier - allowed by Git. - image_size (`int`, *optional*, defaults to 512): - The image size the model was trained on. Use 512 for all Stable Diffusion v1 models and the Stable - Diffusion v2 base model. Use 768 for Stable Diffusion v2. - use_safetensors (`bool`, *optional*, defaults to `None`): - If set to `None`, the safetensors weights are downloaded if they're available **and** if the - safetensors library is installed. If set to `True`, the model is forcibly loaded from safetensors - weights. If set to `False`, safetensors weights are not loaded. - upcast_attention (`bool`, *optional*, defaults to `None`): - Whether the attention computation should always be upcasted. - scaling_factor (`float`, *optional*, defaults to 0.18215): - The component-wise standard deviation of the trained latent space computed using the first batch of the - training set. This is used to scale the latent space to have unit variance when training the diffusion - model. The latents are scaled with the formula `z = z * scaling_factor` before being passed to the - diffusion model. When decoding, the latents are scaled back to the original scale with the formula: `z - = 1 / scaling_factor * z`. For more details, refer to sections 4.3.2 and D.1 of the [High-Resolution - Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) paper. - kwargs (remaining dictionary of keyword arguments, *optional*): - Can be used to overwrite load and saveable variables (for example the pipeline components of the - specific pipeline class). The overwritten components are directly passed to the pipelines `__init__` - method. See example below for more information. - - - - Make sure to pass both `image_size` and `scaling_factor` to `from_single_file()` if you want to load - a VAE that does accompany a stable diffusion model of v2 or higher or SDXL. - - - - Examples: - - ```py - from diffusers import AutoencoderKL - - url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors" # can also be local file - model = AutoencoderKL.from_single_file(url) - ``` - """ - if not is_omegaconf_available(): - raise ValueError(BACKENDS_MAPPING["omegaconf"][1]) - - from omegaconf import OmegaConf - - from .models import AutoencoderKL - - # import here to avoid circular dependency - from .pipelines.stable_diffusion.convert_from_ckpt import ( - convert_ldm_vae_checkpoint, - create_vae_diffusers_config, - ) - - config_file = kwargs.pop("config_file", None) - cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) - resume_download = kwargs.pop("resume_download", False) - force_download = kwargs.pop("force_download", False) - proxies = kwargs.pop("proxies", None) - local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) - use_auth_token = kwargs.pop("use_auth_token", None) - revision = kwargs.pop("revision", None) - image_size = kwargs.pop("image_size", None) - scaling_factor = kwargs.pop("scaling_factor", None) - kwargs.pop("upcast_attention", None) - - torch_dtype = kwargs.pop("torch_dtype", None) - - use_safetensors = kwargs.pop("use_safetensors", None) - - file_extension = pretrained_model_link_or_path.rsplit(".", 1)[-1] - from_safetensors = file_extension == "safetensors" - - if from_safetensors and use_safetensors is False: - raise ValueError("Make sure to install `safetensors` with `pip install safetensors`.") - - # remove huggingface url - for prefix in ["https://huggingface.co/", "huggingface.co/", "hf.co/", "https://hf.co/"]: - if pretrained_model_link_or_path.startswith(prefix): - pretrained_model_link_or_path = pretrained_model_link_or_path[len(prefix) :] - - # Code based on diffusers.pipelines.pipeline_utils.DiffusionPipeline.from_pretrained - ckpt_path = Path(pretrained_model_link_or_path) - if not ckpt_path.is_file(): - # get repo_id and (potentially nested) file path of ckpt in repo - repo_id = "/".join(ckpt_path.parts[:2]) - file_path = "/".join(ckpt_path.parts[2:]) - - if file_path.startswith("blob/"): - file_path = file_path[len("blob/") :] - - if file_path.startswith("main/"): - file_path = file_path[len("main/") :] - - pretrained_model_link_or_path = hf_hub_download( - repo_id, - filename=file_path, - cache_dir=cache_dir, - resume_download=resume_download, - proxies=proxies, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - revision=revision, - force_download=force_download, - ) - - if from_safetensors: - from safetensors import safe_open - - checkpoint = {} - with safe_open(pretrained_model_link_or_path, framework="pt", device="cpu") as f: - for key in f.keys(): - checkpoint[key] = f.get_tensor(key) - else: - checkpoint = torch.load(pretrained_model_link_or_path, map_location="cpu") - - if "state_dict" in checkpoint: - checkpoint = checkpoint["state_dict"] - - if config_file is None: - config_url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/configs/stable-diffusion/v1-inference.yaml" - config_file = BytesIO(requests.get(config_url).content) - - original_config = OmegaConf.load(config_file) - - # default to sd-v1-5 - image_size = image_size or 512 - - vae_config = create_vae_diffusers_config(original_config, image_size=image_size) - converted_vae_checkpoint = convert_ldm_vae_checkpoint(checkpoint, vae_config) - - if scaling_factor is None: - if ( - "model" in original_config - and "params" in original_config.model - and "scale_factor" in original_config.model.params - ): - vae_scaling_factor = original_config.model.params.scale_factor - else: - vae_scaling_factor = 0.18215 # default SD scaling factor - - vae_config["scaling_factor"] = vae_scaling_factor - - ctx = init_empty_weights if is_accelerate_available() else nullcontext - with ctx(): - vae = AutoencoderKL(**vae_config) - - if is_accelerate_available(): - load_model_dict_into_meta(vae, converted_vae_checkpoint, device="cpu") - else: - vae.load_state_dict(converted_vae_checkpoint) - - if torch_dtype is not None: - vae.to(dtype=torch_dtype) - - return vae - - -class FromOriginalControlnetMixin: - @classmethod - def from_single_file(cls, pretrained_model_link_or_path, **kwargs): - r""" - Instantiate a [`ControlNetModel`] from pretrained controlnet weights saved in the original `.ckpt` or - `.safetensors` format. The pipeline is set in evaluation mode (`model.eval()`) by default. - - Parameters: - pretrained_model_link_or_path (`str` or `os.PathLike`, *optional*): - Can be either: - - A link to the `.ckpt` file (for example - `"https://huggingface.co//blob/main/.ckpt"`) on the Hub. - - A path to a *file* containing all pipeline weights. - torch_dtype (`str` or `torch.dtype`, *optional*): - Override the default `torch.dtype` and load the model with another dtype. If `"auto"` is passed, the - dtype is automatically derived from the model's weights. - force_download (`bool`, *optional*, defaults to `False`): - Whether or not to force the (re-)download of the model weights and configuration files, overriding the - cached versions if they exist. - cache_dir (`Union[str, os.PathLike]`, *optional*): - Path to a directory where a downloaded pretrained model configuration is cached if the standard cache - is not used. - resume_download (`bool`, *optional*, defaults to `False`): - Whether or not to resume downloading the model weights and configuration files. If set to `False`, any - incompletely downloaded files are deleted. - proxies (`Dict[str, str]`, *optional*): - A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', - 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. - local_files_only (`bool`, *optional*, defaults to `False`): - Whether to only load local model weights and configuration files or not. If set to True, the model - won't be downloaded from the Hub. - use_auth_token (`str` or *bool*, *optional*): - The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from - `diffusers-cli login` (stored in `~/.huggingface`) is used. - revision (`str`, *optional*, defaults to `"main"`): - The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier - allowed by Git. - use_safetensors (`bool`, *optional*, defaults to `None`): - If set to `None`, the safetensors weights are downloaded if they're available **and** if the - safetensors library is installed. If set to `True`, the model is forcibly loaded from safetensors - weights. If set to `False`, safetensors weights are not loaded. - image_size (`int`, *optional*, defaults to 512): - The image size the model was trained on. Use 512 for all Stable Diffusion v1 models and the Stable - Diffusion v2 base model. Use 768 for Stable Diffusion v2. - upcast_attention (`bool`, *optional*, defaults to `None`): - Whether the attention computation should always be upcasted. - kwargs (remaining dictionary of keyword arguments, *optional*): - Can be used to overwrite load and saveable variables (for example the pipeline components of the - specific pipeline class). The overwritten components are directly passed to the pipelines `__init__` - method. See example below for more information. - - Examples: - - ```py - from diffusers import StableDiffusionControlNetPipeline, ControlNetModel - - url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local path - model = ControlNetModel.from_single_file(url) - - url = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path - pipe = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet) - ``` - """ - # import here to avoid circular dependency - from .pipelines.stable_diffusion.convert_from_ckpt import download_controlnet_from_original_ckpt - - config_file = kwargs.pop("config_file", None) - cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) - resume_download = kwargs.pop("resume_download", False) - force_download = kwargs.pop("force_download", False) - proxies = kwargs.pop("proxies", None) - local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) - use_auth_token = kwargs.pop("use_auth_token", None) - num_in_channels = kwargs.pop("num_in_channels", None) - use_linear_projection = kwargs.pop("use_linear_projection", None) - revision = kwargs.pop("revision", None) - extract_ema = kwargs.pop("extract_ema", False) - image_size = kwargs.pop("image_size", None) - upcast_attention = kwargs.pop("upcast_attention", None) - - torch_dtype = kwargs.pop("torch_dtype", None) - - use_safetensors = kwargs.pop("use_safetensors", None) - - file_extension = pretrained_model_link_or_path.rsplit(".", 1)[-1] - from_safetensors = file_extension == "safetensors" - - if from_safetensors and use_safetensors is False: - raise ValueError("Make sure to install `safetensors` with `pip install safetensors`.") - - # remove huggingface url - for prefix in ["https://huggingface.co/", "huggingface.co/", "hf.co/", "https://hf.co/"]: - if pretrained_model_link_or_path.startswith(prefix): - pretrained_model_link_or_path = pretrained_model_link_or_path[len(prefix) :] - - # Code based on diffusers.pipelines.pipeline_utils.DiffusionPipeline.from_pretrained - ckpt_path = Path(pretrained_model_link_or_path) - if not ckpt_path.is_file(): - # get repo_id and (potentially nested) file path of ckpt in repo - repo_id = "/".join(ckpt_path.parts[:2]) - file_path = "/".join(ckpt_path.parts[2:]) - - if file_path.startswith("blob/"): - file_path = file_path[len("blob/") :] - - if file_path.startswith("main/"): - file_path = file_path[len("main/") :] - - pretrained_model_link_or_path = hf_hub_download( - repo_id, - filename=file_path, - cache_dir=cache_dir, - resume_download=resume_download, - proxies=proxies, - local_files_only=local_files_only, - use_auth_token=use_auth_token, - revision=revision, - force_download=force_download, - ) - - if config_file is None: - config_url = "https://raw.githubusercontent.com/lllyasviel/ControlNet/main/models/cldm_v15.yaml" - config_file = BytesIO(requests.get(config_url).content) - - image_size = image_size or 512 - - controlnet = download_controlnet_from_original_ckpt( - pretrained_model_link_or_path, - original_config_file=config_file, - image_size=image_size, - extract_ema=extract_ema, - num_in_channels=num_in_channels, - upcast_attention=upcast_attention, - from_safetensors=from_safetensors, - use_linear_projection=use_linear_projection, - ) - - if torch_dtype is not None: - controlnet.to(dtype=torch_dtype) - - return controlnet - - -class StableDiffusionXLLoraLoaderMixin(LoraLoaderMixin): - """This class overrides `LoraLoaderMixin` with LoRA loading/saving code that's specific to SDXL""" - - # Overrride to properly handle the loading and unloading of the additional text encoder. - def load_lora_weights( - self, - pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], - adapter_name: Optional[str] = None, - **kwargs, - ): - """ - Load LoRA weights specified in `pretrained_model_name_or_path_or_dict` into `self.unet` and - `self.text_encoder`. - - All kwargs are forwarded to `self.lora_state_dict`. - - See [`~loaders.LoraLoaderMixin.lora_state_dict`] for more details on how the state dict is loaded. - - See [`~loaders.LoraLoaderMixin.load_lora_into_unet`] for more details on how the state dict is loaded into - `self.unet`. - - See [`~loaders.LoraLoaderMixin.load_lora_into_text_encoder`] for more details on how the state dict is loaded - into `self.text_encoder`. - - Parameters: - pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`): - See [`~loaders.LoraLoaderMixin.lora_state_dict`]. - adapter_name (`str`, *optional*): - Adapter name to be used for referencing the loaded adapter model. If not specified, it will use - `default_{i}` where i is the total number of adapters being loaded. - kwargs (`dict`, *optional*): - See [`~loaders.LoraLoaderMixin.lora_state_dict`]. - """ - # We could have accessed the unet config from `lora_state_dict()` too. We pass - # it here explicitly to be able to tell that it's coming from an SDXL - # pipeline. - - # First, ensure that the checkpoint is a compatible one and can be successfully loaded. - state_dict, network_alphas = self.lora_state_dict( - pretrained_model_name_or_path_or_dict, - unet_config=self.unet.config, - **kwargs, - ) - is_correct_format = all("lora" in key for key in state_dict.keys()) - if not is_correct_format: - raise ValueError("Invalid LoRA checkpoint.") - - self.load_lora_into_unet( - state_dict, network_alphas=network_alphas, unet=self.unet, adapter_name=adapter_name, _pipeline=self - ) - text_encoder_state_dict = {k: v for k, v in state_dict.items() if "text_encoder." in k} - if len(text_encoder_state_dict) > 0: - self.load_lora_into_text_encoder( - text_encoder_state_dict, - network_alphas=network_alphas, - text_encoder=self.text_encoder, - prefix="text_encoder", - lora_scale=self.lora_scale, - adapter_name=adapter_name, - _pipeline=self, - ) - - text_encoder_2_state_dict = {k: v for k, v in state_dict.items() if "text_encoder_2." in k} - if len(text_encoder_2_state_dict) > 0: - self.load_lora_into_text_encoder( - text_encoder_2_state_dict, - network_alphas=network_alphas, - text_encoder=self.text_encoder_2, - prefix="text_encoder_2", - lora_scale=self.lora_scale, - adapter_name=adapter_name, - _pipeline=self, - ) - - @classmethod - def save_lora_weights( - cls, - save_directory: Union[str, os.PathLike], - unet_lora_layers: Dict[str, Union[torch.nn.Module, torch.Tensor]] = None, - text_encoder_lora_layers: Dict[str, Union[torch.nn.Module, torch.Tensor]] = None, - text_encoder_2_lora_layers: Dict[str, Union[torch.nn.Module, torch.Tensor]] = None, - is_main_process: bool = True, - weight_name: str = None, - save_function: Callable = None, - safe_serialization: bool = True, - ): - r""" - Save the LoRA parameters corresponding to the UNet and text encoder. - - Arguments: - save_directory (`str` or `os.PathLike`): - Directory to save LoRA parameters to. Will be created if it doesn't exist. - unet_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): - State dict of the LoRA layers corresponding to the `unet`. - text_encoder_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): - State dict of the LoRA layers corresponding to the `text_encoder`. Must explicitly pass the text - encoder LoRA state dict because it comes from πŸ€— Transformers. - is_main_process (`bool`, *optional*, defaults to `True`): - Whether the process calling this is the main process or not. Useful during distributed training and you - need to call this function on all processes. In this case, set `is_main_process=True` only on the main - process to avoid race conditions. - save_function (`Callable`): - The function to use to save the state dictionary. Useful during distributed training when you need to - replace `torch.save` with another method. Can be configured with the environment variable - `DIFFUSERS_SAVE_MODE`. - safe_serialization (`bool`, *optional*, defaults to `True`): - Whether to save the model using `safetensors` or the traditional PyTorch way with `pickle`. - """ - state_dict = {} - - def pack_weights(layers, prefix): - layers_weights = layers.state_dict() if isinstance(layers, torch.nn.Module) else layers - layers_state_dict = {f"{prefix}.{module_name}": param for module_name, param in layers_weights.items()} - return layers_state_dict - - if not (unet_lora_layers or text_encoder_lora_layers or text_encoder_2_lora_layers): - raise ValueError( - "You must pass at least one of `unet_lora_layers`, `text_encoder_lora_layers` or `text_encoder_2_lora_layers`." - ) - - if unet_lora_layers: - state_dict.update(pack_weights(unet_lora_layers, "unet")) - - if text_encoder_lora_layers and text_encoder_2_lora_layers: - state_dict.update(pack_weights(text_encoder_lora_layers, "text_encoder")) - state_dict.update(pack_weights(text_encoder_2_lora_layers, "text_encoder_2")) - - cls.write_lora_layers( - state_dict=state_dict, - save_directory=save_directory, - is_main_process=is_main_process, - weight_name=weight_name, - save_function=save_function, - safe_serialization=safe_serialization, - ) - - def _remove_text_encoder_monkey_patch(self): - if USE_PEFT_BACKEND: - recurse_remove_peft_layers(self.text_encoder) - # TODO: @younesbelkada handle this in transformers side - if getattr(self.text_encoder, "peft_config", None) is not None: - del self.text_encoder.peft_config - self.text_encoder._hf_peft_config_loaded = None - - recurse_remove_peft_layers(self.text_encoder_2) - if getattr(self.text_encoder_2, "peft_config", None) is not None: - del self.text_encoder_2.peft_config - self.text_encoder_2._hf_peft_config_loaded = None - else: - self._remove_text_encoder_monkey_patch_classmethod(self.text_encoder) - self._remove_text_encoder_monkey_patch_classmethod(self.text_encoder_2) diff --git a/src/diffusers/loaders/__init__.py b/src/diffusers/loaders/__init__.py new file mode 100644 index 000000000000..14fd985f69e4 --- /dev/null +++ b/src/diffusers/loaders/__init__.py @@ -0,0 +1,81 @@ +from typing import TYPE_CHECKING + +from ..utils import DIFFUSERS_SLOW_IMPORT, _LazyModule, deprecate +from ..utils.import_utils import is_torch_available, is_transformers_available + + +def text_encoder_lora_state_dict(text_encoder): + deprecate( + "text_encoder_load_state_dict in `models`", + "0.27.0", + "`text_encoder_lora_state_dict` has been moved to `diffusers.models.lora`. Please make sure to import it via `from diffusers.models.lora import text_encoder_lora_state_dict`.", + ) + state_dict = {} + + for name, module in text_encoder_attn_modules(text_encoder): + for k, v in module.q_proj.lora_linear_layer.state_dict().items(): + state_dict[f"{name}.q_proj.lora_linear_layer.{k}"] = v + + for k, v in module.k_proj.lora_linear_layer.state_dict().items(): + state_dict[f"{name}.k_proj.lora_linear_layer.{k}"] = v + + for k, v in module.v_proj.lora_linear_layer.state_dict().items(): + state_dict[f"{name}.v_proj.lora_linear_layer.{k}"] = v + + for k, v in module.out_proj.lora_linear_layer.state_dict().items(): + state_dict[f"{name}.out_proj.lora_linear_layer.{k}"] = v + + return state_dict + + +if is_transformers_available(): + + def text_encoder_attn_modules(text_encoder): + deprecate( + "text_encoder_attn_modules in `models`", + "0.27.0", + "`text_encoder_lora_state_dict` has been moved to `diffusers.models.lora`. Please make sure to import it via `from diffusers.models.lora import text_encoder_lora_state_dict`.", + ) + from transformers import CLIPTextModel, CLIPTextModelWithProjection + + attn_modules = [] + + if isinstance(text_encoder, (CLIPTextModel, CLIPTextModelWithProjection)): + for i, layer in enumerate(text_encoder.text_model.encoder.layers): + name = f"text_model.encoder.layers.{i}.self_attn" + mod = layer.self_attn + attn_modules.append((name, mod)) + else: + raise ValueError(f"do not know how to get attention modules for: {text_encoder.__class__.__name__}") + + return attn_modules + + +_import_structure = {} + +if is_torch_available(): + _import_structure["single_file"] = ["FromOriginalControlnetMixin", "FromOriginalVAEMixin"] + _import_structure["unet"] = ["UNet2DConditionLoadersMixin"] + _import_structure["utils"] = ["AttnProcsLayers"] + + if is_transformers_available(): + _import_structure["single_file"].extend(["FromSingleFileMixin"]) + _import_structure["lora"] = ["LoraLoaderMixin", "StableDiffusionXLLoraLoaderMixin"] + _import_structure["textual_inversion"] = ["TextualInversionLoaderMixin"] + + +if TYPE_CHECKING or DIFFUSERS_SLOW_IMPORT: + if is_torch_available(): + from ..models.lora import text_encoder_lora_state_dict + from .single_file import FromOriginalControlnetMixin, FromOriginalVAEMixin + from .unet import UNet2DConditionLoadersMixin + from .utils import AttnProcsLayers + + if is_transformers_available(): + from .lora import LoraLoaderMixin, StableDiffusionXLLoraLoaderMixin + from .single_file import FromSingleFileMixin + from .textual_inversion import TextualInversionLoaderMixin +else: + import sys + + sys.modules[__name__] = _LazyModule(__name__, globals()["__file__"], _import_structure, module_spec=__spec__) diff --git a/src/diffusers/loaders/lora.py b/src/diffusers/loaders/lora.py new file mode 100644 index 000000000000..532a59f3b9bd --- /dev/null +++ b/src/diffusers/loaders/lora.py @@ -0,0 +1,1682 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import re +from contextlib import nullcontext +from typing import Callable, Dict, List, Optional, Union + +import safetensors +import torch +from huggingface_hub import model_info +from packaging import version +from torch import nn + +from .. import __version__ +from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta +from ..utils import ( + DIFFUSERS_CACHE, + HF_HUB_OFFLINE, + USE_PEFT_BACKEND, + _get_model_file, + convert_state_dict_to_diffusers, + convert_state_dict_to_peft, + convert_unet_state_dict_to_peft, + delete_adapter_layers, + deprecate, + get_adapter_name, + get_peft_kwargs, + is_accelerate_available, + is_transformers_available, + logging, + recurse_remove_peft_layers, + scale_lora_layers, + set_adapter_layers, + set_weights_and_activate_adapters, +) + + +if is_transformers_available(): + from transformers import PreTrainedModel + + from ..models.lora import PatchedLoraProjection, text_encoder_attn_modules, text_encoder_mlp_modules + +if is_accelerate_available(): + from accelerate import init_empty_weights + from accelerate.hooks import AlignDevicesHook, CpuOffload, remove_hook_from_module + +logger = logging.get_logger(__name__) + +TEXT_ENCODER_NAME = "text_encoder" +UNET_NAME = "unet" + +LORA_WEIGHT_NAME = "pytorch_lora_weights.bin" +LORA_WEIGHT_NAME_SAFE = "pytorch_lora_weights.safetensors" + +LORA_DEPRECATION_MESSAGE = "You are using an old version of LoRA backend. This will be deprecated in the next releases in favor of PEFT make sure to install the latest PEFT and transformers packages in the future." + + +class LoraLoaderMixin: + r""" + Load LoRA layers into [`UNet2DConditionModel`] and + [`CLIPTextModel`](https://huggingface.co/docs/transformers/model_doc/clip#transformers.CLIPTextModel). + """ + text_encoder_name = TEXT_ENCODER_NAME + unet_name = UNET_NAME + num_fused_loras = 0 + + def load_lora_weights( + self, pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], adapter_name=None, **kwargs + ): + """ + Load LoRA weights specified in `pretrained_model_name_or_path_or_dict` into `self.unet` and + `self.text_encoder`. + + All kwargs are forwarded to `self.lora_state_dict`. + + See [`~loaders.LoraLoaderMixin.lora_state_dict`] for more details on how the state dict is loaded. + + See [`~loaders.LoraLoaderMixin.load_lora_into_unet`] for more details on how the state dict is loaded into + `self.unet`. + + See [`~loaders.LoraLoaderMixin.load_lora_into_text_encoder`] for more details on how the state dict is loaded + into `self.text_encoder`. + + Parameters: + pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`): + See [`~loaders.LoraLoaderMixin.lora_state_dict`]. + kwargs (`dict`, *optional*): + See [`~loaders.LoraLoaderMixin.lora_state_dict`]. + adapter_name (`str`, *optional*): + Adapter name to be used for referencing the loaded adapter model. If not specified, it will use + `default_{i}` where i is the total number of adapters being loaded. + """ + # First, ensure that the checkpoint is a compatible one and can be successfully loaded. + state_dict, network_alphas = self.lora_state_dict(pretrained_model_name_or_path_or_dict, **kwargs) + + is_correct_format = all("lora" in key for key in state_dict.keys()) + if not is_correct_format: + raise ValueError("Invalid LoRA checkpoint.") + + low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT) + + self.load_lora_into_unet( + state_dict, + network_alphas=network_alphas, + unet=getattr(self, self.unet_name) if not hasattr(self, "unet") else self.unet, + low_cpu_mem_usage=low_cpu_mem_usage, + adapter_name=adapter_name, + _pipeline=self, + ) + self.load_lora_into_text_encoder( + state_dict, + network_alphas=network_alphas, + text_encoder=getattr(self, self.text_encoder_name) + if not hasattr(self, "text_encoder") + else self.text_encoder, + lora_scale=self.lora_scale, + low_cpu_mem_usage=low_cpu_mem_usage, + adapter_name=adapter_name, + _pipeline=self, + ) + + @classmethod + def lora_state_dict( + cls, + pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], + **kwargs, + ): + r""" + Return state dict for lora weights and the network alphas. + + + + We support loading A1111 formatted LoRA checkpoints in a limited capacity. + + This function is experimental and might change in the future. + + + + Parameters: + pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`): + Can be either: + + - A string, the *model id* (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on + the Hub. + - A path to a *directory* (for example `./my_model_directory`) containing the model weights saved + with [`ModelMixin.save_pretrained`]. + - A [torch state + dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict). + + cache_dir (`Union[str, os.PathLike]`, *optional*): + Path to a directory where a downloaded pretrained model configuration is cached if the standard cache + is not used. + force_download (`bool`, *optional*, defaults to `False`): + Whether or not to force the (re-)download of the model weights and configuration files, overriding the + cached versions if they exist. + resume_download (`bool`, *optional*, defaults to `False`): + Whether or not to resume downloading the model weights and configuration files. If set to `False`, any + incompletely downloaded files are deleted. + proxies (`Dict[str, str]`, *optional*): + A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', + 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. + local_files_only (`bool`, *optional*, defaults to `False`): + Whether to only load local model weights and configuration files or not. If set to `True`, the model + won't be downloaded from the Hub. + use_auth_token (`str` or *bool*, *optional*): + The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from + `diffusers-cli login` (stored in `~/.huggingface`) is used. + revision (`str`, *optional*, defaults to `"main"`): + The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier + allowed by Git. + subfolder (`str`, *optional*, defaults to `""`): + The subfolder location of a model file within a larger model repository on the Hub or locally. + low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`): + Speed up model loading only loading the pretrained weights and not initializing the weights. This also + tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. + Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this + argument to `True` will raise an error. + mirror (`str`, *optional*): + Mirror source to resolve accessibility issues if you're downloading a model in China. We do not + guarantee the timeliness or safety of the source, and you should refer to the mirror site for more + information. + + """ + # Load the main state dict first which has the LoRA layers for either of + # UNet and text encoder or both. + cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) + force_download = kwargs.pop("force_download", False) + resume_download = kwargs.pop("resume_download", False) + proxies = kwargs.pop("proxies", None) + local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) + use_auth_token = kwargs.pop("use_auth_token", None) + revision = kwargs.pop("revision", None) + subfolder = kwargs.pop("subfolder", None) + weight_name = kwargs.pop("weight_name", None) + unet_config = kwargs.pop("unet_config", None) + use_safetensors = kwargs.pop("use_safetensors", None) + + allow_pickle = False + if use_safetensors is None: + use_safetensors = True + allow_pickle = True + + user_agent = { + "file_type": "attn_procs_weights", + "framework": "pytorch", + } + + model_file = None + if not isinstance(pretrained_model_name_or_path_or_dict, dict): + # Let's first try to load .safetensors weights + if (use_safetensors and weight_name is None) or ( + weight_name is not None and weight_name.endswith(".safetensors") + ): + try: + # Here we're relaxing the loading check to enable more Inference API + # friendliness where sometimes, it's not at all possible to automatically + # determine `weight_name`. + if weight_name is None: + weight_name = cls._best_guess_weight_name( + pretrained_model_name_or_path_or_dict, file_extension=".safetensors" + ) + model_file = _get_model_file( + pretrained_model_name_or_path_or_dict, + weights_name=weight_name or LORA_WEIGHT_NAME_SAFE, + cache_dir=cache_dir, + force_download=force_download, + resume_download=resume_download, + proxies=proxies, + local_files_only=local_files_only, + use_auth_token=use_auth_token, + revision=revision, + subfolder=subfolder, + user_agent=user_agent, + ) + state_dict = safetensors.torch.load_file(model_file, device="cpu") + except (IOError, safetensors.SafetensorError) as e: + if not allow_pickle: + raise e + # try loading non-safetensors weights + model_file = None + pass + + if model_file is None: + if weight_name is None: + weight_name = cls._best_guess_weight_name( + pretrained_model_name_or_path_or_dict, file_extension=".bin" + ) + model_file = _get_model_file( + pretrained_model_name_or_path_or_dict, + weights_name=weight_name or LORA_WEIGHT_NAME, + cache_dir=cache_dir, + force_download=force_download, + resume_download=resume_download, + proxies=proxies, + local_files_only=local_files_only, + use_auth_token=use_auth_token, + revision=revision, + subfolder=subfolder, + user_agent=user_agent, + ) + state_dict = torch.load(model_file, map_location="cpu") + else: + state_dict = pretrained_model_name_or_path_or_dict + + network_alphas = None + # TODO: replace it with a method from `state_dict_utils` + if all( + ( + k.startswith("lora_te_") + or k.startswith("lora_unet_") + or k.startswith("lora_te1_") + or k.startswith("lora_te2_") + ) + for k in state_dict.keys() + ): + # Map SDXL blocks correctly. + if unet_config is not None: + # use unet config to remap block numbers + state_dict = cls._maybe_map_sgm_blocks_to_diffusers(state_dict, unet_config) + state_dict, network_alphas = cls._convert_kohya_lora_to_diffusers(state_dict) + + return state_dict, network_alphas + + @classmethod + def _best_guess_weight_name(cls, pretrained_model_name_or_path_or_dict, file_extension=".safetensors"): + targeted_files = [] + + if os.path.isfile(pretrained_model_name_or_path_or_dict): + return + elif os.path.isdir(pretrained_model_name_or_path_or_dict): + targeted_files = [ + f for f in os.listdir(pretrained_model_name_or_path_or_dict) if f.endswith(file_extension) + ] + else: + files_in_repo = model_info(pretrained_model_name_or_path_or_dict).siblings + targeted_files = [f.rfilename for f in files_in_repo if f.rfilename.endswith(file_extension)] + if len(targeted_files) == 0: + return + + # "scheduler" does not correspond to a LoRA checkpoint. + # "optimizer" does not correspond to a LoRA checkpoint + # only top-level checkpoints are considered and not the other ones, hence "checkpoint". + unallowed_substrings = {"scheduler", "optimizer", "checkpoint"} + targeted_files = list( + filter(lambda x: all(substring not in x for substring in unallowed_substrings), targeted_files) + ) + + if any(f.endswith(LORA_WEIGHT_NAME) for f in targeted_files): + targeted_files = list(filter(lambda x: x.endswith(LORA_WEIGHT_NAME), targeted_files)) + elif any(f.endswith(LORA_WEIGHT_NAME_SAFE) for f in targeted_files): + targeted_files = list(filter(lambda x: x.endswith(LORA_WEIGHT_NAME_SAFE), targeted_files)) + + if len(targeted_files) > 1: + raise ValueError( + f"Provided path contains more than one weights file in the {file_extension} format. Either specify `weight_name` in `load_lora_weights` or make sure there's only one `.safetensors` or `.bin` file in {pretrained_model_name_or_path_or_dict}." + ) + weight_name = targeted_files[0] + return weight_name + + @classmethod + def _maybe_map_sgm_blocks_to_diffusers(cls, state_dict, unet_config, delimiter="_", block_slice_pos=5): + # 1. get all state_dict_keys + all_keys = list(state_dict.keys()) + sgm_patterns = ["input_blocks", "middle_block", "output_blocks"] + + # 2. check if needs remapping, if not return original dict + is_in_sgm_format = False + for key in all_keys: + if any(p in key for p in sgm_patterns): + is_in_sgm_format = True + break + + if not is_in_sgm_format: + return state_dict + + # 3. Else remap from SGM patterns + new_state_dict = {} + inner_block_map = ["resnets", "attentions", "upsamplers"] + + # Retrieves # of down, mid and up blocks + input_block_ids, middle_block_ids, output_block_ids = set(), set(), set() + + for layer in all_keys: + if "text" in layer: + new_state_dict[layer] = state_dict.pop(layer) + else: + layer_id = int(layer.split(delimiter)[:block_slice_pos][-1]) + if sgm_patterns[0] in layer: + input_block_ids.add(layer_id) + elif sgm_patterns[1] in layer: + middle_block_ids.add(layer_id) + elif sgm_patterns[2] in layer: + output_block_ids.add(layer_id) + else: + raise ValueError(f"Checkpoint not supported because layer {layer} not supported.") + + input_blocks = { + layer_id: [key for key in state_dict if f"input_blocks{delimiter}{layer_id}" in key] + for layer_id in input_block_ids + } + middle_blocks = { + layer_id: [key for key in state_dict if f"middle_block{delimiter}{layer_id}" in key] + for layer_id in middle_block_ids + } + output_blocks = { + layer_id: [key for key in state_dict if f"output_blocks{delimiter}{layer_id}" in key] + for layer_id in output_block_ids + } + + # Rename keys accordingly + for i in input_block_ids: + block_id = (i - 1) // (unet_config.layers_per_block + 1) + layer_in_block_id = (i - 1) % (unet_config.layers_per_block + 1) + + for key in input_blocks[i]: + inner_block_id = int(key.split(delimiter)[block_slice_pos]) + inner_block_key = inner_block_map[inner_block_id] if "op" not in key else "downsamplers" + inner_layers_in_block = str(layer_in_block_id) if "op" not in key else "0" + new_key = delimiter.join( + key.split(delimiter)[: block_slice_pos - 1] + + [str(block_id), inner_block_key, inner_layers_in_block] + + key.split(delimiter)[block_slice_pos + 1 :] + ) + new_state_dict[new_key] = state_dict.pop(key) + + for i in middle_block_ids: + key_part = None + if i == 0: + key_part = [inner_block_map[0], "0"] + elif i == 1: + key_part = [inner_block_map[1], "0"] + elif i == 2: + key_part = [inner_block_map[0], "1"] + else: + raise ValueError(f"Invalid middle block id {i}.") + + for key in middle_blocks[i]: + new_key = delimiter.join( + key.split(delimiter)[: block_slice_pos - 1] + key_part + key.split(delimiter)[block_slice_pos:] + ) + new_state_dict[new_key] = state_dict.pop(key) + + for i in output_block_ids: + block_id = i // (unet_config.layers_per_block + 1) + layer_in_block_id = i % (unet_config.layers_per_block + 1) + + for key in output_blocks[i]: + inner_block_id = int(key.split(delimiter)[block_slice_pos]) + inner_block_key = inner_block_map[inner_block_id] + inner_layers_in_block = str(layer_in_block_id) if inner_block_id < 2 else "0" + new_key = delimiter.join( + key.split(delimiter)[: block_slice_pos - 1] + + [str(block_id), inner_block_key, inner_layers_in_block] + + key.split(delimiter)[block_slice_pos + 1 :] + ) + new_state_dict[new_key] = state_dict.pop(key) + + if len(state_dict) > 0: + raise ValueError("At this point all state dict entries have to be converted.") + + return new_state_dict + + @classmethod + def _optionally_disable_offloading(cls, _pipeline): + """ + Optionally removes offloading in case the pipeline has been already sequentially offloaded to CPU. + + Args: + _pipeline (`DiffusionPipeline`): + The pipeline to disable offloading for. + + Returns: + tuple: + A tuple indicating if `is_model_cpu_offload` or `is_sequential_cpu_offload` is True. + """ + is_model_cpu_offload = False + is_sequential_cpu_offload = False + + if _pipeline is not None: + for _, component in _pipeline.components.items(): + if isinstance(component, nn.Module) and hasattr(component, "_hf_hook"): + if not is_model_cpu_offload: + is_model_cpu_offload = isinstance(component._hf_hook, CpuOffload) + if not is_sequential_cpu_offload: + is_sequential_cpu_offload = isinstance(component._hf_hook, AlignDevicesHook) + + logger.info( + "Accelerate hooks detected. Since you have called `load_lora_weights()`, the previous hooks will be first removed. Then the LoRA parameters will be loaded and the hooks will be applied again." + ) + remove_hook_from_module(component, recurse=is_sequential_cpu_offload) + + return (is_model_cpu_offload, is_sequential_cpu_offload) + + @classmethod + def load_lora_into_unet( + cls, state_dict, network_alphas, unet, low_cpu_mem_usage=None, adapter_name=None, _pipeline=None + ): + """ + This will load the LoRA layers specified in `state_dict` into `unet`. + + Parameters: + state_dict (`dict`): + A standard state dict containing the lora layer parameters. The keys can either be indexed directly + into the unet or prefixed with an additional `unet` which can be used to distinguish between text + encoder lora layers. + network_alphas (`Dict[str, float]`): + See `LoRALinearLayer` for more details. + unet (`UNet2DConditionModel`): + The UNet model to load the LoRA layers into. + low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`): + Speed up model loading only loading the pretrained weights and not initializing the weights. This also + tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. + Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this + argument to `True` will raise an error. + adapter_name (`str`, *optional*): + Adapter name to be used for referencing the loaded adapter model. If not specified, it will use + `default_{i}` where i is the total number of adapters being loaded. + """ + low_cpu_mem_usage = low_cpu_mem_usage if low_cpu_mem_usage is not None else _LOW_CPU_MEM_USAGE_DEFAULT + # If the serialization format is new (introduced in https://github.com/huggingface/diffusers/pull/2918), + # then the `state_dict` keys should have `cls.unet_name` and/or `cls.text_encoder_name` as + # their prefixes. + keys = list(state_dict.keys()) + + if all(key.startswith(cls.unet_name) or key.startswith(cls.text_encoder_name) for key in keys): + # Load the layers corresponding to UNet. + logger.info(f"Loading {cls.unet_name}.") + + unet_keys = [k for k in keys if k.startswith(cls.unet_name)] + state_dict = {k.replace(f"{cls.unet_name}.", ""): v for k, v in state_dict.items() if k in unet_keys} + + if network_alphas is not None: + alpha_keys = [k for k in network_alphas.keys() if k.startswith(cls.unet_name)] + network_alphas = { + k.replace(f"{cls.unet_name}.", ""): v for k, v in network_alphas.items() if k in alpha_keys + } + + else: + # Otherwise, we're dealing with the old format. This means the `state_dict` should only + # contain the module names of the `unet` as its keys WITHOUT any prefix. + warn_message = "You have saved the LoRA weights using the old format. To convert the old LoRA weights to the new format, you can first load them in a dictionary and then create a new dictionary like the following: `new_state_dict = {f'unet.{module_name}': params for module_name, params in old_state_dict.items()}`." + logger.warn(warn_message) + + if USE_PEFT_BACKEND and len(state_dict.keys()) > 0: + from peft import LoraConfig, inject_adapter_in_model, set_peft_model_state_dict + + if adapter_name in getattr(unet, "peft_config", {}): + raise ValueError( + f"Adapter name {adapter_name} already in use in the Unet - please select a new adapter name." + ) + + state_dict = convert_unet_state_dict_to_peft(state_dict) + + if network_alphas is not None: + # The alphas state dict have the same structure as Unet, thus we convert it to peft format using + # `convert_unet_state_dict_to_peft` method. + network_alphas = convert_unet_state_dict_to_peft(network_alphas) + + rank = {} + for key, val in state_dict.items(): + if "lora_B" in key: + rank[key] = val.shape[1] + + lora_config_kwargs = get_peft_kwargs(rank, network_alphas, state_dict, is_unet=True) + lora_config = LoraConfig(**lora_config_kwargs) + + # adapter_name + if adapter_name is None: + adapter_name = get_adapter_name(unet) + + # In case the pipeline has been already offloaded to CPU - temporarily remove the hooks + # otherwise loading LoRA weights will lead to an error + is_model_cpu_offload, is_sequential_cpu_offload = cls._optionally_disable_offloading(_pipeline) + + inject_adapter_in_model(lora_config, unet, adapter_name=adapter_name) + incompatible_keys = set_peft_model_state_dict(unet, state_dict, adapter_name) + + if incompatible_keys is not None: + # check only for unexpected keys + unexpected_keys = getattr(incompatible_keys, "unexpected_keys", None) + if unexpected_keys: + logger.warning( + f"Loading adapter weights from state_dict led to unexpected keys not found in the model: " + f" {unexpected_keys}. " + ) + + # Offload back. + if is_model_cpu_offload: + _pipeline.enable_model_cpu_offload() + elif is_sequential_cpu_offload: + _pipeline.enable_sequential_cpu_offload() + # Unsafe code /> + + unet.load_attn_procs( + state_dict, network_alphas=network_alphas, low_cpu_mem_usage=low_cpu_mem_usage, _pipeline=_pipeline + ) + + @classmethod + def load_lora_into_text_encoder( + cls, + state_dict, + network_alphas, + text_encoder, + prefix=None, + lora_scale=1.0, + low_cpu_mem_usage=None, + adapter_name=None, + _pipeline=None, + ): + """ + This will load the LoRA layers specified in `state_dict` into `text_encoder` + + Parameters: + state_dict (`dict`): + A standard state dict containing the lora layer parameters. The key should be prefixed with an + additional `text_encoder` to distinguish between unet lora layers. + network_alphas (`Dict[str, float]`): + See `LoRALinearLayer` for more details. + text_encoder (`CLIPTextModel`): + The text encoder model to load the LoRA layers into. + prefix (`str`): + Expected prefix of the `text_encoder` in the `state_dict`. + lora_scale (`float`): + How much to scale the output of the lora linear layer before it is added with the output of the regular + lora layer. + low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`): + Speed up model loading only loading the pretrained weights and not initializing the weights. This also + tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. + Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this + argument to `True` will raise an error. + adapter_name (`str`, *optional*): + Adapter name to be used for referencing the loaded adapter model. If not specified, it will use + `default_{i}` where i is the total number of adapters being loaded. + """ + low_cpu_mem_usage = low_cpu_mem_usage if low_cpu_mem_usage is not None else _LOW_CPU_MEM_USAGE_DEFAULT + + # If the serialization format is new (introduced in https://github.com/huggingface/diffusers/pull/2918), + # then the `state_dict` keys should have `self.unet_name` and/or `self.text_encoder_name` as + # their prefixes. + keys = list(state_dict.keys()) + prefix = cls.text_encoder_name if prefix is None else prefix + + # Safe prefix to check with. + if any(cls.text_encoder_name in key for key in keys): + # Load the layers corresponding to text encoder and make necessary adjustments. + text_encoder_keys = [k for k in keys if k.startswith(prefix) and k.split(".")[0] == prefix] + text_encoder_lora_state_dict = { + k.replace(f"{prefix}.", ""): v for k, v in state_dict.items() if k in text_encoder_keys + } + + if len(text_encoder_lora_state_dict) > 0: + logger.info(f"Loading {prefix}.") + rank = {} + text_encoder_lora_state_dict = convert_state_dict_to_diffusers(text_encoder_lora_state_dict) + + if USE_PEFT_BACKEND: + # convert state dict + text_encoder_lora_state_dict = convert_state_dict_to_peft(text_encoder_lora_state_dict) + + for name, _ in text_encoder_attn_modules(text_encoder): + rank_key = f"{name}.out_proj.lora_B.weight" + rank[rank_key] = text_encoder_lora_state_dict[rank_key].shape[1] + + patch_mlp = any(".mlp." in key for key in text_encoder_lora_state_dict.keys()) + if patch_mlp: + for name, _ in text_encoder_mlp_modules(text_encoder): + rank_key_fc1 = f"{name}.fc1.lora_B.weight" + rank_key_fc2 = f"{name}.fc2.lora_B.weight" + + rank[rank_key_fc1] = text_encoder_lora_state_dict[rank_key_fc1].shape[1] + rank[rank_key_fc2] = text_encoder_lora_state_dict[rank_key_fc2].shape[1] + else: + for name, _ in text_encoder_attn_modules(text_encoder): + rank_key = f"{name}.out_proj.lora_linear_layer.up.weight" + rank.update({rank_key: text_encoder_lora_state_dict[rank_key].shape[1]}) + + patch_mlp = any(".mlp." in key for key in text_encoder_lora_state_dict.keys()) + if patch_mlp: + for name, _ in text_encoder_mlp_modules(text_encoder): + rank_key_fc1 = f"{name}.fc1.lora_linear_layer.up.weight" + rank_key_fc2 = f"{name}.fc2.lora_linear_layer.up.weight" + rank[rank_key_fc1] = text_encoder_lora_state_dict[rank_key_fc1].shape[1] + rank[rank_key_fc2] = text_encoder_lora_state_dict[rank_key_fc2].shape[1] + + if network_alphas is not None: + alpha_keys = [ + k for k in network_alphas.keys() if k.startswith(prefix) and k.split(".")[0] == prefix + ] + network_alphas = { + k.replace(f"{prefix}.", ""): v for k, v in network_alphas.items() if k in alpha_keys + } + + if USE_PEFT_BACKEND: + from peft import LoraConfig + + lora_config_kwargs = get_peft_kwargs( + rank, network_alphas, text_encoder_lora_state_dict, is_unet=False + ) + + lora_config = LoraConfig(**lora_config_kwargs) + + # adapter_name + if adapter_name is None: + adapter_name = get_adapter_name(text_encoder) + + is_model_cpu_offload, is_sequential_cpu_offload = cls._optionally_disable_offloading(_pipeline) + + # inject LoRA layers and load the state dict + # in transformers we automatically check whether the adapter name is already in use or not + text_encoder.load_adapter( + adapter_name=adapter_name, + adapter_state_dict=text_encoder_lora_state_dict, + peft_config=lora_config, + ) + + # scale LoRA layers with `lora_scale` + scale_lora_layers(text_encoder, weight=lora_scale) + else: + cls._modify_text_encoder( + text_encoder, + lora_scale, + network_alphas, + rank=rank, + patch_mlp=patch_mlp, + low_cpu_mem_usage=low_cpu_mem_usage, + ) + + is_pipeline_offloaded = _pipeline is not None and any( + isinstance(c, torch.nn.Module) and hasattr(c, "_hf_hook") + for c in _pipeline.components.values() + ) + if is_pipeline_offloaded and low_cpu_mem_usage: + low_cpu_mem_usage = True + logger.info( + f"Pipeline {_pipeline.__class__} is offloaded. Therefore low cpu mem usage loading is forced." + ) + + if low_cpu_mem_usage: + device = next(iter(text_encoder_lora_state_dict.values())).device + dtype = next(iter(text_encoder_lora_state_dict.values())).dtype + unexpected_keys = load_model_dict_into_meta( + text_encoder, text_encoder_lora_state_dict, device=device, dtype=dtype + ) + else: + load_state_dict_results = text_encoder.load_state_dict( + text_encoder_lora_state_dict, strict=False + ) + unexpected_keys = load_state_dict_results.unexpected_keys + + if len(unexpected_keys) != 0: + raise ValueError( + f"failed to load text encoder state dict, unexpected keys: {load_state_dict_results.unexpected_keys}" + ) + + # + + @property + def lora_scale(self) -> float: + # property function that returns the lora scale which can be set at run time by the pipeline. + # if _lora_scale has not been set, return 1 + return self._lora_scale if hasattr(self, "_lora_scale") else 1.0 + + def _remove_text_encoder_monkey_patch(self): + if USE_PEFT_BACKEND: + remove_method = recurse_remove_peft_layers + else: + remove_method = self._remove_text_encoder_monkey_patch_classmethod + + if hasattr(self, "text_encoder"): + remove_method(self.text_encoder) + + # In case text encoder have no Lora attached + if USE_PEFT_BACKEND and getattr(self.text_encoder, "peft_config", None) is not None: + del self.text_encoder.peft_config + self.text_encoder._hf_peft_config_loaded = None + if hasattr(self, "text_encoder_2"): + remove_method(self.text_encoder_2) + if USE_PEFT_BACKEND: + del self.text_encoder_2.peft_config + self.text_encoder_2._hf_peft_config_loaded = None + + @classmethod + def _remove_text_encoder_monkey_patch_classmethod(cls, text_encoder): + if version.parse(__version__) > version.parse("0.23"): + deprecate("_remove_text_encoder_monkey_patch_classmethod", "0.25", LORA_DEPRECATION_MESSAGE) + + for _, attn_module in text_encoder_attn_modules(text_encoder): + if isinstance(attn_module.q_proj, PatchedLoraProjection): + attn_module.q_proj.lora_linear_layer = None + attn_module.k_proj.lora_linear_layer = None + attn_module.v_proj.lora_linear_layer = None + attn_module.out_proj.lora_linear_layer = None + + for _, mlp_module in text_encoder_mlp_modules(text_encoder): + if isinstance(mlp_module.fc1, PatchedLoraProjection): + mlp_module.fc1.lora_linear_layer = None + mlp_module.fc2.lora_linear_layer = None + + @classmethod + def _modify_text_encoder( + cls, + text_encoder, + lora_scale=1, + network_alphas=None, + rank: Union[Dict[str, int], int] = 4, + dtype=None, + patch_mlp=False, + low_cpu_mem_usage=False, + ): + r""" + Monkey-patches the forward passes of attention modules of the text encoder. + """ + if version.parse(__version__) > version.parse("0.23"): + deprecate("_modify_text_encoder", "0.25", LORA_DEPRECATION_MESSAGE) + + def create_patched_linear_lora(model, network_alpha, rank, dtype, lora_parameters): + linear_layer = model.regular_linear_layer if isinstance(model, PatchedLoraProjection) else model + ctx = init_empty_weights if low_cpu_mem_usage else nullcontext + with ctx(): + model = PatchedLoraProjection(linear_layer, lora_scale, network_alpha, rank, dtype=dtype) + + lora_parameters.extend(model.lora_linear_layer.parameters()) + return model + + # First, remove any monkey-patch that might have been applied before + cls._remove_text_encoder_monkey_patch_classmethod(text_encoder) + + lora_parameters = [] + network_alphas = {} if network_alphas is None else network_alphas + is_network_alphas_populated = len(network_alphas) > 0 + + for name, attn_module in text_encoder_attn_modules(text_encoder): + query_alpha = network_alphas.pop(name + ".to_q_lora.down.weight.alpha", None) + key_alpha = network_alphas.pop(name + ".to_k_lora.down.weight.alpha", None) + value_alpha = network_alphas.pop(name + ".to_v_lora.down.weight.alpha", None) + out_alpha = network_alphas.pop(name + ".to_out_lora.down.weight.alpha", None) + + if isinstance(rank, dict): + current_rank = rank.pop(f"{name}.out_proj.lora_linear_layer.up.weight") + else: + current_rank = rank + + attn_module.q_proj = create_patched_linear_lora( + attn_module.q_proj, query_alpha, current_rank, dtype, lora_parameters + ) + attn_module.k_proj = create_patched_linear_lora( + attn_module.k_proj, key_alpha, current_rank, dtype, lora_parameters + ) + attn_module.v_proj = create_patched_linear_lora( + attn_module.v_proj, value_alpha, current_rank, dtype, lora_parameters + ) + attn_module.out_proj = create_patched_linear_lora( + attn_module.out_proj, out_alpha, current_rank, dtype, lora_parameters + ) + + if patch_mlp: + for name, mlp_module in text_encoder_mlp_modules(text_encoder): + fc1_alpha = network_alphas.pop(name + ".fc1.lora_linear_layer.down.weight.alpha", None) + fc2_alpha = network_alphas.pop(name + ".fc2.lora_linear_layer.down.weight.alpha", None) + + current_rank_fc1 = rank.pop(f"{name}.fc1.lora_linear_layer.up.weight") + current_rank_fc2 = rank.pop(f"{name}.fc2.lora_linear_layer.up.weight") + + mlp_module.fc1 = create_patched_linear_lora( + mlp_module.fc1, fc1_alpha, current_rank_fc1, dtype, lora_parameters + ) + mlp_module.fc2 = create_patched_linear_lora( + mlp_module.fc2, fc2_alpha, current_rank_fc2, dtype, lora_parameters + ) + + if is_network_alphas_populated and len(network_alphas) > 0: + raise ValueError( + f"The `network_alphas` has to be empty at this point but has the following keys \n\n {', '.join(network_alphas.keys())}" + ) + + return lora_parameters + + @classmethod + def save_lora_weights( + cls, + save_directory: Union[str, os.PathLike], + unet_lora_layers: Dict[str, Union[torch.nn.Module, torch.Tensor]] = None, + text_encoder_lora_layers: Dict[str, torch.nn.Module] = None, + is_main_process: bool = True, + weight_name: str = None, + save_function: Callable = None, + safe_serialization: bool = True, + ): + r""" + Save the LoRA parameters corresponding to the UNet and text encoder. + + Arguments: + save_directory (`str` or `os.PathLike`): + Directory to save LoRA parameters to. Will be created if it doesn't exist. + unet_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): + State dict of the LoRA layers corresponding to the `unet`. + text_encoder_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): + State dict of the LoRA layers corresponding to the `text_encoder`. Must explicitly pass the text + encoder LoRA state dict because it comes from πŸ€— Transformers. + is_main_process (`bool`, *optional*, defaults to `True`): + Whether the process calling this is the main process or not. Useful during distributed training and you + need to call this function on all processes. In this case, set `is_main_process=True` only on the main + process to avoid race conditions. + save_function (`Callable`): + The function to use to save the state dictionary. Useful during distributed training when you need to + replace `torch.save` with another method. Can be configured with the environment variable + `DIFFUSERS_SAVE_MODE`. + safe_serialization (`bool`, *optional*, defaults to `True`): + Whether to save the model using `safetensors` or the traditional PyTorch way with `pickle`. + """ + # Create a flat dictionary. + state_dict = {} + + # Populate the dictionary. + if unet_lora_layers is not None: + weights = ( + unet_lora_layers.state_dict() if isinstance(unet_lora_layers, torch.nn.Module) else unet_lora_layers + ) + + unet_lora_state_dict = {f"{cls.unet_name}.{module_name}": param for module_name, param in weights.items()} + state_dict.update(unet_lora_state_dict) + + if text_encoder_lora_layers is not None: + weights = ( + text_encoder_lora_layers.state_dict() + if isinstance(text_encoder_lora_layers, torch.nn.Module) + else text_encoder_lora_layers + ) + + text_encoder_lora_state_dict = { + f"{cls.text_encoder_name}.{module_name}": param for module_name, param in weights.items() + } + state_dict.update(text_encoder_lora_state_dict) + + # Save the model + cls.write_lora_layers( + state_dict=state_dict, + save_directory=save_directory, + is_main_process=is_main_process, + weight_name=weight_name, + save_function=save_function, + safe_serialization=safe_serialization, + ) + + @staticmethod + def write_lora_layers( + state_dict: Dict[str, torch.Tensor], + save_directory: str, + is_main_process: bool, + weight_name: str, + save_function: Callable, + safe_serialization: bool, + ): + if os.path.isfile(save_directory): + logger.error(f"Provided path ({save_directory}) should be a directory, not a file") + return + + if save_function is None: + if safe_serialization: + + def save_function(weights, filename): + return safetensors.torch.save_file(weights, filename, metadata={"format": "pt"}) + + else: + save_function = torch.save + + os.makedirs(save_directory, exist_ok=True) + + if weight_name is None: + if safe_serialization: + weight_name = LORA_WEIGHT_NAME_SAFE + else: + weight_name = LORA_WEIGHT_NAME + + save_function(state_dict, os.path.join(save_directory, weight_name)) + logger.info(f"Model weights saved in {os.path.join(save_directory, weight_name)}") + + @classmethod + def _convert_kohya_lora_to_diffusers(cls, state_dict): + unet_state_dict = {} + te_state_dict = {} + te2_state_dict = {} + network_alphas = {} + + # every down weight has a corresponding up weight and potentially an alpha weight + lora_keys = [k for k in state_dict.keys() if k.endswith("lora_down.weight")] + for key in lora_keys: + lora_name = key.split(".")[0] + lora_name_up = lora_name + ".lora_up.weight" + lora_name_alpha = lora_name + ".alpha" + + if lora_name.startswith("lora_unet_"): + diffusers_name = key.replace("lora_unet_", "").replace("_", ".") + + if "input.blocks" in diffusers_name: + diffusers_name = diffusers_name.replace("input.blocks", "down_blocks") + else: + diffusers_name = diffusers_name.replace("down.blocks", "down_blocks") + + if "middle.block" in diffusers_name: + diffusers_name = diffusers_name.replace("middle.block", "mid_block") + else: + diffusers_name = diffusers_name.replace("mid.block", "mid_block") + if "output.blocks" in diffusers_name: + diffusers_name = diffusers_name.replace("output.blocks", "up_blocks") + else: + diffusers_name = diffusers_name.replace("up.blocks", "up_blocks") + + diffusers_name = diffusers_name.replace("transformer.blocks", "transformer_blocks") + diffusers_name = diffusers_name.replace("to.q.lora", "to_q_lora") + diffusers_name = diffusers_name.replace("to.k.lora", "to_k_lora") + diffusers_name = diffusers_name.replace("to.v.lora", "to_v_lora") + diffusers_name = diffusers_name.replace("to.out.0.lora", "to_out_lora") + diffusers_name = diffusers_name.replace("proj.in", "proj_in") + diffusers_name = diffusers_name.replace("proj.out", "proj_out") + diffusers_name = diffusers_name.replace("emb.layers", "time_emb_proj") + + # SDXL specificity. + if "emb" in diffusers_name and "time.emb.proj" not in diffusers_name: + pattern = r"\.\d+(?=\D*$)" + diffusers_name = re.sub(pattern, "", diffusers_name, count=1) + if ".in." in diffusers_name: + diffusers_name = diffusers_name.replace("in.layers.2", "conv1") + if ".out." in diffusers_name: + diffusers_name = diffusers_name.replace("out.layers.3", "conv2") + if "downsamplers" in diffusers_name or "upsamplers" in diffusers_name: + diffusers_name = diffusers_name.replace("op", "conv") + if "skip" in diffusers_name: + diffusers_name = diffusers_name.replace("skip.connection", "conv_shortcut") + + # LyCORIS specificity. + if "time.emb.proj" in diffusers_name: + diffusers_name = diffusers_name.replace("time.emb.proj", "time_emb_proj") + if "conv.shortcut" in diffusers_name: + diffusers_name = diffusers_name.replace("conv.shortcut", "conv_shortcut") + + # General coverage. + if "transformer_blocks" in diffusers_name: + if "attn1" in diffusers_name or "attn2" in diffusers_name: + diffusers_name = diffusers_name.replace("attn1", "attn1.processor") + diffusers_name = diffusers_name.replace("attn2", "attn2.processor") + unet_state_dict[diffusers_name] = state_dict.pop(key) + unet_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + elif "ff" in diffusers_name: + unet_state_dict[diffusers_name] = state_dict.pop(key) + unet_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + elif any(key in diffusers_name for key in ("proj_in", "proj_out")): + unet_state_dict[diffusers_name] = state_dict.pop(key) + unet_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + else: + unet_state_dict[diffusers_name] = state_dict.pop(key) + unet_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + + elif lora_name.startswith("lora_te_"): + diffusers_name = key.replace("lora_te_", "").replace("_", ".") + diffusers_name = diffusers_name.replace("text.model", "text_model") + diffusers_name = diffusers_name.replace("self.attn", "self_attn") + diffusers_name = diffusers_name.replace("q.proj.lora", "to_q_lora") + diffusers_name = diffusers_name.replace("k.proj.lora", "to_k_lora") + diffusers_name = diffusers_name.replace("v.proj.lora", "to_v_lora") + diffusers_name = diffusers_name.replace("out.proj.lora", "to_out_lora") + if "self_attn" in diffusers_name: + te_state_dict[diffusers_name] = state_dict.pop(key) + te_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + elif "mlp" in diffusers_name: + # Be aware that this is the new diffusers convention and the rest of the code might + # not utilize it yet. + diffusers_name = diffusers_name.replace(".lora.", ".lora_linear_layer.") + te_state_dict[diffusers_name] = state_dict.pop(key) + te_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + + # (sayakpaul): Duplicate code. Needs to be cleaned. + elif lora_name.startswith("lora_te1_"): + diffusers_name = key.replace("lora_te1_", "").replace("_", ".") + diffusers_name = diffusers_name.replace("text.model", "text_model") + diffusers_name = diffusers_name.replace("self.attn", "self_attn") + diffusers_name = diffusers_name.replace("q.proj.lora", "to_q_lora") + diffusers_name = diffusers_name.replace("k.proj.lora", "to_k_lora") + diffusers_name = diffusers_name.replace("v.proj.lora", "to_v_lora") + diffusers_name = diffusers_name.replace("out.proj.lora", "to_out_lora") + if "self_attn" in diffusers_name: + te_state_dict[diffusers_name] = state_dict.pop(key) + te_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + elif "mlp" in diffusers_name: + # Be aware that this is the new diffusers convention and the rest of the code might + # not utilize it yet. + diffusers_name = diffusers_name.replace(".lora.", ".lora_linear_layer.") + te_state_dict[diffusers_name] = state_dict.pop(key) + te_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + + # (sayakpaul): Duplicate code. Needs to be cleaned. + elif lora_name.startswith("lora_te2_"): + diffusers_name = key.replace("lora_te2_", "").replace("_", ".") + diffusers_name = diffusers_name.replace("text.model", "text_model") + diffusers_name = diffusers_name.replace("self.attn", "self_attn") + diffusers_name = diffusers_name.replace("q.proj.lora", "to_q_lora") + diffusers_name = diffusers_name.replace("k.proj.lora", "to_k_lora") + diffusers_name = diffusers_name.replace("v.proj.lora", "to_v_lora") + diffusers_name = diffusers_name.replace("out.proj.lora", "to_out_lora") + if "self_attn" in diffusers_name: + te2_state_dict[diffusers_name] = state_dict.pop(key) + te2_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + elif "mlp" in diffusers_name: + # Be aware that this is the new diffusers convention and the rest of the code might + # not utilize it yet. + diffusers_name = diffusers_name.replace(".lora.", ".lora_linear_layer.") + te2_state_dict[diffusers_name] = state_dict.pop(key) + te2_state_dict[diffusers_name.replace(".down.", ".up.")] = state_dict.pop(lora_name_up) + + # Rename the alphas so that they can be mapped appropriately. + if lora_name_alpha in state_dict: + alpha = state_dict.pop(lora_name_alpha).item() + if lora_name_alpha.startswith("lora_unet_"): + prefix = "unet." + elif lora_name_alpha.startswith(("lora_te_", "lora_te1_")): + prefix = "text_encoder." + else: + prefix = "text_encoder_2." + new_name = prefix + diffusers_name.split(".lora.")[0] + ".alpha" + network_alphas.update({new_name: alpha}) + + if len(state_dict) > 0: + raise ValueError( + f"The following keys have not been correctly be renamed: \n\n {', '.join(state_dict.keys())}" + ) + + logger.info("Kohya-style checkpoint detected.") + unet_state_dict = {f"{cls.unet_name}.{module_name}": params for module_name, params in unet_state_dict.items()} + te_state_dict = { + f"{cls.text_encoder_name}.{module_name}": params for module_name, params in te_state_dict.items() + } + te2_state_dict = ( + {f"text_encoder_2.{module_name}": params for module_name, params in te2_state_dict.items()} + if len(te2_state_dict) > 0 + else None + ) + if te2_state_dict is not None: + te_state_dict.update(te2_state_dict) + + new_state_dict = {**unet_state_dict, **te_state_dict} + return new_state_dict, network_alphas + + def unload_lora_weights(self): + """ + Unloads the LoRA parameters. + + Examples: + + ```python + >>> # Assuming `pipeline` is already loaded with the LoRA parameters. + >>> pipeline.unload_lora_weights() + >>> ... + ``` + """ + if not USE_PEFT_BACKEND: + if version.parse(__version__) > version.parse("0.23"): + logger.warn( + "You are using `unload_lora_weights` to disable and unload lora weights. If you want to iteratively enable and disable adapter weights," + "you can use `pipe.enable_lora()` or `pipe.disable_lora()`. After installing the latest version of PEFT." + ) + + for _, module in self.unet.named_modules(): + if hasattr(module, "set_lora_layer"): + module.set_lora_layer(None) + else: + recurse_remove_peft_layers(self.unet) + if hasattr(self.unet, "peft_config"): + del self.unet.peft_config + + # Safe to call the following regardless of LoRA. + self._remove_text_encoder_monkey_patch() + + def fuse_lora( + self, + fuse_unet: bool = True, + fuse_text_encoder: bool = True, + lora_scale: float = 1.0, + safe_fusing: bool = False, + ): + r""" + Fuses the LoRA parameters into the original parameters of the corresponding blocks. + + + + This is an experimental API. + + + + Args: + fuse_unet (`bool`, defaults to `True`): Whether to fuse the UNet LoRA parameters. + fuse_text_encoder (`bool`, defaults to `True`): + Whether to fuse the text encoder LoRA parameters. If the text encoder wasn't monkey-patched with the + LoRA parameters then it won't have any effect. + lora_scale (`float`, defaults to 1.0): + Controls how much to influence the outputs with the LoRA parameters. + safe_fusing (`bool`, defaults to `False`): + Whether to check fused weights for NaN values before fusing and if values are NaN not fusing them. + """ + if fuse_unet or fuse_text_encoder: + self.num_fused_loras += 1 + if self.num_fused_loras > 1: + logger.warn( + "The current API is supported for operating with a single LoRA file. You are trying to load and fuse more than one LoRA which is not well-supported.", + ) + + if fuse_unet: + self.unet.fuse_lora(lora_scale, safe_fusing=safe_fusing) + + if USE_PEFT_BACKEND: + from peft.tuners.tuners_utils import BaseTunerLayer + + def fuse_text_encoder_lora(text_encoder, lora_scale=1.0, safe_fusing=False): + # TODO(Patrick, Younes): enable "safe" fusing + for module in text_encoder.modules(): + if isinstance(module, BaseTunerLayer): + if lora_scale != 1.0: + module.scale_layer(lora_scale) + + module.merge() + + else: + if version.parse(__version__) > version.parse("0.23"): + deprecate("fuse_text_encoder_lora", "0.25", LORA_DEPRECATION_MESSAGE) + + def fuse_text_encoder_lora(text_encoder, lora_scale=1.0, safe_fusing=False): + for _, attn_module in text_encoder_attn_modules(text_encoder): + if isinstance(attn_module.q_proj, PatchedLoraProjection): + attn_module.q_proj._fuse_lora(lora_scale, safe_fusing) + attn_module.k_proj._fuse_lora(lora_scale, safe_fusing) + attn_module.v_proj._fuse_lora(lora_scale, safe_fusing) + attn_module.out_proj._fuse_lora(lora_scale, safe_fusing) + + for _, mlp_module in text_encoder_mlp_modules(text_encoder): + if isinstance(mlp_module.fc1, PatchedLoraProjection): + mlp_module.fc1._fuse_lora(lora_scale, safe_fusing) + mlp_module.fc2._fuse_lora(lora_scale, safe_fusing) + + if fuse_text_encoder: + if hasattr(self, "text_encoder"): + fuse_text_encoder_lora(self.text_encoder, lora_scale, safe_fusing) + if hasattr(self, "text_encoder_2"): + fuse_text_encoder_lora(self.text_encoder_2, lora_scale, safe_fusing) + + def unfuse_lora(self, unfuse_unet: bool = True, unfuse_text_encoder: bool = True): + r""" + Reverses the effect of + [`pipe.fuse_lora()`](https://huggingface.co/docs/diffusers/main/en/api/loaders#diffusers.loaders.LoraLoaderMixin.fuse_lora). + + + + This is an experimental API. + + + + Args: + unfuse_unet (`bool`, defaults to `True`): Whether to unfuse the UNet LoRA parameters. + unfuse_text_encoder (`bool`, defaults to `True`): + Whether to unfuse the text encoder LoRA parameters. If the text encoder wasn't monkey-patched with the + LoRA parameters then it won't have any effect. + """ + if unfuse_unet: + if not USE_PEFT_BACKEND: + self.unet.unfuse_lora() + else: + from peft.tuners.tuners_utils import BaseTunerLayer + + for module in self.unet.modules(): + if isinstance(module, BaseTunerLayer): + module.unmerge() + + if USE_PEFT_BACKEND: + from peft.tuners.tuners_utils import BaseTunerLayer + + def unfuse_text_encoder_lora(text_encoder): + for module in text_encoder.modules(): + if isinstance(module, BaseTunerLayer): + module.unmerge() + + else: + if version.parse(__version__) > version.parse("0.23"): + deprecate("unfuse_text_encoder_lora", "0.25", LORA_DEPRECATION_MESSAGE) + + def unfuse_text_encoder_lora(text_encoder): + for _, attn_module in text_encoder_attn_modules(text_encoder): + if isinstance(attn_module.q_proj, PatchedLoraProjection): + attn_module.q_proj._unfuse_lora() + attn_module.k_proj._unfuse_lora() + attn_module.v_proj._unfuse_lora() + attn_module.out_proj._unfuse_lora() + + for _, mlp_module in text_encoder_mlp_modules(text_encoder): + if isinstance(mlp_module.fc1, PatchedLoraProjection): + mlp_module.fc1._unfuse_lora() + mlp_module.fc2._unfuse_lora() + + if unfuse_text_encoder: + if hasattr(self, "text_encoder"): + unfuse_text_encoder_lora(self.text_encoder) + if hasattr(self, "text_encoder_2"): + unfuse_text_encoder_lora(self.text_encoder_2) + + self.num_fused_loras -= 1 + + def set_adapters_for_text_encoder( + self, + adapter_names: Union[List[str], str], + text_encoder: Optional["PreTrainedModel"] = None, # noqa: F821 + text_encoder_weights: List[float] = None, + ): + """ + Sets the adapter layers for the text encoder. + + Args: + adapter_names (`List[str]` or `str`): + The names of the adapters to use. + text_encoder (`torch.nn.Module`, *optional*): + The text encoder module to set the adapter layers for. If `None`, it will try to get the `text_encoder` + attribute. + text_encoder_weights (`List[float]`, *optional*): + The weights to use for the text encoder. If `None`, the weights are set to `1.0` for all the adapters. + """ + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + + def process_weights(adapter_names, weights): + if weights is None: + weights = [1.0] * len(adapter_names) + elif isinstance(weights, float): + weights = [weights] + + if len(adapter_names) != len(weights): + raise ValueError( + f"Length of adapter names {len(adapter_names)} is not equal to the length of the weights {len(weights)}" + ) + return weights + + adapter_names = [adapter_names] if isinstance(adapter_names, str) else adapter_names + text_encoder_weights = process_weights(adapter_names, text_encoder_weights) + text_encoder = text_encoder or getattr(self, "text_encoder", None) + if text_encoder is None: + raise ValueError( + "The pipeline does not have a default `pipe.text_encoder` class. Please make sure to pass a `text_encoder` instead." + ) + set_weights_and_activate_adapters(text_encoder, adapter_names, text_encoder_weights) + + def disable_lora_for_text_encoder(self, text_encoder: Optional["PreTrainedModel"] = None): + """ + Disables the LoRA layers for the text encoder. + + Args: + text_encoder (`torch.nn.Module`, *optional*): + The text encoder module to disable the LoRA layers for. If `None`, it will try to get the + `text_encoder` attribute. + """ + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + + text_encoder = text_encoder or getattr(self, "text_encoder", None) + if text_encoder is None: + raise ValueError("Text Encoder not found.") + set_adapter_layers(text_encoder, enabled=False) + + def enable_lora_for_text_encoder(self, text_encoder: Optional["PreTrainedModel"] = None): + """ + Enables the LoRA layers for the text encoder. + + Args: + text_encoder (`torch.nn.Module`, *optional*): + The text encoder module to enable the LoRA layers for. If `None`, it will try to get the `text_encoder` + attribute. + """ + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + text_encoder = text_encoder or getattr(self, "text_encoder", None) + if text_encoder is None: + raise ValueError("Text Encoder not found.") + set_adapter_layers(self.text_encoder, enabled=True) + + def set_adapters( + self, + adapter_names: Union[List[str], str], + adapter_weights: Optional[List[float]] = None, + ): + # Handle the UNET + self.unet.set_adapters(adapter_names, adapter_weights) + + # Handle the Text Encoder + if hasattr(self, "text_encoder"): + self.set_adapters_for_text_encoder(adapter_names, self.text_encoder, adapter_weights) + if hasattr(self, "text_encoder_2"): + self.set_adapters_for_text_encoder(adapter_names, self.text_encoder_2, adapter_weights) + + def disable_lora(self): + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + + # Disable unet adapters + self.unet.disable_lora() + + # Disable text encoder adapters + if hasattr(self, "text_encoder"): + self.disable_lora_for_text_encoder(self.text_encoder) + if hasattr(self, "text_encoder_2"): + self.disable_lora_for_text_encoder(self.text_encoder_2) + + def enable_lora(self): + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + + # Enable unet adapters + self.unet.enable_lora() + + # Enable text encoder adapters + if hasattr(self, "text_encoder"): + self.enable_lora_for_text_encoder(self.text_encoder) + if hasattr(self, "text_encoder_2"): + self.enable_lora_for_text_encoder(self.text_encoder_2) + + def delete_adapters(self, adapter_names: Union[List[str], str]): + """ + Args: + Deletes the LoRA layers of `adapter_name` for the unet and text-encoder(s). + adapter_names (`Union[List[str], str]`): + The names of the adapter to delete. Can be a single string or a list of strings + """ + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + + if isinstance(adapter_names, str): + adapter_names = [adapter_names] + + # Delete unet adapters + self.unet.delete_adapters(adapter_names) + + for adapter_name in adapter_names: + # Delete text encoder adapters + if hasattr(self, "text_encoder"): + delete_adapter_layers(self.text_encoder, adapter_name) + if hasattr(self, "text_encoder_2"): + delete_adapter_layers(self.text_encoder_2, adapter_name) + + def get_active_adapters(self) -> List[str]: + """ + Gets the list of the current active adapters. + + Example: + + ```python + from diffusers import DiffusionPipeline + + pipeline = DiffusionPipeline.from_pretrained( + "stabilityai/stable-diffusion-xl-base-1.0", + ).to("cuda") + pipeline.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy") + pipeline.get_active_adapters() + ``` + """ + if not USE_PEFT_BACKEND: + raise ValueError( + "PEFT backend is required for this method. Please install the latest version of PEFT `pip install -U peft`" + ) + + from peft.tuners.tuners_utils import BaseTunerLayer + + active_adapters = [] + + for module in self.unet.modules(): + if isinstance(module, BaseTunerLayer): + active_adapters = module.active_adapters + break + + return active_adapters + + def get_list_adapters(self) -> Dict[str, List[str]]: + """ + Gets the current list of all available adapters in the pipeline. + """ + if not USE_PEFT_BACKEND: + raise ValueError( + "PEFT backend is required for this method. Please install the latest version of PEFT `pip install -U peft`" + ) + + set_adapters = {} + + if hasattr(self, "text_encoder") and hasattr(self.text_encoder, "peft_config"): + set_adapters["text_encoder"] = list(self.text_encoder.peft_config.keys()) + + if hasattr(self, "text_encoder_2") and hasattr(self.text_encoder_2, "peft_config"): + set_adapters["text_encoder_2"] = list(self.text_encoder_2.peft_config.keys()) + + if hasattr(self, "unet") and hasattr(self.unet, "peft_config"): + set_adapters["unet"] = list(self.unet.peft_config.keys()) + + return set_adapters + + def set_lora_device(self, adapter_names: List[str], device: Union[torch.device, str, int]) -> None: + """ + Moves the LoRAs listed in `adapter_names` to a target device. Useful for offloading the LoRA to the CPU in case + you want to load multiple adapters and free some GPU memory. + + Args: + adapter_names (`List[str]`): + List of adapters to send device to. + device (`Union[torch.device, str, int]`): + Device to send the adapters to. Can be either a torch device, a str or an integer. + """ + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + + from peft.tuners.tuners_utils import BaseTunerLayer + + # Handle the UNET + for unet_module in self.unet.modules(): + if isinstance(unet_module, BaseTunerLayer): + for adapter_name in adapter_names: + unet_module.lora_A[adapter_name].to(device) + unet_module.lora_B[adapter_name].to(device) + + # Handle the text encoder + modules_to_process = [] + if hasattr(self, "text_encoder"): + modules_to_process.append(self.text_encoder) + + if hasattr(self, "text_encoder_2"): + modules_to_process.append(self.text_encoder_2) + + for text_encoder in modules_to_process: + # loop over submodules + for text_encoder_module in text_encoder.modules(): + if isinstance(text_encoder_module, BaseTunerLayer): + for adapter_name in adapter_names: + text_encoder_module.lora_A[adapter_name].to(device) + text_encoder_module.lora_B[adapter_name].to(device) + + +class StableDiffusionXLLoraLoaderMixin(LoraLoaderMixin): + """This class overrides `LoraLoaderMixin` with LoRA loading/saving code that's specific to SDXL""" + + # Overrride to properly handle the loading and unloading of the additional text encoder. + def load_lora_weights( + self, + pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], + adapter_name: Optional[str] = None, + **kwargs, + ): + """ + Load LoRA weights specified in `pretrained_model_name_or_path_or_dict` into `self.unet` and + `self.text_encoder`. + + All kwargs are forwarded to `self.lora_state_dict`. + + See [`~loaders.LoraLoaderMixin.lora_state_dict`] for more details on how the state dict is loaded. + + See [`~loaders.LoraLoaderMixin.load_lora_into_unet`] for more details on how the state dict is loaded into + `self.unet`. + + See [`~loaders.LoraLoaderMixin.load_lora_into_text_encoder`] for more details on how the state dict is loaded + into `self.text_encoder`. + + Parameters: + pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`): + See [`~loaders.LoraLoaderMixin.lora_state_dict`]. + adapter_name (`str`, *optional*): + Adapter name to be used for referencing the loaded adapter model. If not specified, it will use + `default_{i}` where i is the total number of adapters being loaded. + kwargs (`dict`, *optional*): + See [`~loaders.LoraLoaderMixin.lora_state_dict`]. + """ + # We could have accessed the unet config from `lora_state_dict()` too. We pass + # it here explicitly to be able to tell that it's coming from an SDXL + # pipeline. + + # First, ensure that the checkpoint is a compatible one and can be successfully loaded. + state_dict, network_alphas = self.lora_state_dict( + pretrained_model_name_or_path_or_dict, + unet_config=self.unet.config, + **kwargs, + ) + is_correct_format = all("lora" in key for key in state_dict.keys()) + if not is_correct_format: + raise ValueError("Invalid LoRA checkpoint.") + + self.load_lora_into_unet( + state_dict, network_alphas=network_alphas, unet=self.unet, adapter_name=adapter_name, _pipeline=self + ) + text_encoder_state_dict = {k: v for k, v in state_dict.items() if "text_encoder." in k} + if len(text_encoder_state_dict) > 0: + self.load_lora_into_text_encoder( + text_encoder_state_dict, + network_alphas=network_alphas, + text_encoder=self.text_encoder, + prefix="text_encoder", + lora_scale=self.lora_scale, + adapter_name=adapter_name, + _pipeline=self, + ) + + text_encoder_2_state_dict = {k: v for k, v in state_dict.items() if "text_encoder_2." in k} + if len(text_encoder_2_state_dict) > 0: + self.load_lora_into_text_encoder( + text_encoder_2_state_dict, + network_alphas=network_alphas, + text_encoder=self.text_encoder_2, + prefix="text_encoder_2", + lora_scale=self.lora_scale, + adapter_name=adapter_name, + _pipeline=self, + ) + + @classmethod + def save_lora_weights( + cls, + save_directory: Union[str, os.PathLike], + unet_lora_layers: Dict[str, Union[torch.nn.Module, torch.Tensor]] = None, + text_encoder_lora_layers: Dict[str, Union[torch.nn.Module, torch.Tensor]] = None, + text_encoder_2_lora_layers: Dict[str, Union[torch.nn.Module, torch.Tensor]] = None, + is_main_process: bool = True, + weight_name: str = None, + save_function: Callable = None, + safe_serialization: bool = True, + ): + r""" + Save the LoRA parameters corresponding to the UNet and text encoder. + + Arguments: + save_directory (`str` or `os.PathLike`): + Directory to save LoRA parameters to. Will be created if it doesn't exist. + unet_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): + State dict of the LoRA layers corresponding to the `unet`. + text_encoder_lora_layers (`Dict[str, torch.nn.Module]` or `Dict[str, torch.Tensor]`): + State dict of the LoRA layers corresponding to the `text_encoder`. Must explicitly pass the text + encoder LoRA state dict because it comes from πŸ€— Transformers. + is_main_process (`bool`, *optional*, defaults to `True`): + Whether the process calling this is the main process or not. Useful during distributed training and you + need to call this function on all processes. In this case, set `is_main_process=True` only on the main + process to avoid race conditions. + save_function (`Callable`): + The function to use to save the state dictionary. Useful during distributed training when you need to + replace `torch.save` with another method. Can be configured with the environment variable + `DIFFUSERS_SAVE_MODE`. + safe_serialization (`bool`, *optional*, defaults to `True`): + Whether to save the model using `safetensors` or the traditional PyTorch way with `pickle`. + """ + state_dict = {} + + def pack_weights(layers, prefix): + layers_weights = layers.state_dict() if isinstance(layers, torch.nn.Module) else layers + layers_state_dict = {f"{prefix}.{module_name}": param for module_name, param in layers_weights.items()} + return layers_state_dict + + if not (unet_lora_layers or text_encoder_lora_layers or text_encoder_2_lora_layers): + raise ValueError( + "You must pass at least one of `unet_lora_layers`, `text_encoder_lora_layers` or `text_encoder_2_lora_layers`." + ) + + if unet_lora_layers: + state_dict.update(pack_weights(unet_lora_layers, "unet")) + + if text_encoder_lora_layers and text_encoder_2_lora_layers: + state_dict.update(pack_weights(text_encoder_lora_layers, "text_encoder")) + state_dict.update(pack_weights(text_encoder_2_lora_layers, "text_encoder_2")) + + cls.write_lora_layers( + state_dict=state_dict, + save_directory=save_directory, + is_main_process=is_main_process, + weight_name=weight_name, + save_function=save_function, + safe_serialization=safe_serialization, + ) + + def _remove_text_encoder_monkey_patch(self): + if USE_PEFT_BACKEND: + recurse_remove_peft_layers(self.text_encoder) + # TODO: @younesbelkada handle this in transformers side + if getattr(self.text_encoder, "peft_config", None) is not None: + del self.text_encoder.peft_config + self.text_encoder._hf_peft_config_loaded = None + + recurse_remove_peft_layers(self.text_encoder_2) + if getattr(self.text_encoder_2, "peft_config", None) is not None: + del self.text_encoder_2.peft_config + self.text_encoder_2._hf_peft_config_loaded = None + else: + self._remove_text_encoder_monkey_patch_classmethod(self.text_encoder) + self._remove_text_encoder_monkey_patch_classmethod(self.text_encoder_2) diff --git a/src/diffusers/loaders/single_file.py b/src/diffusers/loaders/single_file.py new file mode 100644 index 000000000000..8a4f1a0541fd --- /dev/null +++ b/src/diffusers/loaders/single_file.py @@ -0,0 +1,624 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from contextlib import nullcontext +from io import BytesIO +from pathlib import Path + +import requests +import torch +from huggingface_hub import hf_hub_download + +from ..utils import ( + DIFFUSERS_CACHE, + HF_HUB_OFFLINE, + deprecate, + is_accelerate_available, + is_omegaconf_available, + is_transformers_available, + logging, +) +from ..utils.import_utils import BACKENDS_MAPPING + + +if is_transformers_available(): + pass + +if is_accelerate_available(): + from accelerate import init_empty_weights + +logger = logging.get_logger(__name__) + + +class FromSingleFileMixin: + """ + Load model weights saved in the `.ckpt` format into a [`DiffusionPipeline`]. + """ + + @classmethod + def from_ckpt(cls, *args, **kwargs): + deprecation_message = "The function `from_ckpt` is deprecated in favor of `from_single_file` and will be removed in diffusers v.0.21. Please make sure to use `StableDiffusionPipeline.from_single_file(...)` instead." + deprecate("from_ckpt", "0.21.0", deprecation_message, standard_warn=False) + return cls.from_single_file(*args, **kwargs) + + @classmethod + def from_single_file(cls, pretrained_model_link_or_path, **kwargs): + r""" + Instantiate a [`DiffusionPipeline`] from pretrained pipeline weights saved in the `.ckpt` or `.safetensors` + format. The pipeline is set in evaluation mode (`model.eval()`) by default. + + Parameters: + pretrained_model_link_or_path (`str` or `os.PathLike`, *optional*): + Can be either: + - A link to the `.ckpt` file (for example + `"https://huggingface.co//blob/main/.ckpt"`) on the Hub. + - A path to a *file* containing all pipeline weights. + torch_dtype (`str` or `torch.dtype`, *optional*): + Override the default `torch.dtype` and load the model with another dtype. If `"auto"` is passed, the + dtype is automatically derived from the model's weights. + force_download (`bool`, *optional*, defaults to `False`): + Whether or not to force the (re-)download of the model weights and configuration files, overriding the + cached versions if they exist. + cache_dir (`Union[str, os.PathLike]`, *optional*): + Path to a directory where a downloaded pretrained model configuration is cached if the standard cache + is not used. + resume_download (`bool`, *optional*, defaults to `False`): + Whether or not to resume downloading the model weights and configuration files. If set to `False`, any + incompletely downloaded files are deleted. + proxies (`Dict[str, str]`, *optional*): + A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', + 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. + local_files_only (`bool`, *optional*, defaults to `False`): + Whether to only load local model weights and configuration files or not. If set to `True`, the model + won't be downloaded from the Hub. + use_auth_token (`str` or *bool*, *optional*): + The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from + `diffusers-cli login` (stored in `~/.huggingface`) is used. + revision (`str`, *optional*, defaults to `"main"`): + The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier + allowed by Git. + use_safetensors (`bool`, *optional*, defaults to `None`): + If set to `None`, the safetensors weights are downloaded if they're available **and** if the + safetensors library is installed. If set to `True`, the model is forcibly loaded from safetensors + weights. If set to `False`, safetensors weights are not loaded. + extract_ema (`bool`, *optional*, defaults to `False`): + Whether to extract the EMA weights or not. Pass `True` to extract the EMA weights which usually yield + higher quality images for inference. Non-EMA weights are usually better for continuing finetuning. + upcast_attention (`bool`, *optional*, defaults to `None`): + Whether the attention computation should always be upcasted. + image_size (`int`, *optional*, defaults to 512): + The image size the model was trained on. Use 512 for all Stable Diffusion v1 models and the Stable + Diffusion v2 base model. Use 768 for Stable Diffusion v2. + prediction_type (`str`, *optional*): + The prediction type the model was trained on. Use `'epsilon'` for all Stable Diffusion v1 models and + the Stable Diffusion v2 base model. Use `'v_prediction'` for Stable Diffusion v2. + num_in_channels (`int`, *optional*, defaults to `None`): + The number of input channels. If `None`, it is automatically inferred. + scheduler_type (`str`, *optional*, defaults to `"pndm"`): + Type of scheduler to use. Should be one of `["pndm", "lms", "heun", "euler", "euler-ancestral", "dpm", + "ddim"]`. + load_safety_checker (`bool`, *optional*, defaults to `True`): + Whether to load the safety checker or not. + text_encoder ([`~transformers.CLIPTextModel`], *optional*, defaults to `None`): + An instance of `CLIPTextModel` to use, specifically the + [clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14) variant. If this + parameter is `None`, the function loads a new instance of `CLIPTextModel` by itself if needed. + vae (`AutoencoderKL`, *optional*, defaults to `None`): + Variational Auto-Encoder (VAE) Model to encode and decode images to and from latent representations. If + this parameter is `None`, the function will load a new instance of [CLIP] by itself, if needed. + tokenizer ([`~transformers.CLIPTokenizer`], *optional*, defaults to `None`): + An instance of `CLIPTokenizer` to use. If this parameter is `None`, the function loads a new instance + of `CLIPTokenizer` by itself if needed. + original_config_file (`str`): + Path to `.yaml` config file corresponding to the original architecture. If `None`, will be + automatically inferred by looking for a key that only exists in SD2.0 models. + kwargs (remaining dictionary of keyword arguments, *optional*): + Can be used to overwrite load and saveable variables (for example the pipeline components of the + specific pipeline class). The overwritten components are directly passed to the pipelines `__init__` + method. See example below for more information. + + Examples: + + ```py + >>> from diffusers import StableDiffusionPipeline + + >>> # Download pipeline from huggingface.co and cache. + >>> pipeline = StableDiffusionPipeline.from_single_file( + ... "https://huggingface.co/WarriorMama777/OrangeMixs/blob/main/Models/AbyssOrangeMix/AbyssOrangeMix.safetensors" + ... ) + + >>> # Download pipeline from local file + >>> # file is downloaded under ./v1-5-pruned-emaonly.ckpt + >>> pipeline = StableDiffusionPipeline.from_single_file("./v1-5-pruned-emaonly") + + >>> # Enable float16 and move to GPU + >>> pipeline = StableDiffusionPipeline.from_single_file( + ... "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned-emaonly.ckpt", + ... torch_dtype=torch.float16, + ... ) + >>> pipeline.to("cuda") + ``` + """ + # import here to avoid circular dependency + from ..pipelines.stable_diffusion.convert_from_ckpt import download_from_original_stable_diffusion_ckpt + + original_config_file = kwargs.pop("original_config_file", None) + config_files = kwargs.pop("config_files", None) + cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) + resume_download = kwargs.pop("resume_download", False) + force_download = kwargs.pop("force_download", False) + proxies = kwargs.pop("proxies", None) + local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) + use_auth_token = kwargs.pop("use_auth_token", None) + revision = kwargs.pop("revision", None) + extract_ema = kwargs.pop("extract_ema", False) + image_size = kwargs.pop("image_size", None) + scheduler_type = kwargs.pop("scheduler_type", "pndm") + num_in_channels = kwargs.pop("num_in_channels", None) + upcast_attention = kwargs.pop("upcast_attention", None) + load_safety_checker = kwargs.pop("load_safety_checker", True) + prediction_type = kwargs.pop("prediction_type", None) + text_encoder = kwargs.pop("text_encoder", None) + vae = kwargs.pop("vae", None) + controlnet = kwargs.pop("controlnet", None) + adapter = kwargs.pop("adapter", None) + tokenizer = kwargs.pop("tokenizer", None) + + torch_dtype = kwargs.pop("torch_dtype", None) + + use_safetensors = kwargs.pop("use_safetensors", None) + + pipeline_name = cls.__name__ + file_extension = pretrained_model_link_or_path.rsplit(".", 1)[-1] + from_safetensors = file_extension == "safetensors" + + if from_safetensors and use_safetensors is False: + raise ValueError("Make sure to install `safetensors` with `pip install safetensors`.") + + # TODO: For now we only support stable diffusion + stable_unclip = None + model_type = None + + if pipeline_name in [ + "StableDiffusionControlNetPipeline", + "StableDiffusionControlNetImg2ImgPipeline", + "StableDiffusionControlNetInpaintPipeline", + ]: + from .models.controlnet import ControlNetModel + from .pipelines.controlnet.multicontrolnet import MultiControlNetModel + + # list/tuple or a single instance of ControlNetModel or MultiControlNetModel + if not ( + isinstance(controlnet, (ControlNetModel, MultiControlNetModel)) + or isinstance(controlnet, (list, tuple)) + and isinstance(controlnet[0], ControlNetModel) + ): + raise ValueError("ControlNet needs to be passed if loading from ControlNet pipeline.") + elif "StableDiffusion" in pipeline_name: + # Model type will be inferred from the checkpoint. + pass + elif pipeline_name == "StableUnCLIPPipeline": + model_type = "FrozenOpenCLIPEmbedder" + stable_unclip = "txt2img" + elif pipeline_name == "StableUnCLIPImg2ImgPipeline": + model_type = "FrozenOpenCLIPEmbedder" + stable_unclip = "img2img" + elif pipeline_name == "PaintByExamplePipeline": + model_type = "PaintByExample" + elif pipeline_name == "LDMTextToImagePipeline": + model_type = "LDMTextToImage" + else: + raise ValueError(f"Unhandled pipeline class: {pipeline_name}") + + # remove huggingface url + has_valid_url_prefix = False + valid_url_prefixes = ["https://huggingface.co/", "huggingface.co/", "hf.co/", "https://hf.co/"] + for prefix in valid_url_prefixes: + if pretrained_model_link_or_path.startswith(prefix): + pretrained_model_link_or_path = pretrained_model_link_or_path[len(prefix) :] + has_valid_url_prefix = True + + # Code based on diffusers.pipelines.pipeline_utils.DiffusionPipeline.from_pretrained + ckpt_path = Path(pretrained_model_link_or_path) + if not ckpt_path.is_file(): + if not has_valid_url_prefix: + raise ValueError( + f"The provided path is either not a file or a valid huggingface URL was not provided. Valid URLs begin with {', '.join(valid_url_prefixes)}" + ) + + # get repo_id and (potentially nested) file path of ckpt in repo + repo_id = "/".join(ckpt_path.parts[:2]) + file_path = "/".join(ckpt_path.parts[2:]) + + if file_path.startswith("blob/"): + file_path = file_path[len("blob/") :] + + if file_path.startswith("main/"): + file_path = file_path[len("main/") :] + + pretrained_model_link_or_path = hf_hub_download( + repo_id, + filename=file_path, + cache_dir=cache_dir, + resume_download=resume_download, + proxies=proxies, + local_files_only=local_files_only, + use_auth_token=use_auth_token, + revision=revision, + force_download=force_download, + ) + + pipe = download_from_original_stable_diffusion_ckpt( + pretrained_model_link_or_path, + pipeline_class=cls, + model_type=model_type, + stable_unclip=stable_unclip, + controlnet=controlnet, + adapter=adapter, + from_safetensors=from_safetensors, + extract_ema=extract_ema, + image_size=image_size, + scheduler_type=scheduler_type, + num_in_channels=num_in_channels, + upcast_attention=upcast_attention, + load_safety_checker=load_safety_checker, + prediction_type=prediction_type, + text_encoder=text_encoder, + vae=vae, + tokenizer=tokenizer, + original_config_file=original_config_file, + config_files=config_files, + local_files_only=local_files_only, + ) + + if torch_dtype is not None: + pipe.to(torch_dtype=torch_dtype) + + return pipe + + +class FromOriginalVAEMixin: + @classmethod + def from_single_file(cls, pretrained_model_link_or_path, **kwargs): + r""" + Instantiate a [`AutoencoderKL`] from pretrained controlnet weights saved in the original `.ckpt` or + `.safetensors` format. The pipeline is format. The pipeline is set in evaluation mode (`model.eval()`) by + default. + + Parameters: + pretrained_model_link_or_path (`str` or `os.PathLike`, *optional*): + Can be either: + - A link to the `.ckpt` file (for example + `"https://huggingface.co//blob/main/.ckpt"`) on the Hub. + - A path to a *file* containing all pipeline weights. + torch_dtype (`str` or `torch.dtype`, *optional*): + Override the default `torch.dtype` and load the model with another dtype. If `"auto"` is passed, the + dtype is automatically derived from the model's weights. + force_download (`bool`, *optional*, defaults to `False`): + Whether or not to force the (re-)download of the model weights and configuration files, overriding the + cached versions if they exist. + cache_dir (`Union[str, os.PathLike]`, *optional*): + Path to a directory where a downloaded pretrained model configuration is cached if the standard cache + is not used. + resume_download (`bool`, *optional*, defaults to `False`): + Whether or not to resume downloading the model weights and configuration files. If set to `False`, any + incompletely downloaded files are deleted. + proxies (`Dict[str, str]`, *optional*): + A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', + 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. + local_files_only (`bool`, *optional*, defaults to `False`): + Whether to only load local model weights and configuration files or not. If set to True, the model + won't be downloaded from the Hub. + use_auth_token (`str` or *bool*, *optional*): + The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from + `diffusers-cli login` (stored in `~/.huggingface`) is used. + revision (`str`, *optional*, defaults to `"main"`): + The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier + allowed by Git. + image_size (`int`, *optional*, defaults to 512): + The image size the model was trained on. Use 512 for all Stable Diffusion v1 models and the Stable + Diffusion v2 base model. Use 768 for Stable Diffusion v2. + use_safetensors (`bool`, *optional*, defaults to `None`): + If set to `None`, the safetensors weights are downloaded if they're available **and** if the + safetensors library is installed. If set to `True`, the model is forcibly loaded from safetensors + weights. If set to `False`, safetensors weights are not loaded. + upcast_attention (`bool`, *optional*, defaults to `None`): + Whether the attention computation should always be upcasted. + scaling_factor (`float`, *optional*, defaults to 0.18215): + The component-wise standard deviation of the trained latent space computed using the first batch of the + training set. This is used to scale the latent space to have unit variance when training the diffusion + model. The latents are scaled with the formula `z = z * scaling_factor` before being passed to the + diffusion model. When decoding, the latents are scaled back to the original scale with the formula: `z + = 1 / scaling_factor * z`. For more details, refer to sections 4.3.2 and D.1 of the [High-Resolution + Image Synthesis with Latent Diffusion Models](https://arxiv.org/abs/2112.10752) paper. + kwargs (remaining dictionary of keyword arguments, *optional*): + Can be used to overwrite load and saveable variables (for example the pipeline components of the + specific pipeline class). The overwritten components are directly passed to the pipelines `__init__` + method. See example below for more information. + + + + Make sure to pass both `image_size` and `scaling_factor` to `from_single_file()` if you want to load + a VAE that does accompany a stable diffusion model of v2 or higher or SDXL. + + + + Examples: + + ```py + from diffusers import AutoencoderKL + + url = "https://huggingface.co/stabilityai/sd-vae-ft-mse-original/blob/main/vae-ft-mse-840000-ema-pruned.safetensors" # can also be local file + model = AutoencoderKL.from_single_file(url) + ``` + """ + if not is_omegaconf_available(): + raise ValueError(BACKENDS_MAPPING["omegaconf"][1]) + + from omegaconf import OmegaConf + + from ..models import AutoencoderKL + + # import here to avoid circular dependency + from ..pipelines.stable_diffusion.convert_from_ckpt import ( + convert_ldm_vae_checkpoint, + create_vae_diffusers_config, + ) + + config_file = kwargs.pop("config_file", None) + cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) + resume_download = kwargs.pop("resume_download", False) + force_download = kwargs.pop("force_download", False) + proxies = kwargs.pop("proxies", None) + local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) + use_auth_token = kwargs.pop("use_auth_token", None) + revision = kwargs.pop("revision", None) + image_size = kwargs.pop("image_size", None) + scaling_factor = kwargs.pop("scaling_factor", None) + kwargs.pop("upcast_attention", None) + + torch_dtype = kwargs.pop("torch_dtype", None) + + use_safetensors = kwargs.pop("use_safetensors", None) + + file_extension = pretrained_model_link_or_path.rsplit(".", 1)[-1] + from_safetensors = file_extension == "safetensors" + + if from_safetensors and use_safetensors is False: + raise ValueError("Make sure to install `safetensors` with `pip install safetensors`.") + + # remove huggingface url + for prefix in ["https://huggingface.co/", "huggingface.co/", "hf.co/", "https://hf.co/"]: + if pretrained_model_link_or_path.startswith(prefix): + pretrained_model_link_or_path = pretrained_model_link_or_path[len(prefix) :] + + # Code based on diffusers.pipelines.pipeline_utils.DiffusionPipeline.from_pretrained + ckpt_path = Path(pretrained_model_link_or_path) + if not ckpt_path.is_file(): + # get repo_id and (potentially nested) file path of ckpt in repo + repo_id = "/".join(ckpt_path.parts[:2]) + file_path = "/".join(ckpt_path.parts[2:]) + + if file_path.startswith("blob/"): + file_path = file_path[len("blob/") :] + + if file_path.startswith("main/"): + file_path = file_path[len("main/") :] + + pretrained_model_link_or_path = hf_hub_download( + repo_id, + filename=file_path, + cache_dir=cache_dir, + resume_download=resume_download, + proxies=proxies, + local_files_only=local_files_only, + use_auth_token=use_auth_token, + revision=revision, + force_download=force_download, + ) + + if from_safetensors: + from safetensors import safe_open + + checkpoint = {} + with safe_open(pretrained_model_link_or_path, framework="pt", device="cpu") as f: + for key in f.keys(): + checkpoint[key] = f.get_tensor(key) + else: + checkpoint = torch.load(pretrained_model_link_or_path, map_location="cpu") + + if "state_dict" in checkpoint: + checkpoint = checkpoint["state_dict"] + + if config_file is None: + config_url = "https://raw.githubusercontent.com/CompVis/stable-diffusion/main/configs/stable-diffusion/v1-inference.yaml" + config_file = BytesIO(requests.get(config_url).content) + + original_config = OmegaConf.load(config_file) + + # default to sd-v1-5 + image_size = image_size or 512 + + vae_config = create_vae_diffusers_config(original_config, image_size=image_size) + converted_vae_checkpoint = convert_ldm_vae_checkpoint(checkpoint, vae_config) + + if scaling_factor is None: + if ( + "model" in original_config + and "params" in original_config.model + and "scale_factor" in original_config.model.params + ): + vae_scaling_factor = original_config.model.params.scale_factor + else: + vae_scaling_factor = 0.18215 # default SD scaling factor + + vae_config["scaling_factor"] = vae_scaling_factor + + ctx = init_empty_weights if is_accelerate_available() else nullcontext + with ctx(): + vae = AutoencoderKL(**vae_config) + + if is_accelerate_available(): + from ..models.modeling_utils import load_model_dict_into_meta + + load_model_dict_into_meta(vae, converted_vae_checkpoint, device="cpu") + else: + vae.load_state_dict(converted_vae_checkpoint) + + if torch_dtype is not None: + vae.to(dtype=torch_dtype) + + return vae + + +class FromOriginalControlnetMixin: + @classmethod + def from_single_file(cls, pretrained_model_link_or_path, **kwargs): + r""" + Instantiate a [`ControlNetModel`] from pretrained controlnet weights saved in the original `.ckpt` or + `.safetensors` format. The pipeline is set in evaluation mode (`model.eval()`) by default. + + Parameters: + pretrained_model_link_or_path (`str` or `os.PathLike`, *optional*): + Can be either: + - A link to the `.ckpt` file (for example + `"https://huggingface.co//blob/main/.ckpt"`) on the Hub. + - A path to a *file* containing all pipeline weights. + torch_dtype (`str` or `torch.dtype`, *optional*): + Override the default `torch.dtype` and load the model with another dtype. If `"auto"` is passed, the + dtype is automatically derived from the model's weights. + force_download (`bool`, *optional*, defaults to `False`): + Whether or not to force the (re-)download of the model weights and configuration files, overriding the + cached versions if they exist. + cache_dir (`Union[str, os.PathLike]`, *optional*): + Path to a directory where a downloaded pretrained model configuration is cached if the standard cache + is not used. + resume_download (`bool`, *optional*, defaults to `False`): + Whether or not to resume downloading the model weights and configuration files. If set to `False`, any + incompletely downloaded files are deleted. + proxies (`Dict[str, str]`, *optional*): + A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', + 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. + local_files_only (`bool`, *optional*, defaults to `False`): + Whether to only load local model weights and configuration files or not. If set to True, the model + won't be downloaded from the Hub. + use_auth_token (`str` or *bool*, *optional*): + The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from + `diffusers-cli login` (stored in `~/.huggingface`) is used. + revision (`str`, *optional*, defaults to `"main"`): + The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier + allowed by Git. + use_safetensors (`bool`, *optional*, defaults to `None`): + If set to `None`, the safetensors weights are downloaded if they're available **and** if the + safetensors library is installed. If set to `True`, the model is forcibly loaded from safetensors + weights. If set to `False`, safetensors weights are not loaded. + image_size (`int`, *optional*, defaults to 512): + The image size the model was trained on. Use 512 for all Stable Diffusion v1 models and the Stable + Diffusion v2 base model. Use 768 for Stable Diffusion v2. + upcast_attention (`bool`, *optional*, defaults to `None`): + Whether the attention computation should always be upcasted. + kwargs (remaining dictionary of keyword arguments, *optional*): + Can be used to overwrite load and saveable variables (for example the pipeline components of the + specific pipeline class). The overwritten components are directly passed to the pipelines `__init__` + method. See example below for more information. + + Examples: + + ```py + from diffusers import StableDiffusionControlNetPipeline, ControlNetModel + + url = "https://huggingface.co/lllyasviel/ControlNet-v1-1/blob/main/control_v11p_sd15_canny.pth" # can also be a local path + model = ControlNetModel.from_single_file(url) + + url = "https://huggingface.co/runwayml/stable-diffusion-v1-5/blob/main/v1-5-pruned.safetensors" # can also be a local path + pipe = StableDiffusionControlNetPipeline.from_single_file(url, controlnet=controlnet) + ``` + """ + # import here to avoid circular dependency + from ..pipelines.stable_diffusion.convert_from_ckpt import download_controlnet_from_original_ckpt + + config_file = kwargs.pop("config_file", None) + cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) + resume_download = kwargs.pop("resume_download", False) + force_download = kwargs.pop("force_download", False) + proxies = kwargs.pop("proxies", None) + local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) + use_auth_token = kwargs.pop("use_auth_token", None) + num_in_channels = kwargs.pop("num_in_channels", None) + use_linear_projection = kwargs.pop("use_linear_projection", None) + revision = kwargs.pop("revision", None) + extract_ema = kwargs.pop("extract_ema", False) + image_size = kwargs.pop("image_size", None) + upcast_attention = kwargs.pop("upcast_attention", None) + + torch_dtype = kwargs.pop("torch_dtype", None) + + use_safetensors = kwargs.pop("use_safetensors", None) + + file_extension = pretrained_model_link_or_path.rsplit(".", 1)[-1] + from_safetensors = file_extension == "safetensors" + + if from_safetensors and use_safetensors is False: + raise ValueError("Make sure to install `safetensors` with `pip install safetensors`.") + + # remove huggingface url + for prefix in ["https://huggingface.co/", "huggingface.co/", "hf.co/", "https://hf.co/"]: + if pretrained_model_link_or_path.startswith(prefix): + pretrained_model_link_or_path = pretrained_model_link_or_path[len(prefix) :] + + # Code based on diffusers.pipelines.pipeline_utils.DiffusionPipeline.from_pretrained + ckpt_path = Path(pretrained_model_link_or_path) + if not ckpt_path.is_file(): + # get repo_id and (potentially nested) file path of ckpt in repo + repo_id = "/".join(ckpt_path.parts[:2]) + file_path = "/".join(ckpt_path.parts[2:]) + + if file_path.startswith("blob/"): + file_path = file_path[len("blob/") :] + + if file_path.startswith("main/"): + file_path = file_path[len("main/") :] + + pretrained_model_link_or_path = hf_hub_download( + repo_id, + filename=file_path, + cache_dir=cache_dir, + resume_download=resume_download, + proxies=proxies, + local_files_only=local_files_only, + use_auth_token=use_auth_token, + revision=revision, + force_download=force_download, + ) + + if config_file is None: + config_url = "https://raw.githubusercontent.com/lllyasviel/ControlNet/main/models/cldm_v15.yaml" + config_file = BytesIO(requests.get(config_url).content) + + image_size = image_size or 512 + + controlnet = download_controlnet_from_original_ckpt( + pretrained_model_link_or_path, + original_config_file=config_file, + image_size=image_size, + extract_ema=extract_ema, + num_in_channels=num_in_channels, + upcast_attention=upcast_attention, + from_safetensors=from_safetensors, + use_linear_projection=use_linear_projection, + ) + + if torch_dtype is not None: + controlnet.to(dtype=torch_dtype) + + return controlnet diff --git a/src/diffusers/loaders/textual_inversion.py b/src/diffusers/loaders/textual_inversion.py new file mode 100644 index 000000000000..4890810d49a6 --- /dev/null +++ b/src/diffusers/loaders/textual_inversion.py @@ -0,0 +1,447 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import Dict, List, Optional, Union + +import safetensors +import torch +from torch import nn + +from ..utils import ( + DIFFUSERS_CACHE, + HF_HUB_OFFLINE, + _get_model_file, + is_accelerate_available, + is_transformers_available, + logging, +) + + +if is_transformers_available(): + from transformers import PreTrainedModel, PreTrainedTokenizer + +if is_accelerate_available(): + from accelerate.hooks import AlignDevicesHook, CpuOffload, remove_hook_from_module + +logger = logging.get_logger(__name__) + +TEXT_INVERSION_NAME = "learned_embeds.bin" +TEXT_INVERSION_NAME_SAFE = "learned_embeds.safetensors" + + +def load_textual_inversion_state_dicts(pretrained_model_name_or_paths, **kwargs): + cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) + force_download = kwargs.pop("force_download", False) + resume_download = kwargs.pop("resume_download", False) + proxies = kwargs.pop("proxies", None) + local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) + use_auth_token = kwargs.pop("use_auth_token", None) + revision = kwargs.pop("revision", None) + subfolder = kwargs.pop("subfolder", None) + weight_name = kwargs.pop("weight_name", None) + use_safetensors = kwargs.pop("use_safetensors", None) + + allow_pickle = False + if use_safetensors is None: + use_safetensors = True + allow_pickle = True + + user_agent = { + "file_type": "text_inversion", + "framework": "pytorch", + } + state_dicts = [] + for pretrained_model_name_or_path in pretrained_model_name_or_paths: + if not isinstance(pretrained_model_name_or_path, (dict, torch.Tensor)): + # 3.1. Load textual inversion file + model_file = None + + # Let's first try to load .safetensors weights + if (use_safetensors and weight_name is None) or ( + weight_name is not None and weight_name.endswith(".safetensors") + ): + try: + model_file = _get_model_file( + pretrained_model_name_or_path, + weights_name=weight_name or TEXT_INVERSION_NAME_SAFE, + cache_dir=cache_dir, + force_download=force_download, + resume_download=resume_download, + proxies=proxies, + local_files_only=local_files_only, + use_auth_token=use_auth_token, + revision=revision, + subfolder=subfolder, + user_agent=user_agent, + ) + state_dict = safetensors.torch.load_file(model_file, device="cpu") + except Exception as e: + if not allow_pickle: + raise e + + model_file = None + + if model_file is None: + model_file = _get_model_file( + pretrained_model_name_or_path, + weights_name=weight_name or TEXT_INVERSION_NAME, + cache_dir=cache_dir, + force_download=force_download, + resume_download=resume_download, + proxies=proxies, + local_files_only=local_files_only, + use_auth_token=use_auth_token, + revision=revision, + subfolder=subfolder, + user_agent=user_agent, + ) + state_dict = torch.load(model_file, map_location="cpu") + else: + state_dict = pretrained_model_name_or_path + + state_dicts.append(state_dict) + + return state_dicts + + +class TextualInversionLoaderMixin: + r""" + Load textual inversion tokens and embeddings to the tokenizer and text encoder. + """ + + def maybe_convert_prompt(self, prompt: Union[str, List[str]], tokenizer: "PreTrainedTokenizer"): # noqa: F821 + r""" + Processes prompts that include a special token corresponding to a multi-vector textual inversion embedding to + be replaced with multiple special tokens each corresponding to one of the vectors. If the prompt has no textual + inversion token or if the textual inversion token is a single vector, the input prompt is returned. + + Parameters: + prompt (`str` or list of `str`): + The prompt or prompts to guide the image generation. + tokenizer (`PreTrainedTokenizer`): + The tokenizer responsible for encoding the prompt into input tokens. + + Returns: + `str` or list of `str`: The converted prompt + """ + if not isinstance(prompt, List): + prompts = [prompt] + else: + prompts = prompt + + prompts = [self._maybe_convert_prompt(p, tokenizer) for p in prompts] + + if not isinstance(prompt, List): + return prompts[0] + + return prompts + + def _maybe_convert_prompt(self, prompt: str, tokenizer: "PreTrainedTokenizer"): # noqa: F821 + r""" + Maybe convert a prompt into a "multi vector"-compatible prompt. If the prompt includes a token that corresponds + to a multi-vector textual inversion embedding, this function will process the prompt so that the special token + is replaced with multiple special tokens each corresponding to one of the vectors. If the prompt has no textual + inversion token or a textual inversion token that is a single vector, the input prompt is simply returned. + + Parameters: + prompt (`str`): + The prompt to guide the image generation. + tokenizer (`PreTrainedTokenizer`): + The tokenizer responsible for encoding the prompt into input tokens. + + Returns: + `str`: The converted prompt + """ + tokens = tokenizer.tokenize(prompt) + unique_tokens = set(tokens) + for token in unique_tokens: + if token in tokenizer.added_tokens_encoder: + replacement = token + i = 1 + while f"{token}_{i}" in tokenizer.added_tokens_encoder: + replacement += f" {token}_{i}" + i += 1 + + prompt = prompt.replace(token, replacement) + + return prompt + + def _check_text_inv_inputs(self, tokenizer, text_encoder, pretrained_model_name_or_paths, tokens): + if tokenizer is None: + raise ValueError( + f"{self.__class__.__name__} requires `self.tokenizer` or passing a `tokenizer` of type `PreTrainedTokenizer` for calling" + f" `{self.load_textual_inversion.__name__}`" + ) + + if text_encoder is None: + raise ValueError( + f"{self.__class__.__name__} requires `self.text_encoder` or passing a `text_encoder` of type `PreTrainedModel` for calling" + f" `{self.load_textual_inversion.__name__}`" + ) + + if len(pretrained_model_name_or_paths) != len(tokens): + raise ValueError( + f"You have passed a list of models of length {len(pretrained_model_name_or_paths)}, and list of tokens of length {len(tokens)} " + f"Make sure both lists have the same length." + ) + + valid_tokens = [t for t in tokens if t is not None] + if len(set(valid_tokens)) < len(valid_tokens): + raise ValueError(f"You have passed a list of tokens that contains duplicates: {tokens}") + + @staticmethod + def _retrieve_tokens_and_embeddings(tokens, state_dicts, tokenizer): + all_tokens = [] + all_embeddings = [] + for state_dict, token in zip(state_dicts, tokens): + if isinstance(state_dict, torch.Tensor): + if token is None: + raise ValueError( + "You are trying to load a textual inversion embedding that has been saved as a PyTorch tensor. Make sure to pass the name of the corresponding token in this case: `token=...`." + ) + loaded_token = token + embedding = state_dict + elif len(state_dict) == 1: + # diffusers + loaded_token, embedding = next(iter(state_dict.items())) + elif "string_to_param" in state_dict: + # A1111 + loaded_token = state_dict["name"] + embedding = state_dict["string_to_param"]["*"] + else: + raise ValueError( + f"Loaded state dictonary is incorrect: {state_dict}. \n\n" + "Please verify that the loaded state dictionary of the textual embedding either only has a single key or includes the `string_to_param`" + " input key." + ) + + if token is not None and loaded_token != token: + logger.info(f"The loaded token: {loaded_token} is overwritten by the passed token {token}.") + else: + token = loaded_token + + if token in tokenizer.get_vocab(): + raise ValueError( + f"Token {token} already in tokenizer vocabulary. Please choose a different token name or remove {token} and embedding from the tokenizer and text encoder." + ) + + all_tokens.append(token) + all_embeddings.append(embedding) + + return all_tokens, all_embeddings + + @staticmethod + def _extend_tokens_and_embeddings(tokens, embeddings, tokenizer): + all_tokens = [] + all_embeddings = [] + + for embedding, token in zip(embeddings, tokens): + if f"{token}_1" in tokenizer.get_vocab(): + multi_vector_tokens = [token] + i = 1 + while f"{token}_{i}" in tokenizer.added_tokens_encoder: + multi_vector_tokens.append(f"{token}_{i}") + i += 1 + + raise ValueError( + f"Multi-vector Token {multi_vector_tokens} already in tokenizer vocabulary. Please choose a different token name or remove the {multi_vector_tokens} and embedding from the tokenizer and text encoder." + ) + + is_multi_vector = len(embedding.shape) > 1 and embedding.shape[0] > 1 + if is_multi_vector: + all_tokens += [token] + [f"{token}_{i}" for i in range(1, embedding.shape[0])] + all_embeddings += [e for e in embedding] # noqa: C416 + else: + all_tokens += [token] + all_embeddings += [embedding[0]] if len(embedding.shape) > 1 else [embedding] + + return all_tokens, all_embeddings + + def load_textual_inversion( + self, + pretrained_model_name_or_path: Union[str, List[str], Dict[str, torch.Tensor], List[Dict[str, torch.Tensor]]], + token: Optional[Union[str, List[str]]] = None, + tokenizer: Optional["PreTrainedTokenizer"] = None, # noqa: F821 + text_encoder: Optional["PreTrainedModel"] = None, # noqa: F821 + **kwargs, + ): + r""" + Load textual inversion embeddings into the text encoder of [`StableDiffusionPipeline`] (both πŸ€— Diffusers and + Automatic1111 formats are supported). + + Parameters: + pretrained_model_name_or_path (`str` or `os.PathLike` or `List[str or os.PathLike]` or `Dict` or `List[Dict]`): + Can be either one of the following or a list of them: + + - A string, the *model id* (for example `sd-concepts-library/low-poly-hd-logos-icons`) of a + pretrained model hosted on the Hub. + - A path to a *directory* (for example `./my_text_inversion_directory/`) containing the textual + inversion weights. + - A path to a *file* (for example `./my_text_inversions.pt`) containing textual inversion weights. + - A [torch state + dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict). + + token (`str` or `List[str]`, *optional*): + Override the token to use for the textual inversion weights. If `pretrained_model_name_or_path` is a + list, then `token` must also be a list of equal length. + text_encoder ([`~transformers.CLIPTextModel`], *optional*): + Frozen text-encoder ([clip-vit-large-patch14](https://huggingface.co/openai/clip-vit-large-patch14)). + If not specified, function will take self.tokenizer. + tokenizer ([`~transformers.CLIPTokenizer`], *optional*): + A `CLIPTokenizer` to tokenize text. If not specified, function will take self.tokenizer. + weight_name (`str`, *optional*): + Name of a custom weight file. This should be used when: + + - The saved textual inversion file is in πŸ€— Diffusers format, but was saved under a specific weight + name such as `text_inv.bin`. + - The saved textual inversion file is in the Automatic1111 format. + cache_dir (`Union[str, os.PathLike]`, *optional*): + Path to a directory where a downloaded pretrained model configuration is cached if the standard cache + is not used. + force_download (`bool`, *optional*, defaults to `False`): + Whether or not to force the (re-)download of the model weights and configuration files, overriding the + cached versions if they exist. + resume_download (`bool`, *optional*, defaults to `False`): + Whether or not to resume downloading the model weights and configuration files. If set to `False`, any + incompletely downloaded files are deleted. + proxies (`Dict[str, str]`, *optional*): + A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', + 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. + local_files_only (`bool`, *optional*, defaults to `False`): + Whether to only load local model weights and configuration files or not. If set to `True`, the model + won't be downloaded from the Hub. + use_auth_token (`str` or *bool*, *optional*): + The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from + `diffusers-cli login` (stored in `~/.huggingface`) is used. + revision (`str`, *optional*, defaults to `"main"`): + The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier + allowed by Git. + subfolder (`str`, *optional*, defaults to `""`): + The subfolder location of a model file within a larger model repository on the Hub or locally. + mirror (`str`, *optional*): + Mirror source to resolve accessibility issues if you're downloading a model in China. We do not + guarantee the timeliness or safety of the source, and you should refer to the mirror site for more + information. + + Example: + + To load a textual inversion embedding vector in πŸ€— Diffusers format: + + ```py + from diffusers import StableDiffusionPipeline + import torch + + model_id = "runwayml/stable-diffusion-v1-5" + pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") + + pipe.load_textual_inversion("sd-concepts-library/cat-toy") + + prompt = "A backpack" + + image = pipe(prompt, num_inference_steps=50).images[0] + image.save("cat-backpack.png") + ``` + + To load a textual inversion embedding vector in Automatic1111 format, make sure to download the vector first + (for example from [civitAI](https://civitai.com/models/3036?modelVersionId=9857)) and then load the vector + locally: + + ```py + from diffusers import StableDiffusionPipeline + import torch + + model_id = "runwayml/stable-diffusion-v1-5" + pipe = StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda") + + pipe.load_textual_inversion("./charturnerv2.pt", token="charturnerv2") + + prompt = "charturnerv2, multiple views of the same character in the same outfit, a character turnaround of a woman wearing a black jacket and red shirt, best quality, intricate details." + + image = pipe(prompt, num_inference_steps=50).images[0] + image.save("character.png") + ``` + + """ + # 1. Set correct tokenizer and text encoder + tokenizer = tokenizer or getattr(self, "tokenizer", None) + text_encoder = text_encoder or getattr(self, "text_encoder", None) + + # 2. Normalize inputs + pretrained_model_name_or_paths = ( + [pretrained_model_name_or_path] + if not isinstance(pretrained_model_name_or_path, list) + else pretrained_model_name_or_path + ) + tokens = len(pretrained_model_name_or_paths) * [token] if (isinstance(token, str) or token is None) else token + + # 3. Check inputs + self._check_text_inv_inputs(tokenizer, text_encoder, pretrained_model_name_or_paths, tokens) + + # 4. Load state dicts of textual embeddings + state_dicts = load_textual_inversion_state_dicts(pretrained_model_name_or_paths, **kwargs) + + # 4. Retrieve tokens and embeddings + tokens, embeddings = self._retrieve_tokens_and_embeddings(tokens, state_dicts, tokenizer) + + # 5. Extend tokens and embeddings for multi vector + tokens, embeddings = self._extend_tokens_and_embeddings(tokens, embeddings, tokenizer) + + # 6. Make sure all embeddings have the correct size + expected_emb_dim = text_encoder.get_input_embeddings().weight.shape[-1] + if any(expected_emb_dim != emb.shape[-1] for emb in embeddings): + raise ValueError( + "Loaded embeddings are of incorrect shape. Expected each textual inversion embedding " + "to be of shape {input_embeddings.shape[-1]}, but are {embeddings.shape[-1]} " + ) + + # 7. Now we can be sure that loading the embedding matrix works + # < Unsafe code: + + # 7.1 Offload all hooks in case the pipeline was cpu offloaded before make sure, we offload and onload again + is_model_cpu_offload = False + is_sequential_cpu_offload = False + for _, component in self.components.items(): + if isinstance(component, nn.Module): + if hasattr(component, "_hf_hook"): + is_model_cpu_offload = isinstance(getattr(component, "_hf_hook"), CpuOffload) + is_sequential_cpu_offload = isinstance(getattr(component, "_hf_hook"), AlignDevicesHook) + logger.info( + "Accelerate hooks detected. Since you have called `load_textual_inversion()`, the previous hooks will be first removed. Then the textual inversion parameters will be loaded and the hooks will be applied again." + ) + remove_hook_from_module(component, recurse=is_sequential_cpu_offload) + + # 7.2 save expected device and dtype + device = text_encoder.device + dtype = text_encoder.dtype + + # 7.3 Increase token embedding matrix + text_encoder.resize_token_embeddings(len(tokenizer) + len(tokens)) + input_embeddings = text_encoder.get_input_embeddings().weight + + # 7.4 Load token and embedding + for token, embedding in zip(tokens, embeddings): + # add tokens and get ids + tokenizer.add_tokens(token) + token_id = tokenizer.convert_tokens_to_ids(token) + input_embeddings.data[token_id] = embedding + logger.info(f"Loaded textual inversion embedding for {token}.") + + input_embeddings.to(dtype=dtype, device=device) + + # 7.5 Offload the model again + if is_model_cpu_offload: + self.enable_model_cpu_offload() + elif is_sequential_cpu_offload: + self.enable_sequential_cpu_offload() + + # / Unsafe Code > diff --git a/src/diffusers/loaders/unet.py b/src/diffusers/loaders/unet.py new file mode 100644 index 000000000000..3f63e73d9cec --- /dev/null +++ b/src/diffusers/loaders/unet.py @@ -0,0 +1,572 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +from collections import defaultdict +from contextlib import nullcontext +from typing import Callable, Dict, List, Optional, Union + +import safetensors +import torch +from torch import nn + +from ..models.modeling_utils import _LOW_CPU_MEM_USAGE_DEFAULT, load_model_dict_into_meta +from ..utils import ( + DIFFUSERS_CACHE, + HF_HUB_OFFLINE, + USE_PEFT_BACKEND, + _get_model_file, + delete_adapter_layers, + is_accelerate_available, + logging, + set_adapter_layers, + set_weights_and_activate_adapters, +) +from .utils import AttnProcsLayers + + +if is_accelerate_available(): + from accelerate import init_empty_weights + from accelerate.hooks import AlignDevicesHook, CpuOffload, remove_hook_from_module + +logger = logging.get_logger(__name__) + + +TEXT_ENCODER_NAME = "text_encoder" +UNET_NAME = "unet" + +LORA_WEIGHT_NAME = "pytorch_lora_weights.bin" +LORA_WEIGHT_NAME_SAFE = "pytorch_lora_weights.safetensors" + +CUSTOM_DIFFUSION_WEIGHT_NAME = "pytorch_custom_diffusion_weights.bin" +CUSTOM_DIFFUSION_WEIGHT_NAME_SAFE = "pytorch_custom_diffusion_weights.safetensors" + + +class UNet2DConditionLoadersMixin: + text_encoder_name = TEXT_ENCODER_NAME + unet_name = UNET_NAME + + def load_attn_procs(self, pretrained_model_name_or_path_or_dict: Union[str, Dict[str, torch.Tensor]], **kwargs): + r""" + Load pretrained attention processor layers into [`UNet2DConditionModel`]. Attention processor layers have to be + defined in + [`attention_processor.py`](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/attention_processor.py) + and be a `torch.nn.Module` class. + + Parameters: + pretrained_model_name_or_path_or_dict (`str` or `os.PathLike` or `dict`): + Can be either: + + - A string, the model id (for example `google/ddpm-celebahq-256`) of a pretrained model hosted on + the Hub. + - A path to a directory (for example `./my_model_directory`) containing the model weights saved + with [`ModelMixin.save_pretrained`]. + - A [torch state + dict](https://pytorch.org/tutorials/beginner/saving_loading_models.html#what-is-a-state-dict). + + cache_dir (`Union[str, os.PathLike]`, *optional*): + Path to a directory where a downloaded pretrained model configuration is cached if the standard cache + is not used. + force_download (`bool`, *optional*, defaults to `False`): + Whether or not to force the (re-)download of the model weights and configuration files, overriding the + cached versions if they exist. + resume_download (`bool`, *optional*, defaults to `False`): + Whether or not to resume downloading the model weights and configuration files. If set to `False`, any + incompletely downloaded files are deleted. + proxies (`Dict[str, str]`, *optional*): + A dictionary of proxy servers to use by protocol or endpoint, for example, `{'http': 'foo.bar:3128', + 'http://hostname': 'foo.bar:4012'}`. The proxies are used on each request. + local_files_only (`bool`, *optional*, defaults to `False`): + Whether to only load local model weights and configuration files or not. If set to `True`, the model + won't be downloaded from the Hub. + use_auth_token (`str` or *bool*, *optional*): + The token to use as HTTP bearer authorization for remote files. If `True`, the token generated from + `diffusers-cli login` (stored in `~/.huggingface`) is used. + low_cpu_mem_usage (`bool`, *optional*, defaults to `True` if torch version >= 1.9.0 else `False`): + Speed up model loading only loading the pretrained weights and not initializing the weights. This also + tries to not use more than 1x model size in CPU memory (including peak memory) while loading the model. + Only supported for PyTorch >= 1.9.0. If you are using an older version of PyTorch, setting this + argument to `True` will raise an error. + revision (`str`, *optional*, defaults to `"main"`): + The specific model version to use. It can be a branch name, a tag name, a commit id, or any identifier + allowed by Git. + subfolder (`str`, *optional*, defaults to `""`): + The subfolder location of a model file within a larger model repository on the Hub or locally. + mirror (`str`, *optional*): + Mirror source to resolve accessibility issues if you’re downloading a model in China. We do not + guarantee the timeliness or safety of the source, and you should refer to the mirror site for more + information. + + """ + from ..models.attention_processor import CustomDiffusionAttnProcessor + from ..models.lora import LoRACompatibleConv, LoRACompatibleLinear, LoRAConv2dLayer, LoRALinearLayer + + cache_dir = kwargs.pop("cache_dir", DIFFUSERS_CACHE) + force_download = kwargs.pop("force_download", False) + resume_download = kwargs.pop("resume_download", False) + proxies = kwargs.pop("proxies", None) + local_files_only = kwargs.pop("local_files_only", HF_HUB_OFFLINE) + use_auth_token = kwargs.pop("use_auth_token", None) + revision = kwargs.pop("revision", None) + subfolder = kwargs.pop("subfolder", None) + weight_name = kwargs.pop("weight_name", None) + use_safetensors = kwargs.pop("use_safetensors", None) + low_cpu_mem_usage = kwargs.pop("low_cpu_mem_usage", _LOW_CPU_MEM_USAGE_DEFAULT) + # This value has the same meaning as the `--network_alpha` option in the kohya-ss trainer script. + # See https://github.com/darkstorm2150/sd-scripts/blob/main/docs/train_network_README-en.md#execute-learning + network_alphas = kwargs.pop("network_alphas", None) + + _pipeline = kwargs.pop("_pipeline", None) + + is_network_alphas_none = network_alphas is None + + allow_pickle = False + + if use_safetensors is None: + use_safetensors = True + allow_pickle = True + + user_agent = { + "file_type": "attn_procs_weights", + "framework": "pytorch", + } + + if low_cpu_mem_usage and not is_accelerate_available(): + low_cpu_mem_usage = False + logger.warning( + "Cannot initialize model with low cpu memory usage because `accelerate` was not found in the" + " environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install" + " `accelerate` for faster and less memory-intense model loading. You can do so with: \n```\npip" + " install accelerate\n```\n." + ) + + model_file = None + if not isinstance(pretrained_model_name_or_path_or_dict, dict): + # Let's first try to load .safetensors weights + if (use_safetensors and weight_name is None) or ( + weight_name is not None and weight_name.endswith(".safetensors") + ): + try: + model_file = _get_model_file( + pretrained_model_name_or_path_or_dict, + weights_name=weight_name or LORA_WEIGHT_NAME_SAFE, + cache_dir=cache_dir, + force_download=force_download, + resume_download=resume_download, + proxies=proxies, + local_files_only=local_files_only, + use_auth_token=use_auth_token, + revision=revision, + subfolder=subfolder, + user_agent=user_agent, + ) + state_dict = safetensors.torch.load_file(model_file, device="cpu") + except IOError as e: + if not allow_pickle: + raise e + # try loading non-safetensors weights + pass + if model_file is None: + model_file = _get_model_file( + pretrained_model_name_or_path_or_dict, + weights_name=weight_name or LORA_WEIGHT_NAME, + cache_dir=cache_dir, + force_download=force_download, + resume_download=resume_download, + proxies=proxies, + local_files_only=local_files_only, + use_auth_token=use_auth_token, + revision=revision, + subfolder=subfolder, + user_agent=user_agent, + ) + state_dict = torch.load(model_file, map_location="cpu") + else: + state_dict = pretrained_model_name_or_path_or_dict + + # fill attn processors + lora_layers_list = [] + + is_lora = all(("lora" in k or k.endswith(".alpha")) for k in state_dict.keys()) and not USE_PEFT_BACKEND + is_custom_diffusion = any("custom_diffusion" in k for k in state_dict.keys()) + + if is_lora: + # correct keys + state_dict, network_alphas = self.convert_state_dict_legacy_attn_format(state_dict, network_alphas) + + if network_alphas is not None: + network_alphas_keys = list(network_alphas.keys()) + used_network_alphas_keys = set() + + lora_grouped_dict = defaultdict(dict) + mapped_network_alphas = {} + + all_keys = list(state_dict.keys()) + for key in all_keys: + value = state_dict.pop(key) + attn_processor_key, sub_key = ".".join(key.split(".")[:-3]), ".".join(key.split(".")[-3:]) + lora_grouped_dict[attn_processor_key][sub_key] = value + + # Create another `mapped_network_alphas` dictionary so that we can properly map them. + if network_alphas is not None: + for k in network_alphas_keys: + if k.replace(".alpha", "") in key: + mapped_network_alphas.update({attn_processor_key: network_alphas.get(k)}) + used_network_alphas_keys.add(k) + + if not is_network_alphas_none: + if len(set(network_alphas_keys) - used_network_alphas_keys) > 0: + raise ValueError( + f"The `network_alphas` has to be empty at this point but has the following keys \n\n {', '.join(network_alphas.keys())}" + ) + + if len(state_dict) > 0: + raise ValueError( + f"The `state_dict` has to be empty at this point but has the following keys \n\n {', '.join(state_dict.keys())}" + ) + + for key, value_dict in lora_grouped_dict.items(): + attn_processor = self + for sub_key in key.split("."): + attn_processor = getattr(attn_processor, sub_key) + + # Process non-attention layers, which don't have to_{k,v,q,out_proj}_lora layers + # or add_{k,v,q,out_proj}_proj_lora layers. + rank = value_dict["lora.down.weight"].shape[0] + + if isinstance(attn_processor, LoRACompatibleConv): + in_features = attn_processor.in_channels + out_features = attn_processor.out_channels + kernel_size = attn_processor.kernel_size + + ctx = init_empty_weights if low_cpu_mem_usage else nullcontext + with ctx(): + lora = LoRAConv2dLayer( + in_features=in_features, + out_features=out_features, + rank=rank, + kernel_size=kernel_size, + stride=attn_processor.stride, + padding=attn_processor.padding, + network_alpha=mapped_network_alphas.get(key), + ) + elif isinstance(attn_processor, LoRACompatibleLinear): + ctx = init_empty_weights if low_cpu_mem_usage else nullcontext + with ctx(): + lora = LoRALinearLayer( + attn_processor.in_features, + attn_processor.out_features, + rank, + mapped_network_alphas.get(key), + ) + else: + raise ValueError(f"Module {key} is not a LoRACompatibleConv or LoRACompatibleLinear module.") + + value_dict = {k.replace("lora.", ""): v for k, v in value_dict.items()} + lora_layers_list.append((attn_processor, lora)) + + if low_cpu_mem_usage: + device = next(iter(value_dict.values())).device + dtype = next(iter(value_dict.values())).dtype + load_model_dict_into_meta(lora, value_dict, device=device, dtype=dtype) + else: + lora.load_state_dict(value_dict) + + elif is_custom_diffusion: + attn_processors = {} + custom_diffusion_grouped_dict = defaultdict(dict) + for key, value in state_dict.items(): + if len(value) == 0: + custom_diffusion_grouped_dict[key] = {} + else: + if "to_out" in key: + attn_processor_key, sub_key = ".".join(key.split(".")[:-3]), ".".join(key.split(".")[-3:]) + else: + attn_processor_key, sub_key = ".".join(key.split(".")[:-2]), ".".join(key.split(".")[-2:]) + custom_diffusion_grouped_dict[attn_processor_key][sub_key] = value + + for key, value_dict in custom_diffusion_grouped_dict.items(): + if len(value_dict) == 0: + attn_processors[key] = CustomDiffusionAttnProcessor( + train_kv=False, train_q_out=False, hidden_size=None, cross_attention_dim=None + ) + else: + cross_attention_dim = value_dict["to_k_custom_diffusion.weight"].shape[1] + hidden_size = value_dict["to_k_custom_diffusion.weight"].shape[0] + train_q_out = True if "to_q_custom_diffusion.weight" in value_dict else False + attn_processors[key] = CustomDiffusionAttnProcessor( + train_kv=True, + train_q_out=train_q_out, + hidden_size=hidden_size, + cross_attention_dim=cross_attention_dim, + ) + attn_processors[key].load_state_dict(value_dict) + elif USE_PEFT_BACKEND: + # In that case we have nothing to do as loading the adapter weights is already handled above by `set_peft_model_state_dict` + # on the Unet + pass + else: + raise ValueError( + f"{model_file} does not seem to be in the correct format expected by LoRA or Custom Diffusion training." + ) + + # + + def convert_state_dict_legacy_attn_format(self, state_dict, network_alphas): + is_new_lora_format = all( + key.startswith(self.unet_name) or key.startswith(self.text_encoder_name) for key in state_dict.keys() + ) + if is_new_lora_format: + # Strip the `"unet"` prefix. + is_text_encoder_present = any(key.startswith(self.text_encoder_name) for key in state_dict.keys()) + if is_text_encoder_present: + warn_message = "The state_dict contains LoRA params corresponding to the text encoder which are not being used here. To use both UNet and text encoder related LoRA params, use [`pipe.load_lora_weights()`](https://huggingface.co/docs/diffusers/main/en/api/loaders#diffusers.loaders.LoraLoaderMixin.load_lora_weights)." + logger.warn(warn_message) + unet_keys = [k for k in state_dict.keys() if k.startswith(self.unet_name)] + state_dict = {k.replace(f"{self.unet_name}.", ""): v for k, v in state_dict.items() if k in unet_keys} + + # change processor format to 'pure' LoRACompatibleLinear format + if any("processor" in k.split(".") for k in state_dict.keys()): + + def format_to_lora_compatible(key): + if "processor" not in key.split("."): + return key + return key.replace(".processor", "").replace("to_out_lora", "to_out.0.lora").replace("_lora", ".lora") + + state_dict = {format_to_lora_compatible(k): v for k, v in state_dict.items()} + + if network_alphas is not None: + network_alphas = {format_to_lora_compatible(k): v for k, v in network_alphas.items()} + return state_dict, network_alphas + + def save_attn_procs( + self, + save_directory: Union[str, os.PathLike], + is_main_process: bool = True, + weight_name: str = None, + save_function: Callable = None, + safe_serialization: bool = True, + **kwargs, + ): + r""" + Save an attention processor to a directory so that it can be reloaded using the + [`~loaders.UNet2DConditionLoadersMixin.load_attn_procs`] method. + + Arguments: + save_directory (`str` or `os.PathLike`): + Directory to save an attention processor to. Will be created if it doesn't exist. + is_main_process (`bool`, *optional*, defaults to `True`): + Whether the process calling this is the main process or not. Useful during distributed training and you + need to call this function on all processes. In this case, set `is_main_process=True` only on the main + process to avoid race conditions. + save_function (`Callable`): + The function to use to save the state dictionary. Useful during distributed training when you need to + replace `torch.save` with another method. Can be configured with the environment variable + `DIFFUSERS_SAVE_MODE`. + safe_serialization (`bool`, *optional*, defaults to `True`): + Whether to save the model using `safetensors` or the traditional PyTorch way with `pickle`. + """ + from ..models.attention_processor import ( + CustomDiffusionAttnProcessor, + CustomDiffusionAttnProcessor2_0, + CustomDiffusionXFormersAttnProcessor, + ) + + if os.path.isfile(save_directory): + logger.error(f"Provided path ({save_directory}) should be a directory, not a file") + return + + if save_function is None: + if safe_serialization: + + def save_function(weights, filename): + return safetensors.torch.save_file(weights, filename, metadata={"format": "pt"}) + + else: + save_function = torch.save + + os.makedirs(save_directory, exist_ok=True) + + is_custom_diffusion = any( + isinstance( + x, + (CustomDiffusionAttnProcessor, CustomDiffusionAttnProcessor2_0, CustomDiffusionXFormersAttnProcessor), + ) + for (_, x) in self.attn_processors.items() + ) + if is_custom_diffusion: + model_to_save = AttnProcsLayers( + { + y: x + for (y, x) in self.attn_processors.items() + if isinstance( + x, + ( + CustomDiffusionAttnProcessor, + CustomDiffusionAttnProcessor2_0, + CustomDiffusionXFormersAttnProcessor, + ), + ) + } + ) + state_dict = model_to_save.state_dict() + for name, attn in self.attn_processors.items(): + if len(attn.state_dict()) == 0: + state_dict[name] = {} + else: + model_to_save = AttnProcsLayers(self.attn_processors) + state_dict = model_to_save.state_dict() + + if weight_name is None: + if safe_serialization: + weight_name = CUSTOM_DIFFUSION_WEIGHT_NAME_SAFE if is_custom_diffusion else LORA_WEIGHT_NAME_SAFE + else: + weight_name = CUSTOM_DIFFUSION_WEIGHT_NAME if is_custom_diffusion else LORA_WEIGHT_NAME + + # Save the model + save_function(state_dict, os.path.join(save_directory, weight_name)) + logger.info(f"Model weights saved in {os.path.join(save_directory, weight_name)}") + + def fuse_lora(self, lora_scale=1.0, safe_fusing=False): + self.lora_scale = lora_scale + self._safe_fusing = safe_fusing + self.apply(self._fuse_lora_apply) + + def _fuse_lora_apply(self, module): + if not USE_PEFT_BACKEND: + if hasattr(module, "_fuse_lora"): + module._fuse_lora(self.lora_scale, self._safe_fusing) + else: + from peft.tuners.tuners_utils import BaseTunerLayer + + if isinstance(module, BaseTunerLayer): + if self.lora_scale != 1.0: + module.scale_layer(self.lora_scale) + module.merge(safe_merge=self._safe_fusing) + + def unfuse_lora(self): + self.apply(self._unfuse_lora_apply) + + def _unfuse_lora_apply(self, module): + if not USE_PEFT_BACKEND: + if hasattr(module, "_unfuse_lora"): + module._unfuse_lora() + else: + from peft.tuners.tuners_utils import BaseTunerLayer + + if isinstance(module, BaseTunerLayer): + module.unmerge() + + def set_adapters( + self, + adapter_names: Union[List[str], str], + weights: Optional[Union[List[float], float]] = None, + ): + """ + Sets the adapter layers for the unet. + + Args: + adapter_names (`List[str]` or `str`): + The names of the adapters to use. + weights (`Union[List[float], float]`, *optional*): + The adapter(s) weights to use with the UNet. If `None`, the weights are set to `1.0` for all the + adapters. + """ + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for `set_adapters()`.") + + adapter_names = [adapter_names] if isinstance(adapter_names, str) else adapter_names + + if weights is None: + weights = [1.0] * len(adapter_names) + elif isinstance(weights, float): + weights = [weights] * len(adapter_names) + + if len(adapter_names) != len(weights): + raise ValueError( + f"Length of adapter names {len(adapter_names)} is not equal to the length of their weights {len(weights)}." + ) + + set_weights_and_activate_adapters(self, adapter_names, weights) + + def disable_lora(self): + """ + Disables the active LoRA layers for the unet. + """ + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + set_adapter_layers(self, enabled=False) + + def enable_lora(self): + """ + Enables the active LoRA layers for the unet. + """ + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + set_adapter_layers(self, enabled=True) + + def delete_adapters(self, adapter_names: Union[List[str], str]): + """ + Args: + Deletes the LoRA layers of `adapter_name` for the unet. + adapter_names (`Union[List[str], str]`): + The names of the adapter to delete. Can be a single string or a list of strings + """ + if not USE_PEFT_BACKEND: + raise ValueError("PEFT backend is required for this method.") + + if isinstance(adapter_names, str): + adapter_names = [adapter_names] + + for adapter_name in adapter_names: + delete_adapter_layers(self, adapter_name) + + # Pop also the corresponding adapter from the config + if hasattr(self, "peft_config"): + self.peft_config.pop(adapter_name, None) + + delete_adapter_layers diff --git a/src/diffusers/loaders/utils.py b/src/diffusers/loaders/utils.py new file mode 100644 index 000000000000..f65cd4e65065 --- /dev/null +++ b/src/diffusers/loaders/utils.py @@ -0,0 +1,59 @@ +# Copyright 2023 The HuggingFace Team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +from typing import Dict + +import torch + + +class AttnProcsLayers(torch.nn.Module): + def __init__(self, state_dict: Dict[str, torch.Tensor]): + super().__init__() + self.layers = torch.nn.ModuleList(state_dict.values()) + self.mapping = dict(enumerate(state_dict.keys())) + self.rev_mapping = {v: k for k, v in enumerate(state_dict.keys())} + + # .processor for unet, .self_attn for text encoder + self.split_keys = [".processor", ".self_attn"] + + # we add a hook to state_dict() and load_state_dict() so that the + # naming fits with `unet.attn_processors` + def map_to(module, state_dict, *args, **kwargs): + new_state_dict = {} + for key, value in state_dict.items(): + num = int(key.split(".")[1]) # 0 is always "layers" + new_key = key.replace(f"layers.{num}", module.mapping[num]) + new_state_dict[new_key] = value + + return new_state_dict + + def remap_key(key, state_dict): + for k in self.split_keys: + if k in key: + return key.split(k)[0] + k + + raise ValueError( + f"There seems to be a problem with the state_dict: {set(state_dict.keys())}. {key} has to have one of {self.split_keys}." + ) + + def map_from(module, state_dict, *args, **kwargs): + all_keys = list(state_dict.keys()) + for key in all_keys: + replace_key = remap_key(key, state_dict) + new_key = key.replace(replace_key, f"layers.{module.rev_mapping[replace_key]}") + state_dict[new_key] = state_dict[key] + del state_dict[key] + + self._register_state_dict_hook(map_to) + self._register_load_state_dict_pre_hook(map_from, with_module=True) diff --git a/src/diffusers/models/autoencoder_asym_kl.py b/src/diffusers/models/autoencoder_asym_kl.py index 9f0fa62d34cd..656683b43f60 100644 --- a/src/diffusers/models/autoencoder_asym_kl.py +++ b/src/diffusers/models/autoencoder_asym_kl.py @@ -65,11 +65,11 @@ def __init__( self, in_channels: int = 3, out_channels: int = 3, - down_block_types: Tuple[str] = ("DownEncoderBlock2D",), - down_block_out_channels: Tuple[int] = (64,), + down_block_types: Tuple[str, ...] = ("DownEncoderBlock2D",), + down_block_out_channels: Tuple[int, ...] = (64,), layers_per_down_block: int = 1, - up_block_types: Tuple[str] = ("UpDecoderBlock2D",), - up_block_out_channels: Tuple[int] = (64,), + up_block_types: Tuple[str, ...] = ("UpDecoderBlock2D",), + up_block_out_channels: Tuple[int, ...] = (64,), layers_per_up_block: int = 1, act_fn: str = "silu", latent_channels: int = 4, @@ -109,7 +109,9 @@ def __init__( self.use_tiling = False @apply_forward_hook - def encode(self, x: torch.FloatTensor, return_dict: bool = True) -> AutoencoderKLOutput: + def encode( + self, x: torch.FloatTensor, return_dict: bool = True + ) -> Union[AutoencoderKLOutput, Tuple[torch.FloatTensor]]: h = self.encoder(x) moments = self.quant_conv(h) posterior = DiagonalGaussianDistribution(moments) @@ -125,7 +127,7 @@ def _decode( image: Optional[torch.FloatTensor] = None, mask: Optional[torch.FloatTensor] = None, return_dict: bool = True, - ) -> Union[DecoderOutput, torch.FloatTensor]: + ) -> Union[DecoderOutput, Tuple[torch.FloatTensor]]: z = self.post_quant_conv(z) dec = self.decoder(z, image, mask) @@ -142,7 +144,7 @@ def decode( image: Optional[torch.FloatTensor] = None, mask: Optional[torch.FloatTensor] = None, return_dict: bool = True, - ) -> Union[DecoderOutput, torch.FloatTensor]: + ) -> Union[DecoderOutput, Tuple[torch.FloatTensor]]: decoded = self._decode(z, image, mask).sample if not return_dict: @@ -157,7 +159,7 @@ def forward( sample_posterior: bool = False, return_dict: bool = True, generator: Optional[torch.Generator] = None, - ) -> Union[DecoderOutput, torch.FloatTensor]: + ) -> Union[DecoderOutput, Tuple[torch.FloatTensor]]: r""" Args: sample (`torch.FloatTensor`): Input sample. diff --git a/src/diffusers/models/autoencoder_kl.py b/src/diffusers/models/autoencoder_kl.py index ac616530a66a..9003d982b32f 100644 --- a/src/diffusers/models/autoencoder_kl.py +++ b/src/diffusers/models/autoencoder_kl.py @@ -322,13 +322,13 @@ def decode( return DecoderOutput(sample=decoded) - def blend_v(self, a, b, blend_extent): + def blend_v(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor: blend_extent = min(a.shape[2], b.shape[2], blend_extent) for y in range(blend_extent): b[:, :, y, :] = a[:, :, -blend_extent + y, :] * (1 - y / blend_extent) + b[:, :, y, :] * (y / blend_extent) return b - def blend_h(self, a, b, blend_extent): + def blend_h(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor: blend_extent = min(a.shape[3], b.shape[3], blend_extent) for x in range(blend_extent): b[:, :, :, x] = a[:, :, :, -blend_extent + x] * (1 - x / blend_extent) + b[:, :, :, x] * (x / blend_extent) diff --git a/src/diffusers/models/autoencoder_tiny.py b/src/diffusers/models/autoencoder_tiny.py index 15bd53ff99d6..0df97ed22821 100644 --- a/src/diffusers/models/autoencoder_tiny.py +++ b/src/diffusers/models/autoencoder_tiny.py @@ -96,18 +96,18 @@ class AutoencoderTiny(ModelMixin, ConfigMixin): @register_to_config def __init__( self, - in_channels=3, - out_channels=3, - encoder_block_out_channels: Tuple[int] = (64, 64, 64, 64), - decoder_block_out_channels: Tuple[int] = (64, 64, 64, 64), + in_channels: int = 3, + out_channels: int = 3, + encoder_block_out_channels: Tuple[int, ...] = (64, 64, 64, 64), + decoder_block_out_channels: Tuple[int, ...] = (64, 64, 64, 64), act_fn: str = "relu", latent_channels: int = 4, upsampling_scaling_factor: int = 2, - num_encoder_blocks: Tuple[int] = (1, 3, 3, 3), - num_decoder_blocks: Tuple[int] = (3, 3, 3, 1), + num_encoder_blocks: Tuple[int, ...] = (1, 3, 3, 3), + num_decoder_blocks: Tuple[int, ...] = (3, 3, 3, 1), latent_magnitude: int = 3, latent_shift: float = 0.5, - force_upcast: float = False, + force_upcast: bool = False, scaling_factor: float = 1.0, ): super().__init__() @@ -147,33 +147,33 @@ def __init__( self.tile_sample_min_size = 512 self.tile_latent_min_size = self.tile_sample_min_size // self.spatial_scale_factor - def _set_gradient_checkpointing(self, module, value=False): + def _set_gradient_checkpointing(self, module, value: bool = False) -> None: if isinstance(module, (EncoderTiny, DecoderTiny)): module.gradient_checkpointing = value - def scale_latents(self, x): + def scale_latents(self, x: torch.FloatTensor) -> torch.FloatTensor: """raw latents -> [0, 1]""" return x.div(2 * self.latent_magnitude).add(self.latent_shift).clamp(0, 1) - def unscale_latents(self, x): + def unscale_latents(self, x: torch.FloatTensor) -> torch.FloatTensor: """[0, 1] -> raw latents""" return x.sub(self.latent_shift).mul(2 * self.latent_magnitude) - def enable_slicing(self): + def enable_slicing(self) -> None: r""" Enable sliced VAE decoding. When this option is enabled, the VAE will split the input tensor in slices to compute decoding in several steps. This is useful to save some memory and allow larger batch sizes. """ self.use_slicing = True - def disable_slicing(self): + def disable_slicing(self) -> None: r""" Disable sliced VAE decoding. If `enable_slicing` was previously enabled, this method will go back to computing decoding in one step. """ self.use_slicing = False - def enable_tiling(self, use_tiling: bool = True): + def enable_tiling(self, use_tiling: bool = True) -> None: r""" Enable tiled VAE decoding. When this option is enabled, the VAE will split the input tensor into tiles to compute decoding and encoding in several steps. This is useful for saving a large amount of memory and to allow @@ -181,7 +181,7 @@ def enable_tiling(self, use_tiling: bool = True): """ self.use_tiling = use_tiling - def disable_tiling(self): + def disable_tiling(self) -> None: r""" Disable tiled VAE decoding. If `enable_tiling` was previously enabled, this method will go back to computing decoding in one step. @@ -197,13 +197,9 @@ def _tiled_encode(self, x: torch.FloatTensor) -> torch.FloatTensor: Args: x (`torch.FloatTensor`): Input batch of images. - return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~models.autoencoder_tiny.AutoencoderTinyOutput`] instead of a plain tuple. Returns: - [`~models.autoencoder_tiny.AutoencoderTinyOutput`] or `tuple`: - If return_dict is True, a [`~models.autoencoder_tiny.AutoencoderTinyOutput`] is returned, otherwise a - plain `tuple` is returned. + `torch.FloatTensor`: Encoded batch of images. """ # scale of encoder output relative to input sf = self.spatial_scale_factor @@ -249,13 +245,9 @@ def _tiled_decode(self, x: torch.FloatTensor) -> torch.FloatTensor: Args: x (`torch.FloatTensor`): Input batch of images. - return_dict (`bool`, *optional*, defaults to `True`): - Whether or not to return a [`~models.autoencoder_tiny.AutoencoderTinyOutput`] instead of a plain tuple. Returns: - [`~models.vae.DecoderOutput`] or `tuple`: - If return_dict is True, a [`~models.vae.DecoderOutput`] is returned, otherwise a plain `tuple` is - returned. + `torch.FloatTensor`: Encoded batch of images. """ # scale of decoder output relative to input sf = self.spatial_scale_factor diff --git a/src/diffusers/models/consistency_decoder_vae.py b/src/diffusers/models/consistency_decoder_vae.py index 63d8763d14b5..a2d82e2565ed 100644 --- a/src/diffusers/models/consistency_decoder_vae.py +++ b/src/diffusers/models/consistency_decoder_vae.py @@ -56,7 +56,7 @@ class ConsistencyDecoderVAE(ModelMixin, ConfigMixin): Examples: ```py >>> import torch - >>> from diffusers import DiffusionPipeline, ConsistencyDecoderVAE + >>> from diffusers import StableDiffusionPipeline, ConsistencyDecoderVAE >>> vae = ConsistencyDecoderVAE.from_pretrained("openai/consistency-decoder", torch_dtype=torch.float16) >>> pipe = StableDiffusionPipeline.from_pretrained( @@ -70,39 +70,39 @@ class ConsistencyDecoderVAE(ModelMixin, ConfigMixin): @register_to_config def __init__( self, - scaling_factor=0.18215, - latent_channels=4, - encoder_act_fn="silu", - encoder_block_out_channels=(128, 256, 512, 512), - encoder_double_z=True, - encoder_down_block_types=( + scaling_factor: float = 0.18215, + latent_channels: int = 4, + encoder_act_fn: str = "silu", + encoder_block_out_channels: Tuple[int, ...] = (128, 256, 512, 512), + encoder_double_z: bool = True, + encoder_down_block_types: Tuple[str, ...] = ( "DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D", "DownEncoderBlock2D", ), - encoder_in_channels=3, - encoder_layers_per_block=2, - encoder_norm_num_groups=32, - encoder_out_channels=4, - decoder_add_attention=False, - decoder_block_out_channels=(320, 640, 1024, 1024), - decoder_down_block_types=( + encoder_in_channels: int = 3, + encoder_layers_per_block: int = 2, + encoder_norm_num_groups: int = 32, + encoder_out_channels: int = 4, + decoder_add_attention: bool = False, + decoder_block_out_channels: Tuple[int, ...] = (320, 640, 1024, 1024), + decoder_down_block_types: Tuple[str, ...] = ( "ResnetDownsampleBlock2D", "ResnetDownsampleBlock2D", "ResnetDownsampleBlock2D", "ResnetDownsampleBlock2D", ), - decoder_downsample_padding=1, - decoder_in_channels=7, - decoder_layers_per_block=3, - decoder_norm_eps=1e-05, - decoder_norm_num_groups=32, - decoder_num_train_timesteps=1024, - decoder_out_channels=6, - decoder_resnet_time_scale_shift="scale_shift", - decoder_time_embedding_type="learned", - decoder_up_block_types=( + decoder_downsample_padding: int = 1, + decoder_in_channels: int = 7, + decoder_layers_per_block: int = 3, + decoder_norm_eps: float = 1e-05, + decoder_norm_num_groups: int = 32, + decoder_num_train_timesteps: int = 1024, + decoder_out_channels: int = 6, + decoder_resnet_time_scale_shift: str = "scale_shift", + decoder_time_embedding_type: str = "learned", + decoder_up_block_types: Tuple[str, ...] = ( "ResnetUpsampleBlock2D", "ResnetUpsampleBlock2D", "ResnetUpsampleBlock2D", @@ -304,8 +304,8 @@ def decode( z: torch.FloatTensor, generator: Optional[torch.Generator] = None, return_dict: bool = True, - num_inference_steps=2, - ) -> Union[DecoderOutput, torch.FloatTensor]: + num_inference_steps: int = 2, + ) -> Union[DecoderOutput, Tuple[torch.FloatTensor]]: z = (z * self.config.scaling_factor - self.means) / self.stds scale_factor = 2 ** (len(self.config.block_out_channels) - 1) @@ -333,14 +333,14 @@ def decode( return DecoderOutput(sample=x_0) # Copied from diffusers.models.autoencoder_kl.AutoencoderKL.blend_v - def blend_v(self, a, b, blend_extent): + def blend_v(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor: blend_extent = min(a.shape[2], b.shape[2], blend_extent) for y in range(blend_extent): b[:, :, y, :] = a[:, :, -blend_extent + y, :] * (1 - y / blend_extent) + b[:, :, y, :] * (y / blend_extent) return b # Copied from diffusers.models.autoencoder_kl.AutoencoderKL.blend_h - def blend_h(self, a, b, blend_extent): + def blend_h(self, a: torch.Tensor, b: torch.Tensor, blend_extent: int) -> torch.Tensor: blend_extent = min(a.shape[3], b.shape[3], blend_extent) for x in range(blend_extent): b[:, :, :, x] = a[:, :, :, -blend_extent + x] * (1 - x / blend_extent) + b[:, :, :, x] * (x / blend_extent) @@ -407,7 +407,7 @@ def forward( sample_posterior: bool = False, return_dict: bool = True, generator: Optional[torch.Generator] = None, - ) -> Union[DecoderOutput, torch.FloatTensor]: + ) -> Union[DecoderOutput, Tuple[torch.FloatTensor]]: r""" Args: sample (`torch.FloatTensor`): Input sample. @@ -415,6 +415,12 @@ def forward( Whether to sample from the posterior. return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`DecoderOutput`] instead of a plain tuple. + generator (`torch.Generator`, *optional*, defaults to `None`): + Generator to use for sampling. + + Returns: + [`DecoderOutput`] or `tuple`: + If return_dict is True, a [`DecoderOutput`] is returned, otherwise a plain `tuple` is returned. """ x = sample posterior = self.encode(x).latent_dist diff --git a/src/diffusers/models/controlnet.py b/src/diffusers/models/controlnet.py index 052335f6c5cd..220e34593c23 100644 --- a/src/diffusers/models/controlnet.py +++ b/src/diffusers/models/controlnet.py @@ -76,7 +76,7 @@ def __init__( self, conditioning_embedding_channels: int, conditioning_channels: int = 3, - block_out_channels: Tuple[int] = (16, 32, 96, 256), + block_out_channels: Tuple[int, ...] = (16, 32, 96, 256), ): super().__init__() @@ -171,6 +171,9 @@ class conditioning with `class_embed_type` equal to `None`. conditioning_embedding_out_channels (`tuple[int]`, *optional*, defaults to `(16, 32, 96, 256)`): The tuple of output channel for each block in the `conditioning_embedding` layer. global_pool_conditions (`bool`, defaults to `False`): + TODO(Patrick) - unused parameter. + addition_embed_type_num_heads (`int`, defaults to 64): + The number of heads to use for the `TextTimeEmbedding` layer. """ _supports_gradient_checkpointing = True @@ -182,14 +185,14 @@ def __init__( conditioning_channels: int = 3, flip_sin_to_cos: bool = True, freq_shift: int = 0, - down_block_types: Tuple[str] = ( + down_block_types: Tuple[str, ...] = ( "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D", ), only_cross_attention: Union[bool, Tuple[bool]] = False, - block_out_channels: Tuple[int] = (320, 640, 1280, 1280), + block_out_channels: Tuple[int, ...] = (320, 640, 1280, 1280), layers_per_block: int = 2, downsample_padding: int = 1, mid_block_scale_factor: float = 1, @@ -197,11 +200,11 @@ def __init__( norm_num_groups: Optional[int] = 32, norm_eps: float = 1e-5, cross_attention_dim: int = 1280, - transformer_layers_per_block: Union[int, Tuple[int]] = 1, + transformer_layers_per_block: Union[int, Tuple[int, ...]] = 1, encoder_hid_dim: Optional[int] = None, encoder_hid_dim_type: Optional[str] = None, - attention_head_dim: Union[int, Tuple[int]] = 8, - num_attention_heads: Optional[Union[int, Tuple[int]]] = None, + attention_head_dim: Union[int, Tuple[int, ...]] = 8, + num_attention_heads: Optional[Union[int, Tuple[int, ...]]] = None, use_linear_projection: bool = False, class_embed_type: Optional[str] = None, addition_embed_type: Optional[str] = None, @@ -211,9 +214,9 @@ def __init__( resnet_time_scale_shift: str = "default", projection_class_embeddings_input_dim: Optional[int] = None, controlnet_conditioning_channel_order: str = "rgb", - conditioning_embedding_out_channels: Optional[Tuple[int]] = (16, 32, 96, 256), + conditioning_embedding_out_channels: Optional[Tuple[int, ...]] = (16, 32, 96, 256), global_pool_conditions: bool = False, - addition_embed_type_num_heads=64, + addition_embed_type_num_heads: int = 64, ): super().__init__() @@ -426,7 +429,7 @@ def from_unet( cls, unet: UNet2DConditionModel, controlnet_conditioning_channel_order: str = "rgb", - conditioning_embedding_out_channels: Optional[Tuple[int]] = (16, 32, 96, 256), + conditioning_embedding_out_channels: Optional[Tuple[int, ...]] = (16, 32, 96, 256), load_weights_from_unet: bool = True, ): r""" @@ -570,7 +573,7 @@ def set_default_attn_processor(self): self.set_attn_processor(processor, _remove_lora=True) # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_attention_slice - def set_attention_slice(self, slice_size): + def set_attention_slice(self, slice_size: Union[str, int, List[int]]) -> None: r""" Enable sliced attention computation. @@ -635,7 +638,7 @@ def fn_recursive_set_attention_slice(module: torch.nn.Module, slice_size: List[i for module in self.children(): fn_recursive_set_attention_slice(module, reversed_slice_size) - def _set_gradient_checkpointing(self, module, value=False): + def _set_gradient_checkpointing(self, module, value: bool = False) -> None: if isinstance(module, (CrossAttnDownBlock2D, DownBlock2D)): module.gradient_checkpointing = value @@ -653,7 +656,7 @@ def forward( cross_attention_kwargs: Optional[Dict[str, Any]] = None, guess_mode: bool = False, return_dict: bool = True, - ) -> Union[ControlNetOutput, Tuple]: + ) -> Union[ControlNetOutput, Tuple[Tuple[torch.FloatTensor, ...], torch.FloatTensor]]: """ The [`ControlNetModel`] forward method. diff --git a/src/diffusers/models/controlnet_flax.py b/src/diffusers/models/controlnet_flax.py index 076e6183211b..10059ffd6f6d 100644 --- a/src/diffusers/models/controlnet_flax.py +++ b/src/diffusers/models/controlnet_flax.py @@ -46,10 +46,10 @@ class FlaxControlNetOutput(BaseOutput): class FlaxControlNetConditioningEmbedding(nn.Module): conditioning_embedding_channels: int - block_out_channels: Tuple[int] = (16, 32, 96, 256) + block_out_channels: Tuple[int, ...] = (16, 32, 96, 256) dtype: jnp.dtype = jnp.float32 - def setup(self): + def setup(self) -> None: self.conv_in = nn.Conv( self.block_out_channels[0], kernel_size=(3, 3), @@ -87,7 +87,7 @@ def setup(self): dtype=self.dtype, ) - def __call__(self, conditioning): + def __call__(self, conditioning: jnp.ndarray) -> jnp.ndarray: embedding = self.conv_in(conditioning) embedding = nn.silu(embedding) @@ -148,17 +148,17 @@ class FlaxControlNetModel(nn.Module, FlaxModelMixin, ConfigMixin): """ sample_size: int = 32 in_channels: int = 4 - down_block_types: Tuple[str] = ( + down_block_types: Tuple[str, ...] = ( "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D", ) - only_cross_attention: Union[bool, Tuple[bool]] = False - block_out_channels: Tuple[int] = (320, 640, 1280, 1280) + only_cross_attention: Union[bool, Tuple[bool, ...]] = False + block_out_channels: Tuple[int, ...] = (320, 640, 1280, 1280) layers_per_block: int = 2 - attention_head_dim: Union[int, Tuple[int]] = 8 - num_attention_heads: Optional[Union[int, Tuple[int]]] = None + attention_head_dim: Union[int, Tuple[int, ...]] = 8 + num_attention_heads: Optional[Union[int, Tuple[int, ...]]] = None cross_attention_dim: int = 1280 dropout: float = 0.0 use_linear_projection: bool = False @@ -166,7 +166,7 @@ class FlaxControlNetModel(nn.Module, FlaxModelMixin, ConfigMixin): flip_sin_to_cos: bool = True freq_shift: int = 0 controlnet_conditioning_channel_order: str = "rgb" - conditioning_embedding_out_channels: Tuple[int] = (16, 32, 96, 256) + conditioning_embedding_out_channels: Tuple[int, ...] = (16, 32, 96, 256) def init_weights(self, rng: jax.Array) -> FrozenDict: # init input tensors @@ -182,7 +182,7 @@ def init_weights(self, rng: jax.Array) -> FrozenDict: return self.init(rngs, sample, timesteps, encoder_hidden_states, controlnet_cond)["params"] - def setup(self): + def setup(self) -> None: block_out_channels = self.block_out_channels time_embed_dim = block_out_channels[0] * 4 @@ -312,21 +312,21 @@ def setup(self): def __call__( self, - sample, - timesteps, - encoder_hidden_states, - controlnet_cond, + sample: jnp.ndarray, + timesteps: Union[jnp.ndarray, float, int], + encoder_hidden_states: jnp.ndarray, + controlnet_cond: jnp.ndarray, conditioning_scale: float = 1.0, return_dict: bool = True, train: bool = False, - ) -> Union[FlaxControlNetOutput, Tuple]: + ) -> Union[FlaxControlNetOutput, Tuple[Tuple[jnp.ndarray, ...], jnp.ndarray]]: r""" Args: sample (`jnp.ndarray`): (batch, channel, height, width) noisy inputs tensor timestep (`jnp.ndarray` or `float` or `int`): timesteps encoder_hidden_states (`jnp.ndarray`): (batch_size, sequence_length, hidden_size) encoder hidden states controlnet_cond (`jnp.ndarray`): (batch, channel, height, width) the conditional input tensor - conditioning_scale: (`float`) the scale factor for controlnet outputs + conditioning_scale (`float`, *optional*, defaults to `1.0`): the scale factor for controlnet outputs return_dict (`bool`, *optional*, defaults to `True`): Whether or not to return a [`models.unet_2d_condition_flax.FlaxUNet2DConditionOutput`] instead of a plain tuple. @@ -335,8 +335,8 @@ def __call__( Returns: [`~models.unet_2d_condition_flax.FlaxUNet2DConditionOutput`] or `tuple`: - [`~models.unet_2d_condition_flax.FlaxUNet2DConditionOutput`] if `return_dict` is True, otherwise a `tuple`. - When returning a tuple, the first element is the sample tensor. + [`~models.unet_2d_condition_flax.FlaxUNet2DConditionOutput`] if `return_dict` is True, otherwise a + `tuple`. When returning a tuple, the first element is the sample tensor. """ channel_order = self.controlnet_conditioning_channel_order if channel_order == "bgr": diff --git a/src/diffusers/models/lora.py b/src/diffusers/models/lora.py index a143c17458ad..9edec19a3a34 100644 --- a/src/diffusers/models/lora.py +++ b/src/diffusers/models/lora.py @@ -18,13 +18,64 @@ import torch.nn.functional as F from torch import nn -from ..loaders import PatchedLoraProjection, text_encoder_attn_modules, text_encoder_mlp_modules from ..utils import logging +from ..utils.import_utils import is_transformers_available + + +if is_transformers_available(): + from transformers import CLIPTextModel, CLIPTextModelWithProjection logger = logging.get_logger(__name__) # pylint: disable=invalid-name +def text_encoder_attn_modules(text_encoder): + attn_modules = [] + + if isinstance(text_encoder, (CLIPTextModel, CLIPTextModelWithProjection)): + for i, layer in enumerate(text_encoder.text_model.encoder.layers): + name = f"text_model.encoder.layers.{i}.self_attn" + mod = layer.self_attn + attn_modules.append((name, mod)) + else: + raise ValueError(f"do not know how to get attention modules for: {text_encoder.__class__.__name__}") + + return attn_modules + + +def text_encoder_mlp_modules(text_encoder): + mlp_modules = [] + + if isinstance(text_encoder, (CLIPTextModel, CLIPTextModelWithProjection)): + for i, layer in enumerate(text_encoder.text_model.encoder.layers): + mlp_mod = layer.mlp + name = f"text_model.encoder.layers.{i}.mlp" + mlp_modules.append((name, mlp_mod)) + else: + raise ValueError(f"do not know how to get mlp modules for: {text_encoder.__class__.__name__}") + + return mlp_modules + + +def text_encoder_lora_state_dict(text_encoder): + state_dict = {} + + for name, module in text_encoder_attn_modules(text_encoder): + for k, v in module.q_proj.lora_linear_layer.state_dict().items(): + state_dict[f"{name}.q_proj.lora_linear_layer.{k}"] = v + + for k, v in module.k_proj.lora_linear_layer.state_dict().items(): + state_dict[f"{name}.k_proj.lora_linear_layer.{k}"] = v + + for k, v in module.v_proj.lora_linear_layer.state_dict().items(): + state_dict[f"{name}.v_proj.lora_linear_layer.{k}"] = v + + for k, v in module.out_proj.lora_linear_layer.state_dict().items(): + state_dict[f"{name}.out_proj.lora_linear_layer.{k}"] = v + + return state_dict + + def adjust_lora_scale_text_encoder(text_encoder, lora_scale: float = 1.0): for _, attn_module in text_encoder_attn_modules(text_encoder): if isinstance(attn_module.q_proj, PatchedLoraProjection): @@ -39,6 +90,95 @@ def adjust_lora_scale_text_encoder(text_encoder, lora_scale: float = 1.0): mlp_module.fc2.lora_scale = lora_scale +class PatchedLoraProjection(torch.nn.Module): + def __init__(self, regular_linear_layer, lora_scale=1, network_alpha=None, rank=4, dtype=None): + super().__init__() + from ..models.lora import LoRALinearLayer + + self.regular_linear_layer = regular_linear_layer + + device = self.regular_linear_layer.weight.device + + if dtype is None: + dtype = self.regular_linear_layer.weight.dtype + + self.lora_linear_layer = LoRALinearLayer( + self.regular_linear_layer.in_features, + self.regular_linear_layer.out_features, + network_alpha=network_alpha, + device=device, + dtype=dtype, + rank=rank, + ) + + self.lora_scale = lora_scale + + # overwrite PyTorch's `state_dict` to be sure that only the 'regular_linear_layer' weights are saved + # when saving the whole text encoder model and when LoRA is unloaded or fused + def state_dict(self, *args, destination=None, prefix="", keep_vars=False): + if self.lora_linear_layer is None: + return self.regular_linear_layer.state_dict( + *args, destination=destination, prefix=prefix, keep_vars=keep_vars + ) + + return super().state_dict(*args, destination=destination, prefix=prefix, keep_vars=keep_vars) + + def _fuse_lora(self, lora_scale=1.0, safe_fusing=False): + if self.lora_linear_layer is None: + return + + dtype, device = self.regular_linear_layer.weight.data.dtype, self.regular_linear_layer.weight.data.device + + w_orig = self.regular_linear_layer.weight.data.float() + w_up = self.lora_linear_layer.up.weight.data.float() + w_down = self.lora_linear_layer.down.weight.data.float() + + if self.lora_linear_layer.network_alpha is not None: + w_up = w_up * self.lora_linear_layer.network_alpha / self.lora_linear_layer.rank + + fused_weight = w_orig + (lora_scale * torch.bmm(w_up[None, :], w_down[None, :])[0]) + + if safe_fusing and torch.isnan(fused_weight).any().item(): + raise ValueError( + "This LoRA weight seems to be broken. " + f"Encountered NaN values when trying to fuse LoRA weights for {self}." + "LoRA weights will not be fused." + ) + + self.regular_linear_layer.weight.data = fused_weight.to(device=device, dtype=dtype) + + # we can drop the lora layer now + self.lora_linear_layer = None + + # offload the up and down matrices to CPU to not blow the memory + self.w_up = w_up.cpu() + self.w_down = w_down.cpu() + self.lora_scale = lora_scale + + def _unfuse_lora(self): + if not (getattr(self, "w_up", None) is not None and getattr(self, "w_down", None) is not None): + return + + fused_weight = self.regular_linear_layer.weight.data + dtype, device = fused_weight.dtype, fused_weight.device + + w_up = self.w_up.to(device=device).float() + w_down = self.w_down.to(device).float() + + unfused_weight = fused_weight.float() - (self.lora_scale * torch.bmm(w_up[None, :], w_down[None, :])[0]) + self.regular_linear_layer.weight.data = unfused_weight.to(device=device, dtype=dtype) + + self.w_up = None + self.w_down = None + + def forward(self, input): + if self.lora_scale is None: + self.lora_scale = 1.0 + if self.lora_linear_layer is None: + return self.regular_linear_layer(input) + return self.regular_linear_layer(input) + (self.lora_scale * self.lora_linear_layer(input)) + + class LoRALinearLayer(nn.Module): r""" A linear layer that is used with LoRA. diff --git a/src/diffusers/models/modeling_utils.py b/src/diffusers/models/modeling_utils.py index 5fe3c5602f3a..4a9483feb429 100644 --- a/src/diffusers/models/modeling_utils.py +++ b/src/diffusers/models/modeling_utils.py @@ -18,13 +18,14 @@ import itertools import os import re +from collections import OrderedDict from functools import partial from typing import Any, Callable, List, Optional, Tuple, Union import safetensors import torch from huggingface_hub import create_repo -from torch import Tensor, device, nn +from torch import Tensor, nn from .. import __version__ from ..utils import ( @@ -61,7 +62,7 @@ from accelerate.utils.versions import is_torch_version -def get_parameter_device(parameter: torch.nn.Module): +def get_parameter_device(parameter: torch.nn.Module) -> torch.device: try: parameters_and_buffers = itertools.chain(parameter.parameters(), parameter.buffers()) return next(parameters_and_buffers).device @@ -77,7 +78,7 @@ def find_tensor_attributes(module: torch.nn.Module) -> List[Tuple[str, Tensor]]: return first_tuple[1].device -def get_parameter_dtype(parameter: torch.nn.Module): +def get_parameter_dtype(parameter: torch.nn.Module) -> torch.dtype: try: params = tuple(parameter.parameters()) if len(params) > 0: @@ -130,7 +131,13 @@ def load_state_dict(checkpoint_file: Union[str, os.PathLike], variant: Optional[ ) -def load_model_dict_into_meta(model, state_dict, device=None, dtype=None, model_name_or_path=None): +def load_model_dict_into_meta( + model, + state_dict: OrderedDict, + device: Optional[Union[str, torch.device]] = None, + dtype: Optional[Union[str, torch.dtype]] = None, + model_name_or_path: Optional[str] = None, +) -> List[str]: device = device or torch.device("cpu") dtype = dtype or torch.float32 @@ -156,7 +163,7 @@ def load_model_dict_into_meta(model, state_dict, device=None, dtype=None, model_ return unexpected_keys -def _load_state_dict_into_model(model_to_load, state_dict): +def _load_state_dict_into_model(model_to_load, state_dict: OrderedDict) -> List[str]: # Convert old format to new format if needed from a PyTorch state_dict # copy state_dict so _load_from_state_dict can modify it state_dict = state_dict.copy() @@ -164,7 +171,7 @@ def _load_state_dict_into_model(model_to_load, state_dict): # PyTorch's `_load_from_state_dict` does not copy parameters in a module's descendants # so we need to apply the function recursively. - def load(module: torch.nn.Module, prefix=""): + def load(module: torch.nn.Module, prefix: str = ""): args = (state_dict, prefix, {}, True, [], [], error_msgs) module._load_from_state_dict(*args) @@ -220,7 +227,7 @@ def is_gradient_checkpointing(self) -> bool: """ return any(hasattr(m, "gradient_checkpointing") and m.gradient_checkpointing for m in self.modules()) - def enable_gradient_checkpointing(self): + def enable_gradient_checkpointing(self) -> None: """ Activates gradient checkpointing for the current model (may be referred to as *activation checkpointing* or *checkpoint activations* in other frameworks). @@ -229,7 +236,7 @@ def enable_gradient_checkpointing(self): raise ValueError(f"{self.__class__.__name__} does not support gradient checkpointing.") self.apply(partial(self._set_gradient_checkpointing, value=True)) - def disable_gradient_checkpointing(self): + def disable_gradient_checkpointing(self) -> None: """ Deactivates gradient checkpointing for the current model (may be referred to as *activation checkpointing* or *checkpoint activations* in other frameworks). @@ -254,7 +261,7 @@ def fn_recursive_set_mem_eff(module: torch.nn.Module): if isinstance(module, torch.nn.Module): fn_recursive_set_mem_eff(module) - def enable_xformers_memory_efficient_attention(self, attention_op: Optional[Callable] = None): + def enable_xformers_memory_efficient_attention(self, attention_op: Optional[Callable] = None) -> None: r""" Enable memory efficient attention from [xFormers](https://facebookresearch.github.io/xformers/). @@ -290,7 +297,7 @@ def enable_xformers_memory_efficient_attention(self, attention_op: Optional[Call """ self.set_use_memory_efficient_attention_xformers(True, attention_op) - def disable_xformers_memory_efficient_attention(self): + def disable_xformers_memory_efficient_attention(self) -> None: r""" Disable memory efficient attention from [xFormers](https://facebookresearch.github.io/xformers/). """ @@ -447,7 +454,7 @@ def save_pretrained( self, save_directory: Union[str, os.PathLike], is_main_process: bool = True, - save_function: Callable = None, + save_function: Optional[Callable] = None, safe_serialization: bool = True, variant: Optional[str] = None, push_to_hub: bool = False, @@ -910,10 +917,10 @@ def from_pretrained(cls, pretrained_model_name_or_path: Optional[Union[str, os.P def _load_pretrained_model( cls, model, - state_dict, + state_dict: OrderedDict, resolved_archive_file, - pretrained_model_name_or_path, - ignore_mismatched_sizes=False, + pretrained_model_name_or_path: Union[str, os.PathLike], + ignore_mismatched_sizes: bool = False, ): # Retrieve missing & unexpected_keys model_state_dict = model.state_dict() @@ -1011,7 +1018,7 @@ def _find_mismatched_keys( return model, missing_keys, unexpected_keys, mismatched_keys, error_msgs @property - def device(self) -> device: + def device(self) -> torch.device: """ `torch.device`: The device on which the module is (assuming that all the module parameters are on the same device). @@ -1063,7 +1070,7 @@ def num_parameters(self, only_trainable: bool = False, exclude_embeddings: bool else: return sum(p.numel() for p in self.parameters() if p.requires_grad or not only_trainable) - def _convert_deprecated_attention_blocks(self, state_dict): + def _convert_deprecated_attention_blocks(self, state_dict: OrderedDict) -> None: deprecated_attention_block_paths = [] def recursive_find_attn_block(name, module): @@ -1107,7 +1114,7 @@ def recursive_find_attn_block(name, module): if f"{path}.proj_attn.bias" in state_dict: state_dict[f"{path}.to_out.0.bias"] = state_dict.pop(f"{path}.proj_attn.bias") - def _temp_convert_self_to_deprecated_attention_blocks(self): + def _temp_convert_self_to_deprecated_attention_blocks(self) -> None: deprecated_attention_block_modules = [] def recursive_find_attn_block(module): @@ -1134,10 +1141,10 @@ def recursive_find_attn_block(module): del module.to_v del module.to_out - def _undo_temp_convert_self_to_deprecated_attention_blocks(self): + def _undo_temp_convert_self_to_deprecated_attention_blocks(self) -> None: deprecated_attention_block_modules = [] - def recursive_find_attn_block(module): + def recursive_find_attn_block(module) -> None: if hasattr(module, "_from_deprecated_attn_block") and module._from_deprecated_attn_block: deprecated_attention_block_modules.append(module) diff --git a/src/diffusers/models/normalization.py b/src/diffusers/models/normalization.py index cedeff18f351..11d2a344744e 100644 --- a/src/diffusers/models/normalization.py +++ b/src/diffusers/models/normalization.py @@ -101,8 +101,8 @@ def __init__(self, embedding_dim: int, use_additional_conditions: bool = False): def forward( self, timestep: torch.Tensor, - added_cond_kwargs: Dict[str, torch.Tensor] = None, - batch_size: int = None, + added_cond_kwargs: Optional[Dict[str, torch.Tensor]] = None, + batch_size: Optional[int] = None, hidden_dtype: Optional[torch.dtype] = None, ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor, torch.Tensor]: # No modulation happening here. diff --git a/src/diffusers/models/resnet.py b/src/diffusers/models/resnet.py index 868e2e5fae2c..7a48d343a531 100644 --- a/src/diffusers/models/resnet.py +++ b/src/diffusers/models/resnet.py @@ -164,7 +164,9 @@ def __init__( else: self.Conv2d_0 = conv - def forward(self, hidden_states: torch.Tensor, output_size: Optional[int] = None, scale: float = 1.0): + def forward( + self, hidden_states: torch.FloatTensor, output_size: Optional[int] = None, scale: float = 1.0 + ) -> torch.FloatTensor: assert hidden_states.shape[1] == self.channels if self.use_conv_transpose: @@ -256,7 +258,7 @@ def __init__( else: self.conv = conv - def forward(self, hidden_states, scale: float = 1.0): + def forward(self, hidden_states: torch.FloatTensor, scale: float = 1.0) -> torch.FloatTensor: assert hidden_states.shape[1] == self.channels if self.use_conv and self.padding == 0: @@ -280,7 +282,7 @@ class FirUpsample2D(nn.Module): """A 2D FIR upsampling layer with an optional convolution. Parameters: - channels (`int`): + channels (`int`, optional): number of channels in the inputs and outputs. use_conv (`bool`, default `False`): option to use a convolution. @@ -292,7 +294,7 @@ class FirUpsample2D(nn.Module): def __init__( self, - channels: int = None, + channels: Optional[int] = None, out_channels: Optional[int] = None, use_conv: bool = False, fir_kernel: Tuple[int, int, int, int] = (1, 3, 3, 1), @@ -307,12 +309,12 @@ def __init__( def _upsample_2d( self, - hidden_states: torch.Tensor, - weight: Optional[torch.Tensor] = None, + hidden_states: torch.FloatTensor, + weight: Optional[torch.FloatTensor] = None, kernel: Optional[torch.FloatTensor] = None, factor: int = 2, gain: float = 1, - ) -> torch.Tensor: + ) -> torch.FloatTensor: """Fused `upsample_2d()` followed by `Conv2d()`. Padding is performed only once at the beginning, not between the operations. The fused op is considerably more @@ -320,17 +322,21 @@ def _upsample_2d( arbitrary order. Args: - hidden_states: Input tensor of the shape `[N, C, H, W]` or `[N, H, W, C]`. - weight: Weight tensor of the shape `[filterH, filterW, inChannels, - outChannels]`. Grouped convolution can be performed by `inChannels = x.shape[0] // numGroups`. - kernel: FIR filter of the shape `[firH, firW]` or `[firN]` - (separable). The default is `[1] * factor`, which corresponds to nearest-neighbor upsampling. - factor: Integer upsampling factor (default: 2). - gain: Scaling factor for signal magnitude (default: 1.0). + hidden_states (`torch.FloatTensor`): + Input tensor of the shape `[N, C, H, W]` or `[N, H, W, C]`. + weight (`torch.FloatTensor`, *optional*): + Weight tensor of the shape `[filterH, filterW, inChannels, outChannels]`. Grouped convolution can be + performed by `inChannels = x.shape[0] // numGroups`. + kernel (`torch.FloatTensor`, *optional*): + FIR filter of the shape `[firH, firW]` or `[firN]` (separable). The default is `[1] * factor`, which + corresponds to nearest-neighbor upsampling. + factor (`int`, *optional*): Integer upsampling factor (default: 2). + gain (`float`, *optional*): Scaling factor for signal magnitude (default: 1.0). Returns: - output: Tensor of the shape `[N, C, H * factor, W * factor]` or `[N, H * factor, W * factor, C]`, and same - datatype as `hidden_states`. + output (`torch.FloatTensor`): + Tensor of the shape `[N, C, H * factor, W * factor]` or `[N, H * factor, W * factor, C]`, and same + datatype as `hidden_states`. """ assert isinstance(factor, int) and factor >= 1 @@ -392,7 +398,7 @@ def _upsample_2d( return output - def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: if self.use_conv: height = self._upsample_2d(hidden_states, self.Conv2d_0.weight, kernel=self.fir_kernel) height = height + self.Conv2d_0.bias.reshape(1, -1, 1, 1) @@ -418,7 +424,7 @@ class FirDownsample2D(nn.Module): def __init__( self, - channels: int = None, + channels: Optional[int] = None, out_channels: Optional[int] = None, use_conv: bool = False, fir_kernel: Tuple[int, int, int, int] = (1, 3, 3, 1), @@ -433,30 +439,35 @@ def __init__( def _downsample_2d( self, - hidden_states: torch.Tensor, - weight: Optional[torch.Tensor] = None, + hidden_states: torch.FloatTensor, + weight: Optional[torch.FloatTensor] = None, kernel: Optional[torch.FloatTensor] = None, factor: int = 2, gain: float = 1, - ) -> torch.Tensor: + ) -> torch.FloatTensor: """Fused `Conv2d()` followed by `downsample_2d()`. Padding is performed only once at the beginning, not between the operations. The fused op is considerably more efficient than performing the same calculation using standard TensorFlow ops. It supports gradients of arbitrary order. Args: - hidden_states: Input tensor of the shape `[N, C, H, W]` or `[N, H, W, C]`. - weight: + hidden_states (`torch.FloatTensor`): + Input tensor of the shape `[N, C, H, W]` or `[N, H, W, C]`. + weight (`torch.FloatTensor`, *optional*): Weight tensor of the shape `[filterH, filterW, inChannels, outChannels]`. Grouped convolution can be performed by `inChannels = x.shape[0] // numGroups`. - kernel: FIR filter of the shape `[firH, firW]` or `[firN]` (separable). The default is `[1] * - factor`, which corresponds to average pooling. - factor: Integer downsampling factor (default: 2). - gain: Scaling factor for signal magnitude (default: 1.0). + kernel (`torch.FloatTensor`, *optional*): + FIR filter of the shape `[firH, firW]` or `[firN]` (separable). The default is `[1] * factor`, which + corresponds to average pooling. + factor (`int`, *optional*, default to `2`): + Integer downsampling factor. + gain (`float`, *optional*, default to `1.0`): + Scaling factor for signal magnitude. Returns: - output: Tensor of the shape `[N, C, H // factor, W // factor]` or `[N, H // factor, W // factor, C]`, and - same datatype as `x`. + output (`torch.FloatTensor`): + Tensor of the shape `[N, C, H // factor, W // factor]` or `[N, H // factor, W // factor, C]`, and same + datatype as `x`. """ assert isinstance(factor, int) and factor >= 1 @@ -492,7 +503,7 @@ def _downsample_2d( return output - def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + def forward(self, hidden_states: torch.FloatTensor) -> torch.FloatTensor: if self.use_conv: downsample_input = self._downsample_2d(hidden_states, weight=self.Conv2d_0.weight, kernel=self.fir_kernel) hidden_states = downsample_input + self.Conv2d_0.bias.reshape(1, -1, 1, 1) @@ -682,7 +693,9 @@ def __init__( in_channels, conv_2d_out_channels, kernel_size=1, stride=1, padding=0, bias=conv_shortcut_bias ) - def forward(self, input_tensor, temb, scale: float = 1.0): + def forward( + self, input_tensor: torch.FloatTensor, temb: torch.FloatTensor, scale: float = 1.0 + ) -> torch.FloatTensor: hidden_states = input_tensor if self.time_embedding_norm == "ada_group" or self.time_embedding_norm == "spatial": @@ -778,7 +791,7 @@ class Conv1dBlock(nn.Module): out_channels (`int`): Number of output channels. kernel_size (`int` or `tuple`): Size of the convolving kernel. n_groups (`int`, default `8`): Number of groups to separate the channels into. - activation (`str`, defaults `mish`): Name of the activation function. + activation (`str`, defaults to `mish`): Name of the activation function. """ def __init__( @@ -853,8 +866,8 @@ def forward(self, inputs: torch.Tensor, t: torch.Tensor) -> torch.Tensor: def upsample_2d( - hidden_states: torch.Tensor, kernel: Optional[torch.FloatTensor] = None, factor: int = 2, gain: float = 1 -) -> torch.Tensor: + hidden_states: torch.FloatTensor, kernel: Optional[torch.FloatTensor] = None, factor: int = 2, gain: float = 1 +) -> torch.FloatTensor: r"""Upsample2D a batch of 2D images with the given filter. Accepts a batch of 2D images of the shape `[N, C, H, W]` or `[N, H, W, C]` and upsamples each image with the given filter. The filter is normalized so that if the input pixels are constant, they will be scaled by the specified @@ -862,14 +875,19 @@ def upsample_2d( a: multiple of the upsampling factor. Args: - hidden_states: Input tensor of the shape `[N, C, H, W]` or `[N, H, W, C]`. - kernel: FIR filter of the shape `[firH, firW]` or `[firN]` - (separable). The default is `[1] * factor`, which corresponds to nearest-neighbor upsampling. - factor: Integer upsampling factor (default: 2). - gain: Scaling factor for signal magnitude (default: 1.0). + hidden_states (`torch.FloatTensor`): + Input tensor of the shape `[N, C, H, W]` or `[N, H, W, C]`. + kernel (`torch.FloatTensor`, *optional*): + FIR filter of the shape `[firH, firW]` or `[firN]` (separable). The default is `[1] * factor`, which + corresponds to nearest-neighbor upsampling. + factor (`int`, *optional*, default to `2`): + Integer upsampling factor. + gain (`float`, *optional*, default to `1.0`): + Scaling factor for signal magnitude (default: 1.0). Returns: - output: Tensor of the shape `[N, C, H * factor, W * factor]` + output (`torch.FloatTensor`): + Tensor of the shape `[N, C, H * factor, W * factor]` """ assert isinstance(factor, int) and factor >= 1 if kernel is None: @@ -892,8 +910,8 @@ def upsample_2d( def downsample_2d( - hidden_states: torch.Tensor, kernel: Optional[torch.FloatTensor] = None, factor: int = 2, gain: float = 1 -) -> torch.Tensor: + hidden_states: torch.FloatTensor, kernel: Optional[torch.FloatTensor] = None, factor: int = 2, gain: float = 1 +) -> torch.FloatTensor: r"""Downsample2D a batch of 2D images with the given filter. Accepts a batch of 2D images of the shape `[N, C, H, W]` or `[N, H, W, C]` and downsamples each image with the given filter. The filter is normalized so that if the input pixels are constant, they will be scaled by the @@ -901,14 +919,19 @@ def downsample_2d( shape is a multiple of the downsampling factor. Args: - hidden_states: Input tensor of the shape `[N, C, H, W]` or `[N, H, W, C]`. - kernel: FIR filter of the shape `[firH, firW]` or `[firN]` - (separable). The default is `[1] * factor`, which corresponds to average pooling. - factor: Integer downsampling factor (default: 2). - gain: Scaling factor for signal magnitude (default: 1.0). + hidden_states (`torch.FloatTensor`) + Input tensor of the shape `[N, C, H, W]` or `[N, H, W, C]`. + kernel (`torch.FloatTensor`, *optional*): + FIR filter of the shape `[firH, firW]` or `[firN]` (separable). The default is `[1] * factor`, which + corresponds to average pooling. + factor (`int`, *optional*, default to `2`): + Integer downsampling factor. + gain (`float`, *optional*, default to `1.0`): + Scaling factor for signal magnitude. Returns: - output: Tensor of the shape `[N, C, H // factor, W // factor]` + output (`torch.FloatTensor`): + Tensor of the shape `[N, C, H // factor, W // factor]` """ assert isinstance(factor, int) and factor >= 1 @@ -985,7 +1008,7 @@ class TemporalConvLayer(nn.Module): dropout (`float`, *optional*, defaults to `0.0`): The dropout probability to use. """ - def __init__(self, in_dim: int, out_dim: Optional[int] = None, dropout: float = 0.0): + def __init__(self, in_dim: int, out_dim: Optional[int] = None, dropout: float = 0.0, norm_num_groups: int = 32): super().__init__() out_dim = out_dim or in_dim self.in_dim = in_dim @@ -993,22 +1016,22 @@ def __init__(self, in_dim: int, out_dim: Optional[int] = None, dropout: float = # conv layers self.conv1 = nn.Sequential( - nn.GroupNorm(32, in_dim), nn.SiLU(), nn.Conv3d(in_dim, out_dim, (3, 1, 1), padding=(1, 0, 0)) + nn.GroupNorm(norm_num_groups, in_dim), nn.SiLU(), nn.Conv3d(in_dim, out_dim, (3, 1, 1), padding=(1, 0, 0)) ) self.conv2 = nn.Sequential( - nn.GroupNorm(32, out_dim), + nn.GroupNorm(norm_num_groups, out_dim), nn.SiLU(), nn.Dropout(dropout), nn.Conv3d(out_dim, in_dim, (3, 1, 1), padding=(1, 0, 0)), ) self.conv3 = nn.Sequential( - nn.GroupNorm(32, out_dim), + nn.GroupNorm(norm_num_groups, out_dim), nn.SiLU(), nn.Dropout(dropout), nn.Conv3d(out_dim, in_dim, (3, 1, 1), padding=(1, 0, 0)), ) self.conv4 = nn.Sequential( - nn.GroupNorm(32, out_dim), + nn.GroupNorm(norm_num_groups, out_dim), nn.SiLU(), nn.Dropout(dropout), nn.Conv3d(out_dim, in_dim, (3, 1, 1), padding=(1, 0, 0)), diff --git a/src/diffusers/models/unet_2d_condition_flax.py b/src/diffusers/models/unet_2d_condition_flax.py index 770cbf09ccac..13f53e16e7ac 100644 --- a/src/diffusers/models/unet_2d_condition_flax.py +++ b/src/diffusers/models/unet_2d_condition_flax.py @@ -100,18 +100,18 @@ class FlaxUNet2DConditionModel(nn.Module, FlaxModelMixin, ConfigMixin): sample_size: int = 32 in_channels: int = 4 out_channels: int = 4 - down_block_types: Tuple[str] = ( + down_block_types: Tuple[str, ...] = ( "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D", ) - up_block_types: Tuple[str] = ("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D") + up_block_types: Tuple[str, ...] = ("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D") only_cross_attention: Union[bool, Tuple[bool]] = False - block_out_channels: Tuple[int] = (320, 640, 1280, 1280) + block_out_channels: Tuple[int, ...] = (320, 640, 1280, 1280) layers_per_block: int = 2 - attention_head_dim: Union[int, Tuple[int]] = 8 - num_attention_heads: Optional[Union[int, Tuple[int]]] = None + attention_head_dim: Union[int, Tuple[int, ...]] = 8 + num_attention_heads: Optional[Union[int, Tuple[int, ...]]] = None cross_attention_dim: int = 1280 dropout: float = 0.0 use_linear_projection: bool = False @@ -120,7 +120,7 @@ class FlaxUNet2DConditionModel(nn.Module, FlaxModelMixin, ConfigMixin): freq_shift: int = 0 use_memory_efficient_attention: bool = False split_head_dim: bool = False - transformer_layers_per_block: Union[int, Tuple[int]] = 1 + transformer_layers_per_block: Union[int, Tuple[int, ...]] = 1 addition_embed_type: Optional[str] = None addition_time_embed_dim: Optional[int] = None addition_embed_type_num_heads: int = 64 @@ -158,7 +158,7 @@ def init_weights(self, rng: jax.Array) -> FrozenDict: } return self.init(rngs, sample, timesteps, encoder_hidden_states, added_cond_kwargs)["params"] - def setup(self): + def setup(self) -> None: block_out_channels = self.block_out_channels time_embed_dim = block_out_channels[0] * 4 @@ -320,15 +320,15 @@ def setup(self): def __call__( self, - sample, - timesteps, - encoder_hidden_states, + sample: jnp.ndarray, + timesteps: Union[jnp.ndarray, float, int], + encoder_hidden_states: jnp.ndarray, added_cond_kwargs: Optional[Union[Dict, FrozenDict]] = None, - down_block_additional_residuals=None, - mid_block_additional_residual=None, + down_block_additional_residuals: Optional[Tuple[jnp.ndarray, ...]] = None, + mid_block_additional_residual: Optional[jnp.ndarray] = None, return_dict: bool = True, train: bool = False, - ) -> Union[FlaxUNet2DConditionOutput, Tuple]: + ) -> Union[FlaxUNet2DConditionOutput, Tuple[jnp.ndarray]]: r""" Args: sample (`jnp.ndarray`): (batch, channel, height, width) noisy inputs tensor diff --git a/src/diffusers/models/unet_3d_blocks.py b/src/diffusers/models/unet_3d_blocks.py index e8e42cf5615f..767ab846d5dc 100644 --- a/src/diffusers/models/unet_3d_blocks.py +++ b/src/diffusers/models/unet_3d_blocks.py @@ -12,7 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. -from typing import Any, Dict, Optional, Tuple +from typing import Any, Dict, Optional, Tuple, Union import torch from torch import nn @@ -26,26 +26,26 @@ def get_down_block( - down_block_type, - num_layers, - in_channels, - out_channels, - temb_channels, - add_downsample, - resnet_eps, - resnet_act_fn, - num_attention_heads, - resnet_groups=None, - cross_attention_dim=None, - downsample_padding=None, - dual_cross_attention=False, - use_linear_projection=True, - only_cross_attention=False, - upcast_attention=False, - resnet_time_scale_shift="default", - temporal_num_attention_heads=8, - temporal_max_seq_length=32, -): + down_block_type: str, + num_layers: int, + in_channels: int, + out_channels: int, + temb_channels: int, + add_downsample: bool, + resnet_eps: float, + resnet_act_fn: str, + num_attention_heads: int, + resnet_groups: Optional[int] = None, + cross_attention_dim: Optional[int] = None, + downsample_padding: Optional[int] = None, + dual_cross_attention: bool = False, + use_linear_projection: bool = True, + only_cross_attention: bool = False, + upcast_attention: bool = False, + resnet_time_scale_shift: str = "default", + temporal_num_attention_heads: int = 8, + temporal_max_seq_length: int = 32, +) -> Union["DownBlock3D", "CrossAttnDownBlock3D", "DownBlockMotion", "CrossAttnDownBlockMotion"]: if down_block_type == "DownBlock3D": return DownBlock3D( num_layers=num_layers, @@ -123,28 +123,28 @@ def get_down_block( def get_up_block( - up_block_type, - num_layers, - in_channels, - out_channels, - prev_output_channel, - temb_channels, - add_upsample, - resnet_eps, - resnet_act_fn, - num_attention_heads, - resolution_idx=None, - resnet_groups=None, - cross_attention_dim=None, - dual_cross_attention=False, - use_linear_projection=True, - only_cross_attention=False, - upcast_attention=False, - resnet_time_scale_shift="default", - temporal_num_attention_heads=8, - temporal_cross_attention_dim=None, - temporal_max_seq_length=32, -): + up_block_type: str, + num_layers: int, + in_channels: int, + out_channels: int, + prev_output_channel: int, + temb_channels: int, + add_upsample: bool, + resnet_eps: float, + resnet_act_fn: str, + num_attention_heads: int, + resolution_idx: Optional[int] = None, + resnet_groups: Optional[int] = None, + cross_attention_dim: Optional[int] = None, + dual_cross_attention: bool = False, + use_linear_projection: bool = True, + only_cross_attention: bool = False, + upcast_attention: bool = False, + resnet_time_scale_shift: str = "default", + temporal_num_attention_heads: int = 8, + temporal_cross_attention_dim: Optional[int] = None, + temporal_max_seq_length: int = 32, +) -> Union["UpBlock3D", "CrossAttnUpBlock3D", "UpBlockMotion", "CrossAttnUpBlockMotion"]: if up_block_type == "UpBlock3D": return UpBlock3D( num_layers=num_layers, @@ -236,12 +236,12 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - output_scale_factor=1.0, - cross_attention_dim=1280, - dual_cross_attention=False, - use_linear_projection=True, - upcast_attention=False, + num_attention_heads: int = 1, + output_scale_factor: float = 1.0, + cross_attention_dim: int = 1280, + dual_cross_attention: bool = False, + use_linear_projection: bool = True, + upcast_attention: bool = False, ): super().__init__() @@ -269,6 +269,7 @@ def __init__( in_channels, in_channels, dropout=0.1, + norm_num_groups=resnet_groups, ) ] attentions = [] @@ -316,6 +317,7 @@ def __init__( in_channels, in_channels, dropout=0.1, + norm_num_groups=resnet_groups, ) ) @@ -326,13 +328,13 @@ def __init__( def forward( self, - hidden_states, - temb=None, - encoder_hidden_states=None, - attention_mask=None, - num_frames=1, - cross_attention_kwargs=None, - ): + hidden_states: torch.FloatTensor, + temb: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + num_frames: int = 1, + cross_attention_kwargs: Optional[Dict[str, Any]] = None, + ) -> torch.FloatTensor: hidden_states = self.resnets[0](hidden_states, temb) hidden_states = self.temp_convs[0](hidden_states, num_frames=num_frames) for attn, temp_attn, resnet, temp_conv in zip( @@ -366,15 +368,15 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - downsample_padding=1, - add_downsample=True, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, + num_attention_heads: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + downsample_padding: int = 1, + add_downsample: bool = True, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, ): super().__init__() resnets = [] @@ -406,6 +408,7 @@ def __init__( out_channels, out_channels, dropout=0.1, + norm_num_groups=resnet_groups, ) ) attentions.append( @@ -451,13 +454,13 @@ def __init__( def forward( self, - hidden_states, - temb=None, - encoder_hidden_states=None, - attention_mask=None, - num_frames=1, - cross_attention_kwargs=None, - ): + hidden_states: torch.FloatTensor, + temb: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + num_frames: int = 1, + cross_attention_kwargs: Dict[str, Any] = None, + ) -> Union[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: # TODO(Patrick, William) - attention mask is not used output_states = () @@ -500,9 +503,9 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_downsample=True, - downsample_padding=1, + output_scale_factor: float = 1.0, + add_downsample: bool = True, + downsample_padding: int = 1, ): super().__init__() resnets = [] @@ -529,6 +532,7 @@ def __init__( out_channels, out_channels, dropout=0.1, + norm_num_groups=resnet_groups, ) ) @@ -548,7 +552,9 @@ def __init__( self.gradient_checkpointing = False - def forward(self, hidden_states, temb=None, num_frames=1): + def forward( + self, hidden_states: torch.FloatTensor, temb: Optional[torch.FloatTensor] = None, num_frames: int = 1 + ) -> Union[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () for resnet, temp_conv in zip(self.resnets, self.temp_convs): @@ -580,15 +586,15 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - add_upsample=True, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, - resolution_idx=None, + num_attention_heads: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, + resolution_idx: Optional[int] = None, ): super().__init__() resnets = [] @@ -622,6 +628,7 @@ def __init__( out_channels, out_channels, dropout=0.1, + norm_num_groups=resnet_groups, ) ) attentions.append( @@ -662,15 +669,15 @@ def __init__( def forward( self, - hidden_states, - res_hidden_states_tuple, - temb=None, - encoder_hidden_states=None, - upsample_size=None, - attention_mask=None, - num_frames=1, - cross_attention_kwargs=None, - ): + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + upsample_size: Optional[int] = None, + attention_mask: Optional[torch.FloatTensor] = None, + num_frames: int = 1, + cross_attention_kwargs: Dict[str, Any] = None, + ) -> torch.FloatTensor: is_freeu_enabled = ( getattr(self, "s1", None) and getattr(self, "s2", None) @@ -733,9 +740,9 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_upsample=True, - resolution_idx=None, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + resolution_idx: Optional[int] = None, ): super().__init__() resnets = [] @@ -764,6 +771,7 @@ def __init__( out_channels, out_channels, dropout=0.1, + norm_num_groups=resnet_groups, ) ) @@ -778,7 +786,14 @@ def __init__( self.gradient_checkpointing = False self.resolution_idx = resolution_idx - def forward(self, hidden_states, res_hidden_states_tuple, temb=None, upsample_size=None, num_frames=1): + def forward( + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + upsample_size: Optional[int] = None, + num_frames: int = 1, + ) -> torch.FloatTensor: is_freeu_enabled = ( getattr(self, "s1", None) and getattr(self, "s2", None) @@ -827,12 +842,12 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_downsample=True, - downsample_padding=1, - temporal_num_attention_heads=1, - temporal_cross_attention_dim=None, - temporal_max_seq_length=32, + output_scale_factor: float = 1.0, + add_downsample: bool = True, + downsample_padding: int = 1, + temporal_num_attention_heads: int = 1, + temporal_cross_attention_dim: Optional[int] = None, + temporal_max_seq_length: int = 32, ): super().__init__() resnets = [] @@ -884,7 +899,13 @@ def __init__( self.gradient_checkpointing = False - def forward(self, hidden_states, temb=None, scale: float = 1.0, num_frames=1): + def forward( + self, + hidden_states: torch.FloatTensor, + temb: Optional[torch.FloatTensor] = None, + scale: float = 1.0, + num_frames: int = 1, + ) -> Union[torch.FloatTensor, Tuple[torch.FloatTensor, ...]]: output_states = () blocks = zip(self.resnets, self.motion_modules) @@ -938,19 +959,19 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - downsample_padding=1, - add_downsample=True, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, - attention_type="default", - temporal_cross_attention_dim=None, - temporal_num_attention_heads=8, - temporal_max_seq_length=32, + num_attention_heads: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + downsample_padding: int = 1, + add_downsample: bool = True, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, + attention_type: str = "default", + temporal_cross_attention_dim: Optional[int] = None, + temporal_num_attention_heads: int = 8, + temporal_max_seq_length: int = 32, ): super().__init__() resnets = [] @@ -1037,14 +1058,14 @@ def __init__( def forward( self, - hidden_states, - temb=None, - encoder_hidden_states=None, - attention_mask=None, - num_frames=1, - encoder_attention_mask=None, - cross_attention_kwargs=None, - additional_residuals=None, + hidden_states: torch.FloatTensor, + temb: Optional[torch.FloatTensor] = None, + encoder_hidden_states: Optional[torch.FloatTensor] = None, + attention_mask: Optional[torch.FloatTensor] = None, + num_frames: int = 1, + encoder_attention_mask: Optional[torch.FloatTensor] = None, + cross_attention_kwargs: Optional[Dict[str, Any]] = None, + additional_residuals: Optional[torch.FloatTensor] = None, ): output_states = () @@ -1115,7 +1136,7 @@ def __init__( out_channels: int, prev_output_channel: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, transformer_layers_per_block: int = 1, @@ -1124,18 +1145,18 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - cross_attention_dim=1280, - output_scale_factor=1.0, - add_upsample=True, - dual_cross_attention=False, - use_linear_projection=False, - only_cross_attention=False, - upcast_attention=False, - attention_type="default", - temporal_cross_attention_dim=None, - temporal_num_attention_heads=8, - temporal_max_seq_length=32, + num_attention_heads: int = 1, + cross_attention_dim: int = 1280, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + dual_cross_attention: bool = False, + use_linear_projection: bool = False, + only_cross_attention: bool = False, + upcast_attention: bool = False, + attention_type: str = "default", + temporal_cross_attention_dim: Optional[int] = None, + temporal_num_attention_heads: int = 8, + temporal_max_seq_length: int = 32, ): super().__init__() resnets = [] @@ -1226,8 +1247,8 @@ def forward( upsample_size: Optional[int] = None, attention_mask: Optional[torch.FloatTensor] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - num_frames=1, - ): + num_frames: int = 1, + ) -> torch.FloatTensor: lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0 is_freeu_enabled = ( getattr(self, "s1", None) @@ -1311,7 +1332,7 @@ def __init__( prev_output_channel: int, out_channels: int, temb_channels: int, - resolution_idx: int = None, + resolution_idx: Optional[int] = None, dropout: float = 0.0, num_layers: int = 1, resnet_eps: float = 1e-6, @@ -1319,12 +1340,12 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - output_scale_factor=1.0, - add_upsample=True, - temporal_norm_num_groups=32, - temporal_cross_attention_dim=None, - temporal_num_attention_heads=8, - temporal_max_seq_length=32, + output_scale_factor: float = 1.0, + add_upsample: bool = True, + temporal_norm_num_groups: int = 32, + temporal_cross_attention_dim: Optional[int] = None, + temporal_num_attention_heads: int = 8, + temporal_max_seq_length: int = 32, ): super().__init__() resnets = [] @@ -1375,8 +1396,14 @@ def __init__( self.resolution_idx = resolution_idx def forward( - self, hidden_states, res_hidden_states_tuple, temb=None, upsample_size=None, scale: float = 1.0, num_frames=1 - ): + self, + hidden_states: torch.FloatTensor, + res_hidden_states_tuple: Tuple[torch.FloatTensor, ...], + temb: Optional[torch.FloatTensor] = None, + upsample_size=None, + scale: float = 1.0, + num_frames: int = 1, + ) -> torch.FloatTensor: is_freeu_enabled = ( getattr(self, "s1", None) and getattr(self, "s2", None) @@ -1451,16 +1478,16 @@ def __init__( resnet_act_fn: str = "swish", resnet_groups: int = 32, resnet_pre_norm: bool = True, - num_attention_heads=1, - output_scale_factor=1.0, - cross_attention_dim=1280, - dual_cross_attention=False, - use_linear_projection=False, - upcast_attention=False, - attention_type="default", - temporal_num_attention_heads=1, - temporal_cross_attention_dim=None, - temporal_max_seq_length=32, + num_attention_heads: int = 1, + output_scale_factor: float = 1.0, + cross_attention_dim: int = 1280, + dual_cross_attention: float = False, + use_linear_projection: float = False, + upcast_attention: float = False, + attention_type: str = "default", + temporal_num_attention_heads: int = 1, + temporal_cross_attention_dim: Optional[int] = None, + temporal_max_seq_length: int = 32, ): super().__init__() @@ -1554,7 +1581,7 @@ def forward( attention_mask: Optional[torch.FloatTensor] = None, cross_attention_kwargs: Optional[Dict[str, Any]] = None, encoder_attention_mask: Optional[torch.FloatTensor] = None, - num_frames=1, + num_frames: int = 1, ) -> torch.FloatTensor: lora_scale = cross_attention_kwargs.get("scale", 1.0) if cross_attention_kwargs is not None else 1.0 hidden_states = self.resnets[0](hidden_states, temb, scale=lora_scale) diff --git a/src/diffusers/models/unet_3d_condition.py b/src/diffusers/models/unet_3d_condition.py index 7356fb577584..c6710256ef39 100644 --- a/src/diffusers/models/unet_3d_condition.py +++ b/src/diffusers/models/unet_3d_condition.py @@ -98,14 +98,19 @@ def __init__( sample_size: Optional[int] = None, in_channels: int = 4, out_channels: int = 4, - down_block_types: Tuple[str] = ( + down_block_types: Tuple[str, ...] = ( "CrossAttnDownBlock3D", "CrossAttnDownBlock3D", "CrossAttnDownBlock3D", "DownBlock3D", ), - up_block_types: Tuple[str] = ("UpBlock3D", "CrossAttnUpBlock3D", "CrossAttnUpBlock3D", "CrossAttnUpBlock3D"), - block_out_channels: Tuple[int] = (320, 640, 1280, 1280), + up_block_types: Tuple[str, ...] = ( + "UpBlock3D", + "CrossAttnUpBlock3D", + "CrossAttnUpBlock3D", + "CrossAttnUpBlock3D", + ), + block_out_channels: Tuple[int, ...] = (320, 640, 1280, 1280), layers_per_block: int = 2, downsample_padding: int = 1, mid_block_scale_factor: float = 1, @@ -173,6 +178,7 @@ def __init__( attention_head_dim=attention_head_dim, in_channels=block_out_channels[0], num_layers=1, + norm_num_groups=norm_num_groups, ) # class embedding @@ -301,7 +307,7 @@ def fn_recursive_add_processors(name: str, module: torch.nn.Module, processors: return processors # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_attention_slice - def set_attention_slice(self, slice_size): + def set_attention_slice(self, slice_size: Union[str, int, List[int]]) -> None: r""" Enable sliced attention computation. @@ -403,7 +409,7 @@ def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor): for name, module in self.named_children(): fn_recursive_attn_processor(name, module, processor) - def enable_forward_chunking(self, chunk_size=None, dim=0): + def enable_forward_chunking(self, chunk_size: Optional[int] = None, dim: int = 0) -> None: """ Sets the attention processor to use [feed forward chunking](https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers). @@ -459,7 +465,7 @@ def set_default_attn_processor(self): self.set_attn_processor(processor, _remove_lora=True) - def _set_gradient_checkpointing(self, module, value=False): + def _set_gradient_checkpointing(self, module, value: bool = False) -> None: if isinstance(module, (CrossAttnDownBlock3D, DownBlock3D, CrossAttnUpBlock3D, UpBlock3D)): module.gradient_checkpointing = value @@ -509,7 +515,7 @@ def forward( down_block_additional_residuals: Optional[Tuple[torch.Tensor]] = None, mid_block_additional_residual: Optional[torch.Tensor] = None, return_dict: bool = True, - ) -> Union[UNet3DConditionOutput, Tuple]: + ) -> Union[UNet3DConditionOutput, Tuple[torch.FloatTensor]]: r""" The [`UNet3DConditionModel`] forward method. diff --git a/src/diffusers/models/unet_motion_model.py b/src/diffusers/models/unet_motion_model.py index 5d528a34ec96..ab84b4de1395 100644 --- a/src/diffusers/models/unet_motion_model.py +++ b/src/diffusers/models/unet_motion_model.py @@ -50,14 +50,14 @@ class MotionModules(nn.Module): def __init__( self, - in_channels, - layers_per_block=2, - num_attention_heads=8, - attention_bias=False, - cross_attention_dim=None, - activation_fn="geglu", - norm_num_groups=32, - max_seq_length=32, + in_channels: int, + layers_per_block: int = 2, + num_attention_heads: int = 8, + attention_bias: bool = False, + cross_attention_dim: Optional[int] = None, + activation_fn: str = "geglu", + norm_num_groups: int = 32, + max_seq_length: int = 32, ): super().__init__() self.motion_modules = nn.ModuleList([]) @@ -82,13 +82,13 @@ class MotionAdapter(ModelMixin, ConfigMixin): @register_to_config def __init__( self, - block_out_channels=(320, 640, 1280, 1280), - motion_layers_per_block=2, - motion_mid_block_layers_per_block=1, - motion_num_attention_heads=8, - motion_norm_num_groups=32, - motion_max_seq_length=32, - use_motion_mid_block=True, + block_out_channels: Tuple[int, ...] = (320, 640, 1280, 1280), + motion_layers_per_block: int = 2, + motion_mid_block_layers_per_block: int = 1, + motion_num_attention_heads: int = 8, + motion_norm_num_groups: int = 32, + motion_max_seq_length: int = 32, + use_motion_mid_block: bool = True, ): """Container to store AnimateDiff Motion Modules @@ -182,29 +182,29 @@ def __init__( sample_size: Optional[int] = None, in_channels: int = 4, out_channels: int = 4, - down_block_types: Tuple[str] = ( + down_block_types: Tuple[str, ...] = ( "CrossAttnDownBlockMotion", "CrossAttnDownBlockMotion", "CrossAttnDownBlockMotion", "DownBlockMotion", ), - up_block_types: Tuple[str] = ( + up_block_types: Tuple[str, ...] = ( "UpBlockMotion", "CrossAttnUpBlockMotion", "CrossAttnUpBlockMotion", "CrossAttnUpBlockMotion", ), - block_out_channels: Tuple[int] = (320, 640, 1280, 1280), + block_out_channels: Tuple[int, ...] = (320, 640, 1280, 1280), layers_per_block: int = 2, downsample_padding: int = 1, mid_block_scale_factor: float = 1, act_fn: str = "silu", - norm_num_groups: Optional[int] = 32, + norm_num_groups: int = 32, norm_eps: float = 1e-5, cross_attention_dim: int = 1280, use_linear_projection: bool = False, - num_attention_heads: Optional[Union[int, Tuple[int]]] = 8, - motion_max_seq_length: Optional[int] = 32, + num_attention_heads: Union[int, Tuple[int, ...]] = 8, + motion_max_seq_length: int = 32, motion_num_attention_heads: int = 8, use_motion_mid_block: int = True, ): @@ -448,7 +448,7 @@ def from_unet2d( return model - def freeze_unet2d_params(self): + def freeze_unet2d_params(self) -> None: """Freeze the weights of just the UNet2DConditionModel, and leave the motion modules unfrozen for fine tuning. """ @@ -472,9 +472,7 @@ def freeze_unet2d_params(self): for param in motion_modules.parameters(): param.requires_grad = True - return - - def load_motion_modules(self, motion_adapter: Optional[MotionAdapter]): + def load_motion_modules(self, motion_adapter: Optional[MotionAdapter]) -> None: for i, down_block in enumerate(motion_adapter.down_blocks): self.down_blocks[i].motion_modules.load_state_dict(down_block.motion_modules.state_dict()) for i, up_block in enumerate(motion_adapter.up_blocks): @@ -492,7 +490,7 @@ def save_motion_modules( variant: Optional[str] = None, push_to_hub: bool = False, **kwargs, - ): + ) -> None: state_dict = self.state_dict() # Extract all motion modules @@ -582,7 +580,7 @@ def fn_recursive_attn_processor(name: str, module: torch.nn.Module, processor): fn_recursive_attn_processor(name, module, processor) # Copied from diffusers.models.unet_3d_condition.UNet3DConditionModel.enable_forward_chunking - def enable_forward_chunking(self, chunk_size=None, dim=0): + def enable_forward_chunking(self, chunk_size: Optional[int] = None, dim: int = 0) -> None: """ Sets the attention processor to use [feed forward chunking](https://huggingface.co/blog/reformer#2-chunked-feed-forward-layers). @@ -612,7 +610,7 @@ def fn_recursive_feed_forward(module: torch.nn.Module, chunk_size: int, dim: int fn_recursive_feed_forward(module, chunk_size, dim) # Copied from diffusers.models.unet_3d_condition.UNet3DConditionModel.disable_forward_chunking - def disable_forward_chunking(self): + def disable_forward_chunking(self) -> None: def fn_recursive_feed_forward(module: torch.nn.Module, chunk_size: int, dim: int): if hasattr(module, "set_chunk_feed_forward"): module.set_chunk_feed_forward(chunk_size=chunk_size, dim=dim) @@ -624,7 +622,7 @@ def fn_recursive_feed_forward(module: torch.nn.Module, chunk_size: int, dim: int fn_recursive_feed_forward(module, None, 0) # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.set_default_attn_processor - def set_default_attn_processor(self): + def set_default_attn_processor(self) -> None: """ Disables custom attention processors and sets the default attention implementation. """ @@ -639,12 +637,12 @@ def set_default_attn_processor(self): self.set_attn_processor(processor, _remove_lora=True) - def _set_gradient_checkpointing(self, module, value=False): + def _set_gradient_checkpointing(self, module, value: bool = False) -> None: if isinstance(module, (CrossAttnDownBlockMotion, DownBlockMotion, CrossAttnUpBlockMotion, UpBlockMotion)): module.gradient_checkpointing = value # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.enable_freeu - def enable_freeu(self, s1, s2, b1, b2): + def enable_freeu(self, s1: float, s2: float, b1: float, b2: float) -> None: r"""Enables the FreeU mechanism from https://arxiv.org/abs/2309.11497. The suffixes after the scaling factors represent the stage blocks where they are being applied. @@ -669,7 +667,7 @@ def enable_freeu(self, s1, s2, b1, b2): setattr(upsample_block, "b2", b2) # Copied from diffusers.models.unet_2d_condition.UNet2DConditionModel.disable_freeu - def disable_freeu(self): + def disable_freeu(self) -> None: """Disables the FreeU mechanism.""" freeu_keys = {"s1", "s2", "b1", "b2"} for i, upsample_block in enumerate(self.up_blocks): @@ -688,7 +686,7 @@ def forward( down_block_additional_residuals: Optional[Tuple[torch.Tensor]] = None, mid_block_additional_residual: Optional[torch.Tensor] = None, return_dict: bool = True, - ) -> Union[UNet3DConditionOutput, Tuple]: + ) -> Union[UNet3DConditionOutput, Tuple[torch.Tensor]]: r""" The [`UNetMotionModel`] forward method. diff --git a/src/diffusers/models/vq_model.py b/src/diffusers/models/vq_model.py index 0c93b9142bea..f4a6c8fb227f 100644 --- a/src/diffusers/models/vq_model.py +++ b/src/diffusers/models/vq_model.py @@ -148,7 +148,9 @@ def decode( return DecoderOutput(sample=dec) - def forward(self, sample: torch.FloatTensor, return_dict: bool = True) -> Union[DecoderOutput, torch.FloatTensor]: + def forward( + self, sample: torch.FloatTensor, return_dict: bool = True + ) -> Union[DecoderOutput, Tuple[torch.FloatTensor, ...]]: r""" The [`VQModel`] forward method. diff --git a/src/diffusers/optimization.py b/src/diffusers/optimization.py index 46e6125a0f55..678d2c12cfe1 100644 --- a/src/diffusers/optimization.py +++ b/src/diffusers/optimization.py @@ -37,7 +37,7 @@ class SchedulerType(Enum): PIECEWISE_CONSTANT = "piecewise_constant" -def get_constant_schedule(optimizer: Optimizer, last_epoch: int = -1): +def get_constant_schedule(optimizer: Optimizer, last_epoch: int = -1) -> LambdaLR: """ Create a schedule with a constant learning rate, using the learning rate set in optimizer. @@ -53,7 +53,7 @@ def get_constant_schedule(optimizer: Optimizer, last_epoch: int = -1): return LambdaLR(optimizer, lambda _: 1, last_epoch=last_epoch) -def get_constant_schedule_with_warmup(optimizer: Optimizer, num_warmup_steps: int, last_epoch: int = -1): +def get_constant_schedule_with_warmup(optimizer: Optimizer, num_warmup_steps: int, last_epoch: int = -1) -> LambdaLR: """ Create a schedule with a constant learning rate preceded by a warmup period during which the learning rate increases linearly between 0 and the initial lr set in the optimizer. @@ -78,7 +78,7 @@ def lr_lambda(current_step: int): return LambdaLR(optimizer, lr_lambda, last_epoch=last_epoch) -def get_piecewise_constant_schedule(optimizer: Optimizer, step_rules: str, last_epoch: int = -1): +def get_piecewise_constant_schedule(optimizer: Optimizer, step_rules: str, last_epoch: int = -1) -> LambdaLR: """ Create a schedule with a constant learning rate, using the learning rate set in optimizer. @@ -120,7 +120,9 @@ def rule_func(steps: int) -> float: return LambdaLR(optimizer, rules_func, last_epoch=last_epoch) -def get_linear_schedule_with_warmup(optimizer, num_warmup_steps, num_training_steps, last_epoch=-1): +def get_linear_schedule_with_warmup( + optimizer: Optimizer, num_warmup_steps: int, num_training_steps: int, last_epoch: int = -1 +) -> LambdaLR: """ Create a schedule with a learning rate that decreases linearly from the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly from 0 to the initial lr set in the optimizer. @@ -151,7 +153,7 @@ def lr_lambda(current_step: int): def get_cosine_schedule_with_warmup( optimizer: Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: float = 0.5, last_epoch: int = -1 -): +) -> LambdaLR: """ Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, after a warmup period during which it increases linearly between 0 and the @@ -185,7 +187,7 @@ def lr_lambda(current_step): def get_cosine_with_hard_restarts_schedule_with_warmup( optimizer: Optimizer, num_warmup_steps: int, num_training_steps: int, num_cycles: int = 1, last_epoch: int = -1 -): +) -> LambdaLR: """ Create a schedule with a learning rate that decreases following the values of the cosine function between the initial lr set in the optimizer to 0, with several hard restarts, after a warmup period during which it increases @@ -219,8 +221,13 @@ def lr_lambda(current_step): def get_polynomial_decay_schedule_with_warmup( - optimizer, num_warmup_steps, num_training_steps, lr_end=1e-7, power=1.0, last_epoch=-1 -): + optimizer: Optimizer, + num_warmup_steps: int, + num_training_steps: int, + lr_end: float = 1e-7, + power: float = 1.0, + last_epoch: int = -1, +) -> LambdaLR: """ Create a schedule with a learning rate that decreases as a polynomial decay from the initial lr set in the optimizer to end lr defined by *lr_end*, after a warmup period during which it increases linearly from 0 to the @@ -288,7 +295,7 @@ def get_scheduler( num_cycles: int = 1, power: float = 1.0, last_epoch: int = -1, -): +) -> LambdaLR: """ Unified API to get any scheduler from its name. diff --git a/src/diffusers/pipelines/controlnet/pipeline_controlnet.py b/src/diffusers/pipelines/controlnet/pipeline_controlnet.py index 04ca51b19f05..810a6c8a97de 100644 --- a/src/diffusers/pipelines/controlnet/pipeline_controlnet.py +++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet.py @@ -726,6 +726,46 @@ def disable_freeu(self): """Disables the FreeU mechanism if enabled.""" self.unet.disable_freeu() + # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding + def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32): + """ + See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298 + + Args: + timesteps (`torch.Tensor`): + generate embedding vectors at these timesteps + embedding_dim (`int`, *optional*, defaults to 512): + dimension of the embeddings to generate + dtype: + data type of the generated embeddings + + Returns: + `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)` + """ + assert len(w.shape) == 1 + w = w * 1000.0 + + half_dim = embedding_dim // 2 + emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1) + emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb) + emb = w.to(dtype)[:, None] * emb[None, :] + emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1) + if embedding_dim % 2 == 1: # zero pad + emb = torch.nn.functional.pad(emb, (0, 1)) + assert emb.shape == (w.shape[0], embedding_dim) + return emb + + @property + def guidance_scale(self): + return self._guidance_scale + + # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) + # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` + # corresponds to doing no classifier free guidance. + @property + def do_classifier_free_guidance(self): + return self._guidance_scale > 1 and self.unet.config.time_cond_proj_dim is None + @torch.no_grad() @replace_example_docstring(EXAMPLE_DOC_STRING) def __call__( @@ -863,6 +903,8 @@ def __call__( control_guidance_end, ) + self._guidance_scale = guidance_scale + # 2. Define call parameters if prompt is not None and isinstance(prompt, str): batch_size = 1 @@ -872,10 +914,6 @@ def __call__( batch_size = prompt_embeds.shape[0] device = self._execution_device - # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) - # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` - # corresponds to doing no classifier free guidance. - do_classifier_free_guidance = guidance_scale > 1.0 if isinstance(controlnet, MultiControlNetModel) and isinstance(controlnet_conditioning_scale, float): controlnet_conditioning_scale = [controlnet_conditioning_scale] * len(controlnet.nets) @@ -895,7 +933,7 @@ def __call__( prompt, device, num_images_per_prompt, - do_classifier_free_guidance, + self.do_classifier_free_guidance, negative_prompt, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, @@ -905,7 +943,7 @@ def __call__( # For classifier free guidance, we need to do two forward passes. # Here we concatenate the unconditional and text embeddings into a single batch # to avoid doing two forward passes - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds]) # 4. Prepare image @@ -918,7 +956,7 @@ def __call__( num_images_per_prompt=num_images_per_prompt, device=device, dtype=controlnet.dtype, - do_classifier_free_guidance=do_classifier_free_guidance, + do_classifier_free_guidance=self.do_classifier_free_guidance, guess_mode=guess_mode, ) height, width = image.shape[-2:] @@ -934,7 +972,7 @@ def __call__( num_images_per_prompt=num_images_per_prompt, device=device, dtype=controlnet.dtype, - do_classifier_free_guidance=do_classifier_free_guidance, + do_classifier_free_guidance=self.do_classifier_free_guidance, guess_mode=guess_mode, ) @@ -962,6 +1000,14 @@ def __call__( latents, ) + # 6.5 Optionally get Guidance Scale Embedding + timestep_cond = None + if self.unet.config.time_cond_proj_dim is not None: + guidance_scale_tensor = torch.tensor(self.guidance_scale - 1).repeat(batch_size * num_images_per_prompt) + timestep_cond = self.get_guidance_scale_embedding( + guidance_scale_tensor, embedding_dim=self.unet.config.time_cond_proj_dim + ).to(device=device, dtype=latents.dtype) + # 7. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta) @@ -986,11 +1032,11 @@ def __call__( if (is_unet_compiled and is_controlnet_compiled) and is_torch_higher_equal_2_1: torch._inductor.cudagraph_mark_step_begin() # expand the latents if we are doing classifier free guidance - latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents + latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) # controlnet(s) inference - if guess_mode and do_classifier_free_guidance: + if guess_mode and self.do_classifier_free_guidance: # Infer ControlNet only for the conditional batch. control_model_input = latents control_model_input = self.scheduler.scale_model_input(control_model_input, t) @@ -1017,7 +1063,7 @@ def __call__( return_dict=False, ) - if guess_mode and do_classifier_free_guidance: + if guess_mode and self.do_classifier_free_guidance: # Infered ControlNet only for the conditional batch. # To apply the output of ControlNet to both the unconditional and conditional batches, # add 0 to the unconditional batch to keep it unchanged. @@ -1029,6 +1075,7 @@ def __call__( latent_model_input, t, encoder_hidden_states=prompt_embeds, + timestep_cond=timestep_cond, cross_attention_kwargs=cross_attention_kwargs, down_block_additional_residuals=down_block_res_samples, mid_block_additional_residual=mid_block_res_sample, @@ -1036,7 +1083,7 @@ def __call__( )[0] # perform guidance - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) diff --git a/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py b/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py index d6278c4f046a..4a54957af5a5 100644 --- a/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py +++ b/src/diffusers/pipelines/controlnet/pipeline_controlnet_sd_xl.py @@ -791,6 +791,46 @@ def disable_freeu(self): """Disables the FreeU mechanism if enabled.""" self.unet.disable_freeu() + # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding + def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32): + """ + See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298 + + Args: + timesteps (`torch.Tensor`): + generate embedding vectors at these timesteps + embedding_dim (`int`, *optional*, defaults to 512): + dimension of the embeddings to generate + dtype: + data type of the generated embeddings + + Returns: + `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)` + """ + assert len(w.shape) == 1 + w = w * 1000.0 + + half_dim = embedding_dim // 2 + emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1) + emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb) + emb = w.to(dtype)[:, None] * emb[None, :] + emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1) + if embedding_dim % 2 == 1: # zero pad + emb = torch.nn.functional.pad(emb, (0, 1)) + assert emb.shape == (w.shape[0], embedding_dim) + return emb + + @property + def guidance_scale(self): + return self._guidance_scale + + # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) + # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` + # corresponds to doing no classifier free guidance. + @property + def do_classifier_free_guidance(self): + return self._guidance_scale > 1 and self.unet.config.time_cond_proj_dim is None + @torch.no_grad() @replace_example_docstring(EXAMPLE_DOC_STRING) def __call__( @@ -986,6 +1026,8 @@ def __call__( control_guidance_end, ) + self._guidance_scale = guidance_scale + # 2. Define call parameters if prompt is not None and isinstance(prompt, str): batch_size = 1 @@ -995,10 +1037,6 @@ def __call__( batch_size = prompt_embeds.shape[0] device = self._execution_device - # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) - # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` - # corresponds to doing no classifier free guidance. - do_classifier_free_guidance = guidance_scale > 1.0 if isinstance(controlnet, MultiControlNetModel) and isinstance(controlnet_conditioning_scale, float): controlnet_conditioning_scale = [controlnet_conditioning_scale] * len(controlnet.nets) @@ -1024,7 +1062,7 @@ def __call__( prompt_2, device, num_images_per_prompt, - do_classifier_free_guidance, + self.do_classifier_free_guidance, negative_prompt, negative_prompt_2, prompt_embeds=prompt_embeds, @@ -1045,7 +1083,7 @@ def __call__( num_images_per_prompt=num_images_per_prompt, device=device, dtype=controlnet.dtype, - do_classifier_free_guidance=do_classifier_free_guidance, + do_classifier_free_guidance=self.do_classifier_free_guidance, guess_mode=guess_mode, ) height, width = image.shape[-2:] @@ -1061,7 +1099,7 @@ def __call__( num_images_per_prompt=num_images_per_prompt, device=device, dtype=controlnet.dtype, - do_classifier_free_guidance=do_classifier_free_guidance, + do_classifier_free_guidance=self.do_classifier_free_guidance, guess_mode=guess_mode, ) @@ -1089,6 +1127,14 @@ def __call__( latents, ) + # 6.5 Optionally get Guidance Scale Embedding + timestep_cond = None + if self.unet.config.time_cond_proj_dim is not None: + guidance_scale_tensor = torch.tensor(self.guidance_scale - 1).repeat(batch_size * num_images_per_prompt) + timestep_cond = self.get_guidance_scale_embedding( + guidance_scale_tensor, embedding_dim=self.unet.config.time_cond_proj_dim + ).to(device=device, dtype=latents.dtype) + # 7. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta) @@ -1133,7 +1179,7 @@ def __call__( else: negative_add_time_ids = add_time_ids - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0) add_text_embeds = torch.cat([negative_pooled_prompt_embeds, add_text_embeds], dim=0) add_time_ids = torch.cat([negative_add_time_ids, add_time_ids], dim=0) @@ -1154,13 +1200,13 @@ def __call__( if (is_unet_compiled and is_controlnet_compiled) and is_torch_higher_equal_2_1: torch._inductor.cudagraph_mark_step_begin() # expand the latents if we are doing classifier free guidance - latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents + latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) added_cond_kwargs = {"text_embeds": add_text_embeds, "time_ids": add_time_ids} # controlnet(s) inference - if guess_mode and do_classifier_free_guidance: + if guess_mode and self.do_classifier_free_guidance: # Infer ControlNet only for the conditional batch. control_model_input = latents control_model_input = self.scheduler.scale_model_input(control_model_input, t) @@ -1193,7 +1239,7 @@ def __call__( return_dict=False, ) - if guess_mode and do_classifier_free_guidance: + if guess_mode and self.do_classifier_free_guidance: # Infered ControlNet only for the conditional batch. # To apply the output of ControlNet to both the unconditional and conditional batches, # add 0 to the unconditional batch to keep it unchanged. @@ -1205,6 +1251,7 @@ def __call__( latent_model_input, t, encoder_hidden_states=prompt_embeds, + timestep_cond=timestep_cond, cross_attention_kwargs=cross_attention_kwargs, down_block_additional_residuals=down_block_res_samples, mid_block_additional_residual=mid_block_res_sample, @@ -1213,7 +1260,7 @@ def __call__( )[0] # perform guidance - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) diff --git a/src/diffusers/pipelines/pipeline_utils.py b/src/diffusers/pipelines/pipeline_utils.py index 6437732d0315..674ff4d9e3d3 100644 --- a/src/diffusers/pipelines/pipeline_utils.py +++ b/src/diffusers/pipelines/pipeline_utils.py @@ -49,6 +49,7 @@ get_class_from_dynamic_module, is_accelerate_available, is_accelerate_version, + is_peft_available, is_torch_version, is_transformers_available, logging, @@ -270,6 +271,20 @@ def warn_deprecated_model_variant(pretrained_model_name_or_path, use_auth_token, ) +def _unwrap_model(model): + """Unwraps a model.""" + if is_compiled_module(model): + model = model._orig_mod + + if is_peft_available(): + from peft import PeftModel + + if isinstance(model, PeftModel): + model = model.base_model.model + + return model + + def maybe_raise_or_warn( library_name, library, class_name, importable_classes, passed_class_obj, name, is_pipeline_module ): @@ -287,9 +302,8 @@ def maybe_raise_or_warn( # Dynamo wraps the original model in a private class. # I didn't find a public API to get the original class. sub_model = passed_class_obj[name] - model_cls = sub_model.__class__ - if is_compiled_module(sub_model): - model_cls = sub_model._orig_mod.__class__ + unwrapped_sub_model = _unwrap_model(sub_model) + model_cls = unwrapped_sub_model.__class__ if not issubclass(model_cls, expected_class_obj): raise ValueError( @@ -546,10 +560,7 @@ def register_modules(self, **kwargs): register_dict = {name: (None, None)} else: # register the config from the original module, not the dynamo compiled one - if is_compiled_module(module): - not_compiled_module = module._orig_mod - else: - not_compiled_module = module + not_compiled_module = _unwrap_model(module) library = not_compiled_module.__module__.split(".")[0] @@ -652,7 +663,7 @@ def is_saveable_module(name, value): # Dynamo wraps the original model in a private class. # I didn't find a public API to get the original class. if is_compiled_module(sub_model): - sub_model = sub_model._orig_mod + sub_model = _unwrap_model(sub_model) model_cls = sub_model.__class__ save_method_name = None diff --git a/src/diffusers/pipelines/pixart_alpha/pipeline_pixart_alpha.py b/src/diffusers/pipelines/pixart_alpha/pipeline_pixart_alpha.py index c3f667ba16be..f4e61bdc9462 100644 --- a/src/diffusers/pipelines/pixart_alpha/pipeline_pixart_alpha.py +++ b/src/diffusers/pipelines/pixart_alpha/pipeline_pixart_alpha.py @@ -27,6 +27,7 @@ from ...schedulers import DPMSolverMultistepScheduler from ...utils import ( BACKENDS_MAPPING, + deprecate, is_bs4_available, is_ftfy_available, logging, @@ -162,8 +163,10 @@ def encode_prompt( device: Optional[torch.device] = None, prompt_embeds: Optional[torch.FloatTensor] = None, negative_prompt_embeds: Optional[torch.FloatTensor] = None, + prompt_attention_mask: Optional[torch.FloatTensor] = None, + negative_prompt_attention_mask: Optional[torch.FloatTensor] = None, clean_caption: bool = False, - mask_feature: bool = True, + **kwargs, ): r""" Encodes the prompt into text encoder hidden states. @@ -189,10 +192,11 @@ def encode_prompt( string. clean_caption (bool, defaults to `False`): If `True`, the function will preprocess and clean the provided caption before encoding. - mask_feature: (bool, defaults to `True`): - If `True`, the function will mask the text embeddings. """ - embeds_initially_provided = prompt_embeds is not None and negative_prompt_embeds is not None + + if "mask_feature" in kwargs: + deprecation_message = "The use of `mask_feature` is deprecated. It is no longer used in any computation and that doesn't affect the end results. It will be removed in a future version." + deprecate("mask_feature", "1.0.0", deprecation_message, standard_warn=False) if device is None: device = self._execution_device @@ -229,13 +233,11 @@ def encode_prompt( f" {max_length} tokens: {removed_text}" ) - attention_mask = text_inputs.attention_mask.to(device) - prompt_embeds_attention_mask = attention_mask + prompt_attention_mask = text_inputs.attention_mask + prompt_attention_mask = prompt_attention_mask.to(device) - prompt_embeds = self.text_encoder(text_input_ids.to(device), attention_mask=attention_mask) + prompt_embeds = self.text_encoder(text_input_ids.to(device), attention_mask=prompt_attention_mask) prompt_embeds = prompt_embeds[0] - else: - prompt_embeds_attention_mask = torch.ones_like(prompt_embeds) if self.text_encoder is not None: dtype = self.text_encoder.dtype @@ -250,8 +252,8 @@ def encode_prompt( # duplicate text embeddings and attention mask for each generation per prompt, using mps friendly method prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1) prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1) - prompt_embeds_attention_mask = prompt_embeds_attention_mask.view(bs_embed, -1) - prompt_embeds_attention_mask = prompt_embeds_attention_mask.repeat(num_images_per_prompt, 1) + prompt_attention_mask = prompt_attention_mask.view(bs_embed, -1) + prompt_attention_mask = prompt_attention_mask.repeat(num_images_per_prompt, 1) # get unconditional embeddings for classifier free guidance if do_classifier_free_guidance and negative_prompt_embeds is None: @@ -267,11 +269,11 @@ def encode_prompt( add_special_tokens=True, return_tensors="pt", ) - attention_mask = uncond_input.attention_mask.to(device) + negative_prompt_attention_mask = uncond_input.attention_mask + negative_prompt_attention_mask = negative_prompt_attention_mask.to(device) negative_prompt_embeds = self.text_encoder( - uncond_input.input_ids.to(device), - attention_mask=attention_mask, + uncond_input.input_ids.to(device), attention_mask=negative_prompt_attention_mask ) negative_prompt_embeds = negative_prompt_embeds[0] @@ -284,23 +286,13 @@ def encode_prompt( negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1) negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1) - # For classifier free guidance, we need to do two forward passes. - # Here we concatenate the unconditional and text embeddings into a single batch - # to avoid doing two forward passes + negative_prompt_attention_mask = negative_prompt_attention_mask.view(bs_embed, -1) + negative_prompt_attention_mask = negative_prompt_attention_mask.repeat(num_images_per_prompt, 1) else: negative_prompt_embeds = None + negative_prompt_attention_mask = None - # Perform additional masking. - if mask_feature and not embeds_initially_provided: - prompt_embeds = prompt_embeds.unsqueeze(1) - masked_prompt_embeds, keep_indices = self.mask_text_embeddings(prompt_embeds, prompt_embeds_attention_mask) - masked_prompt_embeds = masked_prompt_embeds.squeeze(1) - masked_negative_prompt_embeds = ( - negative_prompt_embeds[:, :keep_indices, :] if negative_prompt_embeds is not None else None - ) - return masked_prompt_embeds, masked_negative_prompt_embeds - - return prompt_embeds, negative_prompt_embeds + return prompt_embeds, prompt_attention_mask, negative_prompt_embeds, negative_prompt_attention_mask # Copied from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion.StableDiffusionPipeline.prepare_extra_step_kwargs def prepare_extra_step_kwargs(self, generator, eta): @@ -329,6 +321,8 @@ def check_inputs( callback_steps, prompt_embeds=None, negative_prompt_embeds=None, + prompt_attention_mask=None, + negative_prompt_attention_mask=None, ): if height % 8 != 0 or width % 8 != 0: raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.") @@ -365,6 +359,12 @@ def check_inputs( f" {negative_prompt_embeds}. Please make sure to only forward one of the two." ) + if prompt_embeds is not None and prompt_attention_mask is None: + raise ValueError("Must provide `prompt_attention_mask` when specifying `prompt_embeds`.") + + if negative_prompt_embeds is not None and negative_prompt_attention_mask is None: + raise ValueError("Must provide `negative_prompt_attention_mask` when specifying `negative_prompt_embeds`.") + if prompt_embeds is not None and negative_prompt_embeds is not None: if prompt_embeds.shape != negative_prompt_embeds.shape: raise ValueError( @@ -372,6 +372,12 @@ def check_inputs( f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`" f" {negative_prompt_embeds.shape}." ) + if prompt_attention_mask.shape != negative_prompt_attention_mask.shape: + raise ValueError( + "`prompt_attention_mask` and `negative_prompt_attention_mask` must have the same shape when passed directly, but" + f" got: `prompt_attention_mask` {prompt_attention_mask.shape} != `negative_prompt_attention_mask`" + f" {negative_prompt_attention_mask.shape}." + ) # Copied from diffusers.pipelines.deepfloyd_if.pipeline_if.IFPipeline._text_preprocessing def _text_preprocessing(self, text, clean_caption=False): @@ -579,14 +585,16 @@ def __call__( generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None, latents: Optional[torch.FloatTensor] = None, prompt_embeds: Optional[torch.FloatTensor] = None, + prompt_attention_mask: Optional[torch.FloatTensor] = None, negative_prompt_embeds: Optional[torch.FloatTensor] = None, + negative_prompt_attention_mask: Optional[torch.FloatTensor] = None, output_type: Optional[str] = "pil", return_dict: bool = True, callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None, callback_steps: int = 1, clean_caption: bool = True, - mask_feature: bool = True, use_resolution_binning: bool = True, + **kwargs, ) -> Union[ImagePipelineOutput, Tuple]: """ Function invoked when calling the pipeline for generation. @@ -630,9 +638,12 @@ def __call__( prompt_embeds (`torch.FloatTensor`, *optional*): Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument. + prompt_attention_mask (`torch.FloatTensor`, *optional*): Pre-generated attention mask for text embeddings. negative_prompt_embeds (`torch.FloatTensor`, *optional*): Pre-generated negative text embeddings. For PixArt-Alpha this negative prompt should be "". If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument. + negative_prompt_attention_mask (`torch.FloatTensor`, *optional*): + Pre-generated attention mask for negative text embeddings. output_type (`str`, *optional*, defaults to `"pil"`): The output format of the generate image. Choose between [PIL](https://pillow.readthedocs.io/en/stable/): `PIL.Image.Image` or `np.array`. @@ -648,11 +659,10 @@ def __call__( Whether or not to clean the caption before creating embeddings. Requires `beautifulsoup4` and `ftfy` to be installed. If the dependencies are not installed, the embeddings will be created from the raw prompt. - mask_feature (`bool` defaults to `True`): If set to `True`, the text embeddings will be masked. - use_resolution_binning: - (`bool` defaults to `True`): If set to `True`, the requested height and width are first mapped to the - closest resolutions using `ASPECT_RATIO_1024_BIN`. After the produced latents are decoded into images, - they are resized back to the requested resolution. Useful for generating non-square images. + use_resolution_binning (`bool` defaults to `True`): + If set to `True`, the requested height and width are first mapped to the closest resolutions using + `ASPECT_RATIO_1024_BIN`. After the produced latents are decoded into images, they are resized back to + the requested resolution. Useful for generating non-square images. Examples: @@ -661,6 +671,9 @@ def __call__( If `return_dict` is `True`, [`~pipelines.ImagePipelineOutput`] is returned, otherwise a `tuple` is returned where the first element is a list with the generated images """ + if "mask_feature" in kwargs: + deprecation_message = "The use of `mask_feature` is deprecated. It is no longer used in any computation and that doesn't affect the end results. It will be removed in a future version." + deprecate("mask_feature", "1.0.0", deprecation_message, standard_warn=False) # 1. Check inputs. Raise error if not correct height = height or self.transformer.config.sample_size * self.vae_scale_factor width = width or self.transformer.config.sample_size * self.vae_scale_factor @@ -669,7 +682,15 @@ def __call__( height, width = self.classify_height_width_bin(height, width, ratios=ASPECT_RATIO_1024_BIN) self.check_inputs( - prompt, height, width, negative_prompt, callback_steps, prompt_embeds, negative_prompt_embeds + prompt, + height, + width, + negative_prompt, + callback_steps, + prompt_embeds, + negative_prompt_embeds, + prompt_attention_mask, + negative_prompt_attention_mask, ) # 2. Default height and width to transformer @@ -688,7 +709,12 @@ def __call__( do_classifier_free_guidance = guidance_scale > 1.0 # 3. Encode input prompt - prompt_embeds, negative_prompt_embeds = self.encode_prompt( + ( + prompt_embeds, + prompt_attention_mask, + negative_prompt_embeds, + negative_prompt_attention_mask, + ) = self.encode_prompt( prompt, do_classifier_free_guidance, negative_prompt=negative_prompt, @@ -696,11 +722,13 @@ def __call__( device=device, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, + prompt_attention_mask=prompt_attention_mask, + negative_prompt_attention_mask=negative_prompt_attention_mask, clean_caption=clean_caption, - mask_feature=mask_feature, ) if do_classifier_free_guidance: prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0) + prompt_attention_mask = torch.cat([negative_prompt_attention_mask, prompt_attention_mask], dim=0) # 4. Prepare timesteps self.scheduler.set_timesteps(num_inference_steps, device=device) @@ -758,6 +786,7 @@ def __call__( noise_pred = self.transformer( latent_model_input, encoder_hidden_states=prompt_embeds, + encoder_attention_mask=prompt_attention_mask, timestep=current_timestep, added_cond_kwargs=added_cond_kwargs, return_dict=False, diff --git a/src/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py b/src/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py index 8c1d52ca83d8..35466f008f54 100644 --- a/src/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py +++ b/src/diffusers/pipelines/stable_diffusion/convert_from_ckpt.py @@ -1232,13 +1232,11 @@ def download_from_original_stable_diffusion_ckpt( StableDiffusionPipeline, StableDiffusionUpscalePipeline, StableDiffusionXLImg2ImgPipeline, + StableDiffusionXLPipeline, StableUnCLIPImg2ImgPipeline, StableUnCLIPPipeline, ) - if pipeline_class is None: - pipeline_class = StableDiffusionPipeline if not controlnet else StableDiffusionControlNetPipeline - if prediction_type == "v-prediction": prediction_type = "v_prediction" @@ -1333,6 +1331,13 @@ def download_from_original_stable_diffusion_ckpt( if image_size is None: image_size = 1024 + if pipeline_class is None: + # Check if we have a SDXL or SD model and initialize default pipeline + if model_type not in ["SDXL", "SDXL-Refiner"]: + pipeline_class = StableDiffusionPipeline if not controlnet else StableDiffusionControlNetPipeline + else: + pipeline_class = StableDiffusionXLPipeline if model_type == "SDXL" else StableDiffusionXLImg2ImgPipeline + if num_in_channels is None and pipeline_class == StableDiffusionInpaintPipeline: num_in_channels = 9 if num_in_channels is None and pipeline_class == StableDiffusionUpscalePipeline: diff --git a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py index bcf2a6217772..5598477c9238 100644 --- a/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py +++ b/src/diffusers/pipelines/stable_diffusion/pipeline_flax_stable_diffusion.py @@ -410,13 +410,13 @@ def __call__( images_uint8_casted = np.asarray(images_uint8_casted).reshape(num_devices * batch_size, height, width, 3) images_uint8_casted, has_nsfw_concept = self._run_safety_checker(images_uint8_casted, safety_params, jit) - images = np.asarray(images) + images = np.asarray(images).copy() # block images if any(has_nsfw_concept): for i, is_nsfw in enumerate(has_nsfw_concept): if is_nsfw: - images[i] = np.asarray(images_uint8_casted[i]) + images[i, 0] = np.asarray(images_uint8_casted[i]) images = images.reshape(num_devices, batch_size, height, width, 3) else: diff --git a/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py b/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py index 103cd7095291..2f65b6cd391b 100644 --- a/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py +++ b/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_adapter.py @@ -610,6 +610,46 @@ def disable_freeu(self): """Disables the FreeU mechanism if enabled.""" self.unet.disable_freeu() + # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding + def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32): + """ + See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298 + + Args: + timesteps (`torch.Tensor`): + generate embedding vectors at these timesteps + embedding_dim (`int`, *optional*, defaults to 512): + dimension of the embeddings to generate + dtype: + data type of the generated embeddings + + Returns: + `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)` + """ + assert len(w.shape) == 1 + w = w * 1000.0 + + half_dim = embedding_dim // 2 + emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1) + emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb) + emb = w.to(dtype)[:, None] * emb[None, :] + emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1) + if embedding_dim % 2 == 1: # zero pad + emb = torch.nn.functional.pad(emb, (0, 1)) + assert emb.shape == (w.shape[0], embedding_dim) + return emb + + @property + def guidance_scale(self): + return self._guidance_scale + + # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) + # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` + # corresponds to doing no classifier free guidance. + @property + def do_classifier_free_guidance(self): + return self._guidance_scale > 1 and self.unet.config.time_cond_proj_dim is None + @torch.no_grad() @replace_example_docstring(EXAMPLE_DOC_STRING) def __call__( @@ -723,6 +763,8 @@ def __call__( prompt, height, width, callback_steps, image, negative_prompt, prompt_embeds, negative_prompt_embeds ) + self._guidance_scale = guidance_scale + if isinstance(self.adapter, MultiAdapter): adapter_input = [] @@ -742,17 +784,12 @@ def __call__( else: batch_size = prompt_embeds.shape[0] - # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) - # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` - # corresponds to doing no classifier free guidance. - do_classifier_free_guidance = guidance_scale > 1.0 - # 3. Encode input prompt prompt_embeds, negative_prompt_embeds = self.encode_prompt( prompt, device, num_images_per_prompt, - do_classifier_free_guidance, + self.do_classifier_free_guidance, negative_prompt, prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_prompt_embeds, @@ -761,7 +798,7 @@ def __call__( # For classifier free guidance, we need to do two forward passes. # Here we concatenate the unconditional and text embeddings into a single batch # to avoid doing two forward passes - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds]) # 4. Prepare timesteps @@ -784,6 +821,14 @@ def __call__( # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta) + # 6.5 Optionally get Guidance Scale Embedding + timestep_cond = None + if self.unet.config.time_cond_proj_dim is not None: + guidance_scale_tensor = torch.tensor(self.guidance_scale - 1).repeat(batch_size * num_images_per_prompt) + timestep_cond = self.get_guidance_scale_embedding( + guidance_scale_tensor, embedding_dim=self.unet.config.time_cond_proj_dim + ).to(device=device, dtype=latents.dtype) + # 7. Denoising loop if isinstance(self.adapter, MultiAdapter): adapter_state = self.adapter(adapter_input, adapter_conditioning_scale) @@ -796,7 +841,7 @@ def __call__( if num_images_per_prompt > 1: for k, v in enumerate(adapter_state): adapter_state[k] = v.repeat(num_images_per_prompt, 1, 1, 1) - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: for k, v in enumerate(adapter_state): adapter_state[k] = torch.cat([v] * 2, dim=0) @@ -804,7 +849,7 @@ def __call__( with self.progress_bar(total=num_inference_steps) as progress_bar: for i, t in enumerate(timesteps): # expand the latents if we are doing classifier free guidance - latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents + latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) # predict the noise residual @@ -812,13 +857,14 @@ def __call__( latent_model_input, t, encoder_hidden_states=prompt_embeds, + timestep_cond=timestep_cond, cross_attention_kwargs=cross_attention_kwargs, down_intrablock_additional_residuals=[state.clone() for state in adapter_state], return_dict=False, )[0] # perform guidance - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) diff --git a/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_xl_adapter.py b/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_xl_adapter.py index a163a27a8bd7..8b676b8ad964 100644 --- a/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_xl_adapter.py +++ b/src/diffusers/pipelines/t2i_adapter/pipeline_stable_diffusion_xl_adapter.py @@ -670,6 +670,46 @@ def disable_freeu(self): """Disables the FreeU mechanism if enabled.""" self.unet.disable_freeu() + # Copied from diffusers.pipelines.latent_consistency_models.pipeline_latent_consistency_text2img.LatentConsistencyModelPipeline.get_guidance_scale_embedding + def get_guidance_scale_embedding(self, w, embedding_dim=512, dtype=torch.float32): + """ + See https://github.com/google-research/vdm/blob/dc27b98a554f65cdc654b800da5aa1846545d41b/model_vdm.py#L298 + + Args: + timesteps (`torch.Tensor`): + generate embedding vectors at these timesteps + embedding_dim (`int`, *optional*, defaults to 512): + dimension of the embeddings to generate + dtype: + data type of the generated embeddings + + Returns: + `torch.FloatTensor`: Embedding vectors with shape `(len(timesteps), embedding_dim)` + """ + assert len(w.shape) == 1 + w = w * 1000.0 + + half_dim = embedding_dim // 2 + emb = torch.log(torch.tensor(10000.0)) / (half_dim - 1) + emb = torch.exp(torch.arange(half_dim, dtype=dtype) * -emb) + emb = w.to(dtype)[:, None] * emb[None, :] + emb = torch.cat([torch.sin(emb), torch.cos(emb)], dim=1) + if embedding_dim % 2 == 1: # zero pad + emb = torch.nn.functional.pad(emb, (0, 1)) + assert emb.shape == (w.shape[0], embedding_dim) + return emb + + @property + def guidance_scale(self): + return self._guidance_scale + + # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) + # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` + # corresponds to doing no classifier free guidance. + @property + def do_classifier_free_guidance(self): + return self._guidance_scale > 1 and self.unet.config.time_cond_proj_dim is None + @torch.no_grad() @replace_example_docstring(EXAMPLE_DOC_STRING) def __call__( @@ -882,6 +922,8 @@ def __call__( negative_pooled_prompt_embeds, ) + self._guidance_scale = guidance_scale + # 2. Define call parameters if prompt is not None and isinstance(prompt, str): batch_size = 1 @@ -892,11 +934,6 @@ def __call__( device = self._execution_device - # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2) - # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1` - # corresponds to doing no classifier free guidance. - do_classifier_free_guidance = guidance_scale > 1.0 - # 3. Encode input prompt ( prompt_embeds, @@ -908,7 +945,7 @@ def __call__( prompt_2=prompt_2, device=device, num_images_per_prompt=num_images_per_prompt, - do_classifier_free_guidance=do_classifier_free_guidance, + do_classifier_free_guidance=self.do_classifier_free_guidance, negative_prompt=negative_prompt, negative_prompt_2=negative_prompt_2, prompt_embeds=prompt_embeds, @@ -939,6 +976,14 @@ def __call__( # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta) + # 6.5 Optionally get Guidance Scale Embedding + timestep_cond = None + if self.unet.config.time_cond_proj_dim is not None: + guidance_scale_tensor = torch.tensor(self.guidance_scale - 1).repeat(batch_size * num_images_per_prompt) + timestep_cond = self.get_guidance_scale_embedding( + guidance_scale_tensor, embedding_dim=self.unet.config.time_cond_proj_dim + ).to(device=device, dtype=latents.dtype) + # 7. Prepare added time ids & embeddings & adapter features if isinstance(self.adapter, MultiAdapter): adapter_state = self.adapter(adapter_input, adapter_conditioning_scale) @@ -951,7 +996,7 @@ def __call__( if num_images_per_prompt > 1: for k, v in enumerate(adapter_state): adapter_state[k] = v.repeat(num_images_per_prompt, 1, 1, 1) - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: for k, v in enumerate(adapter_state): adapter_state[k] = torch.cat([v] * 2, dim=0) @@ -979,7 +1024,7 @@ def __call__( else: negative_add_time_ids = add_time_ids - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds], dim=0) add_text_embeds = torch.cat([negative_pooled_prompt_embeds, add_text_embeds], dim=0) add_time_ids = torch.cat([negative_add_time_ids, add_time_ids], dim=0) @@ -1005,7 +1050,7 @@ def __call__( with self.progress_bar(total=num_inference_steps) as progress_bar: for i, t in enumerate(timesteps): # expand the latents if we are doing classifier free guidance - latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents + latent_model_input = torch.cat([latents] * 2) if self.do_classifier_free_guidance else latents latent_model_input = self.scheduler.scale_model_input(latent_model_input, t) @@ -1021,6 +1066,7 @@ def __call__( latent_model_input, t, encoder_hidden_states=prompt_embeds, + timestep_cond=timestep_cond, cross_attention_kwargs=cross_attention_kwargs, added_cond_kwargs=added_cond_kwargs, return_dict=False, @@ -1028,11 +1074,11 @@ def __call__( )[0] # perform guidance - if do_classifier_free_guidance: + if self.do_classifier_free_guidance: noise_pred_uncond, noise_pred_text = noise_pred.chunk(2) noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond) - if do_classifier_free_guidance and guidance_rescale > 0.0: + if self.do_classifier_free_guidance and guidance_rescale > 0.0: # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf noise_pred = rescale_noise_cfg(noise_pred, noise_pred_text, guidance_rescale=guidance_rescale) diff --git a/src/diffusers/schedulers/scheduling_lcm.py b/src/diffusers/schedulers/scheduling_lcm.py index adcc092a816f..209125f156d1 100644 --- a/src/diffusers/schedulers/scheduling_lcm.py +++ b/src/diffusers/schedulers/scheduling_lcm.py @@ -378,6 +378,12 @@ def set_timesteps( # LCM Training Steps Schedule lcm_origin_timesteps = np.asarray(list(range(1, int(original_steps * strength) + 1))) * c - 1 skipping_step = len(lcm_origin_timesteps) // num_inference_steps + + if skipping_step < 1: + raise ValueError( + f"The combination of `original_steps x strength`: {original_steps} x {strength} is smaller than `num_inference_steps`: {num_inference_steps}. Make sure to either reduce `num_inference_steps` to a value smaller than {int(original_steps * strength)} or increase `strength` to a value higher than {float(num_inference_steps / original_steps)}." + ) + # LCM Inference Steps Schedule timesteps = lcm_origin_timesteps[::-skipping_step][:num_inference_steps] diff --git a/src/diffusers/utils/__init__.py b/src/diffusers/utils/__init__.py index b4d6bdab33eb..c1385d584724 100644 --- a/src/diffusers/utils/__init__.py +++ b/src/diffusers/utils/__init__.py @@ -89,6 +89,7 @@ from .outputs import BaseOutput from .peft_utils import ( check_peft_version, + delete_adapter_layers, get_adapter_name, get_peft_kwargs, recurse_remove_peft_layers, diff --git a/src/diffusers/utils/constants.py b/src/diffusers/utils/constants.py index a485498eb725..8ae5b0dec4d1 100644 --- a/src/diffusers/utils/constants.py +++ b/src/diffusers/utils/constants.py @@ -22,8 +22,8 @@ default_cache_path = HUGGINGFACE_HUB_CACHE -MIN_PEFT_VERSION = "0.5.0" -MIN_TRANSFORMERS_VERSION = "4.33.3" +MIN_PEFT_VERSION = "0.6.0" +MIN_TRANSFORMERS_VERSION = "4.34.0" CONFIG_NAME = "config.json" @@ -41,12 +41,12 @@ # Below should be `True` if the current version of `peft` and `transformers` are compatible with # PEFT backend. Will automatically fall back to PEFT backend if the correct versions of the libraries are # available. -# For PEFT it is has to be greater than 0.6.0 and for transformers it has to be greater than 4.33.1. +# For PEFT it is has to be greater than or equal to 0.6.0 and for transformers it has to be greater than or equal to 4.34.0. _required_peft_version = is_peft_available() and version.parse( version.parse(importlib.metadata.version("peft")).base_version -) > version.parse(MIN_PEFT_VERSION) +) >= version.parse(MIN_PEFT_VERSION) _required_transformers_version = is_transformers_available() and version.parse( version.parse(importlib.metadata.version("transformers")).base_version -) > version.parse(MIN_TRANSFORMERS_VERSION) +) >= version.parse(MIN_TRANSFORMERS_VERSION) USE_PEFT_BACKEND = _required_peft_version and _required_transformers_version diff --git a/src/diffusers/utils/logging.py b/src/diffusers/utils/logging.py index 4ccc57cd69d5..6050f314c008 100644 --- a/src/diffusers/utils/logging.py +++ b/src/diffusers/utils/logging.py @@ -28,7 +28,7 @@ WARN, # NOQA WARNING, # NOQA ) -from typing import Optional +from typing import Dict, Optional from tqdm import auto as tqdm_lib @@ -49,7 +49,7 @@ _tqdm_active = True -def _get_default_logging_level(): +def _get_default_logging_level() -> int: """ If DIFFUSERS_VERBOSITY env var is set to one of the valid choices return that as the new default level. If it is not - fall back to `_default_log_level` @@ -104,7 +104,7 @@ def _reset_library_root_logger() -> None: _default_handler = None -def get_log_levels_dict(): +def get_log_levels_dict() -> Dict[str, int]: return log_levels @@ -161,22 +161,22 @@ def set_verbosity(verbosity: int) -> None: _get_library_root_logger().setLevel(verbosity) -def set_verbosity_info(): +def set_verbosity_info() -> None: """Set the verbosity to the `INFO` level.""" return set_verbosity(INFO) -def set_verbosity_warning(): +def set_verbosity_warning() -> None: """Set the verbosity to the `WARNING` level.""" return set_verbosity(WARNING) -def set_verbosity_debug(): +def set_verbosity_debug() -> None: """Set the verbosity to the `DEBUG` level.""" return set_verbosity(DEBUG) -def set_verbosity_error(): +def set_verbosity_error() -> None: """Set the verbosity to the `ERROR` level.""" return set_verbosity(ERROR) @@ -263,7 +263,7 @@ def reset_format() -> None: handler.setFormatter(None) -def warning_advice(self, *args, **kwargs): +def warning_advice(self, *args, **kwargs) -> None: """ This method is identical to `logger.warning()`, but if env var DIFFUSERS_NO_ADVISORY_WARNINGS=1 is set, this warning will not be printed @@ -327,13 +327,13 @@ def is_progress_bar_enabled() -> bool: return bool(_tqdm_active) -def enable_progress_bar(): +def enable_progress_bar() -> None: """Enable tqdm progress bar.""" global _tqdm_active _tqdm_active = True -def disable_progress_bar(): +def disable_progress_bar() -> None: """Disable tqdm progress bar.""" global _tqdm_active _tqdm_active = False diff --git a/src/diffusers/utils/outputs.py b/src/diffusers/utils/outputs.py index a057b506aec0..01a297361955 100644 --- a/src/diffusers/utils/outputs.py +++ b/src/diffusers/utils/outputs.py @@ -24,7 +24,7 @@ from .import_utils import is_torch_available -def is_tensor(x): +def is_tensor(x) -> bool: """ Tests if `x` is a `torch.Tensor` or `np.ndarray`. """ @@ -66,7 +66,7 @@ def __init_subclass__(cls) -> None: lambda values, context: cls(**torch.utils._pytree._dict_unflatten(values, context)), ) - def __post_init__(self): + def __post_init__(self) -> None: class_fields = fields(self) # Safety and consistency checks @@ -97,14 +97,14 @@ def pop(self, *args, **kwargs): def update(self, *args, **kwargs): raise Exception(f"You cannot use ``update`` on a {self.__class__.__name__} instance.") - def __getitem__(self, k): + def __getitem__(self, k: Any) -> Any: if isinstance(k, str): inner_dict = dict(self.items()) return inner_dict[k] else: return self.to_tuple()[k] - def __setattr__(self, name, value): + def __setattr__(self, name: Any, value: Any) -> None: if name in self.keys() and value is not None: # Don't call self.__setitem__ to avoid recursion errors super().__setitem__(name, value) @@ -123,7 +123,7 @@ def __reduce__(self): args = tuple(getattr(self, field.name) for field in fields(self)) return callable, args, *remaining - def to_tuple(self) -> Tuple[Any]: + def to_tuple(self) -> Tuple[Any, ...]: """ Convert self to a tuple containing all the attributes/keys that are not `None`. """ diff --git a/src/diffusers/utils/peft_utils.py b/src/diffusers/utils/peft_utils.py index 158435a6e812..2bcbeb3b7966 100644 --- a/src/diffusers/utils/peft_utils.py +++ b/src/diffusers/utils/peft_utils.py @@ -180,6 +180,28 @@ def set_adapter_layers(model, enabled=True): module.disable_adapters = not enabled +def delete_adapter_layers(model, adapter_name): + from peft.tuners.tuners_utils import BaseTunerLayer + + for module in model.modules(): + if isinstance(module, BaseTunerLayer): + if hasattr(module, "delete_adapter"): + module.delete_adapter(adapter_name) + else: + raise ValueError( + "The version of PEFT you are using is not compatible, please use a version that is greater than 0.6.1" + ) + + # For transformers integration - we need to pop the adapter from the config + if getattr(model, "_hf_peft_config_loaded", False) and hasattr(model, "peft_config"): + model.peft_config.pop(adapter_name, None) + # In case all adapters are deleted, we need to delete the config + # and make sure to set the flag to False + if len(model.peft_config) == 0: + del model.peft_config + model._hf_peft_config_loaded = None + + def set_weights_and_activate_adapters(model, adapter_names, weights): from peft.tuners.tuners_utils import BaseTunerLayer diff --git a/src/diffusers/utils/torch_utils.py b/src/diffusers/utils/torch_utils.py index 7955ccb01d85..00bc75f41be3 100644 --- a/src/diffusers/utils/torch_utils.py +++ b/src/diffusers/utils/torch_utils.py @@ -82,14 +82,14 @@ def randn_tensor( return latents -def is_compiled_module(module): +def is_compiled_module(module) -> bool: """Check whether the module was compiled with torch.compile()""" if is_torch_version("<", "2.0.0") or not hasattr(torch, "_dynamo"): return False return isinstance(module, torch._dynamo.eval_frame.OptimizedModule) -def fourier_filter(x_in, threshold, scale): +def fourier_filter(x_in: torch.Tensor, threshold: int, scale: int) -> torch.Tensor: """Fourier filter as introduced in FreeU (https://arxiv.org/abs/2309.11497). This version of the method comes from here: diff --git a/tests/lora/test_lora_layers_old_backend.py b/tests/lora/test_lora_layers_old_backend.py index 047cdddfa95a..285c5e864a04 100644 --- a/tests/lora/test_lora_layers_old_backend.py +++ b/tests/lora/test_lora_layers_old_backend.py @@ -41,7 +41,7 @@ UNet2DConditionModel, UNet3DConditionModel, ) -from diffusers.loaders import AttnProcsLayers, LoraLoaderMixin, PatchedLoraProjection, text_encoder_attn_modules +from diffusers.loaders import AttnProcsLayers, LoraLoaderMixin from diffusers.models.attention_processor import ( Attention, AttnProcessor, @@ -51,6 +51,7 @@ LoRAXFormersAttnProcessor, XFormersAttnProcessor, ) +from diffusers.models.lora import PatchedLoraProjection, text_encoder_attn_modules from diffusers.utils.import_utils import is_xformers_available from diffusers.utils.testing_utils import ( deprecate_after_peft_backend, diff --git a/tests/lora/test_lora_layers_peft.py b/tests/lora/test_lora_layers_peft.py index 68e986790d76..c290850a10b6 100644 --- a/tests/lora/test_lora_layers_peft.py +++ b/tests/lora/test_lora_layers_peft.py @@ -40,10 +40,7 @@ UNet2DConditionModel, ) from diffusers.loaders import AttnProcsLayers -from diffusers.models.attention_processor import ( - LoRAAttnProcessor, - LoRAAttnProcessor2_0, -) +from diffusers.models.attention_processor import LoRAAttnProcessor, LoRAAttnProcessor2_0 from diffusers.utils.import_utils import is_accelerate_available, is_peft_available from diffusers.utils.testing_utils import ( floats_tensor, @@ -831,6 +828,96 @@ def test_simple_inference_with_text_unet_multi_adapter(self): "output with no lora and output with lora disabled should give same results", ) + def test_simple_inference_with_text_unet_multi_adapter_delete_adapter(self): + """ + Tests a simple inference with lora attached to text encoder and unet, attaches + multiple adapters and set/delete them + """ + for scheduler_cls in [DDIMScheduler, LCMScheduler]: + components, _, text_lora_config, unet_lora_config = self.get_dummy_components(scheduler_cls) + pipe = self.pipeline_class(**components) + pipe = pipe.to(self.torch_device) + pipe.set_progress_bar_config(disable=None) + _, _, inputs = self.get_dummy_inputs(with_generator=False) + + output_no_lora = pipe(**inputs, generator=torch.manual_seed(0)).images + + pipe.text_encoder.add_adapter(text_lora_config, "adapter-1") + pipe.text_encoder.add_adapter(text_lora_config, "adapter-2") + + pipe.unet.add_adapter(unet_lora_config, "adapter-1") + pipe.unet.add_adapter(unet_lora_config, "adapter-2") + + self.assertTrue( + self.check_if_lora_correctly_set(pipe.text_encoder), "Lora not correctly set in text encoder" + ) + self.assertTrue(self.check_if_lora_correctly_set(pipe.unet), "Lora not correctly set in Unet") + + if self.has_two_text_encoders: + pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-1") + pipe.text_encoder_2.add_adapter(text_lora_config, "adapter-2") + self.assertTrue( + self.check_if_lora_correctly_set(pipe.text_encoder_2), "Lora not correctly set in text encoder 2" + ) + + pipe.set_adapters("adapter-1") + + output_adapter_1 = pipe(**inputs, generator=torch.manual_seed(0)).images + + pipe.set_adapters("adapter-2") + output_adapter_2 = pipe(**inputs, generator=torch.manual_seed(0)).images + + pipe.set_adapters(["adapter-1", "adapter-2"]) + + output_adapter_mixed = pipe(**inputs, generator=torch.manual_seed(0)).images + + self.assertFalse( + np.allclose(output_adapter_1, output_adapter_2, atol=1e-3, rtol=1e-3), + "Adapter 1 and 2 should give different results", + ) + + self.assertFalse( + np.allclose(output_adapter_1, output_adapter_mixed, atol=1e-3, rtol=1e-3), + "Adapter 1 and mixed adapters should give different results", + ) + + self.assertFalse( + np.allclose(output_adapter_2, output_adapter_mixed, atol=1e-3, rtol=1e-3), + "Adapter 2 and mixed adapters should give different results", + ) + + pipe.delete_adapters("adapter-1") + output_deleted_adapter_1 = pipe(**inputs, generator=torch.manual_seed(0)).images + + self.assertTrue( + np.allclose(output_deleted_adapter_1, output_adapter_2, atol=1e-3, rtol=1e-3), + "Adapter 1 and 2 should give different results", + ) + + pipe.delete_adapters("adapter-2") + output_deleted_adapters = pipe(**inputs, generator=torch.manual_seed(0)).images + + self.assertTrue( + np.allclose(output_no_lora, output_deleted_adapters, atol=1e-3, rtol=1e-3), + "output with no lora and output with lora disabled should give same results", + ) + + pipe.text_encoder.add_adapter(text_lora_config, "adapter-1") + pipe.text_encoder.add_adapter(text_lora_config, "adapter-2") + + pipe.unet.add_adapter(unet_lora_config, "adapter-1") + pipe.unet.add_adapter(unet_lora_config, "adapter-2") + + pipe.set_adapters(["adapter-1", "adapter-2"]) + pipe.delete_adapters(["adapter-1", "adapter-2"]) + + output_deleted_adapters = pipe(**inputs, generator=torch.manual_seed(0)).images + + self.assertTrue( + np.allclose(output_no_lora, output_deleted_adapters, atol=1e-3, rtol=1e-3), + "output with no lora and output with lora disabled should give same results", + ) + def test_simple_inference_with_text_unet_multi_adapter_weighted(self): """ Tests a simple inference with lora attached to text encoder and unet, attaches diff --git a/tests/pipelines/controlnet/test_controlnet.py b/tests/pipelines/controlnet/test_controlnet.py index 64baeea910b8..2d8c8869c23c 100644 --- a/tests/pipelines/controlnet/test_controlnet.py +++ b/tests/pipelines/controlnet/test_controlnet.py @@ -27,6 +27,7 @@ ControlNetModel, DDIMScheduler, EulerDiscreteScheduler, + LCMScheduler, StableDiffusionControlNetPipeline, UNet2DConditionModel, ) @@ -116,7 +117,7 @@ class ControlNetPipelineFastTests( image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS - def get_dummy_components(self): + def get_dummy_components(self, time_cond_proj_dim=None): torch.manual_seed(0) unet = UNet2DConditionModel( block_out_channels=(4, 8), @@ -128,6 +129,7 @@ def get_dummy_components(self): up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"), cross_attention_dim=32, norm_num_groups=1, + time_cond_proj_dim=time_cond_proj_dim, ) torch.manual_seed(0) controlnet = ControlNetModel( @@ -221,6 +223,28 @@ def test_xformers_attention_forwardGenerator_pass(self): def test_inference_batch_single_identical(self): self._test_inference_batch_single_identical(expected_max_diff=2e-3) + def test_controlnet_lcm(self): + device = "cpu" # ensure determinism for the device-dependent torch.Generator + + components = self.get_dummy_components(time_cond_proj_dim=256) + sd_pipe = StableDiffusionControlNetPipeline(**components) + sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config) + sd_pipe = sd_pipe.to(torch_device) + sd_pipe.set_progress_bar_config(disable=None) + + inputs = self.get_dummy_inputs(device) + output = sd_pipe(**inputs) + image = output.images + + image_slice = image[0, -3:, -3:, -1] + + assert image.shape == (1, 64, 64, 3) + expected_slice = np.array( + [0.52700454, 0.3930534, 0.25509018, 0.7132304, 0.53696585, 0.46568912, 0.7095368, 0.7059624, 0.4744786] + ) + + assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2 + class StableDiffusionMultiControlNetPipelineFastTests( PipelineTesterMixin, PipelineKarrasSchedulerTesterMixin, unittest.TestCase diff --git a/tests/pipelines/controlnet/test_controlnet_sdxl.py b/tests/pipelines/controlnet/test_controlnet_sdxl.py index be786ebe3000..36ddee36eb52 100644 --- a/tests/pipelines/controlnet/test_controlnet_sdxl.py +++ b/tests/pipelines/controlnet/test_controlnet_sdxl.py @@ -24,6 +24,7 @@ AutoencoderKL, ControlNetModel, EulerDiscreteScheduler, + LCMScheduler, StableDiffusionXLControlNetPipeline, UNet2DConditionModel, ) @@ -62,7 +63,7 @@ class StableDiffusionXLControlNetPipelineFastTests( image_params = IMAGE_TO_IMAGE_IMAGE_PARAMS image_latents_params = TEXT_TO_IMAGE_IMAGE_PARAMS - def get_dummy_components(self): + def get_dummy_components(self, time_cond_proj_dim=None): torch.manual_seed(0) unet = UNet2DConditionModel( block_out_channels=(32, 64), @@ -80,6 +81,7 @@ def get_dummy_components(self): transformer_layers_per_block=(1, 2), projection_class_embeddings_input_dim=80, # 6 * 8 + 32 cross_attention_dim=64, + time_cond_proj_dim=time_cond_proj_dim, ) torch.manual_seed(0) controlnet = ControlNetModel( @@ -330,6 +332,26 @@ def test_controlnet_sdxl_guess(self): # make sure that it's equal assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-4 + def test_controlnet_sdxl_lcm(self): + device = "cpu" # ensure determinism for the device-dependent torch.Generator + + components = self.get_dummy_components(time_cond_proj_dim=256) + sd_pipe = StableDiffusionXLControlNetPipeline(**components) + sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config) + sd_pipe = sd_pipe.to(torch_device) + sd_pipe.set_progress_bar_config(disable=None) + + inputs = self.get_dummy_inputs(device) + output = sd_pipe(**inputs) + image = output.images + + image_slice = image[0, -3:, -3:, -1] + + assert image.shape == (1, 64, 64, 3) + expected_slice = np.array([0.7799, 0.614, 0.6162, 0.7082, 0.6662, 0.5833, 0.4148, 0.5182, 0.4866]) + + assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2 + class StableDiffusionXLMultiControlNetPipelineFastTests( PipelineTesterMixin, PipelineKarrasSchedulerTesterMixin, SDXLOptionalComponentsTesterMixin, unittest.TestCase diff --git a/tests/pipelines/pixart/test_pixart.py b/tests/pipelines/pixart/test_pixart.py index 1fb2560b29b6..b2806a5c1c99 100644 --- a/tests/pipelines/pixart/test_pixart.py +++ b/tests/pipelines/pixart/test_pixart.py @@ -111,13 +111,20 @@ def test_save_load_optional_components(self): num_inference_steps = inputs["num_inference_steps"] output_type = inputs["output_type"] - prompt_embeds, negative_prompt_embeds = pipe.encode_prompt(prompt, mask_feature=False) + ( + prompt_embeds, + prompt_attention_mask, + negative_prompt_embeds, + negative_prompt_attention_mask, + ) = pipe.encode_prompt(prompt) # inputs with prompt converted to embeddings inputs = { "prompt_embeds": prompt_embeds, + "prompt_attention_mask": prompt_attention_mask, "negative_prompt": None, "negative_prompt_embeds": negative_prompt_embeds, + "negative_prompt_attention_mask": negative_prompt_attention_mask, "generator": generator, "num_inference_steps": num_inference_steps, "output_type": output_type, @@ -151,8 +158,10 @@ def test_save_load_optional_components(self): # inputs with prompt converted to embeddings inputs = { "prompt_embeds": prompt_embeds, + "prompt_attention_mask": prompt_attention_mask, "negative_prompt": None, "negative_prompt_embeds": negative_prompt_embeds, + "negative_prompt_attention_mask": negative_prompt_attention_mask, "generator": generator, "num_inference_steps": num_inference_steps, "output_type": output_type, @@ -211,13 +220,15 @@ def test_inference_with_embeddings_and_multiple_images(self): num_inference_steps = inputs["num_inference_steps"] output_type = inputs["output_type"] - prompt_embeds, negative_prompt_embeds = pipe.encode_prompt(prompt) + prompt_embeds, prompt_attn_mask, negative_prompt_embeds, neg_prompt_attn_mask = pipe.encode_prompt(prompt) # inputs with prompt converted to embeddings inputs = { "prompt_embeds": prompt_embeds, + "prompt_attention_mask": prompt_attn_mask, "negative_prompt": None, "negative_prompt_embeds": negative_prompt_embeds, + "negative_prompt_attention_mask": neg_prompt_attn_mask, "generator": generator, "num_inference_steps": num_inference_steps, "output_type": output_type, @@ -252,8 +263,10 @@ def test_inference_with_embeddings_and_multiple_images(self): # inputs with prompt converted to embeddings inputs = { "prompt_embeds": prompt_embeds, + "prompt_attention_mask": prompt_attn_mask, "negative_prompt": None, "negative_prompt_embeds": negative_prompt_embeds, + "negative_prompt_attention_mask": neg_prompt_attn_mask, "generator": generator, "num_inference_steps": num_inference_steps, "output_type": output_type, @@ -266,6 +279,40 @@ def test_inference_with_embeddings_and_multiple_images(self): max_diff = np.abs(to_np(output) - to_np(output_loaded)).max() self.assertLess(max_diff, 1e-4) + def test_inference_with_multiple_images_per_prompt(self): + device = "cpu" + + components = self.get_dummy_components() + pipe = self.pipeline_class(**components) + pipe.to(device) + pipe.set_progress_bar_config(disable=None) + + inputs = self.get_dummy_inputs(device) + inputs["num_images_per_prompt"] = 2 + image = pipe(**inputs).images + image_slice = image[0, -3:, -3:, -1] + + self.assertEqual(image.shape, (2, 8, 8, 3)) + expected_slice = np.array([0.5303, 0.2658, 0.7979, 0.1182, 0.3304, 0.4608, 0.5195, 0.4261, 0.4675]) + max_diff = np.abs(image_slice.flatten() - expected_slice).max() + self.assertLessEqual(max_diff, 1e-3) + + def test_raises_warning_for_mask_feature(self): + device = "cpu" + + components = self.get_dummy_components() + pipe = self.pipeline_class(**components) + pipe.to(device) + pipe.set_progress_bar_config(disable=None) + + inputs = self.get_dummy_inputs(device) + inputs.update({"mask_feature": True}) + + with self.assertWarns(FutureWarning) as warning_ctx: + _ = pipe(**inputs).images + + assert "mask_feature" in str(warning_ctx.warning) + def test_inference_batch_single_identical(self): self._test_inference_batch_single_identical(expected_max_diff=1e-3) @@ -290,7 +337,7 @@ def test_pixart_1024_fast(self): image_slice = image[0, -3:, -3:, -1] - expected_slice = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.1323]) + expected_slice = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) max_diff = np.abs(image_slice.flatten() - expected_slice).max() self.assertLessEqual(max_diff, 1e-3) @@ -307,7 +354,7 @@ def test_pixart_512_fast(self): image_slice = image[0, -3:, -3:, -1] - expected_slice = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0266]) + expected_slice = np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]) max_diff = np.abs(image_slice.flatten() - expected_slice).max() self.assertLessEqual(max_diff, 1e-3) @@ -323,7 +370,7 @@ def test_pixart_1024(self): image_slice = image[0, -3:, -3:, -1] - expected_slice = np.array([0.1501, 0.1755, 0.1877, 0.1445, 0.1665, 0.1763, 0.1389, 0.176, 0.2031]) + expected_slice = np.array([0.1941, 0.2117, 0.2188, 0.1946, 0.218, 0.2124, 0.199, 0.2437, 0.2583]) max_diff = np.abs(image_slice.flatten() - expected_slice).max() self.assertLessEqual(max_diff, 1e-3) @@ -340,7 +387,26 @@ def test_pixart_512(self): image_slice = image[0, -3:, -3:, -1] - expected_slice = np.array([0.2515, 0.2593, 0.2593, 0.2544, 0.2759, 0.2788, 0.2812, 0.3169, 0.332]) + expected_slice = np.array([0.2637, 0.291, 0.2939, 0.207, 0.2512, 0.2783, 0.2168, 0.2324, 0.2817]) max_diff = np.abs(image_slice.flatten() - expected_slice).max() self.assertLessEqual(max_diff, 1e-3) + + def test_pixart_1024_without_resolution_binning(self): + generator = torch.manual_seed(0) + + pipe = PixArtAlphaPipeline.from_pretrained("PixArt-alpha/PixArt-XL-2-1024-MS", torch_dtype=torch.float16) + pipe.enable_model_cpu_offload() + + prompt = "A small cactus with a happy face in the Sahara desert." + + image = pipe(prompt, generator=generator, num_inference_steps=5, output_type="np").images + image_slice = image[0, -3:, -3:, -1] + + generator = torch.manual_seed(0) + no_res_bin_image = pipe( + prompt, generator=generator, num_inference_steps=5, output_type="np", use_resolution_binning=False + ).images + no_res_bin_image_slice = no_res_bin_image[0, -3:, -3:, -1] + + assert not np.allclose(image_slice, no_res_bin_image_slice, atol=1e-4, rtol=1e-4) diff --git a/tests/pipelines/stable_diffusion/test_stable_diffusion_adapter.py b/tests/pipelines/stable_diffusion/test_stable_diffusion_adapter.py index 2dcfb9d3612d..2252c8ef8e99 100644 --- a/tests/pipelines/stable_diffusion/test_stable_diffusion_adapter.py +++ b/tests/pipelines/stable_diffusion/test_stable_diffusion_adapter.py @@ -25,6 +25,7 @@ import diffusers from diffusers import ( AutoencoderKL, + LCMScheduler, MultiAdapter, PNDMScheduler, StableDiffusionAdapterPipeline, @@ -56,7 +57,7 @@ class AdapterTests: params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS - def get_dummy_components(self, adapter_type): + def get_dummy_components(self, adapter_type, time_cond_proj_dim=None): torch.manual_seed(0) unet = UNet2DConditionModel( block_out_channels=(32, 64), @@ -67,6 +68,7 @@ def get_dummy_components(self, adapter_type): down_block_types=("CrossAttnDownBlock2D", "DownBlock2D"), up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"), cross_attention_dim=32, + time_cond_proj_dim=time_cond_proj_dim, ) scheduler = PNDMScheduler(skip_prk_steps=True) torch.manual_seed(0) @@ -264,13 +266,13 @@ def test_inference_batch_single_identical(self): @parameterized.expand( [ # (dim=264) The internal feature map will be 33x33 after initial pixel unshuffling (downscaled x8). - ((4 * 8 + 1) * 8), + (((4 * 8 + 1) * 8),), # (dim=272) The internal feature map will be 17x17 after the first T2I down block (downscaled x16). - ((4 * 4 + 1) * 16), + (((4 * 4 + 1) * 16),), # (dim=288) The internal feature map will be 9x9 after the second T2I down block (downscaled x32). - ((4 * 2 + 1) * 32), + (((4 * 2 + 1) * 32),), # (dim=320) The internal feature map will be 5x5 after the third T2I down block (downscaled x64). - ((4 * 1 + 1) * 64), + (((4 * 1 + 1) * 64),), ] ) def test_multiple_image_dimensions(self, dim): @@ -292,10 +294,30 @@ def test_multiple_image_dimensions(self, dim): assert image.shape == (1, dim, dim, 3) + def test_adapter_lcm(self): + device = "cpu" # ensure determinism for the device-dependent torch.Generator + + components = self.get_dummy_components(time_cond_proj_dim=256) + sd_pipe = StableDiffusionAdapterPipeline(**components) + sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config) + sd_pipe = sd_pipe.to(torch_device) + sd_pipe.set_progress_bar_config(disable=None) + + inputs = self.get_dummy_inputs(device) + output = sd_pipe(**inputs) + image = output.images + + image_slice = image[0, -3:, -3:, -1] + + assert image.shape == (1, 64, 64, 3) + expected_slice = np.array([0.4535, 0.5493, 0.4359, 0.5452, 0.6086, 0.4441, 0.5544, 0.501, 0.4859]) + + assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2 + class StableDiffusionFullAdapterPipelineFastTests(AdapterTests, PipelineTesterMixin, unittest.TestCase): - def get_dummy_components(self): - return super().get_dummy_components("full_adapter") + def get_dummy_components(self, time_cond_proj_dim=None): + return super().get_dummy_components("full_adapter", time_cond_proj_dim=time_cond_proj_dim) def get_dummy_components_with_full_downscaling(self): return super().get_dummy_components_with_full_downscaling("full_adapter") @@ -317,8 +339,8 @@ def test_stable_diffusion_adapter_default_case(self): class StableDiffusionLightAdapterPipelineFastTests(AdapterTests, PipelineTesterMixin, unittest.TestCase): - def get_dummy_components(self): - return super().get_dummy_components("light_adapter") + def get_dummy_components(self, time_cond_proj_dim=None): + return super().get_dummy_components("light_adapter", time_cond_proj_dim=time_cond_proj_dim) def get_dummy_components_with_full_downscaling(self): return super().get_dummy_components_with_full_downscaling("light_adapter") @@ -340,8 +362,8 @@ def test_stable_diffusion_adapter_default_case(self): class StableDiffusionMultiAdapterPipelineFastTests(AdapterTests, PipelineTesterMixin, unittest.TestCase): - def get_dummy_components(self): - return super().get_dummy_components("multi_adapter") + def get_dummy_components(self, time_cond_proj_dim=None): + return super().get_dummy_components("multi_adapter", time_cond_proj_dim=time_cond_proj_dim) def get_dummy_components_with_full_downscaling(self): return super().get_dummy_components_with_full_downscaling("multi_adapter") diff --git a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_adapter.py b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_adapter.py index 1c83e80f3d6f..daf46000a1e0 100644 --- a/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_adapter.py +++ b/tests/pipelines/stable_diffusion_xl/test_stable_diffusion_xl_adapter.py @@ -26,6 +26,7 @@ from diffusers import ( AutoencoderKL, EulerDiscreteScheduler, + LCMScheduler, MultiAdapter, StableDiffusionXLAdapterPipeline, T2IAdapter, @@ -59,7 +60,7 @@ class StableDiffusionXLAdapterPipelineFastTests( params = TEXT_GUIDED_IMAGE_VARIATION_PARAMS batch_params = TEXT_GUIDED_IMAGE_VARIATION_BATCH_PARAMS - def get_dummy_components(self, adapter_type="full_adapter_xl"): + def get_dummy_components(self, adapter_type="full_adapter_xl", time_cond_proj_dim=None): torch.manual_seed(0) unet = UNet2DConditionModel( block_out_channels=(32, 64), @@ -77,6 +78,7 @@ def get_dummy_components(self, adapter_type="full_adapter_xl"): transformer_layers_per_block=(1, 2), projection_class_embeddings_input_dim=80, # 6 * 8 + 32 cross_attention_dim=64, + time_cond_proj_dim=time_cond_proj_dim, ) scheduler = EulerDiscreteScheduler( beta_start=0.00085, @@ -309,9 +311,9 @@ def test_stable_diffusion_adapter_default_case(self): @parameterized.expand( [ # (dim=144) The internal feature map will be 9x9 after initial pixel unshuffling (downscaled x16). - ((4 * 2 + 1) * 16), + (((4 * 2 + 1) * 16),), # (dim=160) The internal feature map will be 5x5 after the first T2I down block (downscaled x32). - ((4 * 1 + 1) * 32), + (((4 * 1 + 1) * 32),), ] ) def test_multiple_image_dimensions(self, dim): @@ -367,12 +369,32 @@ def test_total_downscale_factor(self, adapter_type): def test_save_load_optional_components(self): return self._test_save_load_optional_components() + def test_adapter_sdxl_lcm(self): + device = "cpu" # ensure determinism for the device-dependent torch.Generator + + components = self.get_dummy_components(time_cond_proj_dim=256) + sd_pipe = StableDiffusionXLAdapterPipeline(**components) + sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config) + sd_pipe = sd_pipe.to(torch_device) + sd_pipe.set_progress_bar_config(disable=None) + + inputs = self.get_dummy_inputs(device) + output = sd_pipe(**inputs) + image = output.images + + image_slice = image[0, -3:, -3:, -1] + + assert image.shape == (1, 64, 64, 3) + expected_slice = np.array([0.5425, 0.5385, 0.4964, 0.5045, 0.6149, 0.4974, 0.5469, 0.5332, 0.5426]) + + assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2 + class StableDiffusionXLMultiAdapterPipelineFastTests( StableDiffusionXLAdapterPipelineFastTests, PipelineTesterMixin, unittest.TestCase ): - def get_dummy_components(self): - return super().get_dummy_components("multi_adapter") + def get_dummy_components(self, time_cond_proj_dim=None): + return super().get_dummy_components("multi_adapter", time_cond_proj_dim=time_cond_proj_dim) def get_dummy_components_with_full_downscaling(self): return super().get_dummy_components_with_full_downscaling("multi_adapter") @@ -569,6 +591,29 @@ def test_inference_batch_single_identical( if test_mean_pixel_difference: assert_mean_pixel_difference(output_batch[0][0], output[0][0]) + def test_adapter_sdxl_lcm(self): + device = "cpu" # ensure determinism for the device-dependent torch.Generator + + components = self.get_dummy_components(time_cond_proj_dim=256) + sd_pipe = StableDiffusionXLAdapterPipeline(**components) + sd_pipe.scheduler = LCMScheduler.from_config(sd_pipe.scheduler.config) + sd_pipe = sd_pipe.to(torch_device) + sd_pipe.set_progress_bar_config(disable=None) + + inputs = self.get_dummy_inputs(device) + output = sd_pipe(**inputs) + image = output.images + + image_slice = image[0, -3:, -3:, -1] + + assert image.shape == (1, 64, 64, 3) + expected_slice = np.array([0.5313, 0.5375, 0.4942, 0.5021, 0.6142, 0.4968, 0.5434, 0.5311, 0.5448]) + + debug = [str(round(i, 4)) for i in image_slice.flatten().tolist()] + print(",".join(debug)) + + assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2 + @slow @require_torch_gpu diff --git a/tests/pipelines/text_to_video_synthesis/test_text_to_video.py b/tests/pipelines/text_to_video_synthesis/test_text_to_video.py index 933583ce4b70..e9f435239c92 100644 --- a/tests/pipelines/text_to_video_synthesis/test_text_to_video.py +++ b/tests/pipelines/text_to_video_synthesis/test_text_to_video.py @@ -62,8 +62,8 @@ class TextToVideoSDPipelineFastTests(PipelineTesterMixin, unittest.TestCase): def get_dummy_components(self): torch.manual_seed(0) unet = UNet3DConditionModel( - block_out_channels=(32, 32), - layers_per_block=2, + block_out_channels=(4, 8), + layers_per_block=1, sample_size=32, in_channels=4, out_channels=4, @@ -71,6 +71,7 @@ def get_dummy_components(self): up_block_types=("UpBlock3D", "CrossAttnUpBlock3D"), cross_attention_dim=4, attention_head_dim=4, + norm_num_groups=2, ) scheduler = DDIMScheduler( beta_start=0.00085, @@ -81,13 +82,14 @@ def get_dummy_components(self): ) torch.manual_seed(0) vae = AutoencoderKL( - block_out_channels=(32,), + block_out_channels=(8,), in_channels=3, out_channels=3, down_block_types=["DownEncoderBlock2D"], up_block_types=["UpDecoderBlock2D"], latent_channels=4, sample_size=32, + norm_num_groups=2, ) torch.manual_seed(0) text_encoder_config = CLIPTextConfig( @@ -142,10 +144,11 @@ def test_text_to_video_default_case(self): image_slice = frames[0][-3:, -3:, -1] assert frames[0].shape == (32, 32, 3) - expected_slice = np.array([91.0, 152.0, 66.0, 192.0, 94.0, 126.0, 101.0, 123.0, 152.0]) + expected_slice = np.array([192.0, 44.0, 157.0, 140.0, 108.0, 104.0, 123.0, 144.0, 129.0]) assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2 + @unittest.skipIf(torch_device != "cuda", reason="Feature isn't heavily used. Test in CUDA environment only.") def test_attention_slicing_forward_pass(self): self._test_attention_slicing_forward_pass(test_mean_pixel_difference=False, expected_max_diff=3e-3) diff --git a/tests/pipelines/text_to_video_synthesis/test_video_to_video.py b/tests/pipelines/text_to_video_synthesis/test_video_to_video.py index b5fe3451774b..1785eb967f16 100644 --- a/tests/pipelines/text_to_video_synthesis/test_video_to_video.py +++ b/tests/pipelines/text_to_video_synthesis/test_video_to_video.py @@ -70,15 +70,16 @@ class VideoToVideoSDPipelineFastTests(PipelineTesterMixin, unittest.TestCase): def get_dummy_components(self): torch.manual_seed(0) unet = UNet3DConditionModel( - block_out_channels=(32, 64, 64, 64), - layers_per_block=2, + block_out_channels=(4, 8), + layers_per_block=1, sample_size=32, in_channels=4, out_channels=4, - down_block_types=("CrossAttnDownBlock3D", "CrossAttnDownBlock3D", "CrossAttnDownBlock3D", "DownBlock3D"), - up_block_types=("UpBlock3D", "CrossAttnUpBlock3D", "CrossAttnUpBlock3D", "CrossAttnUpBlock3D"), + down_block_types=("CrossAttnDownBlock3D", "DownBlock3D"), + up_block_types=("UpBlock3D", "CrossAttnUpBlock3D"), cross_attention_dim=32, attention_head_dim=4, + norm_num_groups=2, ) scheduler = DDIMScheduler( beta_start=0.00085, @@ -89,13 +90,18 @@ def get_dummy_components(self): ) torch.manual_seed(0) vae = AutoencoderKL( - block_out_channels=[32, 64], + block_out_channels=[ + 8, + ], in_channels=3, out_channels=3, - down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"], - up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"], + down_block_types=[ + "DownEncoderBlock2D", + ], + up_block_types=["UpDecoderBlock2D"], latent_channels=4, - sample_size=128, + sample_size=32, + norm_num_groups=2, ) torch.manual_seed(0) text_encoder_config = CLIPTextConfig( @@ -154,7 +160,7 @@ def test_text_to_video_default_case(self): image_slice = frames[0][-3:, -3:, -1] assert frames[0].shape == (32, 32, 3) - expected_slice = np.array([106, 117, 113, 174, 137, 112, 148, 151, 131]) + expected_slice = np.array([162.0, 136.0, 132.0, 140.0, 139.0, 137.0, 169.0, 134.0, 132.0]) assert np.abs(image_slice.flatten() - expected_slice).max() < 1e-2