diff --git a/index.html b/index.html index e83c4fb..3bfed15 100644 --- a/index.html +++ b/index.html @@ -34,7 +34,12 @@
First we meassure image fidelity and image-text-alignment using the standard metrics FID-30K and Clip Scores. We find that MultiFusion prompted with text only performs on par with Stable Diffusion despite extension of the Encoder to support multiple languages and modalities.
Image Composition is a known limitation of Diffusion Models. Through evaluation of our new benchmark MCC-250 we show that multimodal prompting leads to more compositional robustness as judged by humans. Each prompt is a complex conjunction of two different objects with different colors, with multimodal prompts containing one visual reference for each object interleaved with the text input.