From 2cc8c06a80f2050766ea5558d0a9181a066f9959 Mon Sep 17 00:00:00 2001 From: HannahBenita <77296142+HannahBenita@users.noreply.github.com> Date: Fri, 1 Dec 2023 18:07:14 +0100 Subject: [PATCH] Update index.html --- index.html | 19 ++++++++++--------- 1 file changed, 10 insertions(+), 9 deletions(-) diff --git a/index.html b/index.html index e83c4fb..3bfed15 100644 --- a/index.html +++ b/index.html @@ -34,7 +34,12 @@

-
@@ -252,11 +253,11 @@

Image Fidelity and Text-to-Image Alignment

First we meassure image fidelity and image-text-alignment using the standard metrics FID-30K and Clip Scores. We find that MultiFusion prompted with text only performs on par with Stable Diffusion despite extension of the Encoder to support multiple languages and modalities.


Compositional Robustness

-
+
method
-
+

Image Composition is a known limitation of Diffusion Models. Through evaluation of our new benchmark MCC-250 we show that multimodal prompting leads to more compositional robustness as judged by humans. Each prompt is a complex conjunction of two different objects with different colors, with multimodal prompts containing one visual reference for each object interleaved with the text input.