diff --git a/.gitignore b/.gitignore
index fe89a3a..c6ac584 100644
--- a/.gitignore
+++ b/.gitignore
@@ -175,3 +175,5 @@ pyrightconfig.json
# End of https://www.toptal.com/developers/gitignore/api/python
+*.DS_Store
+
diff --git a/README.md b/README.md
new file mode 100644
index 0000000..69373d4
--- /dev/null
+++ b/README.md
@@ -0,0 +1,44 @@
+# Academic Project Page Template
+This is an academic paper project page template.
+
+
+Example project pages built using this template are:
+- https://www.vision.huji.ac.il/deepsim/
+- https://www.vision.huji.ac.il/3d_ads/
+- https://www.vision.huji.ac.il/ssrl_ad/
+- https://www.vision.huji.ac.il/conffusion/
+
+
+## Start using the template
+To start using the template click on `Use this Template`.
+
+The template uses html for controlling the content and css for controlling the style.
+To edit the websites contents edit the `index.html` file. It contains different HTML "building blocks", use whichever ones you need and comment out the rest.
+
+**IMPORTANT!** Make sure to replace the `favicon.ico` under `static/images/` with one of your own, otherwise your favicon is going to be a dreambooth image of me.
+
+## Components
+- Teaser video
+- Images Carousel
+- Youtube embedding
+- Video Carousel
+- PDF Poster
+- Bibtex citation
+
+## Tips:
+- The `index.html` file contains comments instructing you what to replace, you should follow these comments.
+- The `meta` tags in the `index.html` file are used to provide metadata about your paper
+(e.g. helping search engine index the website, showing a preview image when sharing the website, etc.)
+- The resolution of images and videos can usually be around 1920-2048, there rarely a need for better resolution that take longer to load.
+- All the images and videos you use should be compressed to allow for fast loading of the website (and thus better indexing by search engines). For images, you can use [TinyPNG](https://tinypng.com), for videos you can need to find the tradeoff between size and quality.
+- When using large video files (larger than 10MB), it's better to use youtube for hosting the video as serving the video from the website can take time.
+- Using a tracker can help you analyze the traffic and see where users came from. [statcounter](https://statcounter.com) is a free, easy to use tracker that takes under 5 minutes to set up.
+- This project page can also be made into a github pages website.
+- Replace the favicon to one of your choosing (the default one is of the Hebrew University).
+- Suggestions, improvements and comments are welcome, simply open an issue or contact me. You can find my contact information at [https://pages.cs.huji.ac.il/eliahu-horwitz/](https://pages.cs.huji.ac.il/eliahu-horwitz/)
+
+## Acknowledgments
+Parts of this project page were adopted from the [Nerfies](https://nerfies.github.io/) page.
+
+## Website License
+
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
diff --git a/index.html b/index.html
index ba95823..e87dba1 100644
--- a/index.html
+++ b/index.html
@@ -1,2 +1,468 @@
-Hello world!
-
IDOL
+ + + + + + + + + + + + + + + + + + + + + + + + + + + ++ Significant advances have been made in human-centric video generation, yet the joint video-depth + generation problem + remains underexplored. Most existing monocular depth estimation methods may not generalize well to + synthesized images or + videos, and multi-view-based methods have difficulty controlling the human appearance and motion. In this + work, we + present IDOL (unIfied Dual-mOdal Latent diffusion) for high-quality human-centric joint video-depth + generation. Our IDOL + consists of two novel designs. First, to enable dual-modal generation and maximize the information + exchange between + video and depth generation, we propose a unified dual-modal U-Net, a parameter-sharing framework for joint + video and + depth denoising, wherein a modality label guides the denoising target, and cross-modal attention enables + the mutual + information flow. Second, to ensure a precise video-depth spatial alignment, we propose a motion + consistency loss that + enforces consistency between the video and depth feature motion fields, leading to harmonized outputs. + Additionally, a + cross-attention map consistency loss is applied to align the cross-attention map of the video denoising + with that of the + depth denoising, further facilitating spatial alignment. Extensive experiments on the TikTok and NTU120 + datasets show + our superior performance, significantly surpassing existing methods in terms of video FVD and depth + accuracy. +
++ Left: Overall model architecture. Our IDOL + features a unified dual-modal U-Net (gray boxes), a + parameter-sharing design for joint video-depth denoising, wherein + the denoising target is controlled by a one-hot modality label + (\(y_{\text{v}}\) for video and \(y_\text{d}\) for depth). +
++ Right: U-Net block structure. Cross-modal + attention is added to enable mutual information flow between video + and depth features, with consistency loss terms + \(\mathcal{L}_{\text{mo}}\) and \(\mathcal{L}_{\text{xattn}}\) + ensuring the video-depth alignment. Skip connections are omitted + for conciseness. +
++ Visualization of the video and depth feature maps and their motion fields without consistency losses. + We attribute the inconsistent video-depth output (blue circle) to the + inconsistent video-depth feature motions (the last row). + This problem exists in multiples layers within the U-Net, and we randomly select + layer 4 and 7 in the up block for visualization. + For the feature map visualization, we follow Plug-and-Play to apply + PCA on the video and depth features at each individual layers, and render the + first three components. + The motion field is visualized similar to optical flow, where different color + indicates different moving direction. +
++ To promote video-depth consistency, we propose a motion consistency loss \(\mathcal{L}_{\text{mo}}\) to + synchronize the video and depth feature motions, and a cross-attention map consistency loss + \(\mathcal{L}_{\text{xattn}}\) to align the cross-attention map of the video denoising with that of the + depth denoising. +
+Example #1
+ + +Example #2
+Background editing examples
+ + +Foreground editing examples
++ Compared with other multi-modal generation methods (MM-Diffusion and LDM3D), our IDOL generates (1) spatial-aligned video and depth, (2) smoother video, and (3) better preserves the human identity. + +
+ +Example on TikTok videos
+ + +Example on NTU120 videos
+ ++ Our IDOL achieves the best video and depth generation quality on the TikTok and NTU120 datasets. +
+Our IDOL is developed based on DisCo: Disentangled Control for + Referring Human Dance Generation in Real World.
+We thank Tan Wang from Nanyang + Technological University for value feedback and discussion.
+ + +@inproceedings{zhai2024idol,
+ title={IDOL: Unified Dual-Modal Latent Diffusion for Human-Centric Joint Video-Depth Generation},
+ author={Zhai, Yuanhao and Lin, Kevin and Li, Linjie and Lin, Chung-Ching and Wang, Jianfeng and Yang, Zhengyuan and Doermann, David and Yuan, Junsong and Liu, Zicheng and Wang, Lijuan},
+ year={2024},
+ booktitle={Proceedings of the European Conference on Computer Vision},
+}
+