A Collection of Video Generation Studies

This GitHub repository summarizes papers and resources related to the video generation task.

If you have any suggestions about this repository, please feel free to start a new issue or pull requests.

Recent news of this GitHub repo are listed as follows.

🔥 [Nov. 19th] We have released our latest paper titled "StableV2V: Stablizing Shape Consistency in Video-to-Video Editing", with the correponding code, model weights, and a testing benchmark DAVIS-Edit open-sourced. Feel free to check them out from the links!

Click to see more information.

[Jun. 17th] All NeurIPS 2023 papers and references are updated.
[Apr. 26th] Update a new direction: Personalized Video Generation.
[Mar. 28th] The official AAAI 2024 paper list are released! Official version of PDFs and BibTeX references are updated accordingly.

To-Do Lists

Latest Papers
- Update ECCV 2024 Papers
- Update CVPR 2024 Papers
  - Update PDFs and References of ⚠️ Papers
  - Update Published Versions of References
- Update AAAI 2024 Papers
  - Update PDFs and References of ⚠️ Papers
  - Update Published Versions of References
- Update ICLR 2024 Papers
- Update NeurIPS 2023 Papers
Previously Published Papers
- Update Previous CVPR papers
- Update Previous ICCV papers
- Update Previous ECCV papers
- Update Previous NeurIPS papers
- Update Previous ICLR papers
- Update Previous AAAI papers
- Update Previous ACM MM papers
Regular Maintenance of Preprint arXiv Papers and Missed Papers

<🎯Back to Top>

Products

Name	Organization	Year	Research Paper	Website	Specialties
Sora	OpenAI	2024	link	link	-
Lumiere	Google	2024	link	link	-
VideoPoet	Google	2023	-	link	-
W.A.I.T	Google	2023	link	link	-
Gen-2	Runaway	2023	-	link	-
Gen-1	Runaway	2023	-	link	-
Animate Anyone	Alibaba	2023	link	link	-
Outfit Anyone	Alibaba	2023	-	link	-
Stable Video	StabilityAI	2023	link	link	-
Pixeling	HiDream.ai	2023	-	link	-
DomoAI	DomoAI	2023	-	link	-
Emu	Meta	2023	link	link	-
Genmo	Genmo	2023	-	link	-
NeverEnds	NeverEnds	2023	-	link	-
Moonvalley	Moonvalley	2023	-	link	-
Morph Studio	Morph	2023	-	link	-
Pika	Pika	2023	-	link	-
PixelDance	ByteDance	2023	link	link	-

<🎯Back to Top>

Papers

Survey Papers

Year 2024
arXiv
- Video Diffusion Models: A Survey [Paper]
Year 2023
arXiv
- A Survey on Video Diffusion Models [Paper]

Text-to-Video Generation

Year 2024
- CVPR
  - Vlogger: Make Your Dream A Vlog [Paper] [Code]
  - Make Pixels Dance: High-Dynamic Video Generation [Paper] [Project] [Demo]
  - VGen: Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [Paper] [Code] [Project]
  - GenTron: Delving Deep into Diffusion Transformers for Image and Video Generation [Paper] [Project]
  - SimDA: Simple Diffusion Adapter for Efficient Video Generation [Paper] [Code] [Project]
  - MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation [Paper] [Project] [Video]
  - Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models [Paper] [Project]
  - PEEKABOO: Interactive Video Generation via Masked-Diffusion [Paper] [Code] [Project] [Demo]
  - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [Paper] [Code] [Project]
  - A Recipe for Scaling up Text-to-Video Generation with Text-free Videos [Paper] [Code] [Project]
  - BIVDiff: A Training-free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models [Paper] [Project]
  - Mind the Time: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [Paper] [Project]
  - Animate Anyone: Consistent and Controllable Image-to-video Synthesis for Character Animation [Paper] [Code] [Project]
  - MotionDirector: Motion Customization of Text-to-Video Diffusion Models [Paper] [Code]
  - Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation [Paper] [Project]
  - DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation [Paper] [Code]
  - Grid Diffusion Models for Text-to-Video Generation [Paper] [Code] [Video]
- ECCV
  - Emu Video: Factorizing Text-to-Video Generation by Explicit Image Conditioning [Paper] [Project]
  - W.A.L.T.: Photorealistic Video Generation with Diffusion Models [Paper] [Project]
  - MoVideo: Motion-Aware Video Generation with Diffusion Models [Paper]
  - DrivingDiffusion: Layout-Guided Multi-View Driving Scenarios Video Generation with Latent Diffusion Model [Paper] [Code] [Project]
  - MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [Paper]
  - HARIVO: Harnessing Text-to-Image Models for Video Generation [Paper] [Project]
  - MEVG: Multi-event Video Generation with Text-to-Video Models [Paper] [Project]
  - DeCo: Decoupled Human-Centered Diffusion Video Editing with Motion Consistency [Paper]
  - SAVE: Protagonist Diversification with Structure Agnostic Video Editing
- ICLR
  - VDT: General-purpose Video Diffusion Transformers via Mask Modeling [Paper] [Code] [Project]
  - VersVideo: Leveraging Enhanced Temporal Diffusion Models for Versatile Video Generation [Paper]
- AAAI
  - Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos [Paper] [Code] [Project]
  - E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning [Paper]
  - ConditionVideo: Training-Free Condition-Guided Text-to-Video Generation [Paper] [Code] [Project]
  - F3-Pruning: A Training-Free and Generalized Pruning Strategy towards Faster and Finer Text to-Video Synthesis [Paper]
- arXiv
  - Lumiere: A Space-Time Diffusion Model for Video Generation [Paper] [Project]
  - Boximator: Generating Rich and Controllable Motions for Video Synthesis [Paper] [Project] [Video]
  - World Model on Million-Length Video And Language With RingAttention [Paper] [Code] [Project]
  - Direct-a-Video: Customized Video Generation with User-Directed Camera Movement and Object Motion [Paper] [Project]
  - WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens [Paper] [Code] [Project]
  - MagicVideo-V2: Multi-Stage High-Aesthetic Video Generation [Paper] [Project]
  - Latte: Latent Diffusion Transformer for Video Generation [Paper] [Code] [Project]
  - Mora: Enabling Generalist Video Generation via A Multi-Agent Framework [Paper] [Code]
  - StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [Paper] [Code] [Project] [Video]
  - VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [Paper]
  - StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation [Paper] [Code] [Project] [Demo]
  - Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model [Paper] [Code] [Project]
  - ControlNeXt: Powerful and Efficient Control for Image and Video Generation [Paper] [Code] [Project]
  - FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance [Paper] [Project]
  - Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data [Paper] [Code]
  - Fine-gained Zero-shot Video Sampling [Paper] [Project]
  - Training-free Long Video Generation with Chain of Diffusion Model Experts [Paper]
  - ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model [Paper] [Code] [Project] [Video]
  - ConFiner: Training-free Long Video Generation with Chain of Diffusion Model Experts [Paper] [Code]
- Others
  - Sora: Video Generation Models as World Simulators [Paper]
Year 2023
- CVPR
  - Align your Latents: High-resolution Video Synthesis with Latent Diffusion Models [Paper] [Project] [Reproduced code]
  - Text2Video-Zero: Text-to-image Diffusion Models are Zero-shot Video Generators [Paper] [Code] [Demo] [Project]
  - Video Probabilistic Diffusion Models in Projected Latent Space [Paper] [Code]
- ICCV
  - Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models [Paper] [Project]
  - Gen-1: Structure and Content-guided Video Synthesis with Diffusion Models [Paper] [Project]
- NeurIPS
  - Video Diffusion Models [Paper] [Project]
  - Learning Universal Policies via Text-Guided Video Generation [Paper] [Project] [Code]
  - VideoComposer: Compositional Video Synthesis with Motion Controllability [Paper] [Code] [Project]
- ICLR
  - CogVideo: Large-scale Pretraining for Text-to-video Generation via Transformers [Paper] [Code] [Demo]
  - Make-A-Video: Text-to-video Generation without Text-video Data [Paper] [Project] [Reproduced code]
  - Phenaki: Variable Length Video Generation From Open Domain Textual Description [Paper] [Reproduced Code]
- arXiv
  - Control-A-Video: Controllable Text-to-video Generation with Diffusion Models [Paper] [Code] [Demo] [Project]
  - ControlVideo: Training-free Controllable Text-to-video Generation [Paper] [Code]
  - Imagen Video: High Definition Video Generation with Diffusion Models [Paper]
  - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient Text-to-video Generation [Paper] [Project]
  - LAVIE: High-quality Video Generation with Cascaded Latent Diffusion Models [Paper] [Code] [Project]
  - Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-video Generation [Paper] [Code] [Project]
  - Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets [Paper] [Code] [Project]
  - VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-video Generation [Paper] [Dataset]
  - VideoGen: A Reference-guided Latent Diffusion Approach for High Definition Text-to-video Generation [Paper] [Code]
  - InstructVideo: Instructing Video Diffusion Models with Human Feedback [Paper] [Code] [Project]
  - SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction [Paper] [Code] [Project]
  - VideoLCM: Video Latent Consistency Model [Paper]
  - ModelScope Text-to-Video Technical Report [Paper] [Code]
  - LAMP: Learn A Motion Pattern for Few-Shot-Based Video Generation [Paper] [Code] [Project]
Year 2022
- CVPR
  - Show Me What and Tell Me How: Video Synthesis via Multimodal Conditioning [Paper] [Code] [Dataset]
Year 2021
- arXiv
  - VideoGPT: Video Generation using VQ-VAE and Transformers [Paper] [Code] [Project]
  - MagicVideo: Efficient Video Generation With Latent Diffusion Models [Paper]
  - EasyAnimate: A High-Performance Long Video Generation Method based on Transformer Architecture [Paper] [Code] <🎯Back to Top>

Image-to-Video Generation

Year 2024
- CVPR
  - VideoBooth: Diffusion-based Video Generation with Image Prompts [Paper] [Code] [Project] [Video]
- ECCV
  - Rethinking Image-to-Video Adaptation: An Object-centric Perspective [Paper]
  - PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation [Paper] [Code] [Project]
- AAAI
  - Decouple Content and Motion for Conditional Image-to-Video Generation [Paper]
- arXiv
  - ConsistI2V: Enhancing Visual Consistency for Image-to-Video Generation [Paper] [Code] [Project]
  - I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models [Paper] [Code]
  - Follow-Your-Click: Open-domain Regional Image Animation via Short Prompts [Paper] [Code] [Project]
  - AtomoVideo: High Fidelity Image-to-Video Generation [Paper] [Project] [Video]
  - Pix2Gif: Motion-Guided Diffusion for GIF Generation [Paper] [Code] [Project]
  - ID-Animator: Zero-Shot Identity-Preserving Human Video Generation [Paper] [Code] [Project]
  - Tuning-Free Noise Rectification for High Fidelity Image-to-Video Generation [Paper] [Project]
  - MegActor-Σ: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer [Paper] [Code]
Year 2023
- CVPR
  - Conditional Image-to-Video Generation with Latent Flow Diffusion Models [Paper] [Code]
- arXiv
  - I2VGen-XL: High-quality Image-to-video Synthesis via Cascaded Diffusion Models [Paper] [Code] [Project]
  - DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance [Paper] [Code] [Project]
  - DynamiCrafter: Animating Open-domain Images with Video Diffusion Priors [Paper] [Project] [Code] [Video] [Demo]
  - AnimateDiff: Animate Your Personalized Text-to-image Diffusion Models without Specific Tuning [Paper] [Project]
Year 2022
- CVPR
  - Make It Move: Controllable Image-to-Video Generation with Text Descriptions [Paper] [Code]
Year 2021
- ICCV
  - Click to Move: Controlling Video Generation with Sparse Motion [Paper] [Code]

<🎯Back to Top>

Audio-to-Video Generation

Year 2024
- AAAI
  - Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation [Paper] [Code]
Year 2023
- CVPR
  - MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation [Paper] [Code]

<🎯Back to Top>

Personalized Video Generation

Year 2024
- CVPR
  - High-fidelity Person-centric Subject-to-Image Synthesis [Paper] [Code]
- ECCV
  - PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control [Paper] [Project]
- arXiv
  - Magic-Me: Identity-Specific Video Customized Diffusion [Paper] [Code] [Project] [Demo]
  - ReVideo: Remake a Video with Motion and Content Control [Paper] [Code] [Project]
Year 2023
- arXiv
  - FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention [Paper] [Code] [Demo]
  - Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance [Paper] [Project]
  - DreamVideo-2: Zero-Shot Subject-Driven Video Customization with Precise Motion Control [Paper] [Project]

<🎯Back to Top>

Video Editing

Year 2024
- CVPR
  - VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models [Paper] [Code] [Project]
  - Fairy: Fast Parallellized Instruction-Guided Video-to-Video Synthesis [Paper] [Project]
  - CCEdit: Creative and Controllable Video Editing via Diffusion Models [Paper] [Code] [Project] [Video]
  - DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing [Paper] [Project] [Video]
  - Video-P2P: Video Editing with Cross-attention Control [Paper] [Code] [Project]
  - A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing [Paper] [Code] [Project]
  - MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [Paper] [Code] [Project]
  - VidToMe: Video Token Merging for Zero-Shot Video Editing [Paper] [Code] [Project] [Video]
  - Towards Language-Driven Video Inpainting via Multimodal Large Language Models [Paper] [Code] [Project] [Dataset]
  - AVID: Any-Length Video Inpainting with Diffusion Model [Paper] [Code] [Project]
  - CAMEL: CAusal Motion Enhancement tailored for Lifting Text-driven Video Editing [Paper] [Code]
  - Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer [Paper] [Code] [Project]
- ECCV
  - DragVideo: Interactive Drag-style Video Editing [Paper]
  - Video Editing via Factorized Diffusion Distillation [Paper]
  - OCD: Object-Centric Diffusion for Efficient Video Editing [Paper] [Project]
  - DreamMotion: Space-Time Self-Similarity Score Distillation for Zero-Shot Video Editing [Paper] [Project]
  - WAVE: Warping DDIM Inversion Features for Zero-shot Text-to-Video Editing [Paper] [Project]
  - Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion [Paper] [Code] [Project]
- ICLR
  - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [Paper] [Code] [Project]
  - TokenFlow: Consistent Diffusion Features for Consistent Video Editing [Paper] [Code] [Project]
  - Consistent Video-to-Video Transfer Using Synthetic Dataset [Paper] [Code]
  - FLATTEN: Optical FLow-guided ATTENtion for Consistent Text-to-Video Editing [Paper] [Code] [Project]
- arXiv
  - Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [Paper] [Code] [Project]
  - UniEdit: A Unified Tuning-Free Framework for Video Motion and Appearance Editing [Paper] [Code] [Project]
  - DragAnything: Motion Control for Anything using Entity Representation [Paper] [Code] [Project]
  - AnyV2V: A Plug-and-Play Framework for Any Video-to-Video Editing Tasks [Paper] [Code] [Project]
  - CoCoCo: Improving Text-Guided Video Inpainting for Better Consistency, Controllability and Compatibility [Paper] [Code] [Project]
  - VASE: Object-Centric Appearance and Shape Manipulation of Real Videos [Paper]
  - StableV2V: Stablizing Shape Consistency in Video-to-Video Editing [Paper] [Code] [Project] [Dataset]
Year 2023
- CVPR
  - Shape-aware Text-driven Layered Video Editing [Paper] [Code] [Project]
- ICCV
  - Pix2Video: Video Editing using Image Diffusion [Paper] [Code]
  - Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation [Paper] [Code] [Project]
- NeurIPS
  - Towards Consistent Video Editing with Text-to-Image Diffusion Models [Paper]
- arXiv
  - Style-A-Video: Agile Diffusion for Arbitrary Text-based Video Style Transfer [Paper]
  - SAVE: Spectral-Shift-Aware Adaptation of Image Diffusion Models for Text-guided Video Editing [Paper] [Code] [Project]
Year 2022
- ECCV
  - Text2LIVE: Text-Driven Layered Image and Video Editing [Paper] [Code] [Project]

<🎯Back to Top>

Datasets

[arXiv 2012] UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild [Paper] [Dataset]
[arXiv 2017] DAVIS: The 2017 DAVIS Challenge on Video Object Segmentation [Paper] [Dataset]
[ICCV 2019] FaceForensics++: Learning to Detect Manipulated Facial Images [Paper] [Code]
[NeurIPS 2019] TaiChi-HD: First Order Motion Model for Image Animation [Paper] [Dataset]
[ECCV 2020] SkyTimeLapse: DTVNet: Dynamic Time-lapse Video Generation via Single Still Image [Paper] [Code]
[ICCV 2021] WebVid-10M: Frozen in Time: ️A Joint Video and Image Encoder for End to End Retrieval [Paper] [Dataset] [Code] [Project]
[ICCV 2021] WebVid-10M: Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [Paper] [Dataset] [Project]
[ECCV 2022] ROS: Learning to Drive by Watching YouTube Videos: Action-Conditioned Contrastive Policy Pretraining [Paper] [Code] [Dataset]
[arXiv 2023] HD-VG-130M: VideoFactory: Swap Attention in Spatiotemporal Diffusions for Text-to-video Generation [Paper] [Dataset]
[NeurIPS 2023] FETV: A Benchmark for Fine-Grained Evaluation of Open-Domain Text-to-Video Generation [Paper] [Code]
[ICLR 2024] InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation [Paper] [Dataset]
[CVPR 2024] Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers [Paper] [Dataset] [Project]
[arXiv 2024] VidProM: A Million-scale Real Prompt-Gallery Dataset for Text-to-Video Diffusion Models [Paper] [Dataset]

<🎯Back to Top>

Evaluation Metrics

[CVPR 2024] VBench: Comprehensive Benchmark Suite for Video Generative Models [Paper] [Code]
[ICCV 2023] DOVER: Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives [Paper] [Code]
[ICLR 2019] FVD: A New Metric for Video Generation [Paper] [Code]

Q&A

Q: The conference sequence of this paper list?
- This paper list is organized according to the following sequence:
  - CVPR
  - ICCV
  - ECCV
  - NeurIPS
  - ICLR
  - AAAI
  - ACM MM
  - SIGGRAPH
  - arXiv
  - Others
Q: What does Others refers to?
- Some of the following studies (e.g., Sora) does not publish their technical report on arXiv. Instead, they tend to write a blog in their official websites. The Others category refers to such kind of studies.

<🎯Back to Top>

References

The reference.bib file summarizes bibtex references of up-to-date image inpainting papers, widely used datasets, and toolkits. Based on the original references, I have made the following modifications to make their results look nice in the LaTeX manuscripts:

Refereces are normally constructed in the form of author-etal-year-nickname. Particularly, references of datasets and toolkits are directly constructed as nickname, e.g., imagenet.
In each reference, all names of conferences/journals are converted into abbreviations, e.g., Computer Vision and Pattern Recognition -> CVPR.
The url, doi, publisher, organization, editor, series in all references are removed.
The pages of all references are added if they are missing.
All paper names are in title case. Besides, I have added an additional {} to make sure that the title case would also work well in some particular templates.

If you have other demands of reference formats, you may refer to the original references of papers by searching their names in DBLP or Google Scholar.

<🎯Back to Top>

Star History

<🎯Back to Top>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

A Collection of Video Generation Studies

Contents

To-Do Lists

Products

Papers

Survey Papers

Text-to-Video Generation

Image-to-Video Generation

Audio-to-Video Generation

Personalized Video Generation

Video Editing

Datasets

Evaluation Metrics

Q&A

References

Star History

Files

README.md

Latest commit

History

README.md

File metadata and controls

A Collection of Video Generation Studies

Contents

To-Do Lists

Products

Papers

Survey Papers

Text-to-Video Generation

Image-to-Video Generation

Audio-to-Video Generation

Personalized Video Generation

Video Editing

Datasets

Evaluation Metrics

Q&A

References

Star History