Skip to content

Latest commit

 

History

History
149 lines (128 loc) · 6.55 KB

README.md

File metadata and controls

149 lines (128 loc) · 6.55 KB

SONIQUE: Video Background Music Generation Using Unpaired Audio-Visual Data

We present SONIQUE, a model for generating background music tailored to video content. Unlike traditional video-to-music generation approaches, which rely heavily on paired audio-visual datasets, SONIQUE leverages unpaired data, combining royalty-free music and independent video sources. By utilizing large language models (LLMs) for video understanding and converting visual descriptions into musical tags, alongside a U-Net-based conditional diffusion model, SONIQUE enables customizable music generation. Users can control specific aspects of the music, such as instruments, genres, tempo, and melodies, ensuring the generated output fits their creative vision. SONIQUE is open-source, with a demo available online.

t2i

Performance: Executing the entire process on an NVIDIA 4090 graphics card is accomplished in under a minute. This model requires less than 14 GB GPU memory. When operated on an NVIDIA 3070 Laptop GPU with 8 GB of memory, the process duration extends to 6 minutes.

Table of contents

Install

  1. Clone this repo
  2. Create a conda environment:
conda env create -f environment.yml
  1. Activate the environment, navigate to the root, and run:
pip install .
  1. After installation, you may run the demo with UI interface:
python run_gradio.py --model-config best_model.json --ckpt-path ./ckpts/stable_ep=220.ckpt
  1. To run the demo without interface:
python inference.py --model-config best_model.json --ckpt-path ./ckpts/stable_ep=220.ckpt

Additional inference flags:

  • --use-video:
    • Use input video as condition
    • Default: False
  • --input-video:
    • Path to input video
    • Default: None
  • --use-init:
    • Use melody condition
    • Default: False
  • init-audio:
    • Melody condition path
    • Default: None
  • --llms:
    • Selection of the name of Large Language Model to extract video description to tags
    • Default: Mistral 7B
  • --low-resource:
    • If set to True, models from video -> tags stage will run in 4-bit. Only set it to False if you have enough GPU memory.
    • Default: True
  • --instruments:
    • Input instrument condition
    • Default: None
  • --genres:
    • Input genre condition
    • Default: None
  • --tempo-rate:
    • Input tempo rate condition
    • Default: None

Model Checkpoint

Pretrained model can be download here. Please download, unzip, and save in the root of this project.

sonique/
├── ckpts/
│   ├── .../
├── sonique/
├── run_gradio.py/
...

Data Collection & Preprocessing

t2i

In SONIQUE, tag generation for training starts by feeding raw musical data into LP-MusicCaps to generate initial captions. These captions are processed by Qwen 14B in two steps: first, it converts the captions into tags, then it cleans the data by removing any incorrect or misleading tags (e.g., ”Low Quality”). This results in a clean set of tags for training.

Video-to-music-generation

SONIQUE is a multi-model tool leveraging on stable_audio_tools, Video_LLaMA, and popular LLMs from Huggingface.

Video description is extracted from the input video. We use Video_LLaMA to extract video description from the video. Then it will be pass to LLMs to converted them into tags that describe the background music. For the LLMs currently support:

  • Mistrial 7B (default)
  • Qwen 14B
  • Gemma 7B (You will need to get authenticate from Google)

Output Tuning

t2i

Users can then fine-tune the music generation by providing additional prompts or specifying negative prompts. The final output is background music that matches both the video and user preferences.

Subjective Evaluation

Human2

We generate a demo with seven examples using SONIQUE. These generated videos were evaluated by a group of 38 individuals, including of artists with video editing backgrounds and music technology students.

Overall, 75% of users rated the generated audio as somewhat, very, or perfectly related to the video, with ”perfectly related” being the most common rating. This positive feedback highlights SONIQUE’s effectiveness in producing audio that aligns well with video content. However, 25% of users found the audio to have little or no relation to the video, indicating that the model struggles to capture the mood or sync the music with specific video events.

Citation

Please consider citing the project if it helps your research:

@misc{zhang2024sonique,
  title={SONIQUE: Efficient Video Background Music Generation},
  author={Zhang, Liqian},
  year={2024},
  publisher={GitHub},
  journal={GitHub repository},
  howpublished={https://github.com/zxxwxyyy/sonique},
}