This is a growing repository of AI-generated caption datasets. A caption is a short descriptive or explanatory text that accompanies content. We intend Gold-Caps to be used for research in topics such as cross-modal modelling.
At this time, Gold-Caps contains captions for the Lakh MIDI Dataset-matched music dataset (~30,000 tracks with accompanying MIDI files). These captions were generated by the gpt-4-1106-preview chat endpoint prompted to describe each track based on the track title and artist. The captions have not been filtered or post-processed in any way.
Give a general description of the track <title> by <artist_name> in one sentence.
Don't mention the title or artist.
Check out some example captions on the demo page
(The demo page also includes example captions from alternative prompts.)
The dataset is hosted on Zenodo:
- Gold-Caps-LMD-Matched-General-v0. For information on the MIDI and audio portions, please visit the Lakh MIDI dataset website.
[1] Noise2Music: Text-conditioned Music Generation with Diffusion Models Qingqing Huang et al.
In order to build their text conditioning the authors “[…] take a pseudo-labeling approach via leveraging MuLan (Huang et al., 2022), a pre-trained text and music audio joint embedding model, together with LaMDA (Thoppilan et al., 2022), a pre-trained large language model, to assign pseudo labels with finegrained semantic to unlabeled music audio clips.”. The process involves creating a large number of pseudo-captions using LaMDA and filtering them according to similarity to the audio computed through MuLAN. Models and datasets are not publicly available.
[2] LP-MusicCaps: LLM-Based Pseudo Music Captioning SeungHeon Doh et al.
The authors use GPT-3.5-Turbo to turn a set of tags associated with the songs in three datasets (MusicCaps, Magna-tag-a-tune, Million Songs Dataset) into captions. This is achieved using various prompting strategies and evaluated using both objective and subjective metrics. Models and datasets are released on this repository
[3] LLark: A Multimodal Foundation Model for Music Josh Gardner et al.
The authors built a model capable of addressing many tasks in music understanding, including captioning. The model features a pretrained generative audio encoder, a pretrained language model, and a simple multimodal projection module that maps encoded audio into the LLM embedding space. Variants of ChatGPT were used to merge the heterogenous information in various datasets into uniform inputs for instruction tuning. The resulting capitions are not made available by the authors.
To cite this project, use the following entry:
@dataset{jonason_2023_10178563,
author = {Jonason, Nicolas and
Casini, Luca and
Sturm, Bob},
title = {Gold-Caps\_LMD-Matched\_General},
month = nov,
year = 2023,
publisher = {Zenodo},
version = {0.0.0},
doi = {10.5281/zenodo.10178563},
url = {https://doi.org/10.5281/zenodo.10178563}
}
This work was supported in part by the grant ERC-2019-COG No. 864189 MUSAiC: Music at the Frontiers of Artificial Creativity and Criticism.