Gold-Caps

This is a growing repository of AI-generated caption datasets. A caption is a short descriptive or explanatory text that accompanies content. We intend Gold-Caps to be used for research in topics such as cross-modal modelling.

At this time, Gold-Caps contains captions for the Lakh MIDI Dataset-matched music dataset (~30,000 tracks with accompanying MIDI files). These captions were generated by the gpt-4-1106-preview chat endpoint prompted to describe each track based on the track title and artist. The captions have not been filtered or post-processed in any way.

Give a general description of the track <title> by <artist_name> in one sentence.
Don't mention the title or artist.

Check out some example captions on the demo page

(The demo page also includes example captions from alternative prompts.)

Use the datasets

The dataset is hosted on Zenodo:

Gold-Caps-LMD-Matched-General-v0. For information on the MIDI and audio portions, please visit the Lakh MIDI dataset website.

Comparison with Related Datasets

[1] Noise2Music: Text-conditioned Music Generation with Diffusion Models Qingqing Huang et al.

In order to build their text conditioning the authors “[…] take a pseudo-labeling approach via leveraging MuLan (Huang et al., 2022), a pre-trained text and music audio joint embedding model, together with LaMDA (Thoppilan et al., 2022), a pre-trained large language model, to assign pseudo labels with finegrained semantic to unlabeled music audio clips.”. The process involves creating a large number of pseudo-captions using LaMDA and filtering them according to similarity to the audio computed through MuLAN. Models and datasets are not publicly available.

[2] LP-MusicCaps: LLM-Based Pseudo Music Captioning SeungHeon Doh et al.

The authors use GPT-3.5-Turbo to turn a set of tags associated with the songs in three datasets (MusicCaps, Magna-tag-a-tune, Million Songs Dataset) into captions. This is achieved using various prompting strategies and evaluated using both objective and subjective metrics. Models and datasets are released on this repository

[3] LLark: A Multimodal Foundation Model for Music Josh Gardner et al.

The authors built a model capable of addressing many tasks in music understanding, including captioning. The model features a pretrained generative audio encoder, a pretrained language model, and a simple multimodal projection module that maps encoded audio into the LLM embedding space. Variants of ChatGPT were used to merge the heterogenous information in various datasets into uniform inputs for instruction tuning. The resulting capitions are not made available by the authors.

Acknowledgments

To cite this project, use the following entry:

@dataset{jonason_2023_10178563,
  author       = {Jonason, Nicolas and
                  Casini, Luca and
                  Sturm, Bob},
  title        = {Gold-Caps\_LMD-Matched\_General},
  month        = nov,
  year         = 2023,
  publisher    = {Zenodo},
  version      = {0.0.0},
  doi          = {10.5281/zenodo.10178563},
  url          = {https://doi.org/10.5281/zenodo.10178563}
}

This work was supported in part by the grant ERC-2019-COG No. 864189 MUSAiC: Music at the Frontiers of Artificial Creativity and Criticism.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
misc		misc
web_assets/2023-11-17-11-32-37		web_assets/2023-11-17-11-32-37
.gitignore		.gitignore
Readme.md		Readme.md
demoData.js		demoData.js
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gold-Caps

Check out some example captions on the demo page

Use the datasets

Comparison with Related Datasets

Acknowledgments

About

Releases

Packages

Contributors 2

Languages

erl-j/Gold-Caps

Folders and files

Latest commit

History

Repository files navigation

Gold-Caps

Check out some example captions on the demo page

Use the datasets

Comparison with Related Datasets

Acknowledgments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages