Update README.md

Yuan-ManX · Jul 4, 2024 · 2d042b1 · 2d042b1
1 parent ecd2687
commit 2d042b1
Showing 1 changed file with 19 additions and 6 deletions.
diff --git a/README.md b/README.md
@@ -1,12 +1,17 @@
-# AI Audio Datasets List (AI-ADL) 🎵
+# AI Audio Datasets (AI-ADS) 🎵
 
-This is a list of datasets consisting of speech, music, and sound effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio applications. It is mainly used for speech recognition, speech synthesis, singing voice synthesis, music information retrieval, music generation, audio processing, sound synthesis, etc.
+AI Audio Datasets (AI-ADS) 🎵 including speech, music, and sound effects, which can provide training data for Generative AI, AIGC, AI model training, intelligent audio tool development, and audio applications. 
+
+## Table of Contents
 
 * [Speech](#s)
 * [Music](#m)
 * [Sound Effect](#se)
 
-## <span id="s">Speech</span>
+
+## Project List
+
+### <span id="s">Speech</span>
 
 * [AISHELL-1](http://www.openslr.org/33/) - AISHELL-1 is a corpus for speech recognition research and building speech recognition systems for Mandarin.
 * [AISHELL-3](https://openslr.org/93/) - AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus published by Beijing Shell Shell Technology Co.,Ltd. It can be used to train multi-speaker Text-to-Speech (TTS) systems.The corpus contains roughly 85 hours of emotion-neutral recordings spoken by 218 native Chinese mandarin speakers and total 88035 utterances.
@@ -72,7 +77,10 @@ This is a list of datasets consisting of speech, music, and sound effects, which
 * [YODAS2](https://huggingface.co/datasets/espnet/yodas2) - YODAS2 is the long-form dataset from YODAS dataset. It provides the same dataset as espnet/yodas but YODAS2 has the following new features: 1. formatted in the long-form (video-level) where audios are not segmented. 2. audios are encoded using higher sampling rates (i.e. 24k).
 * [YTTTS](https://github.com/ryanrudes/YTTTS) - The YouTube Text-To-Speech dataset is comprised of waveform audio extracted from YouTube videos alongside their English transcriptions.
 
-## <span id="m">Music</span>
+<p style="text-align: right;"><a href="#table-of-contents">^ Back to Contents ^</a></p>
+
+
+### <span id="m">Music</span>
 
 * [AAM: Artificial Audio Multitracks Dataset](https://zenodo.org/record/5794629) - This dataset contains 3,000 artificial music audio tracks with rich annotations. It is based on real instrument samples and generated by algorithmic composition with respect to music theory. It provides full mixes of the songs as well as single instrument tracks. The midis used for generation are also available. The annotation files include: Onsets, Pitches, Instruments, Keys, Tempos, Segments, Melody instrument, Beats, and Chords.
 * [Acappella](https://ipcv.github.io/Acappella/acappella/) - Acappella comprises around 46 hours of a cappella solo singing videos sourced from YouTbe, sampled across different singers and languages. Four languages are considered: English, Spanish, Hindi and others.
@@ -170,7 +178,10 @@ This is a list of datasets consisting of speech, music, and sound effects, which
 * [WikiMuTe](https://zenodo.org/records/10223363) - WikiMuTe: A web-sourced dataset of semantic descriptions for music audio. In this study, we present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from Wikipedia's rich catalogue of articles covering musical works. Using a dedicated text-mining pipeline, we extract both long and short-form descriptions covering a wide range of topics related to music content such as genre, style, mood, instrumentation, and tempo. 
 * [YM2413-MDB](https://jech2.github.io/YM2413-MDB/) - **YM2413-MDB** is an 80s FM video game music dataset with multi-label emotion annotations. It includes 669 audio and MIDI files of music from Sega and MSX PC games in the 80s using YM2413, a programmable sound generator based on FM. The collected game music is arranged with a subset of 15 monophonic instruments and one drum instrument.
 
-## <span id="se">Sound Effect</span>
+<p style="text-align: right;"><a href="#table-of-contents">^ Back to Contents ^</a></p>
+
+
+### <span id="se">Sound Effect</span>
 
 * [Animal Sound Dataset](https://github.com/YashNita/Animal-Sound-Dataset) - This data consisting of 875 animal sounds contains 10 types of animal sounds. This animal sounds dataset consists 200 cat, 200 dog, 200 bird, 75 cow, 45 lion, 40 sheep, 35 frog, 30 chicken, 25 donkey, 25 monkey sounds.
 * [AudioSet](https://research.google.com/audioset/index.html) - Audioset is an audio event dataset, which consists of over 2M human-annotated 10-second video clips. These clips are collected from YouTube, therefore many of which are in poor-quality and contain multiple sound-sources. A hierarchical ontology of 632 event classes is employed to annotate these data, which means that the same sound could be annotated as different labels. For example, the sound of barking is annotated as Animal, Pets, and Dog. All the videos are split into Evaluation/Balanced-Train/Unbalanced-Train set.
@@ -200,4 +211,6 @@ This is a list of datasets consisting of speech, music, and sound effects, which
 * [VGG-Sound](https://www.robots.ox.ac.uk/~vgg/data/vggsound/) - A large scale audio-visual dataset. VGG-Sound is an audio-visual correspondent dataset consisting of short clips of audio sounds, extracted from videos uploaded to YouTube.
 * [Visually Indicated Sounds](https://andrewowens.com/vis/) - Materials make distinctive sounds when they are hit or scratched — dirt makes a thud; ceramic makes a clink. These sounds reveal aspects of an object's material properties, as well as the force and motion of the physical interaction.
 
-## And more
+<p style="text-align: right;"><a href="#table-of-contents">^ Back to Contents ^</a></p>
+
+