This project aims to build a lyrics-to-audio alignment system that can synchronize the audio of a polyphonic song with its lyrics and produce time-aligned lyrics with word-level onset and offset as a .lrc
file. A deep-learning-based system is developed to approach the problem in three steps, which include separating the vocals, recognizing the singing vocals, and performing forced alignment. For singing vocals recognition, transfer learning is utilized to apply knowledge obtained from the speech domain to the singing domain.
conda env update -f environment.yml
conda activate lsync
from lsync import LyricsSync
lsync = LyricsSync()
words, lrc = lsync.sync(audio_path, lyrics_path)
Please refer to demo.ipynb.
If you want to visualize .lrc
for evaluation, you can use Lrc Player.
If you want to fine-tune a Wav2Vec2 model for better accuracy in singing domain, please refer to the experiments section below.
- Make a
dataset
folder in root folder - Download DALI dataset and put it it inside
dataset/DALI/v1
- Similarly, you can download jamendolyrics dataset for evaluation and put it in
dataset/jamendolyrics
- Similarly, you can download jamendolyrics dataset for evaluation and put it in
- Download all DALI songs using
python get_dataset.py
- Run
dataset.ipynb
to prepare the DALI for fine-tune tasks- Procedures including vocal extraction, line-level segmentation, and making tokenizer
- Run
train.ipynb
to fine-tune thefacebook/wav2vec2-base
for singing voice recognition - Run
run.ipynb
to see how to use thelsync
library for lyrics-to-audio alignment based on the fine-tuned model- Remember to update model path to your model's path inside
lsync/phoneme_recognizer.py
- Remember to update model path to your model's path inside