Skip to content
This repository has been archived by the owner on Feb 1, 2024. It is now read-only.

Question on Figure #47

Open
theadamsabra opened this issue Apr 8, 2022 · 7 comments
Open

Question on Figure #47

theadamsabra opened this issue Apr 8, 2022 · 7 comments

Comments

@theadamsabra
Copy link

image

quick question on this figure in the blog post: i know coconet is its own model that will generate subsequent melodies given the input midi file. however, should i decide to train midi ddsp, will the training of coconet also be a part of this? or should i expect a monophonic midi melody as input and the generated audio as output.

thanks for all the help and this awesome project

@lukewys
Copy link
Contributor

lukewys commented Apr 8, 2022

Hi! Thanks for your interest! Yes, the latter. MIDI-DDSP takes in a monophonic midi melody as input and the generated audio as output.

@theadamsabra
Copy link
Author

thank you so much for your prompt response. for training, should the output be the same melody as the midi input? meaning if i want to train on a new instrument i need the midi transcription

@lukewys
Copy link
Contributor

lukewys commented Apr 8, 2022

Yes. You need paired MIDI and Audio data to train MIDI-DDSP. MIDI-DDSP currently does not support training on dataset other than URMP, so you might need some hack to do so. Last, Audio-MIDI alignment quality will affect the generation quality of MIDI-DDSP as the extraction of the note expression relies on the note boundary.

@theadamsabra
Copy link
Author

I see. Thank you!

How "accurate"/reliable was URMP in alignment quality? Also, do you use certain metrics used to measure and assess alignment quality?

@lukewys
Copy link
Contributor

lukewys commented Apr 9, 2022

I don't have a metric of the alignment quality, but the MIDI (note boundary) in the URMP dataset is manually labeled. So I manually checked the MIDI alignment with the audio, and empirically I found the URMP dataset has a very good alignment quality.

@theadamsabra
Copy link
Author

Thanks for all of your help. I would love to help out and improve the repository in any way. How difficult do you think it be to allow training on arbitrary datasets?

@lukewys
Copy link
Contributor

lukewys commented Apr 11, 2022

Well... I gotta confess because this codebase is not well-written (by myself), so you will need some hacks. Here are some steps you should do:

  1. Write data preprocess code or dataloader for synthesis generator: write a code that will transform (midi files + audio files) -> tfrecord, with the same format of key and value in here. Note that there are two types of dataset, one is "batched", meaning the data is chunked into 4s of samples. The other one is "unbatched", meaning there is one sample per audio recording. However, you could also try to write your own data loader.
  2. Once your tfrecord is in the same format as the URMP's the dataset dump code and dataloader for the expression generator and should work fine, otherwise you need to hack with those codes and come up with your own dataloader and dataset dump.
  3. You need to come up with a way to scale the note expression control so that they approximately are between [0,1]. Here I come up with my own scaling coefficients. You can simply scale them to have unit variance. But be aware you should not clip the value.

If all the above works, then I believe it can run on arbitrary datasets. This is on my todo list, but I do not have the throughput to do so :(. Good luck about that!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants