This is a log of different approaches, techniques, and other thoughts relating to the project as it evolves.
Note: For the best viewing experience, we recommend installing the markdown diagrams browser extension to render all of the diagrams and mathjax (chrome) for math rendering.
- Wavenet: A Generative Model for Raw Audio
- Unsupervised speech representation learning using WaveNet autoencoders
- Jukebox: A Generative Model for Music
- OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields
- Dancing to Music
03/11/2021
The goal of the movenet
is to create an instrument from human body
movement - to translate that motion into music.
graph LR;
vid_stream([Video Stream]) --> vid_enc[Video Encoder];
vid_enc --> vid_repr([Video Representation]);
vid_repr --> audio_dec[Audio Decoder];
audio_dec --> audio_wav[Audio Waveform];
The prior research enumerated in the resources section should provide ample inspiration for this project, most of which are concerned with the inverse problem: translating music into some representation of human movement. In principle, we can use similar architectures to map a sequence of images (i.e. video) into a raw audio waveform.
03/20/2021
The objective of this project is to produce high quality music given a video input of a human dancing. In order to achieve this, the audio decoder must have enough capacity to generate qualitatively good samples.
A strong baseline model for this task might be the WaveNet model, which has been demonstrated to generate high-quality raw audio samples. In fact, when trained on music, the WaveNet model was able to produce what the authors report generating samples that "were often harmonic and aesthetically pleasing, even when produced by unconditional models". What makes WaveNet a good candidate for this project is that we can condition the output on both static and sequential inputs.
Reproducing the WaveNet model in the context of conditioning on video input will be the first phase of this project.
We can anticipate issues some issues with the baseline model.
A potential issue with the WaveNet model is that the outputs are deterministic with respect to the conditioning inputs. This issue may be solved by using variational methods as described in Chorowski et al. 2019, where they use variational autoencoders (VAE) to capture latent representations of speech waveforms in an autoencoder setting. While this paper used VAEs to solve the problem of learning to better disentangle different high-level semantics of the input, e.g. acoustic content and speaker information, we can use these methods in this project to also produce non-deterministic outputs as a function of video input. The hope is that that same exact movement might produce qualitatively different sounds.
VAE methods are a potential direction to go once the baseline has been established. Other work to consider is Jukebox.
Unlike other music generation systems like WaveNet or Jukebox, we want to condition the generated output on a sequence of vectors representing movement of the human body. This might be a sequence of raw images or a sequence of 2D body keypoints represented via a model like OpenPose, which was the approach taken by Lee et al. 2019.
The benefit of using 2D body keypoints is that it's a lower-dimensional representation of the human body and would remove much of the unimportant details that a CNN would have to learn to ignore with video data.
To get a proof-of-concept completed as quickly as possible, we'll go for an approach that:
- Remains as faithful to the WaveNet architecture as possible.
- Minimizes the number of components that depend on pre-trained models for pre- or post-processing.
WaveNet is an autoregressive model that operates directly on raw audio waveform in the following factorization:
This is an important aspect of the architecture to preserve because not only do we want to generate music from human dance, we also want to produce music that maintains thematic consistency and coherence over relatively long time spans.
We also want to condition the sampled audio value based on the previous dance
pose, so we have a second timeseries
We can transform
We'll use the gated activation units using in the WaveNet paper,
conditioned on the second time series
Where:
-
$z$ is the activation -
$W$ is the learnable audio convolutional filter -
$x$ is the audio signal -
$V$ is the learnable video convolutional filter -
$y$ is the up/downsampled video signal -
$k$ is the convolutional layer index -
$f$ and$g$ denote filter and gate, respectively.
For the baseline prototype we'll go with CNNs to represent (a) individual
image frames in a video sequence and (b) represent a sequence of image
representations via the up/downsampling CNN described above to produce the
sequence y
that has the same sampling frequency as the audio signal.
To model the audio signal, we'll follow the preprocessing step described in the Wavenet paper (section 2.2) to quantize the 16-bit integer values of raw audio into 256 possible values with the mu-law companding transformation:
Where
At a high level, the types that we can use to compose the dataset for this project will look like the following:
AudioSample = Array[256 x 1]
VideoSample = Array[height x width x n_channels]
Sample = Tuple[AudioSample, VideoSample]
TrainingInstance = List[Sample]
Dataset = List[TrainingInstance]
As mentioned above, we'll follow the conditional wavenets architecture, which
is an autoregressive model that outputs a raw audio sample for the next time
step t + 1
conditioned on the previous audio and video samples
{t, t - 1, ..., 1}
. Figure 3 in the Wavenet paper visualizes
dilated causal convolutions, which increases the receptive of the model.
Because the audio signal is encoded in a categorical space of 256 possible values, the loss for this task will be cross entropy loss.
See the paperswithcode entry for reference implementations of the wavenet model.