-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate the audio modality in CoCa #94
base: main
Are you sure you want to change the base?
Conversation
dbea356
to
c342a35
Compare
dropout=pre_conformer_dropout, | ||
) | ||
|
||
self.conformer = Conformer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the dependency to conformer and build it with components from the vision transformer? Maybe we want to change the conformer arcitecture in the future.
super().__init__() | ||
self.sample_key = sample_key | ||
self.prediction_key = prediction_key | ||
self.pre_conformer = PreConformer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a tokenization of the input audio? Maybe choose a better name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not tokenization, just reduction in frame rate of the input. I will come up with a better name.
self.post_conformer = nn.Sequential( | ||
nn.Linear( | ||
input_dims, | ||
n_embd, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to project from input_dims to n_embd? input_dims != n_embd?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, precisely -> input_dims!=n_embd
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the Conformer implementation that I have worked on now, this will not be needed. I will project it in the very beginning (before any computation occurs in the conformer blocks).
nn.Conv1d( | ||
in_channels=n_input_dims, | ||
out_channels=n_input_dims, | ||
kernel_size=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two conv1d layers? Is this common? I assumed we apply vit style patching with conv2d of the spectrogram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, in speech, sub-sampling like the one being performed here is common.
text_cls_prediction_key: str | ||
vision_encoder_config: VisionTransformerConfig | ||
modality_encoder_config: AudioTransformerConfig | VisionTransformerConfig | AVConfig |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we should have vision and audio config with default None. If its set the model is created. With both None we should end up with a normal language model.
def _init_modality(self, encoder_class, encoder_config, n_queries): | ||
encoder = encoder_class(**dict(encoder_config)) | ||
queries = nn.Parameter(torch.randn(n_queries + 1, encoder_config.n_embd)) | ||
attn_pool = AttentionPooling( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attention pooling layer should attend to the combination of the audio and vision endcoder output tokens if both are activated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is something for the future, since, currently, we don't parallel data across all modalities.
vision_embd, vision_cls_token = self._forward_encode_vision(inputs) | ||
# TODO: The "modality_key" needs to be implemented. | ||
if inputs[self.modality_key][0] == self.AUDIO: | ||
modality_embd, modality_cls_token = self._forward_encode_audio(inputs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apply if audio encoder exists. Im not sure if we also want to check if audio data is in the inputs. Explicitly checking would maybe help with training only on two modalites at a time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, for the same reason as mentioned above, currently we can only train on two modalities at a time.
These can help run audio-only, vision-only or audio-vision experiments!
On the basis of a training.txt file and number of assimilation operations, a bpecodes file is generated which is used to create bpe_to_ind and ind_to_bpe dictionary pickles required for tokenization and detokenization.
9528bb6
to
1806062
Compare
…s/modalities into feat/audio_coca
These commits essentially bring in two things:
The Conformer audio encoder:
The Conformer architecture is readily available via
torchaudio
, and only a few additional modules were coded.Changes to the CoCa code which allow the Conformer encoder and the audio modality to be used with the CoCa architecture:
These changes include renaming and introducing a few variables and defining usage for them, as well as, slightly modifying the forward pass logic.