-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #64 from sensein/alistair/audio_tutorial
Audio tutorial
- Loading branch information
Showing
4 changed files
with
455 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,319 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Audio tutorial\n", | ||
"\n", | ||
"This tutorial walks through loading in individual audio records as well as the entire dataset." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import matplotlib.pyplot as plt\n", | ||
"import IPython.display as Ipd\n", | ||
"import pandas as pd\n", | ||
"import torch\n", | ||
"from pathlib import Path\n", | ||
"import numpy as np\n", | ||
"\n", | ||
"from b2aiprep.process import SpeechToText\n", | ||
"from b2aiprep.process import Audio, specgram\n", | ||
"\n", | ||
"from b2aiprep.dataset import VBAIDataset" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We have reformatted the data into a [BIDS](https://bids-standard.github.io/bids-starter-kit/folders_and_files/folders.html)-like format. The data is stored in a directory with the following structure:\n", | ||
"\n", | ||
"```\n", | ||
"data/\n", | ||
" sub-01/\n", | ||
" ses-01/\n", | ||
" beh/\n", | ||
" sub-01_ses-01_questionnaire.json\n", | ||
" audio/\n", | ||
" sub-01_ses-01_recording.wav\n", | ||
" sub-02/\n", | ||
" ses-01/\n", | ||
" beh/\n", | ||
" sub-02_ses-01_questionnaire.json\n", | ||
" audio/\n", | ||
" sub-02_ses-01_recording.wav\n", | ||
" ...\n", | ||
"```\n", | ||
"\n", | ||
"i.e. the data reflects a subject-session-datatype hierarchy. The `beh` subfolder, for behavioural data, was chosen to store questionnaire data as it is the closest approximation to the data collected. The `audio` subfolder contains audio data.\n", | ||
"\n", | ||
"We have provided utilities which load in data from the BIDS-like dataset structure. The only input needed is the folder path which stores the data." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"dataset = VBAIDataset('../output')" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Finding data\n", | ||
"\n", | ||
"The dataset functions are convenient, but they require some prior knowledge: we needed to know that `sessionschema` was the name for the questionnaire with session information. We also needed to know that `demographics` was the name for the questionnaire where general demographics were collected. For convenience, the dataset object has another method which tells you all of the questionnaire names available. It accomplishes this by iterating through every file of the BIDS dataset. Note that this can be an expensive operation! Luckily, if there are less than 10,000 files, it goes pretty fast. Let's try it out." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"participant_df = dataset.load_and_pivot_questionnaire('participant')\n", | ||
"participant_df.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Acoustic tasks\n", | ||
"\n", | ||
"Let's look at the acoustic tasks now. Acoustic task files are organized in the following way:\n", | ||
"\n", | ||
"```\n", | ||
"data/\n", | ||
" sub-01/\n", | ||
" ses-01/\n", | ||
" beh/\n", | ||
" sub-01_ses-01_task-<TaskName>_acoustictaskschema.json\n", | ||
" sub-01_ses-01_task-<TaskName>_rec-<TaskName>-1_recordingschema.json\n", | ||
" ...\n", | ||
"```\n", | ||
"\n", | ||
"where `TaskName` is the name of the acoustic task, including:\n", | ||
"\n", | ||
"* `Audio-Check`\n", | ||
"* `Cinderalla-Story`\n", | ||
"* `Rainbow-Passage`\n", | ||
"\n", | ||
"etc. The audio tasks are listed currently in b2aiprep/prepare.py:_AUDIO_TASKS." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"acoustic_task_df = dataset.load_and_pivot_questionnaire('acoustictaskschema')\n", | ||
"acoustic_task_df.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Each row in the above corresponds to a different acoustic task: an audio check, prolonged vowels, etc. The `value_counts()` method for pandas DataFrames lets us count all the unique values for a column." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"acoustic_task_df['acoustic_task_name'].value_counts()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"The hierarchy for a single session is the following:\n", | ||
"\n", | ||
"```\n", | ||
"subject\n", | ||
"└── session\n", | ||
" └── acoustic_task\n", | ||
" └── recording\n", | ||
"```\n", | ||
"\n", | ||
"That is, a subject has multiple sessions, each session has multiple acoustic tasks, and each acoustic task has multiple recordings. Very often these relationships are 1:1, and so for a given acoustic task there is only one recording, but this is not always the case.\n", | ||
"\n", | ||
"We can load in a dataframe for the recordings using the same load and pivot command." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"recording_df = dataset.load_and_pivot_questionnaire('recordingschema')\n", | ||
"print(f\"{recording_df['recording_id'].nunique()} recordings.\")\n", | ||
"recording_df.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Audio\n", | ||
"\n", | ||
"We have provided three utilities to support loading in audio and related data:\n", | ||
"\n", | ||
"* `load_recording(recording_id)` - loads in audio data given a `recording_id`\n", | ||
"* `load_recordings()` - loads in *all* of the audio data (takes a while!). the list returned will be in the same order as `dataset.recordings_df`\n", | ||
"* `load_spectrograms()` - similar to the above, but loads in precomputed spectrograms.\n", | ||
"\n", | ||
"We can first take a look at the `load_recording` function with the first `recording_id` from the dataframe. This should be an audio check." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"recording_id = recording_df.loc[0, 'recording_id']\n", | ||
"audio = dataset.load_recording(recording_id)\n", | ||
"\n", | ||
"# convert to uint32 - probably should use the bits_per_sample from the original metadata!\n", | ||
"signal = audio.signal.squeeze()\n", | ||
"signal = (np.iinfo(np.uint32).max * (signal - signal.min())) / (signal.max() - signal.min())\n", | ||
"\n", | ||
"# display a widget to play the audio file\n", | ||
"Ipd.display(Ipd.Audio(data=signal, rate=audio.sample_rate))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Spectrograms are a useful transformation of the audio data. We can load in the spectrograms for the first recording using the `load_spectrograms` function." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"spectrograms = dataset.load_spectrograms()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# display a single spectrogram\n", | ||
"i = 5\n", | ||
"\n", | ||
"f = plt.figure(figsize=(10, 5))\n", | ||
"# convert spectrogram to decibel\n", | ||
"log_spec = 10.0 * torch.log10(torch.maximum(spectrograms[i], torch.tensor(1e-10)))\n", | ||
"log_spec = torch.maximum(log_spec, log_spec.max() - 80)\n", | ||
"\n", | ||
"ax = f.gca()\n", | ||
"ax.matshow(log_spec.T, origin=\"lower\", aspect=\"auto\")\n", | ||
"\n", | ||
"# the time axis is dependent on the window hop used by the spectrogram\n", | ||
"# this is usually 25ms for speech\n", | ||
"plt.xlabel('Time (window index)')\n", | ||
"# for this axis, the frequency domain goes from 0 : NFFT/2*fs\n", | ||
"plt.ylabel('Frequency (NFFT index)')\n", | ||
"plt.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We can then alternatively load in all the recordings.\n", | ||
"\n", | ||
"**WARNING**: This takes a little under 15GB of memory!" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"audio_data = dataset.load_recordings()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Transcribe a single audio file\n", | ||
"\n", | ||
"Now that we have loaded in the audio recordings, we can try to apply ML models. For example, we can transcribe the audio." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"audio = audio_data[5].to_16khz()\n", | ||
"\n", | ||
"# display a widget to play the audio file\n", | ||
"Ipd.display(Ipd.Audio(data=audio.signal.squeeze(), rate=audio.sample_rate))\n", | ||
"\n", | ||
"# Select the best device available\n", | ||
"device = \"cuda\" if torch.cuda.is_available() else \"cpu\"\n", | ||
"device = \"mps\" if torch.backends.mps.is_available() else device\n", | ||
"\n", | ||
"# default arguments for the transcription model\n", | ||
"stt_kwargs = {\n", | ||
" \"model_id\": \"openai/whisper-base\",\n", | ||
" \"max_new_tokens\": 128,\n", | ||
" \"chunk_length_s\": 30,\n", | ||
" \"batch_size\": 16,\n", | ||
" \"device\": device,\n", | ||
"}\n", | ||
"stt = SpeechToText(**stt_kwargs)\n", | ||
"transcription = stt.transcribe(audio, language=\"en\")\n", | ||
"print(transcription)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "b2ai", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.8" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
Oops, something went wrong.