Skip to content

Commit

Permalink
Increase diarization performance (#18)
Browse files Browse the repository at this point in the history
- Switched to word-based diarization instead of segment based using wav2vec models. Improves diarization performance.
- Updated docs
  • Loading branch information
hsnfirooz authored Apr 16, 2024
1 parent a5375e1 commit bf6f132
Show file tree
Hide file tree
Showing 7 changed files with 397 additions and 227 deletions.
43 changes: 28 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,27 @@
# speech2text

This repo contains instructions for setting up and applying the speech2text app on Aalto Triton cluster. The app utilizes [WhisperX](https://github.com/m-bain/whisperX) automatic speech recognition tool and [Pyannote](https://huggingface.co/pyannote/speaker-diarization) speaker detection (diarization) pipeline. The speech recognition and diarization steps are run independently and their result segments are combined (aligned) using a simple algorithm which for each transcription segment finds the most overlapping (in time) speaker segment.
>*_NOTE:_* The non-technical user guide for the Open On Demand web interface can be found [here](https://aaltorse.github.io/speech2text/).
This repo contains instructions for setting up and applying the speech2text app on Aalto Triton cluster. The app utilizes

- [WhisperX](https://github.com/m-bain/whisperX) automatic speech recognition tool
- [wav2vec]() to find word start and end timestamps for WhisperX transcription
- [Pyannote](https://huggingface.co/pyannote/speaker-diarization) speaker detection (diarization) tool

The speech recognition and diarization steps are run independently and their result segments are combined using a simple algorithm which for each transcription word segment finds the most overlapping (in time) speaker segment.

The required models are described [here](#models).

Conda environment and Lmod setup is described [here](#setup).

Usage is describe [here](#usage).
Command line (technical) usage on Triton is described [here](#usage).

Open On Demand web interface (non-technical) usage is described [here](https://aaltorse.github.io/speech2text/)

Supported languages are:

arabic (ar), armenian (hy), bulgarian (bg), catalan (ca), chinese (zh), czech (cs), danish (da), dutch (nl), english (en), estonian (et), finnish (fi), french (fr), galician (gl), german (de), greek (el), hebrew (he), hindi (hi), hungarian (hu), icelandic (is), indonesian (id), italian (it), japanese (ja), kazakh (kk), korean (ko), latvian (lv), lithuanian (lt), malay (ms), marathi (mr), nepali (ne), norwegian (no), persian (fa), polish (pl), portuguese (pt), romanian (ro), russian (ru), serbian (sr), slovak (sk), slovenian (sl), spanish (es), swedish (sv), thai (th), turkish (tr), ukrainian (uk), urdu (ur), vietnamese (vi)

The non-technical user guide using the Open On Demand web interface can be found [here](https://aaltorse.github.io/speech2text/).

## Models

Expand All @@ -19,22 +32,20 @@ The required models have been downloaded beforehand from Hugging Face and saved

We support `large-v2` and `large-v3` (default) multilingual [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) models. Languages supported by the models are:

afrikaans, arabic, armenian, azerbaijani, belarusian, bosnian, bulgarian, catalan,
chinese, croatian, czech, danish, dutch, english, estonian, finnish, french, galician,
german, greek, hebrew, hindi, hungarian, icelandic, indonesian, italian, japanese,
kannada, kazakh, korean, latvian, lithuanian, macedonian, malay, marathi, maori, nepali,
norwegian, persian, polish, portuguese, romanian, russian, serbian, slovak, slovenian,
spanish, swahili, swedish, tagalog, tamil, thai, turkish, ukrainian, urdu, vietnamese,
welsh

The models are covered by the [MIT licence]((https://huggingface.co/models?license=license:mit)) and have been pre-downloaded from Hugging Face to

`/scratch/shareddata/dldata/huggingface-hub-cache/hub/models--Systran--faster-whisper-large-v2`

and

`/scratch/shareddata/dldata/huggingface-hub-cache/hub/models--Systran--faster-whisper-large-v3`


### wav2vec

We use [wav2vec](https://ai.meta.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) models as part of the diarization pipeline which efines the timestamps from whisper transcriptions using forced alignment a phoneme-based ASR model (wav2vec 2.0). This provides word-level timestamps, as well as improved segment timestamps.

We use a fine-tuned wav2vec model for each of the supported languages. All the models are fine-tuned over the [Meta's XLRS](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model.

### Pyannote

The diarization is performed using the [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) pipeline installed via [`pyannote.audio`](https://github.com/pyannote/pyannote-audio).
Expand Down Expand Up @@ -148,7 +159,9 @@ SPEECH2TEXT_MEM
SPEECH2TEXT_CPUS_PER_TASK
```

Note that you can leave the language variable unspecified, in which case speech2text tries to detect the language automatically. Specifying the language explicitly is, however, recommended.
The language must be provided by the user from the list of supported languages:

arabic (ar), armenian (hy), bulgarian (bg), catalan (ca), chinese (zh), czech (cs), danish (da), dutch (nl), english (en), estonian (et), finnish (fi), french (fr), galician (gl), german (de), greek (el), hebrew (he), hindi (hi), hungarian (hu), icelandic (is), indonesian (id), italian (it), japanese (ja), kazakh (kk), korean (ko), latvian (lv), lithuanian (lt), malay (ms), marathi (mr), nepali (ne), norwegian (no), persian (fa), polish (pl), portuguese (pt), romanian (ro), russian (ru), serbian (sr), slovak (sk), slovenian (sl), spanish (es), swedish (sv), thai (th), turkish (tr), ukrainian (uk), urdu (ur), vietnamese (vi)

Notification emails will be sent to given email address. If the addresss is left unspecified,
no notifications are sent.
Expand Down Expand Up @@ -267,9 +280,9 @@ The documentation can be found in `docs/build/`. A good place to start is the in

### Audio files with more than one language

If a single audio file contains speech in more than one language, result files will (probably) still be produced but the results will (probably) be nonsensical to some extent. This is because even when using automatic language detection, Whisper appears to [detect the first language it encounters (if not given specifically) and stick to it until the end of the audio file, translating other encountered languages to the first language](https://github.com/openai/whisper/discussions/49).
If a single audio file contains speech in more than one language, result files will (probably) still be produced but the results will (probably) be nonsensical to some extent. This is because WhisperX appears to translate languages to the specified target language (mandatory argument SPEECH2TEXT_LANGUAGE). Related discussion: [https://github.com/openai/whisper/discussions/49](https://github.com/openai/whisper/discussions/49).

In some cases, this problem is easily avoided. For example, if the language changes only once in the middle of the audio, you can just split the file into two and process the parts separately. You can use any audio processing software to do this, e.g. [Audacity](https://www.audacityteam.org/).
In some cases, this problem can avoided relatively easily. For example, if the language changes only once in the middle of the audio, you can just split the file into two and process the parts separately. You can use any audio processing software to do this, e.g. [Audacity](https://www.audacityteam.org/).

## Licensing

Expand Down
23 changes: 10 additions & 13 deletions bin/speech2text
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,16 @@ Example run on a folder containing one or more audio file:
export SPEECH2TEXT_LANGUAGE=finnish
speech2text audiofiles/
Language must be provided from the list of supported languages:
arabic (ar), armenian (hy), bulgarian (bg), catalan (ca), chinese (zh), czech (cs), danish (da),
dutch (nl), english (en), estonian (et), finnish (fi), french (fr), galician (gl), german (de),
greek (el), hebrew (he), hindi (hi), hungarian (hu), icelandic (is), indonesian (id),
italian (it), japanese (ja), kazakh (kk), korean (ko), latvian (lv), lithuanian (lt), malay (ms),
marathi (mr), nepali (ne), norwegian (no), persian (fa), polish (pl), portuguese (pt),
romanian (ro), russian (ru), serbian (sr), slovak (sk), slovenian (sl), spanish (es),
swedish (sv), thai (th), turkish (tr), ukrainian (uk), urdu (ur), vietnamese (vi)
The audio files can be in any common audio (.wav, .mp3, .aff, etc.) or video (.mp4, .mov, etc.) format.
The speech2text app writes result files to a subfolder results/ next to each audio file.
Expand All @@ -26,19 +36,6 @@ Result files in a folder audiofiles/ will be written to folder audiofiles/result
Notification emails will be sent to SPEECH2TEXT_EMAIL. If SPEECH2TEXT_EMAIL is left
unspecified, no notifications are sent.
Supported languages are:
afrikaans, arabic, armenian, azerbaijani, belarusian, bosnian, bulgarian, catalan,
chinese, croatian, czech, danish, dutch, english, estonian, finnish, french, galician,
german, greek, hebrew, hindi, hungarian, icelandic, indonesian, italian, japanese,
kannada, kazakh, korean, latvian, lithuanian, macedonian, malay, marathi, maori, nepali,
norwegian, persian, polish, portuguese, romanian, russian, serbian, slovak, slovenian,
spanish, swahili, swedish, tagalog, tamil, thai, turkish, ukrainian, urdu, vietnamese,
welsh
You can leave the language variable SPEECH2TEXT_LANGUAGE unspecified, in which case
speech2text tries to detect the language automatically. Specifying the language
explicitly is, however, recommended.
EOF
}

Expand Down
77 changes: 77 additions & 0 deletions modules/speech2text/20240408.lua
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
help_text = [[
This app does speech2text with diarization.
Example run on a single file:
export [email protected]
export SPEECH2TEXT_LANGUAGE=finnish
speech2text audiofile.mp3
Example run on a folder containing one or more audio file:
export [email protected]
export SPEECH2TEXT_LANGUAGE=finnish
speech2text audiofiles/
Language must be provided from the list of supported languages:
arabic (ar), armenian (hy), bulgarian (bg), catalan (ca), chinese (zh), czech (cs), danish (da),
dutch (nl), english (en), estonian (et), finnish (fi), french (fr), galician (gl), german (de),
greek (el), hebrew (he), hindi (hi), hungarian (hu), icelandic (is), indonesian (id),
italian (it), japanese (ja), kazakh (kk), korean (ko), latvian (lv), lithuanian (lt), malay (ms),
marathi (mr), nepali (ne), norwegian (no), persian (fa), polish (pl), portuguese (pt),
romanian (ro), russian (ru), serbian (sr), slovak (sk), slovenian (sl), spanish (es),
swedish (sv), thai (th), turkish (tr), ukrainian (uk), urdu (ur), vietnamese (vi)
The audio files can be in any common audio (.wav, .mp3, .aff, etc.) or video (.mp4, .mov, etc.) format.
The speech2text app writes result files to a subfolder results/ next to each audio file.
Result filenames are the audio filename with .txt and .csv extensions. For example, result files
corresponding to audiofile.mp3 are written to results/audiofile.txt and results/audiofile.csv.
Result files in a folder audiofiles/ will be written to folder audiofiles/results/.
Notification emails will be sent to SPEECH2TEXT_EMAIL. If SPEECH2TEXT_EMAIL is left
unspecified, no notifications are sent.
]]

local version = "20240408"
whatis("Name : Aalto speech2text")
whatis("Version :" .. version)
help(help_text)

local speech2text = "/share/apps/manual_installations/speech2text/" .. version .. "/bin/"
local conda_env = "/share/apps/manual_installations/speech2text/" .. version .. "/env/bin/"

prepend_path("PATH", speech2text)
prepend_path("PATH", conda_env)

local hf_home = "/scratch/shareddata/dldata/huggingface-hub-cache/"
local pyannote_cache = hf_home .. "hub/"
local torch_home = "/scratch/shareddata/speech2text"
local pyannote_config = "/share/apps/manual_installations/speech2text/" .. version .. "/pyannote/config.yml"
local numba_cache = "/tmp"
local mplconfigdir = "/tmp"

pushenv("HF_HOME", hf_home)
pushenv("PYANNOTE_CACHE", pyannote_cache)
pushenv("TORCH_HOME", torch_home)
pushenv("XDG_CACHE_HOME", torch_home)
pushenv("PYANNOTE_CONFIG", pyannote_config)
pushenv("NUMBA_CACHE_DIR", numba_cache)
pushenv("MPLCONFIGDIR", mplconfigdir)

local speech2text_mem = "8G"
local speech2text_cpus_per_task = "6"
local speech2text_tmp = os.getenv("WRKDIR") .. "/.speech2text"

pushenv("SPEECH2TEXT_MEM", speech2text_mem)
pushenv("SPEECH2TEXT_CPUS_PER_TASK", speech2text_cpus_per_task)
pushenv("SPEECH2TEXT_TMP", speech2text_tmp)

pushenv("HF_HUB_OFFLINE", "1")

if mode() == "load" then
LmodMessage("For more information, run 'module spider speech2text/" .. version .. "'")
end

45 changes: 33 additions & 12 deletions src/settings.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,11 @@
# Supported languages

supported_languages = {
"afrikaans": "af",
"arabic": "ar",
"armenian": "hy",
"azerbaijani": "az",
"belarusian": "be",
"bosnian": "bs",
"bulgarian": "bg",
"catalan": "ca",
"chinese": "zh",
"croatian": "hr",
"czech": "cs",
"danish": "da",
"dutch": "nl",
Expand All @@ -26,15 +23,12 @@
"indonesian": "id",
"italian": "it",
"japanese": "ja",
"kannada": "kn",
"kazakh": "kk",
"korean": "ko",
"latvian": "lv",
"lithuanian": "lt",
"macedonian": "mk",
"malay": "ms",
"marathi": "mr",
"maori": "mi",
"nepali": "ne",
"norwegian": "no",
"persian": "fa",
Expand All @@ -46,19 +40,46 @@
"slovak": "sk",
"slovenian": "sl",
"spanish": "es",
"swahili": "sw",
"swedish": "sv",
"tagalog": "tl",
"tamil": "ta",
"thai": "th",
"turkish": "tr",
"ukrainian": "uk",
"urdu": "ur",
"vietnamese": "vi",
"welsh": "cy",
}

supported_languages_reverse = {value: key for key, value in supported_languages.items()}

supported_languages_pretty = ", ".join(
[f"{lang} ({short})" for lang, short in supported_languages.items()]
)


# Wav2Vec models

wav2vec_models = {
"hy": "infinitejoy/wav2vec2-large-xls-r-300m-armenian",
"bg": "infinitejoy/wav2vec2-large-xls-r-300m-bulgarian",
"et": "anton-l/wav2vec2-large-xlsr-53-estonian",
"gl": "infinitejoy/wav2vec2-large-xls-r-300m-galician",
"is": "language-and-voice-lab/wav2vec2-large-xlsr-53-icelandic-ep30-967h",
"id": "indonesian-nlp/wav2vec2-large-xlsr-indonesian",
"kk": "aismlv/wav2vec2-large-xlsr-kazakh",
"lv": "infinitejoy/wav2vec2-large-xls-r-300m-latvian",
"lt": "DeividasM/wav2vec2-large-xlsr-53-lithuanian",
"ms": "gvs/wav2vec2-large-xlsr-malayalam",
"mr": "infinitejoy/wav2vec2-large-xls-r-300m-marathi-cv8",
"ne": "Harveenchadha/vakyansh-wav2vec2-nepali-nem-130",
"ro": "anton-l/wav2vec2-large-xlsr-53-romanian",
"sr": "dnikolic/wav2vec2-xlsr-530-serbian-colab",
"sk": "infinitejoy/wav2vec2-large-xls-r-300m-slovak",
"sl": "infinitejoy/wav2vec2-large-xls-r-300m-slovenian",
"sv": "KBLab/wav2vec2-large-xlsr-53-swedish",
"th": "sakares/wav2vec2-large-xlsr-thai-demo",
}

# Whisper models

available_whisper_models = ["large-v2", "large-v3"]
default_whisper_model = "large-v3"
compute_device = "cuda"
Loading

0 comments on commit bf6f132

Please sign in to comment.