Increase diarization performance (#18)

- Switched to word-based diarization instead of segment based using wav2vec models. Improves diarization performance. - Updated docs
AaltoRSE · Apr 16, 2024 · bf6f132 · bf6f132
1 parent a5375e1
commit bf6f132
Show file tree

Hide file tree

Showing 7 changed files with 397 additions and 227 deletions.
diff --git a/README.md b/README.md
@@ -1,14 +1,27 @@
 # speech2text
 
-This repo contains instructions for setting up and applying the speech2text app on Aalto Triton cluster. The app utilizes [WhisperX](https://github.com/m-bain/whisperX) automatic speech recognition tool and [Pyannote](https://huggingface.co/pyannote/speaker-diarization) speaker detection (diarization) pipeline. The speech recognition and diarization steps are run independently and their result segments are combined (aligned) using a simple algorithm which for each transcription segment finds the most overlapping (in time) speaker segment.
+>*_NOTE:_* The non-technical user guide for the Open On Demand web interface can be found [here](https://aaltorse.github.io/speech2text/).
+
+This repo contains instructions for setting up and applying the speech2text app on Aalto Triton cluster. The app utilizes
+
+- [WhisperX](https://github.com/m-bain/whisperX) automatic speech recognition tool
+- [wav2vec]() to find word start and end timestamps for WhisperX transcription
+- [Pyannote](https://huggingface.co/pyannote/speaker-diarization) speaker detection (diarization) tool 
+
+The speech recognition and diarization steps are run independently and their result segments are combined using a simple algorithm which for each transcription word segment finds the most overlapping (in time) speaker segment.
 
 The required models are described [here](#models). 
 
 Conda environment and Lmod setup is described [here](#setup). 
 
-Usage is describe [here](#usage).
+Command line (technical) usage on Triton is described [here](#usage).
+
+Open On Demand web interface (non-technical) usage is described [here](https://aaltorse.github.io/speech2text/)
+
+Supported languages are:
+
+arabic (ar), armenian (hy), bulgarian (bg), catalan (ca), chinese (zh), czech (cs), danish (da), dutch (nl), english (en), estonian (et), finnish (fi), french (fr), galician (gl), german (de), greek (el), hebrew (he), hindi (hi), hungarian (hu), icelandic (is), indonesian (id), italian (it), japanese (ja), kazakh (kk), korean (ko), latvian (lv), lithuanian (lt), malay (ms), marathi (mr), nepali (ne), norwegian (no), persian (fa), polish (pl), portuguese (pt), romanian (ro), russian (ru), serbian (sr), slovak (sk), slovenian (sl), spanish (es), swedish (sv), thai (th), turkish (tr), ukrainian (uk), urdu (ur), vietnamese (vi)
 
-The non-technical user guide using the Open On Demand web interface can be found [here](https://aaltorse.github.io/speech2text/).
 
 ## Models
 
@@ -19,22 +32,20 @@ The required models have been downloaded beforehand from Hugging Face and saved
 
 We support `large-v2` and `large-v3` (default) multilingual [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) models. Languages supported by the models are:
 
-afrikaans, arabic, armenian, azerbaijani, belarusian, bosnian, bulgarian, catalan, 
-chinese, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, 
-german, greek, hebrew, hindi, hungarian, icelandic, indonesian, italian, japanese, 
-kannada, kazakh, korean, latvian, lithuanian, macedonian, malay, marathi, maori, nepali,
-norwegian, persian, polish, portuguese, romanian, russian, serbian, slovak, slovenian, 
-spanish, swahili, swedish, tagalog, tamil, thai, turkish, ukrainian, urdu, vietnamese, 
-welsh
-
 The models are covered by the [MIT licence]((https://huggingface.co/models?license=license:mit)) and have been pre-downloaded from Hugging Face to 
 
 `/scratch/shareddata/dldata/huggingface-hub-cache/hub/models--Systran--faster-whisper-large-v2`
 
 and
 
 `/scratch/shareddata/dldata/huggingface-hub-cache/hub/models--Systran--faster-whisper-large-v3`
-
+
+### wav2vec
+
+We use [wav2vec](https://ai.meta.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/) models as part of the diarization pipeline which efines the timestamps from whisper transcriptions using forced alignment a phoneme-based ASR model (wav2vec 2.0). This provides word-level timestamps, as well as improved segment timestamps.
+
+We use a fine-tuned wav2vec model for each of the supported languages. All the models are fine-tuned over the [Meta's XLRS](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) model.
+
 ### Pyannote 
 
 The diarization is performed using the [pyannote/speaker-diarization-3.1](https://huggingface.co/pyannote/speaker-diarization-3.1) pipeline installed via [`pyannote.audio`](https://github.com/pyannote/pyannote-audio).
@@ -148,7 +159,9 @@ SPEECH2TEXT_MEM
 SPEECH2TEXT_CPUS_PER_TASK
 ```
 
-Note that you can leave the language variable unspecified, in which case speech2text tries to detect the language automatically. Specifying the language explicitly is, however, recommended.
+The language must be provided by the user from the list of supported languages:
+
+arabic (ar), armenian (hy), bulgarian (bg), catalan (ca), chinese (zh), czech (cs), danish (da), dutch (nl), english (en), estonian (et), finnish (fi), french (fr), galician (gl), german (de), greek (el), hebrew (he), hindi (hi), hungarian (hu), icelandic (is), indonesian (id), italian (it), japanese (ja), kazakh (kk), korean (ko), latvian (lv), lithuanian (lt), malay (ms), marathi (mr), nepali (ne), norwegian (no), persian (fa), polish (pl), portuguese (pt), romanian (ro), russian (ru), serbian (sr), slovak (sk), slovenian (sl), spanish (es), swedish (sv), thai (th), turkish (tr), ukrainian (uk), urdu (ur), vietnamese (vi)
 
 Notification emails will be sent to given email address. If the addresss is left unspecified,
 no notifications are sent.
@@ -267,9 +280,9 @@ The documentation can be found in `docs/build/`. A good place to start is the in
 
 ### Audio files with more than one language
 
-If a single audio file contains speech in more than one language, result files will (probably) still be produced but the results will (probably) be nonsensical to some extent. This is because even when using automatic language detection, Whisper appears to [detect the first language it encounters (if not given specifically) and stick to it until the end of the audio file, translating other encountered languages to the first language](https://github.com/openai/whisper/discussions/49).
+If a single audio file contains speech in more than one language, result files will (probably) still be produced but the results will (probably) be nonsensical to some extent. This is because WhisperX appears to translate languages to the specified target language (mandatory argument SPEECH2TEXT_LANGUAGE). Related discussion: [https://github.com/openai/whisper/discussions/49](https://github.com/openai/whisper/discussions/49).
 
-In some cases, this problem is easily avoided. For example, if the language changes only once in the middle of the audio, you can just split the file into two and process the parts separately.  You can use any audio processing software to do this, e.g. [Audacity](https://www.audacityteam.org/).
+In some cases, this problem can avoided relatively easily. For example, if the language changes only once in the middle of the audio, you can just split the file into two and process the parts separately.  You can use any audio processing software to do this, e.g. [Audacity](https://www.audacityteam.org/).
 
 ## Licensing
 

diff --git a/bin/speech2text b/bin/speech2text
@@ -16,6 +16,16 @@ Example run on a folder containing one or more audio file:
     export SPEECH2TEXT_LANGUAGE=finnish
     speech2text audiofiles/
 
+Language must be provided from the list of supported languages:
+
+arabic (ar), armenian (hy), bulgarian (bg), catalan (ca), chinese (zh), czech (cs), danish (da), 
+dutch (nl), english (en), estonian (et), finnish (fi), french (fr), galician (gl), german (de), 
+greek (el), hebrew (he), hindi (hi), hungarian (hu), icelandic (is), indonesian (id), 
+italian (it), japanese (ja), kazakh (kk), korean (ko), latvian (lv), lithuanian (lt), malay (ms), 
+marathi (mr), nepali (ne), norwegian (no), persian (fa), polish (pl), portuguese (pt), 
+romanian (ro), russian (ru), serbian (sr), slovak (sk), slovenian (sl), spanish (es), 
+swedish (sv), thai (th), turkish (tr), ukrainian (uk), urdu (ur), vietnamese (vi)
+
 The audio files can be in any common audio (.wav, .mp3, .aff, etc.) or video (.mp4, .mov, etc.) format.
 
 The speech2text app writes result files to a subfolder results/ next to each audio file.
@@ -26,19 +36,6 @@ Result files in a folder audiofiles/ will be written to folder audiofiles/result
 Notification emails will be sent to SPEECH2TEXT_EMAIL. If SPEECH2TEXT_EMAIL is left 
 unspecified, no notifications are sent.
 
-Supported languages are:
-
-afrikaans, arabic, armenian, azerbaijani, belarusian, bosnian, bulgarian, catalan, 
-chinese, croatian, czech, danish, dutch, english, estonian, finnish, french, galician, 
-german, greek, hebrew, hindi, hungarian, icelandic, indonesian, italian, japanese, 
-kannada, kazakh, korean, latvian, lithuanian, macedonian, malay, marathi, maori, nepali,
-norwegian, persian, polish, portuguese, romanian, russian, serbian, slovak, slovenian, 
-spanish, swahili, swedish, tagalog, tamil, thai, turkish, ukrainian, urdu, vietnamese, 
-welsh
-
-You can leave the language variable SPEECH2TEXT_LANGUAGE unspecified, in which case 
-speech2text tries to detect the language automatically. Specifying the language 
-explicitly is, however, recommended.
 EOF
 }
 

diff --git a/modules/speech2text/20240408.lua b/modules/speech2text/20240408.lua
@@ -0,0 +1,77 @@
+help_text = [[
+
+This app does speech2text with diarization.
+
+Example run on a single file: 
+
+    export [email protected]
+    export SPEECH2TEXT_LANGUAGE=finnish
+    speech2text audiofile.mp3
+
+Example run on a folder containing one or more audio file:
+
+    export [email protected]
+    export SPEECH2TEXT_LANGUAGE=finnish
+    speech2text audiofiles/
+
+Language must be provided from the list of supported languages:
+
+arabic (ar), armenian (hy), bulgarian (bg), catalan (ca), chinese (zh), czech (cs), danish (da), 
+dutch (nl), english (en), estonian (et), finnish (fi), french (fr), galician (gl), german (de), 
+greek (el), hebrew (he), hindi (hi), hungarian (hu), icelandic (is), indonesian (id), 
+italian (it), japanese (ja), kazakh (kk), korean (ko), latvian (lv), lithuanian (lt), malay (ms), 
+marathi (mr), nepali (ne), norwegian (no), persian (fa), polish (pl), portuguese (pt), 
+romanian (ro), russian (ru), serbian (sr), slovak (sk), slovenian (sl), spanish (es), 
+swedish (sv), thai (th), turkish (tr), ukrainian (uk), urdu (ur), vietnamese (vi)
+
+The audio files can be in any common audio (.wav, .mp3, .aff, etc.) or video (.mp4, .mov, etc.) format.
+
+The speech2text app writes result files to a subfolder results/ next to each audio file.
+Result filenames are the audio filename with .txt and .csv extensions. For example, result files
+corresponding to audiofile.mp3 are written to results/audiofile.txt and results/audiofile.csv.
+Result files in a folder audiofiles/ will be written to folder audiofiles/results/.
+
+Notification emails will be sent to SPEECH2TEXT_EMAIL. If SPEECH2TEXT_EMAIL is left 
+unspecified, no notifications are sent.
+]]
+
+local version = "20240408"
+whatis("Name : Aalto speech2text")
+whatis("Version :" .. version)
+help(help_text)
+
+local speech2text = "/share/apps/manual_installations/speech2text/" .. version .. "/bin/"
+local conda_env = "/share/apps/manual_installations/speech2text/" .. version .. "/env/bin/"
+
+prepend_path("PATH", speech2text)
+prepend_path("PATH", conda_env)
+
+local hf_home = "/scratch/shareddata/dldata/huggingface-hub-cache/"
+local pyannote_cache = hf_home .. "hub/"
+local torch_home = "/scratch/shareddata/speech2text"
+local pyannote_config = "/share/apps/manual_installations/speech2text/" .. version .. "/pyannote/config.yml"
+local numba_cache = "/tmp" 
+local mplconfigdir = "/tmp"
+
+pushenv("HF_HOME", hf_home)
+pushenv("PYANNOTE_CACHE", pyannote_cache)
+pushenv("TORCH_HOME", torch_home)
+pushenv("XDG_CACHE_HOME", torch_home)
+pushenv("PYANNOTE_CONFIG", pyannote_config)
+pushenv("NUMBA_CACHE_DIR", numba_cache)
+pushenv("MPLCONFIGDIR", mplconfigdir)
+
+local speech2text_mem = "8G"
+local speech2text_cpus_per_task = "6"
+local speech2text_tmp = os.getenv("WRKDIR") .. "/.speech2text"
+
+pushenv("SPEECH2TEXT_MEM", speech2text_mem)
+pushenv("SPEECH2TEXT_CPUS_PER_TASK", speech2text_cpus_per_task)
+pushenv("SPEECH2TEXT_TMP", speech2text_tmp)
+
+pushenv("HF_HUB_OFFLINE", "1")
+
+if mode() == "load" then
+    LmodMessage("For more information, run 'module spider speech2text/" .. version .. "'")
+end
+
diff --git a/src/settings.py b/src/settings.py
@@ -1,14 +1,11 @@
+# Supported languages
+
 supported_languages = {
-    "afrikaans": "af",
     "arabic": "ar",
     "armenian": "hy",
-    "azerbaijani": "az",
-    "belarusian": "be",
-    "bosnian": "bs",
     "bulgarian": "bg",
     "catalan": "ca",
     "chinese": "zh",
-    "croatian": "hr",
     "czech": "cs",
     "danish": "da",
     "dutch": "nl",
@@ -26,15 +23,12 @@
     "indonesian": "id",
     "italian": "it",
     "japanese": "ja",
-    "kannada": "kn",
     "kazakh": "kk",
     "korean": "ko",
     "latvian": "lv",
     "lithuanian": "lt",
-    "macedonian": "mk",
     "malay": "ms",
     "marathi": "mr",
-    "maori": "mi",
     "nepali": "ne",
     "norwegian": "no",
     "persian": "fa",
@@ -46,19 +40,46 @@
     "slovak": "sk",
     "slovenian": "sl",
     "spanish": "es",
-    "swahili": "sw",
     "swedish": "sv",
-    "tagalog": "tl",
-    "tamil": "ta",
     "thai": "th",
     "turkish": "tr",
     "ukrainian": "uk",
     "urdu": "ur",
     "vietnamese": "vi",
-    "welsh": "cy",
 }
 
 supported_languages_reverse = {value: key for key, value in supported_languages.items()}
 
+supported_languages_pretty = ", ".join(
+    [f"{lang} ({short})" for lang, short in supported_languages.items()]
+)
+
+
+# Wav2Vec models
+
+wav2vec_models = {
+    "hy": "infinitejoy/wav2vec2-large-xls-r-300m-armenian",
+    "bg": "infinitejoy/wav2vec2-large-xls-r-300m-bulgarian",
+    "et": "anton-l/wav2vec2-large-xlsr-53-estonian",
+    "gl": "infinitejoy/wav2vec2-large-xls-r-300m-galician",
+    "is": "language-and-voice-lab/wav2vec2-large-xlsr-53-icelandic-ep30-967h",
+    "id": "indonesian-nlp/wav2vec2-large-xlsr-indonesian",
+    "kk": "aismlv/wav2vec2-large-xlsr-kazakh",
+    "lv": "infinitejoy/wav2vec2-large-xls-r-300m-latvian",
+    "lt": "DeividasM/wav2vec2-large-xlsr-53-lithuanian",
+    "ms": "gvs/wav2vec2-large-xlsr-malayalam",
+    "mr": "infinitejoy/wav2vec2-large-xls-r-300m-marathi-cv8",
+    "ne": "Harveenchadha/vakyansh-wav2vec2-nepali-nem-130",
+    "ro": "anton-l/wav2vec2-large-xlsr-53-romanian",
+    "sr": "dnikolic/wav2vec2-xlsr-530-serbian-colab",
+    "sk": "infinitejoy/wav2vec2-large-xls-r-300m-slovak",
+    "sl": "infinitejoy/wav2vec2-large-xls-r-300m-slovenian",
+    "sv": "KBLab/wav2vec2-large-xlsr-53-swedish",
+    "th": "sakares/wav2vec2-large-xlsr-thai-demo",
+}
+
+# Whisper models
+
 available_whisper_models = ["large-v2", "large-v3"]
 default_whisper_model = "large-v3"
+compute_device = "cuda"