how should make it work on streaming audio? #1801

ywangwxd · 2024-11-29T07:37:28Z

ywangwxd
Nov 29, 2024

Hi, I want to make this diarizing pipeline work on streaming audio.

Here is what I have done, just hope someone can give me suggestion to make it better.

Collect streaming audio data as short audio clips continously, e.g., every 1 mininute.
Process each audio clip using the pipeline in this repo, output speaker labels and speaker embeddings.
Indeed we need to process these audio clips incrementally. This means that, when process current clip, we
assume that we already have an history embedding for each speaker. So we need to merge the speaker labelling
results into the history one and update the history embedding.
The critical issue is how to merge the diarizing results of current clip into the history embeddings. There is
hidden request that you cannot change the speaker labels in the history. So I cannot use a classical clustering algorithm on the merged embedding data, since it cannot guarantee the consistency of the old labels.
What I have done is simply compute a distance matrix of current speaker embedings to the history embeddings.
Assign each speaker ID in the current clip to the ones in the history if the distance is under a threshold. Otherwise,
add a new embeddings into the history and create a new speaker label (e.g. SPEAKER_N+1).

This pipeline works well, the only issue is choosing a proper distance measure and threshold.

hbredin · 2024-11-29T09:12:46Z

You might be interested. in https://github.com/juanmc2005/diart that does streaming on top of pyannote already.

0 replies