Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Estudar diarization, separação dos integrantes na conversa #1

Open
MatMercer opened this issue Jul 6, 2024 · 0 comments
Open

Estudar diarization, separação dos integrantes na conversa #1

MatMercer opened this issue Jul 6, 2024 · 0 comments

Comments

@MatMercer
Copy link
Collaborator

MatMercer commented Jul 6, 2024

Estudos

Temos algumas possibilidades

Planos

Usar 2 modelos

https://github.com/MahmoudAshraf97/whisper-diarization

Esse em específico, usa 2 modelos NeMo, e whisper.

https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/speaker_diarization/intro.html

Existe uma limitação: "atualmente não é possível lidar com 2 pessoas falando ao mesmo tempo, uma forma de melhorar isso é criar 2 áudios para isolar os participantes, usando outro modelo, mas isso aumenta muito o processamento"

Usar "Insanely fast whisper", que aparentemente suporta diarization

https://github.com/Vaibhavs10/insanely-fast-whisper

--diarization_model DIARIZATION_MODEL
                        Name of the pretrained model/ checkpoint to perform diarization. (default: pyannote/speaker-diarization)
--num-speakers NUM_SPEAKERS
                        Specifies the exact number of speakers present in the audio file. Useful when the exact number of participants in the conversation is known. Must be at least 1. Cannot be used together with --min-speakers or --max-speakers. (default: None)
  --min-speakers MIN_SPEAKERS
                        Sets the minimum number of speakers that the system should consider during diarization. Must be at least 1. Cannot be used together with --num-speakers. Must be less than or equal to --max-speakers if both are specified. (default: None)
  --max-speakers MAX_SPEAKERS
                        Defines the maximum number of speakers that the system should consider in diarization. Must be at least 1. Cannot be used together with --num-speakers. Must be greater than or equal to --min-speakers if both are specified. (default: None)

Separar 2 faixas de áudio, e depois mergear as legendas

Talvez existam algoritmos que permitam a separação das vozes, é um problema muito comum para quem trabalha com áudio, pré AI.

WhisperX

https://github.com/m-bain/whisperX

pyannote

https://github.com/pyannote/pyannote-audio

Error rate alto, entre 25%.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant