Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update extract_embedding.py #519

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

MADHUMITHASIVAKUMARR
Copy link

extract_embedding.py

Description:
This script extracts audio embeddings from a set of audio files using a pre-trained ONNX model. It processes the audio files to convert them into feature representations, which are then fed into the model to obtain embeddings. The script supports multi-threading for efficient processing of multiple audio files.

Key Features:

  • Loads audio files specified in a wav.scp file and maps them to speaker identifiers from a corresponding utt2spk file.
  • Resamples audio to 16 kHz if it’s not already in that format.
  • Computes Mel-frequency filterbank features using the Kaldi library.
  • Uses ONNX Runtime to run inference on the audio features, generating embeddings.
  • Saves the resulting embeddings to specified files: utt2embedding.pt for individual utterance embeddings and spk2embedding.pt for averaged speaker embeddings.

Usage:

python extract_embedding.py --dir <directory_path> --onnx_path <onnx_model_path> [--num_thread <num_threads>]

Arguments:

  • --dir: The directory containing the input files (wav.scp and utt2spk).
  • --onnx_path: The path to the ONNX model file used for generating embeddings.
  • --num_thread: (Optional) The number of threads to use for parallel processing. Defaults to 8.

Dependencies:

  • torch: For handling tensors and saving embeddings.
  • torchaudio: For loading audio files and processing them.
  • onnxruntime: For running the ONNX model.
  • torchaudio.compliance.kaldi: For extracting Mel-frequency features.
  • tqdm: For displaying progress during processing.

Output:
The script generates two files in the specified directory:

  • utt2embedding.pt: A PyTorch tensor containing embeddings for each utterance.
  • spk2embedding.pt: A PyTorch tensor containing averaged embeddings for each speaker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant