Skip to content

Latest commit

 

History

History
86 lines (58 loc) · 4.37 KB

README_Phase_1.md

File metadata and controls

86 lines (58 loc) · 4.37 KB

README: Speech Detection


Overview

This notebook provides a comprehensive pipeline for processing speech datasets, implementing machine translation, and detecting speech. It utilizes advanced techniques such as denoising, segmenting, and transcribing audio data while leveraging the OpenAI Whisper model for high-quality speech-to-text processing.


Objectives

  1. Audio Preprocessing: Format conversion, denoising, and segmenting audio files for analysis.
  2. Speech-to-Text Transcription: Transcribe audio using the Whisper model.
  3. Data Cleaning: Remove irrelevant or placeholder elements in transcript data.
  4. Segmentation: Divide audio files into manageable chunks to meet model requirements.
  5. Organization: Create structured outputs for efficient storage and retrieval.

Workflow

1. Audio Preprocessing

  • Format Conversion: Audio files in FLAC format are converted to WAV using the pydub library.
  • Denoising: Silent intervals and non-human content are identified and removed to improve audio quality.
  • Storage: Processed audio files are saved in a designated folder for further analysis.

2. Speech-to-Text Transcription

  • The OpenAI Whisper model (whisper-small) is used for transcription, ensuring high accuracy.
  • Audio files are segmented into 30-second chunks to comply with Whisper's input constraints.
  • Transcriptions are generated and saved in text format for downstream applications.

3. Data Cleaning

  • Placeholder elements (e.g., <noise>, <cough>) in transcripts are identified using regular expressions.
  • Cleaned transcripts are prepared, removing non-essential elements for better model performance.

4. Audio Segmentation

  • Timestamp-Based Segmentation: Long audio files are divided into segments based on predefined timestamps.
  • Dynamic Adjustments: Segments are created to ensure compliance with 30-second limits, accommodating Whisper's requirements.

5. Organization and Storage

  • Separate folders are created for each audio file, housing its respective segments in the Chunks_audio directory.
  • Segmented audio files are organized systematically, enabling efficient retrieval and processing.

Outputs

  1. Denoised Audio Files: Processed audio files with irrelevant sections removed.
  2. Transcriptions: High-quality text outputs generated by the Whisper model, saved for further use.
  3. Organized Folders: Separate directories for each audio file with its respective segments.
  4. Processed Audio Chunks: Timestamp-based audio segments prepared for transcription.
  5. Cleaned Transcripts: Refined text data free of placeholders and irrelevant markers.

Key Features

  • Preprocessing: Comprehensive steps for audio cleaning, denoising, and format conversion.
  • Whisper Integration: Utilizes the OpenAI Whisper model for state-of-the-art transcription.
  • Dynamic Segmentation: Ensures compliance with model constraints through timestamp-driven chunking.
  • Systematic Organization: Outputs are organized into structured directories for seamless workflow integration.

Limitations

  1. Static Timestamps: Current segmentation relies on predefined timestamps, which may not adapt to dynamic audio scenarios.
  2. Processing Constraints: Whisper's 30-second limit necessitates segmentation, potentially causing transcription discontinuities.
  3. Computational Requirements: Audio processing and transcription require significant computational resources.

Future Directions

  1. Dynamic Segmentation: Introduce silence detection or audio content analysis for adaptive segmentation.
  2. Parallel Processing: Implement parallelized workflows to handle large datasets efficiently.
  3. End-to-End Integration: Develop a seamless pipeline for transcription, translation, and evaluation.
  4. Advanced Noise Filtering: Improve denoising techniques to enhance audio quality further.
  5. Custom Model Training: Fine-tune Whisper or other models for domain-specific transcription tasks.

Summary

This pipeline provides a robust foundation for speech data analysis, transcription, and translation tasks. It ensures high-quality outputs through systematic preprocessing, advanced model usage, and organized data handling. The modular structure allows for easy customization and scalability, catering to various applications in machine translation and speech detection.