TTSAudioNormalizer is a professional TTS audio preprocessing tool that provides comprehensive audio analysis and standardization processing capabilities. This tool aims to improve TTS training data quality and ensure consistency in audio features.
- Unified volume levels help models focus on learning speech features rather than being distracted by volume differences
- Standardized data helps models converge faster, reducing training time
- Reduces the risk of models learning incorrect features
- Avoid gradient explosion or vanishing due to large volume differences
- Reduce the possibility of model overfitting to volume features
- Improve training process stability
- Help models focus on learning essential speech features
- Improve model adaptability in different scenarios
- Reduce dependency on non-critical features
- Optimize frequency response, emphasize key speech frequency bands
- Enhance consonant clarity, improve speech intelligibility
- Maintain vowel naturalness, preserve voice characteristics
- Remove background noise, improve speech purity
- Compress dynamic range, balance volume levels
- Filter useless frequency bands, reduce interference factors
- Unify sampling rate, ensure data quality
- Standardize channel settings, simplify processing flow
- Standardize audio format, improve compatibility
- Improve feature extraction accuracy and reliability
- Enhance comparability between different samples
- Ensure training data quality consistency
- Format Unification
- Convert different audio formats (e.g., to WAV)
- Ensure format compatibility
- Sample Rate Unification
- Standardize sampling rate (e.g., 22050Hz)
- Maintain data consistency
- Mono Channel Conversion
- Convert multi-channel audio to mono
- Simplify subsequent processing
- DC Offset Removal
- Eliminate fixed offset in audio signals
- Improve audio quality
- Volume Normalization
- Unify audio volume levels
- Ensure loudness consistency
- Frequency Response Optimization
- Adjust frequency characteristics
- Optimize audio performance
- Silence Removal
- Clean up invalid audio segments
- Enhance data quality
- Noise Reduction
- Eliminate background noise
- Improve audio clarity
- Dynamic Range Compression
- Balance audio dynamic range
- Enhance overall performance
- Quality Validation
- Check processed audio quality
- Ensure training requirements are met
- Feature Validation
- Verify audio feature parameters
- Guarantee effective feature extraction
Processing Flow Diagram:
Input Audio
➡️ Basic Preprocessing
➡️ Quality Optimization
➡️ Noise Processing
➡️ Quality Check
➡️ Output Audio
Important Notes:
- Maintain processing logs for each step
- Perform quality checks at key points
- Keep original audio backups
- Adjust parameters based on specific application scenarios
- Generate detailed loudness statistics report
- Provide volume distribution visualization
- Output parameter optimization suggestions
from audio_analyzer import AudioAnalyzer
analyzer = AudioAnalyzer()
results = analyzer.analyze_speaker_directory(
base_dir="raw_voices", # Nested folders, i.e., a main folder containing several subfolders (with audio files)
output_dir="analysis_report",
max_workers=16
)
发现 49 个说话人目录
处理说话人: 0%| | 0/49 [00:00<?, ?it/s]
分析说话人: 廉颇
分析音频: 0%| | 0/118 [00:00<?, ?it/s]
分析音频: 25%|██▌ | 30/118 [00:00<00:00, 289.97it/s]
分析音频: 53%|█████▎ | 62/118 [00:00<00:00, 299.46it/s]
分析音频: 78%|███████▊ | 92/118 [00:00<00:00, 298.95it/s]
音频分析报告 说话人: 廉颇:
--------------------------------------------------
分析的音频文件总数: 118
音量统计:
Mean Norm:
mean: 0.053
std: 0.010
min: 0.032
max: 0.082
RMS Amplitude:
mean: 0.089
std: 0.015
min: 0.057
max: 0.131
Max Amplitude:
mean: 0.546
std: 0.123
min: 0.293
max: 0.882
处理说话人: 2%|▏ | 1/49 [00:01<01:03, 1.31s/it]
推荐的target_db值:
1. 保守设置 (保持动态范围): target_db = 0.053
2. 平衡设置 (确保清晰度): target_db = 0.063
3. 安全设置: target_db = -3.000
分析结果已保存到: raw_voices/音频分析报告/廉颇
分析说话人: 小乔
分析音频: 0%| | 0/201 [00:00<?, ?it/s]
分析音频: 14%|█▍ | 28/201 [00:00<00:00, 268.48it/s]
分析音频: 29%|██▉ | 58/201 [00:00<00:00, 283.83it/s]
分析音频: 43%|████▎ | 87/201 [00:00<00:00, 281.59it/s]
分析音频: 60%|█████▉ | 120/201 [00:00<00:00, 297.76it/s]
分析音频: 75%|███████▍ | 150/201 [00:00<00:00, 294.95it/s]
分析音频: 90%|████████▉ | 180/201 [00:00<00:00, 289.50it/s]
音频分析报告 说话人: 小乔:
--------------------------------------------------
分析的音频文件总数: 201
音量统计:
Mean Norm:
mean: 0.052
std: 0.019
min: 0.012
max: 0.135
RMS Amplitude:
mean: 0.086
std: 0.030
min: 0.024
max: 0.209
Max Amplitude:
mean: 0.495
std: 0.143
min: 0.163
max: 0.943
处理说话人: 4%|▍ | 2/49 [00:02<01:09, 1.49s/it]
推荐的target_db值:
1. 保守设置 (保持动态范围): target_db = 0.052
2. 平衡设置 (确保清晰度): target_db = 0.071
3. 安全设置: target_db = -3.000
分析结果已保存到: raw_voices/音频分析报告/小乔
分析说话人: 赵云
分析音频: 0%| | 0/142 [00:00<?, ?it/s]
分析音频: 20%|█▉ | 28/142 [00:00<00:00, 270.67it/s]
分析音频: 42%|████▏ | 60/142 [00:00<00:00, 294.19it/s]
分析音频: 63%|██████▎ | 90/142 [00:00<00:00, 291.33it/s]
分析音频: 85%|████████▍ | 120/142 [00:00<00:00, 283.42it/s]
音频分析报告 说话人: 赵云:
--------------------------------------------------
分析的音频文件总数: 142
音量统计:
Mean Norm:
mean: 0.050
std: 0.019
min: 0.018
max: 0.124
RMS Amplitude:
mean: 0.089
std: 0.031
min: 0.039
max: 0.193
Max Amplitude:
mean: 0.603
std: 0.182
min: 0.339
max: 1.000
处理说话人: 6%|▌ | 3/49 [00:04<01:06, 1.45s/it]
推荐的target_db值:
1. 保守设置 (保持动态范围): target_db = 0.050
2. 平衡设置 (确保清晰度): target_db = 0.070
3. 安全设置: target_db = -3.000
分析结果已保存到: raw_voices/音频分析报告/赵云
...
- Practical Significance:
- Reflects overall loudness level of audio
- Represents average absolute amplitude of audio signal
- Value range typically between 0-1
- Value Meaning:
- Higher value = Louder overall perception
- Lower value = Softer overall perception
- Ideal range typically between 0.1-0.3
- Application Scenarios:
- Used to evaluate if overall loudness is appropriate
- Helps determine if volume gain is needed
- Practical Significance:
- Reflects effective energy level of audio
- Closer to human ear's perception of loudness
- Considers energy distribution over time
- Value Meaning:
- Higher value = Stronger audio energy
- Lower value = Weaker audio energy
- Professional audio typically recommended between 0.1-0.4
- Application Scenarios:
- Evaluate audio dynamic range
- Determine if audio needs compression or expansion
- Commonly used in audio normalization
- Practical Significance:
- Reflects peak levels in audio
- Represents maximum instantaneous value of signal
- Used to determine if clipping exists
- Value Meaning:
- 1.0 = Maximum possible value for digital audio (potential clipping)
- Recommended peak control below 0.9
- Too low (e.g., <0.5) indicates audio might be too soft
- Application Scenarios:
- Detect audio distortion
- Evaluate audio headroom
- Guide limiter settings
-
Hierarchical Relationship:
- Max Amplitude > RMS Amplitude > Mean Norm
- This is due to their different calculation methods
-
Practical Application:
- Mean Norm: Used for overall volume assessment
- RMS: Used for energy level control
- Max Amplitude: Used for peak control
- Professional Audio Production Reference Values:
- Mean Norm: 0.1-0.3
- RMS: 0.1-0.4
- Max Amplitude: 0.8-0.9
- First check Max Amplitude to avoid clipping
- Use RMS to ensure overall energy is appropriate
- Reference Mean Norm to adjust overall volume
- Consider all three indicators in context of specific application
These indicators work together to help us:
- Ensure audio quality
- Maintain volume consistency
- Avoid distortion and noise
- Optimize listening experience
Key features of this solution:
- Uses sox's norm effect for audio normalization
- Can process single files or batch process entire directories
- Defaults to normalizing volume to -3dB, adjustable as needed
- Maintains original audio quality, only adjusts volume
Usage is simple:
- For single file: directly call normalize_audio() function
- For entire directory: use batch_normalize_directory() function The processed audio files should have more uniform volume levels, solving the issue of inconsistent loudness. If overall volume still feels too low or high, adjust the target_db parameter.
from tts_audio_normalizer import AudioProcessingParams, TTSAudioNormalizer
# Create parameter object and customize parameters
params = AudioProcessingParams()
params.noise_reduction_strength = 0.8 # Increase noise reduction intensity
params.target_db = -3 # Set target volume
# Process single file
#normalizer.normalize_audio("input.wav", "output.wav", params)
# Batch process directory
normalizer.batch_normalize_directory(
input_dir = "./audio_segments",
output_dir = "./audio_segments_normalized",
params=params,
max_workers=4
)
# Basic format settings
rate: int = 44100 # Sample rate
channels: int = 1 # Number of channels
output_format: str = 'wav' # Output format
target_db: float = -3.0 # Target volume
# Equalizer settings
equalizer_enabled: bool = True # Enable equalizer
treble_frequency: float = 3000.0 # Treble center (2-8kHz)
mid_frequency: float = 1000.0 # Mid center (250Hz-2kHz)
bass_frequency: float = 100.0 # Bass center (80-250Hz)
# Noise processing
subsonic_filter_enabled: bool = True # Subsonic filtering
compression_ratio: float = 2.5 # Compression ratio
threshold_db: float = -15.0 # Noise threshold
Voice Type | Recommended Parameters |
---|---|
Male | bass_gain=2.0, mid_frequency=1200Hz |
Female | treble_gain=1.5, bass_gain=1.5 |
Child | mid_gain=1.5, bass_gain=1.0 |
Compression Level | Parameter Combination |
---|---|
Mild Compression | threshold_db=-20, ratio=2, attack=0.3s |
Medium Compression | threshold_db=-25, ratio=3, attack=0.2s |
Heavy Compression | threshold_db=-30, ratio=4, attack=0.1s |
Sound Quality Goal | Parameter Combination |
---|---|
Voice Enhancement | treble=2.0, bass=1.0 |
Clarity Boost | treble=3.0, bass=-1.0 |
Warm Tone | treble=-1.0, bass=2.0 |
- Audio Feature Protection
- Avoid over-processing leading to distortion
- Maintain phoneme boundary clarity
- Preserve natural speech prosody
- Dataset Adaptation
- Adjust parameters based on speaker characteristics
- Consider recording environment factors
- Maintain processing consistency
- Quality Control
- Regularly check processing effects
- Monitor abnormal samples
- Adjust parameters timely
- Perform audio analysis first
- Select parameters based on analysis report
- Test process effects on small batch
- Adjust and optimize parameter configuration
- Execute batch normalization processing
- Verify processing result quality
Through proper configuration and use of this tool, you can significantly improve TTS training data quality, providing better foundation data support for model training.