ClearerVoice-Studio: A One-Stop Solution for Speech Enhancement, Speech Denoising, Speech Separation and Speaker Extraction
Introduction
ClearerVoice-Studio is a unified inference platform focusing on Speech Enhancement, Speech Separation, and Audio-Visual Target Speaker Extraction. This tutorial will guide you through using this powerful tool for various audio processing tasks.
Supported Pre-trained Models
The platform currently offers the following pre-trained models:
Speech Enhancement (16kHz & 48kHz)
- MossFormer2_SE_48K
- FRCRN_SE_16K
- MossFormerGAN_SE_16K
Speech Separation (16kHz)
- MossFormer2_SS_16K
Audio-Visual Target Speaker Extraction (16kHz)
- AV_MossFormer2_TSE_16K
All models are hosted on HuggingFace and will be automatically downloaded when needed.
Environment Setup
1. Clone the Repository
git clone https://github.com/modelscope/ClearerVoice-Studio.git
2. Create Conda Environment
cd ClearerVoice-Studio
conda create -n ClearerVoice-Studio python=3.8
conda activate ClearerVoice-Studio
pip install -r requirements.txt
Usage Tutorial
1. Speech Enhancement Example
from clearvoice import ClearVoice
import os
# Initialize speech enhancement model
cv_se = ClearVoice(
task='speech_enhancement',
model_names=['MossFormer2_SE_48K']
)
# Process single audio file
input_path = 'samples/noisy.wav'
output_wav = cv_se(
input_path=input_path,
online_write=False
)
# Save enhanced audio
output_dir = 'samples/enhanced'
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, 'enhanced.wav')
cv_se.write(output_wav, output_path=output_path)
2. Speech Separation Example
# Initialize speech separation model
cv_ss = ClearVoice(
task='speech_separation',
model_names=['MossFormer2_SS_16K']
)
# Process mixed speech file
input_path = 'samples/mixed.wav'
output_dir = 'samples/separated'
os.makedirs(output_dir, exist_ok=True)
# Separate speech and save automatically
cv_ss(
input_path=input_path,
online_write=True,
output_path=output_dir
)
# Generated files:
# - output_MossFormer2_SS_16K_spk1.wav
# - output_MossFormer2_SS_16K_spk2.wav
3. Target Speaker Extraction Example
# Initialize target speaker extraction model
cv_tse = ClearVoice(
task='target_speaker_extraction',
model_names=['AV_MossFormer2_TSE_16K']
)
# Process video file
input_path = 'samples/video.mp4'
output_dir = 'samples/extracted'
os.makedirs(output_dir, exist_ok=True)
# Extract target speaker's voice
cv_tse(
input_path=input_path,
online_write=True,
output_path=output_dir
)
# Generated files:
# - extracted_speech.wav (target speaker's voice)
# - background.wav (background audio)
Batch Processing Example
def process_directory(input_dir, output_dir, task='speech_enhancement'):
# Initialize model
cv = ClearVoice(
task=task,
model_names=['MossFormer2_SE_48K'] if task == 'speech_enhancement' else
['MossFormer2_SS_16K'] if task == 'speech_separation' else
['AV_MossFormer2_TSE_16K']
)
# Ensure output directory exists
os.makedirs(output_dir, exist_ok=True)
# Get all audio files
audio_files = [f for f in os.listdir(input_dir) if f.endswith(('.wav', '.mp4', '.avi'))]
# Batch processing
for audio_file in audio_files:
input_path = os.path.join(input_dir, audio_file)
cv(
input_path=input_path,
online_write=True,
output_path=output_dir
)
print(f"Processed: {audio_file}")
# Usage example
process_directory(
input_dir='samples/input',
output_dir='samples/output',
task='speech_enhancement'
)
Advanced Usage: Progress Monitoring
import tqdm
def process_with_progress(input_files, task='speech_enhancement'):
cv = ClearVoice(task=task)
for file in tqdm.tqdm(input_files, desc=f"Processing {task}"):
try:
cv(
input_path=file,
online_write=True,
output_path='samples/output'
)
except Exception as e:
print(f"Error processing {file}: {str(e)}")
continue
Parameter Description
task
: Select processing taskspeech_enhancement
: Speech Enhancementspeech_separation
: Speech Separationtarget_speaker_extraction
: Target Speaker Extraction
model_names
: List of model names, can select one or more modelsinput_path
: Input path, supports single file, directory, or list file (.scp)online_write
: Whether to save results during processingoutput_path
: Output path, can be file or directory
Performance Evaluation
VoiceBank+DEMAND Test Set (16kHz) Performance Comparison
Model | PESQ | STOI | SSNR | P808_MOS |
---|---|---|---|---|
Noisy Audio | 1.97 | 0.92 | 6.13 | 3.05 |
FRCRN_SE_16K | 3.23 | 0.95 | 7.60 | 3.59 |
MossFormerGAN_SE_16K | 3.47 | 0.96 | 9.09 | 3.57 |
MossFormer2_SE_48K | 3.16 | 0.95 | 6.86 | 3.53 |
DNS-Challenge-2020 Test Set Performance Comparison
Model | PESQ | STOI | SSNR | P808_MOS |
---|---|---|---|---|
Noisy Audio | 1.58 | 0.91 | 9.35 | 3.15 |
FRCRN_SE_16K | 3.24 | 0.98 | 7.60 | 4.03 |
MossFormerGAN_SE_16K | 3.57 | 0.98 | 14.03 | 4.05 |
MossFormer2_SE_48K | 2.94 | 0.97 | 11.86 | 3.92 |
Best Practices
-
Model Selection:
- For 48kHz high-fidelity audio, prefer MossFormer2_SE_48K
- For 16kHz audio, choose based on scenario:
- General use: Use MossFormerGAN_SE_16K or MossFormer2_SE_16K
-
Batch Processing Optimization:
- Use
online_write=True
when processing large amounts of audio files - Use .scp list files to manage batch processing tasks
- Use
-
Performance Considerations:
- Balance audio quality and processing speed based on actual needs
- Choose appropriate batch size based on hardware resources
Conclusion
ClearerVoice-Studio provides a powerful and user-friendly solution for audio processing. Through this tutorial, you should be able to master its basic usage and choose appropriate models based on your specific needs. As the project continues to evolve, we look forward to seeing more excellent pre-trained models and features added.
Related Posts
No related posts yet