ClearerVoice-Studio: A One-Stop Solution for Speech Enhancement, Speech Denoising, Speech Separation and Speaker Extraction

Speech Enhancement Speech Denoising Speech Separation Target Speaker Extraction AI Audio Processing clearvoice MossFormer2_SE MossFormerGAN_SE MossFormer2_SS AV_MossFormer2_TSE

Dec 6, 2024 3 min read

Cover image for ClearerVoice-Studio: A One-Stop Solution for Speech Enhancement, Speech Denoising, Speech Separation and Speaker Extraction

Introduction

ClearerVoice-Studio is a unified inference platform focusing on Speech Enhancement, Speech Separation, and Audio-Visual Target Speaker Extraction. This tutorial will guide you through using this powerful tool for various audio processing tasks.

Supported Pre-trained Models

The platform currently offers the following pre-trained models:

Speech Enhancement (16kHz & 48kHz)

MossFormer2_SE_48K
FRCRN_SE_16K
MossFormerGAN_SE_16K

Speech Separation (16kHz)

MossFormer2_SS_16K

Audio-Visual Target Speaker Extraction (16kHz)

AV_MossFormer2_TSE_16K

All models are hosted on HuggingFace and will be automatically downloaded when needed.

Environment Setup

1. Clone the Repository

git clone https://github.com/modelscope/ClearerVoice-Studio.git

2. Create Conda Environment

cd ClearerVoice-Studio
conda create -n ClearerVoice-Studio python=3.8
conda activate ClearerVoice-Studio
pip install -r requirements.txt

Usage Tutorial

1. Speech Enhancement Example

from clearvoice import ClearVoice
import os

# Initialize speech enhancement model
cv_se = ClearVoice(
    task='speech_enhancement',
    model_names=['MossFormer2_SE_48K']
)

# Process single audio file
input_path = 'samples/noisy.wav'
output_wav = cv_se(
    input_path=input_path,
    online_write=False
)

# Save enhanced audio
output_dir = 'samples/enhanced'
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, 'enhanced.wav')
cv_se.write(output_wav, output_path=output_path)

2. Speech Separation Example

# Initialize speech separation model
cv_ss = ClearVoice(
    task='speech_separation',
    model_names=['MossFormer2_SS_16K']
)

# Process mixed speech file
input_path = 'samples/mixed.wav'
output_dir = 'samples/separated'
os.makedirs(output_dir, exist_ok=True)

# Separate speech and save automatically
cv_ss(
    input_path=input_path,
    online_write=True,
    output_path=output_dir
)

# Generated files:
# - output_MossFormer2_SS_16K_spk1.wav
# - output_MossFormer2_SS_16K_spk2.wav

3. Target Speaker Extraction Example

# Initialize target speaker extraction model
cv_tse = ClearVoice(
    task='target_speaker_extraction',
    model_names=['AV_MossFormer2_TSE_16K']
)

# Process video file
input_path = 'samples/video.mp4'
output_dir = 'samples/extracted'
os.makedirs(output_dir, exist_ok=True)

# Extract target speaker's voice
cv_tse(
    input_path=input_path,
    online_write=True,
    output_path=output_dir
)

# Generated files:
# - extracted_speech.wav (target speaker's voice)
# - background.wav (background audio)

Batch Processing Example

def process_directory(input_dir, output_dir, task='speech_enhancement'):
    # Initialize model
    cv = ClearVoice(
        task=task,
        model_names=['MossFormer2_SE_48K'] if task == 'speech_enhancement' else 
                   ['MossFormer2_SS_16K'] if task == 'speech_separation' else
                   ['AV_MossFormer2_TSE_16K']
    )
    
    # Ensure output directory exists
    os.makedirs(output_dir, exist_ok=True)
    
    # Get all audio files
    audio_files = [f for f in os.listdir(input_dir) if f.endswith(('.wav', '.mp4', '.avi'))]
    
    # Batch processing
    for audio_file in audio_files:
        input_path = os.path.join(input_dir, audio_file)
        cv(
            input_path=input_path,
            online_write=True,
            output_path=output_dir
        )
        print(f"Processed: {audio_file}")

# Usage example
process_directory(
    input_dir='samples/input',
    output_dir='samples/output',
    task='speech_enhancement'
)

Advanced Usage: Progress Monitoring

import tqdm

def process_with_progress(input_files, task='speech_enhancement'):
    cv = ClearVoice(task=task)
    
    for file in tqdm.tqdm(input_files, desc=f"Processing {task}"):
        try:
            cv(
                input_path=file,
                online_write=True,
                output_path='samples/output'
            )
        except Exception as e:
            print(f"Error processing {file}: {str(e)}")
            continue

Parameter Description

task: Select processing task
- speech_enhancement: Speech Enhancement
- speech_separation: Speech Separation
- target_speaker_extraction: Target Speaker Extraction
model_names: List of model names, can select one or more models
input_path: Input path, supports single file, directory, or list file (.scp)
online_write: Whether to save results during processing
output_path: Output path, can be file or directory

Performance Evaluation

VoiceBank+DEMAND Test Set (16kHz) Performance Comparison

Model	PESQ	STOI	SSNR	P808_MOS
Noisy Audio	1.97	0.92	6.13	3.05
FRCRN_SE_16K	3.23	0.95	7.60	3.59
MossFormerGAN_SE_16K	3.47	0.96	9.09	3.57
MossFormer2_SE_48K	3.16	0.95	6.86	3.53

DNS-Challenge-2020 Test Set Performance Comparison

Model	PESQ	STOI	SSNR	P808_MOS
Noisy Audio	1.58	0.91	9.35	3.15
FRCRN_SE_16K	3.24	0.98	7.60	4.03
MossFormerGAN_SE_16K	3.57	0.98	14.03	4.05
MossFormer2_SE_48K	2.94	0.97	11.86	3.92

Best Practices

Model Selection:
- For 48kHz high-fidelity audio, prefer MossFormer2_SE_48K
- For 16kHz audio, choose based on scenario:
  - General use: Use MossFormerGAN_SE_16K or MossFormer2_SE_16K
Batch Processing Optimization:
- Use online_write=True when processing large amounts of audio files
- Use .scp list files to manage batch processing tasks
Performance Considerations:
- Balance audio quality and processing speed based on actual needs
- Choose appropriate batch size based on hardware resources

Conclusion

ClearerVoice-Studio provides a powerful and user-friendly solution for audio processing. Through this tutorial, you should be able to master its basic usage and choose appropriate models based on your specific needs. As the project continues to evolve, we look forward to seeing more excellent pre-trained models and features added.