MVA Audio Signal Processing Course

WARNING : The first introductory lecture is highly audio-based, with many examples, and introduces important concepts.

Reading the course slides only is not enough to properly understand the material

Course Validation

Projects

All projects involve deep learning components. Their purpose is to place classical signal processing methods (e.g., filtering, spectral analysis, linear prediction, parametric synthesis) in perspective by directly comparing them with modern learning-based approaches addressing the same tasks.

Project 1 - SEGAN: Speech Enhancement Generative Adversarial Network

Paper: Pascual et al., SEGAN: Speech Enhancement Generative Adversarial Network
https://arxiv.org/abs/1703.09452

Objective of the work
The goal of this project is to study speech denoising directly in the time domain using a convolutional neural network. The work focuses on understanding what a learned filter can achieve compared to a classical Wiener filter under controlled assumptions.

Expected work

Generate noisy speech signals by adding stationary noise to clean speech.
Implement the SEGAN generator only (1D U-Net), without the adversarial discriminator.
Train the generator using an L1 or L2 loss on short patches.
Implement a Wiener filter (with a sliding window) assuming noise estimation from an initial silence segment.
Compare Wiener and SEGAN outputs using spectrograms and listening tests.

Project 2 - DCCRN: Deep Complex Convolution Recurrent Network

Paper: Hu et al., DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement
https://arxiv.org/abs/2008.00264

Objective
This project investigates the role of phase information in speech denoising. The goal is to compare magnitude-only Wiener or deep learning approaches with complex-valued deep-learning approach.

Expected work

Generate noisy speech signals by adding stationary noise to clean speech.
Implement a Wiener filter (with a sliding window) assuming noise estimation from an initial silence segment.
Implement an STFT processing pipeline.
Train a magnitude-only neural CNN network for speech denoising.
Train a simplified complex-valued CNN network using both real and imaginary parts of STFT .
Compare all methods in terms of spectrograms, phase behavior, and perceptual quality.

Project 3 - MetricGAN: Optimizing Perceptual Metrics

Paper: Fu et al., MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement
https://arxiv.org/abs/1905.04874

Objective
This project studies the impact of the loss function in deep learning-based speech denoising. The architecture is fixed and simple; only the training objective is modified.

Expected work

Generate noisy speech signals by adding stationary noise to clean speech.
Implement a simple STFT-based CNN denoiser using only the magnitude of the STFT .
Train the denoiser using a classical MSE or L1 loss.
Pre-train a lightweight neural network to approximate a perceptual speech quality metric.
Train the CNN denoiser model using this perceptual metric-based loss.
Analyze differences between energy-based and perceptual optimization.

Project 4 - DDSP: Differentiable Digital Signal Processing

Paper: Engel et al., DDSP: Differentiable Digital Signal Processing
https://arxiv.org/abs/2001.04643

Objective
The goal of this project is to compare analytical standard (DSP) pipeline sound synthesis with learned parameter estimation. Students study how neural networks can be used to predict parameters of a classical harmonic-plus-noise synthesizer.

Expected work

Implement a complete DSP analysis-synthesis pipeline, including:
- Implement analysis/estimation on sliding windows :
- Then implement signal reconstruction : (harmonic part + noise part) filtered by overall filter.
Implement a DDSP version of the same synthesizer: keep the DSP synthesis model fixed, but replace the parameter estimation with a a small frame-wise regular feed-forward NN (no recurrent or attention-based architecture is required) that predicts the synthesizer parameters end-to-end.
Train the network by minimizing an audio reconstruction loss between the synthesized output and the original signal on a small dataset (single instrument or voice).
Comparisons

Project 5 - LPCNet: Neural Excitation for LPC Vocoding

Paper: Valin & Skoglund, LPCNet: Improving Neural Speech Synthesis through Linear Prediction
https://arxiv.org/abs/1810.11846

Objective
This project analyzes what neural networks learn when combined with a classical LPC vocoder. The focus is exclusively on excitation modeling rather than spectral envelope estimation.

Expected work

Implement a classical LPC analysis?synthesis vocoder with heuristic excitation (dirac trains or white noise).
Implement an LPC analysis?synthesis vocoder using the true residual (obtained by substraction).
Fine-tune a pre-trained LPCNet model on a small, new speaker dataset (short patches).
Compare the three appraoches above.
Discuss the limits of LPC modeling and the contribution of learning.

Project 6 - CREPE: Deep Learning for Pitch Estimation

Paper: Kim et al., CREPE: A Convolutional Representation for Pitch Estimation
https://arxiv.org/abs/1802.06182

Objective
This project compares classical DSP pitch estimation methods with a deep learning-based approach. The emphasis is on experimental evaluation, robustness, and voiced/unvoiced decision making.

Expected work

Reimplement two DSP pitch estimation methods (e.g., autocorrelation and YIN).
Use the pre-trained CREPE model for pitch estimation.
Design an ML (DL if you want) algorithm to evaluate the voiced/unvoiced decision rule based on CREPE outputs.
Run a fixed experimental protocol:
- Voiced clean speech.
- Unvoiced speech segments.
- Voiced speech with stationary noise at SNR = 20 dB, 10 dB, and 0 dB.
- Synthetic tonal signals with known ground-truth pitch.
- ...
Compare DSP and deep learning methods using the following mandatory criteria:
- Pitch estimation error on synthetic signals.
- Pitch stability on sustained voiced segments.
- Voiced / unvoiced classification errors (false positives and false negatives).
- Qualitative analysis of typical failure cases.
- ...

MVA Material for the course on Audio Signal Processing

Registration to the course :

Course outline

Course Validation

Projects

Project 1 - SEGAN: Speech Enhancement Generative Adversarial Network

Project 2 - DCCRN: Deep Complex Convolution Recurrent Network

Project 3 - MetricGAN: Optimizing Perceptual Metrics

Project 4 - DDSP: Differentiable Digital Signal Processing

Project 5 - LPCNet: Neural Excitation for LPC Vocoding

Project 6 - CREPE: Deep Learning for Pitch Estimation