Projects
All projects involve deep learning components. Their purpose is to place classical signal processing methods (e.g., filtering, spectral analysis, linear prediction, parametric synthesis) in perspective by directly comparing them with modern learning-based approaches addressing the same tasks.
Project 1 - SEGAN: Speech Enhancement Generative Adversarial Network
Paper: Pascual et al., SEGAN: Speech Enhancement Generative Adversarial Network
https://arxiv.org/abs/1703.09452
Objective of the work
The goal of this project is to study speech denoising directly in the time domain using a convolutional neural network.
The work focuses on understanding what a learned filter can achieve compared to a classical Wiener filter under controlled assumptions.
Expected work
- Generate noisy speech signals by adding stationary noise to clean speech.
- Implement the SEGAN generator only (1D U-Net), without the adversarial discriminator.
- Train the generator using an L1 or L2 loss on short patches.
- Implement a Wiener filter (with a sliding window) assuming noise estimation from an initial silence segment.
- Compare Wiener and SEGAN outputs using spectrograms and listening tests.
Project 2 - DCCRN: Deep Complex Convolution Recurrent Network
Paper: Hu et al., DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement
https://arxiv.org/abs/2008.00264
Objective
This project investigates the role of phase information in speech denoising.
The goal is to compare magnitude-only Wiener or deep learning approaches with complex-valued deep-learning approach.
Expected work
- Generate noisy speech signals by adding stationary noise to clean speech.
- Implement a Wiener filter (with a sliding window) assuming noise estimation from an initial silence segment.
- Implement an STFT processing pipeline.
- Train a magnitude-only neural CNN network for speech denoising.
- Train a simplified complex-valued CNN network using both real and imaginary parts of STFT .
- Compare all methods in terms of spectrograms, phase behavior, and perceptual quality.
Project 3 - MetricGAN: Optimizing Perceptual Metrics
Paper: Fu et al., MetricGAN: Generative Adversarial Networks based Black-box Metric Scores Optimization for Speech Enhancement
https://arxiv.org/abs/1905.04874
Objective
This project studies the impact of the loss function in deep learning-based speech denoising.
The architecture is fixed and simple; only the training objective is modified.
Expected work
- Generate noisy speech signals by adding stationary noise to clean speech.
- Implement a simple STFT-based CNN denoiser using only the magnitude of the STFT .
- Train the denoiser using a classical MSE or L1 loss.
- Pre-train a lightweight neural network to approximate a perceptual speech quality metric.
- Train the CNN denoiser model using this perceptual metric-based loss.
- Analyze differences between energy-based and perceptual optimization.
Project 4 - DDSP: Differentiable Digital Signal Processing
Paper: Engel et al., DDSP: Differentiable Digital Signal Processing
https://arxiv.org/abs/2001.04643
Objective
The goal of this project is to compare analytical standard (DSP) pipeline sound synthesis with learned parameter estimation.
Students study how neural networks can be used to predict parameters of a classical harmonic-plus-noise synthesizer.
Expected work
- Implement a complete DSP analysis-synthesis pipeline, including:
- Implement analysis/estimation on sliding windows :
- Pitch estimation and voiced/unvoiced decision using auto-correlation or cepstrum or Yin methods
(you can copy/paste existing code, but if you do so you are expected to master it)
- Harmonic analysis based on the estimated pitch (harmonic frequencies and amplitudes).
- Noise analysis by separating harmonic and residual components in the STFT domain
using a simple harmonic mask, and modeling the residual noise through its spectral envelope.
- Estimation of the overall smooth spectral envelope (e.g., LPC or cepstral smoothing).
- Then implement signal reconstruction : (harmonic part + noise part) filtered by overall filter.
- Implement a DDSP version of the same synthesizer: keep the DSP synthesis model fixed, but replace the parameter estimation
with a a small frame-wise regular feed-forward NN (no recurrent or attention-based architecture is required) that predicts the synthesizer parameters end-to-end.
- Train the network by minimizing an audio reconstruction loss between the synthesized output and the original signal on a small dataset (single instrument or voice).
- Comparisons
Project 5 - LPCNet: Neural Excitation for LPC Vocoding
Paper: Valin & Skoglund, LPCNet: Improving Neural Speech Synthesis through Linear Prediction
https://arxiv.org/abs/1810.11846
Objective
This project analyzes what neural networks learn when combined with a classical LPC vocoder.
The focus is exclusively on excitation modeling rather than spectral envelope estimation.
Expected work
- Implement a classical LPC analysis?synthesis vocoder with heuristic excitation (dirac trains or white noise).
- Implement an LPC analysis?synthesis vocoder using the true residual (obtained by substraction).
- Fine-tune a pre-trained LPCNet model on a small, new speaker dataset (short patches).
- Compare the three appraoches above.
- Discuss the limits of LPC modeling and the contribution of learning.
Project 6 - CREPE: Deep Learning for Pitch Estimation
Paper: Kim et al., CREPE: A Convolutional Representation for Pitch Estimation
https://arxiv.org/abs/1802.06182
Objective
This project compares classical DSP pitch estimation methods with a deep learning-based approach.
The emphasis is on experimental evaluation, robustness, and voiced/unvoiced decision making.
Expected work
- Reimplement two DSP pitch estimation methods (e.g., autocorrelation and YIN).
- Use the pre-trained CREPE model for pitch estimation.
- Design an ML (DL if you want) algorithm to evaluate the voiced/unvoiced decision rule based on CREPE outputs.
- Run a fixed experimental protocol:
- Voiced clean speech.
- Unvoiced speech segments.
- Voiced speech with stationary noise at SNR = 20 dB, 10 dB, and 0 dB.
- Synthetic tonal signals with known ground-truth pitch.
- ...
- Compare DSP and deep learning methods using the following mandatory criteria:
- Pitch estimation error on synthetic signals.
- Pitch stability on sustained voiced segments.
- Voiced / unvoiced classification errors (false positives and false negatives).
- Qualitative analysis of typical failure cases.
- ...