Noise Modeling to Build Training Sets for Robust Speech Enhancement

DOI: https://doi.org/10.21203/rs.3.rs-105360/v1

Abstract

The performance of Deep Neural Network (DNN)-based speech enhancement models degrades significantly in real recordings because the synthetic training sets are mismatched with real test sets. To solve this problem, we propose a new Generative Adversarial Network framework for Noise Modeling (NM-GAN) that can build training sets by imitating real noise distribution. The framework combines a novel U-Net with two bidirectional Long Short-Term Memory (LSTM) layers that act as a generator to construct complex noise. The Gaussian distribution is adapted and used as conditional information to direct the noise generation. A discriminator then learns to determine whether a noise sample is from the model distribution or from a real noise distribution. By adversarial and alternate training, NM-GAN can generate enough recall (diversity) and precision (quality of noise) in its samples for it to look like real noise. Afterwards, realistic-looking paired training sets are composed. Extensive experiments were carried out and qualitative and quantitative evaluation of the generated noise samples and training sets demonstrate that potential of the framework. An Speech enhancement model trained on our synthetic training sets and on real training sets was found to be capable of good noise suppression for real speech-related noise.

Full Text

This preprint is available for download as a PDF.