Tinkering with speech enhancement models.
Borrowed code, models and techniques from:
- Improved Speech Enhancement with the Wave-U-Net ((arXiv)
- Wave-U-Net: a multi-scale neural network for end-to-end audio source separation (arXiv)
- Speech Denoising with Deep Feature Losses (arXiv, sound examples, GitHub)
- MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis (arXiv, sound examples, GitHub)
The following datasets are used:
- The Univeristy of Edinburgh Noisy speech database for speech enhancement problem
- The TUT Acoustic scenes 2016 dataset is used to train the scene classifier network, which is used for the loss function. (dataset paper)
- The CHiME-Home (Computational Hearing in Multisource Environments) dataset (2015) is also used for the scene classifier, in some experiments
- The "train-clean-100" dataset from Librispeech, mixed with the TUT acoustic scenes dataset.
At the moment, the algorithm uses 32-bit floating-point audio files at a 16kHz sampling rate to perform correctly. You can use sox
to convert your file. To convert audiofile.wav
to 32-bit floating-point audio at 16kHz sampling rate, run:
sox audiofile.wav -r 16000 -b 32 -e float audiofile.float.wav