Research-and-Analysis-of-Speech-Enhancement-or-Dereverberation (RA-SED)

This repository contains some material of speech enhancement and dereverberation. On the one hand, I summarize this work for my further understanding. On the other hand, I hope that all beginners or masters interested in speech enhancement can ask me questions and make progress together.
A lot of my summary is not very good, I hope you put forward corrections!

Advertisement：
I would like to open source a speech enhancement toolkit in the near future, but there is currently no good way to do frame-level feature extraction. I would like to put the features in one file, but currently running on a small memory machine while reading and writing may run out of memory.
If you have a better way, please contact me!
Thank you!
My email: [email protected], [email protected] (I will not be able to use this email after Jan. 2021!)

0. Outlines

|| ------ 1. Overviews
|| ------------ 1.1 What is speech enhancement
|| ------------ 1.2 Classification of speech enhancement
|| ------ 2. Traditional speech enhancement methods (I will show this part in the future.)
|| ------ 3. Deep learning-based speech enhancement methods
|| ------------ 3.1 Basic framework
|| ------------ 3.2 Frequency domain speech enhancement (Monaural)
|| ------------------ 3.2.1 Feature extraction module
|| ------------------ 3.2.2 Inputs module
|| ------------------ 3.2.3 Phase module
|| ------------------ 3.2.4 Enhancement module
|| ------------------ 3.2.5 Post-processing module
|| ------------------ 3.2.6 Combined with traditional speech enhancement
|| ------------ 3.3 Frequency domain speech enhancement (Multi-channel)
|| ------------ 3.4 Time domain speech enhancement
|| ------------------ 3.4.1 Why use FCN (CNN) in time-domain
|| ------------------ 3.4.2 Neural network structure
|| ------------------ 3.4.3 Loss function
|| ------ 4. Public datasets
|| ------ 5. Performance comparison
|| ------ 6. Future trends
|| ------ 7. Tools
|| ------ 8. Acknowledge

1. Overviews

We will give some basic introduction in this part. We first introduce what is speech enhancement (dereverberation) and its mathematical expression. Then we will give the classification of speech enhancement (dereverberation) which we summarized.

1.1 What is speech enhancement (dereverberation)

In real life, microphone pickup, in addition to receiving voice, will also receive some noise and reverberation. Speech enhancement is aimed at noisy speech, want to get clean speech. But in fact, speech enhancement (dereverberation) will bring some distortion of noise signal, and can't restore clean speech. Speech enhancement (dereverberation) is speech noise reduction (denoising).

The mathematical expression is as follows:
x = r * s + n
s is speech signal (desired), r is room impulse response (RIR)[1], n is additive noise signal, and x is microphone pickup signal, the noisy signal. Speech enhancement system is trying to get rid of the influence of n (additive noise), while speech dereverberation system is trying to get rid of the influence of r (reverberation).

Besides, some people think that it is necessary to remove the additive noise and reverberation[2] at the same time, but others think it is necessary to remove them separately[3]. Therefore, there is no definite conclusion at present. But I prefer to remove additive noise and reverberation separately (Personal opinion, for reference only).

References:
[1] E.A.P. Habets. Room impulse response generator[J]. Technische Universiteit Eindhoven, Tech. Rep, 2006, 2(2.4): 1.
[2] K. Han, Y. Wang, D. Wang, et al. Learning spectral mapping for speech dereverberation and denoising[J]. IEEE/ACM TASLP, 2015, 23(6): 982-992.
[3] Y. Zhao, Z. Wang, D. Wang. A two-stage algorithm for noisy and reverberant speech enhancement[C]//2017 IEEE ICASSP. IEEE, 2017: 5580-5584.

1.2 Classification of speech enhancement (dereverberation)

In fact, I prefer to refer to those models that need to restore speech signal for human auditory perception as speech enhancement. Those for other tasks, such as automatic speech recognition (ASR) or speaker recognition, are called feature enhancement. Feature enhancement is to design the input and output of the model for a specific task, and does not need to be reconstructed to speech signal. So in this git, our speech enhancement (dereverberation) models are all for human auditory perception experiences.

According to methods, speech enhancement can be divided into traditional speech enhancement and deep learning (machine learning) based speech enhancement. Traditional speech enhancement methods need to rely on some assumptions. When dealing with the non-stationary features, the performance will degrade greatly, and some new distortion and noise will be introduced, such as music noise[4]. With the continuous improvement of the method and the increase of computing resources, deep learning (machine learning) based speech enhancement shows stronger performance. Combined with the above, I roughly classify speech enhancement：

Traditional speech enhancement
Deep learning (machine learning) based speech enhancement
- Frequency domain based speech enhancement
- Time domain based speech enhancement

In frequency domain speech enhancement, Fourier transform (e.g., short-time fourier transform, STFT) is used to transform time domain signal into frequency domain representation. While for time domain speech enhancement, waveform signals are directly used. What features are used and how they are handled will be explained in the following introduction.

References:
[4] A. Hussain, M. Chetouani, S. Squartini, et al. Nonlinear speech enhancement: An overview[M]//Progress in nonlinear speech processing. Springer, Berlin, Heidelberg, 2007: 217-248.

2. Traditional speech enhancement or dereverberation methods (I will show this part in the future.)

3. Deep learning-based speech enhancement (dereverberation) methods

3.1 Basic framework

In section 1.2, I give the classification of speech enhancement, and their basic framework is as follows:

In fact, each module in the diagram has no clear name. We are here to introduce the work more conveniently, so we give each part a name. (don't spray if you don't like it) According to my experience, speech enhancement in frequency domain is divided into five modules, while speech enhancement in time domain is divided into three modules. That is because I feel that the current speech enhancement, we are generally improve the performance from these parts. Everyone's classification is different. Here I will sort it out according to my ideas. In many papers, the improved model may have more than one module. Here I only focus on the most important improvement in these papers to classify. I have been focusing on speech enhancement for about a year and a half. Of course, there are some models that I have not paid attention to. Please put forward your correction by email.

3.2 Frequency domain speech enhancement (Monaural)

The speech enhancement algorithm in frequency domain is more perfect than that in time domain. I divide it into five modules. I will show you how to improve the frequency domain speech enhancement algorithm from these five parts.

3.2.1 Feature extraction module

Input Feature
The input feature is very important to the learning of neural network.

Mel frequency power spectrum (MFP) was used for speech enhancement in INTERSPEECH 2013 [5].
At present, the most common feature is the magnitude of spectrogram.
Log processing of the magnitude of spectrogram[6] is more suitable for human hearing.
Moreover, in the way of multi-target learning (MTL) and combining various features, e.g., mel-frequency cepstral coefficients (MFCC), as input and output, the network can also achieve good results[7].

Learning Targets

Using the nonlinear mapping ability of neural network, we can map the spectrum directly, which is called mapping approach.
Masking approach is another common learning targets for speech enhancement.
Ideal binary mask (IBM)[9] and
ideal ratio mask (IRM)[9] are common masking approaches based on the computational auditory scene analysis (CASA)[8] theory.
Besides, considering the influence of phase to speech enhancement, the complex ideal ratio mask (cIRM)[10] and
the phase-sensitive spectrum approximation (PSA)[11] are also proposed and used.

Other Feature Processing

In addition to Fourier transform (FT),
other transform methods, such as Z-transform, will also be used in feature extraction of speech enhancement.
Different filter banks have great influence on feature extraction.
Whether to normalize the features, and how to normalize the features, especially for those MTL models, I will sort them out and improve them in the future work!

References:
[5] X. Lu, Y. Tsao, S. Matsuda, et al. Speech enhancement based on deep denoising autoencoder[C]//Interspeech. 2013, 2013: 436-440.
[6] Y. Xu, J. Du, L. Dai, et al. An experimental study on speech enhancement based on deep neural networks[J]. IEEE Signal processing letters, 2013, 21(1): 65-68.
[7] Y. Xu, J. Du, Z. Huang, et al. Multi-objective learning and mask-based post-processing for deep neural network based speech enhancement[J]. arXiv preprint arXiv:1703.07172, 2017.
[8] D. Wang, G. J. Brown. Computational auditory scene analysis: Principles, algorithms, and applications[M]. Wiley-IEEE press, 2006.
[9] Y. Wang, A. Narayanan, D. Wang. On training targets for supervised speech separation[J]. IEEE/ACM TASLP, 2014, 22(12): 1849-1858.
[10] D. S. Williamson, Y. Wang, D. Wang. Complex ratio masking for joint enhancement of magnitude and phase[C]//2016 IEEE ICASSP. IEEE, 2016: 5220-5224.
[11] H. Erdogan, J. R. Hershey, S. Watanabe, et al. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks[C]//2015 IEEE ICASSP. IEEE, 2015: 708-712.

3.2.2 Inputs module

In fact, this part should be combined with 3.2.1 to say what features need to be input to improve the performance of our speech enhancement system.

[7] shows that some complimentary features can improve the enhancement performance.
In recently years, adding some symbol information [12] and
text information[13] to the network can improve the performance of speech enhancement.
By adding visual information to enhance enhanced performance[37].
Adding predicted noise information[38].
Besides, it is also necessary to select whether frame-level[6] features or
utterance-level features[18] are needed, and
whether or not frame expansion[6] is needed and how many frames are spliced.
Moreover, there are other information that can help improve the performance of speech enhancement, and I will continue to add it.

References:
[12] C. Liao, Y. Tsao, X. Lu, et al. Incorporating symbolic sequential modeling for speech enhancement[J]. arXiv preprint arXiv:1904.13142, 2019.
[13] K. Kinoshita, M. Delcroix, A. Ogawa, et al. Text-informed speech enhancement with deep neural networks[C]//Sixteenth Annual Conference of the International Speech Communication Association. 2015.
[18] K. Tan, D. Wang. A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement[C]//Interspeech. 2018: 3229-3233.
[38] Y. Xu, J. Du, L. Dai, et al. Dynamic noise aware training for speech enhancement based on deep neural networks[C]//Fifteenth Annual Conference of the International Speech Communication Association. 2014.

3.2.3 Phase module

After a long time of research, researchers found that phase information is also important[14] for speech enhancement. However, due to the fact that there is no fixed law of phase information and there are some problems such as phase winding, it is difficult to deal with phase information. Therefore, in the early speech enhancement based on deep learning, only magnitude information is processed, and then the waveform is reconstructed with noisy phase.

Inconsistency of ISTFT

Besides, considering the inconsistency between the enhanced spectrogram and the noisy phase when inverse STFT (ISTFT)[14].

Iterative Signal Rconstruction

The simplest way to deal with phase is by iterating[16]. Firstly, the enhanced amplitude information and noisy phase are used to reconstruct the waveform, and then the phase of the reconstructed waveform is extracted, and then the re-extracted phase is used to reconstruct the waveform. Through repeated iterations, the effect can be improved.

A Way of Bypassing or Utilizing Phase Information

PSA[11] can make use of the phase information and improve to a certain extent.
cIRM[10] and
time domain speech enhancement can bypass the phase problem.

Phase reconstruction

Moreover, phase information can also be used or even predicted through the network, combined with magnitude information[17].

References:
[14] S. Wisdom, J. R. Hershey, K. Wilson, et al. Differentiable consistency constraints for improved deep speech enhancement[C]//ICASSP 2019-2019 IEEE ICASSP. IEEE, 2019: 900-904.
[15] K. Paliwal, K. Wójcicki, B. Shannon. The importance of phase in speech enhancement[J]. speech communication, 2011, 53(4): 465-494.
[16] K. Han, Y. Wang, D. Wang, et al. Learning spectral mapping for speech dereverberation and denoising[J]. IEEE/ACM TASLP, 2015, 23(6): 982-992.
[17] D. Yin, C. Luo, Z. Xiong, et al. PHASEN: A Phase-and-Harmonics-Aware Speech Enhancement Network[C]//AAAI. 2020: 9458-9465.

3.2.4 Enhancement module

Improvement of Neural Network
The choice of model is very important.

From the beginning of DNN[6],
to CNN[19]
and RNN[20], the model is gradually powerful, and the performance of speech enhancement is also improving.
The combination of different networks also has a good effect[21].

Design of Network Structure

U-NET structure[22] has been proved to be effective in many tasks.
By using the idea of generation and confrontation, generative adversarial network (GAN)[23][40] uses the discriminator to judge the effect of the generator, and by introducing some information of the discriminator, to a certain extent, it can alleviate the defect that mean squared error (MSE) as a loss function may not be suitable for human hearing.

Loss Function

In order to be more suitable for human hearing, some speech enhancement systems also use evaluation index as loss function, e.g., short-time objective intelligibility (STOI)[24], [25].
Noise (domain) information can be used in the model in the form of loss[36].

Attention Mechanism

The attention mechanism[26] is becoming more and more common, and the attention-based approach has better generalization ability to unseen noise conditions[27].
Besides, different researcheres use the attention mechanism to model different units, e.g., [28] use attention mechanism to model different signal channels.

Training Strategy
Considering some characteristics of the network, training or designing some structures can improve the enhanced performance.

Transfer learning[29] can improve the performance of the model on small databases by training on large databases and fine tuning on small databases.
In the way of multi-target learning (MTL) and combining various features, e.g., mel-frequency cepstral coefficients (MFCC), as input and output, the network can also achieve good results[7].
And ASR learn in the way of MTL to improve performance[41].
Different networks are trained according to different SNR[30].
[31] uses progressive learning to enhance the results step by step. The stacking of networks[32] also shows the effect.
An additional inverse mapping network is introduced to reconstruct the noisy features from the enhanced ones (but this paper just show the performance of ASR)[44].

Fusion Strategy

Moerover, masking and mapping approaches show different effects on different scenarios, which shows some complementary. [33] proposes minimum difference masks to utilize this complementary, and fuses the spectrograms.
[34] fuses the enhanced signal in time domain and enhanced signal in frequency domain, which also showed the effect.

Mutual Enhancement
The combination of enhancement task and other tasks can enhance each other.

The phonetic posteriorgrams (PPG) shows a certain correlation with enhancement[35].

References:
[19] S. R. Park, J. Lee. A fully convolutional neural network for speech enhancement[J]. arXiv preprint arXiv:1609.07132, 2016.
[20] L. Sun, J. Du, L. Dai, et al. Multiple-target deep learning for LSTM-RNN based speech enhancement[C]//2017 HSCMA. IEEE, 2017: 136-140.
[21] M. Ge, L. Wang, N. Li, et al. Environment-Dependent Attention-Driven Recurrent Convolutional Neural Network for Robust Speech Enhancement[C]//INTERSPEECH. 2019: 3153-3157.
[22] H. Choi, J. Kim, J. Huh, et al. Phase-aware speech enhancement with deep complex u-net[C]//International Conference on Learning Representations. 2018.
[23] M. H. Soni, N. Shah, H. A. Patil. Time-frequency masking-based speech enhancement using generative adversarial network[C]//2018 IEEE ICASSP. IEEE, 2018: 5039-5043.
[24] M. Kolbæk, Z. Tan, J. Jensen. Monaural speech enhancement using deep neural networks by maximizing a short-time objective intelligibility measure[C]//2018 IEEE ICASSP. IEEE, 2018: 5059-5063.
[25] Y. Zhao, B. Xu, R. Giri, et al. Perceptually guided speech enhancement using deep neural networks[C]//2018 IEEE ICASSP. IEEE, 2018: 5074-5078.
[26] A. Vaswani, N. Shazeer, N. Parmar, et al. Attention is all you need[C]//Advances in neural information processing systems. 2017: 5998-6008.
[27] X. Hao, C. Shan, Y. Xu, et al. An attention-based neural network approach for single channel speech enhancement[C]// 2019 IEEE ICASSP. IEEE, 2019: 6895-6899.
[28] B. Tolooshams, R. Giri, A. H. Song, et al. Channel-Attention Dense U-Net for Multichannel Speech Enhancement[C]//ICASSP 2020 ICASSP. IEEE, 2020: 836-840.
[29] Y. Xu, J. Du, L. Dai, et al. Cross-language transfer learning for deep neural network based speech enhancement[C]//The 9th International Symposium on Chinese Spoken Language Processing. IEEE, 2014: 336-340.
[30] S. Fu, Y. Tsao, X. Lu. SNR-Aware Convolutional Neural Network Modeling for Speech Enhancement[C]//Interspeech. 2016: 3768-3772.
[31] T. Gao, J. Du, L. Dai, et al. Densely connected progressive learning for lstm-based speech enhancement[C]//2018 IEEE ICASSP. IEEE, 2018: 5054-5058.
[32] Z. Wang, D. Wang. Recurrent deep stacking networks for supervised speech separation[C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017: 71-75.
[33] H. Shi, L. Wang, M. Ge, et al. Spectrograms Fusion with Minimum Difference Masks Estimation for Monaural Speech Dereverberation[C]//ICASSP ICASSP. IEEE, 2020: 7544-7548.
[34] J. Kim, J. Yoo, S. Chun, et al. Multi-domain processing via hybrid denoising networks for speech enhancement[J]. arXiv preprint arXiv:1812.08914, 2018.
[35] Z. Du, M. Lei, J. Han, et al. Pan: Phoneme-Aware Network for Monaural Speech Enhancement[C]//ICASSP 2020 IEEE ICASSP. IEEE, 2020: 6634-6638.
[36] C. Liao, Y. Tsao, H. Lee, et al. Noise adaptive speech enhancement using domain adversarial training[J]. arXiv preprint arXiv:1807.07501, 2018.
[37] T. Afouras, J. Chung, A. Zisserman. The conversation: Deep audio-visual speech enhancement[J]. arXiv preprint arXiv:1804.04121, 2018.
[40] D. Michelsanti, Z. Tan. Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification[J]. arXiv preprint arXiv:1709.01703, 2017.
[41] Z. Chen, S. Watanabe, H. Erdogan, et al. Speech enhancement and recognition using multi-task learning of long short-term memory recurrent neural networks[C]//Sixteenth Annual Conference of the International Speech Communication Association. 2015.
[44] Z. Meng, J. Li, Y. Gong. Cycle-consistent speech enhancement[J]. arXiv preprint arXiv:1809.02253, 2018.

3.2.5 Post-processing module

By summing the signals of multiple systems in a certain proportion, the gain effect can be obtained[46].

References:
[46] Y. Tu, I. Tashev, S. Zarar, et al. A hybrid approach to combining conventional and deep learning techniques for single-channel speech enhancement and recognition[C]//2018 IEEE ICASSP. IEEE, 2018: 2531-2535.

3.2.6 Combined with traditional speech enhancement

Combined with minimum mean-square error (MMSE)[47].
Combined with weighted prediction error (WPE, dereverberation) [49].

References:
[47] A. Nicolson, K. K. Paliwal. Deep learning for minimum mean-square error approaches to speech enhancement[J]. Speech Communication, 2019, 111: 44-55.
[49] K. Kinoshita, M. Delcroix, H. Kwon, et al. Neural Network-Based Spectrum Estimation for Online WPE Dereverberation[C]//Interspeech. 2017: 384-388.

3.3 Frequency domain speech enhancement (Multi-channel)

3.4 Time domain speech enhancement (dereverberation)

Time domain speech enhancement can enhance the time domain speech waveform signal in the form of end-to-end because it does not need to consider the characteristics and can bypass the phase problem. Moreover, in many literatures, I found that the time domain speech enhancement basically uses convolutional neural network (CNN) or fully convolutional neural network (FCN) as the network structure.

3.4.1 Why use FCN (CNN) in time-domain

It is proved that FCN has more advantages than fully connected network in processing time domain signals[39].

References:
[39] S. Fu, Y. Tsao, X. Lu, et al. Raw waveform-based speech enhancement by fully convolutional networks[C]//2017 APSIPA ASC. IEEE, 2017: 006-012.

3.4.2 Neural network structure

SEGAN[43].
TCNN[45].

References:
[43] S. Pascual, A. Bonafonte, J. Serra. SEGAN: Speech enhancement generative adversarial network[J]. arXiv preprint arXiv:1703.09452, 2017.
[45] A. Pandey, D. Wang. TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain[C]//2019 IEEE ICASSP. IEEE, 2019: 6875-6879.

3.4.3 Loss function

Loss in frequency domain[42].
In the time domain model, several loss functions are compared[48].

References:
[42] A. Pandey, D. Wang. A New Framework for Supervised Speech Enhancement in the Time Domain[C]//Interspeech. 2018: 1136-1140.
[48] M. Kolbæk, Z. Tan, S. Jensen, et al. On loss functions for supervised monaural time-domain speech enhancement[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 825-838.

4. Public datasets

5. Performance comparison

References:

6. Future trends

7. Tools

8. Acknowledge

Up to now, in the course of one and a half years of study, I would like to thank my tutors, Prof. Wang (Longbiao Wang. Tianjin University, China.) and Li (Sheng Li. National Institute of Information and Communications Technology (NICT), Japan.), Prof. Dang (Jianwu Dang. Tianjin University, China.), and the senior brother of the laboratory doctor, Meng Ge (Tianjin University, China) for their guidance and care for me. I hope I can successfully apply for a doctorate degree, and have the opportunity to discuss voice enhancement or other voice direction with you!

wendongj / research-and-analysis-of-speech-enhancement-or-dereverberation Goto Github PK

research-and-analysis-of-speech-enhancement-or-dereverberation's Introduction