Multiresolution Cochleagram Speech Enhancement Algorithm Using Improved Deep Neural Networks with Skip Connections

Deep learning based methods have been a recent benchmark method for speech enhancement. However, these approaches are limited in low signal-to-noise ratios (SNR) conditions, for speech loss and low intelligibility. To address this problem, we improve Multi-Resolution Cochleagram (MRCG), and gammachirp filter bank is used to decompose the speech signal in time and frequency, and the low-resolution signal is denoised by the minimum mean-square error short-time spectral amplitude estimator (MMSE-STSA). Improve Multi-Resolution Cochleagram (I-MRCG) is adopted as the input feature of Skip connections-DNN (Skip-DNN). In this paper, the source to distortion ratio (SDR) is used in the training process, and the logarithm is introduced to observe the iterative process more clearly. Experiments were performed on the TIMIT database with four noise types at four levels of SNR. I-MRCG as the input feature of the Skip-DNN model, the average PESQ is 2.6783, and the average STOI is 0.8752. Compared with MRCG, the PESQ and STOI obtained by MRCG are increased 1.4% and 1.5%, respectively. This shows that MRCG is the input feature of the Skip-DNN model, and the speech enhancement effect after training is better than other features. It can not only solve the problem of speech loss in a low SNR environment, but also obtain more robust speech enhancement. The loss function experiment shows that compared to MSE and SDR, the improved SDR as the loss function of the speech enhancement model has the best enhancement effect.


Introduction
Speech enhancement is an extensively studied research area that has many applications, such as, in telecommunication and human-computer interaction.
In the field of telecommunication, the received speech signals are often corrupted by background noise, such as in a military field, where communication can be difficult due to the complex battlefield conditions, and the demand for high quality speech. In human computer interaction, some intelligent devices began to use speech instead of the keyboard as input, however, this requires high-performance speech recognition in a real world with noisy background sound environment. In these applications, noisy speech needs to be enhanced to improve their intelligibility and quality for more effective communications or interactions.
Speech enhancement can be performed in supervised or unsupervised manner. The spectral subtraction [1] and the Wiener filtering [2] are commonly used methods in unsupervised speech enhancement. A common problem encountered in However, most noise signals are non-stationary in real life. Martin [4] proposed a speech enhancement algorithm based on minimum statistics for nonstationary noise signals. Previous studies have shown that unsupervised speech enhancement has a good enhancement performance in environments with high SNR and stationary noise, but a poor performance in environments with low SNR and nonstationary noise [5][6].
Supervised algorithms have been developed to enhance speech corrupted by non-stationary noise, including shallow neural networks and deep neural networks [7][8]. Non-negative matrix speech enhancement algorithm in the shallow neural network, clean speech and noise are trained separately to obtain good speech enhancement performance [9][10].
Because the number of layers in the shallow neural network is small, the fitting of the test data set is not good and only simple features can be extracted, which leads to poor speech enhancement effect. Recently, DNN have been applied to speech enhancement [11][12][13]. DNN is utilized to train the time-frequency mask between clean speech and noise, which greatly improved the intelligibility of speech. The mapping relationship between noisy and clean speech spectra is learned by the DNN model. In addition, "dropout" is used to prevent overfitting, and the mini-batch stochastic gradient descent algorithm is used to accelerate the training speed [14][15][16]. The DNN model is utilized to predict the complex ratio masking and then predicted the amplitude and phase of the speech, which compensated the phase offset caused by noise [17]. Chen proposed to use multi-resolution cochleagram to obtain global and local features of speech and improve the speech enhancement performance in a low SNR [18]. To improve the generalization ability of the DNN model, Chen et al. added noise to the data during the training process to diversify noise and improve the speech enhancement performance [19]. Tu et al, used DNN to predict target and interference respectively, and the accuracy of the estimated target speech was significantly improved in speech recognition [20]. Tu and Zhang proposed a Skip-DNN model that maps from the Mel-frequency power spectra of noisy speech to that of clean speech, which helps resolve the problem of gradient vanishing and improve the performance of speech enhancement [21]. Tseng et al. utilized a sparse non-negative matrix factorization to extract speech features and utilized the DNN model to estimate the ideal binary mask (IBM), which improves the intelligibility of enhanced speech in a low SNR environment [22].

Speech enhancement based on deep learning
There are often three main components in a deep learning based on speech enhancement algorithm: learning machines, training targets, and acoustic features [23][24]. In the process of feature extraction,  In general, we assume the noise is additive in speech enhancement. The expression of noisy is defined as: where, Y(t) denotes noisy signal, S(t) denotes clean speech signal and N(t) denotes noise signal. The noisy signal in time-frequency domain can be expressed as: , , where, S(t, f) 2 and N(t, f) 2 denote speech energy and noise energy within a time-frequency unit, respectively. The tunable parameter scales the mask, and is commonly chosen to 12 .
In this study, we use the IRM as our training target.
According to (3), we can namely estimate clean speech as follows: (4) where, the symbol " " denotes the multiplication and 1 , S t f denotes the magnitude of the recovered signal.
Then, the enhanced speech that combines the estimated magnitude spectrum and phase of noisy is obtained as follows: where, Yt denotes the phase of the noisy and St denotes the reconstructed clean speech signal.
In this paper, STOI and PESQ are utilized to evaluate speech quality. Among them, STOI is an objective evaluation method of speech intelligibility.
It uses a short time unit and eliminates the mute unit.
The STOI value is in the range of 0 to 1. The larger the STOI value, the higher the intelligibility. PESQ are objective evaluation methods for speech quality. The evaluation range of PESQ is between 1.0 and 4.5. The higher the score, the better the speech quality.

Speech Feature Analysis
The speech signal is short-time stationary for a length of approximately 10-30ms. To mitigate the nonstationary and time-varying effects, the speech signal is often processed in frames of shorter segments.     Because the mean square error is only a simple calculation of the similarity between a given target and an estimated target, in 2010, Vincent et al. [25] proposed a more detailed quantification method to evaluate the quality of speech and hearing based on the way the SNR perceives speech quality. Later, it was used as a loss function to measure the prediction performance of speech enhancement models based on deep learning. The SDR can be expressed as: Among them: y represents the actual value; y represents the predicted value.
During the experiment, it can be simplified to: is between 0 -1 .In this paper, the logarithm of the simplified result is taken, and the improved loss function can be defined in the range of 0 , which is more convenient to observe the change of loss in the process of iteration.
At the same time, by observing the logarithmic curve, it can be seen that the closer the logarithmic loss value is to 0, the more stable it will be. The input layer speech feature is set to 3 frames, and the output layer speech feature is set to 1 frame.
During the processing of the following experimental data, the speech signal and the noise signal are selected according to the above criteria.
In this paper, f16 、 babble、 factory、 white are

Conclusion
In this paper, the feature input is MRCG. Skip-DNN speech enhancement model with SDR as loss function.
It has the features of carrying more voice information,