Ideal Ratio Mask Estimation using Supervised DNN Approach for Target Speech Signal Enhancement

The most challenging process in recent Speech Enhancement (SE) systems is to exclude the non-stationary noises and additive white Gaussian noise in real-time applications. Several SE techniques suggested were not successful in real-time scenarios to eliminate noises in the speech signals due to the high utilization of resources. So, a Sliding Window Empirical Mode Decomposition including a Variant of Variational Model Decomposition and Hurst (SWEMD-VVMDH) technique was developed for minimizing the difficulty in real-time applications. But, this is the statistical framework that takes a long time for computations. Hence in this article, this SWEMD-VVMDH technique is extended using Deep Neural Network (DNN) that learns the decomposed speech signals via SWEMD-VVMDH efficiently to achieve SE. At first, the noisy speech signals are decomposed into Intrinsic Mode Functions (IMFs) by the SWEMD Hurst (SWEMDH) technique. Then, the Time-Delay Estimation (TDE)-based VVMD was performed on the IMFs to elect the most relevant IMFs according to the Hurst exponent and lessen the low- as well as high-frequency noise elements in the speech signal. For each signal frame, the target features are chosen and fed to the DNN that learns these features to estimate the Ideal Ratio Mask (IRM) in a supervised manner. The abilities of DNN are enhanced for the categories of background noise, and the Signal-to-Noise Ratio (SNR) of the speech signals. Also, the noise category dimension and the SNR dimension are chosen for training and testing manifold DNNs since these are dimensions often taken into account for the SE systems. Further, the IRM in each frequency channel for all noisy signal samples is concatenated to reconstruct the noiseless speech signal. At last, the experimental outcomes exhibit considerable improvement in SE under different categories of noises.


I. INTRODUCTION
In the globalized era, the most essential for SE is rejecting the microphone interferences due to the noises in the speech utterances. Several researchers have been designed many SE techniques for diminishing different categories of noises from the speech signals [1]. The compromised accurate assessment of the noise information is the key challenge in the SE systems. The standard assessments are primarily focused on Voice Activity Detectors (VAD) [2][3]. Later, the power distribution of noise characteristics was chosen as a smooth modification of its prior ranges for the speech duration. This approach suggests fair reliability for stationary noises, but the time-varying spectrum was not precisely assessed. For long speech segments and poor SNR, the difficulty of recognizing non-stationary noises emerges towards more intuitive [4]. These criteria were tackled by different power spectrum-based strategies [5][6].
The SE systems based on Time-Frequency (TF) [7] were suggested in preceding decades by the EMD method [8]. The speech signals were decomposed into a set of oscillatory IMF and a residual feature [9]. This method does not involve the set of essential components to analyze the target signal effectively. As well, it should not limit the stationary signals. So, a new SE technique based on EMD was suggested to resolve issues in non-stationary noisy scenarios [10]. Every IMF's noise characteristics were defined and chosen via its Hurst exponent data. Also, IMF selection and speech recovery were conducted in frame-by-frame via accounting target prediction for characteristics and reliability. Regardless, this technique has a high consumption of time and energy. Therefore, a significant improvement may not have been successful under Babble noise scenarios.
So, the SWEMDH technique [11] was recommended according to the EMD estimation in a fairly limited window that is sliding with the time axis. The window size was dependent on the speech signal frequency spectrum. The specific IMF distortions between windows were eliminated through the number of modes and the filtering iterations. For every element, the number filtering iterations can be adjusted based on the sampling rate, assessed signal, its consistency, and frequency band. This was done via decomposing the signal windows according to the usual method and calculating the actual number of filtering iterations. Also, the window sliding process was achieved by extracting the features related to the window after obtaining the mode in the current iteration. These features were accumulated in the IMFs array with a suitable time index. It may happen that the window dimension was very small for spline interpolation so various window dimensions were needed for every mode. In this system, the default window dimension was used. The time indexes of the initial windows in the current iteration were stored. If the mode was extracted efficiently, then this index was maximized and the window for this mode was sliding forward in the subsequent iteration. If there were not adequate extrema, then the sliding task was terminated and the index of the initial windows for the current mode was not shifted. So in the subsequent iteration, the window was increased 2 times. On the contrary, the SWEMDH was not successful in backgrounds with white noise. As a result, a SWEMD-VVMDH technique was developed for minimizing the difficulty in real-time applications [12]. First, the noisy speech signals were decomposed into IMFs using the SWEMDH technique. Then, the TDE-based VVMD was performed on the IMFs to elect the most relevant IMFs according to the Hurst exponent and lessen the low-as well as high-frequency noise elements in the speech signal. Accordingly, the SE was achieved under various noise scenarios include additive white Gaussian noise, street noise, Babble noise, airport noise, and so on. Nonetheless, these are statistical frameworks that take a long time for computations. Hence, it needs to introduce advanced deep learning techniques to reduce the computational difficulty and improve the speech signal quality efficiently. Therefore in this paper, the SE technique is developed using the DNN that learns the decomposed speech signals via SWEMD-VVMDH for enhancing the Speech Quality (SQ) efficiently. The main goal of SWEMD-VVMDH-DNN is to execute a systematic analysis of the generalizability abilities of the SE technique based on the estimated SQ and Speech Intelligibility (SI). For each signal sequence/frame, the target features are chosen and fed to the DNN that learns these features to estimate the IRM in a supervised manner. The abilities of DNN are enhanced for the categories of background noise, and the SNR of the target speech signals. Also, the DNN framework is differed and measured as a dimension. Besides, the noise category dimension and the SNR dimension are chosen for training and testing manifold DNNs since these are dimensions often taken into account for the SE systems.
Further, the IRM in each frequency channel for all noisy signal samples is concatenated to reconstruct the noiseless speech signal.
The remaining article is prepared as follows: Section II presents the previous studies related to the SE systems using deep learning methods. Section III explains the SWEMD-VVMDH-DNN technique and Section IV displays its performance. Section V summarizes the work.

II. LITERATURE SURVEY
Xu et al. [13] designed a huge training set and fed it to the DNN for classifying the speech signals. Also, global variance equalization was used to avoid the over-smoothing issue of the regression framework. Further, the dropout and noise-aware training methods were employed for enhancing the generalization ability of DNN to invisible noise criteria. However, only a small number of data was used that cannot help to achieve better coverage of various acoustic situations like the speaker and morphological inconsistencies.
Williamson et al. [14] presented a scheme for enhancing the perceptual quality of speech segregated from background noise at low SNRs. First, an IRM was determined which segregates the speech from noise with reasonable sound quality. After, the DNN was used for approximating the clean speech via determining the activation weights from the ratio-masked speech where the weights linearly pool components from a Nonnegative Matrix factorization (NMF) speech framework. But, the performance was not effective.
Lee et al. [15] defined a new Spectro-Temporal Detection-Reconstruction (STDR) scheme in which the speech was extracted from background noise via learning a spectro-temporal feature space continuously. Also, a static nonlinearity was applied for projecting the noisy speech. Then, time-frequency gains were decoded together to adjust the noisy speech for creating clean speech estimation. But, the performance was not effective while considering the small amount of data.
Huang et al. [16] developed a common monaural source segregation scheme for mutually modeling all sources within a combination as targets to a Deep Recurrent Neural Network (DRNN). In this scheme, the limits between the actual combination and output predictions were used via time-frequency mask functions and mutually optimizing the time-frequency functions through DRNN. In contrast, the vanishing gradient properties of DRNN were not discarded that may affect the performance.
Chen et al. [17][18] suggested a DNN-based supervised speech segregation model with largescale training for enhancing speech intelligibility. The trained DNN was employed for segregating the speech from noises such as multi-talker babble and cafeteria noise. But, it was not suitable for real-time uses since it considers only magnitude spectra. Also, the generalizability dimensions were predetermined i.e., the dimensions of feed-forward DNN were fixed.
Amodei et al. [19] developed RNN with one or more convolutional layers followed by many recurrent layers, one fully connected layer, and a softmax layer. The entire network was trained end-to-end via the Connectionist Temporal Classification (CTC) loss function which facilitates direct recognition of the sequences of characters from the input speech signal. Conversely, the training was difficult while increasing the network's size and depth i.e., using more recurrent layers.
Zhang et al. [20] proposed multi-context networks for analyzing the monaural speech segregation. The initial multi-context network was used for averaging the outputs of many DNNs whose inputs use various window lengths. The second multi-context network was a stacked multiple DNNs where every DNN yields the concatenation of actual acoustic features and expanses the soft output of the lower unit as its input to identify the IRM of the target speaker. But, gradient descent-based training may cause vanishing gradient problems and degrade the efficiency.
Vidya et al. [21] preferred Hilbert Huang transform which consists of EMD for producing the IMFs and Hilbert spectral testing for detecting the regional properties of the speech signal.
Also, Hilbert amplitude spectrum and phase spectrum were investigated to prevent the direct amplitude and phase of a speech signal. Moreover, the marginal and normalized Hilbert spectrum was determined. But, it needs to develop the analysis for further processing.
Mukherjee et al. [22] suggested a VAD method that utilizes the Line Spectral Frequencybased Statistical attributes called LSF-S merged with an extreme learning-based classifier.
First, the audio signals were pre-processed and the attributes of multifarious sizes were extracted. These attributes were given to the extreme learning-based classifier to detect the voice and non-voice signals. But, it does not handle the live audio signals and needs to analyze the robustness under different noise scenarios.
Karjol et al. [23] presented a variant of a multiple DNN-based SE scheme which directly estimates the clean speech spectrum as a weighted mean of outputs from manifold DNNs.
Initially, the weights were obtained via a gating network. After, the manifold DNNs and the gating network were trained together. Moreover, the objective function was assigned as the mean square logarithmic error between the target clean spectrum and the estimated spectrum.
In contrast, it requires optimization of DNN parameters for further performance improvement.
Zhao et al. [24] suggested a short-time objective intelligibility factor in the loss along with the Mean Squared Error (MSE) for improving the SE system. The perceptually guided loss was optimized using DNN according to the use of reinforcement learning with the previously defined time-varying rewards and a group of mask templates for further improving the speech intelligibility. However, the efficiency was degraded due to the fixed hyper-parameter during training.

Saleem et al. [25] suggested a Relative Spectral Transform-Perceptual Linear Prediction
(RASTA-PLP) for extracting the acoustic features at the frame level. Also, an Auto-Regressive Moving Average (ARMA) filter was used for smoothing the temporal curves of extracted features. Then, the less aggressive Wiener filter was employed as an extra layer on the top of a DNN for generating an improved magnitude spectrum. At last, the noisy speech phase was used for restoring the enhanced speech. But, the parameters used in DNN were fixed i.e., fixed-size DNN cannot able to effectively improve the performance.
Wang et al. [26] developed the DNN framework for every microphone to improve the recorded noisy speech signal and all the noisy recordings were fused into the huge feature structure. A channel-dependent DNN was applied for enhancing the respective noisy input and all the channel-wise improved outcomes were fed to the DNN fusion framework for creating the almost clean signal. On the other hand, the recording channels were not accurately chosen rather than using all the channels as input.
Khaleelur Rahiman et al. [27] decomposed the noisy speech signal into frames and these were fed to the deep Convolutional Neural Network (CNN) for estimating the frequency channel. Then, the speech-dominated cochlear implant channels were considered for generating the electrical stimulation according to the estimated frequency channels. But, the parameters like window function and stride used in CNNs were fixed.
Llombart et al. [28] developed the progressive SE using convolutional and residual neural network structures. In this system, 2 different conditions were used for optimizing the loss factor such as weighted and homogeneous progressive. But, it needs more signals to provide better performance.

III. PROPOSED METHODOLOGY
In this section, the proposed SWEMD-VVMDH-DNN technique for achieving SE is explained briefly. The major tasks in the SWEMD-VVMDH-DNN technique are depicted in Figure 1.

Speech Signals Decomposition using SWEMD-VVMDH Technique
First, the noisy speech signals are decomposed into the group of IMFs using the SWEMDH scheme [11]. Let be the amount of IMFs extracted. For every IMF, the Hurst exponent is calculated and the IMFs having ≤ Then, the sum of all IMFs including input parameters is fed to the VVMD method to decompose into the group of Narrow-Band Components (NBCs) [12].

Feature Extraction and Labeling
The selection of training targets such as IRMs is desirable. So, the DNN is trained in a supervised manner for estimating the IRM from a feature representation of a noisy speech signal. The TF representation used for making the IRM is according to the gammatone filter bank with 64 filters linearly spaced on a Mel-frequency scale from 50Hz-8kHz and with a bandwidth identical to one Equivalent Rectangular Bandwidth (ERB). The output of the filter bank is split into 20ms frames with 10ms overlap and a sampling frequency of 16kHz, every TF unit denotes a vector of 320 samples. These frames and sampling frequency are highly satisfactory i.e., optimal to achieve a better spectral resolution of speech signals and mostly preferred to reconstruct the speech signal from its short-time magnitude spectrum Consider ( , ) indicate the TF unit of the noiseless speech signal at frame and frequency channel . Also, consider ( , ) be the respective TF unit of the noise signal.
After, the IRM is determined as:

DNN Architecture and Training
The DNNs follow a feed-forward framework with an 1845-dimensional input layer and three hidden layers and 64 output units. Each hidden layer has 1024 hidden units. The activation functions for the hidden units are Rectified Linear Units (ReLUs) and the sigmoid function is used for the output units. The DNN structure for SE is portrayed in Figure 2 wherein 1 , … , are the input units i.e., speech signals and is the estimated ̂( , ).

Figure 2. Architecture of DNN for SE
The hidden layers are initialized via the GlorotUniform scheme [29]. Moreover, the DNN has nearly 4M tunable parameters such as weights and biases. The values of these parameters are determined by the Stochastic Gradient Descent (SGD) following the AdaGrad method [30].
The gradients are calculated by the backpropagation according to the MSE function using a batch size of 1024. Also, 20% dropout is applied to each hidden layer during training for decreasing the overfitting. To further diminish the problem of overfitting, an early-stopping training mechanism is used which terminates the training while the MSE of the validation set is not reduced with greater than 1% for more than 20 epochs.

Speech Signal Restoration
Once DNN training is completed, the IRM is determined for a given test speech signal

IV. EXPERIMENTAL RESULTS
In this section, the SWEMD-VVMDH-DNN technique is implemented for evaluating its efficiency as compared to the SWEMD-VVMDH using MATLAB 2014a which has many audio tools and deep learning functions for execution. In this experiment, the TIMIT database 15dB. These categories of noises are collected from the NOISEX-92 database [32].
Therefore, a total of 36×24=864 testing are carried out (9 categories of noises×4 categories of SNR=36). Figure 3 represents the magnitude spectrogram of a sample speech signal.

Figure 3. Magnitude Spectrogram of Sample Speech Signal
The performance metrics used for evaluation are:  SNR: The rate of a mean power of speech signal ( ) to the mean power of noise ( ) is called the SNR.

 Weighted Spectral Slop Measure (WSSM): It is the interval averaged WSSM
where only the best frames are averaged. It computes the weighted variances of spectral slope over 25 critical frequency bands between any 2 related signal frames.
In Eq. (8), ∆ ( ) and ∆ ̅ ( ) denote the spectral slope at frequency band and ̅ , ( ) refers to the weight in every band and stands for the amount of best frames.

Figure 5. MAE Comparison
The graphical representation of MAE values for EMDH, SWEMDH, VVMDH, and DNN using different acoustic noises is portrayed in Figure 5. The x-axis denotes the different techniques and the y-axis denotes the MAE ranges. This analysis indicates that the DNN can reduce the MAE compared to other methods. If the airport noise circumstance is taken into account with the SNR of 10dB, the MAE of DNN is 62.88%, 31.98%, and 19.04% reduced compared to the EMDH, SWEMDH, and VVMDH. This is because of an accurate tradeoff for mistakes between the significant bands and their respective modes which are concurrently determined via DNN. Table 3 lists the SNR outcomes for EMDH, SWEMDH, VVMDH, and DNN using different acoustic noises that corrupt the speech signal during transmission.   respectively. This is attained due to the narrow-band properties related to the current measure of the mode's center-frequency to the signal estimation residual of all other modes. Table 4 gives the PSNR results for EMDH, SWEMDH, VVMDH, and DNN using different acoustic noises that corrupt the speech signal during transmission.

Figure 7. PSNR Comparison
The graphical representation of PSNR values for EMDH, SWEMDH, VVMDH, and DNN using different acoustic noises is depicted in Figure 7. The x-axis denotes the different techniques and the y-axis denotes the PSNR ranges (in dB). This analysis observes that the DNN can attain the maximum PSNR compared to other methods. E.g., if airport noise with the SNR of 10dB is considered, the PSNR of DNN is 41.92%, 7.81%, and 4.11% maximized compared to the EMDH, SWEMDH, and VVMDH. Table 5 gives the PESQ outcomes for EMDH, SWEMDH, VVMDH, and DNN using different acoustic noises that corrupt the speech signal during transmission. Similarly, Figure   8 displays its graphical representation. Here, the x-axis denotes the different techniques and the y-axis denotes the PESQ ranges. This scrutiny identifies that the DNN can improve the PESQ compared to other methods. E.g., consider the airport noise with the SNR of 10dB, the PESQ of DNN is 15.24%, 10.3%, and 5.47% maximized than EMDH, SWEMDH, and VVMDH. This is realized due to the proper adjustment of the center-frequency of low-and high-frequency harmonics which are identified at the tolerable quality and restored without faults.   Table 6 gives the WSSM results for EMDH, SWEMDH, VVMDH, and DNN using different acoustic noises that corrupt the speech signal during transmission. Additionally, the computational complexity of DNN for features is ( ) + ( ) complex multiplications and additions; whereas, the other techniques need the computational complexity of ( 2 log 2 ) + ( log 2 ) complex multiplications and additions.

V. CONCLUSION
In this article, the SWEMD-VVMDH-DNN technique is proposed for learning the speech  The capabilities of state-of-the-art DNN is discovered with respect to the types of background noise, the gender of the target speaker and the SNR.
 the architecture of DNN may be varied and taken as a dimension. For this proposed approach, the noise type dimension and the SNR dimension are selected to train and test all DNN based speech enhancement systems.

Availability of data and material: Not applicable
Code availability: Not applicable

Authors' contributions
A DNN based approach does have potential to improve Speech Enhancement in a broader range of usage Situations and it support that matching the noise type is critical in acquiring good performance for DNN based SE algorithms.