Spectral-Warping Based Noise-Robust Enhanced Children ASR System

In real-life applications, noise originating from different sound sources modifies the characteristics of an input signal which affects the development of an enhanced ASR system. This contamination degrades the quality and comprehension of speech variables while impacting the performance of human-machine communication systems. This paper aims to minimise noise challenges by using a robust feature extraction methodology through introduction of an optimised filtering technique. Initially, the evaluations for enhancing input signals are constructed by using state transformation matrix and minimising a mean square error based upon the linear time variance techniques of Kalman and Adaptive Wiener Filtering. Consequently, Mel-frequency cepstral coefficients (MFCC), Linear Predictive Cepstral Coefficient (LPCC), RelAtive SpecTrAl-Perceptual Linear Prediction (RASTA-PLP) and Gammatone Frequency cepstral coefficient (GFCC) based feature extraction methods have been synthesised with their comparable efficiency in order to derive the adequate characteristics of a signal. It also handle the large-scale training complexities lies among the training and testing dataset. Consequently, the acoustic mismatch and linguistic complexity of large-scale variations lies within small set of speakers have been handle by utilising the Vocal Tract Length Normalization (VTLN) based warping of the test utterances. Furthermore, the spectral warping approach has been used by time reversing the samples inside a frame and passing them into the filter network corresponding to each frame. Finally, the overall Relative Improvement (RI) of 16.13% on 5-way perturbed spectral warped based noise augmented dataset through Wiener Filtering in comparison to other systems respectively.


Introduction
Speech signals in real-time systems usually determined through stress, tiredness, environmental aggravations, and with incited fluctuations in pronunciation of the speaker. However, ASR frameworks are generally being designed by utilizing the ideal conditions where just clean speech corpus is considered for sufficient training of the speech-based models (Barker et al. 2001). Subsequently, this outcomes act as a simple recognizable proof for the spoken words and at the same time streamlined the conversion of spoken words into deciphered string. It is much needed for building of enhanced human-machine interface system. In general, real-time audio signals are affected by noise which results into disruptive and undesirable information (Kim et al. 1999). Consequently, the major challenge is the optimisation of ASR technology through reduction of disparities among real-time and ideal environmental conditions. Thereby, different strategies of system induction with background noise (noise augmentation) (Ko et al. 2017;Pervaiz et al. 2020); its tuning based on its realtime applications (Gopalakrishna et al. 2012;Lang et al. 1996), and enhancement of noisy signal before its conversion to text (Boashash and Mesbah 2004;Sivasankaran et al. 2015;Zhang et al. 2017), have been adopted by various researchers. Additionally, other experimentations are performed by utilization of numerous noise-cancellation methods which are included by most commonly Adaptive Filtering methodology (Espy-Wilson et al. 1996) and different enhancements techniques of Neuro-Fuzzy filters (Esposito et al. 2000), Kalman Filtering (Goh et al. 1999). Apart, these methods employed filtering techniques which can be viewed as an aspect of enhanced quality. In like manner, the noise corrupted vocal signing clip is cleaned with tuned Kalman filter (Das et al. 2016). Thereby, Kalman filtering technique is highly preferred by considering the presence of non-linear noise such that instantaneous state in a linear dynamic system is injected by noise at lower SNRs (Sorqvist et al. 1997). On the other hand, adaptive filters are generally derived from Wiener Filtering (Abd El-Fattah et al. 2014) where Least Mean Square Error based algorithm helped in reduction of the impact of linear noise. This technique is able to smoothen the step factor in time domain and also employed Sigmoid function for controlling the direction.
The information present in real-time signal is too cumbersome to deal with the development of acceptable classification, recognition and verifications frameworks (Kim et al. 1999). This part can be accomplished by expulsion of undesirable information before extraction of the significant features in speech recognition and identification systems. In like way, this frontend process of feature extraction helped in transformation of processed speech signal. A compact yet logical representation is more discriminative as well as reliable than the actual signal. Nonetheless, in present ASR frameworks, various feature extraction procedures yielded into a multidimensional feature vectors. They are being utilised for portraying the dependable information of an input speech signal (Mesgarani et al. 2017). Hence, the different scope of options for such parametric representation of signal is performed using LPC (Gupta and Gupta 2016), MFCC  for recognition measures. MFCC has been broadly utilized and mainstream front-end method which is used for ASR frameworks. It tried to process the most relevant portion of a signal. In many case, these signal being propagated in noisy or mismatched conditions (Zhao and Wang 2013). Apart, PLP (Hermansky 1990) has been introduced as a way of distorting a spectra with a goal of minimizing the inter-speaker variations lies due to acoustically mismatched conditions. However, RASTA-PLP (Hermansky et al. 1991) strategy utilised a band-pass filter to energy component present in each frequency sub-band. It has smoothen the short-term noise-based variations alongside it handled the inter-speaker variations. Subsequently, many advanced noise robust feature extraction methodologies -zero crossing peak amplitude (Scarr 1968), average localized synchrony detection (Ali et al. 2004), BFCC, GPLP (Gulzar et al. 2014) and gammatone frequency cepstral coefficient (GFCC) (Zhao and Wang 2013) have been experimented. Recently, the utilization of Gammatone filters for accurate modelling of the critical bands is performed. Rather than utilising the triangular filters it out-performed conventional strategies of feature extraction in the field of recognition.
Nowadays, individuals generally feel more comfortable in recognizing the words being articulated in their respective native languages as compared to obscure foreign languages. Since, the improvement of ASR system in local language is totally reliant upon the adequate availability of the labelled data and phonetic transcriptions. In this manner, the majority of resource rich dialects based spoken dialogue frameworks are commercially accessible whereas a very few concentrations have been made towards the context for usage of native languages including Punjabi, Mizo, Bodo (Singh et al. 2019;Kaur et al. 2020). These underresourced languages lack the web presence, availability of linguistic expertise and mainly the lack of resources which required text corpora and pronunciation rich lexicon (Besacier et al. 2014). Therefore, to overcome the challenge of data scarcity, various aspects of limited data for both acoustic & language models (Novotney et al. 2009), multi-lingual knowledge transfer (Ma et al. 2017) and construction of adequate pronunciation lexicon (Robinson et al. 1995) have been experimented. Another challenge is sufficient advancement of children ASR system where intelligent speech innovations: YouTube Kids, Amazon Alexa, and computeraided language learning has been currently crucial in the process of classroom learning (Valente et al. 2012). Since, the acoustic and linguistic patterns in case of children speech signals are very unique which indulge speaking rate, vocal tract length when contrasted to an adult speech signal (Subramanian et al. 2019). Additionally, the accessibility of limited children speech datasets even in the context of native language prompts obstruction in development of efficient children speech recognition systems. Moreover, different procedures of data augmentation (Ko et al. 2015), have been utilized by researchers with a goal for inciting the artificial data. It tried to build it essential for performance improvement of data hungry deep learning approaches.
In this paper, two filtering techniques: Kalman and Wiener Filtering have been employed with an effort of reducing unwanted information in an input children speech signal. Since, limited resource Punjabi children ASR system is constructed on own developed speech corpus. Therefore, robustness to ASR system is provided through combination of original speech corpus with synthetic dataset which contains more ideal SNR ratio (Koopmans et al. 2018). Motivated by this, following efforts have been made for improving the performance accuracy of children speech recognition framework through:  Classic approach of in-domain noise augmentation has been applied by inducing four distinctive type of noisesself-recorded classroom, cafeteria, white and pink noise at varying SNRs while keeping the class labels fixed.  Comparative analysis between filtering procedures of Kalman and Wiener has been made by utilizing the feature extraction technique of MFCC, LPCC, RASTA-PLP and GFCC. So, an enhanced signal being generated using both the filters are mixed together with original dataset such that de-noising is operable for audio being recorded in both clean and noisy conditions.  Indulging tonal characteristics using normalised VTLN methodology for filtered signal with an end goal of eliminating the existing inter-speaker variations.
The rest of the paper is structured as follows: Section 2 includes a literature analysis for building noise-robust ASR system. In addition, the theoretical context for filtering and feature extraction techniques is portrayed in Section 3. Section 4 and Section 5 give descriptions of the experimental configurations and the proposed system architecture. In section 6, discussions on the efficiency of different systems in varying environment conditions is outlined with conclusion made in Section 7. Gong et al. (1995) analysed the effects of noise in automated speech recognition systems. They revealed integral part of time and frequency associations in recognition systems. Further exploited task-specific a priori awareness of speech and noise are presented which showed the significance of high SNR values. Earlier, Lim and Oppenheim (1979) explored speech degradation by additional background noise and analysed various techniques proposed for enhancing speech and bandwidth compression. The experimentations resulted into adequate compression while it retained the required information of original audio signal. Further, Boll (1979) calculated spectral noise bias during non-speech activity and supressed the stationary signal by subtracting the calculated spectral noise from it. Further it applied secondary procedures to attenuate the residual noise left after subtraction. The researchers performed perception test with DRT database of 192 words. It indulge noise and found comparable results on intelligibility and quality of signal. With advancement in time, Ephraim and Malah (1984) capitalized on importance of short-time spectral amplitude (STSA). For construction of an enhanced signal, the minimum mean square error (MMSE) STSA estimator are combined with complex exponential noisy phase. They performed similar investigation of MMSE STSA and Wiener STSA and found that MMSE STSA resulted into fundamentally less blunder and inclination when SNR is low. They also tried to improve speech signal for revising vocal tract resonance disorders. The parameters are thought of in such a way it adapt F1 and F2 formant frequencies of an input speech signal (Goncharoff et al. (1988)). However, Etter and Moschytz (1994) underlined the idea of noiseadaptive spectral magnitude expansion with an effort of adapting the crossover point of the spectral magnitude. The expansion are performed in each frequency channel which is based on the noise level. Nowadays, researchers have gone beyond just enhancing the audio signal. Various studies have been conducted on impact analysis of enhancement of speech recognition systems. (Umesh et al. (1996)) Frequency warping function has also been proposed which is derived from scale-transform based acoustic features to effectively separate vowels. The results showed clear distinction between different formant frequencies scale which lies differently among speakers. Also, frequency warping is explored in the field of automatic speech recognition by sampling it with warping function. An audio signal with high energy regions remained sampled with more than low density regions because it is believed that high energy regions carries more linguistic information (Paliwal et al. (2009)). Likewise, Sameti et al. (1998) intentionally corrupted the signal with white, simulated helicopter, and multi-talker (cocktail party) noise. The HMM based MMSE speech enhancement system had been consistently superior in performance to the spectral subtraction-based system in handling noise in non-stationarity. In (Saldanha 2016), Harmonic Regeneration Noise Reduction (HRNR) and adaptive Wiener filter with Two Step Noise Reduction (TSNR) methods are used to enhance the noisy speech signal. They also augmented the speech signal with fan noise and processed it with adaptive wiener filtering. The output is plotted on MATLAB which showed improvement in SNR of an audio signal. Lee et al. (2014) proposed a phase-dependent priori signal-to-noise ratio (SNR) estimator in log-mel spectral domain. It utilized both size and phase information, where the decisiondirected (DD) approach is used to determine a priori SNR from a noisy speech. Lately, Haque and Bhattacharyya (2019) investigated a portion of it by separating procedures which are depend on direct and nonlinear methodologies. These procedures incorporate diverse versatile by separating it from dependent calculation like LMS, NLMS and RLS.

Related Work
However, Gurugubelli and Vuppala (2019) proposed another component persuaded from the human hear-able discernment and high time-recurrence. As a piece of SFF procedure execution, audio signals are gone through a single pole. It is complex bandpass filter bank which tried to get high-goal time-recurrence circulation. Then, at the same time, the circulation is upgraded by utilizing a bunch of hear-able perceptual administrators. Similarly, Narayana and Kopparapu (2009) have experimented a study on the effect of additive Gaussian noise with improvement in performance of commonly used MFCC feature extraction technique. They experimented an estimation error while tuning the parameters of MFCC during Gaussian noise. However, the vast majority of work conducted for building noise robust ASR framework utilised linear predictive coding (LPC) for speech signal modelling. However, Nair et al (2016) have experimented the shortcomings of Kalman filters with LPC and concluded the superiority of MFCC over LPC. They outlined the dependence of refinement of parameters on the choice of R and Q parameters. It resulted into easier modulation of MFCC parameters for smaller amount of noise. Further, Zhao and Wang (2013) analysed the boosted performance of novel speaker feature extraction technique. They indicated that non-linear corrections account mainly for variations in noise and have been adequately handled by implementation of a different time-frequency representation. In addition, they also experimented the boosted robustness of MFCC features in presence of noise. Consequently, Zhao et al. (2011) experimented on Gammatone filter based feature extraction method which can be extended in audio security system. The experiment concluded with an efficient and fair extraction of feature vectors. It resulted into satisfactory classification performance using SVM. Furthermore, Sárosi et al. (2011) carried out experiments for comparative study of novel front-end techniques in six languages: English, Italian, German, Spanish, French and Hungarian. They concluded with the presence of a substantial difference in implementations of MFCC and significant improvements is obtained in PNCC variants. It lies with separate bandwidths and differentiated SNR levels. Kadyan et al. (2021) also investigated adult-child mismatch using Punjabi corpus while formulating ASR system and used vocal tract length normalization to showcase with better output.

Filtering Techniques
Adequate noise reduction in input speech relies on the output of linear time-varying filter. It is being induced by intermittent pulses or presence of noise. However, closer observation with noise-reduction methodologies revealed speech modelling as a ℎ order auto-regressive process. Thus, the present sample ( , ) is explicitly reliant on linear combination of previous ( − 1, ) sample with random noise at varying SNRs. In other words, the representation is an all-pole FIR filter with input as an Additive White Gaussian Noise given by equation (1): where ( ) corresponds to a zero-mean Gaussian noise (process noise) and refers to the linear prediction coefficients (LPCs) evaluated using auto-corelation function (ACF) as in equation (2):

Kalman Filter
Kalman filtering is utilized by a progression of estimations which is saw over the long haul, containing commotion (irregular varieties) and different mistakes. It delivered evaluations of obscure factors which is more exact than those dependent on a solitary estimation alone. The application of Kalman filter to autoregressive models are detailed in equation (3) and is first performed by (Paliwal and Basu 1987). They represented it through state vector representation as state-space model. It is used as a dependent on the state transition matrix. It's coefficients is calculated from additive noisy signal. In addition, the internal use of statespace model makes Kalman Filtering able to handle dynamic models with varying parameters. The application of Kalman filter is employed with as * 1 matrix input along with corrupted noise ( ) at the given kth instance by: where ( , ) corresponds to ( * 1) state matrix with ( * ) state transition matrix. The LPC for the noisy signal is computed through equation (2).

Wiener filter
The Wiener filter is a common technique of filtering. It is employed for noisy signal and is used in many signal enhancement procedures. It is also used to measure an approximation of desired signal by performing linear time-invariant filtering of an observed noisy signal. This resulted into minimisation of mean square error between assessed estimated random process and the target process. The Wiener filter is also utilized to filter out the noise from the corrupted signal. It is used to provide an estimate of the underlying signal of interest. The frequency domain solution to this optimization problem gives the filter function as illustrated in below equation (4): where ( ) and ( ) are the power spectral densities of a clean and noise signals, respectively with an assumption of uncorrelation between both the signals. Thus, signal-tonoise ratio (SNR) can be computed as in equation (5): Finally, wiener filter equation can be interpreted as in equation (6):

Feature Extraction
The feature vectors corresponding to an input speech signal plays a vital role in extraction of unique information. It is possible through segregation of a speaker from others by reduction of magnitude of a signal. It is devoid by causing any damage to the power of speech signal. As a result, the processing of features in degraded environmental conditions largely influenced the performance of an ASR framework. Two most widely used methods have been used to evaluate their effect in both the noisy and clean environments. Thousands of coefficients for specific signals are extracted. Apart, only hundreds of randomly selected signals are used as input features for further study (Kadyan et al. 2017). However, the techniques differ in pre-processing phases, pre-emphasis where usually a signal is passed into first-order finite impulse response (FIR) filter. This process is succeeded by a method of partitioning speech signal into frames. It is beneficial in removal of acoustic interface existing at both starting and ending parts of an input speech signal.

Mel-Frequency Cepstral Coefficients (MFCC)
MFCC is a representation of short-term power spectrum. It is defined as a real cepstrum of a windowed short-time signal. The derivation for the same is performed using fast Fourier transform of a speech signal. MFCC makes the use of non-linear frequency scale. It is possible through approximation of behaviour of an auditory system (Davis and Mermelstein 1980). Further, the modification in the magnitude spectrum is demonstrated to Mel spectrum. It involved the broad variety of frequencies in FFT spectrum. Thus, the pitch value corresponding to each tone with frequency ( , ) is measured in ( ) is represented on Mel Scale as in equation (7): ( ( , )) = 2595 * 10 (1 + ( , ) 700 ) (7)

Gammatone frequency cepstral coefficients (GFCC)
GFCC is based on an auditory model of peripheral. Although a gamma filter bank is used in it which decomposes the input voice to a temporal frequency representation by a auditory model. The GFCC is calculated using a bank of gammatone filters (Zhao and Wang 2013). Consequently, down-sampling of filter bank are responded along with the time dimension. It is possible by decomposing the input speech signal into T-F (Time Frequency) domain. In addition, an equivalent rectangular bandwidth (ERB) represented the bandwidth corresponding to a resulting filter as in equation (8): where related to the central frequency where corresponded to ℎ Gammatone filter. The magnitudes of the down-sampled responses are loudness-compressed which are employed using a cubic root operation as in equation (9): where refers to the numbers of filters and represents the number of time frames obtained after the down sampling operation.

Linear Predictive Cepstral Coefficient (LPCC)
Since spectral characteristics are generated directly from spectra, they essentially represent phonetic information. The contribution of all frequency components of a voice signal is equally emphasised by employing the LPCC features which are generated from spectra. It tried to employ the energy values of linearly organised filter banks. Cepstrum can be extracted from a voice stream using linear prediction analysis. The essential premise of linear predictive analysis is calculated by ℎ speech sample. It may be predicted using a linear combination of the preceding samples, as detailed in the below equation (10): Over such a speech analysis frames, ! , 2 , 3 … … are presumed to be constants. Thereby, the speech samples are predicted using these coefficients where an error is the discrepancy between real and anticipated speech samples which are evaluated by equation (11):

Relative spectral-perceptual linear prediction (RASTA-PLP)
In original PLP technique, a specific band-pass filter is applied to each frequency sub-band to smooth out short-term noise fluctuations and eliminate any consistent offset in the voice channel. The most important processes in RASTA-PLP are calculating the critical-band power spectrum as in PLP, transforming spectral amplitude through a compressing static nonlinear transformation, filtering the time trajectory of each transformed spectral component by the band pass filter using the equation (12) given below:

Spectral Warping
Spectral warping is a change of the time domain by a signal which effectively distorts the frequency content of the original signal. The matrix of transformation for such kind of augmentation is broken down into three stages. The first one is DFT which tried to convert time signal into frequency field. The second step is basically a matrix of interpolation which helped in getting the desired new frequency samples. The frequency warping is efficiently provided to determine the signal content. The spectral warping procedure corresponds to the z-transformation of the input signal at the uniform sample points ( ) of the unit circle and the inverse DFT of the output as a part of final step. The spectral warp is achieved by treating non-uniform z-transform samples as evenly as possible, by applying the opposite Fourier transformation represented in matrix [ , ] as in equation (13).

Original Dataset
For training and testing of our proposed system, the speech data was recorded by utilizing mono channel in clean acoustic conditions. It employed a sampling frequency of 16 kHz. The speech data incorporated 2159 utterances with total number of 20 male speakers and 19 female speakers with absolute duration of speech spanning to 4.15 hours. The data was orchestrated into training and testing sets as shown in Table 1 to such an extent that the convincing presentation for developing the noise-robust Punjabi children ASR system is accomplished.

Noisy Dataset
The scalable and adaptable noisy database using clean dataset is synthetically created while setting the ideal length and sampling frequency. It indulged the presence of significant information in an input speech signal. Similarly, the noise clips including self-recorded classroom, self-recorded cafeteria, white Gaussian and pink were chosen and injected into clean speech corpus. These clips were cautiously hand-picked ensuring the quality of the recordings and can generally be scaled for accommodating new noise types and desired SNR levels going from 15 5 with a step size of −5 . For test dataset, the noise clips were embedded as irregular SNRs value which is similar to training set. However same category of noise as in training set was employed for additional experimentations. The noise augmented dataset has been demonstrated in the block diagram as detailed in Figure 1. Initially, the system is presented (a) clean input speech signal and (b) noisy signal which conferred mismatched conditions between training and testing set. The two datasets are trained and tested by extraction of viable features. It is possible through four feature extraction techniques-MFCC, LPCC, RASTA-PLP and GFCC. For MFCC depiction, 13 coefficients (consisting of 12 finalized coefficients where the 13th coefficient is the energy parameter corresponding to each frame) are extracted for the frame length of 25ms and frame shift of 10ms based on equation (7). Consequently, LPCCs are utilised in this context to collect emotion-specific information expressed via vocal tract characteristics. The voice signal is subjected to a 10th order LP analysis in order to get 13 LPCCs for each speech frame of 25ms and with a 10ms frame shift. Furthermore, 12 lower order coefficients corresponding to the noise robust feature approach of GFCC are extracted over hamming window with 25ms frame length and 10ms frame shift based on the filter-bank utilised as in equation (8).

Figure 2(b). The plots illustrating the significance of clean, noisy signals along with Wiener filter based speech signal.
Further, the developed baseline system is decoded against the noisy test set for the real-time evaluation and the accuracy of the system is evaluated at varying SNRs. In this way, the mechanism for the application of two filtering techniques including Kalman Filtering in Algorithm 1 and Wiener Filtering is applied on the testing dataset as described in Figure 2(a) and Figure 2(b) respectively. For real-time performance evaluation of ASR system, both MFCC and GFCC front-end feature extraction techniques are employed. Furthermore, the normalisation process with CMVN is applied on extracted vector and later dividing it by the standard deviation by:

Algorithm 1: Speech enhancement using Kalman Filtering Technique
Step 1: Initialise the frame size as 30ms and 10ms as window length Step 2: Propagation Step Step 2.1: Predict the next state = * − + * where represents desired signal at time , is the state-transition model, is the control input model, is the original signal at time .
Step 2.2: Predict the error covariance ahead = * − * + where is the error covariance matrix and being covariance of original signal.

Step 3: Measurement Update (Correction)
Step 3.1: Compute the Kalman gain = * * ( * * + ) − where is the Kalman gain, is the observation model and is the covariance of noisy signal.
Step 3.2: Update the projected state via.
is prior state estimate. Step 3.3: Update the predicted error covariance = ( − * ) * Step 4: Reiterate the process, using outputs as input for + with holding the predicted value, the desired result of the Kalman filter. In comparison, acoustic models utilised both the linguistic knowledge of the original and the augmented dataset which are further trained on context modelling techniques ofmonophonic training (mono), delta (tri1) and delta-delta (tri2). It is based upon triphones training. Consequently, the existing speaker variations are reduced by embedding VTLN warping function after evaluation of delta-delta (tri2). This process of normalisation used piecewise linear function (Zhang et al. 2004). It tried to help in mapping of the corresponding frequency on a large scale after computation of central segment. In addition, the test normalised VTLN is employed such that only test dataset based enhanced Kalman dataset and enhanced wiener dataset is normalised. It is relied on the best warping factor which is evaluated in the form of transformation matrix. The proposed system of aforementioned methodologies is detailed in Algorithm 2 such that the features are re-computed for the test datasets after triphone (tri2) modelling. Furthermore, the parameters needs to be reduced with an objective for boosting the feasibility of the system. It helped in efficient predictive analysis of word sequence. Similar to tri2, LDA(tri3) based triphone modelling helped in reducing the triphones into smaller amount of acoustically distinct units. Thus, this process employed LDA which act as a part of original feature space. It helped in presentation of information by reduction of dimensions from 117 to 40. It is possible through diagonal application of MLLT on the lower-dimension feature vectors. Finally, the systems for direct processing of (a) raw input speech signal (b) noisy signal (c) enhanced filtered signal is being trained on DNN-HMM based hybrid acoustic modelling. It utilised tangent hyperbolic activation function of Kaldi Toolkit (Povey et al. 2011). Finally, the efficiency of the enhanced ASR system which utilised the filtering technique is determined using two main parameters -WER and RI.

Algorithm 2:
Step by step process for data augmentation through spectral warping using Wiener Filter on GFCC technique.
Step 11.2: Go back to Step 7 and recompute the features for normalised test data Step 12: Train the model using DNN based hybrid architecture.
Step 13: Obtain the best result on DNN, otherwise go to Step 7 for finding the best warp factor.

Performance evaluation of ASR system on both clean and noisy test conditions
For the first set of experiments, the baseline system is being experimented using four frontend feature extraction techniques-MFCC, LPCC, RASTA-PLP and GFCC on clean child audio signals in both training and testing dataset. The MFCC feature extraction technique achieved 15.43% WER which performed better than LPCC with 16.02%, RASTA-PLP with 15.46% and GFCC with 15.61% under clean conditions as depicted in Table 2. However, in real-world conditions, the test data is a mixture of clean and distorted audio signals with distinct SNRs and different forms of noise. So, the test dataset has been expanded with four types of noises including self-recorded classroom, self-recorded cafeteria, white and pink noise at varying SNR's ranging from 0-15dB. These sets are evaluated on clean training set in baseline system. It is employed on entire above mentioned feature extraction techniques of MFCC, LPCC, RASTA-PLP and GFCC with an effort of experimenting the performance analysis of system under degraded environmental conditions. Under contrast, RASTA-PLP has been almost comparable to MFCC and GFCC in clean and noisy environments respectively, taking into account both the settings for the test datasets. Likewise, GFCC have led to greater performance than MFCC in non-ideal (noisy) conditions with a RI of 7.11% as shown in Table 2.

Performance evaluation of the system at varying SNR values
Although quantitative findings are helpful in analysing speech enhancement algorithms, where diverse training conditions plays a key role. Additionally, test audio signals when evaluated onto trained models have been found to be extremely subjective to inter-speaker variations and affected the efficiency of ASR system even under varying environmental conditions. Therefore, an attempt in the following set of experiments have been made to balance the real-life recording conditions which are captured with a microphone. It comprised of noise-reducing filters. However, the noise test data set has been improved by employing the Kalman and Wiener filtering techniques. Instead of using random noise, the effectiveness for utilisation of filtering techniques are tested with noise induction at particular SNR values such as -5dB, 0dB, 5dB, 10dB, and 15dB.

Performance evaluation of filtering techniques on clean training set
In this set of experiment, the effectiveness of the system under clean training conditions is assessed to further analyse the filtering techniques on noisy dataset. It is being injected at specific SNR values which tried to reproduce enhanced signal. The enhanced signal is further evaluated on feature vectors of MFCC, LPCC, RASTA-PLP and GFCC on clean training conditions. LPCC is recognised to be a loss compression method, meaning data are lost on lengthy ranges. This means both LPCC as well as GFCC are unable to use trained data for filtering. Likewise, in such a scenario RASTA-PLP could not surpass MFCC feature extraction, which leaded to degrade performance of the system at higher greater SNR values. Furthermore, somewhat degraded performance of system is visible even at higher SNR values of 10dB, 15dB with worst output at lower SNRs -5dB and 0dB with respect to both Kalman and Weiner Filtering techniques as shown in Figure 4(a), Figure 4(b), Figure 4(c), and Figure 4(d). Moreover, the Wiener filtering approach has outperformed Kalman Filter approach with average Relative Improvement of 3.08%, 2.95%, 1.49%, and 2.51% with MFCC, LPCC, RASTA-PLP and GFCC feature extraction techniques respectively.

Performance evaluation of feature extraction techniques with external filtering techniques on noise augmented training dataset
In this set of experiment, the noise clips are randomly injected into clean dataset with a SNR values ranging from -5dB to 15dB and combined into the clean dataset. This addition of noise during the evaluation on both enhanced Kalman speech set and enhanced Wiener speech set have been created a regularisation effect which leads to an increased robustness of the model. The experiments have been proven to show the large-scale average Relative Improvement of 7.67% and 19.53% corresponding to enhanced Kalman child and enhanced Wiener child dataset using MFCC, LPCC, RASTA-PLP and GFCC feature extraction technique as detailed in Figure 5(a) and Figure 5(b). However, MFCC feature vectors for noise augmented training set has leads to degraded performance of the system with increased WER at every value of SNR for both enhanced Kalman child and enhanced wiener child dataset. In addition, noise reduction approach is utilised with both the filtering techniques that have been consequently failed in the adaptation of intelligibility factors at lower SNR values of -5dB and 0dB, which is similar to clean conditions. It resulted into decreased performance of the system.

Performance evaluation on perturbed noise augmented dataset using spectral warping
The improved performance of the system is obtained by employing GFCC+VTLN along with enhanced Wiener Child which is quite observable. In like manner, the frequency domain features or transfer function of a test device are often relevance in both analogue and mixed signal devices. In region our interest, an emphasis is typically laid on a specific area of frequency spectrum rather than entire spectrum. So the technique of spectral warping is applied by time reversing the samples inside a frame and feed them into filter network, the spectral warping method is implemented as per each frame. The outputs from each of the first order filter stages are provided as a sample for each twisted signal. A warp factor of varying of values ranged from -0.1 to 0.1 along with a step size of 0.0025 is detailed in Figure 6. In this way, the best value of the warp factor has been found to be -0.075 such that the effective improvement is obtained with real-time system. It is possible by employing GFCC+VTLN along with external enhanced wiener filtering technique. It is reported with a RI of 2.53% in contrast to GFCC+VTLN based system. Finally, an experiment is conducted with a goal of reducing data scarcity using DNN-HMM based hybrid technique. It employed artificially generated noise enhanced training dataset and afterwards pooling it with 3, 4, 5-way based perturbation. Furthermore, an alignments is performed with speed perturbed data. It is rebuilt using DNN-HMM system due to a change in the duration of the signal. However, in low-resource language, we found that a very little improvement is obtained as illustrated in Table 4. It is probably because of the data that had previously been supplemented with simulated reverberation. The overall RI of 16.13% on 5way perturbed spectral warped noise augmented dataset is achieved in case of Wiener filtering technique on GFCC+VTLN approach in comparison to the baseline system.

Comparative analysis of noise-robust system with earlier proposed approaches
The rapidly expanding area of automatic speech recognition is confronted with a number of challenges, including vocabulary size, style of speech, speaker mode, and, most all, environmental resilience. Deep learning has become a strong science in the process of speech recognition, based on sophisticated network architectures and large model parameters, with broad and far-reaching implications. Most of the researchers focused on fully resource datasets but lacks in low-resource speech recognition systems. In past researchers have been addressing these challenges in order to improve the performance of ASR systems. In the application domains, automatic speech recognition is being investigated to a considerable extent where an effort of developing noise-robust ASR system has been made as shown in Table 5. Apart, issue of data scarcity and noise conditions disturbance are overcome in this proposed research using efficient spectral warping method and noise filtering is performed with proposed hybrid approach in comparison to other state of the art work. In contrast to the baseline system, the Wiener Filtering Technique using GFCC+VTLN technique yielded an overall RI of 16.13% on a 5-way perturbed spectral warped noise augmented dataset.

Conclusion
The research for acoustic and linguistic constructs in children speech is addressed in this study via comparative analysis of filtering strategies. It demonstrated with four front end feature extraction methods of MFCC, LPCC, RASTA-PLP and GFCC. These methods have been further evaluated by use of test-based normalisation technique through VTLN for reduction of inter-speaker differences and excitation source characteristics on scarce resources under deteriorated environmental conditions. Two type of corpus, including the clean children dataset and noise-enhanced children dataset using DNN-HMM classifier, have been shown to increase the robustness of the ASR system under non-ideal environments. The obtained findings using synthetic training data were shown to be more beneficial on children speech corpus with an overall improvement of 24.55% (Kalman Filtering) and 30.65% (Wiener Filtering) under noisy conditions. The overall experimental analysis demonstrated the effectiveness of the proposed spectral warped noise augmented system utilising Wiener filter alongside the use of GFCC feature extraction. An overall relative improvement of 16.13% compared to the baseline system is achieved. In future, the comprehensive study can be expanded with the implementation of these filtering methods in speech-based systems, including speaker verification & authentication, gender & emotion classification, using various augmentative methodologies and more broadly, an out-domain speech augmentation approach.

Conflict of Interest:
The authors declare that they have no conflict of interest.