3.2 PROPOSED FEATURES
The wavelets have the property of Multi-resolution and because of this these transform are used to process both non stationary and semi stationary signals. In the literature vast feature design algorithms have been presented for ASR systems under the presence of external disturbances of the natural environment. But, most of these algorithms use Fourier transform to find the spectrum. The speech signals has both periodic and aperiodic regions. But STFT uses window of fixed duration in the time frequency plane. The methods which use STFT cannot handle variations in speech signals. This challenge is addressed by the use of wavelets [35, 36, 37, 43–47]. The wavelets offer flexible frequency resolution in the time frequency plane.
3.2.1 Theory of Wavelet Transforms
A comparative description of WT and STFT is presented in Fig. 1.
3.2.2 Continuous Wavelet Transforms (CWTs)
CWT for a speech segment x(t) is described by
$${CWT}_{x}^{{\Psi }}\left(\tau ,s\right)=\frac{1}{\sqrt{S}}{\int }_{-\infty }^{\infty }x\left(t\right){{\Psi }}^{*\left(\frac{t-\tau }{s}\right)}dt \left(1\right)$$
Equation (1), has \(\tau and s\) indicate translation and dialations respectively, and \({\Psi }\left(\text{t}\right)\) is the mother wavelet.
The mother wavelet plays the role of a prototype or basis to construct other functions. The CWT is computationally complicated transform.
3.2.3 Discrete Wavelet Transforms (CWTs)
The DWT is somewhat less complex[49]. The DWT for a speech signal x(t) is defined as:
$$DWT\left(j,k\right)=\frac{1}{\sqrt{{|2}^{j}|}}{\int }_{-\infty }^{\infty }x\left(t\right)\psi \left(\frac{t-{2}^{j}k}{{2}^{j}}\right) \left(2\right)$$
Mallat effectively summarized the wavelet decomposition process. It is accomplished by passing the speech segments via Wavelet Packet tree. The sample analysis trees are shown in Fig. 2. Here, \({h}_{0}\left(n\right), {h}_{1}\left(n\right)\) are low and high pass filters respectively. Similarly \({g}_{0}\left(n\right),\) \({g}_{1}\left(n\right)\) also form the filter pairs.
\({h}_{0}\left(n\right)\) and \({h}_{1}\left(n\right)\), \({g}_{0}\left(n\right)\) and \({g}_{1}\left(n\right)\) related by Eq. (3)
$${h}_{1}\left(n\right)= {(-1)}^{n}{g}_{0}\left(1 - n\right)$$
,
$${g}_{1}\left(n\right) = {\left(-1\right)}^{n}{h}_{0}\left(1 - n\right) \left(3\right)$$
The decimation and interpolation operation by a factor of 2 is denoted as ↓2 and ↑2 respectively. Figure 3 shows analysis and synthesis tree. In Fig. 3, \(\{{c}_{0}\left(n\right)\)}n ∈ Z is the input to the tree[23].
$${c}_{1}\left(k\right)=\sum _{n}{h}_{0}\left(n-2k\right){c}_{0}\left(n\right) \left(4\right)$$
$${d}_{1}\left(k\right)=\sum _{n}{h}_{1}\left(n-2k\right){c}_{0}\left(n\right) \left(5\right)$$
where \({c}_{1}\left(k\right)\) and \({d}_{1}\left(k\right)\) represent the low frequency space and the high frequency space respectively. The synthesis tree is shown in Fig. 3 is given by the Eq. (6)
$${c}_{0}\left(m\right)=\sum _{k}\left[{g}_{0}\left(2k-m\right){c}_{1}\left(k\right) +{g}_{1}\left(2k-m\right){d}_{1}\left(k\right)\right] \left(6\right)$$
3.2.4 Wavelets for parameterization
By applying the iterative decomposition operation repeatedly a desired wavelet tree is designed. Wavelet based feature vectros are derived using Daubachies wavelet(db4) [57]. Here 4th order is used but as order increases performance increases with increased computational requirements.
3.2.4.1 Mel Filters similar Wavelets Packets Analyis
The 24-bands Mel resembled Wavelet features (WMFCC) is proposed [20]. The frequency \({f}_{c}\) is related to mel scale frequency \({f}_{mel}\) by the Eq. (7)
$${f}_{mel}=2595{log}_{10}\left(1+\frac{{f}_{c}}{700}\right) \left(7\right)$$
The signal analysis in initialized with a balanced 3 level tree. Here, the frequency range is divided into eight bands each of 1KHz. The approximation space of 0-1KHz is decomposed into 8 subbands of 125Hz bandwidth in each subband. Here each subband is near to 100 Hz which is bandwidth of the Mel-filter. A 24-bands Mel scale like WP filter is designed[20] (see Table 1).
Table 1
Comparative description of Wavelet frequency bands and MFCC frequency bands.
Filters | Mel Scale | Wavelet Subband | Filters | Mel Scale | Wavelet Subband | Filters | Mel Scale | Wavelet Subband |
1 | 100 | 125 | 9 | 900 | 1125 | 17 | 2639 | 2750 |
2 | 200 | 250 | 10 | 1000 | 1250 | 18 | 3031 | 3000 |
3 | 300 | 375 | 11 | 1149 | 1375 | 19 | 3482 | 3500 |
4 | 400 | 500 | 12 | 1320 | 1500 | 20 | 4000 | 4000 |
5 | 500 | 625 | 13 | 1516 | 1750 | 21 | 4595 | 5000 |
6 | 600 | 750 | 14 | 1741 | 2000 | 22 | 5278 | 6000 |
7 | 700 | 875 | 15 | 2000 | 2250 | 23 | 6063 | 7000 |
8 | 800 | 1000 | 16 | 2297 | 2500 | 24 | 6954 | 8000 |
The 24-bands Mel filter like wavelet packet sub-bands are shown in Fig. 5[20]. |
The energy is determined by
$${⟨{S}_{i}⟩}_{k}=\sum \frac{{{\left|{\omega }_{{\Psi }}\right(x,k)}_{i}|}^{2}}{{N}_{i}} \left(8\right)$$
where, \({{\omega }_{{\Psi }}(x,k)}_{i}\) are the coefficients of the speech segement \(x\), \(i\) indicates the subband number (\(1\le i\le M\)), \(k\) represents the frame number and \({N}_{i}\) is the total count of samples in the \({i}^{th}\) suband. Just like MFCC, 24 wavelet coefficients are logarithmically processed. The compressed co-efficients are further analyzed by DCT to achieve energy compaction. Then first 13 coefficients are choosen as WMFCC features. The steps of parameterization are shown in Fig. 6.
3.2.4.2 Proposed Hybrid PWP tree for parameterization
The 24-band wavlet tree is designed for parameterization. The repeated experimental investigation and analysis revealed that a 24-band Wavelet Packet (WP) tree shown in Fig. 7 is found to be optimal tree for the same.
The decomposed coefficients energies are calculated, compressed logarithmically and processed by DCT to choose 13 optimal coe-efficients.
3.2.5 ACOUSTIC MODELS
The acoustic modeling in any ASR system plays a very vital role. Acoustic Modelling is the task of mapping the matrix of features with desired phoneme sequences of the hypothesized sentence. This is accomplished through the use of Hidden Markov Model (HMM) Classifier.
3.2.6 LANGUAGE MODELS
The most popular Language models used in any ASR system are n-gram language models. These models are useful in the prediction of \({n}^{th}\) word, utilizing \(\left(n-1\right)\) previous occurred words. The trigram \((n=3)\) and bigram \((n=2)\) are commonly used in the Language Modelling.
3.2.7 RECOGNITION
The baseline classifers such as GMMs-HMMs, extended version of monophones and Deep Neural Networks (DNN) are used in this work for achieving speech recognition.
3.2.8 HIDDEN MARKOV MODELS
To find the probabilities \(P\left(\frac{W}{X}\right)\) Markov chain shown in Fig. 8 with 3 states is used. During the training phase, probality of a system remaining in a state is called as initial state probability(\(\pi\)), probability of transiting among states(A), and probability of symbols emitted (B) are determined using Baum-Welch Algorithm.
$$\lambda =\left(A, B, \pi \right) \left(9\right)$$
The log-likelihood of each sequence of word is found using Viterbi Decoding method described by
$$v* =\left[P \left(O|\lambda v\right)\right], 1\le v\le V \left(10\right)$$
\(V\) is word length.
3.2.9 ASR PERFORMANCE ANALYSIS
The performance accuracy of the proposed system is evaluated through the metric word error rate [24] given by equations (11)
$$WER\left(\%\right)=\frac{(D+S+I)}{N}\times 100\left(\%\right) \left(11\right)$$
Here N represents the maximum number of spoken files in the testing dataset and \(D, S and I\) are errors mainly because of deletion of phonemes, substitution of phonemes and insertion of phonemes respectively.