Robust Perceptual Wavelet Packet Features for Recognition of Continuous Kannada Speech

An ASR system is built for the Continuous Kannada Speech Recognition. The acoustic and language models are created with the help of the Kaldi toolkit. The speech database is created with the native male and female Kannada speakers. The 80% of collected speech data is used for training the acoustic models and 20% of speech database is used for the system testing. The Performance of the system is presented interms of Word Error Rate (WER). Wavelet Packet Decomposition along with Mel filter bank is used to achieve feature extraction. The proposed feature extraction performs slightly better than the conventional features such as MFCC, PLP interms of WRA and WER under uncontrolled conditions. For the speech corpus collected in Kannada Language, the proposed features shows an improvement in Word Recognition Accuracy (WRA) of 1.79% over baseline features.


Introduction
The frequent pauses between the speech sounds of a speech signal portrays its unique characteristic that distinguishes it from all other signals. The speech database created in uncontrolled conditions of the environment must be processed to implement a robust automatic speech recognition system. Speech is an important and efficient tool of communication. The speech research drawn the attention of infinite researchers and has emerged as one of the important multidisciplinary research areas in the recent decades. Speaker Independent Speech recognition is the task of identifying the spoken word or sentence irrespective of the speaker. The speech recognition has been performed over the several languages. The UNESCO atlas of the world languages danger report-2009, describes that about 197 Indian languages are in critical situation of being extinct.
provides introductory information towards automatic speech recognition and some of the important works presented in the literature. Section 3 describes about feature extraction methods.

Related Works
The automatic speech recognition (ASR) system is able to provide 100% accuracy under clean environment. But, its performance is degrades significantly when the spoken utterances gets contaminated by the presense of background noise or mismatch in acoustic features extracted from noisy or clean condtions [13][14][15] and mismatch in the labelled speech data used to train the classifier [16]. Hence, the performance of ASR system is constrained by two choices namely, correct labelling of speech data and selection of acoustic features. The well known acoustic features for speech recognition is Mel-frequency cepstral coefficients (MFCCs). MFCCs are extracted from the Mel filter banks [17]. MFCCs are obtained using short time Fourier transform (STFT). The mel cepstral coefficients are computed by allowing speech signal to pass through a bank triangular shaped filters having passbands slightly overlapping with adjacent passbands and to obtain a smooth spectrum [18,19]. Spectrum is subjected variations as the impact of background noise increases [18,20]. The popular MFCC technique consists of STFT. The STFT has a requirement that the signal to be processed must be stationary over short interval of time i.e.,semi-periodic signals [21]. Due to the trade-off between time-frequency resolution, it is not easy to detect phones that happen with a rapid burst in a slowly changing signal [18,20,22]. This problem of time-frequency resolution is alleviated by using wavelet transform(WT) [23][24][25]. The major benefit of using wavelet transform is that, unlike using single fixed sized analysis window in STFT, it uses windows with variable duration. The high frequency portion of the speech signal is processed by the short duration window, whereas the low frequency part of the speech signal is processed by the long duration window [24,26,27]. Thus by applying wavelet transform to a speech signal, it can be inspected for the presense or absence of sudden burst (stop phonemes) in a slowly changing signal [20,22]. The conventional wavelet filter bank performed well for phoneme recognition tasks [20]. Because of the fixed resolution of frequency in time-frequency plane, the STFT was not able to find voiced stop due to their characteristic of rapid burst at higher frequencies [20,22]. Multiresolution potential of wavelets was enormously utilized by many research professionals for feature extraction and demonstrate their benefit for several applications such as, Biomedical application like ECG [28,29], Speech enhancement [30,31], EEG [32,33] and Phoneme recognition [20,22,34].

Preporocessing
The preprocessing functions like framing, windowing and pre-emphasis are applied to all the wave files in speech database. The frame duration and frame overlap are choosen as 20 ms and 10 mes respectively, for performing framing and windowing.

Proposed Features
The Multi-resolution property of the wavelet makes it appropriate tool for handling nonstatinary and semi-stationary signals. This transform can detect unvoiced sounds in the speech signal and it provides best desnoising characteritics. In the recent years, several feature extraction approaches have been invented for speech recognition in uncontrolled environment. But, majority of these feature extraction schemes use Fourier transform to compute the spectrum. The speech signal consist of voiced (periodic) and unvoiced (aperiodic) portions throughout its existence. It's a popular fact that the STFT or windowed Fourier transform has fixed and uniform frequency resolution with respect to the time frequency plane. Therefore it is difficult for the methods relay on STFT to recognize sudden bursts in the slowly time varying speech signals. To problem is alleviated by the application of wavelet transforms in the speech recognition research [35][36][37][43][44][45][46][47]. The wavelet transform offers good frequency resolution [49-53, 55, 56].

Theoretical Background of Wavelet Transforms
Multi Resolution Analysis is an alternative way to STFT technique to analyze a signal. A mathematical scaling function is utilized to obtain a series of approximations to the signal. This principle has been considered by Wavelet Transforms (WT). A comparision of timefrequency resolution between STFT and WT is shown in Fig. 1.

Continuos Wavelet Transform (CWT)
CWT of a signal x(t) is given by From Eq. (1), the result of transformation is function of two variables, ands that describe the translation and scaling factor respectively, and Ψ(t) is mother wavelet.
The term wavelet [38], is concatenation of two words 'wave' and 'let'. Here wave is signal and let is short. The mother wavelet acts as a model or prototype to derive other

Discrete Wavelet Transform (CWT)
The CWT is more complicated for signal analysis, because it involves significant computational resources. While DWT is less complicated in capture the signal information effectively [49]. The DWT of signal x(t) is defined as: Mallat successfully demonstrated the method of wavelet decomposition by allowing a signal to pass through a series arrangement of low pass filter and high pass filter pairs. The multi resolution analysis of a signal is shown in Fig. 2a, b shown below. Here, h 0 (n), h 1 (n) in the decomposition tree are low pass and high filter pairs respectively. Similarly g 0 (n), g 1 (n) form the low pass and high pass filter pair in the reconstruction tree.
h 0 (n) and h 1 (n) are a pair filters used for analysis, whereas g 0 (n) and g 1 (n) form another pair of low, highpass filters respectively. These four filters have related as Also, the symbols ↓2 and ↑2 presented in Fig. 2a, b, denote the decimating and interpolating opertions carried out by a factor of 2 respectively. A pair of one level analysis and synthesis trees are shown in Fig. 3. In Fig. 3, {c 0 (n)}n ∈ Z is the input applied the one level analysis and synthesis tree respectively [23].
where c 1 (k) and d 1 (k) are known as the approximation space and the detail space respectively. These are created by the one level wavelet analysis of c 0 (n) . The corresponding synthesis tree is shown in Fig. 3 can be operated as

Wavelet Based Acoustic Feature Extraction
By repeating the iterative decomposition a desired binary wavelet packet tree is obtained. Various WP filterbank tree structures can be derived depending on application of interest. Wavelet features are extracted using Daubachies wavelet of order 4 (db4) [57]. Increaing the order of the mother wavelet may provide better results at expense of increased computational complexity. The sound frequency f c is mapped to the mel frequency f mel according to the following equation A frame size of 25 ms with a frame ovelap of 15 ms was used to derive the WMFCC. Intially the speech frames are subjected to pre-emphasis followed by windowing operation using Hamming window. Initially a balanced three level wavelet packet tree structure is derived. Here, the frequency axis is subdivided into eight sub-bands each of 1 kHz The low frequency subband in the range 0-1 kHz is again subjeted to three level balanced decomposition to get eight subbands each having a bandwidth of 125 Hz. Which is almost close to 100 Hz Mel-filter. The subband with frequency range is decomposed into two level balanced WP coefficients, giving four subbands each having a bandwidth of 250 Hz. The subbands in the range 1-1.25 kHz and 1.25-1.5 kHz are decomposed again, resulting in four subbands same bandwidth i.e., 250 Hz. The subband of 3-4 kHz frequency range is again processed by level decomposition, resulting in two subbands of 3-3.5 kHz and 3.5-4 kHz respectively. The frequency bands with ranges 4-5 kHz, 5-6 kHz, 6-7 kHz, and &-8 kHz are retained as it is. This results in 24-band Mel scale resembeled WP filter. The bandwidth of the 24 frequency bands resulting after WP Decomposition does not exactly follow Mel scale [20] (see Table 1).
The energy in each subband is calculated by

Proposed PWP Tree Structure for Feature Extraction
In this work we have proposed a 24-band wavlelet packet tree which is used to obtain the cepstral features. The feature extaction is carried out by proposing a 24-band Wavelet Packet (WP) tree structure after conducting repeated experiments. The WP tree structure shown in Fig. 6. is the proposed WP tree structure for obtaining the features. The energy of the 24 band wavelt subbands are calculated. These coefficients are then logarithmically compressed and subjected to Discrete Cosine Transform. Discrete Cosine Transform (DCT) basically achieves energy compaction. The outpus of DCT gives 24 coefficients and only first 13 coefficients are used as cepstral coefficients. The Kaldi Toolkit is used to determine the delta and delta delta coefficients to form features of 39-dimension.

Acoustic Models
The acoustic models are used to map the observed feature matrix with the desired phoneme sequences of the hypothesized sentence. The creation of acoustic models is usually accomplished by using the Hidden Markov Models (HMM).

Language Models
The ASR systems utilize n-gram language models to facilitate the detection of exact word sequence through prediction of n th word, utilizing (n − 1) previous words. Most popular n-gram language models are trigram (n = 3) and bigram (n = 2) language models.

Hidden Markov Model
To determine the probability P W X a 3-state Markov chain is used here. The 3-state Markov chain [54], is displayed in the Fig. 7. In the training phase, probality of system staying in a state ( ), probability of transition between states (A), and output probabilities (B) are determined applying Baum-Welch Algorithm. The HMM Acoustic Model for each word sequence is defind by using the equation The log-likelywood of every word sequence is estimated using Viterbi Decoding technique according to the equation

Performance Analysis
The recognition accuracy of any ASR system is determined using the popular metric word error rate and word recognition accuracy [24] given by Eqs. (11) and (12) respectively.
where N is the total number of words present in the test set and D, SandI are erroes due to deletion, substitution and Insertion respectively.

Database
The Kannada speech Database consisting of isolated digits from 0-9, 20 isolated wods and 500 continuous speech sentences which includes combination of read speech and spontaneous speech and native language accents have been used as text script for creating the database. The database consists of 100 speakers. The continuous speech is of 15 h database. The database is recoreded in the natural environment in the presense of room noise, vehicle noise happening nearby road at a distance of 5 m from the recording room. The tools used for recording the database is Matlab R2019b software on Dell Laptop. The speech data has been collected at a sampling rate of 16 kHz, 16-bit resolution and Mono recordings. The table indicates the 50 sample sentences from the speech database along with its goolgle transcription. The lexicon is created using Indian Language Symbol Labels version 3 (ILSLv3 [39, 40]) prepared by ASR/TTS consortia, sponsored by Government of India, and hence have become the defacto standard; The continuous speech database created here is according to the guidelines suggested by speech research expert Dr. Samudravijay K, from Indian Institute of Technolgy, Guwahati (IITG). The databsase is partitioned into training set and testing set. The 80% of the database is used for training the Acoustic Models and 20% of the database is used for testing the model.
The database consists of 3 sets for Kannada Language namely: isolated digits through (0-9), isolated words, Continuous Kannada Speech consisting of Spontaneous Spoken Kannada Sentences also. The database consists of 3 sets for English Language namely: isolated digits (TIMIT) through (0-9), isolated words (TIMIT), Librispeech of Continuous English Speech. The Kannada speech sentences along with their transcription is provided in the Table 2.

Kaldi Toolkit
Kaldi is a open source toolkit designed excusively for building Acoustic Models (AMs) and Language Models (LMs) [41]. Kaldi is built using C++ programming language. The Kaldi toolkit can be run in windows as well as Linux based operating systems. But, the support for Linux based Kaldi tasks is very good compared to that of windows. The Table 3 presents the labels for kannada phones using syllable transliteration. There are four Dravidan languages. Kannada, Telugu, Malayalam and Tamil. Kannada is the most popular Dravidan language used in Karnataka state. This language consist of 14 swara (vowels), 32 vyanhana (consonants), 2 part vowel, yogavaahaka (part consonant). The labels used for building the lexicon for phonemes of the Kannada Language is shown in Table 2. Therefore, The Kannada language ASR system is developed by modeling the 46 phonemes. The labels are used from the Indian Speech Sound Label Set (ILSL12 [42]) is shown in Table 4. The lexicon for Kannada language is written by using ILSL12 label set shown in Table 5.
The dictionary is created by using ILSL12. Figure 8 represents the block diagram of the proposed features.
The general architecture of ASR system is shown in Fig. 9.
The Table 6 provides the details of parameters used for acoustic modelling. The acoustic models are generated at Monophones, Triphones1 and Triphones3 levels with number of jobs 3. The parameters used to develop Acoustic Model are as follows:

Results
The results of developed ASR system is presented in this section for Monophones, Tri-phones1, Triphones2, Triphones3 and DNN-HMM Phoneme Models. The Table 7 gives the WER details for the Kannada Isolated digit recognition task. The pictorial representation of Table 7 are presented in Fig. 10.
The Kannada isolated word recognition results and the corresponding graph are presented in Table 8 and Fig. 11 respectively.
The WER details for Continuous Kannada speech recognition for the speech data collected in uncontrolled conditions are presented in Table 9 and Fig. 12 respectively. In all the three sets of Kannada language a slight improvement in the performance can be observe with the proposed features over the MFCC and PLP features for DNN-HMM classifier.
The proposed features are also experimented with the isolated digits and isolated words extracted from the TIMIT database. The Table 10 and Fig. 13 describes the Table 2 The Kannada text sentences of the speech data collected and their corresponding transliteration 1 3 performance of proposed features over the MFCC and PLP features. The Table 11 and Fig. 14 describe the performance of the proposed features over the MFCC and PLP features.A slight improvement in the performance can be observed in Table 11.
The proposed features are also experimented with standard Librispeech corpus of 08 h. A little improvement in the performance can be observed for the proposed features  1 3 over the baseline features such as MFCC and PLP. The WER details are recoreded in the Table 12 and Fig. 15. The Proposed ASR system is also tested with the unseen data consisting of 512 sentences of different combinations of words. A sample of 20 sentences are shown in the Table 13 and the results of the experiment are included in the Table 14   Table 3 The labels using syllable transliteration tool (IT3 to UTF-8) for Kannada phones Table 4 The labels used from the Indian Speech Sound Label Set (ILSL12) for Kannada phonemes Table 5 Dictionary for Kannada  language is created by using  ILSL12 label set   TEXT TRANSCRIPTION  LABEL SET USING ILSL12 koneyalli k o n e y a llx i mattomme m a t t o mm e vaartegala v aa r t e g a lx mukhyamshagalu m u k h y aa nx s h a g a lx u samsattina s a nx s a tt i n a ubhaya u b h a y a sadanagalalli s a d a n a g a lx llx i raastrapatigala r aa s t r a p a t i g a lx bhasanada b h aa s a nx d a meelina m ee l i n a vandana v a n d a n a nirnayada n i r nx y a d a carce c a r c e aarambha aa r a nx b h a vaartegala v aa r t e g a lx vivara v i v a r a samsattina s a nx s a tt i n a ubhaya u b h a y a sadanagalallindu s a d a n a g a lx llx i nx d u raastrapatigala r aa s t r a p a t i g a lx bhaasanakke b h aa s a nx kk e vandane v a n d a n e sallisuva s a llx i s u v a nirnayada n i r nx y a d a meelina m ee l i n a carce c a r c e aarambhavaagide aa r a nx b h a v aa g i d e

Conclusion
The ASR work carried out in this paper are as follows.
• We have experimented the conventional as well as proposed feature extraction technique over Monophone Models, Triphones1, Triphones2, triphones3 and DNN-HMM. • The database consists of 3 sets for Kannada Language namely: isolated digits through (0-9), isolated words, Continuous Kannada Speech consisting of Spontaneous Spoken Kannada Sentences also. • The database consists of 3 sets for English Language namely: isolated digits (TIMIT) through (0-9), isolated words (TIMIT), Librispeech of Continuous English Speech. • In the experiments conducted over isolated digits and words taken from collected data of Kannada Language and from TIMIT data, the proposed features achieved significant improvement in the performance over the baseline features such as MFCC, PLP. • For the experiments on collected Kannada Continuous Speech and Librispeech the proposed features are shown to perform better than the conventional features such as MFCC and PLP features. • The Proposed ASR system is tested with the unseen data of 512 sentences and the performance on this test data set reveals that the proposed system performs better than the conventional features such as MFCC and PLP features.