DOI: https://doi.org/10.21203/rs.3.rs-1542540/v1
An Automatic Speech Recognition system is developed for recognizing the continuous and spontaneous Kannada speech sentences. The language models and acoustic models are constructed using Kaldi toolkit. The speech corpus is developed with the native female and male Kannada speakers and is partioned into training set and testing set. The Performance of the proposed system is analysed and evaluated using the metric Word Error Rate (WER). The Wavelet Packets amalgamated with Mel filter banks are utilized to perform feature vector generation. The proposed hand crafted features perform better than the baseline features such as Perceptual Linear Prediction(PLP), Mel Frequency Cepstral Coefficients(MFCC) interms of Word Error Rate(WER) under both controlled and uncontrolled conditions.
The natural pauses present between the sequence of spoken words makes them very different and unique class of signals. The speech corpus developed in any environment must be subjected to pre-processing to build an effective ASR. Speech is an easiest and most effective means of communication. The integration of deep learning with the multidisciplinary speech signal research and applications of deep neural networks has drawn so much attention of speech community researchers and applications of deep neural networks to speech, audio, video and text now evolved as fascinating domain in for speech research. Spoken sentence recognition is the task of detecting and recognizing the spoken words or sentences with the help of computer algorithms. The ASR applications have been developed for several Indian and Foriegn languages. The world languages danger report prepared by UNESCO during the year 2009, indicates that approximately a total of 197 languages of India are in the condition of becoming obsolete and report also highlights that the percentage of native speakers in the native language is becoming much lower day by day[1]. ASR system is developed using MFCC, HMM, vector quantization and I-vectors for the Assamese Language. The proposed Fusion technique performs better than Hidden Markov Models, Vector Quantization and I-Vectors with an accuracy of 100%[1]. The Automatic Speech Recognition system is built and evaluated through Bengali speech corpus. The 39-dimensional MFCC features are obtained and trained triphone based HMM Models. The system achieved the accuracy of 87.30%[2]. The ASR system is designed for Bangla speech. The Mel scale based LPC features and HMMs, provided 98.11% accuracy in speech recognition [3]. A word recognition system is designed via LPC features and HMMs for Hindi words and accuracy 97.14% was accomplished[4]. A word recognition system is designed with MFCCs and HMM Models for Hindi Speech Corpus. Accuracy of 94.63% was accomplished[5]. A recognition system for Hindi connected Speech Corpus through MFCC features and HMMs is designed and implemented. An accuracy of 87.01% is recoreded[6]. Digit recognition system through MFCC features and HMMs for Malayalam Corpus lead to 98.5% [7]. LPCC and MFCC features via vector quantization formed speaker identification system yield an accuracy of 96.59%[8]. A LI task is accomplished for five Indian languages. The cepstral features and vector quantization technique achieved classification task. WRA of 88.10% is obtained for Kannada sentences[9]. A Punjabi word recognition ASR system by LPC features and Dynamic time warping technique provided an accuracy of 94%[10]. A speaker recognition system using MFCC features, Linear discriminant Analysis and support vector machine and cosine distance scoring[11]. GMMs utilized LP residual processed features to perform the classification task[12]. In this work, an attempt has been made to utilize these features with Kaldi Toolkit in the degraded environment with the intention to improve the system performance. The content organization: The sections 1 and 2 convey brief introduction about ASR system. Section 3 depicts the details of feature extraction techniques.
The ASR applications are impressive in noise free environment and are not promising solutions for the experiments and tests in noisy areas [13, 14, 15, 16]. Hence, the ASR system throughput is limited by two options. These are relevant labelling of the corpus and choice of feature vectors. MFCCs are derived from the Mel filter banks[17],[18, 19],[20] and it depends on STFT that has a demand that the signal is subject to processing it it is stationary over short interval of time[21],[22, 18, 20]. The challenge of time-frequency resolution is addressed by using wavelets[23, 24, 25]. The major freedom here is to use windows with variable durations. The detail portion of the speech subjected to short duration windows, and approximation coefficients of the speech is subjected to long duration windows for processing[24, 26–27]. Therefore wavelets adapts to changing signals[20, 22]. Wavelets based filter banks worked well for phoneme recognition tasks[20, 22]. Multi-resolution ability of wavelet transforms was exhaustively utilized by several researchers for applications such as ECG [28, 29], EEG [32, 33, 34] and signal quality improvement [30, 31].
The speech signal must pass through the preprocessing operations before it is being used by any ASR system. The popular preprocessing operations are framing operation using windows and pre-emphasis operation. The frames and frame overlaps are selected as 20msec and 10msec respectively.
The wavelets have the property of Multi-resolution and because of this these transform are used to process both non stationary and semi stationary signals. In the literature vast feature design algorithms have been presented for ASR systems under the presence of external disturbances of the natural environment. But, most of these algorithms use Fourier transform to find the spectrum. The speech signals has both periodic and aperiodic regions. But STFT uses window of fixed duration in the time frequency plane. The methods which use STFT cannot handle variations in speech signals. This challenge is addressed by the use of wavelets [35, 36, 37, 43–47]. The wavelets offer flexible frequency resolution in the time frequency plane.
A comparative description of WT and STFT is presented in Fig. 1.
CWT for a speech segment x(t) is described by
Equation (1), has \(\tau and s\) indicate translation and dialations respectively, and \({\Psi }\left(\text{t}\right)\) is the mother wavelet.
The mother wavelet plays the role of a prototype or basis to construct other functions. The CWT is computationally complicated transform.
The DWT is somewhat less complex[49]. The DWT for a speech signal x(t) is defined as:
Mallat effectively summarized the wavelet decomposition process. It is accomplished by passing the speech segments via Wavelet Packet tree. The sample analysis trees are shown in Fig. 2. Here, \({h}_{0}\left(n\right), {h}_{1}\left(n\right)\) are low and high pass filters respectively. Similarly \({g}_{0}\left(n\right),\) \({g}_{1}\left(n\right)\) also form the filter pairs.
\({h}_{0}\left(n\right)\) and \({h}_{1}\left(n\right)\), \({g}_{0}\left(n\right)\) and \({g}_{1}\left(n\right)\) related by Eq. (3)
The decimation and interpolation operation by a factor of 2 is denoted as ↓2 and ↑2 respectively. Figure 3 shows analysis and synthesis tree. In Fig. 3, \(\{{c}_{0}\left(n\right)\)}n ∈ Z is the input to the tree[23].
where \({c}_{1}\left(k\right)\) and \({d}_{1}\left(k\right)\) represent the low frequency space and the high frequency space respectively. The synthesis tree is shown in Fig. 3 is given by the Eq. (6)
By applying the iterative decomposition operation repeatedly a desired wavelet tree is designed. Wavelet based feature vectros are derived using Daubachies wavelet(db4) [57]. Here 4th order is used but as order increases performance increases with increased computational requirements.
The 24-bands Mel resembled Wavelet features (WMFCC) is proposed [20]. The frequency \({f}_{c}\) is related to mel scale frequency \({f}_{mel}\) by the Eq. (7)
The signal analysis in initialized with a balanced 3 level tree. Here, the frequency range is divided into eight bands each of 1KHz. The approximation space of 0-1KHz is decomposed into 8 subbands of 125Hz bandwidth in each subband. Here each subband is near to 100 Hz which is bandwidth of the Mel-filter. A 24-bands Mel scale like WP filter is designed[20] (see Table 1).
Filters | Mel Scale | Wavelet Subband | Filters | Mel Scale | Wavelet Subband | Filters | Mel Scale | Wavelet Subband |
---|---|---|---|---|---|---|---|---|
1 | 100 | 125 | 9 | 900 | 1125 | 17 | 2639 | 2750 |
2 | 200 | 250 | 10 | 1000 | 1250 | 18 | 3031 | 3000 |
3 | 300 | 375 | 11 | 1149 | 1375 | 19 | 3482 | 3500 |
4 | 400 | 500 | 12 | 1320 | 1500 | 20 | 4000 | 4000 |
5 | 500 | 625 | 13 | 1516 | 1750 | 21 | 4595 | 5000 |
6 | 600 | 750 | 14 | 1741 | 2000 | 22 | 5278 | 6000 |
7 | 700 | 875 | 15 | 2000 | 2250 | 23 | 6063 | 7000 |
8 | 800 | 1000 | 16 | 2297 | 2500 | 24 | 6954 | 8000 |
The 24-bands Mel filter like wavelet packet sub-bands are shown in Fig. 5[20]. |
The energy is determined by
where, \({{\omega }_{{\Psi }}(x,k)}_{i}\) are the coefficients of the speech segement \(x\), \(i\) indicates the subband number (\(1\le i\le M\)), \(k\) represents the frame number and \({N}_{i}\) is the total count of samples in the \({i}^{th}\) suband. Just like MFCC, 24 wavelet coefficients are logarithmically processed. The compressed co-efficients are further analyzed by DCT to achieve energy compaction. Then first 13 coefficients are choosen as WMFCC features. The steps of parameterization are shown in Fig. 6.
The 24-band wavlet tree is designed for parameterization. The repeated experimental investigation and analysis revealed that a 24-band Wavelet Packet (WP) tree shown in Fig. 7 is found to be optimal tree for the same.
The decomposed coefficients energies are calculated, compressed logarithmically and processed by DCT to choose 13 optimal coe-efficients.
The acoustic modeling in any ASR system plays a very vital role. Acoustic Modelling is the task of mapping the matrix of features with desired phoneme sequences of the hypothesized sentence. This is accomplished through the use of Hidden Markov Model (HMM) Classifier.
The most popular Language models used in any ASR system are n-gram language models. These models are useful in the prediction of \({n}^{th}\) word, utilizing \(\left(n-1\right)\) previous occurred words. The trigram \((n=3)\) and bigram \((n=2)\) are commonly used in the Language Modelling.
The baseline classifers such as GMMs-HMMs, extended version of monophones and Deep Neural Networks (DNN) are used in this work for achieving speech recognition.
To find the probabilities \(P\left(\frac{W}{X}\right)\) Markov chain shown in Fig. 8 with 3 states is used. During the training phase, probality of a system remaining in a state is called as initial state probability(\(\pi\)), probability of transiting among states(A), and probability of symbols emitted (B) are determined using Baum-Welch Algorithm.
The log-likelihood of each sequence of word is found using Viterbi Decoding method described by
\(V\) is word length.
The performance accuracy of the proposed system is evaluated through the metric word error rate [24] given by equations (11)
Here N represents the maximum number of spoken files in the testing dataset and \(D, S and I\) are errors mainly because of deletion of phonemes, substitution of phonemes and insertion of phonemes respectively.
The speech data developed fot this work includes isolated digits, 25 isolated wods and 600 spontaneous sentences and continuously read sentences The dataset was contribution of 100 speakers of size 20 hours. Recoredings were made in the normal room environment in the presense of various noises occurring in the nearest road located at a distance of 5 meters. Recording is done using Matlab R2019b software at 16KHz, 16-bit resolution. The Table 2 represents 50 sample sentences from the speech dataset. The lexicon is designed according to the standard Indian Language Symbol Labels version 3 (ILSLv3) designed ASR Cosortium for Indian languages. The speech corpus is designed by following the procedure suggested in Summer School on Automatic Speech Recognition held at IIT, Guwahati (IITG). The dataset consists of 3 sets isolated digits, isolated words, Continuous and spontaneously spoken Kannada Sentences. The isolated digits and words (TIMIT), Librispeech(Sentences) are used in this work. A set of sample lexicons with transcription is given in the Table 2.
Lexicon | Transliteration |
---|---|
Tumna | t u m b a |
Doora | d o o r a |
Prayana | p r a y a n a |
Maadida | m a a d i d a |
Nantara | n a n t a r a |
Kaldi is a free toolkit written in C + + programming[41]. The language has 46 phonemes. The ASR system models all 46 phonemes. The Kannada alphabets are also labelled by ILSL12 is shown in Table 3 and corresponding sample lexicon using the same is shown in Table 4.
KANNADA PHONEMES | LABEL SET USING IT3:UTF-8 | ||||||
---|---|---|---|---|---|---|---|
ಅ | ಓ | ಠ | ಫ | a | oo | txh | ph |
ಆ | ಔ | ಡ | ಬ | aa | au | dx | b |
ಇ | ಕ | ಢ | ಭ | i | k | dxh | bh |
ಈ | ಖ | ಣ | ಮ | ii | kh | nx | m |
ಉ | ಗ | ತ | ಯ | u | g | t | y |
ಊ | ಘ | ಥ | ರ | uu | gh | th | r |
ಎ | ಚ | ದ | ಲ | e | c | d | l |
ಏ | ಛ | ಧ | ವ | ee | ch | dh | w |
ಐ | ಜ | ನ | ಶ | ai | j | n | sh |
ಒ | ಟ | ಪ | ಸ | o | tx | p | s |
TEXT TRANSCRIPTION | LABEL SET USING ILSL12 |
---|---|
Tumna | t u m b a |
Doora | d o o r a |
Prayana | p r a y a n a |
Maadida | m a a d i d a |
Nantara | n a n t a r a |
The lexicon is written keeping ILSL12 as reference. Figure 10 indicates block diagram for deriving the features. The Fig. 11 is architecture of the proposed system. The Table 5 describes value of parameters employed for constructing acoustic models. Number of jobs 3 is used.
Patameters | Triphones 1 | Triphones 2 | Triphones 3 |
---|---|---|---|
Leaves | 2500 | 2500 | 2500 |
Gaussian | 20000 | 20000 | 20000 |
The Table 6 and Fig. 12 indicate the WER results and the graphical counterpart for the continuous speech sentences of Kannada language.
SET3_SENTENCES | Features | PLP | MFCC | Proposed |
---|---|---|---|---|
WER | Mono | 06.39 | 07.37 | 16.20 |
Tri 1 | 07.35 | 08.07 | 08.96 | |
Tri 2 | 11.49 | 12.23 | 14.09 | |
Tri 3 | 07.17 | 08.01 | 07.01 | |
DNN-HMM | 06.33 | 06.06 | 04.27 |
The Table 7 and Fig. 13 indicate the WER results and the graphical counterpart for the continuous speech sentences of English language.
SET6_SENTENCES | Features | PLP | MFCC | Proposed |
---|---|---|---|---|
WER | Mono | 61.17 | 62.04 | 77.21 |
Tri 1 | 37.72 | 38.75 | 69.95 | |
Tri 2 | 35.40 | 35.49 | 40.33 | |
Tri 3 | 30.29 | 30.86 | 30.26 | |
DNN-HMM | 35.20 | 30.68 | 30.42 |
The performance of the work is evaluated on 600 unseen test sentences. The results of the experiment are included in the Table 8
TASKS (53–55) | Features | PLP | MFCC | Proposed |
---|---|---|---|---|
WER | Mono | 07.40 | 07.80 | 10.40 |
Tri 1 | 04.00 | 04.40 | 05.60 | |
Tri 2 | 03.80 | 03.20 | 02.80 | |
Tri 3 | 03.20 | 02.80 | 02.40 | |
DNN-HMM | 03.00 | 02.60 | 02.20 |
The proposed work evaluated on continuous English sentences choosen from TIMIT dataset corrupted by Additive white noise and results are tabulated in Table 9
Tasks (19–27) | Features | PLP | MFCC | Proposed |
---|---|---|---|---|
WER − 15dB | Mono | 89.80 | 88.32 | 92.76 |
Tri 1 | 89.63 | 87.53 | 95.12 | |
Tri 2 | 90.24 | 88.49 | 85.70 | |
Tri 3 | 92.85 | 93.90 | 88.14 | |
DNN-HMM | 83.17 | 80.30 | 79.33 | |
WER − 20dB | Mono | 91.19 | 86.84 | 88.93 |
Tri 1 | 88.58 | 89.36 | 97.12 | |
Tri 2 | 89.97 | 87.71 | 83.96 | |
Tri 3 | 92.33 | 96.34 | 87.45 | |
DNN-HMM | 83.61 | 82.21 | 81.08 | |
WER- 25dB | Mono | 90.15 | 88.84 | 93.55 |
Tri 1 | 88.58 | 91.11 | 98.87 | |
Tri 2 | 90.67 | 88.67 | 85.27 | |
Tri 3 | 96.60 | 94.68 | 91.89 | |
DNN-HMM | 84.22 | 82.91 | 80.64 |
Proposed work is evaluated with the continuous Kannada sentences choosen from Kannada dataset corrupted by Additive white noise and results are presented in Table 10.
Tasks (28–36) | Features | PLP | MFCC | Proposed |
---|---|---|---|---|
WER − 15dB | Mono | 44.00 | 63.20 | 75.73 |
Tri 1 | 54.27 | 55.47 | 70.27 | |
Tri 2 | 73.87 | 64.93 | 62.67 | |
Tri 3 | 56.00 | 47.73 | 46.66 | |
DNN-HMM | 36.80 | 32.00 | 31.33 | |
WER − 20dB | Mono | 37.60 | 40.67 | 71.73 |
Tri 1 | 45.07 | 48.40 | 61.33 | |
Tri 2 | 65.07 | 68.80 | 55.87 | |
Tri 3 | 38.93 | 44.80 | 37.33 | |
DNN-HMM | 26.80 | 24.00 | 23.20 | |
WER- 25dB | Mono | 33.47 | 45.20 | 65.73 |
Tri 1 | 36.53 | 55.20 | 57.20 | |
Tri 2 | 73.73 | 70.53 | 56.40 | |
Tri 3 | 48.93 | 59.20 | 47.33 | |
DNN-HMM | 22.40 | 27.33 | 21.33 |
Proposed work is also evaluated with the continuous sentences choosen from TIMIT and Kannada dataset presence of car noise and train noise and results are presented in Table 11
Tasks (37–52) | Features | PLP | MFCC | Proposed |
---|---|---|---|---|
WER − 15dB Car Noise [Kannada] | Mono | 51.33 | 39.33 | 71.33 |
Tri 1 | 44.13 | 56.93 | 61.73 | |
Tri 2 | 78.93 | 79.07 | 62.27 | |
Tri 3 | 60.67 | 56.67 | 56.27 | |
DNN-HMM | 32.40 | 24.93 | 23.33 | |
WER − 15dB Train Noise [Kannada] | Mono | 51.87 | 49.47 | 71.07 |
Tri 1 | 55.73 | 58.27 | 65.73 | |
Tri 2 | 79.47 | 76.80 | 63.07 | |
Tri 3 | 59.07 | 55.33 | 52.13 | |
DNN-HMM | 34.53 | 35.60 | 33.86 | |
WER [Real World Noise] | Mono | 1.67 | 1.27 | 0.53 |
Tri 1 | 1.67 | 1.07 | 0.40 | |
Tri 2 | 0.93 | 1.07 | 0.40 | |
Tri 3 | 0.80 | 0.80 | 0.40 | |
DNN-HMM | 0.67 | 0.53 | 0.40 | |
WER- 15dB Car Noise [English] | Mono | 87.79 | 84.83 | 95.29 |
Tri 1 | 86.49 | 88.49 | 92.24 | |
Tri 2 | 85.96 | 89.10 | 87.45 | |
Tri 3 | 87.45 | 90.24 | 86.31 | |
DNN-HMM | 83.96 | 83.96 | 82.38 | |
WER- 15dB Train Noise [English] | Mono | 88.58 | 88.67 | 97.38 |
Tri 1 | 87.58 | 88.49 | 94.94 | |
Tri 2 | 88.23 | 88.40 | 89.01 | |
Tri 3 | 92.50 | 89.89 | 87.18 | |
DNN-HMM | 83.96 | 86.14 | 82.82 |
The comparison of system results with those of the systems reported previously in the literature. This comparision is reported in the Table 12.
TASKS | Features | PLP | MFCC | Proposed |
---|---|---|---|---|
WER | Mono | 07.40 | 07.80 | 10.40 |
Tri 1 | 04.00 | 04.40 | 05.60 | |
Tri 2 | 03.80 | 03.20 | 02.80 | |
Tri 3 | 03.20 | 02.80 | 02.40 | |
DNN-HMM | 03.00 | 02.60 | 02.20 |
REFERENCE PAPERS | FEATURES | CLASSIFIERS USED | DATABASE | WER |
---|---|---|---|---|
Vishal et. al. [82] | MFCC | Deep Neural Network | Hindi | 17.70% |
Mohit et. al. [83] | MFCC, PLP,GFCC | GMM-HMM | Hindi | 13.22% |
Mohit et. al [84] | MFCC, GFCC | GMM-HMM | Hindi TIFR | 22.34% |
Thimmaraja Y et. al. [85] | MFCC | SGMM and DNN | Mandi | 09.60% |
Virender et al. [86] | MFCC | GMM-HMM and DNN-HMM | Punjabi | 05.22% |
Mohit et. al. [87] | MFCC, GFCC | GMM-HMM | Hindi | 13.10% |
Mohit et. al. [88] | MFCC | GMM-HMM | Hindi TIFR | 15.60% |
Vishal et. al. [89] | MFCC | CSVM | WSJ | 16.90% |
Thimmaraja Y et. al. [90] | MFCC | Monophone, Triphones1 and Triphones2 | Kannada | 09.95% |
A. G. Ramakrishnan et. al. [91] | MFCC | Monophone, Triphones1 and Triphones2, Trihones3 and MT-DNN | Gujarati, | 18.10%, |
Tamil, | 31.30%, | |||
Telugu | 29.30% | |||
A. G. Ramakrishnan et. al.[102] | MFCC | Monophones and Triphones | Sanskrit | 10.36% |
Proposed | MFCC | Monophone, Triphones1 and Triphones2, Trihones3 and DNN-HMM | TIMIT | 33.04% |
PLP | 32.78% | |||
Perceptual Wavelet Packet Features | 31.38% | |||
Proposed | MFCC | Monophone, Triphones1 and Triphones2, Trihones3 and DNN-HMM | Kannada | 06.06% |
PLP | 06.33% | |||
Perceptual Wavelet Packet Features | 04.27% |
Sl. No. | Kannada Sentence | Kannada Transcription |
---|---|---|
1. | ಕೊನೆಯಲ್ಲಿ ಮತ್ತೊಮ್ಮೆ ಸುದ್ದಿಗಳ ವಿವರವಿದೆ | koneyalli mattomme vaartegala vivara |
2. | ಮತ್ತೊಮ್ಮೆ ಸಂಸತ್ತಿನ ಸದನಗಳಲ್ಲಿ | mattomme samsattina sadanagalalli |
3. | ವಾರ್ತೆಗಳ ವಿವರ ಆರಂಭ | vaartegala vivara aarambha |
4. | ಮಾತಿನ ಮುಖ್ಯ ಅಂಶಗಳು | bhaasanada mukhyaamshagalu |
5. | ಸಂಸತ್ತಿನ ಮೇಲಿನ ನಿರ್ಣಯದ ಚರ್ಚೆ | samsattina meelina nirnayada carce |
6. | ವಾರ್ತೆಗಳ ಚರ್ಚೆ ವಿವರ ಆರಂಭ | varthegala carce vivara aarambha |
7. | ಸದನಗಳಲ್ಲಿಂದು ಉಭಯ ಸಂಸತ್ತಿನ | sadanagalallindu ubhaya samsattina |
8. | ರಾಷ್ಟ್ರಪತಿಗಳ ಮೇಲಿನ ವಂದನೆ | raastrapatigala meelina vandane |
9. | ಭಾಷಣಕ್ಕೆ ಸಲ್ಲಿಸಬೇಕಾದ ವಿವರಗಳು | bhaasanakke sallisuva vivara |
10. | ನಿರ್ಣಯದ ಚರ್ಚೆ ಆರಂಭವಾಗಿದೆ | nirnayada carce aarambhavaagide |
The ASR work carried out in this paper are as follows.
The authors comply with the ethical standards of the journal and also declare that they have no conflict of interest. The work is not sponsored by any funding agency.
DATA AVAILABILITY STATEMENT
The datasets used in this work can be obtained upon request to the corresponding author.