Robust Automatic Speech Recognition System for Kannada Speech Sentences in the Presence of Noise

DOI: https://doi.org/10.21203/rs.3.rs-1542540/v1

Abstract

An Automatic Speech Recognition system is developed for recognizing the continuous and spontaneous Kannada speech sentences. The language models and acoustic models are constructed using Kaldi toolkit. The speech corpus is developed with the native female and male Kannada speakers and is partioned into training set and testing set. The Performance of the proposed system is analysed and evaluated using the metric Word Error Rate (WER). The Wavelet Packets amalgamated with Mel filter banks are utilized to perform feature vector generation. The proposed hand crafted features perform better than the baseline features such as Perceptual Linear Prediction(PLP), Mel Frequency Cepstral Coefficients(MFCC) interms of Word Error Rate(WER) under both controlled and uncontrolled conditions.

1. Introduction

The natural pauses present between the sequence of spoken words makes them very different and unique class of signals. The speech corpus developed in any environment must be subjected to pre-processing to build an effective ASR. Speech is an easiest and most effective means of communication. The integration of deep learning with the multidisciplinary speech signal research and applications of deep neural networks has drawn so much attention of speech community researchers and applications of deep neural networks to speech, audio, video and text now evolved as fascinating domain in for speech research. Spoken sentence recognition is the task of detecting and recognizing the spoken words or sentences with the help of computer algorithms. The ASR applications have been developed for several Indian and Foriegn languages. The world languages danger report prepared by UNESCO during the year 2009, indicates that approximately a total of 197 languages of India are in the condition of becoming obsolete and report also highlights that the percentage of native speakers in the native language is becoming much lower day by day[1]. ASR system is developed using MFCC, HMM, vector quantization and I-vectors for the Assamese Language. The proposed Fusion technique performs better than Hidden Markov Models, Vector Quantization and I-Vectors with an accuracy of 100%[1]. The Automatic Speech Recognition system is built and evaluated through Bengali speech corpus. The 39-dimensional MFCC features are obtained and trained triphone based HMM Models. The system achieved the accuracy of 87.30%[2]. The ASR system is designed for Bangla speech. The Mel scale based LPC features and HMMs, provided 98.11% accuracy in speech recognition [3]. A word recognition system is designed via LPC features and HMMs for Hindi words and accuracy 97.14% was accomplished[4]. A word recognition system is designed with MFCCs and HMM Models for Hindi Speech Corpus. Accuracy of 94.63% was accomplished[5]. A recognition system for Hindi connected Speech Corpus through MFCC features and HMMs is designed and implemented. An accuracy of 87.01% is recoreded[6]. Digit recognition system through MFCC features and HMMs for Malayalam Corpus lead to 98.5% [7]. LPCC and MFCC features via vector quantization formed speaker identification system yield an accuracy of 96.59%[8]. A LI task is accomplished for five Indian languages. The cepstral features and vector quantization technique achieved classification task. WRA of 88.10% is obtained for Kannada sentences[9]. A Punjabi word recognition ASR system by LPC features and Dynamic time warping technique provided an accuracy of 94%[10]. A speaker recognition system using MFCC features, Linear discriminant Analysis and support vector machine and cosine distance scoring[11]. GMMs utilized LP residual processed features to perform the classification task[12]. In this work, an attempt has been made to utilize these features with Kaldi Toolkit in the degraded environment with the intention to improve the system performance. The content organization: The sections 1 and 2 convey brief introduction about ASR system. Section 3 depicts the details of feature extraction techniques.

2. Related Works

The ASR applications are impressive in noise free environment and are not promising solutions for the experiments and tests in noisy areas [13, 14, 15, 16]. Hence, the ASR system throughput is limited by two options. These are relevant labelling of the corpus and choice of feature vectors. MFCCs are derived from the Mel filter banks[17],[18, 19],[20] and it depends on STFT that has a demand that the signal is subject to processing it it is stationary over short interval of time[21],[22, 18, 20]. The challenge of time-frequency resolution is addressed by using wavelets[23, 24, 25]. The major freedom here is to use windows with variable durations. The detail portion of the speech subjected to short duration windows, and approximation coefficients of the speech is subjected to long duration windows for processing[24, 2627]. Therefore wavelets adapts to changing signals[20, 22]. Wavelets based filter banks worked well for phoneme recognition tasks[20, 22]. Multi-resolution ability of wavelet transforms was exhaustively utilized by several researchers for applications such as ECG [28, 29], EEG [32, 33, 34] and signal quality improvement [30, 31].

3. Methodology

3.1 PREPOROCESSING

The speech signal must pass through the preprocessing operations before it is being used by any ASR system. The popular preprocessing operations are framing operation using windows and pre-emphasis operation. The frames and frame overlaps are selected as 20msec and 10msec respectively.

3.2 PROPOSED FEATURES

The wavelets have the property of Multi-resolution and because of this these transform are used to process both non stationary and semi stationary signals. In the literature vast feature design algorithms have been presented for ASR systems under the presence of external disturbances of the natural environment. But, most of these algorithms use Fourier transform to find the spectrum. The speech signals has both periodic and aperiodic regions. But STFT uses window of fixed duration in the time frequency plane. The methods which use STFT cannot handle variations in speech signals. This challenge is addressed by the use of wavelets [35, 36, 37, 4347]. The wavelets offer flexible frequency resolution in the time frequency plane.

3.2.1 Theory of Wavelet Transforms

A comparative description of WT and STFT is presented in Fig. 1.

3.2.2 Continuous Wavelet Transforms (CWTs)

CWT for a speech segment x(t) is described by

$${CWT}_{x}^{{\Psi }}\left(\tau ,s\right)=\frac{1}{\sqrt{S}}{\int }_{-\infty }^{\infty }x\left(t\right){{\Psi }}^{*\left(\frac{t-\tau }{s}\right)}dt \left(1\right)$$

Equation (1), has \(\tau and s\) indicate translation and dialations respectively, and \({\Psi }\left(\text{t}\right)\) is the mother wavelet.

The mother wavelet plays the role of a prototype or basis to construct other functions. The CWT is computationally complicated transform.

3.2.3 Discrete Wavelet Transforms (CWTs)

The DWT is somewhat less complex[49]. The DWT for a speech signal x(t) is defined as:

$$DWT\left(j,k\right)=\frac{1}{\sqrt{{|2}^{j}|}}{\int }_{-\infty }^{\infty }x\left(t\right)\psi \left(\frac{t-{2}^{j}k}{{2}^{j}}\right) \left(2\right)$$

Mallat effectively summarized the wavelet decomposition process. It is accomplished by passing the speech segments via Wavelet Packet tree. The sample analysis trees are shown in Fig. 2. Here, \({h}_{0}\left(n\right), {h}_{1}\left(n\right)\) are low and high pass filters respectively. Similarly \({g}_{0}\left(n\right),\) \({g}_{1}\left(n\right)\) also form the filter pairs.

\({h}_{0}\left(n\right)\) and \({h}_{1}\left(n\right)\), \({g}_{0}\left(n\right)\) and \({g}_{1}\left(n\right)\) related by Eq. (3)

$${h}_{1}\left(n\right)= {(-1)}^{n}{g}_{0}\left(1 - n\right)$$
,
$${g}_{1}\left(n\right) = {\left(-1\right)}^{n}{h}_{0}\left(1 - n\right) \left(3\right)$$

The decimation and interpolation operation by a factor of 2 is denoted as ↓2 and ↑2 respectively. Figure 3 shows analysis and synthesis tree. In Fig. 3, \(\{{c}_{0}\left(n\right)\)}nZ is the input to the tree[23].

$${c}_{1}\left(k\right)=\sum _{n}{h}_{0}\left(n-2k\right){c}_{0}\left(n\right) \left(4\right)$$
$${d}_{1}\left(k\right)=\sum _{n}{h}_{1}\left(n-2k\right){c}_{0}\left(n\right) \left(5\right)$$

where \({c}_{1}\left(k\right)\) and \({d}_{1}\left(k\right)\) represent the low frequency space and the high frequency space respectively. The synthesis tree is shown in Fig. 3 is given by the Eq. (6)

$${c}_{0}\left(m\right)=\sum _{k}\left[{g}_{0}\left(2k-m\right){c}_{1}\left(k\right) +{g}_{1}\left(2k-m\right){d}_{1}\left(k\right)\right] \left(6\right)$$

3.2.4 Wavelets for parameterization

By applying the iterative decomposition operation repeatedly a desired wavelet tree is designed. Wavelet based feature vectros are derived using Daubachies wavelet(db4) [57]. Here 4th order is used but as order increases performance increases with increased computational requirements.

3.2.4.1 Mel Filters similar Wavelets Packets Analyis

The 24-bands Mel resembled Wavelet features (WMFCC) is proposed [20]. The frequency \({f}_{c}\) is related to mel scale frequency \({f}_{mel}\) by the Eq. (7)

$${f}_{mel}=2595{log}_{10}\left(1+\frac{{f}_{c}}{700}\right) \left(7\right)$$

The signal analysis in initialized with a balanced 3 level tree. Here, the frequency range is divided into eight bands each of 1KHz. The approximation space of 0-1KHz is decomposed into 8 subbands of 125Hz bandwidth in each subband. Here each subband is near to 100 Hz which is bandwidth of the Mel-filter. A 24-bands Mel scale like WP filter is designed[20] (see Table 1).

Table 1

Comparative description of Wavelet frequency bands and MFCC frequency bands.

Filters

Mel Scale

Wavelet Subband

Filters

Mel Scale

Wavelet Subband

Filters

Mel Scale

Wavelet Subband

1

100

125

9

900

1125

17

2639

2750

2

200

250

10

1000

1250

18

3031

3000

3

300

375

11

1149

1375

19

3482

3500

4

400

500

12

1320

1500

20

4000

4000

5

500

625

13

1516

1750

21

4595

5000

6

600

750

14

1741

2000

22

5278

6000

7

700

875

15

2000

2250

23

6063

7000

8

800

1000

16

2297

2500

24

6954

8000

The 24-bands Mel filter like wavelet packet sub-bands are shown in Fig. 5[20].

The energy is determined by

$${⟨{S}_{i}⟩}_{k}=\sum \frac{{{\left|{\omega }_{{\Psi }}\right(x,k)}_{i}|}^{2}}{{N}_{i}} \left(8\right)$$

where, \({{\omega }_{{\Psi }}(x,k)}_{i}\) are the coefficients of the speech segement \(x\), \(i\) indicates the subband number (\(1\le i\le M\)), \(k\) represents the frame number and \({N}_{i}\) is the total count of samples in the \({i}^{th}\) suband. Just like MFCC, 24 wavelet coefficients are logarithmically processed. The compressed co-efficients are further analyzed by DCT to achieve energy compaction. Then first 13 coefficients are choosen as WMFCC features. The steps of parameterization are shown in Fig. 6.

3.2.4.2 Proposed Hybrid PWP tree for parameterization

The 24-band wavlet tree is designed for parameterization. The repeated experimental investigation and analysis revealed that a 24-band Wavelet Packet (WP) tree shown in Fig. 7 is found to be optimal tree for the same.

The decomposed coefficients energies are calculated, compressed logarithmically and processed by DCT to choose 13 optimal coe-efficients.

3.2.5 ACOUSTIC MODELS

The acoustic modeling in any ASR system plays a very vital role. Acoustic Modelling is the task of mapping the matrix of features with desired phoneme sequences of the hypothesized sentence. This is accomplished through the use of Hidden Markov Model (HMM) Classifier.

3.2.6 LANGUAGE MODELS

The most popular Language models used in any ASR system are n-gram language models. These models are useful in the prediction of \({n}^{th}\) word, utilizing \(\left(n-1\right)\) previous occurred words. The trigram \((n=3)\) and bigram \((n=2)\) are commonly used in the Language Modelling.

3.2.7 RECOGNITION

The baseline classifers such as GMMs-HMMs, extended version of monophones and Deep Neural Networks (DNN) are used in this work for achieving speech recognition.

3.2.8 HIDDEN MARKOV MODELS

To find the probabilities \(P\left(\frac{W}{X}\right)\) Markov chain shown in Fig. 8 with 3 states is used. During the training phase, probality of a system remaining in a state is called as initial state probability(\(\pi\)), probability of transiting among states(A), and probability of symbols emitted (B) are determined using Baum-Welch Algorithm.

$$\lambda =\left(A, B, \pi \right) \left(9\right)$$

The log-likelihood of each sequence of word is found using Viterbi Decoding method described by

$$v* =\left[P \left(O|\lambda v\right)\right], 1\le v\le V \left(10\right)$$

\(V\) is word length.

3.2.9 ASR PERFORMANCE ANALYSIS

The performance accuracy of the proposed system is evaluated through the metric word error rate [24] given by equations (11)

$$WER\left(\%\right)=\frac{(D+S+I)}{N}\times 100\left(\%\right) \left(11\right)$$

Here N represents the maximum number of spoken files in the testing dataset and \(D, S and I\) are errors mainly because of deletion of phonemes, substitution of phonemes and insertion of phonemes respectively.

4. Speech Database

The speech data developed fot this work includes isolated digits, 25 isolated wods and 600 spontaneous sentences and continuously read sentences The dataset was contribution of 100 speakers of size 20 hours. Recoredings were made in the normal room environment in the presense of various noises occurring in the nearest road located at a distance of 5 meters. Recording is done using Matlab R2019b software at 16KHz, 16-bit resolution. The Table 2 represents 50 sample sentences from the speech dataset. The lexicon is designed according to the standard Indian Language Symbol Labels version 3 (ILSLv3) designed ASR Cosortium for Indian languages. The speech corpus is designed by following the procedure suggested in Summer School on Automatic Speech Recognition held at IIT, Guwahati (IITG). The dataset consists of 3 sets isolated digits, isolated words, Continuous and spontaneously spoken Kannada Sentences. The isolated digits and words (TIMIT), Librispeech(Sentences) are used in this work. A set of sample lexicons with transcription is given in the Table 2.

Table 2

Kannada sentences with transliteration.

Lexicon

Transliteration

Tumna

t u m b a

Doora

d o o r a

Prayana

p r a y a n a

Maadida

m a a d i d a

Nantara

n a n t a r a

4.1 KALDI TOOLKIT

Kaldi is a free toolkit written in C + + programming[41]. The language has 46 phonemes. The ASR system models all 46 phonemes. The Kannada alphabets are also labelled by ILSL12 is shown in Table 3 and corresponding sample lexicon using the same is shown in Table 4.

Table 4

The labelling of Kannada alphabets using ILSL12

KANNADA PHONEMES

LABEL SET USING IT3:UTF-8

a

oo

txh

ph

aa

au

dx

b

i

k

dxh

bh

ii

kh

nx

m

u

g

t

y

uu

gh

th

r

e

c

d

l

ee

ch

dh

w

ai

j

n

sh

o

tx

p

s

Table 4

Sample lexicon of the speech dataset

TEXT TRANSCRIPTION

LABEL SET USING ILSL12

Tumna

t u m b a

Doora

d o o r a

Prayana

p r a y a n a

Maadida

m a a d i d a

Nantara

n a n t a r a

The lexicon is written keeping ILSL12 as reference. Figure 10 indicates block diagram for deriving the features. The Fig. 11 is architecture of the proposed system. The Table 5 describes value of parameters employed for constructing acoustic models. Number of jobs 3 is used.

Table 5

Parameters of the Acoustic Models

Patameters

Triphones 1

Triphones 2

Triphones 3

Leaves

2500

2500

2500

Gaussian

20000

20000

20000

5. Results

The Table 6 and Fig. 12 indicate the WER results and the graphical counterpart for the continuous speech sentences of Kannada language.

Table 6

WER performance using PLP, MFCC, Proposed features on Kannada data

SET3_SENTENCES

Features

PLP

MFCC

Proposed

WER

Mono

06.39

07.37

16.20

Tri 1

07.35

08.07

08.96

Tri 2

11.49

12.23

14.09

Tri 3

07.17

08.01

07.01

DNN-HMM

06.33

06.06

04.27

The Table 7 and Fig. 13 indicate the WER results and the graphical counterpart for the continuous speech sentences of English language.

Table 7

WER performance on Librispeech dataset

SET6_SENTENCES

Features

PLP

MFCC

Proposed

WER

Mono

61.17

62.04

77.21

Tri 1

37.72

38.75

69.95

Tri 2

35.40

35.49

40.33

Tri 3

30.29

30.86

30.26

DNN-HMM

35.20

30.68

30.42

The performance of the work is evaluated on 600 unseen test sentences. The results of the experiment are included in the Table 8

Table 8

WER performance on Kannada words

TASKS (53–55)

Features

PLP

MFCC

Proposed

WER

Mono

07.40

07.80

10.40

Tri 1

04.00

04.40

05.60

Tri 2

03.80

03.20

02.80

Tri 3

03.20

02.80

02.40

DNN-HMM

03.00

02.60

02.20

The proposed work evaluated on continuous English sentences choosen from TIMIT dataset corrupted by Additive white noise and results are tabulated in Table 9

Table 9

WER performance on Kannada sentences on TIMIT database in the presence of additive noise.

Tasks (19–27)

Features

PLP

MFCC

Proposed

WER − 15dB

Mono

89.80

88.32

92.76

Tri 1

89.63

87.53

95.12

Tri 2

90.24

88.49

85.70

Tri 3

92.85

93.90

88.14

DNN-HMM

83.17

80.30

79.33

WER − 20dB

Mono

91.19

86.84

88.93

Tri 1

88.58

89.36

97.12

Tri 2

89.97

87.71

83.96

Tri 3

92.33

96.34

87.45

DNN-HMM

83.61

82.21

81.08

WER- 25dB

Mono

90.15

88.84

93.55

Tri 1

88.58

91.11

98.87

Tri 2

90.67

88.67

85.27

Tri 3

96.60

94.68

91.89

DNN-HMM

84.22

82.91

80.64

Proposed work is evaluated with the continuous Kannada sentences choosen from Kannada dataset corrupted by Additive white noise and results are presented in Table 10.

Table 10

WER performance on Kannada sentences in the presence of additive noise.

Tasks (28–36)

Features

PLP

MFCC

Proposed

WER − 15dB

Mono

44.00

63.20

75.73

Tri 1

54.27

55.47

70.27

Tri 2

73.87

64.93

62.67

Tri 3

56.00

47.73

46.66

DNN-HMM

36.80

32.00

31.33

WER − 20dB

Mono

37.60

40.67

71.73

Tri 1

45.07

48.40

61.33

Tri 2

65.07

68.80

55.87

Tri 3

38.93

44.80

37.33

DNN-HMM

26.80

24.00

23.20

WER- 25dB

Mono

33.47

45.20

65.73

Tri 1

36.53

55.20

57.20

Tri 2

73.73

70.53

56.40

Tri 3

48.93

59.20

47.33

DNN-HMM

22.40

27.33

21.33

Proposed work is also evaluated with the continuous sentences choosen from TIMIT and Kannada dataset presence of car noise and train noise and results are presented in Table 11

Table 11

WER performance on Kannada database and TIMIT database[99] in the presence of car noise and train noise extracted from Noise-X 92 database [97]

Tasks (37–52)

Features

PLP

MFCC

Proposed

WER − 15dB

Car Noise

[Kannada]

Mono

51.33

39.33

71.33

Tri 1

44.13

56.93

61.73

Tri 2

78.93

79.07

62.27

Tri 3

60.67

56.67

56.27

DNN-HMM

32.40

24.93

23.33

WER − 15dB

Train Noise

[Kannada]

Mono

51.87

49.47

71.07

Tri 1

55.73

58.27

65.73

Tri 2

79.47

76.80

63.07

Tri 3

59.07

55.33

52.13

DNN-HMM

34.53

35.60

33.86

WER

[Real World

Noise]

Mono

1.67

1.27

0.53

Tri 1

1.67

1.07

0.40

Tri 2

0.93

1.07

0.40

Tri 3

0.80

0.80

0.40

DNN-HMM

0.67

0.53

0.40

WER- 15dB

Car Noise

[English]

Mono

87.79

84.83

95.29

Tri 1

86.49

88.49

92.24

Tri 2

85.96

89.10

87.45

Tri 3

87.45

90.24

86.31

DNN-HMM

83.96

83.96

82.38

WER- 15dB

Train Noise

[English]

Mono

88.58

88.67

97.38

Tri 1

87.58

88.49

94.94

Tri 2

88.23

88.40

89.01

Tri 3

92.50

89.89

87.18

DNN-HMM

83.96

86.14

82.82

The comparison of system results with those of the systems reported previously in the literature. This comparision is reported in the Table 12.

Table 11

WER performance on unseen Kannada sentences

TASKS

Features

PLP

MFCC

Proposed

WER

Mono

07.40

07.80

10.40

Tri 1

04.00

04.40

05.60

Tri 2

03.80

03.20

02.80

Tri 3

03.20

02.80

02.40

DNN-HMM

03.00

02.60

02.20

Table 12

WER performance of proposed method compared to those of literature.

REFERENCE PAPERS

FEATURES

CLASSIFIERS USED

DATABASE

WER

Vishal et. al. [82]

MFCC

Deep Neural Network

Hindi

17.70%

Mohit et. al. [83]

MFCC, PLP,GFCC

GMM-HMM

Hindi

13.22%

Mohit et. al [84]

MFCC, GFCC

GMM-HMM

Hindi TIFR

22.34%

Thimmaraja Y et. al. [85]

MFCC

SGMM and DNN

Mandi

09.60%

Virender et al. [86]

MFCC

GMM-HMM and

DNN-HMM

Punjabi

05.22%

Mohit et. al. [87]

MFCC, GFCC

GMM-HMM

Hindi

13.10%

Mohit et. al. [88]

MFCC

GMM-HMM

Hindi TIFR

15.60%

Vishal et. al. [89]

MFCC

CSVM

WSJ

16.90%

Thimmaraja Y

et. al. [90]

MFCC

Monophone, Triphones1 and Triphones2

Kannada

09.95%

A. G. Ramakrishnan

et. al. [91]

MFCC

Monophone, Triphones1 and Triphones2, Trihones3 and MT-DNN

Gujarati,

18.10%,

Tamil,

31.30%,

Telugu

29.30%

A. G. Ramakrishnan

et. al.[102]

MFCC

Monophones and Triphones

Sanskrit

10.36%

Proposed

MFCC

Monophone, Triphones1 and Triphones2, Trihones3 and

DNN-HMM

TIMIT

33.04%

PLP

32.78%

Perceptual Wavelet Packet Features

31.38%

Proposed

MFCC

Monophone, Triphones1 and Triphones2, Trihones3 and

DNN-HMM

Kannada

06.06%

PLP

06.33%

Perceptual Wavelet Packet Features

04.27%

Table 13

Kannada unseen sentences .

Sl.

No.

Kannada

Sentence

Kannada Transcription

1.

ಕೊನೆಯಲ್ಲಿ ಮತ್ತೊಮ್ಮೆ ಸುದ್ದಿಗಳ ವಿವರವಿದೆ

koneyalli mattomme vaartegala vivara

2.

ಮತ್ತೊಮ್ಮೆ ಸಂಸತ್ತಿನ ಸದನಗಳಲ್ಲಿ

mattomme samsattina sadanagalalli

3.

ವಾರ್ತೆಗಳ ವಿವರ ಆರಂಭ

vaartegala vivara aarambha

4.

ಮಾತಿನ ಮುಖ್ಯ ಅಂಶಗಳು

bhaasanada mukhyaamshagalu

5.

ಸಂಸತ್ತಿನ ಮೇಲಿನ ನಿರ್ಣಯದ ಚರ್ಚೆ

samsattina meelina nirnayada carce

6.

ವಾರ್ತೆಗಳ ಚರ್ಚೆ ವಿವರ ಆರಂಭ

varthegala carce vivara aarambha

7.

ಸದನಗಳಲ್ಲಿಂದು ಉಭಯ ಸಂಸತ್ತಿನ

sadanagalallindu ubhaya samsattina

8.

ರಾಷ್ಟ್ರಪತಿಗಳ ಮೇಲಿನ ವಂದನೆ

raastrapatigala meelina vandane

9.

ಭಾಷಣಕ್ಕೆ ಸಲ್ಲಿಸಬೇಕಾದ ವಿವರಗಳು

bhaasanakke sallisuva vivara

10.

ನಿರ್ಣಯದ ಚರ್ಚೆ ಆರಂಭವಾಗಿದೆ

nirnayada carce aarambhavaagide

6. Conclusion

The ASR work carried out in this paper are as follows. 

Declarations

The authors comply with the ethical standards of the journal and also declare that they have no conflict of interest. The work is not sponsored by any funding agency.

DATA AVAILABILITY STATEMENT

The datasets used in this work can be obtained upon request to the corresponding author.

References

  1. Bharali, S. S., & Kalita, S. K. (2018). Speech recognition with reference to Assamese language using novel fusion technique. Int J Speech Technol, 21, 251. https://doi.org/10.1007/s10772-018-9501-1
  2. Hassan, F., Khan, M. S. A., Kotwal, M. R. A., & Huda, M. N. Gender independent bangia automatic speech recognition. In International Conference on Informatics, Electronics & Vision (ICIEV-2012)
  3. Muslima, U., & Islam, M. B. (2014). Experimental framework for melscaled LP based Bangla speech recognition. In 2013 IEEE 16th international conference on computer and information technology (ICCIT), Khulna (pp. 56–59)
  4. Pruthi, T., Saksena, S., Das, P. K., & Swaranjali (2000). : Isolated word recognition for Hindi language using VQ and HMM. In international conference on multimedia processing and systems (ICMPS), Chennai,
  5. Kumar, K., & Aggarwal, R. K. (2011). Hindi speech recognition system using HTK. International Journal of Computing and Business Research, 2(2), 2229–6166
  6. Kumar, K., Aggarwal, R. K., & Jain, A. (2012). A Hindi speech recognition system for connected words using HTK. International Journal of Computational Systems Engineering, 1(1), 25–32
  7. Kurian, C., & Balakrishnan, K. (2009). Speech recognition of Malayalam numbers. In IEEE World Congress on Nature & Biologically Inspired Computing, 2009. NaBIC Coimbatore (pp. 1475–1479)
  8. Bansal, P., Dev, A., & Jain, S. B. (2007). Automatic speaker identification using vector quantization. Asian Journal of Information Technology, 6(9), 938–942
  9. Balleda, J., Murthy, H. A., & Nagarajan, T. (2000). Language identification from short segments of speech. In Interspeech Beijing
  10. Kumar, R., & Singh, M. (2011). Spoken isolated word recognition of Punjabi language using dynamic time warp technique. Information systems for Indian languages (pp. 301–301). Berlin: Springer
  11. Senoussaoui, M., Kenny, P., Dehak, N., & Dumouchel, P. (2010). An I-vector extractor suitable for speaker recognition with both microphone and telephone speech. Brno: In Odyssey
  12. Dipanjan Nandi, D., Pati, K., & Sreenivasa Rao (2017). Implicit processing of LP residual for language identification, Computer Speech & Language, Volume 41, Pages 68–87, ISSN 0885–2308, https://doi.org/10.1016/j.csl.2016.06.002
  13. Kim, C., & Stern, R. M. (2012). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4101–4104). IEEE. https://doi.org/10.1109/ICASSP.2012.62888 20
  14. Li, J., Deng, L., Gong, Y., & Haeb-Umbach, R. (2014). An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 745–777. https://doi.org/10.1109/TASLP.2014.23046 37
  15. Mukherjee, H., Obaidullah, S. M., Santosh, K. C., Phadikar, S., & Roy, K. (2018). Line spectral frequency-based features and extreme learning machine for voice activity detection from audio signal. International Journal of Speech Technologyhttps://doi.org/10.1007/s1077 2-018-9525-6
  16. Bouguelia, M. R., Nowaczyk, S., Santosh, K. C., & Verikas, A. (2017). Agreeing to disagree: Active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics, 9(8), 1307–1319. https://doi.org/10.1007/s13042-017-0645-0
  17. Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366. https://doi.org/10.1109/TASSP.1980.1163420
  18. Farooq, O., Datta, S., & Shrotriya, M. C. (2010). Wavelet sub-band based temporal features for robust Hindi phoneme recognition. International Journal of Wavelets, Multiresolution and Information Processing, 08(06), 847–859. https://doi.org/10.1142/S0219691310003845
  19. Ganchev, T., Fakotakis, N., & Kokkinakis, G. (2005). Comparative evaluation of various MFCC implementations on the speaker verification task. In Proceedings of the SPECOM (pp. 191–194)
  20. Farooq, O., & Datta, S. (2001). Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters, 8(7), 196–198. https://doi.org/10.1109/97.928676
  21. Grigoryan, A. M. (2005). Fourier transform representation by frequency-time wavelets. IEEE Transactions on Signal Processing, 53(7), 2489–2497. https://doi.org/10.1109/TSP.2005.849180
  22. Biswas, A., Sahu, P. K., Bhowmick, A., & Chandra, M. (2014a). Feature extraction technique using ERB like wavelet sub-band periodic and aperiodic decomposition for TIMIT phoneme recognition. International Journal of Speech Technology, 17(4), 389–399. https://doi.org/10.1007/s1077 2-014-9236-6
  23. Biswas, A., Sahu, P. K., & Chandra, M. (2016). Admissible wavelet packet sub-band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition. IET Signal Processing, 10(8), 902–911. https://doi.org/10.1049/iiet-spr.2015.0488
  24. Steffen, P., Heller, P. N., Gopinath, R. A., & Burrus, C. S. (1993). Theory of regular M-band wavelet bases. IEEE Transactions on Signal Processing, 41(12), 3497–3511. https://doi.org/10.1109/78.258088
  25. Vetterli, M., & Herley, C. (1992). Wavelets and filter banks: Theory and design. IEEE Transactions on Signal Processing, 40(9), 2207–2232. https://doi.org/10.1109/78.15722 1
  26. Lin, T., Xu, S., Shi, Q., & Hao, P. (2006b). An algebraic construction of orthonormal M-band wavelets with perfect reconstruction. Applied Mathematics and Computation, 172(2), 717–730. https://doi.org/10.1016/j.amc.2004.11.025
  27. Pollock, S., & Cascio, I. L. (2007). Non-dyadic wavelet analysis. Optimisation, econometric and financial analysis (pp. 167–203). Berlin: Springer. https://doi.org/10.1007/3-540-36626 -1_9
  28. Chiu, C. C., Chuang, C. M., & Hsu, C. Y. (2009). Discrete wavelet transform applied on personal identity verification with ECG signal. International Journal of Wavelets, Multiresolution and Information Processing, 07(03), 341–355. https://doi.org/10.1142/S0219. 69130 90029 57
  29. Rajoub, B., Alshamali, A., & Al-Fahoum, A. S. (2002). An efficient coding algorithm for the compression of ECG signals using the wavelet transform. IEEE Transactions on Biomedical Engineering, 49(4), 355–362. https://doi.org/10.1109/10.99116 3
  30. Tabibian, S., Akbari, A., & Nasersharif, B. (2015). Speech enhancement using a wavelet thresholding method based on symmetric Kullback–Leibler divergence. Signal Processing, 106, 184–197. https://doi.org/10.1016/J.SIGPR O.2014.06.027
  31. Zao, L., Coelho, R., & Flandrin, P. (2014). Speech enhancement with EMD and hurst-based mode selection. IEEE Transactions on Audio, Speech and Language Processing, 22(5), 899–911. https://doi.org/10.1109/TASLP.2014.23125 41
  32. Adeli, H., Zhou, Z., & Dadmehr, N. (2003). Analysis of EEG records in an epileptic patient using wavelet transform. Journal of Neuroscience Methods, 123(1), 69–87. https://doi.org/10.1016/S0165-0270(02)00340. -<background-color:#d279aa;>0</background-color:#d279aa;&gt
  33. Ocak, H. (2009). Automatic detection of epileptic seizures in EEG using discrete wavelet transform and approximate entropy. Expert Systems with Applications, 36(2), 2027–2036. https://doi.org/10.1016/J.ESWA.2007.12.065
  34. Biswas, A., Sahu, P. K., & Chandra, M. (2014b). Admissible wavelet packet features based on human inner ear frequency response for Hindi consonant recognition. Computers & Electrical Engineering, 40(4), 1111–1122. https://doi.org/10.1016/J.COMPE LECENG.2014.01.008
  35. Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9(2), 171–185
  36. Gales, M. (2000). Cluster adaptive training of hidden Markov models. IEEE transactions on speech and audio processing, 8(4), 417–428
  37. Karpov, A., et al. (2014). Large vocabulary Russian speech recognition using syntactico-statistical language modeling. Speech Communication, 56, 213–228
  38. Daubechies, I. (1992). Ten lectures on wavelets. Society for industrial and applied mathematics
  39. http://(IT3%20to%20UTF-8)_11.html
  40. http://www.iitg.ac.in/samudravijaya/tutorials/ILSL_V3.2.pdf
  41. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., et al. (2011). "The Kaldi speech recognition toolkit.". IEEE 2011 workshop on automatic speech recognition and understanding. no. CONF. IEEE Signal Processing Society
  42. Thimmaraja, G., & Yadava, H. S. (2017). Jayanna. "A spoken query system for the agricultural commodity prices and weather information access in Kannada language",International Journal of Speech Technology,
  43. "Performance of Isolated and Continuous Digit Recognition System using Kaldi Toolkit" (2019).International Journal of Recent Technology and Engineering,
  44. Thimmaraja Yadava, G., & Jayanna, H. S. (2018). "Creation and Comparison of Language and Acoustic Models Using Kaldi for Noisy and Enhanced Speech Data",International Journal of Intelligent Systems and Applications,
  45. Praveen Kumar, P. S., Thimmaraja Yadava, G., & Jayanna, H. S. (2019). "Continuous Kannada Speech Recognition System Under Degraded Condition", Circuits, Systems, and Signal Processing,
  46. Biswas, Astik, P. K., Sahu, A., Bhowmick, & Chandra, M. (2015). "Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature". Computers & Electrical Engineering
  47. Mahadevaswamy, D. J., & Ravi, Performance of Isolated and Continuous Digit Recognition System using Kaldi Toolkit, 2019International Journal of Recent Technology and Engineering
  48. Mahadevaswamy, & Ravi, D. J. (2016). "Performance analysis of adaptive wavelet denosing by speech discrimination and thresholding," International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT), Mysuru, 2016, pp. 173–178, doi: 10.1109/ICEECCOT.2016.7955209
  49. Mahadevaswamy, & Ravi, D. J. (2016). "Performance analysis of adaptive wavelet denosing by speech discrimination and thresholding," 2016 International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT), Mysuru, pp. 173–178, doi: 10.1109/ICEECCOT.2016.7955209
  50. Mahadevaswamy, & Ravi, D. J., "Performance Analysis of LP Residual and Correlation Coefficients based Speech Seperation Front End," 2017 International Conference on Current Trends in Computer, Electrical, Electronics and Communication (CTCEEC), Mysore, 2017, pp. 328–332, doi: 10.1109/CTCEEC.2017.8455039
  51. Geethashree, A., & Ravi, D. J., “Automatic Segmentation of Kannada Speech for Emotion Conversion”,Journal of Advanced Research in Dynamical and Control Systems
  52. Geethashree, A., & Ravi, D. J. (2019). Modification of Prosody for Emotion Conversion using Gaussian Regression Model. International Journal of Recent Technology and Engineering
  53. Geethashree, A., & Ravi, D. J. (2018). Kannada Emotional Speech Database: Design, Development and Evaluation. In: Guru D., Vasudev T., Chethan H., Kumar Y. (eds) Proceedings of International Conference on Cognition and Recognition. Lecture Notes in Networks and Systems, vol 14. Springer, Singapore
  54. Basavaiah, J., & Patil, C. M. (2020). Human activity detection and action recognition in videos using convolutional neural networks. Journal of Information and Communication Technology, 19(2), 157–183
  55. Basavaiah, J., & Anthony, A. A. (2020). Tomato Leaf Disease Classification using Multiple Feature Extraction Techniques. Wireless Personal Communications. doi:10.1007/s11277-020-07590-x
  56. Mahadevaswamy, Ravi, D. J. (2021). Robust Perceptual Wavelet Packet Features for Recognition of Continuous Kannada Speech. Wireless Pers Commun, 121, 1781–1804. https://doi.org/10.1007/s11277-021-08736-1