Robust Automatic Speech Recognition System for Kannada Speech Sentences in the Presence of Noise

doi:10.21203/rs.3.rs-1542540/v1

Download PDF

Research Article

Robust Automatic Speech Recognition System for Kannada Speech Sentences in the Presence of Noise

https://doi.org/10.21203/rs.3.rs-1542540/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 12 Apr, 2023

Read the published version in Wireless Personal Communications →

You are reading this latest preprint version

An Automatic Speech Recognition system is developed for recognizing the continuous and spontaneous Kannada speech sentences. The language models and acoustic models are constructed using Kaldi toolkit. The speech corpus is developed with the native female and male Kannada speakers and is partioned into training set and testing set. The Performance of the proposed system is analysed and evaluated using the metric Word Error Rate (WER). The Wavelet Packets amalgamated with Mel filter banks are utilized to perform feature vector generation. The proposed hand crafted features perform better than the baseline features such as Perceptual Linear Prediction(PLP), Mel Frequency Cepstral Coefficients(MFCC) interms of Word Error Rate(WER) under both controlled and uncontrolled conditions.

Approximation coefficients

Detail coefficients

Monophones

Tri-phones and Deep Neural Networks

The natural pauses present between the sequence of spoken words makes them very different and unique class of signals. The speech corpus developed in any environment must be subjected to pre-processing to build an effective ASR. Speech is an easiest and most effective means of communication. The integration of deep learning with the multidisciplinary speech signal research and applications of deep neural networks has drawn so much attention of speech community researchers and applications of deep neural networks to speech, audio, video and text now evolved as fascinating domain in for speech research. Spoken sentence recognition is the task of detecting and recognizing the spoken words or sentences with the help of computer algorithms. The ASR applications have been developed for several Indian and Foriegn languages. The world languages danger report prepared by UNESCO during the year 2009, indicates that approximately a total of 197 languages of India are in the condition of becoming obsolete and report also highlights that the percentage of native speakers in the native language is becoming much lower day by day[1]. ASR system is developed using MFCC, HMM, vector quantization and I-vectors for the Assamese Language. The proposed Fusion technique performs better than Hidden Markov Models, Vector Quantization and I-Vectors with an accuracy of 100%[1]. The Automatic Speech Recognition system is built and evaluated through Bengali speech corpus. The 39-dimensional MFCC features are obtained and trained triphone based HMM Models. The system achieved the accuracy of 87.30%[2]. The ASR system is designed for Bangla speech. The Mel scale based LPC features and HMMs, provided 98.11% accuracy in speech recognition [3]. A word recognition system is designed via LPC features and HMMs for Hindi words and accuracy 97.14% was accomplished[4]. A word recognition system is designed with MFCCs and HMM Models for Hindi Speech Corpus. Accuracy of 94.63% was accomplished[5]. A recognition system for Hindi connected Speech Corpus through MFCC features and HMMs is designed and implemented. An accuracy of 87.01% is recoreded[6]. Digit recognition system through MFCC features and HMMs for Malayalam Corpus lead to 98.5% [7]. LPCC and MFCC features via vector quantization formed speaker identification system yield an accuracy of 96.59%[8]. A LI task is accomplished for five Indian languages. The cepstral features and vector quantization technique achieved classification task. WRA of 88.10% is obtained for Kannada sentences[9]. A Punjabi word recognition ASR system by LPC features and Dynamic time warping technique provided an accuracy of 94%[10]. A speaker recognition system using MFCC features, Linear discriminant Analysis and support vector machine and cosine distance scoring[11]. GMMs utilized LP residual processed features to perform the classification task[12]. In this work, an attempt has been made to utilize these features with Kaldi Toolkit in the degraded environment with the intention to improve the system performance. The content organization: The sections 1 and 2 convey brief introduction about ASR system. Section 3 depicts the details of feature extraction techniques.

The ASR applications are impressive in noise free environment and are not promising solutions for the experiments and tests in noisy areas [13, 14, 15, 16]. Hence, the ASR system throughput is limited by two options. These are relevant labelling of the corpus and choice of feature vectors. MFCCs are derived from the Mel filter banks[17],[18, 19],[20] and it depends on STFT that has a demand that the signal is subject to processing it it is stationary over short interval of time[21],[22, 18, 20]. The challenge of time-frequency resolution is addressed by using wavelets[23, 24, 25]. The major freedom here is to use windows with variable durations. The detail portion of the speech subjected to short duration windows, and approximation coefficients of the speech is subjected to long duration windows for processing[24, 26–27]. Therefore wavelets adapts to changing signals[20, 22]. Wavelets based filter banks worked well for phoneme recognition tasks[20, 22]. Multi-resolution ability of wavelet transforms was exhaustively utilized by several researchers for applications such as ECG [28, 29], EEG [32, 33, 34] and signal quality improvement [30, 31].

3.1 PREPOROCESSING

The speech signal must pass through the preprocessing operations before it is being used by any ASR system. The popular preprocessing operations are framing operation using windows and pre-emphasis operation. The frames and frame overlaps are selected as 20msec and 10msec respectively.

3.2 PROPOSED FEATURES

The wavelets have the property of Multi-resolution and because of this these transform are used to process both non stationary and semi stationary signals. In the literature vast feature design algorithms have been presented for ASR systems under the presence of external disturbances of the natural environment. But, most of these algorithms use Fourier transform to find the spectrum. The speech signals has both periodic and aperiodic regions. But STFT uses window of fixed duration in the time frequency plane. The methods which use STFT cannot handle variations in speech signals. This challenge is addressed by the use of wavelets [35, 36, 37, 43–47]. The wavelets offer flexible frequency resolution in the time frequency plane.

3.2.1 Theory of Wavelet Transforms

A comparative description of WT and STFT is presented in Fig. 1.

3.2.2 Continuous Wavelet Transforms (CWTs)

CWT for a speech segment x(t) is described by

$${CWT}_{x}^{{\Psi }}\left(\tau ,s\right)=\frac{1}{\sqrt{S}}{\int }_{-\infty }^{\infty }x\left(t\right){{\Psi }}^{*\left(\frac{t-\tau }{s}\right)}dt \left(1\right)$$

Equation (1), has $\tau and s$ indicate translation and dialations respectively, and ${\Psi }\left(\text{t}\right)$ is the mother wavelet.

The mother wavelet plays the role of a prototype or basis to construct other functions. The CWT is computationally complicated transform.

3.2.3 Discrete Wavelet Transforms (CWTs)

The DWT is somewhat less complex[49]. The DWT for a speech signal x(t) is defined as:

$$DWT\left(j,k\right)=\frac{1}{\sqrt{{|2}^{j}|}}{\int }_{-\infty }^{\infty }x\left(t\right)\psi \left(\frac{t-{2}^{j}k}{{2}^{j}}\right) \left(2\right)$$

Mallat effectively summarized the wavelet decomposition process. It is accomplished by passing the speech segments via Wavelet Packet tree. The sample analysis trees are shown in Fig. 2. Here, ${h}_{0}\left(n\right), {h}_{1}\left(n\right)$ are low and high pass filters respectively. Similarly ${g}_{0}\left(n\right),$ ${g}_{1}\left(n\right)$ also form the filter pairs.

${h}_{0}\left(n\right)$ and ${h}_{1}\left(n\right)$, ${g}_{0}\left(n\right)$ and ${g}_{1}\left(n\right)$ related by Eq. (3)

$${h}_{1}\left(n\right)= {(-1)}^{n}{g}_{0}\left(1 - n\right)$$

$${g}_{1}\left(n\right) = {\left(-1\right)}^{n}{h}_{0}\left(1 - n\right) \left(3\right)$$

The decimation and interpolation operation by a factor of 2 is denoted as ↓2 and ↑2 respectively. Figure 3 shows analysis and synthesis tree. In Fig. 3, $\{{c}_{0}\left(n\right)$}n ∈ Z is the input to the tree[23].

$${c}_{1}\left(k\right)=\sum _{n}{h}_{0}\left(n-2k\right){c}_{0}\left(n\right) \left(4\right)$$

$${d}_{1}\left(k\right)=\sum _{n}{h}_{1}\left(n-2k\right){c}_{0}\left(n\right) \left(5\right)$$

where ${c}_{1}\left(k\right)$ and ${d}_{1}\left(k\right)$ represent the low frequency space and the high frequency space respectively. The synthesis tree is shown in Fig. 3 is given by the Eq. (6)

$${c}_{0}\left(m\right)=\sum _{k}\left[{g}_{0}\left(2k-m\right){c}_{1}\left(k\right) +{g}_{1}\left(2k-m\right){d}_{1}\left(k\right)\right] \left(6\right)$$

3.2.4 Wavelets for parameterization

By applying the iterative decomposition operation repeatedly a desired wavelet tree is designed. Wavelet based feature vectros are derived using Daubachies wavelet(db4) [57]. Here 4th order is used but as order increases performance increases with increased computational requirements.

3.2.4.1 Mel Filters similar Wavelets Packets Analyis

The 24-bands Mel resembled Wavelet features (WMFCC) is proposed [20]. The frequency ${f}_{c}$ is related to mel scale frequency ${f}_{mel}$ by the Eq. (7)

$${f}_{mel}=2595{log}_{10}\left(1+\frac{{f}_{c}}{700}\right) \left(7\right)$$

The signal analysis in initialized with a balanced 3 level tree. Here, the frequency range is divided into eight bands each of 1KHz. The approximation space of 0-1KHz is decomposed into 8 subbands of 125Hz bandwidth in each subband. Here each subband is near to 100 Hz which is bandwidth of the Mel-filter. A 24-bands Mel scale like WP filter is designed[20] (see Table 1).

Table 1

Comparative description of Wavelet frequency bands and MFCC frequency bands.
Filters	Mel Scale	Wavelet Subband	Filters	Mel Scale	Wavelet Subband	Filters	Mel Scale	Wavelet Subband
1	100	125	9	900	1125	17	2639	2750
2	200	250	10	1000	1250	18	3031	3000
3	300	375	11	1149	1375	19	3482	3500
4	400	500	12	1320	1500	20	4000	4000
5	500	625	13	1516	1750	21	4595	5000
6	600	750	14	1741	2000	22	5278	6000
7	700	875	15	2000	2250	23	6063	7000
8	800	1000	16	2297	2500	24	6954	8000
The 24-bands Mel filter like wavelet packet sub-bands are shown in Fig. 5[20].

The energy is determined by

$${⟨{S}_{i}⟩}_{k}=\sum \frac{{{\left|{\omega }_{{\Psi }}\right(x,k)}_{i}|}^{2}}{{N}_{i}} \left(8\right)$$

where, ${{\omega }_{{\Psi }}(x,k)}_{i}$ are the coefficients of the speech segement $x$, $i$ indicates the subband number ($1\le i\le M$), $k$ represents the frame number and ${N}_{i}$ is the total count of samples in the ${i}^{th}$ suband. Just like MFCC, 24 wavelet coefficients are logarithmically processed. The compressed co-efficients are further analyzed by DCT to achieve energy compaction. Then first 13 coefficients are choosen as WMFCC features. The steps of parameterization are shown in Fig. 6.

3.2.4.2 Proposed Hybrid PWP tree for parameterization

The 24-band wavlet tree is designed for parameterization. The repeated experimental investigation and analysis revealed that a 24-band Wavelet Packet (WP) tree shown in Fig. 7 is found to be optimal tree for the same.

The decomposed coefficients energies are calculated, compressed logarithmically and processed by DCT to choose 13 optimal coe-efficients.

3.2.5 ACOUSTIC MODELS

The acoustic modeling in any ASR system plays a very vital role. Acoustic Modelling is the task of mapping the matrix of features with desired phoneme sequences of the hypothesized sentence. This is accomplished through the use of Hidden Markov Model (HMM) Classifier.

3.2.6 LANGUAGE MODELS

The most popular Language models used in any ASR system are n-gram language models. These models are useful in the prediction of ${n}^{th}$ word, utilizing $\left(n-1\right)$ previous occurred words. The trigram $(n=3)$ and bigram $(n=2)$ are commonly used in the Language Modelling.

3.2.7 RECOGNITION

The baseline classifers such as GMMs-HMMs, extended version of monophones and Deep Neural Networks (DNN) are used in this work for achieving speech recognition.

3.2.8 HIDDEN MARKOV MODELS

To find the probabilities $P\left(\frac{W}{X}\right)$ Markov chain shown in Fig. 8 with 3 states is used. During the training phase, probality of a system remaining in a state is called as initial state probability($\pi$), probability of transiting among states(A), and probability of symbols emitted (B) are determined using Baum-Welch Algorithm.

$$\lambda =\left(A, B, \pi \right) \left(9\right)$$

The log-likelihood of each sequence of word is found using Viterbi Decoding method described by

$$v* =\left[P \left(O|\lambda v\right)\right], 1\le v\le V \left(10\right)$$

$V$ is word length.

3.2.9 ASR PERFORMANCE ANALYSIS

The performance accuracy of the proposed system is evaluated through the metric word error rate [24] given by equations (11)

$$WER\left(\%\right)=\frac{(D+S+I)}{N}\times 100\left(\%\right) \left(11\right)$$

Here N represents the maximum number of spoken files in the testing dataset and $D, S and I$ are errors mainly because of deletion of phonemes, substitution of phonemes and insertion of phonemes respectively.

The speech data developed fot this work includes isolated digits, 25 isolated wods and 600 spontaneous sentences and continuously read sentences The dataset was contribution of 100 speakers of size 20 hours. Recoredings were made in the normal room environment in the presense of various noises occurring in the nearest road located at a distance of 5 meters. Recording is done using Matlab R2019b software at 16KHz, 16-bit resolution. The Table 2 represents 50 sample sentences from the speech dataset. The lexicon is designed according to the standard Indian Language Symbol Labels version 3 (ILSLv3) designed ASR Cosortium for Indian languages. The speech corpus is designed by following the procedure suggested in Summer School on Automatic Speech Recognition held at IIT, Guwahati (IITG). The dataset consists of 3 sets isolated digits, isolated words, Continuous and spontaneously spoken Kannada Sentences. The isolated digits and words (TIMIT), Librispeech(Sentences) are used in this work. A set of sample lexicons with transcription is given in the Table 2.

Table 2

Kannada sentences with transliteration.
Lexicon	Transliteration
Tumna	t u m b a
Doora	d o o r a
Prayana	p r a y a n a
Maadida	m a a d i d a
Nantara	n a n t a r a

4.1 KALDI TOOLKIT

Kaldi is a free toolkit written in C + + programming[41]. The language has 46 phonemes. The ASR system models all 46 phonemes. The Kannada alphabets are also labelled by ILSL12 is shown in Table 3 and corresponding sample lexicon using the same is shown in Table 4.

Table 4

The labelling of Kannada alphabets using ILSL12
KANNADA PHONEMES				LABEL SET USING IT3:UTF-8
ಅ	ಓ	ಠ	ಫ	a	oo	txh	ph
ಆ	ಔ	ಡ	ಬ	aa	au	dx	b
ಇ	ಕ	ಢ	ಭ	i	k	dxh	bh
ಈ	ಖ	ಣ	ಮ	ii	kh	nx	m
ಉ	ಗ	ತ	ಯ	u	g	t	y
ಊ	ಘ	ಥ	ರ	uu	gh	th	r
ಎ	ಚ	ದ	ಲ	e	c	d	l
ಏ	ಛ	ಧ	ವ	ee	ch	dh	w
ಐ	ಜ	ನ	ಶ	ai	j	n	sh
ಒ	ಟ	ಪ	ಸ	o	tx	p	s

Table 4

Sample lexicon of the speech dataset
TEXT TRANSCRIPTION	LABEL SET USING ILSL12
Tumna	t u m b a
Doora	d o o r a
Prayana	p r a y a n a
Maadida	m a a d i d a
Nantara	n a n t a r a

The lexicon is written keeping ILSL12 as reference. Figure 10 indicates block diagram for deriving the features. The Fig. 11 is architecture of the proposed system. The Table 5 describes value of parameters employed for constructing acoustic models. Number of jobs 3 is used.

Table 5

Parameters of the Acoustic Models
Patameters	Triphones 1	Triphones 2	Triphones 3
Leaves	2500	2500	2500
Gaussian	20000	20000	20000

The Table 6 and Fig. 12 indicate the WER results and the graphical counterpart for the continuous speech sentences of Kannada language.

Table 6

WER performance using PLP, MFCC, Proposed features on Kannada data
SET3_SENTENCES	Features	PLP	MFCC	Proposed
WER	Mono	06.39	07.37	16.20
	Tri 1	07.35	08.07	08.96
	Tri 2	11.49	12.23	14.09
	Tri 3	07.17	08.01	07.01
	DNN-HMM	06.33	06.06	04.27

The Table 7 and Fig. 13 indicate the WER results and the graphical counterpart for the continuous speech sentences of English language.

Table 7

WER performance on Librispeech dataset
SET6_SENTENCES	Features	PLP	MFCC	Proposed
WER	Mono	61.17	62.04	77.21
	Tri 1	37.72	38.75	69.95
	Tri 2	35.40	35.49	40.33
	Tri 3	30.29	30.86	30.26
	DNN-HMM	35.20	30.68	30.42

The performance of the work is evaluated on 600 unseen test sentences. The results of the experiment are included in the Table 8

Table 8

WER performance on Kannada words
TASKS (53–55)	Features	PLP	MFCC	Proposed
WER	Mono	07.40	07.80	10.40
	Tri 1	04.00	04.40	05.60
	Tri 2	03.80	03.20	02.80
	Tri 3	03.20	02.80	02.40
	DNN-HMM	03.00	02.60	02.20

The proposed work evaluated on continuous English sentences choosen from TIMIT dataset corrupted by Additive white noise and results are tabulated in Table 9

Table 9

WER performance on Kannada sentences on TIMIT database in the presence of additive noise.
Tasks (19–27)	Features	PLP	MFCC	Proposed
WER − 15dB	Mono	89.80	88.32	92.76
	Tri 1	89.63	87.53	95.12
	Tri 2	90.24	88.49	85.70
	Tri 3	92.85	93.90	88.14
	DNN-HMM	83.17	80.30	79.33
WER − 20dB	Mono	91.19	86.84	88.93
	Tri 1	88.58	89.36	97.12
	Tri 2	89.97	87.71	83.96
	Tri 3	92.33	96.34	87.45
	DNN-HMM	83.61	82.21	81.08
WER- 25dB	Mono	90.15	88.84	93.55
	Tri 1	88.58	91.11	98.87
	Tri 2	90.67	88.67	85.27
	Tri 3	96.60	94.68	91.89
	DNN-HMM	84.22	82.91	80.64

Proposed work is evaluated with the continuous Kannada sentences choosen from Kannada dataset corrupted by Additive white noise and results are presented in Table 10.

Table 10

WER performance on Kannada sentences in the presence of additive noise.
Tasks (28–36)	Features	PLP	MFCC	Proposed
WER − 15dB	Mono	44.00	63.20	75.73
	Tri 1	54.27	55.47	70.27
	Tri 2	73.87	64.93	62.67
	Tri 3	56.00	47.73	46.66
	DNN-HMM	36.80	32.00	31.33
WER − 20dB	Mono	37.60	40.67	71.73
	Tri 1	45.07	48.40	61.33
	Tri 2	65.07	68.80	55.87
	Tri 3	38.93	44.80	37.33
	DNN-HMM	26.80	24.00	23.20
WER- 25dB	Mono	33.47	45.20	65.73
	Tri 1	36.53	55.20	57.20
	Tri 2	73.73	70.53	56.40
	Tri 3	48.93	59.20	47.33
	DNN-HMM	22.40	27.33	21.33

Proposed work is also evaluated with the continuous sentences choosen from TIMIT and Kannada dataset presence of car noise and train noise and results are presented in Table 11

Table 11

WER performance on Kannada database and TIMIT database[99] in the presence of car noise and train noise extracted from Noise-X 92 database [97]
Tasks (37–52)	Features	PLP	MFCC	Proposed
WER − 15dB Car Noise [Kannada]	Mono	51.33	39.33	71.33
	Tri 1	44.13	56.93	61.73
	Tri 2	78.93	79.07	62.27
	Tri 3	60.67	56.67	56.27
	DNN-HMM	32.40	24.93	23.33
WER − 15dB Train Noise [Kannada]	Mono	51.87	49.47	71.07
	Tri 1	55.73	58.27	65.73
	Tri 2	79.47	76.80	63.07
	Tri 3	59.07	55.33	52.13
	DNN-HMM	34.53	35.60	33.86
WER [Real World Noise]	Mono	1.67	1.27	0.53
	Tri 1	1.67	1.07	0.40
	Tri 2	0.93	1.07	0.40
	Tri 3	0.80	0.80	0.40
	DNN-HMM	0.67	0.53	0.40
WER- 15dB Car Noise [English]	Mono	87.79	84.83	95.29
	Tri 1	86.49	88.49	92.24
	Tri 2	85.96	89.10	87.45
	Tri 3	87.45	90.24	86.31
	DNN-HMM	83.96	83.96	82.38
WER- 15dB Train Noise [English]	Mono	88.58	88.67	97.38
	Tri 1	87.58	88.49	94.94
	Tri 2	88.23	88.40	89.01
	Tri 3	92.50	89.89	87.18
	DNN-HMM	83.96	86.14	82.82

The comparison of system results with those of the systems reported previously in the literature. This comparision is reported in the Table 12.

Table 11

WER performance on unseen Kannada sentences
TASKS	Features	PLP	MFCC	Proposed
WER	Mono	07.40	07.80	10.40
	Tri 1	04.00	04.40	05.60
	Tri 2	03.80	03.20	02.80
	Tri 3	03.20	02.80	02.40
	DNN-HMM	03.00	02.60	02.20

Table 12

WER performance of proposed method compared to those of literature.
REFERENCE PAPERS	FEATURES	CLASSIFIERS USED	DATABASE	WER
Vishal et. al. [82]	MFCC	Deep Neural Network	Hindi	17.70%
Mohit et. al. [83]	MFCC, PLP,GFCC	GMM-HMM	Hindi	13.22%
Mohit et. al [84]	MFCC, GFCC	GMM-HMM	Hindi TIFR	22.34%
Thimmaraja Y et. al. [85]	MFCC	SGMM and DNN	Mandi	09.60%
Virender et al. [86]	MFCC	GMM-HMM and DNN-HMM	Punjabi	05.22%
Mohit et. al. [87]	MFCC, GFCC	GMM-HMM	Hindi	13.10%
Mohit et. al. [88]	MFCC	GMM-HMM	Hindi TIFR	15.60%
Vishal et. al. [89]	MFCC	CSVM	WSJ	16.90%
Thimmaraja Y et. al. [90]	MFCC	Monophone, Triphones1 and Triphones2	Kannada	09.95%
A. G. Ramakrishnan et. al. [91]	MFCC	Monophone, Triphones1 and Triphones2, Trihones3 and MT-DNN	Gujarati,	18.10%,
			Tamil,	31.30%,
			Telugu	29.30%
A. G. Ramakrishnan et. al.[102]	MFCC	Monophones and Triphones	Sanskrit	10.36%
Proposed	MFCC	Monophone, Triphones1 and Triphones2, Trihones3 and DNN-HMM	TIMIT	33.04%
	PLP			32.78%
	Perceptual Wavelet Packet Features			31.38%
Proposed	MFCC	Monophone, Triphones1 and Triphones2, Trihones3 and DNN-HMM	Kannada	06.06%
	PLP			06.33%
	Perceptual Wavelet Packet Features			04.27%

Table 13

Kannada unseen sentences .
Sl. No.	Kannada Sentence	Kannada Transcription
1.	ಕೊನೆಯಲ್ಲಿ ಮತ್ತೊಮ್ಮೆ ಸುದ್ದಿಗಳ ವಿವರವಿದೆ	koneyalli mattomme vaartegala vivara
2.	ಮತ್ತೊಮ್ಮೆ ಸಂಸತ್ತಿನ ಸದನಗಳಲ್ಲಿ	mattomme samsattina sadanagalalli
3.	ವಾರ್ತೆಗಳ ವಿವರ ಆರಂಭ	vaartegala vivara aarambha
4.	ಮಾತಿನ ಮುಖ್ಯ ಅಂಶಗಳು	bhaasanada mukhyaamshagalu
5.	ಸಂಸತ್ತಿನ ಮೇಲಿನ ನಿರ್ಣಯದ ಚರ್ಚೆ	samsattina meelina nirnayada carce
6.	ವಾರ್ತೆಗಳ ಚರ್ಚೆ ವಿವರ ಆರಂಭ	varthegala carce vivara aarambha
7.	ಸದನಗಳಲ್ಲಿಂದು ಉಭಯ ಸಂಸತ್ತಿನ	sadanagalallindu ubhaya samsattina
8.	ರಾಷ್ಟ್ರಪತಿಗಳ ಮೇಲಿನ ವಂದನೆ	raastrapatigala meelina vandane
9.	ಭಾಷಣಕ್ಕೆ ಸಲ್ಲಿಸಬೇಕಾದ ವಿವರಗಳು	bhaasanakke sallisuva vivara
10.	ನಿರ್ಣಯದ ಚರ್ಚೆ ಆರಂಭವಾಗಿದೆ	nirnayada carce aarambhavaagide

The ASR work carried out in this paper are as follows.

Experiments have been conducted using the baseline models like Monophone and Triphone Models and DNN-HMM.
The dataset consists of isolated digits and words, Continuous sentences consisting of spontaneous speech and read speech.
The English data base used for comparing the performance of proposed system is TIMIT dataset.
The experimental analysis on the Kannada and Librispeech dataset using the proposed method performed better than the MFCC and PLP features.
Performance of the proposed method is analyzed for 600 unseen sentences in clean condition, in the presence of additive white noise and in the presence of noises choosen from Noiseus-92 dataset. The task reveled that the performance of the proposed system is better than the MFCC and PLP features.

The authors comply with the ethical standards of the journal and also declare that they have no conflict of interest. The work is not sponsored by any funding agency.

DATA AVAILABILITY STATEMENT

The datasets used in this work can be obtained upon request to the corresponding author.

Bharali, S. S., & Kalita, S. K. (2018). Speech recognition with reference to Assamese language using novel fusion technique. Int J Speech Technol, 21, 251. https://doi.org/10.1007/s10772-018-9501-1
Hassan, F., Khan, M. S. A., Kotwal, M. R. A., & Huda, M. N. Gender independent bangia automatic speech recognition. In International Conference on Informatics, Electronics & Vision (ICIEV-2012)
Muslima, U., & Islam, M. B. (2014). Experimental framework for melscaled LP based Bangla speech recognition. In 2013 IEEE 16th international conference on computer and information technology (ICCIT), Khulna (pp. 56–59)
Pruthi, T., Saksena, S., Das, P. K., & Swaranjali (2000). : Isolated word recognition for Hindi language using VQ and HMM. In international conference on multimedia processing and systems (ICMPS), Chennai,
Kumar, K., & Aggarwal, R. K. (2011). Hindi speech recognition system using HTK. International Journal of Computing and Business Research, 2(2), 2229–6166
Kumar, K., Aggarwal, R. K., & Jain, A. (2012). A Hindi speech recognition system for connected words using HTK. International Journal of Computational Systems Engineering, 1(1), 25–32
Kurian, C., & Balakrishnan, K. (2009). Speech recognition of Malayalam numbers. In IEEE World Congress on Nature & Biologically Inspired Computing, 2009. NaBIC Coimbatore (pp. 1475–1479)
Bansal, P., Dev, A., & Jain, S. B. (2007). Automatic speaker identification using vector quantization. Asian Journal of Information Technology, 6(9), 938–942
Balleda, J., Murthy, H. A., & Nagarajan, T. (2000). Language identification from short segments of speech. In Interspeech Beijing
Kumar, R., & Singh, M. (2011). Spoken isolated word recognition of Punjabi language using dynamic time warp technique. Information systems for Indian languages (pp. 301–301). Berlin: Springer
Senoussaoui, M., Kenny, P., Dehak, N., & Dumouchel, P. (2010). An I-vector extractor suitable for speaker recognition with both microphone and telephone speech. Brno: In Odyssey
Dipanjan Nandi, D., Pati, K., & Sreenivasa Rao (2017). Implicit processing of LP residual for language identification, Computer Speech & Language, Volume 41, Pages 68–87, ISSN 0885–2308, https://doi.org/10.1016/j.csl.2016.06.002
Kim, C., & Stern, R. M. (2012). Power-normalized cepstral coefficients (PNCC) for robust speech recognition. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4101–4104). IEEE. https://doi.org/10.1109/ICASSP.2012.62888 20
Li, J., Deng, L., Gong, Y., & Haeb-Umbach, R. (2014). An overview of noise-robust automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(4), 745–777. https://doi.org/10.1109/TASLP.2014.23046 37
Mukherjee, H., Obaidullah, S. M., Santosh, K. C., Phadikar, S., & Roy, K. (2018). Line spectral frequency-based features and extreme learning machine for voice activity detection from audio signal. International Journal of Speech Technologyhttps://doi.org/10.1007/s1077 2-018-9525-6
Bouguelia, M. R., Nowaczyk, S., Santosh, K. C., & Verikas, A. (2017). Agreeing to disagree: Active learning with noisy labels without crowdsourcing. International Journal of Machine Learning and Cybernetics, 9(8), 1307–1319. https://doi.org/10.1007/s13042-017-0645-0
Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4), 357–366. https://doi.org/10.1109/TASSP.1980.1163420
Farooq, O., Datta, S., & Shrotriya, M. C. (2010). Wavelet sub-band based temporal features for robust Hindi phoneme recognition. International Journal of Wavelets, Multiresolution and Information Processing, 08(06), 847–859. https://doi.org/10.1142/S0219691310003845
Ganchev, T., Fakotakis, N., & Kokkinakis, G. (2005). Comparative evaluation of various MFCC implementations on the speaker verification task. In Proceedings of the SPECOM (pp. 191–194)
Farooq, O., & Datta, S. (2001). Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters, 8(7), 196–198. https://doi.org/10.1109/97.928676
Grigoryan, A. M. (2005). Fourier transform representation by frequency-time wavelets. IEEE Transactions on Signal Processing, 53(7), 2489–2497. https://doi.org/10.1109/TSP.2005.849180
Biswas, A., Sahu, P. K., Bhowmick, A., & Chandra, M. (2014a). Feature extraction technique using ERB like wavelet sub-band periodic and aperiodic decomposition for TIMIT phoneme recognition. International Journal of Speech Technology, 17(4), 389–399. https://doi.org/10.1007/s1077 2-014-9236-6
Biswas, A., Sahu, P. K., & Chandra, M. (2016). Admissible wavelet packet sub-band based harmonic energy features using ANOVA fusion techniques for Hindi phoneme recognition. IET Signal Processing, 10(8), 902–911. https://doi.org/10.1049/iiet-spr.2015.0488
Steffen, P., Heller, P. N., Gopinath, R. A., & Burrus, C. S. (1993). Theory of regular M-band wavelet bases. IEEE Transactions on Signal Processing, 41(12), 3497–3511. https://doi.org/10.1109/78.258088
Vetterli, M., & Herley, C. (1992). Wavelets and filter banks: Theory and design. IEEE Transactions on Signal Processing, 40(9), 2207–2232. https://doi.org/10.1109/78.15722 1
Lin, T., Xu, S., Shi, Q., & Hao, P. (2006b). An algebraic construction of orthonormal M-band wavelets with perfect reconstruction. Applied Mathematics and Computation, 172(2), 717–730. https://doi.org/10.1016/j.amc.2004.11.025
Pollock, S., & Cascio, I. L. (2007). Non-dyadic wavelet analysis. Optimisation, econometric and financial analysis (pp. 167–203). Berlin: Springer. https://doi.org/10.1007/3-540-36626 -1_9
Chiu, C. C., Chuang, C. M., & Hsu, C. Y. (2009). Discrete wavelet transform applied on personal identity verification with ECG signal. International Journal of Wavelets, Multiresolution and Information Processing, 07(03), 341–355. https://doi.org/10.1142/S0219. 69130 90029 57
Rajoub, B., Alshamali, A., & Al-Fahoum, A. S. (2002). An efficient coding algorithm for the compression of ECG signals using the wavelet transform. IEEE Transactions on Biomedical Engineering, 49(4), 355–362. https://doi.org/10.1109/10.99116 3
Tabibian, S., Akbari, A., & Nasersharif, B. (2015). Speech enhancement using a wavelet thresholding method based on symmetric Kullback–Leibler divergence. Signal Processing, 106, 184–197. https://doi.org/10.1016/J.SIGPR O.2014.06.027
Zao, L., Coelho, R., & Flandrin, P. (2014). Speech enhancement with EMD and hurst-based mode selection. IEEE Transactions on Audio, Speech and Language Processing, 22(5), 899–911. https://doi.org/10.1109/TASLP.2014.23125 41
Adeli, H., Zhou, Z., & Dadmehr, N. (2003). Analysis of EEG records in an epileptic patient using wavelet transform. Journal of Neuroscience Methods, 123(1), 69–87. https://doi.org/10.1016/S0165-0270(02)00340. -<background-color:#d279aa;>0</background-color:#d279aa;&gt
Ocak, H. (2009). Automatic detection of epileptic seizures in EEG using discrete wavelet transform and approximate entropy. Expert Systems with Applications, 36(2), 2027–2036. https://doi.org/10.1016/J.ESWA.2007.12.065
Biswas, A., Sahu, P. K., & Chandra, M. (2014b). Admissible wavelet packet features based on human inner ear frequency response for Hindi consonant recognition. Computers & Electrical Engineering, 40(4), 1111–1122. https://doi.org/10.1016/J.COMPE LECENG.2014.01.008
Leggetter, C. J., & Woodland, P. C. (1995). Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech & Language, 9(2), 171–185
Gales, M. (2000). Cluster adaptive training of hidden Markov models. IEEE transactions on speech and audio processing, 8(4), 417–428
Karpov, A., et al. (2014). Large vocabulary Russian speech recognition using syntactico-statistical language modeling. Speech Communication, 56, 213–228
Daubechies, I. (1992). Ten lectures on wavelets. Society for industrial and applied mathematics
http://(IT3%20to%20UTF-8)_11.html
http://www.iitg.ac.in/samudravijaya/tutorials/ILSL_V3.2.pdf
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., et al. (2011). "The Kaldi speech recognition toolkit.". IEEE 2011 workshop on automatic speech recognition and understanding. no. CONF. IEEE Signal Processing Society
Thimmaraja, G., & Yadava, H. S. (2017). Jayanna. "A spoken query system for the agricultural commodity prices and weather information access in Kannada language",International Journal of Speech Technology,
"Performance of Isolated and Continuous Digit Recognition System using Kaldi Toolkit" (2019).International Journal of Recent Technology and Engineering,
Thimmaraja Yadava, G., & Jayanna, H. S. (2018). "Creation and Comparison of Language and Acoustic Models Using Kaldi for Noisy and Enhanced Speech Data",International Journal of Intelligent Systems and Applications,
Praveen Kumar, P. S., Thimmaraja Yadava, G., & Jayanna, H. S. (2019). "Continuous Kannada Speech Recognition System Under Degraded Condition", Circuits, Systems, and Signal Processing,
Biswas, Astik, P. K., Sahu, A., Bhowmick, & Chandra, M. (2015). "Hindi phoneme classification using Wiener filtered wavelet packet decomposed periodic and aperiodic acoustic feature". Computers & Electrical Engineering
Mahadevaswamy, D. J., & Ravi, Performance of Isolated and Continuous Digit Recognition System using Kaldi Toolkit, 2019International Journal of Recent Technology and Engineering
Mahadevaswamy, & Ravi, D. J. (2016). "Performance analysis of adaptive wavelet denosing by speech discrimination and thresholding," International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT), Mysuru, 2016, pp. 173–178, doi: 10.1109/ICEECCOT.2016.7955209
Mahadevaswamy, & Ravi, D. J. (2016). "Performance analysis of adaptive wavelet denosing by speech discrimination and thresholding," 2016 International Conference on Electrical, Electronics, Communication, Computer and Optimization Techniques (ICEECCOT), Mysuru, pp. 173–178, doi: 10.1109/ICEECCOT.2016.7955209
Mahadevaswamy, & Ravi, D. J., "Performance Analysis of LP Residual and Correlation Coefficients based Speech Seperation Front End," 2017 International Conference on Current Trends in Computer, Electrical, Electronics and Communication (CTCEEC), Mysore, 2017, pp. 328–332, doi: 10.1109/CTCEEC.2017.8455039
Geethashree, A., & Ravi, D. J., “Automatic Segmentation of Kannada Speech for Emotion Conversion”,Journal of Advanced Research in Dynamical and Control Systems
Geethashree, A., & Ravi, D. J. (2019). Modification of Prosody for Emotion Conversion using Gaussian Regression Model. International Journal of Recent Technology and Engineering
Geethashree, A., & Ravi, D. J. (2018). Kannada Emotional Speech Database: Design, Development and Evaluation. In: Guru D., Vasudev T., Chethan H., Kumar Y. (eds) Proceedings of International Conference on Cognition and Recognition. Lecture Notes in Networks and Systems, vol 14. Springer, Singapore
Basavaiah, J., & Patil, C. M. (2020). Human activity detection and action recognition in videos using convolutional neural networks. Journal of Information and Communication Technology, 19(2), 157–183
Basavaiah, J., & Anthony, A. A. (2020). Tomato Leaf Disease Classification using Multiple Feature Extraction Techniques. Wireless Personal Communications. doi:10.1007/s11277-020-07590-x
Mahadevaswamy, Ravi, D. J. (2021). Robust Perceptual Wavelet Packet Features for Recognition of Continuous Kannada Speech. Wireless Pers Commun, 121, 1781–1804. https://doi.org/10.1007/s11277-021-08736-1

Download PDF

Journal Publication

published 12 Apr, 2023

Read the published version in Wireless Personal Communications →

Editorial decision: Major revisions
18 Oct, 2022
Reviewers agreed at journal
24 Jun, 2022
Reviewers invited by journal
23 Jun, 2022
Editor assigned by journal
18 Apr, 2022
First submitted to journal
10 Apr, 2022

You are reading this latest preprint version

Robust Automatic Speech Recognition System for Kannada Speech Sentences in the Presence of Noise

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Related Works

3. Methodology

3.1 PREPOROCESSING

3.2 PROPOSED FEATURES

3.2.1 Theory of Wavelet Transforms

3.2.2 Continuous Wavelet Transforms (CWTs)

3.2.3 Discrete Wavelet Transforms (CWTs)

3.2.4 Wavelets for parameterization

3.2.4.1 Mel Filters similar Wavelets Packets Analyis

3.2.4.2 Proposed Hybrid PWP tree for parameterization

3.2.5 ACOUSTIC MODELS

3.2.6 LANGUAGE MODELS

3.2.7 RECOGNITION

3.2.8 HIDDEN MARKOV MODELS

3.2.9 ASR PERFORMANCE ANALYSIS

4. Speech Database

4.1 KALDI TOOLKIT

5. Results

6. Conclusion

Declarations

References

Status:

Journal Publication

Version 1

KANNADA PHONEMES				LABEL SET USING IT3:UTF-8
ಅ	ಓ	ಠ	ಫ	a	oo	txh	ph
ಆ	ಔ	ಡ	ಬ	aa	au	dx	b
ಇ	ಕ	ಢ	ಭ	i	k	dxh	bh
ಈ	ಖ	ಣ	ಮ	ii	kh	nx	m
ಉ	ಗ	ತ	ಯ	u	g	t	y
ಊ	ಘ	ಥ	ರ	uu	gh	th	r
ಎ	ಚ	ದ	ಲ	e	c	d	l
ಏ	ಛ	ಧ	ವ	ee	ch	dh	w
ಐ	ಜ	ನ	ಶ	ai	j	n	sh
ಒ	ಟ	ಪ	ಸ	o	tx	p	s

KANNADA PHONEMES				LABEL SET USING IT3:UTF-8
ಅ	ಓ	ಠ	ಫ	a	oo	txh	ph
ಆ	ಔ	ಡ	ಬ	aa	au	dx	b
ಇ	ಕ	ಢ	ಭ	i	k	dxh	bh
ಈ	ಖ	ಣ	ಮ	ii	kh	nx	m
ಉ	ಗ	ತ	ಯ	u	g	t	y
ಊ	ಘ	ಥ	ರ	uu	gh	th	r
ಎ	ಚ	ದ	ಲ	e	c	d	l
ಏ	ಛ	ಧ	ವ	ee	ch	dh	w
ಐ	ಜ	ನ	ಶ	ai	j	n	sh
ಒ	ಟ	ಪ	ಸ	o	tx	p	s

KANNADA PHONEMES				LABEL SET USING IT3:UTF-8
ಅ	ಓ	ಠ	ಫ	a	oo	txh	ph
ಆ	ಔ	ಡ	ಬ	aa	au	dx	b
ಇ	ಕ	ಢ	ಭ	i	k	dxh	bh
ಈ	ಖ	ಣ	ಮ	ii	kh	nx	m
ಉ	ಗ	ತ	ಯ	u	g	t	y
ಊ	ಘ	ಥ	ರ	uu	gh	th	r
ಎ	ಚ	ದ	ಲ	e	c	d	l
ಏ	ಛ	ಧ	ವ	ee	ch	dh	w
ಐ	ಜ	ನ	ಶ	ai	j	n	sh
ಒ	ಟ	ಪ	ಸ	o	tx	p	s