G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost

This paper is an attempt to address to the problem of native language in a mixed voice environment. G- Cocktail would aid these applications in identifying commands given in Gujarati, even from a mixed voice stream. There are two phases of G-cocktail in the first phase, it creates features after filtering the voices and in the second it trains and classifies the dataset. This trained dataset helps in recognizing the new voice signal. The challenge in training a native language is the availability of a small dataset. A single-word input is used in model and phrase benchmark dataset from Microsoft and the Linguistic Data Consortium for Indian Languages (LDC-IL). To overcome the over fitting problem due to smaller dataset we used CatBoost algorithm. And fine-tuned the classification model to avoid the over fitting issue. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). MFCC is good for human voices but noises in the sound makes it less productive. To avoid this shortcoming of MFCC, first filtered the voices are used and then calculated the MFCC. The most relevant features are retained to make it more robust. With MFCC features, the pitch of the voices is also added, as pitch could vary with regional changes, mood of the person, age, and knowledge of the language to the speaker. A voice print of the whole sound files is constructed and fed it as features to the classification model. For training and testing 70% and 30% ratio is used in algorithms like K-means, Naïve Bayes, and Light GBM. Proposed model is compared with given data set and results proved that G-cocktail using XBoost performed better than the others under the given scenario in all parameters.


Introduction
The pandemic in 2019 boosted not just internet usage, but also the use of native languages. A user of a voice-activated gadget would like to communicate in a native language. Native language expressions are a continuously moving acoustic signal. These 'acoustic-phonetic parts' (APS) are demarcated based on distinct variations in time and frequency domains [1]. Because we should not interpret any change in these domains as a segment border, Phonetic criteria is used to determine segment demarcation [2], 3. link the same signal is delineated in APSs to language abstractions like allophones, phonemes, and morphophonemic. In each language, a phoneme is a sound feature that distinguishes one word from another. Allophones are variations of the same phoneme that do not result in a substantial change in expression, On the other hand, strings with one or more APSs and allophones may never be identical. Because the speech signal generator, i.e., the human speaker, does not create invariable and exact duplicates of the signal for several occurrences of the 'same' allophone, we compound this intrinsic lack of similarity [4]. The same phonemes do not exist in all languages.
India is home to over 19,565mother tongues/dialects, according to 2011 Census. Every language has its own phonotactic, prosodic, and acoustic characteristics [5]. As a result, identifying these languages, each with its own vernacular, cadence, semantics, and ambiance, becomes exceedingly difficult [6][7][8]. The focus of the paper is on Gujarati language.
Gujarati has many dialects; the main ones are spoken in Mumbai and Ahmadabad. Others are: Surati, Kathiyawadi, Kharua, Khakari, Tarimukhi, and East African Gujarati. Since many dialects exist, there are many loan words from other languages also. The dialects of southern Gujarati have borrowed words from Hindi, English, and Portuguese. Gujarati has ten vowels. Excluding [e] and [o], vowels are nasalized and in murmured and nonmurmured forms. Gujarati has vowels that are short and long, but they do not conflict. Gujarati has 34 consonants, including 20 stops, 3 fricatives, 3 nasals, and 5 glides and liquids. At five distinct positions, the stops and nasals are articulated and graded as: labial, dental, retroflex, palatal, and velar [9]. In fact, the palatal stops are affricated. The four-way difference between Indo-Aryan and Indo-European languages (Proto-Indo-European had a three-way difference only) involves voiceless and voiced consonants, unaspirated and aspirated in each sequence of stops. Despite a large speaker base of Gujarati not much work is done in its speech enhancement, Text to Speech conversion and separation. Gujarati, like many Indian languages, has many dialects. Analyzing Gujarati requires different phonetic distribution. The current paper focuses on separating Gujarati voices from a mixed signal. Which is more commonly called "cocktail party scene" problem.
Simultaneous and sequential arrangement is the two forms of auditory scene analysis. Simultaneous organization (or grouping) incorporates sounds that overlap, while sequential organization combines sounds that occur at different times [10]. Proximity in frequency and duration, harmonicas, common amplitude and frequency modulation, onset and offset synchrony, commonplace, and prior information are the key organizational concepts responsible for ASA, when we express audio on a time-frequency image such as a spectrogram. For separating the voices Deep Learning offers Multilayer Perceptrons (MLPs), Convolution Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Generative Adversarial Networks (GANs) are commonly used [10]. These grouping rules often regulate speech separation [11][12][13][14]. The key point in speech recognition is that the sounds made by a human being are filtered by the vocal tract's shape, including tongue, teeth, etc. This shape defines the sound that comes out. The accurate assessment of the shape would give us an accurate representation of the phoneme being produced. The frame of the vocal tract embodies itself in the envelope of the short time power range. MFCC can precisely represent this envelope. Because of its ability of representing the envelope and high accuracy, MFCC is widely used feature for the voice signals [15]. The MFCC computation is a simulation of the human auditory system that seeks to mechanically enforce the ear's operating principle, believing that the human ear is a reliable speaker recognizer [16]. The other features are Linear Prediction Coefficient (LPC), Discrete Wavelet Transform (DWT), Linear Predictive Cepstral Coefficients (LPCC), and deep learning-based features [17].
LPC are a type of speech features that imitates the human vocal tract. It estimates the concentration and frequency of the left-over residue by approximating the formants, removing their effects from the speech signal, and evaluating the speech signal [18], 19. Each sample of the signal is stated to be a direct incorporation of previous samples in the result. The formants are defined by the coefficients of the difference equation, so LPC must approximate these coefficients. LPC is a popular formant estimation method as well as a powerful speech analysis method. It provides very precise speech parameter estimates and is computationally efficient. Autocorrelation coefficients are aliased in conventional linear prediction. The susceptibility of LPC estimates to quantization noise is high, so they are not well suited for generalization [19].
DWT is an extension of Wavelet Transform (WT). It can derive information from latent signals in both the time and frequency domains at the same time. Many wavelets are orthogonal, which is an outstanding feature for compact signal representation [20]. The wavelet transform breaks down a signal into a set of simple functions known as wavelets. It uses Dilations and flipping to build wavelets from a single template wavelet called mother wavelet. The WT's key feature is that it searches the frequency spectrum with a variable window, improving the temporal resolution [21]. Its parameters include different frequency scales. This enhances the speech information received in the related frequency. It provides enough frequency bands for accurate speech processing, but since the input signals are of finite duration, the wavelet coefficients may have excessively broad differences at the boundaries because of discontinuities [21].
LPCC are Cepstral Coefficients derived from LPC. They illustrate the coefficients of the Fourier transform of the logarithmic magnitude spectrum of LPC. The susceptibility of LPCC calculations to quantization noise is well documented. In the frequency domain, cepstral analysis on a high-pitch speech signal yields poor source-filter separability. Lowerorder cepstral coefficients are sensitive to spectral slope, whereas higher-order cepstral coefficients are sensitive to noise [22,23].
From the above study it is observed that for separating the voices it is important to choose the right features. Deep Learning based i-vector [24] and x-vector [25] features; a fusion of MFCC, DWT and MFCC, GFCC [26] respectively, are good for language recognition. They work well in noisy environment but have over fitting problem with a small dataset. Gujarati, like many Indian languages does not have a large data corpus. For this MFCC best suits and results shows that it has higher accuracy.
There is no historical evidence of Cocktail-party scene with Gujarati language [47][48][49][50][51][52][53][54][55][56][57]. Tabulation of work done for Gujarati language is given in Table 1. The variation in training dataset was not attempted. Also, during decoding since the language was known language specific model was used Catboost considered [58] to solve the Cocktail-party issue with Gujarati voices. For feature extraction we used MFCC and Pitch after filtering the sample voices [59]. The reasons for choosing Catboost are: • It is a revolutionary categorical function processing algorithm. There is no need to manually pre-process features because it is done for you. In contrast to other algorithms, the performance of data with categorical features is higher. • Ordered boosting is a permutation-driven solution to the classic boosting algorithm.
Gradient Boosting easily overfits on tiny datasets. There is a special modification in Catboost to handle this issue. On datasets where other algorithms struggled with overfitting, Catboost does not have the same issue. Overfitting issue is solved by adjustment the parameters. The details are explained in the model. • It is fast and works on GPU.
• It handles missing values as well.

Methodology
The objective of the paper is to separate voices from a mixed speech in Gujarati and predict the voice.

Experimental Set-up
The setup includes hardware configuration, and dataset used.

Hardware
Intel core i5, fifth generation with 16 GB RAM, NVIDIA Graphic card running on windows 10.

Dataset
Microsoft Indian language Corpus and Linguistic Data Consortium for Indian Languages (LDC-IL) [60]. (https:// www. ldcil. org/ publi catio ns. aspx). The dataset contains voices of adults only. few voice samples of the kids are recorded which are keeping the same set of dialogues and the parameters (number of channels, and sampling rate).

Model
There are two parts of the proposed model feature extraction and other is classification. G-cocktail first reads every voice file, filters it by removing the leading and training spaces. It then extracts the MFCC features and calculates the pitch. A method is designed to retain only the relevant features. Pitch and the retained MFCC features of every voice file are stored as a vector to create the testing dataset. Figures 1 and 2 graphically illustrates the working. Feature extraction process is graphically represented using Fig. 1 and classification using Fig. 2.

3
The classifier (Catboost in our case) trains on the training set generated after the first phase. In Fig. 2 features from Fig. 1 are carried over for training and classification. For prediction, Mixed voice signal is shown in Fig. 2. The overall working of G-cocktail is summarized in the following steps: Step 1 Filter the voice by removing trailing and leading blanks Step 2 Calculate pitch of every voice file Step 3 Extract MFCC features  Step 4 Retain only the relevant features Step 5 Store features and pitch as a vector of all the voice files Step 7 Create labels for each record Step 8 Create a Catboost model Step 9 Train the model with features and labels.
Step 10 Create a mixed voice by appending different voices Step 11 Window the mixed voice of 15 seconds Step 12 Calculate pitch and MFCC features.
Step 13 Using the trained model predict the voice with the new features and pitch.

Objective Function
The objective of the paper is to design a predictive function P such that it can extract individual voice from a mixed voice signal.
The mixed signal G mixed is created by appending Male, Female or assorted voices. The combinations could be just male or female voices or a mix of both.
Assorted is described in fn(2.1)

Filter the Voices
The data set considered is benchmark and is free of noises. Trailing and leading silence are removed from the speech samples using the following equation: For leading silence, the Eq. (1) runs from start till voice is detected. For trailing silence, the equation is run from end till the voice is detected. V G is the voice sample in Gujarati. The sliced speech is stored in S G .

Calculate Pitch
Once the signal is trimmed, we calculate pitch using fn (3).
For calculating pitch we used autocorrelation. The equation used is: Here, ml is the maximum lag and index is the index of the maximum peak.

Extract MFCC Features
Function computes first and second derivatives of Cepstral coefficient to get the temporal dynamics of the signal. Algorithm 2 explains the implementation:

Prepare Data
For every sound file we calculate Pitch (P G ) and MFCC (G F ) Features using Algorithm1 and2. Every record in the dataset is assigned a numeric label (l i ) ranging from 1 to number of files (n).

Design Deep Learning Model
Here, Cat m : Catboostregression model p :parameters Model creates a structure with the following parameters and values: The parameters are selected using hit-error-trial technique.

Fit and Predict
The model in Eq. (3) is trained using Eq. (4). We created G dataset and labels using fn(5.1) and fn(5.2).
To get the pitch and Features of the mixed voice data, we again applied algorithms 1 and2 on G mixed and the trained model using Eq. (6) predicts the voice from G eval . The predicted voice is store in P v . The input for the Catboost model is the MFCC features extracted from the standalone voices and labels assigned to them. Once the model is trained mixed data is evaluated by windowing it at 15 s interval.

Results
To carry out the experiment with the proposed model, we first iterated through each sound file. The code then read each filtered file and calculated the pitch and MFCC features. These features were clubbed together to create and acoustic filter bank. Then filter bank was then split into testing and training set to carry out classification and prediction. The detailed working is explained through the objective functions and algorithms. The observed results were reproduced here along with the outputs generated. Figure 3 is a frequency plot of an adult male voice in Gujarat. The plot is of the original voice before applying any filters. The result of algorithm 1 to calculate pitch is plotted in Fig. 4. The voice is again a male voice in Gujarati. All the plots are of adult male Gujarati voice.
The steps in Algorithm 2 to calculate MFCC are plotted in Figs. 5, 6, 7, and 8. For calculating MFCC the data is windowed as shown in Fig. 5. Hanning window is used for windowing. The plot shows the effect of windowing on the voice data as in step 2 of Algorithm 2. Steps 3-5 converts the signal into frequency domain and window the signal. Figure 6 shows the effect of windowing on the complete wave form. It shows the original frames and the frames after windowing.
Step 6-17 in Algorithm 2 compute the Cepstral Coefficient. First, filter points are created and then mel bands are computed. Figure 7. Figure 8 shows the plot of the filter points and filter bands.
Step 17 of Algorithm 2 computes the Cepstral Feature same is plotted in Fig. 9.

Mathematical Evaluation of Results
Qualitative analysis of the retrieved signal are used for perceptual estimate of speech efficiency (PESQ) scores, source-to-distortion ratio (SDR). Other measurement measures include scaleinvariant signal-to-noise ratio (SI-SNR), signal-to-distortion ratio improvement (SI-SDR).

PESQ:
Here, S and es represent original and estimated clean source, respectively. L represents the length of the signal. einter, enoise, eartif represent interferences, noise and artifacts error terms, respectively. P represents the power of the signal (S, S). Ts, en represent Target  TP is the number of correctly detected sounds (predicted), FN is the number of voices that do not match. FP is the number of speech signals known as voice signals, but they are not. TN is the number of not a Speech Signal correctly defined (Table 3).  Experimented results are presented with SVM, KNN, XBoost, Light GBM, and Cat Boost with the dataset. Obtained results proved that G-cocktail using XBoost performed better than the others under the given scenario.
After enhancement and classification of Gujarati language signals, results are compared in terms of accuracy of G-cocktail with the published papers on Gujarati signals. For experimental purpose, the assorted data contained Hindi, English, Tamil, and Telugu voices. The proposed model is able to recognize the Gujarati voice. Assorted voices were filtered out. Table 4 gives a comparison of different techniques used thus far for Gujarati language detection. The data sets are different. The proposed work outperforms most of the techniques sed in detecting Gujarati language. To use MFCC with pitch we carried out the experiment using different features.
The experiment was done with features like LPC, LPCC, i-vector, and v-vector. The results obtained in 5 clearly reflect that MFCC with pitch performed better than rest of the features. The experiment was done for Gujarati voices only. The results of other languages with different dialects may be different.

Result Analysis
G-cocktail calculates the most relevant MFCC features based on the criterion that feature f_in a set of n features (f1, f2, …fn) is relevant if n < 15. Anything above 15 is ignored. The first 15 MFCC features are the most relevant features. Pitch is calculated to overcome the issues of age, mood, knowledge, and region. People in India speak in different pitches, and it varies from region to region even in the same state also. Gujarati is a sot spoken language, but the pitch varies even for the same sentence from person to person and region to region. In a cocktail party situation, the pitch is expected to be higher than in the normal conditions. G-cocktail is not limited to words or sentences. It can work effectively on continuous sentences. There can be a single speaker or multiple speakers, the model is able to detect the words and sentences (Table 5). HMM, HTK 95.1 [52] HMM, ANN 79.14 [53] HMM 87.23 [54] Singe end-to-end model 82.7 [55] RNN-CTC 78.93 [56] TDNN, RNNLM 85.9 [57] LSTM-CTC 80.89 G-cocktail MFCC + Pitch + Catboost 98.33

Conclusion
G-cocktail is a step forward to address the native languages in India. Due to lesser speakers, the available dataset is small. It is not easy to train and classify a small dataset and get good results. To overcome the problem associated with native language like Gujarati we filtered the voice samples, extracted only the most relevant MFCC features and clubbed it with Pitch to create the voice prints for training and testing. Catboost and fine-tuned it is used to overcome the overfitting problem. G-cocktail can extract and identify the voices from a cocktail party like situation. The results show an accuracy of 96.2% which is very encouraging. G-cocktail can be used to develop a voice bot in Gujarati. One will not have to silence everyone to give a command or speak close to the device even in a party. The model will pick the command.