VoSE: An algorithm to Separate and Enhance Voices from Mixed Signals using Gradient Boosting

Voice Separation and Enhancement (VoSE) algorithm aims at designing a predictive model to solve the problem of speech enhancement and separation from a mixed signal. VoSE can be used for any language, with or without a large Datasets. VoSE can be utilized by any voice response system like, Siri, Alexa, Google Assistant which as of now work on single voice command. The pre-processing of the voice is done using a Trimming Negative and Nonzero voice filter (TNNVF), designed by the authors. TNNVF is independent of language, it works on any voice signal. The segmentation of a voice is generally carried out on frequency domain or time domain. Independently they are known to have ripple or rising effect. To rule out the ripple effect, data is filtered in the time-frequency domain. Voice print of the entire sound files is created for the training and testing purpose. 80% of the voice prints are used to train the network and 20% are kept for testing. The training set contains over 48,000 voice prints. LightGBM with TensorFlow helps in generating unique voice prints in a short time. To enhance the retrieved voice signals, Enhance Predictive Voice(EPV) function is designed. The tests are conducted on English and Indian languages. The proposed work is compared with K-means, Decision Stump, Naïve Bayes, and LSTM.


I. Introduction
Much research in source separation is centred on the famous cocktail party issue [1], where a listener has to attend to speech selectively in a context of competing speech noise. Human's auditory brain is capable of selectively recognising the voices. Brain is able to separate spectral-temporal representations for concurrent speeches [2]. To understand it in simpler terms, consider a party. Depending on the situation, given a variety of distracting voices and other sounds, one can isolate friend's speech and remember the words with little effort [3]. This is a strong indication that the human auditory system has a function to distinguish incoming signals. The principle is known as psychoacoustics. To correctly describe psychoacoustics, there are solutions such as [4].
Audio command enabled devices like Google Assistant, Siri, and Alexa [5] have taken the technology to the next level. These systems accept audio inputs and execute the process after the voice information is decoded. Speech inputs/commands are given in a natural environment. The natural environment has high acoustic noises in the background. The noises could be of different levels and may have one or more interfering sources [6]. Cocktail party problem introduced by Cherry in 1953, is one such problem where recording is done in a natural environment. Cherry introduced automatic voiceprint and speech recognition [7,8].
The algorithms which are developed thus far are no substitute for what humans can do. To solve the different problems and pay attention to the speaker of importance, people use many patterns in a group. When it comes to a gathering of heavy music, the differences are larger. One has to filter the music out and stretch the ears to understand. Patterns in such circumstances play an essential role. The patterns include accuracy, continuity of tone, language and position of the speaker. To resolve the pattern issue, Permutation Invariant Training(PIT), and utterance level Pattern Invariant Training(uPIT) were proposed [43,44] to separate the signals. PIT and uPIT, however, only use the mixed amplitude range as input features. PIT and uPIT fail to accurately discriminate between each speaker. uPIT suffers from permutation problem. To overcome the issue of permutation authors proposed Deep Clustering (DC) with uPIT [45,46]. DC, Deep Attractor Network [47,] and uPIT can predict the assignments at the utterance level of all TF bins at once, without the need for frame based assignment, which is the main cause of the permutation problem. Nevertheless, when vocal features of speakers are similar; these methods also suffer from the issue of permutation.
To exploit the recently developed techniques in Artificial Intelligence(AI), Deep learning based on audio-visual data is introduced in recent years [48,49]. It is widely known that humans not only listen to the sound but also note the speaker 's emotions, they read lips, eyes and body gestures. The proposed work in [48,49] is speaker-dependent, and the effects of the separation too is not satisfactory.
The above finding suggests the need for source separation, especially in cases where an unidentified mixed signal is transmitted and registered in a sensor array. Speech signs have silent spaces and meaningless noises as well. To overcome the issues, authors developed Trimming Negative and Nonzero voice filter (TNNVF). It is also observed that there is a ripple effect in the above mentioned models as speech segmentation is conducted either on a frequency domain or a time domain. VoSE translates the speech data into time-frequency domain. To isolate the voices, the suggested model uses LightGBM [51][52] with TensorFlow running in the background. TensorFlow [53] helps in producing the individual voice prints in the shortest possible time.

Why LightGBM?
Decision tree learning algorithms [50][51][52][53][54] construct trees level(depth)-wise. LightGBM, a gradient boosting algorithm, builds trees leaf-wise as a result there is a lesser loss. LightGBM uses an optimized histogram algorithm. It splits the continuous individual values into n intervals and selects the dividing points among the n values. The use of the histogram algorithm has a regularization effect and can avoid overfitting effectively. LightGBM after the first split, accomplishes the second split only on the leaf node. The leaf-wise isolation of the LightGBM algorithm allows it to operate on large data sets as well. LightGBM has a maximum depth parameter, it expands like a tree but prevents overfitting. Gradient boosting, due to its tree structure, is known to be good for tabular data but recently researchers have found it useful in a various applications [55][56][57][58][59][60][61][62][63][64][65][66][67].
The models in  are either specific to application or address a single language but none of them address the issue of speech translation. Once the speech is separated the voice is not converted into text. For building robust acoustic models for speech recognition [68,69], accurate phonetic transcriptions are important. VoSE after enhancing the predicted voice converts the speech to text to make sure that the converted text matches the original speech's text.

II.
Methodology The methodology involves two processes -Experimental Setup, and the implementation. Implementation is explained through objective functions and related algorithms. The methodology is summarized into objective functions, which are explained in following section. It is followed by steps and the algorithms designed.

Objective functions
There are three main objectives of the proposed work. The objectives functions are represented mathematically below:

a. Detect a voice
Here, The voice sample is iterated to check for 0, any zero value found is removed from the data. The process removes the leading, training spaces, and in between silence. The trimmed signal is further tested to retain only the speech using eq (2) b. Detect Speech The threshold value is arrived at after iterating through the dataset. After eq(1) the signal does not have any silence therefore, now the signal either has voice or noise. For the purpose, average of each signal is calculated and added up. The sum is then divided by the number of samples to arrive at a threshold value Qs. The voice is iterated and if the variance ( 2 ) of the data is more than Qs then it is considered as voice otherwise it is taken as silence. This data is removed from the voice sample.

Working
Following steps briefly explains the working of VoSE: a. Seperate Voices 1. Voice files of different languages are stored in related folders.
(The languages are not mixed they are tested individually) 2. Each folder is read, and the data is filtered using TNNVF. 3 A code is written which automatically reads the files from different folders and fuse the voices. The fused data is stored in a mixed voice folder.

Model Datasets
Filter Fused voices Training set Testing set The model in figure 1 graphically represent the working VoSE. The algorithms explain the working in detail. The algorithms are based on the code written for the purpose.

Algorithm 1: Prepare sound files
Setup Initialize required variables Read folder having sound files Start Step 1 While not end of folder do Step 2 vi  read sound file Step 3 Detect voice using eq (1) Step 4 Remove unwanted data Step 5 Detect speech using eq (2) Step 6 clean_speech Retain only speech Step 7 clean_voiceSave clean file to folder Step 8 end while  Step 7 repeat steps 1 to 6 for female and assorted folders Algorithm 2 prepares the training set. The dataset contains male, female and assorted voices. Single channel is read, digitized and stored in the csv file with appropriate label. Numeric labels are assigned for the uniformity in the data types.

Algorithm 3: Prepare Testing set
Step 1 index  random 1 to length of voice folder Step 2 for i in index: Step 3 Vf In Algorithm 4 voice samples of male, female, mixed are split into training and testing sets with labels assigned to each voice print in algorithm 3. The parameters are selected according to Laurance [70]. Several test runs were carried out before arriving at the optimum parameters and their values. At the end of algorithm model is ready for prediction. For prediction Normalized fused voice sample N is used. Maximum of predicted output is matched with the label stored in Ytrain, if a match is found the voice signal is retained from the dataset using the label. The voice retrieved is a processed voice. To get the original voice print EPV is used. Step 17 end for EPV is a simple takes multiclass parameter for multiclass classification. Original voices are taken as the input dataset and labels are assigned from 1 to the length of the samples. Once the model is trained, prediction analysis is done on the predicted output received after running algorithm 5. The predicted voice print Ev is the enhanced voice print which is very close to the original voice. Received voice is converted into text to ascertain the claim.

III.
Accuracy and Comparison For accuracy True Negative(TN), False Negative(FN), True Positive(TP), and True Negative(TN) are measured. These are further used to calculate Precision, Recall, Accuracy and F1 score. The mathematical formulation used is: TP is the number of correctly detected sounds (predicted), FN is the number of voices that have not been correctly identified or one may claim that they have been wrongly identified. FP is the number of speech signals known as voice signals, but they are not. TN is the number of not a Speech Signal correctly defined.
SDR is represented as: Here, S and es represent original and estimated clean source respectively. L represents length of the signal. einterf , enoise, eartif represent interferences, noise and artifacts error terms respectively. P represents power of the signal (S,S). Ts, en represent Target noise and estimated noise respectively. dS, dA represent symmetric and asymmetric disturbances. S and es are both normalized to have zero-mean to ensure scale-invariance.

Results
The discusses the results obtained. Spectrograms of original voices with fused voice is first produced followed by the quality and time check.  Higher values of SDR, SI-SDR, PESQ, SI-SNR represents better quality of the signal. Table  3 shows the values of the parameters tested on WSJ dataset. The values are calculated on the detected voices.        To test the robustness further Precision, Recall, Accuracy, and F1 score are calculated. Table  6 shows the number of samples tested on different algorithms. FP, TP, FN, and TN are recorded for each algorithm. The values Table 7 are based on Table 6 values. the mathematical formulations are explained in equations 1-4.

Conclusion
This paper presents a model based on gradient boosting algorithm. The objective of VoSE is to separate the voices from a mixed signal and enhance them. The model is able to successfully separate male, female, assorted voices, and other voices from a mixed signal. The algorithm is compared with benchmark algorithm like, Kmeans, Decision Stumps, Naïve Bayes, and LSTM. The comparison is drawn by running the algorithms on the dataset created for the proposed work. Two main objectives of VoSE-to separate the voices from a mixed signal, and to enhance the separated voices are achieved in good time. The results show that VoSE consumes lesser time than K-means, Decision Stumps, Naïve Bayes, and LSTM. An accuracy of 99.99% shows that it performs better than the considered algorithms. The quality of the recovered voices is measured using SI, SI-SDR, PESQ, and SI-SNR. Higher values indicate that the quality of the recovered voice is good.
VoSE can be used to design hearing aid which can give crystal clear sound to the hearing impaired. The scope of the model is not limited to one application. VoSE can be utilized by any voice response system like, Siri, Alexa, Assistant which as of now work on single voice command. VoSE can also be used for audio Bots. In future authors plan to develop a selflearning algorithm that can decode the voices from any source and silence the noises completely. The current research is limited to separation and enhancement of known mixed voices. VoSE is the first step towards the final goal of designing a robust system which would be able to identify the voices from unknown speakers and sources. VI.