A focus module-based lightweight end-to-end CNN framework for voiceprint recognition

The process of identifying a spokesperson from a collection of subsequent time series data is referred to as speaker identification. Convolutional neural networks (CNNs) and deep neural networks are the two types of neural networks that are used in the majority of modern experimental approaches. This work presents a CNN model for speaker identification using a jump-connected one-dimensional convolutional neural network (1-D CNN) with a focus module (FM). The 1-D convolutional layer integrated with FM is employed in the presented model for speaker characteristic extraction and lessens heterogeneity in the temporal and spatial domains, allowing for quicker layer processing. Furthermore, the layered CNN hopping interconnection is employed to overcome the connectivity glitches, and a solution based on softmax loss and smooth L1-norm combined regulation is presented to increase efficiency. The recommended network model was evaluated using the ELSDSR, TIMIT, NIST, 16,000 PCM, and experimental audio datasets. According to experimental data, the equal error rate (EER) of end-to-end CNN for voiceprint identification is 9.02% higher than baseline approaches. In experiments, our proposed speaker recognition (SR) model, which we refer to as the deep FM-1D CNN, had a high recognition accuracy of 99.21%. Moreover, the observations demonstrate that the proposed network model is more robust than other models.


Introduction
Speaker recognition is a useful bio-attribute identification/verification technique. It is the process of finding someone's identification from their voice signal. Speaker recognition (SR) can be classified into two forms based on its utilisation: speaker verification (SV) and speaker identification. The former investigates if the stated speaker is the genuine speaker, whereas the latter seeks to identify the spokesperson.
The intra-speaker variances always outweigh the interspeaker differences as the total number of speakers in the database increases. The primary inaccuracy in large-scale speaker identification comes from this [1]. The abovementioned factors are taken into consideration as the motivation of the proposed work for large-scale speaker identification approaches with higher speaker identification rate outcomes.
Extracting individual attributes from voice signals is a crucial challenge in the speaker identification process [2,3]. During the detection phase, the speaker recognition system compares the retrieved speaker attributes to those of others in the pattern database, including sexual identity, emotional qualities, dialect, and so on. Because of the distinguishing characteristics, researchers are able to identify speakers by using voiceprint features [4,5,6]. The intended spokesman is then determined as the spokesperson with the highest likelihood of utterance. However, according to Wan et al. [7], state-of-the-art SR is still insufficient.
Convolutional neural network (CNN) and deep neural network (DNN) approaches have shown promising results in addressing over-fitting Problem, speaker recognition [5,8], etc. Because of the heterogeneity of the convolutional layer of CNN, the best fitness response cannot be attained during the training process, resulting in a drop-in accuracy rate [9].
The main contribution of this research includes: (1) The NN framework incorporates the speaker-specific characteristic extraction benefits of 1-D CNN with focus module (FM); (2) a focus improvement filter module is being created in order to improve the learning of the low-amplitude hidden information and to concentrate on locations that are prone to be strained by embedding; (3) jump connectivity is used to optimise the FM-1D CNN network model, and the gradient dissolving issue produced by the deep CNN framework is dodged to some level; (4) in the network training phase, the joint supervisory softmax with smooth L1-norm is utilised as an error function to increase the model's recognition performance.

Related works
Computer vision [10] studies emerged in the 1980s as a prominent section of data science, using algorithms to assess data, extract relevant attribute information, and then make decisions. Voiceprint preparation, extraction of features, spokesman modelling, and normalisation are generally the phases of the speaker detection phase during this time frame. Mel-frequency cepstral coefficients [11], linear prediction coefficients [12], linear prediction cepstral coefficients [13], and others are examples of common characteristics. After that, speaker models are constructed by analysing the retrieved attributes. The Gaussian Mixture Model (GMM) [8], the Universal Background Model (UBM), the Joint Factor Analysis Model (JFA) [14], as well as the Support Vector Machine (SVM) [15], are some of the conventional spokesman detection systems. Because of its lower training samples, higher stability, and higher identification efficiency, the i-vector model [16,17] is now regarded as the standard approach in classical speaker identification.
The NN can be used to create novel vector patterns that characterise a particular spokesperson or as a supplement solution to standard SR [18]. For instance, a d-vector might be created by obtaining activations from the developed DNN network's final hidden nodes, then aggregating and averaging after L2 regularisation [19]; obtaining the linear element straight from the softmax layer of the DNN structure for producing an x-vector, allowing for large-scale speaker recognition [20]. CNN is a feature extraction technique that is built on the GMM model. The technique enhances the effectiveness of sentence SR by combining MFCC and prosodic data [13].
Because the voice data are continuous 1-D time series, numerous recent research approaches are centred on CNN or DNN [21]. The spatial characteristics of a voiceprint, as well as 1-D CNN, are suitable for frequency domain feature attribute extraction and modelling spectrum relationships in speech signals because a .wav file consists of a spectrogram [22]. Therefore, we used FM combined with 1-D CNN to automatically extract attributes from the spectra. As a result, stacked jump-connected CNN is used in the framework for speaker-specific feature extraction because it has been demonstrated to be effective for speaker recognition [23,24].

End-to-end speaker recognition structure
In the early days, NN research on speaker identification was focused on optimising and improving local parameters by treating NNs as an extension of the classical model [25,26]. While the standard CNN has addressed the over-fitting issue, as the network topology grows, the gradient vanishes [26,27]. The suggested CNN leverages the jump connection type in the learning phase to resolve the hidden node redundancy issue and accelerate the model's training. Also, utilising the combined guidance of softmax loss and the L1norm increases the accuracy of the SR model. A focus feature vector is constructed in the enrolment phase by collecting filter bank characteristics from the user's voice. Figure 1 depicts the suggested scheme's structure.

Focus module (FM)
In the module, a soft-attention methodology [28] was used to boost the learning rate of the speaker attributes. The focus module is made up of three parts: a feature descriptor (G), a Laplace augmentation factor (L), and spatial and channel attention. We calculated the correlation coefficients λ j of each feature map in the descriptor G and derived the joint distribution and correlation coefficients of adjoining feature maps. In the augmentation factor L, the feature map gradient is generated using Eq. (1) as the associated statistical characteristics.
where j is the channel number plus I and X are the feature vector's dimensions, respectively. This step extracts the speaker-specific attributes in accordance with these characteristics using the gradient intensity correlation of the feature maps and the correlation coefficients of neighbouring frames. The channel and spatial attention parts then receive the filtered and improved feature maps for further learning.
The feature extraction and classification systems' unique layout and parameterisation are depicted in Fig. 2.
For down-sampling the feature map, there are three special 1-D convolutional layers, each with a stride and quantity This technique is repeated several times until every feature vector is 1/2 the size it was before. The final 1-D S-feature vector is obtained after the final convolutional layer. The categorisation structure converts 1-D S to a probability value, giving the raw input speech signal's speaker label. The fully connected network is made up of 2 layers, each with 256 and 128 neurons.
To improve the learning weight of spokesperson data, the module includes a soft learning algorithm. The input feature mapG is segmented to attainG j, j 1, 2, 3…with j being the channel number and I and X being the dimensions of the feature vector, respectively. Here, feature mapsG j that converge on the demands are filtered by comparing the correlation coefficients among adjacent points in the same feature map. For each feature mapG j , the correlation coefficient (λ j ) of speaker elements is defined as: where G j ( p, q) represents the average value and p, q indicate the feature vector size. Then amongG j , those whose correlation coefficients take the 32 largest values are retained, resulting in a screened less thick feature mapG j , j 1, 2… Figure 3 depicts the prototype CNN structure. 16,000 × 1 is the input feature data dimension, while 1-D is the output feature data dimension. ReLU and batch normalisation are the activation and normalisation layers, respectively [9]. The jump connection operation is based on Fig. 3.

FM with Jump-connected CNN
The suggested 1-D FM-based skip-connected CNN architecture is based on the standard CNN architecture, with a few minor changes to the connections between the convolutional layers and learning factors [29]. As shown in Fig. 3, six convolutional layers with filter and stride sizes of 2 × 1 and 1 × 1 are followed by a pooling layer with a maximisation function and filter and stride sizes of 2 × 1. . Flowing those six stacked layers, we have one fully connected layer that outputs 500 filters to the output layer, which then classifies speaker-specific features into speaker labels, i.e. speaker id.

Objective L1-loss function
In the learning phase of classification networks, the crossentropy loss function is commonly utilised. Equation (3) illustrates the cross-entropy loss function.
Here, z k is the speaker label, q k is the output of the network through the softmax layer, and L is the number of dimensions of the label as well as the output. When z k 0, which means the misclassified result, the term z k log(q k ) 0 no matter what the z k is. That is, the estimated cross-entropy loss function outcome is only connected to the dimensionality of the result vector corresponding to the appropriate speaker label.
During training, the smooth L1-loss with cross-entropy is used as the loss function, and the label index of the input spokesman is used as the regression target value.
Equation (4) shows the smooth L1-loss function in combination with MSE loss. The difference between the labelled f (s k ) value and predicted value z k is denoted by 's'. In addition, when compared to the softmax loss and MSE loss individually, the combined smooth L1-loss function allows the training to converge faster and be more resilient.
The amounts of FM 1-D CNN parameters, memory space, and floating-point operations (FLOs) proposed in this work are 21,258,326, 356 MB, and 3.74 GFLOs, respectively.

Experimental framework
The experimental dataset's audio samples were chosen for a 20-ms short-term window with a 10-ms window shift and a 1-s chunk. However, because speaker samples are timevarying, using larger frames also means dropping out on temporal resolution [6].
Each audio file was divided into one-second chunks. As a result, the network's inputs are (16,000 × 1) dimensional data usually 16-bits equal to 2 bytes per sample. The Adam optimiser was used to optimise the model, which had a learning rate of 0.001. The model was trained for 100 epochs, and a combination of softmax and L1-loss loss functions was used to construct the system in order to obtain the best loss function. The recommended FM 1-D CNN model was trained and tested on a workstation equipped with a 3.2 GHz Intel(R) Core (TM) @i7-8700 CPU, 32 GB of RAM, and Windows 10 as the operating system. PyCharm-Python 3.7 and Pytorch 1.1.0 are the integrated development environment and deep learning symbolic library, respectively.
Deep learning models typically use dataset partitioning ratios of 80/20 or 70/30 [6]. Non-stationary time series regressions may produce erroneous results [30]. High residual correlation coefficients and R 2 can be indicators of false regression. To verify stationarity, the Dickey-Fuller assessment is used. We discard the null hypothesis because the resulting p value is less than 0.05 and come to the conclusion that the audio data are stationary.

Experimental result analysis
The probability that the voice to be recognised can be accurately found by the associated speaker from the target set is referred to as the "recognition rate" (accuracy) [6]. The total number of correctly identified speakers (TNIS) is divided by the total number of speech samples (TNSS), and it can be formulated as follows: The framework is evaluated with an equal error rate in this study. The challenge's major metric is EER [29]. The threshold (ρ|EER) at which the two detection error rates are equivalent is referred to as the EER.
The F1-score in SR refers to the precision and recall-based validity of the test voice samples [31]. Using Eq. (7), the F1score can be numerically calculated.

Results on ELSDSR corpus
The system's prediction accuracy is assessed using standardised data from the English language speech database for speaker recognition (ELSDSR) [32]. The collection includes audio clips from 22 spokespersons from Canada, Denmark, and Iceland, ranging in age from 24 to 63. In all, 154 (7 × 22) utterances were acquired for training and 44 (2 × 22) utterances for testing. According to the findings, the final accuracy rates of the features recovered from the focus modules combined with the spectrogram-aided recognition and standalone MFCC were 98 and 92%, respectively. The provided model has a lower

Results on TIMIT corpus
The TIMIT database provides professional-quality speech utterances of 630 spokespersons (192-females and 438males) recorded at 16 kHz, representing the eight major accents of American English [33]. The suggested FM 1-D CNN performed well in the TIMIT dataset (the best of all methods), with a 1% gain in classification accuracy over the Hybrid Wavelet Scattering Transform CNN. From Table 2, one of the most significant benefits of the suggested model is its decreased error rate, which is approximately five times lower than that of SincNet.

Results on NIST 2008 corpus
The NIST 2008 database, which comprises 395 spokesmen, was used as a source dataset for voiceprints. Eight twochannel phone calls, each lasting approximately five minutes, are given to each particular speaker. The full 5-min segment is interrupted by a multitude of 2-s segments, and each individual has one 5-min audio [34].
The suggested framework had an equal error rate of 0.2 in the NIST 2008 speaker recognition evaluation set, compared to 0.25 using the CLNET technique. From Table 3 outcomes, the proposed methodology has the largest identification rate improvement, with more than 2% absolute improvement over the baseline CNN and roughly 5% relative improvement over the GMM model.

Results on prominent leaders' speeches (16000PCM) corpus
The speeches of powerful leaders like Benjamin Netanyahu, Jens Stoltenberg, Julia Gillard, Margaret Thatcher, and Nelson Mandela are included in this dataset. Every audio file in this directory is one second long, has a sample rate of 16,000, and is PCM-encoded. When compared to the performance of these network models on the 16,000 PCM voice dataset in Table 4, the proposed jump-connected CNN network model with FM spectrogram had the best recognition accuracy (98.88%) and the lowest EER (0.3).

Results on the real-time recorded experimental corpus
The experimental speaker data for female and male speakers, captured at 16,000 Hz with a 16-bit rate. Each of the 123 speakers (69 men and 54 women) contributed approximately 25 samples to the dataset. The proposed FM 1-D CNN with improved layered FM and a spectrogram-based approach does better than earlier DL models in terms of accuracy and loss function performance (see Table 5). Figure 4 depicts the duration of one training epoch for several classification algorithms. The CNN + FM + 1-D CNN model has the fewest trainable parameters among these models, while the typical CNN, DNN model has 50% more parameters than the proposed model. Figure 5 demonstrates that the f1-score outcomes of all cutting-edge classification techniques were at least 85.0%. For the experimental dataset, the FM hybrid CNN model proposed had the highest evaluation performance with 99.13%. Even though conventional CNN (94.26% average score) is robust and performs well in time series, it did not outperform the recommended skip-connected CNN model (97.37% average score).

Robustness verification of the proposed scheme
In this section, the performance of the proposed network model is validated using a variety of different types of 5 dB noise, such as Gaussian white noise, noise from an auditorium, noise from a train, noise from television, and noise from a classroom. Table 6 presents the results of the experiment that was carried out with the 16,000 PCM dataset that had noise added to it. Even so, the deep network model that was recommended had an average accuracy of 90.15% for identifying the speaker.

Conclusion
This work provided an FM 1-D CNN model for speaker identification based on the benefits of 1-D jump-connected CNN for speaker identification and the fact that focus modules in conjunction with the speech spectrogram include considerable voiceprint information. In the proposed models, both the advantages of FM feature extraction and the time-dependent nature of the convolution layer unit are incorporated. The results of the speaker identification were produced at the final layer by using the Softmax layer of the network. On the ELSDSR, TIMIT, NIST, 16000PCM, and experimental speech datasets, the proposed deep network model surpassed the rest in terms of model training complexity, equal error rate (less than 2%), f1-score (higher than 99%), recognition accuracy (better than 99%), and stability.