Speaker Recognition System based on Age-related Features using Convolutional and Deep Neural Networks


 With the advent of conversational voice recognition systems growing such as Alexa, SIRI, OK Google, etc., natural language conversational systems including Chatbot and voice recognition systems are in new high and determining the age of a speaker is critical for setting the pertinent context. Age can be inferred from the speech signal by inferring various factors such as physical attributes of voice, linguistic attributes, frequency, speech rate,etc., The proposed research article discusses about extracting the spectral features of speech such as Cepstral Coefficients, Spectral Decrease, Centroid, Flatness, Spectral Entropy, F0DIFF, Jitter and Shimmer as inputs. This would help in classifying speaker age through deep learning techniques. A novel approach is addressed along with the model for implementation using Deep Neural Network and Convolutional Neural Network for classifying the features using three different classifiers which are Gaussian Mixture Model (GMM), Support Vector Machine (SVM) and GMM-SVM. The results obtained from the proposed system would outline the performance in speaker age recognition.


Introduction
Speech recognition is an expanding area in computational linguistics that encompasses technologies that are related to the recognition of speech signals and it interprets in a meaningful way (Karpagavalli, Chandra 2016). Speech contains lots of information within it as given in Figure 1. The speech processing system is also becoming more and more complex in countries like India wherein there are at least 22 recognized languages since each language has several dialects and hundreds of accents. (Ravindra Parshuram Bachate 2019)

Automatic Speaker Recognition
The main components involved in an Automatic Speaker Recognition system include a) System which performs signal processing b) Acoustic Model which is built based on features extracted c) Language Model. Automatic Speaker Identification (ASI) (Ertaş 2011). Automatic Speaker Identification must identify the speaker without any priori (Campbell 1997). For Automatic Speaker Verification (ASV), there could be a manual identity verification step or an approach using supervised learning methodology.

Human speech and acoustic features of age
Delivery of speech happens through the sound units in the language which are sequenced through specific language dialects. Unique physiological characteristics and specific characteristics that are unique to the speaker will be included in the speech signal (Rubin P 1998). In general, acoustic features of the speech vary by age due to anatomical and physiological changes as the human speech system contains various components within it.
Sometimes, recognition of the changes in speech due to various factors may be tedious to ascertain whether it is happened due to disease or age. Changes in the respiratory system also affect the voice along with breathing over time. Factors such as lung capability, thickening of thorax and weakening of respiratory muscles can also affect the quality of speech. Fundamental frequency and voice quality are degraded due to age-related changes. These age-related changes happen to the larynx when it reaches the full size upon puberty.
The craniofacial skeleton which grows between 3-5% continuously could also lower the quality of speech. The key factors which help to regulate the fundamental frequency (F0)(Schotz 2007) include Speech Rate, Co-ordination of articulators and breath support. But, these factors will affects the neuromuscular ageing which will also affect the motor system by causing distress. Increase of variability, instability in F0 and amplitude happens as when the age increases. In summary Respiratory system, Larynx, Supra laryngeal system, Neuromuscular control and Female/Male ageing are key components of a human system which will have an impact on the age-related features of the speech signal(Schotz 2007).

Factors affecting speaker age recognition
The speaker age is the one which is a noncognitive indication in the speech signal, that can be identified during the listening process of the human through their intelligence. It can be determined acoustically and is adapted for estimating the age. Some of the issues which affect the recognition of speaker age may include: Speakers natural speech rate(P 2015) which is represented by the total time and syllables per second (ThomasShipp 2005) • Breathe management i.e., the number of breaths and breath pause duration(Schötz

Applications of Speaker Age Recognition
Speaker age is a significant paralinguistic feature which has phonetic variation based on the speaker (Schötz 2006). Speaker age recognition helps in speaker profiling in investigative agencies with the forensic department for criminal case examination. (Poorjam 2013). It is also found to be a good use case in Service Customization. Customizing advertisements in the waiting queue for an IVRS system based on age determination is said to be one of the good real-time use cases for speaker age recognition.
Emotion Recognition (ER) is one of the important aspects of Dialogue Analysis for which the determination of the age of the speaker is used (Huang 2014). Speaker age recognition is also used in the paralinguistic analysis (Zhang 2017). It would also help in safeguarding children while they use social platforms and while consuming the internet. Its application can be found in interactive learning sessions over the internet. It can also help in determining the physiologic variations that happen to children on their onset to puberty for girls and boys.
Typically, HMI (Human Machine Interface) systems which are part of the automotive system nowadays has chatbots involved. These chatbots need to provide an intuitive and interactive response (Patil 2013) with the interactor. Determination of age of the speaker would help in customizing the response by the bot. Bots can provide different responses for younger people and adults accordingly to provide a user-adaptive HMI experience.

Recognition
Convolutional Neural Network is said to be a class of Deep Neural Network or type of neural network. This is also called as ConvNet and it contains convolution and pooling layers.
Conventional algorithms are typically based on correlation techniques and will eventually have long computational time and would be deficient in the speaker recognition process.
Deep Neural Network (Sainath, et al. 2013) is typically a Multi-Layered Perceptron (MLP) which consists of many hidden layers. It is also a feed-forward artificial neural network and the layers exist as hidden units.
These hidden units are in between the inputs and output across the layers. DNNs, which are pre-trained has proven to perform better than conversational MLPs without pre-training on ASR. The following figure depicts the acoustic model with DNN The paper is organized into seven sections. Section 2 discusses the related literature available with respect to speaker recognition about age and gender. Section 3 elaborates the problem statement in details and rationale behind the methodology choices. Section 4 presents the proposed architecture and methodology as a high-level overview. Section 5 outlines the data and experiments conducted. Section 6 presents a detailed discussion of the results obtained. Finally, the paper is concluded with observations on possible directions of future work along with limitations.

Literature review
• Optimal way of training the x-vectors for the age estimation task is proposed. The training on the NIST SRE08 dataset is performed and testing is done against SRE10 (Ghahremani, et al. 2018). The implementation is performed based on Series of Time Delay Layers, a part of the DNN followed by temporal pooling layer that summarized the feature sequence into a single fixed dimension embedding was further fed into the feed-forward layers. The Mean Absolute Error arrived is at 12% which is better than i-Vector baseline. • Challenges involved in detecting the speaker age with respect to intrinsic differences with respect to the voice of speaker and subjective classification fuzzy were addressed using MFCC and SVM specifically on the isolated words spoken by the speakers. Speaker age recognition at the rate of 72.93% has been achieved through SVM classifier based on the voicebox with 4507 isolated word speech along with Mel Frequency Cepstrum Coefficient (MFCC) which would help to differentiate the speaker age wherein the efficiency is improved without the need for normalization of MFCC (Yue, et al. 2014).
• Age determination within a multilingual context is evaluated with South African dialects available in the Lwazi Corpus. The feature set is optimized by using multilingual classifiers using regression experiments. Tests were performed to determine age based on the feature selection across cross-language with Mean Absolute Error rates which ranges • Two separate DNN for long-term and short-term features as feed-forward DNN is built based on the voice inputs for analysing the age of the speaker. The Gaussian Mixture Model is trained using MFCC features which are subsequently fed to the DNN as supervector that yields very good recognition accuracy for age identification. This is done with 384 speakers out of which 104 were young, 216 were adult and 64 were senior. This research work also concludes that the performance of the short-term feature-based DNN is better than one with the long-term features. (Osman Büyük 2018) .
• Using Long Short-Term Memory and Recurrent Neural Networks, age estimation system is built with short utterances at the rate of 3 to 10 seconds that can be straightforwardly • Voice utterances were encoded by using activation of pooling from last hidden layer into a static vector which is used over time as mini-batches by adopting DNN. For better classification, Kernel-based Extreme Learning Machine is used to train the encoded vectors instead of a SoftMax classifier due to limited availability of samples.
• Mandarin dataset is used for the research with 17,408 utterances from this, the data are divided into different percentages such as 70%, 15% and 30% for training, validation and rest respectively. The paper reported two accuracies namely, weighted accuracy (3.8%) and un-weighted accuracy (2.94%) progress over the implementation. (Wang and Tashev 2017) • The speakers from European Portuguese aged over 60 and above is used for age estimation using i-vectors and Support Vector Regression through which the mean error value of 5.4 for male and 5.7 for female respectively. The selection of these European Portuguese speakers (Pellegrini, et al. 2014)is made by automatically assigned estimated age of test speakers. The adapted acoustic model's experimental research resulted in WER (Word Error Rate) at 9.3% over the 13.9% which is obtained using a baselined ASR system without acoustic models.

Problem Statement
Generally, speech signals show larger variability in terms of languages, dialects, accent and therefore it is important to perform feature extraction for reducing such variability. The Their process would also degrade dramatically with an increase in noise level and channel degradation. The motivating principles behind PLP are also similar to MFCC and its analysis is efficient with computation, but it yields a low dimensional representation of speech.
As per the various literature on various feature extraction techniques in recent studies, it is identified that there is a need for determining speech-based age determination based on CNN-DNN with EBNF as they are noise-robust. To normalize the spectral variations in the speech which arises due to the differences in vocal tract length, localized convolution filters are used. CNN can also be able to manage robust feature extraction under noise and channel degradation conditions in recent times. DNNs have been able to show significant improvements in speaker recognition, and it's been found that speaker normalization techniques significantly contribute to improving speaker recognition accuracy with the help of multiple hidden layers with speaker invariant data. The current trends with DNN deviate significantly from MFCCs and are also replaced by Mel-Filter Energy Bank (MFB) based feature extractions. Therefore, the proposed research work focuses on determining the Enhanced Bottleneck features of speaker age by fusing the CNN-DNN subsequently, and those age-related features are extracted by using weighted methods.

Proposed Methodology
Classification of speakers based on the features extracted to the age is given by combining the two subsystems at score level.

Feature extraction modelling using CNN/DNN
Automatic age determination from the speech signal is complex from different viewpoints.
Example, speaker age could be different from the perceptual age vs actual age. Robust acoustic features are also essential for improving the acoustic with data of noisy and channel-degraded acoustic. For performing speech signal processing, the raw signal can be used directly and LSTM i.e., Long Short-Term Memory is employed for accomplishing the processing in Neural Network. Acoustic Model is used to extract feature, and the parameters are mutually elevated with the model parameters. Using this method, the challenges would exist by restricting training of the data with its effect on model behaviour. This is the reason, DNN and CNN based feature extraction models are proposed and implemented.
In this study, the main focus is to extract the age-relevant features of the speaker which is shown in Table 1. As the audio signals are constantly changing, the above features are to be considered for recognizing the age of the speaker. For example, Cepstral Coefficients would help in separating the excitation from vocal tract shape. It outlines the short-term power spectrum of the human voice. The acoustic characteristics are jitter and shimmer of the input speech signal which would be helpful in detecting voice pathologies. Data aspects which outlined a systematic decrease of acoustic correlates in mean and variance aimed at adult ranges nearby 13 or 14 years. The correlates are formants, pitch and duration with age are essential in determining agerelated aspects from voice signals.  Steps:

CNN based feature extraction
• Here, features are extracted by convolutional and max-pooling layers. In convolutional layer, acoustic features frames are extracted where every frame fi is a 1D (onedimensional) feature map.
• The output of convolutional layer consists of j vectors ([h1, h2,..hj]). All 1D filters fij are connected to xi(input feature map) and bj (output feature map).
• Convolutional output has been computed as the following equation, • Then, max pooling layer operation is applied for output from convolutional layer to regulate spectral variation for the speaker recognition task. Invariant features are extracted with the help of the convolution and pooling layers.
• Subsequently, Pooling layer helps to retain the essential information from the speech signal by discarding the insignificant information.
• Any frequency shifts happening in the speech signal is well managed by the max-pooling process. Similarly, it aids in decreasing spectral variance that exists in the given input signal.
• Each pooling layer down samples the feature maps to arrive at a condensed resolution and decreases the spatial dimension of the input signal from a large volume of parameters thereby dropping the computational cost and controlling the overfitting.
• A filter stage would be a convolutional layer which is succeeded through a temporal maxpooling layer. The features are extracted in convolutional layer and modelled between the layers.

•
In the classifier stage, features are classified with the help of fully connected layers and a SoftMax layer. Since speech signals are non-stationary in nature, they are processed typically in the sliding window with the size of 20-40ms.

DNN based feature extraction
The age-based features are extracted within a narrow-hidden layer of a trained DNN model based on the given input speech frames through the filter banks are termed as Bottleneck  Steps: • Here, the internal layers have a lesser number of hidden units with respect to other layers size and the model is used to extract the features of deep hierarchical representation.
• In this process, Enhanced bottleneck features are mined through Deep Neural Network wherein one among the deepest layer takes a lesser hidden unit.
• Pre-training of the network is done in each layer by RBMs in this model which gives the initial weights to the network.
• These weights are then given as input to the DBN which is a group of Deep Belief Machines in which RBM groups of stacked hidden nodes are divided into several hidden layers.
• The layers that are visible in the network is associated with the first hidden layer and all the hidden layer is linked to the hidden layer of the previous layers.
• During the training of the data, network weights are initialized by DBN. The DBN is a probabilistic generative model which consist of many layers stacked with Restricted Boltzmann Machines (RBMs).
• Each RBMs contain two units such as visible unit and a layer of hidden units to build highly improved characteristics of the input speech features in the form of consecutive frames.
• Respectively, the visible node is linked to all hidden node where each link will have a weight to indicate the strength of the interaction between the nodes.
• Finally, some of the RBM layers are removed for brevity to obtain an Enhanced Bottleneck Features (EBNF) which would form the DNN based DBF feature extractor.
• The layers following the bottleneck targeting on producing speaker explicit features though the upper layers concentrate on the discriminative learning of the speaker classes by age group. 4: X is used to represents the integration of mean and variance normalized raw MFCC features of various adjacent frames which is denotes to all X is a sample.

5:
The input layer output is equivalent to the input itself, i.e.,σ (n) = X.
7: The n th hidden node output from the output layer is calculated by Softmax function on obtaining input, σ n (P) (⨂) = exp (u n P ) where σ(P) n is inferred as posterior probability of the accompanying speaker n label.
8: For classification tasks, the cross-entropy criterion is performed by using the following equation, where tn stands for the 0/1-valued target output at the nth node. 9: In the training process, W(n)weights are modified in proportion to the derivative of JXH with respect to W(n).

Fusion of Features
After extraction of CNN and DNN: EBNF features individually, the extracted features are fused from CNN and DNN network by sum operation. Differential acoustic variability between the different classes of age groups is used which helps to achieve higher probability of predicting the correct age group. Prosodic features such as pitch, energy, formants, vocal tract length warping factor, speaking rate, can also help to enhance the performance. In this work, two different kinds of fusion is performed such as,  (Table 2: Classification of Gender and Age group) outlines the age range for different groups used in the classification.

Classification approach
Following are the tasks performed towards a classification based on adapted posteriors which are obtained from CNN-DNN based Enhanced Bottleneck Features: Features which are invariant to speaker age for improving the performance of the age-independent system are fed as input to the BNF after removing the fully connected layers.
• The CNN and DNN network input layer is designed with 15 frames of context window which include 7 frames on either side of the current frame.
• CNN based acoustic models use convolutional filters ranges at 200 with the size of 8 and pooling size of 3 without overlap.
• The fully connected network has 5 hidden units which consist of 2048 nodes for each hidden unit and many nodes are involved as the number of CD states in the output layer.
• Convolution layers then accept Type-2 features as input and those features are extracted from the CNN Layer.
• The output from CNN, from the feed-forward layer which accepts Type-2 features, are connected with DNN-EBNF for classification.
• The Type 2 features and features from CNN along with DNN classification with Enhanced Bottle Neck Features are fused together.
• Fusion resultant is provided to the classifiers namely SVM, GMM, SVM-GMM for retrieving the accuracy of the feature extraction process.
• Finally, Classification accuracy is evaluated for further optimization.

Score level fusion
The fusions of score level have slightly higher performance (Metze 2007)

c. CMU Kids corpus (Linguistic Data Consortium)
It consists of sentences which are read by children with 24 male and 52 female speakers, totalling 5180 utterances which were recorded in a controlled atmosphere. The existing speech data remain used as training data for identification of age system where the speakers are categorized into two separate groups such as "good" readers and "errorful" readers. Accuracy is the ratio between the correctly classified speech signals vs incorrectly classified speech signals. Precision helps to determine the proportion of the input speech signals which were correctly identified as per the age group. Recall helps to determine the amount of the actual age group of the speech signals which are identified correctly. They are outlined in the formula given below: GMM based subsystems data, SVM based and GMM-SVM data has been tabulated above (Table 3) for comparison among the Accuracy, Recall and Precision for different feature extraction techniques. From the above, GMM-SVM accuracy is found to be higher at 0.83 where Accuracy of GMM and SVM stands at 0.73 and 0.79 respectively which can also be inferenced from Figure 8.

Conclusion
In this research work, the extraction of age-based features is accomplished differently for the CNN--DNN based systems and it is fused using the weighted approach. Combination of GMM-SVM classifier with fusion provides better results of 82.5% accuracy with the recall of 63% in the proposed approach. This has significantly performed better than previous work which is outlined in Table 3. The proposed work is compared with the GMM Base, SVM, GMM+SVM classifiers combination and the fused results of all these systems show that the proposed result is significantly improved than the existing results. Convolutional Neural Networks (CNNs) and the enhanced bottleneck features with DNN have the potential for effective feature extraction and are applied for feature extraction and shown reasonable results.
Leveraging a combination of the CNN and DNN, the computational complexity is decreased and difficulty in loss convergence can be used to manage speech signal noise wherein the high dimensional features are extracted. The problem of speaker age recognition in extracting features is also addressed with the help of different datasets.

Consent for Publication:
Not applicable

Availability of Data and Material:
TIMID, Switch Board and CMU KIDS corpus