Detection of Speaker De-identiﬁcation Disguise Based on Dense Convolutional Network

Nowadays, speaker disguise is a common operation that presents a great challenge to social security. Therefore, it is important to recognize the authenticity of speech. Most of the current researches focus on speech spooﬁng, which simulates a target speaker to break through the state-of-art ASV systems by increasing false acceptance rate. Meanwhile, there is another type of disguise, i.e. de-identiﬁcation, which transforms a speech signal without a target to increase the false rejection rate in order not to be recognized. It has received far less attention. Therefore, in this paper, we investigate the de-identiﬁcation model and propose a method to detect de-identiﬁcation speeches from genuine speeches by using a very deep dense convolutional network with 135 layers. The experimental results show that the average accuracy of the proposed method outperforms the reported state-of-the-art methods.


Introduction
Speaker disguise can be divided into two categories [1]: 1) speech spoofing, including voice conversion (VC), synthesis (SS) and replay, which changes or captures a person's speech, or creates an artificial speech in order to to be recognized as a target person; 2) de-identification disguise that changes a person's speech signal without a target in order not to be recognized. Early researches [2,3,4,5,6,7,8,9,10] have revealed that either kind presents threat to security by breaking through the state-of-art automatic speaker verification (ASV) systems. Recent researches focus on speech spoofing detection. For VC and SS detection, Hanilci et al. proposed a method through linear prediction residuals to extract phase features of speech signals [11]. Janick et al. proposed an algorithm to extract audio quality features based on residual signal of linear prediction [12]. Kamble et al. proposed a detection algorithm using instantaneous frequency cosine coefficients [13]. Muckenhirn et al. employed the first-order and second-order spectrum statistics [14]. Alam et al. proposed an algorithm using the transformation feature representation of infinite impulse response constant q [15]. In addition, other artificially designed features such as long-term spectral statistics, mel-frequency cepstral coefficients (MFCC) and modified group delay were used in [16,17,18,19,20,21,22,23]. Support Vector Machine (SVM), Hidden Markov Model (HMM) and Generalized Method of Moments (GMM) are the most commonly used classifiers in the papers above. In some other efforts, DNN framework is used [24,25,26,27,28]. For replay detection, algorithms were proposed using Electronic Network Frequency (ENF), MFCC and fundamental frequency, linear predictive residual signal, time envelope, stratified scattering decomposition coefficient and Inverse MFCC (IMFCC) respectively [29,30,31,32,33,34,35].
The researches on de-identification detection are relatively fewer. Paper [36,37,38] proposed detection algorithms using MFCC features and SVM. The cross-database recognition rates were lower than 90%, and the computation load was heavy. In addition, Liang et al. proposed an approach based on convolutional neural network (CNN) [39] . The accuracy rates of the above methods are all less than 95%, indicating that an improvement is needed for practical applications.
It should be noted that it is difficult and costly to implement speech spoofing to some extent as VC and SS usually require a large amount of target person's information, and the situation of replay is uncertain. By contrast, de-identification disguise requires no additional information. It has been integrated into many popular audio/voice editing tools and been used in many criminal cases. However, compared with spoofing detection, researches on de-identification detection are relatively few and insufficient. Therefore, in this paper, we examine the model and detection of de-identification disguise. Considering the fact that a DNN can automatically extract deep features that are not artificially designed, we propose a de-identification detection method that employs a very deep dense convolutional network with 135 layers. Experimental results show that it outperforms the state-of-the-art methods.

Feratures for De-identification Detection
The object of de-identificatoin disguise is to alter the frequency content of a signal without affecting its time evolution [40]. Therefore, coming up with a rigorous definition is not easy because time and frequency characteristics of a signal, being related by the Fourier transform, are not independent. However, we can refer to a parametric model of audio signals to facilitate the task. The most efficient model in our context is the quasi-stationary sinusoidal model. In this model, the signal is represented as a sum of sinusoids whose instantaneous frequency w i (t) and amplitude A i (t) vary slowly with time. This can be written as: and where A i (t), ω i (t) and φ i (t) are instantaneous amplitude, instantaneous frequency and instantaneous phase of the i th sinusoid respectively. Defining an arbitrary disguise modification amounts to specifying a scaling factor α(t) > 0 , which is implicitly assumed to be a regular and 'slowly' varying function of time. Then the ideal disguise corresponding to the signal described by Eq.1 and Eq.2 would be: and Eq.3 indicates that the sinusoids in the modified signal at time t have the same amplitude as in the original signal at time t = t , but their instantaneous frequency are multiplied by a factor α(t ), as can be seen by differentiating φ i (t ) with respect to t . As a result, the time-evolution of the original signal is not modified but its frequency content is scaled by the factor. In practice, α(t) is usually a constant. In the following context, we use α to denote the constant factor.
In implementation, STFT can be used to represent the instantaneous frequency of each sinusoid in a short time period, and multiplied by the scaling factor to modify the frequency content. The steps are given below, and the details can be found in reference [40] : Suppose x t (n) is a frame of length N from the input speech signal at time t. Firstly, it is windowed by w(n), and then an FFT is performed on the windowed signal, using Eq.5, where w(n) is a Hamming or Hanning window and k is the bin frequency index. Then, instantaneous magnitude |F (k)| and instantaneous frequency ω(k) are respectively calculated by Eq.6 and Eq.7, where F s is the sampling frequency and ∆ is the deviation of the k th bin frequency. And the computation of ∆ can be referred to in [40]. For de-identification, transient frequency ω(k) is modified by Eq.8, where α is the scale factor, i.e. the disguising factor.
The instantaneous phase φ (k) is then calculated via the instantaneous frequency ω (k) and the transformed FFT coefficients is obtained by Eq.11.
An inverse FFT is performed on F (k) and the disguised signal can thus be obtained.
According to Eq.8 and Eq.9, the spectrum magnitude of the speech is modified by the de-identification disguise, so that some implicit features can be introduced into the disguised signal. Therefore, in our proposed algorithm, speech spectrum is used as the input into a deep neural network to extract deep features for classification. Using STFT, we get the spectrum diagram of the input speech signal, where the window size is 175 and the overlap is 50%.
With respect to phonetics, de-identification disguise is measured by a 12-semitones division [41] leading the disguising factor α to the following form in Eq.12.
α(s) = 2 s/12 (12) s can take any integer value in the range of [-12, +12]. However, a modification too weak or too strong can lead to deception failure or auditory unnaturalness. Here in the following experiments, we consider the median interval between [-8, -4] and between [+4, +8], which have the strongest deception ability.

The Dense Convolutional Network
In a traditional CNN, the output of the previous layer X l−1 is transmitted to the next layer as input by a non-linear operation H l to get the output X l . The non-linear operation consists of convolution, ReLU, and pooling.
It is difficult to train a traditional CNN as degradation occurs with the increament of network layers. In order to effectively inhibit degradation, Residual Networks (ResNets) [42], FractalNets [43] and Highway Networks [44] create short paths X l−n from early layers to late layers, as shown in Eq.14.
Compared with the aforementioned networks, this kind of dense connection mode has obvious advantages :1) It guarantees the maximum information flow between layers and enhances feature propagation. 2) Dense connection has regularization effect, which can reduces the overfitting of tasks with small training sets. 3) It allows the DenseNet layer to become narrower, e.g. k = 12, thus significantly reducing the number of parameters, alleviating degradation problems, and supporting limited neuron reuse. 4) There is no need to relearn the redundant feature maps, which is convenient for training.

Structure of the Proposed DenseNet
The proposed DenseNet structure is shown in Fig.1. The inputs are single channel spectrum diagrams obtained by STFT, and the sizes are set as 90×88. The network consists of an initial layer, two transition layers, three dense blocks, a global pooling layer and a linear layer. The three dense blocks are composed of 6, 12 and 48 bottleneck layers respectively. The linear layer is a full connection layer, followed by softmax, which has two outputs representing the probability of "genuine" and "disguised" respectively. The internal structure of each kind of layers are shown in Fig.2. Each bottleneck layer contains 2 convolutional layers, so the DenseNet as a whole contains (6 + 12 + 48) × 2 + 1 + 1 + 1 = 135 convolutional layers. The bottleneck layer contains a 1×1 convolution layer, followed by a 3×3 convolutional layer, rather than two 3×3 convolution layers, to reduce the computation, as shown in Fig.2. And the transition layer connects two adjacent dense blocks, further reducing the size of the feature map.
Training set: UME-1(2040 clips), NIST-1(2000 clips), Timit-1 (3000 clips); Testing set: UME-2(2000 clips), NIST-2(1560 clips), Timit-2(3300 clips). Each clip is further cut into several 1s clips. And the number of 1s clips in each set is shown in Table 1.  [-8, -4] and between [+4, +8] are taken into consideration. As a result, there are 40 times as many disguised (negative) clips as genuine (positive) clips. To achieve the balance between positive and negative data, we expand the number of positive clips by shifting every 200 samples to make it equal to the number of negative clips.
We use ADAM optimizer [52] to train the proposed DenseNet with L 2 loss function. β 1 and β 2 , that is, the exponential decay rates of the first-order moment estimation value and the second-order moment estimation value are 0.9 and 0.999 respectively. Set epsilon hat as 10 −8 , the learning rate as 10 −4 , and the dropout rate as 0.3. The training batches are 100,000 and the batch size is 64.
The detection accuracy in Eq.16 is used to measure performance, where G and D are the numbers of genuine clips and disguised clips in the testing sets, respectively. G d and D d are the number of genuine clips correctly detected from G and the number of disguised clips correctly detected from S respectively.
The method proposed by us is superior to the other two methods. The reason is that a DenseNet model has more layers than a traditional CNN, so it can extract more in-depth features to facilitate classification. In addition, in a traditional CNN, classification decision only uses deep features solely. However, in a DenseNet, due to the dense connection mode, both the deep features and the shallow edge features are utilized in the decision-making, so that the accuracy can be further improved.

Cross-database evaluation
However, in the reality scenario, testing data and training data may come from different sources, and they may have different intrinsic features. Therefore, crossdatabase evaluation is conducted to test the diversity of the proposed method. Here, one of the three corpora is selected as the testing dataset and the other two as the training datasets. The experimental results are shown in Table 3. We can see that the results of the first two cases are quite good, but the third case is not ideal. One possible reason is that the data volume of NIST is larger than the other two data sets, TIMIT and UME, shown in Table 1, so that the model trained by NIST has better generalization capabilities. In [39], the accuracy of case 1 is 94.37%, while our accuracy is 96.45%, indicating that our method is superior to the method in [39]. The results of the case 2 and case 3 are not given in [39].

Robustness to noise
In practical applications, noises may be introduced during recording and transmission, which may affect the detection accuracy. Here, we add Gaussian white noises into the data sets to evaluate the robustness to noise. Specifically, we add Gaussian noises of 10db, 20db and 30db, respectively, to each clean training set, and train the network by the noised sets and the clean sets. In testing, we add Gaussian noises of 10db, 15db, 20, and 30db respectively, to each clean testing set, and test the accuracy rates over each noised testing set and each clean testing set, as shown in Table 4. From the 4th column we can see that even if 10db SNR noise is added, the 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 accuracy rates are basically maintained at around 90%, indicating that our method has good effect on the attack from noise. It should be noted that noises of 15db are not added into the training set, but they are added into the testing sets to test the models versatility degree in the 5th column. We can see that the accuracy rates are reasonably good compared by the other conditions in column 4, 6, 7, 8.

Robustness to compression
Since compression is a necessity in audio/speech community, we test the robustness to MP3 compression in this section. We conduct MP3 compression (16:1) on the training sets and testing sets respectively. The detection accuracy is shown in Table 5. In compression cases, the average accuracy rate is 92% with a relatively slight decrease from the one of non-compression cases, indicating that the proposed method is robust to compression.

Conclusion
This paper presents a disguised speech detection method based on dense convolutional network. Deep features can be extracted automatically by a DenseNet with 135 layers. It achieves computing efficiency by careful optimization of kernel reduction and by the employment of bottleneck layers. The experimental results show that this method is superior to the state-of-the-art methods and the future work will focus on extracting deeper features to further improve accuracy.