Unsupervised learning method for events identification in φ-OTDR

In this paper, an unsupervised-learning method for events-identification in φ-OTDR fiber-optic distributed vibration sensor is proposed. The different vibration-events including blowing, raining, direct and indirect hitting, and noise-induced false vibration are clustered by the k-means algorithm. The equivalent classification accuracy of 99.4% has been obtained, compared with the actual classes of vibration-events in the experiment. With the cluster-number of 3, the maximal Calinski-Harabaz index and Silhouette coefficient are obtained as 2653 and 0.7206, respectively. It is found that our clustering method is effective for the events-identification of φ-OTDR without any prior labels, which provides an interesting application of unsupervised-learning in self-classification of vibration-events for φ-OTDR.


Introduction
The phase-sensitive optical time-domain reflectometry (φ-OTDR) based fiber-optic distributed vibration sensor can detect and locate the multi-point vibration on long distance sensing optical fiber. The existing research has been engaged in improving multiple performance indicators in the multi-point vibration detection, including dynamic and spatial resolutions, detection distance and accuracy, etc. (Liu et al. 2021). So, it has been widely used in the fields of vibration and acoustic wave detection (Wang et al. 2020a), intrusion and pipeline monitoring (Marcon et al. 2019), perimeter security (Zhu et al. 2013;Cao et al. 2015;Ma et al. 2018), structural health monitoring (Lu et al. 2010;Yue et al. 2015), the railway monitoring (Peng et al. 2014a;Kowarik et al. 2020;Wang et al. 2017) and pipeline leak detection (Zuo et al. 2020). In practical applications, φ-OTDR is utilized in monitoring different vibration-events, therefore, it is necessary to classify the vibrationevents detected to reduce the false alarm rate and obtain more abundant vibration information in different applications.
In addition to the traditional signal-processing methods, deep learning has also been introduced to the vibration-event identification, and a lot of work has been proposed. By using multi-branch long short-time memory modeling combined with convolution neural network (MLSTM-CNN), φ-OTDR can identify four kinds of real disturbance events, including the watering, climbing, hitting, and pressing, and a false disturbance event. The identification rates of the five events are 94.38%, 94.40%, 94.76%, 95.25% and 99.71%, respectively. Average identification rate can reach 95.7%, and nuisance alarm rate (NAR) is 4.3% (Wang et al. 2020b). Wang et al. (Wang et al. 2019) used random forests (RF) to identify the watering, tapping, squeezing, and non-disturbing events. The experimental identification rates have reached93.79%, 97.36%, 97.06% and 98.12%, respectively, with an average identification rate of 96.58% . That work can also identify real disturbance events, including walking, digging, large vehicle driving, a two-point leakage of a pipeline, and a three-point leakage by CNN (Sun et al. 2020). In addition, Generative Adversarial Net (GAN) has also been used to identify noises, walking and vehicles events (Shiloh et al. 2019). Overall, the current event-identification methods for φ-OTDR are summarized in Table 1.
However, these works are mainly based on supervised learning, which requires to collect a large number of samples with manual labelling, limiting the research efficiency and improvement of an intelligent φ-OTDR. For special applications in complex environments, it may be difficult to implement the field observations of vibration-events, and it is impossible to obtain enough accurate labels, which is the bottleneck problem of the event-identification for φ-OTDR by the supervised learning.
The unsupervised learning opens a new window to break through this problem. Shiloh et al. (Shiloh et al. 2019) proposes an unsupervised learning method based on the Generative Adversarial Net (GAN) methodology. The GAN-trained network is designed to generating train-sets for DAS ANNs and its experimental testing, which has significantly improved performance, compared with the classifiers that are trained with the pure simulation data or pure experimental data (Shiloh et al. 2019). The results of supervised learning by CNN are compared with the unsupervised learning by K-means and auto-encoder. Auto-encoder and CNN have an identification rate of 95% for pipeline defects. K-means and CNN have an identification rate of 84.62% for the walking or running. Some researchers also use machine learning algorithm based on neural network to analyze the external intrusion and internal corrosion data of pipes collected by the φ-OTDR system, by the supervised learning algorithm and unsupervised learning algorithm, respectively .
No matter the combination of the supervised learning and unsupervised learning or the application of unsupervised learning, the unsupervised learning plays an increasingly important role in the vibration identification field. According to the unsupervised learning, the relationships within the dataset itself can be found, which are underlying features of the data that can be associated with specific events of the class. Although the supervised learning method can well identify events, it is not a very practical and unified method for analyzing various events. Because the supervised learning needs to label the samples, when the amount of data involved is very large or needs to be recognized on the spot, it will be a very hard work. A lot of time of manually labeling is saved when the unsupervised learning is applied.
In contrast with those studies, our research aims the application of the unsupervised learning clustering algorithm to cluster the vibration events of the φ-OTDR detection signal, including blowing, raining, direct and indirect hitting. The scenes are different, and we recognize more types of vibration events. We only need to use the waveform information in the time domain, without turning to the frequency domain to process the data. At the same time, we use Principal Component Analysis (PCA) dimension reduction method in the data processing to project the data into a low-dimensional subspace to achieve dimension reduction, so that the data can be clustered better. In addition to comparing the clustering results with the original tags to evaluate the performance of the clustering results, we also use the Calinski-Harabaz index (CH) and Silhouette coefficients to judge our clustering performance. In the end, after our data processing optimization, the model can realize the selfrecognition of vibration events. From the perspective of application scopes, our method is more convenient and has a wider range of applications, which lays a foundation for pattern identification via the unsupervised learning method in the later stage.
Through the unsupervised learning, vibration events can be clustered without any input of prior labels, which can be considered as a self-identification. Based on the clustering task to obtain different clusters, only a very limited field observation or calibration of different clusters is needed to realize the practical application pattern identification.

Principle
φ-OTDR is based on the measurement of Rayleigh scattering that occurs when the light travelling in fiber-optic cable is back-scattered due to imperfections along the fiber. Theoretically, the mechanical vibrations due to the physical activities or events taken place around the cable cause fluctuations in the back-scattered light. Interrogating these fluctuations then enabling to detect and classify such activities. In this context, it has already known that the back-scattered signal sensed in a φ-OTDR has generally quite low signalto-noise ratio (SNR) which highly affects the vibration detection capability of the systems. Thus, several methods such as multi-dimension comprehensive analysis (Lu et al. 2010), adaptive temporal/match filtering (Jiang et al. 2019), Wavelet denoising (Liehr et al. 2019;Peng et al. 2014a), polarization diversity detection (Kowarik et al. 2020), two-dimensional edge detection (Wang et al. 2017), difference in time-frequency domain , Fourier transforming (Zuo et al. 2020), heterodyning (Peng et al. 2014b), Raman amplification , signal-noise separation , multi-scale permutation entropy and the zero-crossing rate (Jia et al. 2019) have been used to improve the threat/event detection performance. The optical fiber distributed vibration sensor we used in this paper is φ-OTDR, and its disturbance system structure diagram is shown in Fig. 1.

Experimental setup
A Narrow Linewidth Laser with 5 kHz frequency width and a central wavelength of 1550.92 nm is used as the light source. An acoustic Optic Modulator (AOM) chops the continuous wave(CW) light into probe pulses with a pulse width of 100 ns, allowing 10 m of spatial resolution, and a pulse interval of 0.02 ms. The sensing fiber is a single mode fiber, with about 2 km. The acquisition rate of the DAQ electronics is 90 Ms/s. The vibration event can be located and classified by welding the traditional communication optical cable with the host.

Training and test
The process of the principal component analysis (PCA) and clustering is shown in Fig. 2.

Principal component analysis (PCA)
The sample data is divided to find the approximate location of the vibration according to the starting point. The vibration data of the approximate vibration position is selected and saved. As shown in Fig. 3, the signals in time domain between several events are different. The total sampling points of blowing and raining are 90 M, while the direct-hitting and the indirect-hitting are 180 M. They are different.
Feature extraction is performed on the vibration data of the intercepted vibration events. The eigenvalues of time domain signals can be divided into dimension and dimensionless as shown in (Table 2, Jia et al. 2020). 14 features are extracted as the input data of clustering algorithm. Rectified average is the mean of the absolute value of the signal. Kurtosis factor and Skewness are used to describe the distribution of variables. The wave factor is the ratio of the effective value (RMS) to the average rectification value. The crest factor is the ratio of the peak value of the signal to the effective value, which represents the extreme degree of the peak in the waveform. The impulse factor is the ratio of the signal peak to the Fig. 1 The schematic diagram of φ-OTDR. AOM: acousto-optic modulator Page 7 of 16 457 mean value of the rectification (the mean value of the absolute value). The margin factor is the ratio of the signal peak to the root amplitude.
PCA is a technique for analyzing and simplifying data sets. PCA is often used to reduce the dimensionality of the data set, while maintaining the feature that contributes Fig. 2 The process of PCA and clustering the most to the variance in the data set. This is done by keeping the low-order principal components and ignoring the high-order principal components. Such low-level components can often retain the most important aspects of the data. The purpose of the K-means clustering algorithm is to divide n points into K clusters, making each point belong to the cluster corresponding to the nearest mean (this is the clustering center), and take it as the clustering standard. When all the eigenvalues are taken as the input, the clustering effect is relatively poor, so PCA is performed on the eigenvalues to get a data dimensionality reduction. And the corresponding eigenvalues are arranged from large to small as 3.5795 > 0.5354 > 0.2512 > 0.1979 > 0.0034 (the remaining 9 eigenvalues are too small and have been omitted), so the real input of the clustering algorithm has been obtained.

Clustering
Through the data preprocessing stage, we obtain a total of 1160 samples data of blowing vibration events, 550 samples data of raining vibration events, 300 samples data of indirect hitting vibration events by knocking on the iron net, 270 samples data of direct hitting vibration events and a total of 2280 samples data. The location selection of K initialized centroids has a great influence on the final clustering result and running time, so it is necessary to select the appropriate K centroids. If it is just a completely random selection, it may cause the algorithm to converge very slowly. The K-Means + + algorithm (Arthur and Vassilvitskii 2006) is an optimization of the K-Means method of randomly initializing the centroid.
The optimization strategy for initializing the centroid of K-Means is detailed below. = (a) Select a point randomly from the set of input data points as the first clustering center μ 1 (b) F o r e a ch p o i n t x i i n t h e d a t a s e t , c a l c u l a t e t h e d i s t a n c e D x i = arg minx i − 2 r2 , r = 1, 2, ⋯ k selected (c) Select a new data point as the new cluster center. The principle of selection is as follows: the point with larger D(x) has a higher probability of being selected as the clustering center (d) Repeat b and c until K cluster centroids are selected (e) The K centroids are used as the initial centroids to run the standard K-means algorithm.
Four types of events will be identified so the cluster class K is set to 4. We varied the number of reduced features via PCA from 2 to 6 and performed K-means clustering. The analysis results are listed in Table 3. It can be seen from the table that when the number of clusters is 4 and the PCA feature is 5, the accuracy is the highest, which is 87.1%. At the same time, when the number of features is increased, the accuracy is no longer improved.
In the clustering process, we find that the waveforms of indirect hitting events are similar to those of direct hitting events. Therefore, we classify indirect hitting events and direct hitting events as hitting events. The three vibration events are clustered, as shown in Table 4. When the number of clusters is 3 and the PCA feature is 4, the accuracy is the highest, which can reach up to 89.4%. At the same time, when the number of features is increased, the accuracy is no longer improved.
In the process of clustering, we find that the incorrectly clustered vibration data points show the characteristics of the regionalization. That is, in the process of selecting the sample data, some samples are not the vibration data. So, we manually eliminate these non-vibration sample data points, and get a total of 2044 optimized samples data. PCA dimension reduction is performed on the optimized data, and then the K-means clustering algorithm is performed, and the results obtained are shown in Table 5. When the number of clusters is 3 and the number of features is 4, the accuracy can reach up to 99.4%, and when the number of features is increased, the accuracy will not be improved. From Fig. 4, we can also see that the clustering effect is good and Idx(Idx represents the dimension of data, equivalent to the two columns of our samples) is the vector that predicts the cluster index. Later, confusion matrix is analyzed to evaluate the performance of the algorithm as shown in Fig. 5.

Performance measurement
In order to evaluate the performance of the clustering algorithm, the Silhouette coefficient S i and CH index are used as evaluation indexes. The calculation of Silhouette coefficient is shown as follow: In this formula, a represents the average distance between the sample point and the other points in the same cluster, that is, the similarity between the sample points and the other points in the same cluster; b represents the average distance between the sample points and the other points in the next closest cluster, namely the similarity between the sample points and the other points in the next closest cluster. What K-means pursues is that for each cluster, the difference within the cluster is small, while the difference outside the cluster is large. The silhouette coefficient S is the key index to describe the difference inside and outside the cluster. According to the formula, the value range of S is [− 1, 1], and the closer the value is to 1, the better the clustering performance. On the contrary, the closer the value is to − 1, the worse the clustering performance. When the contour coefficient is equal to 0, it means that the clusters overlap. As shown in the Table 6 below, when the number of clusters is 3 and the feature is 4, the silhouette coefficient is the highest 0.7206. The Silhouette coefficients of different clusters show that the samples can be split into 3 clusters. All the points in the three clusters have a large Silhouette coefficient (0.8 or greater), indicating that the clusters are well separated (Fig. 6).
The calculation of CH index is shown in (2): (1)

Fig. 5 The confusion matrix for classifications of three vibration events
In this formula, m is the number of samples in training set, and k is the number of classifications. B k is the covariance matrix between classifications, and W k is the covariance matrix of data within classifications. That is to say, the smaller the covariance of the data within the classification, the better, and the larger the covariance between the classifications, the better, so the CH index score will be high, and the boundary between clusters is obvious. We can know from the article ) that the higher the CH index, the better the clustering effect. And as shown in the Fig. 7, the maximum CH index is 2653 when the number of cluster classes is 3.

Result and discussion
A total of 14 input features are obtained by feature extraction from samples, and their dimensions are reduced by PCA. The first 6 features scored after reducing dimension are selected for research. We find that when the number of clusters is 4, and the number of features is 5, the equivalent classification accuracy is as high as 87.1%; when the number of clusters is 3 and the number of features is 4, the accuracy is as high as 89.4%. After we manually remove the non-vibration data in the sample, we get that when the number of clusters is 3 and the feature is 4, the accuracy rate reaches the highest 99.4%. And the average silhouette coefficient and CH index can reach 0.7206 and 2653, respectively, which represent the best performance of the clustering task. It is found that the clustering method is effective to be utilized in practical applications. The system is made up of Z-score normalization, feature extraction, PCA (2) s(k) = tr B K tr W K m − k k − 1  . 6 The Silhouette plot where K = 3 and the feature is 4 dimensionality reduction, and calculation of the distance between the sample data and the cluster center. The real-time data collected by the distributed optical fiber sensor can be used to determine which vibration events the sample belonged to through this functional block. In the subsequent research, we will continue to improve the clustering method to achieve realtime monitoring and alarm systems for vibration events. The unsupervised learning algorithm proposed in this paper for the automatic classification of vibration events can significantly improve the efficiency of identifying vibration events, but it also has its limitations. For some similar vibration events, that is, the vibration waveform is similar, its identification effect is relatively poor, such as direct hitting and indirect hitting events.

Conclusion
In this paper, an unsupervised learning method is proposed to identify the vibration events for φ-OTDR. When the number of clusters is 4, the accuracy can reach 87.1%; when the number of clusters is 3, the accuracy rate can reach 89.4%; when we manually eliminate some non-vibration data for the optimization, the accuracy rate can reach 99.4%. The average CH index and Silhouette coefficient are 2653 and 0.7206, respectively, which represent the best performance of the clustering model. Our unsupervised learning algorithm used for automatic identification of vibration events is effective and highly accurate, and we will continue to expand its applicability in the following research.