3.1 Experimental setup
A Narrow Linewidth Laser with 5kHz frequency width and a central wavelength of 1550.92 nm is used as the light source. An acoustic Optic Modulator (AOM) shift chops the continuous light into probe pulses with a pulse width of 100 ns and a pulse interval of 0.02ms. The sensing fiber is a single mode fiber, with about 2 km, a spatial resolution of 10 m. The vibration sampling rate is 90 Ms/s. The vibration event can be located and classified by welding the traditional communication optical cable with the host.
3.2 Training and Test
The process of the principal component analysis (PCA) and clustering is shown in Fig. 2.
3.2.1 Principal Component Analysis (PCA)
The sample data is divided to find the approximate location of the vibration according to the starting point. The vibration data of the approximate vibration position is selected and saved. As shown in Fig. 3, the signals in time domain between several events are different.
Feature extraction is performed on the vibration data of the intercepted vibration events. The eigenvalues of time domain signals can be divided into dimension and dimensionless as shown in Table 2[20]. 14 features are extracted as the input data of clustering algorithm.
Table 2
The features in time domain

Dimension

Dimensionless

1

maximum

rectified average

2

minimum

kurtosis

3

average

skewness

4

peaktopeak

form factor

5

variance

crest factor

6

standard deviation

impulse factor

7

root meansquare

margin factor

PCA is a technique for analyzing and simplifying data sets. PCA is often used to reduce the dimensionality of the data set, while maintaining the feature that contributes the most to the variance in the data set. This is done by keeping the loworder principal components and ignoring the highorder principal components. Such lowlevel components can often retain the most important aspects of the data. The purpose of the Kmeans clustering algorithm is to divide n points into K clusters, making each point belong to the cluster corresponding to the nearest mean (this is the clustering center), and take it as the clustering standard. When all the eigenvalues are taken as the input, the clustering effect is relatively poor, so PCA is performed on the eigenvalues to get a data dimensionality reduction. And the corresponding eigenvalues are arranged from large to small as 3.5795 > 0.5354 > 0.2512 > 0.1979 > 0.0034 (the remaining 9 eigenvalues are too small and have been omitted), so the real input of the clustering algorithm has been obtained.
3.2.2 Clustering
Through the data preprocessing stage, we obtain a total of 1160 samples data of blowing vibration events, 550 samples data of raining vibration events, 300 samples data of indirect hitting vibration events by knocking on the iron net, 270 samples data of direct hitting vibration events and a total of 2280 samples data.
The location selection of K initialized centroids has a great influence on the final clustering result and running time, so it is necessary to select the appropriate K centroids. If it is just a completely random selection, it may cause the algorithm to converge very slowly. The KMeans++ algorithm [42] is an optimization of the KMeans method of randomly initializing the centroid.
The optimization strategy for initializing the centroid of KMeans is detailed below. =

Select a point randomly from the set of input data points as the first clustering center µ1

For each point xi in the data set, calculate the distance\(D\left({x}_{i}\right)=\text{arg}min{‖{x}_{i}{\mu }_{r}‖}_{2}^{2} r=\text{1,2},\cdots {k}_{selected}\)

Select a new data point as the new cluster center. The principle of selection is as follows: the point with larger D(x) has a higher probability of being selected as the clustering center

Repeat b and c until K cluster centroids are selected

The K centroids are used as the initial centroids to run the standard Kmeans algorithm.
Four types of events will be identified so the cluster class K is set to 4. We varied the number of reduced features via PCA from 2 to 6 and performed Kmeans clustering. The analysis results are listed in Table 3. It can be seen from the table that when the number of clusters is 4 and the PCA feature is 5, the accuracy is the highest, which is 87.1%. At the same time, when the number of features is increased, the accuracy is no longer improved.
Table 3
The clustering situation where K= 4
Number of features

Number of samples

Number of correct samples

Accuracy

6

2280

1986

87.1%

5

2280

1986

87.1%

4

2280

1780

78.1%

3

2280

1718

75.4%

2

2280

1736

76.1%

In the clustering process, we find that the waveforms of indirect hitting events are similar to those of direct hitting events. Therefore, we classify indirect hitting events and direct hitting events as hitting events. The three vibration events are clustered, as shown in Table 4. When the number of clusters is 3 and the PCA feature is 4, the accuracy is the highest, which can reach up to 89.4%. At the same time, when the number of features is increased, the accuracy is no longer improved.
Table 4
The clustering situation where K=3
Number of features

Number of samples

Number of correct samples

Accuracy

6

2280

2039

89.4%

5

2280

2039

89.4%

4

2280

2039

89.4%

3

2280

2023

88.7%

2

2280

1998

87.6%

In the process of clustering, we find that the incorrectly clustered vibration data points show the characteristics of the regionalization. That is, in the process of selecting the sample data, some samples are not the vibration data. So, we manually eliminate these nonvibration sample data points, and get a total of 2044 optimized samples data. PCA dimension reduction is performed on the optimized data, and then the Kmeans clustering algorithm is performed, and the results obtained are shown in Table 5. When the number of clusters is 3 and the number of features is 4, the accuracy can reach up to 99.4%, and when the number of features is increased, the accuracy will not be improved. From Fig. 4, we can also see that the clustering effect is good and Idx(Idx represents the dimension of data, equivalent to the two columns of our samples) is the vector that predicts the cluster index.
Table 5
The clustering results where K=3.
Number of features

Number of samples

Number of correct samples

Accuracy

6

2044

2031

99.4%

5

2044

2031

99.4%

4

2044

2031

99.4%

3

2044

2025

99.1%

2

2044

2012

98.4%

Later, confusion matrix is analyzed to evaluate the performance of the algorithm as shown in Fig,5.
3.2.3 Performance measurement
In order to evaluate the performance of the clustering algorithm, the Silhouette coefficient Si and CH index are used as evaluation indexes. The calculation of Silhouette coefficient is shown as follow:
S i =(biai)/max(ai, bi) (1)
In this formula, a represents the average distance between the sample point and the other points in the same cluster, that is, the similarity between the sample points and the other points in the same cluster; b represents the average distance between the sample points and the other points in the next closest cluster, namely the similarity between the sample points and the other points in the next closest cluster. What Kmeans pursues is that for each cluster, the difference within the cluster is small, while the difference outside the cluster is large. The silhouette coefficient S is the key index to describe the difference inside and outside the cluster. According to the formula, the value range of S is [1, 1], and the closer the value is to 1, the better the clustering performance. On the contrary, the closer the value is to 1, the worse the clustering performance. When the contour coefficient is equal to 0, it means that the clusters overlap. As shown in the Table 6 below, when the number of clusters is 3 and the feature is 4, the silhouette coefficient is the highest 0.7206. The Silhouette coefficients of different clusters show that the samples can be split into 3 clusters. All the points in the three clusters have a large Silhouette coefficient (0.8 or greater), indicating that the clusters are well separated.
Table 6
The average Silhouette coefficient.
K

Number of features

Number of samples

Average Silhouette coefficient

2

4

2044

0.6193

3

4

2044

0.7206

4

4

2044

0.5960

5

4

2044

0.6090

The calculation of CH index is shown in (2):
In this formula, m is the number of samples in training set, and k is the number of classifications. Bk is the covariance matrix between classifications, and Wk is the covariance matrix of data within classifications. That is to say, the smaller the covariance of the data within the classification, the better, and the larger the covariance between the classifications, the better, so the CH index score will be high, and the boundary between clusters is obvious. We can know from the article [40] that the higher the CH index, the better the clustering effect. And as shown in the Fig. 6, the maximum CH index is 2653 when the number of cluster classes is 3.
3.3 Result and Discussion
A total of 14 input features are obtained by feature extraction from samples, and their dimensions are reduced by PCA. The first 6 features scored after reducing dimension are selected for research. We find that when the number of clusters is 4, and the number of features is 5, the equivalent classification accuracy is as high as 87.1%; when the number of clusters is 3 and the number of features is 4, the accuracy is as high as 89.4%. After we manually remove the nonvibration data in the sample, we get that when the number of clusters is 3 and the feature is 4, the accuracy rate reaches the highest 99.4%. And the average silhouette coefficient and CH index can reach 0.7206 and 2653, respectively, which represent the best performance of the clustering task. It is found that the clustering method is effective to be utilized in practical applications. Integrate Zscore normalization, feature extraction, PCA dimensionality reduction, and calculation of the distance between the sample data and the cluster center into one functional section. The realtime data collected by the distributed optical fiber sensor can be used to determine which vibration events the sample belonged to through this functional block. In the subsequent research, we will continue to improve the clustering method to achieve realtime monitoring and alarm systems for vibration events.
The unsupervised learning algorithm proposed in this paper for the automatic classification of vibration events can significantly improve the efficiency of identifying vibration events, but it also has its limitations. For some similar vibration events, that is, the vibration waveform is similar, its identification effect is relatively poor, such as direct hitting and indirect hitting events.