This section focuses on how to optimize the identification model of adjacent aquifer water source based on strontium isotope.
6.1 Principal Component Analysis
6.1.1 Determination of WGS identification index
There are many chemical components of water in the aquifer, so it is not practical to consider each chemical component as an index for water source identification. Considering the differences of each component for different water sources and the validity of the data, referring to the relevant literature11, the contents of 87Sr/86Sr (X1), γ(K++Na+) (X2), γCa2+ (X3), γMg2+ (X4), γSO42- (X5), γCl- (X6) and γHCO3- (X7) were selected as the identification indicators of WGS.
6.1.2 Principal Component Analysis
The water sample data in Supplementary material Section 3 were normalized and then subjected to principal component analysis. The correlation coefficient matrix of each factor is listed in Table 1. From Table 1, it can be seen that several of the input factors have clear correlation with each other, which will definitely affect the accuracy of the water source prediction model for WGS in mine, and will easily cause water source misjudgment. Therefore, it is necessary to perform principal component analysis on the input data. Table 2 shows the variance contribution rate and cumulative contribution rate of each component. The principle for selecting principal components is to select the characteristic value from large to small, and the feature vectors are also selected according to the corresponding characteristic values. The larger the characteristic value is, the more important the corresponding principal component data is. The selected principal component data generally accounts for more than 80%. Therefore, in this paper, after treatment by principal component analysis, the first four extracted principal components (87Sr/86Sr (X1), γ(K++Na+) (X2), γCa2+ (X3), γMg2+ (X4),), included 95.271% of the information in the raw data, and the raw variables could be summarized. While conforming to the information presented in the gravel map of Fig. 4. According to the principal component analysis matrix, the relationship between the Y1, Y2, Y3, Y4 factors and the raw variables is given in Table 3. The factor can be expressed as:
Y1=0.952X1-0.347X2-0.412X3-0.524X4+0.911X5-0.072X6+0.9271X7
Y2=-0.292X1-0.204X2+0.887X3+0.730X4+0.318X5+0.304X6+0.268X7
Y3=-0.041X1+0.184X2-0.103X3+0.122X4-0.178X5+0.690X6+0.027X7
Y4=-0.012X1+0.601X2+0.037X3+0.341X4+0.031X5-0.671X6-0.041X7
Table 1 Pearson correlation coefficient matrix of each hydrochemical principal component index
Index
|
X1
|
X2
|
X3
|
X4
|
X5
|
X6
|
X7
|
X1
|
1.000
|
|
|
|
|
|
|
X2
|
0.211
|
1.000
|
|
|
|
|
|
X3
|
-0.104
|
-0.317
|
1.000
|
|
|
|
|
X4
|
-0.284
|
-0.022
|
0.850
|
1.000
|
|
|
|
X5
|
0.917
|
0.140
|
-0.060
|
-0.217
|
1.000
|
|
|
X6
|
0.011
|
0.033
|
0.241
|
0.084
|
-0.114
|
1.000
|
|
X7
|
0.987
|
0.251
|
-0.172
|
0.810
|
0.090
|
0.054
|
1.000
|
Table 2 Characteristic value, variance contribution rate and cumulative contribution rate of Y1-Y2 components
Factor
|
Select squaring and loading
|
Characteristic value
|
Contribution rate /%
|
Cumulative contribution rate /%
|
Y1
|
2.473
|
42.683
|
42.683
|
Y2
|
2.145
|
32.354
|
75.037
|
Y3
|
0.551
|
11.655
|
86.692
|
Y4
|
0.465
|
8.579
|
95.271
|
Y5
|
0.274
|
3.457
|
98.728
|
Y6
|
0.174
|
1.184
|
99.912
|
Y7
|
0.012
|
0.088
|
100.000
|
Table 3 Load matrix of two principal component factors
Index
|
X1
|
X2
|
X3
|
X4
|
X5
|
X6
|
X7
|
Y1
|
0.952
|
-0.347
|
-0.412
|
-0.524
|
0.911
|
-0.072
|
0.927
|
Y2
|
-0.292
|
-0.204
|
0.887
|
0.730
|
0.318
|
0.304
|
0.268
|
Y3
|
-0.041
|
0.184
|
-0.103
|
0.122
|
-0.178
|
0.690
|
0.027
|
Y4
|
-0.012
|
0.601
|
0.037
|
0.341
|
0.031
|
-0.671
|
-0.041
|
6.2 Sr-F model based on strontium isotope
The identification analysis of Fisher model is a linear identification analysis established based on analysis of variance to distinguish each group. Its basic idea is to use the principle of dimension reduction to make the difference of the same kind of samples as small as possible and the difference between different kinds of samples as large as possible. Since the γSr2+ vs. 87Sr/86Sr of TLW and OLW have different distributions (see Fig, 2), this paper introduces the 87Sr/86Sr as the judgment factor of Fisher's judgment method to construct Sr-F model based on the strontium isotope, so as to increase the alteration between different samples and improve the identification accuracy of WGS from Fisher's judgment method.
Learning was performed using the 40 sample data from Supplementary material Section 3. Two types of water sources (Ⅰ and Ⅱ) were taken as two different groups, and the covariance matrix of the two types of water sources was assumed to be equal. The above four principal components (Y1, Y2, Y3, and Y4) obtained by principal component analysis were used as the four identification indicators of Fisher discriminate analysis model. Using analysis-classification-identification in SPSS 19.0 software to conduct identification operation, the Fisher identification function can be obtained as follows:
(1) First identification function: Z1=-0.658Y1+1.137Y2+0.412Y3+0.524Y4
(2) Second identification function: Z2=0.824Y1+0.384Y2–0.213Y3+0.397Y4
The central scores of the identification function in Type I and Type II water sources are listed in Table 4. For the first identification function, its center score in the Type I water source (TLW) is -0.627; the central score in Type II water source (OLW) is 0.325. The group of new samples can be identified by comparing the distance between the function value of the water sample to be judged and the center value of two types of water source group in Table 4.
Table 4 The central value of identification function in each classification
Type
|
Ⅰ
|
Ⅱ
|
Identification Function
|
1
|
-0.627
|
-1.254
|
2
|
0.325
|
-0.327
|
In order to verify whether the principal component analysis and Fisher discriminate analysis water source identification model established by the author can meet the actual requirements, combining Fisher identification theory, 40 training samples in Supplementary material Section 3 are substituted into the established identification standard one by one, and the function values of each sample are calculated and compared with the center distance values of each group, so that the water-gushing categories of each sample can be obtained. The identification results are shown in Table 5. There are 9 wrong water samples in the sample, numbered 6, 11, 12, 15, 23, 24, 29, 32 and 39, and the correct rate of discrimination is 77.5%. The reason for misjudgment may be that the water samples from different water sources are similar in water quality and have strong correlation, which leads to misjudgment.
Table 5 Training sample identification result
Parameter
|
The model of this paper
|
Actual classification
|
Parameter
|
The model of this paper
|
Actual classification
|
Z1
|
Ⅰ
|
Ⅰ
|
Z21
|
Ⅱ
|
Ⅱ
|
Z2
|
Ⅰ
|
Ⅰ
|
Z22
|
Ⅱ
|
Ⅱ
|
Z3
|
Ⅰ
|
Ⅰ
|
Z23
|
Ⅰ*
|
Ⅱ
|
Z4
|
Ⅰ
|
Ⅰ
|
Z24
|
Ⅰ*
|
Ⅱ
|
Z5
|
Ⅰ
|
Ⅰ
|
Z25
|
Ⅱ
|
Ⅱ
|
Z6
|
Ⅱ*
|
Ⅰ
|
Z26
|
Ⅱ
|
Ⅱ
|
Z7
|
Ⅰ
|
Ⅰ
|
Z27
|
Ⅱ
|
Ⅱ
|
Z8
|
Ⅰ
|
Ⅰ
|
Z28
|
Ⅱ
|
Ⅱ
|
Z9
|
Ⅰ
|
Ⅰ
|
Z29
|
Ⅰ*
|
Ⅱ
|
Z10
|
Ⅰ
|
Ⅰ
|
Z30
|
Ⅱ
|
Ⅱ
|
Z11
|
Ⅱ*
|
Ⅰ
|
Z31
|
Ⅱ
|
Ⅱ
|
Z12
|
Ⅱ*
|
Ⅰ
|
Z32
|
Ⅰ*
|
Ⅱ
|
Z13
|
Ⅰ
|
Ⅰ
|
Z33
|
Ⅱ
|
Ⅱ
|
Z14
|
Ⅰ
|
Ⅰ
|
Z34
|
Ⅱ
|
Ⅱ
|
Z15
|
Ⅱ*
|
Ⅰ
|
Z35
|
Ⅱ
|
Ⅱ
|
Z16
|
Ⅰ
|
Ⅰ
|
Z36
|
Ⅱ
|
Ⅱ
|
Z17
|
Ⅰ
|
Ⅰ
|
Z37
|
Ⅱ
|
Ⅱ
|
Z18
|
Ⅰ
|
Ⅰ
|
Z38
|
Ⅱ
|
Ⅱ
|
Z19
|
Ⅰ
|
Ⅰ
|
Z39
|
Ⅰ*
|
Ⅱ
|
Z20
|
Ⅰ
|
Ⅰ
|
Z40
|
Ⅱ
|
Ⅱ
|
6.3 Sr-D model based on strontium isotope
Distance discriminant analysis is to identify the newly acquired samples according to some quantitative characteristics of the existing observed samples. The basic idea is to judge which group a sample belongs to when the sample is closest to that group. In this paper, strontium isotope is introduced as the discrimination factor of distance discrimination method to construct Sr-D model based on strontium isotope. By increasing the distance between the sample and the group, the accuracy of water source identification by distance discrimination method can be improved.
20 water samples (40 in total) were selected from the two types of aquifers in Supplementary material Section 3 for machine learning. Types I and II water sources are separated as two different groups, and the covariance matrix of the two water sources is assumed to be equal. According to the results of principal component analysis, Y1, Y2, Y3 and Y4 indexes are applied as the discriminant factors of the distance discriminant analysis model, and the Sr-D model is constructed according to the distance discriminant analysis method proposed in this paper.
When using sample data for learning, when the number of input variables is determined, the number of sample data will definitely affect the prediction accuracy of the identification analysis model. Up to now, there are few papers that specifically discuss the influence of sample data number on the prediction accuracy of the model. Therefore, based on the principal component analysis using 40 sets of sample data, this paper briefly discusses the influence of the number of sample data on the accuracy of identification analysis model. Among the 40 sample data, the number of samples from both Types I and II water sources is 20 groups. In order to reflect the equality and randomness, three groups of data are extracted from the last of each water source sample data in turn. The remaining samples are used as learning samples to make a model and test the prediction accuracy of the model. The effect of the number of sample data on the prediction accuracy is investigated. According to the sampling principle, 4 samples were taken and 40, 27, 21, 17 and 12 sets of sample data were obtained. The sample data of the above five cases were studied, and the error rates of the corresponding models were investigated by using the back-generation estimation method. Fig. 5 is a comparison diagram of the misjudgment rates obtained.
When there are 12 sets of sample data, the error rate of the identification model is 0.333 for all 12 sets of samples, obviously the accuracy is very low. Then, with the increase of the number of learning samples, in the remaining four cases, the misjudgment rates were 0.0588, 0.0476, 0.0370 and 0.0250, respectively. As can be seen from Fig. 5, the misjudgment rate is already insignificant when the number of learning samples is 27. And then, with the increase of the number of learning samples, the misjudgment rates decrease sequentially and gradually tend to converge. The test results indicate that the number of learning samples has a great influence on the prediction accuracy of the identification model. When the number of learning samples reaches a certain capacity, the prediction accuracy of the identification model can meet the requirements. At the same time, it also shows that the maximum number of learning samples should be obtained for the construction of the identification model when conditions permit, and the prediction accuracy of the identification model obtained will be higher.
Table 6 Identification result of training sample
Number
|
The model of this paper
|
Actual classification
|
Number
|
The model of this paper
|
Actual classification
|
Number
|
The model of this paper
|
Actual classification
|
1
|
Ⅰ
|
Ⅰ
|
15
|
Ⅱ*
|
Ⅰ
|
39
|
Ⅱ
|
Ⅱ
|
2
|
Ⅰ
|
Ⅰ
|
16
|
Ⅰ
|
Ⅰ
|
40
|
Ⅱ
|
Ⅱ
|
3
|
Ⅰ
|
Ⅰ
|
17
|
Ⅰ
|
Ⅰ
|
41
|
Ⅱ
|
Ⅱ
|
4
|
Ⅰ
|
Ⅰ
|
18
|
Ⅰ
|
Ⅰ
|
42
|
Ⅰ*
|
Ⅱ
|
5
|
Ⅰ
|
Ⅰ
|
19
|
Ⅰ
|
Ⅰ
|
43
|
Ⅱ
|
Ⅱ
|
6
|
Ⅰ
|
Ⅰ
|
20
|
Ⅰ
|
Ⅰ
|
44
|
Ⅱ
|
Ⅱ
|
7
|
Ⅰ
|
Ⅰ
|
31
|
Ⅱ
|
Ⅱ
|
45
|
Ⅱ
|
Ⅱ
|
8
|
Ⅰ
|
Ⅰ
|
32
|
Ⅱ
|
Ⅱ
|
46
|
Ⅱ
|
Ⅱ
|
9
|
Ⅰ
|
Ⅰ
|
33
|
Ⅱ
|
Ⅱ
|
47
|
Ⅱ
|
Ⅱ
|
10
|
Ⅰ
|
Ⅰ
|
34
|
Ⅱ
|
Ⅱ
|
48
|
Ⅱ
|
Ⅱ
|
11
|
Ⅰ
|
Ⅰ
|
35
|
Ⅱ
|
Ⅱ
|
49
|
Ⅱ
|
Ⅱ
|
12
|
Ⅰ
|
Ⅰ
|
36
|
Ⅱ
|
Ⅱ
|
50
|
Ⅱ
|
Ⅱ
|
13
|
Ⅰ
|
Ⅰ
|
37
|
Ⅱ
|
Ⅱ
|
|
|
|
14
|
Ⅰ
|
Ⅰ
|
38
|
Ⅱ
|
Ⅱ
|
|
|
|
As mentioned above, 25 water samples (50 in total) are selected from each of the two aquifers in Supplementary material Section 3 as training samples and substituted into the established identification standard one by one to carry out the judgment. This judgment results are shown in Table 6. It is found that the No.15 water sample is misjudged as OLW and the No.42 water sample is misjudged as OLW, with the judgment accuracy rate of 96.7%. Through the results of the judgment, it is generally believed that the model based on principal component analysis and distance identification constructed in this paper is stable and reliable, and can be used to classify and identify the WGS in the adjacent aquifers of coal mines more accurately.
6.4 Sr-B model based on strontium isotope
The back-propagation network (BP) is a multi-layer feedforward network trained according to the principle of back-propagation algorithm, which realizes highly nonlinear mapping relationship between input and output variables. The conventional rule is the steepest descent method, and then the weights and offsets of the neural network structure are adjusted through the back propagation of the data to achieve the purpose of minimizing the sum of squares of the errors of the data, so that the output value of the network approaches the expected value, and the error ends only when the learning requirements are met. Due to the large difference of strontium content in adjacent aquifers, strontium isotope is introduced as the identification factor of the BP neural network model to build the Sr-B model based on strontium isotope to increase the number of input layers and the difference between different input layers of the BP neural network model and improve the water source identification accuracy of the BP neural network model.
According to the results of principal component analysis, the first four principal components can be selected to represent the main information of the aquifer, thus reducing the data from 60×7 to 60×4, and greatly reducing the calculation amount. The corresponding neural network model is established for training and prediction of WGS identification.
The data samples were partitioned into one training set and one testing set. Fifteen samples (30 samples in total) were randomly selected from the water samples of two aquifers to establish the model, and 10 water samples were used to predict the results. MATLAB 12.0 software was used to edit programs to normalize the input variables and to establish the structure of BP neural network model. After many experiments and comparative analysis, when the input layer is 4 layers, the output layer is 2 layers, the system allowable error is 0.001, the learning rate is 0.04, the inertia coefficient is 0.7, and the maximum number of iterations is 2000, it is proved that when the number of neurons in the hidden layer of the neural network is 9, the neural network model has a simple structure, fast convergence speed and high precision. The network structure of 4-9-2 is selected for training and identification verification, in which the implicit layer transfer function is a nonlinear Tansig type function and the output layer implicit function is a pure linear Purelin type function.
The principal component data of the prediction sample after dimension reduction is verified in the trained prediction model, and the identification results of the model are compared with the authenticity of WGS identification. The neural network regression graph generated by the model (Fig. 6) shows that the normalized coefficients of the models are all greater than 99.5%, which proves that the model established by network prediction identification is more successful in identifying the WGS.
Of note, the performance of neural network is mainly evaluated by its generalization ability. The verification analysis of test samples shows that the network model has strong generalization ability and can truly reflect the relationship between input and output. The training convergence curve of the neural network model meets the requirements when the number of iterations is 117, that is, the mean square error is less than 0.0001. The test results (Table 7) show that of the 10 water-gushing samples, the water source identification of 9 water samples is correct, and the water source identification accuracy reaches 90%. The results show that the identification model based on principal component analysis and BP neural network established by this method can classify and identify the WGS in the adjacent aquifers of coal mines more accurately.
Table 7 Detection results of neural network model
Number
|
Expected probability value
|
Predictive probability value
|
Result
|
Ⅰ
|
Ⅱ
|
Ⅰ
|
Ⅱ
|
1
|
1
|
0
|
1.0444
|
0.0748
|
Right
|
2
|
1
|
0
|
0.9154
|
-0.0731
|
Right
|
3
|
1
|
0
|
0.9547
|
0.0079
|
Right
|
4
|
1
|
0
|
1.0143
|
-0.0691
|
Right
|
5
|
1
|
0
|
1.0129
|
0.1134
|
Right
|
6
|
0
|
1
|
0.4354
|
0.5724
|
False
|
7
|
1
|
0
|
0.9953
|
0.0383
|
Right
|
8
|
0
|
1
|
-0.0170
|
0.9235
|
Right
|
9
|
0
|
1
|
0.0828
|
0.9625
|
Right
|
10
|
1
|
0
|
1.0297
|
0.0998
|
Right
|
6.5 Sr-G model based on strontium isotope
Grey relational degree analysis (GRA) is the basic content of grey system theory. The degree of influence between the object to be identified and the object to be identified is judged by accurately calculating the degree of correlation between each influencing factor of the object and the object to be identified, and the degree of influence between the object to be identified and the object to be identified is compared, so as to accurately identify the category of the object to be studied. In this study, the Sr-G model based on strontium isotope is established to increase the difference between the research object and the object to be differentiated in grey correlation analysis, improve the reliability of its correlation degree and improve the accuracy of water source identification in grey correlation analysis.
According to the results of principal component analysis, Y1, Y2, Y3 and Y4 were applied as evaluation factors to identify WGS. The samples in Supplementary material Section 3 were subjected to initial value processing according to the formula. Fifteen (30 in total) samples from two types of water samples were selected as the female parent, and then 10 samples were randomly selected as the samples to be tested. Then, according to the correlation degree formula, the correlation degrees of 40 water samples were calculated. The maximum correlation between the 10 water samples to be tested and the female parent was selected to obtain the identification results of the water samples to be tested, as shown in Table 8. The type of maximum correlation is the type of water sample to be determined. As can be seen from Table 8, when based on the principal component analysis and grey correlation analysis to determine the 10 water samples to be tested, there are 9 water samples were correctly determined, 48 water samples to be tested was mistakenly determined to be TLW, determine the accuracy rate of 90%. The principal component analysis and grey correlation recognition model established by this method can classify and recognize WGS in the adjacent aquifer of coal mine.
Table 8 Identification results of water samples to be measured
Number
|
Actual type
|
The maximum relative female parent number
|
Maximum relevance
|
The model of this paper
|
16
|
Ⅰ
|
2
|
0.9437
|
Ⅰ
|
17
|
Ⅰ
|
10
|
0.8423
|
Ⅰ
|
18
|
Ⅰ
|
8
|
0.9487
|
Ⅰ
|
19
|
Ⅰ
|
13
|
0.9624
|
Ⅰ
|
20
|
Ⅰ
|
2
|
0.9743
|
Ⅰ
|
46
|
Ⅱ
|
33
|
0.9107
|
Ⅱ
|
47
|
Ⅱ
|
37
|
0.9456
|
Ⅱ
|
48
|
Ⅱ
|
7
|
0.9484
|
Ⅰ*
|
49
|
Ⅱ
|
44
|
0.9893
|
Ⅱ
|
50
|
Ⅱ
|
32
|
0.9757
|
Ⅱ
|
6.6 Comparison and Selection of Identification Effects of Different Identification Models
In order to compare the identification effects of the above four identification models (Sr-F, Sr-D, Sr-B, and Sr-G), the remaining 20 water samples (10 samples of TLW and 10 samples of OLW) in Supplementary material Section 3 are respectively substituted into the above identification models, and Table 9 indicates the obtained WGS identification results. The identification results that Sr-B model obtains the highest recognition rate, 95%. Therefore, according to the results of this study, we propose that the Sr-B model can be used to solve the problem of WGS identification adjacent limestone aquifer.
Table 9 Identification result of identification model of unknown water source based on strontium isotope
Number
|
F model
|
D model
|
B model
|
G model
|
Actual type
|
21
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
22
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
23
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅱ*
|
Ⅰ
|
24
|
Ⅱ*
|
Ⅱ*
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
25
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
26
|
Ⅰ
|
Ⅰ
|
Ⅱ*
|
Ⅰ
|
Ⅰ
|
27
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
28
|
Ⅱ*
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
29
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
30
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
Ⅰ
|
51
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
52
|
Ⅰ*
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
53
|
Ⅱ
|
Ⅰ*
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
54
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
55
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
56
|
Ⅰ*
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
57
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅰ*
|
Ⅱ
|
58
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
59
|
Ⅱ
|
Ⅰ*
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
60
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Ⅱ
|
Number of false positives
|
4
|
3
|
1
|
2
|
|
Accuracy rate
|
80%
|
85%
|
95%
|
90%
|
|