3.1 Statistical description
Firstly, Outliers were tested by boxplot technique and replaced by the average value of the rest samples to avoid the loss of information. For each box, the median was used as central mark, the edges of the box represent the 25th and 75th percentiles, and the whiskers extend to the most extreme data points which the algorithm considers to be not outliers. In this work, the whisker parameter was set to 3, thus only the very extreme data points were considered as outliers which might be caused by sample contamination. Finally, 14, 12, 3 and 11 data points were tested as outliers for the multi-elemental measurements of samples in Guangxi, Jiangxi, Hunan, and Guizhou provinces, respectively.
After outlier processing, an approximate normal distribution was found for the element contents of each region. The average and relative standard deviation (RSD) of the trace element contents according to their geographical origins are shown in Table 2. From Table 2, it is obvious that K was the most abundant element in all LHG samples, followed by P, Mg, Ca, Fe, Na, and other trace elements. This result is in accordance with the previous reports 35, which show that LHG is a good source of these elements, especially K.
Table 2 Mean and RSD of the trace element contents according to their origin
|
Guangxi (N=30)
|
Jiangxi (N=24)
|
Hunan (N=30)
|
Guizhou (N=30)
|
Total (N=114)
|
|
Mean
(μg g-1)
|
RSD
(%)
|
Mean
(μg g-1)
|
RSD
(%)
|
Mean
(μg g-1)
|
RSD
(%)
|
Mean
(μg g-1)
|
RSD
(%)
|
Mean
(μg g-1)
|
RSD
(%)
|
K
|
12497.6
|
10.6
|
13122.4
|
12.5
|
15947.3
|
9.4
|
14281.2
|
12.3
|
14006.3
|
14.5
|
Na
|
6.0
|
28.7
|
4.5
|
34.4
|
7.3
|
33.2
|
64.6
|
35.1
|
21.5
|
132.3
|
Ca
|
294.1
|
24.9
|
401.1
|
23.8
|
551.3
|
23.0
|
813.6
|
17.4
|
521.0
|
43.8
|
P
|
1303.9
|
13.2
|
1622.5
|
14.0
|
2645.3
|
19.5
|
2235.0
|
15.4
|
1969.0
|
32.1
|
Mg
|
661.1
|
13.4
|
719.7
|
13.0
|
909.8
|
19.6
|
1152.8
|
15.5
|
868.3
|
27.8
|
Al
|
7.1
|
30.7
|
6.8
|
33.1
|
19.1
|
51.5
|
28.0
|
28.9
|
15.7
|
70.8
|
B
|
6.7
|
15.1
|
8.0
|
24.5
|
9.7
|
20.3
|
13.1
|
17.0
|
9.4
|
32.4
|
Ba
|
1.9
|
31.4
|
3.1
|
54.6
|
1.4
|
46.5
|
0.8
|
34.1
|
1.7
|
69.9
|
Cu
|
7.3
|
21.2
|
5.9
|
20.5
|
5.9
|
20.9
|
8.0
|
15.3
|
6.8
|
23.8
|
Fe
|
31.5
|
11.5
|
28.3
|
21.7
|
35.4
|
33.8
|
63.9
|
22.0
|
40.4
|
43.2
|
Mn
|
9.2
|
27.6
|
8.7
|
27.0
|
6.5
|
22.1
|
6.7
|
37.7
|
7.7
|
32.5
|
Ni
|
1.3
|
24.3
|
1.9
|
38.7
|
1.3
|
36.7
|
2.0
|
52.8
|
1.6
|
47.8
|
Zn
|
11.6
|
14.6
|
12.3
|
14.0
|
12.9
|
17.1
|
22.2
|
22.5
|
14.9
|
36.0
|
Sr
|
0.5
|
23.8
|
2.1
|
74.1
|
0.9
|
53.5
|
2.7
|
24.9
|
1.5
|
78.8
|
In addition, it can be observed that the variations of element content of all samples were much higher than those of individual origin. For example, the RSD of Mg of all samples was 27.8%, which is much larger than the RSD of Mg of an individual origin (13.0%~19.6%). However, it should be noted that there were two elements, Mn and Ni, that showing the largest variation (RSD=37.7% and 52.8% for Mn and Ni, respectively) in Guizhou. We considered that this case was caused by the particular soil characteristics in Guizhou. Generally, large between-groups variability and small within-groups variability is necessary for the good discrimination.
Radar plot allows simple, rapid and intuitional discrimination of different patterns, and was, thus, applied in LHG samples to classify their geographical origin by using element content in this study. Six elements (Na, Al, Sr, Ba, Ca, and Fe) with high variations were selected and the average content values were used for radar plots analysis. Fig. 2 shows that the distributions of the six element contents of LHG samples from different regions showed obviously different characteristic patterns. For instance, all elements except Ba had higher contents in Guizhou than the other three regions. It should also be noted that the LHG samples of Guangxi have the lowest content values for five elements (Na, Al, Sr, Ca, and Fe). Radar plot shows that multielement content has the potential to be used for the discrimination of geographical origin of LHG.
3.2 Principal component analysis
At preliminary stage, PCA was used for exploratory data analysis before classification modeling. PCA is a commonly used dimension reduction technique, which provides the distributions of samples by projecting them on a set of orthogonal basis. In this study, the 74 LHG samples of training set and 14 element contents formed the input data matrix (74 rows and 14 columns) and then was analysed by PCA based on singular value decomposition algorithm.
As shown in Fig. 3, the first principal component (PC1) and the second principal component (PC2) can explain 53.09 % and 13.09 % of the total variance, respectively. The score plot shows a clear separation pattern for the samples from Guizhou and the other three regions, meanwhile, the samples of Hunan are well separated from those of Guizhou, Jiangxi and Guangxi. This distribution can be interpreted from the loading plot that indicated the content of Fe (63.9 μg g-1), B (13.1 μg g-1), Ca (813.6 μg g-1), Na (64.6 μg g-1) and Zn (22.2 μg g-1) is higher for samples from Guizhou, thus the samples from Guizhou are obviously separated with others based on PC1. As for samples from Hunan, they are separated with other samples based on PC2, due to the high content of K (15947.3 μg g-1) and P (2645.3 μg g-1), and the low content of Ni (1.3 μg g-1) and Cu (5.9 μg g-1) for these samples. The higher K and P contents may be caused by excessive fertilization in Hunan. However, a serious overlapping between the samples from Guangxi and Jiangxi was observed. From Fig.1 and Fig.2, it can be seen that there is a farther distance between Jiangxi and Guangxi, but their element profile is more similar, which leads to the overlapping in PCA analysis. In order to obtain reliable classification models for different LHG samples, supervised learning pattern recognition techniques were applied.
3.3 Supervised classification models
In this work, the models for the classification of LHG samples according to their geographical origin were developed using three supervised pattern recognition techniques with different mechanisms, including LDA, k-NN, and SVM. LDA is a linear classification technique by maximizing the variance between classes and minimizing the variance within each class. Discriminant functions were constructed by the linear combinations of original variables and used to differentiate groups of samples. The test set is predicted by the projection of the new samples according the minimal distance to the centroid of each class. Compared to PCA, LDA is a supervised method which uses the labels of samples in training set to develop model. Thus, LDA can give a better pattern recognition result than PCA.
At first, the original element content values were treated by three data pre-processing methods, including auto-scaling, scale standard, and logarithm processing. To put it simply, auto-scaling returns the results for each element of the original data set X such that columns of X are centered to have mean 0 and scaled to have standard deviation 1, scale standard processes the original data set X by normalizing the minimum and maximum values of each row, and logarithm processing directly to take the log of the original data. Data pre-processing can eliminate the effect of the order of magnitudes, and included all information of original data. After data pre-processing, the 14 elemental content values were used in LDA analysis and the distribution of 74 samples in training set were shown in Fig. 4. It is observed that all samples were clearly separated based on function 1 and function 2, and logarithm processing give the best classification result. Therefore, logarithm processing was used in this work, and three discriminant functions explained the 100% of the variance (function 1 explained the 77.95% of the total variance; function 2 the 19.42%; function 3 the 2.63%). Function 1 and function 2 are as follows:
F1=0.2291K - 0.2412Na - 0.4323Ca - 0.5638P - 0.2127Mg - 0.1095Al
+0.0941B + 0.2666Ba + 0.1668Cu + 0.4321Fe - 0.0922Mn - 0.0572Ni (1)
F2=-0.4254K + 0.1505Na - 0.0419Ca - 0.6032P + 0.1015Mg - 0.2247Al
-0.0473B - 0.0518Ba + 0.2672Cu + 0.4900Fe + 0.1391Mn - 0.1052Ni (2)
+ 0.1520Zn + 0.0076Sr
It should be noted that the most effective element for the geographical origin discrimination of LHG samples were those with the highest absolute correlation within discriminant function. In particular, for the function 1 there were P, Ca, and Fe with absolute correlation of 0.5638, 0.4323, and 0.4321, respectively. Similarly, for the function 2 there were P, Fe, and K with absolute correlation of 0.6032, 0.4900, and 0.4254, respectively. Given the fact that function 1 and function 2 accounted for 97.37 of total variance, thus the most effective elements for the geographical origin discrimination of LHG were P, Fe, Ca, and K.
Five-fold cross validation was used to evaluate the classification ability of LDA for the 74 samples in training set. As shown in table 3, the accuracy for training set is 100 %, and all samples were correctly grouped according to their geographical origin. For the test set, 40 unknown samples were used to evaluate the predictive ability of the classifier based on LDA. The samples were projected based on function 1 and function 2 . As shown in Fig. 5, it is obvious that all the samples in test set were almost properly classified. Although there is a slight overlapping between the samples from Guangxi and Jiangxi, function 3 can further improve the classification effect. Finally, all samples of test set were correctly classified by LDA, the result are shown in Table 3. In addition, LAD analysis based on other data pre-processing methods also provides acceptable classification ability, with accuracy range from 95 % to 97.5 %. The result of origin data without any pre-processing is the worst (90%). Details can be found in Table S2 in supplementary material.
Table 3 The discrimination results of different models for the train set and the test set
Geographical origins
|
Number of
samples
|
LDA
Accuracy (%)
|
k-NN
Accuracy (%)
|
SVM
Accuracy (%)
|
Training set
|
Test set
|
Training set
|
Test set
|
Training set
|
Test set
|
Training set
|
Test set
|
Guangxi
|
20
|
10
|
100
|
100
|
95
|
100
|
100
|
100
|
Jiangxi
|
14
|
10
|
100
|
100
|
93
|
100
|
100
|
100
|
Hunan
|
20
|
10
|
100
|
100
|
100
|
100
|
95
|
100
|
Guizhou
|
20
|
10
|
100
|
100
|
100
|
90
|
100
|
100
|
Total accuracy (%)
|
|
100
|
100
|
97
|
97.5
|
98.7
|
100
|
Besides LDA, k-NN and SVM were used to develop classification models, and the results of classification are shown in table 3 (The accuracy for training set is calculated using 5-fold cross validation). k-NN analysis is a distance based technique for pattern recognition, which is easy to understand and implement. In short, this technique assigns an unknown sample into the class most common among its k-nearest neighbours according to distance. In this work, Euclidean distance was used and the optimal size of neighbours k was optimized using five-fold cross validation procedure by which maximum accuracy rate was selected as criterion. SVM uses a nonlinear mapping to transform the original training data into a higher dimension, and then finds a hyperplane using support vectors and margins to classify the data. Through the above analysis, we observed that the three models showed different degrees of success, and LDA and SVM performed better than k-NN. For the three methods, the classification accuracy for training set and test set was 97.3%~100% and 98.7%~100%, respectively. These results demonstrate that the element content is an effective approach for the classification of LHG samples according to geographical origin.