3.1. Pretreatment of chromatograms and spectra
FTIR spectra of M. cocos (Fig. 2) presented the structural information of mixture, including the bands of C = O, C = CH2, C-O, C-OH, O-H, C-C and C-H. The variables in the bands of 2670–1750 cm− 1 and 4000–3700 cm− 1 were excluded after spectral pretreatment. The specific reasons were as follows: firstly, there was no absorption in these regions. Secondly, according to the usage of VIP, if the VIP score of one wave number was more than one, it was customarily considered helpful to recognize each class correctly29. As shown in Fig. 3 that the VIP plot of PLS-DA of FTIR data regarding wild samples, the VIP values in the regions of 2670–1750 and 4000–3700 cm− 1 (rectangle areas in Fig. 3) were irregular and almost more than one, which accounted for the presence of chemical interference. Horn et al.35 reported that the signal of 2670–1750 cm− 1 was caused by crystal material of ATR accessory.
By the way of comparing with the retention time of reference substances, the retention order of the M. cocos constituents was found to be dehydrotumulosic acid, poricoic acid A, 3-epidehydrotumulosic acid, dehydropachymic acid, pachymic acid, and finally dehydrotrametenolic acid. Pachymic acid showed patently in the chromatogram of 210 nm, and others existed in that of 242 nm (Fig S1). Based on the previous work, the precision, stability, repeatability and recovery of chromatographic method were evaluated14 using dehydrotumulosic acid, poricoic acid A, dehydropachymic acid, pachymic acid and dehydrotrametenolic acid, which owned good degree of separation. The results showed that the all of relative standard deviation values were lower than 5.95% and recovery rates was from 96.32% ranging to 106.4%. indicating the method was reliable. The correlation coefficients were higher than 0.99 for the calibration curves of the five reference compounds, therefore, the method could be deemed accurate. the limit of quantification (LOQ) and limit of detection (LOD) (determined by diluting continuously standard solution until the signal-to-noise ratios reached to 10 and 3, respectively), regression equations, correlation coefficients and linear ranges of five reference compounds were shown in Table S2. The fingerprints of 242 nm (Fig. 4), which presented relatively smooth baseline, would be chose for further analysis.
Both FTIR and LC were pretreated by SG polynomial second-derivative method to highlight fingerprint differences and eliminate the interference of baseline drift. Compared to raw data, the PLS-DA models processed by SG polynomial second-derivative presented higher accuracy and efficiency values (Table S3), which indicated this method worked.
3.2. Comparison of cultivated and wild samples
The PLS-DA was performed using wild and cultivated samples as class ID. From the scores scatter plots of two dimensions regarding all of samples (Fig. 5A-B), it could be easily found that the wild samples were located in the bottom left, and the cultivated ones were located in the top right corner, indicating the large difference among them. Moreover, wild samples were significantly different from cultivated ones in the contents of five vital chemical components (Fig. 5C) (P < 0.05). Accordingly, the cultivated and the wild samples should be performed for origin identification separately.
3.3. Quantitative analysis of samples from different origins
These triterpenes showed plenty of bioactivities, and its presence and quantity had a vital influence on the health effect of M. cocos. The contents of five compounds were presented as the box-plots with medians (lines in the boxes). For wild fungal samples (Fig S2), Dali (DL) showed smaller amount of poricoic acid A than the other four places (P < 0.05). Baoshan (BS) possessed higher content of dehydropachymic acid than the remaining collection locations and greater pachymic acid content than Chuxiong (CX). Compared with DL and BS, Yuxi (YX) had higher concentration of dehydrotrametenolic acid. Pu’er (PE) was significantly different from DL in the concentration of dehydrotrametenolic acid. The cultivated samples from BS were significantly different from those from the other geographical origins in terms of the contents of dehydrotumulosic acid, poricoic acid A and dehydropachymic acid. Furthermore, for cultivated fungi, both CX and YX were significantly different from DL and PE in dehydrotumulosic acid, DL in dehydropachymic acid, BS, DL and PE in pachymic acid. These quantitative results of five bioactive analytes gave a valuable reference for differentiating samples derived from different geographical regions and for evaluating the quality of M. cocos.
3.4. PLS-DA and RF classification models of single sets
The selection of parameters was an important step in machine learning methods. The number of latent variables in PLS-DA was defined by 7-fold cross validation by default. In the process of setting up random forest model, two essential parameters were searched based on low OOB error values. Concretely, as for cultivated samples, the optimal ntree and mtry were 118 and 33 for FTIR data, and 178 and 48 for LC data separately. For wild samples, the final ntree and mtry were 316 and 37 respectively in FTIR model, and 82 and 48 in LC model (Fig. 6).
The results of independent decision making were shown in Table 1. Both PLS-DA and RF models showed that the cultivated species from different geographical origins could discriminate easily with the total accuracy rates of 95.24% or 100% in validation set. Compared with cultivated samples, the wild ones had lower classification accuracy. Especially, it was difficult to distinguish Class 1 and Class 2, since it showed relatively low efficiency values in calibration and validation sets on the basis of FTIR and LC data. Thus, in order to obtain a better result regarding wild samples, the feasibility of combining the information from FTIR and LC was investigated by means of low-, mid- and high-level data fusion strategies.
Table 1
The classification efficiency and total accuracy rate of independent decision making.
Data source
|
Model
|
Calibration set
|
Total accuracy (%)
|
Validation set
|
Total accuracy (%)
|
Class 1
|
Class 2
|
Class 3
|
Class 4
|
Class 5
|
Class 1
|
Class 2
|
Class 3
|
Class 4
|
Class 5
|
LC-wild
|
PLS-DA
|
1
|
0.91
|
0.98
|
1
|
1
|
97.50%
|
0.71
|
0.87
|
0.94
|
0.97
|
0.88
|
80.95%
|
RF
|
0.50
|
0.40
|
0.97
|
0.91
|
0.82
|
70%
|
0.71
|
0.50
|
1
|
0.97
|
0.86
|
76.19%
|
FTIR-wild
|
PLS-DA
|
1
|
1
|
1
|
1
|
1
|
100%
|
0.66
|
0.66
|
1
|
1
|
1
|
80.95%
|
RF
|
0.81
|
0.40
|
0.98
|
1
|
0.98
|
82.50%
|
0.69
|
0.84
|
1
|
0.97
|
1
|
85.71%
|
LC-cultivated
|
PLS-DA
|
0.95
|
1
|
0.99
|
1
|
1
|
97.56%
|
1
|
1
|
1
|
1
|
1
|
100%
|
RF
|
0.87
|
0.93
|
0.65
|
1
|
1
|
85.37%
|
0.97
|
1
|
0.82
|
1
|
1
|
95.24%
|
FTIR-cultivated
|
PLS-DA
|
1
|
1
|
1
|
1
|
1
|
100%
|
1
|
1
|
1
|
1
|
1
|
100%
|
RF
|
0.93
|
1
|
0.85
|
0.70
|
1
|
87.80%
|
1
|
1
|
1
|
1
|
1
|
100%
|
3.5. PLS-DA and RF classification models of low-, mid- and high-level data fusion
As for low-level strategy, the preprocessed chromatographic and spectral data were straightforward concatenated into a new matrix. In this work, the size of the low-level fusion matrix was equal to (61 × 3484). As described in independent decision making, the optimal PLS-DA and RF models were set up using low-level merged data using suitable parameters (Fig S3). It could be seen from Table 2 that the total accuracy rates of validation set of PLS-DA and RF models (76.19%) were not more than those of single set analysis, therefore low-level data fusion strategy was unsatisfactory. The main drawback of low-level fusion was that the addition of raw, noisy and correlated data could worsen the classification results36. Hence, one possible reason why low-level fusion was worse than single data analysis was that both LC and FTIR data blocks owned correlated variables (like the information of triterpenes) or noisy.
Table 2
The classification efficiency and total accuracy rates of different data fusion strategies in wild samples.
Data source
|
Model
|
Calibration set
|
Total accuracy (%)
|
Validation set
|
Total accuracy (%)
|
Class 1
|
Class 2
|
Class 3
|
Class 4
|
Class 5
|
Class 1
|
Class 2
|
Class 3
|
Class 4
|
Class 5
|
Low-level
|
PLS-DA
|
1
|
1
|
1
|
1
|
1
|
100%
|
0.47
|
0.71
|
0.97
|
0.97
|
0.97
|
76.19%
|
RF
|
0.82
|
0.70
|
0.98
|
0.91
|
0.98
|
85.00%
|
0.64
|
0.50
|
1
|
0.97
|
0.97
|
76.19%
|
Mid-level-PCA
|
PLS-DA
|
0.85
|
0.91
|
0.98
|
1
|
0.98
|
92.50%
|
0.50
|
0.84
|
1
|
0.97
|
0.93
|
80.95%
|
RF
|
0.60
|
0.71
|
0.98
|
0.91
|
0.86
|
77.50%
|
0.50
|
0.49
|
1
|
0.97
|
0.86
|
71.43%
|
Mid-level-Boruta
|
PLS-DA
|
0.94
|
0.99
|
1
|
1
|
1
|
97.50%
|
0.97
|
0.87
|
1
|
1
|
1
|
95.24%
|
RF
|
0.75
|
0.68
|
1
|
1
|
1
|
85%
|
0.84
|
0.84
|
1
|
1
|
1
|
90.48%
|
High-level-PCA
|
PLS-DA
|
0.98
|
0.91
|
1
|
1
|
1
|
97.50%
|
0.49
|
0.69
|
0.97
|
0.97
|
0.97
|
76.19%
|
RF
|
0.70
|
0.80
|
0.98
|
0.91
|
0.92
|
82.50%
|
0.50
|
0.87
|
1
|
1
|
0.86
|
80.95%
|
High-level-Boruta
|
PLS-DA
|
1
|
1
|
1
|
1
|
1
|
100%
|
0.50
|
0.69
|
1
|
0.91
|
0.97
|
76.19%
|
RF
|
0.92
|
0.90
|
1
|
1
|
1
|
95%
|
0.71
|
0.84
|
1
|
0.97
|
0.97
|
85.71%
|
In mid-level data fusion, the selected variables by Boruta from LC and FTIR data (green lines in Fig S4) were concatenated into a dataset, and it was called as Mid-level-Boruta. The PCs from LC and FTIR data were combined, which was named as Mid-level-PCA. The first ten PCs that described 64.09% of LC variables and first nine PCs that accounted for 79.12% of FTIR variables were extracted. The ntree and mtry screening of the random forest models of Mid-level-PCA and Mid-level-Boruta were displayed in Fig S3. Boruta was more efficient than PCA in feature extraction, because Mid-level-Boruta dataset showed greater efficiency and accuracy of validation set than those of Mid-level-PCA one in both PLS-DA and RF models. What’s more, the models of Mid-level-Boruta were superior to the models of low- and high-level data fusion strategies as well as individual techniques due to the highest accuracy of validation set (95.24% and 90.48%). Its PLS-DA model with appropriate accuracy of calibration set (97.50%) was deemed as the best suitable for the geographical identification of wild samples. The variables with VIP scores greater than one (represented by red dashed line) presented in each block of data (Fig. 7), indicating that both FTIR and LC were complementary to each other for identifying the origin of the samples.
In the progress of high-level fusion model, the classification votes of calibration and validation set output from each individual model were combined for further majority vote based on four fuzzy aggregation connective operators. As an example (Table S4), the truly class of sample NO.10 belonged to Class 1, however it was identified as Class 1 in random forest model of FTIR and Class 2 in that of LC while the voting result basing on fuzzy set theory was Class 1. Two types of high-level data fusion were performed, i.e. High-level-PCA and High-level-Boruta. The parameter screening of their random forest models was showed in Fig S5. Random forest models had higher efficiency of validation set for Class 1 and Class 2 than PLS-DA models. However, it was difficult to distinguish Class 1 and Class 2, as always.
Because the accuracy rates of calibration set in PLS-DA models were usually much higher than those of validation set, all of PLS-DA models were validated by permutations test to assess the risk that the current PLS-DA model was spurious. A 30-iteration permutation test was carried out. As could be seen from the Fig S6, the regression line of the Q2 (predictive squared correlation coefficient) intersected the vertical axis at or below zero, it suggested that the model was not overfitting. The results showed that there was no overfitting in the PLS-DA models.