2.1. Chemicals and sample preparation
HPLC grade acetonitrile was purchased from Thermo Fisher Scientific (Fair Lawn, NJ, USA). Formic acid was purchased from Dikma Technologies (Lake Forest, CA, USA). Purified water was purchased from Guangzhou Watsons Food & Beverage Co., Ltd (Guangzhou, China). Other chemicals and reagents were analytical grade. The standard compounds (pachymic acid, dehydropachymic acid, poricoic acid A, dehydrotrametenolic acid and 3-epidehydrotumulosic acid) (purity ≥ 98%) were supplied by Beijing Keliang Technology Co., Ltd (Beijing, China). Dehydrotumulosic acid (purity ≥ 96%) was purchased from ANPEL Laboratory Technologies Inc. (Shanghai, China). The concentration ranges of standard solutions prepared for each analyte were the following (mg·L-1): dehydrotumulosic acid: 5.00–999; poricoic acid A: 0.22–6730; 3-epidehydrotumulosic acid: 1–100; dehydropachymic acid: 2.4–480; pachymic acid: 10.3–1240; dehydrotrametenolic acid: 0.49–2450.
Both wild and cultivated samples (123) were collected from Yuxi, Pu’er, Dali, Chuxiong and Baoshan of Yunnan Provence, China. The detail information was showed in Table S1. All of samples were identified as Macrohyporia cocos (Schwein.) I. Johans. & Ryvarden by Professor Yuanzhong Wang (Institute of Medicinal Plant, Yunnan Academy of Agricultural Sciences, Kunming, China). For fresh sclerotium, the attached soil was brushed away and washed by tap water. Then, the samples were air-dried in the shade with good ventilation. The dark epidermis was removed, and white inner part was powdered afterwards for analysis. The powder was screened with 60-mesh sieve. All samples were preserved in polyethylene resealable bags for further analysis.
Then, on the one hand, an accurately weighted powder (0.5000±0.0001 g) was ultrasonically dissolved in 2.0 mL methanol for 40 min. The extract was filtered through a 0.22 μm membrane filter. The filtrates collected in auto sampler was injected into the LC system for analysis. On the other hand, sample powder was used directly for attenuated total reflectance FTIR spectra acquisition.
2.2. Chromatographic analysis
LC analyses were performed with an ultra-fast liquid chromatography system (Shimadzu, Japan) equipped with a UV detector, a thermostatic column compartment, an autosampler, a degasser and binary gradient pumps. The separation was carried out on an Inertsil ODS-HL HP column (3.0 × 150 mm, 3 μm) operated at 40 °C. The mobile phase consisted of acetonitrile (A) and 0.05% formic acid (B). The flow rate was kept at 0.4 mL·min-1 and the injection volume was set at 7 μL. The signals were acquired at 242 and 210 nm. Before use, the mobile phase constituents were degassed by ultrasonication and filtered through a 0.2 μm filter. The samples were eluted with the following gradient: 40% A (0.00 min→25.00 min), 40%→69% A (25.00 min→52.00 min), 69%→72% A (52.00 min→56.00 min), 72%→78% A (56.00 min→58.00 min), 78%→90% A (58.00 min→58.01 min), and 90% A (58.01 min→60 min). Each run was followed by an equilibration period of three minutes with initial conditions (40% A and 60% B).
2.3. Spectral acquisition
A FTIR spectrometer (Perkin Elmer, USA) equipped with deuterated triglycine sulfate (DTGS) detector and attenuated total reflectance (ATR) sampling accessory was used to record FTIR spectra. The resolution and scan range were set as 4 cm-1 and 4000–650 cm-1. Each sample was scanned sixteen times successively. The air spectrum was recorded for background correction. This experiment was implemented under a constant temperature (25℃) and humidity (30%) condition.
2.4. Data processing and analysis
2.4.1. Pretreatment of chromatograms and spectra
The retention time of chromatograms would be affected by time and other factors. For this reason, the correlation optimized warping algorithm [23] was used for correcting the retention time shifts among samples. In order to save computation time, the corrected chromatographic data was reduced by taking one in every three points without affecting the chromatographic features. Besides, all the original FTIR spectra were subjected to advanced ATR correction using OMNIC 9.7.7 (Thermo Fisher Scientific, USA). The spectral bands at 4000–3700 cm-1 and 2670–1750 cm-1 had noise, therefore the variables in both ranges were abandoned. Because chromatograms and spectra contained overlapped peaks and baseline shifts, Savitzky-Golay (SG) polynomial second-derivative filter (second order polynomial and 15-point window) was carried out to highlight slight differences and eliminate the interference of baseline drift. Particularly, the deletion of spectral variable was performed after SG polynomial second-derivative preprocessing.
The size of data matrix (m × n) was built to describe the change in variable numbers, where m represented the number of samples, and n represented the retention time of chromatogram or the wavenumber of spectrum. Take wild samples for example, the initial chromatographic matrix (61 × 7201) was transformed as (61 × 2387) after pretreatment, and the raw spectral matrix (61 × 1737) was changed as (61 × 1097). The processed data matrixes were then used for PLS-DA, random forest or data fusion.
2.4.2. Data fusion
Data fusion strategies, which integrated the outputs of multiple complementary information, were expected to obtain more accurate characterization than single information. In the process of data fusion, it was the LC and FTIR data of the same sample that were combined. Three levels data fusion were studied: low-, mid- and high-level. Low-level fusion was conceptual simplicity and easy implementation. Several preprocessed datasets were straightforward concatenated into a matrix, whose variables number was equal to the sum of the variables number from each dataset. The important step of mid-level fusion was to extract relevant features from each dataset independently, then concatenating them into a new matrix employed for multivariate analysis. In high-level, each dataset was calculated by a model, and the outputs of each individual model were integrated to obtain final judgement using the fuzzy set theory [24]. In brief, the final decision depended on the result of a majority vote of four fuzzy aggregation connective operators (maximum, minimum, average and product). The specific schemes of the data fusion process in this study were represented in Fig 1.
Feature extraction could save the computation time and improve the accuracy in practical model building [25], and two extracting features methods were used: (1) PCs extraction employing the dimension reduction technique of principal component analysis. As new variables with a small number, PCs almost described a large proportion of the original information [26]. The number of PCs was determined by 7-fold cross-validation procedure of SIMCA software. (2) variable selection applying the Boruta algorithm. Boruta was a RF-based feature extraction method, which unbiasedly and stably selected important and non-important variables from an information system [25]. The variables marked with the decision of tentative and confirmed were regarded as important features and extracted.
2.4.3. Chemometrics
Chemometrics approaches were playing an essential role in the fields of food and pharmaceutical sciences. Supervised pattern recognition techniques like PLS-DA and RF, were employed in this study. Once a classification model was built, the membership of a sample of unknown class to predefined classes could be recognized. Partial least squares discriminant analysis (PLS-DA) was a wildly-used linear classification method combining the properties of partial least squares regression with classification technique [27-28]. As the primary parameter, the number of latent variables was carried out based on 7-fold cross validation procedure. The important variables to recognize categories correctly could be identified by the variable importance for the projection (VIP) [29].
Random forest (RF) was a method of ensemble learning based on decision of classification or regression trees [30-31]. When building each individual tree, approximately two third of samples in the calibration set generated a training set, and other one third of samples were used to obtain an unbiased estimate of the classification error internally. The one third of samples were also called out of bag (OOB) samples. As two crucial tuning parameters in the establishment of random forest model, the number of trees (ntree) and mtry were chose depending on OOB classification error. The operational steps were roughly divided into the following four steps. Firstly, a dataset that processed by Kennard-Stone algorithm [32] was imported. Secondly, we selected the optimal ntree depending on the lower OOB classification error values of total classes considering each class at the same time, and the initial value of ntree was tested with 2000. Thirdly, the optimal mtry was searched in the range of default value of mtry (square root of the number of variables) plus minus 10 with step by step [33]. If there were several mtry with lowest OOB classification error, the one that was closer to the default value came first. Finally, the RF model was built by using the selected ntree and mtry.
2.4.4. Evaluation of model performance
For assessing the performance of model, the calibration and validation sets were divided at the ratio of 2:1 employing Kennard-Stone algorithm. The calibration set was applied to build a model and the validation set was employed to obtain an estimate of the model practicability from an extern perspective. In general, if the performance of calibration set is far higher than that of validation set, it shows the possibility of overfitting, i.e. diminishing generalization ability of model, which should be avoided.
Additionally, the efficiency and total accuracy rate were as synthetic parameters to evaluate the classification performance. The higher were the values of these parameters, the better was the model performance. The equation of efficiency was shown as follow [34]. Where TP (true positive) was the number of correctly identified samples in target positive class, TN (true negative) was the number of correctly identified samples in target negative class. By analogy, FP (false positive) and FN (false negative) represented the number of incorrectly identified samples in positive and negative classes, respectively. Total accuracy rate was the percentage of correctly identified samples in the samples from all the classes.
2.4.5. Software
SIMACA-P+ (version 13.0, Umetrics, Sweden) was used for PCA, PLS-DA and SG polynomial second-derivative preprocessing. The random forest and Boruta were unfolded using R package (version 3.4.3). The correlation optimized warping and Kennard-Stone algorithms were performed by MATLAB software (version R2017a, MathWorks, USA). Contents of five target compounds were statistically analyzed by one-way analysis of variance at P < 0.05 using SPSS software (version 21.0, IBM Corporation, USA).