Predicting the dietary fiber content of fresh-cut bamboo shoots using a visible and near-infrared hyperspectral technique

The dietary fiber content in fresh-cut bamboo shoots is considered crucial for the quality of processed bamboo shoots products. This study aimed to explore the potential of applying two different hyperspectral techniques, namely visible near infrared (Vis-NIR) spectroscopy and near infrared (NIR) in the quick and non-destructive prediction of the dietary fiber content of fresh-cut bamboo shoots. The Vis-NIR and NIR hyperspectral data were collected to establish partial least square regression (PLSR) and principal component regression (PCR) calibration model for the average spectrum of fresh-cut bamboo shoots and their corresponding dietary fiber content. Subsequently, data fusion analysis, various pre-processing methods, and principal component analysis (PCA) were used to optimize the model. The results indicated that superior models were obtained based on low-level fusion data when compared with the corresponding methods based on single spectral data. The optimal SNV-PCA-PLSR model achieved a good performance with coefficient of determination of prediction (R2p) of 0.902, and root mean square errors of prediction (RMSEP) of 0.135. Therefore, hyperspectral technique combined with data fusion analysis can be a promising approach for non-invasive quality supervision of bamboo shoots products in varied processing states.


Introduction
Bamboo shoots are rich in dietary fiber, which is a kind of natural pollution-free high-quality food. Dietary fiber has the functions of promoting digestion, anti-oxidation, and lowering blood lipids [1], but too much insoluble dietary fiber in bamboo shoots may lead to high lignification and hardness have a rough taste and excess residue after chewing, which greatly affects product quality and reduces consumers' desire to buy [2][3][4]. Fresh-cut bamboo shoots are the raw materials for ready-to-eat food processing of bamboo shoots. The quality of processed bamboo shoots is highly correlated with the dietary fiber content of fresh-cut bamboo shoots. Therefore, it is necessary to establish a fast and spectra for a more comprehensive analysis of material information, including physical and geometric characteristics (shape, color, size, etc.) and chemical composition [7]. Varied from the spectral method that only obtains the spectral data of one or more points on the sample, HIS can obtain the spectral data of every pixel on the sample image [8]. Integrated with stoichiometry, HIS has been extensively applied in detecting the content of different substances including soluble solid content in Feicheng peach [9], amylose, and amylopectin content in sorghum, soluble solid content in Malus micromalus Makino [10,11], total sugar content in apples and soluble solids content in jujube [12,13]. K. Kheiralipour, et al. used HSI to detect the infection of pistachio kernels by two different isolates of Aspergillus flavus, KK11 and R5 [14]. These studies have achieved satisfactory results.
HIS data can be considered high-dimensional data, containing significant data in accordance with the target and a mass of irrelevant noise. It is significant to seek an efficient approach to profitably use the effective spectral data for better calibration. In recent years, multivariate mathematics and chemometric statistics approaches such as partial least square regression (PLSR), and Principal component regression (PCR), combined with HIS, have been successfully and widely adopted for regression prediction. For example, PLSR combined with HIS could be adopted as a quick and non-destructive approach to predict the micronutrients of wheat [15]. PCR has been successfully applied to seed germination prediction of sweet corn [16]. Nevertheless, the comparison of various pre-processing approaches combined with different regression methods, including PLSR and PCR, for prediction of dietary fiber content in bamboo shoots has not been well researched.
In addition, data fusion by combining multiple data sources can provide more complete information and a more comprehensive understanding of the results than analysis of a single block of data [17]. Data fusion can be divided into three ways: low-level, mid-level and high-level. Lowlevel data fusion is to simply concatenate data from all data sources into a new matrix in sample order, and then use chemometry to build classification or regression models. Lowlevel data fusion has been applied to food origin and quality prediction [18][19][20]. Mid-level data fusion is to extract key features from each data source separately, then combine them and process them through classification or correction methods. Feature extraction is a key step in intermediate data fusion. Mid-level data fusion method has been applied in food authenticity, food quality and safety inspection [21]. High-level data fusion is to establish a discriminant model for a single level of data, and then fuse the prediction results obtained from different models to get the final result. However, the study on using data fusion based on two single data blocks produced by two different hyperspectral systems to predict dietary fiber content in bamboo shoots has not been reported.
This research explored the feasibility of two different HIS (Vis-NIR and NIR) combined with data fusion analysis in measuring dietary fiber content in various parts of bamboo shoots. The contents of the present study were as follows: (1) the changes trends of dietary fiber content were analyzed in fresh-cut bamboo shoots of different parts. (2) Hyperspectral images of fresh-cut bamboo shoots were gathered in Vis-NIR (400-1000 nm) and NIR (900-1700 nm) spectral areas, and the region of interest (ROI) of samples was selected by the threshold mask method. (3) Standard normal variate (SNV), multiplicative scatter correction (MSC), the Savitzky-Golay (SG) smoothing filter, and normalization were adopted to preprocess the raw spectral data and fusion spectral data of the bamboo shoots. (4) PCA was applied for extracting the characteristic wavelengths of every sample set, and PLSR and PCR models on basis of the full and characteristic wavelengths were set up to forecast the contents of dietary fiber in the bamboo shoots.

Sample preparation
The research was implemented in October 2020 with commercial mature and fresh bamboo shoots harvested from Yingde Yuanfeng agricultural and by-products processing factory in Qingyuan city, China. Raw shoots were washed and peeled off. Then, the bamboo shoots were cut along the knot from top to bottom into approximately equal 4 pieces (top, middle-upper, middle-lower, and bottom), and a total of 108 samples were obtained (27 × 4 = 108). Hyperspectral data collection must be completed within 20 min, and then physical and chemical experimental determination of quality evaluation indexes of fresh-cut bamboo shoots was started.

Image acquisition system
The hyperspectral data of fresh-cut bamboo shoots were obtained in a hyperspectral imaging system consisting of two hyperspectral camera, models FX10 and FX17 (SPECIM, SPECTRAL IMAGING LTD, Oulu, Finland), eight halogen light sources of 20 W, a mobile platform boosted by a stepping motor (300 mm×200 mm), a computer, and a dark indoor environment (Fig. 1). The FX10 hyperspectral camera is in the visible/near-infrared band, with a wavelength scope of 400-1000 nm and a resolution of 5.5 nm. There are 224 bands in the spectral range. The FX17 hyperspectral camera is in the near-infrared band with a wavelength scope of 900-1700 nm and a resolution of 8 nm in a total of 224 bands in the spectral range.
Before collecting hyperspectral information, it is necessary to adjust the distance between the hyperspectral camera and the sample (objective lens), the moving speed of the automatic object carrier, and the exposure time of the hyperspectral camera (built-in CCD camera) to determine the sharpness of the image and prevent distortion. After repeated debugging, the scanning parameters of FX10 hyperspectral camera were finally set as follows: the moving speed of the carrier platform was 9.8 mm/s, the objective distance was 30 cm, and the exposure time was 2 ms. The scanning parameters of FX17 hyperspectral camera were determined as: the moving speed of the carrier platform was 7.5 mm/s, the objective distance was 30 cm, and the exposure time was 1.5 ms. In addition, the room was kept completely dark during the test, and the humidity and temperature were basically the same. Lumo-Sanner software was used to collect hyperspectral data, and the analysis software ENVI 4.8 was applied for completing image data processing for the collected original spectral data.

Acquisition and calibration
The obtained hyperspectral images could be influenced by several environmental elements including the role of illumination and the diversities in camera and physical configuration of imaging systems. For eliminating or minimizing side effects, the calibration of raw hyperspectral images was made based on Eq. (1): where I C means the calibrated hyperspectral image data, I refers to the raw spectral image of the sample, and I W and I B stand for the white and dark reference images.

Measurement of dietary fiber contents
Fresh-cut bamboo shoots were weighed, cut into 2 mm slices, dried in an oven until the moisture content was less than 5%, and ground into powder. The enzyme-gravimetric method was then used to determine dietary fiber content [22]. Two pieces of bamboo shoot powder (approximately 1 g) were weighed and hydrolyzed with thermally stable α-protease, alkaline protease, and glucose amylosidase to remove most of the starch and protein in the sample and obtain the enzymolysis solution. Four times the volume of 95% ethanol solution was added to the enzymolysis solution and the mixture was precipitated for one hour. Then, organic solvents (75% ethanol solution, 95% ethanol solution, and acetone solution) were used to wash the precipitated impurities, and the residues were obtained through the filtration device. After the residue was kept at 105 °C overnight, residual ash and protein were determined to correct the mass of the dietary fiber.

Data fusion
Data fusion integrates various data from multiple information sources to create a more comprehensive and detailed data set. Data fusion is classified into three types according to the fusion level of different data forms, including lowlevel, mid-level, and high-level fusion [23]. Low-level data fusion is realized by merging raw spectral data. After fusing the spectral data, a new global matrix containing Vis-NIR and NIR spectral data was generated. One row of the matrix corresponds to the average Vis-NIR and NIR spectra of the same bamboo shoot, and one column corresponds to all spectral values of the VIS-NIR and NIR spectra at a particular wavelength. Then, the entire dataset was used as an input

Model establishment and evaluation
PLSR and PCR were adopted to set up calibration models between spectral data and the dietary fiber content of samples. PLSR is a common multivariate statistical analysis approach. Through decomposition of a spectral matrix, variables with the strongest explanatory ability to dependent variables are selected, and signals and noises in spectral data are considered at the same time to overcome the problem of multiple correlations between independent variables to realize the regression modelling of multiple dependent variables to multiple independent variables [30]. PCR is a new model that uses principal component analysis to solve the multicollinearity issue in the regression model, takes the principal component variables extracted from the original data as independent variables for regression analysis, and then substitutes the original variables back according to the score coefficient matrix.
For evaluating the performance of the model, the calculation of several common statistical coefficients was made, including the correlation coefficient (R 2 ) and root mean squared error (RMSE). Generally speaking, RMSE is considered a good indicator because it converts the squared result of the prediction error into the same units as the actual value. When the errors of the data set conform to the normal distribution, RMSE is more advantageous than other indicators such as mean absolute error (MAE) in illustrating the error distribution. R 2 combined with RMSE adequately represents the accuracy of the results. R 2 includes correction set correlation coefficient R 2 c and prediction set correlation coefficient R 2 p. RMSEC includes the root-mean-square error of correction set RMSEC and the root-mean-square error of prediction set RMSEP. In general, better performance prediction models should have larger R 2 and smaller RMSE [31,32]. Table 1 shows the statistical outcomes of the dietary fiber contents in bamboo shoot samples. The distribution scopes of the dietary fiber contents were 0.817-2.748 g/100 g for the calibration set and 0.832-2.274 g/100 g for the prediction set. The allocation scope of the indexes of the prediction set was almost within the allocation scope of the indicators of the calibration set, which contributed to setting up a robust forecast model. for multivariate data analysis. In this kind of fusion, preprocessing or variable selection methods may be applied to data after concatenation. When the form of the fused data is the same or commensurate, the low-level data fusion can often achieve more satisfactory results [24]. Therefore, this study focuses on low-level data fusion based on the combination of spectral information of two bands.

Spectral information extraction and pre-processing
For effectively reducing the computational complexity, the corrected hyperspectral image were resized with ENVI 4.8 software. The cropped image is 250 pixels × 250 pixels × 224 pixels, and only the complete sample area was reserved. Based on the functional discussion on band animation, there was considerable contrast between the bamboo shoots and background areas in the 670.08 and 1102 nm wavelength images. The mask was adopted on the corresponding hyperspectral image to get the bamboo shoots image without a background. The same operation was made on all hyperspectral images of bamboo shoots. Then, the average spectrum of every bamboo shoots sample was acquired for follow-up analysis. There were 216 spectra in the Vis-NIR and NIR data. After removing noisy wavelengths, the spectral wavelength were in the range of 300.28-998.17 nm and 1001.37-1598.79 nm of the Vis-NIR and NIR spectra with 221 and 171 bands.
For setting up a precise forecast model, the spectral data of the bamboo shoot samples needs to be preprocessed [25][26][27]. The pre-processing approaches employed in this study included SNV, MSC, SG, and Normalization. In addition, the 108 bamboo shoot sample data were randomly fallen into a calibration set (81) and a prediction set (27) at the proportion of 3:1 applying the Kennard-Stone (KS) algorithm [28].

Extraction of characteristic wavelengths
The acquired Vis-NIR and NIR spectral data with 221 and 171 spectral bands can be seen as high-dimensional data. This kind of data usually contains a lot of redundant and noisy variables with strong collinearity, reducing the performance of the developed model. For solving these problems, the best wavelength selection was applied for selecting informative variables while eliminating uninformative and noisy variables and shortening the computation time of modeling. PCA is one of the most common methods for spectral data dimension reduction. As a frequently adopted multivariate analysis approach, the PCA of spectral data can produce a load factor vector, showing the significance of the related band [29]. region. From 810 to 890 nm, the bamboo shoots clearly displayed diversities in reflectance values due to the combined role of moisture and dietary fiber content changes, which was reported before to be because of combining symmetric and anti-symmetric O-H and C-H stretching and blending modes [37,38]. In the NIR wavelength spectral scope, the reflectance difference of bamboo shoots in different parts was more obvious. From Fig. 2 (b), the two strong broad peaks at 1100 nm-1150 and 1380-1480 nm are related to the second overtone of the C-H bond and the first overtone of the O-H bond. This may be related to dietary fiber in the form of polysaccharides [39]. Because of the complex association between different wavelengths and varied functional groups throughout the whole spectral scope, no significant absorption peaks related directly to dietary fiber in the reflectance map were found. Hence, the association between dietary fiber contents and every wavelength through multivariate analysis shall be set up.

Comparison of different spectral pre-processing methods
The original and preprocessed spectra were applied to set up PLSR models to compare the prediction effect of different pre-processing methods on dietary fiber content. Table 2 shows the values of RMSEC, RMSEP, R 2 c and R 2 p for comparison among different pre-processing methods based Spectral characteristic analysis Figure 2 (a) displays Vis-NIR spectral data with a wavelength scope of 300.28-998.17 nm; Fig. 2 (b) displays NIR spectral data with a wavelength scope of 1001.37-1598.79 nm. From these two figures, overall, the mean reflectance curves of the four parts of bamboo shoots in the Vis-NIR and NIR spectral regions showed similar varying trends, but the reflectance values of the different bands were varied because of the change of chemical composition (pectin, lignin, cellulose, etc.) caused by the different differentiation degree of fiber cells in the growth process of bamboo shoots [33].
In the Vis-NIR wavelength range, the diversities in the visible zone of the reflectance image were found between 524 and 780 nm, where varied color information had higher intensity. The diversities near 550 nm, 650 nm, and 678 nm were mostly brought by the varied contents of chlorophyll [34,35]. Because the photosystem function of middle and lower internodes of culms was well developed and chlorophyll content was high, the photosystem of upper internodes was not mature and chlorophyll content was relatively low [36]. A similar trend was observed from 810 to 890 nm, a spectral region that also contains key information about the properties of bamboo shoots. For instance, dietary fiber mostly is made up of cellulose, pectins and lignin, and moisture, which can be decided by absorption bands in the NIR  (RMSEP of 0.156 in the Vis-NIR data model and RMSEP of 0.161 in the NIR data model). The low-level fusion model was able to predict the dietary fiber in bamboo shoots more effectively than the model that included the full spectrum and characteristic wavelengths. Consequently, SNV was selected to preprocess the original spectral data for more accurate prediction of dietary fiber content.

Selection of principal component number
Although the full-wavelength spectral information could be applied to forecast the contents of dietary fiber in bamboo shoots, too much surplus data greatly influenced the speed at which the models made their calculations. On the premise of preserving a large amount of useful information, the redundant variables are removed, and the principal component with the largest variance explanation for the original data is selected [40]. Principal component numbers of 221, 60, 40, and 10 were adopted as input coefficients of dietary fiber content regression models to set up component forecast fusion models on basis of SNV-PCA-PLSR. Table 3 shows the results of the model performance. Apparently, after extracting characteristic variables for modeling, the obtained model has stronger stability than the full-wavelength spectral. When the principal component number was 40, the established SNV-PCA-PLSR model had the best prediction performance and a large dimension reduction. In the 400-1000 nm wavelength range, the prediction accuracy evaluation parameters were R 2 c = 0.871, RMSEC = 0.181, R 2 p = 0.886, and RMSEP = 0.144. In the 1000-1600 nm wavelength range, the prediction accuracy evaluation parameters were R 2 c = 0.877, RMSEC = 0.135, R 2 p = 0.882, and RMSEP = 0.139. In the two-wavelength range, the first three spectral principal components explained 91% and 83% of the variance in the spectral data, respectively, which can basically represent the sample information. Hence, the optimal principal component number of 40 was selected to extract characteristic variables from spectral data. on the same data or different types of data with the same method. Based on two independent individual data blocks, after SNV pre-processing, the performance of PLSR prediction model based on VIS-NIR wavelength range is better than that of NIR wavelength range. In the VIS-NIR wavelength range, the prediction accuracy evaluation parameters after SNV processing were R 2 c = 0.869, RMSEC = 0.149, R 2 p = 0.852, and RMSEP = 0.156. In the NIR wavelength range, the prediction accuracy evaluation parameters after SNV processing were R 2 c = 0.868, RMSEC = 0.194, R 2 p = 0.867, and RMSEP = 0.161. In addition, compare with the corresponding pre-processing selection methods in single data blocks, the performance of all models on the basis of fused data was significantly improved. When SNV was used to preprocess the fused data, the value of R 2 p based on the fusion data model was 0.891 greater than that based on single data block models (R 2 p of 0.852 in the Vis-NIR data model and R 2 p of 0.867 in the NIR data model), and the value of RMSEP based on the low-level fusion data model was 0.150 less than that based on single data block models   [38]. The spectral data was analyzed with genetic synergy interval partial least squares (GA-Si-PLS), and the predictive models for dietary fiber content gave R 2 p values of 0.9638 and 0.9756. In another research, Ferreira et al. adopted the HSI technique to forecast dietary fiber content in Brazilian soybeans [37]. The developed variable importance for projection-partial least square regression (VIP-PLSR) model led to an R 2 p of 0.80 with an RMSEP of 0.86. In contrast, in our study, the SNV-PCA-PLSR model established in this paper based on the 400-1000 nm wavelength range also has a better prediction effect.
Totally, both models showed that the integration of HSI with chemometric methods can be adopted for detecting and precisely predicting the dietary fiber content in bamboo shoots. For validating and further improving the performance of these predictive models for practical usages, the calibration set should include more samples to offer more chemical and biological variability.

Conclusion
The presented study demonstrated the potential of HIS combined with data fusion analysis as a rapid and non-destructive process analytical technology to monitor dietary fiber content in varied parts of fresh-cut bamboo shoots. The SG, MSC, SNV, and normalized spectral pre-processing methods were compared, and the SNV with the best modeling effect was selected for spectral pre-processing. Two prediction models were established between the predicted value and the measured value, and the SNV-PCA-PLSR model was superior to the SNV-PCA-PCR model. For the single data blocks, the R 2 p of dietary fiber content under different wavelengths (VIS-NIR and NIR) were 0.886 and 0.882, and the RMSEP were 0.144 and 0.139, respectively. After data fusion, the R 2 p and RMSEP of dietary fiber content were improved to 0.902 and 0.135. The results displayed that the HSI technique could be used to control quality of processed bamboo shoots products, providing an alternative to the traditional method. Furthermore, the method proposed in

Prediction of dietary fiber content in bamboo shoots
PLSR and PCR models were built with 40 characteristic wavelengths to forcast the contents of dietary fiber in bamboo shoot samples. Table 4 displayed the prediction performance of the two models at both wavelength ranges. In the 400-1000 nm wavelength scope, the value of R 2 p of the two dietary fiber prediction models are 0.886 and 0.861, and the minimum values of RMSEP are 0.144 and 0.165. In the 1000-1600 nm wavelength range, the value of R 2 p of the two dietary fiber prediction models are 0.882 and 0.869, and the minimum values of RMSEP are 0.139 and 0.178. After data fusion, the value of R 2 p of the two dietary fiber prediction models are 0.902 and 0.872, and the minimum values of RMSEP are 0.135 and 0.175. The above results showed that both models have good forecast precision and high reliability, and they can be adopted for further analysis. The performance of models on basis of SNV-PCA-PLSR was more appropriate than SNV-PCA-PCR in the prediction of dietary fiber contents in bamboo shoots. Figure 3 shows that SNV-PCA-PLSR produces good predicted results, and the data were located around the 1:1 line, indicating a great relationship between the predicted and measured data. Although the idea of PCR and PLSR is similar, essential diversities between them in the idea and method of extracting components are found. The idea of principal component extraction is to extract a few principal components from independent variables, so as to keep the data of the original variables as fully as possible, and are not associated with each other. In the whole extraction process, there is no association with the dependent variable, and completely independent of the dependent variable, so the extraction process is simple. However, the concept of extracting components in PLSR analysis is to derive a few components from independent variables, so as to only better conclude the data of the original independent variables and have a strong ability to interpret the dependent variables and are not correlated with each other [41]. It adopts a circular data decomposition and extraction approach, and the process is much more complex than principal component extraction. Data Availability All data generated or analysed during this study are included in this published article.

Declarations
Ethics Approval This article does not contain any studies with human participants or animals performend by any of the authors.
Informed consent Informed consent not applicable.

Conflict of interest
The authors declare no competing interests.
this study can be used to select bamboo shoots adaptive for dietary fiber content according to the needs of consumers. In the future study, more sample data sources should be added to the models and more efforts were needed for the actual utilization of the proposed method in an industrial environment for improved accuracy and stability of the models.