2.1. Sample preparation
The five food crops selected for this study are all from Sichuan, namely red glutinous sorghum from Luzhou, long-grain sorghum rice from Nanchong, long-round glutinous rice from Zigong, Xikemai No. 3 wheat from Mianyang and home-grown corn from Guang'an. To selected 0.5kg grains with complete grains and no insect erosion to place in a blender, and crushed for 5s one times. Each variety was weighed and crushed separately, and then the crushed grain particles were placed in a 20-mesh sieve, the final standard is not more than 20% fine powder through 20-mesh sieve. The sieved fine powder is set aside and the remaining coarse particles are reserved for sample preparation. Each variety prepared for 23 samples, and each sample contained 100 grains of different sizes and shapes, totaling 11500 samples. Three mixed samples were prepared according to the proportion of sorghum 36%, rice 22%, glutinous rice 18%, wheat 16% and 8%. In addition, a total of 1500 samples with 300 grains in each of five varieties were selected as the external input validation set of the model to test the model generalization effect.
2.2 Hyperspectral imaging system and data acquisition
All sample data for this study were collected by the visible hyperspectral imaging system (FX10E, Specim, Finland), which includes: electronic control platform, two sets of 150W halogen lamps (OSRAM, Germany), computer equipped with special LUMO-scanner software (DELL, USA) and auxiliary stand. The FX10E hyperspectral camera with a field of view is 38 degrees, a 12-bit camera output, a spatial resolution of 1 024 × 628 pixels, and a spectral range of 397 ~ 1004.5nm, generating 448 bands in total. Before data collection, the parameters of the system were set as follows: peak lighting 3616, exposure frequency 50HZ, exposure time 8ms, platform movement speed 10.84ms. After parameters are set, the system starts to collect data. The samples were tiled in the order of numbering in a petri dish with a diameter of 90 mm and a height of 10 mm to ensure that the broken particles did not stick to and overlap with each other, and then placed on the mobile platform for data collection.
In the process of sample data collection, the intensity distribution of light source with different wavelength is not uniform, which will affect the collected image signal. Secondly, the dark current inside the camera will generate noise in the process of data collection and affect the accuracy of data. Therefore, black-white correction should be carried out on the collected data, and the correction formula is shown in equation:
$$D=\frac{{DO - D{\text{d}}}}{{Dw - D{\text{d}}}}$$
1
Where D is corrected spectral image, Do is covered the dark reference image collected by the lens, Dd is original hyperspectral image, Dw is collected standard whiteboard image.
2.3 Image segmentation and spectral data extraction
The spectral data collected by the hyperspectral imaging system contains the information of grain broken particles and background information. In order to effectively remove the background information and retain the information of grain broken particles, the Otsu method combined with watershed algorithm are used to segment images of grain broken particles with different sizes and shapes in the samples(Xi et al., 2021). The specific steps are :(1) After black and white correction, remove image noise; (2) Image gray processing, using Otsu method to find the optimal threshold of different varieties of samples, to determine the connected domain(Nandhini & Porkodi, 2021); (3) Adjust the filter size to perform edge erosion and watershed transformation on the binarized sample image; (4) Extracting effective region of Interest (ROI) spectral data values; The formula for calculating the success rate of spectral data extraction is shown in Equation:
$$ACU{\text{ex}}(\% )=\frac{E}{{Tr}} \times 100$$
2
Where ACUex is the success rate of data extraction, E is the number of samples extracted, Tr is the total number of samples;
2.4 Eliminate outliers
After extracting valid spectral data, it is necessary to clean the data to remove outliers that affect the modeling results. Here, a combination of density clustering (DBSCAN) and Mahalanobis distance (MD) is used to remove outliers. Density clustering can divide regions with high enough density into clusters, and can find clusters of arbitrary shapes in noisy spatial datasets, and then remove noise points that affect the modeling results(Alireza & Negin, 2021). Mahalanobis distance considers points to be anisotropic, and the specific parameters of anisotropy can be represented by a covariance matrix. The covariance matrix can be regarded as a multidimensional normal distribution covariance matrix, then the contour line describing the distribution density function is a common ellipse. The Mahalanobis distance from the center of the ellipse to each point on the ellipse is equal. If it is not satisfied, it will be removed as an abnormal point. This method can effectively remove points with abnormal distribution(Jiayou et al., 2021).
2.5 Preprocessing and feature wavelength screening
In order to effectively eliminate the spectral differences caused by different scattering levels, a region of interest (MSC) is used to preprocess the spectral data to enhance the correlation between the spectra and the data(Haoping, Xinjun, & Jianping, 2021). The data is divided into training set and test set according to the ratio of 4:1 using the random partition method. Then 10 bands before and after are removed to improve the signal-to-noise ratio.
The data collected by the hyperspectral imaging system contains a lot of complicated and redundant information, and the images of adjacent bands are highly similar. In order to select representative band data, it is necessary to dimension the spectral data. Therefore, this study combines Iteratively Variable Subset Optimization (IVSO) and Competitive adaptive re-weighting algorithm (CARS) to select characteristic wavelengths. IVSO is a highly stable variable selection method, which can eliminate variables with less information and extract effective variable information(Wang et al., 2015). CARS can remove the bands with smaller weights and extract important variables related to the detection target. IVSO-CARS extracts a total of 41 characteristic wavelengths, effectively removing redundant information(Bonah et al., 2020), saving modeling time and improving modeling efficiency.
2.6 Building a classification model
In order to select a model suitable for the classification of multi-grain mixed and broken particles, the probabilistic neural network model (Probabilistic Neural Networks, PNN), the generalized regression neural network model (Generalized Regression Neural Network, GRNN), the radial basis neural network model (Radial Basis Function Neural Network, RBFNN) and Back Propagation Neural Network (BPNN) four classification models were compared and analyzed. PNN is a forward-propagating network and does not require back-propagation to optimize parameters. This is because PNN combines the Bayesian minimum risk criterion to determine the sample category. It has the characteristics of fast training and less time consumption(Yin et al., 2021). GRNN is an artificial neural network model based on nonlinear regression theory, which has strong nonlinear mapping ability and learning speed(Hou et al., 2020). RBF is a feedforward neural network with excellent performance, which has global approximation ability, fast convergence speed and good classification ability(Shi et al., 2018). BPNN is a feed-forward neural network model with tutor learning, which has a simple structure and can quickly search for the optimal solution(Qiao et al., 2021). The model is comprehensively evaluated through the training set, test set, external input validation set and running time. The formula for calculating the correct rate is as follows:
\(precision(\% )=\frac{{Tp}}{{Tp+Fp}}\)×100 (3)
where precision is the classification accuracy, Tp is the number of correct predictions, Fp is the number of prediction errors.