Determining farming methods and geographical origin of chinese rice using NIR combined with chemometrics methods

This study was conducted to develop fast and nondestructive models for the discrimination of different farming methods and to determine the geographical origin of rice samples from different administrative regions in China using near-infrared (NIR) spectroscopy. Principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA) were applied to build the NIR spectral models. Norris smoothing derivative (NSD) and multiplicative scatter correction (MSC) were used as preprocessing methods to reduce the spectral noise and enhance effective information. The results show that it was difficult to distinguish the farming methods with the original spectra plots and PCA score plots except for the rice samples from Heilongjiang Province. In addition, a PLS-DA model combined with NSD preprocessing provided the optimal predictive accuracy of 89.7% for the identification of different farming methods. NSD or MSC preprocessing combined with PLS-DA models provided the best discrimination of the origin traceability. The total accuracy of Northeast China rice samples was 100%, and of the South, East, Central and Southwest China rice samples was 98.2%. The total accuracy of Heilongjiang, Anhui, Jiangsu, Hubei, and Sichuan Provinces were 100%, 98.8%, 95.3%, 95.3%, and 93.6%, respectively. These indicate that NIR combined with PLS-DA and NSD or MSC preprocessing can provide a powerful method to distinguish the different farming methods and geographical origin of Chinese rice.


Introduction
Rice (Oryza sativa L.) serves as an important cereal crop and as the staple food of more than half the world's population (Janssen et al. 2014). In recent years, the way of life of a typical consumer has changed from considering food supply to food safety with the globalization of food markets and the improvement of living standards (Li et al. 2016). Organic agriculture refers to a mode of agricultural production that follows natural laws and ecological principles, while coordinating the balance between planting and aquaculture. Instead of using genetically engineered organisms and related products along with chemically synthesized pesticides, chemical fertilizers, growth regulators, feed additives and so on, a series of agricultural technologies are employed in organic farming that are designed to maintain a sustainable and stable agricultural production system (Seufert et al. 2012). Therefore, compared with conventional rice (CON), organic rice (ORG) is considered a safer, healthier, and ecofriendlier food choice, and typically requires consumers to pay a premium price (Reganold et al. 2016;Yi et al. 2013).
Dan Wu, Xing Liu and Bin Bai contributed equally to this work. Huang Huang hh863@126.com Jun Wu wujun1984@126.com 1 The geographical distribution of rice production in China covers a wide range, including southeast to Taiwan, south to the tropical land in Hainan, north to Heilongjiang and west to Xinjiang. Rice originating from different geographical areas gradually forms rice cultivars with local characteristics based on the different farming methods and the geographic environment. For example, the Northeast is one of the largest and most prominent rice producing areas in China, and has formed the cultivar of "Wuchang rice", a geographical indication protection product, which is ascribed to the unique regional, environmental conditions such as fertile black land with adequate nitrogen, phosphorus, potassium and other mineral elements, and sufficient sunshine and rain with pure and pollution-free water sources. Generally speaking, high quality rice, including ORG rice and local specialty rice, can provide farmers with a relatively high market price (Huang et al. 2002), but inappropriate use of the local product name can also result in frequent market fraud. Dishonest traders attempt to pass off inexpensive rice as the more expensive varieties to make money, which not only seriously impacts the fair market competition environment negatively, but also infringes on the rights of consumers (Matsumura et al. 2009). In this case, the use of origin traceability systems and authenticity discrimination methods are becoming more and more important (Janssen et al. 2014;Lima et al. 2011).
Methods used to trace the geographic origin and authenticity of agricultural products mainly include stable isotope analysis Liu et al. 2020a, b); Luo et al. 2015), mineral element analysis (Liu et al. 2020(Liu et al. , 2020c, volatile compound analysis (Li et al. 2019] Liu et al. 2020a, b), DNA technology (Zhao et al. 2018), metabonomic analysis (Mihailova et al. 2021), and near-infrared (NIR) spectroscopy (Wang et al. 2020;Ayvaz et al. 2015). Among these analytical methods, NIR takes only a minute or so to complete a single scan of a sample without pre-treatment, chemical reagents, and the conditions of high temperature and high pressure, while other methods require pretreatment of samples, take more than 10 min, and are expensive. So NIR spectroscopy as a green analysis tool has the advantages of speed, low cost and good reproducibility without damaging a product or creating pollution (Biancolillo et al. 2020), and this method has been used to provide rapid qualitative and quantitative analysis of oil, rice, fruits, salmon, and honey (Kramer et al. 2017). Due to the influence of climate, environment, geology and agricultural inputs (pesticides and fertilizers), the nutritional composition and structure of rice from different geographic origins and produced using farming methods are variable, and thus, different characteristic infrared absorption peaks can be formed in rice at different sites with different farming methods, so that the geographic origin and authenticity of rice products can be identified based on NIR (Ni et al. 2010).
The NIR spectrum is part of the molecular absorption spectrum of plants, which mainly reflects the vibration spectrum of hydrogen containing groups such as C-H, N-H, O-H and S-H in organic molecules at a wavelength of 12,800 ~ 4,000 cm − 1 . Due to its relatively wide spectral absorption band and possible spectral band overlap, it is difficult to identify farming methods and geographic origin information based on intuitive spectral differences (Plans et al. 2013]; Yi et al. 2008), while chemometrics methods, such as Norris smoothing derivative (NSD) and/or multiplicative scatter correction (MSC), can reduce the spectral noise and enhance effective information, and principal component analysis (PCA) and partial least squares discriminant analysis (PLS-DA) better identify the different farming methods and geographical origin information related to agricultural products that contain in the spectra. That is, NIR spectroscopy combined with chemometric methods can effectively distinguish agricultural products produced by different farming methods and from different geographical areas. Xiao et al. developed nondestructive calibration models with partial least squares (PLS) regression to discriminate between organically-and conventionally-produced rice of Heilongjiang Province using NIR spectroscopy, and the results supported the ability of NIR spectroscopy to differentiate between ORG and CON rice (Xiao et al. 2019). In addition, NIR spectroscopy has also been reported to be useful in discriminating high-quality Wuchang rice which had been mixed with CON rice based on near-infrared spectroscopy (NIRS) analysis technology, PLS-DA and support vector machine (SVM) were used to identify the authenticity of high quality rice with 100% discriminant accuracy of the optimal model. These results show that NIRS analysis technology can provide a reliable tool that could be used to rapidly distinguish whether high-quality rice is adulterated. Existing studies have mainly focused on several famous rice producing areas or several individual provinces and/or cities, such as Heilongjiang Province. However, rice growing areas are widely distributed in China, and different rice growing areas have different farming methods. Therefore, this study aimed to develop a technique that could be used to distinguish the authenticity of differently-claimed farming methods, such as organic or non-ORG, trace the origin of rice samples from seven large regions, including North, South, East, Central, Northeast, Northwest and Southwest China, and identify the geographic origin of rice among the representative provinces in these seven large regions by NIR spectroscopy combined with chemometrics methods, so as to lay a foundation for the establishment of a rice traceability system and provide powerful law enforcement tools for market supervision departments.

Sample preparation
A total of 231 rice samples (including 148 CON and 83 ORG rice samples) were collected from seven geographical areas, specifically Central, East, South, Southwest, Northeast, North and Northwest China. Samples included 91 from Central China (Hunan, Hubei, Jiangxi and Henan provinces), 68 from East China (Jiangsu, Zhejiang, Anhui, and Fujian provinces along with the city of Shanghai), 17 from South China (Guangdong and Guangxi provinces), 22 from Southwest China (Sichuan, Yunnan, and Guizhou provinces), 24 from Northeast China (Heilongjiang province), 5 from North China (Inner Mongolia Autonomous Region and Shanxi) and 4 from Northwest China (Shaanxi and Xinjiang Uygur Autonomous Region), see Table 1. To obtain a representative sampling, the main rice producing regions were chosen in each administrative region where ripening rice samples were collected from various rice farms. All ORG samples were certified according to the Chinese standard GB/T 19,630. Rice in the husk was dried in a thermotank at 60 °C for 48 h, shelled and then milled 15 s by a small electric shredding machine. Rice powders were filtered with a 100-mesh sieve and stored in sealed bags prior to analysis.

Collection of spectra
The NIR spectra of rice samples were scanned using a Thermo Scientific Antaris II FT-NIR Analyzer (Thermo-Fisher Co., USA). The spectrometer was preheated for 2 h before collecting spectral data. The standard sample cup was employed to collect the spectra of rice samples and the background data. The samples were mixed before each scan; about 2.0 g rice powder was placed in a rotating sample cup and scanned three times. The average of the three spectral measurements were used for calibration and validation sets. The range of spectra was 10,000 ~ 4,000 cm − 1 , scan time was 64 times and spectral resolution was 8 cm − 1 with the built-in background as the reference. The sampling gain was 1.0, the moving mirror speed 1.2659, the data point interval 3.857 cm − 1 , and resulted in 1557 variables. The scanning data were stored in the form of absorbance using the OMNIC software provided by the instrument. Each sample was scanned three times, and the average spectrum was taken as the final sample spectrum. The lab temperature was kept at approximated 25℃ and the humidity about 60% during the collection of spectral data.

Spectral preprocessing and modeling methods
Spectral preprocessing can eliminate irrelevant information in the process of spectral acquisition, such as electrical noise and stray light, to improve the accuracy of the model (Arndt et al. 2020). Thus, spectral data were subjected to Norris smoothing (including s and g parameters, where s is the number of data points in one segment and g is the number of data points in one gap) and derivative (first or second derivative) (NSD) and MSC preprocessing before modeling (Dhanoa et al. 1994;Ns et al. 2002;Savitzky et al. 1964).
PCA and PLS-DA were applied to identify the rice samples from different geographical areas and grown using different farming methods. The optimal number of latent variables (LVs) built the PLS-DA model were obtained by Prior to evaluating the PLS-DA methods, a Kennard-Stone (KS) algorithm was used to partition the samples into calibration (about 75%) and validation sets (about 25%). Samples were then selected one by one to determine the furthest distance from each sample in the group, namely according to the Euclidean distance, so they could be spread throughout the multivariate space. Spectral data pre-processing and PLS-DA were carried out using Matlab R2021a software (The MathWorks Inc., Natick, MA, USA), and PCA was performed using SIMCA software (Sartorius Data Analytics, Malmö, Sweden).

Spectral pretreatment analysis
The main spectral information in the NIR region comes from the frequency doubling and combination band vibration of K-fold cross validation and assessed through sensitivity (SE, Eq. 1), specificity (SP, Eq. 2), classification accuracy (Eq. 3) and area under curve (AUC) (Krstajic et al. 2014]; Liu et al. 2015). The closer the AUC is to 1.0, the higher the authenticity of the method. The predictive ability of the PLS-DA model was evaluated by classification accuracy. The schematic to indicate the principle of measurement method for rice samples was showed in Fig. 1.
where TP is true positive (positive samples correctly classified), TN is true negative (negative sample correctly classified), FP is false positive (positive samples incorrectly  Figure 2 shows the spectra of 231 rice samples before and after preprocessing in a wavelength range of 10, 000 ~ 4, 000 cm − 1 , the original NIR spectra displayed a declining trend with increasing wavelength (Fig. 2a) with six obvious wave peaks that appeared around 4,314 cm − 1 , 4,386 cm − 1 , 4,762 cm − 1 , 5,181 cm − 1 , 6,852 cm − 1 and 8,368 cm − 1 . Two small wave peaks appeared in the band 4,314-4,386 cm − 1 and were the combined frequency region of C-H stretching and CH 2 deformation in polysaccharides, proteins and starches. The third wave peak appeared in band 4,762 cm − 1 and represented the combined frequency region of O-H and C = O vibration in carbohydrates. The fourth wave peak appeared in band 5,173 cm − 1 and represented the combined frequency of O-H stretching and HOH deformation of polysaccharides. The fifth small wave peak appeared band at 6,967-6,695 cm − 1 and was mainly attributed to a combination frequency of N-H stretching and N-H in-plane bending due to the N-H group in aromatic amines. The sixth little wave peak appeared in the band at 8,230-8,375 cm − 1 and was mainly attributed to a combination frequency of the vibration of C-H in CH 3 and CH 2 due to methyl (-CH) groups in aliphatic hydrocarbons (Ayvaz et al. 2015]; Ni et al. 2010;Xiao et al. 2019). The differences in the absorption peaks are related to the contents and variety of starches, proteins, polysaccharide and fats in rice; these differences could be influenced by geographical origin and cultivation methods of rice.
After NSD preprocessing (Fig. 2b), the background noise was reduced, the overlapping spectra were corrected, and some weak peaks were strengthened, so the spectra could better represent the information on the composition of the rice samples (Liu et al. 2015]; Wang et al. 2006). In addition, MSC preprocessing ( Fig. 2c) could remove some noise scattering and interference, and significantly improved the spectral quality. Thus, some optimal discriminant models were obtained by using NSD preprocessing, while others were obtained by MSC pretreatment.
The NIR spectra of all samples exhibited a similar trend in absorbance and appeared to exhibit partial overlap, making it difficult to distinguish rice produced by different farming methods and from different geographic origins by analysis of the visible spectrum. Thus pretreatment and multivariate statistical techniques were required to extract information about quality attributes embedded in the NIR spectra (Hao et al. 2019).
hydrogen-containing groups. Therefore, the vast majority of chemical and biochemical samples have a corresponding absorption band in the NIR region, and the absorption information of the samples can be collected for qualitative or quantitative analysis (Buslig 1991). Full-band spectra contained all information of the rice samples, and models based on full-band spectra could better determine the farming Fig. 2 The (a) raw spectra of samples, (b) spectra after Norris smoothing derivative analysis (3,3,2) preprocessing, (c) spectra after multiplicative scatter correction preprocessing environmental impact on the results of farming method discrimination. Figure 5a shows that the spectral value of most ORG rice samples was lower than those of CON rice samples, and the difference was also shown by the clustering of ORG and CON rice in the PCA score plot (Fig. 5b). ORG and CON rice from Heilongjiang could basically be distinguished. Organic with duck farming (ORG + D), ORG without duck farming (ORG + N), conventional planting with duck farming (CON + D) and CON planting without duck farming (CON + N) could be separated, but the cluster trends of ORG + D and CON + D were not very pronounced. This was ascribed to the fact that similar methods were used for rice-duck farming and when free-ranging ducks provided manure and helped reduce insects. Therefore, the rice-duck farming methods can reduce the need to apply chemical fertilizers and pesticides. So NIR combined with PCA score plot analysis could distinguish the ORG and CON rice samples from Heilongjiang Province.
The results indicated that different climates and environments along with different inputs had an impact on the discrimination of rice farming methods; the larger the regional scope, the greater the impact on the authenticity/traceability effect. It was difficult to distinguish the farming methods of rice with the origin spectral plot and PCA score plot except for rice samples from Heilongjiang Province for which farming methods could be accurately identified. Therefore, the original spectra and PCA score plots were not the most suitable methods to discriminate between ORG and CON rice in a large interval span, and it was necessary to use the supervised algorithm PLS-DA to improve the authenticity of the discrimination results for different rice farming methods.

Identification of farming methods with PLS-DA
The PLS-DA technique has been proven to be a powerful and accurate classification method. Diverse pretreatment methods were applied to optimize the PLS-DA model for distinguishing CON and ORG rice. Before establishing the models, a KS algorithm was used to divide the spectra into calibration (75%) and validation sets (25%), and the values of the CON and ORG rice groups were assigned values of 1 and − 1, respectively. If an estimated class values were close to 1 or − 1, the sample was considered as a CON or ORG rice plot, respectively. Performance parameters of PLS-DA models are shown in Table 2. The cross-validation accuracy rate of the original spectra was 89.0% and of the optimal LVs was 19. The LVs were used to build PLS-DA models, the accuracy rate, SE, SP and AUC of calibration set were 97.7%, 98.2%, 96.9% and 97.5% respectively, while the validation set accuracy rate was 81.0% (47/58) ( Table 2). Eleven samples of the validation set were misidentified

Identification of farming methods based on PCA
PCA can provide an effective data mining technique and allows for an analysis of the factors responsible for the largest part of the variation of data (Dale et al. 2013). In addition, PCA is also a common data dimensionality reduction method, which is mainly used to linearly combine the original variables in the data and to obtain several orthogonal components, namely principal components (PC), which explains the covariance matrix of the original data. In the projection of PCA, similar samples will converge together, while dissimilar samples will be relatively far away in space (Brereton et al. 2009). Figure 3a shows that the original spectra acquired from fields with different farming methods; the spectra of ORG and CON rice exhibited serious overlap making it difficult to distinguish ORG from CON rice. In the PCA scores plot of the raw spectra (Fig. 3b), principal components 1 and 2 (PC1 and PC2) represented 98.5% of the variation in the data, and cluster trends of the ORG and CON rice were not very pronounced, which could be ascribed to the different ORG/CON farming methods (mainly inputs, such as fertilizers and pesticides from different sources) and the climatic environments of different regions. After spectra being processed by NSD (3,3,2), the PCA score plot (Fig. 3c) of PC1 and PC2 only represented 46.2% of the variation in the data, and cluster trends of the ORG and CON rice still were not very pronounced. This suggested that the spectral plot and PCA score plot could not well distinguish different farming methods, because differences in farming methods and environmental differences in different regions could affect the nutritional components of rice and then affect the discrimination results.
To reduce the influence of the environment, 91 samples from Central China were chosen for further analysis (Fig. 4). Figure 4a shows the original spectral plot of the CON and ORG samples from Central China. The peak wavelengths of some ORG rice samples were lower than that of traditional rice in the wavelength range of 7,000 cm − 1 to 4,000 cm − 1 . The cluster trends of rice samples in PCA score plots ( Fig. 4b and c) were relatively obvious. Especially, after NSD pretreatment, the aggregation of ORG and CON rice samples were improved. As shown in the orange circle in Fig. 4c, some CON rice samples could be fully distinguished from ORG rice, but there was still some CON rice mixed with ORG rice in some samples. These results also further demonstrated that the discrimination of rice farming methods was influenced by the climate and environment of areas of different rice origin.
A total of 12 ORG and 12 CON rice samples from Heilongjiang Province were also selected to reduce the SE, SP and AUC of calibration set were all higher than the origin spectral data. The classification accuracy rates of the validation set varied from 79.3 to 89.7%, and NSD (3,3,2) was determined to be the optimal preprocessing method. The best robustness and predictive ability of the PLS-DA model was obtained with a predictive accuracy rate 89.7% including seven and four ORG and CON rice that were misclassified as CON and ORG rice, respectively. This misidentification may attribute to the different kinds and contents of protein, starch and fat in rice caused by differences in climatic environments in different region. After NSD and MSC preprocessing, the discrimination accuracy rate, Fig. 3 (a) Raw spectra of organic (ORG) and conventional (CON) samples; (b) Principal component analysis (PCA) score plots of ORG and CON rice samples for principal component (PC) 1 vs. PC 2; (c) PCA score plots of spectra after Norris smoothing derivative (3,3,2) preprocessing of ORG and CON rice samples for PC 1 vs. PC 2 although these five samples had organic certification, the possibility of fraud in the production process could not be ruled out. The one CON rice sample misclassed as ORG rice was from Northeast China, where the soil is fertile and the fertilizers used included organic duck manure fertilizer with (52/58), five and one ORG and CON rice were misjudged as CON and ORG rice, respectively. The five misjudged ORG rice may be attributed to the effects of different cultivation methods and climatic environments in different producing areas on the nutritional components of rice, however, Fig. 4 (a) Raw spectra of organic (ORG) and conventional (CON) samples from Central China; (b) Principal Component Analysis (PCA) score plots of ORG and CON rice samples from Central China; (c) PCA score plots of spectra after NSD (3,3,2) preprocessing of ORG and CON rice samples from Central China PLS-DA( partial least squares discriminant analysis), the parameters (s,g,n) in NSD are defined by: s, the number of data in one segment; g, the number of data in one gap; n, 1 or 2 means first derivative or second derivative, optLVs means the optimal principal components Fig. 5 (a) Raw spectra and (b) principal component analysis score plots of organic (ORG) and conventional (CON) rice samples from Heilongjiang province. Note: plots with (D) and without (N) duck farming China (NW). The first two principal components contained 98.5% of the variability in the data, indicating that these two components could explain almost all the variation in the spectral data. However, rice samples from different geographic region exhibited some overlap in the score plots (Fig. 6a). It was difficult to identify the locations of rice grown in different regions of China directly from the spatial distribution of rice samples, which is consistent with the results of Hao et al. (Hao et al. 2019). After NSD (3,3,2) pretreatment, PCA score plot with 46.2% cumulative contribution rates (Fig. 6b) could distinguish only 13 of the Central China rice plots from rice originating from most other geographic regions except 2 rice from East China.
A significant difference existed in the size of rice samples from each major rice-producing region in the seven large regions (Table 1). Therefore, six representative provinces, Hunan (HN), Hubei (HB), Jiangsu (JS), Anhui (AH), Sichuan (SC) and Heilongjiang (HLJ), were selected to better illustrate the variation in rice from different geographic regions. However, both the PCA score plots of the original and the NSD pretreatment spectra still could not distinguish the rice samples from the six provinces accurately (Fig. 6c almost no use of chemical fertilizer; the planting method is very similar to the organic planting method, so these may have led to the misidentification. The results show that NIR combined with NSD and PLS-DA is a potential method that can be used to distinguish different farming methods, but it is still necessary to further increase the number of rice samples with different inputs and climatic environments to improve the robustness and predictive ability of the PLS-DA discriminant model.

PCA models
In this study, PCA was also used to analyze the clustering of rice plots originating from different geographic areas. Figure 6 shows the score plots of the first two principal components for the original spectra (Fig. 6a) and the spectra after NSD (3,3,2) pretreatment ( Fig. 6b)  the accuracy of the calibration set increased from 97.7% (169/173) to 98.3% (170/173) with SE, SP and AUC values of 81.2%, 100% and 99.0 for the validation set, respectively, while the accuracy increased to 98% (48/49) and only one rice sample was incorrectly identified. The predictive accuracy of the optimal PLS-DA model for CC rice was 95.9% (47/49) with the spectra being preprocessed by NSD (5,5,1); the accuracy of the calibration set was 98.8% and the SE, SP and AUC values of the calibration set were all 100%. The optimal PLS-DA model of SW rice was obtained after MSC pretreatment with 99.4% calibration set accuracy and 93.9% (46/49) validation set accuracy. The optimal PLS-DA models of five regions were all obtained from NIR spectra after NSD or MSC pre-treatment. The misjudged samples of validation set for NE, SC, EC, CC and SW were 0, 2, 1, 2 and 3, respectively, which could be attributed to the different planting methods or rice varieties in the same region such as drought, temperature, salinity, nitrogen deficiency and so on, which could influence the types and contents of starch, protein, fat and other nutrients in rice (Huang et al. 2016]; Thitisaksakul et al. 2012) and led to the similarity of spectral information in different regions. However, these results have shown that NIR combined with PLS-DA discrimination models can better distinguish the geographic origin of five regions and the models have good robustness and prediction ability.
The geographical origin traceability models of six representative provinces (Hunan (HN), Hubei (HB), Jiangsu (JS), Anhui (AH), Sichuan (SC) and Heilongjiang (HLJ)) were also established by PLS-DA with NSD and MSC preprocessing methods. Before establishing the model, spectra were divided into calibration (75%) and validation (25%) sets, and the geographic origin traceability results of six representative provinces are shown in Table 4.
The accuracies of the calibration and validation sets for the original spectra of six provinces were higher than 80%. In particular, the accuracies of the origin traceability model of Heilongjiang Province rice reached 100%, which and d) except 13 Hunan rice samples could be completely distinguished from Jiangsu, Sichuan and Heilongjiang rice. All these results indicated that PCA could extract useful spectral information and reduce spectral dimensions, but it was not the best method to effectively distinguish the origin of rice samples. A supervised PLS-DA algorithm was still needed to improve the discriminant ability of geographic origin traceability models.

PLS-DA models
PLS-DA models were built to enhance the traceability of rice from different regions. In order to improve the accuracy of models, five North China samples and four Northwest China samples were removed before building the traceability discriminant models as the sample quantity is small and of little representativeness (Table 1). Therefore, 173 samples and 49 samples were divided as calibration set and validation set by KS method. And the PLS-DA geographic origin traceability discrimination performances for the five regions were shown in Table 3.
The model accuracies of the calibration and validation sets based on the original spectra of rice in NE, SC, CC, and SW were higher than 80% except for the prediction accuracy in EC (77.6%). After the NSD and MSC preprocessing, the predictive ability and robustness of all rice geographic origin traceability models were improved. The optimal discrimination model for NE rice was obtained by NSD preprocessing combined with PLS-DA, and the accuracies of the calibration and validation sets were all 100% with 100% SE, SP and AUC values of the calibration set. For EC rice, the accuracies of the calibration and validation set varied, the optimal PLS-DA model was chosen with NSD (7,7,2) preprocessing at 98.8% (171/173) calibration set accuracy and 95.9% (47/49) validation set accuracy; in addition, the SE, SP and AUC values of the calibration set were 100%, 98.3% and 98.3%, respectively. The optimal PLS-DA model of SC rice was obtained after MSC pretreatment; Reducing the regional scope from the planting regions to provinces did not greatly improve the accuracies of geographic origin traceability models which could be attributed to the fact that some provinces are contiguous, such as the province pairs of Hubei/Hunan and Hunan/Sichuan, whose climatic environments and the agricultural inputs used were similar, so it was difficult to completely distinguish these rice samples based on NIR spectral traceability models. The existing results further indicated that NIR combines with PLS-DA was a feasible method to quickly identify rice origin.

Conclusion
This study investigated the farming methods and geographical origin of rice in the main rice-producing areas of China using NIR combined with PCA and PLS-DA. The NIR spectra plots and the PCA score plots failed to effectively differentiate ORG and CON rice from different geographic origins with the exception of most Heilongjiang Province rice, because the environmental differences in different regions may also affect the nutritional composition of rice. While ORG and CON rice from Heilongjiang Province could be distinguished, and both ORG and CON rice raised separately with and without duck farming could also be differentiated. The NIR data combined with the PLS-DA model optimized by NSD could differentiate ORG from CON rice with an 89.7% prediction accuracy rate, due to the differences in nutrient components created by different climate and environmental conditions.
The PCA score plots could not also accurately identify the origin of rice owing to the comprehensive impact of influence of different regional environment and agricultural indicated that the differences of cultivation methods and climatic conditions caused variations in the nutritional components of rice in different provinces and were reflected in the NIR spectra information. After the NSD and MSC preprocessing, the spectral interference information was reduced or eliminated, and the accuracies of geographic origin traceability models were improved to varying degrees while the SE, SP and AUC values of the calibration set were all 100%. The discrimination accuracies of the calibration and validation sets were all 100% except for the predictive accuracies of NSD (3,3,2)-PLS-DA and MSC-PLS-DA validation set models, the optimal model was NSD (3,3,1)-PLS-DA with cross-validation accuracy was 98.4% and the optimal LVs were 18. The optimal PLS-DA model of Hubei rice was obtained by NSD (5, 5, 1) pretreatment with relatively better robustness and prediction ability, where the accuracy of the calibration set increased from 90.6 to 95.3%, although the accuracy of the validation set was still 95.4% (41/43), and two samples were misjudged. Similarly, the optimal preprocessing methods of Hunan, Jiangsu, Anhui and Sichuan rice were NSD (7,7,2), NSD (5,5,2), NSD (3,3,1) or NSD (3,3,2) and MSC, respectively. The calibration set accuracies of Hunan, Jiangsu, Anhui and Sichuan rice models were increased from 86.7 to 90.6%, from 94.5 to 95.3%, from 98.4 to 100% and from 93.8 to 95.3%, respectively. The predictive accuracies of PLS-DA models for Hunan, Anhui and Sichuan rice were increased from 90.7% (39/43) to 95.4% (41/43), from 81.4% (35/43) to 95.4% (41/43), and from 86.1% (37/43) to 88.4% (38/43), respectively, while the PLS-DA predictive ability of Jiangsu rice was maintained at 95.4% (41/43). The number of rice samples misjudged by the optimal prediction models in Hunan, Jiangsu, Anhui and Sichuan provinces were 2, 2, 2 and 5, respectively, which could be attributed to the different planting methods or the rice varieties that affected the types and contents of protein, inputs. Meanwhile, PLS-DA combined with NSD and/ or MSC could discriminate the geographic origin of rice from NE, EC, SC, CC and SW regions with 100%, 95.9%, 98.0%, 95.9% and 93.9% predictive accuracy rates, respectively, and also well identify the origin of rice from the provinces of Heilongjiang, Hubei, Hunan, Jiangsu, Anhui and Sichuan with 100%, 95.4%, 95.4%, 95.4%, 95.4% and 88.4% predictive ability, respectively. Accordingly, NIR can guarantee the geographic origin authenticity of rice grown in Heilongjiang Province and provides an effective means to quickly identify rice farming methods and geographic origins.
This study confirms that NIR is a rapid detection method for the geographical origin of rice and related farming methods, and can be used by market regulators to improve rice authenticity in the marketplace. Furthermore, to improve the predictive ability and robustness of the models, more rice samples grown with different farming methods in different regions and in different years should be added to the analysis.

Funding information
The authors acknowledge the financial support by Hunan seed industry innovation project (No. 2021NK1001), State Key Laboratory of Hybrid Rice, Hunan Hybrid Rice Research Center opening project (No. 2021KF004).

Data availability statement
The date that supports the findings of this study are available in the supplementary material of this article.

Compliance with Ethical Standards
Conflict of interest The authors declare that they have no conflict of interest.
Ethical approval This article does not contain any studies with human or animal subjects.
Informed consent Not applicable.