Feasibility study on use of near infrared spectroscopy for rapid and non-destructive determination of gossypol content in intact cottonseeds

Background: Gossypol found in cottonseeds is toxic to human beings and monogastric animals and is a primary parameter for the integrated utilization of cottonseed products. It is usually determined by the techniques relied on complex pretreatment procedures and the samples after determination cannot be used in breeding program, so it is of great importance to predict the gossypol content in cottonseeds rapidly and non-destructively to substitute the traditional analytical method. Results: Gossypol content in cottonseeds was investigated by near-infrared spectroscopy (NIRS) and High-performance liquid chromatography (HPLC). Partial least squares regression, combined with spectral pretreatment methods including Savitzky-Golay smoothing, standard normal variate, multiplicative scatter correction, and rst derivate, were tested for optimizing the calibration models. NIRS technique was ecient in predicting gossypol content in intact cottonseeds, as revealed by the root-mean-square error of cross-validation (RMSECV), root-mean-square error of prediction (RMSEP), coecient for determination of prediction (R p 2 ), and residual predictive deviation (RPD) values for all models, being 0.05-0.07, 0.04-0.06, 0.82-0.92, and 2.3-3.4, respectively. The optimized model pretreated by Savitzky-Golay smoothing + standard normal variate + rst derivate resulted in good determination of gossypol content in intact cottonseeds. Conclusions: Near infrared spectroscopy coupled with different spectral pretreatments and PLS regression has exhibited the feasibility in predicting gossypol content in intact cottonseeds, rapidly and non-destructively. It could be used as an alternative method to substitute for traditional one to determine the gossypol content in intact cottonseeds.


Introducton
Cotton (Gossypium. spp) is one of the important industrial and economic crops (Sunilkumar et al. 2006). Cottonseed, the main by-product of cotton production, can be used to produce food, animal feed, and other products. Cottonseed contains many kinds of nutrients, including proteins, oils, fatty acids, and amino acids, making it a potential food resource for human beings with the rapid growth of global population (Sawan et al. 2006). However, the Gossypium species are characterized by the presence of gossypol, which is toxic to human beings and monogastric animals (Lordelo et al. 2005), such that the utilization of cottonseed products is limited. Gossypol,1,1',6,6',7,3'-dimethyl-(2, 2' binaphthalene)-8, 8'dicarbaldehyde, is a terpenoid compound that helps cotton defend against biotic stress (Kong et al. 2010;Lin et al. 1993;Blanco et al. 1983). Due to the toxicity of gossypol, breeding for both lower gossypol content in cottonseeds and higher gossypol content in cotton plants has been practiced in many cottonplanting countries. The cottonseed breeding work often requires analyzing a large number of cottonseed samples to measure gossypol content. Conventionally, gossypol content is assayed by UV spectrophotometry which not only involves reagents with great toxicity, but also is not accurate and reliable. Despite offering a high level of accuracy and sensitivity, HPLC is usually costly and timeconsuming. In addition, both classical analytical methods cause undesired destruction of the testing samples which frequently needed to be planted in cotton breeding program. So, a rapid and nondestructive method for gossypol determination is required.
Near infrared (NIR) spectroscopy combined with chemometrics is a rapid, convenient, and environmentally-friendly analytical technique in the quality analysis for crops (Sohn et al. 2008;Huang et al. 2013;Weinstock et al. 2006;Rosales et al. 2011;Bellato et al. 2011;Bala et al. 2013;Hacisalihoglu et al. 2010;Mendoza et al. 2018;Lee et al. 2017;Tierno et al. 2016;Yang et al. 2008;Kovalenko et al. 2006;Fassio et al. 2004). Although the NIR calibration model for determining gossypol content in cotton powder was developed (Li et al. 2017), it could not be use to non-destructively analyze gossypol content in intact cottonseeds, especially in breeding programs where the genetic materials from genetic modi cation or cross breeding have a limited availability. It is a challenge to determine gossypol content in intact cottonseeds by NIR, because (i) cottonseed being bigger than other crop seeds, so large voids are left between packed samples in sample cells; (ii) some of immature and wizened cottonseeds can be mixed in the samples, which can introduce irrelevant information into the spectra data; and (iii) the tough and thick shell of cottonseed can impact the penetration of NIR light and result in a lower S/N ratio and poor information. Because of these factors, the spectral data of intact cottonseeds are far more complex than that of other crop seeds, which may contain a large amount of useless and uncorrelated information such as noise and background. To overcome these di culties, sophisticated chemometric methods are applied to extract useful information from NIR spectra and calibrate robust models for gossypol content in intact cottonseeds. Essentially, these include regression methods such as principal component regression (PCR) (Xie et al. 1997), partial least squares (PLS) (Haaland et al. 1988), support vector machines (SVM) (Nie et al. 2008), least squares support vector machines (LS-SVM) (Shao et al. 2012), and arti cial neural networks (ANN) (Makinoa et al. 2010), coupled with spectral pretreatments such as standard normal variate (SNV) (Barnes et al. 1989), Savitzky-Golay (SG) smoothing (Savitzky et al. 1964), multiplicative scatter correction (MSC) (Hopke et al. 2003), and rst derivate (Rinnan et al. 2009).
Due to undesired destruction of the test sample, previous NIR models which can be used in detection of gossypol in cottonseed meal, can be barely applied in breeding trails (Li et al. 2017). In this present study, spectroscopy was investigated the feasibility of analyzing gossypol in intact cottonseeds based on NIR spectrometer. The main aim of this study was to establish an optimal model which could provide a powerful technical support for cotton breeders and other people who work on cottonseeds.

Samples and preparation
A total of 268 samples of cottonseeds were collected from different growing areas, including Hangzhou (Zhejiang, China), Xiaoshan (Zhejiang, China), Sanmen (Zhejiang, China), Sanya (Hainan, China), Wuhu (Anhui, China), and Yancheng (Jiangsu, China), in 2012, and 2014. The cottonseed samples were delinted and dried at 30℃ to constant weight. After spectral acquisition by NIR spectroscopy, the intact cottonseed samples were dehulled, and then ground to cottonseed kernel powder for HPLC analysis. The preparations were implemented in the same experimental condition in order to reduce the in uence of other physical factors.
Gossypol extraction 0.1 g of cottonseed kernel powder was suspended in 5 mL acetone and sonicated in an ultrasonic bath for 45 minutes. Then, the suspension was ltered through quantitative lter paper followed by a ltration with a 0.45 μm syringe lter (Agela, Newark, USA). The sediment was washed three times by acetone. After this procedure, the extract was adjusted to 25 mL using acetone.

HPLC analysis
HPLC analysis was performed on an Agilent 1100 HPLC system (Agilent, Santa Clara, USA), equipped with an auto-sampler and UV detection. A C 18 column (250 mm × 4.6 mm, 5 μm, Dikma, Richmond Hill, USA) was employed as the stationary phase. The mobile phase consisted of methanol/0.2% H 3 PO 4 (80/20, v/v). The injection volume was 10 μL and the ow rate was 1.0 mL·min -1 . The UV detector was set at 238 nm and the column temperature was 25℃. Each sample was measured three times. The limit of detection (LOD) was obtained at a signal-to-noise (S/N) ratio of three and the limit of quanti cation (LOQ) at an S/N ratio. To detect the stability of gossypol at room temperature, three samples were randomly employed to determine the changes of peak area within 36 hours. HPLC-grade gossypol was purchased from Sigma (Sigma-Aldrich, St. Louis, USA). Methanol (HPLC grade) was procured from Tianjin Chemical Reagent Company (Tianjin, China). Double deionized water was prepared using Milli-Q-water puri cation system (Millipore, Molsheim, France).

NIR spectra acquisition
The NIR spectra of intact cottonseed samples were scanned with a Büchi Flex-N500 NIR spectrometer (Büchi, Flawil, Switzerland), equipped with a solid sample module as followings. The NIR spectra were collected across the range 4 000-10 000 cm -1 , and were recorded with a spectral resolution of 4 cm -1 .
Samples were measured three times on a rotating cylinder device at 25 ± 0.5℃ and 60% relative air humidity. All the spectra were transformed into absorbance (log (1/R)).

Spectral pretreatment
Before calibration, the spectral data were pretreated for an optimal performance. Eight pretreatment strategies which included one or some combination of Savitzky-Golay smoothing, SNV, MSC, and rst derivate (Norris gap) were compared with the raw spectra.

Sampling design
Samples were assigned to calibration and prediction sets using Kennard-Stone (KS) selection (Kennard et al. 1969). The calibration models were established with the calibration set, and the prediction set was used to validate the predictive capabilities and analytical features of the calibration models.

PLS regression
PLS regression has been widely used as a calibration method to investigate the relationship between the spectral and the corresponding reference data. Before calibration of the PLS models, the data sets (spectral and reference data) were analyzed using 4-fold cross-validation to develop a full-spectra calibration model. The aim of the cross-validation was to nd the optimum number of latent variables (LV) for PLS. The root-mean-square error of cross-validation (RMSECV) served as a measure to adjust the parameters, and the number of LV which provide the lowest RMSECV was selected as the best.

Model evaluation
The estimate of the calibration models was based on following quality parameters: where N is the total number of samples, Y nirs is the predicted value by calibrationmodels, Y ref is the reference value by HPLC, and SD is the standard deviation.
The coe cient for determination of prediction (R p 2 ), the root mean square error of prediction (RMSEP), the coe cient for determination of calibration (R c 2 ), the root mean square error of cross-validation (RMSECV), and the residual predictive deviation (RPD) were used as criterion to evaluate model performance. An acceptable model should have high R c 2 and R p 2 values and low RMSECV and RMSEP values. Meanwhile, the model is considered as robust if the RPD is higher than 2.5.

Software
NIR spectroscopic data (268 samples × 1501variables) were exported in text format, organized in Excel spreadsheets, and then transferred into MATLAB R2011a (Math Works, Natick, USA) for chemometric analysis. All the algorithms in spectral pretreatments, sampling design, and regressions were implemented with MATLAB R2012a.

HPLC analysis
The regression equation, correlation coe cient (r 2 ), limits of detection (LOD), limits of quanti cation (LOQ), and average recovery of gossypol were illustrated in Table 1. The retention times of gossypol standard and gossypol extraction were 9.91 and 9.60 minutes, respectively (Fig. 1). Table 2 shows the stability for peak area of gossypol determined by HPLC for 24 hours. All the results indicated that the improved HPLC method could be used to detect gossypol content, and the cottonseed extract should be analyzed within 24 hours.

NIR spectra analysis
Across the spectral range of 4 000-10 000 cm -1 , absorbance values are mainly associated with the combination and overtone bands of the C-H, N-H, O-H, and S-H bonds (Macho et al. 2002), which were quite sensitive to the compositional variations in complex samples. Fig. 2A shows the raw intact cottonseed spectra in the NIR spectral region. The spectra showed six broad absorption peaks around the 4 200, 4 700, 5 150, 5 580, 6 900, and 8 400 cm -1 . The small peak observed at 4 200 cm -1 fell within the regions associated with the combination bands of C-H. At 5150 and 6900 cm -1 , these could be attributed to the combination and the rst overtone bands of O-H, respectively, which were identi ed as water absorption. The gentle peaks at 5 580 and 8 400 cm -1 overlapped with the second and rst C-H overtone regions, respectively. It was worth mentioning that the peak at 4 700 cm -1 was attributed to the rst C-H combination bands of alkenes and aromatic hydrocarbons, which could be identi ed as the absorption of polyphenolic terpenes, including gossypol and its derivatives.
The raw spectra were homogeneous, so the presence of noise could not be directly identi ed. Consistent baseline offsets and bias were present in the spectra, which are common features in the NIR spectra.
Hence, eight pretreatment strategies were performed to optimize the raw spectra before establishment of the calibration models. The pretreatment spectra of several types of representative strategies were shown in Fig. 2B, Fig. 2C, and Fig. 2D. To different degrees, all these pretreatments could reduce the physical change among samples due to scattering and remove both additive and multiplicative effects in the spectra. It was noted that ten variables were lost after SG smoothing. Hence, the 1491 variables were used for calibration among the models using SG smoothing during the spectral pretreatments.

Kennard-Stone sampling design
The Kennard-Stone algorithm is an effective method for extracting a sample subset in multidimensional space, which includes all the most diverse samples and enables the selection of a subset of representative samples. Therefore, it has been con rmed that a calibration set extracted using KS selection has a better predictive capability than a set randomly built or constructed by other data selection methods such as Kohonen self-organized mapping (Kohonen et al. 1982) and D-optimal designs (de Aguiar et al. 1995). In this study, the total 268 intact cottonseed samples were divided into calibration and prediction sets based on KS algorithm, with the former set consisting of 218 samples and the later one 50 samples. The statistical values of gossypol contents in all cottonseed samples for calibration and prediction set were demonstrated in Table 3, which indicated that the range of variation for gossypol content was broad enough to develop NIR calibration models.

PLS regression
The calibration models of gossypol content in intact cottonseeds based on PLS regression were established in the NIR spectral range of 4 000-10 000 cm -1 , and the results were summarized in Table 4.
The number of LV were selected with the aid of cross-validation using the rst minimum RMSECV for all models. The RMSECV and RMSEP values for all the calibration models were between 0.05-0.07 and 0.04-0.06 for calibration and prediction sets, respectively. The values of R p 2 and R c 2 ranged from 0.82 to 0.93 and from 0.87 to 0.97, respectively. The RPD values ranged from 2.3 to 3.4. strategies (SG smoothing, SNV, MSC, and rst derivate), two pretreatments strategies (SNV + rst derivate and MSC + rst derivate), and three pretreatments strategies (SG smoothing + SNV + rst derivate and SG smoothing + MSC + rst derivate). In analyzing the results obtained from singe pretreatment strategies, the PLS model using eight latent variables based on application of MSC produced better results with low values of RMSECV and RMSEP (0.06 and 0.05, respectively), and the RPD value was increased by 20.36% compared with that of the direct regression model based on raw spectra (Fig. 4). Fig. 3B shows the correlation of model using MSC, presented by plotting predicted and reference values for gossypol content in intact cottonseeds. The samples near the diagonal line indicated that their predicted values were more closed to reference ones and vice versa. In the aspect of two pretreatments strategies, the calibration model based on SNV + rst derivate, presented a better predictive ability than that on MSC + rst derivate, with the R c 2 and R p 2 values of 0.962 and 0.887, respectively. The RPD value of that model was 3.0, increased by 28.14% compared to the model using raw spectra. From all the results of calibration models established, the best model was pretreated using the strategy of SG + SNV + rst derivate, and it had the highest R c 2 (0.97) and R p 2 (0.93), and the RPD (3.4) increased by 46.28% compared with that of raw spectral model. Furthermore, RMSECV (0.05) and RMSEP (0.04) were the lowest among all the models. The correlation plots between predicted and reference values were focused on the diagonal line (Fig. 3D). It was indicated that the model using SG + SNV + rst derivate and PLS was accurate and robust enough to substitute the conventional gossypol analysis methods (HPLC) to measure gossypol in intact cottonseeds.
The NIR spectra of these intact seeds generally contained a mass of undesirable features, including noise, overlapping peaks, baseline effects and some systematic behaviors, caused by the seed size, shell, and some other physical factors. Hence, a suitable pretreatment strategy was required for widespread application of NIR technology in crop seed analysis. In this work, it was indicated that an advisable pretreatment strategy before regression was important to re ne the effective information from spectral data and eliminate spectral deviation to calibrate an accurate and robust NIR model.
The calibration models reported here con rmed the feasibility of the using of NIR technology for rapid and non-destructive determination of gossypol, an important parameter to cottonseed products, in intact cottonseeds for the rst time. The high RPD values (3.4) suggested that this technology could be an effective method for the measurement of gossypol in intact cottonseeds. The optimal model could substitute conventional analysis methods for gossypol, including UV spectrophotometry and HPLC. Because of the potential of high sample throughput and low costs, as well as a signi cant reduction in toxic chemicals, the application of NIR method could be encouraged and popularized to other similar agricultural products.

Conclusions
The calibration and validation statistics obtained in the current work showed the potential of NIRS to predict microelement gossypol content in intact cottonseeds. The optimized model was that pretreated by Savitzky-Golay smoothing + standard normal variate + rst derivate, with RMSECV, RMSEP, R p 2 , and RPD of 0.05, 0.04, 0.92, and 3.4, respectively, which provided a method to determine gossypol content in intact cottonseeds feasibly.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
All co-authors have consent for submission of manuscript.

Availability of data and materials
All relevant data are within this article.
Competing interests Figure 1 Chromatograms of (A) gossypol standard and (B) gossypol extract in intact cottonseeds.