Predicting cotton fiber properties from fiber length parameters measured by dual-beard fibrograph

Cotton fiber properties, although strongly influenced by plant growth conditions, are largely dictated by the cotton variety; therefore, certain inherent associations exist among these properties. Previous studies examined the mutual influences of cotton properties (e.g., fiber maturity on strength), but latent associations between fiber length and other important properties (e.g., fineness, maturity and strength) have not been explored. This paper attempted to investigate these relationships, and to create regression models to predict the fiber properties from the length parameters so that an overview on cotton quality can be provided when only length measurements are available. We collected 100 cotton samples as a training set and 17 extra samples as a testing set, and measured the fiber length parameters using the dual beard fibrograph and the seven other fiber properties (strength, elongation, micronaire, nep, fineness, immature fiber content, and maturity ratio) using the High Volume Instrument and Advanced Fiber Information System. We then performed the correlations, multicollinearity, regression and clustering analyses on the fiber properties. It was found that the fiber length parameters had moderate associations (0.3<|r|<0.7) with the seven properties, and the prediction errors for the training set varied from 2.25% (maturity ratio) to 14.36% (nep). The Bland–Altman analysis proved that for all the seven properties, more than 94.9% of the predicted and actual points were within the 95% agreement limits and without systematic biases. The regression models based on the five cotton clusters consistently lowered the prediction errors through the optimally aggregated fiber properties. The comparable results were obtained from the testing set, which demonstrated the good generalization power of the prediction models.


Introduction
The dual-beard fibrograph (DBF) was developed to measure the fiber length distribution (FLD) of cotton (Jin et al. 2018;) and to calculate a set of length parameters defined by the industry, such as the upper half mean length (UHML), short fiber content (SFC), and uniformity index (UI), derived from the FLD (Zhou and Xu 2021). These length parameters are evaluated separately in the current classification system for cotton quality assessment (Cotton Incorporated 2018). For instance, UHML is reported in both 100ths and 32nds of an inch (or 25.4 mm), while length uniformity is classified into five levels based on different thresholds of UI (Cotton Incorporated 2018). Multiple length parameters can also be used together to perform a multivariate classification for evaluating length uniformity .
Physical properties of cotton fibers, such as fiber length, fineness, maturity and strength, are largely dictated by the cotton variety or genetic makeup, but are susceptible to environmental conditions (e.g., temperature, water and nutrient) experienced by the plant (Basra and Saha 2020;Cotton Incorporated 2018;Seagull 2001). From an epidermal cell to a mature fiber, the fiber development takes four distinctive phases, including initiation (3 days before anthesis to 3 days post anthesis−DPA), elongation (3 to 25 DPA), secondary wall thickening (15 to 45 DPA), and maturation (45 to > 50 DPA) (Seagull 2001;Qin et al. 2011;Basra and Saha 2020;Wilkins and Jernstedt 2020). At the end of the development, a fiber can gain its length by 4000 − 5000 times, diameter by 2 − 3 times and wall volume by more than 10,000 times (Seagull 2001). Each phase involves specific biological processes, and directly impacts the final physical properties of fibers. However, considerable overlapping occurs between the 2nd (elongation) and 3rd (secondary wall thickening) phases, allowing the length, diameter and wall thickness of a fiber to grow simultaneously in a period varying from 5 to 10 DPA (or 10-20% of the entire DPA) (Seagull 2001;Ryser 2020) recorded the changes in the fiber length and diameter of a cotton variety (G. hirsutum, variety MD51 Ne) from 5 DPA to 50 DPA. Hernandez-Gomez et al. (2015) measured the fiber cell wall thickness of four cotton species (PimaS7, Fm966, Krasnyj and JFW15) on the 10th, 17th and 25th DPA. Both studies revealed that the fiber length, diameter and wall thickness had upward trends with the increase of DPA even though their changes were in different rates. Thus, some inherent associations may exist among these fundamental properties, which in turn can influence other properties, such as maturity, fineness and strength.
To help cotton breeders select cotton varieties having the desired properties, Mangialardi et al. (1990) assessed the effect of the fiber fineness, maturity, and strength on the number of neps and concluded that neps were most highly correlated with maturity and fineness, with correlation coefficients of − 0.30 and − 0.40, respectively. A low but significantly positive correlation was also found between the nep count and fiber strength (van der Sluijs et al. 2016). Montalvo et al. (2004) used a near-infrared (NIR) instrument to analyze fineness, maturity and micronaire of cotton, and found a significant relationship between fiber strength and reflectance (Rd), micronaire, and moisture. In a subsequent study, Montalvo et al. (2005) further reported that cotton micronaire had linear relationships with the cotton fineness and maturity ratio, with R-squared values of 0.88 and 0.87, respectively. Kim et al. (2019) established relationships between the maturity and strength-related properties. These studies demonstrated that cotton maturity is positively correlated with the bundle fiber strength and elongation; moreover, maturity is a major factor that dictates fiber strength.
Among the fiber properties, the cotton length is considered the most crucial attribute impacting the yarn spinning efficiency and quality (Thibodeaux et al. 2008;Krifa 2016) studied the influences of the cotton maturity, fineness, and strength on the fiber length distribution and observed that immature and weak cottons exhibited a unimodal length distribution, whereas mature and strong cottons tended to show a bimodal length distribution. However, few studies have explored the quantitative relationships between cotton fiber length and other properties.
In a previous study (Zhou et al. 2021), we demonstrated a new imaging approach, DBF, which measures the FLD of a cotton sliver clamped randomly and combed to form two tapered beards. DBF can mitigate the fiber entanglement and alignment problems encountered in single-beard measurements (Jin et al. 2018;Breuer and Farber 2008;Pabich et al. 2010) and can output reliable and comprehensive length parameters derived from FLDs Zhou and Xu 2021). Based on the correlation analysis on the length parameters (Zhou and Xu 2020), we found three major parameters, UHML, SFC and UI, were most effective for characterizing cotton length attributes when used collectively. Here, UHML represents the average length of the longer half of the fibers, SFC refers to the number of fibers shorter than 12.7 mm (or 0.5 in), and UI indicates the overall length uniformity (Cotton Incorporated 2018). Together, these parameters provide a holistic view of the cotton length quality . Figure 1 shows the FLDs of two distinct cottons and their properties, including the UHML obtained from DBF, strength obtained from the High Volume Instrument (HVI), and maturity ratio obtained from the Advanced Fiber Information System (AFIS). The two FLDs follow approximately normal distributions, but Cotton 1 represents a distribution with a high long-fiber content and Cotton 2 has a high short-fiber content. Thus, their UHMLs are markedly different (30.41 mm vs. 25.58 mm). Because of the overlap between the 2nd (elongation) and 3rd (secondary wall thickening) phases, the co-developments of fiber length, diameter and thickness can cause inherent associations among these properties. It is likely that in the overlap range (5 to 10 DPA), a longer fiber has more DPA to thicken its secondary wall (daily layering of cellulose) than a shorter fiber, which leads to a more mature and stronger fiber. This association is evidenced by Cotton 1 and Cotton 2 in this case. Overall, Cotton 1 has a higher strength and maturity ratio than Cotton 2. Therefore, cotton fiber length has a potential to be a useful factor for estimating other physical properties of cotton.
In a preliminary study, we examined the associations between the length parameters and other properties of 100 cotton samples through a multivariate regression analysis. The length parameters, UHML, SFC and UI, were measured separately by DBF, HVI and AFIS, and used as the input variables. The other properties, including strength and maturity ratio, were taken out of the HVI and AFIS measurements, and used individually as the dependent variable. As shown in Table 1, the correlation coefficients (|r|) between the fiber properties and the (UHML, SFC, UI) of the three methods were in a moderate range (0.3<|r|< 0.7) (Ratner 2009). There was no particular advantage or disadvantage for any of these methods when their length measurements were used to estimate the associations with the other fiber properties. In this study, we will focus on the use of the length measurements from DBF for predicting other important fiber  properties to provide an overview on cotton quality without the HVI and AFIS measurements. This study attempts to expand the utilization of these three length parameters to evaluate other important properties of cotton, including its strength, maturity, fineness and nep, and to add a useful function to DBF for the quick assessment of cotton fiber properties as soon as a dual-beard sample is scanned on DBF. By incorporating measurements from different fiber testing methods, we first explore the associations between the cotton fiber length parameters measured by DBF and the other physical properties obtained from HVI and AFIS, and then establish models for estimating these properties based on the length parameters. The outcomes of this study can enrich the understanding of the relationship among various cotton properties and provide a fast and efficient means for the comprehensive evaluation of cotton quality when only fiber length measurements are available. This is particularly useful when the HVI or AFIS testing is not attainable.

Materials and methods
In this study, we used two batches of U.S. upland cotton samples provided by the Fiber and Biopolymer Research Institute, Texas Tech University (FBRI-TTU). The first batch contained 100 samples to be used as the training set, and the second batch contained 17 samples to be used as the testing set. Each cotton sample was divided into three specimens that were tested separately by AFIS (Uster AFIS PRO 2, Knoxville, TN) and HVI (Uster HVI 1000, Knoxville, TN) at FBRI-TTU, and by DBF in our research lab at the University of North Texas (UNT). For each specimen, one replica was used for the AFIS and HVI testing, and three replicas were used for the DBF testing. The three testing methods yielded a large set of the measurements of cotton properties related to the fiber length, strength, elongation, maturity, fineness, and nep. In the subsequent analyses, the three major length parameters obtained from DBF, i.e., UHML, SFC, and UI, were used as the input (or independent) variables, and the seven other fiber properties, i.e., the strength, elongation, and micronaire (MIC) from HVI and nep, fineness, immature fiber content (IFC), and maturity ratio (MR) from AFIS, were used individually as the dependent variable. Table 2 lists the basic statistics of the properties of the 117 cotton samples measured using DBF, HVI, and AFIS. It can be seen that the selected cotton samples cover a wide range for each of these properties, which is important for examining associations among the properties.
Correlation, agreement, and hypothesis tests were performed on the test data for the 117 cotton samples to examine the associations between the cotton properties measured by HVI and AFIS with the length parameters measured by DBF to build prediction models for the fiber properties. To circumvent the estimation of errors of the models caused by the variability of individual cottons, a clustering analysis was conducted, in which cotton samples with high similarity in length parameters were grouped into the same cluster, and a few distinct clusters were created to represent collective cotton features based on the cluster centroids. Regression models based on the cluster centroids may better reflect the inherent associations of fiber properties. In the multivariate regression analysis, we used R-squared (R 2 ) or the correlation coefficient (r = ± √ R 2 ) to measure the correlation or association between two variables, the F-test to report how well a regression model fits the data, and the t-test (or coefficient analysis) to analyze the significance of each independent variable in the model. When |r| is between 0.3 and 0.7, a moderate (either positive or negative) correlation exists between the variables (Ratner 2009). The significance level was set at α = 0.05. If the p-value in the F-test is below α = 0.05, the regression model is considered to be statistically significant for predicting the dependent variable, i.e., a good fit with the input data. If the p-value in the t-test is below α = 0.05, the coefficient of a term (an input variable) is deemed to be significant, i.e., the contribution of the variable to the model is important. The variance inflation factor (VIF) was calculated to measure the multicollinearity among the input variables in a multivariate regression. In general, a VIF greater than 10 indicates a high correlation or low independence between variables (Wu 2020). A more conservative level of 2.5 is often used for more constrained applications (Mueller et al. 2016). In this study, we set the VIF threshold to 2.5. In addition, Bland-Altman analysis (Bland and Altman 1986) was performed to assess the agreement between the predictions and the actual measurements of fiber properties. SPSS and Excel were used to perform the abovementioned statistical analyses.

Results and discussion
Associations between fiber length parameters and other fiber properties of cotton When UHML, SFC, or UI is used as the input variable in a linear regression analysis, their correlation coefficients (r) with the seven properties can be calculated individually, and the results are listed in Table 3. The obtained r values range from low to moderate, and some of the p-values are above 0.05 (the numbers in italics). Thus, these three parameters are not ideal for creating single-variate regression models to predict these seven properties. Before using combinations of UHML, SFC, and UI together as the input variables in a linear regression model, we must verify the correlation, significance tests, and multicollinearity associated with the models. Table 4 lists the linear regression statistics between the length parameters (input/independent variables) and the HVI, strength, elongation, and MIC (dependent variables). Here, R 2 indicates the correlation between the actual and predicted values of one of the dependent variables, the F-test shows the overall effectiveness of the regression model, the t-test is the coefficient analysis to verify the contribution of each input variable to the model, and VIF measures the multicollinearity of the input variables in the model.
With regard to strength, the R 2 value of Model 1 is 0.295 (equivalently, |r| = 0.543), showing that 29.5% of the overall variance of the strength can be explained by UHML and SFC, and the small p-values (< 0.05) in both the F test and t-test verify that UHML and SFC contribute significantly to the model. A small VIF of 1.964 (< 2.5) indicates no collinearity between UHML and SFC. Therefore, there is sufficient evidence that UHML and SFC can be used to predict strength. The same result can be obtained for Model 2, in which UHML and UI are the input variables. However, in Model 3, R 2 is 0.278, and SFC and UI have high collinearity because the VIF (3.904) is above the threshold (2.5). In Model 4, the three input variables (UHML, SFC, and UI) generate the same correlation as in Model 1 (R 2 = 0.295, |r| = 0.543), but they do not pass the significance tests (p-values > 0.05), and multicollinearity exists among the three variables (VIFs > 2.5). Thus, a model whose input variables include the pair of (SFC and UI) should be excluded from strength prediction. Because UHML and SFC in Model 1 exhibit a slightly stronger association with strength than UHML and UI in Model 2, UHML and SFC were selected as predictors for strength. With regard to MIC, almost the same results as those in the above analysis for strength can be derived. Model 1 shows the best correlation (R 2 = 0.495, or |r| = 0.704), i.e., the strongest association between the pair of (UHML, SFC) and MIC among the four models, and it passes the significance tests (p-values < 0.05) and the VIF check (VIF < 2.5). Although Model 4 has a correlation equivalent to that of Model 1, it has multicollinearity problems, as indicated by the high VIF values. In terms of elongation, Model 4 yields the highest R 2 (R 2 = 0.180 or |r| = 0.424), but its VIFs are all above the threshold (multicollinearity) and the p-values (0.149 and 0.682) of two independent variables (SFC and UI) are greater than 0.05 (insignificant contributions). Therefore, Model 1 (i.e., UHML and SFC) was chosen for the elongation prediction. Table 5 summarizes the linear regression statistics of the fiber properties for both the training and testing sets (a total of 117 samples). Combining the criteria for the correlation, p-value, and VIF for selecting a reliable regression model, we can see that Model 1 is an optimal regression model for predicting all four dependent variables. In short, Model 1, consisting of UHML and SFC, generates moderate associations with the seven properties (0.3 < |r| < 0.7), and both parameters contribute significantly to the model without collinearity. Although Model 4 takes advantage of the three parameters and provides equivalent correlations, it has excessive multicollinearity (VIFs > 2.5).

Predictions of the Fiber Properties of Cotton samples
According to the R 2 values in Tables 4 and 5, the two predictors (UHML and SFC) show moderate associations with strength, elongation, MIC, nep, fineness, IFC, and MR. This is because there is a considerable overlap between the elongation and thickening phases, which allows a cotton fiber to grow its length, diameter and wall thickness simultaneously and thus produces certain associations among these properties. In turn, fiber diameter and wall thickness can impact fineness, maturity, strength, and other properties. However, environmental factors, such as plant nutrients, climate, insects, and boll populations, can also greatly influence the growth of a cotton plant and alter these inherent relationships. Thus, only moderate associations can be expected among the fiber properties. Nevertheless, the length parameters still provide certain clues to other fiber properties, and thus can be used to estimate these properties when their testing equipment (e.g., HVI or AFIS) is not available. The equations below show the prediction models for the fiber properties based on the above multivariate linear regression analysis of the training set (100 cotton samples). (1) (2) Elogation = −0.368 * UHML − 0.083 * SFC + 18.141 While R 2 indicates how well a linear model fits the dependent variable, the mean absolute error (MAE) or root mean square error (RMSE) also provide a measure of the goodness-of-fit between the predicted and actual values. Table 6 lists the mean values and MAEs of the seven dependent variables for the training set (100 cotton samples). Relative errors (%MAE), calculated as (MAE/mean×100%), are also included in parentheses in the table. The %MAEs of the training set vary from 2.25% (for MR) to 14.36% (for nep). The %MAEs in the predictions of MR, strength, MIC, and fineness (note that MIC is a measure combining both maturity and fineness) are less than 4%. The high error of nep prediction is related to the high variability of nep measured by AFIS (see Table 1).
The performances of the prediction models shown in Eqs. (1)-(7) were verified using the testing set (Table 7). It can be seen that the errors of the prediction models for the testing set all increased when they were applied to the 17 new samples. Note that these 17 samples were tested using HVI, AFIS, and DBF at different times from the 100 samples in the training set, and thus their measurements could potentially deviate from those of the training set. The models were able to maintain the relative errors of the training set of under 20%, demonstrating a certain level of generalization.
In addition to the correlation analysis for independent and dependent variables, Bland-Altman analysis was used to assess the agreement between the predicted and actual measurements of the dependent variable. Table 8 summarizes the Bland-Altman analysis results for the seven fiber properties of the 117 samples. The 95% limits of agreement are bias ± 1.96 SD, where the bias is the difference between the means of the predicted and actual measurements and SD is their standard deviation. Figure 2 shows the Bland-Altman plots of the seven fiber properties to visualize the mapping of individual cotton samples in comparison with the 95% limits of agreement. First, all seven agreement tests are significant (or strong) because the p-values are < < 0.05. Second, the biases of the seven predicted properties are close to zero, and thus there is no systematic bias between the two sets of data. Third, as shown in Fig. 2, the number of outliers, i.e., the points outside the limit lines, in each of the seven plots is ≤ 6, meaning that more than 94.9% (= 1 -6/117) of points are in agreement at α = 0.05.

Predictions of Fiber Properties based on cotton clusters
The fiber properties of cotton inherently possess a great amount of variability, resulting in only moderate correlations and certain estimation errors in their regression models, as shown above. To circumvent the model estimation errors caused by cotton variability, we can group cotton samples that have high similarity in the length parameters into the same cluster, thus forming a limited number of distinct cotton clusters. Then, the centroids of the clusters can be used to represent categorical features of cotton that may more reliably reflect the inherent associations of cotton properties. In this section, we apply clustering analysis to the data points in the space of (UHML, SFC), and then create prediction models based on the clusters.
K-means clustering is the most commonly used method for partitioning data points into K clusters (Li et al. 2012). The within-cluster sum-of-squares (WCSS) is a metric used to evaluate the variability of the data within each cluster and can be used to determine the optimal number of clusters . In this study, we found that K = 5 was the optimal number for partitioning the 100 samples in the training set. Figure 3 shows the classified cotton samples in the five clusters in the (UHML, SFC) space. It is clear that the five clusters are sharply separated in the (UHML, SFC) space, with Cluster 1 being far from the other clusters. Table 9 lists the sizes and mean properties of the five clusters. From Clusters 1 to 4, the mean UHML shows an increasing trend, as do the mean strength, MIC, fineness, and MR. This means that longer fibers are more likely to have higher strength, fineness, and maturity. The mean SFC and IFC values change in the same direction. Compared to Cluster 4, Cluster 5 has a similar UHML but significantly higher SFC, representing a group with both long and short fiber contents (UHML = 29.46 mm, SFC = 6.02%).
The mean UHML and SFC represent the centroids of the clusters. We can reconstruct the linear regression models using the cluster centroids to avoid variations in individual points. Equations (8)-(14) are the linear prediction models for the seven properties (strength, elongation, …, MR) based on the centroids. The R 2 values of the linear regressions are all significantly improved compared to those of the models without clustering. The correlations between the length parameters and properties increase to a high level (0.707 < |r| < 0.990).     . It can be seen that the cluster-based models reduce the prediction errors for the first six properties and yield the same error for MR as the model without clustering. The two sets of errors are highly correlated but not significantly different (R 2 = 0.997, p-value = 0.718 > 0.05). However, across the seven fiber properties, the %MAEs with clustering are consistently lower than those without clustering as shown in the figure. Thus, the centroid-based models can improve the accuracy of the fiber-property predictions. Because cotton samples possess high betweenand within-bale variability (Gourlot et al. 2012) and the test results are highly susceptible to instruments' makes and models (Hunter L 2003), most fiber measurements do suffer poor repeatability/reproducibility. It was reported that the differences of the HVI measurements between two separate labs could be as high as 21.7% for MIC, 18.0% for strength and 35.5% for elongation (Hunter L 2003). This high variability certainly impacts the accuracy of the above prediction models, causing some fiber properties, e,g., elongation, MIC and nep, to suffer relatively high %MAEs, as shown in Fig. 4. .

Conclusion
This study thoroughly examined the associations of fiber length parameters measured by DBF with other fiber properties measured by HVI and AFIS for cotton fibers, and created prediction models to estimate these fiber properties based on the fiber length distributions generated by DBF. In the study, we collected 117 cotton samples, which were separated into a training set (100 samples) and testing set (17 samples), and used the three testing methods (DBF, HVI, and AFIS) to generate the comprehensive fiber property measurements for the two sets of samples. We then conducted regression analysis, hypothesis testing, Bland-Altman analysis, and clustering on the two sets of data to assess the correlations, multicollinearity, agreement, and clusters of the fiber properties. It was found that the fiber length parameters had moderate associations (0.3 < |r| < 0.7) with the seven other properties (strength, elongation, micronaire, nep, fineness, immature fiber content, and maturity ratio), and the prediction errors of the training set varied from 2.25% (for MR) to 14.36% (for nep). The Bland-Altman analysis confirmed that for any of the seven properties, there was no systematic bias between the actual and predicted values, and more than 94.9% of the predicted values were in agreement with the actual values (α = 0.05). The regression models based on cotton cluster centroids consistently lowered the prediction errors for all of the properties. The analyses using the testing set showed that the prediction models generated comparable results as for the training set, demonstrating a certain level of generalization for new samples. The prediction models established in this study make it possible to assess the fiber strength, maturity, fineness, and other important properties when their testing conditions are not available, and to provide an inexpensive method for a quick overview of cotton quality using only two fiber length parameters (UHML and SFC).