3.1. Prediction of grain yield by different approaches
The estimate of the coefficient of determination, for all methodologies using the ten defining agronomic characteristics in the prediction of grain yield (GY) in white oats is shown in Table 1.
Table 1
Mean of the maximum estimate of the coefficient of determination for the training set, in four environments corresponding to the data set of experiments with and with fungicide in two agricultural years, to predict the grain yield in white oat (Avena sativa L.).
Approach | Technique | E1 | E2 | E3 | E4 |
AM | BO | 92.29 | 86.69 | 81.23 | 79.23 |
DT | 85.37 | 76.39 | 61.78 | 64.65 |
BA | 94.61 | 93.89 | 92.70 | 92.98 |
RF | 64.91 | 55.09 | 10.57 | 24.48 |
IA | PMC-1 | 73.25 | 71.42 | 30.14 | 59.84 |
PMC-2 | 96.45 | 90.12 | 56.72 | 57.94 |
PMC-3 | 86.13 | 88.58 | 61.45 | 68.62 |
PMC-4 | 75.16 | 85.32 | 87.34 | 58.77 |
RBF | 90.12 | 73.76 | 80.72 | 76.44 |
Conventional | RM | 61.02 | 46.07 | 20.67 | 32.72 |
AI: Artificial Intelligence; AM: Machine Learning; RM: Multiple Regression; PMC: Multilayer Perceptron; PMC: Multilayer Perceptron; PMC-1: Multilayer Perceptron with (10-11-1); PMC-2: Multilayer Perceptron (10-11-11-1); PMC-3: Multilayer Perceptron (10-11-11-11-1); PMC-4: Multilayer Perceptron (10-3-4-11-1); RBR: Radial Base Network; DT: Decision Tree; RF: Random Forest; BA: Bagging; BO: boosting. E: environments. E1 and E3: no fungicide; E2 and E4: with fungicide. |
Based on Table 1, it is possible to compare the approach that is more efficient for the prediction of GY. Higher values of \({ R}^{2}\)indicate that the prediction target variable has a better fit considering the ten explanatory variables used as predictors in this analysis [3, 4]. Among the methodologies used in this study, it was found that multiple regression presented a lower estimate of \({ R}^{2}\), indicating the existence of non-linear associations between the explanatory variables not considered in the model. Artificial intelligence and machine learning methodologies, in turn, stood out for their ability to extract non-linear information from model inputs [5, 8], as seen in Table 1. Other authors have already highlighted the abilities of neural networks [12, 13] and machine learning [3, 4, 14] to better capture non-linear relationships when compared to conventional methodologies.
The results obtained by different approaches show that there was a discrepancy between the maximum estimate of \({ R}^{2}\)for the predictive variable in the same environments (Table 1). This discrepancy in the estimate of \({ R}^{2}\)was also reported by [3, 4]. It is noteworthy that the differences in results obtained in these analyzes are indicative that the environment influences the estimate of\({ R}^{2}\) and, consequently, the choice of the best prediction model for the response variable.
The machine learning approach proved to be more efficient compared to the other approaches (Table 1). There was a low estimate of \({ R}^{2}\)maximum in the random forest procedure, for all environments. On the other hand, this procedure was superior to the multiple regression approach for the same environment, except the environment without fungicide (E3), which corresponds to 10.57%. The low estimate of \({ R}^{2}\)maximum in the random forest procedure was also demonstrated in flood-irrigated rice [4] and on simulated data with different heritability [3]. This procedure involves the steps of randomly resampling the set of explanatory variables, and building several decision trees that will constitute a random forest that will allow the prediction and estimation of scores that will lead to the evaluation of the importance of predictors in a process repeated several times.
Regarding the environments and the bagging procedure, it appears that the estimates of \({ R}^{2}\)were higher than 92.70 %,making this approach the best highlight for use in the analyzed data sets. High estimates (with reference to values around 80%) of \({ R}^{2}\)were also obtained using machine learning methodologies by boosting procedures, in addition to bagging, for all prediction data sets (Table 1). [3, 4] showed that the machine learning approaches for the bagging and boosting procedures were more consistent in obtaining a higher overall mean estimate of \({ R}^{2}\), about predictive variables. The decision tree (DT) and random forest methodology did not stand out from other machine learning procedures (Table 1).
Artificial intelligence approaches based on RBF provided adjustments whose \({ R}^{2}\)were greater than 70% in all environments (Table 1). In this procedure, the highest estimate \({ R}^{2}\)maximum was 90.12% (± 5.79) and the lowest 73.75% (± 1.67), which corresponds to environments E1 and E2, respectively. [4] found an estimate of maximum \({ R}^{2}\) ranging from 48–99% in different environments for the flood-irrigated rice crop. For simulated data with the different genetic structures the maximum estimate of \({ R}^{2}\) ranging from 44–54% [3] and [15] obtained results of \({ R}^{2}\) consistent for different genetic structures. [16] evaluated bean cultivars and obtained an estimate of \({ R}^{2}\) for the characteristics days to first flower and flowering days of 94.10% and 94.40%, respectively. This procedure has a good ability to handle complex interactions compared to semiparametric and linear regressions [15, 17]. Generally, RBF is quick to learn from the data used as training information and provides a unique solution compared to perceptron ANNs [9, 15, 17].
Radial basis function networks have a good ability to handle interactions compared to semiparametric and linear regressions [15]. [15] applied the RBF in studies using simulated traits with 30% and 60% heredity for variable selection. The authors identified greater efficiency in the selection by the RBF when the scenario involved epistatic interactions in the gene control of the studied characters. [9] observed that it is possible to improve prediction in nonparametric models when the selection includes markers that are not directly related to the characteristics of interest. [4] applied RBF to predict grain yield, grain length-width ratio, and panicle length in flood-flooded rice. These authors argue that RBF has a high performance in predicting the importance of variables. [3] evaluated the importance of auxiliary traits of the main trait based on phenotypic information and previously known genetic structure using RBF and demonstrated the efficiency of this network to quantify the importance of variables.
Regarding procedure PMC-1 (10-11-1), the highest estimate of\({ R}^{2}\)maximum was observed in E1- 73.25% and the lowest in E3, with an estimate of 30.14%, both environments correspond to the one without fungicide. In the procedure PMC-2 (10-11-11-1) and PMC-3 (10-11-11-11-1) the highest estimates were observed in E1 and E2 and the smallest in E3 and E4, respectively. For the same hidden layer number that corresponds to PMC-3 (10-11-11-11-1) and PMC-4 (10-3-4-11-1). We observed lower estimates of maximum \({ R}^{2}\) for the PMC-4 procedure, except the E3 environment. This shows that the number of neurons in the layer influences the estimation of \({ R}^{2}\) maximum. [3] argued that the number of neurons influences the estimation of the coefficient of determination.
The PMC network is widely used in the predictive process [3, 4, 18], since the success of this network has already been demonstrated in several research groups that have shown mathematically that, with only a single hidden layer, this network works very well with different numbers of neurons in the hidden layer [18].
Thus, machine learning is actually more efficient for selecting phenotypic traits because it can handle reduced or redundant information about phenotypic traits [3]. [19] evaluated the importance of variables by bagging, random forest, boosting, decision tree, PML and RBF and reported that PML and RBF achieved better results. [3, 4] verified that the methodologies of computational intelligence and machine learning in the prediction allowed to identify the explanatory phenotypic characteristics that should be prioritized and established as auxiliary characteristics for the indirect selection.
The efficiency of ANNs in prediction problems given their ability to extract relevant information from large data sets [20] and generalize relatively inaccurate information [21], was very well expressed by the results obtained (Table 1). The same can be seen for methodologies based on machine learning, which are capable of dealing with more reduced or redundant information in the input variables [3, 4]. However, another study as important as prediction and which is often not carried out is the identification of more important predictive variables, which is an important factor in the decision-making process [22]. Thus, after the prediction analyses, analyzes were carried out to quantify the importance of variables through the methods of artificial intelligence and machine learning, in order to identify, among the set of explanatory variables, those that should be prioritized and identified as auxiliary characteristics in indirect responses to selection.
3.2. Linear relationship between predictor and grain yield variables in white oat
The greatest linear associations with GY may be a preliminary indication that the variables, individually, are important in the prediction of GY. In multivariate prediction models, a predictor variable, with high correlation with the response variable, may lose its importance due to its redundancy, considering that, in the model, it may be represented by another associated. Thus, in addition to quantifying the linear relationships between predictor-response, it is important to quantify and appreciate the linear relationships, expressed by linear correlation coefficients, between all predictors in the search for redundancies. In this work, these associations were represented in a correlation network that contains red and green lines that represent negative and positive correlations, respectively, and their width is proportional to the magnitude of the correlations (Fig. 1). Regarding the phenotypic correlation network, they observed that the structure of correlated groups aiming to predict GY. In this network, the similarity between the phenotypic characteristics and the phenotypic correlation patterns is highlighted.
The characteristics that present groups with GY in E1 were MTG, HW and PH that correlated positively, but varying in magnitude, and the negatively correlated was LRS. In relation to E2, the positively correlated characteristics consist of: PH and MTG; and negatively LS and DFM. For E3, which represents no fungicide, the characteristic that is negatively correlated was SRS. Environment 4, the positively correlated group consists of HW and DEF and the negative DEM (Fig. 1).
3.3. Importance of variables in prediction by Artificial Intelligence approach
3.3.1. Multilayer Perceptron (PMC)
Estimates of the coefficient of determination of grain yield prediction by PMC attributing perturbation to the genotypic information are shown in Table 2. These results show large discrepancies in the\({ \text{R}}^{2\text{*}}\) in comparing the environments with each other, which makes interpretation difficult. In environments E1 and E4, which correspond to environments without fungicide, the characteristics LP, PH, LRS were efficient in quantifying the response variable GY due to the reduction in the estimate of \({ \text{R}}^{2\text{*}}\) as a function of the strategy of attributing disturbance to phenotypic information.
Table 2
Estimates of the coefficient of determination of grain yield prediction in white oat (Avena sativa L.), using PMC attributing perturbation to genotypic information.
| E1 | E2 |
Input | TOP1 | TOP2 | TOP3 | TOP4 | TOP1 | TOP2 | TOP3 | TOP4 |
MTG | 70.02 | 87.37 | 64.93 | 74.93 | 31.92 | 23.19 | 9.73 | 26.47 |
HW | 71.78 | 78.42 | 70.44 | 72.44 | 54.25 | 87.37 | 86.02 | 84.30 |
DEF | 76.51 | 76.36 | 74.89 | 64.89 | 54.68 | 65.92 | 48.33 | 75.16 |
DFM | 75.18 | 86.87 | 68.59 | 78.59 | 43.67 | 36.65 | 70.15 | 50.05 |
DEM | 76.54 | 77.17 | 83.87 | 73.87 | 56.49 | 74.88 | 75.94 | 77.60 |
PH | 61.01 | 80.26 | 49.89 | 59.89 | 53.23 | 63.91 | 33.01 | 55.37 |
LP | 75.26 | 66.07 | 62.90 | 67.90 | 46.46 | 71.41 | 76.46 | 68.43 |
LRS | 52.80 | 33.62 | 10.33 | 8.33 | 52.72 | 73.18 | 85.34 | 67.20 |
SRS | 76.59 | 78.03 | 71.10 | 71.10 | 57.33 | 80.89 | 58.86 | 60.60 |
LS | 75.19 | 80.32 | 81.71 | 71.81 | 56.85 | 76.40 | 74.77 | 72.44 |
| E3 | E4 |
Input | TOP1 | TOP2 | TOP3 | TOP4 | TOP1 | TOP2 | TOP3 | TOP4 |
MTG | 32.34 | 33.29 | 52.69 | 65.34 | 51.85 | 38.73 | 50.93 | 58.67 |
HW | 21.20 | 12.64 | 26.31 | 42.81 | 47.09 | 52.58 | 26.67 | 37.82 |
DEF | 30.70 | 54.58 | 73.53 | 57.07 | 37.84 | 45.99 | 42.69 | 34.25 |
DFM | 30.93 | 33.23 | 36.08 | 50.12 | 55.95 | 52.81 | 56.06 | 53.65 |
DEM | 32.58 | 48.50 | 79.57 | 68.96 | 40.57 | 46.46 | 31.72 | 50.97 |
PH | 29.51 | 35.36 | 57.87 | 44.98 | 50.74 | 53.91 | 55.85 | 56.06 |
LP | 18.57 | 39.66 | 29.95 | 51.51 | 59.52 | 48.74 | 57.27 | 59.37 |
LRS | 4.48 | 11.46 | 24.87 | 21.62 | 44.69 | 29.64 | 45.38 | 39.15 |
SRS | 24.65 | 19.57 | 38.52 | 9.99 | 39.55 | 42.86 | 31.70 | 40.07 |
LS | 26.36 | 12.91 | 49.10 | 45.01 | 56.54 | 37.20 | 45.79 | 59.26 |
MTG = Thousand Grain Mass in grams; HW = Hectoliter Weight; DEM = Days between Emergency and Maturation; PH = percentage of lodging; GY = Grain yield in kilograms per hectare; DEF = Days from Emergence to Flowering; DFM = Days from Flowering to Maturation; PH = Plant Height; LRS = Leaf Rust Severity; SRS = Stem Rust Severity and LS = Leaf Spots; E: environments. E1 and E3: no fungicide; E2 and E4: with fungicide. Topology- TOP1: Multilayer Perceptron with (10-11-1); TOP2: Multilayer Perceptron (10-11-11-1); TOP3: Multilayer Perceptron (10-11-11-11-1); TOP4: Multilayer Perceptron (10-3-4-11-1); E: environments. E1 and E3: no fungicide; E2 and E4: with fungicide. |
Regardless of the number of neurons in the output layer and a single hidden layer, they agreed to pinpoint the most important variables to predict GY. This result shows that these variables are important in predicting GY, as the disturbance in their values led to a considerable reduction in the quality of the fit. In the E2 environment, the MTG characteristic was the most important in predicting GY.
There was a discrepancy in the number of neurons in the output layer and hidden layer, pointing out that the most important variables in E4, which correspond to the fungicide environment. With only one neuron in the output layer and a single hidden layer they showed that DEF and SRS were the most important due to the reduction in the estimate of\({ \text{R}}^{2\text{*}}\). With two neurons in the middle layer and a single hidden layer they demonstrated that LRS and LS for the target prediction variable. When we use a neuron in the input layer, and three hidden layers with 11 neurons in the intermediate layer and one neuron in the output layer, the characteristics that proved to be the most important were HW and SRS. On the other hand, with three hidden layers with three, four and 11 neurons in the intermediate layer, the important characteristics in predicting the GY were: LRS, DEF and HW. [4] reported that with only one neuron in the output layer and a single hidden layer, they agreed to point out that the most important variables were grain width and length in irrigated rice, given the significant drops in estimated values of \({ \text{R}}^{2\text{*}}\)observed when we disturb the variables
The importance of the variables was quantified by assigning destructuring to the genotypic information referring to each variable, in order to observe what changes would occur in the values of the\({ R}^{2}\). It is important to point out that, in this Table, reductions in the values of \({ R}^{2}\) after attributing disruption to the genotypic information referring to each variable, are indicative that this variable is important in relation to the others for purposes of prediction with the already established network.
3.3.2. Radial Base Network (RBF)
The estimation of the importance of characters in white oat attributing disturbance to the information of an input variable after the RBF has been established is described in Table 3. In this Table, the relative importance of each input estimated by the technique of destructuring the information of each variable explanatory. When using this strategy, drastic reductions in the values of \({ \text{R}}^{2\text{*}}\)were observed for the most important variables and LRS for the predictive variable GY, in the E1 and E4 environments. In practice, the intensity of this trait reduces genetic progress to increase grain yield. In the E2 environment, the variable that suffered the greatest reduction in \({ \text{R}}^{2\text{*}}\)was DMF, with an estimate of 44.47%. This feature increases grain yield, as more photoassimilates are produced and translocated to grains. However, late cycle cultivars tend to be more productive in relation to the initial cycle, as they obtain an increase in the amount of photoassimilates that are translocated to the grains [4].
Table 3
Coefficient estimates for determining grain yield prediction in white oat (Avena sativa L.) using the RBF attributing perturbation to genotypic information.
Input | E1 | E2 | E3 | E4 |
MTG | 81.04 | 58.97 | 47.98 | 40.77 |
HW | 76.70 | 60.73 | 65.01 | 53.99 |
DEF | 85.43 | 68.16 | 72.11 | 47.52 |
DFM | 84.30 | 44.37 | 52.47 | 65.02 |
DEM | 80.99 | 68.16 | 74.24 | 54.61 |
PH | 73.97 | 59.36 | 62.75 | 72.19 |
LP | 81.96 | 68.07 | 64.71 | 64.71 |
LRS | 60.13 | 63.30 | 71.59 | 45.04 |
SRS | 84.38 | 70.50 | 69.10 | 63.74 |
LS | 88.37 | 61.23 | 54.09 | 52.25 |
MTG = Thousand Grain Mass in grams; HW = Hectoliter Weight; DEM = Days between Emergency and Maturation; PH = percentage of lodging; GY = Grain yield in kilograms per hectare; DEF = Days from Emergence to Flowering; DFM = Days from Flowering to Maturation; PH = Plant Height; LRS = Leaf Rust Severity; SRS = Stem Rust Severity and LS = Leaf Spots; E: environments. E1 and E3: no fungicide; E2 and E4: with fungicide. |
The results show that the most important variable using the RBF was MTG, in the E2, E3 and E4 environments, with estimates of 58.97%, 47.98% and 40.97%, respectively. In practice, MTG influences the grain yield in white oats, since the higher MTG, consequently, the higher GY. This justifies the results of this study in white oats in the prediction of GY.
The results obtained corroborate the expectation about the RBF in quantifying and revealing the importance of the characteristics using the strategy of causing disturbances from the permutations or fixation of the phenotypic values of the input variables. Our study demonstrates the ability of RNA to quantify the importance of phenotypic characteristics in white oats. Techniques that show the impact of interruption or disturbance in the information of a given input in the estimation of the coefficient of determination and partition of the connection weights of the ANN were presented. These techniques were effective in estimating the true importance of phenotypic traits. Therefore, there is a certain agreement between the results found by the two computational intelligence methodologies of PMC networks and RBF networks.
3.4. Importance of variables in predicting by approach Machine Learning
Table 4 shows the means of the relative contributions of the explanatory variables for grain yield prediction by estimating the minimum squared error increment percentage (IMSE), which is constructed by swapping the values of each variable in the data set, and comparing with the prediction of the original non-permuted dataset of the variable. In this case, unlike the strategy used for the computational intelligence methodologies of PMC and RBF networks, for which lower values of \({ \text{R}}^{2}\)indicated greater importance of that variable for the model, in the machine learning approach the importance of the explanatory variable it is related to the estimation of the average decrease in the precision of the model through the IMSE so that the higher this estimate, the greater the importance of the variable.
Table 4
Average estimate of the relative contributions of the explanatory variables for grain yield prediction in white oat using a machine learning approach, in four environments corresponding to without and with fungicide application.
VA | E1 | E2 | E3 | E4 |
BA | RF | BO | BA | RF | BO | BA | RF | BO | BA | RF | BO |
MTG | 7.58 | 7.94 | 12.37 | 10.47 | 9.84 | 10.89 | 1.04 | 1.53 | 4.49 | 3.55 | 3.21 | 3.75 |
HW | 10.11 | 10.68 | 15.29 | 2.19 | 2.23 | 6.57 | 2.2 | 1.75 | 3.51 | 3.83 | 4.44 | 3.93 |
DEF | 3.29 | 2.42 | 7.55 | 6.85 | 5.79 | 6.73 | 3.58 | 3.5 | 4.60 | 11.46 | 11.49 | 9.18 |
DFM | 1.59 | 2.21 | 2.97 | 16.94 | 16.84 | 12.25 | 0.8 | -0.4 | 4.22 | 5.57 | 4.68 | 4.82 |
DEM | 3.46 | 3.1 | 6.35 | 6.44 | 6.06 | 6.28 | 2.14 | 1.86 | 3.43 | 6.12 | 5.95 | 5.17 |
PH | 10.74 | 10.3 | 9.65 | 10.01 | 8.29 | 9.32 | -0.93 | -0.24 | 2.72 | 0.8 | -0.45 | 1.02 |
LP | 2.83 | 2.49 | 5.94 | 1.36 | 1.08 | 2.79 | 3.04 | 3.04 | 2.96 | 0.36 | -0.66 | 0.89 |
LRS | 20.87 | 20.1 | 29.59 | 9.27 | 9.29 | 16.05 | 9.91 | 10.91 | 12.29 | 4.19 | 4.58 | 7.02 |
SRS | 7.32 | 7.76 | 5.60 | 3.09 | 2.25 | 3.65 | 3.52 | 3.97 | 3.71 | 0.8 | 1.62 | 2.04 |
LS | 3.11 | 3.67 | 4.69 | 3.62 | 2.91 | 3.74 | 3.22 | 2.95 | 3.30 | 3.99 | 3.49 | 3.59 |
MTG = Thousand Grain Mass in grams; HW = Hectoliter Weight; DEM = Days between Emergency and Maturation; PH = percentage of lodging; GY = Grain yield in kilograms per hectare; DEF = Days from Emergence to Flowering; DFM = Days from Flowering to Maturation; PH = Plant Height; LRS = Leaf Rust Severity; SRS = Stem Rust Severity and LS = Leaf Spots; FA: random forest; BA: Bagging; BO: Boosting; VA: auxiliary variable; E: environments. E1 and E3: no fungicide; E2 and E4: with fungicide. |
Based on Table 4, the variables that obtained the highest IMSE estimate in all machine learning methodologies in relation to environments without fungicides were: LRS, HW, PH, and MTG; DEF, SRS, and LRS, E1, and E3, respectively. The variable that showed to be more efficient in these environments was LRS. This justifies that this variable can be used in the indirect selection process when the target variable of prediction is GY. To environments with fungicides, the most important variables were: MTG, DFM, PH, and LRS; DEF, DFM, DEM, and LRS, which are represented by E2 and E4, respectively. For this environment with fungicide, the variables DFM and LRS proved to be efficient in estimating the prediction of grain yield in white oat.
The random forest and bagging methodologies were coincident in quantifying the same explanatory variables. Similar result is reported by [3, 4]. Regarding the boosting procedure, the results show discrepancies. On the other hand, this procedure was more consistent in variable prediction. In this procedure to estimate the importance of a variable using GY as a predictive target, the variables: MTG, HW, PH, and LRS; MTG, DEF and LRS stood out in the environment without fungicides, represented by E1 and E3, respectively. To the fungicide environment, the important variables were: MTG, DFM, PH, and LRS; DEF, DFM, DEM, and LRS, respectively. When using the boosting procedure, the variable that stood out in all environments was LRS. This justifies that this variable can be used to predict GY in white oats.
The bagging technique involves generating several distinct training sets from the original dataset. Final predictions are calculated by averaging all generated predictions. This is useful for decision tree and artificial neural network techniques that are sensitive to small changes in training data [23].