Data utilized in this research can be broadly categorized into two distinct sets: 1) a multiyear contemporary cultivar development set of F7 lines from the SunGrains cooperative program and, 2) an historic data set based on the Southern Uniform Winter Wheat Scab Nursery (SUWWSN) from the years 2011-2019 (Murphy et al., 2018, 2019; Murphy et al., 2017; Murphy et al., 2015; Murphy et al., 2016; Murphy & Navarro, 2010, 2011, 2012, 2013, 2014).
All F7 generation SunGrains lines from the 2019-2020 and 2020-2021 seasons were simultaneously genotyped via GBS for genome wide SNP markers and KASP assays diagnostic for Fhb1, Qfhb.nc-1A, Qfhb.nc-4A, and the Qfhb.vt-1B Jamestown haplotype. The 2019-2020 SunGrains panel (SG20) contained 2,376 lines and the 2020-2021 SunGrains panel (SG21) included 3,423lines. Each panel was representative of the SunGrains program composition (North Carolina State University, Clemson University, The University of Georgia, Louisiana State University, The University of Arkansas, and Texas A&M University), and each panel was reflective of the total southeastern United States soft red winter wheat germplasm. The genome wide GBS SNP data and QTL haplotype call data from the SG20 and SG21 panels was used for training of prediction models as well as to identify the effect of training size on prediction accuracy.
The historical data set, based on the SUWWSN was used to compare observed QTL haplotype call effects to predicted QTL haplotype call effects. Models trained on the SG20 and SG21 data sets were used to predict QTL haplotype calls in the SUWWSN. This dataset represented 95 total environments across the southeastern United States over eight years for 418 distinct lines from 16 variety development programs and is comprised of elite soft red winter wheat lines adapted to the southeastern United States.
Genotyping methods used in this study were similar to those used in Sarinelli et al (2019). Leaf tissue at the four-leaf stage was sampled for each line in the SG20, SG21, and SUWWSN panels and DNA was extracted using sbeadex plant maxi kits (LGC Genomics, Middlesex, UK) as directed by the manufacturers protocol. Genotyping-by-sequencing was performed as described in Poland et al (2012). Libraries were constructed at 96 plex densities and each library was processed on an Illumina HiSeq 2500. SNP discovery using raw data was done via the Tassel-5GBSv2 pipeline version 5.2.35 (Glaubitz et al., 2014).
Reads were aligned to the RefSeq 1.0 wheat genome assembly (Appels et al., 2018) using the Burrows-Wheeler aligner (BWA) version 0.7.12. Data was filtered by removing any: taxa with 85% or more missing data, SNPs at a minor allele frequency of 5% or lower, SNPs that had any heterozygous call frequency of 10% or higher, SNPs with 20% or more missing data, SNPs with a read-depth of less than 1 or more than 100, and SNPs that did not align with the reference sequence. Imputation via Beagle 5.2 was conducted post filtering (Browning et al., 2018; Browning & Browning, 2007).
All lines in the SG20 and SG21 panels were genotyped using KASP markers diagnostic for Fhb1, Qfhb.nc-1A, Qfhb.nc-4A, and the Qfhb.vt-1B Jamestown haplotype. Marker sequences and genomic locations of the KASP assays used to genotype the SG20 and SG21 population are provided (Table 1). Composite calls of either resistant (R) or susceptible (S) were recorded based on the results of the marker assays for each region. Historic data for the four FHB resistance QTL were compiled from the Uniform Winter Wheat Scab Nursery Marker Reports for the SUWWSN from 2011-2020 (Brown-Guedira, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019). For comparability and simplicity of predictions, only resistant (R) or susceptible (S) were recorded for the QTL in the SUWWSN panels. All lines which received a heterozygous haplotype call, an ambiguous haplotype call, or did not receive a haplotype call for a QTL were removed from the dataset prior to use in predictions. Linkage disequilibrium was calculated in the regions for all QTL evaluated in the SG20, SG21, and SUWWSN data sets using the function “LD()” in the package “gaston” in R (Perdry & Dandine-Roulland, 2018). Boundaries of QTL regions were delimited by using the most proximal and distal megabase pair position of markers used to haplotype regions.
Phenotypic data for the SUWWSN were compiled from published scab reports sourced from the USA Wheat and Barley Scab Initiative repositories. Data collected included adjusted means for heading date, plant height, severity (SEV), percent fusarium damaged kernels (FDK), and concentration of DON content as measured in parts per million. Heading data was recorded in days after January 1st when heading was evident in half the plot. Plant height was measured in centimeters from the base of the plants in the center of a plot to the tip of spikes, excluding awns. Severity was taken as a percent of the number of spikelet symptomatic for FHB over the total number of spikelets in a subsample of spikes within a plot. FDK was measured by comparing seed samples to standards of known scabby seed percentages to assign ratings. Concentrations of DON were recorded via mass spectrometry and gas chromatography. For all data preparation and phenotyping protocols, refer to the USA Wheat and Barley Scab Initiative web portal [scabusa.org].
Phenotypic Data Analysis - Software and Models
All data analysis was performed in R statistical software version 4.1.1 (R Core Team, 2013). Adjusted means from SUWWSN reports were checked for assumptions of normality by visual comparison of distributions. To detect population structure in the SUWWSN, a principal component analysis (PCA) was conducted using the genome wide GBS derived SNP data via the “prcomp()” function from the “stats” package. Principal components (PCs) which accounted for three percent or more of the total variation were used as fixed effects in estimation of marker effects. All mixed linear models were run using the “asreml” package version 188.8.131.52 (Butler et al., 2009). Adjusted means for recorded SUWWSN traits were used with observed and predicted QTL haplotype calls in mixed linear models to estimate marker effects:
yijkl = μ + Mi + Pj + gk + el + εijkl
Where y is the response, μ is the population mean, M is the fixed marker call effect, P is the fixed PC effect, g is the random genotype effect, e is the random environment effect, and ε is the residual error where .
QTL Prediction - Models
Four separate models were entertained as possible methods of predicting QTL haplotypes: naïve classification via the most correlated marker in the training set (NCOR), K-nearest neighbors (KNN), random forest (RF), and gradient boosting machines (GBM).
For NCOR, the topmost correlated GBS SNP to the QTL haplotype calls in the SunGrains training population were used to predict the QTL haplotype call in the SunGrains testing and SUWWSN populations. Parameter tuning via K-fold cross validation within the training population was not conducted for NCOR due to the arbitrary nature with which the classifying marker was selected. All models following were implemented using the “caret” package in R statistical software and optimal model parameters were selected by five-fold cross-validation over 1000 iterations (Kuhn, 2008).
KNN functioned by finding K individuals in a training set that were most similar to an unclassified individual in a testing set; the most frequent class among those K neighbors in the training set was then used as the prediction for the class of the individual in the testing set (Belkasim et al., 1992). Here, we use the top 100 most correlated SNPs to the QTL haplotype calls in the training population as the variables which define neighbors. To avoid ties in decision making, up to 25 possible neighbors were considered for classifying, starting from one individual, and proceeding in odd numbered intervals (e.g., 1, 3, 5 … 25).
RF is a machine learning model that classifies through a randomly generated decision trees. In RF models used for classification, random vectors of observations are drawn out of the training population with replacement, and N number of randomly selected classifying variables are used to split at nodes within trees. A multitude of trees were drawn using N number of random predictor variables to create splits at nodes in each tree. Once the random forest was generated from the training data, classifications were made in the testing population by assigning the most frequently predicted category observed among all trees in the forest for an individual (Breiman, 2001). We used the top 100 most correlated markers to the QTL haplotyping calls in the training population as possible classifying variables and N number of random predictors used to split nodes within trees were tested from one to 100 markers in groups of five (e.g., 1, 5, 10 … 100). The number of trees generated in the random forest was optimized by the “caret” package.
GBM, also known as stochastic gradient boosting (Friedman, 2001), was similar to RF in that it draws a random forest comprised of decision trees made from random selected classifying variables; however, unlike RF, GBM took a logistic regression like approach to classification. The GBM algorithm first derived the log of odds from the observed classifications in the training population and calculated a probability via the logistic function. Then, the GBM algorithm calculated pseudo-residuals from the observed class probability of individuals versus the predicted probability derived from the most frequent class. An initial decision tree was then drawn using randomly selected classifier variables to a limited number of leaves. Unlike RF, GBM had multiple observations in a single leaf. Each leaf’s residuals were totaled, converted to a log of odds, scaled by a learning rate, and added back to the original log of odds calculated from the frequencies of the observed classifications. A probability was then derived from the newly calculated log of odds, and this process was repeated over N number of trees (Friedman, 2001). For QTL haplotype classification, we used the top 100 most correlated markers to the QTL haplotype calls in the training population as random classifying variables. We tested three learning rates (0.001, 0.01, and 0.1) and only one-way interactions among variables were considered. Generated random forests contained 100-1000 decision trees proceeding in groups of 100 (e.g., 100, 200, ect.). The number of leaves per tree were scaled by training population size so that trees contained only 10, 20 or 30 leaves.
QTL Prediction - Procedure
A general visual diagram of the described procedure is provided (Figure 1). For each QTL assessed, only GBS derived SNP markers located on the QTL’s chromosome of origin were considered (e.g., only SNPs on chromosome 3B for Fhb1). All observed QTL haplotype calls in the contemporary SunGrains and historic SUWWSN data were bound with imputed marker matrices from the QTL’s respective chromosome. Within the SunGrains panels, five training sizes were tested: 10%, 25%, 50%, 75%, and 90% of the total available data. Data were randomly subset, without replacement, into training-test splits, and the training data observed QTL haplotype calls were used in a correlational study of all GBS SNP markers on the QTL’s chromosome of origin. Only the top 100 most correlated markers were used as predictors to increase computational efficiency. Importance of predictor variables was calculated via the “varImp()” function in “caret” for KNN, RF, and GBM models, and all importance values were scaled between 0-100 for ease of comparison.
Each trained model was used to predict the QTL haplotype calls of the held out test portion of the SunGrains panel and confusion matrices were calculated using the QTL haplotype call predictions and the observed QTL haplotype calls. The same models trained using the SunGrains data were used to predict the SUWWSN QTL haplotype calls. Predicted and observed QTL haplotype calls were used in calculation of confusion matrices. Predicted and observed QTL haplotype calls in the SUWWSN were used in previously mentioned mixed linear models to estimate QTL haplotype call effects and means. Due to the random subsetting without replacement portion at the beginning of this procedure, the experiment was repeated 30 times to obtain distributions and averages of calculated confusion matrix coefficients and estimated predicted QTL haplotype call effects and means.