The study of the characterization of genetic diversity, population structure, and genetic relationships among elite parents of germplasm, based on the use of molecular markers, can accelerate genetic gains in breeding programs (Romay et al., 2013; Adu et al., 2019). This study helps understand how the germplasm is organized in selecting parents that present effective contributions and in the designation of heterotic groups (Wu et al., 2016). Thus, genomic data not only allows the estimation of genetic diversity but also combines them with phenotypic information to find new functional genes and build prediction models (Milner et al., 2019). However, in this topic, the focus is on whether, with the simulated reference genome, there is the discovery of the same polymorphisms and how it reflects on the population structure of the lines.
The WSS method indicated the optimal number of clusters by locating a curve on the plot, generally considered an indicator of the optimal number of groups (Kassambara, 2017). With this information and the results of the K-means clustering, the parental lines were partitioned into subpopulations, where the SNP datasets showed similar behavior (Fig. S1, Fig. 1, Table S1), in agreement with the spatial distributions obtained in the biplot graphs (Fig. 3), in which all SNP datasets showed the same dispersion pattern between lines. This suggests that the SNP datasets capture similar patterns of variance, despite the difference in the number of markers between them, where GBS-Mock has a lower number and the difference in the genotyping platform itself (array and GBS). Thus, SNP-array, GBS-B73, and GBS-Mock revealed similar performances concerning genetic diversity and the population structure of parental lines. Darrier et al. (2019) compared the performance of SNP-array and GBS to investigate the extent and pattern of genetic variance in barley and observed that the two methodologies selectively access the informative polymorphism in different portions of the genome. Despite this, their results showed a strong positive correlation between the matrices of both genotyping approaches, supporting their similarity and validity.
PCA shows that these variance patterns captured by the SNP datasets are more similar concerning the first eigenvectors (Fig. 2). However, the captured variance is more consistent when comparing GBS-B73 and GBS-Mock (Fig. 2c). This can be explained by the verification bias existing in the SNP-array since this bias arises when the markers are not obtained from a random sample of the polymorphisms of the population of interest, since the matrix is constructed using temperate maize lines (Frascaroli et al., 2013; HeslotHeslop et al., 2013; Unterseer et al., 2014), and the lines in the study are from tropical germplasm.
The matrices of genetic distances among the parental lines revealed similar patterns, showing the formation of subpopulations between the lines (Fig. S5). When using wheat as a model species to test for verification bias and investigate its impact on genetic diversity estimates, Chu et al. (2020) observed a tendency for SNP array, leading to an underestimation of molecular diversity within the population. These results agree with a previous study on wheat lines (Elbasyoni et al., 2018) and maize lines (Frascaroli et al., 2013). Despite the verification bias mentioned above and the difference between the reference genome used, the temperate B73 genome, or the Mock genome, the population structure between the lines did not show a significant difference, as the correlations between the matrices of genetic distances were of high magnitude. Even though GBS-Mock uses a different reference genome from SNP-array and GBS-B73, the correlation between them was high (Table 1). Elbasyoni et al. (2018), investigating the influence of SNPs from different genotyping platforms on genomic prediction, observed a high correlation (r = 0.77) between SNP-array and GBS genetic distance matrices. These high-magnitude correlations suggest that the broad sampling of diversity is well represented by the approaches used in the study. This is supported by the GWAS study by Darrier et al. (2019). They indicated that methods using SNP-array and GBS could detect markers closely associated with genes that control key phenotypic traits.
Influence of genotyping methods in the determination of heterotic groups in the choice of testers
Heterosis is a fundamental phenomenon in obtaining superior single-crosses. Establishing heterotic groups to exploit them effectively throughout the breeding cycles is necessary. These, in turn, are made up of genetically related parental lines, which generate little or no heterosis when crossed with each other. Crossing with lines from another heterotic group tends to result in vigorous single-crosses (Lee, 1995). Therefore, genetic diversity among heterotic groups tends to increase the level of heterosis detected in hybrid combinations (Falconer and Mackay, 1996; Fu et al., 2014). Badu-Apraku et al. (2011) reported in their diallel study between maize lines that their genetic diversity was small and, because of this, distinct heterotic groups could not be identified. Significant genetic diversity was found in a similar study with other maize lines, and two clear heterotic groups were identified. The type of predominant gene action in the parents under investigation is another factor that affects heterotic clustering. When additive and non-additive effects are significant, and there is a predominance of additive gene action over non-additive gene action, heterotic groups are easily identified (Badu-Apraku et al., 2015; Badu-Apraku, Fakorede, Talabi, et al., 2016; Badu-Apraku, Fakorede, Gedil, et al., 2016).
The PH and EH traits showed higher proportions of additive variance captured by the Ga matrices than GY (Table S2). Although these traits have polygenic inheritance, GY is the most complex trait and most influenced by dominance deviations (Fischer et al., 2008; Hallauer, 2010). According to Hallauer (2010), most of the loci involved with GY in maize are due to the occurrence of dominance. This is reflected in a greater difference between H² and h² for GY than for the other traits, confirming the greater influence of dominance deviations on this trait. The additive genomic relationship matrices of the single-crosses (Ga) showed high correlations among SNP-array, GBS-B73, and GBS-Mock, indicating that these approaches capture similar additive variance patterns. GBS-Mock captures additive relationships in single-crosses similar to standard procedures, SNP-array, and GBS-B73 (Table 1, Fig. S6a, b, c). On the other hand, the correlations between the dominant relationship matrices (Gd) were lower but still from medium to high. In both Ga and Gd, the correlations between SNP-array and GBS-Mock were lower, which can be explained by the fact that these SNP datasets use different reference genomes to perform SNP calling.
SCA reflects the action of non-additive gene effects, indicating intra-allelic interactions, is one of the most important parameters in identifying superior single-crosses, and is an indicator of genetic distance between parents (Sprague and Tatum, 1942; Carvalho, 1993). Thus, using the SCA estimates as the genetic distance between the lines to identify the panel structure, two heterotic groups were formed, in which the distance between them is maximized. The correlations between the SCA estimates were almost perfect (Table S4). In other words, SNP-array, GBS-B73, and GBS-Mock presented equivalent SCA estimates. Thus, the composition of heterotic groups practically did not change from one SNP dataset to another. Therefore, the determination of heterotic groups was similar regardless of the platform used (Fig. 4, Table S3).
In addition to presenting distinct heterotic groups, a well-established breeding program also offers good testers. When crossed with parental lines, these provide information about the genetic value of the lines when evaluating the ability to combine between them since it is associated with the additive effects of alleles and additive-type epistatic actions (Cruz and Vencovsky, 1989; Albrecht et al., 2014). The correct choice of a tester can have great significance in the expectation of a successful selection process (Miranda Filho, 2018). According to Hallauer and Martinson (1975), a good tester presents simplicity in use, information that correctly classifies the relative merit of the lines, and the potential for maximizing genetic gain. Thus, based on the GCA estimates between the lines, testers were elected for each heterotic group based on the evaluated traits and the SNP datasets. As expected, there were no differences in tester choice between SNP datasets, as the correlations between GCA estimates across rows were perfect (Table 2, Table S5).
Once previous results regarding the study of population structure of parental lines, the genotyping approaches produced very similar results but not the same, it was expected that this would somehow influence the formation of heterotic groups and the choice of testers. However, given the results, the genotyping platform, and, more specifically, the approach that uses the simulated genome as a strategy, the GBS-Mock, produces similar results to the standard procedures.
Influence of genotyping methods on genomic prediction of single-crosses
Assessing the performance of all single-crosses combinations of parental lines that excel in a breeding program is impractical in most cases, given that the number of combinations grows exponentially as the number of elite parents increases. Thus, obtaining estimates of the genetic values of single-crosses not evaluated became viable with the increased availability of molecular markers and genomic prediction models (Hallauer, 2010). Therefore, to accelerate genetic gain with limited resources, the prediction of single-crosses performance is highly important in modern breeding programs (Basnet et al., 2019).
However, few studies still address how genotyping platforms influence single-crosses' prediction and, more specifically, regarding the mock genome as a tool for more sophisticated analyses, such as genomic prediction. Only one recent study shows the mock genome's efficiency in predicting maize single-crosses, which may be an alternative for crops that do not yet have a reference genome (Sabadin et al., 2022). However, our study is more complete and more representative because getting approaches from the population structure phase is crucial for the intended use of germplasm through the division of heterotic groups, the definition of testers, and, finally, the genomic prediction of single-crosses.
GY showed the lowest predictive abilities in all SNP datasets, and EH was the highest in the additive-dominant GBLUP prediction model (Fig. 5, Table S2). Combs and Bernardo (2013) suggested that genomic predictions are more accurate for traits with higher heritability. In the results of Hayes et al. (2010), complex traits controlled by many small effect loci, such as GY, showed lower predictive abilities than less complex traits. Although GBS-Mock has a lower number of markers, this approach presented a similar performance to the other SNP datasets for all traits, corroborating the hypothesis that it is possible to substantially reduce the number of markers and maintain a high predictive ability (Tayeh et al., 2015; Ma et al., 2016; Sousa et al., 2019). Exceptions for long-term breeding cycles without updating the training population that would demand high marker densities (DoVale et al., 2022). In addition, the genetic distance estimates between the SNP datasets were very similar (Fig. S5).
Selection intensity must be chosen thoughtfully, as genetic variability can be drastically reduced with high selection pressure. The choice of appropriate selection intensities depends on the size of the population and the duration of the breeding program, whether short-term or long-term. In general, selection intensities ranging from 10 to 40% are used in plant breeding, the highest being applied at the beginning of a breeding program (Hallauer, 2010). For the coincidence of individuals by phenotypic selection and genomic selection, the SNP datasets showed similar behavior as the selection intensity was increased, being more pronounced from 1 to 10% of selection intensity. From then on, observing the coincidence of selection gains smaller increments (Fig. 6). Our results for predictive ability and coincidence of selection agree with the results of Sabadin et al. (2022). It is valid to consider that those different intensities modify the response rates. Thus, this coincidence between phenotypic and genomic selections is expected to reach a plateau and subsequently decrease.
Despite the apparent differences between SNP datasets, the general message is that these approaches perform comparably in the analyses performed in this study, even accessing different types of genomic sequences. While SNP-array is derived from exome capture and therefore focused on coding sequence variation, the GBS data represent a wider diversity survey in genomic regions associated with low levels of DNA methylation, which may also include many genes and gene regulatory regions (Darrier et al., 2019; Negro et al., 2019). On the other hand, the physical distribution of markers reveals higher frequencies of SNPs at the gene-rich telomeric ends of each of the chromosomes for both approaches, with this frequency being more pronounced in SNP-array (Bayer et al., 2017). The platforms probably capture nearby markers in linkage disequilibrium with QTLs (Quantitative Trait Loci). In this sense, using different platforms can be advantageous, as it allows the identification of additional QTLs.
Possible applications of the Mock genome in plant breeding
Until recently, only the main commercial crops benefited from state-of-the-art technologies. However, the development of the GBS platform emerged as an alternative for using such technologies to be viable for orphan crops. Approaches like this can convert orphan crops into crops rich in genomic resources and substantially reduce the breeding process (Varshney et al., 2009, 2012; Varshney and May, 2012).
Previously, this process was much slower than nowadays. Rice, for example, took almost 20 years to stop being an orphan crop and become a basic model for cereals (Varshney et al., 2009). Introducing these crops into the genomic era also accelerates the identification of genes underlying important agronomic traits and improves our understanding of the evolution of these species (Ye and Fan, 2021). However, many minor crops are becoming rich in genetic resources as a result of investments from various public and private initiatives, such as the African Orphan Crops Consortium (AOCC) (Hendre et al., 2019), which is a global partnership that is generating resources genomics for 101 African orphans. One of the objectives of this Consortium is to create reference genomes for these cultures. Although some efforts are being made to pay greater attention to these crops (Chiurugwi et al., 2019; Gregory et al., 2019; Jamnadass et al., 2020), the ideal is still far from being achieved with a view to several species relevant to local diets around the world that are understudied.
Despite initiatives and investments, not all crops will benefit, so they cannot take advantage of modern breeding tools. While these advances are being consolidated, mock genomes can be an alternative, where the absence of a reference genome presents a barrier to the efficient use of GBS data (Melo et al., 2017; Hale et al., 2018). In the meantime, the present study has shown that using a population-tailored mock reference to perform SNP discovery is a valid alternative. With this approach, it was possible to carry out investigations to outline a breeding program, from studies of diversity and population to genomic prediction studies. However, it is important to emphasize that a population with maximum representativeness must be considered when building a mock reference to capture all the population polymorphisms (Sabadin et al., 2022).
Although these advantages of using a mock genome in genomic studies must consider some caveats, for example, diploid crops with relatively smaller genomes are preferred over cross-pollinated or polyploid orphan crops, as these have genomes that are too complex to be sequenced. However, genome size will become less of a barrier with advances in sequencing technologies and bioinformatics tools (Armstead et al., 2009). Another challenge is in the SNP calling due to the limitations of GBS, which can lead to incorrect identification of homozygotes and heterozygotes because of the low coverage of NGS reads, in addition to a large number of lost and low-quality data (Heslot et al., 2013). According to Sabadin et al. (2022), the mock genomes do not present the physical position of the markers in a constant reference, which hinders studies such as GWAS and candidate gene discovery. Negro et al., (2019) state that SNP-array and GBS are complementary to detect QTLs tagging different haplotypes in association studies. In this sense, using other platforms can be advantageous, as it allows the identification of additional QTLs. However, no studies still demonstrate the performance of mock genomes for these purposes. When looking for these larger effect marks, the results will probably differ from those obtained with the SNP array due to changes in coverage between platforms.
Given what has been shown, it is possible to infer and recommend that a mock genome constructed from the population's polymorphisms to perform the SNP calling is an excellent strategy to support plant breeders in studies of diversity, population structure, the definition of heterotic groups, choice of testers and genomic prediction in species that still do not have a reference genome available, which is an alternative for the rapid advancement of orphan crop improvement. This approach will play a key role in improving the genetic potential of orphan crops and helping develop sustainable food systems.