Phenotypic variation of carotenoids in the SAP/CAP collection
To characterize phenotypic variation of carotenoids in sorghum grain, and to confirm previously published phenotype data on one year of samples, we quantified lutein, zeaxanthin, β-carotene, and α-carotene for 447 accessions in the SAP/CAP global collection using HPLC (Table S1). Lutein was the most abundant carotenoid, followed by zeaxanthin, β-carotene, and then α-carotene. Raw concentrations for lutein ranged from 0.02–4.61 µg/g, for zeaxanthin from 0.01–2.40 µg/g and for β-carotene from 0.03–1.19 µg/g, with means of 0.58 µg/g, 0.18 µg/g, and 0.17 µg/g, respectively. α-carotene was detected in only 31 accessions, with values ranging from 0.02–0.11 µg/g. Due to the limited number of accessions with detectable concentrations, α-carotene was omitted from subsequent genetic analysis. High correlations were found between β-carotene and zeaxanthin (r = 0.74; p < 10− 16), β-carotene and lutein (r = 0.78; p < 10− 16), and lutein and zeaxanthin (r = 0.75; p < 10− 16). Four accessions, two of which had not previously been phenotyped, had higher concentrations of β-carotene than any accessions previously phenotyped in the SAP/CAP collection.
To account for unbalanced data and accurately predict the genetic merit for carotenoids of the SAP/CAP accessions, BLUPs and heritabilities were calculated for each of the carotenoid traits (Table 1, Table S2). Due to the expected shrinkage effect, lower ranges were obtained for the BLUPs than for the raw concentrations. However, entry-mean basis heritability estimates (Table 1) were high, ranging from 0.78 for β-carotene to 0.92 for zeaxanthin.
Table 1
Range, mean, and entry-mean basis heritability (H²) for the BLUPs of lutein, zeaxanthin, and β-carotene for the SAP/CAP collection.
| Lutein (µg/g) | Zeaxanthin (µg/g) | β-carotene (µg/g) |
Minimum | 0.15 | 0.04 | 0.07 |
Maximum | 3.09 | 1.83 | 0.80 |
Mean | 0.58 | 0.18 | 0.17 |
H² | 0.80 | 0.92 | 0.78 |
Genome-wide association study of carotenoids in SAP/CAP collection
Next, we sought to characterize the genetic architecture of sorghum carotenoids. A previous study [15] suggested that global sorghum grain carotenoid variation is oligogenic, so to further test this hypothesis, we conducted a GWAS using more accessions and replicates. To maximize the number of accessions included, we used BLUPs rather than raw data in order to account for unbalanced data. Marker-trait associations were identified for the BLUPs of the three carotenoid traits evaluated (Fig. 2, Table S3).
For lutein, only 1 SNP, on chromosome 4 (S04_275231), was above the Bonferroni threshold of significance (Fig. 2A, Table S3). To identify candidates that may not be found using the stricter Bonferroni multiple comparison corrections, we also considered a more liberal False Discovery Rate (FDR) criteria. Under the FDR < 0.05 threshold, 7 significant SNPs were identified, corresponding to four regions of association on chromosomes 3, 4, 6 and 9. Three of these SNPs were located in a region around 2.17 Mb on chromosome 9, which is not near any a priori candidate genes. The only association in proximity to an a priori candidate gene was at SNP S6_47123508, near Sobic.006G097500 (401 kb away), an a priori candidate gene that is annotated as a putative ortholog of the maize ZEP gene.
Zeaxanthin had the highest number of marker-trait associations above the Bonferroni significance threshold, with 39 significant SNPs in 17 regions across all chromosomes except chromosome 3 (Fig. 2B, Table S3). The most prominent association was on chromosome 6 between 45.9–48.6 Mb, with six significant SNPs, three of which were among the top ten most significant associations. The most significant association for zeaxanthin was the SNP near the ZEP gene (S6_47123508; 401 kb away) that was also associated with lutein. There was also an association on chromosome 2 (S2_61694864), which is in proximity to Sobic.002G225400 (42 kb away), an a priori candidate gene annotated as an abscisic acid 8'-hydroxylase 3 (CYP707A).
β-carotene had ten significant marker-trait associations for a total of six regions of association across chromosomes 2, 6 and 10 (Fig. 2C, Table S3). Chromosome 10 had the highest number of marker-trait associations, particularly within a region around 7.48 Mb. There was also a SNP on chromosome 10 (S10_14377366) that was significantly associated with both β-carotene and zeaxanthin, which is not in proximity to any a priori candidate genes. Among the ten markers associated with β-carotene, only SNP S06_47123508, 401 kb from the ZEP gene, was in proximity to an a priori candidate gene.
Population structure of sorghum carotenoids in SAP/CAP global collection
Next, we tested if carotenoid variation is structured by population and geographic origin. Since provitamin A carotenoids are our primary target, we focused on β-carotene concentrations for this analysis. Country of origin was obtained from the USDA NPGS GRIN database for the accessions in the SAP/CAP collection. Countries that had less than eight accessions were discarded from the analysis. In the SAP/CAP collection there were nine countries represented by more than eight accessions: Botswana, Ethiopia, India, Lebanon, Nigeria, South Africa, Sudan, Uganda, and the United States. β-carotene BLUP estimates among this subset of 309 accessions had the same range and average as the full set of SAP/CAP global collection (Table 1, Fig. 3). Interestingly, most of the countries had average values below the global average for β-carotene of 0.17 µg/g (Fig. 3). Furthermore, β-carotene distribution for the accessions of Sudan, South Africa, India, and Botswana were almost completely below the global average. In contrast, Lebanon had the highest β-carotene BLUPs estimates with the majority of their accessions above the global average. Notably, Nigeria, had the widest range of variation as well as the highest β-carotene concentrations among the countries.
Based on the limited geographic distribution of high carotenoid sorghums, we hypothesized that the high carotenoid germplasm originates from a narrow genetic pool. To test this hypothesis, we conducted a PCA to evaluate the genetic diversity of the high carotenoid accessions identified in the SAP/CAP collection. The high carotenoid accessions were defined as those within the top 5% for β-carotene BLUP estimates, which consisted of 17 accessions ranging from 0.40 to 0.80 µg/g β-carotene. The majority of the high carotenoid accessions originated in the United States (6 accessions), followed by Nigeria (3 accessions) and Lebanon (3 accessions) (Fig. 4A). Interestingly, the three high carotenoid accessions from Nigeria grouped together and were clustered separately from most of the other high carotenoid accessions, and another high accession of unknown origin did not group with any other high carotenoid accessions, suggesting three genetically distinct high carotenoid groups (Fig. 4A).
To further test our hypothesis on a narrow genetic pool for high carotenoid lines, we evaluated the genetic diversity surrounding the most prominent SNP identified by GWAS for all three carotenoids (S06_47123508). We analyzed a window of 1 Mb upstream and downstream of S06_47123508, which encompassed 1,665 SNPs. Nucleotide diversity was decreased in the high carotenoid accessions, but not in the low carotenoid accessions (defined as the lowest 5% for β-carotene BLUP estimates) or in the complete set of SAP/CAP collection accessions. The most prominent region of low nucleotide diversity was surrounding SNP S06_47123508, a region which includes the a priori candidate gene encoding ZEP (Fig. 4B).
Prediction of carotenoid breeding values in unexplored germplasm collection
Next, we sought to explore if there exists unidentified high carotenoid germplasm in additional germplasm collections. Publicly available genotype data was obtained for germplasm collections from six countries: Ethiopia, Haiti, Niger, Nigeria, Senegal, and Sudan. Together with the SAP/CAP collection, the dataset consisted of 60,129 common SNPs with less than 20% of data missing for 2,495 accessions. There were 367 accessions from Ethiopia, 296 from Haiti, 516 from Niger, 180 from Nigeria, 421 from Senegal, 319 from Sudan, and 396 from the SAP/CAP collection. Most of this germplasm is photoperiod sensitive making it difficult to phenotype in temperate regions such as the United States.
Genomic prediction has the potential to guide resource allocations by identifying the most promising germplasm to test in future work. We first explored the feasibility of the SAP/CAP collection as a training population for the unexplored germplasm collections. For this, the genetic relationship among the unexplored germplasm collections and the SAP/CAP collection was tested with a PCA, highlighting the country of origin for each accession (Fig. 5A). Germplasm from Haiti, Ethiopia, Niger, Nigeria, and Senegal formed independent clusters, indicating genetic similarities within but not between countries. Haiti segregated the most distantly, followed by more sparsely grouped germplasm from Senegal and Nigeria. The distant genetic relationship of Hatian germplasm with the other countries was expected as these materials are from a breeding program that went through a recent bottleneck after a sugarcane aphid infestation [36]. Germplasm from Niger and Ethiopia clustered very close together, but separate from the other countries. As expected based on previous studies [37, 38], accessions from the Sudan collection and SAP/CAP collection were scattered across all clusters, rather than clustering together, indicating high genetic diversity. The scattered distribution of the SAP/CAP collection confirms that it is an appropriate training population for genomic predictions in the unexplored germplasm.
Next, we estimated GEBV from the BLUPs of β-carotene, lutein, and zeaxanthin in the unexplored germplasm collections and the SAP/CAP collection (Table S4). Lutein GEBV ranged from − 0.37 to 2.16 µg/g, with an average prediction accuracy of 0.62 and a genomic heritability of 0.96 (Table 2). For zeaxanthin, GEBV ranged from − 0.20 to 1.44 µg/g, with a prediction accuracy of 0.69 and a genomic heritability of 1.00 (Table 2). Lastly, for β-carotene, GEBV values ranged from − 0.08 to 0.46 µg/g, with a prediction accuracy of 0.67 and a genomic heritability of 0.75 (Table 2). Interestingly, there were no accessions in the unexplored germplasm that had predicted GEBV for β-carotene higher than the highest values in the SAP/CAP collection, however there were some accessions that had values among the highest in all the collections (Fig. S4 and Table S4). Finally, as seen in the SAP/CAP accessions, high correlations were identified for GEBV between β-carotene and zeaxanthin (r = 0.89; p < 10− 16), β-carotene and lutein (r = 0.87; p < 10− 16), and lutein and zeaxanthin (r = 0.85; p < 10− 16)
Table 2
Range of GEBV, average prediction accuracy and genomic heritability (H²) for the lutein, zeaxanthin, and β-carotene for the unexplored germplasm collections and SAP/CAP collection.
| GEBV Lutein (µg/g) | GEBV Zeaxanthin (µg/g) | GEBV β-carotene (µg/g) |
Minimum | -0.37 | -0.20 | -0.08 |
Maximum | 2.16 | 1.44 | 0.46 |
Average Prediction Accuracy | 0.62 | 0.69 | 0.67 |
H² | 0.96 | 1.00 | 0.75 |
To further explore geographic patterns of sorghum carotenoid distribution beyond the SAP/CAP collection, we aggregated GEBV by country using the unexplored germplasm collections (Fig. 5B-D). Nigeria had the highest GEBV and range of values for all three carotenoids, followed by Niger. In contrast, Haiti had some of the smallest carotenoid GEBV values, as well as the smallest range of values. Interestingly, Ethiopia had several accessions with high GEBV for lutein, but only three high accessions for β-carotene, and no high accessions for zeaxanthin. Similarly, Senegal had one accession with a high GEBV for β-carotene and zeaxanthin, but not for lutein. These differences suggest that although the three carotenoids are highly correlated—consistent with common genetic controls—there are independent genetic controls, as well.
Next, we looked at the genetic relationships among the predicted top 5% for each carotenoid using a PCA for the GEBV (Fig. 5E-G, Table S5). The pattern of distribution differed by carotenoids, but the majority of the accessions were clustered by country. Lutein (Fig. 5E) had two major clusters corresponding to Ethiopian accessions and a combination of accessions mostly from Nigeria and Niger. For zeaxanthin (Fig. 5F) and β-carotene (Fig. 5G), the clustering was similar, with the accessions of Nigeria and Niger forming the tightest cluster. Accessions from Sudan and the SAP/CAP germplasm were scattered for the three carotenoids, suggesting they are genetically distinct. Taken together, a proportion of accessions predominantly from Nigeria and Niger formed the most distinct cluster in the PCA for the three carotenoids, indicating they are genetically similar. The accessions with the highest GEBV for β-carotene were also part of this cluster.
Allelic diversity and geographic distribution of ZEP allele
To further test the hypothesis that high carotenoid lines originate from a narrow genetic pool, we analyzed the SNPs inside the ZEP gene in the SAP/CAP collection and unexplored germplasm collections. In the SAP/CAP collection, we identified 14 SNPs in the ZEP gene with MAF > 0.05. Due to low marker density, the majority of these SNPs were absent in the unexplored germplasm collections. However, SNP S06_46717975 was present in the SAP/CAP collection and the SNP data set for Haiti, Niger, and Nigeria germplasm (Table S6). This SNP is found within the ZEP gene and was previously identified by our group as associated with zeaxanthin variation [15]. S06_46717975 was found to be bi-allelic with A/G variants present among the germplasm. The minor allele ‘A’ is moderately common globally, with 10% presence in SAP/CAP collection. However, among countries there are striking differences in the allele frequency; for instance, 24% in Nigeria versus 2% in Niger and 0% in Haiti germplasm.
We next explored if there were any patterns between allelic variant, geographic distribution, and carotenoid content (Fig. 6). In the SAP/CAP collection, there was a correlation between allelic type and country of origin with the United States, Lebanon, and Nigeria, the countries with the highest prevalence of the ‘A’ allele (Fig. 6A). Among the high carotenoid accessions in the SAP/CAP collection (defined as the top 5% for β-carotene concentration), the ‘A’ allele was present in 85% of them (Fig. 6B). We then analyzed the alleles in the unexplored germplasm collections and the SAP/CAP accessions that were not phenotyped. Similar patterns were observed for the geographic distribution of the ‘A’ allele with the highest prevalence in the United States and Nigeria (Fig. 6C). Surprisingly, the difference in the distribution of the ‘A’ and ‘G’ alleles was not nearly as pronounced in the predicted high carotenoid lines based on β-carotene GEBV (Fig. 6D).