Nucleotide diversity of core C4 gene families in sorghum
Based on 9 genes corresponding to 8 core C4 enzymes in sorghum, 18 homologous genes were identified across the sorghum genome. In total, 5 CA genes, 2 NADP-MDH genes, 5 NAPD-ME genes, 6 PEPC genes, 3 PPCK genes, 2 PPDK genes, 3 PPDK-RP genes and 1 rbcS gene were identified (Table 1). Nucleotide diversity (θπ) of these 27 genes was investigated using sequence data of 48 genotypes covering wild and weedy, and cultivated sorghum [25]. A total number of 4,183 single nucleotide polymorphisms (SNPs) were identified in these 27 genes with 521 SNPs located in coding sequence (CDS) regions (Table 1). These C4 gene families displayed an average overall nucleotide diversity of θπ =2.09×10-3, which is comparable to that of 130 housekeeping genes (θπ =1.97×10-3) [25]. Nucleotide diversity varied dramatically among the C4 gene families, with the NADP-MDH genes displaying the lowest levels of diversity across all genotypes (average θπ =0.25×10-3), followed by NADP-ME genes (θπ =0.93×10-3), PPCK genes (θπ =1.20×10-3), PEPC genes (θπ =2.11×10-3), CA (θπ =2.26×10-3) and PPDK-RP (θπ =2.96×10-3), while PPDK genes showed the highest level of diversity (θπ =5.21×10-3) (Table 2, Figure 2A). The only gene encoding rbcS, Sobic.005G042000, had high genetic diversity with θπ = 4.32×10-3across all 48 genotypes, 5.72×10-3in the wild and weedy group and 3.03×10-3in the cultivated group.
Mixed trends were found when comparing C4 genes with non-C4 isoforms in each gene family with the average overall genetic diversity of C4 genes being comparable to that of their non-C4 counterpart (Table 2). The C4 PPDK-RP gene (Sobic.007G166300) and C4 NADP-MDH gene (Sobic.002G324400) had an overall θπ which was 161.76% and 79.85% higher than their non-C4 isoforms, respectively, whereas the θπ of the C4 PPDK gene (Sobic.009G132900) was 75.16% lower than that of the non-C4 PPDK isoform. Nucleotide diversity of C4 genes in the other gene families was within the range of variation of their non-C4 isoforms.
Genetic diversity across C4 gene families was reduced during sorghum domestication. Averaged across all C4 gene families genetic diversity was reduced by 22.44% in the domesticated compared with the wild and weedy group and when just the 9 core C4 genes were considered, the reduction was 22.98%. However, the reduction of genetic diversity during domestication in C4 genes was not significantly different from that in housekeeping genes (Table S2). Among the 27 genes, Sobic.003G292400, a non-C4 NADP-ME isoform, exhibited the most severe reduction in genetic diversity, with a reduction of 98.23%. The C4 version of that gene, the NADP-ME gene (Sobic.003G036200), showed the greatest loss of genetic diversity (51.89%) among the C4 genes, with a Fst between the cultivated and wild and weedy groups of 0.06 (Figure 2B). In contrast, another non-C4 isoform of NADP-ME (Sobic.009G069600), a non-C4 isoform of PPCK (Sobic.006G148300) and a non-C4 CA isoform (Sobic.003G234600), showed a more than 2- fold increase in genetic diversity in the cultivated group.
Identification of selection signals during domestication across the 27 genes
The selection signature of these C4 gene families was firstly investigated at the gene level. Based on thresholds of genome-wide rankings described in Mace et al. [25], only one gene (Sobic.001G326900, non-C4 PPDK isoform) was identified as being under balancing selection during sorghum domestication, while no gene was identified as being under purifying selection (Table 1). Subsequent to this, a higher resolution detection of selection signature was conducted at the SNP level using the CDS of the 27 genes. Among 521 SNPs across 27 CDS, 176 were non-synonymous. The number of non-synonymous SNPs within genes varied from 19 in the non-C4 PPDK-RP isoform (Sobic.002G324700) to 0 in the C4 PPDK (Sobic.009G132900). The C4 PEPC gene (Sobic.010G160700) had the highest number of non-synonymous SNPs (9) among the 9 C4 genes (Table 1).
Based on the SNP-level analysis, 24 SNPs across 8 genes were identified as being under purifying selection including 7 non-synonymous SNPs in 6 genes (Table S3). Genes with SNPs under purifying selection included two C4 isoforms, PPDK (Sobic.009G132900) and CA (Sobic.003G234200), three of 4 non-C4 NADP-ME (Sobic.003G280900, Sobic.003G292400, Sobic.009G069600), both two non-C4 PPDK-RP (Sobic.002G324500, Sobic.002G324700) and a non-C4 PEPC gene (Sobic.007G106500). Among the 2 C4 genes with SNPs under selection, Sobic.009G132900 had 3 synonymous SNPs under purifying selection, while Sobic.003G234200 had a non-synonymous SNP under purifying selection.
A total of 60 SNPs across 8 genes were identified as being under balancing selection, 7 of which were non-synonymous SNPs distributed across 2 genes (Table S4). The non-C4 PPDK (Sobic.001G326900) had 24 SNPs under balancing selection including 5 non-synonymous SNPs, and additionally had an overall gene-level signature of balancing selection based on the previous analysis. Two C4 isoforms, PPDK-RP (Sobic.002G324400) and PEPC (Sobic.010G160700), were identified with 3 and 2 SNPs under balancing selection, respectively, although none of them were non-synonymous SNPs. Two non-C4 PEPC (Sobic.003G100600, Sobic.004G106900) were identified with SNPs under balancing selection, with Sobic.003G100600 having 21 SNPs including 2 non-synonymous SNPs exhibiting signatures of balancing selection. The other 2 genes with SNPs under balancing selection were a non-C4 CA isoform, Sobic.002G230100 and a non-C4 PPCK isoform, Sobic.004G219900.
Allelic variation of core C4 genes under selection in sorghum
A phylogenetic tree was constructed using CDS of the 27 genes to investigate the genetic relationships between the 48 accessions (Figure S1). The inter-and intra-species distribution of private haplotypes of each gene is detailed in Table S5, with the majority (~90%) of the genes with private inter-species haplotypes from S. propinquum, e.g. 4 unique haplotypes were observed for the C4 isoform of PEPC, with the 2 S. propinquum accessions sharing a single private haplotype. To further investigate allelic variation of 4 core C4 genes with SNPs under selection in sorghum, haplotype networks were constructed using CDS SNPs. Based on 16 SNPs within the CDS of the PPDK gene (Sobic.009G132900), 8 haplotypes were identified. Five haplotypes were identified in wild and weedy, with 3 being private haplotypes and two of them being maintained in cultivated sorghum; two new haplotypes arose in cultivated sorghum after domestication (Figure 3A). Ten haplotypes of one CA gene (Sobic.003G234200) were revealed using 33 SNPs, with 4 distinct haplotypes being characterised by the wild and weedy genotypes, two of which were private haplotypes to this group. The remaining two haplotypes were maintained in cultivated sorghum during domestication, with three new haplotypes arising after domestication (Figure 3B). The loss of wild and weedy haplotypes in cultivated sorghum in these two genes was consistent with the finding that they were under purifying selection.
The PPDK-RP gene (Sobic.002G324400) had 22 SNPs in the CDS, based on which 5 haplotypes were identified. Two haplotypes were characterised by the wild and weedy genotypes, with the main wild haplotype maintained and further diversifying into two new haplotypes in the cultivated group (Figure 3C). Based on 28 SNPs in the CDS of the C4 PEPC gene (Sobic.010G160700), 4 haplotypes were identified. Wild and weedy genotypes encompassed 3 haplotypes and all of them were maintained in cultivated sorghum (Figure 3D). S. propinquum had unique haplotypes across all 4 genes, while the Sorghum bicolor race guinea margaritiferum shared haplotypes with the wild and weedy genotypes in most cases, indicating a closer relationship with the wild and weedy group.