Development of SNP markers
A total of ~750 million total unfiltered reads were generated from 309 inbred lines arranged in four GBS plates run twice independently on an Ion Proton Next-Generation Sequencer (ThermoFisher Scientific, Waltham, MA, US). All the raw sequence reads for all the accessions have been submitted to the National Center for Biotechnology Information (NCBI) sequence read archive and deposited under the accession ID “BioProject ID”: PRJNA598172. Mapping the GBS reads to the pearl millet reference genome sequence initially detected 150, 977 unfiltered and non-imputed SNPs for the panel. Among these, 123,995 SNPs were distributed on chromosomes 1 to 7. A total of 26,982 SNPs were mostly from the unanchored genome sequences or from the part of the genome not covered in the reference genome sequence. The raw number of SNPs mapped on each chromosome ranged from 13,432 (chr 7) to 21,665 (chr 1).
A high-density of high quality SNPs across the seven pearl millet chromosomes with most of the SNPs distributed in the telomeric regions than the pericentromeric regions of the chromosomes (Figure 1) were visualized. Markers density varied across the genome, ranging from 0 to 192 SNPs per Mb with an average of 29 Kb per SNP. Chromosome-wise markers density varied from 9,558 (chr. 1) to 6,120 (chr. 7). In chr. 5, fewer markers were mapped in one arm than the other.
Analysis of SNP types showed that transition mutations (33,817, 62%) were much higher than transversion mutations (20,953, 38%) and the transition/transversion ratio was 1.63 (Table 1). In overall, C/T transitions occurred at the highest frequency, while A/T mutation occurred at the lowest frequency among all types of mutations detected. The frequencies were similar between A/G and C/T transitions and among the four transversions, except for A/T. Minor allele frequency was ranging from 0.05 to 0.50 with a mean of 0.20 (Figure 2).
Genetic diversity
A total of 54,770 SNPs on the seven chromosomes were used to evaluate genetic diversity in 309 genotypes (Additional file S2). The kinship coefficients between pairs of the 309 inbred lines ranged from 0.00 to 1.21 with a mean value of 0.02 (Figure 3A). Nearly 91% of the pairwise relative kinship values were close to zero (< 0.05) and the remaining were between 0.05 and 1.21. The highest relative kinship value was observed between ICML197354 (IP-9407) and ICML197438 (IP-17690) derived from landraces collected in Ghana and Togo, respectively (Additional file S3).
The genetic distances (GD) of pairwise comparisons of the inbred lines varied from 0.09 to 0.33 with the average distance of 0.28 (Figure 3B). The majority of the genetic distances fell between 0.25 and 0.30. The lowest genetic distance (0.09) was observed between ICML197458 (IP-21020) and ICML197279 (IP-4020) developed from landraces collected in Nigeria and India, respectively (Additional file S4). The highest genetic distance (0.33) was observed between inbred ICML197314 (IP-6882) and ICML197390 (IP-11677) derived from landraces
Population structure and principal component analysis
Analysis of the pearl millet inbred lines population using 30,893 LD pruned SNPs and the genetic distance matrix of the lines in ADMIXTURE (Alexander et al., 2009) identified five subpopulations (Figure 4A). The error rate from a cross-validation method used by Admixture to determine the appropriate number of sub-populations rapidly declined from K=1 to K=5, indicating that the 309 lines fell into five distinct groups with different levels of admixtures (Figure 4B). Each line was assigned to a group when the proportion of the membership probability was higher than 0.6. Thus, subgroups 1 and 2 consisted of 33 and 134 lines, whereas subgroups 3, 4 and 5 contained 25, 14 and 10 inbred lines, respectively. The remaining 93 lines were classified as admixtures. No cluster made exclusively of inbred lines from the same country was found. In general, most of the lines closely related in pedigree or derived from landraces/improved varieties with high Fe and Zn content were grouped together, except for subgroup 2 which was formed mainly of inbred lines from the accessions collected in WCA.
To confirm the ADMIXTURE results, principal component analysis (PCA) using the snpgdsPCA function of the R package SNPRelate (Zheng et al., 2012) and genetic relatedness among accessions was visualized using a neighbor joining (NJ) tree which revealed a clear separation between groups. Both the PCA and the NJ tree confirmed the five subgroups found in the population structure analysis (Figure 6A and 6B). PCA based on the results of ADMIXTURE revealed that the total genetic variation explained by the first ten PCs was 11.3%. The first and second PCs explained a very low proportion of genetic variation, 1.82 and 1.74%, respectively. The results showed a clear subgroup separation in the panel and agree with the conclusion of five subgroups from ADMIXTURE. Group 3 showed more dispersion than the other groups. Based on genetic distance, the neighbor joining phylogenetic tree also displayed similar subgroups as shown by the ADMIXTURE and PCA analyses.
Among the five subgroups, subgroup 1 contained accessions such as ICTP 8203, GB8735, MC94, ICMV96490 and ICMR312, mainly known for their high levels of Fe and Zn content. Subgroup 2 was composed of landraces/improved cultivars mainly from WCA countries. Subgroup 3 was formed with inbred lines derived from landraces/improved cultivars collected from various sources, and showed more dispersion than the other subgroups. Results showed that the subgroup 4 contained inbred lines derived from the source populations IP-3110, IP-6745, IP-21142, PBPMPOP-1 and PBPMPOP-2, whereas group 5 comprised about 50% of inbred lines extracted from PBPMPOP-3.
Allelic diversity and purity
In accordance with the neutral allele distribution expected for inbred populations, the nucleotide diversity (π) and Tajima’s D statistics among the five subpopulations were presented in Table 2. Degrees of SNP heterozygosity varied from 0 to 100% with an average of 15% (Figure 5A). Most of the SNPs had a heterozygosity range of 0-25%. Similarly, genetic purity of the lines in the population varied from 70 to 93%, with a mean of 85% (Figure 5B). Out of 309 inbred lines, 105 (34%) showed less than 10% heterozygosity, 145 (47%) had heterozygosity ranging from 11 to 20%, while 59 (about 20%) had heterozygosity from 21 to 30%.
Genetic differentiation
Genetic differentiation of subgroups was calculated using Fst-based analysis of the SNP data (Table 3). The Fst coefficients showed that subgroup 2 could be considered as a subset of the whole panel used in this study with a very low Fst value as compared to the whole population (0.002). The Fst coefficients among the groups varied from 0.044 to 0.110, indicating moderate differentiation. The highest level of differentiation was observed between subgroups 4 and 5. In contrast, the lowest Fst coefficient was observed between subgroups 2 and 3.
The mean nucleotide diversity observed in subgroup 2 (9.77E-06) was similar to the whole panel (9.92E-06) (Table 3). The nucleotide diversity among accessions within the subgroups ranged from 7.32E-06 to 9.77E-06. Accessions from Group 5 showed lowest diversity in the population. The mean Tajima’s D values were positive for all the subgroups, which is an indication for balancing selection.
Linkage disequilibrium
LD among the SNPs was investigated between pairs of SNP markers from the seven chromosomes and then between pairs of SNP markers from the same chromosome. The average pairwise LD (r2) across the genome declined rapidly with increasing physical distance and most of the r2 values were below 0.05 (Figure 7). On average, LD declined from its initial value of 0.46 to 0.1 within approximately 3.5 kb. LD decays rapidly in chromosomes 1 and 4 (~1.5 kb) as compared to 3 and 7 (~7.2 kb), suggesting that a larger number of markers are required from chromosomes 1 and 4 than from chromosomes 3 and 7 for genome-wide association studies in pearl millet.