Sequence mapping
Through sequencing of six distantly related peach accessions, we generated a total of 107.35 Gb base pairs of sequences. Then, the data were filtered by the following 2 steps: first, the adapter contaminants in the reads were deleted, and then, the reads that contained more than 50% low-quality bases (quality value <= 12) were removed. A total of 105.85 G high quality clean reads were obtained for the following analysis. The reads were then aligned to the peach reference genome (224.61 Mb) using BWA software. The mapping rate in different accessions varied from 93.16% to 96.69%, and the final effective sequencing depth varied from 63.81 to 89.38 × (Table 1).
Variation detection and annotation
Pair-end reads were mapped against the peach reference genome [13] (release version 1.0), and a final set of 1,166,551 SNPs, 44,245 SVs, 12,302 CNVs, and 141,895 SSRs were identified, resulting in an average of 5351.1 SNPs per Mb, 202.71 SVs per Mb, 56.4 CNVs per Mb, and 634.43 SSRs per Mb, respectively. Based on the consensus sequence, the polymorphic loci between the identified genotype and the reference were filtered. The polymorphic DNA markers were classified into five groups: SNPs, InDels, SVs, CNVs and SSRs.
Next, we analysed the number of genetic variation across each chromosome. We found that the number of polymorphic DNA markers varied across each chromosome (Fig. 1). Most of them were observed in Chr. 1, 2, and 4. For example, the number of SNPs in ‘Jin Mi Xia Ye Tao’ (179,310) and ‘Sa Hua Hong Pan Tao’ (198,140) of Chr. 4 were 6.43- and 2.76- fold respectively higher than the number of SNPs in ‘Jin Mi Xia Ye Tao’ of Chr. 7 (27,867) and in ‘Sa Hua Hong Pan Tao’ of Chr. 5 (71,790). The uneven marker distribution of each chromosome can be attributed to the variations in chromosome size in the peach genome. Chr. 4 was found to be 30.19 Mb in size, which was 1.33-fold the size of Chr. 7 (22.72 Mb) and was 1.64-fold that of Chr. 5 (18.45 Mb). Finally, all the polymorphic DNA markers were compared at population level to detect SNPs, InDels, SVs, and CNVs. We showed the distribution of these variatons at population level in reference genome by figures (Fig. 2).
The detailed genome-wide characterization of SNPs
The SNP annotation showed that approximately 4.49% to 7.28% of the total SNPs were located in the coding DNA sequence (CDS) of these six genomes (Table 2). These variations were minimal but had a substantial impact on the variation in genomes and biological traits. Therefore, it is helpful to examine the detailed SNP annotations. In this study, 27,623 synonymous and 40,245 nonsynonymous SNPs were annotated in ‘Sa Hua Hong Pan Tao’, and only 17,902 synonymous and 26,076 nonsynonymous SNPs were annotated in ‘Wu Yue Xian Bian Gan’ (S1a Fig). These nonsynonymous SNPs have been suggested as good candidate mutations to explain the different phenotypes among different samples. The ratio of nonsynonymous to synonymous substitutions was 1.46~1.52, which is higher than that of Arabidopsis thaliana (0.83) [14] and rice (1.29) [15] and similar to our previous report in peach (1.63) [16]. In addition, we detected 114,227~178,968 InDels, 8,386~12,298 SVs, and 2,111~2,581 CNVs among the six peach genomes (Table 3). Similar to the annotation of SNPs, only minimal distributions of InDels, SVs, and CNVs were located in the CDS.
We further analyzed the annotation of the so-called large-effect SNPs (S1b Fig), which are predicted to have a potentially disabling effect on gene function. We identified a total of 1,943 SNPs that were expected to induce premature stop codons (designated as stop codon gain), 789 to disrupt splicing donor or acceptor sites, 170 to alter initiation methionine residues (start codon loss), and 217 SNPs to remove the annotated stop codons (stop codon loss), resulting in longer open reading frames. Based on the GO term annotation, 89.66% (203 genes) of these genes containing large-effect SNPs were assigned to one or more functional annotations. There were 144 GO terms associated with biological process, 68 with cellular component, and 157 with molecular function. Compared with the total annotated genes in the peach genome, the genes that grouped into localization and metabolic processes were enriched (S2 Fig).
We further investigated the tissue-specific expression of the genes contained large-effect SNPs and in root, fruit, phloem, leaf and seed in a representative cultivar, ‘Chinese Cling’ (S3 Fig). We found that gene expression pattern were different among various tissues and could be classified then into four groups. Among 158 differential expression genes, a total of 9, 13, 22, 17, and 8 genes that showed higher expression in leaves, fruits, seeds, root, and phloem than the other tissues, respectively.
The detailed genome-wide characterization of SSRs
Totally, 141,895 SSR loci were firstly identified in the six peach accessions, and 8.93% (12,672) of them were found to be detected in all six accessions (Fig. 3).
Among all SSRs, mono-nucleotide and di-nucleotide repeats were abundant, accounting for 47.36 % and 39.87%, respectively (Table 4). Tri-, tetra-, penta-, and hexa-nucleotide repeats only accounted for 7.86, 1.48, 2.17, and 1.26%, respectively, for all SSRs. And the similar distribution of different repeats type was repoted in peach [17, 18] and mei [12].
Of the di-nucleotide repeats, AT/TA repeats were the most abundant, accounting for 12.79%. And AAT/ATA/TAA/TTA/TAT/ATT was also abundant of all tri-nucleotide repeats, accounting for 2.21% of all SSRs. Among di-nucleotide repeats, the second largest group was CT/TC, slight higher than AG/GA repeats. Meanwhile, we found that TTC/TCT/CTT/CCT/CTC/TCC repeats accounted for 1.27% of all SSRs, also slight lower than AAG/AGA/GAA/GGA/GAG/AGG (1.30%) of all tri-nucleotide repeats. CG/GC and CGC/GCG/CGG/CCG/CGC/GCC were the least repeats of di-nucleotide and tri-nucleotide repeats, respectively. The result indicated A/T nucleotide exhibited a strong bias among SSRs, similar with the study reported before [8, 19].
Of all SSRs, 25,037 (17.64%) were located in the 7773 genes. Among them, the number of SSRs in CDS was 8,412 (33.60%), and there were 800 in 5’untranslated region (UTR), 630 in 3’UTR (S2 Table). We also analysed the SSR motifs in different regions. Of all SSRs located in UTR regions, di-nucleotide repeats were the most abundant, accounting for 46.36 % of all motifs (S2 Table). AG/GA (290) and CT/TC (266) were the most and the next was AT/TA (54), AC/CA (30), and GT/TG (23). No CG/GC repeats were found in these regions. However, among SSRs located in CDS regions (S2 Table), the di-nucleotide repeats account for 28.27%, was lower than that of mono-nucleotide repeats (51.70%). AG/GA and CT/TC dimers prevailed in CDS sequences, reached 776 and 720, respectively.
The SSR-containing genes should be further investigated to study the variation of agronomic character in peaches. Therefore, we first counted the SSR number in each gene and found that 4247 of 7773 genes contained only one SSR and 331 genes contained more than ten loci, and this may be related to the size of each genic region. A sharply decreasing trend of gene number was observed as the contained SSR number increased (Fig. 4a). The detailed GO annotation of above 331genes identified in this study is shown in Fig. 4b. The expression of all genes which containing more than 10 SSRs was further quantified using FPKM values, and 235 (71.0 %) genes had an FPKM value > 1 in at least one tissue. The heat map results showed that most genes presented tissue-specific expression patterns (Fig. 4c). In general, the fruit has the greater number of SSRs than the leaf, flower, and root tissues. And in phloem and seed, the tissue-specific genes were fewest. The result indicated that the phenotype variations in fruit are mostly result from SSR polymorphism than the other tissues.
Association study of fruit maturity date using SNPs
Based on previous studies in peach [20], three quantitative trait loci for maturity date were detected and all located on Chr. 4 between two SNPs, SNP_IGA_405773 (Chr. 4: 9,658,797 bp) and SNP_IGA_437516 (Chr. 4:17,094,116 bp). Filtering with missing rate, MAF and sequencing depth parameters, we totally obtained 25,299 SNPs from 9.6 Mb to 17.1 Mb on Chr. 4. Furthermore, considering that linkage disequilibrium decay was about 20-50 kb for the different subgroups of cultivated peach [16], a total of 375 loci were thought enough for obtaining reliable association result. To increase the accurary of result, 944 SNPs (about 7-kb intervel) were filtered for association studies of fruit maturity date in the following analysis.
We identified a clear year-stable signal in 2011 and 2012 associated with fruit maturity date on Chr. 4, although the association signal in 2012 was lower than the adjusted p value (-log10P >4.97, Fig. 5). The leading SNP (Chr.4: 10,184,313 bp, -log10P = 5.32) of this association was found in the exon of Prupe.4G171900 which encoded a glutamate receptor protein. And the location has a close distance (< 1 Mb) with the candidate gene Prupe.4G186800 of fruit maturity by another study [21], proving the association result is reliable.
Genetic diversity assessment using SSRs
To detect SSRs with high polymorphism, 2,034 SSRs of all 141,895 loci showing polymorphism among the 6 resequenced peach accessions were retained. Next, 1,334 (65.6%) mononucleotide repeats were excluded because the loci may have resulted in weak PCR products. Finally, the loci that were detected in more than five accessions were identified for subsequent analysis.
The remaining 194 SSRs were used to design primers to test their polymorphism in 21 peach accessions. However, the primer design of 7 SSRs failed due to the high sequence similarity of the flanking sequence of these SSRs with multiple regions in the genome. In addition, a total of 23 primer pairs generated unfavorable amplification products. Thus, 164 SSR primer pairs (S3 Table) with clear amplification banding patterns were retained, and their genotype is shown in S4 Table.
To assess the application of these SSRs, 15 polymorphic markers with more than 7 alleles when amplified from 21 peaches were randomly selected to analyze the genetic relationship among 221 accessions (S5 Table). These SSR markers detected a total of 210 alleles falling within a range of 8-26 alleles per locus (S6 Table). Among all alleles, 36 were specific to 28 peach accessions. Shannon’s information index (I) ranged from 0.0161 to 0.6896, with 101 alleles producing I values > 0.1. The I values of seven SSR markers (SSR073, 093, 120, 169, 179, 183, and 184) were higher than 0.2, indicating high efficiency.
We constructed a neighbor-joining phylogenetic tree based on the genetic distances calculated from the genotypes at all 15 SSR positions of the 221 peach accession, and three clusters were delineated (Fig. 6). Group 1 contained 65 accessions, among which 75% (57 of 56) were landraces, mostly from northwest China (21 of 22), northern China (13 of 21), YunGui Plateau(10 of 10) and southern China (4 0f 5). Group 2 contained 54 accessions, including mostly improved varieties derived from America and Europe (37 of 44), 13 improved varieties derived from China, 2 improved varieties derived from Japan and Korea and 2 landraces. The improved varieties from China (59 of 78) and Japan and Korea (19 of 23) were mainly clustered into group 3, probably because they belong to Asian peach genotypes. We found that some landraces originating from the middle and low reaches of the Changjiang River (8) and northern China (7) were closely related with the above improved varieties.
To compare the levels of polymorphism between the SSR markers developed in this study and those previously reported, we selected 15 previously reported SSRs with high polymorphism from 36 accessions. Comparing the two studies using the same accessions (S7 Table), these two groups of SSR markers were polymorphic and produced a total of 171 and 153 alleles. The average number of alleles per locus for the 15 SSR markers developed in this study was 11.4, ranging from 6 (SSR73) to 22 (SSR152), which was slightly higher than that of the previously developed SSR markers (10.2). Importantly, there was no obvious difference between the two groups in molecular diversity, such as the I and GD values (S4 Fig).