Genome sequencing data statistics and genome estimation
The Illumina paired-end library was used to obtain a total of 108.97 Gb of raw data for T. rupestris in this study (Table 1). After filtering and rectification, there were 107.83 Gb of clean bases generated. Quality value is an important indicator of second-generation sequencing, which is obtained by sequencing every base in the genome. The higher it is, the more accurate the sequencing will be. The Q20 was 95.98% and the Q30 was 90.39% in this experiment. This suggests that the sequencing was of high quality. Twenty thousand clean reads were randomly selected and compared to NT (Nucleotide Sequence Database) database with BLAST in NCBI. The first five matched species were Geum rupestre (15.14%), Rosa chinensis (1.51%), Fragaria vesca (0.74%), Rosa praelucens (0.64%) and Prunus pedunculata (0.44%). The first number was higher than the others because it was also Rosaceae the same as T. rupestris. The results showed no contamination from other species.
All clean data in this study was analyzed with a K-mer 17 (Fig. 2). The x-axis and y-axis in Fig. 2 represent the depth of k-mer and the number of corresponding depths, respectively. Then the expected genome size was obtained with the following formula: Genome size = K-mer num/K-mer depth [15]. The estimated genome size is 976.97 Mb. We determined that the gene heterozygosity rate of T. rupestris was 0.726% while the repetition rate was 56.93%. It belongs to the genome with micro-heterozygosity and high-repetition. The results showed that assembling a genome for T. rupestris is difficult.
Genome GC analysis and genome assembly
The assembled sequences of T. rupestris genome were subsequently studied for GC analysis (Fig. 3). The density points were mostly found in areas with a GC content of 20–50%, with an average of 33.75%. It is similar to Apocynum venetum (32.91%) [16] and Macaranga indica (33.83%) [17], but lower than Cydonia oblonga (38.66%) [18], Rosa multiflora (38.9%) [19], Fragaria nilgerrensis (39.22%) [20] and Rosa rugosa (39.30%) [21]. Above all, the genome of T. rupestris has a mid-GC content.
As shown in Table 1, the total number of contigs was 100973. In total, 91276077 bp contigs were generated from the T. rupestris genome, with the largest contig length of 26073 bp, a N50 contig length of 2607 bp and a N90 contig length of 1189 bp.
Table 1
Assembled sequencing results data statistics for T. rupestris
Features | Number or ratio |
Total number of bases (Gb) | 108.97 |
Clean reads (Gb) | 107.83 |
Clean read proportion (%) | 98.95% |
Q20 | 95.98% |
Q30 | 90.39% |
GC content | 33.75% |
heterozygosity rate | 0.726% |
repetition rate | 56.93% |
Total number of contigs | 100973 |
Total length of contigs (bp) | 91276077 |
Max length (bp) | 26073 |
N50 (bp) | 2607 |
N90 (bp) | 1189 |
SSR identification and validation
Compared with the traditional design methods, using genomic data to develop SSR molecular markers is simple and effective [12]. Data showed that 805600 SSRs were discovered in this study, and the frequency of SSR repeats was significantly different (Fig. 4). As shown in Fig. 5, the most common form of repeat was mononucleotide (63.70%), followed by dinucleotide (18.77%), tetranucleotide (11.63%), trinucleotide (3.25%), hexanucleotide (2.43%), and pentanucleotide (0.22%). SSR markers with both mononucleotide and dinucleotide sequences accounted for 82.47% of the total. A/T ratios were bigger than G/C ratios among mononucleotide repeats, accounting for 97.01% of the total (513158). The biggest number of dinucleotide was AG/CT (83998, 55.56% ), followed by AT/AT (59954, 39.66%), AC/GT (7134, 4.72%) and CG/CG (94, 0.06%).The highest number among the trinucleotides was AAG/CTT repeats (7871, 30.03%), followed by ACC/GGT (4948, 18.88%), AAT/ATT (3679, 14.04%) and AAC/GTT (2889, 11.02%). The most dominant tetranucleotide is AGAG/CTCT among the 36 tetranucleotide types, which accounted for 58.22%, followed by ATAT/ATAT (33277, 35.51%), ACAC/GTGT (3096, 3.30%) and AAAT/ATTT (1210, 1.29%).
Validation of SSR markers
In the present study, 562246 SSR-containing sequences were used to construct primers, resulting in a total of 72769 primer pairs for use. We synthesized 100 primer pairs to detect these SSR markers, and the results showed there were 82 primer pairs successfully amplified achieving the desired size (Fig. 6). We selected 15 pairs of primers with good polymorphism to perform cluster analysis on 45 individuals from two populations, and the result was shown in Fig. 7.
Information about these 15 primer pairs is provided in Table 2. The results showed that the average allele (Na) of the 15 SSR loci was 3.67, the minimum observed heterozygosity (HO) and expected heterozygosity (HE) were 0.095 and 0.167, and the maximum HO and HE were 1.000 and 0.788, respectively. The average PIC (polymorphic information content) value was 0.44, with four highly polymorphic loci (TH17, TH25, TH31 and TH58) having PIC values higher than 0.5.
Table 2
Genetic characteristic statistics generated by 15 SSR markers on 45 T. rupestris individuals.
Locus | Motif | Primer sequence | k | HO | HE | PIC | HW | F(Null) |
TH4 | (ATTT)7 | TGGGCTACGACGATTGAACA GCATGTACTAGCAAACTCGCA | 3 | 0.289 | 0.461 | 0.378 | ND | 0.2225 |
TH17 | (ATGAG)5 | GTGTGCAAGTGGTTGGTTGG CTGCACCGTACCATCATGGA | 5 | 0.444 | 0.672 | 0.623 | NS | 0.183 |
TH22 | (GAT)9 | TGTCTGATTCGGGCCCTAGA TGGCATGTGATTCGCCTTCT | 7 | 0.289 | 0.508 | 0.481 | ND | 0.3385 |
TH25 | (GAA)8 | TGGCATAAAGAGTGGTCTGAGG AACGGTCTCTCCTCCTCCTC | 6 | 0.409 | 0.788 | 0.746 | ND | 0.3146 |
TH28 | (GCCA)5 | TCTGTTTTGTTTAAGGCGTGCA ACACGTGTCATCTCGTCATTGA | 3 | 0.182 | 0.581 | 0.484 | *** | 0.5428 |
TH31 | (AGAAGA)5(AGA)10 | CGTGCGATTGGTGTACCTCT AGAGGTCATTACGATTTACAACCA | 4 | 0.682 | 0.714 | 0.653 | NS | 0.0189 |
TH41 | (T)13(TT)6 | GCTAGCCAACACACCACTCT ACCCTAGGTGGCTACGAGTT | 2 | 0.295 | 0.255 | 0.22 | ND | -0.0764 |
TH44 | (TTC)7 | ACTCGATCCTCTCCCTAAAGGA CATAGGAGACGAGCAGAGGC | 3 | 0.095 | 0.304 | 0.268 | ND | 0.5373 |
TH49 | (AT)13(ATAT)6 | GTTGTACTAGGTGGCTGCCC GCTGATGGCTAGGATTCACT | 2 | 1.000 | 0.506 | 0.375 | *** | -0.333 |
TH50 | (ACC)7 | ACGACGTCACCTCCGTAAAC AGATTGAAGAGGCGGTGGTG | 2 | 0.605 | 0.499 | 0.372 | NS | -0.1015 |
TH53 | (CTT)9 | CCTACTGGCATCGAGACACC GGGATCTCCACTCCAACAGC | 3 | 0.128 | 0.167 | 0.155 | ND | 0.1229 |
TH58 | (CAAAA)5 | TCATTCTCTGCACCAACCCC GGACGTGGAGGCATTCTTGA | 6 | 0.622 | 0.773 | 0.729 | NS | 0.1122 |
TH67 | (ATGGTG)5 | CACAATCTTCCCTAAAAAGGCACA GAACCAAACCGCCCGAATTC | 3 | 0.200 | 0.388 | 0.349 | ND | 0.3055 |
TH80 | (GAAGAT)5 | TTGTCATCTTCCGCGGTGAA TCCACACCCTCATGATCGGA | 3 | 0.089 | 0.401 | 0.36 | ND | 0.6677 |
TH92 | (TTTGT)5 | TGAATGGGCAAAGGTCAACT ACCATACAAAGTTTTTGCATCCT | 3 | 1.000 | 0.564 | 0.46 | *** | -0.3011 |
Mean | | | 3.67 | 0.42 | 0.51 | 0.44 | | 0.17 |
k: the number of alleles, HO: Observed heterozygosity, HE: Expected heterozygosity, PIC: Polymorphism information content, HW: Significance of deviation from Hardy-Weinberg equilibrium. NS = not significant, *** = significant at the 0.1% level, ND = not done. These significance levels include a Bonferroni correction if the Bonferroni correction option was selected. F (null): Null allele frequency.
The results in Fig. 7 showed that the SSR markers could distinguish genetic differences among populations based on different geographical locations. Since the two populations sampled were only one kilometer apart, there was gradual genetic penetration between the populations. Two coordinates explain 15.73% and 13.14% of the overall genetic variation, respectively.
Numerous genome-wide SSR loci have been identified, paving the way for the creation of high-density genetic maps and research into the regulatory mechanisms of T. rupestris therapeutic components. It also serves as a useful reference for future genomic in vestigations and molecular markers.