Illumina sequencing and de novo assembly
In Illumina sequencing platform, the cDNA library was constructed using a total of 21,490,725 paired-end raw reads, which included null reads, low-quality sequences, and adapter-primer sequences. Total of 21,277,286 high-quality clean reads with 98.02% (Q20) and 93.98% (Q30) bases were obtained after a stringent quality check and data filtering. The GC percentage for the clean reads had 4.71 and clean reads were total nucleotide number of 133,174,507 (66,597,750 transcripts + 66,576,757 unigenes) (Table 2). Total unigenes identified with paired-end reads were 65,010 bp, with total length of 665,767,57 bp encompassing an average length of 1024.10 bp. From a given set of contigs, the values of N50 and N90 were generated as 1349 and 494 bp, respectively. In the 65,010 unigenes, 17,527 unigenes (26.96%) had a length of 200-500bp; 21,591 unigenes (33.21%) ranged from 500-1 kbp; 19,547 bp unigenes (30.06%) ranged from 1kbp-2kbp, and the length of 6,345 unigenes (9.76%) ranged > 2kbp (Fig. 1).
Table 2
Statistics of the transcriptome data generated after Illumina sequencing in B. ciliata
Category | Items | Number |
Raw reads | Total raw read | 21,490,725 |
Clean reads | Total clean reads | 21,277,286 |
| Total clean nucleotides (nt) | 133,174,507 |
| Q20 percentage | 98.02% |
| Q30 percentage | 93.98% |
| GC percentage | 44.71% |
Unigenes | Total sequence number | 65,010 |
| Total sequence base | 665,767,57 |
| Largest | 6,531 |
| Smallest | 301 |
| Average | 1,024 |
| N50(bp) | 1,349 |
| N90 (bp) | 494 |
EST-SSR | Total number of examined sequences | 65,010 |
| Total size of examined sequences (bp) | 665,767,57 |
| Total number of identified SSRs | 18,226 |
| Number of SSR-containing sequences | 14,497 |
| Number of sequences containing more than one SSR | 2,913 |
| Number of SSRs present in compound formation | 1,468 |
Frequency And Distribution Of Est-ssrs In The Unigenes
Out of 65,010 unigenes, 18,226 potential EST-SSRs were identified, and 1,468 compound microsatellites obtained from the 18,226 EST-SSRs (Table 3). The SSR frequency of unigenes in B. ciliata was 28.03%. A total of 18,226 EST-SSRs were identified, wherein the most prominent type repeat was dinucleotides (8,728, 47.89%), followed by mono (6,327, 35.04%) and trinucleotide repeats (2,899, 15.91%) (Table 4; Fig. 2).
Table 3
EST-SSR markers identified from de novo transcriptome sequencing in B. ciliata
Searching Items | Numbers |
Total number of sequences examined | 65,010 |
Total size of examined sequences (bp) | 66,576,757 |
Total number of identified SSRs | 18,226 |
Number of SSR containing sequences | 14,497 |
Number of sequences containing more than 1 SSR | 2,913 |
Number of SSRs present in compound formation | 1,468 |
Mono-nucleotide | 6,387 |
Di-nucleotide | 8,728 |
Tri-nucleotide | 2,899 |
Tetra-nucleotide | 101 |
Penta-nucleotide | 41 |
Hexa-nucleotide | 70 |
Table 4
Length distributions of microsatellites in B. ciliata based on the number of nucleotide repeat units
Number of repeats | Mono- | di- | Tri- | tetra | Penta- | Hexa- | Total | Percentage (%) |
5 | | | 1,737 | 72 | 30 | 54 | 1,893 | 10.38 |
6 | | 2,468 | 593 | 24 | 8 | 9 | 3,102 | 17.01 |
7 | | 1,709 | 297 | 2 | | 7 | 2,015 | 11.05 |
8 | | 1,292 | 147 | 1 | 1 | | 1,441 | 7.90 |
9 | | 901 | 79 | 2 | 1 | | 983 | 5.39 |
10 | 3,194 | 741 | 29 | | | | 3,964 | 21.74 |
11 | 1,127 | 448 | 4 | | | | 1,579 | 8.66 |
12 | 683 | 340 | 8 | | 1 | | 1,032 | 5.66 |
13 | 364 | 217 | 5 | | | | 586 | 3.21 |
14 | 275 | 234 | | | | | 509 | 2.79 |
15 | 198 | 190 | | | | | 388 | 2.12 |
16 | 123 | 24 | | | | | 147 | 0.80 |
17 | 108 | 50 | | | | | 158 | 0.86 |
18 | 56 | 25 | | | | | 81 | 0.44 |
19 | 50 | 19 | | | | | 69 | 0.37 |
20 | 43 | 17 | | | | | 60 | 0.32 |
21 | 13 | 23 | | | | | 36 | 0.19 |
22 | 20 | 10 | | | | | 30 | 0.16 |
23 | 18 | 5 | | | | | 23 | 0.12 |
24 | 20 | 4 | | | | | 24 | 0.13 |
25 | 12 | 5 | | | | | 17 | 0.09 |
26 | 15 | 4 | | | | | 19 | 0.10 |
27 | 8 | 1 | | | | | 9 | 0.04 |
28 | 6 | 1 | | | | | 7 | 0.03 |
29 | 25 | | | | | | 25 | 0.13 |
30 | | | | | | | | 0 |
< 30 | 29 | | | | | | 29 | 0.15 |
Total | 6,387 | 8,728 | 2,899 | 101 | 41 | 70 | 18,226 | |
Percentage (%) | 35.04 | 47.89 | 15.91 | 0.55 | 0.22 | 0.38 | | |
The ten tandem repeats of EST-SSR (3,964, 21.74%) were found to be the most common (Table 5), followed by six (3,102, 17.01%), seven tandem repeats (2,015, 11.05%), and five, eleven, and eight tandem repeats were 1,893 (10.38%), 1,579 (8.66%) and 1,441 (7.90%), respectively, while the remaining of tandem repeat for individual contributed < 10% of EST-SSR. The AG/CT was most dominant motif (7,351; 40.33%) followed by AT/AT (1050; 5.76%), AC/GT (301; 1.65%), and CG/CG (26; 0.14%) in the di-nucleotide repeats. The most abundant repeat motif was AAG/CTT (706; 3.87%) followed by ACC/GGT (519; 2.84%), ATC/ATG (387; 2.12%), AGC/CTG (309; 1.69%), and AGG/CCT (281; 1.54%) in the tri-nucleotide repeats (Table 5; Fig. 2).
Table 5
Frequency and distribution of microsatellites in B. ciliata based on SSRs repeat motifs
Repeat motifs | Number of repeats |
5 | 6 | 7 | 8 | 9 | 10 | > 10 | Total | Frequency (%) |
Mono-nucleotide |
A/T | — | — | — | — | — | 3164 | 3081 | 6245 | 34.26 |
C/G | — | — | — | — | — | 30 | 112 | 142 | 0.77 |
| | | | | | | | 6387 | 35.04 |
Di-nucleotide |
AG/CT | — | 2033 | 1471 | 1088 | 749 | 634 | 1376 | 7351 | 40.33 |
AT/AT | — | 293 | 165 | 163 | 125 | 89 | 215 | 1050 | 5.76 |
AC/GT | — | 121 | 70 | 41 | 25 | 18 | 26 | 301 | 1.65 |
CG/CG | — | 21 | 3 | — | 2 | — | — | 26 | 0.14 |
| | | | | | | | 8,728 | 47.88 |
Tri-nucleotide |
AAG/CTT | 377 | 159 | 73 | 47 | 31 | 11 | 8 | 706 | 3.87 |
ACC/GGT | 308 | 108 | 62 | 32 | 6 | 3 | — | 519 | 2.84 |
ATC/ATG | 254 | 67 | 32 | 12 | 13 | 6 | 3 | 387 | 2.12 |
AGC/CTG | 202 | 49 | 33 | 20 | 1 | 4 | — | 309 | 1.69 |
AGG/CCT | 189 | 52 | 26 | 5 | 7 | 2 | — | 281 | 1.54 |
Others | 407 | 158 | 71 | 31 | 21 | 3 | 6 | 697 | 3.82 |
| | | | | | | | 2,899 | 15.90 |
Quad—nucleotide |
AAAT/ATTT | 24 | — | — | — | — | — | — | 24 | 0.13 |
AAAC/GTTT | 11 | 2 | — | — | — | — | — | 13 | 0.07 |
AATC/ATTG | 11 | 2 | — | — | — | — | — | 13 | 0.07 |
AGAT/ATCT | 7 | 3 | 1 | — | 2 | — | — | 13 | 0.07 |
Others | 19 | 17 | 1 | 1 | — | — | — | 38 | 0.21 |
| | | | | | | | 101 | 0.55 |
Penta-nucleotide | 30 | 8 | — | 1 | 1 | — | 1 | 41 | 0.22 |
Hexa-nucleotide | 54 | 9 | 7 | — | — | — | — | 70 | 0.38 |
| | | | | | | | 111 | 0.60 |
Total | 1,893 | 3,102 | 2,015 | 1,441 | 983 | 3,964 | 4,829 | 18227 | 100 |
Frequency (%) | 10.38569 | 17.01871 | 11.05503 | 7.905854 | 5.393098 | 21.74796 | 26.49366 | 100 | |
Functional annotation of Bergenia transcriptome
De novo assembled unigenes of B. ciliata were annotated against the functional public databases; Nt (Non-redundant nucleotide sequence), Nr (Non-redundant protein sequence), KO (KEGG Orthology), Swiss-Prot, Pfam, GO (Gene Ontology), and KOG (Eukaryotic Orthologous Group) databases (Table 6; Fig. 3). All the sequences were generated by Blast and splicing algorithm, which was applied for the comparison and to obtain a relevant sequence and associating annotation.
Table 6
Summary of functional annotation of unigenes of B. ciliata with seven databases
| Number of Unigenes | Percentage (%) |
Annotated in NR | 53577 | 82.41 |
Annotated in NT | 44297 | 68.14 |
Annotated in KO | 22540 | 34.67 |
Annotated in Swissport | 42287 | 65.05 |
Annotated in PFAM | 20609 | 31.7 |
Annotated in GO | 29477 | 45.34 |
Annotated in KOG | 15027 | 23.11 |
Annotated in all Databases | 4954 | 7.62 |
Annotated in at least one Database (overall*) | 54732 | 84.19 |
Total Unigenes | 65010 | 100 |
* the number of unigenes which can be annotated with at least one functional database |
Annotation Of Non-redundant And Nucleotide Database
Out of the 65,010 unigenes, 54,732 unigenes were successfully annotated. However, 53,577 (82.41%) unigenes showed efficient homology with the proteins in the Nr database, while 44,297 (68.14%) of the control sequences relate with the Nt database entries (Fig. 3). As for the species distribution annotations of B. ciliata, Vitis vinifera (Vitaceae) has the highest similarity score (25.7%, followed by Quercus suber (Fagaceae) 7.6%, Juglans regia (Juglandaceae) 5.1%, Nelumbo nucifera (Nelumbonaceae) 3.3% and Hevea brasiliensis (Euphorbiaceae) 3.2% (Fig. 4).
Kog Classification, Go Annotation, Kegg Pathway And Swiss-prot Annotation
Functional annotation in the KOG database was based on 25 functional groups, including metabolic functions, cellular structure, and signal transduction. The post-translational modification, protein turnover and chaperones (2266 genes, 15.07%) represents the largest group, followed by general function prediction (1861 genes, 12.38%), translation, ribosomal structure, and biogenesis (1638 genes, 10.90%), and nuclear structures and cell motility as the smallest groups (Fig. 5).
Annotation in GO database grouped 29,477 (45.34%) unigenes into three major categories such as cellular component (51232, 37.8%), biological process (50744, 37.5%), and molecular functions (33274, 24.6%), with 51 subcategories (Fig. 6). Most of the unigenes in the molecular function are specified for binding (15346) and catalytic activity (14136), while metabolic process (14927) and cellular process (14264) are the major subcategories in the biological process. In total, 22,540 (34.67%) of unigenes were identified in the database, which were significantly assigned to 125 metabolic pathways.
Annotation in KEGG database, categorized metabolic pathways into 11 main divisions. Metabolic information processing (with 21717 genes) found to be the largest division, followed by genetic information processing (5,167), organismal systems (4,282), cellular processing (2,482), and environmental information processing (2,286) Further, in Swiss-Prot database 42,287 (65.05%) unigenes were found matched (Table 6; Fig. 7).
Development And Validation Of Novel Est-ssr Markers
Total 96 primer pairs were synthesized and checked for their amplification and polymorphism. Out of 96 primer pairs, 37 were successfully amplified, while the remaining 59 primers did not show any amplification even at different annealing temperatures. Among the 37 primer pairs, 32 successfully produced the desired amplified products, while the remaining 5 PCR products were larger or smaller than the expected size (ESM Fig. 1). Total of eight individuals from eight different populations of B. ciliata were used as PCR templates, and from 37 primers, 18 primer pairs were found polymorphic (ESM Fig. 2; ESM Fig. 3), whereas 14 pairs were identified as monomorphic (Table 7).
Table 7
Characterization of the 18 novel EST-SSR polymorphic primer pairs synthesized from the transcriptome of B. ciliata
SSR | Repeat Type | Repeat motif | Forward Primer | Tm (ºC) | Reverse Primer | Tm (ºC) | Product Size (bp) |
BC2 | Di | (AG)10 | CTGAGGCCAAAGAAAGTGCG | 59.7 | ACAAAGTCACACGGGCATCT | 59.8 | 190–250 |
BC7 | Di | (AG)6 | ACAATCAACAAGGCATCATGC | 57.7 | TCCAACTTACTGGGCAGGAA | 58.5 | 180–250 |
BC8 | Di | (AG)7 | TGGTCTGACAGTGAGTTCGC | 59.9 | TCGCCATCACAGAAGCCTTT | 59.9 | 140–160 |
BC17 | Di | (AT)6 | TACAAATACACCGGTGCAGG | 57.8 | AAATCTGGAGGGTTGCCAGG | 59.9 | 125–150 |
BC23 | Di | (CT)10 | TCACTCGTAAAGTCGACCCT | 58.0 | GGACGTCGAGCGAACAAATG | 59.9 | 140–180 |
BC26 | Di | (CT)11 | CAGCCAGTACTCTGCCCAAA | 59.9 | ACTCTCCACCTCCTGACCTC | 59.9 | 130–150 |
BC29 | Di | (CT)6 | ACGCCATTCTCACTGTACCT | 58.7 | TCAGCGGAGAAACAACCTCC | 59.9 | 180–210 |
BC33 | Di | (CT)6 | CATTGTTTCCTCCGTTGCCC | 59.7 | CTCCGTTTGGTTCTCGGGAA | 59.9 | 200–250 |
BC38 | Di | (CT)8 | TCGCAAACTCTCTCACTCTCC | 59.4 | AAACTTCAACCGCGGGATCT | 59.9 | 140–160 |
BC50 | Di | (GA)8 | TCCTCGAGTATTTGTCGCAG | 57.4 | GCGTTGAGAATCATTCGCCC | 59.9 | 140–160 |
BC53 | Di | (GA)9 | ACCGCCAAGAGCTTGATGTA | 59.3 | TGTTGAGTCGTTCGTCTTCC | 57.8 | 110–150 |
BC58 | Di | (TA)8 | ACACATGTTTACACGCGCAT | 59.1 | GAAGTGCACCCAAAGCATGA | 59.0 | 175–210 |
BC67 | Tri | (AGA)5 | ACCAATGTGAGGGTTCCTTCT | 58.9 | CCAACACACAGCAAGACAGC | 59.9 | 160–180 |
BC71 | Tri | (CAC)6 | AGAGGCACAATGTGGAAGAGA | 59.0 | TTCATGTAGTCCGGCAGCTC | 59.8 | 190–210 |
BC73 | Tri | (GAA)5 | AGTGTGGTACTCCTCGCTCT | 59.9 | ATCACGTCGTCGGAGAATCG | 59.9 | 220–250 |
BC74 | Tri | (GAC)5 | GGCAAACCTCCTCCCAAGAA | 59.8 | TTCCCTTGCCAGTTCCTCAC | 59.8 | 230–250 |
BC84 | Tri | (TTC)5 | GCTTGCAGTTTACACCCACA | 58.9 | CGCCTCCACGTCTATGTCTC | 59.9 | 160–190 |
BC87 | Tri | (TTTA)5 | GGAAAGGTTGGATTGCTCCC | 58.8 | GATCTGCTGCAGAACTGGGT | 60.0 | 100–130 |