Phenotyping diversity and RNA-seq Data of diploid D genome species
To observe an evolutionary divergence and conservation at the transcriptomic level of wild diploid D-genome cotton, four cotton species including G. thurberi (D1), G. klotzschianum (D3-k), G. raimondii (D5) and G. trilobum (D8) with a rich diversity of morphological characteristics were selected for further analysis. These species show tremendous phenotypic variations in flower color and leaf shape, although these are phylogenetically most closely related to each other (Figure 1A). Additionally, four species presented variant cold and salt resistance: G. klotzschianum and G. thurberi showed excellent resistance to cold and salt stress, followed by G. raimondii (Figure 1B). Strangely, G. trilobum indicated the lowest resistance, in spite of it is most closely related to G. thurberi. For comprehensive evaluation, we carried out RNA-sequencing of D-genome diploid cotton. Four species, G. thurberi, G. klotzschianum, G. raimondii and G. trilobum, were abbreviated as GD1, GD3, GD5 and GD8 for convenience. Ten leaf samples were collected from each species in different time intervals (0h, 6h and 12h after three-leaf stage of seedlings: C0, C6 and C12) after two stress treatment samples (Cold T12 and salt stress S12), two repeats for each sample. A total of 40 data sets with 273.01 GB raw data were obtained. The resulting GC content rates of 44.41%-46.64% are similar in different dataset. 273.01 Gb clean data without an adapter, ploy-N and lower quality reads were obtained after quality control. Every dataset is at least 17.1 million clean reads and more than 89.03% of base Q30. After that, RNA-seq clean reads were mapped to reference genome of G. raimondii and ranged from 81.37% to 91.37% clean reads were uniquely mapped to reference genome (Table S1). Based on mapping result more than, 7,374 SNPs (Single Nucleotide Polymorphisms) were identified in each library (Table S2). expression level of 41,053 genes, including 3,548 new transcriptional genes, is quantitated. Except samples of GD1S12, the R1 and R2 libraries in the same sample are with high value of correlation (R2> 0.7); suggested datasets of other samples are reliable (Figure 1C). The reason of the lower value of correlation between GD1S12R1 and GD1S12R2 may be effects of developmental and environmental variation on gene expression, although we try to minimize it.
Genetic divergence of four diploid D genome species
The transcript sequence polymorphism analysis contributed to understanding the evolutionary diversity and conservation at the transcriptomic level in four D-genome diploid cotton. The RNA-Seq data were used for sequence polymorphism discovery and 7,374-404,737 SNPs were identified in different libraries (Table S2). More than 59% SNPs have transition type. On account of the datasets was obtained by transcriptomic sequencing, the chromosomal distribution of more than SNPs in introgenic region. Of note is the observation that about 9% SNPs were identified in the intergenic regions of GD1, GD3 and GD8, representing new genic region in other D-genome species. Libraries of GD5 (G. raimondii) have less number of SNPs as compared to other three species. In general, transcriptomes of GD5 samples should have the same genome with reference genome. Although transcriptome of GD5 samples and reference genome was very similar to each other (<1 SNP per Kb of most genes), some SNPs were observed between GD5 and reference genome, suggesting that there existed in differences in the leave′s transcriptome among plants of the same donor material. A total 3, 2651 genes were expected to have high functional effects by SNPs in four diploid D-genome cottons. Go ontology (GO) enrichment analysis revealed that a large number of genes were enriched in oxidation-reduction process, protein modification process, stress response and protein phosphorylation, it is suggested that abiotic stresses have played an important role in driving transcriptional diversity among four species (Table S3). On the contrary, it is also noticeable that there are 8,402 genes have no SNPs function-effect, including many housekeeping genes such as (Tubulin, ribosomal protein and glycolytic enzyme-coding genes). A total of 8,402 genes were involved in several biological pathways including photosynthetic electron transport chain. We used Neighbor-joining methods to construct a phylogeny tree of four species base on transcriptomic data (Figure 2A). And, the principal component cluster was performed to observe the relationship among four species (Figure 2B). A modest number of SNPs separate them among four species (< 5 SNPs per kb of most of genes). The datasets of same species almost overlap with each other. Consistent with previous study, G. thurberi showed closer relationships with G. trilobum.
Transcriptome De novo assembly
Transcriptome De novo assemblies were performed based on RNA-seq data using Trinity software. Ranged from 47,180 to 55,548 unique genes were identified in four diploid species. Most of unigenes (> 40%) length was ranged from 200 - 500 bp. To initiate our evolutionary analysis, we identified the strictly orthologous unigenes among 7 species, including three cultivated (A2-genome: G. arboreum; AD1-genome: G. hirsutum; AD2-genome: G. barbadense) and four wild species. For rigorous analysis, the genomes of allotetraploid species were separate A and D subgenome to identify orthologous transcripts. A total of 47,119 orthologous gene pairs were characterized, and 5,312 single-copy transcript pairs of those were used to construct phylogenic tree of the nine genomes by maximum likelihood methods. Phylogenetic analysis using these unigenes revealed a quite similar topology (Figure 2C) of the tree based on transcriptomic SNP data (Figure 2A), indicating a solid phylogeny for four D-genome diploid species. Same with previous study, A subgenome of G. hirsutum and G. barbadense originate from G. arboreum, which formed a monophyly in our phylogenetic analysis. G. hirsutum-D subgenome, G. barbadense-D subgenome and GD5 formed a monophyly. These results proved D subgenome of allotetraploid AD1 and AD2 genomes have same donors, originated from G. raimondii.
Estimate evolutionary rates and identify positively selected genes (PSG)
Non-synonymous (dN) and synonymous (dS) are used to estimate evolutionary rate and positively selected genes (PSGs) in each species. If dN/dS value is >1, this is indicative of positive selection, whereas equal to or significantly greater than 1 represent either purifying or neutral selection. Therefore, we evaluated the dN, dS and dN/dS of the identified orthologous single-copy genes among nine genomes. Based on the above-constructed phylogenetic tree, the dN/dS for each orthologous unigene pair was evaluated in the different branches using a free ratio model (model=1), which allows for a separate dN/dS ratio for each branch. We found that wild diploid cotton (GD1, GD3, GD5, GD5_ref, and GD8) had higher dN/dS ratios than the branch of cultivated cotton (GAD1, GAD2, and GA2). Under natural selected pressure, wild species are showed a fast evolutionary ratio based on adaptive choices (Figure 2D). A total of 163, 344, 330, and 161 PSGs in GD1, GD3, GD5 and GD8 were identified by PAML software, respectively (Table S4). PSGs of diploid cottons were enriched in GO terms related to protein modification, protein ubiquitination, RNA processing and ncRNA processing, involved in environmental adaptation (Table S5). What is more, we also found 180, 77, 51, 103 and 70 PSGs in GA2, GAD1_A, GAD1_D, GAD2_A and GAD2_D, and these PSGs were enriched mRNA metabolic process (Table S6).
Gene Expression Divergence and conservation
We detected expression levels of a total of 41,053 transcripts in at least a single sample. Global expression level distributions of 40 data sets were similar to each other. Twenty-four datasets of leaves without stress treatment were used to detect gene expression divergence and conservation of four species. Samples of GD5C0R1 and GD5C0R2 were as a control group to identify different expression transcripts. Patterns of differential gene expression were characterized of four species (Figure 3A). A total of 29,512 DEGs was clustered eight profiles based on gene expression pattern use k-means clustering method. Profiles with a large number of DEGs, display a similar expression pattern (Profile 1, 2 and 3) of four species. The profiles 1 and 2 showed considerable difference in expression over time in leaves of four species. In the Profile 1, the DEGs were down-regulated expression level over time, and enriched in carbon metabolism and carbon fixation in photosynthetic organism. On the contrary, DEGs of the profile 2 were up-regulated and enriched in plant circadian rhythm and ribosome biogenesis in eukaryotes. In the profile 3, DEGs were significantly enriched biosynthesis of unsaturated fatty acids, DNA replication and photosynthesis-antenna proteins process (Figure 3B). Those pathways are indispensable during leaves growth and development, while transcripts expression of those pathways is conservation during divergence of D-genome species of cotton. Profile 4-9 was observed a considerable species-specific expression pattern. In the profile 4 and 5, 3662 and 1048 DEGs displayed GD5-specific expression pattern. What′s more, in the profile 6, 1516 DEGs displayed GD1-specific expression pattern. And, in profile 7 and 8, showed GD3-GD5 and GD1-GD8 special expression pattern. In view of GD1 and GD8 with close relationship, it made perfect sense that GD1 and GD8 contain more conservative biological pathways.
Evolutionary conservation and divergence of the Gene Co-expression Networks
Notwithstanding the analysis of gene expression pattern provided insights of gene expression divergence and conservation, co-expression gene networks, constructed by highly connected genes, is more relevant to important biological processes of growth and development and complex regulation of abiotic stress response. To grasp gene co-expression network conservation and divergence, all of 40 datasets of leaves were used to construct co-expression network by weighted gene co-expression network analysis (WGCNA). In order to improve the accuracy of WGCNA, we removed genes with FPKM< 1. In conjunction with connection strengths (soft-threshold power: 5, R2> 0.90) among 12110 genes (Figure S1), a global view of co-expression network topology among four species was constructed. Genes in the same showed module showed a higher topological overlap (Figure S2). Finally, a total of 33 modules were used for investigating evolutionary conservation and divergence of gene co-expression networks, which are defined as clusters of highly interconnected genes (Figure 4A). In these modules, the genes number ranged from 31 (darkorange) to 3128 (turquoise) with high correlation coefficients with one another in corresponding modules. Every gene connectivity was evaluated on each module. Highly connected genes (kME> 0.95) were identified as hub genes. Ultimately, a total of 425 hub genes (ranging from 1 to 293 within the modules) were detected.
In each module, expression levels of all genes were displayed by a heatmap and were summarized by the eigengene values (the first principle component of module expression profiles). Seven sample conditions were defined for identified significant modules (Figure 4B). Share.T.S group was used for identifying conservative shared networks of cold and salt stress response. Additionally, the module, correlated with cold and salt stress response, was identified by Cold and Salt Groups, respectively. GD1, GD3, GD5 and GD8 Groups were used for identifying genome-specific modules. By association analysis between eigengenes and sample conditions (7 groups: Share.T.S, Cold, Salt, and 4 genomes) via Pearson correlation coefficient analysis, 29 modules were identified with significant genome-specific and/or abiotic stress regulated co-expression patterns (ANOVA, P<0.05). Four major modules of highly co-expression genes were most strongly correlated (Pearson’s correlation r>0.9) with four genomes, respectively. The largest module (turquoise), containing 3128 highly connected genes in GD5, enriched in response to stimulus and immune system process (Table S7). A second module (Blue, 1,193 genes), showed GD3 specific. Other two modules (yellow with 420 genes and purple with 442 genes), which included few highly connected genes, respectively related to GD1 and GD8. Interestingly, all four major modules displayed similar result of GO enrichment analysis. Nearly half genes enriched for GO terms related to response to stimulus and immune system process in each module, revealing that abiotic and biotic stresses have played a major role driving transcriptional variation among these four species. We noticed that characteristics of the seven modules were unique related to corresponding group, and among these modules, five modules were genome-specific. What′s more, most of the modules most strongly correlated with four genome group. These results suggested transcriptional variation is mostly correlated with genome divergence. We observed some modules overlapped among different groups. Overlap modules are helpful to understand the similar biological characteristics among two or more genomes. For example, the darkgreen module was significantly showed a positive correlation with GD1 and GD3, but a negative correlation with GD5 and GD8, and enriched sesquiterpenoid and triterpenoid biosynthesis and flavonoid biosynthesis which were known related with abiotic and biotic stress resistance.
Of particular concern is that co-expression networks related to cold and salt stress, due to four diploid species presented different cold and salt stress resistance. Association analyses between co-expression modules and abiotic stress (Cold and salt stress) revealed that 17 modules correlated with abiotic stress (Shar.T.S: 7 modules; Cold: 8 modules; Salt: 8 modules), thus representing suites of interconnected genes underlying the biological process of abiotic stress response (Table S7). Noticeably, among four species, GD3 and Salt/Cold group had more overlapped modules. Considering GD3 showed better cold and salt tolerance than other three species, suggested GD3 evolved more complete mechanisms in abiotic stress adaptation. On the contrary, almost no one module showed overlap between GD8 and abiotic stress groups, and it is foreseeable that GD8 manifest as significant sensitivity under cold and salt.
Interestingly, we found skyblue3 module displayed significant negative correlation with Share.T.S, Cold and Salt group. One hub gene, RALF (rapid alkalization factor)-like, was identified in this module. The homologs of this gene in Arabidopsis may regulate plant stress, growth, and development. Fifteen genes in this module that is interconnected with RALF-like protein coding gene. Except five uncharacterized genes, ten out of fifteen interconnected genes are related to GO term of response to stimulus, including Gorai.005G234900, Gorai.007G094200, Gorai.N023400, and Gorai.011G238800. Homologs of them in Arabidopsis are annotated four important transcription factors: MYB44, TCP9, TCP12 and GATA8, and involved in abiotic stress response. We also observed two xyloglucan endotransglucosylase/hydrolase protein 22 (XTH22)-encoding genes, Gorai.003G052400 and Gorai.009G006400, which involved in carbohydrate transport and metabolism. Homologs of Gorai.005G094300 in Arabidopsis, EXORDIUM protein-encoding gene, required for cell expansion in leaves, and may be involved in signaling processes that coordinate brassinosteroid (BR) responses to environmental or developmental signals (Figure 5). Most of fifteen genes were decreased the expression level after cold and salt stress treatment.
Characterization of DEGs under cold and salt stress
The false discovery rate (FDR) ≤ 0.001 and log2 rates ≥ 2 (8 groups: treatment/control) were used to identify DEGs of in four species. The number of DEGs varied from 459 to 3372 among treatments of 8 groups (Figure S3). Totally, 7515 DEGs were found, occupying 20.04% of total detected genes. Interestingly, the number of DEGs among salt stress groups was always smaller than those among cold stress group in four species examined (Figure 6A), indicated that the cold stress response was divergent than salt stress response, and more unique DEGs (6405, 85.2% of all DEGs) in one type of abiotic stress again proved that. But, 1,109 genes (16.1% of cold DEGs; 63.3% of Salt DEGs) were found in two different abiotic stresses, indicating potential share regulated pathways in cold and salt stress response. As expected, 1109 genes were enriched in the oxidation reduction pathway, which were involved in multiple abiotic stress response.
We thus focused on DEGs in GD3, on account of the it showed higher tolerance under cold and salt stress, and ultimately identified 2759 DEGs in salt and cold groups of GD3, including 2,300 genes cold DEGs, 630 salt DEGs. Among those genes, we found171 share DEGs under salt and cold stress, containing 80 down-regulated and 75 up-regulated DEGs (Figure 6C). Interestingly, 102 genes of share DEGs involved response to stimulate, and 32 genes of those genes were related to response to cold or salt stress, including 13 transcriptional factors encoding genes (NAC, ERF, MYB, G2, HD-ZIP) are putatively related to response to abiotic stress, since some homologs of those genes in Arabidopsis related to abiotic stress response. For example, Gorai.002G073700 encodes the homolog of Arabidopsis NAC72, which binds to a drought-responsive cis-element in the early responsive to dehydration stress 1 promoter [46]. Also, both Gorai.001G239000 and Gorai.006G017400 encode homologs of MYB-like protein in Arabidopsis that involve in plant defense response [47]. Notable a total of thirty cold DEGs and three salt DEGs were PSGs. Moreover, three cold DEGs and 12 salt DEGs were in skyblue3 module. Although fewer DEGs (Cold DEGs: 6.1%; Salt DEGs: 1.7%) were identified above all salt DEGs, some DEGs were overlapping with the PSGs or skyblue3 module (Figure 6B). Those results indicate again adaptive evolution drives transcriptional diversity, and a share regulated network involved in cold and salt stress response. We also found Gorai.006G147500 of skyblue3 module genes is positively selected, and is interconnected with the hub gene, RALF-like protein coding gene. It further confirmed the potential regulated network that was identified in previous analysis.