Phenotyping diversity and RNA-seq Data of diploid D genome species
To observe an evolutionary divergence and conservation at the transcriptomic level of wild diploid D-genome cotton, four cotton species including G. thurberi (D1), G. klotzschianum (D3-k), G. raimondii (D5) and G. trilobum (D8) with a rich diversity of morphological characteristics were selected for further analysis. These species show tremendous phenotypic variations in flower color and leaf shape, although these are phylogenetically most closely related to each other (Figure 1A). Additionally, four species presented variant cold and salt resistance: G. klotzschianum and G. thurberi showed excellent resistance to cold and salt stress, followed by G. raimondii (Figure 1B). Strangely, G. trilobum indicated the lowest resistance, despite it is most closely related to G. thurberi. For a comprehensive evaluation, we carried out the RNA-sequencing of D-genome diploid cotton. Four species, G. thurberi, G. klotzschianum, G. raimondii, and G. trilobum, were abbreviated as GD1, GD3, GD5 and GD8 for convenience. Ten leaf samples were collected from each species in different time intervals (0h, 6h, and 12h after a three-leaf stage of seedlings: C0, C6, and C12) after two stress treatment samples (Cold T12 and salt stress S12), two repeats for each sample. A total of 40 data sets with 273.01 GB raw data were obtained. The resulting GC content rates of 44.41%-46.64% are similar in a different dataset. 273.01 Gb clean data without an adapter, ploy-N, and lower quality reads were obtained after quality control. Every dataset is at least 17.1 million clean reads and more than 89.03% of base Q30. After that, RNA-seq clean reads were mapped to the reference genome of G. raimondii and ranged from 81.37% to 91.37% clean reads were uniquely mapped to the reference genome (Table S1). Based on mapping results more than, 7,374 SNPs (Single Nucleotide Polymorphisms) were identified in each library (Table S2). the expression level of 41,053 genes, including 3,548 new transcriptional genes, is quantitated. Except for samples of GD1S12, the R1 and R2 libraries in the same sample area with a high value of correlation (R2> 0.7); suggested datasets of other samples are reliable (Figure 1C). The reason for the lower value of the correlation between GD1S12R1 and GD1S12R2 may be the effects of developmental and environmental variation on gene expression, although we try to minimize it.
Genetic divergence of four diploid D genome species
The transcript sequence polymorphism analysis contributed to understanding the evolutionary diversity and conservation at the transcriptomic level in four D-genome diploid cotton. The RNA-Seq data were used for sequence polymorphism discovery and 7,374-404,737 SNPs were identified in different libraries (Table S2). More than 59% of SNPs have a transition type. On account of the datasets was obtained by transcriptomic sequencing, the chromosomal distribution of more than SNPs in introgenic region. Of note is the observation that about 9% SNPs were identified in the intergenic regions of GD1, GD3, and GD8, representing a new genic region in other D-genome species. Libraries of GD5 (G. raimondii) have a fewer number of SNPs as compared to the other three species. In general, transcriptomes of GD5 samples should have the same genome with a reference genome. Although the transcriptome of GD5 samples and reference genome was very similar to each other (<1 SNP per Kb of most genes), some SNPs were observed between GD5 and the reference genome, suggesting that there existed in differences in the leave′s transcriptome among plants of the same donor material. A total, 3, 2651 genes were expected to have high functional effects by SNPs in four diploid D-genome kinds of cotton. Go ontology (GO) enrichment analysis revealed that a large number of genes were enriched in oxidation-reduction process, protein modification process, stress response, and protein phosphorylation, it is suggested that abiotic stresses have played an important role in driving transcriptional diversity among four species (Table S3). On the contrary, it is also noticeable that 8,402 genes have no SNPs function-effect, including many housekeeping genes such as (Tubulin, ribosomal protein, and glycolytic enzyme-coding genes). A total of 8,402 genes were involved in several biological pathways including photosynthetic electron transport chain. We used Neighbor-joining methods to construct a phylogeny tree of four species base on transcriptomic data (Figure 2A). And, the principal component cluster was performed to observe the relationship between four species (Figure 2B). A modest number of SNPs separate them among four species (< 5 SNPs per kb of most of the genes). The datasets of the same species almost overlap with each other. Consistent with the previous study, G. thurberi showed closer relationships with G. trilobum.
Transcriptome De novo assembly
Transcriptome De novo assemblies were performed based on RNA-seq data using Trinity software. Ranged from 47,180 to 55,548 unique genes were identified in four diploid species. Most of the unigenes (> 40%) length were ranged from 200 - 500 bp. To initiate our evolutionary analysis, we identified the strictly orthologous unigenes among 7 species, including three cultivated (A2-genome: G. arboreum; AD1-genome: G. hirsutum; AD2-genome: G. barbadense) and four wild species. For rigorous analysis, the genomes of allotetraploid species were separate A and D subgenome to identify orthologous transcripts. A total of 47,119 orthologous gene pairs were characterized, and 5,312 single-copy transcript pairs of those were used to construct the phylogenic tree of the nine genomes by maximum likelihood methods. Phylogenetic analysis using these unigenes revealed a quite similar topology (Figure 2C) of the tree based on transcriptomic SNP data (Figure 2A), indicating a solid phylogeny for four D-genome diploid species. Same with the previous study, A subgenome of G. hirsutum and G. barbadense originate from G. arboreum, which formed monophyly in our phylogenetic analysis. G. hirsutum-D subgenome, G. barbadense-D subgenome, and GD5 formed monophyly. These results proved D subgenome of allotetraploid AD1 and AD2 genomes have the same donors, originated from G. raimondii.
Estimate evolutionary rates and identify positively selected genes (PSG)
Non-synonymous (dN) and synonymous (dS) are used to estimate the evolutionary rate and positively selected genes (PSGs) in each species. If dN/dS value is >1, this is indicative of positive selection, whereas equal to or significantly greater than 1 represent either purifying or neutral selection. Therefore, we evaluated the dN, dS, and dN/dS of the identified orthologous single-copy genes among nine genomes. Based on the above-constructed phylogenetic tree, the dN/dS for each orthologous unigene pair was evaluated in the different branches using a free ratio model (model=1), which allows for a separate dN/dS ratio for each branch. We found that wild diploid cotton (GD1, GD3, GD5, GD5_ref, and GD8) had higher dN/dS ratios than the branch of cultivated cotton (GAD1, GAD2, and GA2). Under natural selected pressure, wild species are showed a fast evolutionary ratio based on adaptive choices (Figure 2D). A total of 163, 344, 330, and 161 PSGs in GD1, GD3, GD5, and GD8 were identified by PAML software, respectively (Table S4). PSGs of diploid kinds of cotton were enriched in GO terms related to protein modification, protein ubiquitination, RNA processing, and ncRNA processing, involved in environmental adaptation (Table S5). What is more, we also found 180, 77, 51, 103, and 70 PSGs in GA2, GAD1_A, GAD1_D, GAD2_A, and GAD2_D, and these PSGs were enriched mRNA metabolic process (Table S6).
Gene Expression Divergence and conservation
We detected expression levels of a total of 41,053 transcripts in at least a single sample. Global expression level distributions of 40 data sets were similar to each other. Twenty-four datasets of leaves without stress treatment were used to detect gene expression divergence and conservation of four species. Samples of GD5C0R1 and GD5C0R2 were as a control group to identify different expression transcripts. Patterns of differential gene expression were characterized by four species (Figure 3A). A total of 29,512 DEGs was clustered eight profiles based on gene expression pattern use k-means clustering method. Profiles with a large number of DEGs, display a similar expression pattern (Profile 1, 2 and 3) of four species. The profiles 1 and 2 showed considerable differences in expression over time in the leaves of four species. In the Profile 1, the DEGs were down-regulated expression level over time and enriched in carbon metabolism and carbon fixation in the photosynthetic organism. On the contrary, DEGs of the profile 2 were up-regulated and enriched in plant circadian rhythm and ribosome biogenesis in eukaryotes. In the profile 3, DEGs were significantly enriched biosynthesis of unsaturated fatty acids, DNA replication, and photosynthesis-antenna proteins process (Figure 3B). Those pathways are indispensable during leaf's growth and development, while transcripts expression of those pathways is conservation during the divergence of D-genome species of cotton. Profile 4-9 was observed a considerable species-specific expression pattern. In the profile 4 and 5, 3662 and 1048 DEGs displayed GD5-specific expression pattern. What′s more, in the profile 6, 1516 DEGs displayed GD1-specific expression pattern. And, in profile 7 and 8, showed GD3-GD5 and GD1-GD8 special expression pattern. Given GD1 and GD8 with a close relationship, it made perfect sense that GD1 and GD8 contain more conservative biological pathways.
Evolutionary conservation and divergence of the Gene Co-expression Networks
Notwithstanding the analysis of gene expression pattern provided insights of gene expression divergence and conservation, co-expression gene networks, constructed by highly connected genes, is more relevant to important biological processes of growth and development and complex regulation of abiotic stress response. To grasp gene co-expression network conservation and divergence, all 40 datasets of leaves were used to construct a co-expression network by weighted gene co-expression network analysis (WGCNA). To improve the accuracy of WGCNA, we removed genes with FPKM< 1. In conjunction with connection strengths (soft-threshold power: 5, R2> 0.90) among 12110 genes (Figure S1), a global view of co-expression network topology among four species was constructed. Genes in the same showed module showed a higher topological overlap (Figure S2). Finally, a total of 33 modules were used for investigating evolutionary conservation and divergence of gene co-expression networks, which are defined as clusters of highly interconnected genes (Figure 4A). In these modules, the number of the genes ranged from 31 (dark orange) to 3128 (turquoise) with high correlation coefficients with one another in corresponding modules. Every gene connectivity was evaluated on each module. Highly connected genes (kME> 0.95) were identified as hub genes. Ultimately, a total of 425 hub genes (ranging from 1 to 293 within the modules) were detected.
In each module, expression levels of all genes were displayed by a heatmap and were summarized by the eigengene values (the first principle component of module expression profiles). Seven sample conditions were defined for identified significant modules (Figure 4B). Share.T.S group was used for identifying conservative shared networks of the cold and salt stress response. Additionally, the module, correlated with cold and salt stress response, was identified by Cold and Salt Groups, respectively. GD1, GD3, GD5, and GD8 Groups were used for identifying genome-specific modules. By association analysis between eigengenes and sample conditions (7 groups: Share.T.S, Cold, Salt, and 4 genomes) via Pearson correlation coefficient analysis, 29 modules were identified with significant genome-specific and/or abiotic stress-regulated co-expression patterns (ANOVA, P<0.05). Four major modules of highly co-expression genes were most strongly correlated (Pearson’s correlation r>0.9) with four genomes, respectively. The largest module (turquoise), containing 3128 highly connected genes in GD5, enriched in response to stimulus and immune system process (Table S7). A second module (Blue, 1,193 genes), showed GD3 specific. The other two modules (yellow with 420 genes and purple with 442 genes), which included few highly connected genes, respectively related to GD1 and GD8. Interestingly, all four major modules displayed similar results of GO enrichment analysis. Nearly half genes enriched for GO terms related to response to stimulus and immune system process in each module, revealing that abiotic and biotic stresses have played a major role driving transcriptional variation among these four species. We noticed that characteristics of the seven modules were unique related to the corresponding group, and among these modules, five modules were genome-specific. What′s more, most of the modules most strongly correlated with four genome group. These results suggested transcriptional variation is mostly correlated with genome divergence. We observed some modules overlapped among different groups. Overlap modules are helpful to understand the similar biological characteristics among two or more genomes. For example, the dark green module was significantly showed a positive correlation with GD1 and GD3, but a negative correlation with GD5 and GD8, and enriched sesquiterpenoid and triterpenoid biosynthesis and flavonoid biosynthesis which were known related with abiotic and biotic stress resistance.
Of particular concern is that co-expression networks related to cold and salt stress, due to four diploid species presented different cold and salt stress resistance. Association analyses between co-expression modules and abiotic stress (Cold and salt stress) revealed that 17 modules correlated with abiotic stress (Shar.T.S: 7 modules; Cold: 8 modules; Salt: 8 modules), thus representing suites of interconnected genes underlying the biological process of the abiotic stress response (Table S7). Noticeably, among four species, GD3 and Salt/Cold group had more overlapped modules. Considering GD3 showed better cold and salt tolerance than the other three species, suggested GD3 evolved more complete mechanisms in abiotic stress adaptation. On the contrary, almost no one module showed overlap between GD8 and abiotic stress groups, and it is foreseeable that GD8 manifests as significantly sensitive under cold and salt.
Interestingly, we found the skyblue3 module displayed a significant negative correlation with Share.T.S, Cold, and Salt group. One hub gene, RALF (rapid alkalization factor)-like, was identified in this module. The homologs of this gene in Arabidopsis may regulate plant stress, growth, and development. Fifteen genes in this module that is interconnected with the RALF-like protein coding gene. Except for five uncharacterized genes, ten out of fifteen interconnected genes are related to GO term of response to the stimulus, including Gorai.005G234900, Gorai.007G094200, Gorai.N023400, and Gorai.011G238800. Homologs of them in Arabidopsis have annotated four important transcription factors: MYB44, TCP9, TCP12, and GATA8, and involved in abiotic stress response. We also observed two xyloglucan endotransglucosylase/hydrolase protein 22 (XTH22)-encoding genes, Gorai.003G052400 and Gorai.009G006400, which involved in carbohydrate transport and metabolism. Homologs of Gorai.005G094300 in Arabidopsis, EXORDIUM protein-encoding gene, required for cell expansion in leaves, and may be involved in signaling processes that coordinate brassinosteroid (BR) responses to environmental or developmental signals (Figure 5). Most of the fifteen genes were decreased the expression level after cold and salt stress treatment.
Characterization of DEGs under cold and salt stress and validation of key genes under salt-alkali stress conditions
The false discovery rate (FDR) ≤ 0.001 and log2 rates ≥ 2 (8 groups: treatment/control) were used to identify DEGs in four species. The number of DEGs varied from 459 to 3372 among treatments of 8 groups (Figure S3). A total of 7515 DEGs were found, occupying 20.04% of total detected genes. Interestingly, the number of DEGs among salt stress groups was always smaller than those among cold stress group in four species examined (Figure 6A), indicated that the cold stress response was divergent than salt stress response, and more unique DEGs (6405, 85.2% of all DEGs) in one type of abiotic stress again proved that. But, 1,109 genes (16.1% of cold DEGs; 63.3% of Salt DEGs) were found in two different abiotic stresses, indicating potential share regulated pathways in cold and salt stress response. As expected, 1109 genes were enriched in the oxidation-reduction pathway, which was involved in multiple abiotic stress responses.
We thus focused on DEGs in GD3, on the account showed higher tolerance under cold and salt stress, and ultimately identified 2759 DEGs in salt and cold groups of GD3, including 2,300 genes cold DEGs, 630 salt DEGs. Among those genes, we found171 share DEGs under salt and cold stress, containing 80 down-regulated and 75 up-regulated DEGs (Figure 6C). Interestingly, 102 genes of share DEGs involved response to stimulate, and 32 genes of those genes were related to response to cold or salt stress, including 13 transcriptional factors encoding genes (NAC, ERF, MYB, G2, HD-ZIP) are putatively related to response to abiotic stress, since some homologs of those genes in Arabidopsis related to the abiotic stress response. For example, Gorai.002G073700 encodes the homolog of Arabidopsis NAC72, which binds to a drought-responsive cis-element in the early responsive to dehydration stress 1 promoter [21]. Also, both Gorai.001G239000 and Gorai.006G017400 encode homologs of MYB-like protein in Arabidopsis that involve in plant defense response [22]. Notable a total of thirty cold DEGs and three salt DEGs were PSGs. Moreover, three cold DEGs and 12 salt DEGs were in the skyblue3 module. Although fewer DEGs (Cold DEGs: 6.1%; Salt DEGs: 1.7%) were identified above all salt DEGs, some DEGs were overlapping with the PSGs or skyblue3 module (Figure 6B). Those results indicate again adaptive evolution drives transcriptional diversity, and a share regulated network involved in cold and salt stress response. We also found Gorai.006G147500 of skyblue3 module genes is positively selected and is interconnected with the hub gene, RALF-like protein coding gene. It further confirmed the potential regulated network that was identified in the previous analysis. Validation of the key genes through reverse genetics revealed that the VIGS-plant's ability to adapt to the salt-alkali was highly compromised. The VIGS-plants exhibited wilting and gradual drying of the leaf margin upon exposure to salt stress (Figure 7A), the results were in agreement with previous findings which demonstrated that salt stress does negatively affects the leaves and in turn, leads to a significant reduction in above-ground biomass. [23]. Furthermore, all exhibited significantly low biomass (Figure 7B), moreover, determination of ion content of the various tissues, the VIGS-plants registered a significantly lower concentration of Na+/K+ ions, an indication that the balance between the two ions is affected due to salt-alkali toxicity (Figure 7C).