RNA presequencing and data quality
The data retention rate is high and the data quality is good. The total RNA-seq data is 86,700, and the average length is 95.79 bp. The samples are all in accordance with the 4G requirements, which meets the customer and analysis needs.
Table 1
The statistical results of the data before and after the quality pretreatment are as follows
Sample | Raw Data | Valid Data | Valid Ratio (reads) |
Read | Base | Read | Base | Average length |
Treat | 51104338 | 5110433800 | 44318328 | 4240686660 | 95.69 | 86.72% |
Control | 48321894 | 4832189400 | 42394518 | 4065816222 | 95.90 | 87.73% |
All | 99426232 | 9942623200 | 86712846 | 8306502882 | 95.79 | 87.21% |
3.2 De novo Assembly
The valid reads of 2 samples of cotton were combined for de novo splicing, using the software Trinity, version trinityrnaseq_r2011-11-26, using the paired-end splicing method. The splicing sequence was reduplicated, and finally 400,708 transcripts larger than 100 bp in length were obtained, and the size was 125 Mb. The longest transcript under each locus was obtained as Unigene, and 207,241 Unigenes were obtained, and the size was 64 Mb.
Table 2
The statistical table of results of de novo splicing
| All ( > = 100bp) | >=500 bp | >=1000 bp | N50 | N90 | Total Length | Max Length | Min Length | Average Length |
Transcript | 400708 | 58960 | 15884 | 397 | 145 | 125386375 | 11381 | 101 | 312.91 |
Unigene | 207241 | 30858 | 9802 | 429 | 133 | 64511685 | 11381 | 101 | 311.29 |
The transcriptome sequencing, stitching and the GC contentdetermination of the analysed transcription data
After pretreatment, 9.94 million raw reads were obtained, and after filtering 8.67 million valid reads with average length of 95.79bp were produced by Solexa sequence. Moreover, 400,708 transcripts with lengths greater than or equal to 100bp were achieved by denovo assembly after filtering out the redundant sequence (Fig. 1A). Only one longest transcript of every locus was reserved with 207,241 unigenes. Furthermore, only transcripts with length greater than or equal to 500bp were further analyzed. Moreover, the GC analysis of all the transcripts (≥ 100bp) revealed the feature of the ratio of canonical bases that shows the average content ratio of GC of all these transcripts that is 39.7%, and AT is 60.3%. The least GC content was 8.49% while the highest was 81.67%; and 90% of all the sequences contained GC of more than 90% (Fig. 1B).
Gene annotation and function classification
We annotated the transcripts (≥ 500bp) based on 5 different protein database, namely, conserved domain database (cdd), non-redundant (nr), protein families (pfam), sprot and TrEMBL, in which 21,609; 38,175; 20,993; 38,322 and 32,370 transcript were annotated, however, non-redundant and pfam databases provided the highest proportion of annotated transcript with 100% similarity (Table xxx). Further analysis on the annotated transcripts were analysed by blasting in other plant genomes, in which most of the sequences were matched by cotton transcript belong to Vitisvinifera L. (Vitis), RicinuscommunisL. (Ricinus) and Populus, with the number 10,360, 9,856, 8,192, and the ratio in all transcripts were 27.1%, 25.8%, 21.2%, respectively (Figure xxxx). Moreover, only 1,381 transcripts matched with the known sequences from cotton with the ratio of 3.6% (Fig. 2A, Table 1)
Table 3
Sequencing and identification of the transcription based on various databases
Percentage identity similarities |
Protein databases | 100% | 90-99.9% | 80-89.9% | 70-79.9% | 60-69.9% | 50-59.9% | 40-49.9% | 30-39.9% | 20-29.9% | 10-19.9% | 0-9.9% | Totals |
cdd | 6 | 435 | 1,351 | 2029 | 2896 | 4245 | 5447 | 4755 | 443 | 2 | 0 | 21609 |
nr | 353 | 3787 | 8,812 | 10388 | 7756 | 4601 | 1946 | 531 | 1 | 0 | 0 | 38175 |
sprot | 50 | 761 | 2,627 | 3968 | 4318 | 4121 | 3339 | 1793 | 16 | 0 | 0 | 20993 |
trembl | 188 | 2770 | 7,562 | 10268 | 8468 | 5635 | 2845 | 582 | 4 | 0 | 0 | 38322 |
pfam | 248 | 5010 | 9,183 | 7351 | 4876 | 3204 | 1751 | 733 | 14 | 0 | 0 | 32370 |
Totals | 845 | 12763 | 29535 | 34004 | 28314 | 21806 | 15328 | 8394 | 478 | 2 | 0 | 151,469 |
Clusters of Orthologous Groups of proteins (COGs)of the annotated transcripts.
The rational of classifying the proteins encoded in sequenced genomes is important for making the genome sequences maximally useful for functional and evolutionary studies [24]. Out of the five databases, nr annotated transcripts were further analyzed and out of 38,175 Nr hits, 3,586 sequences were assigned to the COG classifications (Fig. 2B). Among the 23 COG categories, the cluster for General function prediction only (584, 14%) represented the largest group, followed by Transcription (163, 5%), Replication, metabolism and carbohydrate transport (1,200, 4%), recombination and repair (217, 7%), protein turnover, Post-translational modification and chaperones (369, 9%), and Translation, Signal transduction mechanisms (1,487, 3%), biogenesis and ribosomal structure (11,615, 75%), only 5 unigenes were assigned to Cell motility and secretion (Fig. 2B)
The result of KEGG pathway analysis showed that 13,088 (22.2% of all the annotation genes) transcripts (several transcripts hit multiple pathways) mapped to 284 pathways belong to all the five categories of KEGG, including metabolism, genetic information processing, environmental information processing, cellular processes, and organismal systems. Out of the ten pathways, the plant hormone signal transduction pathway was regulated by the highest number of transcripts, 446 transcripts, which were mainly responsible for the signal transport, a critical process being the very first step initiated by various molecules in the plants when plant are exposed to drought stress. Secondly, the metabolism transport, which includesthe nucleotide and sucrose metabolism, this metabolism progress, mainly induces more genes, which are involved in the oxidative phosphorylation pathway; moreover, many genes were found to be involved in plant-pathogen interaction.
Gene Ontology (GO) analysis
Gene ontology provides the very fundamental role or functions of the various genes. The genes function are classified into three GO functional annotation, the biological process (BP), cellular component (CC) and molecular function (MF) [25]. We carried out GO functional analysis using Blast2GO program [26].A higher proportion of the transcripts were found to be associated with various GO functional terms, 164, 660transcripts harbored GO functions. All the GO functions were detected, in relation to cellular component, 342 functions were obtained and found to be associated with 83,572 transcripts, with the cell part (GO:0044464) and cell (GO:0005623) had the highest number of transcripts 11,332 in each, accounting for 13.6% of all the transcripts. In molecular functions, 818 GO functions were obtained, which associated with 26,490 transcripts, with binding (GO:0005488), and catalytic activity (GO:0003824) found to be associated with the highest number of transcripts, 18,778 and 16,604, respectively. Finally, in relation to the biological process, 998 GO terms were detected, with were linked to 54,598 transcripts. The metabolic process (GO:0008152), cellular process (GO:0009987), primary metabolic process (GO:0044238), and cellular metabolic process (GO:0044237), were the dominant biological processes associated with the highest number of transcripts, 16,448 (30.1%), 14,386 (26.3%), 12,061 (22.1%), and 11,717 (21.5%) transcripts, respectively. The various GO functions detected, have been found to associated with a number of stress responsive genes, such as the late embryogenesis abundant (LEA)[27], the multidrug and toxic compound extrusion (MATE) gene family [28], cyclin dependent kinase (CDK) genes [10], which showed that the transcripts could be playing a significant role in enhancing drought stress tolerance in Gossypium darwinii.
The analysis of differential expression genes (DEGs) and enrichment analysis
We evaluated all the transcripts detected in this, a total of 58, 639 were found to be differentially expressed genes (DEGs), in which 32,693 (55.75%) were upregulated genes, 127 (0.2%) not expressed and 25,919 (44.2%) were downregulated. The ratio of the differential expression genes indicated that the expression change of cotton genes to drought stress mainly is upregulated. We performed GO analysis of the upregulated genes, which was aimed to validate their biological function within the plants under drought stress conditions, 17,028 of the upregulated genes were associated by various GO terms, majority of which were highly associated with various forms of abiotic stress factors. The results showed that Gossypium darwiniiperhaps has a higher level of drought stress tolerance and could be used in breeding for more drought stress tolerant cotton germplasms. The KEGG was performed for the signal pathway analysis of high abundant differential expression genes. Upregulated genes mainly involved in the progress of glycerophospholipid metabolism, Glycolysis/Gluconeogenesis, amino sugar and nucleotide sugar metabolism, Lysosome, Alanine, aspartate and glutamate Fatty acid metabolism, metabolism, Pyruvate metabolism), Galactose metabolism), Cysteine and methionine metabolism (Fig. 4A). Among the various KEGG pathways, of significance was the ko04120, Ubiquitin mediated proteolysis. When plants are exposed to drought stress, the equilibrium and delicate balance between reactive oxygen species (ROS) release and elimination, becomes altered, leading to excessive accumulation of the ROS within the cell [29]. Excess accumulation of ROS triggers oxidative stress, which eventually lead to plant death [30]. The ubiquitin mediated proteolysis, aids in the elimination of various reactive oxygen species, thus reducing the lethal effects, and maintains the plants survival under stress [31].
Table 4
Pathway | Count | Percentage | Pathway ID |
1 | Plant hormone signal transduction | 446 | 3.41% | ko04075 |
2 | RNA transport | 361 | 2.76% | ko03013 |
3 | Spliceosome | 339 | 2.59% | ko03040 |
4 | Purine metabolism | 283 | 2.16% | ko00230 |
5 | Oxidative phosphorylation | 272 | 2.08% | ko00190 |
6 | Ribosome | 256 | 1.96% | ko03010 |
7 | Glycolysis/ Gluconeogenesis | 234 | 1.79% | ko00010 |
8 | Starch and sucrose metabolism | 220 | 1.68% | ko00500 |
9 | Plant-pathogen interaction | 207 | 1.58% | ko04626 |
10 | Protein processing in endoplasmic reticulum | 198 | 1.51% | ko04141 |