Contrasting genotypes of cotton, an overview of full-length sequencing, and a bioinformatics pipeline for FL transcript analysis
Three yield-contrasting cotton genotypes, including SJ48, Z98, and DT, were selected from prevoius field studies of various genotypes conducted at three main cotton belts in China for two years. The genotype SJ48 is from the Yellow River cotton belt of China, while both other genotypes are from the Yangze River. The genotype SJ48 displayed a higher cotton yield and stable performance across all environments. In contrast, both other genotypes had diverse performance across all environments. Specifically, Z98 showed the lowest yield than SJ48. Their analysis of the yield trajectory showed significant variation in yield and fiber quality traits among these genotypes (Shahzad, et al. 2019). Meanwhile, these geneotypes showed distinct root morpohlogical and biomass mass trait variability at the early seedling stage (Fig. 1). The root phenotype, length, surface area, diameter, and volume were significantly different among these genotypes (Fig. 1a-f). The variability in fresh and dry biomass showed significant differences (Fig. 1g-h). However, above ground traits such as shoot length and leaf size had significantly less variance among these genotypes (Fig. 1i). Variability in seedling growth stages will probably disrupt the substantial energy biomass resources for yield enhancement as well as to mitigate various stresses.Thus, their transcript profiling can enable us to understand the phenomenon of important economic traits. To construct SMRT bell libraries, the high-quality total mRNAs from the root and first true leaf were pooled equally for every cotton genotype. A total of 14.48G raw bases were generated for ARL (SJ48), which yielded 12751616 filtered subreads with a mean length of 1136 bp and an N50 of 2118 bp (Table S2). The 14.54G raw bases in BRL (Z98) yielded 15577536 filtered subreads with a mean length of 934 bp and an N50 of 2194 bp. Whereas 13.3G raw bases in DRL (DT8) yielded 6222044 filtered subreads with a mean length of 2137 bp and an N50 of 2591 bp (Table S2). The length distribution of subreads revealed DRL had a large difference in read length compared to ARL and BRL inbred parents (Fig. S1a–c). After subreads integration and error correction, a total of 623730 circular consensus sequences (CCSs) reads of ARL were divided into 452708 full-length (FL) and 373468 full-length non-chimeric (FLNC) reads (Table S3). In BRL, 615287 CCS reads were divided into 444750 FL reads and 354876 FLNC reads. The 560143 CCS reads of DRL were divided into 413090 FL and 397153 FLNC reads. The mean length of FLNC reads was almost similar in ARL and BRL but higher in DRL (Fig. S2a-c). Further clustering and polishing yielded 226272 high-quality FL consensus reads with a mean length of 2112 bp in AR, whereas high quality FL consensus reads were 211904 in BRL with a mean length of 2273 bp. In DRL, the number of high-quality FL consensus reads was 212885, with a mean length of 2696 bp (Table S4). The high-quality FL consensus reads had significantly improved length distribution, and all sequenced libraries showed similar trends as compared with filtered subreads (Fig. S1d–f). In addition, comparison with high-quality Illumina HiSeq transcripts further confirmed the accuracy of constructed FL transcripts (Table S5). The total reads mapped to the TM-1 reference genome were almost 82%, while the mean percentage of unmapped, multi-mapped, and uniquely mapped reads was approximately 17, 50, and 32, respectively, in this sequencing (Table S6).
Diversity of long-length isoforms in contrasting genotypes of cotton
We developed a brief bioinforamtic pipeline to identify full-length isoforms, and the FL transcript was compared with the cotton reference genome to identify novel genes and novel isoforms. In addition, the functional annotations of FL transcripts were retrieved from various databases to further strengthen the reference transcriptome. It was found that most of the mapped FL transcripts were aligned to nucleotide sequences (NT), followed by non-redundant protein sequences (NR), and the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Fig. S3a). In contrast, the majority of unmapped reads had annotations retrieved for the Protein Family (Pfam), Swiss-Prot, and NR (Fig. S3b). Our comparative analysis detected a diversity of isoforms in each cotton geneotype. The 13515 transcripts were from novel genes. In total, 88423 isoforms were identified in ARL, which were classified into three different groups. 15.7% were full-splice matches (FSM) matching completely to known genes. 72.3% were novel in the catalog due to non full coverage (NIC), which corresponds to novel isoforms from known genes, and 11.9% were novel not in the catalog (NNC), which corresponds to novel isoforms from novel genes (Fig. 2a). Whereas 13514 tanscripts of BRL were identified as novel genes, the total number of isoforms was 83927, the percentage of FSM isoforms was 15.1, the percentage of NIC isoforms was 73.4, and the percentage of NNC isoforms was 11.5 (Fig. 2b). In DRL, the number of identified novel genes was 13514. The total number of detected isoforms was 8,1513. Of which, 16%, 73.4%, and 10.7%, respectively, belong to the FSM, NIC, and NNC isoform groups (Fig. 2c). These higher landscapes of novel isoforms indicate the diversity of structural transcripts in each cotton geneotype. Our comparative analysis further determined 9.3% (n = 17736) overlapped isoforms among ARL, BRL, and DRL. However, 27.5% (n = 52307) isoforms were specifically detected in ARL, whereas 25.1% (n = 44666) isoforms were specific to BRL, and 23.4% (n = 47910) isoforms were specific to DRL (Fig. 2d). The identification of large number of specific isoforms reveals variability in these genotypes. These results not only enrich the knowledge about full-length transcriptomes but also suggest that exploring novel isoforms remains an unexploited biological research area in cotton, especially in the complex allotetraploid genome of upland cotton. Thus, we further characterized novel isoforms (NIC and NNC) in this study. Their subgenomic distribution has shown that novel isoforms were equally distributed across the At and Dt subgenomes of cotton (Fig. S4). Among different groups of novel isoforms, each of the At and Dt subgenomes of cotton generated approximately 45% NIC type isoforms. However, both subgenomes combined generated 9% of NNC type isoforms, and only 1% of novel isoforms belonged to different scaffolds (Fig. 2e). In addition, novel isoforms of these cotton genotypes were widely distributed across the 26 chromosomes of cotton. The chromosomes A05 and D05, followed by chromosome D11, had a significantly higher number of novel isoforms in ARL and BRL, while chromosome A04 had the lowest proportion of novel isoforms among the others (Fig. S5). Similar statistics were observed on these chromosomes for both NIC and NNC types of isoforms in ARL, BRL, and DRL (Fig. S6).
It was detected that almost the same proportion of NIC type isoforms was associated with the antisense and sense strands in each cotton genotype. In contrast, above 80% of NNC type isoforms were associated with the sense strand (Fig. 2f). The comparative analyses of the number of exons showed NIC contained significantly higher number of exons than NNC (median 7 vs 1) in each genotype. It was further observed that both types of novel isoforms possessed significantly different median exon structures in the root and leaf. Both BRL and DRL possessed a higher median exon structure than ARL (Fig. 2g). The meadian length per isoform is significantly different for three geneotypes and novel isoforms. The meadian length of NIC was 2228, 2464, and 2313, respectively, for ARL, BRL, and DRL, wherein the median length of NCC in ARL was 2200, 2474 in BRL, and 2289 in DRL (Fig. 2h). Our computational analyses found enrichment of novel isoforms with different exon structures and lengths for each cotton genotype. For instance, Gh_A12G0099 encodes phototropin-2 and is predicted to be involved in stomata opening and photomorphogenesis stimuli in leaves. It yields a different number of novel isoforms in all genotypes. Of which, seven were overlapped, 18 specific to ARL, 20 specific to BRL, and 10 specific to DRL (Fig. S7a). These variabilities in isoforms may have a relationship with the functional diversity of genes, which ultimately results in disruption of the growth pattern of cotton. Among transcript structural variables, alternative splicing is a key post-transcriptional regulatory mechanism, yields more than one splicing isoform through various splicing models (Fig. S7b), and ultimately increases protein diversity and functional domains. Within a geneotype, a total of 21348 AS events corresponding to 33966 novel isoforms were detected in ARL, which contained 7918 intron retention (IR), 5567 alternative 3 splice sites (A3), 3234 alternative 5 splice sites (A5), 2141 skipped exons (SE), 134 mutually exclusive exons (MX), and 1354 others (Fig. S7c). Whereas, a total of 21186 AS events were escalated from 35071 novel isoforms in BRL and consisted of 7851 IR, 5510 A3, 4273 A5, 2122 SE, 128 MX, and 1302 others (Fig. S7c). A total of 37176 AS events were associated with 37176 novel isoforms in BRL, consisting of 8185 IR, 5932 A3, 4573 A5, 2382 SE, 139 MX, and 1438 others (Fig. S7c). These higher alternations of splicing events probably act as an adoptable mechanism against changing evernionement in cotton.
Expression analysis based profiling of novel long-read isoforms
To quantify the expression of long-read isoforms as fragments per kilobase of exon model per million reads mapped (FPKM), we utilized previously pulished high coverage short-read RNA sequenced datasets of the same root and leaf samples, which enabled us to determine more accurate expression profiles in each cotton genotype. The analysis results showed long-read isoform derived expressions were quite different in root compared with leaf, and root had a different FPKM distribution than leaf in each cotton geneotype (Fig. 3a). Almost 86% of novel isoforms in the root and leaf had expression levels in the range of 0–15 FPKM (Fig. 3b). The comparative analysis showed nearly 10K novel isoforms had significant differential expression among ARvsBR and ARvsDR, while the comparison of ALvsBL and ALvsDL showed almost 15K novel isoforms with differntial expression (Fig. 3c). Among differential regulation, most of the isoforms exhibited higher regulation in the A root and leaf compared with the B and D roots and leaves. Within each sample, we identified that 85% of isoforms with differential regulation were NIC type and only 15% were NNCs (Fig. 3d). This indicates that differential regulation of NIC isoforms may contribute to differential growth and performance during adverse conditions, which results in different levels of seedling growth. It was detected that most NIC isoforms exhibited differential expression specific to root or leaf, whereas a smaller number of isoforms exhibited differential regulation in pairwise comparisons of both root and leaf (Fig. 3e). Only 593 shared isoforms with differetial expression were identified among the root and leaf of these contrasting genotypes. Further plotting indicated a significant fold change difference among the NIC and NNC types of isoforms. For NIC type, the median fold change differences were 0.7340 in ARvsBR, 0.7438 in ARvsDR, -0.5403 in ALvsBL, and 0.3467 in ALvsBL. In contrast, the median fold change differences for NCC type were 0.3075, -0.4084, -0.7214, and 0.3508, respectively, in ARvsBR, ARvsDR, ALvsBL, and ALvsBL (Fig. 3f). These findings further confirmed novel isoforms relatively had higher expression level differences in root, which indicates novel isoforms in root have more diversified expression profiles in cotton.
Comprehensive characterization of the novel isoforms among contrasting genotypes of cotton
Based on the above findings, we only selected core isoforms from NIC group for further study. Analyzed results showed a dynamic expression changes specific to the root or leaf. In root, expression based cluster analysis confirmed core isoforms (n = 2000) had distinct clusters in A, while both B and D were clustered together (Fig. 4a). The majority of these isoforms had higher expression levels in A, which may contribute to better root development and may drive various mechanisms related to growth and development. Our functional analysis of root core isoform target genes revealed their association with biological processes including protein phosphorylation, oxidation-reduction processes, and various metabolic processes (Fig. 4b). Some core isoform target genes had functions in cellular processes and were enriched for membrane-related processes. In the molecular functional class, the majority of core isoform target genes showed enrichment to ATP binding and varoius activities such as catalytic and transpoter. Moreover, pathway enrichment analysis identified that root core isoform annotated genes were enriched in metabolic, biosynthesis of secondary metabolites, biosynthesis of antibiotics, and carbon metabolism related pathways (Fig. 4c). The core isoforms from root had lower mean than leaf for total exon numbers and total sequanced length (Fig. 4d-e). The functioanl regulation patterns of these core isoforms may disrupt the performance of these pathways in the root, thereby, leading to modifications of biological and biochemical processes during the reproductive growth of cotton. Among core isoforms of leaf, expression based cluster analysis identified that the expression level of core isoforms is different in A than others, and most of these isofrms expression is relatively low in leaves of B and D (Fig. 4f). The core isoforms from leaf had higher mean than root for total exon numbers as well as total sequanced length (Fig. 4d-e). The major portion of genes associated with core isoforms of leaf were enriched for protein phosphorylation, along with protein phosphorylation related biological processes. In the cellular component group, these were enriched for nucleus and cytoplasm annotated functions. At molecular functions, the mojar group of isoforms encoding genes was enriched for ATP binding, catalytic activity, and protein kinese activity (Fig. 4b). In addition, it was identified that metabolic, biosynthesis of secondary metabolites, biosynthesis of antibiotics, and carbon metabolism related pathways can be altered with differential expressions of isoforms (Fig. 4c). Together, these findings show that the dynamic regulation of isoforms can imbalance several biological and molecular processes of target proteins the root and leaf, which are indespensible during early growth and developmental stage of cotton.
Identification of novel isoforms target protein networks in contrasting genotypes of cotton
The long-read isoforms enable us to understand the total diversity of transcript structural variables and, therefore, can provide novel insight into the physiological, biochemical, and molecular processes of target traits. In cotton, the identification of potentially important novel isoforms can address the gap in the genetic mechanism of yield traits. Our analysis identified 647 key novel isofrms that had differential regulation both in the root and leaf of yield cotrasting genotypes of cotton (Fig. 5a). Of which, we considered potential isoforms those that have a higher length (~ 1500 bp), have at least two exon structures, and must be annotated with one target gene. With this advancement, we identified 26 potentially important novel isoforms that had shown significant fold change differences in the range of -4 to 3 among the comaprsions of ARvsBR, ARvsDR, ALvsBL, and ALvsDL (Fig. S7). Novel isoforms Gh_A01G1792_N11 and Gh_A01G1792_N14 annoated with target protein enolase 2 had the highest expression level in B compared to A and D cotton geneotypes (Fig. 5b). Circadian pathway and light depend pathway enriched GIGANTEA (GI) encoding Gh_A02G0645_N17 and Gh_A02G0645_N39 isoforms were identified with lower expression in the roots and leaves of A, B, and D cotton geneotypes (Fig. 5b). In contrast, carbon metabolism pathway enriched D-glycerate 3-kinase annoated isoforms Gh_D05G1215_N12 and Gh_D05G1215_N25 showed higher regulation in A compared with B and D cotton geneotypes. Furthermore, Gh_A05G0536_N03 encodes myosin heavy chain-related, organic cation/carnitine transporter 7 and annotates Gh_D08G0875_N23, Gh_D08G0875_N36, selenium-binding protein 2 related Gh_A10G1144_N02, Gh_A10G1144_N03, and bifunctional nuclease 1 Gh_D02G0956_N01 and Gh_D02G0956_N06 were identified with higher expression levels both in the root and leaves of A as compared with B and D cotton geneotypes. Among others potentially important novel isoforms, calcium-transporting ATPase 12 is related. Gh_A08G0581_N06, Gh_A08G0581_N10, pirin-like protein related Gh_D10G1286_N01, Gh_D10G1286_N02, and CTP synthase related Gh_D12G0472_N08, Gh_D12G0472_N13 had significant differential variation in root and leaf of all these yield contrasting genotypes of cotton (Fig. 5b). The isoforms are compulsory for functional activity of traget proteins, and spliced isoforms cause proteome diversity, therefore, we assumed a protin-protein interaction (PPI) network of traget genes accompanying potentially important novel isoforms might help to understand the complexity of transcriptional events that can influence the molecular mechanisms underlying cotton seedling growth contributing traits. In particular, it was observed that novel isoform target genes displayed protein interaction with 691 other genes (Table S7), some of which had novel isoforms with differntial expression in the root or leaf of these contrasting genotypes of cotton. In this network, most genes showed function enrichment in the circadian rhythm plant, plant pathogen interaction, and signaling pathways. In particular, the predicted PPI network of Gh_A02G0645 was relatively higher (Fig. 5c), and it was proposed to have PPI with many circadian rhythm enriched genes, including the same GIGANTEA encoding related Gh_A09G0775, Gh_D02G0690, CCT motif containing response regulator protein annotated Gh_A03G0597, Gh_D11G2916, protein REVEILLE8s annotated Gh_A09G1504, Gh_D09G1515, and protein LHY annotated Gh_A11G0926, Gh_A12G1061, Gh_D11G1068, Gh_D12G1184. From these gene networks, our analysis identified novel isoforms Gh_D02G0690_N50, Gh_A11G0926_N23, together with Gh_D11G1068 related novel isoforms Gh_D11G1068_N03, Gh_D11G1068_N7, Gh_D11G1068_N20, and Gh_D11G1068_N54, which showed dynamic expression regulation either in roots or leaves of yield contrasting cotton genotypes (Fig. 5b). In agronomic crops, most recent studies confirm that circadian rhythm is an endogenous time keeping mechanism in crops that contributes to growth, photosynthetic, leaf senescence, photoperiodic flowering, biomass, heterosis, biotic and abiotic stress resistance, and other agronomic traits by regulating target gene transcription, protein activity, and biochemical metabolism (Steed, et al. 2021). Thus, the differential regulation of novel isoforms, especially those annotated with circadian oscillator target genes, may disrupt the PPI network, which in turn may modulate transcriptional regulation crucial for growth changes in cotton.
Functional analysis of Gh_A02G0645_N17 in contrasting genotypes of cotton
Three novel isoforms of Gh_A02G0645 were selected for validation with real-time qRT-PCR analysis. Their relative expression showed significant differences among roots and leaves of the A, B, and D geneotypes of cotton (Fig. 6a-c). Their expression trends were similar to those determined with long-read single-molecule RNA sequencing (Fig. 5b). The RT-PCR analysis quantified that three novel isoforms related to the GI had higher expression levels in leaves than roots (Fig. S9). Hence, we selected leaves and further validated the diurnal expression of three novel isoforms related to the GI (Fig. 6d-e). It was observed that all these isoforms showed significantly higher expression peaks around Zeitlupe (ZT) 12 hours in each genotype. Noteably, these isoforms respond significantly differently to day light in each cotton genotype (Fig. 6d-e). This result suggests rhythmic expression of GI related isoforms probely maintian light depend endogenous biological rhythms important for development of cotton. To determine the role of Gh_A02G0645 isoforms for cotton seedling, we conducted a VIGS experiment of Gh_A02G0645_N17 in SJ48 (A) and Z98 (B) genotypes of cotton. The relative expression levels of CLCrV:00 and CLCrV:Gh_A02G0645_N17 confirmed the silenced expression of Gh_A02G0645_N17 in both cotton geneotypes (Fig. 7a). Phenotypically, CLCrV:Gh_A02G0645_N17 plants showed improved growth as compared with the control plant in both cotton genotypes. In addition, the silenced expression of CLCrV:Gh_A02G0645_N17 significantly improved plant height, fresh and dry biomass in each cotton genotype (Fig. 7d-f). These results revealed that Gh_A02G0645 mediates seedling growth in cotton (Fig. 7b-d). Functions of isoforms depend on amino acid sequence structure, which explains how isoforms with a certain degree of sequence homology perform different functions. The varoius protein deep learning based approaches now can predict accurate 3D structures of proteins, thereby, we predicted the 3D structures of protein GI (Fig. S10a) and its three novel isoforms, Gh_A02G0645_N17 (Fig. S10b), Gh_A02G0645_N39 (Fig. S10c), and Gh_A02G0645_N42 (Fig. S10d), and compared the structural differences between novel isoforms and reference protein GI. These novel isoforms showed random or less structural similarities with the reference protein GI (Fig. S11). Gh_A02G0645_N17 amino acid showed very less consistancy score (Fig. S11a) than other isoforms (Fig. S11b-c) Hypothetically, structural differences are expected to generate functional diversity. However, these structure based predictions need experimental validation.