Assembly of the G. rotundifolium, G. arboreum and G. raimondii genomes
In this study, we applied Oxford Nanopore sequencing technology to assemble G. rotundifolium (K2), G. arboreum (A2) and G. raimondii (D5) genomes. G. arboreum and G. raimondii genomes have been de novo assembled previously using Illumina and PacBio reads20,23, but both genomes have a number of sequence gaps and require an improvement in assembly contiguity. We generated a total of 304 Gb, 212 Gb, 125 Gb Nanopore sequencing data with a genome coverage 124×, 131×, 167 × for K2, A2 and D5, respectively (Supplementary Table 1). We assembled 3,593, 1,173 and 366 contigs for G. rotundifolium, G. arboreum and G. raimondii with a contig length of 2.44 Gb, 1.62 Gb and 0.75 Gb, respectively (Table 1). These initial contigs were polished using Illumina paired-end reads with a genome coverage of 108×, 118×, 132 × for K2, A2 and D5. The contig N50 is 5.33 Mb, 11.69 Mb and 17.04 Mb for K2, A2 and D5, respectively. The maximum contig has a length of 32.72 Mb, 58.57 Mb and 43.74 Mb. After polishing contig using Illumina reads, we used high-through chromosome conformation capture (Hi-C) data to order and orient contigs, aimed at constructing pseudo-chromosomes of each species (Fig. 1a and Supplementary Figs. 1–4). In the Hi-C assisted assembly, 2,559, 485 and 201 contigs were placed on the 13 chromosomes of K2, A2 and D5 genomes, occupying over 99% of genome length (Fig. 1b).
Table 1
Summary of genome assemblies and annotations of G. rotundifolium, G. arboreum and G. raimondii.
Genomic feature | G. rotundifolium | G. arboreum | G. raimodii |
Total length of contigs | 2,444,364,209 | 1,621,008,062 | 750,197,587 |
Total length of scaffolds | 2,444,484,509 | 1,621,030,562 | 750,205,487 |
Total length of gaps | 120,300 | 22,500 | 7,900 |
Percentage of anchoring, bp | 99.28% | 99.47% | 99.57% |
Percentage of anchoring and ordering, bp | 93.16% | 98.84% | 99.01% |
Number of contigs | 3,593 | 1,173 | 366 |
Number of scaffolds | 2,390 | 948 | 287 |
Contig N50, bp | 5,326,689 | 11,691,474 | 17,043,680 |
Contig N90, bp | 621,066 | 2,910,421 | 3,537,560 |
Scaffold N50, bp | 177,839,665 | 129,592,444 | 57,716,579 |
Scaffold N90, bp | 115,394,628 | 93,157,762 | 49,929,625 |
Maximun contig length, bp | 32,728,186 | 58,575,076 | 43,739,617 |
Maximum scaffold length, bp | 205,722,655 | 143,367,608 | 63,188,200 |
GC content | 36.38% | 35.16% | 33.23% |
Percentage of repeat sequences | 80.92% | 68.05% | 57.04% |
Number of genes | 41,590 | 41,778 | 40,820 |
To verify the genome assembly completeness, we mapped the clean Illumina reads against each genome, and found that more than 97% of reads were aligned (Supplementary Table 2). More than 90% sequencing reads were perfectly mapped, suggesting the high sequence accuracy after base correction of Nanopore reads. We also performed Benchmarking Universal Single-Copy Orthologs (BUSCO) analysis to estimate the assembly completeness in genic regions, which showed that 92.5%, 93.89% and 95.42% of BUSCO hits for K2, A2 and D5 were found (Supplementary Table 3). Compared with recently published PacBio assemblies of A2 and D5 genomes18,20,23, our assemblies have an improvement of 6.3 fold and 2.7 fold in contiguity, reducing gaps from 1.16 Mb to 22.5 Kb for A2 and from 17.4 Kb to 7.9 Kb for D5 (Supplementary Tables 4–5 and Figs. 5–6). These genomes should represent reference-grade genomes for the three diploid cotton species.
Genome annotation
We applied three approaches including ab initio prediction, homology searches and transcriptome-based analysis, to predict genes in the three genomes (Table 1). Totally, we predicted 41,590, 41,778 and 40,820 genes for K2, A2 and D5 genomes, respectively. The genic regions have a similar length among the three genomes (Fig. 1c). About 27,014, 27,381 and 28,759 genes were transcribed in leaf tissue of K2, A2 and D5 (Fragments Per Kilobase of transcript per Million fragments mapped > 0.1). A total of 41,184 (99.02%), 41,624 (99.63%) and 40,653 (99.59%) protein-coding genes of K2, A2 and D5 genomes were functionally annotated based on InterPro, NR, Swiss-Prot and TAIR10 databases (Supplementary Table 6). In the three genomes, 20,782, 11,033 and 6,535 non-coding RNAs were predicted, including 132, 133 and 122 miRNAs (Supplementary Table 7). We also predict repeat sequences in the three genomes. The result shows that repeat related sequences have a length of 1,978 Mb, 1,103 Mb and 428 Mb for K2, A2 and D5, occupying 80.92%, 68.05% and 57.04% of genomes, respectively (Fig. 1c). The long terminal repeat (LTR) retrotransposons exhibit higher proportions in the centromeric regions that were predicted using previous centromere-related long LTR regions (Supplementary Table 8 and Fig. 7). These data showed that these genomes have a little different length of genic regions, and the large different content of repeat-related sequences in three cotton genomes contributes to the three-fold genome size change.
TE amplification and genome expansion
To explore whether lineage-differential transposon amplification contributes to the genome size variation in these three species, we classified TEs into different categories. We found that 1,761 Mb (72.07%), 1,029 Mb (63.52%) and 366 Mb (48.85%) of transposable elements in K2, A2 and D5 were class I LTRs, of which the vast majority were Gypsy (Fig. 2a). In different species, Gypsy exhibited a large different copy numbers among three species. Class II DNA transposon occupies 3.8%, 1.8%, and 3.1% of each genome. Compared with the TE number in D5, three types of class I (Gypsy, DIRS, LARD) and one type of class II (TIR) have the most obvious change of copy number in these categories, occupying 80.04%, 62.71%, and 47.22% of all TEs in K2, A2 and D5 genomes (Supplementary Table 9). It is observed that Copia did not show significant differences between K2 and A2 genomes. The amplification of TEs between K2-A2 and A2-D5 groups led to the larger size of intergenic sequences between genes (Fig. 2b; Mann-Whitney U test, P < 2.2 × 10− 16), indicating the role of TEs in enlarging the proportion of genomic non-coding regions. The lineage-differential LTR retrotransposon amplification was responsible for the genome size variation of the three genomes.
To document the details of transposable element amplification, we analyzed full-length LTRs in the three genomes. We identified a total of 26,852, 21,590 and 3,911 full-length LTRs in K2, A2 and D5 genomes respectively (Supplementary Fig. 8). These LTRs were clustered into families which could unravel the sequence similarity between different copies. It is found that 30% of LTRs in A2 were clustered with a family size of > 20. As a comparison, only 12% of LTRs were clustered in K2 and D5 (Fig. 2c). This result indicates that LTRs in A2 have higher sequence similarity than those in K2 and D5. To explore whether the sequence similarity is related to the time of LTR burst, we estimated the average age of LTR amplification in each species using the determined mutation rate per year. In K2, the insertion time peak of LTR retrotransposons was found at 4.5-5 MYA, while A2 had a more recent amplification peak at 0.6-1 MYA (Supplementary Fig. 9). Furthermore, we found that Gypsy, DIRS, LARD and TIR have the largest insertion time in K2 with a peak of 4.5-5 MYA (Fig. 2d). In agreement with previous estimates, these LTRs in D5 had a peak of 3–4 MYA and 0.6-1 MYA in A224. Of note is the observation that LARD have two amplification peaks (~ 1 MYA and ~ 4 MYA) in A2, of which the older peak was similar to the amplification time in K2 and D5 (Fig. 2d). The recent amplification of LTRs in A2 may explain why LTRs have higher sequence similarity. In addition, these results suggest that K2 genome has gained a large number of LTRs around 5 MYA compared with the A2 genome. The phylogenetic tree also supports a huge Gorge315,16 expansion of Gypsy-like retrotransposon in the clade III by comparing K2 and A2 with D5 (Fig. 2e).
Comparative genomics and evolution
TE amplification has contributed to a three-fold genome size variation, but we don’t know whether it has disrupted genomic synteny in gene regions. Here, we identified syntenic blocks between the K2 genome and the A2 or D5 genome (Fig. 3a). We found that 84.11% and 88.78% of the K2 genome shared genomic synteny in the K2-A2 and K2-D5 comparisons, respectively (Supplementary Fig. 10). In these regions, 26,579, 28,372 and 28,485 orthologous genes were included in K2, A2 and D5 genomes. Analysis of the syntenic block size showed that the K2 genome has the largest length and D5 has the smallest length, consistent with the three-fold change of genome size (Fig. 3b). Using the syntenic blocks, we identified a large rearrangement occurring in the Chr01-Chr02 chromosome between K2 and A2 genomes. This rearrangement did not occur in the K2-D5 comparison. Since the rearrangement was also found between the G. herbaceum (A1) and G. arboreum (A2) genomes20, it may suggest that this is a A2-specific event. Furthermore, comparison of the A2 genome with other published diploid cotton genomes, including G. thurberi (D1), G. turneri (D10) and G. longicalyx (F1), also revealed that this rearrangement only occurred in A2 (Supplementary Fig. 11). We identified another large rearrangement in Chr13-Chr05 in the comparison of K2 with A2 and D5 genomes (Fig. 3a). This rearrangement was supposed to be K2-specific by comparing with other diploid genomes (Supplementary Fig. 12).
We calculated synonymous substitution values (Ks) for syntenic gene pairs in each genome and between the genomes. We found that all the three species shared a common whole genome duplication (WGD) event occurring approximately 57–71 MYA (Fig. 3c). Analysis of orthologous genes showed that the three species might have undergone lineage divergence at the same period approximately 5.1–5.4 MYA (Fig. 3d), which was also shown between A2 and D5 previously24. Further, we found that K2, A2 and D5 genomes were divergent from their relative species of Gossypioides kikii approximately 8.5–10 MYA25 (Supplementary Fig. 13). This suggests a speciation event between Gossypium and Gossypioide genus. It is estimated that 68%, 86%, 79% of LTRs (corresponding to ~ 5 MYA) emerged during the divergence of three species (Supplementary Fig. 9a).
Comparison of gene content in syntenic blocks among different species can reveal evolutionary genome organization. Since the three genomes are divergent from a common ancestor, we explored the extent of gene fractionation after speciation. In the syntenic blocks, 21,173 genes have consistent collinearity in the three genomes; 5,868 genes exhibit collinearity between D5 and A2 that are not found in K2, 2,736 genes exhibit collinearity between D5 and K2 that are not found in A2, and 3,972 genes exhibit collinearity between A2 and K2 that are not found in D5 (Fig. 3e). To further analyze gene content at gene family level, we used OrthoMCL to identify gene clusters. The result shows that ~ 15% (unclustered genes and unique paralogs) of genes in each species were unique, these genes might have been under fast evolutionary process or represent lineage-specific genes (Fig. 3f). These results indicate that the three diploid genomes have a change of gene content during the lineage divergence.
Evolution of A/B compartment switching
TE amplification has been recognized as a driver for shaping higher-order chromatin structures in mammals, but we do not know whether it has a similar role in the organization of plant 3D genome. We have known that K2 and A2 genomes have gained widespread additional TE insertion relative to the D5 genome and they shared conserved gene synteny occupying the vast majority of genomes. This provides an opportunity for exploring the effect of TE amplification on the organization of higher-order chromatin structures. We first analyzed the A and B compartments in the three genomes, which represent active and inactive chromatin status. We found that 44.1%, 47.3% and 46.6% of genomes could be categorized as A compartment in K2, A2 and D5, respectively. The 55.1%, 52.1% and 53.0% of genomes were regarded as B compartments (Fig. 4a). A chromosome-level visualization showed that each D5 chromosome tends to have two large A compartment on chromosome arms and one B compartment on the middle of each chromosome (Supplementary Fig. 14). However, K2 and A2 genomes have more status switching of A/B compartment in TE enriched regions. At the gene level, we noticed that 31,307 (75.2%) genes in K2 and 31,331 (75.0%) genes in A2 were located in the A compartment, and 4,431 (10.7%) and 4,518 (10.8%) genes were located in B compartment. In D5, 24,267 (59.5%) genes were in A compartment and 11,729 (28.7%) genes were in B compartment (Fig. 4b). These data show that more genes were found in A compartment and less genes in B compartment in the three genomes (Chi-square test, P < 0.01). Genes located in A compartment were enriched in some basic biological processes, such as auxin signaling, whereas genes in B compartment were involved in defense response, nucleotide integration and fatty acid metabolic processes (Supplementary Fig. 15). At the TE level, 59.2%, 59.1%, 61.9% of TEs in K2, A2 and D5 were located in B compartment (Fig. 4c). This data indicates that more genes and TEs in D5 were located in B compartment than those in K2 and A2.
To investigate the change of A/B compartment status among three species, we analyzed the chromatin status in syntenic gene regions. A comparison of homologous syntenic genes shows that 468 genes exhibited A-to-B transition in the comparison of K2 with A2, 3,770 genes exhibited A-to-B transition in the comparison of K2 with D5 and 3,765 genes exhibited A-to-B transition in the comparison of A2 with D5. Only 296, 73 and 67 genes exhibited B-to-A compartment switching in the three comparisons (Fig. 4d and Supplementary Tables 10–12). About 17.4% (3,693/21,173) of syntenic genes in three genomes exhibited A/B compartment switching, 41 and 182 syntenic genes exhibited A-B-A and B-A-B switching in the comparison of K2-A2-D5 (Fig. 4e and Table 13). To support this finding, genes in chromosome Chr06 were shown. In this chromosome, 32, 33 and 254 genes were located in B compartment in K2, A2 and D5, respectively (Fig. 4f and Supplementary Fig. 16). To further characterize the biological role of homologous genes with chromatin status switching, we performed a functional enrichment analysis of the A-to-B and B-to-A orthologous genes in K2-A2 and A2-D5 comparisons (Figs. 4g, h and Supplementary Fig. 17). These results showed that A-to-B switching genes were enriched in the pathways of ion binding and transcription factor activity, while the B-to-A genes were intriguingly involved in fundamental activity such as ubiquitin transferase activity, pectate lyase activity, ATP binding (adjusted P < 0.01).
We then investigate the status of A/B compartment and their transcriptional activity. As expected, we found that genes and TEs in the A compartment display significantly higher expression levels than those in the B compartment (Fig. 4i). Further, in the comparison of K2 with A2 or D5, we found that the expression patterns of genes in K2 with A-to-B and B-to-A status switching exhibited significantly higher and lower expression levels (Fig. 4j). This points to a relationship between the switching of chromatin status and the change of transcriptional activity. We also investigated TE activity in 5 Kb flanking regions for these switching genes. It is found that genes showing A-to-B compartment switching have more active TEs in K2 genome and less in A2 or D5 genomes, and vice versa (Fig. 4k and Supplementary Fig. 18). This result suggested that active TEs might be involved in the switching of A to B compartment, which is linked to gene transcription.
Evolution of TAD organization
Topologically associating domains (TADs) represent megabase-sized local chromatin interaction domains of physical higher-order chromatin structures, which were separated by boundaries with an enrichment of specific DNA motifs26. Our previous study has shown that cotton genome can be packaged into thousands of TADs in each genome, and polyploidization reshaped the organization of TADs in the comparison of diploids with tetraploid subgenomes11. Using the new genome sequences, we identified TADs for three cotton species. The result showed that there are 2,541, 1,773 and 1,063 TADs ranging from 300 Kb to 3 Mb in K2, A2 and D5 genomes, occupying 2,187 Mb (96.7%), 1,524 Mb (95.6%) and 686 Mb (92.7%) of genomic length, respectively (Supplementary Table 14). The numbers of TADs in A2 and D5 were larger than our previous studys11, in which the tetraploid subgenomes were used as a reference to identify TADs since no high quality reference genome sequences were available at that time. The average sizes of TADs in K2 and A2 genomes are 860 Kb and 861 Kb, while D5 genome has smaller TADs with an average size of 645 Kb (Fig. 5a). We characterized the gene composition of TAD boundaries that are responsible for TAD organization in K2, A2 and D5 genomes. The K2 genome had the smallest gene number in TAD boundaries, while D5 had the largest gene number (Fig. 5b). As expected, we found that genes in TAD boundaries tend to have significantly higher expression levels than those in TAD interior (Fig. 5c and Supplementary Fig. 19), consistent with our previous result11.
The turnover of TAD boundaries indicates the reorganization of TADs in genomes. We compared TAD boundaries in syntenic blocks to explore TAD conservation and renewal in three genomes. We found that 406 TAD boundaries in K2 were conserved in the comparison of three genomes, and K2, A2 and D5 genomes have 1,393, 580 and 131 lineage-specific boundaries, respectively (Fig. 5d). To support this finding, a syntenic region between K2 (Chr08: 81.4–91.7 Mb) and D5 (Chr08: 29.3–32.4 Mb) was shown. In this block, 5 boundaries were identified in D5 and 11 boundaries were found in K2, of which 6 were specific in K2 (Fig. 5e). In the comparison of A2 and K2, a syntenic block was shown (Chr07: 70-79.5 Mb for K2 and Chr07: 62.75–68.45 Mb for A2). In this block, 7 boundaries were identified in A2 and 10 boundaries were identified in K2, of which 3 were specific in K2 (Fig. 5f). These showed that the comparison of syntenic blocks could help to identify lineage-specific TAD organization. A further analysis of the specific TAD boundaries showed that there were 69 sequence motifs in K2, 8 motifs in A2, 4 motifs in D5. We identified 13 motifs in conserved boundaries in three genomes (Fig. 5g and Supplementary Table 15). For example, K2 genome has a PABPC3 (poly(A) binding protein cytoplasmic 3) binding motif in lineage-specific boundaries, A2 genome has an AP2 (activating enhancer-binding protein 2) binding motif, and D5 genome has a CDF3 (cyclic dof factor 3) binding motif. The conserved boundaries in the three genomes are enriched in a bZIP (basic domain-leucine zipper) binding motif (Fig. 5h). Since gene transcription was found to have a role in the organization of TADs in mammals27,28, we supposed that these transcriptional factor binding motifs might participate in the formation of TADs in each genome, similar to the finding that TCP transcriptional factor binding motif was enriched in TAD boundaries in rice10.
Effect of transposon amplification on TAD organization
To explore whether TE amplification led to changes of TAD organization in K2 and A2 genomes, we investigated TE content in TAD boundaries. It is found that the Gypsy LTR retrotransposons that were involved in 99% boundaries of K2, A2 and D5, occupy the highest proportion of all TEs in boundaries (Fig. 6a). Of note is the finding that active TEs were enriched in TAD boundaries compared with the whole-genome level, and specific boundaries had a higher proportion of active TEs relative to conserved boundaries in K2 and A2 genomes (Figs. 6b, c). This result was coupled with the finding that more specific boundaries were located in A compartment than in B compartment (Fig. 6d). Over time, we classified the intact LTRs retrotransposons in each genome into ancient TEs and young TEs based on the median age of TE insertion within a species. We found that more young LTR retrotransposons were identified in lineage-specific TAD boundaries, and ancient TEs were more likely to exist in conserved boundaries (Fig. 6e, two-sided t-test, P < 0.001). In addition, young LTR retrotransposons were found to have higher expression levels than ancient LTR retrotransposons in the three genomes (Fig. 6f). In summary, these results indicated that the amplification of active young TEs in K2 and A2 genomes might contribute to the formation of lineage-specific TAD boundaries after divergence of the three species (Fig. 6g).