The PSC355 genome represents a reference-quality assembly with near telomere to telomere completeness (41/52). A total of 1,120 contigs were manually scaffolded into 26 molecules resulting in a 2.29 GB genome assembly (Table 1). Twenty-five of 26 chromosomes assembled into a single contig, 15 of which were contigs with telomere sequences on each end (Supplemental Table S1, Supplemental Figure S1). Only one chromosome was scaffolded by concatenating two contigs of 15.2 and 83.1 MB in the proper orientation. Approximately 77% of the genome was identified as repetitive with the largest component consisting of LTR elements (48%, Supplemental Table S2). Of the LTR elements, gypsy elements were the most abundant (38%). Centromeres were identified for each chromosome by mapping chip-seq reads from immuno-precipitated CENH3 (Centromere associated histone 3) protein (Fig. 1). Whole genome synteny with G. hirsutum TM-1 suggests near 1:1 alignment with no major structural differences between PSC355 and Bar 32 − 30, or between PSC355 and TM-1 (Supplemental Fig S1).
Table 1
The assembled genome of PSC355
# PacBio Hifi Cells | 5 |
Contigs* | 1120 |
Max Contig (MB) | 128.1 |
Mean Contig (MB) | 2.12 |
Contig N50 (MB) | 90.02 |
Contig N90 (MB) | 59.65 |
Total Contig Length | 2369.57 |
Assembly GC % | 34.92 |
Scaffolds♱ | 26 |
Max Scaffold (MB) | 128.17 |
Mean Scaffold (MB) | 88.24 |
Scaffold N50 | 108.24 |
Scaffold N90 | 61.04 |
Number scaffolded contigs | 27 |
Total Scaffold Length (MB) | 2294.33 |
Number of genes | 75,746 |
Repeat sequences (%) | 77.33 |
*Contig metrics reports are from the raw output file of hifiasm |
♱Only 3 cells were used for the assembly |
Due to the close pedigree relationship between PCS355 and BAR 32 − 30, we used liftoff to create orthologous annotations for the PSC355 genome. There were 75,746 gene annotations identified for PSC355, which was similar to the number observed for BAR 32 − 30 as well as those previously reported for other G. hirsutum genomes (Yang et al. 2019, Chen et al. 2020, Huang et al. 2020, Perkin et al. 2021, Hu et al. 2019). Analysis of gene content revealed 98.5% of conserved orthologs (BUSCO) were complete transcripts and the majority (86.0%) were duplicated in the tetraploid genome (Supplemental Table S3).
Within putative resistance marker regions, minigraph identified thousands of polymorphic differences shared by the nematode tolerant BARBREN-713 and BAR 32 − 30 when compared against the nematode susceptible PSC355. This comparison enabled us to identify the putatively resistant genes inherited from the nematode-resistant Pima cotton parental line within the hybrids (BARBREN-713 and BAR 32 − 30). The first set of polymorphisms were small (SNPs and Indels < 50 bp). We identified 114 genes with small polymorphisms in the marker regions (Fig. 1), of which 77 have functional annotations (Table S4). GO enrichment for biological process, molecular function, and cellular components yielded five, two, and three terms, respectively (Table 2) (p < 0.01). There is a suite of polymorphic genes (SNPs and Indels) with transcription and translation gene ontology annotations. Eleven reside within Mi-1 root-knot nematode resistant marker on A11, two in Mi-2 marker on D02, and one in Ren2 reniform nematode resistant marker on D11. These include a variety of transcription factors known to be involved in regulation of plant disease resistance and stress response: TFIIS, bHLH, CCHC zinc finger, squamosa promoter-binding-like, MADS-box, nuclear transcription factor Y, and NAC (see bolded terms in Supplementary Table S4; Cheng et al. 2018; Li et al. 2018; Mou et al. 2022; Nuruzzaman et al. 2013; Parenicova et al. 2003; Petroni et al. 2012; Sun et al. 2022; Szadeczky-Kardoss et al. 2022). Additional loci shared by BARBREN-713 and BAR 32 − 30, compared to PSC355, include genes that constitute E3 ubiquitin ligase complexes such as Bar32.A11G290000, Bar32.A11G291200, Bar32.A11G299800, Bar32.A11G300200 in Mi-1 region, Bar32.D02G005800 in Mi-2 region, Bar32.D11G234000, Bar32.D11G236500 in Ren2 region. Importantly, six additional genes with these marker polymorphisms also included the NB-ARC domain containing plant disease resistance proteins (R) and the protein kinases involved in the signal transduction pathways (see asterisked terms in Supplemental Table S4). These loci also contained 22 genes associated with disease resistance and stress responses (see underlined terms in Supplemental Table S4) such as viral inhibitor of apoptosis (IAP)-associated factor, arabinogalactan peptide 23-like protein, dirigent proteins in Mi-1, ULP_protease domain-containing proteins in Mi-2, importin beta-like SAD2 protein in Mi-2, aspartic protease in Mi-2, heat shock protein 70 co-chaperone, J domain-containing proteins in Mi-2 and Ren3, purple acid phosphatase in Mi-2, Verticillium resistant protein in Ren2, BED-type domain-containing protein in Ren2, bidirectional sugar transporter SWEET in Ren2, branched-chain amino acid aminotransferase in Ren3, RNase H type-1 domain-containing protein in Ren3 and COBRA-like protein 10 associated with the regulation of cell wall expansion and cellulose deposition in Ren3 (Antonyuk et al. 2014; Corredor-Moreno et al. 2021; Cox et al. 2017; Figueiredo et al. 2021; Kabbage et al. 2017; Knoch et al. 2014; Moelling et al. 2017; Morrell and Sadanandom 2019; Verma et al. 2019; Zhao et al. 2021; Zhu et al. 2007; Zuluaga et al. 2020). Two genes (Bar32.A11G293500 and Bar32.A11G295600), which had the same annotation as BAR32.D11G23380 (NBS-LRR), were also located within Mi-1 region.
Table 2
Enrichment of gene ontology terms from genes with SNPs and Indels
GO.ID | Term | Annotated Genes from Entire Genome | Significant Genes Determined within Marker Regions with SNPs and Indels | Expected Number of Genes within Marker Regions | ClassicFisher p-value |
Biological process |
GO:0016070 | RNA metabolic process | 3397 | 11 | 5.88 | 0.0018 |
GO:0019219 | Regulation of nucleobase-containing compounds | 2164 | 8 | 3.75 | 0.0068 |
GO:0050794 | Regulation of cellular process | 3138 | 11 | 5.43 | 0.0090 |
GO:0019222 | Regulation of metabolic process | 2463 | 10 | 4.27 | 0.0009 |
GO:1901362 | Organic cyclic compound biosynthetic processes | 2932 | 9 | 5.08 | 0.0097 |
Molecular function |
GO:0043531 | ADP Binding | 274 | 7 | 0.57 | 5e-6 |
GO:0005515 | Protein binding | 5674 | 21 | 11.90 | 0.0023 |
GO:0043168 | Anion binding | 5144 | 15 | 10.78 | 0.0041 |
Cellular component |
GO:0005622 | Intracellular anatomical structure | 3855 | 12 | 8.01 | 0.0063 |
GO:0043226 | organelle | 3159 | 11 | 6.57 | 0.0066 |
The second set of polymorphic genes relate to structural differences or large polymorphisms (50 bp − 250 Kbp), which affected 159 genes associated with enrichment of GO terms in biological processes (15) and cellular components (7) (Table 3). These large polymorphisms capture copy number variants (CNV) of constitutively upregulated genes shared by the reniform nematode-resistant individuals (BARBREN-713 and BAR 32 − 30) such as the previously identified root-knot resistant gene: PSC355.D02G019300-1, which is paralogous to Gh_D02G0276 (Fig. 2; Wubben et al. 2019). Gh_D02G0276 persists on an 80 KB structural variant shared by both BAR 32 − 30 and BARBREN-713, compared to Phytogen PSC355 (Fig. 2). Sequence analysis of this duplicated copy shows more similarity between BAR 32 − 30 and BARBREN-713 than PSC355, yet sequence analysis of the Gh_D02G0276 orthologue shared by all three genomes revealed the copies of BARBREN-713 and PSC355 were more closely related to each other than to the copy of BAR 32 − 30 (Fig. 2). Additionally, only BAR 32 − 30 had the premature stop codon possessed by the root-knot resistant lines derived from M-240 RNR.
Table 3
Enrichment of gene ontology terms from genes within structural differences at resistance marker regions.
GO.ID | Term | Annotated Genes from Entire Genome | Significant Genes Determined within Marker Regions with SNPs and Indels | Expected Number of Genes within Marker Regions | ClassicFisher p-value |
Biological process |
GO:0051235 | Maintenance of location | 26 | 2 | 0.05 | 0.000065 |
GO:0006259 | DNA metabolic process | 9654 | 39 | 18.68 | 0.00027 |
GO:0044260 | Cellular macromolecule metabolic process | 11818 | 40 | 22.87 | 0.00041 |
GO:0006139 | Nucleobase-containing compound metabolic process | 14236 | 44 | 27.55 | 0.00057 |
GO:0006807 | Nitrogen compound metabolic process | 20436 | 52 | 39.54 | 0.00064 |
GO:0051651 | Maintenance of location in cell | 19 | 2 | 0.04 | 0.00079 |
GO:0045185 | Maintenance of protein location | 19 | 2 | 0.04 | 0.00096 |
GO:1901360 | Organic cyclic compound metabolic process | 14654 | 44 | 28.36 | 0.00098 |
GO:0090304 | Nucleic acid metabolic process | 13677 | 44 | 26.46 | 0.00268 |
GO:0043170 | Macromolecule metabolic process | 19501 | 51 | 37.73 | 0.00275 |
GO:0044237 | Cellular metabolic process | 20735 | 51 | 40.12 | 0.00326 |
GO:0034641 | Cellular nitrogen compound metabolic process. | 15376 | 45 | 29.75 | 0.00429 |
GO:0046483 | Heterocycle metabolic process | 14523 | 44 | 28.1 | 0.00589 |
GO:0006725 | Cellular aromatic compound metabolic | 14624 | 44 | 28.3 | 0.0072 |
GO:0009987 | Cellular process | 25449 | 56 | 49.24 | 0.00773 |
Cellular component |
GO:0005615 | extracellular space | 12 | 1 | 0.01 | 0.011 |
GO:0005875 | microtubule associated complex | 42 | 1 | 0.04 | 0.017 |
GO:0005576 | extracellular region | 279 | 1 | 0.24 | 0.229 |
GO:0005856 | cytoskeleton | 275 | 1 | 0.24 | 0.312 |
GO:0005737 | cytoplasm | 1469 | 2 | 1.26 | 0.401 |
GO:0043232 | intracellular non-membrane-bounded organelle | 1611 | 2 | 1.38 | 0.438 |
GO:0043228 | non-membrane-bounded organelle | 1611 | 2 | 1.38 | 0.438 |
Ren 2 GB713 on D11 also has a structural difference (> 250 KB) shared by both BARBREN-713 and BAR 32 − 30 that contains CNV of constitutively upregulated genes PSC355.A11G349900 (Gorai.007G333200.1) and PSC355.A11G50000 (Gorai.N015800.1-JGI_221_v2.1), which are annotated as NB-ARC domain containing disease resistance protein (Figs. 3 & 4). RepeatModeler identified several classes of transposable elements (TEs) flanking these NB-ARC domain-containing proteins on Ren2GB713. Notably the tandem helitron repeat reveals a possible mechanism by which these duplications have arisen in resistant Pima-derived lines (Fig. 4). Helitrons are a Class 2 TE named after their “rolling circle” mechanism of duplication (Kapitonov and Jurka 2001). This operation can be associated with tandem repeats as the single-stranded intermediate can be ligated back into the genome. Furthermore, there are differences in paralogy of these genes between BAR 32 − 30 and BARBREN-713. For Gorai.007G333200.1 we identified two paralogs in BARBREN-713 and one in BAR 32 − 30, which is shared and identical to one copy in BARBREN-713 (Supplemental Figure S1). HiFi read coverage corroborates the observed copy number differences between the BARBREN-713 and BAR 32 − 30. There are three and two discrete peaks in these lines, respectively, with ~ 30% more read coverage overall compared to this syntenic region in PSC355 (Supplemental Figure S1).