Chromosome scale genome assembly of C. australasica
Hifi reads with Q33 median read quality were generated from two PacBio SMRT cells yielding 39 Gb (115X) and 37 Gb (108X) of sequence respectively. Hi-C paired end Illumina reads generated from two lanes yielded a total of 768 M reads with a total of 116 Gb data. Hifiasm was run with two options; Hifi reads only and the Hi-C integrated (Hifi reads + Hi-C reads) option (Table S4). The Hifi-reads-only assembly yielded a slightly better contig assembly with a total size of 407 Mb (1,569 contigs), having a 99.3% complete BUSCO and an N50 of 31.7 Mb. The Hi-C integrated Hifi assembly yielded a collapsed assembly with 2,224 contigs (with a total size of 436 Mb), a 99.1% complete BUSCO and an N50 of 31.4 Mb. The two assemblies were independently subjected to scaffolding with Hi-C data.
Scaffolding was performed with Hi-C data using three scaffolding tools using two different pipelines (Figure S1). The results were compared by checking the telomeres at the ends of the scaffolds, the N50 for the whole assembly and by aligning the scaffold assemblies with the previously published genome of C. australis [30] (TableS2a-S2g). The Hi-C integrated assembly, assembled by BWA + Arima mapping and the SALSA pipeline was selected as the final best assembly based on the high contiguity, completeness, and the presence of telomeric repeats at the ends of the scaffolds (Method S1, Table S2e). The scaffolds that could be assigned to chromosome level based on the alignments with C. australis are shown in Table S2e. The scaffold assembly generated a total of 2,208 scaffolds (with a total length of 436 Mb) with a 99.1% complete BUSCO and an N50 of 33.5 Mb (Table S5).
The top eleven longest scaffolds in the collapsed genome were selected to represent the nine pseudochromosomes as they had the same total BUSCO score (99.1%) as the whole assembly and were anchored to pseudochromosomes by aligning with C. australis genome. (Table S5). The pseudochromosomes were labeled as Chr1-Chr9 based on the order of C. australis genome. The orientations of the C. australasica chromosomes were determined based on those of C. australis chromosomes (Figure S4a). Seven pseudochromosomes were composed of one contig. Chr8 was composed of two contigs which were joined by Hi-C. Chr4 was generated by manually joining three scaffolds (manual adjustments) (Table S3). Four pseudochromosomes had telomeres at both terminals, three pseudochromosomes had telomeres at one terminal, one pseudochromosome had one telomere at one end and the other at the peri-terminal region, and one pseudochromosome with no telomeric sequences at either end (Figure S4c). The N50 of the chromosome scale assembly was 35.2 Mb (Table S5). There were some scaffolds (SC9, SC10, SC8, SC11 and SC20) with 5.8S and 28S rRNA gene repeats at the terminal regions and some scaffolds (SC5, SC9) with high copy number tandem arrays of satellite DNA repeats at their terminal regions (Table S3).
Dotplots of scaffolds vs contig assemblies revealed that some medium sized scaffolds (5 Mb – 1.4 Mb) might belong to the nuclear genome (Figure S4b). Those scaffolds and two other small scaffolds (0.85 Mb and 0.7 Mb) with telomeres at the terminal regions were included in the final assembly as unplaced scaffolds as they might be parts of the nuclear genome The assembly containing nine pseudochromosomes and the unplaced scaffolds, totaling 344.2 Mb, is henceforth referred to as the “nuclear genome” of C. australasica. The alignment of scaffolds with the C. australasica chloroplast genome revealed the sites of insertion of chloroplast fragments in the nuclear genome (Figure S5a and S5b). Some scaffolds smaller than 1.4 Mb might belong to the chloroplast genome (Figure S5c) and were excluded from the nuclear genome assembly. The heterozygosity of the genome was estimated as 1.28% by K-mer analysis.
The two haplotypes were also assembled using the same pipeline and some manual adjustments were done based on their homology with the collapsed genome. The orientation and chromosome numbers of the chromosome scale pseudomolecules were determined with respect to the collapsed assembly (Figure S6, Table S3). The two haplotypes had 98.9% and 99% BUSCO and an N50 of 32.7 Mb and 34.4 Mb for hap1 and hap2 respectively.
Structural annotation of C. australasica genome
The genome was annotated for repeat elements, and protein coding genes. A large portion of the collapsed genome was covered by interspersed repeats (54.6%) with 34.5% unclassified repeats, 2.64% DNA transposons and 17.5% retro transposons. LTR elements were the dominant type of retroelements where Copia and Gypsy elements were present in equal proportions (6.78%) and Pao elements were found in a very small percentage (0.03%). The other types of repeat elements such as rolling circles (0.42%), small RNA repeats (0.67%), satellite repeats (0.39%), simple repeats (0.97%) and low complexity repeats (0.2%) were found in small proportions in the whole nuclear genome (Table S6). RNA-seq trimmed reads (320 million reads, 44 Gb representing 129X coverage of the genome) were used for gene prediction. A total of 36,597 genes were predicted in nine pseudochromosomes of the collapsed genome while 30,050 and 34,139 genes were found for the hap1 and hap2 nine pseudochromosomes respectively (Table 1). The annotation statistics of the nuclear genomes (including the unplaced scaffolds) are given in the Table S7.
Table 1
The size of the pseudochromosomes and the number of genes in collapsed, hap1 and hap2 genomes
Chromosome | | collapsed | | hap1 | | hap2 |
| Size (Mb) | Genes | | Size (Mb) | Genes | | Size (Mb) | Genes |
1 | | 31.4 | 3,514 | | 31 | 3,299 | | 30.6 | 3,489 |
2 | | 35 | 4,272 | | 32 | 3,334 | | 34.4 | 4,274 |
3 | | 36.9 | 3,306 | | 36.6 | 3,222 | | 35.1 | 3,095 |
4 | | 43.9 | 4,144 | | 39.8 | 4,366 | | 34.4 | 3,415 |
5 | | 49.5 | 6,541 | | 47.4 | 4,586 | | 48.9 | 5,653 |
6 | | 26.1 | 2,501 | | 26 | 2,475 | | 26 | 2,432 |
7 | | 33.5 | 3,830 | | 32.7 | 3,556 | | 36.5 | 5,018 |
8 | | 35.2 | 4,866 | | 29.1 | 2,602 | | 34.2 | 4,219 |
9 | | 34.8 | 3,623 | | 29.6 | 2,610 | | 30 | 2,544 |
Total size (Mb) | | 326.3 | | 304.1 | | 310.1 |
Total genes | | 36,597 | | 30,050 | | 34,139 |
Annotation completeness (BUSCO) | | 98.9% | | 98.8% | | 99.2% |
Functional annotation of C. australasica genome
The CDS sequences of the collapsed genome (45,935) and the two haplotypes (39,651 of hap1 and 41,516 of hap2) were independently annotated obtaining BLAST hits and GO terms associated with the CDS sequences. BLAST hits were obtained for 40,105 CDS sequences in the collapsed genome and for 35,286 and 39,395 sequences in hap1 and hap2 genomes respectively. Of the sequences with BLAST hits, 21,104 CDS in the collapsed genome, 20,899 CDS in the hap1 genome and 20,966 CDS in the hap2 genome underwent GO mapping and annotation. The majority of the CDS sequences of the collapsed and the two haplotypes received BLAST hits from other citrus species. The highest number of sequences received BLAST hits from C. sinensis (179,114 - collapsed genome, 166,574 - hap1 genome, and 175,964 - hap2 genome), followed by C. clementina (39,949 - collapsed genome, 39,304 - hap1 genome, 39,605 - hap2 genome) and C. unshiu (27,052 - collapsed genome, 26,723 - hap1 genome, 26,784 - hap2 genome) and a small percentage of sequences from other species (Figure S7a, S7b, S7c). Sequences annotated with IPS, their families distribution, GO-levels and enzyme codes of the collapsed genome are shown in Figure S7d, S7e, S7f. The coding potential assessment of the CDS sequences in the collapsed genome with no BLAST hits (5,830) revealed 99.8% and 99.1% CDS with coding potential based on the models from coding and non-coding sequences of Arabidopsis thaliana and citrus respectively. The coding potential of the sequences with no BLAST hits for the two haplotypes also indicated a high number with coding potential (Figure S8a, S8b, S8c, S8d).
Structural and functional comparison between C. australasica and C. australis genomes and among C. australasica assemblies
The inference of orthologs from orthovenn3 revealed 19,980 shared orthologous clusters between C. australasica and C. australis corresponding to 48,185 shared genes. The number of unique orthologous clusters (gene families) in C. australasica (870) were higher than C. australis (666) (Fig. 1a). The 870 unique orthologous clusters of C. australasica included 12,748 unique protein coding genes and the 666 unique orthologous clusters of C. australis contained 4191 unique protein coding genes. In addition to the unique orthologous clusters, there were 4,487 singletons in C. australasica and 5,566 singletons in C. australis which had no orthologous genes identified in the other species and they could not be assigned to any cluster within the species. Therefore, the genes in unique orthologous clusters and singletons are henceforth referred to as unique genes in each species. Of them, 13,661 unique genes of C. australasica and 8,121 unique genes in C. australis were within their nine chromosomes.
The functional analysis of the unique genes of C. australasica revealed that they were enriched in biological processes including stress response, protein modification, cellular component organization, response to organic substance, transport and regulation of gene expression with molecular functions such as nucleic acid binding, hydrolase activity, protein binding, catalytic activities, ATP binding, oxidoreductase activity and transition metol ion binding. KEGG pathway analysis showed that the unique gene sequences of C. australasica were primarily involved in purine metabolism (155), thiamine metabolism (145), plant-pathogen interactions (92) (out of the total 345 genes related to plant-pathogen interactions in C. australsica, 92 genes were unique in it), phenylpropanoid biosynthesis (55), diterpenoid biosynthesis (48), and Tryptophan metabolism (44) (Figure S9). The 92 unique genes associated with plant-pathogen interactions encode disease resistance proteins [NB-ARC domain containing R proteins, nucleotide binding and leucine rich repeat proteins (NLRs), leucine-rich repeat receptor-like protein kinases (LRR-RLKs)], calcium sensor proteins [calcium-dependent protein kinases (CPDKs), calmodulin (CaM) and calmodulin-like proteins (CMLs)], retrovirus-related pol polyproteins, glycerol kinases, pathogenesis-related protein 1, cyclic nucleotide-gated ion channels, and many other hypothetical proteins. The functional analysis of the C. australis unique protein clusters indicated that they were associated with many biological processes including protein modification, cellular component organization, transport, defense response, phosphate containing compound metabolic process. The KEGG analysis showed that the unique genes in C. australis were mainly involved in purine metabolism (80), plant pathogen interactions (76) and Thiamine metabolism (75) (Figure S10).
The orthologs comparison among the three assemblies of C. australasica revealed 3,307 genes unique to collapsed genome, 1,676 unique to hap1 genome and 1,696 genes unique to hap2 genome which are in clusters as depicted in Fig. 1b. In addition, there were 2,155, 1,610, 1,471 singletons identified in the collapsed, hap1 and hap2 assemblies respectively. There were 21,061 shared gene families containing 76,167 shared genes among the three assemblies. There were 13,801 genes (4,078 gene families) shared between collapsed and hap2 assemblies. 8,667 genes (2,745 gene families) were shared between the collapsed and hap1 assembly and 2,830 genes (841 gene families) were shared between the two haplotypes which were not present in the collapsed assembly. Functional characterization of the collapsed and haplotype assemblies specific genes revealed that they were associated with many different cellular, metabolic, biosynthetic processes, and stress responses (Figure S11).
Structural and local sequence variations between C. australasica and C. australis genomes revealed the conserved and rearranged regions of the two genomes (Fig. 2). Large inversions were found in Chr4 and Chr5 and small inversions were found in all chromosomes. Translocations and duplications were found in all chromosomes and large-scale translocations and duplications were found in Chr3, Chr4, Chr5, Chr7 and Chr8. A relatively smaller number of rearranged regions were found in Chr6, which was the smallest chromosome in both the species. Local sequence variations such as SNPs (5,437,677), insertions (542,557), deletions (445,565), highly diverged regions (4,915) and tandem repeats (137) were annotated in both syntenic and rearranged regions of the two genomes with the help of whole genome alignments (Fig. 2).
Genetic diversity in different C. australasica cultivars
The genetic diversity within C. australasica was determined using five cultivars, including Rainbow for which the genome was assembled in the present study. Illumina reads for the five cultivars (between 9 to 13 Gb data with coverage ranging from 28X – 39X genome size of C. australasica) were used in the fix ploidy variant analysis using Rainbow as the reference (Table S8). Different variant types (insertions, deletions, single and multi-nucleotide variants, and replacements) ranging between 1.9 M to 3.5 M were found for the five cultivars where many of them were SNVs. At the whole genome level, the total number of SNV positions which include heterozygous SNVs and homozygous SNVs (at 100% variant frequency) were in the range between 1.6 M to 2.8 M (Figure S12). The total number of SNV variant positions in CDS regions were in the range between 0.1 M to 0.18 M for the five cultivars. Based on the total SNVs, C. australasica cv 3 (Red champagne) and C. australasica cv 1 were the most and less divergent cultivars respectively with respect to Rainbow (Figure S12). The heterozygosity estimated based on the Rainbow genome size was the highest for C. australasica cv 3 (Red champagne) (0.56) and the lowest for C. australasica cv 1 (0.35) (Table 2).
Table 2
SNVs and heterozygosity values in five C. australasica cultivars with respect to C. australasica cv 2 (Rainbow) as the reference
Cultivar | | Whole genome | | CDS | | Heterozygosity |
| Heterozygous SNVs | 100% homozygous SNVs | Total SNVs | | Heterozygous SNVs | 100% homozygous SNVs | Total SNVs | |
C. australasica cv 2 (Rainbow) | | 1,601,626 | 37 | 1,601,663 | | 104,730 | 1 | 104,731 | | 0.47 |
C. australasica cv 1 | | 1,192,418 | 1,037,382 | 2,229,800 | | 81,382 | 68,639 | 150,021 | | 0.35 |
C. australasica cv 4 (Red Finger lime) | | 1,407,305 | 1,053,367 | 2,460,672 | | 95,624 | 66,596 | 162,220 | | 0.41 |
C. australasica cv 3 (Red champagne) | | 1,927,999 | 842,733 | 2,770,732 | | 130,940 | 50,892 | 181,832 | | 0.56 |
C. australasica cv 5 (Ricks red) | | 1,860,975 | 794,928 | 2,655,903 | | 127,859 | 49,745 | 177,604 | | 0.54 |
Reference genome size 344.2 Mb including nine pseudochromosomes and unplaced scaffolds.
A selected set of important genes in C. australasica
Disease resistant genes
A wide array of antimicrobial proteins/peptides (AMPs) were identified through functional annotation in C. australasica collapsed genome. Three antimicrobial genes [g22065 (Chr6), g32276 (Chr8), g34118 (Chr9)] coding for peptides containing stress-responsive A/B barrel domain were identified in the collapsed genome (Fig. 3). Of these three genes, g34118 was identified as a homolog to the previously detected short version of stable AMP (SAMP) in HLB resistant citrus species (204 bp) [26]. The gene g34118 was transcribed into three transcripts of 549 bp, 462 bp, and 330 bp. The third transcript having 330 bp (encoding 110 aa) showed the highest homology with a major portion of the 204 bp SAMP sequence (encode 67 aa) of the previously reported SAMP of C. australasica with twelve single nucleotide polymorphisms. Similarly, the SAMP homologs, identified in Chr9 of the two haplotypes (hap1; g27559 of 330 bp and, hap2; g31668 of 330 bp) showed 100% identity with the corresponding collapsed SAMP gene (Figure S13a, S13b). The SAMP identified in the previous study had two cysteine residues, whereas the Rainbow SAMP homologs had no cysteine residues. In addition to the SAMP sequences, two other types of antimicrobial peptides having the stress-responsive A/B barrel domain were identified in the two haplotypes similar to the collapsed genome (Table S9).
The re-annotation of the C. australis genome [30] using Braker3 identified one SAMP homologous gene encoding two transcripts which are longer than those of C. australasica. The alignment of these two transcripts with C. australasica SAMP sequences showed a homology with them, however it was not as high as the sequence homology of C. australasica SAMP sequences (Figure S14). Furthermore, the alignments of the 110 aa antimicrobial peptides found in Rainbow and other HLB resistant and susceptible citrus species with 67 SAMP of C. australasica showed a high aa similarity among all of them (Figure S15).
In addition to stress-response A/B barrel domain-containing protein, other types of AMPs such as defensins, thionins, non-specific lipid transfer proteins, snakins, hevein-like proteins, knottin-type peptides were identified in the genome. Other defense-related genes including cysteine-rich receptor-like protein kinases (CRKs), of which 14 genes encoding CRK 10, one encoding CRK 25 and others encoding other types of CRKs were identified in the genome (Table S10). There were eleven genes for pathogenesis related (PR) proteins of which three were PR-1 proteins (Fig. 3, Table S11). In addition, the annotation identified 61 leucine rich repeat proteins (LRR) genes, 13 guanine nucleotide-binding proteins (Fig. 3, Table S12), 34 glutathione-S-transferase genes, 28 oxoglutarate (2OG) and Fe(II)-dependent oxygenases, 22 cellulose synthase genes, 26 β-1,3-Glucanase genes, and many other genes related to anthocyanins, terpenoids, amino acids (phenylalanine, tyrosine, and tryptophan) and antioxidants (flavonoids, carotenoids, tocopherols) in the Rainbow genome.
Volatile compounds encoding genes
KEGG pathway analysis identified 112 genes involved in terpenoid backbone biosynthesis, 82 genes for monoterpenoid biosynthesis (including geranyl phosphate, geraniol, linalool, myrcene, limonene, α-terpineol, camphene, neomenthol), 89 genes for diterpenoid biosynthesis and 48 genes for sesquiterpenoid and triterpenoid biosynthesis (α-farnesene, β-farnesene, germacrene, valencene, β-carophyllene, β-amyrin, α-amyrin, α-humulene) in the Rainbow genome.
Curvature thylakoid protein genes
A total of 8,314 genes encoding curvature thylakoid proteins (CURT1) which belong to four isoforms were identified in the genome. There was one gene encoding CURT1A on Chr5 with 495 bp, one gene encoding CURT1B in Chr5 having 510 bp and one gene for CURT1C in Chr1 transcribed into two CDS sequences (465 bp and 444 bp). There were 8,311 genes encoding CURT1D proteins, with majority of them being identified within the 9 chromosomes. The lowest number of genes (572) were identified in Chr9, and the largest number of genes were identified in Chr8 (1,619) with open reading frames in the CDS sequences. The CURT1D proteins were present as large tandem arrays of gene clusters within the chromosomes with the smallest gene having 252 bp (encoding 84 aa) and the largest gene having 30,726 bp (encoding 10,242 aa). No CURT1 genes were identified in Chr3 and Chr6. In contrast, only 357 CURT1 genes were found in the C. australis genome.
Red/orange coloration related genes
C. australasica cv Rainbow has a red/yellow warty skin and pink colored clear vesicles inside the fruit. Anthocyanins and β-citraurin are the two major pigments involved in the orange-reddish color of citrus fruits. The structural and regulatory genes involved in anthocyanin production were identified in C. australasica genome. Structural genes involved in the production of enzymes needed for the biosynthesis of anthocyanins are depicted in Figure S16. A group of upstream genes encoding CHS, CHI, F3H and downstream genes encoding F3’M, F3’5’H, DFR, ANS, UFGT were known to play major roles in pigmentation [51]. The annotation identified 13 CHS genes, 3 CHI genes, 5 F3H genes, 2 F3’M genes, 1 F3’5’H gene, 2 DFR genes, 1 ANS gene, 2 genes of UFGT (Figure S16). A total of 11 genes involved in the biosynthesis of major anthocyanins including pelargonidin, pelargonidin-3-sambubioside, cyanidin, cyanidin 3-glucoside, cyanidin 5-glucoside, cyanidin 3,5-diglucoside, cyanidin-3-sambubioside, delphinidin, delphidin-3-sambubioside, delphinidin 3-glucoside were identified in the genome (Fig. 4).
Four types of regulatory genes are involved in anthocyanin gene expression in plants. There were two Ruby homologs in Chr6, one bHLH (Noemi) homolog on Chr5, five WD-40 protein encoding genes on Chr1, Chr2, Chr4 and Chr5, and 58 WRKY TFs scattered on all chromosomes in the Rainbow genome.
β-citraurin is encoded by carotenoid cleavage dioxygenase 4 (CCD4) gene. There were two CCD4 genes identified in collapsed, hap1 and hap2 genomes on Chr7 and Chr8 and one additional gene was identified on Chr6 of the hap1 genome (Table S13).