Biased distribution of NUPTs across moringa nuclear genomic features
In order to examine the distribution of NUPTs across the different sequence features found in the moringa nuclear genome, we employed the NUPTs detected in [24] and the structural annotation of a chromosome-scale assembly, i.e., AOCCv2 [25] completed by further classifying tRNA genes as plastid, mitochondrial or nuclear. 91,06% of the genome could be categorized into 12 genomic features, including structural genes, Transposable Elements (TEs), other repeats, NUPTs plus nine different categories of RNA genes, while the rest was deemed as “other DNA” (Table S1). A total of 4,885 genomic features were hit by 4,754 NUPTs (Table 1), which we will refer to as NUPT genomic features. The majority of the NUPT genomic features corresponded to TEs (2,317) and plastid tRNA genes (912) (Table 1 and Table S1). 3,716 NUPTs were found in two or more specific genomic features (Table 1). The percentage of genomic features affected by NUPTs varied widely, ranging between 0%, for spliceosomal and other RNA genes and more than 99% for plastid tRNA genes (Table 1). Apart from nuclear plastid tDNA, two additional categories of RNA genes for which the majority of members were affected by NUPTs were self-splicing intron RNA genes (96.06%) and prokaryotic rRNA genes (93.21%) (Table 1). Within every category of NUPT genomic features, the percentage of them fully aligning with plastid DNA was also highly variable, ranging between 5.61% for structural genes to more than 99%, as was the case for mitochondrial and plastid tRNA, as well as self-splicing intron RNA genes (Table 1). Some specific classes of NUPT genomic features seemed to be preferentially affected by NUPTs from one episode or another (Table S2).
Table 1: Summary of NUPT genomic features.
NUPT Genomic feature
|
Total number
|
Total percentage (%) in the genome
|
Percentage (%) fully covered by plastid DNA
|
Number of NUPTs2
|
Structural gene1
|
428
|
1.88
|
5.61
|
657
|
TE
|
2,317
|
1.81
|
40.40
|
3,858
|
Other repeats
|
80
|
0.10
|
90
|
105
|
Eukaryotic rRNA gene
|
132
|
5.19
|
27.27
|
264
|
Prokaryotic rRNA gene
|
302
|
93.21
|
55.96
|
832
|
Nuclear tRNA gene
|
112
|
18.33
|
95.54
|
119
|
Mitochondrial tRNA gene
|
48
|
81.36
|
100
|
48
|
Plastid tRNA gene
|
912
|
99.89
|
99.56
|
885
|
Self-splicing intron RNA gene3
|
512
|
96.06
|
98.63
|
633
|
Regulatory RNA genes4
|
42
|
11.2
|
69.05
|
62
|
Spliceosomal RNA gene
|
0
|
0
|
0
|
0
|
Other RNA genes3
|
0
|
0
|
0
|
0
|
1Including the 1 Kb regions upstream and downstream of the ATG and stop codons, respectively.
2Note that there are 3,716 NUPTs overlapping two or more genomic features and so the total number is higher than the actual number of NUPTs.
3Group I and group II introns.
4Only iron stress repressed RNA (isrR) genes.
We next examined any potential spatial biases in the distribution of NUPTs with respect to specific genomic features, i.e., whether specific genomic features were more or less tolerant to host NUPTs. First, visually, by graphically representing the arrangement of NUPTs from every episode plus every other feature found in the moringa genome along the 14 chromosomes using Circos plots (Fig. 1A). The frequency of genomic features was represented in each case as density plots in windows of 500 kb in length. We then searched for any putative overlaps in density peaks between genomic regions corresponding to NUPTs from one episode or another and the rest of genomic features. As expected, older NUPTs from episode I, which were found to be apparently distributed homogenously across chromosomes [24], do not show any apparent peaks in the distribution (Fig. 1A). In turn, younger NUPTs from episode II, which were found to be concentrated in hotspots [24], collocated with peaks in the density distribution of different classes of RNA genes, including self-splicing intron, plastid tRNA and prokaryotic rRNA, which in turn corresponded with chromosome regions with low density of TEs and other repeats (Fig. 1A).
Next, we tried to statistically substantiate putative biases in the distribution of NUPTs from one episode or another against every specific genomic feature by comparing observed versus expected base pair overlap counts (Fig. 1B and Table S3). Based on the results from performing Pearson's Chi-squared independence tests with Yates' continuity correction, prokaryotic rRNA genes, tRNA genes, self-splicing intron RNA genes, regulatory RNA genes and other DNA were highly enriched for NUPTs from one and another episode whereas structural genes, other repeats, spliceosomal RNA genes and other RNA genes were highly impoverished (Fig. 1B and Table S3). Features such as eukaryotic rRNA genes, TEs, NUPTs from another episode and unclassified NUPTs, showed opposite trends of enrichment depending on the NUPTs formation event (Fig. 1B and Table S3).
NUPTs show differential patterns of retention between structural and RNA genes
As stated above, structural genes were found to be affected by NUPTs from one and another episode less than expected by chance (Fig. 1B and Table S2), likely reflecting the potential deleterious effects resulting from the insertion of exogenous DNA, especially when affecting coding sequences. Nevertheless, impoverishment in NUPTs across structural genes could be observed regardless of the region of the gene affected (Fig. S1 and Table S4).
A total of 428 structural genes were hit by 657 NUPTs (Table 1 and S5), including 249 affected in intron regions, 71 in the 1 Kb region upstream of the ATG start codon, considered as the promoter region, 76 inserted in the 1 Kb region downstream of the stop codon, deemed as the terminator region, and only 30 affected in exons, including 12 single-exon genes originating from six individual NUPTs from episode II (Table S5). Although many NUPT structural genes showed one single NUPT (222), the remaining displayed a variable number ranging from two to up to 10 NUPTs (one single gene, Morol14g00250) (Table S6). Additionally, two NUPT structural genes (Morol07g01780 and Morol05g07670) were affected by more than one NUPT found in both promoter and terminator regions.
We explored the spatial arrangement of NUPT structural genes across the moringa nuclear genome. For this purpose, NUPT structural genes from one episode or another, categorized according to the gene regions affected, were graphically represented across the moringa chromosomes in the form of Circos plots (Fig. 1C). 151 structural genes were exclusively affected by NUPTs-I, while 153 were exclusively hit by NUPTs-II; 134 structural genes either contain NUPTs from one and another episode and / or unclassified NUPTs (Fig. 1C). In general, NUPT structural genes appear to be scattered across all 14 chromosomes and did not show any apparent arrangement in clusters, the only exception being NUPT structural genes affected in exon sequences (Fig. 1C).
In contrast to the underrepresentation of NUPTs among structural genes, the opposite trend could be observed for most categories of RNA genes. This observation was especially significant for organellar tRNA, prokaryotic rRNA and self-splicing intron RNA genes present in the nuclear genome, the majority of which were of plastid origin (Table 1). We further mapped NUPT RNA genes onto the region of origin in the plastid genome by using a newly obtained annotation of RNA genes in the moringa plastid genome. Our new annotation of the plastid genome detected additional RNA genes, including five eukaryotic rRNA, one regulatory RNA (corresponding to the iron stress repressed RNA genes, isrR) plus 22 self-splicing introns (Fig. 1D). Furthermore, on top of the 36 tRNA genes found in the original annotation, an additional selenocysteine tRNA gene was detected. tRNA genes found in the plastid genome were further classified as plastid, mitochondrial and nuclear (Fig. 1D). In general, NUPT RNA genes matched RNA genes in the plastid genome belonging to the same category. For example, out of the 913 and 59 genes in the moringa nuclear genome annotated as plastid and mitochondrial tRNA, 912 and 48, respectively, corresponded to tRNA genes identically annotated in the plastid genome (Fig. 1D). Out of the 611 nuclear tRNA genes found in the nuclear genome, 112 were of plastid origin, 85 out of which were similarly annotated in the plastid genome, while the rest were annotated as plastid tRNA genes (Fig. 1D). A similar situation applied to the 324 genes annotated in the nuclear genome as encoding for prokaryotic rRNA, 302 out of which originated from prokaryotic rRNA genes found in the plastid genome. With respect to the 2,541 eukaryotic rRNA genes found in the nuclear genome, only 132 derived from the plastid genome, where they were identically annotated, except 36 genes encoding for 5S rRNA, which corresponded to two 5S rRNA genes annotated as prokaryotic in the plastid. Another category of RNA genes enriched for NUPTs was that of regulatory RNA genes, 42 out of 375 arising from a single gene found in the plastid annotated as isrR. Furthermore, every single of the 512 out of 533 genes annotated as NUPT-self-splicing intron in the nucleus proceeded from a gene region identically annotated in the plastid. Finally, the representation of NUPTs RNA genes across the 14 chromosomes of the moringa nuclear genome revealed their preferential arrangement in clusters, in contrast to that observed for structural genes (Fig. 1D).
Functional and expression characterization of NUPT structural genes
We examined whether the presence of NUPTs in structural genes could determine differences in their expression, either qualitatively or quantitatively, with respect to the rest of genes in the genome, using RNA-seq data from five tissues, i.e., flower, leaf, root, seed, and stem [26]. Out of the 428 NUPT structural genes, 380 showed significant expression in at least one of the five tissues, a fraction not significantly different to that found among all structural genes according to a Fisher’s exact test (Fig. 2A). When partitioned by formation episode, only NUPT-II structural genes featured a fraction of expressed genes significantly smaller than expected by chance (Fig. 2A). The fraction of expressed versus unexpressed NUPT structural genes showed deviations from non-NUPT ones depending on the region affected by NUPTs (Fig. 2A). While this fraction was significantly greater in the case of NUPT structural genes affected in introns, the opposite situation was observed for structural genes with NUPTs located in promoter or terminator regions, with no significant differences among those genes affected by NUPTs in exons (Fig. 2A).
Moreover, the overall expression of NUPT structural genes was not different than that of non-NUPT ones according to a Wilcoxon’s rank test, either when considered together, partitioned by formation episode or by the affected gene region, except among those affected in exons, whose overall expression was significantly lower (Fig. 2B).We also checked for differences in expression broadness between NUPT structural genes and non-NUPT ones by using the Tau index, calculated using the RNA-seq data from each of the five tissues sampled. Values of the Tau index range from 0, indicating broader unspecific expression, to 1, reflecting narrower specific expression [27]. NUPT structural genes showed a significantly broader expression across the five tissues with respect to the rest of structural genes in the genome according to a Wilcoxon’s rank test (Fig. 2C), an effect specifically related to structural genes affected by NUPTs-I in intronic regions (Fig. 2C). In contrast, NUPT structural genes affected in exonic regions showed a significantly narrower expression across the five tissues with respect to the rest of structural genes in the genome (Fig. 2C).
Next, we attempted to get insights about the potential involvement of the 428 NUPT structural genes in specific biological functions. For this purpose, we first used GO terms. Seven GO functional terms related to chloroplast and photosynthesis molecular functions, biological processes or subcellular locations were found to be significantly overrepresented (Table S7). Similar enrichment tests were also performed using EC numbers, representing a hierarchical classification of chemical reactions catalyzed by enzymes, and KEGG KO terms, describing molecular functions represented in terms of functional orthologs. One EC term (EC:1, grouping oxidoreductases), was found to be significantly overrepresented, whereas two other EC terms (EC:3.5.1.98, histone deacetylase; EC:1.10.3.9, photosystem II oxidoreductase) were found to be marginally overrepresented (Table S8). One KO term (K02704), describing a chlorophyll apoprotein, was found to be enriched among NUPT structural genes, whereas two other KO terms also describing chlorophyll apoproteins (K02690 and K02705) were found to be marginally enriched (Table S9). The enrichment found among NUPT-structural genes for biological functions and enzymatic activities related to chloroplast functions is not surprising considering that most of them corresponded to genes entirely of plastid origin (Tables S7-10). Indeed, DeepLoc 2.0 predictions of subcellular localization [28], did not find a significantly different number of NUPT structural genes to be imported to the chloroplast (45) than for non-NUPT structural genes (1,982), according to a Fisher’s exact test (P = 0.23), as could be expected if NUPT-structural genes had preferentially evolved chloroplast-related functions.
Out of the six NUPT-structural genes not fully covered by plastid DNA, Morol06g13970, affected by a single NUPT-II spanning the end of intron fifteen and the beginning of exon sixteen, showed the highest expression across all five tissues (Fig. 3). Morol06g13970 was annotated as encoding for the nuclear TPR3-like protein (Table S10), featured by a number of WD40 repeated motifs rich in Asp and Trp residues. WD40 motifs are found in a diverse range of proteins covering a wide variety of plant developmental related functions ranging from signal transduction and transcription regulation to cell cycle control and apoptosis [29]. A second gene, Morol10g09890, annotated as encoding for the cytochrome P450 CYP72A219-like protein, was affected by a NUPT-II spanning the end of its last exon and the beginning of its terminator region and was only found to be marginally expressed in roots (Fig. 3 and Table S10). Morol11g02440, highly expressed in all five tissues and affected by a single NUPT-I spanning the end of exon two and the beginning of intron two (Fig. 3), and Morol02g15190, only found to be marginally expressed in seeds and affected by an unclassified NUPT spanning its last exon (Fig. 3), had no annotated function (Table S10). BLASTP searches for putative homologous proteins in the Uniprot database v2023_04 [30] yielded translation initiation factor IF-1, an essential component for the initiation of protein synthesis, as best hit for Morol11g02440. In turn, the best retrieved BLAST hit for Morol02g15190 was annotated as a polyprenol reductase, a key player in the early steps of protein N-linked glycosylation.
NUPT RNA genes are functionally expressed
In a first attempt to assess whether nuclear RNA genes of plastid origin were functional, we examined their expression using our dataset of RNA-seq from five tissues. As for structural genes above, the number and percentage of expressed genes with respect to the total plus the distribution of their expression levels were represented as stacked bar plots and violin plots, respectively, for every class of RNA genes found as enriched for NUPTs (Fig. 4). The expression of homologous RNA genes present in the plastid genome was also shown for comparison. The fraction of expressed NUPT eukaryotic rRNA genes was found to be significantly higher than expected according to a Fisher’s exact test, an effect specifically related to NUPTs-I, while no significant differences were found among NUPT prokaryotic rRNA genes (Fig. 4A, C). The overall expression of NUPT rRNA genes was in general greater than that of non-NUPT ones, with differences being significant for eukaryotic rRNA genes in the case of NUPTs-I and for prokaryotic ones in the case of NUPT-IIs (Fig. 4B, D). For the rest of RNA genes in the nuclear genome, the fraction of expressed genes among those of plastid origin was generally lower than expected for most categories (Fig. 4E, G, I), and in contrast to what was observed among rRNA genes, their overall expression was in all cases smaller among those of plastid origin, with differences significant in all cases where a Wilcoxon’s rank tests could be implemented (Fig. 4F, H, J).