Chromosome-length scaffold assembly of Schlechtendalia chinensis
Our strategy of genome assembly employed Illumina (San Diego, CA) short read and PacBio (Menlo Park, CA) long-read sequencing data with scaffolding informed by high-throughput chromosome conformation capture (Hi-C)[64]. Aphid samples were collected from the gall induced by S. chinensis on Rhus chinensis and DNA was extracted for sequencing from 40 autumn migrant individuals. We constructed two DNA libraries of 350 bp, followed by sequencing on the Illumina HiSeq 2000 paired-end technology resulting in the generation of 85 Gb and 236,191,478 raw reads. Finally, 35 Gb of clean reads were produced after removing low-quality reads, corresponding to ~ 86-fold coverage of the haploid genome. The genome size was estimated to be 409.55 Mb by K-mer analysis (k = 19) and estimated heterozygosity of 1.31%, which indicated a highly heterogeneous complex genome (Additional file 1: Figure S1). In addition, the sequencing from PacBio libraries on the PacBio PS II platform yielded 510 Gb of raw data. After quality checking and removing the reads of endosymbionts, a total of 36.04 Gb clean data was produced corresponding to ~ 90-fold coverage of the genome from 2,257,809 reads with N50 of 16,203 bp, which indicated the high quality of this sequenced genome (Additional file 1: Table S1 & S2).
Hi-C generated 53 Gb of data consisting of 180,735,576 reads, combined with the PacBio assembly, further improved data quality for assembling 189 scaffolds with N50 (210,091,861 bp). All 189 scaffolds were anchored and oriented onto 13 chromosomes, in which more than 91.71% of assembled sequences were located (Fig. 2, Table 1, Additional file 1: Table S3). We did not encounter chromosomal super- scaffold gaps after Hi-C scaffolding, indicating the high genome assembly quality (Additional file 1: Table S4). To evaluate the completeness of the genome assembly, BUSCO analysis was performed against arthropod datasets of 939 metazoan species, and 97% of S. chinensis genes were mapped to the reference genomes, which validated the completeness and high quality of our genome (Table 1, Additional file 1: Table S5). Our assembly of the of the S. chinensis genome has almost the same size as those of other aphid genomes publicly available.
Table 1
Detailed statistics of the Schlechtendalia chinensis genome assembly, showing the assembly features and gene models.
Assembly features
|
Hi-C Scaffolds
|
Number of chromosomes (n)
|
13
|
Number of Scaffolds
|
189
|
Total size of Scaffolds
|
344.56 Mb
|
Longest scaffold size (Mbp)
|
122.78 Mb
|
Shortest Scaffold size (Mbp)
|
15.65 Kb
|
Mean scaffold length (Mbp)
|
1.82 Mb
|
Median scaffold size
|
77.06 Kb
|
N50 scaffold length
|
21.09 Mb
|
L50 scaffold count
|
10
|
Scaffolds GC content
|
33.74%
|
Scaffolds Gaps (N) content
|
0
|
Percentage of assembled contigs in scaffolds
|
96.92%
|
Average number of contigs per scaffold
|
1.03
|
BUSCO (complete)
|
94.34%
|
Gene models
|
|
Number of genes models
|
15,289
|
Mean coding sequence length CDS
|
1520.26 bp
|
Mean number of exons per gene
|
6.61
|
Mean exon length
|
1822.17 bp
|
Mean intron length
|
6663.72 bp
|
Non-protein-coding genes
|
426
|
Number of miRNA gene
|
23
|
Number of tRNA gene
|
136
|
Number of rRNA gene
|
33
|
Number of snRNA gene
|
71
|
Number of soRNA
|
13
|
Pseudogenes
|
|
Number of pseudogenes
|
192
|
Total length
|
523,288
|
Average length
|
2725.46
|
Gene prediction and genome annotation
In total, 129,303,997 bp repeat sequences were identified, spanned ~ 130 Mb, and constituted 37.52% of the S. chinensis genome (Additional file 1: Table S6). We predicted 15,013 genes in the genome using the program EVidenceModeler (EVM), which used the combination of Ab initio gene prediction, protein, and transcripts alignment into predicted gene structure (Additional file 1: Figure S2, Table S7). The number of predicted genes in the S. chinensis genome is within the range of other aphid genomes (11,980 − 32,005) but is on the lower end (Table 2). However, because the number of predicted genes might vary depending on the quality of the genome and prediction pipeline [65, 66], we further used a collection of eight annotation databases, i.e., KEGG, KOG, Pfam, GO, TrEMBL, eggNOG, Nr, and Swissprot, to annotate protein-coding genes. In the end, we obtained annotation for 14,582 (97.13%) coding genes, 99% of which were anchored to the 13 chromosomes. The average sequence lengths for the entire gene regions, exons, coding regions (CDS), and introns are shown in Table 1.
Table 2
Comparative whole genome phylogenomic analysis of Schlechtendalia chinensis with the genomes of 10 aphid’s species included in the study.
|
S. chinenisis
|
A. craccivora
|
A. glycine
|
A. gossypi
|
A. pisum
|
C. cedri
|
D. noxia
|
M. persicae
|
M. sacchari
|
R. madis
|
S. flava
|
No. of genes
|
15,013
|
32,005
|
18,358
|
12,343
|
18,214
|
16,684
|
12,290
|
14,825
|
12,269
|
11980
|
13,504
|
Genes in orthogroup
|
13,596
|
23,856
|
13,092
|
11,496
|
15,815
|
13,021
|
11,157
|
14,001
|
11,795
|
11,530
|
12,364
|
Percentage of genes in orthogroup
|
90.6
|
74.5
|
71.3
|
93.1
|
86.8
|
76.7
|
90.8
|
98.4
|
96.1
|
96.2
|
91.6
|
Unassigned genes
|
1,417
|
8,149
|
5,266
|
847
|
2,399
|
3,963
|
1,133
|
824
|
474
|
450
|
1,140
|
Percentage of unassigned genes
|
8.4
|
25.5
|
28.3
|
6.9
|
13.2
|
23.3
|
8.2
|
1.6
|
3.9
|
3.8
|
8.4
|
Number of orthogroups containing genes
|
9,865
|
13,049
|
9,561
|
9,909
|
11,324
|
10,080
|
9,788
|
10,811
|
10,161
|
9,844
|
9.817
|
Species-specific orthogroup
|
193
|
832
|
221
|
15
|
211
|
361
|
14
|
56
|
28
|
44
|
103
|
Genes in Species-specific orthogroup
|
893
|
4,105
|
993
|
31
|
601
|
1,069
|
31
|
138
|
80
|
123
|
354
|
Percentage of genes in Species-specific orthogroup
|
2.6
|
12.8
|
5.4
|
0.3
|
3.3
|
6.3
|
0.3
|
0.9
|
0.7
|
1.0
|
5.9
|
The number of the predicted protein-coding genes in the S. chinensis genome (15,013) is closest in range to the number of genes annotated for Cinara cedri (16,684) and Myzus persicae (14,825), but less than half that of A. craccivora (32,005) (Table 2). Among the 15,013 predicted genes, 14,582 (97.13%) could be annotated by one of the eight protein-coding databases, i.e., GO 10,037 (66.86%), KEGG 10,997 (73.25%), KOG 7,762 (51.7%), Pfam 11,447 (76.25%), TrEMBL 14,546 (96.89%), eggNOG 10,795 (71.9%), Swissprot 9,847 (65.59%) and the NCBI non-redundant protein database nr 14,238 (97.13%) (Additional file 1: Table S8).
We predicted the functional classification of the annotated genes using the eggNOG database, which returned 3,845 (33.5%) of the genes as having unknown functions (Additional file 1: Figure S3). Subsequently, we predicted different types of non-coding RNAs (ncRNAs), including small nuclear RNAs, tRNAs, rRNAs, micro RNAs (miRNA), and small nucleolar RNAs (snoRNAs), from a small RNA library (Table 1). Furthermore, a total of 192 pseudogenes were identified in the assembled genome of S. chinensis (Table 1).
Comparative genomics and phylogeny
We downloaded whole genomes of 10 species of aphids and one whitefly species (Bemesia tabaci) as an outgroup for comparative genomic analysis (Additional file 1: Table S9). We clustered the annotated genes to identify gene families (orthogroups) and genes common to all species. We analysed 19,874 orthogroups and found that a total of 13,596 (90.6%) genes in S. chinensis clustered into 9,865 orthogroups, among which 893 genes (5.9%) belonged to 193 S. chinensis-specific orthogroups (Table 2). The number of orthogroups unique to S. chinensis vs. shared by all 12 species and the copy number of genes for each species was also analysed (Fig. 3A & 3B). The genes specific to S. chinensis were also analysed by GO and KEGG enrichment analyses for their functions using cluster Profiler v3.14.0 [67]. They were characterized as involved in molecular functions, various biological processes, and related to structural cellular components (Additional file 1: Figure S4).
The prediction of contraction and expansion of the gene families relative to the other 11 species showed a higher frequency of gene expansion in the S. chinensis genome. Compared with other aphid species in the analysis, 97 gene families in S. chinensis underwent expansion, while only 27 gene families underwent contraction (Fig. 3C). Using GO pathway enrichment analysis, the 97 expanded gene families were found to have a role in DNA repair, telomerase maintenance, DNA helicase activity, protein dimerization, RNA-directed polymerase activity, and aspartic-type endopeptidase activity (Additional file 1: Figure S5). In contrast, the roles of 27 contracted gene families were related to the oxidation-reduction process, nucleosomes assembly, and methylation (Additional file 1: Figure S6).
Collinear analysis of the S. chinensis genome with a reference genome (Acyrthosiphon pisum; GenBank accession (PRJNA547584) was performed to evaluate genome structure variation by using DIAMOND v0.9.29.130[60], and to determine similar gene pairs. Collinear analysis was performed to analyze variation between S. chinensis and the reference genome from Aphidoidea, which could help to verify the accuracy of genome assembly. In addition, collinearity analysis is helpful to obtain gene pairs of homologous origin, that simplifies the calculation of Ka/ Ks for genome duplication analysis, as collinear genes tend to have the same biological function. Based on all predicted and annotated genes in the S. chinensis genome, collinear genes between chromosomes were determined using MCScanX [61] (parameter-m 15). All the genes were arranged in the collinear block, and the linear pattern of all genes of S. chinensis against the reference genome was predicted (Fig. 4A & 4B). The collinearity analysis compares all the orthologs shared by S. chinensis with the reference genome, with random distribution on chromosomes because of different size and numbers of genomes and chromosomes, respectively.
We reconstructed a Maximum Likelihood (ML) phylogenetic tree by selecting 1,902 single-copy orthologous genes from the 11 aphid species with B. tabaci as an outgroup to root the tree (Fig. 3D) using IQ-TREE v1.6.11 software [51]. We calculated the aphid divergence time using TimeTree [68] (http://www.timetree.org/) (see Methods for details). The phylogenetic analysis and estimated divergence time indicate that, among this limited sample of aphid diversity, S. chinensis is most closely related to C. cedri and diverged from the latter around 19–126 million years ago (MYA); Eriosomatinae + Lachninae diverged from Aphididae around 58–155 MYA (Fig. 3D).
Whole Genome Duplication and positive selection
Whole Genome Duplication (WGD) is the process of doubling the overall content of the genome and has a significant impact in shaping the evolution of species. Ancient WGDs have been associated with major eukaryote lineages, and many events of WGD have been detected in insects[69]. To evaluate the possibility of WGD in S. chinensis, we analyzed the distribution of synonymous substitution rates per gene (Ks) and four-fold synonymous (degenerative) third codon transversion (4DTv) between collinear paralogous genes. We mapped the Ks and 4DTv curves within and between the species (S. chinensis, A. pisum, C. cedri, and S. flava) to determine WGD occurrence. One prominent peak was observed in the S. chinensis genome based on the abundance of Ks sites values (Ks value of 0.25) and 4DTv value (4DTv value of 0.05), indicating that S. chinensis had experienced a WGD event during its evolution. The genomes of A. pisum, C. cedri, and S. flava were used to identify the Ks and 4DTv values collinear blocks between S. chinensis, which suggested that S. chinensis experienced large-scale duplication more recently than these three aphid species (Fig. 5).
Based on Ka/Ks value, we performed an analysis for positive selection and identified nine genes from different gene families containing significantly positively selected sites (Additional file 1: Table S10). When genes are strongly positively selected, they may result in new functions for species [70]. Positively selected genes are the foundation for new functions in species and have a prominent role in evolution. GO enrichment analysis revealed that positively selected genes in S. chinensis were related to the "transcription regulation" category of biological processes, "transport vesicles and plasma membrane" in the cellular component category, and "sequence-specific DNA binding and transcription regulator activity" in the molecular function category (Additional file 1: Figure S7). While the KEGG pathway enrichment analysis showed that these positively selected genes were mainly related to neuroactive ligand-related reception, apoptosis, and the phosphatidylinositol signalling system (Additional file 1: Figure S8).
Endogenization of Parvovirus like DNA sequences
Parvoviruses (Parvoviridae) are single-stranded DNA viruses that infect a wide variety of arthropods and insects, including aphids[71]. Densoviruses (Parvoviridae: Densovirinae) have been reported to infect aphid species, e.g., Myzus persicae and Dysaphis plantaginea [72, 73]. Integration of densovirus genes, including structural and non-structural types, has also been reported in non-galling aphid species[71, 74]. Using EVM gene prediction followed by multiple annotation software (see methods), we found many DNA sequences related to parvoviruses in the genome of S. chinensis. We analyzed all the predicted parvovirus-related DNA sequences (PRDs) against the Pfam database, which classified all the sequences into their respective groups and families. A significant number of PRDs were classified as "parvovirus coat protein VP1" genes. We identified a total of 115 PRDs integrated into 13 chromosomes of the S. chinensis genome (Fig. 6), while few PRDs were found in unanchored scaffolds. All the analyzed PRDs have variable lengths (519–3204 bp) with 2–3 exons and 1–2 introns in each sequence.
We performed functional annotation of the PRD peptide-coding sequences (CDS) using GO functional annotation and KEGG functional annotation databases. GO annotation showed that some of the PRDs have phospholipase A2 activity (GO:0004623) in the molecular function category, and phospholipid metabolic process (GO:0006644) and arachidonic acid secretion (GO:0050482) in the category of biological process (Additional file 2). At the same time, KEGG annotation showed that these sequences have a role in amino acid metabolism, i.e., lysine degradation and histone (H3)-lysine N-methyl transferase SETMAR activity (K11433). The functional activity of these sequences might also reflect stable domestication of these sequences in the host genome, like SETMAR protein which is the product of domesticated transposase fused with methylase[75].
Based on the above analysis, further investigations will be required to find the specific function of these sequences. Other databases, i.e., TrEMBL and nr, annotated the PRDs in S. chinensis as uncharacterized and hypothetical protein in Aphis glycines, Acyrthosiphon pisum, and a protein of Tribolium castaneum (Additional file 2). The presence of PRDs in other aphids reflects the old and stable integration of these sequences in aphids, possibly by horizontal gene transfer. We did not find any non-structural genes related to parvoviruses in the genome of S. chinensis.
PRDs are also present in other, non-galling aphids. We performed a detailed phylogenetic analysis of S. chinensis PRDs with other aphids. We did NCBI BLAST searches using S. chinensis PRDs as queries against genomes of other aphids and downloaded the genomes with the best hits. We also downloaded parvovirus coat protein VP1 genes and performed phylogenetic analyses to investigate the origin of S. chinensis PRDs. The results indicated that one clade of total 13 PRDs from the S. chinensis genome grouped with sequences extracted from other aphids i.e., Myzus persicae, Acyrthosiphon pisum, and Sipha flava, suggesting common ancestry (Fig. 7a). While one PRD (SchiG000450.1) nested with the single cell protist Abeoforma parvovirus coat protein VP1 (Accession BK010894.1) in a clade containing parvovirus VP1 sequences of Semian parvovirus, Myzus percicae, African termite densovirus, and Solenopsis invicta. The remaining of the PRDs clustered separately into five different clades, and a single branch of one PRD, which indicated multiple and independent integrations of these sequences in the S. chinensis genome (Fig. 7A & 7B). Overall, all the PRDs identified in S. chinensis genomes belongs to seven lineages (Fig. 7B). The exact role and function of these endogenized genes are unknown in these aphids. However, a recent study reported horizontally transferred parvovirus non-structural genes in pea aphid genomes to have roles in wing plasticity in response to crowding [76]. Here, for the first time, we present the integration and endogenization of parvovirus-related DNA sequences (PRDs) in a galling aphid genome. Further studies and research will be required to investigate the role and functions of these endogenized sequences in this and other aphids.
Cytochrome P450 genes and gall-inducing genes of S. chinensis
Like all other aphids, S. chinensis feeds on the sap of its host plant, and in so doing it must detoxify any toxic plant metabolites that are present. The cytochrome P450 family of proteins plays a major role in metabolizing plant metabolites [77], and the expression of these proteins in insects is generally induced by toxic plant metabolites. We identified 36 cytochrome P450 genes in the genome of S. chinensis (ScP450s) comprising 0.21% of the total genes. Of these, 35 genes appeared to be functional, while one was a probable pseudogene. We followed the standard nomenclature at the Cytochrome P450 homepage [78] to name and classify the 36 genes. We assigned the 36 ScP450s to four clans, 16 families, and 18 subfamilies (Additional file 1: Table S 11). The most represented families were CYP6 with ten genes and CYP4 with eight genes. Comparatively, the number of P450 genes was fewer than in all the ten aphid genomes downloaded from GenBank (Additional file 1: Table S9). Compared to other insects, especially the ten aphids included in this study, the number of P450s in S. chinensis is fewer, i.e., 48 in D. noxia, 58 in A. glycine, 75 in R. maidis, 83 in A. pisum, 66 in A. gossypii, and 115 in M. persicae. The fewer P450s genes in S. chinensis compared to other aphids might be related to S. chinensis being an oligophagous insect with only two definitive hosts, i.e., the sole primary host, Rhus chinensis, and secondary hosts comprising only three moss species from the same family. In contrast, most of the other aphids are polyphagous, with many host plants requiring more P450 genes to detoxify the metabolites of more diverse host plants [79]. All the P450 genes identified in S. chinensis were extracted from genome (Additional file 3), and their position on chromosomes was also located (Fig. 6, Additional file 1: Table S11)
The average length of ScP450s with complete open reading frames was 498 amino acids (aa), which is consistent with the average P450 gene length in other insects [80]. The 36 ScP450s were located on ten chromosomes, in which chromosome 1 contained the highest number of nine, and chromosome 6 had 8 ScP450s. Almost half of the genes were present as a cluster of two or more on each of six chromosomes, while the remaining were present individually. We did not find any duplicated ScP450s, suggesting the absence of any duplication event during evolution.
A maximum likelihood (ML) phylogenetic tree of the P450 superfamily was constructed using three other insect species (Anopheles gambiae, Drosophila melanogaster, and Bombyx mori) with already known P450 classifications to identify gene orthologs (Fig. 8). The phylogenetic tree resolved the four expected insect P450 groups of the CYP2, CYP3, CYP4, and mitochondrial clans. The CYP3 clan contained a single family CPY6 with less evolutionary divergence as compared to the other clan. The mitochondrial clan contained the fewest ScP450 genes, but all six belonged to different families. Although the BLASTn searches showed the closest relationship of all ScP450s with Acyrthosiphon pisum P450 sequences, they were not included in the analysis due to the absence of a complete dataset. The CYP4 clan contained 11 ScP450 genes, followed by CYP3 and CYP2, which contained 10 and 9 ScP450 genes, respectively (Fig. 8, Additional Table 11).
Gall formation is a defensive response of some plants to galling microorganisms and insects. Insects in several different orders can induce galls on their hosts, but the exact mechanisms by which galls are induced and develop are still uncertain. Some studies have shown that gall induction is species-specific, and galling insects induce galls on their host by delivering effector proteins into plant tissues through their saliva during feeding [81–83]. Saliva injection by S. chinensis in the leaf cells of the host plant induces formation of the gall. Recently, LC-MS/MS analysis of S. chinensis saliva identified 31 S. chinensis proteins, some of which may conceivably play a role in gall induction [84]. The genes coding for these specific salivary proteins were also annotated and present in our S. chinensis complete data set (15,013 genes). The proteins coded by these genes are mostly related to binding activities, including DNA-, ATP-, protein-, and iron-binding.