Assembly of the genome of IndInt strain
Fourteen generations of Iso-female lines from a parent T6 lab-colony of the intermediate-form of An. stephensi (IndInt) were grown, both to achieve homogenity required for high-quality assembly from sequencing of pooled DNA from multiple insects and to enrich for Plasmodium resistant phenotype. Multiple tools were used to obtain contig-level assembly of IndInt from 60X coverage of reads sequenced using PacBio Sequel II technologies. The assembly statistics obtained from the various tools are shown in Table 1. In the first round, the best assembly statistics was obtained from WTDBG2 tool with a L50 value of 6 and N50 9.1 Mbp. The next best statistics was shown by the tool Flye assembler with a L50 of 15 and N50 value of 3.9 Mbp.
Table 1
Statistics of contig-level assemblies by CANU, FALCON, FLYE, and WTDBG2 tools.
| Assembly size (Mbp) | Total contigs | Longest contig (Mbp) | Shortest contig (bp) | Median contig (Kbp) | N50 | L50 |
CANU | 405 | 3464 | 16.6 | 1055 | 44.2 | 48.7 Kbp | 115 |
WTDBG | 205.4 | 789 | 33.08 | 1126 | 15.4 | 9.1 Mbp | 6 |
FALCON | 293 | 851 | 11.4 | 5734 | 87.4 | 1.4 Mbp | 46 |
Flye | 211.7 | 689 | 27.15 | 1007 | 14.6 | 3.9 Mbp | 15 |
Flye + WTDBG | 213.3 | 521 | 45.56 | 1018 | 8.3 | 17.04 Mbp | 4 |
(Flye + WTDBG) + SSPACE | 213.8 | 464 | 55.4 | 1018 | 5.9 | 36.9 Mbp | 3 |
To improve the contiguity of the assembly, the four different contig-level assemblies from CANU, FALCON, FLYE, and WTDBG2[19][20][21][22] were merged in different combinations using the tool Quickmerge[23]. Assembly statistics were computed for each of the merged combinations to determine the best merged assembly (Table 1). During the second round, the assembly with the best statistics was obtained by merging assemblies from FLYE and WTDBG2, with a L50 of 4 and a high N50 of 17 Mbp. The computed genome size of this merged assembly was comparable to the actual size of An. stephensi genome, and was used for scaffolding. The scaffolds of the error-corrected contigs were built using the SSPACE assembly tool with simulated mate-pairs of varying sizes from the assembly of IndCh strain as reference. The L50 value for the scaffold level assembly was 3 and the N50 value significantly increased from 17 Mbp to 37 Mbp (Table 1, last row).
In order to check any heterozygosity from potential inversions, haplotype phasing of the assembly of the IndInt strain was done using a set of primary and associated contigs from FALCON and FALCON-unzip tools, respectively[20]. In the presence of heterozygosity, the mapped reads will be split between two contigs that represent the two forms of the heterozygous allele, resulting in reduced coverage. This creates two peaks in the histogram of read counts, one for the haploid level coverage from heterozygous regions and the other for the diploid level coverage from homozygous regions. Figure 1a with a hump indicates the presence of potential inversion heterozygosity in the genome of IndInt.
Synteny between the assemblies of IndInt and IndCh strains is shown in Fig. 1b. The block view clearly suggests that IndInt is homozygous for the standard form of 2Rb (2R+ b/2R+ b) inversion similar to the IndCh strain. However, in the 3L arm of chromosome 3 IndInt shows a clear presence of an inversion not seen in the IndCh assembly. The locus of the physical markers revealed that the markers from 42A to 44C are inverted in the assembly, which matched with a reported breakpoint for the 3Li inversion in An. stephensi[8]. Based on the photomaps from several individuals of IndInt strain (see Fig. 1e), it is clear that 48% of the individuals are heterozygous for the 3Li inversion (3L+ i/3Li), with the rest (52%) being homozygous for the standard form of 3Li (3L+ i/3L+ i) (Supplementary Fig. 1). It should be mentioned that the percentage heterozygosity reported here is from an outgrown generation maintained in the lab after the generation used for assembly. Markers 41A, 41B, 41C, and 46D shown in the standard form reported for 3Li (top Fig. 1c) are not detectable in the IndInt assembly using the BLASTN tool, perhaps because of limitations of the assembly tools in handling noise generated by reads representing the two arms near the inversion breakpoints from the heterozygous arms.
Validation of the 2R b breakpoint region in IndInt
The IndInt strain is homozygous for the standard form of the 2Rb inversion, making it the second assembled genome for one of the three possible 2Rb genotypes, such as 2R+ b/2R+ b. This provides an opportunity to further refine and validate the breakpoint regions described previously by our group[5]. A BLASTN of the sequences from breakpoint regions from IndCh including 1000 bases to the left and right of both breakpoints, against the IndInt assembly, shows 100% identity for all the 2079 bases from the centromere-proximal breakpoint and 99.8% identity for all the 2009 bases from the centromere-distal breakpoint. We also provide the actual alignment with 50+/- nucleotides around the breakpoints in Supplementary Fig. 2. This confirms the 2Rb genotype to be 2R+ b/2R+ b for IndInt.
Interrogation of 3L i breakpoints in IndInt
In order to define the breakpoints for the 3Li inversion, a contact map was generated by mapping the HiC reads from the UCI strain onto the final IndInt assembly. Since UCI is homozygous for the standard (3L+ i/3L+ i) configuration, two expected butterfly-like structures in the chromosome 3L were observed in the contact map (Fig. 2a) defining the location of both the proximal and distal breakpoints of the 3Li inversion in the IndInt assembly. By magnifying each of the butterfly structures to a resolution of 5 Kbp per dot, the proximal and distal breakpoints of the 3Li inversion were estimated to be between bp 22,545,000 to 22,550,000, and between bp 31,555,000 to 31,560,000 respectively.
Synteny of 3L i inversion with 2La of An. gambiae
Synteny of the final IndInt assembly was also plotted against the An. gambiae PEST strain[24]. Figure 2b shows that chromosome X, 2R, and 3R of An. gambiae is homologous to that of IndInt, respectively. However, the synteny between chromosomal arms 2L and 3L of An. gambiae are switched to chromosomal arms 3L and 2L of IndInt, respectively, as reported elsewhere[5]. Figure 2c shows the synteny between the 2La inversion of An. gambiae with the 3Li region of IndInt. The 2La region of An. gambiae was taken from a previous study[25]. The circle view (Fig. 2c) of this synteny reveals that the entire 3Li inversion region of IndInt consisting of 8 Mbp is syntenic to a big part of the 2La inversion region of An. gambiae comprising 21 Mbp. Since the 2La inversion of An. gambiae is associated with Plasmodium sensitivity[11][12], by analogy, one can infer that the 3Li of IndInt may also be associated with the Plasmodium resistance observed for IndInt strain. Furthermore, it is reported that the strains of An. gambiae that are homozygous for the standard form of 2La (2L+ a/2L+ a), show higher vectorial capacity than the corresponding strains that are heterozygous for the inverted form. Similarly, the strain IndCh with higher vectorial capacity is homozygous for the standard form of 3Li (3L+ i/3L+ i) and IndInt is heterozygous for the 3Li inversion (3L+ i/3Li), suggesting the potential role of the 3Li inversion in Plasmodium sensitivity.
Functional characterization of genes within the 3L i region
Within the 2La region of the An. gambiae PEST strain spanning 21 Mbp, there are 1243 genes of which there are 494 genes that are homologous/syntenic to the 3Li region of IndInt. Of these, 464 are functionally annotated in IndInt based on sequence level annotation tools, and the remaining 30 genes are uncharacterized due to a lack of sequence homology to known functional genes. Among functionally-characterized genes, we found 36 isoforms of cuticle-forming genes in IndInt. Interestingly, cuticle thickening is associated with pyrethroid & insecticide resistance in other Anopheles species[26], while also resisting Plasmodium infection[27]. We found 99 missense mutations in the IndInt strain within 23 copies of the cuticle genes, as compared to the IndCh strain of the type form, which has a homozygous configuration for the standard form of 3Li, like the UCI strain.
In order to find genes implicated in immune response within the 3Li region of IndInt, we intersected the 494 genes within this region with a list of 589 genes reported to be implicated in immunity in insects [15]. Out of these 589 genes, we found only 17 genes within the 3Li region (Table 2) including NFkB, a transcription factor implicated in cytokine production in humans.
Table 2
Genes within the 3Li region of IndInt that are reported to be implicated in immune response by Waterhouse et al. [15]. Highlighted in cyan is the only gene implicated in TNF-TNFR signaling pathway.
IndInt Augustus ID | IndInt Chr 3L Start | IndInt Chr 3L End | Waterhouse et. al [15] Immune DB ID | Function |
g22033.t1 | 22194979 | 22196095 | AGAP005552 | Peptidoglycan recognition protein |
g22086.t1 | 22559888 | 22560559 | AGAP007020 | Thioredoxin peroxidase 5 |
g22107.t1 | 22657471 | 22661294 | AGAP006939 | Armitage protein |
g22121.t1 | 22702737 | 22707661 | AAEL001675 | Clip domain, serine protease family A |
g22273.t1 | 24373199 | 24379261 | AGAP006631 | Class A Scavenger Receptor (SRCR domain) with Serine Protease domain |
g22562.t1 | 26855939 | 26862215 | AGAP006914 | Fibrinogen-related protein 1 |
g22567.t1 | 26922106 | 26923713 | AGAP006911 | Serine protease inhibitor - Serpin 2 |
g22568.t1 | 26925252 | 26932268 | AGAP006909 | Serine protease inhibitor - Serpin 1 |
g22799.t1 | 28434986 | 28435854 | AGAP006790 | Fibrinogen C-terminal domain-containing protein |
g22840.t1 | 28809510 | 28812635 | AGAP006761 | 3-glucan binding protein |
g22865.t1 | 28987304 | 28992526 | AGAP006747 | IMD pathway signaling NF-kappaB Relish-like transcription factor |
g22870.t1 | 29022137 | 29026993 | AGAP006743 | Unspecified product BLASTP : Similar to ficolin 1 |
g22921.t1 | 29370252 | 29370571 | AGAP006722 | Unspecified product BLASTP : Similar to Cecropin B1 |
g23080.t1 | 30464940 | 30470336 | AGAP005672 | Staphylococcal nuclease domain-containing protein 1 |
g23109.t1 | 30680813 | 30689983 | AGAP005652 | ATP-dependent RNA helicase DDX5/DBP2 |
g23151.t1 | 30998165 | 31003426 | AGAP005625 | Class A Scavenger Receptor (SRCR domain) with Serine Protease domain |
g23220.t1 | 31575970 | 31576777 | CPIJ004568 | Similar to Cu-chaperone |
The genes implicated in Plasmodium resistance in An. gambiae[28] LRIM1, APL1C, and TEP1 are not within the 3Li region in IndInt. The TEP1 protein in IndInt is located on chromosomal arm 2L. The LRIM1 gene is outside the 3Li region, while the gene APL1C is located immediately outside the distal breakpoint of the 3Li region. The LRIM1 gene has leucine-rich repeats and forms a disulfide-bonded complex with APL1C (PDB ID: 3OJA [28]), which stabilizes the mature TEP1 and promotes its binding to parasites to trigger pathogen destruction. Also, more recently, a knockout of LRIM1 in An. stephensi was shown to play a role in vector competence[29]. Furthermore, LRIM1 and APL1C are within the 2La region in An. gambiae, which is three times larger than the 3Li region in IndInt. In IndInt, the LRIM1 has four and six missense mutations compared to IndCh and UCI strains of type-form, respectively. However, since the gene APL1C shows no missense mutation in IndInt compared to IndCh, we expected other unannotated paralogs of LRIM1 within the 3Li region of IndInt. Deep annotation of the 30 functionally uncharacterized genes within the 3Li region shown in Supplementary Fig. 3 with fold-prediction algorithms, revealed two novel leucine-rich proteins, g22432 and g22212, similar in structure to LRIM1 (Supplementary Fig. 3a and 3b). These two LRIM1-like genes could potentially work with the intact APL1C gene in IndInt to launch a host defense system to antagonize the malaria parasite. The high conservation of these, otherwise uncharacterized genes across Anopheles species, shown in the multiple sequence alignments of g22432 and g22212, suggests functional conservation across species (Supplementary Fig. 4a and 4b).
Recently, members of LPS-induced TNF-alpha factors, LITAFs, have been shown to regulate anti-Plasmodium immunity in An. gambiae[30][31]. A search in the assembled genome of An. stephensi identified four members of LITAF-like transmembrane genes (Supplementary Text 1). LITAFs are known to send signals to the nucleus to transcribe TNF-alpha molecules in humans [32]. In fact, LPS-induced animal models were used in developing drugs against TNF-alpha to treat auto-immune disorders. We were also able to find homologs of all the genes involved in this signaling pathway in IndInt (Supplementary Text 2).
We found an eiger homolog, g22826, which is within the 3Li region of IndInt and is missing in the reported list of 361 genes with immune repertoire in An. stephensi [4]. A crystal structure of the complex of eiger and its receptor from Drosophila has revealed that eiger belongs to the TNF superfamily and its two receptors, wengen and grnd, are members of the TNFR superfamily(6ZT0 [33]). A search for an active TNFR-like gene in An. stephensi, using the HMM model (PF00020) of the cysteine-rich domain (CRD) from PFAM database, identified two potential hits g1129 and g18030 in IndInt. It should be mentioned that while wengen displays detectable homology with TNFRSF, predicted grnd homologs from many sequenced species, including malaria vectors, remain uncharacterized by sequence-based annotation tools. The gene, g1129 from chromosome X of the IndInt strain encodes a wengen homolog of Drosophila, with a well-defined extracellular CRD, and a transmembrane region. The grnd homolog (g18030) in the IndInt strain consists of an intact CRD region sandwiched between a predicted signal sequence and transmembrane region (Supplementary Fig. 5c and 5d). We found that the eiger gene has a frame-shift mutation at the tail-end of the very conserved folding domain in the IndCh assembly. Also, in the UCI and STE2 strains, the predicted eiger homologs are missing the critical N-terminal transmembrane region (Supplementary Fig. 6). In Fig. 3a, a 3D model of the eiger gene from IndInt, built using RoseTTaFold tool[16], is shown superimposed on the crystal structure of human TNF-TNFR complex (PDB: 1TNR,[34]) on the left, along with a wengen model built using homology to the second domain of TNFR-1A. Similarly, the eiger model is shown superimposed on the crystal structure of the eiger-grnd complex (PDB: 6ZT0 [33]), along with a model of grnd built using homology to grnd of Drosophila in Fig. 3b. The choice of templates for comparative modeling of the CRDs of the two receptors is based on the conserved cysteine patterns of the ligand binding regions, which are different for the two receptors wengen and grnd. Interestingly, the wengen binding domain has a cysteine pattern similar to TNFR-1A of humans and that of grnd is similar to that from TNFR-1B, suggesting an evolutionary-conserved link between eiger and TNF-alpha mediated signaling in arthropods and vertebrates respectively.
A functional TNF-alpha/TNFR pathway in malaria vectors with lower vectorial capacity
It is now well established that the Drosophila genome harbors two functional TNFR receptors and a single ligand, eiger, belonging to the TNF-TNFR system[33]. In order to predict whether the homologs of the eiger and its cognate receptor genes in An. stephensi (IndInt) are potentially functional, we used signal peptide and transmembrane predictions on the respective protein sequences derived from IndInt (Supplementary Fig. 5). The eiger homolog shows a transmembrane region at its N-terminus (Supplementary Fig. 5a) consistent with type II configuration common to all members of the TNF superfamily across organisms. The two receptor homologs show clear transmembrane regions (Supplementary Fig. 5b and 5c) following the cysteine-rich folding domain. However, only grnd, but not wengen, has a predicted signal peptide signature (Supplementary Fig. 5d) N-terminal to the CRD, consistent with the localization of these two genes in Drosophila, where wengen is almost always localized in intracellular vesicles[33]. We also found the homolog of the TACE gene in IndInt, which is a TNF-alpha processing metalloprotease that converts the type-II membrane-anchored ligand into its soluble form in humans [35]. In the eiger gene from IndInt, we also found a cleavage site (AQ) N-terminal to the folding domain that is known to be specific to TACE [36].
In order to show that all the genes in the cytoplasm necessary for eiger-wgn-grnd mediated signaling are encoded by the genome of IndInt, we identified homologs of all the pathway genes from TNF-TNFR signaling pathway from KEGG (Supplementary Text 2). Figure 4 shows genes implicated in wengen-based (left) and grnd-based (right, hypothesis) based on the homology of the receptors to TNFR-1A and − 1B, respectively. Several genes are disrupted in the other strains of type-form viz IndCh, UCI and STE2, as shown in Fig. 4, Supplementary table 1 and Supplementray Table 2. The homologs for genes marked by red and purple stars are disrupted and absent, respectively, in the IndCh genome, potentially disrupting eiger-mediated signaling pathway. Genes with blue stars are missing in all 4 assembled genomes, suggesting that these pathways may have evolved in vertebrates after the invertebrate-vertebrate split.
It should be mentioned that a majority of these genes including eiger, grnd, and wengen are missing in the collated list of 589 immune related genes in insects[15] and, more specifically, missing from a list of 361 immune related genes in An. stephensi reported recently in the UCI strain[4].