Assembly and annotation raw sequences from RNA sequencing
In this work, we carried a transcriptome analysis of the H. saulcyi venom gland for extracting and characterizing the sequences of the phospholipase D. In addition, we used the previously reconstructed venom gland transcriptome from H. lepturus and A. crassicauda 19. After sequencing and filtering out of low-quality reads, the cDNA library was constructed from approximately 48 million paired-end of clean reads of >200 bp by the Trinity program 20. Among the 191150 assembled transcripts, 110126 unigenes were obtained and 98365 potential coding sequences were predicted using TransDecoder software. We found that a high percentage of those protein coding sequences (95167) had a match in non-redundant protein (Nr), Swissprot, and Pfam databases.
In our previous study, following removal of the adaptor, short reads and quality filtering, approximately 51 and 48 million paired-end of clean reads were obtained in A. crassicauda and H. lepturus, respectively. In total, 260663 and 222989 transcript sequences were reconstructed by de novo transcriptome assembler of Trinity software, in which 152704 and 122618 unigenes (>200 bp) were clustered from deep RNA-Seq data of A. crassicauda and H. lepturus, respectively. After comparing the protein coding sequences against NCBI, Swissprot, and Pfam databases, 11415 and 11869 transcripts were identified as known mRNAs in A. crassicauda and H. lepturus transcriptomes, respectively.
Identification of the dermonecrotic toxins
The first step is building a local database of dermonecrotic toxin to target BLAST searches for phospholipase sequences. This database was created by collecting the known sequences of phospholipase D1, D2 or phospholipase alpha and beta from different scorpion species and closely related species such as spiders, ticks, termite, ants, flies and wasp species. Using our local dermonecrotic toxin database, we conducted exhaustive BLAST searches of the H. saulcyi, A. crassicauda and H. lepturus venom glands transcriptome and found multiple transcripts with high sequence similarity to PLDs. Generating a target database of PLD sequences from a few widely used databases facilitated the use of comprehensive search strategies to extract subsets of PLD sequences from our dataset. To increase the sequence identification confidence and to classify the sequences, the obtained sequences were directly searched against NCBI and Uniprot databases. Sequences sharing high sequence identity with previously classified PLD sequences belonging to species mentioned above were considered members of this group. Accordingly, we found that scorpions encode two genes of phospholipase D: PLD1 and PLD2.
Domain and motif identification
For definitive categorization, newly obtained PLD sequences were subjected to protein motif and domain identification. For domain identification, we used BLASTP searches of the local copy of the NCBI CDD. Furthermore, PROSITE collection of motifs online tool and RCSB PDB server were utilized for protein motif exploration. By searching PROSITE database, we found that some PLD sequences are sharing a motif of phosphatidylinositol-specific phospholipase C (PIPLC_X_DOMAIN) and glycerophosphodiester phosphodiesterases domain (GP-PDE), which were classified as PLD1 family. The rest, which have share similar motifs including two copies of HKD motif (PLD) and a phox homology domain (PX), were considered as PLD2 family. By analyzing the results of searches for protein motifs in the NCBI CDD database, two copies of catalytic domains of phospholipases were found for sequences classified into PLD2 family, while a Glycerophosphodiester phosphodiesterase-like domain (GDPD_like_SMaseD_PLD) of spider venom sphingomyelinases D was found for sequences classified into PLD1 family. Using RCSB PDB resource, we found that PLD1 show structure similarity with 6U8Z (Crystal Structure of Catalytic Domain of Human Phospholipase D1), and PLD2 ScoTox alpha shows structure similarity with 3K72 (Structure of integrin alphaX beta2). In general, based on sequence similarity and conserved segments of protein or nucleic acid sequences analyses of PLDs from scorpions, three types of phospholipase D (PLD) discovered in H. saulcyi, A. crassicauda and H. lepturus scorpion species were named PLD alpha, beta I and beta II. Results of the BLAST searches against NCBI-CDD and PROSITE databases are presented in Figure 2.
The discovery of a specific diagnostic peptide pattern from the conserved sequence or known motifs of PLDs family assist to recognize members of this highly diverged but homologous family in insects. To achieve this goal, the alignments of conserved segment of PLD sequences from scorpions, spiders, ticks and mites were performed with Clustal Omega program. Based on sequence analysis of PLDs beta from insects, we found that those family members had highly conserved segments (Figure 3). All members of PLDs beta are characterized by the sequence HMX13GANX4DX13HX2PCDC, for which two domains termed GAN (HMX13GAN) and HPCDC (HX2PCDC) can be defined for them.
All members of beta PLDs are characterized by the sequence HMX13GANX4DX13HX2PCDC, for which two domains can be defined, namely GAN (HMX13GAN) and HPCDC (HX2PCDC).
Therefore, enzymes with the characteristic HMGAN and HPCDC domains are categorized as part of a PLD1 family and include PLD beta enzymes. This large family was divided into two subgroups (beta I and beta II) based on sequence similarity that we named Phospholipase D1 Scorpion Toxin-beta I (PLD1 ScoTox-beta I) or Dermonecrotic Scorpion Toxin-beta I (DScorTox-betaI), and Phospholipase D1 Scorpion Toxin-beta II (PLD1 ScoTox-beta II) or Dermonecrotic Scorpion Toxin-beta II (DScorTox-beta II).
Various numbers of PLD1 betaI were identified from venom glands transcriptome of H. saulcyi and A. crassicauda and H. lepturus, which the highest diversity was observed in H. lepturus scorpion. In general, three PLD1 were discovered from venom gland of A. crassicauda, called PLD1 ScoTox-betaI PLD1, ScoTox-betaII isoform x1, and PLD1 ScoTox-betaII isoform x2; eight from H. lepturus, named PLD1 ScoTox-betaI isoform x1 to PLD1 ScoTox betaI isoform x8; and one from H. saulcyi, called PLD1 ScoTox-betaI isoform x1. The nucleic acid and protein sequence data displayed in this study are available in the NCBI under the accession number OP867052, OP867053, OP867054, OP867055, OP867056, OP867057, OP867058, OP867059, OP867060, OP867061, OP180075.
Furthermore, we aligned the PLD2 or PLD alpha sequences from scorpion species and similar sequences from other insects (spiders, ticks, termite, ants, flies and wasp species) using Clustal Omega program and found that PLD alpha from insects is highly conserved and contains duplicate phosphodiesterase active sites. The profiles of phospholipase D2 phosphodiesterase active sites are LWAHHEKX4DQX2AFXGG and ELXYVHSKX2IXDDX3IXGSANINDRS for first and second catalytic sequences respectively (Figure 4), in which characterized by the sequence HxKx4Dx6G (HKDG domain). Based on these analyses of PLDs, sequences sharing two copies of catalytic HKDG domain were categorized as part of the PLD2 family. We named these PLDs, Phospholipase D2 Scorpion Toxin-alpha I (PLD2 ScoTox-alpha I) or Dermonecrotic Scorpion Toxin-alpha I (DScorTox-alpha I). The nucleic acid and protein sequence data displayed in this study are available in the NCBI under the accession number OP867062 and OP820507.
Exon-intron organization of scorpion PLDs
The exon-intron patterns of PLD beta (Fig. 5, Fig. 6) and alpha (Fig. 5, Fig. 7) genes were predicted by aligning the mRNA sequences of PLD beta or alpha discovered in this study with the genomic sequence and mRNA of dermoncrotic toxins from Centruroides sculpturatus using MAFT program. The total sequence of PLD beta gene from START-codon (ATG) to STOP-codon (TAA) is around 5400bp and full-length cDNAs spans from ~1100 to 1600bp. As show in figure 6, PLD beta gene was found to consist of 6 coding exons and 5 introns. Exons 1–6 vary in size from ~45bp (exon 1) to ~600bp (exon 6), whereas the introns vary in length between ~70bp (intron 4) to ~1630bp (intron 3). Exons 1 to 6 containing ~45, 250, 132, 125, 228, and 601bp, respectively. The introns 1 to 5 consist of ~1012, 77, 1630, 71, and 1217bp, respectively. We found that exon 6 with the size of ~600bp and intron 3 with the size of ~1630bp are the longest regions in the PLD1 genes. All introns of PLD1 gene follow the standard form of GT-AG.
As show in figure 7, PLD alpha gene is quite large and predicted to consist of 21 coding exons ranging from 68bp (exon 5) to 245bp (exon 13). The total sequence of PLD alpha gene and full-length cDNAs from START-codon (ATG) to STOP-codon (TAA) are ~45596bp and ~2800bp respectively, in which exons 1 to 21 containing ~120, 154, 125, 110, 68, 160, 153, 150, 166, 110, 139, 149, 245, 80, 150, 120, 123, 113, 90, 160, 115bp respectively. We found that PLD2 gene contains 20 GT-AG type introns. The introns 1 to 20 consist of 10700, 4102, 109, 3384, 3332, 5000, 80, 8066, 74, 168, 110, 4446, 550, 89, 2242, 70, 60 84, 70, 92bp. Exon 13 with the size of 245bp and intron 1 with the size of 10700bp are the longest regions in the PLD2 genes.
Protein structures of PLDs
Computational models of protein structures for dermonecrotic toxins were determined with I-TASSER, Phyre2, and SWISS-MODEL web servers. The quality of each model was assessed by overall quality factor in SAVES server and Z-score in ProSA-web server (Table 2). All Phyre2 generated models had less quality according to SAVES server assessment. Models predicted by Swiss-model with high overall quality factor and a Z-score in the range of native conformations have the best quality. The Ramachandran map analysis of three-dimensional structures predicted by Swiss-model showed that the percentage distribution of dihedral angles of the PLD protein residues at the allowable zone is over 95% which indicating high stability of structures; therefore, they were utilized for structural homology comparison.
Structural homology analysis of the dermonecrotic toxin sequences
The best computational models of PLDs were used for comparison of the structural homology between A. crassicauda–PLDs, H. saulcyi-PLDs and the structure of dermonecrotic toxins derived from H. lepturus. For structural alignment and visualization, we used molecular graphics program UCSF Chimera (ver. 1.11.2, University of California, San Francisco, CA, USA). The analysis indicated that the predicted PLD structures contain both α-helices and β-pleated sheets which had significant structural homology when compered together (Fig. 8, 9). Comparative modeling analysis of PLD alpha showed that the number of α-helix and β-sheet in A. crassicauda and H. saulcyi were similar. Structures of PLD beta from A. crassicauda, H. saulcyi and H. lepturus were also generally similar and minor differences observed between PLD beta structures are related to the amino acid composition of the proteins.
Characterization of the gene and protein of PLD1 and PLD2
SignalP (version 6.0) was implemented to propose the presence, cleavage site positions and general structure of the signal peptides. Signal peptide score, as the confidence in the signal peptide prediction, was predicted over than 0.97 for all PLD1 members (Table 3). General structure of the signal peptides has suggested and it consists of three main parts, including N-terminal region (N-region), intermediate hydrophobic region (H-region) and C-terminal region (C-region). This study revealed differences in the amino acid composition and length of these three regions between PLD1 ScoTox-betaI proteins of the Buthidae and Hemiscorpiidae families, such that the longer amino acid signal peptides (32-amino acids signal peptide) were predicted for isoforms X3-X6 of H. lepturus.
In addition, such differences were also observed between PLD1-betaI and PLD1-betaII group of proteins. The N-regions of PLD1 ScoTox-betaI and betaII from Buthidae family (A. crassicauda and H. saulcyi) were predicted to compose of 2–3 amino acid residues, and they have a cleavage site at position 21 and 19, respectively. We found that the cleavage site is absent from PLD2 ScoTox-alpha proteins.
The physiochemical properties were analyzed for all mature PLDs using Expasy’s ProtParam software (Tables 4, 5). We found that PLD1-beta gene from scorpions encodes a mature protein with molecular weight ranging from ~ 31 to 44 kDa. Based on IP results, scorpion PLD1-beta is an acidic peptide with isoelectric points ranging from 5.5 to 6.9. The PLD1-betaI sequences in this study indicated high stability, high aliphatic indices and negative values of GRAVY. The aliphatic indices for those protein sequences ranged from 74.74 to 83.59. Mature PLD2-alpha with molecular weight of approximately 128 kDa and polar nature (negative GRAVY value), has a basic isoelectric point (~9) and high aliphatic index (Table 5).
Phylogenetic relationships of the arthropods
Phylogenetic analyses of the PLD proteins were conducted using the PhyML 3.0 program 10 to evaluate the evolutionary relationship of PLD proteins in insects. To build the phylogenetic tree, we included dermoncrotic toxins sequences from this study, alongside available PLD homologous peptides in GenBank from A. australis and H. trilineatus scorpions and six out-groups including spiders, ticks, termite, ants, flies and wasp species (Table 1). Our phylogenetic analysis classified the PLD family of proteins into two main subfamilies, PLD1-beta and PLD2-alpha (Fig. 10). Clade 1 consist of PLD1-beta and clade 2 includes PLD2-alpha, which clade 1 further divided into the subclades betaI and betaII. The phylogenetic tree represents that PLD alpha members share close evolutionary relationship, as well as PLD beta members of different insect species. However, there are long evolutionary distances between PLD alpha and PLD beta.
The analysis of PLDs showed that A. crassicauda and H. saulcyi PLD members have a close phylogenetic relationship with their other Buthidae family orthologs, but they have a long phylogenetic distance to their H. lepturus orthologs. Furthermore, among all insects, the closest phylogenetic relationship of PLD members was observed between scorpions and spiders.