Humanity faces many challenges in the search for novel active molecules that help solve animal, human health as well as environmental issues. Recombinant DNA technology and optimization of the solid phase chemical synthesis have allowed peptides and proteins to become a feasible and scalable solution to these problems. Peptides entering clinical trials are now common (26) and venom-derived new products are increasingly available (3,27). Spiders, being the venomous animal with the greatest number of species described to date, are becoming protagonists for the prospection of bioactive molecules (1,17). In the absence of annotated genomes of arachnids (28), the de novo assembly of transcriptomes has become the main tool for the molecular characterization of non-model organisms (29,30). Still, indels of nucleotides, assembly of chimeras, under representation of isoforms, presence of partial transcripts, biases in highly and lowly expressed genes and elimination of contigs that contain highly repeated sequences, are some of the current challenges that assemblers’ algorithms face (30,31). Moreover, the bias of some algorithms towards the assembly of certain types of proteins and performance variation in different tissues and/or organisms (29,31–33), may be overcome with the use of tools such as CEGMA or BUSCO, as the lack of consensus on the accuracy of the average length or the N50 value as estimators of the assembler’s performance also rises, and the use of multiple assemblies with different software (29,31,34).
The quality of the assemblies
All four assemblies obtained with Trinity showed and average of 97,000 contigs, N50 value and average lengths of 750 and 570 bp respectively. These results are in agreement with reported spider transcriptomes, which are highly variable and contain between 5000 up to 201,000 unigenes, and average lengths around 600 bp (35). Surprisingly, comparison between the assembly presets showed that the slight increase in the k-mer number, produced the increase of the N50 value, the median and average length, while it reduced the total number of contigs and assembled nucleotides, evidencing a potential increase in the assembly quality. However, BUSCO results show a drop in the number of retrieved conserved genes and an increase in the fragmented, missing and duplicated genes. Therefore, for our dataset, higher k-mer lengths assemblies done with Trinity have a lower possibility of correctly retrieving conserved arthropod genes (Table 1 a, b). Alignment of the assembled transcriptomes to the genome of P. tepitadorium showed an average length of the aligned sequences above the reported average and maximum length values of the assembled contigs. This implies the inclusion of gaps, mainly due to the phylogenetic distance between Theraphosidae (Mygalomorphae) and Theriidae (Araneae) which are estimated to have diverged between 240-300 million years ago (35,36). Around 70% of the aligned sequences have lengths smaller than 500 bp (Figure 2); therefore, the possibility of multiple alignments to different sections of the P. tepitadorium genome helped reducing the alignment quality as observed (37). Still, while the comparison between P. tepitadorium and P. verdolaga is not ideal, the coverage analysis results behave as the BUSCO results, and this indicates that for our dataset, smaller k-mer numbers give a better assembly for the P. verdolaga transcriptome with Trinity.
To date, Trinity is regarded as the best de novo assembler for various species (29,38); however, biases in the prediction of certain types of proteins suggest the use of other assemblers (24,33). Assembly of the transcriptome of P. verdolaga with the software SPAdes and Soapdenovo showed a reduction in the overall number of assembled contigs as well as the N50 statistic in sequences over 200 bp, as expected. SPAdes’ assemblies show apparent worse performance as k-mer number increases, which is evidenced in worse assembly statistics and in BUSCO results, showing correlation between both metrics and indicating that a smaller k-mer produces a higher quality assembly (Table 2 a, b). Still, while not as superior in number as Trinity’s results, the quality of SPAdes’ results are comparable as it has been previously reported (33). However, the smaller number of duplicated genes present in SPAdes’ BUSCO terms indicate again that assembly stats fail to describe the whole picture. On the other hand, Soapdenovo’s assemblies, while showing the worst performance of all the three software, show a better performance at higher k-mer numbers (Table 2 a, b). Assembly of venom gland transcriptomes for snakes of the Crotalus genus as well as scorpions of the Centruroides and Hadrurus genera with SPAdes and Soapdenovo, show an inverse correlation between assembler performance and increased k-mer number, as percentages of missing and fragmented BUSCO terms increase from 30 to 60% for snakes and from 3 to 10% on scorpions with k-mer numbers from 31 to 137 respectively, due to the possible loss of transcripts with low abundance (24). While this behavior applies to both Trinity and SPAdes, Soapdenovo’s results seem to contradict this tendency as it assembles over 1.3 million transcripts with lengths smaller than 200 bp at k-mer length 31, evidence of a high fragmentation of P. verdolagas’ transcriptome that clearly punishes both assembly stats and BUSCO metrics. Soapdenovo is known to assemble a high proportion of small contigs in other datasets (32). For the prospection of novel bioactive peptides and proteins, the relation between k-mer length and read length in our dataset and for Soapdenovo, should be higher as observed. This again suggests that, ideally for each organism and/or tissue, optimization of the assembly conditions is highly recommended. Still, the presence of over 160 fragmented core arthropod genes in sequences smaller than 200 bp, suggests the potential presence of genomic information that might be of interest to certain biological questions, since it has been reported that genes that are restricted to one genus or species tend to be shorter than sequences that are shared between different taxa (39).
We observed up to 20% of ORFs (complete genes) in relationship with the initial amount of contigs (Table 3), a number somewhat lower than the 36% reported for transcriptomes of other spiders (35). When we also included partial transcripts in Trinity analyses, we increased to 33% the percentage of complete genes in relation to the number of contigs (not shown); however, the inclusion of incomplete transcripts might negatively affect subsequent studies as annotation of incomplete genes leads to errors in transcript quantification as well as errors in gene expression profiles (40,41). A previous P. verdolaga assembly and annotation made by Estrada and collaborators with Trinity and TransDecoder, reported 78,000 contigs from which 20% (16,030) were identified as ORFs and only 45% were identified as non-redundant genes, showing a final 9% (7173 sequences) of the total predicted contigs as genes (25). Our methods classified only 0.07 to 25.2% of the ORFs as redundant sequences and functional annotation of these sequences represented an increase of up to 20% more complete genes in the non-redundant merge, when compared to the previous work.
The low abundance of spider-specific genes in databases contributes to the low level of annotation found. (28,39). Therefore, in some instances, alignments with e-values higher than of 1x10-3 might be considered as positive hits (42). Approximately 45.5% of the not annotated sequences have IPS hits that provide insight into the possible function of these genes (Table S3) and might classify them as probable “orphan genes”. As it has been observed in other spider transcriptomes, up to 20% of genes are not present within other taxa due to lineage-specific environmental adaptations, while showing the presence of signal peptide, IPS or Pfam hits (39,43). These findings become especially relevant as phylogenomic analysis of new world theraphosids place them as a separate clade group from other arachnids (44) and some reports show that there is higher genetic diversification found in neo-tropical spiders as well as in those spiders that lack orb webs (36). A low percentage of annotated sequences is also observed in spiders of clinical importance, such as those from the Loxoceles and Latrodectus genera, as only 39 to 54% of the described proteins from transcriptomes in these species are correctly identified (45,46).
Potential new toxins in the venom gland of P. verdolaga
The non-redundant merged transcriptome showed an increase of approximately 1200 more annotated genes when compared to the Trinity annotated sequences (highest in total numbers among the three) as well as the overall identification of a higher number of GO terms as observed in Table 4. 1152 of the extra annotated proteins were related to the SPAdes assembly, while 56 were traced to the Soapdenovo assembly, i.e. 27% of the unique sequences belonging to both SPAdes and Soapdenovo were correctly annotated. This increase in the total numbers of ORFs also carried the inclusion of approximately 3200 sequences which were not annotated, decreasing the percentage of correctly annotated sequences when compared to Trinity by almost 10%, as expected (29,31). However, this reduction in annotation statistics is compensated by the extra sequences that were identified, as 42% of the sequences missing annotation show IPS matches that classify them as possible orphan genes, leaving a total potential of 12,000 identified genes (Table S3). The annotation of the non-redundant merged transcriptome identifies 35% of the annotated sequences as related to cellular processes. There is abundance of GO terms related to nucleotide binding and H3-K4/H3-K36 histones, which reflects the high level of transcriptomic activity that is expected from the venom gland (47). BLAST, IPS hits as well as GO terms, allowed the identification of 113 proteins as proteases or protease inhibitors, 62 toxins with either cytolytic, necrotic, neurotoxic or hemolytic activity and one protein with hyaluronidase activity. Venoms like those from spiders of the genera Latrodectus and Loxoceles are primarily composed of phospholipases and proteases, at 10 to 20% of the identified sequences. Other enzymes, like hyaluronidases, are found with low abundance (45,48). Theraphosidae venoms contain approximately 40% of proteins related to cellular processes and about 5 to 30% of proteins that evidence the ICK motif (possible toxins), while showing abundance in proteins related to redox activity, histones and proteases that also contribute to toxin diversity (49,50). Alignment of the non-redundant merged transcriptome to the Arachnoserver and EROP databases allowed the identification of 171 sequences with biological activities of interest; 36 of these were previously reported (25). All 135 remaining sequences matched to 427 different proteins within the aforementioned databases; however, the best hit was restricted to only 32 proteins at an average e-value of 1x10-5 and identities smaller than 85% in 98% of the sequences. The CD-HIT analysis showed a high level of homology within P. verdolagas’ 135 predicted proteins, as 87 clusters were found with global identities higher than 30% (51). All identified proteins were clustered and named as per the nomenclature proposed by King and collaborators, were U stands for the unknown function and Pv for Pamphobeteus verdolaga; theraphotoxins-Pv1 to Pv3 were described elsewhere (25,52).
Various patterns could be observed between P. verdolagas’ predicted sequences and their best score homologs. First, all best score homologs were proteins that within the arachnoserver database are among the longest (over 100 kDa in weight). Second, an inverse relationship was observed between the predicted protein length and its global identity to the homolog protein, as those with the shortest lengths had global identities over 20% and ORFs with similar lengths to the homolog had on average global identities below 20%. Third, on average all proteins showed hits to proteins that were either isoforms to the best score homolog or to homolog proteins described in other species (data not shown). Fourth and last, all 32 proteins were described in 11 species, of which 30% were spiders from the genera Latrodectus or Loxoceles, evidencing again the bias in arachnid research. All the smaller P. verdolagas’ ORFs are hypothesized to be then smaller isoforms of the best score homolog proteins, as the Augustus prediction algorithm identified them as complete transcripts and the global identity was above 20% (51). The varying lengths of these ORFs, which represent 7 to 27% of the best score homolog proteins primary structure, could be indicators of splicing events within the venom gland of spiders, as it has been described before in other venomous species (53,54), and therefore could also imply similar biological function. On the other hand, ORFs with the smallest global identity show between 58 and 164% of the best homolog protein’s length, implying enough variability within the primary structure as to assume probable different biological activities to those of their homologs (51).
Forty P. verdolagas’ toxins, distributed in 29 clusters, matched ⍺-Latrocrustotoxin. This toxin, from Latrodectus tredecimguttatus, affects crustaceans by promoting the release of neurotransmitters by various mechanisms, including pore formation (55). ⍺ and 𝛿-Latroinsectotoxin also induce neurotransmitter release by pore formation; however, these are found to be only active in arthropods making them potential insecticidal proteins (55,56). 8 and 9 of the described P. verdolaga toxins are related to ⍺ and 𝛿-Latroinsectotoxin respectively. ⍺-Latrotoxin-Lt1a is a 150kDa toxin that shares 93% of its amino acid sequence to the toxin ⍺-Latrotoxin-Lh1a from Latrodectus hasselti. These ⍺-Latrotoxins show toxicity to vertebrates and in some situations can represent a fatal danger to mammals due to the formation of membrane pores that lead to neurotransmitter release (57). 12 and 7 proteins were found to match to ⍺-Latrotoxin-Lt1a and Lha1a respectively. Presence of 19 sequences within the transcriptome of P. verdolaga that share identity to these toxins might indicate probable noxious effects of this theraphosid venom to mammals; however, the pore formation activity could be exploited in the development of various cytolytic drugs that may prove to be either antimicrobial or anti-cancerous. P. verdolagas’ ORFs are also related to toxins described from the Loxoceles genera. Such is the case of the toxin Hayluronidase-1 from the spider Loxoceles intermedia, which is related to skin necrosis (58) and therefore indicative of possible necrotic activity in the bite of P. verdolaga.
Seven of the described toxins are related to the peptide 𝜅-theraphotoxin-Gr1d from the spider Grammostola rosea, which is also homolog to the Hanatoxin (HaTx) from the same species. These toxins are known to modulate voltage-gated potassium channels, specifically Kv2.1 and Kv4.2 (59), which could pose them as valuable tools for ion channel research. However, unlike previously described toxins (and due to their short length), these toxins are also probable homologs to other Grammostola rosea toxins such as the antimicrobial peptide M-theraphotoxin-Gr1a (60), the analgesic and antiarrhythmic toxin 𝜅-theraphotoxin-Gr2c (as stated by US patents US5756663 and US5877026), and the lectin-like toxin U5-theraphotoxin-Hs1a from the spider Haplopelma schmidti (61), making them some of the most interesting peptides found in the venom gland transcriptome of P. verdolaga. Other ORFs related to theraphosid venoms are two toxins related to H. schmidti’s 𝜅-theraphotoxin-Hsb1, a potassium channel and protease inhibitor (62,63), and three ORFs related to U15-theraphotoxin-Hhn1a/e/h, three protease inhibitors from the venom of the spider Haplopelma hainanum (63,64).
Other ORFs related to proteins from other species were the Acetylcholinesterase-1, Cysteine rich secretory protein-1 (CRISP-1) and Neprilysin-1 from the spider Trittame loki, homologs to 23 of the P. verdolagas’ predicted proteins and distributed in 1, 3 and 10 clusters respectively. These proteins are known to be related to the envenomation process, as acetylcholinesterase cleaves acetylcholine in neuron synapses and Neprilysin-1 is a metalloprotease that is involved in the degradation of proteins from the extracellular matrix; CRISP-1 is to be related to vespid allergens (65). And the toxin U1-filistatoxin-Kh1b from the venom of the spider Kukulcania hibernalis, an insecticidal toxin as described by patent number US5457178, which is homolog to two of the P. verdolagas’ ORFs.
Some of the hits can also be traced to organisms outside the Araneae order. Such examples are the homology of one of the predicted sequences to the Kazal-type serine protease inhibitor from the lobster Sagmariasus verreauxi (66), four ORFs homolog to species of the genera Scolopendra: Scolopendin 1 and 2, both antimicrobial and antifungal peptides (67,68), the toxin SLPTX10-4 of unknown function (69) and lastly Scolopendrasin VII, an anticancer peptide (70). Finally, 14 ORFs also showed homology to the protein Techylectin-1 from the Phoneutria nigriventer, which is also homolog to the toxin Techylectin-5B, an antimicrobial lectin from the horseshoe crab Tachypleus tridentatus (71).