Optimization of the de novo assembly of the transcriptome of the venom gland of Pamphobeteus verdolaga, prospecting novel bioactive peptides

Background: Spiders are among the most venomous animals in nature. Their venom constitutes a source of novel and innovative peptides and proteins with medicinal and biotechnological interest. However, their potential as antimicrobial, anti-cancerous, anti-hypertensive and even in the modulation of nociception is under-studied, mainly because handling the venom is technically challenging and there is paucity of next-generation-sequencing (NGS) data. Due to the increasing evidence of underestimation of the number of genes by the use of a single transcriptome assembler, we re-assembled and optimized the de novo transcriptome of the venom gland of the recently described Colombian spider P. verdolaga, by using three free access algorithms: Trinity, Soapdenovo and SPAdes. All the assemblies were evaluated by statistical parameters (e.g. contigs, GC%, max and min length and N50), by applying BUSCO´s terms retrieval against the arthropod data set to determine the best assembly for each software. Results: Our analyses show that while approximately 54% of all the assembled and structurally annotated sequences could be found in all three algorithms, around 23% of these were unique for Trinity and 21% were unique for SPAdes. The non-redundant merge of all three assemblies’ output permitted the annotation of 8640 sequences; at least 15% more when compared to each software separately, and an increase of 20% when compared to a previous P. verdolaga assembly. Analysis of the annotated genes allowed the identication of unreported lectins, kinins and over 200 peptides and proteins with potential antimicrobial and protease inhibition activities. Furthermore, homology search against the Arachnoserver database and the EROP knowledgebase allowed the identication of 135 novel theraphotoxins of biotechnological interest. Conclusion: for the functional annotation. Briey, the structural annotation with Augustus used the genome of P. tepitadorium as a template, retrieving information such as intron length and possible splicing signals in order to predict the possible ORFs from each of the assemblies. Only complete ORFs were reported. The functional annotation was carried out with the software OmicsBox using the UniProtKB database for homology search as well as the InterPro database for GO terms retrieval. Only those sequences with hits from both databases were considered as annotated. For the redundancy analysis the software CD-HIT was employed with a cutoff of 85% homology.

through the identi cation of known motifs that are key to the activities of interest. This can be done through a bioinformatics approach, applying tools like BLAST and HMMER (19,(21)(22)(23). However, in nonmodel organisms, such as the members from the Araneae order, there is a lack of reported nucleotide information at the genome and transcriptome level. Thereby, the assembly and annotation of the transcriptome becomes an important endeavor to guarantee that most of the nucleotide information is retrieved. Rarely, a single assembly method permits the retrieval of all genes and sequences within a transcriptome (24). In venoms, their inherit complexity and the presence of highly homologous and paralogous sequences might distort the quantity and quality of the information recovered. Therefore, the usage of various assemblers may be recommended as taxon speci c as well as tissue speci c biases and issues may arise, especially in these non-model organisms (24).
In this study, we optimized the assembly of the transcriptome of the venom gland of the Colombian spider Pamphobeteus verdolaga (Araneae:Theraphosidae). In order to maximize the amount of information obtained from the transcriptome, we assembled the reads with three different de novo assembly algorithms and assessed their performance through "classical assembly statistics" as well as BUSCO terms retrieval, and compared the number of genes annotated when only one assembly output was used and when a merge of the three assembler results were used as input for the annotation with OmicsBox. The le with the merged predicted genes from the three assemblers showed an increase of at least 15% in the number of annotated genes than in each of the individual assemblies. The improvement in annotation allowed the identi cation of 135 novel toxins that may hold biotechnological potential, unveiling several of new bioactive peptides with potential interest in biomedicine.

Venom gland transcriptome data
The transcriptome's raw reads from the venom gland of Pamphobeteus verdolaga were obtained from the European Nucleotide Archive (ENA) under accessions numbers: PRJEB21288/ERS1788422/ERX2067777-ERR2008012. As explained by Estrada and co-workers (25), two female specimens were collected from the province of Antioquia, Colombia under the contract 155 signed by the University of Antioquia and the Environmental Ministry of Colombia, and the venom glands were extirpated. The total RNA was obtained through TRIzol ® reagent (ThermoFisher Scienti c, MA, USA) while the puri cation process of mRNA and the library creation was carried out with the Illumina mRNA TruSeq kit v2, as indicated by the manufacturer. The library 100 bp pair-ended reads was sequenced in an Illumina Hiseq 2500 instrument (illumina Inc, San Diego USA).

Transcriptome assembly
The raw reads obtained from the Illumina platform were subjected to a process of assembly optimization. Firstly, low quality reads and possible adaptor sequences were eliminated by using the programs TrimGalore v0.6.3 as well Trimmomatic v0. 38, with default options. The cleaned reads were subject to an assembly with Trinity v2.1.1 using two k-mer lengths: 25 and 32, thus, four different transcriptome assemblies were obtained. Each assemblies' statistics were obtained by using the TrinityStats.pl script. The reads edited with TrimGalore were also assembled with SOAPdenovo v2.04 and with SPAdes v3.13.1, with k-mer lengths 31 and 63 for a total of eight different transcriptome assemblies. The statistics from these assemblies were obtained with the QUAST v5.0.2 software. For Trinity, SOAPdenovo and SPAdes, only the k-mer length was varied.
Assessing transcriptome completion with BUSCO A rst measure of the completeness of the transcriptomes was evaluated using the Benchmarking Universal Single-Copy Orthologs (BUSCO) v4.1.2 software. For this, each of the assemblies was analyzed against BUSCO's own arthropoda dataset. The data from the number of complete, duplicated, fragmented and missing BUSCO terms was extracted. In the case of the assemblies obtained from SOAPdenovo and SPAdes, a separation of transcripts with lengths higher and lower than 200 bp was done with the software SeqKit v0.12.0 before BUSCO.
Sequence coverage relative to the genome of Parasteatoda tepitadorium The assembled contigs were aligned to the genome of the common house spider: Parasteatoda tepitadorium in order to assess the level of coverage. The alignment was done with the software minimap2 v2.17, using the splice setting due to the large size of the contigs. The resulting alignment coverage and its abundance was graphed with the software GraphPad Prism v7.00 for MacOS (GraphPad Software, La Jolla California, USA).

Structural and functional annotation of the transcriptome assemblies
The transcriptome assemblies from Trinity, SPAdes and Soapdenovo that showed the best performance evaluated through classical statistics, BUSCO terms retrieval and sequence coverage, were annotated with the software Augustus v 3.3.3 for the prediction of ORFs and with the software OmicsBox v 1.2.4 (BioBam Bioinformatics S.L., Valencia, Spain) (previously known as Blast2GO) for the functional annotation. Brie y, the structural annotation with Augustus used the genome of P. tepitadorium as a template, retrieving information such as intron length and possible splicing signals in order to predict the possible ORFs from each of the assemblies. Only complete ORFs were reported. The functional annotation was carried out with the software OmicsBox using the UniProtKB database for homology search as well as the InterPro database for GO terms retrieval. Only those sequences with hits from both databases were considered as annotated. For the redundancy analysis the software CD-HIT v4.8.1 was employed with a cutoff of 85% homology.

Prediction of proteins and enzymes
All annotated sequences' GO terms were scanned for the keywords: toxin, antimicrobial, nociception, antiparasitic, antiarrhythmic, cytolytic, hemolytic, hyaluronidase, in ammatory, kinin, lectin, necrosis, neurotoxin, neurotransmitter hydrolysis, presynaptic neurotoxin, protease and protease inhibitor. All resulting sequences were extracted and manually curated using PubMed to check whether there was experimental evidence for their activity or were just related to an associated process. A further prospection was carried using the over 1800 peptides from the Arachnoserver database (http://www.arachnoserver.org) and the over 8000 peptides from the Erop knowledge base (http://erop.inbi.ras.ru). Homology to the assembled transcriptome was checked using local BLASTP v2.10.0. All of the retrieved sequences with a e-value higher than 1x10 -3 were eliminated. If a gene had multiple hits, only the one with the smallest e-value and highest identity percentage was taken as the best and probable result. The associated mature toxin was identi ed by elimination of the possible signal peptide and pro-peptide using the SpiderP software (http://www.arachnoserver.org/spiderP.html). A global alignment of the identi ed mature toxin and the identi ed homologue from BLAST was carried through ClustalW (https://www.genome.jp/tools-bin/clustalw). Those sequence pairs with global identity smaller than 20% were discarded as possible homologues.

Results
Changing the assembly software brings more variability than changes in the pre-processing of reads.
The 24 million, 101 pb pair-ended raw sequences of the venom gland of P. verdolaga were rst assembled using the most popular de novo assembly software: Trinity; using two pre-processing software: Trimmomatic and TrimGalore, and two different k-mer lengths: k-mer 25 and k-mer 32 (Fig. 1a). This allowed to choose the best pre-processing method and the best Trinity assembly. Next, the reads were assembled with the SPAdes and Soapdenovo software using the chosen pre-processing method and two k-mer lengths: k-mer 31 for comparison with Trinity and k-mer 63, due to the ability to assemble at higher lengths of both software (Fig. 1b).
In the rst part of the optimization, neither changing the pre-processing method nor the k-mer number, changed the number of transcripts, %GC, N50 or the median and average length of the assembled transcripts by more than 3% (Table 1 A). BUSCOs, which are a means to assess the completeness of an assembly in terms of the gene content, provide a complementary view to statistics such as the N50. Each of the four assemblies was evaluated against the arthropod data set from BUSCO in order to determine their completeness. Again, a variation of less than 2% was observed among all of the presets. On average, 78% of the arthropod dataset genes were identi ed completely; however, on average 23% of the retrieved terms were duplicated (Table 1 B). Finally, all of the mappings showed poor quality but, unlike the previous statistics, the assembly with TrimGalore and k-mer 25 showed higher average length of the mapped reads as well as higher coverage (Table 1 C). A histogram of the alignment lengths shows that the highest variability is found in contigs with length between 100,000 to 200,000 bp ( Figure 2). Whilst the differences between the presets cannot be taken as signi cant, the reads pre-processed with TrimGalore and assembled with a k-mer of 25, have a better balance between the number of transcripts, the N50 value, the number of completed and missing BUSCOs, the average length and total coverage of the mapped reads, when compared to the other evaluated presets.
In the second part of the optimization, all of the stats were analyzed with Quast instead of TrinityStats.pl due to compatibility issues with the outputs from SPAdes and Soapdenovo. Unlike the rst part, changes in the assembly algorithm added more variance than changes in the k-mer length or the pre-processing method. Around 29-42% and 74%-95% of the assembled contigs from SPAdes and Soapdenovo respectively, have less than 200 bp. Removal of these sequences shows that SPAdes has a higher average N50 than Trinity, with a smaller number of total contigs. Soapdenovo has the smallest N50 of the three and the number of total contigs assembled is almost 30% smaller than that of Trinity (Table 2 A). The BUSCO analysis was carried out for sequences higher and smaller than 200 bp. No complete BUSCOs can be found in sequences smaller than 200 bp; however 3% and 6-16% of fragmented genes can be found in SPAdes and Soapdenovo k-mer 31/k-mer 63, respectively .The number of complete genes identi ed in sequences larger than 200 bp are close to that of Trinity for SPAdes (76% on average), and around half of the BUSCO dataset can be identi ed on the Soapdenovo assembly. Surprinsignly, only 8% and 2% of the retrieved BUSCO terms for SPAdes and Soapdenovo, respectively, are duplicated, which is four to ten times smaller than the duplicated average number found in Trinity (Table 2 B). The alignment of the reads to the genome of P. tepitadorium shows that, in general, the longest average length and coverage is presented by Trinity, followed by SPAdes and Soapdenovo. The total coverage is about 40% smaller on SPAdes and 82% smaller on Soapdenovo when compared to Trinity. Even with such small coverage, Soapdenovo k-mer 31 has the best alignment quality in all of the eight coverage analysis, followed by SPAdes k-mer 31 ( Table 2 C). The distribution of the aligned sequences is similar between SPAdes and Trinity, while Soapdenovo greatly differs in the 50,000 to 100,000 bp range. Noticeably, both k-mer 63 assemblies show more sequences with length smaller than 10,000 bp when compared to Trinity; which is approximately 75% of all the contigs ( Figure 3). Again, although not drastic, the differences between the presets show a tendency for the reads assembled with SPAdes k-mer 31, and Soapdenovo kmer of 63 to have a better balance between the numbers of transcripts, the N50, the number of completed and missing BUSCOs and the coverage analysis. Even when the Soapdenovo k-mer 31 assembly to the P. tepitadorium genome has a higher average length and quality, the total coverage favors the Soapdenovo k-mer 63 alignment. . From the above results, it is concluded that the reads cleaned with TrimGalore and assembled with Trinity with k-mer 25, SPAdes with k-mer 32 and Soapdenovo with k-mer 63, showed the best performance.

The merging of the assembled transcriptomes increases the annotation performance
The assembled transcriptomes that showed the best overall performance were annotated. The output from Augustus shows that the biggest number of predicted ORFs comes from Trinity, followed by Spades and Soapdenovo. The average and maximum ORF length followed the same pattern. Due to the high number of duplicated genes observed in the previous BUSCO analysis, all of the genes with an identity higher than 85% within the Augustus output were eliminated with CD-HIT. Around 25% of all of the ORFs in the Trinity assembly were duplicated, while this number was only 12% for the SPAdes assembly and, surprisingly, 0.1% for Soapdenovo ( Table 3).
The described results leave SPAdes with the highest number of non-redundant ORFs, followed by Trinity and Soapdenovo. The distribution of the non-redundant sequences shows that SPAdes has the highest percentage (76%) of sequences smaller than 200 amino acids, followed by Soapdenovo (71%) and Trinity (69%). Trinity and Soapdenovo have the same number of sequences, between 200 and 1000 amino acids (28%), followed by SPAdes (22%) (Figure 4). In order to increase the total number of sequences annotated, a merge of all of the predicted ORFs was created. The non-redundant merge le is composed of approximately 23% sequences unique to Trinity, 21% sequences unique to SPAdes and 1% sequences unique to Soapdenovo; the other 54% of the sequences are shared between all of the three assemblies. This implies an increase in the number of ORFs by 31% when compared to Trinity, 32% when compared to SPAdes and 96% when compared to Soapdenovo.
The predicted ORFs were annotated with help of the UniProtKB database (Swiss-prot) and the InterproScan database in the software OmicsBox, with default settings. Just like in all of the previous analysis, the biggest number of annotated ORFs came from the Trinity assembly, with a total of 7478 sequences (43% without annotation) ( Figure 5 (Table 4 A and 4 B).
The P. verdolaga transcriptome is enriched in proteins with potential biotechnological application The non-redundant merged transcriptome of P. verdolaga was screened for potentially bioactive peptides using the set of keywords of interest described in Methods. This allowed the identi cation of 431 genes within the venom gland of P. verdolaga that were related to the aforementioned keywords. After manual curation, around 53% of the sequences were found to be related to biological pathways and not to the molecule itself, as identi ed through the PubMed database (data not shown). Within the 202 identi ed annotated genes, 52% were related to a protease or protease inhibitor (Kunitz domains), 25% were related to reported arachnid toxins and 8% related to vasodilator or smooth muscle contraction (kinin) activity. The average identity to their respective homologue was 60% (Table 5, Table S1).
The prospection of peptides and proteins with potential biotechnological interest was also carried through homology using peptides from the Arachnoserver and EROP databases as templates. A total of 171 sequences with an e-value higher than 1x10 -3 were identi ed. After curation, 79% of the sequences were kept. Out of the 122 sequences left from the Arachnoserver database, 40% corresponded to presynaptic neurotoxins, 20% to peptides and proteins with related protease or protease inhibition activities, and 15% to neurotoxins. For the EROP knowledge base 100% of the sequences were kept and 42% of them were related to arachnid toxins, 36% to neurotoxins and 16% to antimicrobial peptides. The average identity of the alignments was smaller for the Arachnoserver database, as 99% of the identi ed sequences had an identity to their hits smaller than 85%. For EROP, the average identity was of 67% ( Table 5). Out of the total sequences identi ed from the annotation and Arachnoserver/EROP databases, only 36 were previously reported for P. verdolaga.
Processing of the peptides and proteins with SpiderP (http://www.arachnoserver.org/spiderP.html) allowed the identi cation of the mature toxins from both the annotation le and those coming from the Arachnoserver/EROP prediction (202 and 135 unique sequences respectively) (Table S1-S2). Clustering of the sequences at 30% identity revealed the presence of 87 clusters within the Arachnoserver and EROP predictions, allowing the identi cation of 87 new theraphotoxins: U-Theraphotoxin-Pv4/91. The possible activities reported, are those of all toxins that had a match at a signi cant e-value and identity percentage (Table S2).

Discussion
Humanity faces many challenges in the search for novel active molecules that help solve animal, human health as well as environmental issues. Recombinant DNA technology and optimization of the solid phase chemical synthesis have allowed peptides and proteins to become a feasible and scalable solution to these problems. Peptides entering clinical trials are now common (26) and venom-derived new products are increasingly available (3,27). Spiders, being the venomous animal with the greatest number of species described to date, are becoming protagonists for the prospection of bioactive molecules (1,17). In the absence of annotated genomes of arachnids (28), the de novo assembly of transcriptomes has become the main tool for the molecular characterization of non-model organisms (29,30). Still, indels of nucleotides, assembly of chimeras, under representation of isoforms, presence of partial transcripts, biases in highly and lowly expressed genes and elimination of contigs that contain highly repeated sequences, are some of the current challenges that assemblers' algorithms face (30,31). Moreover, the bias of some algorithms towards the assembly of certain types of proteins and performance variation in different tissues and/or organisms (29,(31)(32)(33), may be overcome with the use of tools such as CEGMA or BUSCO, as the lack of consensus on the accuracy of the average length or the N50 value as estimators of the assembler's performance also rises, and the use of multiple assemblies with different software (29,31,34).
The quality of the assemblies All four assemblies obtained with Trinity showed and average of 97,000 contigs, N50 value and average lengths of 750 and 570 bp respectively. These results are in agreement with reported spider transcriptomes, which are highly variable and contain between 5000 up to 201,000 unigenes, and average lengths around 600 bp (35). Surprisingly, comparison between the assembly presets showed that the slight increase in the k-mer number, produced the increase of the N50 value, the median and average length, while it reduced the total number of contigs and assembled nucleotides, evidencing a potential increase in the assembly quality. However, BUSCO results show a drop in the number of retrieved conserved genes and an increase in the fragmented, missing and duplicated genes. Therefore, for our dataset, higher k-mer lengths assemblies done with Trinity have a lower possibility of correctly retrieving conserved arthropod genes (Table 1 a, b). Alignment of the assembled transcriptomes to the genome of P. tepitadorium showed an average length of the aligned sequences above the reported average and maximum length values of the assembled contigs. This implies the inclusion of gaps, mainly due to the phylogenetic distance between Theraphosidae (Mygalomorphae) and Theriidae (Araneae) which are estimated to have diverged between 240-300 million years ago (35,36). Around 70% of the aligned sequences have lengths smaller than 500 bp ( Figure 2); therefore, the possibility of multiple alignments to different sections of the P. tepitadorium genome helped reducing the alignment quality as observed (37). Still, while the comparison between P. tepitadorium and P. verdolaga is not ideal, the coverage analysis results behave as the BUSCO results, and this indicates that for our dataset, smaller k-mer numbers give a better assembly for the P. verdolaga transcriptome with Trinity.
To date, Trinity is regarded as the best de novo assembler for various species (29,38); however, biases in the prediction of certain types of proteins suggest the use of other assemblers (24,33). Assembly of the transcriptome of P. verdolaga with the software SPAdes and Soapdenovo showed a reduction in the overall number of assembled contigs as well as the N50 statistic in sequences over 200 bp, as expected. SPAdes' assemblies show apparent worse performance as k-mer number increases, which is evidenced in worse assembly statistics and in BUSCO results, showing correlation between both metrics and indicating that a smaller k-mer produces a higher quality assembly (Table 2 a, b). Still, while not as superior in number as Trinity's results, the quality of SPAdes' results are comparable as it has been previously reported (33). However, the smaller number of duplicated genes present in SPAdes' BUSCO terms indicate again that assembly stats fail to describe the whole picture. On the other hand, Soapdenovo's assemblies, while showing the worst performance of all the three software, show a better performance at higher k-mer numbers (Table 2 a, b). Assembly of venom gland transcriptomes for snakes of the Crotalus genus as well as scorpions of the Centruroides and Hadrurus genera with SPAdes and Soapdenovo, show an inverse correlation between assembler performance and increased k-mer number, as percentages of missing and fragmented BUSCO terms increase from 30 to 60% for snakes and from 3 to 10% on scorpions with k-mer numbers from 31 to 137 respectively, due to the possible loss of transcripts with low abundance (24). While this behavior applies to both Trinity and SPAdes, Soapdenovo's results seem to contradict this tendency as it assembles over 1.3 million transcripts with lengths smaller than 200 bp at k-mer length 31, evidence of a high fragmentation of P. verdolagas' transcriptome that clearly punishes both assembly stats and BUSCO metrics. Soapdenovo is known to assemble a high proportion of small contigs in other datasets (32). For the prospection of novel bioactive peptides and proteins, the relation between k-mer length and read length in our dataset and for Soapdenovo, should be higher as observed.
This again suggests that, ideally for each organism and/or tissue, optimization of the assembly conditions is highly recommended. Still, the presence of over 160 fragmented core arthropod genes in sequences smaller than 200 bp, suggests the potential presence of genomic information that might be of interest to certain biological questions, since it has been reported that genes that are restricted to one genus or species tend to be shorter than sequences that are shared between different taxa (39).
We observed up to 20% of ORFs (complete genes) in relationship with the initial amount of contigs (Table  3), a number somewhat lower than the 36% reported for transcriptomes of other spiders (35). When we also included partial transcripts in Trinity analyses, we increased to 33% the percentage of complete genes in relation to the number of contigs (not shown); however, the inclusion of incomplete transcripts might negatively affect subsequent studies as annotation of incomplete genes leads to errors in transcript quanti cation as well as errors in gene expression pro les (40,41). A previous P. verdolaga assembly and annotation made by Estrada and collaborators with Trinity and TransDecoder, reported 78,000 contigs from which 20% (16,030) were identi ed as ORFs and only 45% were identi ed as nonredundant genes, showing a nal 9% (7173 sequences) of the total predicted contigs as genes (25). Our methods classi ed only 0.07 to 25.2% of the ORFs as redundant sequences and functional annotation of these sequences represented an increase of up to 20% more complete genes in the non-redundant merge, when compared to the previous work.
The low abundance of spider-speci c genes in databases contributes to the low level of annotation found. (28,39). Therefore, in some instances, alignments with e-values higher than of 1x10 -3 might be considered as positive hits (42). Approximately 45.5% of the not annotated sequences have IPS hits that provide insight into the possible function of these genes (Table S3) and might classify them as probable "orphan genes". As it has been observed in other spider transcriptomes, up to 20% of genes are not present within other taxa due to lineage-speci c environmental adaptations, while showing the presence of signal peptide, IPS or Pfam hits (39,43). These ndings become especially relevant as phylogenomic analysis of new world theraphosids place them as a separate clade group from other arachnids (44) and some reports show that there is higher genetic diversi cation found in neo-tropical spiders as well as in those spiders that lack orb webs (36). A low percentage of annotated sequences is also observed in spiders of clinical importance, such as those from the Loxoceles and Latrodectus genera, as only 39 to 54% of the described proteins from transcriptomes in these species are correctly identi ed (45,46).

Potential new toxins in the venom gland of P. verdolaga
The non-redundant merged transcriptome showed an increase of approximately 1200 more annotated genes when compared to the Trinity annotated sequences (highest in total numbers among the three) as well as the overall identi cation of a higher number of GO terms as observed in Table 4. 1152 of the extra annotated proteins were related to the SPAdes assembly, while 56 were traced to the Soapdenovo assembly, i.e. 27% of the unique sequences belonging to both SPAdes and Soapdenovo were correctly annotated. This increase in the total numbers of ORFs also carried the inclusion of approximately 3200 sequences which were not annotated, decreasing the percentage of correctly annotated sequences when compared to Trinity by almost 10%, as expected (29,31). However, this reduction in annotation statistics is compensated by the extra sequences that were identi ed, as 42% of the sequences missing annotation show IPS matches that classify them as possible orphan genes, leaving a total potential of 12,000 identi ed genes (Table S3). The annotation of the non-redundant merged transcriptome identi es 35% of the annotated sequences as related to cellular processes. There is abundance of GO terms related to nucleotide binding and H3-K4/H3-K36 histones, which re ects the high level of transcriptomic activity that is expected from the venom gland (47). BLAST, IPS hits as well as GO terms, allowed the identi cation of 113 proteins as proteases or protease inhibitors, 62 toxins with either cytolytic, necrotic, neurotoxic or hemolytic activity and one protein with hyaluronidase activity. Venoms like those from spiders of the genera Latrodectus and Loxoceles are primarily composed of phospholipases and proteases, at 10 to 20% of the identi ed sequences. Other enzymes, like hyaluronidases, are found with low abundance (45,48). Theraphosidae venoms contain approximately 40% of proteins related to cellular processes and about 5 to 30% of proteins that evidence the ICK motif (possible toxins), while showing abundance in proteins related to redox activity, histones and proteases that also contribute to toxin diversity (49,50). Alignment of the non-redundant merged transcriptome to the Arachnoserver and EROP databases allowed the identi cation of 171 sequences with biological activities of interest; 36 of these were previously reported (25). All 135 remaining sequences matched to 427 different proteins within the aforementioned databases; however, the best hit was restricted to only 32 proteins at an average e-value of 1x10 -5 and identities smaller than 85% in 98% of the sequences. The CD-HIT analysis showed a high level of homology within P. verdolagas' 135 predicted proteins, as 87 clusters were found with global identities higher than 30% (51). All identi ed proteins were clustered and named as per the nomenclature proposed by King and collaborators, were U stands for the unknown function and Pv for Pamphobeteus verdolaga; theraphotoxins-Pv1 to Pv3 were described elsewhere (25,52).
Various patterns could be observed between P. verdolagas' predicted sequences and their best score homologs. First, all best score homologs were proteins that within the arachnoserver database are among the longest (over 100 kDa in weight). Second, an inverse relationship was observed between the predicted protein length and its global identity to the homolog protein, as those with the shortest lengths had global identities over 20% and ORFs with similar lengths to the homolog had on average global identities below 20%. Third, on average all proteins showed hits to proteins that were either isoforms to the best score homolog or to homolog proteins described in other species (data not shown). Fourth and last, all 32 proteins were described in 11 species, of which 30% were spiders from the genera Latrodectus or Loxoceles, evidencing again the bias in arachnid research. All the smaller P. verdolagas' ORFs are hypothesized to be then smaller isoforms of the best score homolog proteins, as the Augustus prediction algorithm identi ed them as complete transcripts and the global identity was above 20% (51). The varying lengths of these ORFs, which represent 7 to 27% of the best score homolog proteins primary structure, could be indicators of splicing events within the venom gland of spiders, as it has been described before in other venomous species (53,54), and therefore could also imply similar biological function. On the other hand, ORFs with the smallest global identity show between 58 and 164% of the best homolog protein's length, implying enough variability within the primary structure as to assume probable different biological activities to those of their homologs (51).
Forty P. verdolagas' toxins, distributed in 29 clusters, matched -Latrocrustotoxin. This toxin, from Latrodectus tredecimguttatus, affects crustaceans by promoting the release of neurotransmitters by various mechanisms, including pore formation (55). and -Latroinsectotoxin also induce neurotransmitter release by pore formation; however, these are found to be only active in arthropods making them potential insecticidal proteins (55,56). 8 and 9 of the described P. verdolaga toxins are related to and -Latroinsectotoxin respectively. -Latrotoxin-Lt1a is a 150kDa toxin that shares 93% of its amino acid sequence to the toxin -Latrotoxin-Lh1a from Latrodectus hasselti. These -Latrotoxins show toxicity to vertebrates and in some situations can represent a fatal danger to mammals due to the formation of membrane pores that lead to neurotransmitter release (57). 12 and 7 proteins were found to match to -Latrotoxin-Lt1a and Lha1a respectively. Presence of 19 sequences within the transcriptome of P. verdolaga that share identity to these toxins might indicate probable noxious effects of this theraphosid venom to mammals; however, the pore formation activity could be exploited in the development of various cytolytic drugs that may prove to be either antimicrobial or anti-cancerous. P. verdolagas' ORFs are also related to toxins described from the Loxoceles genera. Such is the case of the toxin Hayluronidase-1 from the spider Loxoceles intermedia, which is related to skin necrosis (58) and therefore indicative of possible necrotic activity in the bite of P. verdolaga.
Seven of the described toxins are related to the peptide -theraphotoxin-Gr1d from the spider Grammostola rosea, which is also homolog to the Hanatoxin (HaTx) from the same species. These toxins are known to modulate voltage-gated potassium channels, speci cally Kv2.1 and Kv4.2 (59), which could pose them as valuable tools for ion channel research. However, unlike previously described toxins (and due to their short length), these toxins are also probable homologs to other Grammostola rosea toxins such as the antimicrobial peptide M-theraphotoxin-Gr1a (60), the analgesic and antiarrhythmic toxintheraphotoxin-Gr2c (as stated by US patents US5756663 and US5877026), and the lectin-like toxin U5theraphotoxin-Hs1a from the spider Haplopelma schmidti (61), making them some of the most interesting peptides found in the venom gland transcriptome of P. verdolaga. Other ORFs related to theraphosid venoms are two toxins related to H. schmidti's -theraphotoxin-Hsb1, a potassium channel and protease inhibitor (62,63), and three ORFs related to U15-theraphotoxin-Hhn1a/e/h, three protease inhibitors from the venom of the spider Haplopelma hainanum (63,64).
Other ORFs related to proteins from other species were the Acetylcholinesterase-1, Cysteine rich secretory protein-1 (CRISP-1) and Neprilysin-1 from the spider Trittame loki, homologs to 23 of the P. verdolagas' predicted proteins and distributed in 1, 3 and 10 clusters respectively. These proteins are known to be related to the envenomation process, as acetylcholinesterase cleaves acetylcholine in neuron synapses and Neprilysin-1 is a metalloprotease that is involved in the degradation of proteins from the extracellular matrix; CRISP-1 is to be related to vespid allergens (65). And the toxin U1-listatoxin-Kh1b from the venom of the spider Kukulcania hibernalis, an insecticidal toxin as described by patent number US5457178, which is homolog to two of the P. verdolagas' ORFs.
Some of the hits can also be traced to organisms outside the Araneae order. Such examples are the homology of one of the predicted sequences to the Kazal-type serine protease inhibitor from the lobster Sagmariasus verreauxi (66), four ORFs homolog to species of the genera Scolopendra: Scolopendin 1 and 2, both antimicrobial and antifungal peptides (67,68), the toxin SLPTX10-4 of unknown function (69) and lastly Scolopendrasin VII, an anticancer peptide (70). Finally, 14 ORFs also showed homology to the protein Techylectin-1 from the Phoneutria nigriventer, which is also homolog to the toxin Techylectin-5B, an antimicrobial lectin from the horseshoe crab Tachypleus tridentatus (71).

Conclusions
The de novo transcriptome assembly has become the main method of obtaining genomic information in non-model organisms, especially in spiders, were there is an evident lack of nucleotide information in contrast to the number of reported species to date. As these arthropods become coveted for the richness of bioactive peptides in their venoms, the completeness in the resulting assembled transcriptomes also gains importance. Here, the utilization of three distinct assembly algorithms allowed the identi cation and annotation of over 1200 more proteins (Over 4000 considering the probable orphan genes) when compared to the v1.0 of P. verdolagas' transcriptome assembly previously done in our group. This supports the need for the merging of the output of various assemblers as a means to obtain more complete transcriptomes. Furthermore, the 8460 annotated proteins will serve as guidance for the description and annotation of posterior spider proteins, especially in the absence of nucleotide sequences of new world theraphosids. On the other hand, the prospection of bioactive peptides allowed the identi cation 135 new theraphotoxins with exciting hypothetic activities that range from probable new antimicrobials to novel insecticidal proteins that may have a place in the development of new products in the medium term. These activities are then left to be tested biologically. Segura was also the team coordinator and resource manager.