Data set : bacterial toxins and controls
The dataset was constructed by extracting representatives of Type I, Type II, and Type III toxins from UniProtKB/Swiss-Prot (RRID:SCR_021164) [15] and PubMed (RRID:SCR_004846) [16] after selecting them from literature reviews for each type of toxin. For Type I, we included the representatives for the Superantigen family and Enterotoxins [33].Type II included representatives of pore-forming toxins [34] and phospholipases [35]. Type III toxins included AB toxins and effector proteins. For AB toxins coded in different open reading frames, the binding subunit (B) was removed. Although, it is essential for binding, localization, and delivery of the active toxin into the cytosol [36], it has no known toxicity-related enzymatic activity inside the target cell. When both subunits are encoded in one gene and secreted as one protein (e.g. Tetanus toxin), the full protein was included. Effector proteins from bacterial secretion systems were extracted from existing specialized datasets for effector proteins [37–40]. A curation step included the removal of structural proteins, and chaperones, as they do not contain known toxicity. We have listed example proteins removed (See Additional file 1, Table S1) and kept (Aee Additional file 1, Table S2) from available resources, and all sources used for the revision and extraction of proteins (See Additional file 2, Table S3). We assumed that all effector proteins have a function in the manipulation of the cell metabolism. Therefore, Type III toxins include all known and predicted effector proteins from the secretion systems known to translocate proteins directly into the target cell, which include T3SS, T4SS, and T6SS. However, as before, we excluded structural proteins (See Additional file 1, Table S1) and proteins involved in the translocation.
Based on our operational definition of toxins, we have expanded the classification of bacterial toxins to a Type IV, which included bacterial proteins degrading extracellular matrix (ECM). By degrading the ECM they have a potential effect on cell behavior. Representatives of Type IV toxins included enzymes with direct effect on ECM proteins such as collagenases, siacylases, and metalloproteases.
For bacterial secreted proteins (control) with no known toxin activity, we selected all secreted proteins, non-membrane associated, and no outer membrane vesicles (OMV) associated from monoderms and diderms, available in the PSORTb 3.0 database (accessed June 2021) [41] . This database was chosen because its classification considers the potential membrane association of secreted proteins, which other databases do not. This allows us to select the most similar proteins to bacterial toxins. We removed all proteins that contained “Fragment” within the identification.
Data set: animal toxins and controls
We based the dataset of animal venom toxins on the expert-curated Tox-Prot [42], itself is a subset of UniProt (accessed on Dec 2022). We included venom proteins from jellyfish, snails, centipedes, insects, arachnids, fish, mammals and snakes. We added proteins retrieved from our own studies on three-finger toxins (3FTs) in snakes [43] and of our own recent annotation of hymenopteran venoms [44].
For the control set of secreted, animal proteins, we selected all secreted [KW-0964], non-toxic proteins [NOT KW-0800] from the same organism as in the animal venoms, based on their Organism ID as defined in UniProt [14] (accessed July 2023). For animals, we chose UniProt as better suited because of the better prediction of subcellular localization for eukaryotic proteins. We removed all proteins that contained “Fragment” within the identification.
Data set comparison:
To address a possible bias from sequence redundancy, we excluded identical sequences within each of the sets through the alignment method MMseqs2 [45]. We applied the default parameter setting from the easy-cluster option at a sequence similarity threshold of 1.00, and alignment coverage modus of 0 [45]. At modus 0, the alignments cover both the query- and the target sequence at the selected coverage threshold. We kept only the cluster-representatives for each cluster.
To investigate the remaining sequence diversity within the sets, we used MMseqs2 clustered at three different thresholds of 0.75, 0.50, and 0.25. Subsequently we calculated the number of clusters (n)that were similar given a particular threshold (thresh) as follows:
For each data set, we calculated the amino acid composition (percentage of any of the 20 native amino acids), average sequence length (number of residues per protein), the aromaticity (percentage of phenylalanine, tryptophan and tyrosine) and the isoelectric points (pI) after Bjellqvist [46]. The pI, amino acid composition, average sequence length and aromaticity were calculated after removal of duplicates from the raw data using Biopython (RRID:SCR_007173).
To visualize the differences in amino acid composition between datasets, we calculated the Surprise (Eqn. 1) per amino acid, defined as follows:
with AA as one of the 20 amino acids; µdata1, as the average percentage of AA in data set 1, and µmergeddatasets as the average for data set 1 minus background (both sets combined); while σmergeddatasets described the standard deviation.
We first computed the amino acid composition in the single and merged datasets. Then, we generated a normal distribution by bootstrapping with 10000 samples to calculate the mean and standard deviations. Finally, the Surprise values determined the height of each amino acid letter (one-letter code) using the python package Logomaker [47] .
Limitations
The following limitations need to be considered: Our bacterial data sets are biased as a result of databases enriched in bacteria that can be cultured in the laboratory and their relevance as human pathogens. Future analyses addressing the biases in data sets available, like metagenomic data, or computational predictions for identification of toxins from unculturable organisms, will be valuable approaches. What our data cannot tell us is if these differences will also be valid for toxins from other bacteria, fungi, or plants.