Complex patterns of repeats exist in repeat proteins and are fairly common: The dot plots produced by DOTTER reveal complex patterns that can be used to compare repeat proteins much like traditional sequence alignment methods while also reducing the effect of sequence repetition[20]. Analysis of repeat protein amino acid sequences [Fig. 1] using DOTTER [31] readily revealed visually identifiable patterns for the proteins [Fig. 2 & SI Figs. 1, SI Table 1]. Human observation noted that pairs of dot plots with a Jaccard similarity score (JX ≥ 0.5; JX is the ratio of the number of matching black pixels in both dot plots to the total number of black pixels) were typically quite difficult to distinguish and pattern similarities were usually detectable by human observers at JX ≥ 0.1. Furthermore, known repeat containing proteins had more information rich dot plots on average than randomly selected proteins. Proteins within the RepeatsDB set had a mean of 272 pixels per protein chain (median 119 pixels/chain, mean length = 345 residues/chain, mean 0.66 ± 1.0 pixels/residue, where pixels were simply the black points within the dot plots corresponding to the comparison of a specific pair of amino acids related by the dot plot indices. This set of known repeat proteins had significantly more signal information in their dot plots than two control sets (“bacillus” and “mouse” generated by searching the PDB for both of these keyword terms)(Table 1). Within the RepeatsDB set, 71.8% of proteins had more than 0.14 pixels/residue, with artificially designed repeat proteins (identified by searching within RepeatsDB for the term “design”) tending to have more pixel information than natural ones on average (SI Fig 2D). Because the DOTTER-produced dot plots lack the explicit degeneracy that confounds traditional sequence comparisons and pairwise comparison of the plots was rapid and efficient it allowed us to analyze the entirety of the UniRef90 database [32]. To compensate for differences in protein sizes, we introduced a “sliding” method in which the start of the smaller protein was positioned along every possible point that gave any overlap along the self-identity diagonal of the larger protein (Fig. 1). The highest JX score was considered the optimal positioning. We identified 13.3 million (16.9 %) protein chains (out of 78.9 million) with an information content of at least 0.42 pixels/residue. The 0.42 pixels/residue cutoff was chosen based on a comparison of the RepeatsDB set and the control “mouse” and “bacillus” sets (see Table 1, SI Fig. 2). This is within the range previously reported for previous estimates of the prevalence of repeat-containing proteins [16, 17]. Likewise, we reasonably find that 5.5 % of proteins in the set contain one or more LCR regions when a minimum length = 20 filter is applied (and 23.3 % for a minimum length = 6 filter) [4].
Conservation of dot plot patterns in related proteins: The patterns present in the dot plots of repeat proteins were maintained longer than should have been expected as compared to randomly changing sequences, suggesting that there is some pressure to maintain these patterns. In order to investigate how the dot plots were affected by changes in sequence, we estimated the rate of information decay by subjecting a set of 79 chains (the standard set, see implementation, SI Table 2) to random in silico mutations (Fig. 3, SI Fig. 3) using BLOSUM62[33]. These 79 proteins (at least two from each of the RepeatsDB subcategories) were used as a standard test set throughout this work. Here, the standard set proteins were mutated in silico and the dot plots were calculated for the mutants to compare to the original protein, producing a decay curve for JX values. The resulting curves were fit to a simple exponential decay equation (JX = e -bz) where z indicates the per cent identity difference between the mutant and initial proteins. Random mutation usually resulted in a 50% reduction in JX after an 8.2 ± 1.1 % loss of sequence identity demonstrating that the patterns decay rapidly in the absence of selective pressures (8.5 ± 0.5 when calculated only from the chains (N = 64) with good R2 values for the decay experiment, SI Table 1). It should be noted that both these decay constants are within each other’s standard error ranges. In most (19 of 22, 86%) of the subgroups taken directly from RepeatsDB at least 2 out of 3 proteins tested exhibited single exponential decay as judged by an R2 ≥ 0.98 for the fit, and in 12 of the 22 subgroups (55%) all of the protein chains did so (SI Table 2). Since JX values seemed to be conserved better than sequence identity (decay half-life < 10% seq id), we hypothesized that it might be employed as a more robust method to detect evolutionary relationships than approaches that rely solely on sequence alignments.
Because the decay had a “half-life” of less than 10% sequence identity, we examined how well this method could detect commonalities in related proteins and compared it to standard phylogeny using MrBayes [34]. We chose 12 proteins from the standard set to attempt to identify conserved, consensus dot plot patterns that might be conserved among each set of these related proteins. Illustrative examples for 4 sets of closely related proteins are given in Fig 2 (comparisons of phylogenic and dot plot analysis for all 12 sets are given in SI Fig 1). Consensus dot plot patterns were identified for 10 of these 12 (83% success rate). We also used the standard set of 79 proteins to examine the effects of insertions on decay of the Jaccard score by randomly inserting amino acids into a protein sequence. Random insertions had a more debilitating effect on the dot plot conservation, with half of JX being lost on average after a 0.96 ± 0.37% insertion rate.
Relationship between sequence and dot plot conservation: We sought to investigate if the relationships between different dot plots were entirely due to sequence similarity. To do so the pairwise sequence identities for all the members of the full RepeatsDB [9] set were calculated and compared with their Jaccard distances (JD) (SI Fig. 4). This comparison showed two features, a main peak around 10-20% sequence identity comprising most of the pairwise comparisons between the proteins and a smaller one above 90% sequence identity which was highly enriched in streptavidin chains (N = 387) that have low information content plots (almost no positive pixels) but do make up a sizable portion (6%) of the total number of chains in the dataset. Additionally, the set of 79 standard proteins when mutated using a replacement matrix (SI Table 2) showed remarkable maintenance of the dot plot structures and JX values (Fig. 4, SI Fig. 5,6, SI Table 1). Despite essentially no sequence identity between the protein and its mutated variant the dot plot patterns were often quite similar (as high as JX = 0.88 for GalNAc/Gal-specific lectin (PDB ID 5f8w chain A)). In fact, 71 of the 79 (89.8%) test proteins had a JX ≥ 0.1 (our estimate for minimum JX that could be recognized by human observers) and 20 out of 79 (25.3 %) had JX ≥ 0.5, the point at which it is typically difficult for human observers to distinguish two proteins, despite the two proteins having essentially no sequence identity in all cases.
Analysis of large data sets with DOTTER: We sought to determine how efficiently we could analyze large protein data sets with our method. First, we utilized the RepeatsDB database [9] to produce a general analysis of known repeat proteins (SI Dendrogram). Generation of the DOTTER dot plots for the set of ~6000 protein chains obtained from RepeatsDB in batch mode required only a few minutes on a modern LINUX desktop computer. The protein chains from RepeatsDB were analyzed using pairwise distances (1 – JX = JD) and then hierarchically clustered and the resulting clusters were scored based on how well they replicated the known sequence identity and structural subgroups defined in RepeatsDB. The clusters from the dendrogram were examined manually with special attention paid to clusters with a high average number of pixels per member [SI Table 3]. We chose to examine the clustering generated by the McQuitty method in R because it gave the largest number of total clusters at a reasonable cut-off level and the clusters were the most homogenous with the sequence identity groupings and structural classifications used by RepeatsDB itself (SI Table 3, SI Dendrogram). We were unable to identify any correlation between these clusters and the structural groups as defined by RepeatsDB that beyond what would be expected from sequence conservation. But, while most of the resulting groups were immediately obvious upon inspection, manual examination did find an intriguing clustering of the highly immunogenic OspA protein from the spiroform bacterium B. burgdorferi, the causative agent of Lyme disease [35] and the LIC proteins of unknown function from the pathogenic spiroform Leptospira bacteria [36] which cluster together despite not having significant group median sequence identity (42 %). This relationship was also robust, occurring with several methods other than the reported McQuitty method [SI Table 3]. We are unaware of this relationship having being noted elsewhere despite the not insignificant sequence identity these families share, although sequence similarity does not correlate well with the distance of evolutionary relationships in repeat proteins.
Second, we applied the method to a large data set, namely the UniRef90 database which contains all known protein sequences at 90% sequence identity. This set was analyzed with DOTTER and HipMCL [37] was used to cluster all sequences that had corresponding dot plots with at least 0.42 pixels/residue of information. This gave 23050 clusters of which 10205 had at least 5 members. We arbitrarily classed clusters with 4 or fewer members as singletons. Manual examination of those clusters which had between 5-200 members (n=8569) found that only 538 of the clusters were not comprised of a single functional type as judged by UniProt protein names while 925 clusters were made up of entirely or essentially entirely “uncharacterized” or “hypothetical” proteins. 7104 clusters (82.9 %) were easily human identifiable as a single functional type (or 8031 (93.7%) if “uncharacterized” proteins are included as a functional group) (SI Fig. 7). The number of multi-function clusters increases sharply at the lowest 5% of median sequence similarity clusters (SI Fig 8). Analysis of these 8559 clusters from UniRef90 revealed that they had between 31.8-99.9% median pairwise sequence similarity within a cluster as calculated by as global alignment in BioPython (BLOSUM62, gap opening = -11, gap extension = -1) [38] (SI Fig 8). Calculation of the pairwise sequence similarity for 10 of the clusters failed due to either long sequence length or a high number of non-standard amino acids. The distance relationships for the set of clusters with 5 or more members were visualized by CLANS [39](Fig. 5). Attempts at finding superclusters of related proteins from this CLANS representation were not particularly successful, however the clusters in which the greatest proportion of their members contained LCR did seem to group in one small region of the plot. A list of the proteins contained in the clusters is included in the supplemental material.