DegSampler: Gibbs sampling strategy for predicting E3-binding sites with position-speciﬁc prior information

Background: The ubiquitin-proteasome system is a pathway in eukaryotic cells for degrading polyubiquitin-tagged proteins through the proteasomal machinery to control various cellular processes and maintain intracellular homeostasis. In this system, the E3 ubiquitin ligase (hereinafter E3) plays an important role in selectively recognizing and binding to speciﬁc regions of its substrate proteins. The relationship between a substrate protein and its sites bound by E3s is not well understood. Thus, it is challenging to computationally identify such sites in substrate proteins. Results: In this study, we proposed a collapsed Gibbs sampling algorithm called DegSampler (Degron Sampler) to identify the binding sites of E3s. DegSampler employs a position-speciﬁc prior probability distribution, based on the estimated information of the disorder-to-order region bound by any protein. Conclusions: Our computational experiments show that DegSampler achieved 5 and 3.5 times higher the F-measure values of MEME and GLAM2, respectively. Thus DegSampler is the ﬁrst model demonstrating an eﬀective way of using estimated information on disorder-to-order binding regions in motif discovery. We expect our results to improve further as higher quality proteome-wide disorder-to-order binding region data become available.


Background
Eukaryotic cells have two major pathways for degrading proteins in order to control cellular processes and maintain intracellular homeostasis. One of the pathways is autophagy, a catabolic process that delivers intracellular components to lysosomes or vacuoles [1]. The other pathway is the ubiquitin-proteasome system, which degrades polyubiquitin-tagged proteins through the proteasomal machinery [2]. In the ubiquitin-proteasome system, an E3 ubiquitin ligase (hereinafter E3) selectively recognizes and binds to specific regions of the target substrate proteins. These binding sites are called degrons [3].
It is important to identify the E3s and substrate proteins that interact with one another to characterize various cellular mechanisms. Information on known degrons has been accumulated in the eukaryotic linear motif (ELM) resource for functional sites of proteins [4]. In the database, degron motifs are grouped into the "DEG" class. In addition, in the E3Net database [5], the relationship between E3s and their associated substrate proteins have been compiled. To date, only 55 of the 837 identified substrate proteins have been annotated with DEG motifs. In other words, 93% of the 837 known substrates are not annotated with any degrons. Thus, it is necessary to identify degrons of substrate proteins bound by specific E3s.
In literature, many sequence motif finders have been presented, such as MEME [6], Gibbs Motif Sampler [7], and GLAM2 [8]. MEME is a popular position-specific scoring matrix (PSSM) finding method based on expectation-maximization (EM) (see http://meme-suite.org/index.html). Gibbs Motif Sampler and GLAM2 are Gibbs sampling-based methods. One way of further improving motif identification is to exploit prior information on sequence positions. Narlikar and colleagues proposed a Gibbs samplingbased method called PRIORITY, which showed the effectiveness of the position-specific prior information based on nucleosome occupancy for motif discovery in deoxyribonucleic acid (DNA) [9]. In general, prior information can be easily integrated into the posterior probability distribution of motifs by representing it as a prior probability distribution. Inspired by their success, Bailey and colleagues extended the EM-based MEME algorithm to use position-specific priors and reported their effectiveness in identifying transcription factor-binding sites in yeast and mouse DNA sequences [10].
However, to the best of our knowledge, there is no model in which position-specific prior information is integrated in Gibbs sampling algorithms to identify motifs in protein sequences. Thus, we propose a Gibbs sampling-based method called DegSampler (Degron Sampler) to identify the locations of degrons in a given set of substrate protein sequences, which are bound by E3s.
One of the candidates of information sources for position-specific prior distributions is the disorderness of each residue in a protein sequence. This inference is based on the observation that the E3-binding sites of substrate proteins are often located within the disordered regions of the protein structure [11,12]. The disorderness for a protein sequence can be calculated by IUPred1 [13] and IUPred2 [14].
Furthermore, we also examined the information on disorder-to-order binding regions in disordered proteins as a source of position-specific prior distribution. The estimated information can be obtained using prediction tools, including ANCHOR [15,16], ANCHOR2 [14], MoRFpred [17], MFSPSSMpred [18], and DISO-PRED3 [19]. Typically, the output of the tools is the probability of each residue being in a disorder-to-order region bound by any protein. These probabilities are expected to work better than disorderness as positionspecific prior information in identifying the locations of degrons.
We have evaluated the efficiency of position-specific prior information related to the disorderness and disorder-to-order binding affinity of amino acid residues in identifying degrons. We have also compared the performance of DegSampler with those of other popular tools, viz. MEME and GLAM2, on real data sets. We found that DegSampler significantly outperforms the others.

Materials
Sequence data sets and known motifs The E3Net database [5] identified 491 E3 ubiquitin ligases (E3s) associated with some known substrate proteins. In this research, we selected all E3s with at least three or more known substrate proteins. Consequently, we have extracted 123 E3s that are associated with 837 different substrate proteins from six species: Homo sapiens (human), S. cerevisiae (yeast), Arabidopsis thaliana (mouse-ear cress), Mus musculus (mouse), Rattus norvegicus (rat), and Drosophila melanogaster (fruit fly), and three viruses, human herpesvirus 1 (HHV11), human papillomavirus type 16 (HPV16), and human herpesvirus 8 type P (HHV8P). The average number of substrates assigned to an E3 is 8.5.
ELM is a database of short eukaryotic linear motifs that are manually curated from literature; this includes 3,026 experimentally validated motif occurrences on 1,886 proteins. These linear motifs are grouped into six functional classes: cleavage (CLV) sites, degradation (DEG) sites, docking (DOC) sites, ligand (LIG)binding sites, post-translational modification (MOD) sites, and targeting (TRG) sites. The DEG class motifs are equivalent to degron motifs. Thus, the output of motif finders was evaluated using these DEG motifs. This class has 25 motifs, as shown in Table 1. Some of the motifs are too short to be predicted. Furthermore, only 36 of the 123 E3s in the E3Net database include at least one substrate that is annotated with a DEG motif in ELM (human 29; yeast 3; mouse-ear cress 1; mouse 1; rat 0; fruit fly 0; HHV11 1; HPV16 1; and HHV8P 0). The remaining 87 E3s have no DEG motif annotation on the substrate proteins. Thus, we used these 36 E3-specific sets of substrate proteins as input for motif finders to evaluate the performance of DegSampler.
Although E3Net has 837 E3-specific substrates, only 199 substrates are annotated with some ELM motifs. Furthermore, only 55 substrates are annotated with the DEG motifs. In other words, 93% of the 837 substrates were not annotated with any DEG motif. Therefore, the motif occurrences predicted by reliable motif finders are good candidates for degrons.

Source of position-specific prior information
Here, we describe the source of the position-specific prior information used in this study. We prepared position-specific priors derived from IUPred1 [13] and IUPred2 [14], which provides the probability of each residue belonging to a disordered region. IUPreds have three prediction types, "short disorder", "long disorder", and "structured domain." We selected the short and long disorder types. The outputs are denoted by IUPred1-short, IUPred1-long, IUPred2-short, and IUPred2-long.
Concerning the prior information on disorder-toorder binding regions of protein sequences, we used three prediction tools, ANCHOR [15], which is designated as ANCHOR1 to distinguish from its successor ANCHOR2, ANCHOR2 [14], fMoRFpred [20], and DISOPRED3 [19]. The output of these prediction tools (a) (b) Figure 1 Example of ANCHOR1 outputs. The x-axis shows the position of a protein sequence. The y-axis represents the probability assigned to a sequence position by ANCHOR1. The known occurrence of a DEG motif is indicated by a tall and rectangular broken red line. The graph title has the UniProt accession identifier of the substrate protein and the ID of a DEG motif in the protein.  is the probability of each residue belonging to disorderto-order binding regions. ANCHOR1 is a biophysicsbased energy scoring method, and its successor, AN-CHOR2, is an updated version. The fMoRFpred is a faster version of MoRFpred [17] with slightly compromised prediction accuracy, but has the ability to accept up to 2,000 protein sequences at a time, unlike MoRFpred which takes up to five protein sequences. Both methods are available only in their web servers. Since we have as many as 837 sequences, we selected fMoRFpred instead of MoRFpred. DISOPRED3 is a support vector machine (SVM)-based method that uses profile data and several sequence-derived features. We assigned a probability of 0.5 to all the residues of an input sequence whenever a tool failed to make a prediction.
Here, we show good and bad examples of the AN-CHOR1 output in Fig. 1. The substrate protein is shown in Fig. 1 (a) is DCC HUMAN (P43146). There is a DEG SIAH 1 occurrence from the position 1331 to 1339. As shown in the graph, the probabilities of ANCHOR1 around that region are approximately 0.9. However, the substrate protein shown in Fig. 1 (b) is MYC HUMAN (P01106). Although there is a DEG SCF FBW7 1 occurrence from the position 55 to 62, the ANCHOR1 probabilities in the region are less than half.

Preliminaries
Let Σ be an alphabet of 20 distinct letters representing the 20 amino acids. A dataset of N sequences over Σ is referred to as X = {X 1 , . . . , X N }. The jth letter of the ith sequence of X i is denoted by X i,j . |X i | represents the length of X i . The collection of all the indexes of the sequence letters is denoted by U = {(i, j)|i = 1, . . . , N, j = 1, . . . , |X i |}. The complement of a subset V of U , that is, U \ V , is denoted by V c . We introduce a letter-counting function c X for V that returns c X (V ) = (freq 1 , . . . , freq |Σ| ) T where freq is the number of the th type letter of Σ in the letter set {X i,j |(i, j) ∈ V }. For simplicity, we use c(V ) instead of c X (V ) because once a set of sequences are given, the set remains fixed.
We used the following notations of basic arithmetic operations as used in [21]. For the two vectors u and v of the same dimension n, we define u

Motif model
We adopted two models for the distribution of the motif sites: one occurrence per sequence (OOPS) and zero or one occurrence per sequence (ZOOPS). The OOPS model [22] assumes that each sequence has exactly one occurrence of a motif. Let z i be a hidden random variable that represents the starting position of the occurrence of a motif in X i ; z i ∈ {1, . . . , |X i |−W +1} for the motif width W where i = 1, . . . , N . We denote all z i together as the vector Z = (z 1 , . . . , z N ). The ZOOPS model [23] is a generalization of OOPS in which each sequence is permitted to have either zero or one occurrence of a motif. We denote the state where X i does not have any motif occurrence by z i = 0.
The site of a motif is often modeled as a single contiguous block of residues. However a more discriminative motif can be obtained by excluding some positions within the contiguous block of a motif. Liu et al. proposed a way of selecting a constant number of R(≤ W ) columns in an aligned block of width W to form a motif [21]. This model is called the fragmentation model of a motif. In this model, R selected positions of a short aligned block of width W are evaluated as parts of the motif (see Fig. 2). DegSampler optimizes R although it is kept constant in [21]. Figure 2 Example of a fragmentation model of a motif. The six rectangles represent the columns selected as parts of the motif. In this example, λ is (1, 0, 1, 1, 0, 1, 0, 1, 0, 1), which gives R = 6 and W = 10.
Such R columns selected as parts of the motif are specified by λ = (λ 1 , . . . , λ W ), where λ 1 = λ W = 1, λ w ∈ {0, 1} for w = 2, . . . , W − 1, and W w=1 λ w = R. λ w = 1 (resp. 0) means that the wth column is evaluated as part of a motif (resp. background) model. We refer to W as the motif width. The region of length W covered by a motif is referred to as the occurrence of the motif. Our Gibbs sampling algorithm optimizes W and R by updating λ. After each update, λ is freshly indexed so that λ 1 = λ W = 1 for a new motif width W . Thus, both the columns represent the two ends of the new motif. For w ∈ {w|λ w = 1}, λ(w) denotes the order of the wth element among the elements having the value 1. For example, for λ = (1, 0, 0, 1, 1), we have λ(1) = 1, λ(4) = 2, and λ(5) = 3.
We also show that the posterior probability distribution is tempered in the fragmentation model of ZOOPS and OOPS. In other words, the posterior probability distribution is parameterized with a temperature parameter of T . We first describe the untempered version.

Likelihood
The local alignment of the given sequences X occupied by a motif is represented as a matrix where the ith row corresponds to the ith sequence and the wth column indicates the wth relative position within a motif occurrence. Our Gibbs sampling algorithm applies several operations to the various subsets of letter indexes of the given sequences. Then, we introduce matrix-like notations representing the subsets of indexes, as shown in Table 2. I[Z, λ] is the collection of the indexes with z i > 0 and λ w = 1. I[Z, w] is a subset of I[Z, λ], in which the elements are restricted to those in the wth column.
in which the elements are limited to the indexes of the kth sequence. I[z k , w] is the index corresponding to the wth position of a motif occurrence in the kth sequence, i.e., (k, z k + w − 1); if the sequence does not have any motif occurrence, the set is empty. Note that all of these notations can be used in both the OOPS and the ZOOPS models.
A position-specific scoring matrix (PSSM) is a de facto standard model of a likelihood function of a posterior probability distribution for the sequence motif identification. This model evaluates the conservation of each residue within the selected positions of a motif.
The letters outside the R selected positions in all the given sequences are assumed to be independently drawn from the same categorical distribution with the parameter θ 0 . This is called the background model.
Thus the probability of X with Z, λ, and θ 0:R can be given as

Prior distributions
Here, we formulate the prior distributions for the parameters θ 1:R , Z, and λ.
Prior distribution of θ 1:R Let us recall that c(I[Z, w]) with λ w = 1 is a sample from the categorical distributions with the parameters θ λ(w) . A Dirichlet distribution is known to be a conjugate prior to categorical distribution and is defined as where the parameter α is of dimension L; 1 is the L-dimensional vector with all the elements having a value of 1. and the multivariate beta function B (α) is defined as follows: In this study, we used the same Dirichlet distribution parameter α for the R + 1 random variables θ 0 , θ 1 , . . . , θ R . For r = 0, . . . , R, we define the following: p (θ r |α ) = Dir (θ r |α ).
Prior distribution of Z We assume that for each i = 1, . . . , N , and j = 1, . . . , |X i |, a prior probability d i,j is given; for example, it will be the outcome of ANCHOR1 for X i . The probability is used as the likelihood of the position being part of a motif occurrence. We introduce the notation m(z i , λ) = {z i + w − 1|λ w = 1, w = 1, . . . , W, z i = 0}, which represent the positions included in a current motif occurrence specified by z i and λ. Then, we suppose that p(z i |λ), the probability of z i given λ, is proportional to if z i = 0 and is proportional to if z i = 0. This formulation is almost the same as in [9]. Furthermore, we have applied two modifications to the formulation. Note that the last term |Xi| j=1 (1 − d i,j ) is constant once X is given. Then, we use the terms that are obtained by dividing both cases by the constant.
Another modification is the exploitation of the coefficient g of this prior term to adjust the effectiveness of the prior information quantity, for which the term is raised to the power of g. The resulting prior probability distribution of z i , given λ is proportional to where δ i,j is the Kronecker delta, equal to 1 if i = j and 0 otherwise. Thus, the right-hand side is 1 when z i = 0. The resulting prior distribution of Z is formulated as p(Z|λ) = N i=1 p(z i |λ) with different normalizing constants in the ZOOPS and the OOPS models.
Hyperprior distribution of λ We have discussed the hyperprior distribution of λ. There are various fragmentation patterns in the ELM resource; therefore, it is reasonable to assume no bias for λ. Then, we use the uniform distribution of λ specified by using the two parameters R min and W max such that R min ≤ R and W ≤ W max , respectively.

Collapsed posterior distribution
The joint posterior probability distribution of the hidden random variables Z, λ, and θ 0:R , in the ZOOPS model is formulated as By integrating out θ 0:R , as shown in [21], we obtain the collapsed posterior distribution of Z and λ as (1) From this collapsed posterior distribution, we derive conditional distributions of z k and λ v to update their values. This collapsed posterior distribution works in both the OOPS and the ZOOPS models.
Update of a motif starting position z k To update z k , we derive the conditional probability p (z k |Z −k , λ, X ) from Eq. (1), which is proportional to Note that the approximation scheme given in [21] has been used for the derivation. In particular, when z k = 0, p (z k |Z −k , λ, X ) is proportional to 1 with the same normalizing constant, all c (I[z k , w]) and δ z k ,0 become 0.
Update of the motif column selection λ We have described how to update λ. First, the current λ is extended by attaching 2·∆ new columns to the left and right sides of λ. The columns added to the left are indexed as −∆+1, . . . , −1, 0, and the columns added to the right are indexed as W + 1, . . . , W + ∆. Thus, the columns of the resulting λ are indexed continuously from −∆ + 1 to W + ∆.
In the second step, the index v ∈ {−∆ + 1, . . . , W + ∆} was either selected randomly or in a fixed order. In the third step, we performed random sampling according to the conditional probability distribution of λ v ∈ {0, 1} with the current values of λ −v , Z, and X. The probabilities of the distribution are calculated as where λ is the same as λ but λ v = 1, and Both the right-hand side equations share the same normalizing constant.
If λ v = 1 is selected in the sampling, and v is an index of the additional columns indexed as −∆+1, . . . , −1, 0, then the value of z i is replaced with z i + v − 1 for z i = 0. Furthermore, the leftmost and the rightmost contiguous blocks of columns indexed as w with λ w = 0 are removed. For example, if the content of λ after sampling with ∆ = 3 and v = −1 is (010110111000), then the updated λ is (10110111).
An advantage of this update is that a motif can easily shift to the left or the right on the substrate sequence. In addition, the R and W values are also varied. The default value of ∆ is set to 3 in this analysis.

Simulated annealing
We also derived a tempered version of the collapsed posterior probability distribution, which was parameterized with temperature T as follows: ( The corresponding conditional probability distribution of z k is as follows: The conditional probabilities of λ v are given as follows: where λ is the same as λ, but λ v = 1 and

Algorithm of DegSampler
Here, we provide an overview of the algorithm of DegSampler in Algorithm 1. At the top level of DegSampler, it repeats the main procedure of the function called DegSampler2Core for three times to avoid falling into a local optimum. DegSampler2Core repeats the procedure M = 300 times. At M sa = 240, the simulated annealing is started. Thus, the last 20% of all iterations are annealed. After each repetition, the value of the temperature parameter T is reduced by sa ratio = 0.999.
logPost ← the R.H.S. of Eq. (2) for Z and λ if optLogPost < logPost then optLogPost ← logPost Z best ← Z λ best ← λ Algorithm 1: DegSampler algorithm. The three variables optLogPost, Z best , and λ best , are supposed to be global, that is, they can be accessed from any place.

Performance measure
We measure the quality of predicted motif occurrences with respect to the DEG motif sites by precision, recall, and F-measure. In general, the precision score is the ratio of the number of true positives to the predicted positives, and the recall score is the ratio of the number of true positives to the number of real positives. The F-measure is the harmonic mean of precision and recall. In this work we define the positions of the DEG motif sites that are shared by the predicted motif occurrences as true positives. Similarly, for substrate protein sequences annotated by some DEG motifs, the positions of the predicted motif occurrences that are not covered by any of the DEG motif sites are false positives. Finally, the sequence positions of the DEG motif sites that are not covered by the predicted motif occurrences are false negatives. These quantities are calculated only on the protein sequences annotated by some DEG motif occurrences because the other sequences always have no true positives. Let TP , FP , and FN be the number of true positives, that of false positives, and that of false negatives, respectively. We represent the precision, recall, and F-measure as

Default parameter set
We set the default value of a parameter as follows: The maximum motif width and the minimum number of selected columns are W max = 9 and R min = 8, respectively. The weight of the position-specific prior distribution, g, is 1. The pseudo count of residues at each position of a motif is set to be α = 1, and the site distribution model is ZOOPS. In this work, with a particular set of parameter values, DegSampler is executed five times.

Performance comparison
We have shown the performance comparison of DegSampler with MEME and GLAM2. DegSampler used default parameter values: W max = 9, R min = 8, g = 1, α = 1, d i,j is the ANCHOR1 probability, and the motif site distribution model is ZOOPS. We performed GLAM2 using the default values. MEME can optionally accept a prior distribution over the site locations of width w(≥ 2). Thus we set the geometric Figure 3 Performance comparison of DegSampler with MEME and GLAM2. DegSampler used default parameter values: Wmax = 9, R min = 8, g = 1, α = 1, d i,j is the ANCHOR1 probability, and the motif site distribution model is ZOOPS. We performed GLAM2 using the default values: MEME was executed using the default values, except for the specified options: MEME-OOPS (resp. MEME-ZOOPS) represents MEME, where the site distribution model was set as OOPS (resp. ZOOPS); Furthermore, MEME-OOPS-Prior (resp. MEME-ZOOPS-Prior) represents MEME with position-specific prior information, using the specified site distribution model. mean d i,j ·d i,j+1 as the prior probability of the location of width 2 starting at the position j whered i,j = 0.9 × d i,j / j d i,j , This transformation in the ZOOPS model implies that the prior probability of X i having no motif site is 0.1 For a motif-width w different from w, MEME internally transforms a given prior distribution of site width w to the corresponding prior distribution of site width w , using the geometric mean. Thus, the w value used does not seem to be a critical factor. We have set w = 2 in this study. MEME-OOPS-Prior (resp. MEME-ZOOPS-Prior) represents MEME with a position-specific prior using the specified site distribution model of OOPS (resp. ZOOPS). In addition, MEME-OOPS (resp. MEME-ZOOPS) represents MEME without any position-specific prior using the specified site distribution model of OOPS (resp. ZOOPS).
The graph of Fig. 3 shows that DegSampler outperformed the others. The best F-measure among the four methods based on MEME is 0.043 is attained by MEME-OOPS-Prior. The second best, 0.037, is slightly lower than the best and is marked by MEME-OOPS. This result implies that the given positionspecific prior information of ANCHOR1 in MEME cannot be sufficiently utilized. MEME shows the same trend in the ZOOPS model.
GLAM2 is the second best in F-measure. Note that GLAM2 is also a Gibbs sampling-based method as DegSampler. The disadvantage of GLAM2 is that there is no mechanism to utilize position-specific prior information. On the other hand, it is an advantage that the motif can be gapped. Actually, some of the DEG motifs in ELM are regular expressions whose matching sequences are variable in length like DegSampler has 16 wins, 7 draws, and 13 losses against GLAM2 using 36 E3 input instances (see Supplement Table S1). Note that there are 27 instances of the 36 input sets, including substrate proteins annotated by DEG motifs with variable-length patterns. Even after being restricted during the 27 instances, DegSampler has 14 wins, 8 draws, and 5 losses. Under this restriction advantage to GLAM2, DegSampler outperformed GLAM2.
Comparing sources of position-specific prior information Figure 4 Comparison of various sources of position-specific prior information for DegSampler. Height of the bar represents the mean of precision, recall, and F-measure scores. The error bars represent standard error. IUPred1 and 2 have an option for specifying the range of disordered regions, either short or long (default). The selected option is represented by attaching either "-short" or "-long" to "IUPred1" and "IUPred2." We examined the predictability of DegSampler using IUpred1, IUpred2, ANCHOR1, ANCHOR2, fMoRFpred, and DISOPRED3, and then identifying the best source of position-specific prior information. These prediction methods are modeled in their own ways, thus the predictability of DegSampler would vary with each method. Fig. 4 shows the precision, recall, and F-measure of DegSampler for each prior source as well as those with the case where the position-specific prior distribution is uniform, which is used as the baseline for this comparison. The other parameters are set to default.
We could see that ANCHOR1 outperforms the others. Although the successor, ANCHOR2, is reported to be superior to ANCHOR1 in the identification of disorder-to-order binding regions [14], the predecessor works better. The remaining, except the uniform prior cases are better than it. The scores of ANCHOR2 and fMoRFpred are slightly higher than those of IUPreds. This might be caused by the fact that ANCHOR2 and fMoRFpred are predictors for disorder-to-order regions, which are bound by any protein, and IUPreds are only predictors for disorder-to-order regions.

Effectiveness of position-specific prior
In this case, we examine how the position-specific prior of DegSampler affects predictability by controlling the weight of the prior, g. We have executed DegSampler with g = 0, 0.1, 0.2, . . . , 2, and 4, 6, . . . , 24. Figure 5 shows the F-measure values of DegSampler over g = 0, 0.1, 0.2, . . . , 2. The other parameters have the default values. In accordance with the increase of g from 0, the F-measure drastically increases and converges around g = 1.0. The case where g = 0 is equivalent to the case in which the position-specific prior probability distribution is uniform. The mean of the F-measure value of this case is 0.017. This value is quite low; thus, the result indicates the usefulness of the position-specific prior information. The F-measure appears stable over the range from g = 0.8 to 2. This plateau continues in the case where g = 18 (data not shown), and then the F-measure value falls down to 0.182728, 0.168434, and 0.133904 at g = 20, 22, 24, respectively. This decrease is likely due to the biasness of the position-specific prior information.

Example of found motif occurrences
Here, we provide an example of the outputs of DegSampler from the 36 input sets. The E3 ubiquitin ligase of SYVN1 HUMAN is associated with 12 substrate proteins in the E3Net database.  The ANCHOR1 probability for the 5th substrate protein P04637 (P53 HUMAN) is given in Fig. 6. There are many peaks, specifically five peaks whose probability is higher than 0.9. The site of motif DEG MDM2 SWIB 1, located in the region from position 19 to 26, is overlapped with one of the five peaks, and its ANCHOR1 probability is approximately 1. DegSampler was able to successfully identify the site.
We found that the regular expression of motif DEG MDM2 SWIB 1 also matches the sequence, FEETWATLL, of the region from 2464 to 2472 of P42858. However, this occurrence is not recorded as an instance of the DEG motif in the ELM database. In ad-dition, this site is not predicted by DegSampler. The reason could be that the average of the ANCHOR1 probabilities of within that region is very low (0.249).
We further analyze this local alignment by classifying the 20 amino acid residues into the positively charged residues (RHK), the negatively charged residues (DE), the polar residues (NQSTY), and the non-polar residues (ACGILMFPWV) [24]. The singleletter amino acid symbols of the categories are replaced with '+', '-', 'x', and 'o' in Table 3 We focus on the indexed symbols of P04637, annotated by the DEG motif DEG MDM2 SWIB 1. It has the indexed symbol of 'o' in the 1st, 4th, 5th, 7th, and 8th columns, and 'x' in the 2nd, '-' in the 3rd, and '+' in the 6th column, respectively. For each position of the local alignment, a binomial test is applied to evaluate whether the number of the indexed symbol of P04637 in the column is unexpectedly rare under the frequency of each amino acid over all the substrate protein sequences used in this model, as shown in Supplement Table S2. This amino acid frequency distribution   HUMAN). Among them, only the 5th substrate, P04637 (P53 HUMAN), is annotated with a DEG motif, DEG MDM2 SWIB 1. Column "start" shows the starting position of the motif occurrence. An asterisk on the top of a column indicates that the column was selected as part of a motif. The residues covered by a motif occurrence and the known DEG motifs simultaneously are underlined. (b) The 20 amino acid residues can be classified into the positively charged residues (RHK), the negatively charged residues (DE), the polar residues (NQSTY), and the non-polar residues (ACGILMFPWV), which are denoted as '+', '-', 'x', and 'o', respectively. Each residue is replaced with a corresponding indexed symbol. is almost similar to that of the UniProtKB/Swiss-Prot UniProt release 2020 04 [25]. For example, in the 1st column, the number of observed symbol 'o' is 10 and the frequency of non-polar residues under the null hypothesis is 0.4809. The resulting p-value is 0.0021.
The conservation of these symbols is statistically significant in the 1st and 5th to 8th columns, at 0.05 level of significance.

Discussion
Narlikar et al. discussed how to formulate a positionspecific prior distribution derived from nucleosome occupancy data to identify transcription factor-binding sites, and proposed two types of position-specific prior distributions: simple and discriminative. These are used in their Gibbs sampling-based motif finder, PRI-ORITY [9], as the prior distribution for the latent variable representing start positions of the motif occurrences. The simple prior distribution of a DNA sequence is calculated from the sequence like our position-specific prior distribution. On the other hand, the discriminative one is derived from positive sequences (bound by a profiled transcription factor) as well as negative sequences (not bound by any profiled transcription factors). Inspired by their work, in addition to our position-specific prior probability, p(z i |λ), we have also evaluated an alternative position-specific prior distribution, derived from the positive sequences (E3-specific substrate sequences) and the negative sequences (the other sequences of the E3Net database).
We formulated the distribution as follows: Let X and Y be the E3-specific set of substrate protein sequences and the remaining substrate protein sequences in the E3Net database, respectively. We denote the subsequence of X i (Y i , resp.) of length W , starting at j, i.e., X i,j · · · X i,j+W −1 by X i,j , resp.). Furthermore, average of the ANCHOR1 probabilities in the subsequence is A X . Then, we define a discriminative position-specific prior probability dis- Figure 6 The output of ANCHOR1 for the substrate protein P53 HUMAN (P04637). The x-axis shows the position of a protein sequence. The y-axis represents the probability assigned to a sequence position by ANCHOR1. A DEG MDM2 SWIB 1 site is located in the region from position 19 to 26. The location is indicated by a vertically long rectangle. The entry name of the substrate protein and the DEG motif ID associated with the protein are given above the graph. tribution: Note that in this formulation, the fragmentation model is not used for the sake of simplicity of the formulation.
We used this discriminative prior in DegSampler with the default parameters where W max = 9, R min = 8, g = 1, α = 1, d i,j is the ANCHOR1 probability, and ZOOPS is the motif site distribution model. Unlike the case of PRIORITY, the performance of DegSampler with the discriminative prior is worse than the original one. The precision, recall, and F-measure with their standard errors are 0.026 (0.009), 0.026 (0.009), and 0.026 (0.009), respectively.
The reason of the unsuccessful result could be as follows. As mentioned in the Materials section, the average number of substrates assigned to an E3 is 8.5.
In addition, the length of those protein sequences are typically several hundreds. Under this situation, many extracted amino acid sequences of length 8 or 9 have one or fewer occurrences over the positive and negative sequences. As a result, the discriminative prior does not work well; probably, because an input protein sequence consists of 20 types of letters, which is significantly higher than the 4 types in a DNA sequence. A larger input set of protein sequences are needed to use the discriminative position-specific prior distribution.

Conclusions
We have proposed a Gibbs sampling-based motif finder, DegSampler, specialized for E3 binding sites. In this model, we confirmed that the probability of each residue being in a disorder-to-order region bound by any protein is a useful source to form a positionspecific prior distribution for the degrons. DegSampler finds a good combination of input protein sequence sites of with relatively higher position-specific prior probabilities. Simultaneously, we also found that such probabilities of prediction tools are quite different (We used ANCHOR1, ANCHOR2, fMoRFpred, and DISOPRED3 in this study). We evaluated simple and discriminative position-specific prior formulations. DegSampler outperformed MEME, which uses a different formulation of the position-specific prior distribution. However, there is much room for improvement in identifying degrons because the F-measure of DegSampler is still low.

Declarations
Ethics approval and consent to participate Not applicable.
Consent for publication Not applicable.