PSL-Recommender: Protein Subcellular Localization Prediction using Recommender System

Motivation Identifying a protein’s subcellular location is of great interest for understanding its function and behavior within the cell. In the last decade, many computational approaches have been proposed as a surrogate for expensive and inefficient wet lab methods that are used for protein subcellular localization. Yet, there is still much room for improving the prediction accuracy of these methods. Results In this paper we present PSL-Recommender, a method that employs neighborhood regularized logistic matrix factorization to build a recommender system for protein subcellular localization. By evaluating on four human and animals benchmark datasets. We have shown that PSL-Recommender significantly outperforms state-of-the-art methods, improving the previous best method by 13%, 8%, 31%, and 2% in F1 − mean, and also 12%, 8%, 28% and 2% in ACC. Availability PSL-Recommender is freely available online at https://github.com/RJamali/PSL-Recommender Contact ch-eslahchi@sbu.ac.ir


Introduction
Proteins are responsible for a wide range of functions within cells. The functionality of a protein is entangled with its subcellular location. Therefore, identifying Protein Subcellular Localization (PSL) is of great importance for both biologists and pharmacists, helping them inferring a protein's function and identifying drug-target interactions (Nair and Rost [2003]). Recent advances in genomics and proteomics provide massive amount of protein sequence data, which extends the gap between sequence and annotation data. Although PSLs can be identified by experimental methods, these methods are laborious and time-consuming explaining why only a narrow range of PSL information in Swiss-Prot database has been verified in this manner (Zhou et al. [2016]). This problem augments the demand for accurate computational prediction methods. Developments of computational and machine learning techniques have provided fast and effective methods for PSL prediction (Höglund et al. [2006], Horton et al. [2007], Briesemeister et al. [2010], Zhou et al. [2016], Cheng et al. [2017], Mehrabad et al. [2018]). The desired PSL prediction can be reached typically by relying on sequence-derived features, taking into consideration that using annotationderived features can lead up to better performance. Different types of sequence-derived features have been used for PSL prediction. For example, PSORT (Nakai and Horton [1999]), WoLF PSORT (Horton et al. [2007]) and TargetP (Emanuelsson et al. [2000]) employ sequence sorting signals (Bannai et al. [2002]) while Cell-Ploc (Chou and Shen [2008]) and LOCSVMPSI (Xie et al. [2005]) use position specific scoring matrix (Sinha [2006]). Additionally, amino/psudo-amino acid composition information (Nakashima and Nishikawa [1994], Chou and Cai [2003]) is utilized by ngLOC (King and Guda [2007]). There are also some methods that employ combinations of sequence based features (Höglund et al. [2006], Horton et al. [2007]). Alongside, there are different types of annotation derived features such as protein-protein interaction, Gene Ontology (GO) terms and functional domain and motifs which are used by different methods (Höglund et al. [2006], Chou and Shen [2007], Lee et al. [2008], Huang et al. [2008], Shin et al. [2009], Zhou et al. [2016, Cheng et al. [2017], Mehrabad et al. [2018]). Parallel to the importance of features, selecting a suitable algorithm definitely leads to a higher accuracy in prediction. Many machine learning methods or statistical inferences are applied for the protein subcellular localization problem, such as support vector machine (Höglund et al. [2006], Zhou et al. [2016]), K-nearest neighbors (Xiao et al. [2011], He et al. [2012]), and Bayesian methods (Briesemeister et al. [2010], Simha and Shatkay [2014]). In this paper, we have modeled the problem of protein subcellular localization as a recommendation task that aims to suggest a list of subcellular locations to a new protein. In general, Recommender systems are methods and techniques that suggest users a preferred list of items (e.g. suggesting a movie to watch or suggesting an item to purchase) based on a previous knowledge about relations within and between items and users (Francesco et al. [2011]). Our method, "PSL-Recommender", employs a probabilistic recommender system to predict the presence probability of a protein in a subcellular location. PSL-Recommender employs logistic matrix factorization technique integrated with a neighborhood regularization method to capture the information from a set of previously known protein-subcellular location relations. Then, it utilizes this information to predict the presence probability of a new protein in a subcellular location using a logistic function. Logistic Matrix factorization was first introduced by (Johnson [2014]) for collaborative filtering. This technique has shown promising results for problems such as drug-target interaction prediction (Liu et al. [2016], Hao et al. [2017]) and lncRNA-protein interaction prediction (Liu et al. [2017]). However, to the best of our knowledge, it has not been used in PSL prediction problem. By evaluating on different benchmark datasets, we have shown that PSL-Recommender significantly outperforms the current state-of-art methods.

Method
To recommend a subcellular position to a protein, PSL-Recommender employs two matrices; a matrix of currently known protein-subcellular location assignments(PSL interactions) and a similarity matrix between proteins. The proteins similarity matrix is the weighted average of similarity measures such as GO terms (Ashburner et al. [2000]) similarities, PSSM (Stormo et al. [1982]) similarity and STRING (Szklarczyk et al. [2014]) similarity. The main idea is to model the localization probability of a protein in a location as a logistic function of two latent matrices. The latent matrices are acquired by matrix factorization of the protein-subcellular location matrix according to the similarity matrices. Construction pipeline of PSL-Recommender predictor has been demonstrated in Fig 2.1. The details of similarity measures and the recommender system are as follows.

PSSM similarity
The PSSM similarity matrix, S P SSM = [s P SSM i,j ] n×n , contains the pairwise global alignment scores of proteins that are calculated using the position specific scoring matrices (PSSM). Accordingly, to compute the s P SSM i,j of proteins i and j, first for each protein, PSI-BLAST (Altschul et al. [1997]) with e-value 0.001 is used to search the Swiss-Prot database to obtain each protein's PSSM. Then i and j are globally aligned twice, once using the PSSM of i and once using the PSSM of j. Finally, s P SSM i,j is obtained by the mean of reciprocal alignment scores. The PSSM similarity matrix is normalized using unity based normalization.

STRING similarity
It has been shown that two interacting proteins have a higher chance to be in the same subcellular location (Scott et al. [2005], Lee et al. [2008], Mehrabad et al. [2018]). Accordingly, we extracted the interaction score of all pairs of proteins from STRING (Ver. 10.5) to construct the proteins interaction scoring matrix. If no interaction was available for a pair of proteins, we set their interaction score to zero. Since the STRING proteinprotein interaction scores are in the range of [0, 999], we normalized the scores with unity-base normalization.

Semantic similarity of GO terms
Gene Ontology terms are valuable sources of information for predicting subcellular localization (Sayers et al. [2009], Zhou et al. [2016] (Mazandu et al. [2015]). Similarities were normalized using unity-based normalization.

PSL-Recommender
Let proteins and subcellular locations sets be denoted by X and Y, respectively and |X| = m and |Y | = n. Moreover, let S p = s p i,k m×m represent the similarity of proteins. The presence of proteins in subcellular locations is also denoted by a binary matrix L = [l ij ] m×n , where, l ij = 1 if proteins i has been experimentally observed in subcellular location j and l ij = 0 otherwise. The localization probability of the protein i in subcellular location j can be modeled as a logistic function as follows: In Eq.(1), u i ∈ IR 1×d and v j ∈ IR 1×d are two latent vectors that reflect the properties of protein i and subcellular location j in a shared latent space of size d < min (m, n). However, in our case matrix L is biased toward some proteins and subcellular locations, meaning that some proteins tend to localize in many locations and some subcellular locations include many proteins. Accordingly, for each protein and subcellular location we introduce a latent term to capture this bias. In Eq.(1), β p represent the bias factor for protein i and β l j represent the bias factor for subcellular location j. Now the goal is to acquire the latent factors for a given L. Suppose U ∈ IR m×d , V ∈ IR n×d , β p ∈ IR m×1 and β l ∈ IR n×1 denote the latent matrices and bias vectors for proteins and subcellular locations. According to the Bayes' theorem and the independence of U and V we have: On the other hand, by assuming that all entries of L are independent, we have: where c is weighting factor on positive observations, since we have more confidence on positive observations than negative ones. Also, by placing a zero-mean spherical Gaussian prior on latent vectors of proteins and subcellular locations we have: where σ 2 p and σ 2 l are parameters controlling the variances of prior distributions and I denotes the identity matrix. According to the above equations, the log of the posterior is yielded as follows: where λp = 1 and c is a constant term independent of the model parameters. Our goal is to learn U , V , β p and β l that maximize the log posterior above, which is equal to minimizing the following objective function: where . F denotes the Frobenius norm of a matrix. By minimizing the above function U , V and β can effectively capture the information of protein localizations. However, we can further improve the model by incorporating protein similarities as suggested by (Liu et al. [2016]). This process is known as neighborhood regularization. This is done by regularizing the latent vectors of proteins such that the distance between a protein and its similar proteins is minimized in the latent space. Accordingly, suppose that the set of k 1 most similar neighbors to protein m×m that represents proteins neighborhood information as follows: To minimize the distance between proteins and their k most similar proteins we minimize the following objective function: where H p = B p +B p − A + A T and tr (.) is the trace of matrix.
In this equation, B p andB p are two diagonal matrices, that their diagonal a ij , respectively.
Finally by plugging Eq.(8) into Eq.(6) we will have the following: A local minimum of above function can be found by employing the alternating gradient descent method. In each iteration of the gradient descent, first U and β i are fixed to compute V and β j and then V and β j are fixed to compute U and β i . To accelerate the convergence, we have employed the AdaGrad (Duchi et al. [2011]) algorithm to choose the gradient step size in each iteration adaptively. The partial gradients of latent vectors and biases are given by: Once the latent matrices U , V , β i and β j are calculated, the presence probability of a protein i in a subcellular location can be estimated by the logistic function in formula 1. However for a new protein the latent factors u and b are not available. Hence, for a new protein the presence probability in subcellular location j is estimated as follows: whereũ i is the weighted average of the latent vectors of k 2 nearest neighbors of i, as follows: "main" -2018/11/5 -page 4 -#4 Sample et al.
Eventually a threshold can be applied on probabilities to assign the subcellular locations to proteins.

Datasets and evaluation criteria
Evaluating the protein subcellular prediction methods is a challenging task. In one hand, the standalone version of state-of-the-art methods are not available and on the other hand, the protein databases are updated quickly. Hence, to achieve a fair evaluation and comparison we have employed the same datasets and evaluation criteria as used in previous studies (Zhou et al. [2016], Briesemeister et al. [2010]). These datasets are summarized in Table 1. The Hum-mploc3.0, the BaCelLo animals, and the Höglund datasets consist of two non-overlapping subsets for training and testing purposes while for DBMloc we have performed 5-fold cross validation. The training set of Hum-mploc 3.0, HumB, is constructed from Swiss-Prot database release 2012 01 (January 2012) and consists of 3122 proteins of which 1023 proteins are labeled with more than one subcellular locations and the rest are single location proteins. Alongside HumB, HumT is used as the testing set to evaluate the method's performance. HumT is also constructed from Swiss-Prot database release 2015 05 (May 2015 release) and consists of 379 proteins of which 120 proteins are labeled with more than one subcellular locations and the rest are single location proteins. Each protein in Hum-mploc 3.0 is assigned to at least one of 12 subcellular locations ( introduced by (Tsoumakas and Katakis [2007]) and used by other stateof-the-art methods for this problem. ACC is the average of ACCx i of all proteins in the test set, calculated for each protein as follows: where, T P , F P and F N stand for true positive, false positive and false negative, respectively. The F1 − mean is the average of F1y j of all subcellular locations, where F1 of subcellular location y j is the harmonic mean of Precisiony j and Recally j , defined as follows: where, R j and T j are sets of predicted proteins for location y j and true proteins for location y j , respectively.

Learning parameters
Minimizing formula 6 in section 2.1 depends on 13 parameters. The parameters were picked empirically for each dataset by maximizing the F1 − mean. Due to the large search space, we used a greedy grid search method for selecting the parameters. The weight of similarity measures used to build the protein similarity matrix was picked from 1 to 10 by step of 1. The dimension of latent space, r, was selected between 1 and the number of subcellular locations by step of 1. The weighting factor for positive observations, c, was chosen between 5 and 80 by step of 1. The number of nearest neighbors for constructing N k 1 (x i ) in equation 7, k 1 , was selected from 1 to 60 by step of 1. Similarly, The number of nearest neighbors for constructing N k 2 (x i ), in equation 12, k 2 , was selected from 1 to 60 by step of 1. The variance controlling parameters, λp and λ l , were chosen between 0 and 1 by step of 0.1. Impact factor of nearest neighbors in equation 8, α, was picked from 0.1 to 1 by step of 0.1. Finally, The learning rate of the gradient descent criteria, θ, was selected from 0.1 to 1 by step of 0.1. All learned parameters for each dataset is available in the code.

Results and discussion
PSL-Recommender can be employed to predict the subcellular protein localization in different species. Accordingly, we evaluated the performance of PSL-Recommender on different datasets and compared it to other state-of-the-arts methods. We further investigated the role of each protein similarity measures that are employed by the PSL-Recommender.

Comparison with the State-of-art method
We have first employed the Hum-mPLoc 3.0 (Zhou et al. [2016]) human protein dataset to compare the performance of PSL-Recommender to six methods that were introduced for protein localization in human. The methods include YLoc+ (Briesemeister et al. [2010]), iLoc-Hum (Chou et al. [2012]), WegoLoc (Chi and Nam [2012]), mLASSO-Hum (Wan et al. [2015]) and Hum-mPloc 3.0. The F1 − score for each location and the ACC and F1 − mean of all methods on Hum-mploc 3.0 dataset is depicted in Table 2. As seen in Table 2, PSL-Recommender significantly outperforms the  F1 − mean and ACC of all other methods improving the best method by 13% in F1 − mean and 12% in ACC. Also, in 10 out of 12 subcellular locations, PSL-Recommender has the best performance amongst all methods while in the other two locations it has the second best performance. The most significant improvements have been observed in Centrosome, ER (Endoplasmic Reticulum) and Plasma Membrane showing 20%, 20% and 19% improvement respectively over the second best method. It is only in Endosome that PSL-Recommender shows unsatisfactory results (45% F1 − score). This is while other methods also fail to provide good results for this location such that the best method (Hum-mPLOC 3.0) only achieves 52% F1 − score. Moreover, for Extracellular, WegoLoc slightly (3%) outperforms PSL-Recommender.
To show the performance of PSL-Recommender on other species we have employed previously introduced datasets that include proteins from animals, plants and eukaryotes. We then compared the results to five state-of-the-art methods including (Briesemeister et al. [2010], Blum et al. [2009], Pierleoni et al. [2006], Mehrabad et al. [2018], Zhou et al. [2016]). The results are depicted in Table 3. As seen in Table 3, PSL-Recommender outperforms all methods in all datasets by both F1 − mean and ACC. In Höglund dataset, PSL-Recommender significantly outperforms the second best method by 31% and 28% in F1 − mean and ACC respectively. In BaCelLo dataset, the improvement over the second best method is 8% in both ACC and F1 − mean while in DBMloc dataset, PSL-Recommender slightly improves the second best method by 2% in both F1 − mean and ACC. It also worth mentioning that, for PSL prediction problem, to the best of our knowledge, PMLPR (Mehrabad et al. [2018]) is the only recommender system based method that employs the well-known network-based  (Zhou et al. [2007]) approach. As seen in Table 3, PSL-Recommender outperforms PMLPR by 52% and 19% in F1 − mean, and also 28% and 17% in ACC on Höglund and DBMloc datasets.

Impact of each similarity matrix
The proteins similarity matrix is used for neighborhood regularization and also the prediction step. To acquire this matrix PSL-Recommender combines three sources of protein similarity measures (PSSM similarity, String-DB interactions similarity and GO terms semantic similarity) using weighted averaging. The weights are acquired through the learning process.
To investigate the impact of different similarity measures, we repeated previous experiments using different combination of similarity measures. Table 4. shows the result of each combination on all datasets. As can be seen in Table 4., those combinations including the GO terms semantic similarities do not provide reliable predictions showing that GO terms semantic similarities play an important role in protein subcellular localization.
It should be noted that GO terms are not available for all proteins. In the absence of GO terms semantic similarities, PLS-Recommender is still able to provide acceptable results for DBMLoc and BacelLo datasets but its performance significantly drops for Hoglund and Hum-mPloc 3.0. Moreover, the usage of String protein-protein interaction scores is only limited to datasets that contain proteins from single species. Since DBMLoc, BacelLo and Hoglund datasets contain proteins from multiple species we were unable to use String interaction scores in these datasets.

Conculusion
In the absence of efficient experimental methods, computational tools play an important role for predicting protein subcellular localizations. Yet, there is still much room for improving the prediction accuracy of these methods. In this paper, we introduced PSL-Recommender, a recommender system that employs logistic matrix factorization for efficient prediction of protein subcellular localization. By evaluating on human and animals datasets it was shown that PSL-Recommender significantly outperforms other state-of-the-art methods. However, we believe that the performance of PSL-Recommender can be improved further by employing a better approach for searching the parameter space. The standalone version of PSL-Recommender and all the datasets are available online at: https://github.com/RJamali/PSL-Recommender