Improving Drug Response Prediction Using Dual Similarity Regularization

Anti-cancer medicine for a particular patient has been a personal medical goal. Many computational models have been proposed by researchers to predict drug response. But predictive accuracy still remains a challenge. Base on this concept which “Si milar cells have similar responses to drugs”, we developed the basic method of matrix factorization method by adding fines to simil arity. So that the distance of latent factors to two cell lines or (drug) should be inversely related to similarity. This means that two similar drugs or similar cell lines should have a short distance, whereas two similar cell lines or non-similar drugs should have a large gap with their latent factors. We proposed a Dual similarity-regularized matrix factorization (DSRMF) model, then generated new data for drug similarity from the two-dimensional three-dimensional chemical structure, which were obtained from the CCLE and GDSC databases. In this research, by using the proposed model, and generating new drug similarity data we achieved the average Pearson correlation coefficient (PCC) about 0.96, and average mean square error (RMSE) Root about 0.30, between the observed value and the predicted value for the cell line response to the drug. Our analysis in this research showed, using heterogeneous data, has better results, and can be obtained with the proposed model, using other panels’ cancer cell lines, to calculate similarity between cells . Also, by imposing more restrictions on the similarity between cells, we were able to achieve more accurate prediction for the response of the cell line to the anticancer drug. we evaluation to evaluate the We We found that the than


Background
Personal medicine is a growing as a medical and therapeutic strategy. And it has been able to reached significant achievement as a new solution in the field of treatment of patients. Also, the patient's own genomic and molecular information is used to precisely personalize the patient's response to the drug. As patients respond differently to a different medical treatment, in order to the patient can be treated with the least side effects and the most effective drug treatment, the researchers are developing specific personalized medical solutions to a particular disease. We deal with cancer, a disease that is causing deaths in the world with high complexity in treatment, so researchers are using precision personal medicine to detect cancer. Computational methods for combining genomic profiles and cancer cell lines can be used to create and improve precision and nasal cancer response to anticancer drugs. Researches about the sensitivity of anticancer drugs to cells, are divided into two categories. A number of focusing studies have identified and discovered biomarkers that play an important role in the sensitivity of the drug to the cell. In [1] Using molecular level variations to predict the sensitivity of an anticancer drug to cells. In [2][3][4][5][6] using Elastic net regularization and random forests routines to identify genomic biomarkers, and to predict drug sensitivity to cells Cancer has been used but, in contrast, a large number of researches have been conducted to predict drug susceptibility to gene expression levels. The kernelized Bayesian matrix factorization (KBMF) method was used in [7]. In [8], Weighted graph regularized matrix factorization (WGRMF) algorithm has been proposed to predict the sensitivity of cancer drugs to cell lines. In this way, the likeness between similarity of drugs and the similarity of the cells closest to the neighborhood are alike. In this method the GDSC database was used. In [9], three different deep learning algorithms are compared with random forests and nearest neighbor (knn). And the result shows the combination of RF GE and KNN Residual works better. This method has been suggested to achieve the best Outliers deletion performance, reduce data size, and limited data usage. In [10], Private linear regression model has been proposed to predict the sensitivity of cancer drugs to cell lines. In [11], A Bayesian algorithm that combines kernelized multitasking and dimensionality reduction, called kernelized Bayesian multitasking (KBMT), is used to share a subspace and data in this space. The commonalities between these subsets capture data, are used to learn and improve forecasting performance. [12] have used Multitasking learning on CCLE, CTD4 and NCI60 databases to improve prediction. For their analysis, in [13], a network-based approach called GloNetDRP was used. In this method, a heterogeneous network between the similarity of drugs like drugs and the similarity of drugs to drugs was used by using a network-based method called GloNetDRP. And the response of responsiveness is used not only on the basis of the neighborhood, but also from the similarity with other drugs in heterogeneous network. In [14], the CCLE and GDSC databases were used. A model based on the similarity of drugs was proposed, so the drug sensitivity profile is given to the new drug if it is structurally similar. In [15], a network based classifier (NBC) method is used to measure the sensitivity of different types of drugs to different types of tumors, as well as a list of apoptotic genes and clinical dose-related predictions were used in this study. The CCLE and GDSC databases were used in [16]. The domain adaptation method called PRECISE were used to collect and predict the shared information that exists between human tumors and preclinical models. In [17], the Support Vector Machine (SVM) method and the recursive feature selection on the CCLE database were combined. Dong divided the cell lines into two sensitive and persistent subsets. Based on the drug response rate, drug responses were used to select features. The SVM model were used in [18]. Consensus p-Median clustering method were used to infer drug response to cell lines on a tumor. In [19], a set of heterogeneous genes that are important for drug response were selected, then Bayesian network model and genomic profiles were used to predict cell line response to drugs. In this study the NCI60 database were used.
Similarity-regularized matrix factorization (SRMF) were used to predict the response to anticancer drugs by cell lines. In this method, the structural similarity of drugs and the level of gene expression of cell lines were used. One of the problems is that the used data were not validated. And all dimensional properties of the two-dimensional chemical structure of the drugs were not used in order to obtain the drug similarity matrix. And also there were an overflow problem and accurate prediction. Working with CCLE database, we used GDSC and NCBI PubChem Repository. In the related work, available drugs were in SDF format. The number of anticancer drugs with a specific chemical structure were less and limited. In this work, the data has been synthesized and the two-dimensional chemical structure has been applied to generate the similarity data between the drugs. Dual similarityregularized matrix factorization (DSRMF) is also used to improve the prediction accuracy of a computational method. In this study, we used Pearson correlation coefficient (PCC), root mean square error (RMSE) evaluation metrics to evaluate the results. We also deduced the response values of drugs that are missing in the GDSC data. We found that the proposed method has lower RMSE and higher PCC than previous methods.

Methods
In this research, we first used the Genomics of Drug Sensitivity in Cancer project, which its release -5.0 has 790 cell lines and 135 drugs. Then, we used the Genomics of Drug Sensitivity in Cancer project, which its release -5.0 has 790 cell lines and 135 drugs. By initial preprocessing, we found that some drugs were not chemically specific, and their PubChem CIDs and PubMed SDF file were not available in the Cancer Cell Line Encyclopedia (CCLE), so the number of drugs was reduced to 97 for subsequent calculations, and the number of cell lines was reduced to 604 by removing duplicates. In vitro, measuring the response of the drug to an IC50 value indicates that the lower the IC50 value means the greater the sensitivity of the cell line to the drug. The cell lines are identified by genomic features, and the drugs are also encoded by chemical structure in SDF format, and are available in the NCBI PubChem Repository. With using PaDEL software [20], to determine the PubChem Description of the drug fingerprint, we also incorporate two-dimensional and three-determine chemical structures into the computation. And ultimately, using new SDF formulation of drugs, a drug-like matrix was obtained. The Primary Product Response Matrix contains 58588 entries, some of entries, contain missing data. To obtain the similarity matrix of the cell line based on the expression profile of the gene, a Pearson correlation coefficient was obtained between the cell line pairs. And To obtain the similarity matrix between drugs based on the similarity of fingerprints, and the chemical structure between drug pairs, a Jaccard coefficient were obtained.

Problem formulation
In this paper, we use the matrix factorization algorithm to predict the response to the anticancer drug. Similar framework has been adopted to predict drug targets [21]. First, we mapped the drug m and n cells into a shared low dimensionality K latent space in which the properties of each cell line drug are represented by the latent coordinates and respectively. We call the drug response watermarks in this space Y. The purpose is to obtain an approximation of the response value of to the cell line , by determining their corresponding latent coordinates. Our objective function here is as follows Where W is the matrix weigh = 1 If has a certain amount of drug response. Otherwise = 0 and ‖. ‖ is frobenius norm. And to avoid overflow of U, V training data, L2 (Tikhonov) regularization is added to the latent variables U, V as fines.
According to the work done by [16], Similar drugs and similar cell lines have similar responses to drugs Lin Wang et al research, helped to enhance the prediction accuracy of drug response by minimizing the objective functions of ‖ − ‖ 2 and ‖ − ‖ The major contribution in this research is adding more severe, and more similarity regularization limitation, that the distance of latent factors to the two cell lines or drug should be inversely correlated by analogy. It means that two similar drugs or similar cell lines should have a small distance, whereas two identical cell lines or non-identical drugs should have a large distance with their latent factors. So by using this idea and applying it to the objective function, we were able to improve the correct prediction of drug response compared to previous methods.
We used the following formula to obtain similarity regularization for cell lines: And to obtain similarity regularization for drugs: is to control the radius of the Gaussian function = In [19], were used. To obtain the final drug response, the results of DSRMF were compared with previous works. The results showed improvement on prediction accuracy. In this research we used Python (3.7) 64-bit for our implementation.

Measurements of prediction performance
To measure the efficiency of the method, the Root mean squared error (RMSE) and Pearson correlation coefficient (PCC) are used for each drug in [19]. The RMSE is computed as follow: The value of n refers to the number of cell lines that respond to a particular drug. Also R (D, C) and R (D, C) refer to the observed and predicted response values of the cell lines to the drugs.
The hyper parameter settings in the machine learning algorithm used in this paper, which is based on the matrix factorization method, are as follows: First, the matrix data of the response lines of the cell lines to the anticancer drugs were collected from the CCLE and GDSC databases. By dividing the values of the data in the matrix to the maximum absolute value, they are converted to values between range [-1, 1]. Therefore, the regularization parameters , and were tested in the intervals [2 −6 ... 2 2 ] respectively. The point is, in [19] paper, which has achieved the best result at =0 actually, the chemical structure of drugs was ignored. But in our paper, accuracy was higher in DSRMF method at = 0.0000001 by applying chemical structure control settings to the loss function. Another difference in this paper is; producing new cell line drug response data, by incorporating twodimensional and three-dimensional chemical structures in drug similarity and response, to previous drug similarity in previous work. The results of applying the SRMF model to the new production data is shown in this study. The value of k that is the low dimensionality K of the GDSC database, was set as 44, the Iteration training was set as 20. In this research, we compared performance of the SRMF model with the proposed DSRMF model. The average PCC and average RMSE of 100 replicates was used. And the new generated data from the CCLE and GDSC databases, was used in the DSRMF model averaged PCC_S / R (Pearson coefficient correlation). The sensitivity of the cell line to the anticancer drug between the observed and the predicted was approximately 96.0. Comparing with the previous SRMF model in [19], which at the best was 0.71 on its data by = 0, was about 0.25. For SRMF model on data was about 0.95 Figure 1. Comparison the mean of PCC_S / R among the samples of drugs between the proposed method (dsrmf) and the previous method Figure 1. compares the mean PCC_S / R between a samples of drugs, and averaged RMSE_S /R (root mean square error) between observed and predicted cell lines response to drugs in DSRMF model was 0.30, and in SRMF model was about 0.31 for data in this Research. Comparing to the previous work of [16], which had an average RMSE_S / R value of 0.78 on its own data conditions, the proposed method was about 0.47 better than the previous models in RF [22] and KBMF [23]. Also, in Figure 2, we see a compared the proposed method and the previous method for randomly selected anticancer drugs. And in Figure 3, we compared between the previous method and the proposed method by adding a number of algorithm iterations. And Table 1 shows the predictions details and the comparisons of results for different methods.

Discussion
In this study, we used the CCLE and GDSC databases to compare the similarity between cell lines and the drug webs. To get the resemblance between cells, we used gene expression profile. Whereas, in order to achieve similarity, Cell lines panels could be used for other cancers such as DNA methylation, reverse-phase protein array and, micro RNA expression. And also, by using the genomic properties of cell lines, and other properties such as copy number variation and pathways, and somatic mutation, in the proposed method, DSRMF, could have better results in predicting response to anticancer drugs. It could also be used in other areas of predicting and modeling.

Conclusions
In this study, we used the CCLE and GDSC databases to compare the similarity between cell lines and the drug webs. To get the resemblance between cells, we used gene expression profile. Whereas, in order to achieve similarity, Cell lines panels could be used for other cancers such as DNA methylation, reverse-phase protein array and, micro RNA expression. And also, by using the genomic properties of cell lines, and other properties such as copy number variation and pathways, and somatic mutation, in the proposed method, DSRMF, could have better results in predicting response to anticancer drugs. It could also be used in other areas of predicting and modeling. In this research we developed a Dual similarity-regularized matrix factorization (DSRMF) model to predict response to anticancer drug as measured by IC50 criteria for cell line sensitivity or resistance. We also utilized the CCLE and GDSC databases. And the production of new drug similarity data which incorporated two-dimensional and three-dimensional chemical structures of drugs, and other used properties in previous articles to improve efficiency. -Ethics approval and consent to participate This article does not contain any studies with human participants or animals performed by any of the authors.
-Consent to publish Agree to be published -Availability of data and materials Used Cell line data in this article was strived from https://portals.broadinstitute.org/ccle Used Drug data in this article was strived from https://www.cancerrxgene.org/downloads/anova, and https://pubchem.ncbi.nlm.nih.gov/, also data can be sent emailed as needed.
-Competing interests All Authors declares that they have no conflict of interest. -Funding No funding was used.

-Authors' Contributions
Ali Reza Ebadi is the first author, Ali Soleimani is corresponding author, and Abdulbaghi Ghaderzadeh is the third author.