CSDVP: Compressed Sensing for Drug-Virus Prediction

The 2019 Coronavirus (COVID-19) epidemic has recently hit most countries hard. Therefore, many researchers around the world are looking for a way to control this virus. Examining existing medications and using them to prevent this epidemic can be helpful. Drug repositioning solutions can be effective because designing and discovering a drug can be very time-consuming. Although no drug has been definitively approved for the treatment of this disease, the effectiveness of a few drugs for the treatment of the disease has been observed. In this study, with the help of computational matrix factorization methods, the associations between drugs and viruses have been predicted. By combining the similarities between the drugs and the similarities between the viruses and using the compressed sensing technique, we investigated the association between the drug and the virus. The Compressed Sensing approach to Drug-Virus Prediction (CSDVP) can work well. We compared the proposed method with other methods in this field and found its accuracy is more desirable than other methods. In fact, the CSDVP approach with the HDVD dataset and evaluation through 5-fold CV, with AUC = 0.96 and AUPR = 0.85, can identify the relationship between drugs and viruses. We also investigated the effect of drug properties on model performance improvement using autoencoder. Thus, with each decrease in the size of the characteristics in different sizes, we examined the performance of the CSDVP model in predicting the drug-virus relationship. The relationship between drugs and coronavirus infection is also analyzed, and the results are presented.


Introduction
Acute Respiratory Syndrome (SARS) of Corona-2 (SARS-CoV-2) virus has caused widespread disruption in most economic and social fields, and its reckless spread has forced many countries to become infected with the virus [8]. From the first case in Wuhan, China, in December 2019 until today, despite the vaccination of many people, there are still deaths from COVID-19 (VIrus-2019 coronary heart disease) [19]. This virus is different from SARS-CoV and MERS-CoV, SARS-CoV-2. Covid-19 is also the most pathogenic human coronavirus ever detected [18].
Meanwhile, much research has focused on finding a solution to treat people with COVID-19.
Various laboratory and computational studies are underway in multiple fields, and to date, several vaccines have been approved to control the virus. On October 22, 2020, Remdesivir was approved by the US Food and Drug Administration (FDA) as the first official treatment for COVID-19 [26].
The purpose of drug repositioning is to find a new therapeutic target in drugs. With the spread of the coronavirus, the importance of using this method to find effective drugs for this new and dangerous virus has doubled. For example, in 2020, based on a study by Lim et al., It was found that ribavirin, previously used to treat infectious diseases such as hepatitis, would also be effective in treating Covid-19 [15,26]. The usefulness of drug repositioning compared to traditional drug discovery methods is to optimize the time and cost of drug production and reduce the potential risks associated with drug toxicity.
In recent years, many studies have been conducted to find effective drugs in the treatment of COVID-19 using drug repositioning. In 2020, Peng et al. clinically reviewed about 20 drugs and identified which drugs could effectively treat Covid-19 [22]. Che et al. were also able to predict useful drugs in the treatment of Covid-19 by embedding a knowledge chart [5]. This study formed a relationship between drugs, genes, diseases, side effects, and pathways. They used the Graph Convolutional Network with Attention to identify potential relationships between drugs and diseases. Another model was proposed in 2021 by Meng et al. They predicted the drug-virus relationship based on the matrix factorization model [18]. This method uses chemical structures of drugs and virus genomic sequences to calculate the similarities between drugs and viruses, respectively. Finally, using the matrix factorization approach predicts the relationship between each drug-virus pair. Tang et al. Also identified the drug-virus relationship in 2021 using matrix factorization [26]. In this method, using similarity matrices of drugs based on their structure and also similarity matrices of viruses based on their sequence, they predicted the drug-virus relationship.
In this study, we used one of the "compressed sensing" (CS) techniques [7], which is based on reducing the dimension of the matrices. Using this method, which has been used in various bioinformatics issues [14,21,23], we predicted the relationship between drugs and viruses.
Because studies based on compressed sensing have been successful, we also used this method to predict unknown drug-virus relationships. This method, which has been very effective in recovering signals, also seems to be effective in finding drug-virus associations that are not known [23]. In this paper, we used human drug virus database (HDVD), according to Study [26].
In the problem of drug-virus prediction, we refer to 'signals' as drug-virus associations, some of which are known and some of which are unknown. 'Samples' are also considered drug-virus pairs whose relationship is known. This method seems to work well for predicting drug-virus association, similar to the drug-ADR prediction problem [23]. Because in this problem, the data has noise, and the number of positive data (known drug-virus relation) is low. Based on this framework, we used the CS method to find the relationship between the drug and the virus. First, to calculate the similarity between each drug pair, we extracted the different properties of the drugs from different databases. We extracted various drug-related features such as structural properties (fingerprint), phenotype, genes, side effect and indication from PubChem [10], CTD [17] and SIDER [12] databases. Then, with different measures, we calculated the similarity between the drugs based on the extracted features. For viruses, in addition to their sequence information, we used pre-trained model Biobert (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) [13]. The name of each virus is encoded in vectors of size 768. Similarities between viruses are also obtained using these two features. We then recover drug-virus relationships based on drug-virus relationships, drug similarity, and virus similarity. Finally, we identify the drugs that can be beneficial for the coronavirus.
In addition, we explored another approach that examined the importance of drug properties. We reduced the properties of the drugs by concatenating the properties of the drugs with each other and using autoencoder [28]. In other words, we once predicted the drug-virus relationship based on all the properties we extracted for drugs in this study. We also re-predicted the relationship between the drug and the virus by reducing the dimension of drug feature vectors.

Methods
Problem Description. The problem of predicting the drug-virus association can be considered as a bipartite network. Which has n drugs and m viruses. Now the matrix adjacent to the intended network has a dimension of × ( × ). We denote the set of drugs by = { 1 , 2 , … , } and the set of viruses by = { 1 , 2 , … , }. In the network adjacency matrix, = 1 means that drug is associated with virus ; otherwise, it is = 0. Finally, our goal is to find anonymous drug-virus relationships using the compressed sensing technique. Figure 1 provides an overview of the proposed model (CSDVP) and problem-solving process.
As shown in Fig. 1, problem compressed sensing drug-virus prediction (CSDVP) is divided into three parts. In Part (I), model inputs are made. The inputs to the problem are drug-virus adjacency network matrix ( ), drug features matrix ( ), and virus features matrix ( ). In the next part (II), we calculated the similarity between drugs and the similarity between viruses using different computational criteria. Then, using method Kernel Target Alignment-based Multiple Kernel Learning (KTA-MKL) [9], we combined the similarity matrix of drugs. We similarly combined the similarities between the viruses. Finally, using these input matrices and compressed sensing technique, we predicted the drug-virus relationship (part (III)).

Human drug virus database (HDVD).
In this study, we used the data set used in [18]. The details of this dataset are described in Table 1. we obtained similarities between drugs [9,23]. For each drug , we create fingerprint profiles ( ), genotype ( ), phenotype ( ), side effect ( ), and indication ( ). In general, the profile of each drug can be displayed as follows:  Fingerprint: We can encode any drug into a binary line vector 881-dimensional using chemistry development kit (CDK) service [9,27]. In this binary vector, each bit represents the presence of a predefined piece of chemical structure. If this property exists, we set that bit to 1 and otherwise to 0.  Genotype: We consider the set of all genes that change due to drug use, function, and regulation to be drug-dependent genes. The set of these genes can be extracted from CTD database [17].  Phenotype: Phenotype refers to a non-disease biological event. For example, cell cycle reduction is a phenotype. Under the influence of drug use, cellular, molecular, and physiological phenotypes are formed. All chemical-phenotypic interactions are available under the CTD database [17].  Side effect: A side effect is an effect of a drug that is separate from the main therapeutic effect of the drug. These side effects are available from the Sider database [12].  Indication: The set of disorders for which a drug is prescribed or used for treatment is called the "indication" of that drug. Indications of a drug can be extracted from the Sider database [12].
For viruses, in addition to the similarities obtained through their genome sequences [18], we also received specific vectors for each virus using the Biobert model. Biobert is a pre-trained model on a variety of biomedical texts (such as PubMed publications) that can give us a good representation of the viruses used in the dataset. The name of each input virus of the Biobert model and its output are vectors of size 768.
In the following, we will review the meters we used to calculate the similarity in this study [9,23]:  Gaussian Interaction Profile (GIP): The Gaussian similarity criterion based on the exponential function of EXP is defined as follows: In the Gaussian kernel, is the bandwidth controlling parameter. and are also input vectors (such as drug property vectors) calculated using the Gaussian criterion [9].
 Correlation: We also used the correlation criterion to calculate the similarity, which is defined as follows: In this relation, means covariance, and means variance [9].  Tanimoto: Another similarity criterion is based on the Tanimoto coefficient, which is expressed as follows: The notation | ∧ | indicates that in several components of the and feature vectors, both have the same value 1. | ∨ | also represents the number of and vectors where at least one of the components is equal to 1 [11,23].

 Mutual Information (MI):
We also used the mutual information relationship to calculate similarity. This relationship is defined as follows: in this relation, ( ) ( ( )) refers to the frequency of the u (v) in the ( ) vector.
After finding the similarities between the matrices based on the different properties and criteria, we integrate them according to the kernel target alignment (KTA) method [9]. The weight of each similarity matrix is obtained according to the KTA method as follows: In the above relation, ( , , ) and ( , , ) means the similarity of cosine between matrices, which is defined as follows: Which ‖ ′ ‖ and< ′ , > is obtained as follows: In Eq.(1) and Eq.  Finally, by minimizing the loss function (3), we obtain the latent factor of drug space ( = ( )) and latent factor of virus space ( = ( )), and from their combination, we obtain the probability of any drug-virus associations (see Figure 1 (part (III)) ).
virus association matrix 3 . is the transpose of F and ‖ ‖ means the Frobenius norm. The also represents the trace of the matrix and means 'degree matrix' of SD [23]. After finding the matrices of F and G using the following equation, the predicted values for each drug-virus relationship are calculated.
= exp( ) (1 + exp ( )) , Figure 2: Overview of work using Autoencoder. I. In this part, we have formed a vector for each drug during 19821, which is obtained by combining all the drug properties. For each virus ( ), in addition to their genomic sequence, using the Biobert model, we extract its specificity vector for each virus name. II. We calculate the similarity of drugs using the obtained features by reducing the specificity of the properties (We showed each of the features obtained from the hidden layer of autoencoder with in the figure.) with the autoencoder and the stated computational criteria (MI, GIP, Cosine, Correlation). For viruses, in addition to genomic sequences, we calculate the similarity between each pair of viruses with Biobert derived vectors and various computational criteria. Finally, by minimizing the loss function of the CS technique, we find the latent factors in the space of drugs (F) and viruses (G) and use them to calculate the drug-virus relationship.
According to Figure 2, this time, we concatenated all the properties of the drug (Part (I)). After concatenating feature vectors for each drug, the feature vector of dimension 19821 for each drug was obtained. Then, using an autoencoder, we reduce the feature dimension from 19821, and each time by calculating the similarity of the drugs based on the obtained features (part (II)), we reviewed the proposed approaches (CSDVP) in predicting the relationship between drugs and viruses. Table 2 also shows the size of each of the drug and virus characteristics we used in this study. with other approaches, we also examined the importance of drug specificity by reducing dimensions using an automated encoder.

models. All codes and tests on Matlab 2018b run on
Windows and Intel Core i5-2430M processors and 4 GB of memory. In the following, we first state the values that we considered for the parameters of the proposed approach (CSDVP) and other methods and then assert the criteria we used to evaluate our model.

Parameters setting
In the CS approach, we set the value of the parameters to = 0.5, = 0.01, and = 10.
The reduction value of the given dimension is equal to 18 and also the number of repetitions to minimize the loss function is equal to 100. The autoencoder used has one hidden layer, and its activating function is for the hidden layer "Sigmoid".

Model evaluation
We evaluated our model based on its performance in predicting drug-virus association. To    Tables 3 and 4, the CSDVP model performs better than other models. Feature sequences for viruses are more effective in predicting the relationship between drugs-viruses. After executing the models based on 5-fold CV, the results in Table 5 were obtained. These results, the mean values of AUC and AUPR after five runs, indicate that the proposed CSDVP model performs better in predicting the drug-virus relationship. Another point is that using only similarities based on the genome sequences of viruses can be more AUPR in models (see Tables   3 and 5).
We also measured the performance of the proposed model in predicting drug-virus association, based on the obtained characteristics, after concatenating the features of the drugs and using an autoencoder to reduce the dimension. The results can be seen in Table 6. We also specified the value of the autoencoder mean squared error (MSE) error in Table 6 for each dimension reduction value.  0.67 0.23 [1] According to Table 7, we have identified the most likely drugs associated with the coronavirus.
We also examined these relationships in SCPMF and showed the possible values that that model predicts. Identifying these connections with computational methods and examining them more closely can be effective in finding more effective drugs for the coronavirus.

Discussion
The coronavirus, which has progressed uncontrollably in many countries, has caused many problems. In addition to vaccine production, the use of available drugs effective in controlling mortality from Covid-19 can also be a promising path to safety and health. Machine learning and data mining models have been able to help laboratory methods to find the drug-virus relationship to a great extent. These methods can predict drug-virus relationships at a better cost and time. In addition to the biological characteristics of drugs and viruses, the use of clinical data can also be effective. For example, extracting relevant information from social networks such as Twitter and electronic databases of medical records and teaching this information along with biological data from drugs and viruses to learning models can improve prediction efficiency.

Conclusion
In this paper, we use the compressed sensing technique, one of the matrix factorization methods, to predict the relationship between drugs and viruses. We also use the automatic encoder to find the number of the best properties of the drug, which is achieved by reducing the dimension of the drug properties and finally using these properties to predict whether there is a causal relationship between the drug or not. After comparing our proposed approach with other matrix factorization methods, we identified drugs with a high potential for association with the Covid-19 virus.
Because the proposed CSDVP model was able to predict the drug-virus relationship more accurately than other models, the identification of these drugs could help find drugs that are effective in treating Covid-19.