Link prediction of viral spike proteins and cell receptors using structural perturbation method

Many protein receptors for animal and human viruses have been discovered in decades of studies. The main determinant of virus entry is the binding of the viral spike protein to host cell receptors, which mediates membrane fusion. In this work, a bilayer network is constructed by integrating the similarity network of the viral spike proteins, the similarity network of host receptors, and the association network between viruses and receptors. The structural perturbation method (SPM) is used to predict possible emerging infection of a virus in potential new host organisms. The reliability of this method is based on the hypothesis that the major barrier to virus infection is the diﬀerences in the compatibility of spike proteins and cell receptors, which is determined by the amino acid sequences among species.


Introduction
With the rising threat of viral pandemics in the past 50 years -as illustrated by SARS-COV [14], COVID-19 [9,18], Ebola [1], Zika [15] and HIV [4] -questions have arisen about how we can predict similar large-scale outbreaks to have time to avoid them or reduce their impact on daily life.
The majority of these viral infections emerge first in another species (for instance, SARS-CoV-2 most likely originated in bats [19,18]) before evolving to cross species barriers and expanding their life cycles in other hosts, including humans [6]. Currently, genome sequencing and computational analysis are common techniques used to identify and predict which pathogens are most likely to evolve to infect humans [3]. Specifically, the gene sequences coding the amino acid chains of viral spike proteins of different viruses and the genes of corresponding host cell receptors can be compared to identify their similarities and compatibility. Spike proteins are transmembrane glycoproteins (composed of an amino acid sequence with a carbohydrate side chain) folded into a spike shape to surround viruses [12,14]. This "spike" shape is especially prominent in coronaviruses, hence their name. The spike proteins of viruses attach to human cell receptors and if compatible, will trigger the fusing of the viral and host cell membranes, allowing the viral genome to sneak into the cell uninhibited [12,14]. Clearly, the spike proteins and the receptors interact to play a key role in the success of a virus in infecting a host. The spike protein is also a main element of diversity between different viral strains [12]. As new viruses or viral variants develop, their spike proteins evolve to changing environmental conditions, such as increased immune response (natural or vaccine-induced) or the use of medication inside a host's body.
Here I postulate that a virus from a reservoir has the ability to cross species barriers and adapt to a new host if the similarity between receptor proteins of the potential host and its original host is high enough. Based on this assumption, I present an approach to apply the structural perturbation method (SPM) [13] to predict the propensities of viral cross-species infections.

Computational Problem Formulation
In this work, I model the prediction of potential viral infection through prediction of the links of a graph, as described below. The similarities of the sequences of both the spike proteins and the receptors of different species are graphed onto an adjacency matrix. In addition to metrics for both virus-virus and receptor-receptor similarities, virus-receptor compatibility characterized in known viral infections is included in the adjacency matrix as well. Thus, since both animal and human cell receptors will be included in the adjacency matrix, it will allow for the identification of viral infections in animals that have yet to infect humans, but may do so in the future. This matrix can be analyzed using a linear algebra technique called structural perturbation method (SPM) [13].
Because this matrix will inevitably include unknown values -for virus-receptor interactions that haven't been studied yet -it can be diagonalized with the use of eigenvalues, after a set of data is removed, to predict the unknown values. These values can be used to predict viruses that are most likely to cross species and infect humans in the future.

Forming an Adjacency Matrix from a Graph
An adjacency matrix of a graph is formed by labeling each vertex of the graph with a number, which will then become the rows and columns of the matrix. In a simple unweighted graph, each (v i , v j ) is marked with either 0 or 1, depending on whether the two vertices are linked or not. In weighted graphs, each (v i , v j ) is marked with the strength of the relationship between i and j, usually on a scale from 0 to 1. For graphs with no self-loops, the main diagonal of the matrix contains all zeroes.  By assigning weights based on the proximity of the dashed lines, the following matrix A can be produced:  This can also be represented visually with Figure

Eigenvalues, Diagonalization and Spectral Decomposition
If A is an nxn matrix in a field F , then the eigenvalue λ in F and the (nonzero) eigenvector v in F n are defined such that Manipulating equation (1), we get: where I is the identity matrix.
Thus, if v is nonzero, this equation can only hold if: All λ (both real and complex) that satisfy equation (3) are employed, a more compact form of the equations, equations (5) and (6), can be produced: This is called the spectral decomposition of the matrix, where a real symmetric matrix can be written in terms of its eigenvalues and eigenvectors. Since the matrix of eigenvectors V is orthogonal (by definition, they form an orthonormal basis in R N ), then V −1 = V T . Thus, equation (6) can be rewritten as equation (7): In addition, for any m: Hence:

Structural Perturbation Method
After generating an adjacency matrix, a random set of vertices must be perturbed (removed) and the removed values will be replaced with zeroes in the matrix. If the set of perturbed values is denoted by ∆E and the remaining set by E R , and their weighted adjacency matrices are ∆A and A R respectively, then A = A R + ∆A. Note that both the perturbed and remaining matrix contain the same dimensions as the original matrix; the missing values are simply set to zero.
Since A R is a real symmetric matrix (based on how the adjacency matrix is formed), using equation (9) it can be diagonalized as follows: where λ k and v k are respectively the eigenvalues and eigenvectors of each perturbation k of A R , with N total perturbations.
After perturbation, the eigenvalue is adjusted to λ k + ∆λ k and the eigenvector to v k + ∆v k .
Thus, from equation (1): Next, this equation is left multiplied by v T k . To make the problem easier, a linear approximation is utilized by neglecting the second order terms v T k ∆A∆v k and ∆λ k v T k ∆v k . Thus: This linear approximation allows the eigenvalues to change but keeps the eigenvectors constant (since all terms containing ∆v T k are cancelled). Thus, the perturbed matrix can be rewritten as as: However, this process neglects the degenerate case in which some eigenvalues are repeated. The Again using a linear approximation (neglecting second order terms), the following is obtained: For all n = 1...M , we can multiply equation (13) by v T kn to obtain: This condenses to the matrix form: where W is an M xM matrix defined by W nj = v T kn ∆Av kj and B k is the column vector of β kj (essentially the adjusted eigenvector). Noticing the parallels between equation (17) and equation (1), one realizes that the perturbed matrix becomes: This perturbed matrix A ′ can be computed for as large N as necessary, dividing by N to obtain the average (in order to maintain the same scaling and structural consistency). For each perturbation and subesequent diagonalization, a set number or percentage of perturbed values must be used. This perturbed matrix can be used to make a prediction about the value of certain unknown links. The method described in this section is known as the structural perturbation method (SPM) [13].

Data set
As mentioned in the introduction, the techniques explained in section 4 will be applied to determine the compatibility of a set of viruses and protein receptors. For this specific application, the amino acid sequences of 22 viral spike proteins, along with all of the known sequences of the 18 receptors (40 from animal hosts and 20 from human hosts) were collected [5]. These are listed below in Table 1.
All of the spike proteins, and all of the the receptor proteins, were then compared in pairs based on sequence alignment using MUSCLE [7]. The alignment results were used to calculate the sequence similarity, using absolute and relative distance. The absolute distance is an estimate of the evolutionary divergence between sequences and is defined as the ratio of the number of aligned amino acid residues to the alignment length. The relative distance is the pairwise distance divided by the maximum distance value calculated from the distance analysis results. In this application, the relative distances were used as the weights of the links between viral spike proteins and the links between receptor proteins. The data behind the links between viral spike proteins and receptors are taken from 60 known pairs of virus-receptor interactions [5]. Thus, after generating a graph, a matrix is created using the method demonstrated in Section 4.

Evaluation of Structural Perturbation Method
The accuracy of the structural perturbation method outlined in section 4.3 must be verified in the context of the specific application, before it can be used on the data set for prediction. In order to do so, we perform a 5-fold cross-validation [10]. In this method, all of the 60 known virus-receptor interactions are randomly divided into 5 equally sized subsets. Four of these groups are used training data; the other is the test data. To determine its accuracy, SPM will be performed and its results will be compared with the already known information regarding the virus-receptor interactions using the following metrics: where T P , F P and F N are respectively the number of true positive, false positive and false negative samples, with respect to a virus-receptor interactions.
The receiver operating characteristic curve (ROC) curve is obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The true positive rate (TPR) and the false positive rate (FPR) are calculated by the following formula [16]: The area under the ROC curve (AUC) -calculated via integration -represents the overall success of the structural perturbation method in predicting the correct outcomes [2].
Using the previously mentioned data set of 60 known virus-receptor interactions split into 5 groups, the plot showing AUCs of different numbers of perturbations is shown in Figure 3. As the plot illustrates, the accuracy plateaus between 85% and 90% once 8 perturbations have been performed. Since 50% represents a worthless test and 100% represents a perfect test, this statistic demonstrates an accurate prediction rate for the application.

Results
Finally, the structural perturbation method outlined in section 4.3 was applied to the full data set described in section 5.1. This was calculated using the Matlab software of SPM [13]. Table 2

Discussion
At this moment, the COVID-19 pandemic is still ongoing all over the world. It is believed that there is a reservoir of viruses in wild animals, since increasing numbers of viruses have been discovered in wild animals. With more sequences of viruses and their receptors available, computational prediction models such as the structural perturbation method used in this work can be applied as an early warning system to prevent infections and propagation in humans.
Although many other proteins also contribute to the process of virus production and host invasion, the spike protein located on the surface of the virus plays the most important role in the binding of the cell receptor and membrane fusion [8,11,17]. Therefore, it is the most important factor to determine host range [14,8,11,17]. In this work, with the sequences of SARS-COV spike protein and its receptor included, SPM model successfully predicted that SARS-CoV-2 has the potential to cross species barriers to infect humans.
Cross-validations showed that the accuracy of SPM only reaches close to 90%. Clearly, there is still room to improve. In addition to sequence similarity, other features of spike proteins and their receptors can be considered. Physicochemical properties, secondary structural motifs and compositional information of animo acids or dipeptides of viral spike proteins and cell receptors are all candidates to be considered in the future.

Conclusion
In this paper, a bilayer network, which integrated the similarity network of the viral spike proteins, the similarity network of host receptors, and the association network between viruses and receptors, was analyzed using the structural perturbation method. Through SPM, I managed to compare the links between several viruses, animal receptors and human receptors to predict the compatibility between certain viral spike proteins and human receptors, which can determine which viruses are most likely to cause the next major outbreak. Computing techniques like SPM should be harnessed to act as an early detection system for future propensity of viral infections in humans.