The world is facing a catastrophic situation today, caused by the emergence of a novel viral infection that has become one of the biggest threats to humanity. The recent spread of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-Cov-2), commonly known as coronavirus, is a global public health crisis. The novel coronavirus is suspected to have originated from bats and transmitted to humans through an intermediate host, malayan pangolins [1]. The pandemic originated in Wuhan, China, at the end of December 2019 and quickly escalated into a global pandemic. As of May 2023, there have been 76,44,74,387 confirmed cases of COVID-19 and 69,15,286 deaths reported worldwide [2]. When compared to the Severe Acute Respiratory Syndrome Coronavirus (SARS-Cov) and the Middle East Respiratory Syndrome Coronavirus (MERS-Cov), the novel virus has a higher transmission rate but a lower fatality rate. The spread of pathogens is now faster and more widespread than ever before due to globalization, deforestation, climate change, and unhealthy eating habits, resulting in the emergence of pandemics. Early detection and prompt action are crucial in reducing the impact of threats caused by pathogens. Epidemic intelligence plays a significant role in this process [3]. It involves several stages, with detection and verification being particularly crucial in forecasting potential pathogenic attacks. Understanding the gateway through which pathogens enter the human body and spread throughout the system is fundamental to epidemic intelligence and essential for preventing future pandemics.
The COVID-19 pandemic has highlighted the significance of studying Protein-Protein Interactions (PPIs) to comprehend host-pathogen relationships [4]. The understanding of how virus proteins interact with human cells can provide insight into the mechanism of viral invasion and infection [5]. Traditional methods for studying PPIs, such as yeast two-hybrid screens, mass spectrometry, immunoprecipitation, and protein chips, are expensive, time-consuming, and have a high rate of false positives and false negatives limiting their coverage and global understanding. High-throughput experimental methods are labor-intensive and have limitations. Due to this, the global understanding of protein-protein interactions on a larger scale became challenging. As a result, computational approaches with the aid of machine learning have emerged as a solution to bridge this gap and help in developing better models to predict interactions between humans and viruses effectively.
Today, enormous advancements in the field of artificial intelligence have led to the development of computational aids for predicting protein-protein interactions, which have become a promising complement to wet-lab experiments in understanding host-pathogen associations [6]. Moreover, the emergence of new virus variants has increased the importance of analyzing human virus PPIs using machine learning models, as they provide valuable insights into these interactions. With this insight, several machine learning models and web-based tools have been developed for predicting human virus PPIs. The limited number of experimentally validated pairs of interacting human-virus proteins due to cost and time constraints makes computational prediction a practical alternative for identifying protein-protein interactions.
The global pandemic has fostered the extensive application of machine learning and deep learning techniques due to the availability of large amounts of human virus PPI data. High-throughput experimental methods generate these data on a larger scale which helps to develop accurate machine learning models. The interdisciplinary [7] nature of the work leads to collaboration between researchers from various fields. Lack of potential information will lead to potential biases and implicit assumptions, as there may be contradictions with other fields that are not fully realized. Hence to develop an accurate model it is mandatory to analyze all the aspects of the specific data otherwise our model suffers from underfitting or overfitting.
Most prediction models recently developed a focus on interactions within species or between different species. Still, they are often limited to specific virus species and are not easily applicable to other human host-virus systems. Some machine learning models rely solely on primary-level information from protein sequences to classify protein-protein interactions, using features like Conjoint Triad (CT), Auto Covariance (AC), Amino Acid Composition (AAC), Sequence-Order (SO), and Dipeptide Composition (DPC). For instance, Shen et al. developed a Support Vector Machine (SVM) model based on CT features, while Guo et al. used the AC protein sequence coding method and SVM to classify interacting and non-interacting protein pairs [8, 9]. Valente et al. employed a random forest classifier with AAC as the feature descriptor, and You et al. used SVM and two feature descriptors, SO and DPC [10, 11]. Finally, Sun et al. used a Stacked Autoencoder (SAE) classifier with AC and CT as input features [12].
On the other hand, existing machine learning models that use embedding techniques to transform protein sequences into vectors have shown promising results by solely utilizing primary sequence information [13]. It is difficult to prove these models are reliable by considering only the sequence arrangement of amino acids other than their properties. To predict protein-protein interactions data containing protein sequences from both humans and viruses were fed into these intelligent models [14]. For instance, processing protein sequences is like processing text and language in Natural Language Processing (NLP), a broad field in computer science. Since proteins are composed of 20 amino acids and can be denoted by a one-letter code and is a natural fit for many NLP methods as they can be represented as strings. As a result, embedding methods such as word2vec and doc2vec have been utilized to obtain the distributed representation of words and documents respectively, for protein sequence-based prediction tasks [15]. The amino acid sequences of human and virus proteins are encoded as matrices using these embedding techniques, which find k-mers (k-consecutive amino acids) in protein sequences and consider them a single word. However, these models do not consider the relationships between amino acid segments in the context of the whole protein sequence.
Uniprot [16], Swiss-Prot [17], DIP [18], IntAct [19], BioGRID [20], BIND [21] etc. are some of the publicly available databases for retrieving PPI data. With thousands of experimentally determined human virus protein-protein interactions (PPIs), there is an abundance of data available to develop machine-learning methods to predict human virus-protein interactions. By analyzing the previous studies in this field, we assumed that the properties of amino acids are the stepping stone to understanding the interactions between two protein sequences. Therefore, to create an accurate prediction model from the primary protein sequence level, it is recommended to extract all possible amino acid properties from the protein sequence hence it defines each protein sequence uniquely which takes part in the interaction.
The proposed model works at a primary level taking only the protein sequence information and is to be converted into a machine-readable form before being given to a machine learning model. Each human-virus protein sequence that participates in the Protein-Protein Interaction must be identified uniquely, so extracting all possible features from the protein sequence is necessary. Most often these features extracted are of higher dimension space. However, handling these high-dimensional features is a challenging task while processing biological data. Therefore, linear dimensionality reduction methods such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are used to map these high-dimensional data points into a lower-dimension space. To validate the importance of the features in protein-protein interaction prediction, a dataset is constructed without applying dimensionality reduction techniques. A random forest classifier is trained on this dataset to classify interacting and non-interacting protein pairs in an 80:20 ratio for training and testing, respectively. The experimental results suggest that to develop an accurate protein-protein interaction prediction model, all the features of amino acids that can be extracted from the protein sequence without losing their uniqueness must be incorporated.