A Comprehensive Machine Learning Approach for COVID-19 Target Discovery in the Small-Molecule Metabolome

doi:10.21203/rs.3.rs-3981522/v1

Download PDF

Research Article

A Comprehensive Machine Learning Approach for COVID-19 Target Discovery in the Small-Molecule Metabolome

https://doi.org/10.21203/rs.3.rs-3981522/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Respiratory viruses, including influenza, RSV, and COVID-19, cause various respiratory infections. Distinguishing these viruses relies on diagnostic methods such as PCR testing. Challenges stem from overlapping symptoms and the emergence of new strains. Advanced diagnostics are crucial for accurate detection and effective management. This study leveraged nasopharyngeal metabolome data to predict respiratory virus scenarios including control vs RSV, control vs influenza A, control vs COVID-19, control vs all respiratory viruses, and COVID-19 vs influenza A/RSV. Our advanced machine learning models, including linear discriminant analysis, support vector machine, random forest, and logistic regression, exhibited superior accuracy, sensitivity, and specificity to previous supervised machine learning approaches. Key techniques such as feature ranking, standard scaling, and SMOTE were used to address class imbalances, thus enhancing model robustness. SHAP analysis identified crucial metabolites influencing positive predictions, thereby providing valuable insights into diagnostic markers. Our approach not only outperformed existing methods but also revealed top dominant features for predicting COVID-19, including Lysophosphatidylcholine acyl C18:2, Kynurenine, Phenylalanine, Valine, tyrosine, and aspartic Acid (Asp). These compounds play critical roles in metabolic pathways and have been identified as top contributors to predictive models in COVID-19 respiratory virus scenarios.

Metabolomics

Respiratory viruses

Machine learning

Diagnostic markers

COVID-19

Globally, the coronavirus disease 2019 (COVID-19) pandemic caused widespread disruptions and a substantial loss of human lives. SARS-CoV-2, the virus that causes COVID-19, enters the human body primarily via nasal epithelial cells [1]. The primary immune reaction to the virus occurs within a distinct immune microenvironment known as the nasopharynx-associated lymphoid tissue system, which is located in the nasal cavity [1]. Influenza, a widespread illness affecting both humans and animals, is caused by viruses that have animal reservoirs and exhibit continuoual antigenic change [2]. Both COVID-19 and influenza are contagious respiratory illnesses [3]. COVID-19 spreads through respiratory droplets, aerosols, and contaminated surfaces [3]. Influenza, caused by influenza A or B viruses, spreads primarily through respiratory droplets released during coughing or sneezing [4]. Respiratory syncytial virus (RSV) is also a highly contagious virus causing acute respiratory infections, with a global incidence of approximately 33 million cases in children under 5 years; RSV infection often leads to severe bronchiolitis [5]. Molecular testing, specifically polymerase chain reaction (PCR), has revolutionized the surveillance and diagnosis of infectious diseases in clinical microbiology and virology laboratories over the past decade [6, 7]. Although these techniques are rapid and accurate, they continue to have notable limitations, including cost, complicated procedure, inability to differentiate active infection from latency or colonization, and diminished sensitivity when applied to direct patient specimens [6–8].

With varying degrees of success, the application of 'omics' method, comprising genomics, proteomics, and metabolomics, has been investigated for diagnosing COVID-19 and influenza [9–14]. In contrast to conventional clinical virology diagnostics, metabolomics, which examines small molecules on a large scale, identifies the metabolic response of the host rather than explicitly identifying the pathogen [15]. Alterations in the nasal metabolome that are specific to a particular virus have been observed to correlate with viral load and disease severity [16].

Machine learning (ML) has become a powerful tool for navigating the intricacies of metabolomics data, thus facilitating efficient analysis, interpretation, and extraction of valuable insights [17–20]. Recently Kantz et al. [21] have created, fine-tuned, and evaluated an ML pipeline that effectively classifies spectral features in non-targeted liquid chromatography–mass spectrometry (LC/MS) metabolomics data by using, both deep neural networks and a simpler multiple logistic regression model. Jeany et al. [22] introduced a novel approach that integrates mass spectrometry and machine learning using paired m/z analysis for direct COVID-19 diagnosis from raw data. This method presents a flexible tool for population screening and risk assessment in public health initiatives, addressing ion competition effects and compatible with a range of mass spectrometers, such as flow-injection mass spectrometry. This technique offers molecular insights into the pathogenesis of COVID-19, with potential uses for managing patients during pandemic and other related disorders.

Metabolomics and ML strategies have the potential to revolutionize the diagnosis of infectious diseases, specifically respiratory viruses. Hogan et al. [23] have applied liquid chromatography quadrupole time-of-flight (LC/Q-TOF) and ML for influenza diagnosis based on nasopharyngeal swab samples. After an initial analysis of 236 samples, the researchers extended their approach to a clinically applicable LC/MS analysis in a cohort of 96 symptomatic individuals. Hasan et al.[24] have applied metabolomics strategies for analyzing volatile organic compounds in exhaled breath and using mass spectrometry for COVID-19 detection in nasopharyngeal swabs. The study highlights the differentiation between targeted and untargeted approaches, thus stressing the need for standardization and extensive clinical validation before integration of volatile organic compound -based tests into clinical practice. Recently Bennet et al. [16] have systematically examined the nasopharyngeal metabolome in patients with COVID-19 using a liquid chromatography tandem mass spectrometry (LC–MS/MS) kit, quantifying 141 analytes. Through qRT-PCR and used of ML models, the study [16] has achieved remarkable accuracy in discerning viral infections and specifically distinguishing COVID-19 from other respiratory viruses, and identifying critical differentiating metabolites in the process.

During the flu season, the ability to distinguish between COVID-19, RSV and influenza with greater precision and earlier detection could be facilitated by the identification of distinct metabolic signatures. We used a publicly available metabolomics dataset in the course of our research, focusing on the nasal metabolomic profile associated with COVID-19 and other respiratory viruses. The results of our study indicated distinct metabolomic profiles between COVID-19 and other respiratory viruses. By using a comprehensive ML framework, we successfully predicted and analyzed patterns in the metabolome associated with COVID-19. The contributions of this study are as follows:

Through analysis the nasopharyngeal metabolome in patients with COVID-19,other RSVs, and influenza, our study identified potential therapeutic targets through a comprehensive ML approach on the LC–MS/MS dataset reported by Bennet et al. [16].
Multivariate statistical analysis and SHapley Additive exPlanations (SHAP) analysis were implemented across individual cases.
Subsequently, through experimental validation, we precisely identify the distinct metabolites responsible for COVID-19, RSV, and influenza, thus shedding light on their individual molecular signatures.

This section contains a detailed explanation of the methods used to identify respiratory viruses in small-molecule metabolomes including the dataset, preprocessing methods, and model implementation.

Figure 1 provides an outline of the workflow process. The investigation began by analysis of clinical nasopharyngeal swabs using a viral transport medium (VTM) and a TMIC Prime kit. This procedure involved chemical derivatization and LC–MS/MS. Statistical analyses were conducted, incorporating p-values, chi-square tests, and t-distributed stochastic neighbor embedding t-SNE plots. Subsequently, a feature extraction process was executed, wherein the top ten feature ranks were identified. A 5-fold dataset was generated to facilitate robust model training. Various ML models were used, including tree-based models, instance-based models, and neural networks. The stacking technique was applied to create an optimal model for predicting the final output. To further elucidate the influential metabolites associated with specific respiratory viruses, we conducted SHAP analysis. This analytical approach was aimed at identifying and quantifying individual metabolites on the predictive models, thereby contributing to a comprehensive understanding of the metabolomic landscape in relation to respiratory virus presence.

2.1 Dataset Description

The dataset was reported by Bennet et al. [16], who conducted a study using nasopharyngeal specimens from individuals infected with COVID-19, Influenza A, and RSV, along with unaffected controls. Using an LC–MS/MS-based screening system to quantify 141 analytes, were characterized the nasopharyngeal metabolome. SARS-CoV-2 positive, influenza A positive, and RSV positive patients comprised the remaining 210 members of the dataset, Individuals were classified into unaffected controls and three distinct patient groups. A thorough examination of the metabolomic distinctions between various respiratory viruses and controls subjects was achieved by analysis of the small-molecule profiles in viral transport medium extracted from nasal samples from each group. The demographic characteristics of all patients are presented in Supplementary Material Table 1S, including essential information such as the number of individuals, collection year (including monthly variation), age range, sex distribution expressed as a percentage, and median computed tomography attenuation (CTa) with the corresponding range. A comprehensive list of all metabolites assessed in the study, along with their detailed information is documented in Supplementary Material Table 2S.

Figure 2 illustrates the comprehensive analysis of the dataset, including both the total sample distribution and patient classes. The t-distributed stochastic neighbor embedding (t-SNE) [25] plot visually depicts the distinct class separations, and provides insights into the clustering patterns for both the control group and individual respiratory virus categories. Additionally, a parallel coordination plot is presented, highlighting the class separability across the top ten features for the four identified classes. This integrated approach provides a thorough examination of the dataset, combining descriptive statistics, dimensionality reduction, and feature visualization to enhance understanding of the underlying patterns and relationships within the data.

2.2 Statistical Analysis

A statistical analysis in Python 3.9 was performed to evaluate the central values of the features and the distribution of the data. The significance of individual features in relation to the objective variable was determined with p-values calculated with a variety of statistical tests, such as the chi-square test, Wilcoxon rank-sum test, and T test [26, 27].

In the initial state, the dataset comprised 48 metabolite features. Through implementation of a stringent feature selection method, the ten most promising features were identified. The following section provides an in-depth analysis of their specific implications. The notable characteristics are outlined in Table 1, which presents a comparative statistical analysis between the control group and the group of all respiratory viruses for the top 10 features. These features include lysophosphatidylcholine 18:2 (LysoPC 18:2), kynurenine (Kyn), phenylalanine (Phe), Isoleucine (Ile), Aspartic Acid (Asp), tyrosine (Tyr), methionine sulfoxide (Met.SO), proline (Pro), valine (Val), and arginine (Arg).

Table 1

Statistical Analysis of the Characteristic of the metabolite features (control vs all respiratory viruses)
Control vs all respiratory viruses
Feature name	Control	Respiratory virus	Total	Technique	P-value
Sex • Male (%) • Female (%) • Null (%)	25% 75% 0%	42.77% 47.59% 9.63%	53.33% 39.04% 7.62%	Chi-square test	< 0.05
LYSOC18.2 • Mean ± SD • Median	0.86 ± 1.05 0.8725	1.57 ± 0.97 1.4427	1.42 ± 1.03 1.2314	Rank-sum test	< 0.0001
Ile • Mean ± SD • Median	19.57 ± 15.78 15.50	69.76 ± 42.48 66.90	59.24 ± 43.54 53.30	Rank-sum test	< 0.0001
Met.SO • Mean ± SD • Median	1.27 ± 1.97 0.5445	6.74 ± 6.29 5.90	5.59 ± 6.08 5.02	Rank-sum test	< 0.0001
Asp • Mean ± SD • Median	54.54 ± 25.06 49.350	139.60 ± 58.74 132.50	121.78 ± 63.70 116.00	T-test	< 0.0001
Phe • Mean ± SD • Median	24.54 ± 16.97 21.40	85.80 ± 44.40 84.05	72.97 ± 47.33 70.40	Rank-sum test	< 0.0001
Tyr • Mean ± SD • Median	23.24 ± 12.52 22.60	72.33 ± 43.25 62.95	62.04 ± 43.70 54.90	T-test	< 0.0001
Kynurenine • Mean ± SD • Median	3.88 ± 2.72 6.224	6.85 ± 7.05 5.190	6.22 ± 6.50 5.3550	Rank-sum test	0.0067
Val • Mean ± SD • Median	32.43 ± 29.98 26.250	122.04 ± 89.86 111.00	103.26 ± 88.86 85.85	Rank-sum test	< 0.0001
Citric.acid • Mean ± SD • Median	3.26 ± 1.68 3.840	1.76 ± 4.21 1.070	2.08 ± 3.86 1.28	T-test	0.02169
Arg • Mean ± SD • Median	42.75 ± 24.27 36.150	134.68 ± 73.22 132.00	115.42 ± 75.90 92.75	Rank-sum test	< 0.0001

2.3 Dataset Preprocessing

The dataset used in this study was originally reported by Bennet et al. To enhance the efficacy of ML models during training, normalization of the input data was necessary, to ensure that each feature contributed proportionately, thereby improving overall model performance. In this context, the Standard Scaler method was used for normalization [28, 29]. To promote robust training and facilitate generalization, was subjected the dataset to a 5-fold cross-validation, involving partitioning the data into training and testing sets (80% and 20% respectively). This strategic data splitting method aided in assessing model performance across different subsets of the dataset and contributed to a more reliable evaluation of the model's ability to generalize to unseen data.

To address the class imbalance within the dataset, wherein the counts for RSV, COVID-19, influenza, and control classes were 58, 55, 53, and 44, respectively, the pipeline used Synthetic Minority Over-sampling Technique (SMOTE) augmentation [30]. This technique helps mitigate the effects of imbalanced class distribution during training by generating synthetic samples for the minority classes. By oversampling the minority classes, SMOTE contributes to a more balanced representation across all classes, enhancing the model's ability to effectively learn from and generalize to each class during the training process.

Feature ranking is an essential preemptive measure in the field of ML [31], particularly when datasets comprise a large number of features. This method is critical to prevent overfitting, which occurs when a model overly adjusts to the complexities of the training data, thereby impairing its performance when applied to novel datasets. For five separate investigations, the XBGoost, random Forest, and extra trees algorithms were used to rank the 48 features. The random forest algorithm initially ranked highest, surpassing the performance of the other two approaches.

2.3 Classification Model Development

We systematically explored various ML classifiers, comprising linear discriminant analysis [32], random forest [33], Support vector machine (SVM) [34], K-nearest Neighbor (KNN) [35], XGBoost [36], extra-trees [37], and logistic regression [38]. Notably, these classifiers emerged as top-performing models in the initial analysis of 20 ML models. After an optimal model analysis, we selected the top ten performing models for each case, as described in Section 3. This rigorous selection process ensured a focused examination of the most promising models, thereby contributing to a comprehensive and insightful exploration of ML outcomes in the subsequent sections of the study.

Stacking is an ensemble learning technique that combines the predictions of numerous base models to enhance forecasting precision [39, 40]. The initial phase involved training individual ML models, subsequently, the top three performing models were selected according to their predictive capabilities. Notably, random forest was chosen as the metamodel. The core of this technique involves using the meta-model to acquire and combine information from many base models, thus enhancing prediction ability. The use of stacking, as exemplified by Rahman et al. [41], has produced noteworthy results in evaluation metrics, continually surpassing 90% in all assessment criteria.

A comprehensive probability distribution is constructed by combining predictions from the base-level classifier set N with the input variable x.

$${ \text{P}}^{\text{N}}\left(\text{x}\right)=\left({\text{P}}^{\text{N}}\left({\text{c}}_{1}|\text{x}\right),{\text{P}}^{\text{N}}\left({\text{c}}_{2}|\text{x}\right),\dots \dots .,{\text{P}}^{\text{N}}\left({\text{c}}_{\text{m}}|\text{x}\right)\right)$$

The set of potential class values is represented as ($c$, ${\text{c}}_{2}$... ${\text{c}}_{\text{m}}$), and the probability that example y belongs to class bi, as determined and forecasted by classifier M, is given as P N(bi |x). The architecture of the stacking ensemble approach is depicted in Fig. 3.

2.3.1 Evaluation Metrics

The performance of the classifiers was assessed with receiver operating characteristic (ROC) curves and the area under the curve (AUC), as well as precision, sensitivity, specificity, accuracy, and F1-Score. Furthermore, we used a five-fold cross-validation technique, which involved splitting the dataset into 80% for training and 20% for testing. This process was repeated five times to validate the complete dataset, on the basis of the fold number. We used per-class weighted metrics and overall precision ,because of the varying number of instances across classes. Furthermore, the AUC value was used as an assessment criterion. The mathematical representation of five evaluation measures (weighted sensitivity or recall, specificity, precision, total accuracy, and F1 score) can be found in Equations 2 through 6.

$Accurac{y}_{class\_x}=\frac{T{P}_{class\_x}+T{N}_{class\_x}}{T{P}_{class\_x}+T{N}_{class\_x}+F{P}_{class\_x}+F{N}_{class\_x}}$	(2)
$Precisio{n}_{class\_x}=\frac{T{P}_{class\_i}}{T{P}_{class\_x}+F{P}_{class\_x}}$	(3)
$Recall/Sensitivit{y}_{clas{s}_{x}}=\frac{T{P}_{clas{s}_{i}}}{T{P}_{clas{s}_{x}}+F{N}_{clas{s}_{x}}}$	(4)
$F1\_scor{e}_{clas{s}_{x}}=2\frac{Precisio{n}_{clas{s}_{x}}\times Sensitivit{y}_{clas{s}_{i}}}{Precisio{n}_{clas{s}_{x}}+Sensitivit{y}_{clas{s}_{x}}}$	(5)
$Specificit{y}_{class\_x}=\frac{T{N}_{class\_x}}{T{N}_{class\_x}+F{P}_{class\_x}}$	(6)

Here, the terms "true positive," "true negative," "false positive," and "false negative" are abbreviated as TP, TN, FP, and FN, respectively.

2.3.2 Model Explainability

The ability to comprehend and interpret the decisions or predictions generated by an ML model, referred to "explainability", encompasses a range of methods and strategies that reveal the process through which a model derives its outcomes, thereby enhancing the model's transparency. SHAP [42], a method for explaining models that measures the individual effect of each attribute on the model's prediction, offers valuable information regarding how specific characteristics affect the output of the model, thus improving the comprehensibility and clarity of intricate ML models.

This section includes the following: (i) feature ranking, (ii) detailed outcomes of the top-performing model, (iii) results pertaining to model explainability, and (iv) a comprehensive discussion and comparative analysis. This structured presentation is aimed at providing a nuanced understanding of the study's outcomes and their implications.

3.1 Feature Ranking

In this investigation, three advanced ML feature selection models—XGBoost, random forest, and extra trees — were used. After a thorough preliminary exploration, the random forest model was found to exhibit superior performance, achieving the highest rankings. From the initial set of 48 features, the top ten features emerged as particularly impactful, delivering optimal results with a minimal subset of features. Figure 4 indicates the top features, ranked through the random forest feature selection algorithm, across distinct comparisons: (A) control vs RSV, (B) control vs influenza, (C) control vs COVID-19, (D) control vs all respiratory virus, and (E) COVID-19 vs influenza/RSV. These visual representations offer a concisely provide insight into the discriminative power of selected features in differentiating among the specified conditions.

3.2 Classification Model Results

The comprehensive evaluation process comprised five distinct scenarios: control vs RSV, control vs influenza A, control vs all respiratory viruses, COVID-19 vs all respiratory virus, and influenza A/RSV. In the initial phase, 20 ML models were trained with a 5-fold dataset. The top ten performing models were carefully selected for each case, and a stacking-based ensemble technique was used to enhance predictive accuracy.

The application of stacking achieved notable improvements in the evaluation metrics, particularly for scenarios involving control vs all respiratory viruses and COVID-19 vs all influenza A/RSV. However, for the remaining scenarios, no improvement in metrics was observed. Figure 5 visually depicts the top ten performing models across the five scenarios, thus providing a concise overview of the model performances in each distinct case.

Figure 5(A) indicates the outcomes for the control vs RSV scenario, with linear discriminant analysis emerging as the top-performing model. Demonstrating superior performance across various evaluation metrics, this model achieved an accuracy of 96.08%, precision of 96.13%, recall of 96.08%, specificity of 95.38%, F1-score of 96.07%, and an AUC of 95.92%. Figure 5(B) reveals the exceptional performance of SVM as the leading model in the control vs influenza A scenario. SVM outperformed other models, with an accuracy of 97.94%, precision of 98.01%, recall of 97.94%, specificity of 97.51%, F1-score of 97.93%, and an impressive AUC of 99.69%. In Fig. 5(C), the control vs COVID-19 scenario highlights SVM as the preeminent model, exhibiting an accuracy of 95.96%, precision of 96.02%, recall of 95.96%, specificity of 95.4%, F1-score of 95.95%, and AUC of 97.23%. Figure 5(D) reveals random forest as the top performer in the control vs all respiratory virus scenario, achieving an exceptional 98.1% accuracy, 98.09% precision, 98.1% recall, 94.48% specificity, F1 score of 98.08%, and an AUC of 97.78%. In Fig. 5(E), Logistic Regression emerges as the superior performer in the COVID-19 vs influenza A/RSV scenario, with commendable metrics, including an accuracy of 86.14%, precision of 85.97%, recall of 86.14%, specificity of 80.3%, F1 score of 85.97%, and an AUC of 87.68%. Notably, the lower accuracy in this case was attributed to the class imbalance issue for COVID-19, with 55 samples ,compared with 110 samples for influenza A/RSV.

Further detailed results for each case can be found in the Supplementary Material, including Table 3S to Table 7S. The supplementary tables offer comprehensive insights into the confusion matrices and AUC curves for the best-performing models in each scenario, as visually depicted in Fig. 3S and 4S.

3.3 Model Explainability According to Shap Values

SHAP [43] helps understand the impact of each feature on the model's output for a particular prediction, offering valuable insights into the model's decision-making process. This method uniquely highlights the individual contribution of each feature towards a specific prediction, thereby providing a nuanced understanding of the global and local behaviors inherent in the model. By emphasizing transparency and elucidating the decision-making process, SHAP is aimed at instilling trust in the ML approach among end-users. SHAP not only enhances interpretability but also promotes a more informed and confident engagement with the model's predictions.

We used the random forest model to perform SHAP analysis in three unique scenarios for our research, considering all pertinent attributes. Figure 6 illustrates the effect of SHAP values on the model output across various scenarios. The horizontal axis delineates the direction of the effect, with positive and negative impacts represented by red and blue colors, respectively. In this context, red indicates higher feature values, whereas blue indicates lower values.

In Fig. 6(A), for the control vs. RSV scenario, the SHAP analysis highlights distinct feature effects on model predictions. Specifically, Met.SO (Methionine sulfoxide) had a substantial positive effect on RSV predictions, indicative of the higher concentrations in RSV cases than control. Notably, Ile, Val, Asp, Phe, and showed considerable positive effects, thus emphasizing their influential roles in predicting RSV cases. In Fig. 6(B), focusing on the control vs influenza A scenario, the SHAP analysis revealed LYSOC18:2 as the predominant metabolite feature with the greatest effect on predicting influenza A cases. In Fig. 6(C) for control vs COVID-19, LYSOC18:2 again emerge as the dominant feature, in agreement with previous findings by Bennet et al. [16], thereby establishing its value in distinguishing COVID-19 cases. Other notable metabolite features, including Kynurenine, Phe, Val, Tyr, and Asp, contributed significantly to the predictive model. For the control vs all respiratory virus scenario, as depicted in Fig. 6(D), LYSOC18:2 was the most dominant feature, thus indicating its crucial role in discriminating cases involving respiratory viruses collectively.

Finally, in the control vs RSV/Influenza A scenario represented in Fig. 6(E); Carnosine emerged as the most dominant feature for predicting COVID-19 cases. This detailed analysis provided valuable insights into the specific metabolite features driving the predictive capability of the model across various respiratory virus classification scenarios.

3.3 Discussion

Respiratory viruses, including influenza A, RSV, and COVID-19, pose major health challenges [44–46]. Our work focused on leveraging LC/MS-MS metabolomics data to predict the presence of respiratory viruses in individuals, by discerning dominant metabolites contributing to accurate classification. Applying a similar method to various diseases allowed us to explore distinct metabolite profiles and gain insights into the underlying biochemical dynamics across different pathological conditions. ML models can discern complex patterns within the data [47] and identify subtle metabolic changes associated with specific viral infections. This approach enables a more nuanced understanding of disease dynamics.

A comprehensive statistical analysis was conducted for control, normal, and all respiratory virus scenarios, by using chi-square tests, rank Sum tests, and T-tests. Twenty ML models were trained for five distinct scenarios: control vs RSV, control vs influenza A, control vs COVID-19, control vs all respiratory viruses, and COVID-19 vs influenza A/RSV. Feature ranking techniques were applied to select the top ten features. Standard scaling was used to normalize the data, and a 5-fold dataset was created. Before model fitting, the SMOTE technique was used to address class imbalance.

Among the 20 ML models, the top ten performers were selected, and a stacking ML model was trained by using the three most successful models. The outcomes of each model are illustrated in Fig. 5. Notably, linear discriminant analysis excelled in the control vs RSV scenario, whereas SVM stood out in the control vs influenza A scenario. The control vs COVID-19 and control vs all respiratory virus scenarios indicated SVM and random forest as the leading models, respectively. Logistic regression emerged as the superior performer in the COVID-19 vs influenza A/RSV scenario.

Furthermore, SHAP values were used to evaluate the influence of features on model output, thus revealing the dominant features contributing to positive predictions. Specifically, for COVID-19, features such as LYSOC18:2, Kynurenine, Phe, Val, Tyr, and Asp significantly influenced the predictive model. These findings provide valuable insights into the metabolic signatures associated with respiratory virus infections.

Table 2

Comparison of evaluation metrics with other work.
	Model	Cases	Accuracy	Sensitivity	Specificity
Bennet et al. [16]	Supervised machine learning	Control vs all respiratory virus	96%	98%	86%
Bennet et al. [16]	Supervised machine learning	COVID- 19 vs influenza A/RSV	85%	74%	90%
Ours	RandomForest	Control vs all respiratory virus	98.10%	98.10%	94.48%
Ours	Logistic Regression	COVID-19 vs influenza A/RSV	86.14%	86.14%	80.3

Table 2 presents a comparative analysis of respiratory virus scenarios between the results obtained by Bennet et al. For the scenario of control vs all respiratory viruses, Bennet et al. achieved an accuracy of 96%, sensitivity of 98%, and specificity of 86%. In contrast, our random forest model had higher performance, with an accuracy of 98.10%, sensitivity of 98.10%, and specificity of 94.48%. In the case of COVID-19 vs influenza A/RSV, Bennet et al. have reported an accuracy of 85%, sensitivity of 74%, and specificity of 90%. Our Logistic Regression model exhibited improved performance with an accuracy of 86.14%, sensitivity of 86.14%, and specificity of 80.3%. These results highlight the effectiveness of our proposed models in achieving higher accuracy and comparable or enhanced sensitivity and specificity to the referenced supervised ML approach.

Our study leveraged advanced ML techniques to predict respiratory virus scenarios, specifically focusing on control vs all respiratory viruses and COVID-19 vs influenza A/RSV. With random forest and logistic regression models, we achieved superior performance to the supervised ML approach reported by Bennet et al.[16] Our models demonstrated higher accuracy, sensitivity, and specificity, thus showcasing their effectiveness in discriminating among respiratory virus categories. The inclusion of feature ranking techniques, standard scaling, and the application of SMOTE to address class imbalances further enhanced the robustness of our models. SHAP analysis provided a deeper understanding of the key metabolites influencing positive predictions, thereby emphasizing the relevance of specific features such as LYSOC18:2, Kynurenine, Phe, Val, Tyr, and Asp in contributing to accurate predictions. Overall, our approach not only outperforms existing methods but also provides valuable insights into the metabolic markers associated with respiratory virus classifications. This method may pave the way to enhanced diagnostic and predictive capabilities in the field of infectious diseases.

Conflicts of interest:

The authors declare that they have no conflict of interest.

Ethical approval:

This study utilizes the dataset shared by Bennet et al. [16]. Hence, the authors of this article were not involved in the Data collection process.

Funding:

This study was supported by the collaborative funds from Qatar University. The statements made herein are solely the responsibility of the authors.

Author Contribution

Md. Shaheenur Islam Sumon: Contributed to the conceptualization of the study, data curation, methodology development, formal analysis, and writing of the original draft.Md. Sakib Abrar Hossain: Participated in the data acquisition, software implementation, formal analysis, and validation of the study's results.Haya Al-Sulaiti: Provided expertise in metabolomics, contributed to the interpretation of metabolome data, and critically reviewed and revised the manuscript for intellectual content.Hadi M. Yassine: Provided expertise in respiratory viruses, contributed to the interpretation of virological data, and critically reviewed and revised the manuscript for intellectual content.Muhammad E. H. Chowdhury: Supervised the study, provided funding acquisition, contributed to the study's design and conceptualization, and critically reviewed and edited the manuscript for intellectual content.

Gallo, O., et al. (2021). The central role of the nasal microenvironment in the transmission, modulation, and clinical progression of SARS-CoV-2 infection. Mucosal immunology, 14(2), 305–316.
Palese, P. (2004). Influenza: old and new threats. Nature medicine, 10(Suppl 12), S82–S87.
Preventation, C. (2022). f.D.C.a. Symptoms of COVID-19. ; Available from: https://www.cdc.gov/coronavirus/2019-ncov/index.html.
organization, W. H. (2009). Influenza. ; Available from: https://www.who.int/teams/health-product-policy-and-standards/standards-and-specifications/vaccines-quality/influenza.
Jha, A., et al. (2016). Respiratory syncytial virus. SARS, MERS and other viral lung infections.
Schreckenberger, P. C., & McAdam, A. J. (2015). Point-counterpoint: large multiplex PCR panels should be first-line tests for detection of respiratory and intestinal pathogens. Journal of clinical microbiology, 53(10), 3110–3115.
Somerville, L. K., et al. (2015). Molecular diagnosis of respiratory viruses. Pathology, 47(3), 243–249.
Tan, S. K. (2015). Molecular and culture-based bronchoalveolar lavage fluid testing for the diagnosis of cytomegalovirus pneumonitis. Open Forum Infectious Diseases. Oxford University Press.
Phan, T., Genetic diversity and evolution of SARS-CoV-2. Infection, genetics and evolution, 2020. 81: p. 104260.
Haljasmägi, L., et al. (2020). Longitudinal proteomic profiling reveals increased early inflammation and sustained apoptosis proteins in severe COVID-19. Scientific reports, 10(1), 20533.
Valdés, A., et al. (2022). Metabolomics study of COVID-19 patients in four different clinical stages. Scientific reports, 12(1), 1650.
Antonelli, G. (2013). Emerging new technologies in clinical virology. Clinical Microbiology and Infection, 19(1), 8–9.
Mancone, C., et al. (2013). Applying proteomic technology to clinical virology. Clinical microbiology and infection, 19(1), 23–28.
Burke, T. W., et al. (2017). Nasopharyngeal protein biomarkers of acute respiratory virus infection. EBioMedicine, 17, 172–181.
Nalbantoglu, S. (2019). Metabolomics: basic principles and strategies. Molecular Medicine, 10.
Bennet, S., et al. (2022). Small-molecule metabolome identifies potential therapeutic targets against COVID-19. Scientific Reports, 12(1), 10029.
Liebal, U. W., et al. (2020). Machine learning applications for mass spectrometry-based metabolomics. Metabolites, 10(6), 243.
Galal, A., Talal, M., & Moustafa, A. (2022). Applications of machine learning in metabolomics: Disease modeling and classification. Frontiers in genetics, 13, 1017340.
Beirnaert, C., et al. (2019). Using expert driven machine learning to enhance dynamic metabolomics data analysis. Metabolites, 9(3), 54.
Mendez, K. M., Reinke, S. N., & Broadhurst, D. I. (2019). A comparative evaluation of the generalised predictive ability of eight machine learning algorithms across ten clinical metabolomics data sets for binary classification. Metabolomics, 15, 1–15.
Kantz, E. D., et al. (2019). Deep neural networks for classification of LC-MS spectral peaks. Analytical chemistry, 91(19), 12407–12413.
Delafiori, J., et al. (2021). Covid-19 automated diagnosis and risk assessment through metabolomics and machine learning. Analytical Chemistry, 93(4), 2471–2479.
Hogan, C. A. (2021). Nasopharyngeal metabolomics and machine learning approach for the diagnosis of influenza. EBioMedicine, 71.
Hasan, M. R., Suleiman, M., & Perez-Lopez, A. (2021). Metabolomics in the Diagnosis and Prognosis of COVID-19. Frontiers in Genetics, 12, 721556.
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).
Rahman, T., et al. (2021). Mortality prediction utilizing blood biomarkers to predict the severity of COVID-19 using machine learning technique. Diagnostics, 11(9), 1582.
Bridge, P. D., & Sawilowsky, S. S. (1999). Increasing physicians’ awareness of the impact of statistics on research outcomes: comparative power of the t-test and Wilcoxon rank-sum test in small samples applied research. Journal of clinical epidemiology, 52(3), 229–235.
Chowdhury, M. E. (2021). An early warning tool for predicting mortality risk of COVID-19 patients using machine learning. Cognitive Computation, : p. 1–16.
Singh, D., & Singh, B. (2020). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 97, 105524.
Chawla, N. V., et al. (2002). SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16, 321–357.
Ferreira, P., Le, D. C., & Zincir-Heywood, N. (2019). Exploring feature normalization and temporal information for machine learning based insider threat detection. in 15th International Conference on Network and Service Management (CNSM). 2019. IEEE.
Tharwat, A., et al. (2017). Linear discriminant analysis: A detailed tutorial. AI communications, 30(2), 169–190.
Pal, M. (2005). Random forest classifier for remote sensing classification. International journal of remote sensing, 26(1), 217–222.
Keerthi, S. S., et al. (2001). Improvements to Platt's SMO algorithm for SVM classifier design. Neural computation, 13(3), 637–649.
Guo, G. (2003). KNN model-based approach in classification. in On The Move to Meaningful Internet Systems 2003: CoopIS, DOA, and ODBASE: OTM Confederated International Conferences, CoopIS, DOA, and ODBASE 2003, Catania, Sicily, Italy, November 3–7, 2003. Proceedings. Springer.
Chen, T. (2015). Xgboost: extreme gradient boosting. R package version 0.4-2, 1(4): p. 1–4.
Sharaff, A., & Gupta, H. (2019). Extra-tree classifier with metaheuristics approach for email classification. in Advances in Computer Communication and Computational Sciences: Proceedings of IC4S 2018. Springer.
Nusinovici, S., et al. (2020). Logistic regression was as good as machine learning for predicting major chronic diseases. Journal of clinical epidemiology, 122, 56–69.
Dietterich, T. G. (2000). Ensemble methods in machine learning. in International workshop on multiple classifier systems. Springer.
Hossain, R., & Timmer, D. (2021). Machine learning model optimization with hyper parameter tuning approach. Glob J Comput Sci Technol D Neural Artif Intell, 21(2).
Tawsifur, R. (2022). QCovSML: A reliable COVID-19 detection system using CBC biomarkers by a stacking machine learning model.
Kim, Y., & Kim, Y. (2022). Explainable heat-related mortality with random forest and SHapley Additive exPlanations (SHAP) models. Sustainable Cities and Society, 79, 103677.
Lundberg, S. M., & Lee, S. I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
Ogra, P. L. (2004). Respiratory syncytial virus: the virus, the disease and the immune response. Paediatric respiratory reviews, 5, S119–S126.
Suarez, D. L. (2016). Influenza A virus. Animal influenza, : p. 1–30.
Abu-Farha, M., et al. (2020). The role of lipid metabolism in COVID-19 virus infection and as a drug target. International journal of molecular sciences, 21(10), 3544.
Frank, M., Drikakis, D., & Charissis, V. (2020). Machine-learning methods for computational science and engineering. Computation, 8(1), 15.

No competing interests reported.

Supplementarydoc.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

A Comprehensive Machine Learning Approach for COVID-19 Target Discovery in the Small-Molecule Metabolome

Status:

Version 1

Abstract

Figures

1. Introduction

2. Methods

2.1 Dataset Description

2.2 Statistical Analysis

2.3 Dataset Preprocessing

2.3 Classification Model Development

2.3.1 Evaluation Metrics

2.3.2 Model Explainability

3. Results and Discussion

3.1 Feature Ranking

3.2 Classification Model Results

3.3 Model Explainability According to Shap Values

3.3 Discussion

4. Conclusion

Declarations

Conflicts of interest:

Ethical approval:

Funding:

Author Contribution

References

Additional Declarations

Supplementary Files

Status:

Version 1

\(Accurac{y}_{class\_x}=\frac{T{P}_{class\_x}+T{N}_{class\_x}}{T{P}_{class\_x}+T{N}_{class\_x}+F{P}_{class\_x}+F{N}_{class\_x}}\)	(2)
\(Precisio{n}_{class\_x}=\frac{T{P}_{class\_i}}{T{P}_{class\_x}+F{P}_{class\_x}}\)	(3)
\(Recall/Sensitivit{y}_{clas{s}_{x}}=\frac{T{P}_{clas{s}_{i}}}{T{P}_{clas{s}_{x}}+F{N}_{clas{s}_{x}}}\)	(4)
\(F1\_scor{e}_{clas{s}_{x}}=2\frac{Precisio{n}_{clas{s}_{x}}\times Sensitivit{y}_{clas{s}_{i}}}{Precisio{n}_{clas{s}_{x}}+Sensitivit{y}_{clas{s}_{x}}}\)	(5)
\(Specificit{y}_{class\_x}=\frac{T{N}_{class\_x}}{T{N}_{class\_x}+F{P}_{class\_x}}\)	(6)