Deciphering Novel SARS CoV-2 specic Disease Pathways from RNA Sequencing Data of COVID-19 Infected A549 Cells and Potential Therapeutics using Next Generation Knowledge Discovery Platforms

Background The coronavirus (CoV) disease identied in Wuhan, China in 2019 (COVID-19) was chiey characterized by atypical pneumonia and severe acute respiratory syndrome (SARS) and caused by SARS CoV-2 that belongs to the family Coronaviridae. COVID-19 symptoms vary from a mild cold to more severe illnesses such as SARS, thrombosis, stroke, organ failure, and in some patients even cause mortality. Deciphering the underlying disease mechanisms is pivotal for the identication and development of COVID-19 specic drugs for effective treatment and prevent human-to-human transmission, disease complications, and mortality. Methodology: Here, the Next Generation RNA Sequencing (RNA Seq) data using Illumina Next Seq 500 from SARS CoV-infected A549 cells and mock-treated A549 cells, were obtained from the gene expression omnibus (GEO) (GSE147507) and the Quality Control (QC) were evaluated using the CLC Genomics Workbench 20.0 (Qiagen, USA) before the RNA Seq analysis. The DEGs were imported into BioJupies to analyze to decipher COVID-19 induced biological, molecular, and cellular processes, pathways, and small molecules derived from chemical synthesis or natural sources to mimic or reverse COVID-19-specic gene signatures. Besides, we have used the iPathwayGuide (Advaita Bioinformatics USA) to identify COVID-19 specic pathways, biological, molecular, and cellular processes, and “druggable” candidates for future therapy. Results: 141 DEGs were identied out of a total of 9665 DEGs obtained from BioJupies analysis of the RNASeq reads of the SARS CoV infected A549 cells and mock-treated A549 cells based on a p-value cut off (0.05) and a fold change cut off 1.5. Conclusion: In conclusion, the present study unravels a novel approach of using next-generation knowledge discovery platforms to discover specic drugs for the amelioration of COVID-19 related disease pathologies.


Introduction
Coronavirus Disease 2019  is caused by a type of coronavirus (CoV), severe acute respiratory syndrome (SARS) virus 2 (SARS CoV-2). COVID-19 is characterized by symptoms ranging from a mild cold to more severe ailments such as SARS, sudden stroke, gastrointestinal complications, multiple organ failure, and even cause mortality in some patients (Lamers et al., 2020;Mulay et al., 2021). Coronaviruses belong to the family of Coronaviridae and the presence of viral spike proteins in the virus gives a halo or corona-like appearance under electron microscopy (Fig. 1A). A novel coronavirus virus (nCoV) was discovered in Wuhan, China in 2019 as the cause of an outbreak of respiratory illness in humans leading to severe atypical pneumonia  and it is the source of the current global pandemic that hampered all walks of life (Agarwal et al., 2021) The World Health Organization (WHO) renamed this nCoV as SARS-CoV 2 which is the causative agent for COVID-19 (Dong et al., 2020;Wang et al., 2020;Wu et al., 2020). The COVID-19 is highly transmissible and pathogenic compared to other viral infections and the exact mortality rate is yet to be determined since the pandemic is still not under control in several countries leading to unprecedented protective measures, partial or total lockdowns, travel restrictions, etc., (Mandolesi et al., 2021;Zhonghua et al., 2019). As of 4th March 2021, based on the Johns Hopkins Coronavirus Dashboard metrics, COVID-19 had already infected more than 154 million in 192 countries and territories around the world and killed about 3.2 million people. However, about 90 million people recovered from COVID-19 to date. Nevertheless, the precise mortality rate will be calculated or identi ed once the COVID-19 epidemic reaches a plateau. The United States of America (USA)and the WHO has declared the SARS CoV-2 outbreak a public health emergency since it is more contagious than Severe Acute Respiratory Syndrome Coronavirus (SARS-CoV) (Wu et al., 2020), and the Middle East Respiratory Syndrome Coronavirus (MERS-CoV) (Liu et al., 2021). SARS-CoV-2 has a nucleocapsid with a positive-sense RNA genome. The host cells express SARS-CoV-2 nucleoproteins and the nucleocapsid protein (N-protein) that is the most abundant, highly immunogenic, and required for CoV RNA synthesis. The N-protein is a structural protein that attaches to the CoV RNA genome and forms a capsid around the viral RNA. Though, the spike protein (S-protein) is crucial for the attachment between the SARS CoV-2 and the Angiotensin-Converting Enzyme 2 (ACE2) surface receptors on the host cells (Fig. 1B) and thus facilitates the coronavirus to penetrate the host cells (Fig. 1C). (Ols et al., 2020).
Even though the COVID-19 vaccines are currently available as a preventive measure and many are under the research and development phase (Bezbaruah et al., 2021;Mandolesi et al., 2021), decoding the underlying pathological mechanisms is pivotal for the identi cation and development of COVID-19 speci c drugs for the effective treatment and prevention of human-to-human transmission, COVID-19 complications, and mortality.
Henceforth, in the present study, the raw RNA Seq reads (Single-End) (FASTQ les) in quadruplicates derived using Illumina Next Seq 500 from SARS CoV-infected A549 cells and mock-treated A549 cells, were obtained from the gene expression omnibus (GEO) (Accession Number: GSE147507) and the Quality Control (QC) were evaluated using the CLC Genomics Workbench 20.0 (Qiagen, USA) before the RNA Seq analysis. After the initial QC, the RNA Seq reads were imported into CLC Genomics Workbench 20.0 (Qiagen, USA) before the RNA Seq analysis and applied next-generation knowledge discovery platforms (NGKD) such as BioJupies (Torre et al., 2018) and iPathwayGuide (Advaita Bioinformatics, USA) to decipher the disease-speci c molecular signatures and an array of small molecules derived from either synthetic or natural sources to mimic or reverse COVID-19 gene signatures. The present study outlines an innovative method of applying NGKD platforms to discover precise drugs and natural products for the potential treatment of COVID-19 related disease pathologies.

Ethical Statement
This study did not use any animal models or human subjects and was performed using the RNASeq datasets from Next Generation Sequencing experiments using A549 cells. The raw data were retrieved from the Gene Expression Omnibus (GEO) as stated in the Data Source section below. Hence, it was exempted from the Institution Review Board (IRB) approval.

NGS Data Source
The raw RNA Seq reads (Single-End) (FASTQ format) in quadruplicates derived using Illumina Next Seq 500 from A549 cells infected with SARS CoV-2, and mock-treated A549 cells, were obtained from the gene expression omnibus (GEO) (Accession No: GSE147507) were used for analysis using high-throughput knowledge discovery platforms.

COVID-19 RNA Seq Data from A549 Cells -Quality Control (QC)
The raw RNA Seq reads (Single-End) in quadruplicates (FASTQ les) derived using Illumina Next Seq 500 from SARS CoV-infected A549 cells and mock-treated A549 cells were derived from the GEO and the Quality Control (QC) was evaluated using the CLC Genomics Workbench 20.0 (Qiagen, USA) before the RNA Seq analysis to obtain the differentially expressed genes (DEGs).
COVID-19 RNA Seq Data from A549 Cells-Differential Gene and Transcript Expression Analysis The RNA Seq reads were imported into CLC Genomics Workbench 20.0 (Qiagen, USA) after the QC step. The RNA Seq Analysis tool in the Biomedical Genomics Analysis plugin of the CLC Genomics Workbench was used to extract all annotated transcripts using both Homo sapiens (hg38) _Gene (Gene track) and Homo sapiens (hg38) _mRNA (mRNA track) and mapped to the human reference genome (GRCh38). The Gene Expression Track (GE) was generated for SARS CoV-2 infectedA549 cells, and corresponding mock reads (Test vs Control), respectively. Furthermore, the Differential Expression in Two Groups tool in CLC Genomics Workbench was used to perform a statistical differential expression test for a set of Expression Tracks (Test vs Control). Here, a multi-factorial statistic based on a negative binomial Generalized Linear Model (GLM) is used, and the Differential Expression in Two Groups tool handles one factor and two groups. In this analysis, for GE, the values used are "Total Exon Reads". Differentially expressed genes (DEGs) were then generated for Test compared to the corresponding Control and used for further downstream analysis using next-generation knowledge discovery platforms.

BioJupies Analysis of the RNASeq Data
BioJupies with 14 different plug-ins were used to analyze DEGs generated using CLC Genomics Workbench to identify novel pathways, disease-speci c gene networks, and an array of drugs and small molecules obtained from natural sources to mimic or reverse disease-speci c gene signatures (Torre et al., 2018). In Biojupies, the RNASeq Datasets were user-submitted, compressed in an HDF5 data package, and uploaded to Google Cloud. Raw counts were normalized to log10-Counts Per Million (log CPM) and the differentially expressed genes were derived between the control group and the experimental group using the limma R package (Ritchie et al., 2015)). The Principal Component Analysis Function in the sklearn Python module was used to transform log CPM based on the Z-score method to generate the PCA plot and Clustergrammer (Fernandez et al., 2017)), was used to generate interactive heatmaps and the DEGs were then sent to Enrichr (Kuleshov et al., 2016). In the Volcano Plot, the DEGs were shown on the x-axis and p-values were corrected using the , and presented on the y axis (Benjamini and Yekutieli, 2001)).
However, the average gene expression was shown on the x-axis in the MA plot, and the P-values were corrected using the Benjamini-Hochberg method (Benjamini and Hochberg, 1995;Benjamini and Yekutieli, 2001), transformed (-log10), and presented on the y axis. Gene ontology (GO) and Pathway enrichment analyses were performed using both the up-regulated and down-regulated genes in Enrichr. Signi cant GO terms and Pathways (KEGG, WiKiPathways, and Reactome) are computed by using a cut-off of p-value<0.1 after applying Benjamini-Hochberg correction (Benjamini and Hochberg, 1995;Benjamini and Yekutieli, 2001).

L1000CDS 2 and L1000FWD Queries
The L1000CDS2 analysis was performed by submitting the top 2000 DEGs to the L1000CDS2 signature search API (Duan et al., 2016). Similarly, the L1000FWD analysis was performed by submitting the top 2000 DEGs to the L1000FWD signature search API (Wang et al., 2018).
In Silico Analysis of the RNASeq Expression Data using iPathwayGuide The Impact Analysis Method (IAM) (Draghici et al., 2007;Tarca et al.) in the iPathwayGuide was used to determine the signi cantly impacted gene signatures, pathways, miRNAs, and metabolites in the cell lines treated with anti-cancer drugs compared with the corresponding untreated control. The pathway score is calculated based on a p-value computed using Fisher's method. The p-value is corrected based on multiple testing corrections for false discovery rate (FDR) and Bonferroni corrections (Bonferroni, 1935;Bonferroni, 1936). The FDR has signi cant power but it controls only the family-wise false positives rate (Benjamini and Hochberg, 1995;Benjamini and Yekutieli, 2001). The pathways and the gene interactions using the DEGs were generated using the KEGG database (Kanehisa and Goto, 2000;Kanehisa et al., 2012;Kanehisa et al., 2014). For each Gene Ontology (GO) term (Ashburner et al., 2000), the number of DEGs annotated to the term when compared to the DEGs expected just by chance. iPathwayGuide uses an over-representation approach to compute the statistical signi cance of observing at least the given number of DEGs (Draghici, 2016;Draǧhici et al., 2003). The hypergeometric distribution was used to compute the p values in iPathwayGuide analysis and corrected using FDR and Bonferroni for multiple comparisons (Draghici, 2016;Draǧhici et al., 2003).

Prediction of Upstream Drugs or Natural Products using iPathwayGuide
The prediction of upstream Chemicals, Drugs, Toxicants (CDTs) is based on two types of information: i) the enrichment of DEGs from the experiment and ii) a network of interactions from the Advaita Knowledge Base (AKB v2012). The edges represent known effects that these CDTs have on various genes. A signed edge in this graph consists of a source CDT, a target gene, and a sign to indicate the type of effect: activation (+) or inhibition (-) (Draǧhici et al., 2003;Draghici et al., 2007).

Upstream CDTs predicted as present (or overly abundant)
Here, the research hypothesis considers the presence of the CDT. This hypothesis is useful when investigating whether the given phenotype has been impacted by the presence of a given chemical, drug, or toxicant (Draǧhici et al., 2003;Draghici et al., 2007). For each CDT u, the number of consistent DE genes downstream of u, DTA(u) is compared to the number of measured target genes expected to be both consistent and DE just by chance. iPathwayGuide uses an over-representation approach to compute the statistical signi cance of observing at least the given number of consistent DE genes. The p-value Ppres is computed using the hypergeometric distribution (Draǧhici et al., 2003;Draghici et al., 2007). The analysis uses the standard Fisher's method to combine p-values into one test statistic (Fisher, 1925).
Upstream CDTs predicted as absent (or insu cient) In parallel with upstream CDTs predicted as present, Pabs and Pz are used to predict upstream CDTs that are absent. This hypothesis is relevant when investigating whether the given phenotype has been impacted by the lack of a given chemical that is necessary for the well-functioning of the organism or cell. Here, the research hypothesis states that the upstream CDT is insu cient in the condition studied. For each upstream CDT u, the number of consistent DE genes downstream of u, DTI(u) is compared to the number of measured target genes expected to be both consistent and DE just by chance. Using Fisher's method as above, the analysis combines Pabs and Pz, where Pz is considered only for signi cant negative z-scores ( z ≤ -2) (Draǧhici et al., 2003;Draghici et al., 2007).

Swiss Target Prediction of Potential Anti-COVID-19 Compounds
Isomeric Simpli ed Molecular Input Line Entry System (SMILES) codes of Prednisolone and Withaferin-A was used in the SwissTargetPrediction tool to nd the protein targets (Bahlas et al., 2020). The ligand-based target prediction for both prednisolone and Withaferin-A was performed as described before (Bahlas et al., 2020;Kalamegam et al., 2020).

The Open Targets Platform Analysis of Anti-COVID-19 Compounds
The Open Targets Platform web tool was used to uncover the Prednisolone and Withaferin-A molecular targets associated with COVID-19 disease pathology (Bahlas et al., 2020;Carvalho-Silva et al., 2019). The Open Targets Platform uses scienti c evidence to score and rank target-disease associations and aid target prioritization (Carvalho-Silva et al., 2019). The query list with about 100 eligible molecular targets of Prednisolone and Withaferin-A was utilized to discover the protein targets associated signi cantly (P < 0.05) with COVID-19.

Results
Here, the raw RNA Seq reads (Single-End) (FASTQ les) in quadruplicates derived using Illumina Next Seq 500 from SARS CoV-infectedA549 cells, and mock-treated A549 cells were obtained from the gene expression omnibus (GEO) (GSE147507) and the Quality Control (QC) were evaluated before the RNA Seq analysis using the CLC Genomics Workbench 20.0 (Qiagen, USA). The DEGs were further analyzed using BioJupies (Torre et al., 2018), and iPathwayGuide (Advaita Bioinformatics, USA) to decipher the disease-speci c signatures and an array of drugs and small molecules derived from natural sources to mimic or reverse disease-speci c gene signatures.
The global patterns in high-dimensional RNA seq datasets were uncovered by interactive PCA analysis ( Figure  2A). Clustergrammer web tool was used to generate interactive heatmaps for visualization and in-depth analysis of DEGs derived from the high-dimensional RNASeq data of SARS CoV-infected A549 cells, and mock-treated A549 cells (Figure 2A -C). The volcano plot was generated using transformed gene fold changes using log2 and shown on the x-axis ( Figure 2D). MA plot was based on average gene expression which was calculated using the mean of the normalized gene expression values and shown on the x-axis ( Figure 2E).
The interactive bar chart ( Figure 3A) shows the top small molecules identi ed by the L1000CDS2 query. The left panel displays the small molecules such as Calyculin A, Emetine Hydrochloride, Narciclasine, NVP-TAE684, wiskostatin, NCGC00185684-02, and Amsacrine which mimic the observed gene expression signature, while the right panel displays the small molecules such as Trichostatin A, Vorinostat, afatinib, DL-PDMP, Withaferin-A, IMD 0354, and 2-[(chloroacetyl) (4-uorophenyl] amino-N-cyclohexyl-2 pyridine 3 which reverse it. Besides, the natural products and drugs with opposite and similar molecular signatures (Table 1) based on the L1000FWD tool that contains gene signatures from an array of human cell lines administered with more than 20,000 drugs and natural products. Withaferin-A, an active ingredient of the medicinal plant ( Figure 3A), Withania somnifera was found to reverse the COVID-19 induced molecular signatures in both L1000CDS2 and L1000FWD analyses along with other small molecule drugs.
The GO enrichment analysis for the Biological Processes, Molecular Function, and Cellular Component were generated using Enrichr ( Figure 4A). The x-axis indicates the -log10(P-value) for each term and the signi cant terms enriched in each GO category were highlighted in bold. Similarly, Figure 4B shows and diseases from the KEGG database (Release 96.0+/11-21, Nov 20 (Kanehisa and Goto, 2000).
iPathwayGuide analysis further showed that 34 pathways were found to be signi cantly impacted in the SARS CoV2 infected A549 cells compared to the mock-treated A549 cells. Also, 557 Gene Ontology (GO) terms, 224 gene upstream regulators, 451 chemical upstream regulators, and 31 diseases were found to be signi cantly (P<0.05) enriched before the correction for multiple comparisons.
The top ve upstream regulators identi ed after Bonferroni Correction STAT2, IRF9, IFNB1, IL1B, and IRF3 were predicted as activated ( Functions identi ed include chemokine receptor binding, chemokine activity, CXCR chemokine receptor binding, receptor-ligand activity, signaling receptor activator activity, and the top Cellular Components identi ed include blood microparticle, brinogen complex, nuclear outer membrane, extracellular space, and extracellular region for each pruning type were provided in Table 3. The upstream regulator drugs obtained either based on chemical synthesis or natural sources with opposite molecular signatures were also identi ed based on iPathwayGuide Analysis The drugs that can signi cantly reverse the molecular impact of COVID-19 infection are Methyl Prednisolone, Prednisolone, Gold Sodium Thiomalate, Tofacitinib, Diclofenac, JQ1 Compound, Azathioprine, etc., ( Figure 3B). The upstream regulator drugs and natural products with opposite molecular signatures identi ed using iPathwayGuide sorted based on Z score were listed in Supplementary Table 1.

In silico Prediction of the Molecular Targets of Prednisolone and Withaferin-A Using SwissTarget Prediction
In the present study, the SwissTargetPrediction was performed for Prednisolone and Withaferin A using the

Discussion
The COVID-19 is highly infectious and pathogenic compared to other viral infections and the exact mortality rate is yet to be determined since the pandemic is still not under control in several countries (Zhonghua et al., 2019). Hence, deciphering the underlying pathological mechanisms is pivotal for the identi cation and development of COVID-19 speci c drugs for the effective treatment and prevention of human-to-human transmission, COVID-19 complications, and reduce mortality. COVID-19 is usually characterized by cough, breathing problems, high body temperature, diarrhea, abdominal discomfort, and in severe cases causes' atypical pneumonia, SARS, stroke, thrombosis, multiple organ failure, and in some cases cause mortality. It was found that about 80% of the COVID-19 cases were having mild symptoms or asymptomatic, with the elderly, and individuals with other comorbid conditions were more prone to develop severe symptoms and succumb to the disease Zhonghua et al., 2019).
To differentiate the COVID-19 from other in uenza viruses, SARS, and MERS coronaviruses, etc. is essential in a clinical setting to design effective or e cient treatment strategies for the patients (Huang et al., 2020). Also, non-infectious diseases like idiopathic interstitial pneumonia, or cryptogenic organizing pneumonia, dermatomyositis, vasculitis, etc., must be differentially diagnosed from COVID-19 (Huang et al., 2020;Zhonghua et al., 2019).
The COVID-19 infection of the A549 cells activated the upstream genes such as STAT2, IRF9, IFNB, IL1B, IRF3, etc. The biological processes such as type I interferon signaling pathway, defense response to the virus, negative regulation of viral genome replication, and the interferon-gamma-mediated signaling pathway were differentially regulated. The molecular functions such as chemokine activity, CXCR chemokine receptor binding, 2'-5'-oligoadenylate synthetase activity, double-stranded RNA binding, and protein ADP-ribosylase activity were enriched in the COVID infected cells. Cytokines are the hormones of the immune system and are important for both innate and adaptive host responses, cell growth and differentiation, repair and development, cellular homeostasis, and cell death (Harakeh et al., 2020). Cytokines are glycoproteins released upon any external stimuli and bind with speci c cell surface receptors on the plasma membrane of target cells to elicit their responses (Pushparaj, 2019) The cytokine/chemokine storm observed in the moderate to severe cases of COVID-19 is caused by a signi cant increase in the levels of an array of circulating cytokines and chemokines such as IL-6, IL-8, TNFα, CXCL-10, etc., and contribute to the poor prognosis (Vaninov, 2020) In general, the viruses develop mechanisms to avoid detection and the subsequent destruction inside the host by repurposing and copying cytokine and cytokine receptor genes. (Pushparaj, 2019;Pushparaj et al., 2007;Sowmya et al., 2011). Similarly, the COVID-19 induced cytokines and cytokine receptors, chemokines, and other speci c cytokine receptors and binding proteins to destabilize and modify the host cytokine responses and immune networks (Pushparaj et al., 2007). Here, the COVID-19 induced chemokines and cytokines may either augment or prevent cytokine signaling and may signi cantly change or attenuate various arms of host immunity. Besides, cellular processes such as blood microparticle brinogen complex were activated in the COVID infected A549 cells. The increase in the cellular processes such as blood microparticle as observed in the present study was further con rmed by a recent study showing the increased circulating blood microparticles and activated platelets in COVID-19 patients (Zahran et al., 2021).
COVID-19 pandemic is currently managed by vaccines, convalescent plasma, monoclonal antibodies, antiviral medications such as remdesivir, and taking preventive procedures such as wearing masks, hand hygiene measures, social distancing, etc. (Sewell et al., 2020). In our present study, Withaferin-A was predicted to oppose the molecular signatures triggered by . The open targets platform analysis predicted that 36 targets were found to play a role in the COVID-19 disease pathology. Withaferin-A is one of the constituents of the medicinal plant, Withania somnifera (Indian Ginseng or Ashwagandha). The active ingredients mainly include withanolides, saponins, alkaloids, and steroidal lactones. W. somnifera is used in herbal formulations in traditional medicine possess antioxidant, anti-anxiety, anti-in ammatory, anti-bacterial, and aphrodisiac properties, etc (Sood et al., 2018). Ashwagandha has neuroprotective, cardioprotective, immunomodulating, and anti-cancer properties (Singh et al., 2021). Also, a recent in silico screening study recognized that ashwagandha has natural compounds against COVID-19 (Srivastava et al., 2020).
Similarly, the traditional Chinese herbal formulation, JinFuKang is composed of 12 medicinal plants, and each dosage included 10 mL (Cassileth et al., 2009;Jiao et al., 2015). JinFuKang has anti-cancer properties and an array of medicinal bene ts (Que et al., 2021). Since the antiviral remdesivir is very minimal in reducing mortality (Wilt et al., 2021), the use of corticosteroids increases the possibility of secondary infections (Gopalaswamy and Subbian, 2021), and monoclonal antibody therapies are either expensive or di cult to procure, it is better to investigate the potential compounds identi ed in our study for the potential treatment and amelioration of COVID-19 and related pathologies.

Conclusion
In conclusion, the present study unravels an innovative approach of using next-generation high throughput RNA sequencing technologies coupled with NGKD platforms to decipher speci c drugs obtained either synthetically or from natural products for the betterment of COVID-19. Further studies are required to validate the drugs such as Prednisolone, Methylprednisolone, Diclofenac, JQ1 compound, etc. and the natural products such as Withaferin-A and JinFuKang in COVID-19 infection model systems such as primary human alveolar epithelial cells and human small intestinal organoids (hSIOs) (Mulay et al., 2021)to infer the mechanism(s) of action before pre-clinical trials and clinical trials for the potential treatment of COVID-19 related disease pathologies.  Wang, Z., Lachmann, A., Keenan, A.B., Ma'ayan, A., 2018  6.389e-5 3.194e-4 * the p-value corresponding to the pathway was computed using only over-representation analysis.