Host transcriptome alterations in vitro and in vivo following severe acute respiratory syndrome coronavirus 2 infection

Background. Exploring alterations in the host transcriptome following SARS-CoV-2 infection is not only highly warranted to help us understand molecular mechanisms of the disease, but also provide new prospective for screening effective antiviral drugs, finding new therapeutic targets, and evaluating the risk of systemic inflammatory response syndrome (SIRS) early. Methods. We downloaded three gene expression matrix files from the Gene Expression Omnibus (GEO) database, and extracted the gene expression data of the SARS-CoV-2 infection and non-infection in human samples and different cell line samples, and then performed gene set enrichment analysis (GSEA), respectively. Thereafter, we integrated the results of GSEA and obtained co-enriched gene sets and co-core genes in three various microarray data. Finally, we also constructed a protein-protein interaction (PPI) network and molecular modules for co-core genes and performed Kyoto encyclopedia of genes and genomes (KEGG) pathway analysis for the genes from modules to clarify their possible biological processes and underlying signaling pathway. Results. A total of 11 co- enriched gene sets were identiﬁed from the three various microarray data. Among them, 10 gene sets were activated, and involved in immune response and inflammatory reaction. 1 gene set was suppressed, and participated in cell cycle. The analysis of molecular modules showed that 2 modules might play a vital role in the pathogenic process of SARS-CoV-2 infection. The KEGG enrichment analysis showed that genes from module one enriched in signaling pathways related to inflammation, but genes from module two enriched in signaling of cell cycle and DNA replication. Particularly, necroptosis signaling, a newly identified type of programmed cell death that differed from apoptosis, was also determined in our findings. Additionally, for patients with SARS-CoV-2 infection, genes from module one showed a relatively high-level expression while genes from module two showed low-level. Conclusions. We identified two molecular modules were used to assess severity and predict the prognosis of the patients with SARS-CoV-2 infection. In addition, these results provide a unique opportunity to explore more molecular pathways new targets in COVID


Introduction
The coronavirus disease 2019 (COVID-19) epidemic caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is one of the major public health emergencies worldwide. As of June 16, 2020, more than 8,300,000 cases were confirmed, with 446,000 deaths and about 5.3% estimated mortality rate. The rapid increase in the number of patients, especially critically ill patients, has brought a big challenge to the public health. Generally, the infection with SARS-CoV-2 spreads from person to person via respiratory droplets; however, other special transmission vectors, such as aerosol, fecal-oral route, and intermediate fomites from both symptomatic and asymptomatic patients during the incubation period, are also our focus 1 . Although, nonspecific gastrointestinal symptoms such as diarrhea is observed in several patients with infection 2 ,majority of the patients have flu-like initial symptoms. Among the patient population, smokers, and male above 65 years of age are generally at a greater risk of disease progression that involves critical or lethal conditions such as severe pneumonia, acute respiratory distress syndrome (ARDS), multiple organ failure, and even death 3 . Based on the data published from the Chinese Center for Disease Control and Prevention, more than 80% of patients have mild conditions, and less than 20% have severe conditions. Alarmingly, the mortality rate was nearly 50% in patients with severe conditions 4 . According to the current evidence, the pathogenesis of the SARS-CoV-2 infection is associated with dysregulated inflammatory response, tissue injury, and multiple organ dysfunction; since the details of infection remain obscure, it is necessary to explore at the gene level to know how the disease exactly progresses from mild to severe conditions. This is not only warrant to help understand disease progression, but also provides new perspective for screening effective antiviral drugs, finding new therapeutic targets, and evaluating the risk of systemic inflammatory response syndrome. Hence, we designed this study of integrated bioinformatics analysis based on the Gene Expression Omnibus (GEO) (https://www.ncbi.nlm.nih.gov/geo/) database 5,6 ， using gene expression dada of lung tissues from healthy individuals and patients with infection, with a view to discover valuable mRNA biomarkers, potential pathogenic mechanisms, and finally to guide clinical diagnosis and treatment.

Microarray data
We downloaded GSE148815 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE148815), GSE150316 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi), and GSE147507 7 (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE147507) data sets from GEO database. In case of in vivo level exploration, we extracted the gene expression data of adults from GSE148815 and GSE150316 data sets labeling as healthy samples and SARS-CoV-2 infected samples, respectively. Then, we integrated two sets of raw-counts data to compose the patientmicroarray data used in the study. And in case of in vitro exploration, we extracted the gene expression data of normal human bronchial epithelial (NHBE) cell line and A549 (lung adenocarcinoma) cell line to form two cell line microarray data (NHBE data and A549 data). In order to reduce the effect of errors caused by gene chips on the pattern of results and our conclusions, data preprocessing was performed according to the following procedures using R software (R 3.6.1, https://cran.r-project.org/): 1) standardizing microarray data utilizing DESeq2 package 8 , 2) log2 transformation, 3) handling missing or unknown values, 4) checking and adjusting batch effects, 5) correction of value of log fold change (logFC) to reduce the error caused by a gene with a constant expression, such as zero. The detailed methods of data processing are described in Text S1. For the microarray data were downloaded from a public database, patient consent or ethics committee approval was not necessary.

Differential analysis
Differential analysis was performed on the three microarray data using the DESeq2 package to identify the differentially expressed genes (DEGs) between the samples with SARS-CoV-2 infection and without infection. As a result, logFC and statistical p values were obtained. When logFC>0, that the gene is highly expressed in the SARS-CoV-2 infected group, and when logFC<0 the gene is expressed in a low level. The result was visualized as a heatmap with hierarchical clustering using the pheatmap package (https://CRAN.Rproject.org/package=pheatmap).

Gene set enrichment analysis (GSEA) and identification of core genes
GSEA, based on functional class scoring methods, was performed with the results of DEGs using phenotype labels "SARS-CoV-2 infected" versus "non-infected" by the clusterProfiler package 9 to elucidate relevant biological significance. First, we converted the gene symbols to Entrez IDs, and then analyzed by comparing it with the data of hallmark gene set (Version 7.1) 10 downloaded from the Molecular Signatures Database (MSigDB, https://www.gsea-msigdb.org/gsea/msigdb/) 11 . The results with p value<0.05 are plotted as a bubble chart. After that, we obtained co-enriched hallmark gene sets and their co-core genes from the GSEA results of the three microarray datasets.

Protein-protein interaction (PPI) network and molecular modules construction
The PPI network of co-core genes was constructed in the STRING database (https://string-db.org/) through the following setting: meaning of network edges was set as "confidence", all active interaction sources were used, minimum required interaction score was selected as "highest confidence (0.4500)", and finally display simplification was to hide disconnected nodes in the network 12,13 . The PPI data were downloaded and key molecular modules were identified using MCODE plug-in Cytoscape software (version 3.7.1, https://cytoscape.org/). In MCODE, filters were based on the parameters as "Network Scoring ticked Include Loops and Degree Cutoff =2," "Cluster Finding ticked Haircut, Node Score Cutoff =0.2, K-Core =2, and Max. Depth =100" 14 .

Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway enrichment analysis
KEGG pathway enrichment analysis was conducted with genes in each molecular module to identify the biological functions of modules using clusterProfiler 9 package. An adjusted value of p<0.05 was considered to be statistically significant, and the results were visualized by GOplot package. In addition, we used data of patients to evaluate the expression levels of genes from the modules in patients with SARS-CoV-2 infection compared to healthy individuals, and the results are shown as a box plot.

Microarray data
The data processing workflow is illustrated in Figure 1. The results corresponding to data processing are explained in Text S1, which indicated that the gene expression used in our study was uniformly distributed.

Differential analysis
The heatmaps (Figure 2A-C) reflected the variance in gene expression relative to the mean across all samples with high-level expressed genes represented in red and low-level expressed genes represented in green. For this outcome, we found that there was a significant difference in gene expression levels between samples with SARS-CoV-2 infection and without infection, which also provided a foundation for us to further explore the genetic changes.

GSEA and identification of core genes
The GSEA results between samples with SARS-CoV-2 infection and without infection are shown in Figure Figure   3C). We merged the three GSEA results and obtained 11 hallmark gene sets co-enriched, including 10 activated and 1 suppressed ( Figure 3D), and the details are listed in Table 1. Then, we extracted all core genes in the co-enriched gene sets from the three datasets (499 core enriched genes in data of patients, 293 in NHBE data, and 438 in A549 data), and obtained 56 co-core genes ( Figure 3D), which are described in Table S1. From these findings, we found that in samples with SARS-CoV-2 infection, inflammation-related hallmark gene sets such as IL6-Jak-Stat3-signaling, IL2-Stat5-signaling, complement, coagulation, KRAS-signaling-up, and TNFαsignaling were activated while cell cycle related hallmark gene set such as E2F-targets was suppressed. Additionally, we also found that inflammation-related genes such as IL6, ICAM1, IL1B, and TNF were involved in the co-enriched gene sets. These findings indicated that, to some extent, inflammatory genes contribute greatly to the biological process of disease progression and the interaction of molecular function in samples with SARS-CoV-2 infection.

PPI network and molecular models
The PPI network for 56 co-core genes was built using the STRING database and the result contained 54 nodes and 234 edges ( Figure 4A). Through MCODE, we identified two key molecular modules, and their gene make-ups were shown in Figure 4B-C ( Figure 4B as module 1, Figure 4C as module 2). To our surprise, nearly one third of the co-core genes were members of module 1, suggesting that the module plays a crucial role in the PPI network, which requires special attention.

KEGG pathway enrichment analysis
50 of the KEGG pathways were enriched by the genes from module 1 (Table S2), including multiple specific infection-related signaling pathways caused by malaria, legionellosis, measles, tuberculosis, influenza A, and hepatitis B. In addition, it also included multiple inflammationrelated signaling pathways such as TNF signaling pathway, IL-17 signaling pathway, PI3K-Akt signaling pathway, and Jak-Stat signaling pathway ( Figure 5A). For genes from module 2, four pathways ( Figure 5B) related to cell cycle and DNA replication were identified. Finally, with the results of gene expression level ( Figure 5C), we found that, for patients with SARS-CoV-2 infection, genes from module 1 showed a relatively high level of expression in contrast to genes from module 2. It should be noted specifically that our data source was from patients who died due to SARS-CoV-2 infection. These results signified that high and low expression of genes in module 1 and 2, respectively, might signal severe negative condition in patients with SARS-CoV-2 infection.

Discussion
As obligate intracellular parasites, viruses must rely on host cells for replication, and hence their infection causes changes in the morphology and function of host cells. SARS-CoV-2 is a monopartite, single-stranded, and positive-sense RNA virus with a genome size of 29,903 nucleotides, making it the second largest known RNA genome 15 . Severe COVID-19 is characterized by respiratory failure caused by hyper-inflammation, which causes great difficulties in clinical treatment leading to increased mortality 16 . It is now well established that a cytokine storm syndrome, involving molecules such as interleukin 6 (IL6), interleukin 8 (IL8), E-cadherin, MCP-1, and VEGF, may be the reason for hyper-inflammation in severe cases 2,17 . Investigating differences at a genetic level during disease occurrence and prognosis helps us understand the disease more clearly thereby guiding the development of countermeasures. In the present study, candidate datasets consisted of two parts, in vivo level (data of patients) and in vitro level (NHBE data and A549 data), and they were annotated using the same platform as GPL18573 [Illumina NextSeq 500 (Homo sapiens)], which eliminated the adverse effect of the bias of gene coverage and algorithm usage from different sequencing platforms on the results and conclusion. After completing the differential analysis and GSEA, we found that in samples with SARS-CoV-2 infection, inflammation-related hallmark gene sets such as IL6-Jak-Stat3-signaling, IL2-Stat5signaling, complement, coagulation, KRAS-signaling-up, and TNFα-signaling were activated while the cell cycle related hallmark gene set such as E2F-targets was suppressed in both in vitro and in vivo levels. Subsequently, inflammation-related genes were found in the core genes of those co-enriched gene sets such as IL6, ICAM1, IL1B, CSF1, TLR2, VEGFA, TNF, and so on.
We constructed a PPI network for the co-core genes, and analyzed molecular modules through Cascade reactions of inflammation and immunity are the two key aspects in the pathogenesis of virus infecting the host, which often requires the participation of multiple cells and cellular components, and lead to changes in the level of gene expression. This is no exception for SARS-CoV-2. GSEA is capable of solving biological problems by focusing on gene sets rather than individual genes 11 . In our study, we use this method to analyze genetic changes related to inflammation and immune response after SARS-CoV-2 infection, which preserved gene-gene correlations, and provided an improved understanding of the biological functional enrichment in the groups of high-level and low-level expressed genes 18 . For hallmark gene sets, we found gene sets involved in significant molecules in the interleukin superfamily, gene set of IL6-Jak-Stat3signaling (genes were up-regulated by IL6 via Stat3 during acute phase response) and IL2-Stat5signaling (genes up-regulated by Stat5 in response to IL2 stimulation), were activated in samples of SARS-CoV-2 infection. Several studies have reported that IL6, as one of the genes involved in cytokine release syndrome (CRS), was used to assess the severity of COVID 19 like respiratory failure, ARDS, and adverse clinical outcomes [19][20][21][22][23][24] . We also identified from cell line data and cases died due to SARS-CoV-2 infection that IL6 gene was not only highly expressed but also interacted the most with other genes in the PPI network, which confirmed its importance as an active cytokine with a wide range of biological functions and as an effective biomarker for the prognosis of COVID 19. IL2 is a secreted cytokine that is important for the proliferation of T and B lymphocytes 25,26 . Here, even though IL2 was not observed in the core gene, but we found activated IL2-Stat5-signaling in the infected samples, and IL2-related genes such as CSF1 and LIF were among the co-core genes with a high level of expression, which revealed that the action of IL2 might be as a "homeostatic" cytokine, and is mainly involved in the immune response but not in CRS. Moreover, the gene sets of TNFA-signaling (genes regulated by NF-κB in response to PIF was also an important feature of COVID 19 cases with poor prognosis, as previously reported 27,28 . Meanwhile, we should be alert to the possibility of sequela caused by EMTactivating. Apart from this, gene set of complement (genes encoding components of the complement system, which is part of the innate immune system) and gene set of coagulation (genes up-regulated during formation of blood vessels) were also found to be activated in our study, which indicated that the disease evolution of SARS-CoV-2 infection was a multiple pathway, complicated process, comprising various dynamic changes in the genome.
E2F transcription factors play critical roles in the control of transcription, cell cycle and apoptosis [29][30][31] . For samples with SARS-CoV-2 infection, we found that the gene set of E2F targets was suppressed, which indicated that cell cycle disorders might occur in COVID 19. From a clinical perspective, this might be related to hypoxia caused by SARS-CoV-2 infection in humans 32 , because hypoxia could directly affect cell differentiation and energy metabolism 33 .
From the gene level, we identified that genes such as PCNA, CCNB2, and PTTG1, were at a low level in the samples of SARS-CoV-2 infection. According to the literature, PCNA and PTTG1 are involved in regulating the biological function of p21, which is a CDK inhibitor that triggers cellcycle arrest in the G1 and G2 phases 34,35 . CCNB2 is involved in the formation of the CCNB2/CDK1 complex that controls separase activity through inhibition of phosphorylation and regulates the biological process of the G2 phase of the cell cycle 36 . Incorporating previous studies, we proposed that in the process of SARS-CoV-2 infecting the host, genetic changes caused cell cycle disorders, which might manifest as hyper-active S-phase and hypo-active G2 phase, facilitating viral DNA replication. Other co-core genes, such as ICAM1, a leukocyte adhesion molecule, mainly encode a cell surface glycoprotein typically expressed on endothelial cells and immune cells. As reported, ICAM1 plays an important role in enhancing CD16+ monocyte adhesion to the endothelium 37 , and regulating IL6/Akt/Stat3/NF-κB-dependent pathway 38 . For roles in viral infection, ICAM1 had been determined to involve in DC-mediated HIV-1 transmission to CD4(+) T cells 39 and regulate interferon-gamma and IL17 in hepatitis B virus infection 40 . Our results showed that ICAM1 participated in the signaling pathways of NF-κB, IL17, and TNF, which was in line with similar previous studies. TNFAIP3, induced by TNF, plays a role in inhibiting the activation of NF-κB and TNF-mediated apoptosis. Literature data reported that TNFAIP3 was closely associated with the replication of viruses such as influenza A 41 , and hepatitis B virus 42 . In our study, we identified that TNFAIP3, TNF, and IL1B were involved in the signaling pathway of necroptosis. Similar to apoptosis, necroptosis was executed via a distinctive signaling mechanism comprising a cascade of specified proteins, resulting in regulated necrotic cell death 43 . Physiologically, necroptosis induces an innate immune response as well as premature assembly of viral particles in cells infected with virus that abrogates host apoptotic machinery, which is advantageous for the host. On the other hand, necroptosis is also deleterious because it can cause various diseases such as sepsis, neurodegenerative diseases and ischemic reperfusion injury 43 . Qin et al. observed that necroptosis of the pulmonary epithelium was associated with severe H7N9 infection leading to ARDS, and the final conclusion indicated that necroptosis inhibition might be a novel therapy for H7N9 influenza virus 44 . However, the benefits of necroptosis to the host may sometimes be outweighed by the potentially deleterious hyperinflammatory consequences of activating this death modality in pulmonary and other tissues 45,46 .
All these evidences suggest that necroptosis signaling is a double-edged sword providing defense against microbial infection, or might even be the likely culprit that increases mortality of the SARS-CoV-2 infected cases. This also provided a novel clue for necroptosis signaling in the treatment of COVID 19 and serves as a potential therapeutic target. BCL2A1, a pro-inflammatory cytokine, encode a member of the BCL-2 protein family. In our study, we found that BCL2A1 participated in the signaling of NF-κB and apoptosis, which was similar to previous reports in the literature 47,48 . Nevertheless, more details of how BCL2A1 affects the pathogenesis of SARS-CoV-2 infection still needs to be explored.
For signaling pathway, besides mentioned above, signaling pathway of AGE-RAGE, MAPK, cytokine-cytokine receptor interaction, EGFR tyrosine kinase inhibitor resistance, viral protein interaction, were enriched in our results. From these findings, we believe that in the course of SARS-CoV-2 infection, the dysregulation was gene-specific rather than pathway-specific. Hence, we tried to construct molecular modules not only for accurate prediction but also for the evaluation of the effects of genes on COVID 19 patients' prognosis. Our data indicated that in severe and/or fatal SARS-CoV-2 infection cases, immune responses tended to be gentle while inflammatory cascades were hyper-activated, which might be attributable to aberrant gene expression. However, in terms of how genes found in our results participated in the pathogenesis of SARS-CoV-2, it is still an unresolved problem, and further needs to be verified through in vivo data.

Conclusions
In conclusion, using integrated bioinformatics analysis for microarray datasets, we identified two molecular modules to assess severity and predict the prognosis of patients with SARS-CoV-2 infection. In addition, these results provide a unique opportunity to explore more molecular pathways as new potential targets for therapy in COVID 19.

Figure 2
The cluster heat map of the three microarray data. (A) is for patient-data, (B) is for NHBE-data, (C) is for A549-data. Red represents gene expression with a high-level, green represents gene expression with a low-level, and black represents gene expression with an intermediate-level.

Figure 3
The GSEA results (A-C) and distributions of co-enriched gene sets and co-core genes in the three microarray data. (A) is the result from patient-data, (B) is from NHBE-data, (C) is from A549-data. Upper half of (D) is Venn plot for distribution of co-enriched gene sets, and lower half of (D) is Venn plot for distribution of co-core genes. Patients, patient-data. NHBE, normal human bronchial epithelial cell line. A549, lung adenocarcinoma cell line A549.

Figure 4
The PPI network and the molecular modules for co-core genes. (A) PPI network of cocore genes constructed in STRING database. (B) molecular module one for co-core genes. (C) molecular module two for co-core genes. Circles represent genes, lines represent interactions between gene-encoded proteins and line width represents evidence of interactions between proteins.  Table 1 Summary of co-GSEA results in the three microarray data. NHBE, normal human bronchial epithelial cell line. A549, lung adenocarcinoma cell line A549.Size, the number of the core enrichment gene. NES, normalized enrichment score. ALLOGRAFT_REJECTION, genes up-regulated during transplant rejection. COAGULATION, genes encoding components of blood coagulation system; also up-regulated in platelets. COMPLEMENT, genes encoding components of the complement system, which is part of the innate immune system. E2F_TARGETS, genes encoding cell cycle related targets of E2F transcription factors. EMT, genes defining epithelial-mesenchymal transition (EMT), as in wound healing, fibrosis and metastasis. IL2_STAT5_SIGNALING, genes up-regulated by STAT5 in response to IL2 stimulation. IL6_JAK_STAT3_SIGNALING, genes up-regulated by IL6 via STAT3, during acute phase response. INFLAMMATORY_RESPONSE, genes defining inflammatory response. INTERFERON_GAMMA_RESPONSE, genes up-regulated in response to IFNG. KRAS_SIGNALING_UP, Genes up-regulated by KRAS activation. TNFA_SIGNALING_VIA_NFKB, genes regulated by NF-kB in response to TNF.