DEGs identification, Gene Ontology (GO) and pathway enrichment analysis
The GEO database was utilized to obtain gene expression profile datasets in peripheral blood of septic patients. Four datasets (GSE28750, GSE57065, GSE65682 and GSE69528) representing different populations from France, United States, Netherlands and Australia respectively were first obtained from the GEO database. The limma R package was used to screened out the DEGs. As a result, 1662, 1340, 2603 and 1359 DEGs were identified from each dataset respectively. After integrated bioinformatical analysis, a total of 444 common DEGs were identified (Fig. 1A, Supplementary Table 1), including 246 up-regulated and 198 down-regulated genes (Fig. 1B).
To better understand biological meanings of the common DEGs, GO analysis was conducted with DAVID. The immune response, T cell receptor complex and MHC class II protein receptor activity were the most significant terms for the category of biological process, cellular component and molecular function respectively. (Fig. 1C-1E). GO analysis of up-regulated and down-regulated genes was also performed separately (Supplementary Table 2). Moreover, the GOCircle plots showed the top 10 most significant GO terms (Fig. 1F), and genes involved in the the top 5 terms were exhibited using a chord plot (Fig. 1G).
Pathway enrichment analysis for the common DEGs was conducted using “KEGG PATHWAY”, “Reactome”, “Biocyc” and “Panther” databases with KOBAS 3.0 tool. Results showed that they were most significantly enriched in neutrophil deregulation and immune system (Fig. 1H). Besides, pathway analysis of up-regulated and down-regulated genes was also performed separately (Supplementary Table 3).
Weighted Gene Co-Expression Network Analysis (WGCNA) and module detection
The gene expression matrixes of GSE28750, GSE57065, GSE65682, and GSE69528 were respectively clustered using Pearson’s correlation coefficient according to the expression profiles of the 444 common DEGs in these datasets. Clustering trees for each dataset were established and no outliers were found (Fig. 2A-D). Next, the gene modules, which represented groups of genes with similar patterns of expression, were calculated. Four gene modules were finally identified by the hierarchical clustering dendrogram. And the gray module represented genes that cannot be clustered into any other modules (Fig. 2E). Among the modules, the turquoise one was the largest, which contained many genes related to hemopoietic stem cell differentiation, such as CD4, ITGAM and IL1R1. And the blue module contained many genes such as TRAT1, ZAP70, CD8A, and CD3E, which were related to T cell activation, differentiation, receptor binding and costimulation. Therefore, this module was likely T-cell specific. Heatmap was constructed to visualize the gene co-expression network (Fig. 2F).
Screening for clinically related modules and genes
Module eigengene is the first principal component of a given module, which can be considered a representative of the gene expression profiles in a module. The correlation between each module eigengene and clinical phenotypes was calculated [26]. The results showed that the turquoise module had the strongest association with sepsis (Fig. 2G). So, for each gene in this module, gene significance (GS) was calculated to evaluate the correlation between gene expression level and sepsis. 15 genes were identified according to GS value (Supplementary Table 4). And many of these genes (such as CD177 [27, 28], S100A12 [29, 30], and CLEC4D [31]) played a critical role in sepsis pathology.
Identification of hub genes and blocks using protein-protein interaction (PPI) network
The activity of protein-protein interactions is considered to be the prime target of cellular biology study and works as a precondition for system biology. Proteins perform their operation inside a cell with the interaction of another protein, and information that is produced from a PPI network raises perception about the function of the proteins [23]. For the reasons above, the proteins corresponding to the common DEGs were constructed into a PPI network using the STRING database (Fig. 3A). The network was composed of 369 nodes (proteins) and 2032 edges (interactions), and 75 of the 444 genes were filtered out.
Nodes that have the most interactions were considered as hub genes [23]. Among the 369 nodes, 24 were identified as the hub genes with the criteria of node degree >35 (Supplementary Table 5), meaning that each protein expressed from these genes has more than 35 interactions. It is worth noting that many of these proteins, such as MPO [32] and CD28 [33], have been reported to play a role in sepsis. Other proteins like TLR8 could act as a potential therapeutic target [34].
Then the Molecular complex detection (MCODE) plug-in was subsequently applied to select the significant blocks in the PPI network. Two significant blocks with the highest scores were screened out. Block 1 consisted of 21 nodes and 208 edges, while block 2 was composed of 44 nodes and 371 edges (Fig. 3B-C). Notably, ARG1 was located in the central position of block 1 (Fig. 3B).
Identification of ARG1 as a key gene in sepsis
Then genes most relevant to sepsis screened by WGCNA (Supplementary Table 4) was compared with hub genes with more than 35 interactions identified by the PPI network (Supplementary Table 5). ARG1 was found to be the only one overlapped gene in both results (Fig. 4), indicating that this gene was not only highly correlated with the clinical phenotype of sepsis, but also played a hub role in protein-protein interactions. At the same time, ARG1 was also located in the central position in block 1 of the PPI network(Fig. 2B). These results showed that ARG1 was a key gene in sepsis.
ARG1 is sharply up-regulated in the whole blood cells of septic patients
In order to validate the role of ARG1 in sepsis, more datasets were brought into our analysis and validation system. Of these datasets: (i) Six were from studies conducted in adults and four in pediatric subjects; (ii) Five were from studies that took place in North America, four in Europe and one in Asia; (iii) Eight were performed using microarray and 2 using RNA-seq. Across 10 datasets, a significant increase in transcript abundance of ARG1 was observed in the peripheral blood of septic patients compared with that in the control groups (Fig. 5), regardless of ethnicity, age, or experimental settings. Besides, ROC curves generated from these datasets further confirmed the role of ARG1 in sepsis (Fig. 5). A good biomarker should exhibit high sensitivity (the fraction of correctly identified true positives) and specificity (the fraction of correctly identified true negatives), while the sensitivity and specificity are reflected by the area under the curves (AUC) value in the ROC curve. The AUC value for ARG1 in all plots were equal or close to 1, indicating the diagnostic character of this gene (Fig. 5).
ARG1 helps to make an accurate diagnosis, discriminate the severity and predict the treatment response of sepsis
Considering the high expression of ARG1 in sepsis, we next investigated whether ARG1 played a role in distinguishing sepsis from diseases with similar symptoms. We only found two datasets (from Spain and Italy) that contained peripherial blood samples from both septic and non-septic shock cases, and the expression levels of ARG1 were significantly higher in septic shock (Fig. 6A-B). Since septic shock is a severe form of sepsis, and shares similar signs and symptoms with non-septic shock, it is of great value to utilize ARG1 as a potential biomarker to distinguish the two conditions in clinical practice.
Furthermore, revealed by GSE63042 dataset which contains severe or lethal sepsis and uncomplicated sepsis from America, we further discovered the role of ARG1 in discriminating the severity of this disease. In this set of data, the expression level of ARG1 in severe sepsis and lethal sepsis was significantly higher than that in uncomplicated sepsis (Fig. 6C). Moreover, ARG1 was also found up-regulated in patients with septic shock compared with general sepsis based on the dataset from Germany (Fig. 6D). These findings indicated that quantification of the expression level of ARG1 may help to identify those at the greatest risk of progression and mortality.
Besides, our following investigations found that ARG1 could also act as an indicator for judging whether it is responsive to early supportive therapy. In the dataset from Italy, patients received a blood check at Intensive Care Unit (ICU) admission at first, and then their responses to the early symptomatic treatment were recorded in the next few days. No significant difference was found between the responders and non-responders regarding the source of infection, circulating markers of inflammation, or leukocyte and lymphocyte counts [35]. Interestingly, ARG1 was high expressed in non-responders compared with responders of septic patients (P =0.0017) (Fig. 6E). This finding indicated that ARG1 may play a role in establishing the treatment response, and be helpful to predict whether early treatment for sepsis is effective.
Validation of ARG1 as a key biomarker using quantitative real-time PCR
To validate the high expression of ARG1 in sepsis, cecal ligation and puncture (CLP) was performed on mice to induce experimental sepsis. The quantitative real-time PCR showed that the transcription abundance of AGR1 increased dramatically in the peripheral blood of septic mice (Fig. 7), demonstrating that ARG1 is highly correlated with sepsis and have potential to act as a key biomarker.