Identification of immune prognostic signature of lung carcinoma based on Genome-wide immune expression profile

Background : Lung cancer is one of the most common types of cancer with low early diagnosis rate and poor prognosis. The integration of immune checkpoint gene expression data and patient prognosis information can help identify the immune subtypes of lung cancer and provide reference for individualized gene immunotherapy in patients with lung cancer. Methods : The data of immune gene expression for lung cancer patients were obtained from TCGA and GEO databases. The relationship between the expressions of 45 immune checkpoint genes (ICGs) and prognosis were analysed. In the other hand, the correlation between the expressions of 45 biomarkers , tumor mutation load (TMB), MMRs, neoantigens and other immunotherapy biomarkers were been identified. Ultimately, prognosis-related ICGs were combined with IDO1, CD274, and CTLA4 to divide lung cancer immune subgroups and the prognostic differences between lung cancer immune subgroups were identified. Results: Based on TCGA database and GEO database, 9 and 11 ICGs were obtained respectively, which were closely related to prognosis. There was a certain synergistic relationship between them. The expression of CD200R1 had a significant negative correlation with TMB and neoantigens. CD200R1 showed a significant positive correlation with CD8A, CD68 and GZMB genes, indicating that it may cause the expression disorder of adaptive immune resistance pathway genes. Based on CD200R1 and combination with IDO1, CD274 and CTLA4, the group with high expression of CD200R1 and low expression of IDO1, CD274 and CTLA4 had the best prognosis among the immune subtypes. Conclusion : Our research provides a method of integrating immune checkpoint gene expression profile and clinical prognosis information to identify immune subtypes of lung cancer, which can provide a unique reference for gene immunotherapy of lung cancer patients.

3 Background Lung cancer accounts for more than one-tenth of all cancer cases worldwide and is one of the most common types of cancer [1], smoking is a major risk factor for lung cancer.
Exposure to high carcinogens in tobacco smoke can induce the accumulation of mutation load in lung cells [2][3]. The prognosis of lung cancer is rather poor, with a 5-year survival rate of 18% [4]. The prognosis of lung cancer is related to the clinical pathological characteristics of the patients, and the early diagnosis rate is low. Therefore, the formation and pathological grade of lung cancer could not be accurately judged, thus limiting the operation [5]. At the time of diagnosis of lung cancer, most patients were already in the stage of extensive metastasis. The main therapeutic strategy is to reduce the malignant phenotype by suppressing the oncogenic gene pathway. Although response to chemotherapy and radiotherapy in the early stage is relatively good, this treatment often leads to drug resistance in the treatment of lung cancer. In view of this situation, the treatment is relatively limited [6][7].
The progress of biomedical big data, including the mature technology of new generation sequencing and multi-generation sequencing, has deepened our understanding of lung cancer development, cellular heterogeneity and interaction at the precise molecular level [8], on which the immunotherapy development has shown great potential in treating various malignant tumors [9]. At present, monoclonal antibody targeted drugs are innovating the treatment of lung cancer, especially the immunotherapy developed with PD-1 and CD274 as targets has achieved some great results in clinical trials of lung cancer treatment [10][11]. Currently, these targeted drugs can be used as first-line treatment or second-line drugs for lung cancer since there are over 50% of tumor cells in first-line patients express PD-L1 [12]. However, this therapy does not work for all patients due to the dynamic and isomeric expressions, which suggests other therapeutic targets will be needed for specific patients [13][14]. On the whole, the effect of most targeted monoclonal antibodies is still limited among all lung cancer patients for the reason that beneficial responses are observed from some patients, while many others are not [15]. This indicates that not all patients show the same sensitivity to therapeutic targets because of tumor heterogeneity. Immunotherapy is not a panacea for all patients. So it is highly necessary to further differentiate tumor immune subtypes for improving clinical trial design and identifying patients who may benefit from immunotherapy.
Traditional cancer research usually focuses on the role of a particular target gene in cancer. Because the occurrence and development of cancer is rather complex, it involves the abnormality of multiple genes and multiple signaling pathways. Therefore, studying the role of a single gene in cancer has its limitations [16]. At present, biological studies have entered the multi-omics data age, which provides possibility for understanding and studying tumors from genome level. The establishment of several large cancer databases of TCGA and GEO enables researchers to obtain the gene expression data of multiple immune checkpoints and the corresponding clinical data of patients. On this basis, the integrated analysis can divide different immune subtypes in lung cancer samples in more detail and guide the "individualized" treatment of patients [17][18]. In the existing reports related to immunotherapy for lung cancer, most of them are drug design studies based on a single immune checkpoint gene as a target. There is still a lack of research on gene expression profile of multiple immune check points [19]. Therefore, based on the gene expression data of 47 immune checkpoints such as PD1 (PDCD1), PD-L1 (CD274) and IDO1, the immune subgroup and prognosis of 761 lung cancer samples from TCGA and GEO database were identified.
In our research, the expression of 45 immune checkpoint genes (ICGs) and its relationship with prognosis were analyzed. Then the relationship between immunotherapy biomarkers studied by integrated cell mutation data to determine the relationship between the expression of ICGs (PD1/PD-L1 and CTLA4) in biomarkers, which has been widely used in immunotherapy, and other biomarkers, which are not used in immunotherapy. Ultimately, the relationship between ICGs expression and immune activation related signature genes was studied in order to determine the relationship between immune activation and immunosuppression in tumor microenvironment, and to provide reference for individualized gene immunotherapy in patients with lung cancer.

Methods
Data downloading 47 genes of immune checkpoints were downloaded (S_ Table 1  Data preprocessing The following pre-steps were conducted in the processing of 515 RNA-seq data samples:1) Samples without clinical information or OS < 30 days were removed;2) The data of normal tissue samples was removed;3) Remove genes whose FPKM value is equal to 0; thus more than half of the samples were removed. Furthermore, GSE31210 data were pre-processed as follows:1) The data of normal tissue samples was removed and we retained the only primary tumor data; 2) Year/month-based OS data were converted into days-based; 3) The microarray probe was mapped to human gene SYMBOL by using bioconductor package; 4) Only the expression profile of immune-related genes was retained. Statistical information for the three data sets was shown in Table 1.

Relationship between ICGs expression and prognosis
The ICGs with expression in TCGA dataset were screened. The patients were divided into groups based on different levels of expression. The relationship between these ICGs with expression and patient prognosis was analyzed using univariate cox regression analysis (log rank p < 0.05). The expression and prognostic patterns of the ICGs were analyzed by the same method in GEO database.

Relationship between ICGs and other immune checkpoint biomarkers
Spearman method was used to evaluate the correlation between TMB and ICGs. The relationship between neoantigens and ICGs expression was further analyzed based on the somatic cell mutation data of TCGA.

Relationship between prognosis and subtypes defined by ICGs
Therefore, the high expression group (H) and low expression group (L) were divided according to the horizontal density distribution of these genes. Taking the deviation from the first main distribution interval (density peak) as the threshold, the genes were divided into high expression group and low expression group. The High/Low expression groups of IDO1, CD274, CTLA4 and CD8A were integrated, respectively. Then the HCC samples were divided into four categories to analyze the survival of the three gene classification samples.

Statistical analysis
For comparison between the two groups, the statistical significance of the normal distribution variables was estimated by the nonpaired student's t test, and the non-normal distribution variables were analyzed using the Mann-Whitney U test. For comparisons between more than two groups, Kruskal-Wallis test and one-way ANOVA were used as nonparametric and parametric methods, respectively [20]. The correlation coefficients were calculated by Spearman correlation analysis. Kaplan-Meier method was used to generate a survival curve for the subgroups in each data set, and logrank test was used to determine the statistical significance at the level of P < 0.05. All these analyses were performed using R 3.4.3, based on defaulted parameters unless otherwise specified.

Results
The relationship between ICGs and prognosis The expression of ICGs

Prognostic analysis of ICGs
Using univariate COX regression analysis, we calculated the relationship between the 45 ICGs with expression quantity in the TCGA data set and prognosis, and obtained altogether 9 genes that were significantly correlated with prognosis: TNFSF14, CD70, TNFRSF14, CD27, LAIR1, CTLA4, CD28, CD40LG and CD200R1 ( Fig. 2A, log rank p < 0.05). It could be found that these nine genes were mainly positively correlated and showed obvious aggregation effect (Fig. 2B) through correlation analysis of ICGs expression levels, indicating a coordinated expression relationship between ICGs.
ICGs analysis on the GEO data set Among the 47 selected ICGs on the GSE31210 data set, there were 44 ICGs with expression, and their expression patterns were also significantly clustered into three categories: high expression group, medium expression group, low expression group (Fig. 3A). Moreover, we found that the genes in the three expression groups of GSE31210 were highly consistent with those in the three expression groups of TCGA. For instance, LGALS9, TNFRSF14, CD40 and CD44 showed high expression levels in the two data sets.
The expression levels of HAVCR2, LAIR1, TNFSF9 and other genes were medium in both data sets. Such low expression genes in GEO as HHLA2, CD200R1, CD244, CD28 and BTNL2 showed the same low expression level in TCGA data set.
Univariate survival analysis of the relationship between 44 ICGs and overall survival (OS) revealed that 11 ICGs are auspiciously correlated with the prognosis of OS (Fig. 3B), and there was a significant correlation of CD40LG gene with better prognosis in both TCGA and GSE31210 data sets (HR < 1, log rank p < 0.05). On the GSE31210 data set, ICGs expression levels were also largely in a positive correlation and had an obvious aggregation effect (Fig. 3C), showing no difference with TCGA results.

Relationship between ICGs and biomarker for other immune checkpoints treatment
The relationship between ICGs and TMB Our study calculated the TMB of LAUD according to the somatic mutation data of TCGA, and removed the intron interval and the mutation annotated as silent. First, we selected 9 ICGs with expression quantity that were strongly correlated with OS prognosis, and used Spearman method to evaluate the correlation between TMB and the 9 ICGs (Shapiro.test p < 2.2e-16). We observed highly negative correlations between TMB and CD40LG, CD200R1, TNFRSF14 and TNFSF14 expression (Fig. 4, R < 0 and FDR < 0.005). The TNFSF14 was the potential factor, that was causing unfavourable prognosis. Thus, we speculated that the high expression TNFSF14 corresponds to the lower tumor mutation burden. And low TMB generally did not apply to immune checkpoint inhibitor treatment.

The relationship between ICGs and MMRs
MMRs were intracellular mismatch repair mechanisms, and the function loss of key genes in this mechanism will lead to irreparable DNA replication errors, producing much more The relationship between ICGs and of adaptive immune-resistance pathway genes CD8 T cells can produce IFNγ and lead to up-regulated expression (activation of immune pathways) of adaptive immune-resistance pathway genes (including PD-1/PD-l1 axis and IDO1, etc.). Therefore, we had analyzed the correlation between CD8A, GZMB, CD68, NOS2 and ICGs. In the genes of adaptive immune-resistance pathways, there were four genes except NOS2 with very and mostly positively correlated with ICGs expressions (Fig. 7A).
The significance test of correlation coefficients showed that there were at least half of significant correlations between these genes (p < 1e-3, Fig. 7B).
We had also analyzed the expression relationship between adaptive immune-resistance pathway's gene and ICGs by using GSE31210 data. This relationship was positively correlated with (72.9%), as well a significant positive correlation was shown in CD8A and GZMB (S1_Figure), which coincided with the result of TCGA data set.

The relationship between ICGs and clinical features
Univariate survival analysis had identified 9 ICGs that were highly related to OS prognosis   Fig. 3).

The relationship between ICGs subtypes and prognosis
Although PD1 (PDCD1), PD-L1 (CD274) and IDO1 are important genes in immune checkpoints and important biomarkers. Univariate analysis showed that there was no significant prognostic relationship between them. Based on the comprehensive analysis above, we found that CD200R1 was not only strongly associated with the prognosis of the TCGA samples, but also has highly negative correlations with TMB and neoantigens, and positive correlations with CD8A, CD68 and GZMB. This indicates that CD200R1 can cause expression dysregulation of adaptive immune-resistance pathway genes. Therefore we analyzed the relationship between such gene combinations as CD200R1, PD-L1 (CD274),

IDO1 and prognosis.
Integrating the H/L expression groups of IDO1, CD274, CTLA4 and CD200R1, respectively, we sorted the LUAD samples into four cluster according to the median of gene expression.
The survival analysis of the three gene classification samples showed that there exist significant differences in the prognosis of OS among the four cluster of samples. They all revealed that the group with the high expression of CD200R1 and the low expressions including IDO1, CD274 and CTLA4 has the best prognosis. The group with low expression of CD200R1 and low expression of IDO1, CD274 and CTLA4 had the worst prognosis ( Fig. 9A-C). Significant differences can be seen through prognostic analysis between the two cluster with the best and the worst prognosis (Fig. 9D-F).
On the GSE31210 data set, we also observed that the three genes and several expression patterns of CD200R1 were significantly correlated with different prognosis, and that CD200R1 was highly expressed and IDO1, CD274, and CTLA4 low expression groups had the best prognosis (S4_Figure).

Discussion
This study aims to identify the relationship between immune subgroup and prognosis of lung cancer samples based on gene expression data from multiple immune checkpoints.
The gene expression data from TCGA and GEO databases were divided into high -medium Then by using univariate COX regression analysis to screen for genes associated with patients' overall survival, we obtained nine genes significantly related to prognosis from TCGA database and 11 prognosis-related genes from GEO database. Among them, CD40LG, a leukocytic coding gene of the cluster of differentiation 40 ligand -CD40L, is significantly correlated with better prognosis in both databases (HR < 1, log rank p < 0.05) [21]. CD40L, which is a type II membrane-related glycoprotein and a member of the tumor necrosis factor (TNF) superfamily [22], is conducive to regulate the immune response and inhibit tumor growth. It produces direct growth-inhibiting effect through apoptosis in CD40L over- To make clear the relationship between ICGs and other biomarkers for immune checkpoint treatment, we analyzed the relationship between ICGs and tumor mutational burden (TMB), MMRs, and neoantigens, respectively. TMB is a reliable indicator for predicting clinical efficacy of PD-1 inhibitor [24]. The more mutant genes in the tumor tissue, the more likely abnormal proteins will be largely produced. These abnormal proteins are able to activate the body's anti-tumor immune response, improving the sensitivity of immunotherapy. So it is more suitable for patients with high TMB level to continue immunotherapy with PD1 inhibitors [25]. Through analyzing the relationship between ICGs and TMB, we found that the expression of TMB is strongly inversely correlated with that of CD40LG, CD200R1, TNFRSF14 and TNFSF14 (R < 0 and FDR < 0.005). That is, the high expression of CD40LG, CD200R1, TNFRSF14 and TNFSF14 corresponds to the low expression of TMB. Among them, TNFSF14 is a potential cause of unfavourable prognosis.
It is inferred that the high expression of TNFSF14 corresponds to the low tumor mutational burden, so the high expression of TNFSF14 may not be suitable for the treatment of immune checkpoint inhibitors. MMRs is an intracellular mismatch repair mechanism. Its loss of function will lead to irreparable DNA replication errors, thus, leading to the high generation of gene mutations [26]. The 5 MMRs gene mutations evaluated in this study are positively correlated with ICGs expression. Among them, CD70, which is also significantly correlated with prognosis which might be a potential predictor of MMR gene mutations. New proteins produced by gene mutation that may activate the immune system and further trigger the immune system to attack cancer cells are called neoantigens [27].
Our study found that CD40LG, CD200R1, TNFSF14 and TNFRSF14 were also significantly negatively correlated with the expressions of neoantigens (R < 0 and FDR < 0.05), which was in line with the extremely correlation of negative expression between TMB and the four genes mentioned above. Our experiment showed that TNFSF14, the unfavourable prognostic gene, was of high expression in patients. While the high expression of TNFSF14 corresponds to low level neoantigens. So in this case, it was not appropriate for patients to be treated with personalized neoantigens [28]. In general, the greater total number of gene mutations (the higher the TMB) in the tumor tissues of patients, the more neoantigens they carry [29]. According to our results, the expressions of CD40LG, CD200R1, TNFSF14 and TNFRSF14 were negatively correlated with TMB and neoantigens.
This indicated TMB and neoantigens share the same variation trend, which coincides with what was reported in literature.
In order to reveal the relationship between ICGs and adaptive immune-resistance pathway genes, we analyzed the correlation between CD8A, GZMB, CD68, NOS2 and ICGs. Most of them were found out to be positively and significantly correlated. We obtained the same results in TCGA and GEO databases. This indicated that adaptive immune pathway genes might have a certain regulatory effect on ICGs expressions.
In the relationship between ICGs and clinical features, we found that the expression levels of ICGs in early tumor samples were higher than that in advanced tumor samples, CD200R1 in particular. In the above analysis, we found that CD200R1 were highly correlated with the overall survive of TCGA, and negatively correlated with TMB and neoantigens. In addition, in correlation with the adaptive immune-resistance pathway genes, we found CD200R1 were positively correlated with CD8A, CD68 and GZMB, suggesting that CD200R1 might be the possible cause of expression dysregulation of adaptive immune-resistance pathway genes. Then we analyzed the relationship between IDO1, CD274, CTLA4 and the H/L expression groups and prognosis of CD200R1. based on TCGA database our analysis results showed that prognosis will be the best when it is the group with high expression of CD200R1 and low expressions of IDO1, CD274 and CTLA4, while prognosis will be the worst when it is the group with CD200R1 low expression and low expressions of IDO1, CD274 and CTLA4. Moreover, there was a significant difference between of two groups. It was also confirmed in GEO database that the group with high expression of CD200R1 and low expressions of IDO1, CD274 and CTLA4 had the best prognosis. However, in the previous univariate COX analysis, we did not find a direct correlation between such ICGs as IDO1, CD274 and CTLA4 and patient prognosis.

Consent for publication
Not applicable.

Availability of data and material
The data used to support the findings of this study are available from the corresponding author on reasonable request

Competing interests
The authors declare that there are no conflicts of interest  Figure 1 The heatmap of ICGs on TCGA dataset. red: high expression group, green: medium expression group; blue: low expression group.   Scatter diagram of IGCs expression levels and TMB. R2 is the correlation coefficient, and FDR is false discovery rate.  scatter diagram of IGCs expression levels and neoantigens. R2 is the correlation coefficient, and FDR is false discovery rate.

Figure 7
A: heatmap of the correlation coefficients between ICGs and adaptive immuneresistance pathway genes; B: -log10 (P-value) is tested for the correlation coefficients between ICGs and adaptive immune-resistance pathway genes.  Supplementary Files