Validation of an Immune-Related Gene Pair Index as a Prognostic Marker of Lung Squamous Cell Carcinoma

Background: Lung squamous cell carcinoma (LUSC) is one of the subtypes of non-small-cell lung cancer (NSCLC) and accounts for approximately 20 to 30% of all lung cancers. Methods: In this study, we developed an immune-related gene pair index (IRGPI) for LUSC from 8 public LUSC data sets, including The Cancer Genome Atlas LUSC cohort and Gene Expression Omnibus data sets, and explored the prognostic value of IRGPI for patients with LUSC. Results: IRGPI was constructed by 13 gene pairs consisting of 25 unique immune-related genes from the training cohort. Multivariate cox regression analysis showed that high risk based on IRGPI was an independent risk factor for poor prognosis of patients with LUSC in the training cohort (230 patients; HR= 3.40; 95%CI [2.34-4.94]; p<0.001), the testing cohort (228 patients; HR=2.11; 95%CI [1.48-3.01]; p<0.001) and the validation cohort (472 patients; HR=1.99; 95%CI [1.5-2.63]; p<0.001). The inltrations of naïve B cells, plasma cells, CD8 + T cells, activated memory CD4 + T cells, gamma delta (γδ) T cells, M1 macrophages, and activated dendritic cells were lower in the high-risk group, as compared with the low-risk group in the TCGA cohort. The inltrations of neutrophils, activated mast cells, and monocytes were higher in the high-risk group. Conclusions: IRGPI is a signicant prognostic biomarker for predicting overall survival in LUSC patients. Combining clinical features with IRGPI will improve prognostic accuracy.


Background
With the development of sequencing technology, more and more biomarkers were constructed for the diagnosis and prognosis of patients with lung cancer, including autoantibodies, complement fragments, miRNAs, circulating tumor DNA, DNA methylation, blood protein pro ling, and other signatures. For example, Elena et al. reported a novel protein-based prognostic signature for risk strati cation in patients with early lung adenocarcinoma [1], Ana et al. constructed a prognostic classi er based on mRNA, microRNA, and DNA methylation biomarkers to identify stage I lung adenocarcinoma patients at high risk of recurrence [2], and Daniel et al. described the diagnostic value of complement fragment C4d for patients with lung cancer [3].
As Guo et al. had reported, because of the instability of quantitative information of gene expression measurements under current technical conditions, it was more reasonable to construct quantitative transcriptional diagnostic signatures based on within-sample relative expression orderings of gene pairs [4]. Gene pairs signatures also demonstrated unique prognostic value in many cancers, such as nonsquamous non-small cell lung cancer (NSCLC) [5], gastric cancer [6], and colorectal cancer [7].
In the present study, we constructed an immune-related gene pair index (IRGPI) for LUSC and explored the prognostic value of IRGPI in multiple data sets. By combining the tumor stage with IRGPI, the prediction accuracy of IRGPI was improved.

Materials And Methods
Study design and data process This retrospective study was conducted by using gene expression data from 8 public LUSC data sets, including 1 RNA-Seq data set and 7 microarray data sets [17][18][19]. Only patients without neoadjuvant therapy, adjuvant chemotherapy, or other immune-modulating therapy were included. LUSC RNA-Seq FPKM data and clinical information were downloaded from Genomic Data Commons Data Portal (GDC Data Portal) (https://portal.gdc.cancer.gov/). 7 GEO data sets (GSE37745, GSE50081, GSE30219, GSE19188, GSE14814, GSE41271, and GSE42127) with clinical information were download from GEO (https://www.ncbi.nlm.nih.gov/geo/). All the microarray data sets were merged into 1 data set and then randomly divided into a training cohort and a testing cohort. 472 LUSC patients with complete clinical information from TCGA were included as the independent validation cohort. The information about platforms was shown in Supplementary Table 1. For Affymetrix microarrays, gene expression pro les were normalized with the MAS5.0 method (affy, version 1.50.0) [20]. Model-Based Background Correction (MBCB) processed data were downloaded for Illumina datasets.
Construction of IRGPI as a speci c prognostic biomarker for LUSC The IRGPI was constructed according to a published article [5]. An immune-related genes (IRGs) list consisting of cytokines, cytokine receptors, and genes related to variable immune cells function was constructed by using data downloaded from the ImmPort database (https://immport.niaid.nih.gov) [21].
Only IRGs which were detected by all platforms were included. To form immune-related gene pairs (IRGPs), the included genes were pairwise compared. For one gene pair, the IRGP score was 0 if the gene expression level of IRG 1 was more than that of IRG 2; otherwise, the IRGP score was 1. Immune-related gene pairs (IRGPs) with constant values (0 or 1) in each data set were excluded considering the probable biases. The log-rank test was used to select the IRGPs which were associated with overall survival in all the cohorts. IRGP with a p-value of less than 0.05 was selected to construct the immune-related gene pair index (IRGPI) in the training cohort. Cox proportional hazards regression model with the least absolute shrinkage and selection operator (LASSO) was used to select the appropriate IRGPs. The optimal model parameter λ was evaluated by tenfold cross-validation at 1 SE (standard error) as previously recommended. The time-dependent receiver operating characteristic (ROC) curve was used to determine the optimal cutoff of IRGPI. The nearest neighbor estimation method was used to estimate the ROC curve. Patients were divided into low-risk and high-risk groups based on the cutoff of IRGPI.
Validation of the prognostic value of IRGPI To assess the prognostic value of IRGPI, univariate Cox regression analysis was conducted in the testing and validation cohorts. Age and tumor stage were coded as continuous variables. The tumor stage was coded as I=1, II=2, III=3, IV=4. The risk factors of gender and immune risk are male and high risk based on IRGPI. The prognostic accuracy of IRGPI was assessed by the time-dependent receiver operating characteristic (ROC) curve.
Immune cells in ltration of low-and high-risk groups based on IRGPI Tumor-in ltrating immune cells (TIICs) are important components of the tumor microenvironment (TME). The in ltrations of immune cells were assessed by CIBERSORT [22]. CIBERSORT is a computational method that can quantify immune cell fractions by using gene expression pro les (GEPs) from bulk tumor tissue. LUSC RNA-Seq gene expression pro les were downloaded from TCGA and transformed into the appropriate form according to the requirement. A total of 22 immune cell fractions were evaluated via the CIBERSORT website (https://cibersort.stanford.edu/). Wilcoxon rank-sum test was used to assess the difference of immune in ltrations between low-and high-risk groups.

Statistical analysis
All statistical analyses were performed using R version 3.6.1 (https://www.r-project.org/). The log-rank test was used to select Immune-related Gene Pairs (IRGPs) associated with overall survival. LASSO Cox proportional hazards regression model was used to select ideal prognostic IRGPs for the construction of IRGPI by glmnet [23]. The time-dependent receiver operating characteristic (ROC) curve was applied to determine the optimal cutoff of IRGPI by survivalROC [24]. All the cohorts were separated into high-risk and low-risk groups by the same cutoff. The discrimination of the model was evaluated by Area Under the Curve (AUC) of the ROC curve. The Kaplan-Meier method was used to estimate survival curves.
Multivariate Cox proportional hazards regression model was performed with variates signi cantly associated with overall survival in the univariate analysis by survival packages. Unless speci ed otherwise, p-value < 0.05 was considered statistically signi cant.

Characteristics of patients
The analysis process was presented in the ow chart ( Fig. 1). A total of 930 lung squamous cell cancer (LUSC) patients from TCGA and GEO data sets were enrolled in this study. 458 LUSC patients from GEO data sets were randomly divided into the training cohort and testing cohort. The demographic and clinical features of patients in the training, testing, and validation cohorts were listed in Table 1. There was no signi cant difference in the clinical features of the training and testing cohorts.
Construction of the IRGPI 1065 immune-related genes (IRGs) were detected by the platforms enrolled in this study. 566580 immunerelated gene pairs (IRGPs) were constructed and IRGPs with constant values (0 or 1) in each data set were excluded. The log-rank test was used to assess the association between the remaining IRGPs and the overall survival (OS) of patients. To avoid over-tting of the prediction model, the LASSO regression was performed to screen the top 14 OS-related IRGPs (Figs. 2a, b). The nal 13 IRGPs and LASSO coe cients were shown in Table 2. The 13 IRGPs which contained 25 unique immune-related genes were used to construct IRGPI via L1-penalized Cox proportional hazards regression in the training cohort. Based on the time-dependent ROC curve analysis, the optional cutoff for IRGPI was 1.31 ( Supplementary   Fig. 1). The risk curve and scatterplot were used to demonstrate the IRGPI and vital status of each patient with LUSC. Patients in the high-risk group had higher mortality than patients in the low-risk group (Figs. 2c, d). Survival curves of low-and high-risk groups were estimated by using the Kaplan-Meier method and were compared by using the log-rank test (Fig. 2e). In the training cohort, the AUC of time-dependent ROC curves at 1, 3, and 5 years was 0.744, 0.775, and 0.803, respectively (Fig. 2f).

Validation of IRGPI as an independent prognostic factor for LUSC
The same formula was used to calculate the IRPGI of patients in the testing and validation cohorts. Risk curves and scatterplots were applied to show the IRGPI and vital status of each patient with LUSC in the testing and validation cohorts. Patients with high risk based on IRGPI had higher mortality than patients with low risk in the testing and validation cohorts (Figs. 3a-d). The IRGPI strati ed patients with LUSC into different prognostic groups in the testing and training cohorts by using the Kaplan-Meier curves (Figs. 3e, g). In the testing cohort, the AUC of time-dependent ROC curves at 1, 3, and 5 years was 0.601, 0.648, and 0.691, respectively (Fig. 3f). In the validation cohort, the AUC of time-dependent ROC curves at 1, 3, and 5 years was 0.635, 0.698, and 0.677, respectively (Fig. 3h). Multivariate Cox proportional hazards regression model was performed with variates that were signi cantly associated with overall survival in univariate analysis ( Immune in ltration related to IRGPI CIBERSORT was used to assess the in ltrations of TIICs in the low-and high-risk groups based on IRGPI in the TCGA validation cohort (Fig. 4). The in ltrations of naïve B cells, plasma cells, CD8 + T cells, activated memory CD4 + T cells, gamma delta (γδ) T cells, M1 macrophages and activated dendritic cells were lower in the high-risk group, as compared with the low-risk group (p=0.002, p=0.004, p=0.027, p=0.024, p=0.004, p=0.008, p=0.001, respectively). Patients in the high-risk group had higher proportions of neutrophils, activated mast cells, and monocytes, as compared with patients in the low-risk group (p<0.001; p=0.002; p=0.043, respectively).

Comparison of biomarkers for LUSC
We summarized current biomarkers for LUSC and compared the biomarkers in Table 4 Table 5).

Discussion
Gene expression pro les had shown enormous potentials in predicting the survival of patients. Many biomarkers from gene signatures were constructed for the evaluation of prognosis in patients with nonsquamous NSCLC, such as CCP score [28] and quantitative-PCR-based assay [29]. The cell-cycle progression genes (CCP score) were calculated from the normalized signatures of 31 cell-cycle genes and could identify patient groups of reduced or increased risk of death after surgical resection. The quantitative-PCR-based assay could identify patients with early-stage non-squamous NSCLC at high risk for mortality after surgical resection. For LUSC, gene signatures from tumor microenvironment could also be used to develop clinically useful tests to differentiate patients with different prognoses [30]. Current prognostic biomarkers for LUSC included prognostic alternative mRNA splicing signature, lncRNA signatures, and microRNA signatures [9,[11][12][13]. However, these studies [9,[11][12][13] only included LUSC data from the TCGA database. In this study, we built IRGPI based on 13 immune-related gene pairs from the training cohort and explored the prognostic value of IRGPI in the testing and validation cohorts. By combining multi-gene expression, it was possible to construct robust gene expression-based biomarkers to stratify patients with LUSC into low-risk and high-risk groups, even if gene expression pro les were from different sequencing platforms. Results showed that IRGPI could stratify LUSC patients into subgroups with different survival outcomes. Besides, we combined the IRGPI with the tumor stage to build a new score ICPI. Compared with IRGPI, ICPI showed higher prognostic accuracy in the testing and validation cohorts.
More and more evidence had shown the potential role of TIILs in the development, progression, and prognosis of cancers and the response of cancers to therapy [31]. According to the previous study, patients with lower levels of tumor-in ltrating effector T cells demonstrated a relatively worse prognosis [32]. Several studies had reported that tumors could escape immune system surveillance and effective antitumor immunity partly depended on cytotoxic T lymphocytes [33]. In NSCLC, many studies had demonstrated the association between the immune cell populations and prognosis of patients [34][35][36][37]. The immunotherapy with immune checkpoint inhibitors against NSCLC had shown remarkable clinical e cacy and the e cacy of immune checkpoint blockade depended on the in ltration of TIICs in the tumor tissues [38][39][40].
Many agents for NSCLC are not bene cial for the treatment of LUSC because LUSC lacked the speci c genetic alterations that agents targeted [41]. In contrast, immune checkpoint inhibitor therapy had shown clinical e cacy in LUSC with prolonged survival [42,43]. The immunotherapy for LUSC could be improved by prognostic biomarkers related to the tumor immune microenvironment. In this study, we found higher in ltrations of neutrophils and monocytes and lower in ltrations of CD8 + T cells and CD4 + T cells in the high-risk group based on IRGPI in the TCGA LUSC cohort. Patients in the high-risk group had lower levels of tumor-in ltrating CD8 + T cells and displayed a relatively poor prognosis. Both the neutrophil and monocyte lineages had been demonstrated to possess lymphocyte suppressive capabilities [44]. The dysregulation of TIICs might explain the differences of prognosis between low-and high-risk groups based on IRGPI.
The method used to construct the IRGPI was based on the previous study [5]. Li et al. constructed the IRGPI for early-stage non-squamous NSCLC. However, the immune microenvironment differences between squamous and non-squamous NSCLS were signi cant [45][46][47]. To construct a robust signature for LUSC, several public LUSC data sets were included in the study. The expression levels of genes in each sample were compared in pairs to generate IRGPI. This gene pair-based approach was based on the relative gene expression of each sample and did not need normalization. There were also some limitations in our study. Although batch effects were reduced by the pairwise comparison of immunerelated genes, some of them were unavoidable due to the combination of different platforms [48].
Meanwhile, sampling bias caused by intratumor genetic heterogeneity was inevitable for gene expression-based IRGPI.

Conclusions
The immune-related gene pair index (IRGPI) is a prospective prognostic biomarker for LUSC, especially for early-stage LUSC. Further studies are still needed to validate its accuracy for assessing prognosis and to explore its clinical utility in the individualized management of LUSC.

Declarations
Ethics approval and consent to participate This study was approved by Ethics Committees of Tongji Medical College, Huazhong University of Science and Technology (2020-0120).

Consent for publication
Not applicable.

Availability of data and materials
All datasets enrolled in this study are available in the GEO (https://www.ncbi.nlm.nih.gov/geo/) and TCGA (https://portal.gdc.cancer.gov/) databases.

Competing interests
The authors declare that they have no competing interests. Authors' contributions QZ conceived the idea, designed and supervised the study, had full access to all data and took responsibility for the integrity of the data. YL, LLY, WBP, SYZ and YRN recorded and sorted the data. XSW and XW analyzed the data. ZHW and XX interpreted the data and wrote the manuscript. All authors contributed signi cantly to this work and agreed to be accountable for the work. All authors read and approved the nal manuscript.     Figure 1 Flow chart of the research.  Patients in the testing and validation cohorts were divided into low-and high-risk groups based on IRGPI.  The in ltrations of 22 tumor-in ltrating immune cells (TIICs) between low-and high-risk groups based on IRGPI were compared by Wilcoxon rank-sum test.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplement.pdf