Validation of an Immune-related Gene Pair Index as a Prognostic Marker of Early-Stage Lung Squamous Cell Carcinoma

Background. Lung squamous cell carcinoma (LUSC) is one of the subtypes of non-small-cell lung cancer (NSCLC) and accounts for approximately 20 to 30% of all lung cancers. Methods. In this study, we developed an immune-related gene pair index (IRGPI) for early-stage LUSC from 3 public LUSC data sets, including The Cancer Genome Atlas LUSC cohort and Gene Expression Omnibus data sets, and explored whether IRGPI could act as a prognostic marker to identify patients with early-stage LUSC at high risk. Results. IRGPI was constructed by 68 gene pairs consisting of 123 unique immune-related genes from TCGA LUSC cohort. In the derivation cohort, the hazard of death among high-risk group was 10.51 times that of the low-risk group (HR, 10.51; 95%CI, 6.96-15.86; p<0.001). The hazard of death among the high-risk group was 2.26 times that of the low-risk group (HR, 2.26; 95%CI, 1.2-4.25; p=0.009) in the GSE37745 validation cohort and was 3.2 times that of low-risk group (HR, 3.2; 95%CI, 0.98-10.4; p=0.042) in the GSE41271 validation cohort. The inltrations of CD8 + T cells and T follicular helper cells were lower in the high-risk group, as compared with the low-risk group in the TCGA cohort (6.94% vs 9.63%, p=0.004; 2.15% vs 3%, p=0.002, respectively). The inltrations of neutrophils, activated mast cells and monocytes were higher in the high-risk group, as compared with the low-risk group in the TCGA cohort (1.63% vs 0.72%, p=0.001; 1.64% vs 1.02%, p=0.007; 0.57% vs 0.35%, p=0.041, respectively).


Introduction
With the development of sequencing technology, more and more biomarkers were constructed for the diagnosis and prognosis of patients with lung cancer, including autoantibodies, complement fragments, miRNAs, circulating tumor DNA, DNA methylation, blood protein pro ling, and other signatures. For example, Elena et al. reported a novel protein-based prognostic signature for risk strati cation in patients with early lung adenocarcinoma [1], Ana et al. constructed an prognostic classi er based on mRNA, microRNA, and DNA methylation biomarkers to identify stage I lung adenocarcinoma patients at high risk of recurrence [2], and Daniel et al. described the diagnostic value of complement fragment C4d for patients with lung cancer [3].
As Guo et al. had reported, because of the instability of quantitative information of gene expression measurements under current technical conditions, it was more reasonable to construct quantitative transcriptional diagnostic signatures based on within-sample relative expression orderings of gene pairs [4]. Gene pairs signatures also demonstrated unique prognostic value in many cancers, such as nonsquamous non-small cell lung cancer [5], gastric cancer [6], and colorectal cancer [7].
In the present study, we rstly constructed an immune-related gene pair index (IRGPI) for early-stage LUSC and explored the prognostic value of IRGPI in the derivation and validation cohorts.

Study Design and Data Process
This retrospective study was conducted by using gene expression data from 3 public LUSC data sets, including The Cancer Genome Atlas (TCGA) LUSC cohort and Gene Expression Omnibus (GEO) data sets [17][18][19]. Only patients without neoadjuvant therapy, adjuvant chemotherapy or other immunemodulating therapy were included. LUSC RNA-Seq FPKM data and clinical information were downloaded from Genomic Data Commons Data Portal (GDC Data Portal) (https://portal.gdc.cancer.gov/). 385 earlystage LUSC patients with complete clinical information from TCGA were included as the derivation cohort. Two independent GEO data sets (GSE37745 and GSE41271) with clinical information were download from GEO (https://www.ncbi.nlm.nih.gov/geo/). The information of platforms was shown in sTable1 in the Supplement. For Affymetrix microarrays, gene expression pro les were normalized with the MAS5.0 method (affy, version 1.50.0) [20]. Model-Based Background Correction (MBCB) processed data were downloaded for Illumina datasets. A total of 54 and 47 early-stage LUSC patients from GSE37745 and GSE41271 datasets were enrolled as independent validation cohorts.

Construction of IRGPI as a LUSC Speci c Prognostic Biomarker
The IRGPI was constructed according to a published article [5]. An immune-related genes (IRGs) list consisting of cytokines, cytokine receptors, and genes which related to variable immune cells function was constructed by using data downloaded from the ImmPort database (https://immport.niaid.nih.gov) [21]. Only IRGs which were detected by all platforms were included. To form immune-related gene pairs (IRGPs), the included genes were pairwise compared. For one gene pair, the IRGP score was 0 if the gene expression level of IRG 1 was more than that of IRG 2; otherwise, the IRGP score was 1. Immune-related gene pairs (IRGPs) with constant values (0 or 1) in each data set were excluded considering the probable biases. The log-rank test was used to select the IRGPs which were associated with overall survival in the derivation cohort. IRGP with a p-value of less than 0.01 was selected to construct the immune-related gene pair index (IRGPI). Cox proportional hazards regression model with the least absolute shrinkage and selection operator (LASSO) was used to select the appropriate IRGPs. The optimal model parameter λ was evaluated by tenfold cross-validation at 1 SE (standard error) as previously recommended. The timedependent receiver operating characteristic (ROC) curve was used to determine the optimal cutoff of IRGPI. The nearest neighbor estimation method was used to estimate the ROC curve. Patients were divided into low-and high-risk groups based on the cutoff of IRGPI.

Validation of the Prognostic Value of IRGPI
To assess the prognostic value of IRGPI, univariate Cox regression analysis was conducted in the derivation and validation cohorts. Age and tumor stage were coded as continuous variables. The tumor stage was coded as I=1, II=2. The risk factors of gender and immune risk are male and high risk based on IRGPI. The prognostic accuracy of IRGPI was assessed by the time-dependent receiver operating characteristic (ROC) curve.

Immune Cells In ltration of Low-and High-Risk Groups
Tumor-in ltrating leukocytes (TILs) are important components of the tumor microenvironment (TME). The in ltrations of immune cells were assessed by CIBERSORT [22]. CIBERSORT is a computational method that can quantify immune cell fractions by using gene expression pro les (GEPs) from bulk tumor tissue. LUSC RNA-Seq gene expression pro les were downloaded from TCGA and transformed into the appropriate form according to the requirement. A total of 22 immune cell fractions were evaluated via the CIBERSORT website (https://cibersort.stanford.edu/). Wilcoxon rank-sum test was used to assess the difference of immune in ltrations between low-and high-risk groups.
Functional Annotation of Immune-related Genes IRGPI was constructed by 68 IRGPs (123 unique genes). The functional annotation of these genes was conducted using clusterPro ler (version 3.0.4) [23]. The cellular component, molecular function and biological process of gene ontology were shown in sFig1 in the Supplement.

Statistical analysis
All statistical analyses were performed using R version 3.6.1(https://www.r-project.org/). The log-rank test was used to select Immune-related Gene Pairs (IRGPs) associated with overall survival in the derivation cohort. LASSO Cox proportional hazards regression model was used to select ideal prognostic IRGPs for the construction of IRGPI by glmnet (version 2.0-18) [24]. The time-dependent receiver operating characteristic (ROC) curve was applied to determine the optimal cutoff of IRGPI by survivalROC (version 1.0.3) [25]. Derivation cohort and validation cohort were separated into high-risk and low-risk groups by the same cutoff. The discrimination of the model was evaluated by Area Under the Curve (AUC) of the ROC curve. The Kaplan-Meier method was used to estimate survival curves. Multivariate Cox proportional hazards regression model was performed with variates signi cantly associated with overall survival in the univariate analysis by survival (version 2.44-1.1) packages. Unless speci ed otherwise, pvalue < 0.05 was considered statistically signi cant.

Results
Construction and Validation of the IRGPI Table 1 summarized the clinical characteristics of patients enrolled in this study. 1179 IRGs were detected by the platforms mentioned in the sTable1 in the Supplement. 694431 IRGPs were constructed and 97.4% of them were excluded. The log-rank test was used to assess the association between the remaining 18396 IRGPs and the overall survival of patients in the derivation cohort. 401 IRGPs with a p-value of less than 0.01 were selected to t a Cox proportional hazards regression model with the least absolute shrinkage and selection operator (LASSO) (Fig 1.a, b). The nal 68 IRGPs and LASSO coe cients were shown in sTable2 in the Supplement. 68 IRGPs (123 unique immune-related genes) were used to construct IRGPI via L1-penalized Cox proportional hazards regression in the derivation data set. Based on the time-dependent ROC curve analysis, the optional cutoff for IRGPI was 1.32 (sFig2 in the Supplement). Survival curves of low-and high-risk groups were estimated by using the Kaplan-Meier method and were compared by using the log-rank test in the derivation and validation cohorts (Fig 1.c and e). In the TCGA derivation cohort, the AUC of time-dependent ROC curves at 1, 3 and 5 years was 0.854, 0.953 and 0.944, respectively (Fig 1.d). In the GSE37745 validation cohort, the AUC of time-dependent ROC curves at 1, 3 and 5 years was 0.764, 0.728 and 0.710, respectively (Fig 1.f).

Validation of IRGP as a prognostic factor of Early-stage LUSC
Hazard ratios between high-risk and low-risk groups based on IRGPI were shown in the forest plot (Fig 2). The IRGPI strati ed patients with early-stage (stage I and II) LUSC into different prognostic groups. In early-stage (stage I and II) LUSC, the hazard of death among high-risk group was 10 Immune In ltration related to IRGPI Gene ontology (GO) of the unique 123 LUSC IRGs was shown in the sFig1 in the Supplement. Most of the biological processes were cytokine biosynthetic, secreted and metabolic processes. The molecular function of GO concentrated on cytokine receptor binding, receptor-ligand activity, receptor regulator activity, etc. In the early-stage LUSC TCGA cohort, the percentages of 22 immune cells in ltration were shown in Fig 3.a and Fig 3.b. Patients in high-risk group had higher proportions of neutrophils, monocytes and activated mast cells in ltrations in their tumors (1.63% vs 0.72%, p=0.001; 0.57% vs 0.35%, p=0.041; 1.64% vs 1.03%, p=0.007, respectively). But the in ltrations of CD8+ T cells and T follicular helper cells were lower in the high-risk group, as compared with the low-risk group (6.94% vs 9.63%, p=0.004; 2.15% vs 3%, p=0.002, respectively) (Fig 3.c).

Comparison of biomarkers for LUSC
We summarized current biomarkers for LUSC and compared the biomarkers in Table 3 prognostic predictors based on alternative splicing events for NSCLC patients and the AUC for prognostic predictor was over 0.8 in TCGA LUSC cohort [9]. Choi et al. found that MLL2 mutations predicted poor prognosis in both TP53 mutant and wild-type LUSC [8]. Gao et al. identi ed a prognostic model contained 5 genes and the AUC of the model for predicting the survival at 1, 3, and 5 years was 0.692, 0.722, and 0.651, respectively [28]. For TCGA derivation cohort, the AUC of IRGPI in at 1, 3 and 5 years was 0.854, 0.953 and 0.944, respectively. For GSE37745 validation cohort, the AUC of IRGPI at 1, 3 and 5 years was 0.764, 0.728 and 0.710, respectively. Discussion Gene expression pro les had shown enormous potentials in predicting the survival of patients. Many biomarkers from gene signatures were constructed for the evaluation of prognosis in patients with nonsquamous NSCLC, such as CCP score [29] and quantitative-PCR-based assay [30]. The cell-cycle progression genes (CCP score) were calculated from the normalized signatures of 31 cell-cycle genes and could identify patient groups of reduced or increased risk of death after surgical resection. The quantitative-PCR-based assay could identify patients with early-stage non-squamous NSCLC at high risk for mortality after surgical resection. Current prognostic biomarkers for lung squamous cell carcinoma (LUSC) included prognostic alternative mRNA splicing signature, lncRNA signatures, and microRNA signatures [9,[11][12][13]. However, these studies [9,[11][12][13] only included LUSC data from TCGA database. In this study, we built IRGPI based on 68 immune-related gene pairs from TCGA LUSC database and explored the prognostic value of IRGPI in two independent GEO data sets. By combining multi-gene expression, it was possible to construct robust gene expression-based biomarkers to stratify patients with LUSC into low-risk and high-risk groups, even if gene expression pro les were from different sequencing platforms. Results showed that IRGPI could stratify early-stage LUSC patients into subgroups with different survival outcomes. Besides, we found that IRGPI and old age were independent risk factors for poor prognosis in the GSE37745 validation cohort.
The immunotherapy for NSCLC could be improved by prognostic or predictive biomarkers related to the tumor immune microenvironment. According to previous study, patients with lower levels of tumorin ltrating effector T cells demonstrated a relatively worse prognosis [31]. In this study, we found a higher in ltration of neutrophils and a lower in ltration of CD8 + T cells in the high-risk group based on IRGPI in the TCGA cohort. Patients in the high-risk group had lower levels of tumor-in ltrating CD8 + T cells and displayed relatively poor prognosis. Dysregulation of tumor-in ltrating lymphocytes might explain the differences in the overall survival between low-and high-risk groups based on IRGPI.
To construct a robust signature, several LUSC data sets were included in the study. The expression levels of genes in each sample were compared in pairs to generate IRGPI. This gene pair-based approach was based on the relative gene expression of each sample and did not need normalization. There were also some limitations in our study. Although batch effects were reduced by the pairwise comparison of immune-related genes, some of them were unavoidable due to the combination of different platforms [32]. Meanwhile, sampling bias caused by intratumor genetic heterogeneity was inevitable for gene expression-based IRGPI.

Conclusions
The immune-related gene pair index (IRGPI) is a prospective prognostic biomarker for LUSC, especially for early-stage LUSC. Further studies are still needed to validate its accuracy for assessing prognosis and to explore its clinical utility in the individualized management of LUSC.

Declarations
All datasets were adopted in this study are available in the GEO (https://www.ncbi.nlm.nih.gov/geo/) and TCGA (https://portal.gdc.cancer.gov/) database.   Patients were divided into low-and high-risk groups by IRGPI. a The immune in ltrations of the high-risk group based on IRGPI. b The immune in ltrations of the low-risk group based on IRGPI. c The immune in ltrations of groups were compared by Wilcoxon rank-sum test.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplement.pdf