Immune Gene Data-based Molecular Subtyping of Rectal Adenocarcinoma Related to Prognosis

Background: The morbidity and mortality of rectal adenocarcinoma (READ) is increasing, which is considered as an aggressive type of colorectal malignancy. A great deal of evidence has suggested the significant association between the progression of READ and the immunophenotype of tumor cells (i.e. the expression of intracellular immune-related genes). Methods: Samples retrieved from the TCGA and ImmPort database were investigated to identify immune-related genes specifically impacting the prognosis of READ patients. Several typical ones were then selected to construct the prognostic prediction model of READ patients through Lasso algorithm. The training and test cohorts were incorporated into the model, respectively. We stratified READ patients to evaluate the accuracy, efficiency and stability of the model in the prediction and classification of patient prognosis according to the value of median RiskScore (Risk-H and Risk-L). GO and KEGG signaling pathway enrichment analysis were conducted among the nine selected immune-related genes. Results: A total of 57 immune-related displaying marked correlated with patient prognosis were identified and nine most typical genes could be majorly enriched into several pathways with close correlation with READ and the corresponding immune response. the distribution of nine immune-related genes were examined in the samples from both Risk-H and –L groups. Finally, the connection of the RiskScore value with the clinical characteristics of sample and the related signaling pathways were investigated. Conclusions: The prognostic prediction model of RiskScore constructed based on the expression profiling of the nine immune-associated genes exhibited high prediction accuracy and stability to identify the relevant immune features. This model could contribute to the guidance for clinicians in diagnosing and predicting the prognosis for various immunophenotypes. Meanwhile, it also can offer various therapeutic targets for precise treatment of READ in clinical practice according to the identified immune molecules specific to different subtypes.


Background
Rectal adenocarcinoma (READ) is defined as a type of rectal cancer located above the dentate line 4 and between the sigmoid colon and the transitional part of the rectum, accounting for 75% ~ 85% of all types of colorectal cancer [1]. At present, the pathogenesis of READ remains unclear, which might be associated with dietary factors, rectal adenoma, hereditary precancerous lesions, and inflammatory diseases of the large intestine [2]. Moreover, READ is a highly complex and heterogeneous malignancy. The heterogeneity of READ results in numerous clinical subtypes, with diverse degrees of sensitivity to chemotherapy and targeted therapy, thus leading to various prognoses [3,4]. Approximately 2/3 READ patients are diagnosed at advanced stage in clinical practice [5].
The current therapeutic strategy of rectal cancer commonly includes a combination of surgery, chemotherapy, radiotherapy or targeted therapy [6]. Under normal conditions, a specific therapy is established according to the general assessment of the patient status, as well as cancer location and stage. However, only limited success rate could be achieved from the conventional approaches for READ, such as chemotherapy and radiotherapy, which might be due to the toxicity and non-specificity of these traditional therapy [7]. Notably, certain large-scale clinical studies do no recommend systemic adjuvant therapy as a postoperative treatment in the majority of READ patients at early stage, largely due to the fact that the therapy-triggered toxicity on human body is far greater than the survival benefit of patients [8]. In high-risk patients, however, the absence of systemic adjuvant therapy would lead to rapid recurrence of the disease, which may even progress into infiltration and distant metastasis in the para-carcinoma tissues. To this end, it is of crucial necessity to identify the survival risks in patients during the early diagnosis. Moreover, patients should be stratified into different subgroups, and additional adjuvant systemic therapy should be prescribed for high-risk patients [9]. READ could be currently categorized into three types according to the differentiation degree of cancer cells: highly differentiated READ, moderately differentiated READ, and poorly differentiated READ [10].
Of the three different pathological types, poorly differentiated READ the most aggressive. However, this classification method certainly cannot be applied in the prognosis of patients at early stage.
Biomarkers allow for reliable estimation of disease prognosis and patient survival, which are, therefore, valuable in decision-making of the clinical treatment of READ. [11] Recently, accumulative 5 studies have indicated the application of gene expression in the prediction and stratification of the survival prognosis for READ patients [12]. Unfortunately, this proposal has not been adopted as the routine clinical practice yet, due to the small sample size, excessive data fitting or inadequate evidence in most studies. On this account, the open and accessible large-scale databases containing data of gene expression, such as the TCGA [13] and ImmPort databases [14], allow for mining the potentially more reliable biomarkers to predict and classify the prognosis of READ patients.
Immune cell infiltration, postulated to manifest active tumor responses, has been detected in most human solid tumors, and lymphocytic infiltration of READ has been shown to confer a survival advantage [15]. In addition, immune escape has also been validated as a novel cancer marker.
Recently, immunotherapy-targeted specific immune checkpoints (including PD-1/PD-L1) therapy has given rise to surprising therapeutic effects in READ patients [16,17]. However, the prominent molecular events concerning the tumor cell-immunocyte interaction within the READ microenvironment should be further exploited and summarized, which can help to determine their potential roles in predicting the prognosis of READ patients.
In this study, we mainly constructed and validated a prognostic prediction model for READ on the basis of the immune-related genes according to the clinical characteristics of patients recruited from the TCGA and ImmPort databases. Our present outcomes might potentially help the clinicians to evaluate the therapeutic effect, to predict the prognosis, and to select proper therapy for READ.

Pre-processing of preliminary sample data and initial screening of immune-related genes in READ
The latest clinical data on follow-up were extracted by using TCGA GDC API. Altogether 171 RNA-Seq data samples were retrieved (shown in Table S1), 161 of them were tumor tissues. Moreover, the immune-related gene set was also derived from the ImmPort database, involving a total of 1811 genes, as presented in Table S2.
First of all, 156 of the retrieved RNA-seq data (Table S3) were subject to pro-processing.
Consequently, 156 samples involving 876 genes were adopted into further modeling analysis. Due to the small sample size of TCGA-READ and the limited number of death samples (26), events of disease progression were combined with death cases, which yielded to 46 after combination, aiming to improve the accuracy of the model. The data of processed samples were shown in Table 1.
Secondly, in consideration of the relatively small sample size of TCGA-READ, 80% samples of TCGA-READ datasets were assigned to the training sets, and all samples were considered as test sets, instead of establishing training sets and test sets with an average distribution of 0.5:0.5 among all samples. According to the above-described method, 156 samples were classified as the training set and the test set, respectively. Data in the final training set and test set were shown in Table S4 and Table S5, respectively. Meanwhile, the clinical information of samples from both training and test sets were shown in Table 1. As a result, there was no significant difference the training set and the test set, as indicated by P-value, revealing reasonable sample grouping.

Univariate survival analysis of the immune-related genes in the training set
Univariate Cox proportional hazards regression model was used to analyze all immune-related genes and the survival data by utilizing the survival coxph function of R package [18]. A p value <0.05 was considered as statistical significance.

Screening of immune-related genes specific to the prognosis of READ, and the establishment of the prognostic prediction model
At first, the R software package glmnet [19] was adopted for lasso regression analysis, which finally gave rise to the risk model based on specific immune-related genes. The formula were shown as follows: RiskScore=FYN*0.0667786+FGF18*0.075671487+TPT1*8.70E-05+ERAP2*-0.006313662+NFKB1*-0.03711375+TAP1*-0.002024+RARG*0.032456265+ADIPOR2*-0.00954341 +HSPA1A*0.001132712 Afterwards, related gene expression profiles were selected from both the training and test sets, which were subsequently substituted into the as-proposed model to compute the RiskScore value of each sample. Of note, the median RiskScore value was employed as the cutoff value to classify the samples from the high risk group (Risk-H) or low risk group (Risk-L), respectively. Finally, to comprehensively assess the efficiency, accuracy and stability of the model to in the prediction and classification of the prognosis of READ patients, ROC analysis, KM analysis and gene clustering analysis were performed.

Functional annotations and signaling pathway enrichment of the immune-related genes specific to prognosis
The gene families of the nine eventually-selected genes were annotated based on the human gene classification in the HGNC database [20]. Notably, KEGG and GO enrichment analyses were performed by using the clusterProfile [21] of the R software package for the nine identified immune-related, prognosis-specific genes in this study.

Correlation of the RiskScore value with the signaling pathways and the clinical characteristics of samples
To begin with, the score of KEGG functional enrichment analysis was analyzed using the ssGSEA function of the R software package GSVA [22]. Meanwhile, the correlation with the RiskScore value was also computed, followed by clustering analysis in accordance with the enrichment scores of each pathway in all samples. Moreover, we analyzed the correlations of related factors (including T, N, M stages, age and gender) with the RiskScore value. Finally, the nomogram model and forest plot were performed based on the related clinical characteristics along with the RiskScore value, followed by assessment of the correlations between the RiskScore value as well as clinical characteristics and patient survival.

Identification of specific immune-related genes according to survival and prognosis outcomes in READ patients
At first, relevant data were collected from the TCGA and ImmPort databases, which were subsequently subject to pre-processing. Afterwards, univariate Cox proportional hazards regression model was used to analyze all immune-related genes and survival data by using the R package survival coxph function, with the significance level set at P<0.05 (shown in Table S6). Finally, a total of 57 prognosis-significant immune-related genes were identified. The associations between the p-8 values of these 57 genes and the HRs as well as the expression quantities were shown in Fig.1.

Screening of prognosis-specific immune-related genes and construction of the prognostic prediction model for READ
Although 57 immune-related genes were identified, most of these genes were not good enough for clinical detection. Thus, we aimed to narrow the immune-related gene scope while maintain the high accuracy. To this end, these 57 genes were further compressed using lasso regression for reduction of the number of genes in the risk model. The Lasso algorithm, a type of shrinkage estimate, could be used to construct a penalty function to acquire a relatively refined model, so that it could compress some coefficients, which would be set as 0. Consequently, the advantages of subset shrinkage were retained. Moreover, as a biased estimate for the multicollinearity data processing, it could estimate parameters while be proper for realizing variable selection, and could solve the problem of multicollinearity in regression analysis to a better extent. In the present study, the glmnet of R software package was used for lasso regression analysis. Briefly, 57 genes were compressed into nine genes (shown in Figure S1 and Table S7), and the formula were summarized in Materials and methods

Section.
Thereafter, all samples from the training set were substituted into the formula to compute the RiskScore values. Typically, the median RiskScore value was employed as the cutoff value to classify samples into high risk (Risk-H) or low risk (Risk-L) groups. Moreover, ROC analysis concerning prognostic classification for RiskScore was also conducted. The OS duration of all samples was approximately two years ( Figure S2). As a result (Figure 2A), we evaluated the 1-3 year survival prediction efficiency of the model in this study, with the average AUC reaching 0.823. Additionally, the sample distribution of both Risk-H and Risk-L groups under various OS were displayed in Figure   2B, revealing no statistical significance in the sample size between the 0-and 1-year Risk-H and Risk-L group. Moreover, the sample size in Risk-H group after the 3 rd year was markedly decreased compared with that in Risk-L group. Of note, such variations became more obvious along with the extension of OS ( Figure 2C). The clustering outcomes of the samples from the training set were shown in Figure 2D. It was clear that the above-mentioned nine genes could be markedly clustered into the 9 high and low expression groups, respectively, while samples in the training set could also be assigned into two groups. Moreover, we also compared the RiskScore values of the two subclasses ( Figure 2E).
To validate the reliability of the prognostic prediction model, the expression profiles of the abovementioned nine genes were retrieved from the test set as well, which were be substituted into the validation model. Meanwhile, we computed the RiskScore values of all samples. Similarly, data extracted from the test set were adopted to evaluate the 1~3-year survival prediction efficiencies, as shown in Figure 3A. Additionally, sample distribution in both Risk-H and Risk-L group under diverse OS was shown in Figure 3B, which suggested not obvious difference in the sample size between the 0~1- year Risk-H and Risk-L groups. Moreover, the sample size in Risk-H group after the 2 nd year was obviously decreased in comparison with that in Risk-L group, which became more obvious with the extension of OS ( Figure 3C). The clustering outcomes for samples in the test set, and the difference in the RiskScore value between two subgroups, were shown in Figure 3D and E, respectively.
The KM survival curves of the risk model constructed based on the nine genes to classify the Risk-H and Risk-L groups in both training and test sets were shown in Figure 4. To be specific, the KM survival curve of the training set (p<0.0001) and test set (p<0.001) was shown in Figure 5A and In our present study, we found that the prognosis model established based on the expression profile data of the above nine immune genes showed great prediction accuracy and stability in identifying immune features.

Functional annotations of the immune-related genes and signaling pathway enrichment specific to patient prognosis
At first, the gene families of the nine obtained genes were annotated according to the human gene classification in HGNC database ( Table 2). As a result, these nine immune-related genes were significantly enriched in six gene families (p < 0.05), and the detailed information of gene function annotation was displayed in Table S8. In addition, the expression levels of these nine genes of the samples were significantly different between the Risk-H and Risk-L group ( Figure 5). Meanwhile, the clusterProfile of the R software package was used to enrichment analyses of the above-mentioned 10 nine prognosis specific immune related genes. Results of GO enrichment analysis and KEGG pathway enrichment analysis were shown in Figure 6A and Figure 6B, respectively, and relevant data were shown in Table S9 and Table S10. As was shown, most genes were enriched into multiple immunerelated, cancer-associated biological processes and signaling pathways.  Table S11. Afterwards, all the 34 pathways were chosen to conduct clustering analysis in accordance with their enrichment scores of all samples ( Figure 7). Besides, the correlation between the enrichment score and the RiskScore value was investigated by selecting the top five pathways with the highest GSEA enrichment score (including the prion diseases, ABC transporters, primary immunodeficiency, spliceosome and pendocytosis) to analyze the distribution between Risk-H and Risk-L groups. As shown in Figure 8, the scores acquired from these pathways were significantly different between Risk-H and Risk-L group.

Correlation of the
Afterwards, the correlations between various factors (such as T, N, M stages, age and gender) and the RiskScore value were analyzed as well, as shown in Figure 9. Clearly, there was no obvious association of other features with the RiskScore value (p>0.05), with the exception of N (p=0.0035), suggesting that the dependence of RiskScore model on regional lymph node invasion to certain extent.
Eventually, the RiskScore value was combined with the clinical characteristics for the establishment of the nomogram model. Nomogram, an approach to intuitively and effectively present the outcomes of a certain risk model, could be conveniently applied in outcome prediction. In the nomogram, the length of a straight line represented the effects of different variables and their values on the outcome. In this study, due to the inconsistency between TNM and Stage, nomograms for the combination of TNM+RiskScore or Stage+RiskScore were constructed, respectively, as shown in Figure 10. The RiskScore features were obviously associated with the highest impact on the prediction of survival rate, suggesting the good performance of the established risk model based on the nine genes on prognostic prediction.
Finally, the forest plot was established by utilizing both RiskScore and clinical characteristics. As shown in Figure 11, in the forest plot established by TNM+RiskScore and Stage+RiskScore, the HRs of RiskScore were around 3, with a p-value <0.05. The multivariate cox-regression analyses of diverse clinical characteristics and RiskScore were shown in Table S12.

Discussion
The current standard therapeutic options for READ include surgical resection alone for earlystage READ, and surgical resection followed by adjuvant radiochemotherapy for the advanced stage READ.
However, the effect of surgical resection is frequently restricted due to the local invasion of adjacent tissues by cancer cells or distant metastasis [23,24]. Meanwhile, radiochemotherapy is also limited by its toxicity on normal tissues in the body [25]. Conventional anti-cancer therapies exert great burden on the body in while obtain therapeutic benefits [26]. Thus, no consistent therapeutic benefit can be achieved among these patients through the clinical medication, which might be partially due to the potential toxic, side effects and tumor heterogeneity [27]. Nevertheless, the application of postoperative systemic adjuvant therapy is still controversial in clinical practice. As a result, it is crucial to retrieve the potential biomarkers for READ for prognosis and recurrence prediction, to implement and to benefit from the early adjuvant therapy for high-risk patients.
The relationship between the immune system and the pathogenesis and progression of malignant tumors has attracted great attention in recent years, which has shed novel light on READ treatment, thus promoting the continuous development of anti-cancer treatment [16]. Starting from the tumor origin, namely, the immune system of the human body, to control and kill tumor cells via the regulation of the immune system of the body and enhancement of the anti-tumor immunity in the tumor microenvironment, has ushered in a new way for anti-tumor therapy. Therefore, screening the novel and meaningful READ prognosis-specific immune-related genes is of great value in the prediction of patient prognosis and mining novel therapeutic targets [28,29]. In addition, the classification of READ based on the prognosis-specific immune-related genes would definitely contribute to the accurate prediction of patient outcome as well as identification of READ patients with high or low risk of postoperative recurrence.
In the present research, a total of nine prognosis-specific, immune-related genes were identified by means of big data mining, statistics and sorting including the TCGA and ImmPort databases.
Afterwards, we also constructed the prognostic prediction model according to the expression of these nine immune-related genes, followed by calculation of the RiskScore values of patients. Moreover, both prediction and verification of the model were performed. Typically, the presently-proposed prognosis model in this study, which was established based on the expression profiles of specific immune genes, rendered further classification of patients with clear clinical stage into various subgroups according to the predictive survival outcomes.
During the implementation of this project, samples extracted from the TCGA and ImmPort databases were investigated to identify immune-related genes specifically impacting the prognosis of READ patients. As a result, 57 immune-related genes with remarkable association with patient prognosis were selected, which were subsequently subject to shrinkage estimate. Among them, nine most typical ones showing evident association with patient prognosis were further chosen for the establishment of prognostic prediction model for READ patients through Lasso algorithm.
Subsequently, samples in both training and test sets were incorporated into the constructed model, respectively. Meanwhile, to evaluate the efficiency, accuracy and stability of the model in the prediction and classification of patient prognosis, READ patients were stratified according to the value of median RiskScore (Risk-H and Risk-L) to assess the. Thereafter, functional annotations, GO and KEGG signaling pathway enrichment analyses were conducted among the nine selected immunerelated genes. Our findings indicated that, all of the above-mentioned nine genes could be majorly enriched into several pathways with close association with READ and the corresponding immune response. In addition, the distributions of nine immune-related genes were examined in the samples in Risk-H and -L groups. Finally, the connection between the RiskScore value and the sample clinical characteristics as well as the related signaling pathways was investigated.

13
Among the nine genes, seven (NFKB1, FYN, ADIPOR2, TAP1, HSPA1A, ERAP2 and FGF18) out of them have been previously reported to be involved in the pathogenesis, progression, malignant transformation, and pathological process of immune microenvironment of READ, which are significantly correlated with patient survival and prognosis [30][31][32][33][34][35]. These previous findings have been verified the high reliability and accuracy of the bioinformatic mining results in this study. However, the correlations of the remaining two genes (RARG and TPT1) with READ have not been validated in neither basic nor clinical studies, which we are the most interested in. RARG has been confirmed to participate in the regulation of the proliferation and invasion of multiple malignant cancer cells [36], affecting production and release of cytokines and growth factors [37], modulating the innate immune response to viruses and pathogens [38]. Meanwhile, TPT1 has been validated to be involved in lymphocyte proliferation and activation, which can regulate the sensitivity of tumor cells to immunotherapy [39].

Conclusion
To sum up, our present findings could contribute to the identification of the novel markers for READ in clinic. Additionally, our proposed risk model based on the nine prognosis-specific immune-related genes would thereby provide multiple targets for the precise treatment of READ and facilitate the classification of READ patients at molecular subtype level. Furthermore, this established model is a promising tool to offer guidance for clinicians in prognostic prediction, clinical diagnosis and medication for READ patients with distinct immunophenotypes to some extent.  Figure 1 The       The GO (A) and KEGG pathway (B) enrichment analysis of the 9 specific immune-related genes.

Abbreviations
30 Figure 6 The GO (A) and KEGG pathway (B) enrichment analysis of the 9 specific immune-related genes.

Figure 7
Correlation of RiskScore with signaling pathways. KEGG functional enrichment score of each sample was analyzed, the correlation with RiskScore was calculated, respectively, based on the enrichment score of each pathway in each sample, and all the 34 pathways related KEGG pathways were shown. Clustering analysis had to be carried out according to the enrichment score. Correlation of RiskScore with signaling pathways. KEGG functional enrichment score of each sample was analyzed, the correlation with RiskScore was calculated, respectively, based on the enrichment score of each pathway in each sample, and all the 34 pathways related KEGG pathways were shown. Clustering analysis had to be carried out according to the enrichment score.
32 Figure 8 The relationships of enrichment score of prion diseases (A), ABC transporters (B), primary immunodeficiency (C), spliceosome (D) and pendocytosis (E) with RiskScore for each sample.
33 Figure 8 The relationships of enrichment score of prion diseases (A), ABC transporters (B), primary immunodeficiency (C), spliceosome (D) and pendocytosis (E) with RiskScore for each sample.   The forest plot constructed by combining the TNM+age (A) or stage+age (B) with RiskScore for READ patients.
39 Figure 11 The forest plot constructed by combining the TNM+age (A) or stage+age (B) with RiskScore for READ patients.