HHLA2 and NEBL as novel prognostic biomarkers based on Rayleigh survival model in colorectal cancer

The primary focus for this study was investigation of prognostic biomarkers in colorectal cancer (CRC), since biomarkers are instrumental in clinical decision making and patient management as well as playing pivotal roles in precision medicine. We analyzed indigenous dataset from colorectal cancer patients. Exon microarray dataset was used and 135 genes were identiﬁed as novel candidate biomarkers. 135 genes were further split into two groups: low and high gene expression values via ’maxstat’ algorithm, they were analyzed using Kaplan-Meier (K-M) analysis and univariate Cox model, and a set of 33 genes were identiﬁed as statistically signiﬁcant (p¡0.05). Furthermore, using the ’VARCLUS’ algorithms (a SAS software procedure) which is a useful tool for variable reduction, based on the divisive clustering tech-niques, a small subset of 5 genes were selected out of 33 signiﬁcant genes as potential candidates to build survival models. Both parametric and semi-parametric survival models were utilized to assess whether these 5 genes could be used as prognostic biomarkers. via index, comparing comparison Akaike and Bayesian esti-mates. The Tukey-Anscome plot and Quantile-Quantile plot as diagnostic tools applied to the parametric survival models.


Abstract Background
The primary focus for this study was investigation of prognostic biomarkers in colorectal cancer (CRC), since biomarkers are instrumental in clinical decision making and patient management as well as playing pivotal roles in precision medicine.

Methods
We analyzed indigenous dataset from colorectal cancer patients. Exon microarray dataset was used and 135 genes were identified as novel candidate biomarkers. 135 genes were further split into two groups: low and high gene expression values via 'maxstat' algorithm, they were analyzed using Kaplan-Meier (K-M) analysis and univariate Cox model, and a set of 33 genes were identified as statistically significant (p¡0.05). Furthermore, using the 'VARCLUS' algorithms (a SAS software procedure) which is a useful tool for variable reduction, based on the divisive clustering techniques, a small subset of 5 genes were selected out of 33 significant genes as potential candidates to build survival models. Both parametric and semi-parametric survival models were utilized to assess whether these 5 genes could be used as prognostic biomarkers. Results HHLA2 (p < 0.01) and NEBL (p < 0.01) genes emerged as potential biomarkers, based on the para-and semi-parametric models such as: Rayleigh, Exponential probability distributions, and Cox models and were also validated on independent datasets, apart from validation with qPCR test as well as with the cell lines patients data in the laboratory. A rigorous statistical evaluation for model's performance were done via Harrell's index, Brier score, Shoenfeld residual plots as well as with comparing several predictive model plots. We also made the comparison of Akaike and Bayesian Information(AIC and BIC) criteria as well as log-likelihood estimates. The Tukey-Anscome plot and Quantile-Quantile plot as diagnostic tools were applied to validate the parametric survival models. Conclusion Based on predictive models HHLA2 and NEBL novel biomarkers were found as statically significant.

Introduction
Precision and personalized medicine is coming of age with conspicuous results in cancer. Colorectal cancer owing to its heterogeneous nature could be a model disease for assessing the pros and cons of precision medicine [1,2,3,4,5]. Biomarkers are proposed to be important tools in practicing precision and personalized medicine [6,7,8]. Prognostic biomarkers are defined as measurable factors that can be used to assess the disease progression. Predictive biomarkers are used primarily for knowing the probable response of drug treatment [3,9]. Same biomarkers can serve as prognostic and predictive as well as can also be used as potential drug target. The progression of colorectal cancer is already well defined in the form of clinical models and pathways. There are several putative biomarkers which are classified according to type, pathway, clinical utility, and method of analysis. Mutation in KRAS gene is well known and clinically being used as predictive biomarker(s) for adjuvant chemotherapy in colorectal cancer [10]. Other biomarkers like gene mutations, loss of 18q, micro-satellite instability are being studied for their clinical utility and relevance. Several other biomarkers like miRNA, circulating tumor cells and DNA had been extensively studied though yet to reach their potential application. Gene expression based biomarkers are easy to assay as well as routinely being used in clinics. There is a need to discover better biomarkers that could be biologically and clinically relevant. Gene expression databases can be explored for novel biomarker discovery. However, the importance of indigenous data from local population is underscored alongside the idea of personalized medicine. In this report we used dataset of differentially expressed genes in local colorectal cancer patient samples [11]. Statistically significant differentially expressed genes were analyzed to find putative novel biomarker candidates. These genes were tested for their potential as prognostic biomarker using parametric and semi-parametric survival models. These models identified Human Endogenous Retrovirus-H Long Terminal Repeat-Associating Protein 2 (HHLA2) and Nebulette (NEBL) genes as potential independent prognostic biomarkers in colorectal cancer. Data sets from the Broad Institute https://www.broadinstitute.org and 'cBioPortal' https://www.cbioportal.org/study/summary?id=coadread_tcga_pan_ can_atlas_2018 were used to validate our results. These datasets will be referred as 'external' throughout the manuscript. Experimental validation of HHLA2 gene expression was carried out using quantitative real time PCR on matched tumor-normal samples from patients. Expression of HHLA2 gene was also analyzed in colorectal cancer cell lines. A schematic flow diagram for the work done in this paper is given below in Figure 1.

Methods
Ethics approval and consent to participate: The study was approved by the King Abdullah International Medical Research Center's bioethics committee. The matched tumor-normal samples from 41 patients will be referred as internal data were used to identify differentially expressed genes using Exon microarray (Affymetrix, Thermo Fisher Scientific, USA) and is described in [11]. All the patients were consented and the study (RC10/083) was approved by institutional review board before sample collection. Raw data was analyzed using Transcriptomics Analysis Console (Affymetrix, Thermo Fisher Scientific, USA) to obtain list of differentially expressed genes. Differentially expressed genes were analyzed using Ingenuity Pathway Analysis (IPA), Qiagen Biosciences, USA to identify novel candidate genes that qualify for biomarker identification. IPA was used with filter considering molecules associated with cancer and previously unknown as biomarker. Exon microarray dataset was used and 135 genes were identified as novel candidate biomarkers, split into two groups: low and high gene expression values via 'maxstat' algorithm from an R package, see: https://cran.r-project.org/ web/packages/maxstat/index.html. After split into two two groups, they were analyzed using Kaplan-Meier (K-M) survival analysis with log-rank test, and univariate Cox model. A set of 33 genes were found to be statistically significant from these analyses, as p values were less than the assumed significance level as: α = 0.05 considered as statistically significant in this work for all statistical analysis.
Further, using 'VARCLUS' algorithms, a SAS software procedure [13], a small subset of 5 genes genes were selected from the set of 33 genes. And these 5 genes were utilized as prognostics biomarkers to build survival models based on parametric and semi-parametric methods. To assess and evaluate predictive models, a rigorous diagnostic and performance tools were applied, namely: Harrell's index, Brier score, Shoenfeld residual plots as well as with comparison of several predictive model plots. We also made the comparison of Akaike and Bayesian Information(AIC and BIC) crieteria as well as the log-likelihood estimates. The diagnostic assessment for parametric survival models were based on Tukey-Anscombe (TA) and Quantile-Quantile plot.
A detailed outline for splitting continuous variables the 'maxstat' tool, feature selection the 'VARCLUS' algorithm, parametric, semi and non-parametric survival models, and model's performance evaluation are described in the supplementary file. All the statistical analysis were done using SAS software [14] and R statistical packages

Demographics distributions for internal and external datasets
The demographics and clinical characteristics for internal (N=41) and external (N=102) datasets are given in supplementary Table S1.

Filtering of genes, Kaplan-Meier and univariate Cox model
Using IPA tool with filtering algorithms a set of 135 genes were identified as potential biomarkers, and were further split as low and high expression values groups via 'maxstat' tool (an R-package). The z-statistic, with their p-values based on log-rank test and K-M survival analysis as well as univariate Cox model were shown in Table S2 (see: supplementary file); and a subset of 33 genes were found to be statistically significant (p < 0.05), and K-M plots are presented in the supplementary materials. These subset of genes include: B3GNT7, BTNL3, BTNL8, C2orf88, CA1, CCDC68, CEMIP, CLDN8, COL1A2, CPM, CWH43, HHLA2, IFITM1, MT1G, NEBL, NR3C2, PDK4, PLCE1, REG1B, SELENBP1, SELENOP, SI, SLC26A3, SLC51B, SMPDL3A, TGFBI, THBS2, THRB, TMIGD1, TRPM6, TSPAN7, UGT2A3, and UGT2B17. Out of these 33 genes, 16 of them were associated with CRC disease based on the IPA analysis, and were graphically shown in Figure 2, and their functional relevance were found to be significant for survival, and makes our results more reliable and biologically explainable. Selecting CLDN8, HHLA2, NEBL, SELENOP and THBS2 genes as predictors based on feature-selection algorithm In general, building predicive models we encounter with multicollinearity problem due to two or more explanatory variables in regression modeling get linearly related, which is always a challenge with omics data since genes are generally highly correlated. To overcome this multicollinearity in modeling, we utilized feature selection 'VARCLUS' algorithm (discussed in supplementary file). So, 5 clusters were created in such a way that genes from the same cluster are highly correlated with each other but have a low correlation with any other cluster. The selection process consisted of selecting one gene from each cluster. We selected the genes with minimum R-square ratio within its own cluster. A Figure 2: Based on univariate Cox-model and Kaplan-Meier survival plot, a subset of 33 genes were statistically significant and out of this subset of genes 16 genes are known to be associated with coloerctal cancer. IPA tool was utilized to make this graph.
gene selected this way is the best representative of the cluster. Based on this algorithm a set of five gene covariates namely: CLDN8, HHLA2, NEBL, SE-LENOP and THBS2 were chosen from each cluster respectively, and their K-M survival plots were given in supplementary file. The output results from this algorithm such as cluster number, name of the variable, R 2 with own cluster and next closest cluster, and the 1 − R 2 Ratio are given in Table 1.

HHLA2 and NEBL as biomarkers based on parametric models
We utilized parametric survival models based on: Weibull, exponential, lognormal, log-logistic, and Rayleigh probability distributions to assess whether these 5 selected genes can be potentially prognostic biomarkers or not. The estimated parameters from these models namely: slopes, standard error, z-statistic and p values are given in Table 2. The CLDN8 was statistically significant with p-value less than 0.05 for log-normal and log-logistic models. The exponential, log-normal-log-logistics models showed statistical significant for the gene HHLA2 as the p-values is less than 0.05, whereas the Rayleigh model shows high significant statistically, since the p-value was less than 0.01. The NEBL gene covariate was statistically significant for Weibull and Exponential models, were highly significant based on the Rayleigh model. The SELENOP and THBS2 were not statistically significance since p-values were greater than 0.05. The output results are given in Table 2, estimated β coefficients (slopes), standard error and p-values are marked as '*', '**', and '***' which indicates that the p-values are less than: 0.1, 0.05, and 0.01 respectively.
Model's valida- Validating HHLA2 and NEBL as biomarkers on external dataset The external dataset 1 was used to validate the four gene covariates as potential biomarker predictors namely: HHLA2, NEBL, SELENOP and THBS2. For these genes, K-M survival analysis were done and based on the log-rank test, they showed statistical significance, except THBS2 gene, their survival plots are shown in supplementary file.
The parametric survival models: Weibull, exponential, log-normal, log-logistic, and Rayleigh were utilized for building the predictive models. The estimated parameters, standard errors, and p-values are presented in Table 2. HHLA2 gene showed statistical significance for Weibull, exponential and log-normal models. The HHLA2 gene covariate was highly statistically significant with p-value less than 0.01 for the Rayleigh model. The NEBL gene covariate was statistically significant for: Weibull, exponential, log-normal, and log-logistic models with pvalue less than 0.05 and was highly statistically significance for p-value less than 0.01, based on the Rayleigh model. Hence, the validation for the gene covariate HHLA2 and NEBL as prognostic biomarkers were tested for internal as well as external datasets (an independent dataset for a set of different population).

Comparison of normal versus tumor tissues for expression values of HHLA2 and NEBL
The Wilcoxon test fort two independent samples was applied to compare the expression values of HHLA2 and NEBL genes between normal and tumor tissues, the comparison test have shown statistical significance's for both the genes HHLA2 and NEBL with p-value less than 0.00001, and they are shown in the Violin plots of Figure 3.

Quantitative real-time polymerase chain reaction
We utilized SYBR green based quantitative PCR to analyze the expression levels of HHLA2 gene in matched tumor and normal samples as well as colorectal cancer cell lines. The sequence for HHLA2 primers used is as follows : Forward primer 5 -TAC AAA GGC AGT GAC CAT TTG G -3 Reverse primer 5 -AAG TGT AAA TTC CTT CGT CCA GA -3 . HPRT1 was used as housekeeping gene. All experiments were done in triplicate using Applied Biosystems QuantStudio 6 Flex Real-Time PCR System (Thermo Fisher Scientific, USA). RNA was extracted from frozen patient tissue samples as well as cell lines usingThe PureLink RNA Mini Kit (Thermo Fisher Scientific,USA). Expression Suite software was used for relative quantification analysis. Normal sample from the same patient was used as calibrator. For cell lines, CCD841 was used as calibrator.

Cell Culture
Colorectal cancer cell lines -CaCO 2 , HCT116, HCT8, HT29, RKO, SW480, SW62 and DLD1 were grown in advanced DMEM media supplemented with 10% fetal bovine serum and supplements. CCD841 cell line was used for nor-1 CLDN8 gene was not in the external dataset tumor tissues gene expression values, based on the Wilcoxon test p-value shows high statistical significance (p < 0.00001), moreover shape (thicker or thinner) of the violin plots tell that the thicker part implies the values in that section of the violin has higher frequency, and the thinner part implies lower frequency. Whereas, on the right panel the violin plot for NEBL gene: normal vs. tumor tissues shows the expression values, applying the Wilcoxon p-value shows high statistical significance (p < 0.00001), moreover shape (thicker or thinner) of the violin plots tell that the thicker part implies the values in that section of the violin has higher frequency, and the thinner part implies lower frequency, mal colon cells. All cells were cultured in 5% CO 2 and 37 o C in a humidified incubator.1.
The expression values for HHLA2 gene in 8 CRC cell lines were analyzed.

Discussion
Biomarkers are proposed to be critical tools for practicing precision and personalized medicine. In this study we attempted to propose two new prognostic biomarkers for colorectal cancer. We utilized an indigenous dataset from local colorectal cancer patients and selected candidate novel biomarker genes. By using well curated databases, rigorous statistical analyses and prediction models we propose HHLA2 and NEBL genes as the novel prognostic biomarkers. Decreased expression of HHLA2 gene was found to be associated with poor survival in colorectal cancer patients. The quest for discovering biomarker has supported the growth of precision medicine. In colorectal cancer early detection is the key for better prognosis. However, the current clinical practice of using colonoscopy, gFOBT, FIT or other tests have been less adequate for technical and economic reasons. Patient compliance for invasive procedures has been poor. The clinical applications of biomarkers in CRC is pertinent for prognostic stratification, surveillance, and therapy selection [15]. Carcinoembryonic antigen (CEA) is a well-known prognostic biomarker used in the clinic for CRC management. High levels are associated with cancer progression and can indicate recurrence after surgery. However, high CEA levels are not specific to CRC and can also be found in other malignancies and inflammatory conditions, such as inflammatory bowel disease, liver disease, and pancreatitis [16] and [17]. Microsatellite instability and BRAF mutation status are also now considered for prognostication of CRC [18]. Current and emerging biomarkers also reflect the genomic complexity of CRC, and include a wide range of aberrations such as point mutations, amplifications, fusions and hypermutator phenotypes, in addition to global gene expression subtypes [19]. Gene expression based biomarkers are relatively easier to quantify and are more accurate and sensitive than the traditional IHC based assays. To search for novel biomarkers, we utilized the 'biomarker discovery' tool in IPA tool which resulted in a list of 135 genes with no previously established evidence for their role in CRC prognosis. IPA based biomarker discovery has been successfully reported earlier [20] and provided a list of putative candidate novel biomarkers for CRC. There are several reasons not to include all 33 genes while building predictive model, one of the important issue is the multicollinearity problem which is due to high correlatedness among predictors (it's well known that gene expression values are highly correlated). Another problematic factor in model building is associated with over-or under-fitting because of high dimensional gene expression data, which have too many genes as predictors with limited numbers of samples or observations. To overcome these issues we utilized the VARCLUS algorithm [13]. Among five genes that were selected for model SELENOP and THBS2 genes are known to be involved in colorectal cancer progression. Downregulation of human SEPP1 [SELENOP] mRNA in colorectal carcinoma cells is associated with colorectal cancer in human [21] [?]. Mutant human THBS2 gene (c.1903G¿A translating to p.G635S [somatic missense]) is observed with adenocarcinoma in human colon (COSMIC: observed in 2 of 24 samples) [?]. Rest of the three genes were completely novel and made good candidates for Biomarker. We attempted five different parametric models and compared them to arrive at most statistically significant tool. This approach has potential benefit of proposing and testing a biomarker with high confidence [?] and prevents high rate of false positives. Rayleigh model showed least minimum standard error. We also used the semi-parametric cox model which is widely used in biomarker discovery [33]. HHLA2 and NEBL genes were found to be significant whereas CLDN8, SELENOP and THBS2 were having higher p-value. HHLA2 gene has been found to be downregulated in CRC in a pan cancer study and that is in conformity with our results. However, results from various databases cannot conclude the favorable or unfavorable prognosis associated with HHLA2 gene expression [25] However, in renal cell carcinoma higher expression of HHLA2 have been associated with poor outcome [26]. Higher expression of HHLA2 have been reported in NSCLC based on IHC studies [?]. Two studies used IHC technique to report higher expression levels of HHLA2 in CRC [28] and [29]. These studies are limited both in technical sensitivity and the sample size as compared to other studies which concluded lower expression levels of HHLA2 in CRC and its prognostic value based on transcriptomic data from multiple databases, including cBioPortal, TCGA, Cistrome, TIMER, On-comine, Kaplan-Meier, GeneXplain. Since CRC has not benefitted much from immunotherapy especially in non-MSI tumors (which comprise ¿85% of all CRC tumors), a new immune checkpoint molecule, HHLA2, which is widely expressed in PD-1 negative human tumors, may be a promising target. HHLA2 is being suggested as prognostic as well as immunotherapeutic target in other cancer types e.g. Intrahepatic cholangiocarcinomai [30]. Our study provide evidence to support the use of HHLA2 as prognostic biomarker. Further studies are needed to establish its role in targeted immunotherapy. Nebulette gene expression has been reported earlier to be a potential prognostic biomarker based on quantitative real time PCR experiment [31] and support our results based on more robust technique of microarray. Our results provide further evidence for use of NEBL gene expression as an independent predictor for CRC prognosis. Higher expression of NEBL was reported to be predictor for better outcome and is in conformity with our results [32]. Our independent analysis of indigenous data based on local population samples from matched tumor normal samples provide confidence in use of NEBL gene as a putative prognostic biomarker.
With the current scope and importance of personalized medicine, 2 biomarkers namely: HHLA2 and NEBL were identified as potential prognostic biomarkers. These two genes could be utilized in clinical decision making for patients with colorectal cancer. Since their significance were based on the probabilistic theory with the utilization of parametric survival models, whereas the parametric based models have better predictive power to make prediction in comparison with the semi-or non-paramettric medicals. Particularly using Rayleigh survival model had shown that it has a better model fit, as it's evident from the model's diagnostic assessments via Tukey-Anscombe and Quantile-Quantile plot and had shown high statistical significance with 95% confidence level. The model's assumptions were validated as well as graphically presented for the semiparametric Cox model as well. Other evaluation for the model were based on the AIC, BIC, and log-likelihood as well as with the Harrel's index which gives the predictive probability of 81%, were also applied to assess the performance. The statistical significance for HHLA2 and NEBL genes were further validated on the external dataset available in public domain apart from the validity were tested based on the PCR's result; apart from comparison of normal versus tumor tissue based upon the Wilocoxon two samples test for HHLA2 and NEBL also exhibited a high statistical significance.

Conclusion
Higher expression of HHLA2 and NEBL genes could predict better overall survival in colorectal cancer. This study provides substantial evidence based on indigenous patient data and validated in other datasets. Rigorous experimental validation is presented to support this evidence, apart from predictive statistical survival models.

Scopes and limitation of this study
The HHLA2 and NEBL genes can be utilized as potential prognostic biomarkers in the clinical decision making for patients with CRC. The Rayleigh survival model had shown as the most suitable predictive model from the family of parametric survival model and could be used as a suitable and reliable predictive model since parametric models have better statistical power. This is based upon the diagnostics and the validation tools.
The comparison between the normal and tumor tissues had shown statistically significant for the HHLA2 and NEBL genes.
The statistical significance for both the genes HHLA2 and NEBL based on parametric, semi-and non-parametric statistical survival models for the internal as well as an external data as an independent dataset on different set of population. The gene HHLA2 was also validated with the help of PCR generated data for a subset of subjects from the internal dataset.
This study report is based on the retrospective data which may have some bias due to statistical sampling procedure, as the subjects were of mostly older age population.
The total subjects for the internal dataset had only 41 subjects with event of died subjects were 13, and 28 subjects censored (alive or status unknown). The external dataset was having only 100 subjects and events (dead) were 8 and remaining subjects were either alive or their status were unknown, since very few events were observed in this external dataset though it had suffice one of the requirement for a good model fit to have minimum 5 events per coariate [38]. Another independent dataset TCGA dataset with 469 observations with events (93 dead) and 376 subject were censored at the time of last follow-up, and had shown statistical significance with parametric as well as semi-and non-parametric survival model for the gene covariate HHLA2 but not for the gene NEBL.
The PCR laboratory test for the gene NEBL had not been done due to logistic as well as limited resources.
We had also explored to validate on on few other 'GEO-series' independent datasets, but the genes HHLA2 and NEBL, did not show statistical significance. This shows the complexity and heterogeneity of subject population as well as poses challenges to build predictive models based on the probabilistic based approach.
Since the genes HHLA2 and NEBL had shown statistical significant for HHLA2 and NEBL for survival models for internal as well as the external dataset, apart from the laboratory validation for the gene HHLA2 based on the PCR laboratory analysis.
We are also intending to further explore both these genes in vitro as well as animal based models to confirm whether both these genes could be utilized as prognostic biomarkers, though it's solely based upon the pending grant from the research center.
AlGhamdi had contributed as laboratory related work. Other authors had reviewed the manuscript. Table 2: The parametric models: Weibull, exponential, log-normal, log-logistic, and Rayleigh based probability distribution are given. The estimated parameters, standard errors and statistically significant parameters are marked with '*', '**', and '***' for α significance values as: 0.1, 0.05 and 0.01.