A modified risk detection approach of biomarkers by frailty effect on multiple time to event data

Multiple indications of disease progression found in a cancer patient by loco-regional relapse, distant metastasis and death. Early identification of these indications is necessary to change the treatment strategy. Biomarkers play an essential role in this aspect. The survival chance of a patient is dependent on the biomarker, and the treatment strategy also differs accordingly, e.g., the survival prediction of breast cancer patients diagnosed with HER2 positive status is different from the same with HER2 negative status. This results in a different treatment strategy. So, the heterogeneity of the biomarker statuses or levels should be taken into consideration while modelling the survival outcome. This heterogeneity factor which is often unobserved, is called frailty. When multiple indications are present simultaneously, the scenario becomes more complex as only one of them can occur, which will censor the occurrence of other events. Incorporating independent frailties of each biomarker status for every cause of indications will not depict the complete picture of heterogeneity. The events indicating cancer progression are likely to be inter-related. So, the correlation should be incorporated through the frailties of different events. In our study, we considered a multiple events or risks model with a heterogeneity component. Based on the estimated variance of the frailty, the threshold levels of a biomarker are utilised as early detection tool of the disease progression or death. Additive-gamma frailty model is considered to account the correlation between different frailty components and estimation of parameters are performed using Expectation-Maximization Algorithm. With the extensive algorithm in R, we have obtained the threshold levels of activity of a biomarker in a multiple events scenario.


Introduction
Identification of disease progression is an essential aspect in the treatment of cancer. It is observed that specific biological characteristics of the human body of a cancer patient deviate from the same of healthy people. Thus an accurate indicator which can capture this deviation at an early stage is required. This necessitates the search for measurable biological identities which can serve as the indicator of cancer onset. This type of biological identities is called as biomarkers [1]. The identification of this occurrences before-hand is necessary for treatment intervention [2]. In survival analysis, the interest lies in the waiting time until the occurrence of a well-defined event. Based on the objective of studies, the event may be defined differently, such as distant metastasis(DM), loco-regional relapse(LR) and death. In some situations, more than one event may be observed, and the evolution of the patient's state over time is studied. Multi-state models play an essential role to analyse such complex biological processes [3,4]. Due to the inter-related nature of the states, individual analysis for each event will not reveal the dependency structure among them, and the contributions of the prognostic factors towards disease progression are misunderstood [5].
In cancer treatment, biomarkers can influence how particular cancer behaves and how it may respond to a specific treatment. The survival probability of a patient depends on the status of the biomarker, and the treatment plan is also changed accordingly, e.g., the survival prediction of breast cancer patients diagnosed with HER2 positive status is different from the same with HER2 negative status. This results in a completely different treatment strategy as the treatment for HER2 positive has little chance to act as effectively as for HER2 negative. These different statuses or levels create a disparate effect on each of the events. Suppose, a biomarker is measured with continuous variables values u 1 , u 2 , ..., u n . The range of u i 's can be split into two parts as [min u i , a] and (a, max u i ] and thus the expression values can be divided into two levels. These two levels may have a different effect on disease progression as well as a treatment regime. The key lies in finding the optimal threshold level to visualise this heterogeneous effect more prominently. This heterogeneity between the cluster of observations which is often unobserved can be modelled using a random frailty term. The variance of the frailty term can be interpreted as a measure of heterogeneity between different groups of patients. Possible correlation between patients diagnosed with a specific biomarker level can be modelled using a shared frailty component. However, when multiple causes of event occurrence or competing risks are present simultaneously, using only one frailty term per biomarker level for every competing risk will not properly explain the overall heterogeneity. Similarly, individual frailties for each indication under each biomarker status fail to capture the dependency among the competing events. The frailty components among different causes of failures under a specific biomarker level are likely to be correlated. Thus, the overall heterogeneity can be modelled using a combined contribution of both level-specific and event-specific variabilities [6]. The construction of a prognostic tool for patient's treatment strategy by incorporating both of these variabilities is challenging in oncology research.
Due to the varying effect of biomarker statuses over the disease progression, proper detection of biomarker statuses carries enormous importance. The treatment procedures for different biomarker statuses are likely to be different. The gene expression collected for a particular biomarker is continuous in nature. It is usually assumed that all genes follow the same underlying distribution. However, researchers have found that the gene expression analysis of several cancer biomarkers show multimodal expression patters [7,8].
Genetic mutations and amplifications are often inconsistent across tumors and their effects can be visible in smaller subsets of patients. This provides the idea of bimodally and multi-modally expressed genes that have distinct modes corresponding to statistically significant difference in survival times [9]. Thus, it is required to obtain a proper way for classification of statuses based on some statistical measures. The frailty component can be used to serve this purpose. The patients with the same biomarker status are likely to be clustered, and hence frailty variance might be lower for them. Thus, with the combination of different threshold values, we can classify the biomarker statuses accordingly and investigate the frailty variance for each of these combinations to obtain the most useful classification of statuses. In the context of medical studies, this event of interest defined in different forms as disease recurrence, progression etc. The widely used Cox proportional hazard model [10] describes the relationship between a set of explanatory variables and a time-to-event outcome. This model also takes into consideration of possible censoring mechanism when an event has not occurred during the study period. The observed event times are assumed to be independent for the application of this model. However, this assumption is not appropriate when the survival duration of patients with a common characteristic may have a possible correlation [11]. This type of dependence in survival data can be modelled using a random frailty term. It represents the unobserved heterogeneity effect between observations or between clusters of observations. The variance of frailty component can be interpreted as a measure of heterogeneity between the groups or individuals.
Analogous to the single end-point analysis, the Cox proportional hazard model can be extended for competing events of interest with cause-specific hazards for each cause of failure. The focus of multi-state models is to study the process of moving from one state to another. The study of multiple risks with clustered data complicates the possible dependency structure. In the context of competing events, only one frailty term for all the causes of failures is not suitable to explain the overall randomness. Also, independent frailty terms for each cause of failures will ignore the dependency structure due to a particular risk. The random effect in the survival model was introduced by Beard [12] as a longevity factor for modeling human mortality. Vaupel et al. [13] first introduced the term frailty to indicate the unobservable random effect shared by subjects in a standard risk group. Keyfitz & Littman [14] developed a procedure to estimate the life expectancy taking into consideration the heterogeneity present in the population. Yashin et al. [15] introduced the shared frailty model using gamma frailty to analyse twin survival data. Frailty models for bivariate survival data were proposed by Oakes [16]. He also introduced several possible frailty models for the analysis of survival data. Extensive work on frailty models has been done for survival data in different areas of research [17,18,19,20].
Fine and Gray's subdistribution hazard model [21] is useful for analysis of competing events. This model is also extended to incorporate a frailty component to account cluster dependencies on the cumulative incidence function of the event of interest in the presence of competing risks. The correlated frailty models in the presence of competing risks analysed by Wienke et al. [22] with the assumption of independence between the risks. Liquet et al. [23] investigated multi-state frailty models for multi-centre data using both independent and joint frailty for modelling the dependency structure among transition intensities. Rutten-Budde et al. [24] analysed multi-centre heterogeneity with a competing risk frailty model for two diseases by considering both centre-specific and disease-specific random components. Abbring and van der Berg [25] discussed on the identifiability of cause-specific hazard frailty models. Ha et al. presented a detailed h-likelihood-based inference procedure of a multivariate frailty model under competing risk setting [26].
In this paper, we considered a competing events model and incorporated biomarker level-dependent heterogeneity to observe its effect on the survival prediction. An additive gamma frailty model was used multiplicatively on the cause-specific hazard to model the dependency structure within biomarker levels and between two competing events. The expectation-maximisation algorithm utilised for estimation procedure, and a strategy outlined to compute the standard error of the estimates. Based on the frailty variance of the competing events at different biomarker levels, an ideal threshold level of the biomarker is defined, and risk prediction is performed in a more accurate manner. The obtained threshold levels of the biomarker will determine the most efficient risk prediction analogy for different competing events. In section 2, we discuss the proposed risk detection procedure. In section 3, the simulation study is described to show the efficacy of our procedure and effectivity in identifying the heterogeneity in risk. In section 4, we have applied the procedure on a real dataset, and corresponding results are presented.

Methodology
Suppose there are n independent subjects (i.e., patients). Let, T i be the time to failure for i th patient, i = 1, 2, ..., n and X i is a m × 1 vector of covariates. Let, C i denotes non-informative censoring time, independent of T i . We assume the event indicator δ i = 1 indicates the failure time is observed and δ i = 0 if censored. The hazard function for the i th patient is defined as, The survival function S(t) = P r(T > t) denotes the probability of survival time being greater than t. For a patient died at t i , its contribution to the likelihood function will be the density during that duration and it is written in terms of the products of survival function and hazard function i.e, L i = f (t i ) = S(t i )λ(t i ). If the patient is still alive at t i , then the likelihood will be formulated as L i = S(t i ). So, the survival likelihood for all the n patients is defined as, where Λ(t) = t 0 λ(s)ds. In the multi-state set-up, there is a correlation present between the diseases and the evolution of the patient's state from one to another is studied. Very often, the Markov assumption is adopted in the multi-state set-up for mathematical convenience [23]. In our study, we ignore the transitions and only consider whichever event appears earlier. So, we obtain a competing risk framework with multiple risks that are present for a patient at the beginning of the study.
In multiple risk analysis, the interest is to identify the death rate for a particular cause. The hazard rate of a specific cause j for patient i is defined as, Here δ i = 1, 2, ..., J is the indicator of causes of failure, δ i = 0 implies censored data, Cumulative cause-specific hazard and the probability of no failure from any of the causes are defined as, and, So, under competing risks, the overall likelihood for all the patients with different causes of failures is defined as [27], When the proportionality assumption is maintained, the widely used Cox proportional hazard (PH) model is useful to observe the effect of covariates on different causes of failures. Using the proportional hazard model, the cause-specific hazard function of cause j for the i th patient is defined as, where λ j0 is the cause-specific baseline hazard function, X i is the covariates for i th patient and β β β j is the corresponding coefficient for cause j. Thus, the likelihood function for PH model using cause-specific hazards is defined as, The presence of frailty effect can not be ignored in survival data analysis. The effect of the unobserved heterogeneity over hazard function can be measured using a random component in survival models. The variance of the frailty term used as a measure to examine the heterogeneity in the data. The univariate frailty model introduced by Vaupel et al. [13] discussed the proportionality assumption in frailty components and thus introduced the idea of the multiplicative effect of frailty component over the hazard function. In individual frailty model, it assumed that each of the individuals has its frailty part; thus, there is the individual-level different impact on the hazard rate. In the shared frailty model, the frailty component shared among all the individuals belonging to a particular cluster or group. Different distributional assumptions are made for the frailty component namely Gamma distribution, Gaussian distribution, Exponential distribution, t-distribution etc. In this study, we assume that the expression of a biomarker can be split in different levels based on threshold values and those levels have heterogeneous effects on the survival outcome. The proportional hazard frailty model defined as, where W k is the level-specific or shared frailty effect of k th level of biomarker. In shared frailty, the frailty component is shared by all the individuals in a cluster. So, the frailty variance can be considered as a measure of heterogeneity among the patients in that cluster. Let, W = (W 1 , W 2 , ..., W K ) is the frailty effect associated with K levels of the biomarker. d k and n k are the number of events and patients having specific k th level of a biomarker respectively. δ ki = 0, 1, 2, ..., J is the indicator of event/cause of failure for patient i in level k.
Here, the frailty component is assumed to follow Gamma distribution. The conjugate prior property of gamma distribution makes this a popular choice for frailty component given the survival data [29]. It benefits to compute the explicit form of the posterior distribution as well as the formulation of programming codes. Other distributional approaches of the frailty term can be found in the following literature [11].
So, the likelihood of survival data with independent frailties for J competing events is defined as, By integrating out the frailty terms, we obtain observed data likelihood [28]. Using same shared frailty irrespective of events is not enough to completely explain the randomness in survival data. The heterogeneity due to different events are also needed to be taken into account. So, if we assign independent frailty terms corresponding to each cause of failures within a biomarker level, then the dependency due to effect of a particular level is ignored. This correlation can be translated through the frailty components. Yashin et al. [15] first proposed the decomposition of frailty term in two independent frailties, one of them being the shared frailty and the other is the cause specific independent frailty. Petersen et al. [20] extended the Cox model to incorporate additive gamma frailties and estimated the parameters using the EM algorithm.
W k1 , W k2 , ..., W kJ are frailty variables assigned to each cause of failure for a biomarker level k. So, for example, with 4 levels of a biomarker and 3 different causes of failures, the full set of frailty effects will be {W kj } k=1,...,4 j=1,...,3 . Each frailty component for level k constructed adding two parts, one is a common component which has the same effect on all the causes, and another is the independent component which has unobserved causespecific effects.
The frailties defined as, is the shared frailty component due to the levels of biomarkers and irrespective of the cause of failure. Z k1 , Z k2 , ..., Z kJ are risk-specific frailty components due to different biomarker levels. Z k0 ∼ Γ(ν 0 , 1), Z k1 ∼ Γ(ν 1 , 1), . . . , . . . , Z kJ ∼ Γ(ν J , 1). Z k0 , Z k1 , . . . , Z kJ are independent frailty components for biomarker level k. Thus, the frailties will be distributed as, The variance and correlations will be defined as, for two causes of failure j 1 and j 2 in a biomarker level k. The model parameters are estimated by maximising the logarithm of the likelihood function. The frailties within a biomarker level are associated with each other, and patients across different clusters are independent. So, the likelihood function is the product of the likelihoods of all patients across different levels.
The complete data likelihood for level k given the data and frailty components will be Here, n k denotes the number of patients in biomarker level k, d kj denotes the total number of patients with the cause of failure j, j = 1, 2, ..., J. The baseline hazard function is remained unspecified which is estimated during the Expectation Maximization (EM) optimization. As the frailty components are not observed, so the estimation procedure becomes more complicated than the traditional maximum likelihood method. So, considering the unobserved frailties as missing information, we can estimate the model parameters using the EM algorithm.
In genomic studies with a large number of gene expressions, it becomes necessary to combine the results of multiple tests and draw a necessary inference based on a metaanalysis. The p-value combination method is particularly advantageous when the data underlying the tests are difficult to merge. Results of multiple experiments are generally ordered based on a measure of association, such as p-value. A single p-value may not reveal the proper association structure, especially when its value is close to the boundary level. Thus, statistical methods of associating multiple p-values to make a consolidated remark hold significant importance.
There are a number of p-value combination methods available in the literature with diverse statistical properties. Suppose, p 1 , p 2 , ..., p n are the p-values obtained from independent tests. If the underlying test statistics are denoted as t 1 , t 2 , ..., t n have absolutely continuous distribution under the corresponding null hypothesis, then the joint null hypothesis for the p-values is defined as, Several commonly used statistics are available for combining the p-values namely, [33]), P T = min(p 1 , p 2 , ..., p n ) (Tippett [34]). The combined p-value is where T is the transformation performed on the p-values (This T is different from the T mentioned at the beginning of Section 2. Here T is a function of p-values whereas earlier it was referred as the time to failure). Clearly, all of the combiners are monotonic in p-values and therefore optimal in some settings.
Prior to the modern computing era, the form of T was considered based on the ease of computing power so that the joint distribution of the statistic can be evaluated easily. However, with the increase in computational strength, different forms of transformations are utilised using Monte Carlo simulation. The steps are as follows [35]: Extensions have been made to incorporate the dependency among p-values [36,37].
In our study to obtain ideal cutoffs for each of the biomarkers based on independent p-values, we employed different combiner methods to combine different p-values obtained from independent tests. For each gene, the gene expression values are divided into percentiles and taking each percentile value as threshold level, the gene expression is transformed into a binary variable. So, for a particular r th percentile [r = 1, 2, ..., 99] of gene g, the expression U g is divided into two parts [min U g , U gr ] and [U gr , max U g ]. U g denotes the expression of g th gene and U gr denotes the r th percentile of gene g.
Thus, we obtain For each of this cutoff values, the Cox proportional hazard model is fitted taking into account this binary variable along with other prognostic factors. For each of these equations, we obtain a corresponding p-value p gk . Thus for genes g = 1, 2, ..., G, the p-values are p 1 , p 2 , ..., p G where p g = {p g1 , p g2 , ..., p g99 } are the p-values for different threshold levels.
Here, the equation with the minimum of the p-values will provide the highest significance on the binary variable, i.e., it produces the highest significant level of the gene expression. This will depict the best cut-points for which the heterogeneity between the two groups is maximum. Thus, it will provide a more stringent threshold for better classification of gene expression.

Simulation
We performed a simulation study to generate and analyse datasets for assessing the performance of the correlated frailty model. The datasets have been made using different parameter values and obtain the effect of correlation due to varying combinations of the frailty components. Different parameter values considered for simulation, and based on the generated dataset, we performed the required analysis.
Our objective was to replicate the real-life scenario of cancer data obtained in the clinical study. We considered three different stages of HER2 for frailty component and three different events (Progression, Relapse and Death) for competing events. So, the value of J is 3. For each event, we considered five different patients, so as a total, we obtained simulated data for 60 patients. The starting state regarded as the state of a patient while entering surgery and being alive with no evidence of disease post-surgery. The classification of events is described in figure 1. So, the frailty components are W k1 , W k2 , W k3 in this analysis set-up. The shared frailty component is denoted as, Z k0 , which is biomarker level-specific frailty shared by all the risks. Z k1 , Z k2 , Z k3 are the frailty components assigned for each of the risks.
The frailty parameters are assumed to have a common scale as 1. The standard frailty components Z k0 considered to have shape 1.5. The individual frailties Z k1 , Z k2 , Z k3 have shape parameters 2, 2.5 and 3 respectively, to simulate the survival times and event statuses, the event times for three competing events are generated using Weibull distribution with scale parameters as 4.8, 5.2 and 5.5 respectively and shape parameters as 1.01, 1.02 and 1.04 respectively. Censoring times made using Exponential distributions with standard rate parameter 0.5.
Keeping in mind the real data scenario, we considered three different prognostic factors as covariates, namely, Age, t-stage and n-stage. The regression coefficients considered for the 3 covariates are (−0.06, 0.5, 0.1), (−0.05, 0.2, 0.2) and (−0.03, 0.3, 0.3) respectively for 3 diseases. The Age considered as a random variable simulated using random number generation from Uniform distribution with lower bound ten and upper bound 70. The tstage and n-stage are binary variables and randomly generated from 0 and 1. 0 considered for lower levels, and one is for higher levels. The event statuses for three competing events are denoted as 1, 2 and 3, whereas the event status for censored survival times denoted as 0.
The data generation algorithm is explained in the following: Step 1: The frailty components Z k0 , Z k1,...,Z k3 are generated from gamma distributions. Step 2: Covariates are simulated from respective distributions.
Step 3: Random numbers from uniform distribution are generated and the survival data is simulated for each of the events by inversing proportional hazard model with Weibull baseline hazard function.
Step 4: Censoring times are generated from Exponential distribution.
Step 5: The observed survival time is identified as the minimum of the survival times generated for 3 events and censoring time. The corresponding event status is also identified.
First, we fit an independent frailty model by taking into consideration the level effect of the biomarker separately on the competing event and ignoring the shared level-specific impact overall events. The results of the independent frailty model shown in table 1. It is evident the parameter estimates are heavily biased when the shared frailty ignored, and only cause-specific randomness considered. It fails to capture the correlation between the effects of the covariates. Also, the frailty variances are very low with high p-values which signifies no HER2 level effect on the risks. This independent frailty model is insufficient to analyse the simulated datasets. So, we need to opt for a model which will take into consideration both the individual and shared frailty effects.
The results of the analysis are displayed in the following table 2. The estimated correlations are also close to their actual values which validate the correlated frailty model. The results of the relationships shown in table 3. The optimal regression estimates obtained using the Expectation Maximisation (EM) algorithm, which is tabulated exclusively for capturing the correlated frailty effect. The posterior distributions for each of the frailty parameters are classified using observed data likelihood.
From the table, we observe that the estimates of the fixed effect parameters for Age, nstage and t-stage are close to the actual values and the estimates fall within the confidence intervals. The exact values of frailty variances also fall within the 95% confidence intervals of the estimated values. The estimates are closer to the precise values for correlated frailty models than that of in the independent frailty models.
The standard errors for the coefficients of the covariates are generally smaller in correlated frailty model than the independent model. Also, the correlated frailty model can capture the shared frailty effects and the pairwise correlations among them. The estimated values are close to the valid parameter values, and the efficiency of the correlated frailty model explained based on the Empirical Standard Error (EmpSE) and Root Mean Squared Error (RMSE). The standard errors of the parameter estimates of the covariates in this model are lower than the independent frailty model. Also, the EmpSE and RMSE are much lower for the proposed correlated frailty model. The values of the RMSE and EmpSE of the frailty variances and correlations are also close to each other.

Application on breast cancer data
Analysis has been performed on a breast cancer dataset obtained from fresh frozen breast cancer tissue of every third patient diagnosed and treated between 1991 and 2004 at the Koo Foundation Sun-Yat-Sen Cancer Center. Patients whose the follow-up period is less than three years, other than the cause of death were excluded from the study. In case of non-availability of the designated sample, the following one was selected. A total of 447 samples were obtained out of which 135 samples were excluded due to insufficient RNA (n=1), poor RNA (n=116) or unacceptable micro-array quality (n=18). So, a total of 312 samples were eligible for the study. In addition to that, gene expression of additional 14 breast carcinoma samples was also selected, which were collected between 1999 and 2004. As a result, the dataset consists of disease information of 326 female patients. All the patients were treated according to the guidelines National Comprehensive Cancer Network. Following a modified mastectomy or breast-conserving surgery plus dissection of axillary nodes, the patients were given radiotherapy, adjuvant chemotherapy and/or, hormonal therapy. Neoadjuvant chemotherapy was provided to patients with locally advanced disease. This is an open-source data readily available in GEO database (GSE20685) and the study was performed with all necessary consents and ethics. Hence, we did not require any separate ethical approval or consent for our study.
Breast cancer is one of the most common types of cancer diagnosed in females. The standard treatment protocol for breast cancer is surgery, followed by chemotherapy. A patient might experience progression of disease after surgery. In our study, they can either develop distant metastasis (Eventmet), i.e., tumour growth at a distant place or might  Figure 2 Firstly, we performed a simple cause-specific regression analysis on survival data using Cox proportional hazard model. The effect of the corresponding covariates on the hazard function are shown in table 4. From the table, we can see that for the t-stage level 2 and 3, the hazard ratio with respect to level 1 is less than 1 indicating a decrease in event occurrence, whereas for t-stage level 4, the hazard rate increases. For m-stage level 1,2,3, the hazard ratios are very high whereas for n-stage levels, the p-values are quite high. For death, the hazard ration for t-stage level 3 and 4 are more than 1 with respect to level 1. For the m-stage levels, the p-values are close to 1. However, for the sake of biological relevance, we have not discarded any of those covariates. We have taken one gene at a time and transformed the gene expression values into a binary variable. The percentiles of the gene expression are considered as different threshold values and analysis is performed for each of this partitions to obtain the frailty variance using correlated frailty model. Our interest to obtain the optimal threshold level for which the biomarker levels have the most distinct effect on the hazard ratio. The higher frailty variance will portray more  10.584, respectively. The frailty variances depict the heterogeneity in between two groups. Also, it is well known that p-value gives evidence against the null hypothesis. Lower the p-value, higher the significance of a variable in a model. So, if we transform the gene expressions into a binary variable by partitioning on different threshold values and use Cox proportional hazard model, then the equation with minimum p-value gives maximum importance on the binary variable that produces a highest significant level of gene expression. As a result, the p-values and frailty variances computed for different threshold values will be negatively correlated. The correlation between p-values and frailty variances for different threshold values are -0.33,-0.44,-0.56 and -0.32 for HER2, PR, ER and ERBB2 respectively.
Different studies have shown significant association among the biomarkers related to breast cancer [38,39]. The dependency among biomarker influences the selection of optimal threshold value of gene expressions. So, it will be erroneous to independently identify the threshold values based on p-values for each of the genes using a binary variable. To overcome this issue, one can incorporate all the binary variables generated from every biomarker and perform the analysis based on the Cox proportional hazard model. However, this approach also suffers the problem of multiple testing, which increases the chance of false positives. For example, if the range of a particular gene expression is divided into 100 equal parts and each of them is considered as a threshold value, then for k biomarkers, there will be 100 k threshold values considering all possible combinations of threshold values of the biomarkers. Thus, performing this analysis becomes difficult without enough computation power.
However, this problem can be overcome with a little compromise in the computation technique. In this study, we performed a stepwise computation technique to perform this analysis and obtain the threshold values. The gene expression values are standardized to treat them uniformly for threshold selection and difference in range of expressions can be nullified. For four biomarkers considered, we first set the threshold values at a particular level for the first three genes. The range of the 4th gene is divided into 100 equal parts, and each of them is considered as threshold values. The optimal threshold value for the 4th one is chosen from those 100 equidistant threshold values on the basis of the minimum p-value. To perform the analysis, we include all four binary variables generated from biomarkers in the CoxPH. Once the optimal threshold value is obtained for the 4th gene, it is kept fixed at that threshold value. Now, the same process is performed for the 3rd gene by keeping the starting threshold value for 1st and 2nd gene and the optimal threshold value for the 4th gene. The same goes for the 2nd gene and then for 1st gene consecutively. Thus, we obtain the optimal threshold value for each of the genes using starting threshold values of the previous genes and optimal threshold values of the following genes. As explained earlier that the biomarkers are dependent among themselves, so the threshold values obtained in this procedure will be dependent on the order of genes by which they are selected and their starting threshold values.
To overcome this order dependency, we used all possible combinations of the genes and performed a similar analysis. The starting values are considered as the 1st, 2nd and 3rd To verify the efficacy of the proposed procedure, it is vital to observe the effect of thresholding on survival predictions of patients. Shared frailty model has been utilised to observe the patient-specific randomness for the parts consisting of gene expression values below and above the thresholds. We implemented a proportional hazard model with shared frailty component, and comparison between the parts are made on the basis of frailty variances. The results of frailty variances considering threshold values obtained with 50th cutoff as starting value are shown in table 6 & 7. The differences between lower and upper frailty variance are evident. For ER and PR, the shared gaussian frailty variance for patients having gene expression value less than the threshold is higher than that of patients having higher gene expression value. For HER2, it shows the opposite characteristics. However, for ERBB2, the frailty variances are pretty close. For gamma shared frailty variance, the Only PR shows a similar output as gaussian frailty. For ERBB2, the frailty variance increases for higher expression level. However, both the frailty variances of the lower and upper part are very close to 0. For HER2, the frailty variance decreases for the higher part as opposed to the characteristics shown in gamma frailty. Foe ER, the frailty variances are pretty similar, and both of them are close to 0.

Discussion
Biomarkers play a crucial role in the identification of disease progression. The treatment procedure of cancer is administered according to the response of corresponding biomarkers. In the case of breast cancer, several competing events may be present in the patient at the same time, and the occurrence of any one them may hinder the presence of others. The different biomarker statuses often portray contrasting effect on different events. However, it demands some statistical and medical justification for the splitting of continuous-valued gene expressions into different levels.
In this study, we have elaborated an efficient statistical procedure of dividing the gene expressions into categorical variables and validated the results by exploring the variability attached to the survival prediction at each of the biomarker levels. The survival predictions of different events in a particular biomarker level are likely to be correlated. This necessitates the use of correlated frailty where the correlation is translated through an additive frailty component consisting of biomarker level-specific frailty and event-specific frailty term. The variance of the frailty component is used as a criterion for thresholding of the gene expression. In our procedure, we used an additive frailty model under competing risk set-up and identified the event-specific frailty variance due to heterogeneity of biomarker levels. The event of interest is the occurrence of Distant Metastasis, and the competing event is the death from other causes.
In the proposed procedure, the gene expressions of 4 different biomarkers associated with breast cancer are studied. For each of the biomarker, the range of the expressions are divided into 100 equal parts and taking each one of them as the threshold levels, we have utilised a competing risk frailty model and obtained frailty variance associated with distant metastasis. The threshold value with maximum frailty variance provides the biomarker levels with most distinct effect on survival prediction. However, the threshold levels are obtained with the assumption that the biomarkers are independent, which is in contrary to the actual behaviour of the biomarkers. There is an association that exists among the biomarkers responsible for breast cancer, and the individual analysis of each of these biomarkers have lower power. Thus, a combined threshold level taking into considerations of all the biomarkers is necessary. Obtaining a consolidated frailty variance taking all the threshold levels for each biomarker is a difficult task. This problem can be overcome if we consider the corresponding p-values. The threshold levels with minimum p-values give maximum importance at each partition which will also provide minimum frailty variances. In our study, we have incorporated the frailty components as indicator variables in the model and obtained the threshold levels based on the minimum p-value criterion using p-value combiner methods. In the proposed procedure, we choose one indicator variable at a time keeping others fixed at certain threshold levels. Once an optimal threshold level is obtained, the corresponding biomarker level is set at that level, and the procedure is continued as before till we obtain threshold levels for all the biomarker levels. Also, to encounter the ordering problem, we work out the same procedure for all possible orders of the biomarkers and taking starting values as the quantiles. For validation, we applied the threshold levels for survival modelling and examined distinction in the survival prediction for each partition divided by the threshold levels. Also, this method effectively determines the optimal threshold levels in a sequential procedure instead of computing the minimum p-value for all possible combinations of thresholds. This dramatically reduces the computational time and minimises the multiple testing problem. Thus, using our proposed methodology, we obtain biomarker thresholding to clearly distinguish the survival effects for each partition and facilitate the treatment procedure.
The proposed procedure bears the potential for further improvements. In our study, we have created a criterion of choosing optimal threshold levels using p-value combiner methods to amalgamate the p-values corresponding to different correlated biomarkers. It will be beneficial to obtain a composite criterion associating the frailty variances for correlated biomarkers. Though we have utilised a competing risk model to obtain the frailty variances and incorporated the dependency structure among the events and biomarker levels, the prime interest is to examine the heterogeneity in survival prediction at different levels for only one event (i.e., Distant Metastasis in this study). It will be beneficial to extend the proposed procedure to examine heterogeneity in survival prediction for all the associated events. In this study, we partitioned the gene expressions on a dichotomous scale to identify the level separations. This can be extended on a polychotomous classification where the survival outcome differs at different levels of expressions. More precise results may be obtained using auxiliary information of the distributions of parameters which can be performed using bayesian techniques.

Conclusion
The next generation of cancer treatment relies heavily on the proper identification of disease progression in terms of gene expressions. It is crucial to identify the distinct behaviour of survival outcomes and treatment responses at different biomarker statuses to provide the most effective treatment to the patients. We hope that the proposed procedure will motivate studies related to biomarker level classification and improve the patient treatment strategy. We are optimistic that the carefully designed prospective treatment regime will be assessed to take advantage of such clinical utility.