The Construction of Polygenic Risk Scores for Breast Cancer Based on LightGBM and Multiple Omics Data

Background: Breast cancer accounts for a large proportion of cancer-related deaths in 20 women. Polygenic risk score (PRS) derived from single nucleotide polymorphisms (SNP) 21 data can evaluate the individual-level genetic risk of breast cancer and has been widely 22 applied for risk stratification. However, standalone SNP data used for PRS may not provide 23 satisfactory prediction accuracy. Additionally, current PRS models based on linear regression 24 have insufficient power to leverage non-linear effects from thousands of associated SNPs. 25 Methods: In this study, the multiple omics data (DNA methylation data, miRNA data, mRNA 26 data and lncRNA data) and clinical data of breast invasive carcinoma (BRCA) were collected 27 from The Cancer Genome Atlas (TCGA). First, we developed a novel PRS model utilizing 28 single omic data and a machine learning algorithm (LightGBM). Subsequently, we built a 29 combination model of PRS derived from each omic data to explore whether multiple omics 30 data can further improve the prediction accuracy of PRS. Finally, we performed association 31 analysis and prognosis prediction of breast cancer to evaluate the utility of the PRS generated 32 by our method. 33 Results: Our PRS model based on single omic data and LightGBM algorithm achieved better 34 predictive performance than the linear models and other machine learning models. Moreover, 35 the combination of the PRS derived from each omic data can efficiently strengthen prediction 36 accuracy. The analysis of prevalence and the associations of the PRS with phenotypes 37 including case-control and cancer stage status indicated that the risk of breast cancer increases with the increases of PRS. The survival analysis also suggested that PRS for the cancer stage 39 is an effective prognostic metric of breast cancer patients. 40 Conclusion: Our proposed model expanded the current definition of PRS from standalone 41 SNP data to multiple omics data and outperformed the state-of-the-art PRS models, which 42 may provide a powerful tool for diagnostic and prognostic prediction of breast cancer.


61
Breast cancer is the most frequently diagnosed cancer in women worldwide [1]. In 2020, there 62 were over 2 million new cases reported [2]. The establishment of effective prevention and 63 treatment measures is essential to prevent breast cancer occurrence and reduce breast cancer 64 mortality. Although carriers of BRCA1 and BRCA2 gene mutations confer a high risk of 65 breast cancer, these gene mutations can be found in only a small part of breast cancer 66 patients [3]. In recent years, genome-wide association study (GWAS) identified multiple high 67 frequency and low penetrance susceptibility variants of breast cancer [4]. The accumulation 68 effects of these susceptibility variants can be summarized as a polygenic risk score (PRS). In 69 recent years, researchers have developed several PRS models for breast cancer by using a 70 large amount of single nucleotide polymorphisms (SNPs) data. These studies maintained the 71 PRS to be an effective and reliable predictor of breast cancer risk that may provide screening 72 and prevention strategies [5][6][7][8][9]. analyses of multiple omics data may lead to new insights into diagnosis and prognosis of 80 breast cancer [15]. In addition, in the standard approach of PRS, the effect sizes of the genetic 81 variants are usually estimated in linear statistical models [16][17][18][19]. However, linear statistical 82 model has some limitations and only be applied when specific requirements are satisfied [20]. 83 Advanced machine learning (ML) models [21,22]

Overview of PRS model 122
According to the different phenotypes, we proposed to utilize multiple omics data and breast 123 cancer status to construct two kinds of PRS models. The first phenotype only contains the 124 normal samples (control) and tumor samples (case), which were labelled 0 and 1, respectively. 125 The second phenotype contains the normal samples, early-stage and late-stage tumor samples, 126 which were labelled 0, 1 and 2, respectively. We defined the above-mentioned two PRS 127 models as PRS for case-control status and PRS for cancer stage status. The PRS can evaluate 128 the individual risk of breast cancer and may improve the diagnosis of breast cancer. Moreover, 129 since recent studies found the stage of cancer is highly associated with the prognosis[28], 130 accurate construction of PRS for cancer stage status may facilitate the prediction of breast 131 cancer prognosis. The framework of this study is shown in Figure 1. 132

PRS based on LightGBM 133
LightGBM is an ensemble model of classification and regression trees (CART) [  ii ly y is a differentiable convex loss function that measures the difference between 148 the prediction ˆi y and true phenotype i y . The K and k w respectively represent the 149 number and value of leaf nodes in each CART model, γ and λ are constant coefficients.

150
In general setting, the second-order approximation can be utilized to quickly optimize the 151 objective function. 152  we used each omic dataset as the input of these models, and the corresponding phenotypes as 171 the output. 172

Model training and evaluation 173
To ensure the robustness and stability of the model, we trained and evaluated the proposed 174 PRS model by 5-fold cross validation. This procedure divided each omic dataset into five 175 subsets. In each fold, one of the five subsets was used as the testing dataset and the other four 176 subsets were put together to form a training dataset. We applied bayesian optimization[37] 177 and 3-fold inner cross validation to optimize the hyper-parameters of the PRS model in each 178 training dataset. Specifically, for LASSO, we optimized the parameter "alpha". For MCP, we 179 adjusted regularization parameter "labmda". For elastic net, the parameter "alpha" and 180 "l1_ratio" were optimized. For SVR, we choose "rbf kernel" and optimized the regularization 181 parameter "C". For LightGBM, the optimized parameters were "num_leaves", "n_estimators", 182 "learning_rate", "max_depth", "max_bin", "min_split_gain", "subsample", "subsample_freq", 183 "colsample_bytree", "min_child_sample", "min_child_weight", "reg_alpha", "reg_lambda". 184 Finally, we obtained the PRS of each testing dataset which was predicted by the model with 185 the optimized parameters. Each PRS was standardized based on its mean and standard 186 deviation. The predictive performance of PRS model was evaluated by square of the Pearson 187 correlation coefficient (R 2 ). 188   206 We first compared our prediction model to existing PRS methods and other ML methods for 207 case-control status. Figure 2a shows the results of these PRS methods on four kinds of omics 208 datasets. We observed that elastic net achieves the best performance in traditional linear 209 models. The R 2 of SVR is 3.3%, 7.7% and 0.5% higher than elastic net on DNA methylation, 210 miRNA and lncRNA datasets and 3.1% lower than elastic net on mRNA dataset. The R 2 of 211 Across the common samples, we observed that the prevalence is about 10% in the first 241 stratum then upgrades to 100% in the second stratum and remains steady afterwords. The 242

Predictive performance of PRS based on multiple omics data
prevalence changes significantly at one stratum because our proposed method achieved 243 relatively accurate prediction of breast cancer risk for case-control status. The trend plot of the 244 prevalence also indicated that individuals with high-PRS strata have greater breast cancer risk 245 than the individuals with lower-PRS strata. 246 247 We investigated the relationship of PRS with different phenotypes of breast cancer in this 248 section (  and regulation mechanism of large-scale genes, which play an important role in determining 299 the mechanism and treatment of cancer [45,46]. Compared to the individual-level genotype 300 data, using multiple omics data to construct breast cancer PRS considered the interaction of 301 genetic and environmental factors, and thus can provide higher PA. 302

Prognosis prediction of breast cancer
Although our PRS methods provide powerful predictive performance, they have some 303 limitations. First, the LightGBM model has more hyper-parameters than traditional linear 304 models such as MCP, LASSO and elastic net. Thus, we need more time to train the proposed 305 model. We applied multithreading technology to effectively utilize computing resources and 306 correspondingly reduced some running time. Second, the sample size of breast cancer from 307 TCGA is relatively small compared to large-scale Genome-wide association studies data. In 308 addition, there are significantly more tumor samples than normal samples in our study. 309 Imbalanced datasets significantly compromise the performance of most standard learning 310 algorithms, because these models assume the balanced class distributions. Third, this study 311 lacks independent validation datasets, because it is very difficult to collect multiple omics 312 data including DNA methylation, miRNA, mRNA and lmcRNA of case-control and cancer 313 stage status. Thus, we employed 5-fold cross validation to strengthen the robustness and 314 stability of our proposed models. In the future, we will consider applying our PRS model to 315 analyze breast cancer with other phenotypes by using larger and balanced multiple omics 316 datasets. 317   The 1st stratum can be regarded as a low-risk PRS stratum and the 2nd to 10th stratum as a 496 high-risk stratum. 497  Figure 1 Schematic overview of the framework for constructing PRS model based on multiple omics data. The dataset of BRCA was split into two groups as training dataset and testing dataset based on 5-fold cross validation. We constructed PRS model by using MCP, LASSO, elastic net, SVM and LightGBM based on training dataset. The hyper-parameters of ve models were optimized by using bayesian optimization and 3-fold cross validation. The PRS of testing dataset was predicted by optimized model. The predictive performance of nal models was evaluated with R2.  Prevalence strata plot of increasing PRS for case-control status. The sample size of 10 strata was equal and the prevalence of BRCA increased with the increase of PRS. The 1st stratum can be regarded as a low-risk PRS stratum and the 2nd to 10th stratum as a high-risk stratum.

Figure 4
The KM survival curve of BRCA patients in the high-risk and low-risk groups. We divided patients into high-risk and low-risk groups based on the 50th PRS. The patients with low-risk group have better prognosis than those with high-risk group.