Background: Breast cancer accounts for a large proportion of cancer-related deaths in women. Polygenic risk score (PRS) derived from single nucleotide polymorphisms (SNP) data can evaluate the individual-level genetic risk of breast cancer and has been widely applied for risk stratification. However, standalone SNP data used for PRS may not provide satisfactory prediction accuracy. Additionally, current PRS models based on linear regression have insufficient power to leverage non-linear effects from thousands of associated SNPs.
Methods: In this study, the multiple omics data (DNA methylation data, miRNA data, mRNA data and lncRNA data) and clinical data of breast invasive carcinoma (BRCA) were collected from The Cancer Genome Atlas (TCGA). First, we developed a novel PRS model utilizing single omic data and a machine learning algorithm (LightGBM). Subsequently, we built a combination model of PRS derived from each omic data to explore whether multiple omics data can further improve the prediction accuracy of PRS. Finally, we performed association analysis and prognosis prediction of breast cancer to evaluate the utility of the PRS generated by our method.
Results: Our PRS model based on single omic data and LightGBM algorithm achieved better predictive performance than the linear models and other machine learning models. Moreover, the combination of the PRS derived from each omic data can efficiently strengthen prediction accuracy. The analysis of prevalence and the associations of the PRS with phenotypes including case-control and cancer stage status indicated that the risk of breast cancer increases with the increases of PRS. The survival analysis also suggested that PRS for the cancer stage is an effective prognostic metric of breast cancer patients.
Conclusion: Our proposed model expanded the current definition of PRS from standalone SNP data to multiple omics data and outperformed the state-of-the-art PRS models, which may provide a powerful tool for diagnostic and prognostic prediction of breast cancer.