Complex diseases, with a multifactorial etiology, are frequently encountered in health care. They are generally caused by multiple genes and environmental factors, involving gene–gene and/or gene–environment interactions. A plethora of susceptibility loci have been identified by genome wide association studies (GWAS) and follow-up meta-analyses for various common complex diseases[1]. Undoubtedly, these findings will result in a better understanding of the pathogenesis of complex diseases and facilitate the development of novel therapeutic options. Furthermore, leveraging these discoveries to better predict the status of complex diseases will greatly improve the prevention and treatment of complex diseases and attracts great interest recently[2]. As is well-known, a monogenic disease can be accurately predicted by the corresponding disease-causing mutation. However, although several approaches have been raised for complex disease prediction (such as penalized regression methods [3, 4] and random-effects models [5]), it still remains a great challenge because of the complex genetic architecture.
Currently, one of the most practical ways to leverage recent GWAS findings for disease prediction is the polygenic risk score (PRS) analysis [6]. Typically, a PRS is calculated as the weighted sum of a number of high risk loci [7]. It combines the modest effects of multiple disease associated SNPs into a single variable, therefore has higher power than that of a single SNP. PRS have been created for many complex diseases, such as cardiovascular disease[8], multiple sclerosis [9] and schizophrenia [10]. They are used in different ways in recent years since more genetic data becomes readily available, such as Mendelian randomization studies[11] and disease prediction [12]. Among these disease prediction may be particularly useful for addressing the current challenge to translate the vast genetical knowledge of complex diseases into clinically usable information. For instance, several PRS have been proposed to optimize the use of genetic information of type 1 diabetes and ultimately improve its prediction and diagnosis [13]. Additionally, a previous study demonstrated that PRS was a powerful predictor for patients with first-episode psychosis using logistic regression [14].
But on the other side, using PRS analysis for disease prediction suffers the limitations from various factors nowadays. First, genetic correlations often exist among correlated diseases [15], which were hardly taken into consideration currently. Second, some environmental risk factors are themselves heritable (e.g. lipid fractions), and can mediate part of the genetic risk of the target disease. This was also mostly ignored in current PRS-based disease prediction. Taking into account target complex disease related diseases and environmental exposure factors may help obtain additional useful information for disease prediction.
As a subfield of computer science, machine learning algorithms play an essential role in the process of knowledge extraction [16, 17] and have been successfully applied in clinical field. For instance, Capper et al. used a machine learning approach to classify brain tumors on the basis of DNA methylation recently. Compared to standard methods, it resulted in a change of diagnosis in up to 12% of prospective cases[18]. Random forest is a tree based machine learning algorithm that consists of a collection of randomized decision trees[19]. Preliminary experiments showed that compared to several other popular machine learning algorithms (e.g. support vector machines), random forest achieved the highest accuracy [20].
Generally, a machine learning algorithm is used to train a classification model for separating samples of different classes (e.g. healthy or ill) based on variables (e.g. SNPs in a GWAS). In this circumstance, the whole original genomic data sets are generally used. However, instead of using these whole original genomic data, utilizing the combined genetic information of target disease associated genetic loci has the potential to enhance the performance of machine learning for disease prediction. To the best of our knowledge, the performance of combining PRS and machine learning for disease prediction remains largely unknown.
As machine learning and PRS are becoming more and more popular in genetic studies of complex diseases, we systematacially assessed the performance disease prediction with a combination of random forest and PRS. We first used random forest to train a classification model for separating samples of different health status (healthy or ill) based on PRS matrix and phenotypical data. Notably, PRS were constructed using the identified loci associated with target complex disease (e.g. type 2 diabetes), the complex diseases genetically related to target complex disease (e.g. hypertension, obesity) and environmental exposures (e.g. smoking, lack of exercises). We illustrated the feasibility and performance of this disease prediction approach through extensive genetic simulation. Our results may provide valuable information for applying random forest to PRS matrix for complex disease prediction.