Simulation is an effective approach for evaluating new genetic models [22]. We simulated a case-control genotype dataset consisting of 1000 SNP loci and 2000 individuals (see Methods). The 2000 individuals were divided into case (i.e., disease) and control groups of equal size. Genotypes were generated according to the same guiding minor allele frequencies (MAFs) between the case and control groups. In the case group, the 1000 loci composed 500 pairs of interacting loci containing pure epistatic interactions. Therefore, all the 1000 loci should be regarded as association with the disease. The alleles of a pair of interacting loci are denoted by A/a and B/b (a and b are minor alleles) respectively. We assume that only the aabb genotype (i.e., both loci are homozygous for their minor alleles) has interacting effects. Under conventional genetic models, the case group has higher disease allele frequencies than the case group, whereas in this simulated dataset, no locus displays significantly different allele frequencies between the case and control groups, because the minimum p-value (χ2 test) is 7.03×10−5 which is higher than the standard GWAS cutoff of 5×10−8 as well as the lenient cut-off of 10−5 [23] and no locus leads to a significant p-value after FDR adjustment. The mean and median relative risks of the aabb genotype among all pairs of interacting loci are 1.63 (>1, p<2.2×10−16, t-test) and 1.60 respectively, confirming that pure epistatic interactions were successfully generated.
The GRS of an individual is defined as the number of occurrences of the aabb genotype among the 500 pairs of interacting loci. We computed the GRS for each case and control individual. Distributions of GRS (Figure 1) clearly show that the case group tends to have higher GRS than the control group (p<2.2×10−16, t-test). Although the above comparison is informative, clinical values of GRS depend more on whether the GRS can distinguish case (i.e., disease) and control (i.e. healthy) statuses. To this end, we computed the area under the curve (AUC), a measure of the discriminative ability ranging from 0 to 1. An AUC of 0.900 was obtained, indicating that the GRS can well depict disease risks.
The exact hereditary mechanisms of complex diseases such as cancer and heart diseases are still little understood. Population and clinical studies suggest that the offspring of healthy parents tend to have less risks of developing a complex disease than the offspring of one parent or both parents with histories of that complex disease [24–26]. We asked whether offspring’s risks of developing a complex disease can be depicted by the GRS under the additive epistatic interaction model. Based on the simulated genotype dataset, we derived the genotypes of offspring belonging to three parent groups: 1) case×case, 2) case×control, and 3) control×control (see Methods). We added the GRS distributions of the three offspring groups to Figure 1. The case × case offspring tend to have higher GRS than the other two offspring groups (p=2.47×10−7 and p<2.2×10−16 respectively, t-test), and the case × control offspring tend to have higher GRS than the control × control offspring (p=1.21×10−4, t-test). This analysis suggests that stronger parental histories of a complex disease do increase offspring’s genetic risks of developing that disease. However, the GRS of case × case offspring are greatly reduced relative to the case group (p<2.2×10−16, t-test), suggesting that even with strong parental histories of a complex disease, the genetic risks of developing that disease in offspring are much lower than the same genomes as real patients, because recombination significantly reduces the occurrences of the aabb genotype. This observation further indicates that for people with strong family histories of a complex disease, the actual acquirement of that disease may still largely depend on non-parental factors such as accumulation of de novo/somatic mutations and epigenetic modifications. The above analyses collectively suggest that the additive epistatic interaction model and its GRS can well depict the genetic risks and hereditary patterns of complex diseases.
Next, we asked whether the additive epistatic interaction model is applicable to real genotype data. First, we obtained the genotype data of chromosome 20 from the 1000 Genomes Project [27]. The genotype data consisted of 2504 individuals from 5 super populations including African, Ad Mixed American, East Asian, European and South Asian. Here, we treat these super populations as the phenotype, and any two super populations comprise a pair of case-control groups. We inferred that if pure epistatic interactions are widespread in the human genome, the additive epistatic interaction model can be applied to classify super populations. We employed a customized procedure (see Methods) to identify 500 pairs of SNP loci containing pure epistatic interactions for each pair of super populations. Next, we computed AUCs for pairs of super populations based on the GRS of 500 pairs of interacting loci. Note that for each pair of super populations, we exchanged case and control assignments, resulting in two AUC values. These AUCs range from 0.866 to 0.991 (Figure2), suggesting that human super populations might be accurately classified based on the additive epistatic interaction model. Second, we obtained the genotype data of chromosome 20 from the UKBioBank database [28]. We classified 567 individuals with diabetes and 573 individuals without diabetes from the Irish population. We identified 500 pairs of SNP loci containing pure epistatic interactions and computed a GRS for each individual. An AUC of 0.992 was obtained, suggesting that the additive epistatic interaction model could accurately distinguish disease statuses of genetic disorders.
An important consideration for applying the additive epistatic interaction model is its robustness. In real data applications, especially those utilizing whole-genome sequencing technologies, the numbers of markers are much larger than sample sizes. Thus, most of the identified pairs of interacting loci may not be real. Therefore, we asked whether the additive epistatic interaction model can tolerate some extent of noises and whether advanced machine learning technologies such as deep learning can be integrated to the additive epistatic interaction model. While deep learning has different models, we chose deep neural network (DNN) because different pure epistatic interactions do not contain sequence or interdependency information. We here note that the features of DNN models should be occurrences of the aabb genotype, rather than original genotypes because effective feature extraction is still required for deep learning technologies.
We added different ratios of noises (i.e., loci assumed to be interacting but do not contain pure epistatic interactions) to the simulated genotype dataset, and computed AUCs of distinguishing between case and control statuses based on GRS and DNN respectively. Figure 3 shows comparisons of AUCs between GRS and DNN at different noise ratios. In general, the AUCs of both approaches gradually decrease along with increase of noise ratios. When noise ratios are relatively low (e.g., <1), GRS outperforms DNN. When the noise ratio goes up from 1, the AUCs of GRS decrease rapidly, while the AUCs of DNN show resistance to noises, leading to DNN outperforming GRS. This analysis suggests that both GRS and DNN can tolerate a substantial proportion of noises, and DNN can be integrated to the additive epistatic interaction model for dealing with complex datasets.