Prediction of Inammatory Bowel Diseases using Genetic Risk Score in Asian populations

The incidence of Inammatory bowel disease (IBD), including Crohn’s disease (CD) and Ulcerative colitis (UC), is rising in Asian populations. We undertook a cross-population study to explore whether genetic risk scores (GRS) of IBD, CD and UC could explain their occurrence, and whether they can be used to predict disease occurrence in general populations from East Asia (EA) and Central Asia (CA). Methods We studied 9,698 subjects – 4,733 IBD patients (2,003 CD; 2,730 UC) and 4,965 matched controls – who had been genotyped using Immunochip. The subjects were from three East Asian (Japan, South Korea and China) and two Central Asian populations (India and Iran). We generated GRS for each population by combining information from up to 201 genome-wide signicant IBD-associated variants to summarize the total load of genetic risk for each phenotype. We then estimated the explained variance and predictability of IBD using the GRS.


Introduction
In ammatory bowel disease (IBD) is a chronic, debilitating disease that now affects 2.5 million people of European descent. However, its incidence and prevalence are rising in populations in the newly industrialized countries of Central and East Asia [1]. Non-European populations have distinct phenotypic characteristics for IBD, for instance, Asians with Crohn's disease (CD) and Ulcerative colitis (UC) have a lower proportion of family history and of extra intestinal manifestations compared with European populations. In CD, Asians show a male predominance, more stricturing disease and more perianal involvement than Europeans, whereas in UC, they report lower rates of extensive colitis and colectomy [2]. IBD occurs when the immune system responds inappropriately to gut microbiota in a genetically susceptible host [3]. Genome-wide association (GWAS) and deep-sequencing studies have identi ed over 240 genetic variants associated with IBD. These studies were largely conducted in individuals of European descent, with only a few in Asian and African-American populations [4,5], but they provide biological insights into the disease mechanisms [6]. Apart from susceptibility to the disease, several studies have also investigated genetic loci affecting IBD sub-phenotypes, such as disease location and prognosis [7,8]. In particular, a study of 29,838 patients identi ed genetic loci associated with disease location, but not with how the disease evolved over time [7]. Our earlier trans-ethnic association study, involving 9,846 subjects of non-European descent, identi ed 38 novel genetic loci and found genetic heterogeneity between populations [4]. It also demonstrated the importance of trans-ancestry genetic studies.
Recently, genetic risk scores (GRS) have been used to aggregate the contribution of multiple single nucleotide polymorphisms (SNPs) by combining genetic information and testing for improved performance in predicting disease incidence [9]. By using GRS, the genetic overlap and pleiotropic character of IBD sub-phenotypes was quali ed [10] and supported a continuum of the disease that was better explained by three disease groups -ileal Crohn's disease, colonic Crohn's disease, and Ulcerative colitis -than by the current bipartite classi cation [7]. Furthermore, several studies have shown that information on multiple SNPs combined into a GRS was associated with complex diseases such as obesity, type 2 diabetes, and coronary heart disease [11][12][13]. From another viewpoint, a better understanding of how GRS can be used to predict IBD could improve identi cation of high-risk individuals for whom preventive interventions could help avoid development of disease. However, studies so far on using GRS did not improve the risk prediction of IBD in the general population, nor in patients [14][15][16][17], although in a European population GRS did yield more information on the genetic background of IBD than candidate SNP associations alone [7]. The validity of GRS in predicting IBD, CD and UC occurrence in Central and East Asian populations has not yet been reported.
In our current study, we investigated how GRS might explain susceptibility to IBD, as measured by the amount of variance explained by GRS for IBD. We also explored the predictive value of GRS for IBD across ancestrally diverse, non-European, general populations. We compared genetic association with IBD disease phenotypes (IBD, CD, or UC), and their predictability in general Asian populations.
Methods Study Design Our current study built on an earlier trans-ethnic GWAS of 9,846 subjects of non-European descent, in which we identi ed 38 novel genetic loci for IBD and veri ed another 173 [3]. Now we went on to generate GRS for IBD, CD and UC phenotypes, and to estimate explained variance and predictability of IBD, CD and UC incidence in Asian populations of Japanese, Korean, Chinese, Indian and Iranian descent. IBD diagnosis was determined by IBD specialists from reviews of case notes and clinical, radiological, pathological and endoscopic reports [3]. After quality control, we extracted detailed information on the disease phenotypes for the ve studied populations. The original study included patients from population-based registries, and from secondary-and tertiary medical referral centers at multiple locations [3] (Supplementary Text S1).
We now had data on 9,698 participants, including their gender, ethnicity, smoking status, family medical history, clinical and genetic data. We retrieved Immunochip array genotypes for 6,395 East Asian and 3,303 Central Asian patients and country-, age-and gender-matched controls. In our current analysis we used three East Asian populations, including 5,317 Japanese (CD 1,312, 723 UC, and 3,282 controls), 547 South Koreans (201 CD, 230 UC, and 114 controls) and 533 Chinese (155 CD, 143 UC, and 235 controls), and two Central Asian populations, including 2,413 Indians (184 CD, 1,237 UC, and 992 controls) and 890 Iranians (151 CD, 397 UC, and 342 controls) (Supplementary Figure S1). Detailed phenotype data were available for at least 74.5% of CD patients and 88.9% of UC patients in the three East Asian populations, and 71% of CD patients and 76% of UC patients of the two Central Asian populations. Data on age at diagnosis, family history, and smoking were available for 82.6%, 82.7% and 61.0% of patients in the East Asian populations, and 79%, 77% and 76% in the Central Asian populations. Data on age of disease onset, extra intestinal manifestations, and surgical history were also collected on 2,557 East Asian, 1,421 Indian, and 548 Iranian IBD patients. All 9,698 participants for the current study had been genotyped on the Immunochip array as part of the trans-ethnic IBD genetic consortium (IBDGC) initiative [3]. During quality control we removed individuals with more than 10% of their genotype missing.
The genotyping methods and quality control have been explained elsewhere [3]. We selected genotype data on the 201 common IBD-associated SNPs discovered in our earlier study [3]. Genetic data were harmonized by ltering out the genetic variants that were missing in any of the ve populations. At the missing SNPs, we rst identi ed the proxy SNPs that were in the highest linkage disequilibrium (LD) of r 2 >0.99 and determined those closest to the GWAS SNPs by using references panel from the1000 Genomes project. Of the 201 IBD-associated SNPs, 19 had missing information in at least one of the three East Asian populations that could not be replaced by a proxy, yielding 182 common SNPs for nal analysis. Of the 201 SNPs, we could not retrieve a proxy for seven in the Indian and Iranian populations, Data Analysis Our nal analysis was performed on the matched genotype and phenotype data of 4,733 IBD patients (2003 CD; 2,730 UC), and 4,965 country-, age-and gender-matched controls. Per SNP, the risk variant was used to build GRS, rstly to test the explained variance of IBD, CD and UC, and secondly to examine the disease predictability in general populations by implementing a previously applied systematic framework [6].
We tted a number of genotype-phenotype models using mixed models for the East Asian populations to estimate independent risk and cross-checked these models across each of the three separate East Asian populations. The dataset of the East Asian populations was split into (1) a training set (including two out of the three East Asian populations) to build the model to calculate the odds ratio (OR), and (2) a test set (the third population) for evaluating and validating the predictive model built for the training set. The target population was excluded from the East Asian population and the association of each allele with the risk of the phenotype of interest was studied using the other two remaining populations. To calculate the independent risk for IBD, CD and UC per IBD SNP, we rst combined the Korean and Chinese populations (OR KC ). Next, we combined the Japanese and Chinese populations (OR JC ) and nally, we combined the Japanese and Korean populations (OR JK ). We used an additive linear mixed model as implemented in the software package MMM (C-program for analyzing a linear mixed model) [18] to calculate the risk (OR) for each of the 182 common IBD SNPs ( Figure 1).
The original risk alleles were de ned as alleles associated with an increased risk of IBD, CD or UC in our original trans-ethnic meta-GWAS of IBD [3], these have been replicated in a follow-up study in Caucasians [6]. In brief, we included 201 top SNPs that were associated to IBD to form a genetic dataset and to build a genetic relatedness de ned in an R matrix. The R matrix was calculated with the number of variants per phenotype and included as a random-effects component in the model to account for population strati cation. The results of case-control association tests were presented as OR with associated p-values for the phenotype of interest. We evaluated the ORs per phenotype and any SNPs with an extreme OR were excluded. Finally, we included 176 IBD risk variants that had OR estimates across the three East Asian populations to build the GRS.
For the Indians and Iranians, we included data for each of the 194 IBD-associated variants. We de ned the risk allele as the one obtained for the Caucasian population in the trans-ethnic meta-GWAS of IBD [3]. Likewise, we used the same SNPs and speci c ORs estimated for the Caucasian population [3] to build the GRS and to test predictive models in the two CA populations.
Genetic Risk Score (GRS) We built a multi-locus GRS for each patient in the studied population by taking the frequency of a given risk allele per SNP from the controls for our target populations and multiplying it with the natural logarithm of its OR, as estimated in the above procedures. Unweighted GRS were built for each disease phenotype in the three East Asian populations (GRS IBD vs. controls, GRS CD vs. controls, and GRS UC vs. controls) and we thus arrived at nine GRS in 2,763 patients and 3,631 controls that utilized the allelic OR from the MMM model analyses. The models used two populations and the allele frequencies taken from controls of the third target population to account for the strength of the genetic association in each allele in the target population. We shu ed the three East Asian populations into three settings. We calculated the combined independent risk for IBD, CD and UC in the Korean and Chinese populations (OR KC ), and then used these estimates to build GRS for 176 associated SNPs with the phenotype of interest for the Japanese population. Next, we applied the combined independent risk estimate for the Japanese and Chinese populations (OR JC ) to calculate the GRS per SNP for IBD, UC and CD in the Koreans. Finally, we combined the Japanese and Korean populations to calculate the independent risk estimate (OR JK ) per SNP (Supplementary Figure S2).
For the Indians and Iranians, we implemented the same procedures using ORs of the 194 associated IBD SNPs, as de ned above. Genetic risk scores were calculated for each population using the R package "Mangrove" (See Web link).
Explained Variance and Predictive Analyses We estimated the explained variance (disease susceptibility) for IBD and its phenotypes' risk alleles in the ve populations. Mangrove holds the risk alleles, effect sizes (β values) and frequencies (f) for a set of genetic variants (i.e. 176 for East Asians or 194 for Central Asians) relevant to predicting a phenotype. It calculates the variance explained analytically, by converting the OR to liability scale units (i.e. the genetic risk variants included in the model) and adding them together ( Figure 2). It gives the variance explained by the variants included in the model and plots the cumulative variance explained as the variants are added in one-at-a-time (in order of most to least variance explained). The distribution of predicted risks in patients was then compared to controls using the Wilcox rank sum test. Given the prevalence of IBD, CD and UC for each target population, we calculated the posterior probability of disease incidence for each phenotype in the target Asian populations.

Results
Detailed clinical characteristics of the participants are shown in Supplementary table S1. Characteristics of the genetic variants included in the GRS containing mean ORs (±SE) and allele frequencies (AF) are presented in Table 1.  Figures S2-S6). As shown in Figure 2, the cumulative variance explained by 176 IBD SNPs was 4.4% for Japanese, 1.5% for Koreans, and 1.34% for Chinese, while the cumulative variance explained by 194 IBD SNPs was 3.81% for Indians and 4.14% for Iranians.
Phenotype Prediction Given a prevalence (prior probability) of 0.08% for IBD in Japanese [1], we estimated a posterior probability of 8.8×10 -4 after including the GRS in our model, this was not signi cantly different from the prior probability. This observation also held for the Koreans [19] and Chinese [20], with prior probabilities of 0.04% and 0.009%, respectively, for IBD, and posterior probabilities of 4.7×10 -4 and 1.05×10 -4 . However, our calculated GRS for CD and UC explained CD and UC to a lesser signi cance than for IBD in the Japanese, Koreans and Chinese and gave prior probabilities of 0.02%, 0.01% and 0.001%, respectively, for CD and of 0.06%, 0.03% and 0.008% for UC, respectively. The predictive probabilities yielded a negligible probability of 2.12×10 -4 , 1.06×10 -4 and 2.98×10 -5 for CD and 6.18×10 -4 , 3.73×10 -4 and 7.13×10 -5 for UC in the Japanese, Koreans and Chinese, respectively (Table 1).
Given a prevalence (prior probability) of 0.044% for IBD in Indians [1,21], we estimated a posterior probability of 5.52×10 -4 after including the GRS in our model, which was not signi cantly different from the prior probability. This observation also held for the Iranians [22], with a prior probability of 0.04% for IBD and a posterior probability of 5.4×10 -5 . The calculated GRS for CD and UC explained CD and UC to a less signi cant extent than for IBD and gave prior probabilities of 0.002% and 0.005% for CD, and 0.044% and 0.035% for UC, in Indians and Iranians, respectively. The predictive probabilities yielded a negligible probability of 2.11×10 -5 and 5.8×10 -5 for CD, and 5.59×10 -4 and 4.56×10 -4 for UC, in the Indo-Iranian populations (Table 1 and Supplementary Figures S2-S6).

Discussion
IBD are chronic in ammatory diseases caused by an abnormal immune response towards microorganisms in genetically susceptible individuals. We aimed to understand the variance of IBD as explained by GRS for IBD, CD and UC in Asians. Across ve populations, we showed that GRS could signi cantly explain up to 4.40% of IBD disease variance. The GRS, representing the cumulative effect of 176 risk alleles for IBD in East Asian populations and 194 risk alleles in Central Asians, could signi cantly explain 1.19 to 4.40% of the variance of IBD, CD and UC in East Asians, and 3.49 to 4.26% in Central Asians, but this yielded a negligible additive predictive probability for IBD, CD and UC in our populations.
The past few decades have witnessed a rapid rise in the incidence of IBD in newly industrialized countries, including those in East and Central Asia [1,23]. Such increases have highlighted the importance of studying the disease in these geographical areas, particularly for genetic factors that may reveal additional population-speci c risk loci. Although environmental triggers are important to causing the disease to develop, an underlying genetic susceptibility is also required [24]. It is known that the coe cient of heritability for siblings of IBD probands is 25 to 42 for CD, and 4 to 15 for UC, and that heritability estimates from pooled twin studies are 0.75 and 0.67, respectively [25][26][27]. Furthermore, our earlier trans-ethnic GWAS reported ~220 variants for IBD, but we also highlighted signi cant genetic heterogeneity between European and non-European populations for the majority of IBD risk loci [3]. However, there are indeed population-speci c loci for the disease. For example, a meta-analysis of Asian studies revealed that NOD2 and ATG16L1 were not associated with IBD in many Asian populations [28], whereas a GWAS in Ashkenazi Jewish CD patients identi ed ve novel genetic loci that had not been found in non-Jewish Caucasian populations [29]. Since many genetic variants have small effect sizes and each variant accounts for only a small part of the disease heritability [4], it is now common to use GRS to overcome population-speci c effects. GRS summarizes the overall genetic risk across the genome by aggregating information from multiple risk alleles, and this approach is robust to skewed effect sizes due to imperfect linkage or low allele frequencies [30]. Previously we demonstrated the role of GRS in representing the strong association of all known risk alleles for IBD with sub-phenotypes [6]. GRS has been shown to explain disease heritability and it helps to dissect genetic overlap between sub-phenotypes [31,32].
We found that GRS could explain between 1.19-4.40% of IBD disease variance in Asian populations, and 10% in European populations [6] given that many more IBD-associated SNPs have been identi ed in European populations. Although the strong associations show an unequivocal genetic component in disease susceptibility in these populations, they only explain a small proportion of disease variance. These percentages are similar to those reported for other common diseases, such as diabetes mellitus (0.4%) [14,15], coronary heart disease (2.2%) [33,34], and breast cancer (0.6%) [35]. The disease variance percentages are comparable between CD and UC in our Asian populations, suggesting a similar contribution of the risk alleles to both diseases [36]. Likewise, despite much-anticipated interest, predicting the outcomes may not be achievable with the current data. The small proportion of variance explained reveals the presence of signi cant missing heritability, which may be due to genetic epistasis, gene-environment interaction, or the presence of unmapped genetic loci and/or rare variants [37][38][39]. Importantly, the rapid rise of IBD incidence in Asia points to the importance of a changing environment, pinpointing the possibility of gene-environment interaction in its pathogenesis [40,41]. In support of this hypothesis, several studies have shown that environmental factors, like smoking or gut microbes, can modify the risks conferred by the major genetic loci of STAT3 [42] and TNFSF15 [43]. The importance of gene-environment interactions in shaping disease susceptibility can best be studied in populations where the epidemiology is changing rapidly [44].
We found that IBD GRS showed little additive value in predicting IBD, CD or UC in the general Asian population. The low predictive proportions may also be attributable to our relatively small sample size (the predictive ability of a polygenic score can be affected by the sample size [45]). This limitation re ects the need for much larger studies in non-European populations, they will likely yield new genetic loci for IBD. However, even with a relatively small sample size, we were able to replicate several European risk alleles in our Asian populations and showed that a composite weighted score from European risk alleles is still relevant. Another possibility for the ethnic differences is a different environmental burden to the disease. This may be relevant given the fast-evolving environment in Asian countries, which could itself be responsible for the rapidly increasing disease incidence in this region. We conclude there may be a lower genetic risk threshold for IBD in Asians.
The low predictability of GRS we found in Asian populations is not unique in IBD [46]. CD and UC GRS did not predict IBD phenotype or complications in Hispanics or in non-Hispanic Whites, except for indicating a younger age of onset in Hispanics and abdominal surgeries in CD, both with only weak signi cance. This study reported no relationship between colectomies for UC or predicting the number of IBD-related hospitalizations [46]. In a study on ischemic stroke, the combined GRS (cGRS) of 113 SNPs led to an increase of only 0.5% in predictive power when added to all co-variables [46]. This suggests there is no clinical advantage in constructing a multi-locus SNP panel for predicting stroke risk, even when extended to include variants acting on intermediate phenotypes such as hypertension or atrial brillation [46]. This study suggested that the gain in predictive power from adding GRS to gender alone is limited in stroke [46]. Furthermore, our current ndings agree with studies on other conditions, including breast cancer [35], diabetes mellitus [14,15], coronary heart disease [33,34], and multiple sclerosis [47], that found limited improvement in risk prediction with using GRS. Two larger studies concluded that, at present, the discriminative power of Polygenic risk score (PRS) for schizophrenia is not su cient to use in population screening to identify individuals at high risk and that PRS may never prove powerful enough for screening [48,49]. However, PRS explains a substantial amount of the variance of schizophrenia in Europeans, probably more than any other traditional risk factor. Finally, genetic risk variants so far known to play a role in migraine, are not able to explain a comprehensive set of clinical characteristics of migraine severity [8] . Altogether, as we stated during the early development of genetic studies [50] , these observations across different domains question the potential clinical utility of PRS in predicting complex diseases in general populations. As noted by us and others [50][51][52], the predictive utility of GRS for common diseases is likely to be very limited, especially considering the myriad factors of the exposome that also in uence individual susceptibility [51] .
Our current study has several methodological strengths, including replication of the GRS in independent case-control samples and validation in general populations. We used one of the largest and most accurate collections of population-based samples of IBD patients with genome-wide data available to date. We based this work on an earlier GWAS with enriched genetic data to capture IBD probability compared to Caucasian populations. The current study investigated the association between IBD and GRS in several Asian populations. We demonstrated the feasibility of applying GRS based on Caucasian risk alleles to several Asian populations. The method we used provided e cient, genome-wide coverage for our Asian sample. However, in our total sample, GRS explained only 4.40% of the variance in predicting case-control status, most probably capturing the strong genetic signal from Europeans.
The results of our study must be interpreted in light of its limitations. Our study was not su ciently powered to investigate the effects of GRS of IBD, CD, UC that might play a role, especially in disease prediction. Moreover, our study included fewer participants with IBD than Caucasian studies, which might have limited our statistical power to identify any association between GRS and the predictability of IBD phenotypes in Asian populations. Future studies with a balanced number of participants with IBD are needed to con rm our ndings. Our estimate on the GRS' predictive value and explained variance was based on common SNPs that had been selected based on prede ned criteria from recently published loci associated with IBD. Thus, we may have excluded many variants with an effect on IBD. We did not analyze GRS for IBD subtypes because this information was not available in all cohorts used for validation. Our participants were speci cally of Asian descent, thus our results may not be generalizable to other ethnicities. To explore the clinical importance of the association of GRS with IBD using common variants, further research in the eld should go beyond the association with case-control status. Alternative strategies for constructing GRS for IBD and for combining GRS with risk factor pro les and clinical information might eventually lead to better risk prediction. Future risk assessment for complex diseases such as IBD should include a much more careful consideration of gene-gene and/or geneenvironmental interactions.

Conclusion
In conclusion, we found a multi-locus GRS derived from GWAS for established risk factors for IBD to be signi cantly associated with IBD risk in Asians. However, the power of the GRS in predicting IBD risk and hence its clinical usefulness was limited. Our current study shows that genetic ndings based on transethnic analyses are indeed applicable across Central and East Asian populations, but the association of GRS that was built upon combining the effect of genome-wide associated risk alleles for IBD is unlikely to provide a strong predictive probability of IBD, CD and UC in Central and East Asians. Taking these results into consideration means that any strategies to test common genetic variants for informing clinical decisions would need to be rigorously tested beforehand. Greater efforts will be required to use the available genetic information in an appropriate clinical context to optimize disease prevention and management.