Study Design Our current study built on an earlier trans-ethnic GWAS of 9,846 subjects of non-European descent, in which we identified 38 novel genetic loci for IBD and verified another 173 [3]. Now we went on to generate GRS for IBD, CD and UC phenotypes, and to estimate explained variance and predictability of IBD, CD and UC incidence in Asian populations of Japanese, Korean, Chinese, Indian and Iranian descent.
IBD diagnosis was determined by IBD specialists from reviews of case notes and clinical, radiological, pathological and endoscopic reports [3]. After quality control, we extracted detailed information on the disease phenotypes for the five studied populations. The original study included patients from population-based registries, and from secondary- and tertiary medical referral centers at multiple locations [3] (Supplementary Text S1).
We now had data on 9,698 participants, including their gender, ethnicity, smoking status, family medical history, clinical and genetic data. We retrieved Immunochip array genotypes for 6,395 East Asian and 3,303 Central Asian patients and country-, age- and gender-matched controls. In our current analysis we used three East Asian populations, including 5,317 Japanese (CD 1,312, 723 UC, and 3,282 controls), 547 South Koreans (201 CD, 230 UC, and 114 controls) and 533 Chinese (155 CD, 143 UC, and 235 controls), and two Central Asian populations, including 2,413 Indians (184 CD, 1,237 UC, and 992 controls) and 890 Iranians (151 CD, 397 UC, and 342 controls) (Supplementary Figure S1). Detailed phenotype data were available for at least 74.5% of CD patients and 88.9% of UC patients in the three East Asian populations, and 71% of CD patients and 76% of UC patients of the two Central Asian populations. Data on age at diagnosis, family history, and smoking were available for 82.6%, 82.7% and 61.0% of patients in the East Asian populations, and 79%, 77% and 76% in the Central Asian populations. Data on age of disease onset, extra intestinal manifestations, and surgical history were also collected on 2,557 East Asian, 1,421 Indian, and 548 Iranian IBD patients. All 9,698 participants for the current study had been genotyped on the Immunochip array as part of the trans-ethnic IBD genetic consortium (IBDGC) initiative [3]. During quality control we removed individuals with more than 10% of their genotype missing. The genotyping methods and quality control have been explained elsewhere [3]. We selected genotype data on the 201 common IBD-associated SNPs discovered in our earlier study [3]. Genetic data were harmonized by filtering out the genetic variants that were missing in any of the five populations. At the missing SNPs, we first identified the proxy SNPs that were in the highest linkage disequilibrium (LD) of r2>0.99 and determined those closest to the GWAS SNPs by using references panel from the1000 Genomes project. Of the 201 IBD-associated SNPs, 19 had missing information in at least one of the three East Asian populations that could not be replaced by a proxy, yielding 182 common SNPs for final analysis. Of the 201 SNPs, we could not retrieve a proxy for seven in the Indian and Iranian populations, yielding 194 common SNPs.
Ethical considerations The protocol of described study is in line with the ethical guidelines of the 1975 Declaration of Helsinki as reflected in approval by the medical ethical review board of all involved cohorts in the International Inflammatory Bowel Disease Genetics Consortium (IIBDGC). All methods were performed in accordance with the relevant guidelines and regulations. The recruitment of study subjects, all methods and protocols were approved by the ethics committees or institutional review boards of all individual participating centers of involved countries in the IBDGC and this study containing: Institute of Medical Science, University of Tokyo, RIKEN Yokohama Institute of Japan, Yonsei University College of Medicine and Asan Medical Centre, Seoul, Korea, Chinese University of Hong Kong, China, Digestive Disease Research Institute, Tehran University of Medical Science, Iran and Department of Medicine, Dayanand Medical college and hospital, Ludhiana, India. Informed consent was obtained from all participants in this study.
Data Analysis Our final analysis was performed on the matched genotype and phenotype data of 4,733 IBD patients (2003 CD; 2,730 UC), and 4,965 country-, age- and gender-matched controls. Per SNP, the risk variant was used to build GRS, firstly to test the explained variance of IBD, CD and UC, and secondly to examine the disease predictability in general populations by implementing a previously applied systematic framework [6].
We fitted a number of genotype-phenotype models using mixed models for the East Asian populations to estimate independent risk and cross-checked these models across each of the three separate East Asian populations. The dataset of the East Asian populations was split into (1) a training set (including two out of the three East Asian populations) to build the model to calculate the odds ratio (OR), and (2) a test set (the third population) for evaluating and validating the predictive model built for the training set. The target population was excluded from the East Asian population and the association of each allele with the risk of the phenotype of interest was studied using the other two remaining populations. To calculate the independent risk for IBD, CD and UC per IBD SNP, we first combined the Korean and Chinese populations (ORKC). Next, we combined the Japanese and Chinese populations (ORJC) and finally, we combined the Japanese and Korean populations (ORJK). We used an additive linear mixed model as implemented in the software package MMM (C-program for analyzing a linear mixed model) [18] to calculate the risk (OR) for each of the 182 common IBD SNPs (Figure 1).
The original risk alleles were defined as alleles associated with an increased risk of IBD, CD or UC in our original trans-ethnic meta-GWAS of IBD [3], these have been replicated in a follow-up study in Caucasians [6]. In brief, we included 201 top SNPs that were associated to IBD to form a genetic dataset and to build a genetic relatedness defined in an R matrix. The R matrix was calculated with the number of variants per phenotype and included as a random-effects component in the model to account for population stratification. The results of case-control association tests were presented as OR with associated p-values for the phenotype of interest. We evaluated the ORs per phenotype and any SNPs with an extreme OR were excluded. Finally, we included 176 IBD risk variants that had OR estimates across the three East Asian populations to build the GRS.
For the Indians and Iranians, we included data for each of the 194 IBD-associated variants. We defined the risk allele as the one obtained for the Caucasian population in the trans-ethnic meta-GWAS of IBD [3]. Likewise, we used the same SNPs and specific ORs estimated for the Caucasian population [3] to build the GRS and to test predictive models in the two CA populations.
Genetic Risk Score (GRS) We built a multi-locus GRS for each patient in the studied population by taking the frequency of a given risk allele per SNP from the controls for our target populations and multiplying it with the natural logarithm of its OR, as estimated in the above procedures. Unweighted GRS were built for each disease phenotype in the three East Asian populations (GRS IBD vs. controls, GRS CD vs. controls, and GRS UC vs. controls) and we thus arrived at nine GRS in 2,763 patients and 3,631 controls that utilized the allelic OR from the MMM model analyses. The models used two populations and the allele frequencies taken from controls of the third target population to account for the strength of the genetic association in each allele in the target population. We shuffled the three East Asian populations into three settings. We calculated the combined independent risk for IBD, CD and UC in the Korean and Chinese populations (ORKC), and then used these estimates to build GRS for 176 associated SNPs with the phenotype of interest for the Japanese population. Next, we applied the combined independent risk estimate for the Japanese and Chinese populations (ORJC) to calculate the GRS per SNP for IBD, UC and CD in the Koreans. Finally, we combined the Japanese and Korean populations to calculate the independent risk estimate (ORJK) per SNP (Supplementary Figure S2).
For the Indians and Iranians, we implemented the same procedures using ORs of the 194 associated IBD SNPs, as defined above. Genetic risk scores were calculated for each population using the R package “Mangrove” (See Web link).
Explained Variance and Predictive Analyses We estimated the explained variance (disease susceptibility) for IBD and its phenotypes’ risk alleles in the five populations. Mangrove holds the risk alleles, effect sizes (β values) and frequencies (f) for a set of genetic variants (i.e. 176 for East Asians or 194 for Central Asians) relevant to predicting a phenotype. It calculates the variance explained analytically, by converting the OR to liability scale units (i.e. the genetic risk variants included in the model) and adding them together (Figure 2). It gives the variance explained by the variants included in the model and plots the cumulative variance explained as the variants are added in one-at-a-time (in order of most to least variance explained). The distribution of predicted risks in patients was then compared to controls using the Wilcox rank sum test. Given the prevalence of IBD, CD and UC for each target population, we calculated the posterior probability of disease incidence for each phenotype in the target Asian populations.
Risk Prediction in Unrelated Individuals To predict IBD phenotypes from GRS, we considered a matrix of GRS with elements GRSj for individual j, and a vector of standardized effect sizes b~fbGRS. Next, the IBD phenotype predictions and probabilities of disease status as a function of GRS were calculated via a logistics link function as disease status = (1+e-(μ0+(meanGRS)T x GRSj-(meanGRS))-1 where mean GRS is a vector of the log odds ratios for GRS, μ0 (baseline risk) is a function of K representing the prevalence of the IBD phenotype in the target population. Mean GRS was defined as GRSj = ∑ij(2βij x fij) which is a normalizing constant accounting for fi (i.e. the allele frequency), and the effect size βi for the individual j for a given genetic variant (i.e. each of 176 variants for East Asians or 194 variants for Indo-Iranians). More details of the data modeling and related mathematical equations, are explained elsewhere (see Web link).