LDB and genes in association with IBD
A total of 546 IBD-associated single-nucleotide polymorphisms (SNPs) were identified from 35 association studies (Table S2), corresponding to 718 GWAS genes. This set includes 413 mappedGenes and 448 eGenes, with an overlap of 143 genes, as depicted in Fig. 1B and Table S2. Notably, 13 of the 104 monogenic IBD genes (monoGenes) are GWAS genes, i.e. ’Mendelian-complex genes’, exhibiting significant intersection (Fisher’s exact test; protein-coding genes only, p = 6.72x10− 6). Functional gene set enrichment analysis revealed similar enrichment of both GWAS genes and monoGenes in immune-related pathway (Table S3; Figure S1), aligning with the anticipated convergence of molecular pathogenic pathways in monogenic IBD and complex IBD.
Another feature of GWAS genes is the enrichment of non-protein coding pseudogenes (n = 157), which make up 26.39% of mappedGenes and 12.95% of eGenes. This aligns with overrepresentation of pseudogenes in the applied reference GENCODE V43 [33, 34] and our impartial SNP-gene mapping approach with no preferable selection for protein-coding genes or known IBD genes. Whilst there is growing knowledge of their association with disease and immune regulation [35], the majority of the pseudogenes are not covered by the Exome capture kit (n = 116; Fig. 1b).
Utilizing the European-based LDU map[14], 546 GWAS SNPs are categorized into 260 LDBs, with 150 consisting of a single association SNP (IBD-association p < 5x10− 8), and the remaining defined by ≥ 2 GWAS association SNPs. The LDBs span from 1.00 to 3.20 LDU, or 3,630bp to 3,246,717 bp according to the physical position in size. The largest LDB, LDB78b, is located at 5q31.1 (Table S2), and encompasses 6 GWAS association SNPs, which consist of eQTLs of MEIKIN. LDB78a, despite being > 1LDU far away from LDB78b, encompasses another IBD-associated eQTL of MEIKIN. Such LDBs, by sharing a common gene with the association SNP that they encompass, are defined as clusters of LDBs (n = 21). As might be expected, the most significant cluster is the HLA region at 6p21.32-33, comprising 7 LDBs (Table S2). One hundred and ninety-four LDBs are captured by the Exome sequencing assay. These LDBs encompass the complete sequence of 313 GWAS genes, partially overlap with 201 GWAS genes, and have no intersection with the other 204 GWAS genes. LDBs can also extend beyond mappedGenes and eGenes. For instance, LDB187 at 16q12.1, delineated by 5 GWAS SNPs covers CYLD, a monoGene but not a GWAS gene, besides NOD2 and CYLD-AS1 (Figure S2).
The GWAS genes, LDBs and monoGenes together account for 885 target regions to be tested, as component LDBs within a LDB cluster is tested separately.
The UK Biobank IBD cohort
Following QC, ethnicity- and phenotype-based filtration retained 891 CD, 1,409 UC cases, and 60,118 controls. Most of the IBD diagnoses were made in patients’ adulthood, whilst 37 CD and 33 UC were diagnosed on or before the patients reached 18 years old. Further demographic and sub-phenotypic features of the UC and CD patients are derived based on the ICD-10 code of diagnosis as shown in Table 1.
Table 1
Demographic characteristics of the European UK BioBank cohort for the analysis
| CD (n = 891) | UC (n = 1409) | Controls (n = 60,118) |
Demographics | | | |
Male | 387 | 730 | 32,925 |
Female | 504 | 679 | 27,193 |
Age at latest assessment: median (IQR) (Year) | 59 (51–64) | 61 (54–65) | 58 (50–63) |
Age at Diagnosis; median (IQR) (Year) | 57 (49–65) | 51 (49–67) | NA |
Disease Subtypes |
| Small intestine | Large intestine | Both small & lareg intestine | Unspecified | ileocolitis | proctitis | rectosigmoiditis | proctitis & rectosigmoiditis | unspecified | NA |
| 174 | 197 | 47 | 473 | 6 | 192 | 81 | 25 | 1105 | NA |
GI complications |
Fistula disease | 57 | 30 | NA |
Stricturing disease | 138 | 57 | NA |
Colon cancer | 26 | 43 | NA |
Megacolon disease | 4 | 6 | NA |
Comorbidities with other autoimmune diseases |
n = 1 | 111 | 151 | NA |
n = 2 | 15 | 15 | NA |
n > = 3 | 1 | 5 | NA |
Pathogenic mutations of GWAS association loci and monogenic IBD genes
All but 10 of the GWAS-derived set of 794 targets host ≥ 1 variants with CADDphred_score≥15 in the cohort, and all the monoGenes were mutated in at least 1 patient. Despite this, pathogenic variants were very sparsely identified in the patients. Approximately half of the testing loci had a non-zero GenePy score in fewer than 5% of patients, as observed on 416 (52.39%) of the GWAS loci and 46 (44.23%) of the monoGenes in CD patients, and similarly on 425 (53.53%) GWAS loci and 48 (46.15%) monoGenes in UC. With more than half of the values being zeros, the GenePy score matrix per locus/individual is a sparse matrix for downstream analysis.
The most mutated genes are the 13 known ‘mendelian-complex’ IBD genes, as 8 (61.53%) are mutated in > 5% of both UC and CD, except for CD40, IL2RA, IL10, STAT3 and LACC1 that are rarely mutated either UC or CD. Such sparsity of non-zero GenePy scores of the patients corresponds to the genetic heterogeneity of IBD and is the rational for the following GenePy-based association tests on subpopulations with highest scores.
Association of the candidate regions with disease
Under the monogenic IBD model, two significant associations are observed with CD which exert opposing effects on disease: NOD2 being risk under the recessive model and IL23R, under the dominant and additive inheritance models, both with protective effects. Both genes are known IBD genes with NOD2 also being a ‘mendelian-complex’ IBD gene. No significant associations were detected with UC from this test (Figure S3).
Burden-based SKAT-O test highlighted the most significantly associated gene of UC, RIPK2-DT, a noncoding eGene associated with the IBD-association SNP rs7015630. RIPK2-DT plays a role in mitigating inflammation induced by free fatty acids but is less known in IBD compared to its downstream gene RIPK2 [36, 37]. The RIPK2-DT association was not detected in the GenePy-based rank sum tests.
GenePy-based Mann-Whitney U test uncovered 35 loci in significant association with CD and 25 with UC (Fig. 2; Figure S4). HLA-DQA1 and HLA-DQB2 are the most significantly associated genes with UC and controls of the top 7.5% or GenePy scores, albeit of modest effect sizes (ϕHLA−DQA1 = 0.63, CI [0.59,0.67]; ϕHLA−DQB2 = 0.66, CI [0.63,0.70]), compared to other associated genes, e.g. ϕSLC17A1 = 0.81, CI [0.73,0.88], or the monoGene LIG4, where ϕLIG4 = 0.82, CI [0.74,0.89]; NOD2, together with the co-located LDB187 and CYLA-AS1 gene (Figure S2), but not CYLD the monoGene, are the most significantly associated with CD (Fig. 2). Such associations propped up by the rare pathogenic variants (CADDphred_score≥15) exert larger effect sizes to disease compared to that identified from the original GWAS, of both protective and risk effects observed, and such effects tend to be bigger when the affected sub-population is smaller (Fig. 2). Notably, although the smallest p value of NOD2 was observed in individuals of the top 7.5% highest scores (p = 1.41x10− 17, ϕ = 0.80, CI [0.77,0.83], the maximum effect size was observed in those with the top 2.5% highest score p = 5.13x10− 7, ϕ = 0.81, CI [0.76,0.86])
The eGene NOTCH1 and a mappedGene CARD9 at locus 9q34.3 are tagged by the association SNPs encompassed by LDB131a/b, and both exhibited significant association with CD, evincing pleiotropic effects at the gene level of a GWAS association locus. In another case, LDB189 which constitutes a proportion of the PLCG2 gene encompassing the phospholipases domain, is significantly associated with CD with protective effects but the entire PLCG2 gene is not (Fig. 3), in line the GWAS findings[38].
Set of highly mutated genes in IBD and controls
We tested rare variant-based associations in both UC and CD, appeared to exert both protective and risk effects, with the potential for some cases of the disease to constitute oligogenic pathogenesis given the large effect sizes. We tested this using an itemset association analysis by the APRIORI algorithm, with patients carrying higher GenePy2 score than the cut-off applied in association tests considered being GenePy positive for a mutant gene or LDB (Table S5-8). The test was conducted on 398 CD patients and 28,017 controls with any positive GenePy status of the 34 CD-associated genes/LDBs, and similarly on 613 UC with 25,748 controls with positive GenePy of the 25 association loci.
GenePy status of LDBs/genes within the same GWAS association region tend to be associated because of the existing intra-region overlaps (Table S2 and Table S5-8). Between GWAS regions, considerable coexistence of ‘positive’ GenePy status of LDB187/NOD2 and IR23R/LDB6 were observed in controls (Fig. 4A). This coexistence was completely absent in itemset observation in CD cases, with GenePy(+) status of both the NOD2/IL23R regions being mutually exclusive in CD patients (Fig. 4A; Table S5-S6). IL23R and the genomic region also showed strongest associations with other regions in controls of the UC-associated genes/LDBs (Fig. 4B; Table S7-8), indicating its counter-risk effects in both IBD subtypes, albeit the observation in UC can be biased as the sub-population with IL23R positive GenePy status constituted 14.08% of the UC cohort (Fig. 3C).