Pathogenic repeat expansions are overrepresented in exons
Of the 224 pathogenic repeats in HGMD-RPE, most were located in introns (31.3%, n = 70), followed by exons (26.8%, n = 60), promoters (14.3%, n = 32) and 5'UTR regions (11.6%, n = 26). Fewer were located in the 5'gene (7.59%, n = 17), 3'UTR (4.91%, n = 11), 3'gene (3.13%, n = 7) regions and an exon-intron junction site (0.45%, n = 1) (Table 1), respectively. Additionally, we found 26 repeat units expanded at two or more loci causing human inherited disease and 10 genes harbouring more than one pathogenic repeat expansion. The annotation methodology employed for introns, exons, 5'UTR, 5'gene, 3'UTR and 3'gene is to be found in Methods and the Supplementary Material.
We ensured that motif length and the overall length of the repeat regions in the control sets (TRF-repeats, gnomAD-repeats and NR-region) and the HGMD-RPE were distributed similarly in each genomic functional region (GFR) (Fig. S1). However, the proportion of pathogenic repeats in specific GFRs may be overrepresented compared to the control sets. To determine whether the numbers of pathogenic repeat expansions in each GFR were higher than expectation, we compared the proportion of HGMD-RPE in a given GFR to that in the control dataset of 3,269 repeats (gnomAD-repeats and TRF-repeats). We observed that the pathogenic repeat expansions in HGMD-RPE were disproportionately enriched in exon (P = 1.38\(\times\)10−11), and 5'UTR (P = 6.22\(\times 10\)−5) regions as compared to the repeat expansions in gnomAD-repeats and TRF-repeats (Table 1). A similar comparison was also performed on HGMD-RPE-DM (a restricted dataset of HGMD-RPE encompassing pathogenic repeat expansions annotated as DM or DM? by HGMD) and the corresponding smaller gnomAD-repeats-DM and TRF-repeats-DM control datasets. We found that the repeat expansions in HGMD-RPE-DM were significantly enriched in the 5'UTR (P = 1.96 × 10− 3) and exon (P = 1.03×10− 10) regions (Table 1).
Table 1
Enrichment of pathogenic repeat expansions in different genomic function regions (GFR) comparing to three control sets.
GFR
|
HGMD-RPE
|
gnomAD-repeats
|
TRF-repeats
|
NR-region
|
P-valuea
|
3’-UTR
|
11
|
49
|
105
|
33
|
4.87\(\times 10\)-1
|
3’-gene
|
7
|
242
|
327
|
21
|
1.00
|
5’-UTR
|
26
|
13
|
150
|
78
|
6.22\(\times 10\)-5
|
5’-gene
|
17
|
221
|
267
|
51
|
9.90\(\times\)10-1
|
Exon
|
60
|
19
|
330
|
180
|
1.38\(\times\)10-11
|
Intron
|
70
|
1258
|
740
|
210
|
1.00
|
Promoter
|
32
|
417
|
368
|
96
|
9.90\(\times\)10-1
|
Total
|
224
|
2149
|
1120
|
672
|
-
|
a Enrichment of pathogenic repeat expansions in GFR comparing to the loci in control sets was evaluated by Chi-square test. |
Pathogenic repeat expansions tend to encode disordered surface-located protein domains
We next turned our attention to the protein structures encoded by pathogenic repeat expansions, and compared them with the non-repeat portions in the same gene-coding regions. A total of 60 pathogenic repeat expansions were found in the coding regions of 49 genes. Most of the secondary structures of the protein regions encoded by repeats associated with exonic pathogenic expansions were alpha helix (H), coil (C) or H and C (Fig. 2A). We found that coil (C) was significantly enriched in the repeat regions associated with pathogenic repeat expansions (P = 1.18 \(\times\) 10− 118 by one-sided Fisher’s Exact test) compared to the non-repeat regions. By contrast, beta sheet (E) and alpha-helix (H) were significantly under-represented in the pathogenic repeat regions compared to the non-repeat regions (P = 1.67\(\times\) 10− 130 and 7.32 \(\times\)10− 77, respectively).
We also compared the secondary structures encoded by polyglutamine (CAG) tracts associated with pathogenic repeat expansions with those in non-repeat regions, and found that the former was disproportionately associated with alpha-helical (H) folds (P = 2.30 \(\times\) 10− 3, one-sided Fisher’s Exact test) compared to polyglutamine in the non-repeat regions at the expense of beta sheets (E) (P = 9.60 \(\times\) 10− 3, one-sided Fisher’s Exact test). Thus, beta-sheet secondary structures were found to be under-represented within the protein regions around the CAG repeat-encoded polyglutamine tracts. Because the X-ray structure of the huntingtin protein shows that the expanded CAG-repeat gives rises to an alpha-helical secondary structure(Kim et al. 2009), we specifically analyzed the repeat expansions in five additional genes (ATXN8OS, CNR1, HTT, JPH3 and TBP) known to be associated with huntingtin(Hire et al. 2011; Holmes et al. 2001; Kloster et al. 2013; Koutsis et al. 2012; Melamed et al. 2015; Quarrell et al. 2007; Xu et al. 2009). Whereas the pathogenic repeat expansions in the HTT and TBP genes are located in exonic regions, those in the three other huntingtin-associated genes are not located within the coding region but rather in the 3'UTR (ATXN8OS), 3'gene region (CNR1) or an exonic region encoding RNA rather than protein (JPH3 gene; recorded as a JPH3-202 transcript in the Ensembl database, https://asia.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000154118;r=16:87601835-87698156;t=ENST00000301008). We then examined the enrichment of protein secondary structures in the repeat expansion regions of HTT and TBP. We found that helix (H) was significantly enriched in the repeat regions associated with pathogenic repeat expansions in HTT and TBP (P = 1.3\(\times\)10−3 by a one-sided Fisher’s Exact test) than in the non-repeat regions of these two genes.
The function of a protein is closely related to its three-dimensional (3-D) structure. Although disordered regions of proteins cannot form stable 3-D structures, they may nevertheless play important roles in biological processes(Oldfield and Dunker 2014; Uversky 2020), particularly in the context of protein-protein interactions. Therefore, we utilized SPOT-Disorder-Single(Hanson et al. 2018) to predict whether protein encoded by the exonic regions in the vicinity of known pathogenic repeat expansions had a preference to form stable 3-D structures or disordered structures. More details of the methodology used to annotate the 3-D structure of the repeat expansions within coding regions are given in the Supplementary Material. We found that 53 (88.2%, n = 60) of the repeat regions in the vicinity of pathogenic repeat expansions resided in regions with disordered protein structures. Only the repeat regions within the proteins CLEC4M, CNDP1, JPH3, LRP5, MICA, NEB and TENT5A were located within stable protein 3-D domains, if we used a commonly used cutoff 0.4 (score provided by SPOT-Disorder-Single) to define stable protein 3-D domains. Thus, the proportion of amino acid residues with disordered protein structures in repeat regions encompassing the pathogenic repeat expansions is much higher (P = 1.2 × 10− 11, by a single-tailed t-test) than that in non-repeat regions (Fig. 2B).
In order to further evaluate the influence of pathogenic repeat expansions in protein structure, we then calculated the accessible surface area (ASA) scores of the regions encompassing the pathogenic repeat expansions within the encoded proteins. The average ASA score of the pathogenic repeat regions was significantly (P = 8.9 × 10− 37) higher than for the non-repetitive regions (Fig. 2C). Thus, exonic repeat expansions encode amino acid residues that are preferentially located on the surface of proteins, where they may presumably exert their deleterious effects by mediating inappropriate interactions with other molecules, which can be experimentally validated by methods such as X-ray scattering(Hallinan et al. 2021; Jorda et al. 2010).
Pathogenic repeat expansions are predicted to facilitate long-range chromatin interactions
To assess the impact of pathogenic repeat expansions on long range chromatin interactions, we first compared the CTCF peak-RP of the repeats in the HGMD-RPE dataset with the three control datasets (gnomAD-repeats, TRF-repeats and NR-regions).
The CTCF peak-RP for the HGMD-RPE dataset was significantly higher than for all three control groups in the intron, exon and 5'gene GFRs (Fig. 3A and Table S4). For the promoter and 5'UTR regions, the HGMD-RPE also displayed a slightly higher CTCF peak-RP than the gnomAD-repeats dataset (P = 2.3 × 10− 2) but not the TRF-repeat and NR-regions datasets (P = 0.674 and 0.718). Likewise, in the restricted HGMD-RPE-DM, the repeats in the exon, promoter and 5'gene GFRs exhibited a higher CTCF peak-RP than all three control datasets (Fig. S4A and Table S5).
Taken together, these data show that pathogenic repeat expansion loci residing within the exon and 5’gene GFRs tend to be disproportionately associated with CTCF binding sites in comparison with repeat regions with no known pathogenicity. Such an observation is suggestive of a potential relationship between pathogenic repeat expansions and chromatin folding dynamics.
Enhanced chromatin accessibility is a distinctive characteristic of pathogenic repeats
Histone marks are critical determinants of chromatin compactness and transcriptional control. H3K4me3 occupancy was higher in HGMD-RPE than in gnomAD-repeats, the TRF-repeats and the NR-regions, particularly in exon [P = 1.55×10− 5 (gnomAD-repeats); P = 1.76×10− 2 (TRF-repeats); P = 1.07×10− 4 (NR-regions) ], 5’gene [P = 1.87×10− 4 (gnomAD-repeats); P = 4.59×10− 3 (TRF-repeats); P = 1.55×10− 2 (NR-regions) ] and promoter [P = 1.29×10− 7 (gnomAD-repeats); P = 5.45×10− 3 (TRF-repeats); P = 2.16×10− 2 (NR-regions) ] GFRs (Fig. 3B and Table S4 ). Likewise, the ChIP signals of H3K4me1, which are usually associated with transcription (Kang et al. 2020), were stronger for HGMD-RPE than for gnomAD-repeats within the promoter (P = 1.01 × 10− 2) and 3'UTR (P = 2.63 × 10− 2) GFRs (Fig. 3C). H3K9me3 signal in both the HGMD-RPE and the controls were generally weaker than the other histone marks (Fig. 3D). H3K9me3 signal was stronger in the HGMD-RPE dataset than in the gnomAD-repeats for intron (P = 2.92\(\times\)10−2) and promoter (P = 2.46E\(\times\)10−2), although lower than that in the NR-regions for the exons (P = 3.73E\(\times\)10−3) and 3'gene (P = 2.46\(\times\)10−2). H3K27me3 signals were also stronger in HGMD-RPE than in all three control datasets in the 5'gene GFR (P = 1.83\(\times\)10−13, 2.95\(\times\)10−4 and 2.16\(\times\)10−4, respectively; Fig. 3E and Table S4), and stronger than that of gnomAD-repeats in the exon, promoter, 5'gene and 5'-UTR (Fig. 3E and Table S4) GFRs. Finally, H3K36me3 signals were weaker in HGMD-RPE pathogenic repeats, with P-values of 4.20 \(\times\) 10− 3 and 1.37 \(\times\) 10− 3 in exons, with respect to TRF-repeats and NR-regions, and in promoters with respect to all three control datasets (Fig. 3F and Table S4). A similar analysis performed on the smaller HGMD-RPE-DM led to similar results (Fig. S4). In conclusion, H3K4me3, H3K9me3 and H3K27me3 displayed stronger occupancy in pathogenic repeat expanded regions than in the controls for most GFRs, whereas H3K36me3 occupancy was generally reduced, suggesting that one of the pathological consequences of the repeat expansions may be a shift to a more accessible chromatin structure leading to enhanced transcription.
In addition to the histone modification marks, we analysed the chromatin accessibility of our groups based on DNase-seq data. The results of this analysis (Fig. 3G) suggested that the regions associated with the pathogenic repeat expansions in HGMD-RPE tend to have greater chromatin accessibility in the 5'gene (P = 2.39\(\times\)10−3), 5'UTR (P = 3.50\(\times\)10−4) and exon (P = 1.39\(\times\)10−4) GFRs than the repeats in the gnomAD-repeats dataset (Table S4 and Fig. 3G). Chromatin accessibility was significantly greater for the pathogenic repeat expansions in HGMD-RPE in exon, 5'gene, 3'UTR and promoter GFRs as compared to the TRF-repeats. Chromatin accessibility was also greater for the pathogenic repeat expansions in HGMD-RPE in exon, 5'gene and promoter GFRs than for those in NR-regions (Fig. 3G).
The HGMD-RPE-DM dataset exhibited significantly higher chromatin accessibility in the promoter, 5'gene and 3'gene GFRs than all three control datasets. The HGMD-RPE-DM also exhibited significantly higher chromatin accessibility in the exons than the gnomAD-repeats-DM (P = 4.41\(\times\)10−4) and TRF-repeats-DM (P = 1.79\(\times\)10−3) datasets (Fig. S4G and Table S5).
Potential TFBSs in pathogenic repeat expansion loci are embedded within open chromatin regions
We next sought potential transcription factor binding sites (TFBSs) within the pathogenic repeat loci located in 5'gene, promoter and 5'UTR GFRs because these regions are the most frequently associated with the initiation of transcription (Schoenfelder and Fraser 2019; Whitfield et al. 2012). In total, 75 of 224 pathogenic HGMD-RPE repeat regions were found to reside in the 5'gene, promoter or 5'UTR regions, and of these, 54 repeat regions were found to contain 107 TF binding site motifs (Table S6), with 50 repeat regions containing more than one TF binding site motif. Additionally, 54 TF motifs appear frequently in repeat regions associated with pathogenic repeat expansions located in 5'gene, promoter or 5'UTR (Table S6) GFRs. For example, the TF motif with JASPAR dataset ID, MA0162.4 (EGR1, motif length = 14) displayed the most frequent occurrence within the pathogenic repeats, being associated with 33 different genes (Table S7).
Although the motifs identified within the pathogenic repeat loci match TFBS and appeared to have the potential to bind TFs, it is by no means known that they actually do bind TFs in vivo. We therefore repeated our analysis by including the chromatin accessibility signals within putative TFBSs. We specifically compared the chromatin accessibility signals within putative TFBSs between HGMD-RPE and the control groups (Fig. 3H and Fig. S4H). FIMO-predicted TFBSs in HGMD-RPE showed significantly higher chromatin accessibility peaks than in all three control datasets in the promoter and 5'gene GFRs, higher chromatin accessibility than the gnomAD-repeats (P = 1.26\(\times\)10−3) and the NR-regions (P = 9.93\(\times\)10−3) in the exons, and higher chromatin accessibility than the gnomAD-repeats in the 5'UTR region (P = 2.04\(\times\)10−3) (Table S4). Likewise, in the more restricted HGMD-RPE-DM dataset, the FIMO-predicted TFBSs in the 5'gene region resided in higher chromatin accessibility areas than in all controls, whereas those in exons resided in higher chromatin accessibility peaks than gnomAD-repeats-DM (P = 1.73\(\times\)10−2) (Table S5).
Taken together, we conclude that pathogenic repeat expansions have a higher probability of being associated with TFBSs, and hence dysregulating gene expression, than either repeats with no known pathogenicity or the non-repeat regions. This finding therefore provides indirect support for the view that pathogenic repeat expansions frequently perturb the regulatory functions of TFBS.
Pathogenic repeat expansions are located so as to be capable of interfering with RNA splicing
We next investigated the potential impact of pathogenic repeat expansions on mRNA splicing. We focused our analysis on repeats in exons, introns and UTRs because only in these regions do the expanded repeats have the potential to be transcribed into precursor mRNAs. For HGMD-RPE, the distances from the sites of repeat expansions to the nearest splice sites were significantly lower than for all three control groups in terms of intron, 5'gene, 3'gene and 3'UTR GFRs, lower than gnomAD-repeats and NR regions in exons (P = 1.42\(\times\)10−10 and 1.27\(\times\)10−4), and lower than gnomAD-repeats in the 5'UTRs (P = 4.80\(\times\)10−6) (Fig. 4A and Table S4). This implies that at least some pathogenic repeat expansions may exert their deleterious effects by disrupting mRNA splicing. We further leveraged MMSplice to assess the potential impact of pathogenic repeat expansions on mRNA splicing by separated computation within and around (distance to the repeat < 3000bp and > 25bp) each repeat region. MMSplice scores within and around repeats are shown in Fig. 4B and Fig. 4C, respectively. MMSplice scores were higher for HGMD-RPE than for all three control groups in the exon, 5'gene and 3'UTR GFRs (Fig. 4B and 4C), and, indeed, in all seven GFRs, MMSplice scores in HGMD-RPE were higher than in the gnomAD-repeats (Table S4). Within HGMD-RPE-DM, the repeat expansions were characterized by a significantly shorter distance to neighboring splice sites than with gnomAD-repeats and NR-regions for six GFRs (exons, promoter, intron, 3'gene, 5'gene and 5'UTR) as shown in Fig.S5 and Table S5. Thus, at least a proportion of pathogenic repeat expansions are positioned such that they may interfere with the mRNA splicing process.
Stable RNA secondary structures are associated with repeat expansions
Next, we assessed the stability of RNA secondary structure by calculating the free energy of hairpin/loop folding in 100bp fragments upstream and immediately flanking the repeat regions associated with pathogenic repeat expansions. Lower energy implies a more stable RNA secondary structure. Fragments upstream of intron, promoter, and 5'gene GFRs of HGMD-RPE exhibited higher stability (lower RNA free energy) in HGMD-RPE than in all three controls; in addition, fragments upstream of 5' UTRs displayed higher stability in HGMD-RPE than in NR-regions (Fig. 4D). Similar findings were observed in HGMD-RPE-DM (Fig. S6). Likewise, fragments downstream of GFRs exhibited greater stability in pathogenic repeat expansions than in the three controls (Fig. 4E).
We computed the mean of the pairing probabilities (mPAP) of each RNA nucleotide for all 100 bp fragments (Ke et al. 2020) for the HGMD-RPE and controls. A higher mPAP score indicates a higher probability that a given base will pair with other bases in the RNA sequence, which in turns yields a higher stability of the composite RNA secondary structures. The 100bp region 5' to the repeat regions associated with the pathogenic repeat expansions exhibited significantly higher mPAP scores than gnomAD-repeats in the exon, promoter and 5'UTR GFRs (Fig. 4F, P = 7.90\(\times\)10−3, 3.24\(\times\)10−2 and 8.14\(\times\)10−3), and higher than all three control datasets in the 5'gene region. The region 100bp to the expanded repeats in HGMD-RPE in the exon, 5'gene and 3'gene GFRs exhibited significantly higher mPAP scores than the corresponding repeat expansions in the gnomAD-repeats and TRF-repeats datasets (Fig. 4G, Table S4). Similar findings were obtained with the HGMD-RPE-DM (Fig. S6, Table S5), clearly showing that the mRNA domains flanking pathogenic repeat expansions are prone to folding into stable hairpin/loop secondary structures.
Pathogenic CGG and CAG expansions promote R-loops
R-loops are RNA–DNA hybrids that can trigger replication stress genomic instability and strand breaks from chromatin condensation, may media te transcriptional activation by altering chromatin structure at promoter regions and recruit transcription and chromatin-remodeling factors. Therefore, using RloopDB (Wongsurawat et al. 2012) (http://rloop.bii.a-star.edu.sg/), we calculated the fraction of R-loops within the repeat regions that have undergone pathogenic repeat expansions. The fraction of the repeat regions in HGMD-RPE that were covered by more than 10% R-loops was significantly (Chi-square test P < 3.8\(\times\)10-3) higher than for the repeat regions in the gnomAD-repeats or TRF-repeats datasets (Fig. 5A). We extended the R-loops analysis to the repeat regions encompassing HGMD-RPE which were associated with different disease categories. The pathologies most frequently associated with > 10% R-loops affected the nervous system, the musculoskeletal system and the brain (Fig. 5B). We additionally examined the R-loop enrichment in HGMD-RPE located in different GFRs by comparing to gnomAD-repeats and TFR-repeats. We found that HGMD-RPE is enriched in R-loops in 5’gene, exon, promoter and intron comparing to gnomAD-repeats and it is enriched in R-loops in 5’gene, promoter and intron comparing to TRF-repeats (Table S8).
Finally, we examined the percentages of R-loops spanning pathogenic repeat expansions of different repeat units, i.e. CGG, CAG, GCN, GGC and GCG, most frequently encountered in HGMD-RPE (Fig. 5C). Of the repeat expansion regions that were covered by at least 10% R-loops, 13.9% and 11.7% of them were within repeat motifs CGG and CAG, respectively (Fig. 5C) compared to 2.8% and 1.4% with TRF-repeats (Fig. 5C). With the gnomAD-repeat dataset, no repeat expansion regions with CGG and CAG motifs were found. These findings suggest that CGG and CAG are important for R-loop formation in the context of pathogenic repeat expansions, thereby supporting and extending a previous report that R-loops often occur in association with CAG/CTG repeat expansions(Malla et al. 2021).
Non-B DNA formation and expanded DNA repeats
Non-B DNA conformations are formed by long repeat tracts. Here, we investigated whether certain types of non-B DNA are enriched in pathogenic repeat expansions compared to controls. The non-B DNA structures associated with each region harbouring pathogenic repeat expansions were annotated by reference to non-B DB 2.0 (Cer et al. 2013a) and bedtools (Quinlan and Hall 2010a). In total, 216 out of the 224 HGMD-derived pathogenic repeat regions were found to be capable of adopting at least one type of non-B DNA structure. We calculated the coverage of each type of non-B structure [Z-DNA, G-quadruplex (GQ), A-phased repeats, inverted repeats (IR), mirror repeats (MR), direct repeats (DR) and short tandem repeats (STR)] in each repeat region by computing the ratio of bases covered by the non-B structure. We found that HGMD-RPE in intron was significantly enriched in GQ comparing to the gnomAD-repeats and TRF-repeats (P = 4.43 \(\times\)10− 2 for gnomAD-repeats and P = 3.40 \(\times\)10− 3 for TRF-repeats, Table S9). Moreover, HGMD-RPE repeats in 3’gene were significantly enriched in Z-DNA and IR comparing to gnomAD-repeats with P = 1.57\(\times\)10−7 and 3.52 \(\times\)10− 2, respectively. Whereas compared to the TRF-repeats, HGMD-RPE in exon are significantly enriched in A-phased repeats (P = 1.16\(\times\)10−2) and Z-DNA (P = 3.91 \(\times\)10− 5). The similar findings have been obtained in analyzing HGMD-RPE-DM (Table S9). These data support the view that non-B DNA-foming motifs may be incorporated along with other matrices to predict the formation of pathologic repeat expansions genome-wide.
Pathogenic expansion loci are evolutionarily conserved
Evolutionary conservation is a key feature to infer the function of specific DNA sequences and hence the likelihood that mutations within them will be pathogenic. We observed that loci that have undergone repeat expansions in HGMD-RPE are significantly more conserved than genomically matched gnomAD-repeats in exon, promoter and 5'UTR GFRs (P = 1.36\(\times 10\)−6, 1.74\(\times\)10−2 and 2.14\(\times\)10−2, respectively) (Fig. 6A and Table S4). Within HGMD-RPE-DM, the repeat expansion loci were significantly more conserved than those from the gnomAD-repeats and NR-regions in both the intronic and exonic GFRs (P = 4.88\(\times 10\)−3 and 5.41\(\times\)10−6 for gnomAD-repeats-DM, P = 3.47\(\times\)10−2 and 5.61\(\times\)10−3 for NR-regions-DM) (Fig. S7A and Table S5). Since the gnomAD-repeats dataset represents repeat expansions with no known pathogenicity, our results enable the prediction that repeat expansions in highly conserved GFRs are more likely to lead to disease than those in less conserved GFRs.
To ascertain whether the repeat expansions disrupt conserved functional genomic elements, we investigated the regions neighboring the repeats at the DNA sequence level. Specifically, we employed a 100-bp sliding window to scan the neighboring regions and calculated the average phastCons score within each window. We then gradually expanded the distance from the repeat regions or non-repeat regions to the neighboring sequences from 25 bp to 3000 bp and calculated the maximum value of the average phastCons score within the sliding window. In HGMD-RPE, the phastCons conservation scores ranged between 50 bp and 1000bp, and were significantly higher than in all three control datasets for both the aggregate (Fig. 6B and Table S4) and the individual (exons, introns and 5' UTRs) GFRs (Fig. S8). The same trend was observed for the HGMD-RPE-DM dataset (Fig. S9 and Table S5). These rather striking results indicate that pathogenic repeat expansions are characterized by their proximity to conserved functional genomic elements in comparison to non-repeat regions or repeats without known pathogenicity.
Base excision repair (BER) factors in the vicinity of pathogenic repeat expansions
Human genome stability requires efficient base excision repair by several DNA repair pathways, including BER(Bacolla et al. 2021). Therefore, we wondered if pathogenic repeat expansions were enriched in BER factors. To this end, we assessed the location of the binding sites of four BER factors, acetylated NEIL1 (AcNEIL1), POLB, OGG1 and XRCC1. In HGMD-PRE, the pathogenic repeat expansions in the 5’gene GFR were enriched in AcNEIL1 and XRCC occupancy compared to gnomAD-repeats whereas no significant differences were observed between HGMD-PRE and the other control sets (Fig. 7, Table S4). By contrast, AcNEIL1 occupancy at 3'UTR was lower for the HGMD-PRE than for the NR-region; likewise, XRCC1 binding was lower around the pathogenic repeat expansions in HGMD-PRE in 3'gene, 5'UTR and exon compared to the control sets (Fig. 7). For POLB and OGG1, their occupancies in 5'UTR were stronger in the HGMD-PRE than in NR-regions (Fig. 7). Moreover, binding of OGG1 to exon GFRs was higher around the pathogenic repeat expansions in the HGMD-PRE than in gnomAD-repeats. However, POLB binding in intron and promoter GFRs was reduced in the HGMD-PRE. Results of parallel analysis on HGMD-RPE-DM and corresponding controls were exhibited in Fig S10.
Characteristics of repeat expansions differ between different categories of disease
To assess whether specific repeat expansions are associated with particular types of disease, repeat expansion-related diseases were grouped into 31 distinct categories according to Medical Subject Headings (MeSH) terms (Table S10 and Table S11), collapsing into one representative category those cases in which a disease was classified into multiple categories. For each category, we counted the number of disease genes impacted by repeat expansion (Fig. 8A), which revealed that ‘Nervous system diseases’ were the most frequently associated with genes impacted by repeat expansions (68 genes), followed by ‘Inherited susceptibility to neoplasms’ (34 genes), ‘Musculoskeletal diseases’ (33 genes), and ‘Mental disorders’ (22 genes). From each disease category, we further identified one gene that was most frequently associated with diseases in the same disease category (Table S12).
Furthermore, 40 genes in all were found to be associated with diseases from different categories, and Table S13 shows the top 10 genes associated with the largest number of disease categories. The different MeSH categories appeared to be preferentially associated with the expansion of specific types of repeat motif (Fig. 8B and Fig. S11); 14 disease categories were associated with expansion of the DNA repeat motif ‘GT’. Three types of repeat motif, namely CGG, CA and CAG, were found to be associated with 11 disease categories. Other repeat motifs, specifically TA, CTG and AAT, were associated with more than five disease categories.
Many studies have shown that the extent of repeat amplification correlates directly with clinical severity and the age of onset(Gijselinck et al. 2016; Swami et al. 2009). Two important questions arise: (1) Does the number of repeat units vary between different disease categories? (2) Is the average repeat unit length of the pathogenic repeat expansions different between different disease categories? To address the first question, we examined the top nine disease categories associated with the largest number of pathogenic repeat expansions, which showed that the repeat expansions associated with the MeSH disease category ‘Nutritional and metabolic diseases’ exhibit the highest average number of repeat units pre-expansion (Fig. 8C and Supplementary Material). Next, we explored the variation in total repeat length between disease categories. The average length of the repeat motifs associated with repeat expansion disorders in the categories ‘Nervous system diseases’, ‘Mental disorders’, and ‘Neoplasms’ appeared to be longer than for the repeat motifs associated with other categories (Fig. 8D). Repeat expansions within coding regions tend to be located within certain regions of the proteins characterized by specific secondary structures. The pre-expansions encompassing pathogenic repeat expansions in HGMD-RPE associated with ‘Nervous system diseases’, ‘Neoplasms’, ‘Mental disorders’, ‘Nutritional and metabolic disease’, ‘Male urogenital diseases’, and ‘Digestive system diseases’ were more likely to be located within coils (‘C’) than in ‘H’ or ‘E’ (Fig. 8E).
Additional analyses were performed to ascertain the influence of repeat expansions on gene expression. In total, 127 repeat expansions in HGMD-RPE were likely to play a role in modulating gene expression. Of these, repeat expansions causing ‘Nervous system diseases’, ‘Neoplasms’, ‘Cardiovascular diseases’, ‘Nutritional and metabolic diseases’, ‘Male urogenital diseases’ and ‘Susceptibility to infections’ were predicted to exert their effects by increasing gene expression (Fig. 8F). Among them, six repeat expansions causing or associated with musculoskeletal diseases, mental disorders and digestive system diseases tended to up-regulate gene expression by involving multiple genomic architectural features including histone marks, transcription factor binding sites, CTCF binding sites and chromatin accessibility. By contrast, 14 pathogenic repeat expansions causing or associated with musculoskeletal diseases, mental disorders and digestive system diseases tended to be associated with the down-regulation of gene expression. In summary, about half of all pathogenic repeat expansions are predicted to exert their pathological role at least in part by impacting gene expression.
Constructing DPREx for the prediction of pathogenic repeat expansions
Having compiled comprehensive bioinformatics metrics on repeat expansions, we constructed a predictive model (DPREx) of pathogenic repeat expansions. DPREx (http://biomed.nscc-gz.cn/zhaolab/geneprediction/#/) utilized a total of 185 distinct features to characterize the repeat regions from the positive and the negative training datasets, respectively, including chromatin accessibility, histone modification marks from multiple tissues, the binding strength of transcription factors, distance to alternative splice sites, evolutionary conservation, non-B DNA structures, BER factors, R-loops and RNA secondary structure. The DPREx model was trained and evaluated by means of a 5-fold cross-validation (CV) in the training set. When 42 features were used, DPREx performance attained an AUC of 0.875 (Table S14). The contribution of each feature was evaluated by ablation steps (Fig. 9A, Table S14), and was further described by the weight of each feature in the X-Gboost model (Fig. 9B). The results show that the distance of the repeat regions to a splice site is the most important predictor in determining whether a repeat undergoes pathogenic expansion. Additionally, the types of non-B DNA-forming structure, such as minor repeats (MR) and short tandem repeats (STR), are key predictive features for detecting regions of pathogenic potential.
In the training set, DPREx yielded the highest Matthews correlation coefficient (MCC, 0.39) when we coupled with a 0.04 cutoff to determine whether or not a given repeat region was pathogenic (Fig. 9C). This cutoff was used to compare the MCC performance and F1-score performance of DPREx to other two methods (CADD and ReMM) in independent tests. Using independent test set, test set1 (Method), DPREx was compared with two methods, CADD and ReMM, which are commonly used to predict the pathogenicity of single-locus variants, and which we exploited to generate pathogenic scores for the breakpoints of the repeat regions plus 10 bases upstream and downstream. From these data, we selected the maximum score of all bases within window as the representative score for the repeat region. Our method achieved an AUC of 0.87, which outperformed both CADD (AUC 0.66) and ReMM (AUC 0.72) (Fig. 9D). When we used DPREx with a score of 0.04 as a cutoff to determine whether or not a given repeat region was pathogenic, it achieved an F1-score of 0.43 and an MCC of 0.35. For CADD, a score of 10 means that a given variant is among the top 1% most deleterious variants in the human genome (Rentzsch et al. 2019). Using this score 10 as a cutoff, CADD achieved an F1-score of 0.23 and an MCC of 0.17. ReMM outputs mutation scores ranging from 0 (non-deleterious) to 1 (deleterious). By using a score of 0.5 as a threshold for determining whether or not a mutation was deleterious, ReMM achieved an F1-Measure 0.22, and an MCC of 0.20 (Table 2).
Table 2
Comparing DPREx with CADD and ReMM using two independent datasets.
|
Test set1
|
Test set2
|
Methods
|
DPRExa
|
CADDb
|
ReMMc
|
DPREx
|
CADD
|
ReMM
|
AUC
|
0.89
|
0.69
|
0.70
|
0.82
|
0.72
|
0.75
|
MCC
|
0.35
|
0.17
|
0.20
|
0.45
|
0.25
|
0.21
|
F1-score
|
0.33
|
0.23
|
0.22
|
0.51
|
0.39
|
0.37
|
a Performance of DPREx with respect to MCC, and F1-score measured by means of a DPREx score threshold 0.0705 which yielded the highest MCC in the training set; b Performance of CADD, with respect to MCC and F1-score using a CADD score threshold of 10 which means a given variant is amongst the top 1% of deleterious variants in the human genome; c Performance of ReMM, with respect to MCC and F1-score using an ReMM score 0.5 as a threshold. |
DPREx was further compared with CADD and ReMM by employing a second independent dataset, test set2 (Method). In this case, DPREx achieved an AUC of 0.78, which was higher than those achieved by CADD (0.61) and ReMM (0.62) (Fig. 9E). When we used the cutoffs of 10 for CADD, 0.5 for ReMM, and 0.04 for DPREx to predict pathogenic repeat regions, CADD, ReMM and DPREx achieved F1-scores of 0.36, 0.39 and 0.53, and MCC values of 0.17, 0.21 and 0.48, respectively. Thus, DPREx provides a robust and competitive method for predicting pathogenic repeat expansions based upon the integrated data and model from the human genome.