Profiling human pathogenic repeat expansion regions by synergistic and multi-level impacts on molecular connections

doi:10.21203/rs.3.rs-1922350/v1

Download PDF

Research Article

Profiling human pathogenic repeat expansion regions by synergistic and multi-level impacts on molecular connections

https://doi.org/10.21203/rs.3.rs-1922350/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background and Motivation: Whilst DNA repeat expansions cause numerous heritable human disorders, their origins and underlying pathological mechanisms are often unclear.

Method: We collated a dataset comprising 224 human repeat expansions encompassing 203 different genes, and performed a systematic analysis with respect to key features at the DNA-, RNA- and protein-levels. Comparison with controls without known pathogenicity and genomic regions lacking repeats, allowed the construction of the first model to discriminate repeat regions harboring pathogenic repeat expansions (DPREx).

Results: At the DNA level, pathogenic repeat expansions exhibited stronger signals for DNA regulatory factors (e.g. H3K4me3, transcription factor-binding sites) in exons, promoters, 5’UTRs, and 5’genes but not significantly different in introns, 3’UTRs and 3’genes than controls. At the RNA-level, pathogenic repeat expansions showed lower free energy for forming RNA secondary structure and were closer to splice sites in introns, exons, promoters and 5’genes than controls. At the protein level, pathogenic repeat expansions preferred to form coil than other types of secondary structures, and tended to encode surface-located protein domains. Additionally, pathogenic repeat expansions were also enriched in non-B DNA structures. Guided by these features, DPREx (http://biomed.nscc-gz.cn/zhaolab/geneprediction/#/) achieved an Area Under the Curve (AUC) value of 0.88 in an independent dataset test.

Conclusion: Pathogenic repeat expansions are located so as to exert a synergistic, multi-level influence on stress responses and inter-molecular connections involving DNA, RNA and proteins, thereby impacting the relationship between genotype and clinical phenotype.

repeat expansions

DNA structure

inherited disorders

genome instability

DNA repair

genome evolution

RNA splicing

Repeat expansions were first described as a cause of human genetic disease more than 30 years ago (Depienne and Mandel 2021). In 1991, two X-linked genes, responsible respectively for fragile X syndrome and X-linked spinal and bulbar muscular atrophy, were cloned. Further characterization of the loci revealed that in both cases the underlying pathological mutation was the expansion of a pre-existing triplet repeat. Thus, fragile X syndrome results from the expansion of CGG repeats in the 5'-untranslated region (UTR) of the FMR1 gene, whereas X-linked spinal and bulbar muscular atrophy is caused by the expansion of a glutamine-encoding CAG repeat within the coding region of the androgen receptor (AR) gene(Madeira et al. 2018). These initial findings have since been followed by the discovery of many other examples of repeat expansion causing different types of inherited disease. Although these expanded repeats vary in terms of the sequences involved, the genes which harbour them, their intragenic locations and the extent of the expansions, in general the greater the size of the expansion, the more severe are the clinical symptoms and the earlier the age of onset of disease(Figueroa et al. 2009; Paulson 2018; Sun et al. 2011). Further, whilst the copy numbers of the expanded repeat loci are invariably polymorphic in the clinically asymptomatic general population, the repeat copy number must expand beyond a certain threshold for it to exert its pathological effects.

In addition to triplet repeats, numerous repeats of larger unit size have been found to contribute to disease. One of the largest pathogenic repeat expansions, with a repeat motif of length 86 base-pairs, has been noted in an intron of the IL1RN gene and is reported to be a cause of lichen sclerosus(Clay et al. 1994). At the other end of the scale, dinucleotide repeat expansions, such as that of (GT)_n (n ranges from 1 to 6) in the 5'UTR of the DBH gene, or (GT)_n (n ranges from 14 to 18) in the promoter of the FOXP3 gene, are associated with attention deficit disorder(Roman et al. 2002) and type 1 diabetes (Bassuny et al. 2003), respectively. Since the repeats are not only subject to expansion as they are transmitted to the next generation but can also be somatically unstable, they may give rise to markedly different clinical phenotypes among affected individuals from the same family(Paulson 2018). Not surprisingly, repeat expansions involving different repeat lengths and/or located within different gene regions can result in quite different consequences for gene function.

Thirty years on, there are many additional examples of pathogenic expanded repeats causing or associated with different human genetic diseases including nervous system disease, neoplasms, musculoskeletal diseases, mental disorders, cardiovascular disease, nutritional and metabolic disease, digestive system disease, infection, among others(Den Dunnen 2017; Mirkin 2007; Rodriguez and Todd 2019; Schmidt and Pearson 2016). A CAG repeat in the coding region of the Huntington disease gene, HTT, is expanded from 36 to more than 120 copies thereby greatly lengthening a polyglutamine tract within the Huntingtin protein(Gatto et al. 2020; Tabrizi et al. 2020). The expansion of a hexanucleotide (GGGGCC) repeat in an intronic region of C9orf72 is responsible for the most common cause of amyotrophic lateral sclerosis (ALS) and frontotemporal dementia (FTD)(Balendra and Isaacs 2018; Freibaum and Taylor 2017; Wen et al. 2014). Amplification of a CTG triplet repeat in the 3'UTR region of the DMPK gene causes myotonic dystrophy type 1 (DM1), a multi-system neuromuscular disease that affects both skeletal and smooth muscle, the central nervous system, heart, endocrine system and the eye(Bird 1993; Lanni and Pearson 2019; Mahadevan et al. 1992; Santoro et al. 2017). On the other hand, myotonic dystrophy type 2 (DM2), which commonly presents as proximal muscular weakness, is caused by the dramatic expansion of a CCTG repeat in the first intron of the ZNF9 gene, from ~ 75 to as many as 11,000 repeat copies(Liquori et al. 2001; Paulson 2018). Repeat expansions have also been found to be associated with cancer predisposition; thus, for example, a variable number of tandem repeat (VNTR) expansion in the 5'flanking region of the SMYD3 gene has been reported to increase susceptibility to multiple cancers including colorectal cancer, hepatocellular carcinoma and breast cancer(Tsuge et al. 2005).

The precise mechanisms by which certain repeats become expanded so as to cross a specific threshold into the pathological range are still unclear(Flower and Tabrizi 2020). Previous studies have shown that repeat expansions can arise through DNA slippage, and there is a broad consensus that DNA secondary structures may contribute to repeat expansion(Schmidt and Pearson 2016; Xu et al. 2020). Expanded repeats increase the flexibility of double helices and can give rise to the formation of alternative structures such as hairpin-loops, triplexes, left-handed Z-DNA and G-quadruplexes, collectively known as non-canonical (non-B) DNA(Grishchenko et al. 2020). One example is the CGG repeat which can form quadruplexes mediated by Hoogsteen hydrogen bond formation between guanines(Eddy and Maizels 2008).

With regard to how repeat expansions are triggered, both transgenic mouse models and genome-wide association studies point to a critical requirement for components of DNA repair pathways, including the mismatch repair system (MSH2, MSH3, MLH1 and PMS2) binding to hairpin-loops and possibly to other non-canonical DNA conformations(Khristich and Mirkin 2020; Schmidt and Pearson 2016) and base excision repair(Lai et al. 2020; Liu and Wilson 2012). Components of the Fanconi anemia pathway, such as FAN1 (FANCD2/FANCI-associated nuclease 1), have also emerged as modifiers of repeat expansion by protecting against instability, possibly by acting upon replication forks stalled at DNA damage intermediates, such as N6-furfuryladenine (N6FFA)(Maiuri et al. 2019). R-loops have also been proposed to play important roles in both the expansion and contraction of pathogenic repeats(Groh et al. 2014; Laverde et al. 2020; Neil et al. 2018). R-loops are structures which comprise an RNA–DNA hybrid and a displaced single-stranded DNA molecule. Sequences that are capable of forming R-loops can be readily predicted(Dettori et al. 2021; Freudenreich 2018; Mackay et al. 2020; Santos-Pereira and Aguilera 2015), and they appear to be quite abundant in the human genome(Kuznetsov et al. 2018). R-loops have the potential to compromise genome stability(Niehrs and Luke 2020; Santos-Pereira and Aguilera 2015), are present within many protein-coding genes(Jenjaroenpun et al. 2017) but are also preferentially associated with promoters, enhancers and non-B DNA structures(Kuznetsov et al. 2018). Many disease-causing repeat expansions have been found to be capable of forming stable R-loops which can contribute to repeat instability(Freudenreich 2018; Su and Freudenreich 2017). In some cases, repeat expansions have been found to impact chromatin remodelling (Hannan 2018) or perturb transcription factor regulation. Unsurprisingly in this context, expanded repeats can harbour transcription factor binding motifs(Avsec et al. 2021; Polak and Domany 2006). Additionally, a number of unstable repeat loci have been found to be associated with CTCF-binding sites (Libby et al. 2008; Phillips and Corces 2009; Schmidt et al. 2012).

It has until comparatively recently been generally assumed that any given repeat expansion exerts its pathogenic effects via a single underlying mechanism. But this is likely to be a gross over-simplification. So, for example, an expanded polyglutamine tract in the Huntingtin protein alters the function of the protein but may also alter mRNA structure or function(Hefferon et al. 2004; Hui et al. 2005; Malik et al. 2021). There is also evidence that HTT mRNA harbouring an expanded polyglutamine tract is toxic and that it alters the transcription of other genes(Malla et al. 2021; Tabrizi et al. 2020). Pathogenic repeat expansions may promote hairpin formation at the RNA level(Krzyzosiak et al. 2012; Mooers et al. 2005; Sobczak et al. 2003), or can give rise to extremely stable uni- and multi-molecular parallel G-quadruplex structures(Conlon et al. 2016; Reddy et al. 2013) thereby potentially impacting RNA splicing and/or gene expression on multiple levels. This notwithstanding, no study has hitherto attempted to pull together the large number of examples of pathogenic repeat expansion that have been reported to date with a view to investigating the mechanisms of expansion and modes of action at different stages of the gene expression pathway, and exploring the commonalities and differences between them.

Owing to the limited number of known examples of pathogenic repeat expansion, there are few methods with which to predict the pathogenicity of expanded repeats. Usually, researchers have used a repeat expansion cutoff to define the pathogenic repeats, and have considered the disease risk cutoffs to be close to or exceeding 100 to 150 bases(Mitsuhashi and Matsumoto 2020; Tang et al. 2017). By detecting repeat expansions with greater length variation than other repeats(Mitsuhashi et al. 2021), tandem-genotypes(Mitsuhashi et al. 2019) identified certain types of disease-causing tandem repeat (TRs). Recently, STRetch was designed to detect pathogenic TR expansions using short-read whole-genome sequencing data at known pathogenic loci(Dashnow et al. 2018). However, most approaches have used a length cutoff to determine pathogenic TR expansions, and have tended to overlook TR expansions located in different genomic regions with different functions(Dolzhenko et al. 2020; Dolzhenko et al. 2017; Tankard et al. 2018). Although many methods, such as Combined Annotation Dependent Depletion (CADD)(Rentzsch et al.), and Regulatory Mendelian Mutation (ReMM)(Smedley et al.) have been developed to detect the pathogenicity of mutations, there are no tools available to specifically predict the functional impact of repeat expansions. Such a tool is needed to recognize disease-associated TR expansions according not only to the length but also the location and sequence content of the TR expansions.

Here, we have used HGMD(Stenson et al. 2020) to collect known pathogenic repeat expansions with a view to performing a systematic analysis (Fig. 1). Our study comprised 224 loci (excluding 37 repeat expansions that influence transcription but which are not known to cause disease) in 203 different human genes associated with 280 inherited conditions. The set of 224 repeat expansions was termed the ‘reference sequence of loci with pathogenic repeat expansions’ (HGMD-RPE). To explore the mechanisms underlying pathogenic repeat expansions, we methodically analyzed the repetitive genomic regions harbouring pathogenic repeat expansions at DNA-, RNA- and protein-levels with respect to multiple features, including CTCF binding sites, histone modification marks, chromatin accessibility, transcription factor binding sites, evolutionary conservation, non-B DNA structures, R-loops, RNA secondary structure, encoded protein structure, and genome-wide coverage of DNA base excision repair factors. The characteristics of these pathogenic repeat expansions were then compared with 2,149 repeat expansions with no known pathogenicity, 1,120 repeat loci without expansions as well as with 672 genomic regions lacking repeats. We then performed an identical analysis on a subset of HGMD-listed repeats that were deemed to be either disease causing (DM) or likely disease causing (DM?) repeat expansions (HGMD-RPE-DM). Finally, we established a model to discriminate pathogenic repeat expansions from repeat expansions with no known pathogenicity (DPREx). This model was compared with existing methods designed for predicting pathogenicity of point mutations, and was further tested on independent datasets. A work flow for characterizing the formational mechanisms and discrimination of pathogenic repeat expansions are shown in Fig. 1.

Collation of known DNA repeat expansions from the published literature

Known examples of pathogenic repeat expansion have been collated by HGMD(Stenson et al. 2020). HGMD specifies the gene involved and the gene functional region (i.e. promoter, exon, intron, 5'UTR etc) where the repeat expansions are located, albeit without providing exact genomic coordinates. In order to identify unambiguously their precise location of these mutations, we manually located the repeat expansions within the hg38 reference sequence employing where necessary additional information provided in the literature. Since most of the genomic and epigenomic data used in this study were only available in hg19, we employed CrossMap.py and the chain file

(https://hgdownload.cse.ucsc.edu/goldenpath/hg38/liftOver/hg38ToHg19.over.chain.gz) provided by UCSC to convert GRCh38 coordinates to hg19. In total, 224 disease-causing or disease-associated repeat expansions were identified in HGMD; these were located in a total of 203 genes, an additional 37 repeat expansions impacted transcription but without any known associated pathology and were not further investigated in this study (Tables S1). From HGMD-RPE, we extracted a subset of 73 repeat expansions that were annotated as disease-causing (DM) or likely disease-causing (DM?) by HGMD. This set of repeat expansions was termed HGMD-RPE-DM. Thus, HGMD-RPE-DM is a subset of HGMD-RPE and contains only loci that are disease-causing or likely disease-causing.

It should be appreciated that all duplicate repeats on the reverse strand were removed from the dataset (the precise details of how they were removed are given in the Supplementary material) prior to analysis.

Annotating the pathogenic repeat regions in the human reference genome by means of the Tandem Repeat Finder (TRF) program

We employed the Tandem Repeat Finder (TRF) program (v4.09.1) (Benson 1999) to annotate the pathogenic repeat regions with respect to the human reference genome hg19. TRF is a program which locates and displays tandem repeats in DNA sequences and provides information about the repeat location, size and number of the repeat motif and nucleotide content. The GRCh37 reference genome was downloaded from GENCODE (Frankish et al. 2019). To use the TRF, we adopted most of the recommended parameters (https://tandem.bu.edu/trf/trf.unix.help.html), but reduced the parameter of motif search length from 500 bp to 100 bp since the motif length of all the expanded repeat units in HGMD were below this threshold. The same threshold was applied to the control datasets.

Constructing control datasets of human genomic repeats and repeat expansions with no known pathogenicity

The first control dataset, TRF-repeats, was constructed so as to include repeat loci from the human genome that are neither expanded nor known to be associated with pathogenicity. A second dataset, gnomAD-repeats, was constructed so as to include repeat expansions that occur in the healthy population but which are not known to be associated with any pathogenicity. The statistical significance of the differences observed between HGMD-RPE and the controls was evaluated by a one-sided Fisher’s Exact test.

TRF-annotated repeats (TRF-repeats) dataset: We constructed a control dataset of repeats, termed ‘TRF-repeats’, which comprised randomly sampled repeat loci from the hg19 reference genome, annotated by means of the TRF program. This dataset was characterized by reference to four specific features of the repeats, namely (a) the genomic functional region (GFR) involved; (b) the sequence similarity of the repeat motif (Table S1); (c) the length of each repeat motif; and (d) the length of the unexpanded repetitive region in the Homo sapiens (human) genome assembly GRCh37 (hg19) from the Genome Reference Consortium. First, we ensured that comparable proportions of the TRF-repeats and HGMD-RPE were located within the different GFRs, specifically exons, 3'UTRs, 5'UTRs, 3'genes, 5'genes, introns and promoters. The genomic functional regions (GFRs) harbouring these expansions were annotated by ANNOVAR using hg19_ensGene.txt as the reference genome(Wang et al. 2010). Briefly, the term ‘promoter’, as used here, refers to the region between the transcription start site (TSS) of a gene and 2000 bp upstream. ‘5'gene regions’ were defined as the region immediately upstream of the gene promoter. The ‘5'UTR region’ refers to the 5'-untranslated region, which is the exonic region lying between the TSS and the translational start codon. Although the UTRs are usually considered to be parts of the exonic region, the term ’exons’ is used here to refer exclusively to the coding sequences (CDS) within the exons. By contrast, the introns constitute the non-coding regions located between adjacent exons (including those specifying the UTR regions). The 3'UTR refers to the 3'-untranslated region, which extends from immediately downstream of the stop codon to the end of the last exon. Finally, the 3'gene constitutes the region downstream of the last exon (i.e. beyond the polyadenylation site). The details of how the 7 GFRs were annotated are described in the Supplementary material.

Then, we selected from TRF-repeats such that the proportions of TRF-repeats matched those of the HGMD-RPE dataset in terms of their repeat motif lengths and overall sizes. In this way, we constructed a control dataset, which we termed ‘TRF-repeats’, keeping the total sample size (1,120 repeat loci) five times larger than the HGMD-RPE dataset. Comparisons of the repeat lengths and repeat motif lengths of TRF-repeats and HGMD-RPE are shown in Fig. S1. For HGMD-RPE-DM, a corresponding control dataset of TFR-repeats (termed ‘TRF-repeats-DM’) was constructed, which included 317 repeat regions. For the purposes of analysis, the 3'gene and 3'UTR GFRs were combined as “3'gene_3'UTR”. The sample sizes of the datasets pertaining to each GFR are given in Table S2.

gnomAD-derived repeats (gnomAD-repeats) dataset

The second control dataset was derived from the whole genome sequencing (WGS) data of gnomAD v2(Karczewski et al. 2020). A gross insertion in gnomAD within a repetitive sequence locus was considered to represent a repeat expansion if the inserted sequence (a) exhibited at least 85% sequence homology with the repeat locus and (b) was at least twice as long as the original repeat motif length. The repeat loci were identified by TRF. Repeat loci already present in the HGMD-RPE were removed from the gnomAD-repeats control dataset. Any expansions occurring within intergenic regions or non-coding RNA (ncRNA) were removed from the gnomAD-repeats dataset since all known pathogenic HGMD-RPE reside within or very close to protein-coding genes. In total, 17,871 repeat loci were extracted from gnomAD and were considered to represent repeat expansions. These repeat loci were then grouped into 7 distinct GFRs, i.e. exon, 3'UTR, 5'UTR, 3'gene, 5'gene, intron and promoter.

From each GFR, we randomly extracted repeat loci so as to ensure that the distributions of both the entire length of the repeat region and the repeat motif length were similar in the gnomAD-repeats dataset to those of the repeat loci from the HGMD-repeats dataset with respect to each GFR. In total, 2,149 repeat loci from gnomAD were included in the gnomAD-repeats dataset. The number of repeat loci from the gnomAD-repeats dataset in relation to each GFR is shown in Table S2. The detailed approaches employed to generate the gnomAD-repeats dataset are described in the Supplementary Material. The size distributions of the repeat loci in each GFR are given in Fig. S1. Since the individual participants in gnomAD were all apparently healthy and lacked any overt disease, the gnomAD-repeats represent repeat expansions with no known pathogenicity and may therefore be used as a control dataset. For exploration of the HGMD-RPE-DM dataset, a corresponding control dataset from gnomAD (‘gnomAD-repeats-DM’) was constructed, which included 635 repeat expansions that mirrored the repeat length distribution of HGMD-RPE-DM in each GFR. In the gnomAD-repeats-DM dataset, the repeat expansions in the 3'gene and 3'UTR regions were combined into one group, termed 3'gene, to render the number of repeats meaningful for the purposes of statistical testing. Details of the sample sizes of the gnomAD-repeats and gnomAD-repeats-DM datasets in relation to each GFR are given in Table S2.

The number and proportion of loci with each type of repeat motif in the HGMD-RPE and corresponding control datasets (TRF-repeats and gnomAD-repeats) are shown in Table S3. The G/C content (fraction of G or C in the repeat motif) is given in Fig. S2.

Constructing a control dataset of non-repeat loci

An ‘NR-regions control dataset’ comprising non-repeat loci was also constructed.

The genomic regions that were not marked as repeated locus by TRF were defined as non-repeat locus. The non-repeat loci corresponded to those regions of the hg19 reference genome remaining after exclusion of the TRF-annotated repeat loci. In total, 7,794,452 non-repeat loci were identified within those genes associated with repeat regions annotated by TRF. From these non-repeat loci, we randomly sampled three sequences located within the same functional regions, and with the same region size, for each HGMD-RPE sequence. Therefore, the NR-regions dataset comprised 672 non-repeat sequences that matched the HGMD-RPE in terms of their location (genomic functional region, GFR) and size but not involving the same genes as the HGMD-RPE (Fig.S1). We did not sample the non-repeat loci from within the same genes as the HGMD-RPE because this would have considerably reduced the number of non-repeat loci and made it much harder to obtain non-repetitive loci within identical functional regions and with the same sizes as the HGMD-RPE sequences. The size distribution of the 672 non-repeat sequences is given in Fig. S1 and Table S2. From the NR-regions dataset, we further extracted 266 non-repeat samples (NR-regions-DM) matching the HGMD-RPE-DM in terms of their location (genomic functional region, GFR) and size (Table S2).

CTCF binding sites

CCCTC-binding factor (CTCF) is a chromosomal architectural protein that plays a key role in mediating long-range chromatin interactions (Ong and Corces 2014; Rao et al. 2014; Tang et al. 2015), and CTCF binding sites are indicative of the genomic loci most likely to be involved in the anchorage of chromatin loops. We obtained CTCF binding sites from chromatin immunoprecipitation (ChIP) assays in the H1 human embryonic stem cell line(Consortium 2011). The CTCF binding sites identified by ChIP assays were downloaded in the narrowPeak format (https://genome.ucsc.edu/FAQ/FAQformat.html#format12) from ENCODE (https://www.encodeproject.org/, accession ID: ENCFF618DDO). We adopted the concept “peak regulatory-potential (peak-RP)” from the study of Qin et al.(Qin et al. 2020) to represent the weighted CTCF binding counts around a locus of interest (LOI) (Fig. S3). Peak-RP was defined as \(\sum _{i}{w}_{i}\), where \({w}_{i}\) is the weight of each peak around the LOI:

\({w}_{i}=\frac{2\cdot \text{exp}\left(-\mu \times d\right)}{1+\text{exp}\left(-\mu \times d\right)}\) ,\(\mu =\text{lg}(3/L)\)

\(d\) is the distance from the peak to the LOI (Fig. S3); \(L\) is a parameter controlling the decay rate and is set at 1,000 bp. Distant peaks contribute less to the peak-RP score of LOI than the peaks close to the LOI. The CTCF peak-RP score represents the strength of the CTCF binding signals within and around the LOI, and it was employed here as a proxy for the likelihood that a given locus may be involved in long-range chromatin interactions.

Epigenetic signal analysis of chromatin accessibility and histone modification marks

Chromatin accessibility reflects the degree to which a given genomic locus contains accessible regulatory elements(Minnoye et al. 2021). The ENCODE project provides chromatin accessibility data as measured by DNaseI hypersensitive site sequencing (DNase-seq). We used the DNaseI data, derived from the brain tissue of a human female embryo (105 days) in bigWig format (Accession ID: ENCFF585XKE).

Histone modifications constitute one of the mechanisms by which gene expression is regulated. We analysed histone modification marks from brain tissue, as measured by the histone ChIP-seq technique, from the ENCODE project(Davis et al. 2018). The P-value (-\({\text{log}}_{10}P\)) tracks of the Chip-seq signals in bigWig format were used since they usually have a higher signal-to-noise ratio than other types of tracks, such as fold-change over control(Schreiber et al. 2020). In all, we compared the average signals of five histone modification marks (H3K4me1, H3K4me3, H3K9me3, H3K27me3, and H3K36me3) in loci that harbour pathogenic repeats versus those in loci that harbour non-pathogenic repeats. H3K4me1 is an enhancer mark(Heintzman et al. 2007; Kundaje et al. 2015) whilst H3K4me3 is a histone mark usually associated with the activities at transcription start sites (TSS) and active promoters(Kundaje et al. 2015). The H3K9me3 histone modification is usually associated with heterochromatin(Becker et al. 2016; Kundaje et al. 2015; Peters et al. 2003) whereas the H3K27me3 mark is associated with polycomb-mediated repression of expression(Bonasio et al. 2010; Kundaje et al. 2015). Recently, H3K27me3-enriched regions have been found to act as silencers of gene expression via chromatin folding(Cai et al. 2021). Finally, the H3K36me3 mark has been found to be enriched in exonic regions as compared to intronic regions, and is closely associated with active transcription(Sims and Reinberg 2009).

Annotating potential transcription factor binding sites within the expanded repeats

In addition to chromatin accessibility analysis, we further annotated potential transcription factor (TF) binding sites within TRs. The FIMO program (Grant et al. 2011) was utilized to scan the DNA sequences of the repeats against TF motif from the JASPAR database (JASPAR2020_CORE_vertebrates_non-redundant_pfms_meme.txt)(Fornes et al. 2019). Since FIMO requires a minimum length of input DNA sequence of 24 bp, we expanded each input DNA sequence by including the flanking regions of 12 bp at both ends to avoid sequences of less than 24 bp being discarded by FIMO. Subsequently, we excluded TF binding sites (TFBSs) that overlapped with less than half the length of the repeat regions. To further analyse the FIMO-predicted TFBS in different cell-types, we combined the FIMO predictions with the chromatin accessibility data. Specifically, we averaged the DNaseI signals of all the potential TFBSs in each region to determine to what extent each region was likely to be bound by TFs in a specific cell type.

Evolutionary conservation analysis

The evolutionary conservation of repeat regions was annotated by means of phastCons scores, which were obtained from the UCSC genome browser [https://genome.ucsc.edu] and calculated by multiple alignment from 99 vertebrate species. We calculated the average phastCons scores for the repeat or non-repeat loci in both the HGMD-RPE and control datasets.

Evaluating the potential impact of repeats on RNA splicing

Human splice sites were extracted from GENCODE (GRCh37, version 37) and were defined as actual exon-intron junctions. We first calculated the distance between repeats and their nearest potential splice sites. In addition, we used MMSplice scores obtained from CADD to evaluate the potential impact of repeat regions on RNA splicing(Cheng et al. 2019). For the sake of simplicity, we took the sum of the absolute values of the predicted MMSplice score for each site to measure the potential of the region to affect RNA splicing. We used a sliding window of 20 bp to identify the region with the largest average MMSplice score within and flanking the repeats, respectively. When a repeat region was less than 20 bp in length, the sliding window was adjusted to be the same length as the repeats. More detailed information on the evaluation of RNA splicing potential is to be found in the Supplementary Material.

Annotation of non-B structures

Non-B DNA structures have been previously shown to play important roles in repeat expansion(Guo et al. 2017; Orr and Zoghbi 2007). The predicted non-B structure of DNA repeats was retrieved from the non-B DNA database (version 2.0, https://ncifrederick.cancer.gov/bids/ftp/?nonb)(Cer et al. 2013b). This database contains seven types of non-B structure i.e. Z-DNA, G-quadruplex (GQ), A-phased repeats, inverted repeats (IR), mirror repeats (MR), direct repeats (DR) and short tandem repeats (STR). The overlaps between repeat loci and non-B DNA-forming regions were determined using bedtools intersect(Quinlan and Hall 2010b).

Annotation of protein secondary structure

To evaluate the influence of the repeat expansions on protein secondary structure, we applied SPIDER3(Heffernan et al. 2017) to predict the secondary structure of the protein residues encoded by the repeat regions associated with pathogenic repeat expansions as well as the secondary structure of the protein residues encoded by the non-repeat regions. We then calculated the proportions of each type of secondary structure [i.e. alpha-helix (H), coil (C), and beta-strand (E)] in both the repeat regions associated with the pathogenic repeat expansions and the non-repeat regions. Further information on how the repeat expansions in the coding regions were characterized at the protein level is to be found in the Supplementary Material.

Annotation of RNA secondary structure

For every gene harbouring a pathogenic repeat, we explored the RNA secondary structures flanking the pathogenic repeat (100 bp upstream and downstream of the repeats). RNA secondary structures were characterized with regard to two features, viz. minimum free energy (FE) and reactivity (RCT)(Choudhary et al. 2019; Smola et al. 2015; Wan et al. 2014). Firstly, we extracted pre-RNA sequence from the HGMD-RPE and control datasets using bedtools(Quinlan and Hall 2010b), and then used the RNAfold tool in ViennaRNA(Lorenz et al. 2011) to compute their free energy (FE). The RCT of the RNA secondary structure was determined by HF-GRASS(Ke et al. 2020) for each amino acid.

Annotation of R-loops

We evaluated the spatial coincidence of R-loops and repeat expansions. The locations of the R-loops were downloaded from R-loopDB, a database that reports all R-loop forming sequences (RLFS) identified within over half of all human genes(Wongsurawat et al. 2012).

Annotation of DNA base excision repair (BER) factor binding sites

ChipSeq signals of binding sites for four DNA base excision repair (BER) factors (ACNEIL_SSINP, GSM2137770_POLB, OGG1 and XRCC1) were downloaded from a public Cyverse platform (https://de.cyverse.org/data/ds/iplant/home/abacolla/bigwig) in the bigwig format and hg19 reference genome. The Python package of pyBigWig was used to extract the signals for each base from the bigwig files. To evaluate the binding signals of BER factors for the HGMD pathogenic repeat expansions and corresponding controls, we selected two windows (1kb upstream and downstream from the repeat start sites, and 1kb upstream and downstream from the repeat end sites), inside each window. We treated the position in each base-pair as a peak and calculated the peakRP as performed in Fig. S3. We then summed up the peakRPs of the two windows to determine the overall binding signal of the repeat expansion. The binding enrichment was calculated by a single-tailed Mann-Whitney test comparing HGMD pathogenic repeats and the control sets of gnomAD-repeats, TRF-repeats and NR-region, separately.

Assessing the statistical significance between groups

Significance assessment between repeat groups was performed using the Mann-Whitney U test provided by the SciPy module in Python 3.8 unless otherwise explicitly specified.

Constructing a model to predict pathogenic repeat expansions

We combined the TRF-repeats and gnomAD-repeats to form a negative control dataset which included 3348 unique repeat regions not associated with pathogenic repeat expansions, and used 224 repeat regions harboring pathogenic repeat expansions from HGMD-RPE as a positive dataset. We randomly picked 37 pathogenic repeat regions and 481 control repeat regions in chromosomes 4, 7 or 19 as a test set (termed test set1). The pathogenic repeat regions and the controls on the other chromosomes were then used as a training set. Another independent test set (termed test set2) was constructed that included newly (after March 2020) reported repeat regions harbouring 38 pathogenic repeat expansions, and 38 control repeat regions generated using the approach for establishing gnomAD-repeats and TRF-repeats. A five-fold cross-validation was performed on the training set to train the model. The model was further independently tested using the two independent test sets, test set1 and test set2.

An X-GBoost model was constructed to allow discrimination between repeat regions harbouring pathogenic repeat expansions and repeat regions not harbouring pathogenic repeat expansions (DPREx). We trained the classifier by means of the Gradient Boosting Decision Tree (GBDT) in the X-GBoost package25. The feature selection was performed by the greedy algorithm. First, we used all the individual features to predict pathogenic repeat expansions, and then we ranked the features according to Area Under the Curve Receiver operating characteristic (AUC) in a five-fold cross-validation. The best feature was combined with the second-best feature to construct a combined model. If the model yielded an increased AUC after integration of the features, the procedure was continued until no AUC improvement could be achieved by adding features.

Pathogenic repeat expansions are overrepresented in exons

Of the 224 pathogenic repeats in HGMD-RPE, most were located in introns (31.3%, n = 70), followed by exons (26.8%, n = 60), promoters (14.3%, n = 32) and 5'UTR regions (11.6%, n = 26). Fewer were located in the 5'gene (7.59%, n = 17), 3'UTR (4.91%, n = 11), 3'gene (3.13%, n = 7) regions and an exon-intron junction site (0.45%, n = 1) (Table 1), respectively. Additionally, we found 26 repeat units expanded at two or more loci causing human inherited disease and 10 genes harbouring more than one pathogenic repeat expansion. The annotation methodology employed for introns, exons, 5'UTR, 5'gene, 3'UTR and 3'gene is to be found in Methods and the Supplementary Material.

We ensured that motif length and the overall length of the repeat regions in the control sets (TRF-repeats, gnomAD-repeats and NR-region) and the HGMD-RPE were distributed similarly in each genomic functional region (GFR) (Fig. S1). However, the proportion of pathogenic repeats in specific GFRs may be overrepresented compared to the control sets. To determine whether the numbers of pathogenic repeat expansions in each GFR were higher than expectation, we compared the proportion of HGMD-RPE in a given GFR to that in the control dataset of 3,269 repeats (gnomAD-repeats and TRF-repeats). We observed that the pathogenic repeat expansions in HGMD-RPE were disproportionately enriched in exon (P = 1.38\(\times\)10⁻¹¹), and 5'UTR (P = 6.22\(\times 10\)⁻⁵) regions as compared to the repeat expansions in gnomAD-repeats and TRF-repeats (Table 1). A similar comparison was also performed on HGMD-RPE-DM (a restricted dataset of HGMD-RPE encompassing pathogenic repeat expansions annotated as DM or DM? by HGMD) and the corresponding smaller gnomAD-repeats-DM and TRF-repeats-DM control datasets. We found that the repeat expansions in HGMD-RPE-DM were significantly enriched in the 5'UTR (P = 1.96 × 10^− 3) and exon (P = 1.03×10^− 10) regions (Table 1).

Table 1

Enrichment of pathogenic repeat expansions in different genomic function regions (GFR) comparing to three control sets.
GFR	HGMD-RPE	gnomAD-repeats	TRF-repeats	NR-region	P-value^a
3’-UTR	11	49	105	33	4.87\(\times 10\)^-1
3’-gene	7	242	327	21	1.00
5’-UTR	26	13	150	78	6.22\(\times 10\)^-5
5’-gene	17	221	267	51	9.90\(\times\)10^-1
Exon	60	19	330	180	1.38\(\times\)10^-11
Intron	70	1258	740	210	1.00
Promoter	32	417	368	96	9.90\(\times\)10^-1
Total	224	2149	1120	672	-
^a Enrichment of pathogenic repeat expansions in GFR comparing to the loci in control sets was evaluated by Chi-square test.

Pathogenic repeat expansions tend to encode disordered surface-located protein domains

We next turned our attention to the protein structures encoded by pathogenic repeat expansions, and compared them with the non-repeat portions in the same gene-coding regions. A total of 60 pathogenic repeat expansions were found in the coding regions of 49 genes. Most of the secondary structures of the protein regions encoded by repeats associated with exonic pathogenic expansions were alpha helix (H), coil (C) or H and C (Fig. 2A). We found that coil (C) was significantly enriched in the repeat regions associated with pathogenic repeat expansions (P = 1.18 \(\times\) 10^− 118 by one-sided Fisher’s Exact test) compared to the non-repeat regions. By contrast, beta sheet (E) and alpha-helix (H) were significantly under-represented in the pathogenic repeat regions compared to the non-repeat regions (P = 1.67\(\times\) 10^− 130 and 7.32 \(\times\)10^− 77, respectively).

We also compared the secondary structures encoded by polyglutamine (CAG) tracts associated with pathogenic repeat expansions with those in non-repeat regions, and found that the former was disproportionately associated with alpha-helical (H) folds (P = 2.30 \(\times\) 10^− 3, one-sided Fisher’s Exact test) compared to polyglutamine in the non-repeat regions at the expense of beta sheets (E) (P = 9.60 \(\times\) 10^− 3, one-sided Fisher’s Exact test). Thus, beta-sheet secondary structures were found to be under-represented within the protein regions around the CAG repeat-encoded polyglutamine tracts. Because the X-ray structure of the huntingtin protein shows that the expanded CAG-repeat gives rises to an alpha-helical secondary structure(Kim et al. 2009), we specifically analyzed the repeat expansions in five additional genes (ATXN8OS, CNR1, HTT, JPH3 and TBP) known to be associated with huntingtin(Hire et al. 2011; Holmes et al. 2001; Kloster et al. 2013; Koutsis et al. 2012; Melamed et al. 2015; Quarrell et al. 2007; Xu et al. 2009). Whereas the pathogenic repeat expansions in the HTT and TBP genes are located in exonic regions, those in the three other huntingtin-associated genes are not located within the coding region but rather in the 3'UTR (ATXN8OS), 3'gene region (CNR1) or an exonic region encoding RNA rather than protein (JPH3 gene; recorded as a JPH3-202 transcript in the Ensembl database, https://asia.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000154118;r=16:87601835-87698156;t=ENST00000301008). We then examined the enrichment of protein secondary structures in the repeat expansion regions of HTT and TBP. We found that helix (H) was significantly enriched in the repeat regions associated with pathogenic repeat expansions in HTT and TBP (P = 1.3\(\times\)10⁻³ by a one-sided Fisher’s Exact test) than in the non-repeat regions of these two genes.

The function of a protein is closely related to its three-dimensional (3-D) structure. Although disordered regions of proteins cannot form stable 3-D structures, they may nevertheless play important roles in biological processes(Oldfield and Dunker 2014; Uversky 2020), particularly in the context of protein-protein interactions. Therefore, we utilized SPOT-Disorder-Single(Hanson et al. 2018) to predict whether protein encoded by the exonic regions in the vicinity of known pathogenic repeat expansions had a preference to form stable 3-D structures or disordered structures. More details of the methodology used to annotate the 3-D structure of the repeat expansions within coding regions are given in the Supplementary Material. We found that 53 (88.2%, n = 60) of the repeat regions in the vicinity of pathogenic repeat expansions resided in regions with disordered protein structures. Only the repeat regions within the proteins CLEC4M, CNDP1, JPH3, LRP5, MICA, NEB and TENT5A were located within stable protein 3-D domains, if we used a commonly used cutoff 0.4 (score provided by SPOT-Disorder-Single) to define stable protein 3-D domains. Thus, the proportion of amino acid residues with disordered protein structures in repeat regions encompassing the pathogenic repeat expansions is much higher (P = 1.2 × 10^− 11, by a single-tailed t-test) than that in non-repeat regions (Fig. 2B).

In order to further evaluate the influence of pathogenic repeat expansions in protein structure, we then calculated the accessible surface area (ASA) scores of the regions encompassing the pathogenic repeat expansions within the encoded proteins. The average ASA score of the pathogenic repeat regions was significantly (P = 8.9 × 10^− 37) higher than for the non-repetitive regions (Fig. 2C). Thus, exonic repeat expansions encode amino acid residues that are preferentially located on the surface of proteins, where they may presumably exert their deleterious effects by mediating inappropriate interactions with other molecules, which can be experimentally validated by methods such as X-ray scattering(Hallinan et al. 2021; Jorda et al. 2010).

Pathogenic repeat expansions are predicted to facilitate long-range chromatin interactions

To assess the impact of pathogenic repeat expansions on long range chromatin interactions, we first compared the CTCF peak-RP of the repeats in the HGMD-RPE dataset with the three control datasets (gnomAD-repeats, TRF-repeats and NR-regions).

The CTCF peak-RP for the HGMD-RPE dataset was significantly higher than for all three control groups in the intron, exon and 5'gene GFRs (Fig. 3A and Table S4). For the promoter and 5'UTR regions, the HGMD-RPE also displayed a slightly higher CTCF peak-RP than the gnomAD-repeats dataset (P = 2.3 × 10^− 2) but not the TRF-repeat and NR-regions datasets (P = 0.674 and 0.718). Likewise, in the restricted HGMD-RPE-DM, the repeats in the exon, promoter and 5'gene GFRs exhibited a higher CTCF peak-RP than all three control datasets (Fig. S4A and Table S5).

Taken together, these data show that pathogenic repeat expansion loci residing within the exon and 5’gene GFRs tend to be disproportionately associated with CTCF binding sites in comparison with repeat regions with no known pathogenicity. Such an observation is suggestive of a potential relationship between pathogenic repeat expansions and chromatin folding dynamics.

Enhanced chromatin accessibility is a distinctive characteristic of pathogenic repeats

Histone marks are critical determinants of chromatin compactness and transcriptional control. H3K4me3 occupancy was higher in HGMD-RPE than in gnomAD-repeats, the TRF-repeats and the NR-regions, particularly in exon [P = 1.55×10^− 5 (gnomAD-repeats); P = 1.76×10^− 2 (TRF-repeats); P = 1.07×10^− 4 (NR-regions) ], 5’gene [P = 1.87×10^− 4 (gnomAD-repeats); P = 4.59×10^− 3 (TRF-repeats); P = 1.55×10^− 2 (NR-regions) ] and promoter [P = 1.29×10^− 7 (gnomAD-repeats); P = 5.45×10^− 3 (TRF-repeats); P = 2.16×10^− 2 (NR-regions) ] GFRs (Fig. 3B and Table S4 ). Likewise, the ChIP signals of H3K4me1, which are usually associated with transcription (Kang et al. 2020), were stronger for HGMD-RPE than for gnomAD-repeats within the promoter (P = 1.01 × 10^− 2) and 3'UTR (P = 2.63 × 10^− 2) GFRs (Fig. 3C). H3K9me3 signal in both the HGMD-RPE and the controls were generally weaker than the other histone marks (Fig. 3D). H3K9me3 signal was stronger in the HGMD-RPE dataset than in the gnomAD-repeats for intron (P = 2.92\(\times\)10⁻²) and promoter (P = 2.46E\(\times\)10⁻²), although lower than that in the NR-regions for the exons (P = 3.73E\(\times\)10⁻³) and 3'gene (P = 2.46\(\times\)10⁻²). H3K27me3 signals were also stronger in HGMD-RPE than in all three control datasets in the 5'gene GFR (P = 1.83\(\times\)10⁻¹³, 2.95\(\times\)10⁻⁴ and 2.16\(\times\)10⁻⁴, respectively; Fig. 3E and Table S4), and stronger than that of gnomAD-repeats in the exon, promoter, 5'gene and 5'-UTR (Fig. 3E and Table S4) GFRs. Finally, H3K36me3 signals were weaker in HGMD-RPE pathogenic repeats, with P-values of 4.20 \(\times\) 10^− 3 and 1.37 \(\times\) 10^− 3 in exons, with respect to TRF-repeats and NR-regions, and in promoters with respect to all three control datasets (Fig. 3F and Table S4). A similar analysis performed on the smaller HGMD-RPE-DM led to similar results (Fig. S4). In conclusion, H3K4me3, H3K9me3 and H3K27me3 displayed stronger occupancy in pathogenic repeat expanded regions than in the controls for most GFRs, whereas H3K36me3 occupancy was generally reduced, suggesting that one of the pathological consequences of the repeat expansions may be a shift to a more accessible chromatin structure leading to enhanced transcription.

In addition to the histone modification marks, we analysed the chromatin accessibility of our groups based on DNase-seq data. The results of this analysis (Fig. 3G) suggested that the regions associated with the pathogenic repeat expansions in HGMD-RPE tend to have greater chromatin accessibility in the 5'gene (P = 2.39\(\times\)10⁻³), 5'UTR (P = 3.50\(\times\)10⁻⁴) and exon (P = 1.39\(\times\)10⁻⁴) GFRs than the repeats in the gnomAD-repeats dataset (Table S4 and Fig. 3G). Chromatin accessibility was significantly greater for the pathogenic repeat expansions in HGMD-RPE in exon, 5'gene, 3'UTR and promoter GFRs as compared to the TRF-repeats. Chromatin accessibility was also greater for the pathogenic repeat expansions in HGMD-RPE in exon, 5'gene and promoter GFRs than for those in NR-regions (Fig. 3G).

The HGMD-RPE-DM dataset exhibited significantly higher chromatin accessibility in the promoter, 5'gene and 3'gene GFRs than all three control datasets. The HGMD-RPE-DM also exhibited significantly higher chromatin accessibility in the exons than the gnomAD-repeats-DM (P = 4.41\(\times\)10⁻⁴) and TRF-repeats-DM (P = 1.79\(\times\)10⁻³) datasets (Fig. S4G and Table S5).

Potential TFBSs in pathogenic repeat expansion loci are embedded within open chromatin regions

We next sought potential transcription factor binding sites (TFBSs) within the pathogenic repeat loci located in 5'gene, promoter and 5'UTR GFRs because these regions are the most frequently associated with the initiation of transcription (Schoenfelder and Fraser 2019; Whitfield et al. 2012). In total, 75 of 224 pathogenic HGMD-RPE repeat regions were found to reside in the 5'gene, promoter or 5'UTR regions, and of these, 54 repeat regions were found to contain 107 TF binding site motifs (Table S6), with 50 repeat regions containing more than one TF binding site motif. Additionally, 54 TF motifs appear frequently in repeat regions associated with pathogenic repeat expansions located in 5'gene, promoter or 5'UTR (Table S6) GFRs. For example, the TF motif with JASPAR dataset ID, MA0162.4 (EGR1, motif length = 14) displayed the most frequent occurrence within the pathogenic repeats, being associated with 33 different genes (Table S7).

Although the motifs identified within the pathogenic repeat loci match TFBS and appeared to have the potential to bind TFs, it is by no means known that they actually do bind TFs in vivo. We therefore repeated our analysis by including the chromatin accessibility signals within putative TFBSs. We specifically compared the chromatin accessibility signals within putative TFBSs between HGMD-RPE and the control groups (Fig. 3H and Fig. S4H). FIMO-predicted TFBSs in HGMD-RPE showed significantly higher chromatin accessibility peaks than in all three control datasets in the promoter and 5'gene GFRs, higher chromatin accessibility than the gnomAD-repeats (P = 1.26\(\times\)10⁻³) and the NR-regions (P = 9.93\(\times\)10⁻³) in the exons, and higher chromatin accessibility than the gnomAD-repeats in the 5'UTR region (P = 2.04\(\times\)10⁻³) (Table S4). Likewise, in the more restricted HGMD-RPE-DM dataset, the FIMO-predicted TFBSs in the 5'gene region resided in higher chromatin accessibility areas than in all controls, whereas those in exons resided in higher chromatin accessibility peaks than gnomAD-repeats-DM (P = 1.73\(\times\)10⁻²) (Table S5).

Taken together, we conclude that pathogenic repeat expansions have a higher probability of being associated with TFBSs, and hence dysregulating gene expression, than either repeats with no known pathogenicity or the non-repeat regions. This finding therefore provides indirect support for the view that pathogenic repeat expansions frequently perturb the regulatory functions of TFBS.

Pathogenic repeat expansions are located so as to be capable of interfering with RNA splicing

We next investigated the potential impact of pathogenic repeat expansions on mRNA splicing. We focused our analysis on repeats in exons, introns and UTRs because only in these regions do the expanded repeats have the potential to be transcribed into precursor mRNAs. For HGMD-RPE, the distances from the sites of repeat expansions to the nearest splice sites were significantly lower than for all three control groups in terms of intron, 5'gene, 3'gene and 3'UTR GFRs, lower than gnomAD-repeats and NR regions in exons (P = 1.42\(\times\)10⁻¹⁰ and 1.27\(\times\)10⁻⁴), and lower than gnomAD-repeats in the 5'UTRs (P = 4.80\(\times\)10⁻⁶) (Fig. 4A and Table S4). This implies that at least some pathogenic repeat expansions may exert their deleterious effects by disrupting mRNA splicing. We further leveraged MMSplice to assess the potential impact of pathogenic repeat expansions on mRNA splicing by separated computation within and around (distance to the repeat < 3000bp and > 25bp) each repeat region. MMSplice scores within and around repeats are shown in Fig. 4B and Fig. 4C, respectively. MMSplice scores were higher for HGMD-RPE than for all three control groups in the exon, 5'gene and 3'UTR GFRs (Fig. 4B and 4C), and, indeed, in all seven GFRs, MMSplice scores in HGMD-RPE were higher than in the gnomAD-repeats (Table S4). Within HGMD-RPE-DM, the repeat expansions were characterized by a significantly shorter distance to neighboring splice sites than with gnomAD-repeats and NR-regions for six GFRs (exons, promoter, intron, 3'gene, 5'gene and 5'UTR) as shown in Fig.S5 and Table S5. Thus, at least a proportion of pathogenic repeat expansions are positioned such that they may interfere with the mRNA splicing process.

Stable RNA secondary structures are associated with repeat expansions

Next, we assessed the stability of RNA secondary structure by calculating the free energy of hairpin/loop folding in 100bp fragments upstream and immediately flanking the repeat regions associated with pathogenic repeat expansions. Lower energy implies a more stable RNA secondary structure. Fragments upstream of intron, promoter, and 5'gene GFRs of HGMD-RPE exhibited higher stability (lower RNA free energy) in HGMD-RPE than in all three controls; in addition, fragments upstream of 5' UTRs displayed higher stability in HGMD-RPE than in NR-regions (Fig. 4D). Similar findings were observed in HGMD-RPE-DM (Fig. S6). Likewise, fragments downstream of GFRs exhibited greater stability in pathogenic repeat expansions than in the three controls (Fig. 4E).

We computed the mean of the pairing probabilities (mPAP) of each RNA nucleotide for all 100 bp fragments (Ke et al. 2020) for the HGMD-RPE and controls. A higher mPAP score indicates a higher probability that a given base will pair with other bases in the RNA sequence, which in turns yields a higher stability of the composite RNA secondary structures. The 100bp region 5' to the repeat regions associated with the pathogenic repeat expansions exhibited significantly higher mPAP scores than gnomAD-repeats in the exon, promoter and 5'UTR GFRs (Fig. 4F, P = 7.90\(\times\)10⁻³, 3.24\(\times\)10⁻² and 8.14\(\times\)10⁻³), and higher than all three control datasets in the 5'gene region. The region 100bp to the expanded repeats in HGMD-RPE in the exon, 5'gene and 3'gene GFRs exhibited significantly higher mPAP scores than the corresponding repeat expansions in the gnomAD-repeats and TRF-repeats datasets (Fig. 4G, Table S4). Similar findings were obtained with the HGMD-RPE-DM (Fig. S6, Table S5), clearly showing that the mRNA domains flanking pathogenic repeat expansions are prone to folding into stable hairpin/loop secondary structures.

Pathogenic CGG and CAG expansions promote R-loops

R-loops are RNA–DNA hybrids that can trigger replication stress genomic instability and strand breaks from chromatin condensation, may media te transcriptional activation by altering chromatin structure at promoter regions and recruit transcription and chromatin-remodeling factors. Therefore, using RloopDB (Wongsurawat et al. 2012) (http://rloop.bii.a-star.edu.sg/), we calculated the fraction of R-loops within the repeat regions that have undergone pathogenic repeat expansions. The fraction of the repeat regions in HGMD-RPE that were covered by more than 10% R-loops was significantly (Chi-square test P < 3.8\(\times\)10^-3) higher than for the repeat regions in the gnomAD-repeats or TRF-repeats datasets (Fig. 5A). We extended the R-loops analysis to the repeat regions encompassing HGMD-RPE which were associated with different disease categories. The pathologies most frequently associated with > 10% R-loops affected the nervous system, the musculoskeletal system and the brain (Fig. 5B). We additionally examined the R-loop enrichment in HGMD-RPE located in different GFRs by comparing to gnomAD-repeats and TFR-repeats. We found that HGMD-RPE is enriched in R-loops in 5’gene, exon, promoter and intron comparing to gnomAD-repeats and it is enriched in R-loops in 5’gene, promoter and intron comparing to TRF-repeats (Table S8).

Finally, we examined the percentages of R-loops spanning pathogenic repeat expansions of different repeat units, i.e. CGG, CAG, GCN, GGC and GCG, most frequently encountered in HGMD-RPE (Fig. 5C). Of the repeat expansion regions that were covered by at least 10% R-loops, 13.9% and 11.7% of them were within repeat motifs CGG and CAG, respectively (Fig. 5C) compared to 2.8% and 1.4% with TRF-repeats (Fig. 5C). With the gnomAD-repeat dataset, no repeat expansion regions with CGG and CAG motifs were found. These findings suggest that CGG and CAG are important for R-loop formation in the context of pathogenic repeat expansions, thereby supporting and extending a previous report that R-loops often occur in association with CAG/CTG repeat expansions(Malla et al. 2021).

Non-B DNA formation and expanded DNA repeats

Non-B DNA conformations are formed by long repeat tracts. Here, we investigated whether certain types of non-B DNA are enriched in pathogenic repeat expansions compared to controls. The non-B DNA structures associated with each region harbouring pathogenic repeat expansions were annotated by reference to non-B DB 2.0 (Cer et al. 2013a) and bedtools (Quinlan and Hall 2010a). In total, 216 out of the 224 HGMD-derived pathogenic repeat regions were found to be capable of adopting at least one type of non-B DNA structure. We calculated the coverage of each type of non-B structure [Z-DNA, G-quadruplex (GQ), A-phased repeats, inverted repeats (IR), mirror repeats (MR), direct repeats (DR) and short tandem repeats (STR)] in each repeat region by computing the ratio of bases covered by the non-B structure. We found that HGMD-RPE in intron was significantly enriched in GQ comparing to the gnomAD-repeats and TRF-repeats (P = 4.43 \(\times\)10^− 2 for gnomAD-repeats and P = 3.40 \(\times\)10^− 3 for TRF-repeats, Table S9). Moreover, HGMD-RPE repeats in 3’gene were significantly enriched in Z-DNA and IR comparing to gnomAD-repeats with P = 1.57\(\times\)10⁻⁷ and 3.52 \(\times\)10^− 2, respectively. Whereas compared to the TRF-repeats, HGMD-RPE in exon are significantly enriched in A-phased repeats (P = 1.16\(\times\)10⁻²) and Z-DNA (P = 3.91 \(\times\)10^− 5). The similar findings have been obtained in analyzing HGMD-RPE-DM (Table S9). These data support the view that non-B DNA-foming motifs may be incorporated along with other matrices to predict the formation of pathologic repeat expansions genome-wide.

Pathogenic expansion loci are evolutionarily conserved

Evolutionary conservation is a key feature to infer the function of specific DNA sequences and hence the likelihood that mutations within them will be pathogenic. We observed that loci that have undergone repeat expansions in HGMD-RPE are significantly more conserved than genomically matched gnomAD-repeats in exon, promoter and 5'UTR GFRs (P = 1.36\(\times 10\)⁻⁶, 1.74\(\times\)10⁻² and 2.14\(\times\)10⁻², respectively) (Fig. 6A and Table S4). Within HGMD-RPE-DM, the repeat expansion loci were significantly more conserved than those from the gnomAD-repeats and NR-regions in both the intronic and exonic GFRs (P = 4.88\(\times 10\)⁻³ and 5.41\(\times\)10⁻⁶ for gnomAD-repeats-DM, P = 3.47\(\times\)10⁻² and 5.61\(\times\)10⁻³ for NR-regions-DM) (Fig. S7A and Table S5). Since the gnomAD-repeats dataset represents repeat expansions with no known pathogenicity, our results enable the prediction that repeat expansions in highly conserved GFRs are more likely to lead to disease than those in less conserved GFRs.

To ascertain whether the repeat expansions disrupt conserved functional genomic elements, we investigated the regions neighboring the repeats at the DNA sequence level. Specifically, we employed a 100-bp sliding window to scan the neighboring regions and calculated the average phastCons score within each window. We then gradually expanded the distance from the repeat regions or non-repeat regions to the neighboring sequences from 25 bp to 3000 bp and calculated the maximum value of the average phastCons score within the sliding window. In HGMD-RPE, the phastCons conservation scores ranged between 50 bp and 1000bp, and were significantly higher than in all three control datasets for both the aggregate (Fig. 6B and Table S4) and the individual (exons, introns and 5' UTRs) GFRs (Fig. S8). The same trend was observed for the HGMD-RPE-DM dataset (Fig. S9 and Table S5). These rather striking results indicate that pathogenic repeat expansions are characterized by their proximity to conserved functional genomic elements in comparison to non-repeat regions or repeats without known pathogenicity.

Base excision repair (BER) factors in the vicinity of pathogenic repeat expansions

Human genome stability requires efficient base excision repair by several DNA repair pathways, including BER(Bacolla et al. 2021). Therefore, we wondered if pathogenic repeat expansions were enriched in BER factors. To this end, we assessed the location of the binding sites of four BER factors, acetylated NEIL1 (AcNEIL1), POLB, OGG1 and XRCC1. In HGMD-PRE, the pathogenic repeat expansions in the 5’gene GFR were enriched in AcNEIL1 and XRCC occupancy compared to gnomAD-repeats whereas no significant differences were observed between HGMD-PRE and the other control sets (Fig. 7, Table S4). By contrast, AcNEIL1 occupancy at 3'UTR was lower for the HGMD-PRE than for the NR-region; likewise, XRCC1 binding was lower around the pathogenic repeat expansions in HGMD-PRE in 3'gene, 5'UTR and exon compared to the control sets (Fig. 7). For POLB and OGG1, their occupancies in 5'UTR were stronger in the HGMD-PRE than in NR-regions (Fig. 7). Moreover, binding of OGG1 to exon GFRs was higher around the pathogenic repeat expansions in the HGMD-PRE than in gnomAD-repeats. However, POLB binding in intron and promoter GFRs was reduced in the HGMD-PRE. Results of parallel analysis on HGMD-RPE-DM and corresponding controls were exhibited in Fig S10.

Characteristics of repeat expansions differ between different categories of disease

To assess whether specific repeat expansions are associated with particular types of disease, repeat expansion-related diseases were grouped into 31 distinct categories according to Medical Subject Headings (MeSH) terms (Table S10 and Table S11), collapsing into one representative category those cases in which a disease was classified into multiple categories. For each category, we counted the number of disease genes impacted by repeat expansion (Fig. 8A), which revealed that ‘Nervous system diseases’ were the most frequently associated with genes impacted by repeat expansions (68 genes), followed by ‘Inherited susceptibility to neoplasms’ (34 genes), ‘Musculoskeletal diseases’ (33 genes), and ‘Mental disorders’ (22 genes). From each disease category, we further identified one gene that was most frequently associated with diseases in the same disease category (Table S12).

Furthermore, 40 genes in all were found to be associated with diseases from different categories, and Table S13 shows the top 10 genes associated with the largest number of disease categories. The different MeSH categories appeared to be preferentially associated with the expansion of specific types of repeat motif (Fig. 8B and Fig. S11); 14 disease categories were associated with expansion of the DNA repeat motif ‘GT’. Three types of repeat motif, namely CGG, CA and CAG, were found to be associated with 11 disease categories. Other repeat motifs, specifically TA, CTG and AAT, were associated with more than five disease categories.

Many studies have shown that the extent of repeat amplification correlates directly with clinical severity and the age of onset(Gijselinck et al. 2016; Swami et al. 2009). Two important questions arise: (1) Does the number of repeat units vary between different disease categories? (2) Is the average repeat unit length of the pathogenic repeat expansions different between different disease categories? To address the first question, we examined the top nine disease categories associated with the largest number of pathogenic repeat expansions, which showed that the repeat expansions associated with the MeSH disease category ‘Nutritional and metabolic diseases’ exhibit the highest average number of repeat units pre-expansion (Fig. 8C and Supplementary Material). Next, we explored the variation in total repeat length between disease categories. The average length of the repeat motifs associated with repeat expansion disorders in the categories ‘Nervous system diseases’, ‘Mental disorders’, and ‘Neoplasms’ appeared to be longer than for the repeat motifs associated with other categories (Fig. 8D). Repeat expansions within coding regions tend to be located within certain regions of the proteins characterized by specific secondary structures. The pre-expansions encompassing pathogenic repeat expansions in HGMD-RPE associated with ‘Nervous system diseases’, ‘Neoplasms’, ‘Mental disorders’, ‘Nutritional and metabolic disease’, ‘Male urogenital diseases’, and ‘Digestive system diseases’ were more likely to be located within coils (‘C’) than in ‘H’ or ‘E’ (Fig. 8E).

Additional analyses were performed to ascertain the influence of repeat expansions on gene expression. In total, 127 repeat expansions in HGMD-RPE were likely to play a role in modulating gene expression. Of these, repeat expansions causing ‘Nervous system diseases’, ‘Neoplasms’, ‘Cardiovascular diseases’, ‘Nutritional and metabolic diseases’, ‘Male urogenital diseases’ and ‘Susceptibility to infections’ were predicted to exert their effects by increasing gene expression (Fig. 8F). Among them, six repeat expansions causing or associated with musculoskeletal diseases, mental disorders and digestive system diseases tended to up-regulate gene expression by involving multiple genomic architectural features including histone marks, transcription factor binding sites, CTCF binding sites and chromatin accessibility. By contrast, 14 pathogenic repeat expansions causing or associated with musculoskeletal diseases, mental disorders and digestive system diseases tended to be associated with the down-regulation of gene expression. In summary, about half of all pathogenic repeat expansions are predicted to exert their pathological role at least in part by impacting gene expression.

Constructing DPREx for the prediction of pathogenic repeat expansions

Having compiled comprehensive bioinformatics metrics on repeat expansions, we constructed a predictive model (DPREx) of pathogenic repeat expansions. DPREx (http://biomed.nscc-gz.cn/zhaolab/geneprediction/#/) utilized a total of 185 distinct features to characterize the repeat regions from the positive and the negative training datasets, respectively, including chromatin accessibility, histone modification marks from multiple tissues, the binding strength of transcription factors, distance to alternative splice sites, evolutionary conservation, non-B DNA structures, BER factors, R-loops and RNA secondary structure. The DPREx model was trained and evaluated by means of a 5-fold cross-validation (CV) in the training set. When 42 features were used, DPREx performance attained an AUC of 0.875 (Table S14). The contribution of each feature was evaluated by ablation steps (Fig. 9A, Table S14), and was further described by the weight of each feature in the X-Gboost model (Fig. 9B). The results show that the distance of the repeat regions to a splice site is the most important predictor in determining whether a repeat undergoes pathogenic expansion. Additionally, the types of non-B DNA-forming structure, such as minor repeats (MR) and short tandem repeats (STR), are key predictive features for detecting regions of pathogenic potential.

In the training set, DPREx yielded the highest Matthews correlation coefficient (MCC, 0.39) when we coupled with a 0.04 cutoff to determine whether or not a given repeat region was pathogenic (Fig. 9C). This cutoff was used to compare the MCC performance and F1-score performance of DPREx to other two methods (CADD and ReMM) in independent tests. Using independent test set, test set1 (Method), DPREx was compared with two methods, CADD and ReMM, which are commonly used to predict the pathogenicity of single-locus variants, and which we exploited to generate pathogenic scores for the breakpoints of the repeat regions plus 10 bases upstream and downstream. From these data, we selected the maximum score of all bases within window as the representative score for the repeat region. Our method achieved an AUC of 0.87, which outperformed both CADD (AUC 0.66) and ReMM (AUC 0.72) (Fig. 9D). When we used DPREx with a score of 0.04 as a cutoff to determine whether or not a given repeat region was pathogenic, it achieved an F1-score of 0.43 and an MCC of 0.35. For CADD, a score of 10 means that a given variant is among the top 1% most deleterious variants in the human genome (Rentzsch et al. 2019). Using this score 10 as a cutoff, CADD achieved an F1-score of 0.23 and an MCC of 0.17. ReMM outputs mutation scores ranging from 0 (non-deleterious) to 1 (deleterious). By using a score of 0.5 as a threshold for determining whether or not a mutation was deleterious, ReMM achieved an F1-Measure 0.22, and an MCC of 0.20 (Table 2).

Table 2

Comparing DPREx with CADD and ReMM using two independent datasets.
	Test set1			Test set2
Methods	DPREx^a	CADD^b	ReMM^c	DPREx	CADD	ReMM
AUC	0.89	0.69	0.70	0.82	0.72	0.75
MCC	0.35	0.17	0.20	0.45	0.25	0.21
F1-score	0.33	0.23	0.22	0.51	0.39	0.37
^a Performance of DPREx with respect to MCC, and F1-score measured by means of a DPREx score threshold 0.0705 which yielded the highest MCC in the training set; ^b Performance of CADD, with respect to MCC and F1-score using a CADD score threshold of 10 which means a given variant is amongst the top 1% of deleterious variants in the human genome; ^c Performance of ReMM, with respect to MCC and F1-score using an ReMM score 0.5 as a threshold.

DPREx was further compared with CADD and ReMM by employing a second independent dataset, test set2 (Method). In this case, DPREx achieved an AUC of 0.78, which was higher than those achieved by CADD (0.61) and ReMM (0.62) (Fig. 9E). When we used the cutoffs of 10 for CADD, 0.5 for ReMM, and 0.04 for DPREx to predict pathogenic repeat regions, CADD, ReMM and DPREx achieved F1-scores of 0.36, 0.39 and 0.53, and MCC values of 0.17, 0.21 and 0.48, respectively. Thus, DPREx provides a robust and competitive method for predicting pathogenic repeat expansions based upon the integrated data and model from the human genome.

About 50% of the human genome is made up of repetitive sequences(Lander et al. 2001); yet only a small percentage of these repetitive sequences are pathogenic, underscoring the critical need for advancing predictive knowledge of pathogenic repeats. Here, we performed a multi-pronged analysis to explore 224 genomic regions harbouring pathogenic repeat expansions involving 33 distinct disease categories, termed HGMD-RPE, which represents the largest meta-analysis of pathogenic repeat expansions reported to date. We investigated the mechanisms of formation of the pathogenic repeat expansions by comparing them with three control datasets, TR-repeats, gnomeAD-repeats and NR-regions. The TR-repeats and gnomeAD-repeats datasets were established so as to allow comparison of the pathogenic repeat expansions with repeat expansions with no known pathogenicity, as well as with repeat loci which are not known to be expanded or associated with disease. The NR-regions represent genomic regions without repeat sequences. By comparing HGMD-RPE with these three datasets, we demonstrate that pathogenic repeat expansions may dysregulate gene expression through the formation of CTCF-mediated loops, by enhancing chromatin accessibility, and by providing multiple TFBS. Additionally, we found that R-loops and certain types of sequence with non-B DNA-forming potential are associated with pathogenic repeat expansions; these features may mediate transcriptional activation by impacting chromatin structure. At the protein level, our analyses support the view that pathogenic repeat expansions may exert their deleterious function by encoding structurally disordered domains located on the protein surface. Taken together, our data show that pathogenic repeat expansions typically impact gene function at multiple levels with potential for amplification, cross-talk and synergy to drive expansion and pathogenesis. As repeat variation plays a crucial role in human disease, especially in neurodegeneration and cancer(Li et al. 2020; Loureiro et al. 2016), the results and methodology employed here provide the means to efficiently test and extend current knowledge.

Specifically, by exploring chromatin interactions through 3D chromatin structures derived from Hi-C and other 3C-based technologies, our analysis reveals the potential impact of the expanded pathogenic repeats on gene regulation. We also complemented the relatively low resolution Hi-C data (Zhang et al. 2018) with high resolution CTCF ChIP-seq data (Belokopytova et al. 2020), as CTCF binding sites are closely associated with the formation of chromatin loops(de Wit et al. 2015; Pugacheva et al. 2020; Xi and Beer 2021). Indeed, CTCF plays crucial roles in the formation and maintenance of topologically associating domains (TADs) that modulate connections between regulatory elements, such as promoters and enhancers(Kentepozidou et al. 2020). We show that the 5' regions of genes harbouring pathogenic repeat expansions tending to be disproportionately bound by CTCF as compared to repeat regions with no known pathogenicity. Thus, regions harbouring pathogenic repeat expansions appear to have increased potential to bind CTCF, suggesting an alteration in chromatin conformation to increase transcription. The transcriptional repressor CTCF contains a zinc finger involved in transcriptional regulation and insulation, imprinting, V(D)J recombination and the regulation of chromatin architecture(Peters 2021; Phillips and Corces 2009; van Ruiten and Rowland 2021; Wu et al. 2020). Given that the regions around pathogenic repeat expansions also contain more TFBS than non-repetitive regions, our results support the conclusion that pathogenicity arises at least in part from altered, mostly increased, gene expression. In future studies it will be interesting to compare DNA methylation patterns between pathogenic and benign repeat expansions (Fotsing et al. 2019; Quilez et al. 2016).

Mechanistically, repeats can emerge and expand by means of DNA repair and replication processes. This implicates processes and proteins involved in replication slippage and reinitiation, template switching, inaccurate translesion synthesis polymerases, and homologous recombination or alternative end-joining repair of DNA double-strand breaks(Charlesworth et al. 1994; Eckelmann et al. 2020; Hambarde et al. 2021; Levinson and Gutman 1987; Tsutakawa et al. 2017; Ye et al. 2021). Furthermore, DNA repeats can form non-B DNA structures, which can then play a role in eliciting replication and transcription stress resulting in DNA breaks, as well as in modulating gene regulation, protein function and RNA translation (Du et al. 2013). Our analyses showed that pathogenic repeat expansions are enriched in GQ and Z-DNA, supporting the notion that perturbing DNA structure can cause neurodegenerative disorders, such as Alzheimer’s disease, prion disease, Parkinson’s disease, amyotrophic lateral sclerosis/frontal-temporal dementia and fragile X-syndrome as well as contribute to genome instability and cancer mutational burden (Wang et al. 2021). Therefore, our comprehensive analyses support and extend results showing that pathogenic repeat expansions have high potential to form specific non-B DNA structures that may interfere with gene and protein function.

Formation of stable R-loops with a strong positive G/C skew representing G-clusters in the non-template DNA strand next to transcriptionally active promoters, is one of the mechanisms underlying the pathogenicity of repeat expansions, with a strong positive G/C skew representing G-clusters in the non-template DNA strand next to transcriptionally active promoters (Abu Diab et al. 2018; Ginno et al. 2012; Malik et al. 2021). We therefore analyzed the spatial coincidence of R-loops with pathogenic repeat expansion loci. One model that implicates R-loops in the enhancement of repeat instability has suggested that they act by promoting noncanonical structures, such as hairpins and G-quadruplexes (G4), involving the unpaired DNA and RNA (Abu Diab et al. 2018; Gray et al. 2014). Indeed, these studies have shown that when R-loops are formed in GC-rich repeats in a transcriptionally competent state, DNA unpairing is more extensive and associated with the interruption of double-stranded structures by the nontranscribed (G-rich) DNA strand. Consequently, unusual DNA structures are formed, which may trigger repeat instability (Abu Diab et al. 2018; Bacolla et al. 2016). As mentioned, we observed a significantly higher proportion of R-loops in HGMD-RPE than in repeat expansions with no known pathogenicity, highlighting the potential contribution of R-loops, particularly those formed by CGG repeats, to human pathology. Our results therefore reinforce and extend the concept that sequences with strand asymmetry in the distribution of guanine and cytosines, a property known as GC skew, may form stable R-loops upon transcription. In this mechanism, a newly transcribed G-rich RNA strand reanneals back to the template C-rich DNA strand forcing the non-template G-rich DNA strand into a single-stranded conformation(Ginno et al. 2012; Ratmeyer et al. 1994). Indeed, the formation of R-loops within CGG repeat regions has been associated with DNA damage(Abu Diab et al. 2018; Loomis et al. 2014; Robin et al. 2017).

Pathogenic repeat expansions displayed stronger signals for H3K4me3, H3K9me3 and H3K27me3 than in the controls for most functional regions. According to Roh et al.(Roh et al. 2006), the activation of H3K4me3 methylation and the repression of H3K27me3 methylation are associated with the high level of gene expression. In HGMD-RPE, we found stronger H3K4me3 signals evident in the exon, promoter and 5'gene GFRs than in all three control datasets, which may indicate that the pathogenic repeat expansions in these GFRs are involved in modulating the chromatin modification pattern. Notably, there are 23 pathogenic repeat expansions in 5'UTR, intron, exon, promoter and 5'gene GFRs with strong H3K4me3 signals (score > 5), and they were reported to regulate the expression of 23 genes (Table S15). These 23 genes were found to be significantly (P_adjust < 0.05) enriched in the KEGG pathways, ‘Dopaminergic synapse’, ‘Arginine and proline metabolism’, ‘Arachidonic acid metabolism’ and ‘Metabolic pathway’ according to clusterProfiler (Wu et al. 2021).

A key finding of our study is that any one pathogenic repeat expansion has the potential to impact gene expression at multiple levels, involving several genomic architectural features, including histone marks, transcription factor binding sites, CTCF binding sites and chromatin accessibility. Fig.S12 displays the pathogenic repeat expansions in HGMD-RPE that have stronger histone marks, larger numbers of TFBS or CTCF sites, or higher chromatin accessibility than the average values of at least one of the three control sets. We found that these pathogenic repeat expansions are frequently located in the 5'gene, 5'UTR, exon and promoter GFRs. Among these pathogenic repeat expansions, one (with HGMD ID CE1313062; chromosom2: 73613031–73613067) exhibited stronger H3K27me3, H3K4me1, H3K4me3, H3K9me3 signals, and an increased number of TFBS, CTCF and DNaseI sites compared to all control datasets. This repeat expansion is located in an exonic region of the ALMS1 gene and confers susceptibility to cardiovascular disease. Another repeat expansion, CE102901 on chromosomX:136648984 (in an exonic region of the ZIC3 gene), exhibited stronger H3K27me3, H3K4me1, H3K4me3 and H3K9me3 signals, and a larger number of TFBS, CTCF-binding sites and DNaseI sites than one or more control datasets. Among these pathogenic repeat expansions, 53 were experimentally shown to dysregulate gene expression. Results of a parallel analysis in HGMD-RPE-DE are shown in Fig. S13. Since genes harbouring pathogenic repeat expansions are differentially regulated, it is not altogether surprising that the effects of any given repeat expansion are pleiotropic(Rice and Rebeiz 2019). For example, the transcriptional activity of the CYP19A1 gene, associated with breast cancer(Kristensen et al. 1998; Ma et al. 2010; Xita et al. 2010) and increased cortical bone mass density(Lorentzon et al. 2006), may be altered by an intronic repeat expansion (CE068496) by several means, including enhanced H3K27me3 occupancy, chromatin accessibility, and its closeness to transcription factor and CTCF binding sites. Such open chromatin regions associated with increased transcription, replication origins, and increased DNA repair will be partly tissue-dependent(Bacolla et al. 2021). It will be interesting to identify the pathways through which pathogenic expansions exert their tissue-specific effects.

At the protein level, exonic regions encompassing repeat expansions tend to form sequence local coils, but rarely sequence-distant beta strands or alpha helices. Additionally, amino acids encoded by regions harbouring pathogenic repeat expansions are often located in disordered regions and on the protein surface, implying that these regions may function by interacting with cognate proteins or small molecules. Such surface exposed and disordered regions act in DNA replication and repair processes especially DNA double-strand break repair(Hammel and Tainer 2021). Multivalent interactions of disordered regions and RNA may also be responsible for phase separation that can impact molecular interactions and responses to DNA damage(Thapar et al. 2021). Additional analyses should focus on determining whether pathogenic expansions tend to reside within the binding domains of proteins.

One of goals of this work was to build a robust and sensitive model for predicting pathogenic repeat expansions. DPREx is the first model of its kind, tailored to both recognizing repeat regions that may harbour pathogenic repeat expansions and employing non-B DNA-forming sequences to predict mutational pathogenicity. Although DPREx was constructed from a limited number of pathogenic repeat expansions, it outperformed existing algorithms that are designed to predict the pathogenicity of point mutations. The underlying reason may be that DPREx employed the specific set of features that we found to be strongly associated with repeat expansion, for example, non-B DNA structure. Feature analysis indicated that types of non-B DNA, such as Short Tandem Repeats (STR) and mirror repeats (MR), are second only in importance after the distance to splice site. This finding is consistent with our conclusion that STRs are over-represented in the pathogenic repeat expansions whereas MR is under-represented in the pathogenic repeat expansions compared to controls.

Three limitations of our study point toward future potential improvements. First, we used chromatin modification data from brain tissue to identify the distinctive characteristics of pathogenic repeats required for chromatin accessibility, on the basis that 44.3% of pathogenic repeat expansions are associated with Nervous system diseases or Mental disorders. Further analysis may consider chromatin modification data from other tissues. Second, we established the control sets by matching the HGMD-RPE and the controls in terms of the length of the repeat motifs and functional regions. However, we did not consider repeat motif similarity in HGMD-RPE and the controls, as this would have dramatically reduced the sample size of the control sets. Analysis of the impact of repeat expansions on other regulatory mechanisms (e.g. DNA methylation and RNA methylation) would also be warranted.

As a conclusion, we have performed a systematic analysis on the largest available collection of pathogenic DNA repeat expansions in the human genome. We characterized distinct and synergistic features of the pathogenic repeat expansion loci including chromatin folding, non-B DNA structure, RNA secondary structure, R-loops, TF binding, mRNA splicing, evolutionary conservation, protein structure and BER factors. Our analysis revealed that far from having a single mechanism of action, repeat expansions typically may exert their pathological effects at multiple levels, potentially involving cross-talk and synergies from different stages of the gene expression pathway. For example, genomic instability may both result from and drive repeat expansions engendered by aberrant DNA, RNA, and protein structures. Finally, we tested and validated the practical implications of our findings by constructing DPREx. Importantly, this predictive model is capable of distinguishing pathogenic repeat expansions from non-pathogenic ones, highlights critical points of synergy contributing to predicted pathogenic repeats, and unveils multifactorial mechanisms that may underlie repeat pathogenicity as a rule. Thus, our integrated model provides a unified hypothesis for pathogenic repeats that can be further tested and improved. As more pathogenic repeat expansions are described and characterized, we anticipate that the features of added loci should both test our model and advance understanding of how DNA repeat expansions exert their pathological influence relevant to strategies for precision medicine therapeutics.

Data and Code availability

The source code and data for characterizing genomic regions with repeat expansion are available at https://github.com/wykswr/ItvAnt.

The source code and data for DPREx are available at https://github.com/wykswr/DPREx.

Reference genome and annotations were obtained from GENCODE (https://www.gencodegenes.org/). The release version of annotation is v37: http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_37/GRCh37_mapping/gencode.v37lift37.annotation.gtf.gz .

The epigenetics data, including CTCF binding sites, DNase-seq and histone modification data, were obtained from the ENCODE project (https://www.encodeproject.org/). Accession IDs are: ENCFF618DDO (CTCF ChIP-seq, narrowPeak); ENCFF021YPR (H3K27me3 ChIP-seq, bigWig); ENCFF388WCD (H3K36me3 ChIP-seq, bigWig); ENCFF481BLF (H3K4me1 ChIP-seq, bigWig); ENCFF780JKM (H3K3me3 ChIP-seq, bigWig); ENCFF411VJD (H3K9me3 ChIP-seq, bigWig).

phastCons conservation scores were downloaded from UCSC: https://hgdownload.cse.ucsc.edu/goldenpath/hg19/phastCons100way/hg19.100way.phastCons.bw .

The pre-computed MMSplice scores were obtained from the annotation of CADD (offline version): https://cadd.gs.washington.edu/download .

Non-B DNA structure annotation (hg19): https://ncifrederick.cancer.gov/bids/ftp/?nonb# , https://ncifrederick.cancer.gov/bids/ftp/actions/download/?resource=/bioinfo/static/nonb_dwnld/human_hg19/human_hg19.gff.tar.gz

Acknowledgements

Thanks for funding for this work by the National Key Research and Development Program of China (2020YFB0204803), the Natural Science Foundation of China (81801132, 81971190, 61772566), and the Natural Science Foundation of Guangdong (2021A1515010256). J.A.T. was supported in part by National Institutes of Health (NIH) grants P01 CA092584 and R35 CA220430, by the Cancer Prevention Research Institute of Texas (CPRIT) grant (RP180813), and a Robert A. Welch Chemistry Chair. P.D.S., E.V.B., M.M. and D.N.C. acknowledge financial support from Qiagen Inc through a License Agreement with Cardiff University.

Funding

Author information

Cong Fan and Ken Chen have contributed equally to this work.

Authors and Affiliations

Department of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, 500001, China

Cong Fan

School of Computer Science and Engineering, Sun Yat-Sen University, Guangzhou, 500001, China

Ken Chen

School of LifeScience, Sun Yat-sen University, Guangzhou, 500001, China

YuKai Wang

Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK

Edward V. Ball

Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK

Peter D. Stenson

Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK

Matthew Mort

The University of Texas MD Anderson Cancer Center, Department of Molecular and Cellular Oncology, 6767 Bertner Avenue, Houston, TX 77030, USA

Albino Bacolla

Institute of Human Genetics, University of Ulm, Albert-Einstein-Allee 11, 89081 Ulm, Germany

Hildegard Kehrer-Sawatzki

The University of Texas MD Anderson Cancer Center, Department of Molecular and Cellular Oncology, 6767 Bertner Avenue, Houston, TX 77030, USA

John A. Tainer

Institute of Medical Genetics, School of Medicine, Cardiff University, Heath Park, Cardiff CF14 4XN, UK

David N. Cooper

Department of Medical Research Center, Sun Yat-Sen Memorial Hospital, Sun Yat-Sen University, Guangzhou, 500001, China

Huiying Zhao

Contributions

Cong Fan and Ken Chen performed the analysis and wrote the manuscript with equal contribution. Yukai Wang developed the annotation pipline as well as the model of DPREx. Edward V. Ball, Peter D. Stenson, Matthew Mort, Albino Bacolla, Hildegard Kehrer-Sawatzki, John A. Tainer and David N. Cooper gave suggestions to the work design, supplied data for analysis to this work, and made times of careful check on the manuscript. Huiying Zhao designed the strategies of this work and collected resources for the project.

Corresponding author

Correspondence to Huiying Zhao

Ethics declarations

Conflict of interests

All the authors declare no conflict interests or competing interests

Ethics approval and consent to participate

This research involved participants, data or biological material of neither human nor animals. This research involved no human subjects.

Abu Diab M, Mor-Shaked H, Cohen E, Cohen-Hadad Y, Ram O, Epsztejn-Litman S, Eiges R (2018) The G-rich Repeats in FMR1 and C9orf72 Loci Are Hotspots for Local Unpairing of DNA. Genetics 210: 1239-1252. doi: 10.1534/genetics.118.301672
Avsec Z, Weilert M, Shrikumar A, Krueger S, Alexandari A, Dalal K, Fropf R, McAnany C, Gagneur J, Kundaje A, Zeitlinger J (2021) Base-resolution models of transcription-factor binding reveal soft motif syntax. Nat Genet 53: 354-366. doi: 10.1038/s41588-021-00782-6
Bacolla A, Sengupta S, Ye Z, Yang C, Mitra J, De-Paula RB, Hegde ML, Ahmed Z, Mort M, Cooper DN, Mitra S, Tainer JA (2021) Heritable pattern of oxidized DNA base repair coincides with pre-targeting of repair complexes to open chromatin. Nucleic Acids Res 49: 221-243. doi: 10.1093/nar/gkaa1120
Bacolla A, Tainer JA, Vasquez KM, Cooper DN (2016) Translocation and deletion breakpoints in cancer genomes are associated with potential non-B DNA-forming sequences. Nucleic Acids Res 44: 5673-88. doi: 10.1093/nar/gkw261
Balendra R, Isaacs AM (2018) C9orf72-mediated ALS and FTD: multiple pathways to disease. Nat Rev Neurol 14: 544-558. doi: 10.1038/s41582-018-0047-2
Bassuny WM, Ihara K, Sasaki Y, Kuromaru R, Kohno H, Matsuura N, Hara T (2003) A functional polymorphism in the promoter/enhancer region of the FOXP3/Scurfin gene associated with type 1 diabetes. Immunogenetics 55: 149-156. doi: 10.1007/s00251-003-0559-8
Becker JS, Nicetto D, Zaret KS (2016) H3K9me3-Dependent Heterochromatin: Barrier to Cell Fate Changes. Trends in genetics : TIG 32: 29-41. doi: 10.1016/j.tig.2015.11.001
Belokopytova PS, Nuriddinov MA, Mozheiko EA, Fishman D, Fishman V (2020) Quantitative prediction of enhancer-promoter interactions. Genome research 30: 72-84. doi: 10.1101/gr.249367.119
Benson G (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic acids research 27: 573-580. doi: 10.1093/nar/27.2.573
Bird TD (1993) Myotonic Dystrophy Type 1. In: Adam MP, Ardinger HH, Pagon RA, Wallace SE, Bean LJH, Stephens K, Amemiya A (eds) GeneReviews(®). University of Washington, Seattle
Bonasio R, Tu S, Reinberg D (2010) Molecular Signals of Epigenetic States. Science 330: 612. doi: 10.1126/science.1191078
Cai Y, Zhang Y, Loh YP, Tng JQ, Lim MC, Cao Z, Raju A, Lieberman Aiden E, Li S, Manikandan L, Tergaonkar V, Tucker-Kellogg G, Fullwood MJ (2021) H3K27me3-rich genomic regions can function as silencers to repress gene expression via chromatin interactions. Nature Communications 12: 719. doi: 10.1038/s41467-021-20940-y
Cer RZ, Donohue DE, Mudunuri US, Temiz NA, Loss MA, Starner NJ, Halusa GN, Volfovsky N, Yi M, Luke BT, Bacolla A, Collins JR, Stephens RM (2013a) Non-B DB v2.0: a database of predicted non-B DNA-forming motifs and its associated tools. Nucleic Acids Research 41: D94-D100. doi: 10.1093/nar/gks955
Cer RZ, Donohue DE, Mudunuri US, Temiz NA, Loss MA, Starner NJ, Halusa GN, Volfovsky N, Yi M, Luke BT, Bacolla A, Collins JR, Stephens RM (2013b) Non-B DB v2.0: a database of predicted non-B DNA-forming motifs and its associated tools. Nucleic Acids Res 41: D94-D100. doi: 10.1093/nar/gks955
Charlesworth B, Sniegowski P, Stephan W (1994) The evolutionary dynamics of repetitive DNA in eukaryotes. Nature 371: 215-20. doi: 10.1038/371215a0
Cheng J, Nguyen TYD, Cygan KJ, Çelik MH, Fairbrother WG, Avsec ž, Gagneur J (2019) MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biology 20: 48. doi: 10.1186/s13059-019-1653-z
Choudhary K, Lai YH, Tran EJ, Aviran S (2019) dStruct: identifying differentially reactive regions from RNA structurome profiling data. Genome Biol 20: 40. doi: 10.1186/s13059-019-1641-3
Clay FE, Cork MJ, Tarlow JK, Blakemore AI, Harrington CI, Lewis F, Duff GW (1994) Interleukin 1 receptor antagonist gene polymorphism association with lichen sclerosus. Hum Genet 94: 407-10. doi: 10.1007/BF00201602
Conlon EG, Lu L, Sharma A, Yamazaki T, Tang T, Shneider NA, Manley JL (2016) The C9ORF72 GGGGCC expansion forms RNA G-quadruplex inclusions and sequesters hnRNP H to disrupt splicing in ALS brains. Elife 5. doi: 10.7554/eLife.17820
Consortium EP (2011) A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biol 9: e1001046. doi: 10.1371/journal.pbio.1001046
Dashnow H, Lek M, Phipson B, Halman A, Sadedin S, Lonsdale A, Davis M, Lamont P, Clayton JS, Laing NG, MacArthur DG, Oshlack A (2018) STRetch: detecting and discovering pathogenic short tandem repeat expansions. Genome Biol 19: 121. doi: 10.1186/s13059-018-1505-2
Davis CA, Hitz BC, Sloan CA, Chan ET, Davidson JM, Gabdank I, Hilton JA, Jain K, Baymuradov UK, Narayanan AK, Onate KC, Graham K, Miyasato SR, Dreszer TR, Strattan JS, Jolanki O, Tanaka FY, Cherry JM (2018) The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Research 46: D794-D801. doi: 10.1093/nar/gkx1081
de Wit E, Vos ES, Holwerda SJ, Valdes-Quezada C, Verstegen MJ, Teunissen H, Splinter E, Wijchers PJ, Krijger PH, de Laat W (2015) CTCF Binding Polarity Determines Chromatin Looping. Mol Cell 60: 676-84. doi: 10.1016/j.molcel.2015.09.023
Den Dunnen WFA (2017) Trinucleotide repeat disorders. Handb Clin Neurol 145: 383-391. doi: 10.1016/b978-0-12-802395-2.00027-4
Depienne C, Mandel JL (2021) 30 years of repeat expansion disorders: What have we learned and what are the remaining challenges? Am J Hum Genet. doi: 10.1016/j.ajhg.2021.03.011
Dettori LG, Torrejon D, Chakraborty A, Dutta A, Mohamed M, Papp C, Kuznetsov VA, Sung P, Feng W, Bah A (2021) A Tale of Loops and Tails: The Role of Intrinsically Disordered Protein Regions in R-Loop Recognition and Phase Separation. Front Mol Biosci 8: 691694. doi: 10.3389/fmolb.2021.691694
Dolzhenko E, Bennett MF, Richmond PA, Trost B, Chen S, van Vugt J, Nguyen C, Narzisi G, Gainullin VG, Gross AM, Lajoie BR, Taft RJ, Wasserman WW, Scherer SW, Veldink JH, Bentley DR, Yuen RKC, Bahlo M, Eberle MA (2020) ExpansionHunter Denovo: a computational method for locating known and novel repeat expansions in short-read sequencing data. Genome Biol 21: 102. doi: 10.1186/s13059-020-02017-z
Dolzhenko E, van Vugt J, Shaw RJ, Bekritsky MA, van Blitterswijk M, Narzisi G, Ajay SS, Rajan V, Lajoie BR, Johnson NH, Kingsbury Z, Humphray SJ, Schellevis RD, Brands WJ, Baker M, Rademakers R, Kooyman M, Tazelaar GHP, van Es MA, McLaughlin R, Sproviero W, Shatunov A, Jones A, Al Khleifat A, Pittman A, Morgan S, Hardiman O, Al-Chalabi A, Shaw C, Smith B, Neo EJ, Morrison K, Shaw PJ, Reeves C, Winterkorn L, Wexler NS, Group US-VCR, Housman DE, Ng CW, Li AL, Taft RJ, van den Berg LH, Bentley DR, Veldink JH, Eberle MA (2017) Detection of long repeat expansions from PCR-free whole-genome sequence data. Genome Res 27: 1895-1903. doi: 10.1101/gr.225672.117
Du X, Wojtowicz D, Bowers AA, Levens D, Benham CJ, Przytycka TM (2013) The genome-wide distribution of non-B DNA motifs is shaped by operon structure and suggests the transcriptional importance of non-B DNA structures in Escherichia coli. Nucleic Acids Res 41: 5965-77. doi: 10.1093/nar/gkt308
Eckelmann BJ, Bacolla A, Wang H, Ye Z, Guerrero EN, Jiang W, El-Zein R, Hegde ML, Tomkinson AE, Tainer JA, Mitra S (2020) XRCC1 promotes replication restart, nascent fork degradation and mutagenic DNA repair in BRCA2-deficient cells. NAR Cancer 2: zcaa013. doi: 10.1093/narcan/zcaa013
Eddy J, Maizels N (2008) Conserved elements with potential to form polymorphic G-quadruplex structures in the first intron of human genes. Nucleic Acids Res 36: 1321-33. doi: 10.1093/nar/gkm1138
Figueroa KP, Farooqi S, Harrup K, Frank J, O'Rahilly S, Pulst SM (2009) Genetic variance in the spinocerebellar ataxia type 2 (ATXN2) gene in children with severe early onset obesity. PLoS One 4: e8280. doi: 10.1371/journal.pone.0008280
Flower MD, Tabrizi SJ (2020) A small molecule kicks repeat expansion into reverse. Nat Genet 52: 136-137. doi: 10.1038/s41588-020-0577-6
Fornes O, Castro-Mondragon JA, Khan A, van der Lee R, Zhang X, Richmond PA, Modi BP, Correard S, Gheorghe M, Baranašić D, Santana-Garcia W, Tan G, Chèneby J, Ballester B, Parcy F, Sandelin A, Lenhard B, Wasserman WW, Mathelier A (2019) JASPAR 2020: update of the open-access database of transcription factor binding profiles. Nucleic Acids Research 48: D87-D92. doi: 10.1093/nar/gkz1001
Fotsing SF, Margoliash J, Wang C, Saini S, Yanicky R, Shleizer-Burko S, Goren A, Gymrek M (2019) The impact of short tandem repeat variation on gene expression. Nat Genet 51: 1652-1659. doi: 10.1038/s41588-019-0521-9
Frankish A, Diekhans M, Ferreira A-M, Johnson R, Jungreis I, Loveland J, Mudge JM, Sisu C, Wright J, Armstrong J, Barnes I, Berry A, Bignell A, Carbonell Sala S, Chrast J, Cunningham F, Di Domenico T, Donaldson S, Fiddes IT, García Girón C, Gonzalez JM, Grego T, Hardy M, Hourlier T, Hunt T, Izuogu OG, Lagarde J, Martin FJ, Martínez L, Mohanan S, Muir P, Navarro FCP, Parker A, Pei B, Pozo F, Ruffier M, Schmitt BM, Stapleton E, Suner M-M, Sycheva I, Uszczynska-Ratajczak B, Xu J, Yates A, Zerbino D, Zhang Y, Aken B, Choudhary JS, Gerstein M, Guigó R, Hubbard TJP, Kellis M, Paten B, Reymond A, Tress ML, Flicek P (2019) GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Research 47: D766-D773. doi: 10.1093/nar/gky955
Freibaum BD, Taylor JP (2017) The Role of Dipeptide Repeats in C9ORF72-Related ALS-FTD. Front Mol Neurosci 10: 35. doi: 10.3389/fnmol.2017.00035
Freudenreich CH (2018) R-loops: targets for nuclease cleavage and repeat instability. Curr Genet 64: 789-794. doi: 10.1007/s00294-018-0806-z
Gatto EM, Rojas NG, Persi G, Etcheverry JL, Cesarini ME, Perandones C (2020) Huntington disease: Advances in the understanding of its mechanisms. Clin Park Relat Disord 3: 100056. doi: 10.1016/j.prdoa.2020.100056
Gijselinck I, Van Mossevelde S, van der Zee J, Sieben A, Engelborghs S, De Bleecker J, Ivanoiu A, Deryck O, Edbauer D, Zhang M, Heeman B, Baumer V, Van den Broeck M, Mattheijssens M, Peeters K, Rogaeva E, De Jonghe P, Cras P, Martin JJ, de Deyn PP, Cruts M, Van Broeckhoven C (2016) The C9orf72 repeat size correlates with onset age of disease, DNA methylation and transcriptional downregulation of the promoter. Mol Psychiatry 21: 1112-24. doi: 10.1038/mp.2015.159
Ginno PA, Lott PL, Christensen HC, Korf I, Chedin F (2012) R-loop formation is a distinctive characteristic of unmethylated human CpG island promoters. Mol Cell 45: 814-25. doi: 10.1016/j.molcel.2012.01.017
Grant CE, Bailey TL, Noble WS (2011) FIMO: scanning for occurrences of a given motif. Bioinformatics (Oxford, England) 27: 1017-1018. doi: 10.1093/bioinformatics/btr064
Gray LT, Vallur AC, Eddy J, Maizels N (2014) G quadruplexes are genomewide targets of transcriptional helicases XPB and XPD. Nat Chem Biol 10: 313-8. doi: 10.1038/nchembio.1475
Grishchenko IV, Purvinsh YV, Yudkin DV (2020) Mystery of Expansion: DNA Metabolism and Unstable Repeats. Adv Exp Med Biol 1241: 101-124. doi: 10.1007/978-3-030-41283-8_7
Groh M, Lufino MM, Wade-Martins R, Gromak N (2014) R-loops associated with triplet repeat expansions promote gene silencing in Friedreich ataxia and fragile X syndrome. PLoS Genet 10: e1004318. doi: 10.1371/journal.pgen.1004318
Guo J, Chen L, Li GM (2017) DNA mismatch repair in trinucleotide repeat instability. Sci China Life Sci 60: 1087-1092. doi: 10.1007/s11427-017-9186-7
Hallinan JP, Doyle LA, Shen BW, Gewe MM, Takushi B, Kennedy MA, Friend D, Roberts JM, Bradley P, Stoddard BL (2021) Design of functionalised circular tandem repeat proteins with longer repeat topologies and enhanced subunit contact surfaces. Commun Biol 4: 1240. doi: 10.1038/s42003-021-02766-y
Hambarde S, Tsai CL, Pandita RK, Bacolla A, Maitra A, Charaka V, Hunt CR, Kumar R, Limbo O, Le Meur R, Chazin WJ, Tsutakawa SE, Russell P, Schlacher K, Pandita TK, Tainer JA (2021) EXO5-DNA structure and BLM interactions direct DNA resection critical for ATR-dependent replication restart. Mol Cell 81: 2989-3006 e9. doi: 10.1016/j.molcel.2021.05.027
Hammel M, Tainer JA (2021) X-ray scattering reveals disordered linkers and dynamic interfaces in complexes and mechanisms for DNA double-strand break repair impacting cell and cancer biology. Protein Sci 30: 1735-1756. doi: 10.1002/pro.4133
Hannan AJ (2018) Tandem repeats mediating genetic plasticity in health and disease. Nat Rev Genet 19: 286-298. doi: 10.1038/nrg.2017.115
Hanson J, Paliwal K, Zhou Y (2018) Accurate Single-Sequence Prediction of Protein Intrinsic Disorder by an Ensemble of Deep Recurrent and Convolutional Architectures. J Chem Inf Model 58: 2369-2376. doi: 10.1021/acs.jcim.8b00636
Heffernan R, Yang Y, Paliwal K, Zhou Y (2017) Capturing non-local interactions by long short-term memory bidirectional recurrent neural networks for improving prediction of protein secondary structure, backbone angles, contact numbers and solvent accessibility. Bioinformatics 33: 2842-2849. doi: 10.1093/bioinformatics/btx218
Hefferon TW, Groman JD, Yurk CE, Cutting GR (2004) A variable dinucleotide repeat in the CFTR gene contributes to phenotype diversity by forming RNA secondary structures that alter splicing. Proc Natl Acad Sci U S A 101: 3504-9. doi: 10.1073/pnas.0400182101
Heintzman ND, Stuart RK, Hon G, Fu Y, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu C, Ching KA, Wang W, Weng Z, Green RD, Crawford GE, Ren B (2007) Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genetics 39: 311-318. doi: 10.1038/ng1966
Hire RR, Katrak SM, Vaidya S, Radhakrishnan K, Seshadri M (2011) Spinocerebellar ataxia type 17 in Indian patients: two rare cases of homozygous expansions. Clin Genet 80: 472-7. doi: 10.1111/j.1399-0004.2010.01589.x
Holmes SE, O'Hearn E, Rosenblatt A, Callahan C, Hwang HS, Ingersoll-Ashworth RG, Fleisher A, Stevanin G, Brice A, Potter NT, Ross CA, Margolis RL (2001) A repeat expansion in the gene encoding junctophilin-3 is associated with Huntington disease-like 2. Nat Genet 29: 377-8. doi: 10.1038/ng760
Hui J, Hung LH, Heiner M, Schreiner S, Neumuller N, Reither G, Haas SA, Bindereif A (2005) Intronic CA-repeat and CA-rich elements: a new class of regulators of mammalian alternative splicing. EMBO J 24: 1988-98. doi: 10.1038/sj.emboj.7600677
Jenjaroenpun P, Wongsurawat T, Sutheeworapong S, Kuznetsov VA (2017) R-loopDB: a database for R-loop forming sequences (RLFS) and R-loops. Nucleic Acids Res 45: D119-D127. doi: 10.1093/nar/gkw1054
Jorda J, Xue B, Uversky VN, Kajava AV (2010) Protein tandem repeats - the more perfect, the less structured. FEBS J 277: 2673-82. doi: 10.1111/j.1742-464X.2010.07684.x
Kang H, Shokhirev MN, Xu Z, Chandran S, Dixon JR, Hetzer MW (2020) Dynamic regulation of histone modifications and long-range chromosomal interactions during postmitotic transcriptional reactivation. Genes Dev 34: 913-930. doi: 10.1101/gad.335794.119
Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alfoldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD, Brand H, Solomonson M, Watts NA, Rhodes D, Singer-Berk M, England EM, Seaby EG, Kosmicki JA, Walters RK, Tashman K, Farjoun Y, Banks E, Poterba T, Wang A, Seed C, Whiffin N, Chong JX, Samocha KE, Pierce-Hoffman E, Zappala Z, O'Donnell-Luria AH, Minikel EV, Weisburd B, Lek M, Ware JS, Vittal C, Armean IM, Bergelson L, Cibulskis K, Connolly KM, Covarrubias M, Donnelly S, Ferriera S, Gabriel S, Gentry J, Gupta N, Jeandet T, Kaplan D, Llanwarne C, Munshi R, Novod S, Petrillo N, Roazen D, Ruano-Rubio V, Saltzman A, Schleicher M, Soto J, Tibbetts K, Tolonen C, Wade G, Talkowski ME, Genome Aggregation Database C, Neale BM, Daly MJ, MacArthur DG (2020) The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581: 434-443. doi: 10.1038/s41586-020-2308-7
Ke Y, Rao J, Zhao H, Lu Y, Xiao N, Yang Y (2020) Accurate prediction of genome-wide RNA secondary structure profile based on extreme gradient boosting. Bioinformatics 36: 4576-4582. doi: 10.1093/bioinformatics/btaa534
Kentepozidou E, Aitken SJ, Feig C, Stefflova K, Ibarra-Soria X, Odom DT, Roller M, Flicek P (2020) Clustered CTCF binding is an evolutionary mechanism to maintain topologically associating domains. Genome Biol 21: 5. doi: 10.1186/s13059-019-1894-x
Khristich AN, Mirkin SM (2020) On the wrong DNA track: Molecular mechanisms of repeat-mediated genome instability. J Biol Chem 295: 4134-4170. doi: 10.1074/jbc.REV119.007678
Kim MW, Chelliah Y, Kim SW, Otwinowski Z, Bezprozvanny I (2009) Secondary structure of Huntingtin amino-terminal region. Structure 17: 1205-12. doi: 10.1016/j.str.2009.08.002
Kloster E, Saft C, Epplen JT, Arning L (2013) CNR1 variation is associated with the age at onset in Huntington disease. Eur J Med Genet 56: 416-9. doi: 10.1016/j.ejmg.2013.05.007
Koutsis G, Karadima G, Pandraud A, Sweeney MG, Paudel R, Houlden H, Wood NW, Panas M (2012) Genetic screening of Greek patients with Huntington's disease phenocopies identifies an SCA8 expansion. J Neurol 259: 1874-8. doi: 10.1007/s00415-012-6430-9
Kristensen VN, Andersen TI, Lindblom A, Erikstein B, Magnus P, Borresen-Dale AL (1998) A rare CYP19 (aromatase) variant may increase the risk of breast cancer. Pharmacogenetics 8: 43-8. doi: 10.1097/00008571-199802000-00006
Krzyzosiak WJ, Sobczak K, Wojciechowska M, Fiszer A, Mykowska A, Kozlowski P (2012) Triplet repeat RNA structure and its role as pathogenic agent and therapeutic target. Nucleic Acids Res 40: 11-26. doi: 10.1093/nar/gkr729
Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, Ziller MJ, Amin V, Whitaker JW, Schultz MD, Ward LD, Sarkar A, Quon G, Sandstrom RS, Eaton ML, Wu Y-C, Pfenning AR, Wang X, Claussnitzer M, Liu Y, Coarfa C, Harris RA, Shoresh N, Epstein CB, Gjoneska E, Leung D, Xie W, Hawkins RD, Lister R, Hong C, Gascard P, Mungall AJ, Moore R, Chuah E, Tam A, Canfield TK, Hansen RS, Kaul R, Sabo PJ, Bansal MS, Carles A, Dixon JR, Farh K-H, Feizi S, Karlic R, Kim A-R, Kulkarni A, Li D, Lowdon R, Elliott G, Mercer TR, Neph SJ, Onuchic V, Polak P, Rajagopal N, Ray P, Sallari RC, Siebenthall KT, Sinnott-Armstrong NA, Stevens M, Thurman RE, Wu J, Zhang B, Zhou X, Beaudet AE, Boyer LA, Jager PLD, Farnham PJ, Fisher SJ, Haussler D, Jones SJM, Li W, Marra MA, McManus MT, Sunyaev S, Thomson JA, Tlsty TD, Tsai L-H, Wang W, Waterland RA, Zhang MQ, Chadwick LH, Bernstein BE, Costello JF, Ecker JR, Hirst M, Meissner A, Milosavljevic A, Ren B, Stamatoyannopoulos JA, Wang T, Kellis M (2015) Integrative analysis of 111 reference human epigenomes. Nature 518: 317-330. doi: 10.1038/nature14248
Kuznetsov VA, Bondarenko V, Wongsurawat T, Yenamandra SP, Jenjaroenpun P (2018) Toward predictive R-loop computational biology: genome-scale prediction of R-loops reveals their association with complex promoter structures, G-quadruplexes and transcriptionally active enhancers. Nucleic Acids Res 46: 7566-7585. doi: 10.1093/nar/gky554
Lai Y, Beaver JM, Laverde E, Liu Y (2020) Trinucleotide repeat instability via DNA base excision repair. DNA Repair (Amst) 93: 102912. doi: 10.1016/j.dnarep.2020.102912
Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, Funke R, Gage D, Harris K, Heaford A, Howland J, Kann L, Lehoczky J, LeVine R, McEwan P, McKernan K, Meldrim J, Mesirov JP, Miranda C, Morris W, Naylor J, Raymond C, Rosetti M, Santos R, Sheridan A, Sougnez C, Stange-Thomann Y, Stojanovic N, Subramanian A, Wyman D, Rogers J, Sulston J, Ainscough R, Beck S, Bentley D, Burton J, Clee C, Carter N, Coulson A, Deadman R, Deloukas P, Dunham A, Dunham I, Durbin R, French L, Grafham D, Gregory S, Hubbard T, Humphray S, Hunt A, Jones M, Lloyd C, McMurray A, Matthews L, Mercer S, Milne S, Mullikin JC, Mungall A, Plumb R, Ross M, Shownkeen R, Sims S, Waterston RH, Wilson RK, Hillier LW, McPherson JD, Marra MA, Mardis ER, Fulton LA, Chinwalla AT, Pepin KH, Gish WR, Chissoe SL, Wendl MC, Delehaunty KD, Miner TL, Delehaunty A, Kramer JB, Cook LL, Fulton RS, Johnson DL, Minx PJ, Clifton SW, Hawkins T, Branscomb E, Predki P, Richardson P, Wenning S, Slezak T, Doggett N, Cheng JF, Olsen A, Lucas S, Elkin C, Uberbacher E, Frazier M, et al. (2001) Initial sequencing and analysis of the human genome. Nature 409: 860-921. doi: 10.1038/35057062
Lanni S, Pearson CE (2019) Molecular genetics of congenital myotonic dystrophy. Neurobiol Dis 132: 104533. doi: 10.1016/j.nbd.2019.104533
Laverde EE, Lai Y, Leng F, Balakrishnan L, Freudenreich CH, Liu Y (2020) R-loops promote trinucleotide repeat deletion through DNA base excision repair enzymatic activities. J Biol Chem 295: 13902-13913. doi: 10.1074/jbc.RA120.014161
Levinson G, Gutman GA (1987) Slipped-strand mispairing: a major mechanism for DNA sequence evolution. Mol Biol Evol 4: 203-21. doi: 10.1093/oxfordjournals.molbev.a040442
Li Y, Roberts ND, Wala JA, Shapira O, Schumacher SE, Kumar K, Khurana E, Waszak S, Korbel JO, Haber JE, Imielinski M, Group PSVW, Weischenfeldt J, Beroukhim R, Campbell PJ, Consortium P (2020) Patterns of somatic structural variation in human cancer genomes. Nature 578: 112-121. doi: 10.1038/s41586-019-1913-9
Libby RT, Hagerman KA, Pineda VV, Lau R, Cho DH, Baccam SL, Axford MM, Cleary JD, Moore JM, Sopher BL, Tapscott SJ, Filippova GN, Pearson CE, La Spada AR (2008) CTCF cis-regulates trinucleotide repeat instability in an epigenetic manner: a novel basis for mutational hot spot determination. PLoS Genet 4: e1000257. doi: 10.1371/journal.pgen.1000257
Liquori CL, Ricker K, Moseley ML, Jacobsen JF, Kress W, Naylor SL, Day JW, Ranum LP (2001) Myotonic dystrophy type 2 caused by a CCTG expansion in intron 1 of ZNF9. Science 293: 864-7. doi: 10.1126/science.1062125
Liu Y, Wilson SH (2012) DNA base excision repair: a mechanism of trinucleotide repeat expansion. Trends Biochem Sci 37: 162-72. doi: 10.1016/j.tibs.2011.12.002
Loomis EW, Sanz LA, Chedin F, Hagerman PJ (2014) Transcription-associated R-loop formation across the human FMR1 CGG-repeat region. PLoS Genet 10: e1004294. doi: 10.1371/journal.pgen.1004294
Lorentzon M, Swanson C, Eriksson AL, Mellstrom D, Ohlsson C (2006) Polymorphisms in the aromatase gene predict areal BMD as a result of affected cortical bone size: the GOOD study. J Bone Miner Res 21: 332-9. doi: 10.1359/JBMR.051026
Lorenz R, Bernhart SH, Honer Zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL (2011) ViennaRNA Package 2.0. Algorithms Mol Biol 6: 26. doi: 10.1186/1748-7188-6-26
Loureiro JR, Oliveira CL, Silveira I (2016) Unstable repeat expansions in neurodegenerative diseases: nucleocytoplasmic transport emerges on the scene. Neurobiol Aging 39: 174-83. doi: 10.1016/j.neurobiolaging.2015.12.007
Ma X, Qi X, Chen C, Lin H, Xiong H, Li Y, Jiang J (2010) Association between CYP19 polymorphisms and breast cancer risk: results from 10,592 cases and 11,720 controls. Breast Cancer Res Treat 122: 495-501. doi: 10.1007/s10549-009-0693-6
Mackay RP, Xu Q, Weinberger PM (2020) R-Loop Physiology and Pathology: A Brief Review. DNA Cell Biol 39: 1914-1925. doi: 10.1089/dna.2020.5906
Madeira JLO, Souza ABC, Cunha FS, Batista RL, Gomes NL, Rodrigues AS, Mennucci de Haidar Jorge F, Chadi G, Callegaro D, Mendonca BB, Costa EMF, Domenice S (2018) A severe phenotype of Kennedy disease associated with a very large CAG repeat expansion. Muscle Nerve 57: E95-e97. doi: 10.1002/mus.25952
Mahadevan M, Tsilfidis C, Sabourin L, Shutler G, Amemiya C, Jansen G, Neville C, Narang M, Barceló J, O'Hoy K, et al. (1992) Myotonic dystrophy mutation: an unstable CTG repeat in the 3' untranslated region of the gene. Science 255: 1253-5. doi: 10.1126/science.1546325
Maiuri T, Suart CE, Hung CLK, Graham KJ, Barba Bazan CA, Truant R (2019) DNA Damage Repair in Huntington's Disease and Other Neurodegenerative Diseases. Neurotherapeutics 16: 948-956. doi: 10.1007/s13311-019-00768-7
Malik I, Kelley CP, Wang ET, Todd PK (2021) Molecular mechanisms underlying nucleotide repeat expansion disorders. Nat Rev Mol Cell Biol 22: 589-607. doi: 10.1038/s41580-021-00382-6
Malla B, Guo X, Senger G, Chasapopoulou Z, Yildirim F (2021) A Systematic Review of Transcriptional Dysregulation in Huntington's Disease Studied by RNA Sequencing. Front Genet 12: 751033. doi: 10.3389/fgene.2021.751033
Melamed O, Behar DM, Bram C, Magal N, Pras E, Reznik-Wolf H, Borochowitz ZU, Davidov B, Mor-Cohen R, Baris HN (2015) Founder mutation for Huntington disease in Caucasus Jews. Clin Genet 87: 167-72. doi: 10.1111/cge.12344
Minnoye L, Marinov GK, Krausgruber T, Pan L, Marand AP, Secchia S, Greenleaf WJ, Furlong EEM, Zhao K, Schmitz RJ, Bock C, Aerts S (2021) Chromatin accessibility profiling methods. Nature Reviews Methods Primers 1: 1-24. doi: 10.1038/s43586-020-00008-9
Mirkin SM (2007) Expandable DNA repeats and human disease. Nature 447: 932-40. doi: 10.1038/nature05977
Mitsuhashi S, Frith MC, Matsumoto N (2021) Genome-wide survey of tandem repeats by nanopore sequencing shows that disease-associated repeats are more polymorphic in the general population. BMC Med Genomics 14: 17. doi: 10.1186/s12920-020-00853-3
Mitsuhashi S, Frith MC, Mizuguchi T, Miyatake S, Toyota T, Adachi H, Oma Y, Kino Y, Mitsuhashi H, Matsumoto N (2019) Tandem-genotypes: robust detection of tandem repeat expansions from long DNA reads. Genome Biol 20: 58. doi: 10.1186/s13059-019-1667-6
Mitsuhashi S, Matsumoto N (2020) Long-read sequencing for rare human genetic diseases. J Hum Genet 65: 11-19. doi: 10.1038/s10038-019-0671-8
Mooers BH, Logue JS, Berglund JA (2005) The structural basis of myotonic dystrophy from the crystal structure of CUG repeats. Proc Natl Acad Sci U S A 102: 16626-31. doi: 10.1073/pnas.0505873102
Neil AJ, Liang MU, Khristich AN, Shah KA, Mirkin SM (2018) RNA-DNA hybrids promote the expansion of Friedreich's ataxia (GAA)n repeats via break-induced replication. Nucleic Acids Res 46: 3487-3497. doi: 10.1093/nar/gky099
Niehrs C, Luke B (2020) Regulatory R-loops as facilitators of gene expression and genome stability. Nat Rev Mol Cell Biol 21: 167-178. doi: 10.1038/s41580-019-0206-3
Oldfield CJ, Dunker AK (2014) Intrinsically disordered proteins and intrinsically disordered protein regions. Annu Rev Biochem 83: 553-84. doi: 10.1146/annurev-biochem-072711-164947
Ong C-T, Corces VG (2014) CTCF: an architectural protein bridging genome topology and function. Nature Reviews Genetics 15: 234-246. doi: 10.1038/nrg3663
Orr HT, Zoghbi HY (2007) Trinucleotide repeat disorders. Annu Rev Neurosci 30: 575-621. doi: 10.1146/annurev.neuro.29.051605.113042
Paulson H (2018) Repeat expansion diseases. Handb Clin Neurol 147: 105-123. doi: 10.1016/B978-0-444-63233-3.00009-9
Peters AHFM, Kubicek S, Mechtler K, O'Sullivan RJ, Derijck AAHA, Perez-Burgos L, Kohlmaier A, Opravil S, Tachibana M, Shinkai Y, Martens JHA, Jenuwein T (2003) Partitioning and Plasticity of Repressive Histone Methylation States in Mammalian Chromatin. Molecular Cell 12: 1577-1589. doi: https://doi.org/10.1016/S1097-2765(03)00477-5
Peters JM (2021) How DNA loop extrusion mediated by cohesin enables V(D)J recombination. Curr Opin Cell Biol 70: 75-83. doi: 10.1016/j.ceb.2020.11.007
Phillips JE, Corces VG (2009) CTCF: master weaver of the genome. Cell 137: 1194-211. doi: 10.1016/j.cell.2009.06.001
Polak P, Domany E (2006) Alu elements contain many binding sites for transcription factors and may play a role in regulation of developmental processes. BMC Genomics 7: 133. doi: 10.1186/1471-2164-7-133
Pugacheva EM, Kubo N, Loukinov D, Tajmul M, Kang S, Kovalchuk AL, Strunnikov AV, Zentner GE, Ren B, Lobanenkov VV (2020) CTCF mediates chromatin looping via N-terminal domain-dependent cohesin retention. Proceedings of the National Academy of Sciences 117.
Qin Q, Fan J, Zheng R, Wan C, Mei S, Wu Q, Sun H, Brown M, Zhang J, Meyer CA, Liu XS (2020) Lisa: inferring transcriptional regulators through integrative modeling of public chromatin accessibility and ChIP-seq data. Genome Biol 21: 32. doi: 10.1186/s13059-020-1934-6
Quarrell OW, Rigby AS, Barron L, Crow Y, Dalton A, Dennis N, Fryer AE, Heydon F, Kinning E, Lashwood A, Losekoot M, Margerison L, McDonnell S, Morrison PJ, Norman A, Peterson M, Raymond FL, Simpson S, Thompson E, Warner J (2007) Reduced penetrance alleles for Huntington's disease: a multi-centre direct observational study. J Med Genet 44: e68. doi: 10.1136/jmg.2006.045120
Quilez J, Guilmatre A, Garg P, Highnam G, Gymrek M, Erlich Y, Joshi RS, Mittelman D, Sharp AJ (2016) Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Res 44: 3750-62. doi: 10.1093/nar/gkw219
Quinlan AR, Hall IM (2010a) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841-842. doi: 10.1093/bioinformatics/btq033
Quinlan AR, Hall IM (2010b) BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26: 841-2. doi: 10.1093/bioinformatics/btq033
Rao Suhas SP, Huntley Miriam H, Durand Neva C, Stamenova Elena K, Bochkov Ivan D, Robinson James T, Sanborn Adrian L, Machol I, Omer Arina D, Lander Eric S, Aiden Erez L (2014) A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping. Cell 159: 1665-1680. doi: 10.1016/j.cell.2014.11.021
Ratmeyer L, Vinayak R, Zhong YY, Zon G, Wilson WD (1994) Sequence specific thermodynamic and structural properties for DNA.RNA duplexes. Biochemistry 33: 5298-304. doi: 10.1021/bi00183a037
Reddy K, Zamiri B, Stanley SYR, Macgregor RB, Jr., Pearson CE (2013) The disease-associated r(GGGGCC)n repeat from the C9orf72 gene forms tract length-dependent uni- and multimolecular RNA G-quadruplex structures. J Biol Chem 288: 9860-9866. doi: 10.1074/jbc.C113.452532
Rentzsch P, Schubach M, Shendure J, Kircher M CADD-Splice-improving genome-wide variant effect prediction using deep learning-derived splice scores. Genome medicine 13: 31-31. doi: 10.1186/s13073-021-00835-9
Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M (2019) CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 47: D886-D894. doi: 10.1093/nar/gky1016
Rice G, Rebeiz M (2019) Evolution: How Many Phenotypes Do Regulatory Mutations Affect? Curr Biol 29: R21-R23. doi: 10.1016/j.cub.2018.11.027
Robin G, Lopez JR, Espinal GM, Hulsizer S, Hagerman PJ, Pessah IN (2017) Calcium dysregulation and Cdk5-ATM pathway involved in a mouse model of fragile X-associated tremor/ataxia syndrome. Hum Mol Genet 26: 2649-2666. doi: 10.1093/hmg/ddx148
Rodriguez CM, Todd PK (2019) New pathologic mechanisms in nucleotide repeat expansion disorders. Neurobiol Dis 130: 104515. doi: 10.1016/j.nbd.2019.104515
Roh TY, Cuddapah S, Cui K, Zhao K (2006) The genomic landscape of histone modifications in human T cells. Proc Natl Acad Sci U S A 103: 15782-7. doi: 10.1073/pnas.0607617103
Roman T, Schmitz M, Polanczyk GV, Eizirik M, Rohde LA, Hutz MH (2002) Further evidence for the association between attention-deficit/hyperactivity disorder and the dopamine-beta-hydroxylase gene. Am J Med Genet 114: 154-8. doi: 10.1002/ajmg.10194
Santoro M, Masciullo M, Silvestri G, Novelli G, Botta A (2017) Myotonic dystrophy type 1: role of CCG, CTC and CGG interruptions within DMPK alleles in the pathogenesis and molecular diagnosis. Clin Genet 92: 355-364. doi: 10.1111/cge.12954
Santos-Pereira JM, Aguilera A (2015) R loops: new modulators of genome dynamics and function. Nat Rev Genet 16: 583-97. doi: 10.1038/nrg3961
Schmidt D, Schwalie PC, Wilson MD, Ballester B, Goncalves A, Kutter C, Brown GD, Marshall A, Flicek P, Odom DT (2012) Waves of retrotransposon expansion remodel genome organization and CTCF binding in multiple mammalian lineages. Cell 148: 335-48. doi: 10.1016/j.cell.2011.11.058
Schmidt MHM, Pearson CE (2016) Disease-associated repeat instability and mismatch repair. DNA Repair (Amst) 38: 117-126. doi: 10.1016/j.dnarep.2015.11.008
Schoenfelder S, Fraser P (2019) Long-range enhancer–promoter contacts in gene expression control. Nature Reviews Genetics 20: 437-455. doi: 10.1038/s41576-019-0128-0
Schreiber J, Durham T, Bilmes J, Noble WS (2020) Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome. Genome Biology 21: 81. doi: 10.1186/s13059-020-01977-6
Sims RJ, 3rd, Reinberg D (2009) Processing the H3K36me3 signature. Nat Genet 41: 270-1. doi: 10.1038/ng0309-270
Smedley D, Schubach M, Jacobsen JOB, Köhler S, Zemojtel T, Spielmann M, Jäger M, Hochheiser H, Washington NL, McMurry JA, others A whole-genome analysis framework for effective identification of path ogenic regulatory variants in Mendelian disease. The American Journal of Human Genetics 99: 595-606.
Smola MJ, Calabrese JM, Weeks KM (2015) Detection of RNA-Protein Interactions in Living Cells with SHAPE. Biochemistry 54: 6867-75. doi: 10.1021/acs.biochem.5b00977
Sobczak K, de Mezer M, Michlewski G, Krol J, Krzyzosiak WJ (2003) RNA structure of trinucleotide repeats associated with human neurological diseases. Nucleic Acids Res 31: 5469-82. doi: 10.1093/nar/gkg766
Stenson PD, Mort M, Ball EV, Chapman M, Evans K, Azevedo L, Hayden M, Heywood S, Millar DS, Phillips AD, Cooper DN (2020) The Human Gene Mutation Database (HGMD((R))): optimizing its use in a clinical diagnostic or research setting. Hum Genet 139: 1197-1207. doi: 10.1007/s00439-020-02199-3
Su XA, Freudenreich CH (2017) Cytosine deamination and base excision repair cause R-loop-induced CAG repeat fragility and instability in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A 114: E8392-E8401. doi: 10.1073/pnas.1711283114
Sun H, Satake W, Zhang C, Nagai Y, Tian Y, Fu S, Yu J, Qian Y, Qian Y, Chu J, Toda T (2011) Genetic and clinical analysis in a Chinese parkinsonism-predominant spinocerebellar ataxia type 2 family. J Hum Genet 56: 330-4. doi: 10.1038/jhg.2011.14
Swami M, Hendricks AE, Gillis T, Massood T, Mysore J, Myers RH, Wheeler VC (2009) Somatic expansion of the Huntington's disease CAG repeat in the brain is associated with an earlier age of disease onset. Hum Mol Genet 18: 3039-47. doi: 10.1093/hmg/ddp242
Tabrizi SJ, Flower MD, Ross CA, Wild EJ (2020) Huntington disease: new insights into molecular pathogenesis and therapeutic opportunities. Nat Rev Neurol 16: 529-546. doi: 10.1038/s41582-020-0389-4
Tang H, Kirkness EF, Lippert C, Biggs WH, Fabani M, Guzman E, Ramakrishnan S, Lavrenko V, Kakaradov B, Hou C, Hicks B, Heckerman D, Och FJ, Caskey CT, Venter JC, Telenti A (2017) Profiling of Short-Tandem-Repeat Disease Alleles in 12,632 Human Whole Genomes. Am J Hum Genet 101: 700-715. doi: 10.1016/j.ajhg.2017.09.013
Tang Z, Luo OJ, Li X, Zheng M, Zhu JJ, Szalaj P, Trzaskoma P, Magalska A, Wlodarczyk J, Ruszczycki B, Michalski P, Piecuch E, Wang P, Wang D, Tian SZ, Penrad-Mobayed M, Sachs LM, Ruan X, Wei CL, Liu ET, Wilczynski GM, Plewczynski D, Li G, Ruan Y (2015) CTCF-Mediated Human 3D Genome Architecture Reveals Chromatin Topology for Transcription. Cell 163: 1611-27. doi: 10.1016/j.cell.2015.11.024
Tankard RM, Bennett MF, Degorski P, Delatycki MB, Lockhart PJ, Bahlo M (2018) Detecting Expansions of Tandem Repeats in Cohorts Sequenced with Short-Read Sequencing Data. Am J Hum Genet 103: 858-873. doi: 10.1016/j.ajhg.2018.10.015
Thapar R, Wang JL, Hammel M, Ye R, Liang K, Sun C, Hnizda A, Liang S, Maw SS, Lee L, Villarreal H, Forrester I, Fang S, Tsai MS, Blundell TL, Davis AJ, Lin C, Lees-Miller SP, Strick TR, Tainer JA (2021) Mechanism of efficient double-strand break repair by a long non-coding RNA. Nucleic Acids Res 49: 1199-1200. doi: 10.1093/nar/gkaa1233
Tsuge M, Hamamoto R, Silva FP, Ohnishi Y, Chayama K, Kamatani N, Furukawa Y, Nakamura Y (2005) A variable number of tandem repeats polymorphism in an E2F-1 binding element in the 5' flanking region of SMYD3 is a risk factor for human cancers. Nat Genet 37: 1104-7. doi: 10.1038/ng1638
Tsutakawa SE, Thompson MJ, Arvai AS, Neil AJ, Shaw SJ, Algasaier SI, Kim JC, Finger LD, Jardine E, Gotham VJB, Sarker AH, Her MZ, Rashid F, Hamdan SM, Mirkin SM, Grasby JA, Tainer JA (2017) Phosphate steering by Flap Endonuclease 1 promotes 5'-flap specificity and incision to prevent genome instability. Nat Commun 8: 15855. doi: 10.1038/ncomms15855
Uversky VN (2020) Functions of short lifetime biological structures at large: the case of intrinsically disordered proteins. Brief Funct Genomics 19: 60-68. doi: 10.1093/bfgp/ely023
van Ruiten MS, Rowland BD (2021) On the choreography of genome folding: A grand pas de deux of cohesin and CTCF. Curr Opin Cell Biol 70: 84-90. doi: 10.1016/j.ceb.2020.12.001
Wan Y, Qu K, Zhang QC, Flynn RA, Manor O, Ouyang Z, Zhang J, Spitale RC, Snyder MP, Segal E, Chang HY (2014) Landscape and variation of RNA secondary structure across the human transcriptome. Nature 505: 706-9. doi: 10.1038/nature12946
Wang E, Thombre R, Shah Y, Latanich R, Wang J (2021) G-Quadruplexes as pathogenic drivers in neurodegenerative disorders. Nucleic Acids Research. doi: 10.1093/nar/gkab164
Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic acids research 38: e164-e164. doi: 10.1093/nar/gkq603
Wen X, Tan W, Westergard T, Krishnamurthy K, Markandaiah SS, Shi Y, Lin S, Shneider NA, Monaghan J, Pandey UB, Pasinelli P, Ichida JK, Trotti D (2014) Antisense proline-arginine RAN dipeptides linked to C9ORF72-ALS/FTD form toxic nuclear aggregates that initiate in vitro and in vivo neuronal death. Neuron 84: 1213-25. doi: 10.1016/j.neuron.2014.12.010
Whitfield TW, Wang J, Collins PJ, Partridge EC, Aldred SF, Trinklein ND, Myers RM, Weng Z (2012) Functional analysis of transcription factor binding sites in human promoters. Genome Biology 13: R50. doi: 10.1186/gb-2012-13-9-r50
Wongsurawat T, Jenjaroenpun P, Kwoh CK, Kuznetsov V (2012) Quantitative model of R-loop forming structures reveals a novel level of RNA-DNA interactome complexity. Nucleic Acids Res 40: e16. doi: 10.1093/nar/gkr1075
Wu Q, Liu P, Wang L (2020) Many facades of CTCF unified by its coding for three-dimensional genome architecture. J Genet Genomics 47: 407-424. doi: 10.1016/j.jgg.2020.06.008
Wu T, Hu E, Xu S, Chen M, Guo P, Dai Z, Feng T, Zhou L, Tang W, Zhan L, Fu X, Liu S, Bo X, Yu G (2021) clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. Innovation (N Y) 2: 100141. doi: 10.1016/j.xinn.2021.100141
Xi W, Beer MA (2021) Loop competition and extrusion model predicts CTCF interaction specificity. Nature communications 12: 1-15.
Xita N, Chatzikyriakidou A, Stavrou I, Zois C, Georgiou I, Tsatsoulis A (2010) The (TTTA)n polymorphism of aromatase (CYP19) gene is associated with age at menarche. Hum Reprod 25: 3129-33. doi: 10.1093/humrep/deq276
Xu EH, Tang Y, Li D, Jia JP (2009) Polymorphism of HD and UCHL-1 genes in Huntington's disease. J Clin Neurosci 16: 1473-7. doi: 10.1016/j.jocn.2009.03.027
Xu P, Pan F, Roland C, Sagui C, Weninger K (2020) Dynamics of strand slippage in DNA hairpins formed by CAG repeats: roles of sequence parity and trinucleotide interrupts. Nucleic Acids Res 48: 2232-2245. doi: 10.1093/nar/gkaa036
Ye Z, Xu S, Shi Y, Bacolla A, Syed A, Moiani D, Tsai CL, Shen Q, Peng G, Leonard PG, Jones DE, Wang B, Tainer JA, Ahmed Z (2021) GRB2 enforces homology-directed repair initiation by MRE11. Sci Adv 7. doi: 10.1126/sciadv.abe9254
Zhang Y, An L, Xu J, Zhang B, Zheng WJ, Hu M, Tang J, Yue F (2018) Enhancing Hi-C data resolution with deep convolutional neural network HiCPlus. Nature communications 9: 1-9.

suppltable220721.xlsx
Supplementary Tables S1-S15

Download PDF

Editorial decision: Major revisions
13 Sep, 2022
Reviewers agreed at journal
14 Aug, 2022
Reviewers invited by journal
14 Aug, 2022
Editor assigned by journal
02 Aug, 2022
First submitted to journal
02 Aug, 2022

You are reading this latest preprint version

Profiling human pathogenic repeat expansion regions by synergistic and multi-level impacts on molecular connections

Status:

Version 1

Abstract

Figures

Introduction

Materials And Methods

Collation of known DNA repeat expansions from the published literature

Annotating the pathogenic repeat regions in the human reference genome by means of the Tandem Repeat Finder (TRF) program

Constructing control datasets of human genomic repeats and repeat expansions with no known pathogenicity

gnomAD-derived repeats (gnomAD-repeats) dataset

Constructing a control dataset of non-repeat loci

CTCF binding sites

Epigenetic signal analysis of chromatin accessibility and histone modification marks

Annotating potential transcription factor binding sites within the expanded repeats

Evolutionary conservation analysis

Evaluating the potential impact of repeats on RNA splicing

Annotation of non-B structures

Annotation of protein secondary structure

Annotation of RNA secondary structure

Annotation of R-loops

Annotation of DNA base excision repair (BER) factor binding sites

Assessing the statistical significance between groups

Constructing a model to predict pathogenic repeat expansions

Results

Pathogenic repeat expansions are overrepresented in exons

Pathogenic repeat expansions tend to encode disordered surface-located protein domains

Pathogenic repeat expansions are predicted to facilitate long-range chromatin interactions

Enhanced chromatin accessibility is a distinctive characteristic of pathogenic repeats

Potential TFBSs in pathogenic repeat expansion loci are embedded within open chromatin regions

Pathogenic repeat expansions are located so as to be capable of interfering with RNA splicing

Stable RNA secondary structures are associated with repeat expansions

Pathogenic CGG and CAG expansions promote R-loops

Non-B DNA formation and expanded DNA repeats

Pathogenic expansion loci are evolutionarily conserved

Base excision repair (BER) factors in the vicinity of pathogenic repeat expansions

Characteristics of repeat expansions differ between different categories of disease

Constructing DPREx for the prediction of pathogenic repeat expansions

Discussion And Conclusions

Declarations

References

Supplementary Files

Status:

Version 1