Hi-C-LSTM: Learning representations of chromatin contacts using a recurrent neural network identifies genomic drivers of 3D genome conformation


 Despite the availability of chromatin conformation capture experiments, discerning the relationship between the 1D genome and 3D conformation remains a challenge, which limits our understanding of their affect on gene expression and disease. We propose Hi-C-LSTM, a method that produces low-dimensional latent representations that summarize intra-chromosomal Hi-C contacts via a recurrent long short-term memory (LSTM) neural network model. We find that these representations contain all the information needed to recreate the original Hi-C matrix with high accuracy, outperforming existing methods. These representations enable the identification of a variety of conformation-defining genomic elements, including nuclear compartments and conformation-related transcription factors. They furthermore enable in-silico perturbation experiments that measure the influence of cis-regulatory elements on conformation.


24
The organisation of the genome in 3D space inside the nucleus is important to its function. Chromo-25 some conformation capture (3C) techniques, developed in the last couple of decades, have enabled 26 researchers to quantify the strength of interactions between loci that are nearby in space. Hi-C [1] 27 uses a combination of chromatin conformation capture and high-throughput sequencing to assay 28 pairwise chromatin interactions genome-wide. This rich source of data promises to help elucidate 29 the influence of 3D structure on gene expression and thereby on development, evolution and disease. 30 However, we lack a complete understanding of how the 1D genome influences 3D conformation. 31 The machine learning technique of representation learning [2] provides a way to link the 1D 32 genome to 3D conformation. Representation learning aims to summarize high dimensional datasets 33 into a low-dimensional representation. It has become a valuable tool for finding compact and infor- [8]. Representation learning has recently been applied to genomic sequences [9, 10] and Hi-C data 38 [11][12][13][14]. 39 In order to understand the 1D-3D relationship and thereby link 3D conformation to genetic 40 variation and disease, we need representations for Hi-C data that can summarize the contact map 41 into a locus-level summary. Such a representation would encapsulate all the contacts from each 42 genomic position to the others into a small number of features per locus, such that the contacts 43 can be reproduced using just the features. Reducing the Hi-C map to locus-level representations 44 in this way would allow us to study the effect of sequence elements on chromatin conformation, 45 identify genomic drivers of 3D conformation and predict the effect of genetic variants. 46 Two methods for representation learning of Hi-C data have previously been developed, SNIPER 47 [11] and SCI [12]. SNIPER uses a fully-connected autoencoder [15] to transform the sparse Hi-C 48 inter-chromosomal matrix into a dense one row-wise, the bottleneck of which is assigned as the  These existing methods for Hi-C representations have two weaknesses that limit their applicabil-53 ity. First, SNIPER takes only inter-chromosomal contacts as input and therefore its representations 54 cannot incorporate intra-chromosomal contact patterns that are most important for the regulation 55 of gene expression, such as topological domains and promoter-enhancer looping. Second, the Hi-C 56 representations produced by both SNIPER and SCI do not account for the inherent sequential 57 nature of the genome. As we demonstrate below, these two weaknesses limits existing methods' 58 informativeness and makes them unable to accurately identify conformation-defining elements or 59 predict how those elements influence structure. 60 In this work, we propose a method called Hi-C-LSTM that produces low-dimensional representa-61 tions of the Hi-C intra-chromosomal contacts, assigning a vector of features to each genomic position 62 that represents that position's contact activity with all other positions in the given chromosome. 63 Hi-C-LSTM defines these representations using a sequential long short-term memory (LSTM) neu-64 ral network model which, in contrast to existing methods like SNIPER and SCI, accounts for the 65 sequential nature of the genome. A second methodological innovation of Hi-C-LSTM is that, in-66 stead of learning an encoder to create representations, we learn our representations directly through 67 iterative optimization. We find that this approach provides a large improvement in information 68 content relative to existing non-sequential methods, enables the use of intra-chromosomal interac-69 tions, and enables the model to accurately predict the effects of genomic perturbations (Fig. 1,   70 Results). 71 We demonstrate the utility of Hi-C-LSTM's representations through several analyses. First, 72 we show that our representations have information needed to recreate the Hi-C matrix and that 73 this recreation is more accurate using an LSTM than alternatives. Second, we show that our 74 representation captures cell type-specific functional activity, genomic elements and identifies ge-75 nomic regions that drive conformation. Third, we show that feature attribution of Hi-C-LSTM can 76 identify sequence elements driving 3D conformation, such as binding sites of CTCF and cohesin 77 subunits [17,18]. Fourth, we show Hi-C-LSTM can perform in-silico perturbation of CTCF and 78 cohesin binding sites. Fifth, we simulated a previously-assayed 2.1 Mbp structural variant at the 79 SOX9 locus and found that Hi-C-LSTM correctly reproduces experimentally-derived contacts. 80 Figure 1: Overview of approach. Hi-C-LSTM learns a K-length vector representation of each genomic position that summarizes its chromatin contacts, using an LSTM embedding neural network. The representations and LSTM decoder are jointly optimized to maximize the accuracy with which the decoder can reproduce the original Hi-C matrix given just the representations. The resulting representations identify sequence elements driving 3D conformation through Integrated Gradients (IG) analysis, and they enable a researcher to perform in-silico perturbation experiments by editing the representations and observing the effect on predicted contacts.

81
Hi-C-LSTM performs two main tasks; it forms Hi-C representations, and it predicts Hi-C contacts.

82
Learning methods have been proposed that perform either of these tasks. Hi-C-LSTM assigns a representation to each genomic position in the Hi-C contact map, such that a 119 LSTM [31] that takes these representations as input can predict the original contact map (Fig. 2).

120
The representation and the LSTM are jointly trained to optimize the reconstruction of the Hi-C 121 map. This process gives us position-specific representations genome-wide (see Methods for more 122 details). 123 We find that the Hi-C-LSTM achieves higher accuracy when constructing the Hi-C matrix com-124 pared to existing methods (Fig. 3a). The inferred Hi-C map matches the original Hi-C map ( Fig.   125 3c) closely, and differs from it by about 0.25 R-squared points on average. We adapt SNIPER to 126 our task by replacing the feed-forward decoder that converts low-resolution Hi-C to high-resolution 127 Hi-C with a decoder that reproduces the original input Hi-C. We call this SNIPER-FC. Hi-C-LSTM 128 outperforms SNIPER (SNIPER-FC) convincingly, by 10% higher R-squared on average (Fig. 3a).

130
Two hypotheses could explain Hi-C-LSTM's improved reconstructions: (1) that Hi-C-LSTM's 131 representation captures more information, or (2) that an LSTM is a more powerful decoder. We 132 found that both are true. To distinguish these hypotheses, we split each model respectively into 133 two components-its representation and decoder-and evaluated each possible pair of components. For each (i, j) pair, the LSTM takes as input the concatenated representation vector (R i , R j ) and outputs the predicted Hi-C contact probability H i,j . The LSTM hidden state h is carried over from (i, j) to (i, j + 1). This process is repeated for all N rows of the contact map by reinitializing the LSTM states. The LSTM and the representation matrix are jointly trained to minimize the reconstruction error. We found that the choice of decoder has the largest influence on reconstruction performance.

140
Using a LSTM decoder performs best, even when using representations derived from SNIPER or 141 SCI (improvement of 0.14 and 0.12 R-squared points on average over fully-connected decoders 142 respectively, Fig. 3a). Furthermore, we found that Hi-C-LSTM's representations are most infor-143 mative, even when using decoder architectures derived from SNIPER or SCI (Fig. 3a).

144
Though the Hi-C-LSTM representations capture important information from a particular sam-145 ple, we wanted to verify whether they capture real biological processes or irreplicable experimental value for creating the Hi-C contact map of replicate 2 (Fig. 3b). The average R-squared reduces 150 slightly for inference of replicate 2 due to experimental variability; however, the performance trend 151 of the representation-decoder combinations is largely preserved (Fig. 3b). These results show that 152 Hi-C-LSTM's improved performance is not merely driven by memorizing irreplicable noise. . 166 We find that the models built using the intra-chromosomal representations achieve higher predic-167 tive accuracy overall relative to ones trained on inter-chromosomal representations when predicting 168 gene expression, enhancers and TSSs (Fig. 4a). This trend is likely due to the relatively close 169 range of the elements involved in prediction. In contrast, SNIPER is slightly better at predict- SNIPER are comparable to Hi-C-LSTM and all three models perform significantly better than the 186 baselines on average (Fig. 4a).

187
Feature attribution reveals association with genomic elements driving 3D con-188 formation 189 Given that our representations capture elements driving 3D conformation, we should be able to

219
TAD boundaries enriched with CTCF show a 20% higher mean importance score compared to 220 TAD boundaries not associated with CTCF, pointing to the importance of CTCF sites at domain 221 boundaries in conformation (Fig. 5c). Moreover, loop domains show a 20% higher mean importance 222 score compared to non-loop domains, which is expected because of the increased contact strength 223 on average and the presence of CTCF sites (Fig. 5c).

224
The variation of the aggregated feature importance across interesting genomic regions helps us 225 distinguish boundaries of domains and genomic regulatory elements (Fig. 5a,b). We observe the locus (chr21 28-29.2Mbp) along with genes, regulatory elements and Hi-C. We see that the feature importance scores peak at known regulatory elements. c) Violin plots of aggregated feature attribution scores for selected elements. The x-axis shows the labels/elements and the y-axis displays the log plus z normalized feature importance scores from Integrated Gradients. across all TADs (Fig. 5a). The feature importance has largely similar values in the interior of the 232 TAD, noticeably peaks at the TAD boundaries, and slopes downward in the immediate exterior 233 vicinity of the TAD (Fig. 5a). This trend validates the importance of TADs and TAD boundaries 234 in chromatin conformation, which we saw in (Fig. 5c). We also consider a candidate region in 235 chromosome 21 that is referred to in [45] to observe the variation of feature importance across 236 active genomic elements (Fig. 5b). For this selected region in chromosome 21, as we don't have to 237 deal with domains of varying sizes, we just average the feature importance signal within a specified 238 number of bins and plot this in the UCSC Genome Browser (https://genome.ucsc.edu) along with 239 genes and regulatory elements. The feature importance peaks around genes, regulatory elements 240 and domain boundaries (Fig. 5b), showing that they play a more important role in conformation 241 than other functional elements. tance is observed (Fig. 6c). After CTCF and cohesin knockouts, the average contact strength 261 reduces by >15% when compared to the no knockout case (Fig. 6c). CTCF knockout is seen to 262 affect insulation at about 100 Kbp and reflect possible loss of loops at 200 Kbp (Fig. 6c). The 263 knockout of cohesin subunits SMC3 and RAD21 binding sites is observed to be independent of 264 CTCF knockout with 5% higher average inferred strength over distance, hinting at their relative 265 importance (Fig. 6c).

266
The CTCF sites at loop anchors occur mainly in a convergent orientation, with the forward as a function of genomic distance is observed (Fig. 6c). The replacement of convergent with the 272 divergent orientation around loops is seen to behave similar to the case of CTCF knockout thereby 273 validating observations made in [56] (Fig. 6c). On the other hand, replacement of divergent with 274 the convergent orientation is seen to preserve loops at 200 Kbp and behave similar to the control, 275 although with reduced inferred contact strength (5% on average) (Fig. 6c). 276 The difference in inferred Hi-C between the CTCF (Fig. 6a) and cohesin (Fig. 6b)  the loop (0.12 lower on average), but not as strongly as the CTCF knockout (Fig. 6b). Around 282 the loop, CTCF and cohesin knockout results in patches of decreased (0.05 lower on average) as 283 well as increased contacts (0.05 higher on average) (Fig. 6a, b). The predicted Hi-C after CTCF 284 and cohesin knockout (Fig. 6a,  to produce a simulated Hi-C matrix on a post-duplication genome (Fig. 7b). Because Hi-C reads 300 cannot be disambiguated between the two duplicated loci, we simulated mapping reads to the 301 original hg19 reference by summing reads originating from the two copies (see Methods). We  (Fig. 7c). 304 We found that Hi-C-LSTM accurately predicted the effect of the duplication. The domains that 305 existed pre-duplication (D 1 , D 2 , D 3 , Fig. 7a) are correctly captured post-duplication. In addition, 306 a new chromatin domain (D New ) that was introduced by the duplication is correctly predicted 307 by Hi-C-LSTM (Fig. 7b). To quantitatively evaluate our predictions, we compared them to a 308 baseline that predicts the original pre-duplication Hi-C for the interactions between the upstream, 309 downstream and duplicated regions, and the genomic average for the interactions of the duplicated 310 region with itself (see Methods). We found that Hi-C-LSTM's predictions significantly outperform 311 this baseline overall (Fig. 7d). Note the baseline is a slightly better predictor of contacts between 312 the upstream and downstream regions.

356
The code and data repository for this project, including training, evaluation, data handling, and 357 generated data can be found in our GitHub repository (https://github.com/smaslova/HiCLSTM).

Data sets 359
The where chr refers to the chromosome being extracted.

366
Following SCI [12], to mitigate the extreme range of magnitudes present in Hi-C read counts, 367 we transformed Hi-C values into contact probabilities between 0 and 1. We calculated contact 368 probabilities according to the exponential transformation (Eq. 1) where v is the raw input contact strength, δ is a very small positive real number (we set δ to 370 be 10 −10 ) , cf is the coefficient obtained, a is the coefficient multiplier, and CP is the resulting 371 contact probability. We chose a = 8 because it appeared to provide a good separation of low and    Segway and Segway-GBR labels were obtained from Hoffmanlab (https://segway.hoffmanlab.org) 391 and Noblelab (https://noble.gs.washington.edu/proj/gbr) respectively. 392 CTCF, RAD21 and SMC3 peak calls were downloaded from ENCODE (https://www.encodeproj 393 ect.org). The CTCF orientations were obtained by using the CTCF motif from the MEME suite where • is the Hadamard product and σ refers to the sigmoid activation function.

419
Hi-C-LSTM creates a representation given a pair of genomic positions in the Hi-C contact matrix 420 using an embedding neural network layer and predicts the contact strength at that particular pair 421 via a deep LSTM [31] that takes these representations as input (Fig. 2). Hi-C-LSTM takes as input where h j is the same as h j−1 within the frame and is reinitialized at the beginning of each new 434 frame.

435
The LSTM and the embedding neural network layer are jointly trained using the mean squared the Hi-C prediction accuracy increased substantially with hidden size, we noticed the elbow at a 485 representation size of 16 for average mAP and therefore set our representation size to that value as 486 a trade-off.

487
Hi-C reproduction evaluation 488 We investigated three hypotheses with following analysis. First, we asked whether the Hi-C-LSTM  Element identification evaluation 521 We used the following analysis to evaluate the ability of a representation to identify genomic 522 phenomena and chromatin regions. XGBoost with a maximum depth of 6 and a maximum of 5000 estimators and these parameters 527 were chosen following ablation experiments with odd chromosomes as the training set and even 528 chromosomes as the test set ( Supplementary: Fig. 3). N-fold cross-validation, with n = 20, was 529 used to validate our training with and an early stopping criterion of 20 epochs. The rest of the 530 XGBoost parameters were set to their default values.

531
For each task, the genomic loci under contention were assigned labels. All tasks were treated 532 as binary classification tasks, except the subcompartments task, which was treated as a multi-class 533 classification task. For tasks without preassigned negative labels, negative labels were created by 534 randomly sampling genome-wide, excluding the regions with positive labels.

535
The XGBoost classifier was given the representations at these genomic loci as input and the 536 assigned labels as targets. The classifier was evaluated using the metric of mean average precision 537 (mAP), which is a standard metric for classification tasks and is defined as the average of the 538 maximum precision scores achieved at varying recall levels. importance scores were then subjected to log normalization followed by min-max normalization 552 (Eq. 5). Specifically, let IG be to the integrated gradients (IG) score, IG min and IG max be the 553 minimum and maximum IG scores. The normalized IG score IG norm is defined as 554 IG norm = log IG − log IG min log IG max − log IG min .
In-silico perturbation 555 The Hi-C-LSTM enables us to perform in-silico deletion, orientation replacement and reversal of 556 genomic loci and predict changes in the resulting Hi-C contact map. We performed three types 557 of experiments:: knockout, CTCF orientation replacement, and duplication. In a knockout experi-558 ment, we chose certain genomic sites (such as CTCF and cohesin binding sites) and replaced their 559 representations with a null representation. As a null representation, we used the average repre-

577
To enable comparison to Hi-C data mapped to the original pre-duplication reference genome, 578 we combined inferred contacts from both copies. This combination is required because Hi-C reads 579 cannot be disambiguated between the two duplicated copies when they are mapped to the reference 580 genome. Specifically, we passed the predicted contact probability cp through the inverse exponential 581 transformation to define predicted read counts CS = 1 − log cp/a − δ (see Eq. 1). We summed 582 predicted read counts from the two duplicated copies to simulate mapping reads from both copies 583 to the same reference genome CS ′ , then re-applied the exponential transform to obtain predicted 584 contact probability cp ′ .

585
Our baseline for the quantitative evaluation was the original pre-duplication Hi-C for the in-  The data that support the findings of this study are publicly available to download and are referenced in the 592 bibliography. Refer to Methods for more details. The data and representations generated from the project 593 can be found in the GitHub Repository (https://github.com/smaslova/HiCLSTM).