A Deep Learning Model for Predicting NGS Sequencing Depth from DNA Sequence

Targeted high-throughput DNA sequencing is a primary approach for genomics and molecular diagnostics, and more recently as a readout for DNA information storage. Oligonucleotide probes used to enriching gene loci of interest have different hybridization kinetics, resulting in non-uniform coverage that increases sequencing costs and decreases sequencing sensitivities. Here, we present a deep learning model (DLM) for predicting NGS sequencing depth from DNA probe sequence. Our DLM includes a bidirectional recurrent neural network that takes as input both DNA nucleotide identities as well as the calculated probability of the nucleotide being unpaired. We applied our DLM to two different NGS panels: a designed 7,373-plex panel for DNA information storage, and a 39,145-plex panel for human single nucleotide polymorphisms. In cross-validation, our DLM predicts sequencing depth to within a factor of 3 with 99% accuracy for the designed panel, and 93% accuracy for the human panel. The same model is also effective at predicting the measured single-plex kinetic rate constants of DNA hybridization and strand displacement.


Introduction
With more than 3 billion DNA nucleotides in the hap-loid human genome, deep sequencing of the entire human genome for clinical applications is not economically feasible. Instead, researchers and diagnostic laboratories typically use targeted sequencing, in which a set of DNA hybridization probes is designed to bind and enrich the DNA regions of interest [1,2]. However, the DNA oligonucleotide probes in a targeted sequencing panel typically all have different kinetics and thermodynamics of binding to their respective targets. Consequently, a naively designed and synthesized panel of DNA probes will result in grossly different enrichment e ciencies for different genetic loci.
The sensitivity of NGS to a locus is directly proportional to the number of NGS reads that contain the locus (the locus's sequencing depth).. Nonuniformity of sequencing depth either reduces the sensitivity at low-depth loci, or necessitates ad-ditional sequencing to guarantee that all loci are sequenced to a minimum desired depth. Empirical optimization of an NGS panel's probe sequences and concentrations is time-and labor-consuming, but currently cannot be avoided. A com-putational method to predict the sequencing depth based on probe sequence could inform the selection of probe sets with higher uniformity and modulation of probe concentrations to achieve higher uniformity.
Here, we constructed a deep learning model (DLM) for predicting NGS sequencing depth for a given oligonucleotide probe, and characterized its performance on predicting the sequencing depths of two NGS panels, one with 39,145 probes against human single nucleotide polymorphisms, and one with 7,373 probes bearing arti cially designed sequences for information storage [6]. Our DLM is based on a recurrent neural network (RNN) architecture to better captures both short-range and long-range interactions within the DNA probe sequence that can impact capture e ciency/speed.

Results
Design of the Deep Learning Model. In the genomics eld, DNA probe oligonucleotide lengths range between 50 nt and 150 nt. Thus, in designing our DLM, we considered that the model should be generalizable to DNA sequences of dif-ferent lengths. To this end, NNs with a xed number of input nodes, including conventional feed-forward NNs and convo-lutional NNs for image recognition [20], are not well-suited for DNA sequence inputs. Furthermore, from DNA thermo-dynamics and structure studies [14][15][16][17], we know that distal DNA nucleotides can hybridize to each other in secondary structures. These long-range interactions in DNA molecules are better captured by recurrent neural networks (RNNs), which have been applied commercially in speech recognition and natural language processing [18].
In brief, RNNs contain a number of internal hidden nodes, which are updated serially based on the ordered inputs and its current state values. RNNs have two primary imple-mentations: long short-term memories (LSTMs) and gated recurrent units (GRUs). We chose to implement our DLM using GRUs because they have been reported to achieve similar performance using fewer computational resources [22]. Our DLM includes a total of four GRUs grouped into in two sets: two GRUs for target sequence T , and two GRUs for probe sequence P . Although the target sequence is always the reverse complement of the probe sequence in our DLM model, we included separate GRUs for T and P both to ease the training of the model and to enable the DLM to be more generalizable to problems with asymmetric information on T and P , such as in the strand displacement kinetics that we discuss later.
Each of the two GRUs for each oligonucleotide (T or P ) takes the sequence either in the direction from 5 I to 3 I , or from 3 I to 5 I . Unlike biological polymerization reactions which have a clear 5 I to 3 I directionality, the hybridization process is equally likely to initiate on either end. For RNNs and GRUs, the last inputs tend to have a larger in uence on the nal state values of the hidden nodes, so the design decision to include sequences in both directions is aimed at reducing input direction bias.
For each GRU, at every single nucleotide there are three input variables: (1) a binary bit indicating whether the nu-cleotide is a purine (A or G), (2) a binary bit indicating whether the nucleotide is "strong" (G or C), and (3) an analog Nupack-computed probability p unpaired that the nu-cleotide is unpaired at the reaction conditions [14]. We chose to encode the identity of each nucleotide in two dimensions rather than a single dimension (e.g. A = 1, T = 2, C = 3, G = 4), in order to re ect the pairwise "distances" between any two nucleotides, based on DNA biochemistry knowledge. The unpaired probability of each nucleotide re ects our bio-physical understanding that only unpaired nucleotides can participate in hybridization reactions; a paired nucleotide must rst dissociate in order to allow new Watson-Crick base-pairing. p unpaired is calculated using Nupack and con-siders the ensemble of all possible secondary structures that can be adopted by each DNA molecule, rather than just the minimum free energy structure.
Each GRU was designed to have 128 hidden nodes (h t). . All node values are initialized to 0, and updated based on each nucleotide's information. The hidden nodes of the RNNs represented potential patterns in the DNA sequence that the GRU could identify, and the nal values of states after updating all nucleotides in the T and P sequences correspond to the presence or absence of those patterns. Thus, the number of hidden nodes in the RNN (currently 128) limited the maximum number of patterns that could be observed by the RNN. Preliminary studies showed similar prediction performance for GRUs with 128 internal states as for 256 internal states (data not shown), suggesting that 128 states were su cient to capture the bulk of the patterns. Through the course of DLM training and weight updating through back-propagation, the GRU parameter weights were modi ed until they represented frequently observed patterns in the training data.
Downstream of the GRU, we used a conventional feed-forward neural network (FFNN) that takes as input the nal state values of the hidden nodes of the GRUs (128 from H 5 −>3 and 128 from H 3 −>5 ).. In addition to the hidden node values, the FFNN also takes as input 4 global features: the reaction temperature, the predicted standard free energy of folding of probe(∆G • (P ))) and Target(∆G • (T )),), and the predicted standard free energy of formation of the TP double-stranded DNA molecule (∆G • (TP )).). These global features were intended to capture properties of the T + P reactions that were not easily revealed by the base pair probabilities. Thus, a total of 260 nodes were used as FFNN input. The FFNN network contained 2 hidden layers with 256 and 128 nodes, respectively; these values were picked arbitrarily based on our experience, and overall prediction performance did not appear to be sensitive to the dimensionality of the FFNN hidden layers.
Training and Validation the DLM on NGS read depth. Each of the two NGS datasets were used to independently train the DLM, and sequence depths were predicted in cross-validation for each NGS dataset individually. The reason for this is because each NGS experiment had a large number of different experimental variables (e.g. total number of reads, hybridization temperatures, experimental operator, sequencing instrument, etc.) that we felt were beyond the scope of the DLM. From a practical point of view, we ex-pect that most users would aim to optimize probe sequence and concentration selective to improve uniformity within an NGS panel, rather than across different panels on different instruments.
For each NGS dataset, we randomly split the data into 20 classes, and predictions of each class (5% of total dataset) were obtained by a DLM trained on the remaining 19 classes (95% of total dataset), as shown in Fig. 2a. Thus, a total of 20 DLMs were used in the 20-fold cross validation predictions for evaluating prediction accuracy. There are roughly 300,000 weight parameters in the DLM (illustrated in Fig. 1c); these were preset via Xavier initialization (uniformly distributed weights with standard deviation dependent on the number of parameters in a layer) in order to alleviate the vanishing gradient problem for deep NNs [36].
During training, we iteratively minimized the square er-ror between the predicted and experimental log sequencing depth, using gradient descent with an Adam optimizer [37] to update the network weights. To minimize over tting, we implemented an additional dropout layer after each hidden layer of the FFNN, in which 20% of parameters are randomly selected and prevented from updating in each training iteration. The DLM was implemented using Tensor ow [38], and DLM hyper-parameters include GRU hidden nodes (128), FFNN hidden nodes (256 and 128), batch size (999), learning rate (0.0001), and node dropout fraction (20%). We tried roughly 50 sets of different hyper-parameter values, and the values listed appear to yield the shortest training time and best predictive performance. Our previous studies on Twist oligonucleotide pools suggest that the lack of sequencing reads for these may indi-cate di culties with probe synthesis [43]. Consequently, we chose to exclude these probes with 0 observed NGS reads in order to eliminate the possibility of training against noise.
The DLM yields an average RMSE of roughly 0.30 on both the training classes and the test classes. For comparison, a naive model of predicting sequencing depth only based on the mean log 10 (Depth) of all observed sequences produces an RMSE of 0.41. Fig. 2d plots the comparison of predicted and measured sequencing depths for each sequence. From this g-ure, we see that a signi cant contributor to our DLM's RMSE is a subset of DNA sequences that are observed to have very low log 10 (Depth) (e.g. 0.3, corresponding to a depth of 2), but predicted to have log 10 (Depth) between 1 and 3.3. Our interpretation of this phenomenon is that there are a myriad possible reasons why probes may yield extremely low depth, e.g. poor probe synthesis yield, poor hybridization yield, se-quence biases in nonspeci c DNA binding to plasticware, and unintended crosstalk binding between different subsequences of the human genome, and poor bridge PCR e ciency during Illumina NGS that result in under-representation of reads. The DLM does not have su cient training instances for each reason to be able to make con dent predictions of very low sequencing depth. Fig. 2e shows the DLM results applied to a second NGS panel comprising 7,373 DNA probes against nonbiological DNA sequences intended for DNA information storage ap-plications. As these sequences were procedurally generated to avoid known problematic DNA sequences, such as those with high or low G/C content (Fig. 2b), homopolymers, etc., there is much less variation in sequencing depth to begin with. Nonetheless, the DLM is effective at predicting se-quencing depths (Fig. 2e) beyond a naive model (see also Supplementary Section S3).
Our DLM in total contains over 300,000 parameters (e.g. node biases and node-node weights). The large number of pa-rameters leads to potential concern regarding over tting and model reproducibility. To address this, we next performed 15 independent DLM parameter initiation and training on the dataset, in order to characterize the reproducibility of the model (Fig. 3). All 15 DLMs consistently reached early-stop at roughly epoch 250 (Fig. 3a), and the predicted sequencing depths showed high pairwise concordance (Fig. 3bc). Across the 105 pairwise comparisons of the 15 DLMs, we observe a Pearson's r value of no less than 0.975. Consequently, we believe that our approach produces DLMs with fairly consis-tent predictions despite variations in parameter initialization and training.
DLM prediction of single-plex DNA hybridization and strand displacement rate constants. Based on our understanding of DNA and NGS, we believe that NGS read depth is primarily dependent on the yield and speed of DNA probe hybridization, and secondarily by instrument-and chemistry-speci c biases. Consequently, our DLM should also be effective at predicting the rate constants of hybridiza-tion of DNA (Fig. 4a). To further challenge our DLM and to highlight the effectiveness of our DLM approach, we further applied the DLM to the prediction of a related DNA mechanism, strand displacement [24,25] (Fig.   4b). Unlike NGS experiments in which thousands of DNA probes and targets are simultaneously hybridizing, for DNA hybridiza-tion and strand displacement rate constant prediction, we use time-based uorescence data in which a single target and probe species are observed with high time and yield resolution (Fig. 4c, Supplementary Section S2. See also ref. [31] for additional explanation on hybridization reaction kinetics experimental details. Experimental hybridization rate constants are taken from ref. [31], and experimental strand displacement rate constant data are collected for this work. Supplementary Section S2 describes details of how best-t rate constants are tted to uorescence-based kinetics data. Fig. 4d shows the DLM prediction vs. experimental best-t rate constants for 210 hybridization reactions and 211 strand displacement reactions. Here, we performed 100-fold leave-one-class-out (LOCO) pre-diction rather than 20-fold cross validation, because the smaller number of data points would lead to signi cant biases due to small sample sizes of the test set; see Supplementary Section S3 for details. The variation in rate constants observed is similar to the variation in NGS sequencing depths (4 logs), though the latter is possibly somewhat smaller due to the saturation of of hybridization for the timescales of NGS hybrid-capture reactions.
The purpose of this sub-study was to see if the DLM could be used for non-NGS applications of nucleic acid molecular diagnostics, such as those based on qPCR [41] and electro-chemistry [27]. Importantly, our DLM was trained simultaneously on both the hybridization and the strand displacement rate constant datasets. Because the target T and probe P sequences for the hybridization reactions are identical to that of strand displacement reactions, the difference between the two is manifested only in the predicted probability of each nucleotide being unpaired. This information alone was enough to communicate to the DLM the distinction between hybridization and strand displacement, and no special case handling or neural network architecture modi cation was needed to accommodate strand displacement.
Contribution of Different Features to DLM Perfor-mance. The architecture of the DLM and the local and global features were initially decided based on our understanding of the behavior of DNA, rather than through knowledge-free exploration. Consequently, it is possible that some of the features are not directly relevant to NGS depth or hy-bridization/strand displacement rate constants. To test this hypothesis, we next constructed a series of DLMs in which different features were removed from the model (Fig. 5).
We found that 3 of the 4 global features, individually, had essentially no impact on any of the predictions. Both sets of local features (sequences and base probabilities) were important for some aspect of the DLM prediction, but it appears that the two are interchangeable for predicting the NGS depths of the human SNP panel. Examining the global features more closely, we note that temperature T is the same for all sequences within a panel, so it is tautological that the DLM cannot learn any impact from changing T.
The standard free energy of formation of the target-probe duplex E T P likely did not matter because the lengths of all probes/targets were long enough that the probe binding was no longer limited by its thermodynamics. Finally, the stan-dard free energy of folding of the target by itself E T likely did not matter because it was not and could not be accurately calculated: whereas the probe P has a homogenous molecular population with a well-de ned sequence, the target T is a heterogeneous mixture constructed through randomized phys-ical fragmentation of human genomic DNA. Consequently, the 5 I and 3 I overhang sequences of the target are highly variable, and cannot be re ected as a single sequence. Prediction accuracies of all feature-reduced DLMs are summarized in Supplementary Section S4. The performance of even re-duced models are in general much better than naive models (Supplementary Section S5).

Discussion
Targeted high-throughput sequencing of DNA has become a dominant method for biological and biomedical research, and furthermore is becoming standard of practice for cancer treatment [48]. More recently, targeted sequencing has been explored as a method for random-access readout of informa-tion stored densely and for long term in DNA [6]. Although DNA sequencing costs are exponentially decreasing over the years [49], poor sequencing uniformity would waste a large majority of reads sequencing highdepth targets redundantly and providing insu cient information on low-depth targets. Consequently, a strong need exists to rationally design NGS panels with high uniformity.
However, predicting DNA hybridization kinetics and e -ciency is extremely di cult even for experts in single-plex settings [31]. In a complex multi-component system that is hybrid-capture target enrichment, prediction of sequenc-ing depth becomes intractable for rst-principles biophysical models. At the same time, the large size of NGS datasets renders the problem well-suited for machine learning.
Deep learning leverages large datasets to autonomously discover weak correlative features between inputs and outputs. This has led to deep learning becoming dominant in computer vision and other areas with large available datasets. On the other hand, simpler statistical models (e.g. multivariate linear regression) remain dominant in the natural and biomedical sciences [50,51] where well-curated data for speci c problems are scarce. Expert system machine learning approaches based on extensive manual feature construction and curation takes an in-between approach, using expert knowledge to guide the construction of narrowly optimized prediction software, but are generally poor at generalization to similar problems.
In our DLM, we restricted our inputs to a limited set of global and local features that can be automatically computed based on DNA sequence, in order to avoid the trap of labor-intensive and problem-speci c model construc-tion. Given both DNA thermodynamics model inaccu-racy/incompleteness [8] and the fact that we could not feasibly consider the intermolecular interactions from all 3+ billion nucleotides of the human genome, we believe that the Nupack-predicted base pair probability values likely have signi cant error. Future models for more accurately predicting base pair accessibility in a highly complex and heterogeneous solu-tion could prove crucial to further improving the prediction accuracy of the DLM.
DNA sequence is rather unlike most other inputs for prob-lems solved by deep learning networks. DNA molecules are known to have long-range interactions where distal DNA nucleotides bind to each other.
Simultaneously, there are not orderly "grammar" rules such as in natural language process-ing [19] that can be readily discovered by neural networks. Distal and intermolecular DNA binding are essentially the effects of chemistry and does not necessarily need to conform to human common sense. These all contribute to the dif-culty of building neural networks that accurately predict DNA behavior based on sequence.
Conversely, once a neural network architecture is estab-lished to "understand" DNA sequence, it could hold the potential to a large range of other nucleic acid-based prob-lems. As a research example, a large range of non-coding RNA [52] have been discovered, and being able to predict their structures can provide insights to their function. As a biomedical example, codon optimization problems [53] for syn-thetic biology, including de novo construction of RNA-based drugs [54]. The incorporation of generalizable domain knowl-edge within deep learning architectures will be a key enabler for predicting behaviors for nucleic acids, given the impact of their sequences on form and function and the exponential number of possible sequences of given lengths.

Declarations
Additional information. Correspondence may be ad-dressed to AP (Andrew.Phillips@microsoft.com) or DYZ (dyz1@rice.edu). There is a patent pending on X-probes used in this work. There is a patent pending on the WNV model of rate constant prediction. MXW declares a com-peting interest in the form of consulting for Nuprobe. DYZ declares a competing interest in the form of consulting for Nuprobe, Torus Biosystems, and Avenge Bio.
Data Availability. The sequences of the DNA oligos used for the manuscript, the oligo concentrations used for uores-cence experiments, the best-t rate constants, and the values of the manuallyconstructed features for the WNV model are included as an Excel le accompanying this manuscript. The original raw uorescence data from our single-plex kinetics experiments are also included as a Zip le.
Code Availability. To allow readers to use the DLM for predicting rate constants, we provide our DLM software code and installation/usage instructions. These are attached with the submission. We also intend to make an online software tool available by the time of publication.  Training and cross-validation results of the DLM on predicting NGS read depth. (a) Data pre-processing and training pipeline. Raw FASTQ NGS reads were aligned to the target sequences [28,29], and the read depth of each template sequence is counted. Separately, local and global features for each target/probe are computed using Nupack [14]. The complete dataset was split into 20 classes, and each class was (d) Predicted vs. observed sequencing depth for the human SNP panel. Plotted are the aggregated results for all 20 validation classes. Shaded in dark gray are the zones where the predicted and the observed read depth agreed to within a factor of 2; light gray shows agreement to within factor of 3. F2err and F3err denotes the fraction of sequences with depth predicted to beyond a factor of 2 and 3, respectively. (e) Predicted vs. observed sequencing depth for the synthetic DNA panel.  Applying our DLM to prediction of single-plex hybridization and strand displacement rate constants.
Schematic of (a) hybridization and (b) strand displacement reactions. (c) Sample kinetics traces of a hybridization and a strand displacement reaction. Reaction yields inferred through uorescence; see Supplementary Section S2 for experimental and data processing details. (d) Accuracy of DLM predictions of hybridization (blue) and strand displacement (orange) rate constants. Due to the small number of data points here (421), we implemented prediction based on leave-one-out rather than 20-fold cross validation, due to the large expected variation in small validation classes. Note that strand displacement and hybridization reaction parameters were co-trained using the same DLM, and predictions were likewise using a single DLM. (e) Comparing DLM prediction performance to a previous expert system machine learning approach based on weighted neighbor voting [31]. Prediction results are similar.