Design of the Deep Learning Model. In the genomics field, DNA probe oligonucleotide lengths range between 50 nt and 150 nt. Thus, in designing our DLM, we considered that the model should be generalizable to DNA sequences of dif- ferent lengths. To this end, NNs with a fixed number of input nodes, including conventional feed-forward NNs and convo- lutional NNs for image recognition [20], are not well-suited for DNA sequence inputs. Furthermore, from DNA thermo- dynamics and structure studies [14–17], we know that distal DNA nucleotides can hybridize to each other in secondary structures. These long-range interactions in DNA molecules are better captured by recurrent neural networks (RNNs), which have been applied commercially in speech recognition and natural language processing [18].
In brief, RNNs contain a number of internal hidden nodes, which are updated serially based on the ordered inputs and its current state values. RNNs have two primary imple- mentations: long short-term memories (LSTMs) and gated recurrent units (GRUs). We chose to implement our DLM using GRUs because they have been reported to achieve similar performance using fewer computational resources[22]. Our DLM includes a total of four GRUs grouped into in two sets: two GRUs for target sequence T , and two GRUs for probe sequence P . Although the target sequence is always the reverse complement of the probe sequence in our DLM model, we included separate GRUs for T and P both to ease the training of the model and to enable the DLM to be more generalizable to problems with asymmetric information on T and P , such as in the strand displacement kinetics that we discuss later.
Each of the two GRUs for each oligonucleotide (T or P ) takes the sequence either in the direction from 5I to 3I, or from 3I to 5I. Unlike biological polymerization reactions which have a clear 5I to 3I directionality, the hybridization process is equally likely to initiate on either end. For RNNs and GRUs, the last inputs tend to have a larger influence on the final state values of the hidden nodes, so the design decision to include sequences in both directions is aimed at reducing input direction bias.
For each GRU, at every single nucleotide there are three input variables: (1) a binary bit indicating whether the nu- cleotide is a purine (A or G), (2) a binary bit indicating whether the nucleotide is “strong” (G or C), and (3) an analog Nupack-computed probability punpaired that the nu- cleotide is unpaired at the reaction conditions [14]. We chose to encode the identity of each nucleotide in two dimensions rather than a single dimension (e.g. A = 1, T = 2, C = 3, G = 4), in order to reflect the pairwise “distances” between any two nucleotides, based on DNA biochemistry knowledge. The unpaired probability of each nucleotide reflects our bio- physical understanding that only unpaired nucleotides can participate in hybridization reactions; a paired nucleotide must first dissociate in order to allow new Watson-Crick base-pairing. punpaired is calculated using Nupack and con- siders the ensemble of all possible secondary structures that can be adopted by each DNA molecule, rather than just the minimum free energy structure.
Each GRU was designed to have 128 hidden nodes (ht).. All node values are initialized to 0, and updated based on each nucleotide’s information. The hidden nodes of the RNNs represented potential patterns in the DNA sequence that the GRU could identify, and the final values of states after updating all nucleotides in the T and P sequences correspond to the presence or absence of those patterns. Thus, the number of hidden nodes in the RNN (currently 128) limited the maximum number of patterns that could be observed by the RNN. Preliminary studies showed similar prediction performance for GRUs with 128 internal states as for 256 internal states (data not shown), suggesting that 128 states were sufficient to capture the bulk of the patterns. Through the course of DLM training and weight updating through back-propagation, the GRU parameter weights were modified until they represented frequently observed patterns in the training data.
Downstream of the GRU, we used a conventional feed- forward neural network (FFNN) that takes as input the final state values of the hidden nodes of the GRUs (128 from H5 −>3 and 128 from H3 −>5 ).. In addition to the hidden node values, the FFNN also takes as input 4 global features: the reaction temperature, the predicted standard free energy of folding of probe(∆G◦(P ))) and Target(∆G◦(T )),), and the predicted standard free energy of formation of the TP double- stranded DNA molecule (∆G◦(TP )).). These global features were intended to capture properties of the T + P reactions that were not easily revealed by the base pair probabilities. Thus, a total of 260 nodes were used as FFNN input. The FFNN network contained 2 hidden layers with 256 and 128 nodes, respectively; these values were picked arbitrarily based on our experience, and overall prediction performance did not appear to be sensitive to the dimensionality of the FFNN hidden layers.
Training and Validation the DLM on NGS read depth. Each of the two NGS datasets were used to indepen- dently train the DLM, and sequence depths were predicted in cross-validation for each NGS dataset individually. The reason for this is because each NGS experiment had a large number of different experimental variables (e.g. total number of reads, hybridization temperatures, experimental operator, sequencing instrument, etc.) that we felt were beyond the scope of the DLM. From a practical point of view, we ex- pect that most users would aim to optimize probe sequence and concentration selective to improve uniformity within an NGS panel, rather than across different panels on different instruments.
For each NGS dataset, we randomly split the data into 20 classes, and predictions of each class (5% of total dataset) were obtained by a DLM trained on the remaining 19 classes (95% of total dataset), as shown in Fig. 2a. Thus, a total of 20 DLMs were used in the 20-fold cross validation predictions for evaluating prediction accuracy. There are roughly 300,000 weight parameters in the DLM (illustrated in Fig. 1c); these were preset via Xavier initialization (uniformly distributed weights with standard deviation dependent on the number of parameters in a layer) in order to alleviate the vanishing gradient problem for deep NNs [36].
During training, we iteratively minimized the square er- ror between the predicted and experimental log sequencing depth, using gradient descent with an Adam optimizer [37] to update the network weights. To minimize overfitting, we implemented an additional dropout layer after each hidden layer of the FFNN, in which 20% of parameters are randomly selected and prevented from updating in each training iteration. The DLM was implemented using Tensorflow [38], and DLM hyper-parameters include GRU hidden nodes (128), FFNN hidden nodes (256 and 128), batch size (999), learning rate (0.0001), and node dropout fraction (20%). We tried roughly 50 sets of different hyper-parameter values, and the values listed appear to yield the shortest training time and best predictive performance.
Fig. 2c summarizes the root-mean-square error (RMSE) of the DLM predicted values of log10(Depth) based on different sequences vs. the actual observed NGS read depth for a hu- man single nucleotide polymorphism (SNP) panel comprising 39,145 probes synthesized as a pool by Twist Biosciences. Of the 39,145 probes, NGS results showed 0 reads on 1,105 probes. Our previous studies on Twist oligonucleotide pools suggest that the lack of sequencing reads for these may indi- cate difficulties with probe synthesis [43]. Consequently, we chose to exclude these probes with 0 observed NGS reads in order to eliminate the possibility of training against noise.
The DLM yields an average RMSE of roughly 0.30 on both the training classes and the test classes. For comparison, a naive model of predicting sequencing depth only based on the mean log10(Depth) of all observed sequences produces an RMSE of 0.41. Fig. 2d plots the comparison of predicted and measured sequencing depths for each sequence. From this fig- ure, we see that a significant contributor to our DLM’s RMSE is a subset of DNA sequences that are observed to have very low log10(Depth) (e.g. 0.3, corresponding to a depth of 2), but predicted to have log10(Depth) between 1 and 3.3. Our interpretation of this phenomenon is that there are a myriad possible reasons why probes may yield extremely low depth, e.g. poor probe synthesis yield, poor hybridization yield, se- quence biases in nonspecific DNA binding to plasticware, and unintended crosstalk binding between different subsequences of the human genome, and poor bridge PCR efficiency during Illumina NGS that result in under-representation of reads. The DLM does not have sufficient training instances for each reason to be able to make confident predictions of very low sequencing depth.
Fig. 2e shows the DLM results applied to a second NGS panel comprising 7,373 DNA probes against non-biological DNA sequences intended for DNA information storage ap- plications. As these sequences were procedurally generated to avoid known problematic DNA sequences, such as those with high or low G/C content (Fig. 2b), homopolymers, etc., there is much less variation in sequencing depth to begin with. Nonetheless, the DLM is effective at predicting se- quencing depths (Fig. 2e) beyond a naive model (see also Supplementary Section S3).
Our DLM in total contains over 300,000 parameters (e.g. node biases and node-node weights). The large number of pa- rameters leads to potential concern regarding overfitting and model reproducibility. To address this, we next performed 15 independent DLM parameter initiation and training on the dataset, in order to characterize the reproducibility of the model (Fig. 3). All 15 DLMs consistently reached early-stop at roughly epoch 250 (Fig. 3a), and the predicted sequencing depths showed high pairwise concordance (Fig. 3bc). Across the 105 pairwise comparisons of the 15 DLMs, we observe a Pearson’s r value of no less than 0.975. Consequently, we believe that our approach produces DLMs with fairly consis- tent predictions despite variations in parameter initialization and training.
DLM prediction of single-plex DNA hybridization and strand displacement rate constants. Based on our understanding of DNA and NGS, we believe that NGS read depth is primarily dependent on the yield and speed of DNA probe hybridization, and secondarily by instrument- and chemistry-specific biases. Consequently, our DLM should also be effective at predicting the rate constants of hybridiza- tion of DNA (Fig. 4a). To further challenge our DLM and to highlight the effectiveness of our DLM approach, we fur- ther applied the DLM to the prediction of a related DNA mechanism, strand displacement [24, 25] (Fig. 4b). Unlike NGS experiments in which thousands of DNA probes and targets are simultaneously hybridizing, for DNA hybridiza- tion and strand displacement rate constant prediction, we use time-based fluorescence data in which a single target and probe species are observed with high time and yield resolu- tion (Fig. 4c, Supplementary Section S2. See also ref. [31] for additional explanation on hybridization reaction kinetics experimental details.
Experimental hybridization rate constants are taken from ref. [31], and experimental strand displacement rate constant data are collected for this work. Supplementary Section S2 describes details of how best-fit rate constants are fitted to fluorescence-based kinetics data. Fig. 4d shows the DLM prediction vs. experimental best-fit rate constants for 210 hybridization reactions and 211 strand displacement reactions. Here, we performed 100-fold leave-one-class-out (LOCO) pre- diction rather than 20-fold cross validation, because the smaller number of data points would lead to significant biases due to small sample sizes of the test set; see Supplementary Section S3 for details. The variation in rate constants ob- served is similar to the variation in NGS sequencing depths (4 logs), though the latter is possibly somewhat smaller due to the saturation of of hybridization for the timescales of NGS hybrid-capture reactions.
The purpose of this sub-study was to see if the DLM could be used for non-NGS applications of nucleic acid molecular diagnostics, such as those based on qPCR [41] and electro- chemistry [27]. Importantly, our DLM was trained simultaneously on both the hybridization and the strand displacement rate constant datasets. Because the target T and probe P sequences for the hybridization reactions are identical to that of strand displacement reactions, the difference between the two is manifested only in the predicted probability of each nucleotide being unpaired. This information alone was enough to communicate to the DLM the distinction between hybridization and strand displacement, and no special case handling or neural network architecture modification was needed to accommodate strand displacement.
Contribution of Different Features to DLM Perfor- mance. The architecture of the DLM and the local and global features were initially decided based on our understanding of the behavior of DNA, rather than through knowledge- free exploration. Consequently, it is possible that some of the features are not directly relevant to NGS depth or hy- bridization/strand displacement rate constants. To test this hypothesis, we next constructed a series of DLMs in which different features were removed from the model (Fig. 5).
We found that 3 of the 4 global features, individually, had essentially no impact on any of the predictions. Both sets of local features (sequences and base probabilities) were important for some aspect of the DLM prediction, but it appears that the two are interchangeable for predicting the NGS depths of the human SNP panel. Examining the global features more closely, we note that temperature T is the same for all sequences within a panel, so it is tautological that the DLM cannot learn any impact from changing T.
The standard free energy of formation of the target-probe duplex ET P likely did not matter because the lengths of all probes/targets were long enough that the probe binding was no longer limited by its thermodynamics. Finally, the stan- dard free energy of folding of the target by itself ET likely did not matter because it was not and could not be accurately calculated: whereas the probe P has a homogenous molecular population with a well-defined sequence, the target T is a heterogeneous mixture constructed through randomized phys- ical fragmentation of human genomic DNA. Consequently, the 5I and 3I overhang sequences of the target are highly variable, and cannot be reflected as a single sequence. Prediction accuracies of all feature-reduced DLMs are summarized in Supplementary Section S4. The performance of even re- duced models are in general much better than naive models (Supplementary Section S5).