Real-time selective sequencing using nanopores and deep learning

Nanopore sequencing is an emerging technology that utilizes a unique method of reading nucleic acid sequences and, at the same time, it detects various chemical modications. Deep learning has increased in popularity as a useful technique to solve many complex computational tasks. Selective sequencing has been widely used in genomic research; although it introduces several caveats to the process of sequencing, its advantages supersede them. In this study we demonstrate an alternative method of software-based selective sequencing that is performed in real time by combining nanopore sequencing and deep learning. Our results show the feasibility of using deep learning for classifying signals from only the rst 200 nucleotides in a raw nanopore sequencing signal format. Using custom deep learning models and a script utilizing "Read-Until" framework to target mitochondrial molecules in real time from a human cell line sample, we achieved a signicant separation and enrichment ability of more than 2-fold. In a series of very short sequencing runs (10, 30, and 120 minutes), we identied genomic and mitochondrial reads with accuracy above 90%, although mitochondrial DNA comprises only 0.1% of the total input material. We believe that our results will lay the foundation for rapid and selective sequencing using nanopore technology and will pave the way for future clinical applications using nanopore sequencing data.


Next generation sequencing
Next generation sequencing (NGS) has revolutionized DNA sequencing and laid the foundation for a plethora of scienti c and clinical opportunities. One recent emerging sequencing technology uses nanopore sequencing (for example, those developed by Oxford Nanopore Technologies, ONT) 1 . In this study we used ONT's portable MinION sequencer, which was released in 2014. The sequencing is performed by measuring changes in ionic current produced by individual nucleic acids as single DNA strands that pass through an array of protein nanopores. These changes are detected by a sensor and are saved on a computer for later analysis 2 . The recorded ionic current, known as the "raw signal" or "squiggle," is mainly used for basecalling by translating the raw signal into nucleotides. To date, the vast majority of studies that use nanopore sequencers ignore the raw signal after using it to generate a nucleotide sequence. A few researchers, however, have used this signal for other tasks, such as improving the accuracy of a consensus sequence, or for investigating chemical modi cations on the DNA 3,4,5 .

Deep learning
Deep learning is a subset of machine learning methods that have gained increased popularity in recent years after overtaking other methods in the eld of image classi cation 6 . Deep learning has been applied to elds such as image, video, audio, and natural language processing where it has been used to perform tasks such as classi cation, generation, prediction, and detection [6][7][8] . Therefore, it is plausible that similar deep learning approaches could be applied to nanopore sequencing data analysis. These approaches include methods such as convolutional neural network (CNN) and recurrent neural network (RNN) architectures that have been used for audio signal analysis 6, [9][10][11] . Initially, raw nanopore signal was translated to nucleotides using a Hidden Markov Model (HMM) 12,13 , but recently, deep learning was found to perform the task better and it is now used to translate a raw nanopore signal into a nucleotide sequence 14,15 . Deep learning is also used to perform tasks such as predicting DNA methylation 16 and simulating a raw signal based on a reference genome 17 . These ndings reinforce our suggestion to use deep learning in order to classify reads based on their raw signal. Hence, we tested several commonly used deep learning architectures that were previousely applied on similar data in order to select the one that we preferred for our analysis.

Selective sequencing
Selective sequencing (or sequencing of targeted genomic regions) is a widespread technique used in many applications when the goal is to sequence speci c portions of a DNA molecule from a larger pool of genetic material. When targeting only part of the DNA, one can save resources, time, and money.
Selective sequencing is traditionally based on physically isolating parts of the DNA during the library preparation steps and prior to sequencing 18-20 . Recently, it was also performed during standard nanopore library preparation 21,22 . Traditional selective methods, however, have been found to introduce bias to the output like lack of evenness of coverage and divergent results from different library preparation kits 23 , therefore an alternative method could bene t researchers.

Nanopore selective sequencing
With the introduction of nanopore sequencing, an exciting new feature, "Read Until", makes it possible to selectively "reject" DNA molecules before the entire molecule has been completely sequenced 24 . The decision to reject the molecule is based on the initial portion of the DNA molecule, potentially saving time and reagents by not sequencing the entire molecule. Several studies have demonstrated real-time selective sequencing using the nanopore "Read Until" feature. In this regard, Loose et al. demonstrated in a rst published study the ability to perform selective sequencing with the genome of Lambda phage 24 ; dynamic time warping (DTW) was used to determine whether the DNA molecule should be sequenced or not. This approach imposed restrictions on the length of the possible target and reference sequences. In another study, Edwards et al. performed real-time selective sequencing by online basecalling the start of the molecules and then deciding which molecule to sequence by mapping it to a reference library using the LAST aligner 25 , which is similar to the method used by Payne et al. who mapped the base-called nucleotides to a reference genome using minimap2 26 . This approach removed the constraints caused by using the DTW algorithm; however, it introduced two separate steps (basecalling and mapping) into the decision process. Another study used the same concept of basecalling and mapping the reads to a reference genome, but for a different purpose, namely, to achieve more uniform coverage 27 . Finally, in a more recent work, Kovaka et al. probabilistically decoded the raw signal into k-mers by using a technique based on an HMM achieving enrichment factor of 4.46 28 .

Our contribution
Here we apply selective sequencing on nanopore sequencing via a unique deep learning approach. We begin by developing a deep learning model capable of accepting only the rst 2,000 values of a raw signal, which equates to roughly 200 base pairs as input. We decided to focus on a biologically signi cant region of human DNA that potentially will provide enough data in whole genome sequencing experiments. This is a prerequisite for training the deep learning model. We chose to perform selective sequencing on mitochondrial DNA. The mitochondrial DNA is a cellular organelle within eukaryotic cells containing about 16K base pairs; it encodes 13 proteins. It has been sequenced many times, has high coverage in publicly available nanopore datasets, and is of biological and medical signi cance when analyzing human sequencing data 29 . We trained the model to classify sequencing reads into 'mitochondrial' or 'genomic' reads based on the signal. Analysis of the raw signal directly bypasses the error prone basecalling step while also allowing the deep learning model to incorporate additional information present in the raw signal such as DNA modi cations 4,5 , this potentially could increase accuracy of DNA classi cation by eliminating data analysis steps and increasing the information volume for the deep learning model. Unlike the previous attempts at real-time selective sequencing, our method neither requires a nucleotide reference nor a generated signal reference. Bypassing a reference decreases the run time and complexity restriction as the reference database expands. We also tested several deep learning architectures for sequence analysis; we tried it on several datasets of nanopore signal data, and applied it for classifying reads of different DNA origins. Finally, we selected the model with the highest classi cation accuracy and combined it with the "Read Until" API in order to perform a sequencing experiment where we used our model to successfully selectively sequence mitochondrial DNA.
Overall, by developing a new real-time selective sequencing method, we will not only alleviate the challenges caused by the additional steps during library preparation-we can also change the targeted regions during the experiment simply by modifying a parameter in the software. Our method has the potential to increase accuracy, speed up the sequencing process, and it can eventually be applied to any clinical settings where time-sensitive DNA sequencing is of the essence.

Data organization, preprocessing, and augmentation
For the purpose of training and testing our deep learning models, we used two publicly available nanopore sequencing datasets: (i) Jain et al. produced a human genome assembly using long reads from nanopore sequencing 30 . About 14 million reads were sequenced and aligned to the 1000 genome GRCh38 reference genome 31 . From this dataset, we used 60,000 reads that were aligned to the mitochondria and 200,000 random reads that were aligned to the rest of the human genome. (ii) The "Cliveome" dataset, which was sequenced by ONT and released to the public in 2016 32 . From this dataset, we used 8,000 reads that mapped to the mitochondria as well as 200,000 random reads that mapped to the rest of the human genome. In each dataset we separated the sequenced reads randomly into training, validation, and test sets containing 80%, 10%, and 10%, respectively, of the total reads. Only the rst portion of each raw signal were used to simulate reading the beginning of the molecule with the Read Until feature.
Deep learning requires iterating through the training dataset by mini-batches, which allows handling large datasets and improves the training results 33 . In this research we used the Pytorch 34 deep learning framework, which contains a Dataloader class; we customized this class to allow parallel data loading with custom data transformations. Our custom dataloader applies four transformations to the signal: the rst transformation randomly selected a region of 2,000 values from the total 5,000 values. The second transformation changed the signal from the raw values, which represent the electric current level, to differential values in order to eliminate possible bias between voltages of different devices and ow cells. The third transformation cut the signal into a sliding window array, transforming the 1D-long linear signal into a 2D array of stacked sliding windows. The nal transformation added Gaussian noise to the sample to mimic the background noise in nanopore sequencing. All of the transformations improved the training process and the nal accuracy; further details are in the Supplementary Methods. further model details and justi cation for their selection are presented in the Supplementary Methods. All models were tested with three different sizes corresponding to the number of hidden parameters: large size, medium size, and small size models. All models were tested extensively with different con gurations as explained in the supplementary methods section.
We also attempted to combine a CNN model with an RNN model whose schematic overview can be seen in Supplementary Figure 1. In theory, CNN is good at proximal feature representation and RNN can nd long distance dependencies; by combining those techniques, our model could utilize both short-and longdistance information hidden in the raw signal 40 . We combined the VDCNN with regular GRU as well as VDCNN with LSTM with recurrent batch normalization and tested multiple con gurations of these models as well as described in the supplementary methods section..
To eliminate any differences in model accuracy due to a different training process, the same python script was used to train all models similarly. We used the training dataset during training, the validation dataset for hyperparameter tuning, and the test dataset was used exclusively at the nal stage to measure the accuracy of each model. Accuracy was measured separately for genomic reads and for mitochondrial reads, total accuracy was calculated by averaging the accuracy of the mitochondrial reads and the accuracy of the genomic reads. An Adam (A Method for Stochastic Optimization) optimizer 41 was used; the learning rate and other parameters for the optimizer were determined by a manual search. All models were trained for 300 epochs and the learning curve of each model was assessed to determine whether the loss curve plateaued and whether over tting became an issue. Supplementary Figure 2 illustrates the learning curve of the model with LSTM and recurrent batch normalization as an example of a successfully trained model.
After training all models on the primary dataset, a second dataset (Cliveome) was used to test the models for generalization. At rst, the accuracy of the models was tested on the test dataset from the second dataset without any additional training. Later, all models were trained for 30 epochs on the second dataset training data in order to improve the accuracy speci cally for the second dataset ( ne-tuning).
After the additional training, all models were tested again with the second dataset and its accuracy was recorded. Approximately 400 ng of puri ed DNA in a total volume of 7.5 μl in a 0.2 ml PCR tube was used as input for sequencing library preparation using Oxford Nanopore Technologies' Rapid Sequencing kit (SQK-RAD004, version RSE_9046_v1_revB_17Nov2017) according to the manufacturer's instructions. For fragmentation and transposase adapter attachment, 2.5 μl FRA was added to the DNA and mixed by inversion. The sample was then incubated at 30°C for 1 minute, followed by 80°C for 1 minute, and nally cooled on ice. Sequencing adapters were then attached by adding 1 μl RAP to the mixture and mixing by inversion. The sample with sequencing adapters was incubated at room temperature for 5 minutes, and then stored on ice until it was ready for sequencing.
MinION sequencing was conducted according to the manufacturer's instructions using R9.4 and R9.4.1 rev. D ow cells (FLO-MIN106, ONT). After ow cell priming, 4.5 μl nuclease-free water, 34 μl sequencing buffer (SQB), and 25.5 μl mixed loading beads (LB) were added to the library and mixed by gently icking the tube immediately before loading into the SpotOn port.
Three sequencing experiments were performed under those conditions; they will be referred to as the "HEK1", "HEK2", and "HEK3" runs. The results of the rst two experiments were used for testing and training the model, whereas the third experiment was run in conjunction with the read-until script to perform real-time selective sequencing.

Sequencing data analysis
After data were acquired from the rst two sequencing experiments, HEK1 and HEK2, the reads were translated to nucleotides using ONT Albacore version 2.2.5. Although Albacore is currently not supported, a recent comparison between base-calling software indicated that the differences between Albacore and more modern basecallers 42 are miniscule for the purposes of our experiments. The reads were mapped to the GRCh38 human reference genome using minimap2 software 43 version 2.11. Reads were separated into mitochondrial reads and genomic reads based on their mapping, and each group was separated into training/validation/testing groups with proportions of 80%/10%/10% of the total reads, respectively. Initially, the accuracy of the models trained on the rst dataset was tested with the HEK1 data. Later, the models were trained for 30 epochs on the HEK1 data and accuracy was tested again ( netunned). The best performing model was determined by the highest accuracy value on the HEK2 data and saved for later use with read-until on the HEK3 sequencing experiment.
To test the performance of read-until, we utilized the developmental API provided by ONT and wrote a custom script to perform selective sequencing based on the "simple.py" le from the GitHub repository of Read-Until. This script receives the raw signal at the beginning of every DNA molecule, the raw signal is analyzed by the deep learning model, and nally the script sends a signal to the MinION device to either keep sequencing the DNA molecule or to stop and remove the unwanted DNA molecule from the pore. Reads that were classi ed by the model as mitochondrial reads were allowed to be fully sequenced, whereas the rest of the reads had received a signal to terminate their sequencing. In order to gather the validated results, we performed the experiments with 3 technical repeats for 3 different time spans: 10 minutes, 30 minutes, and 120 minutes. In each time span we performed 3 regular sequencing experiments without using read-until and 3 sequencing experiments utilizing Read Until. To account for the deterioration of the ow cell over time and to reduce technical bias, we performed the experiments with read-until and without it sparingly. The reads were translated and mapped to a human reference genome, then for each sequencing experiment the alignment statistics were collected. Logistic regression with proportions and a random effects variable 44 analysis were performed to test for differences in the proportion of the sequenced mitochondrial nucleotides to the total sequenced nucleotides. A comparison was made between pairs of the technical repeats as follows: 10min_with_read-until_run1 VS 10min_without_read-until_run1, 10min_with_read-until_run2 VS 10min_without_read-until_run2, etc. Additionally, read lengths were collected for each of the experiments and analyzed using the Fisher-Pitman permutation test 45 to check for statistical differences between the read lengths of different groups.

Deep Learning model selection
We trained 90 models in total while saving the accuracy statistics (see Supplementary Data File 1). More than half of the models exhibited total accuracy above 70% for all datasets after training. A summary of the results can be seen in Table 1. Larger models, generally, achieve higher accuracy than the smaller versions, as can be seen in Supplementary Data File 1. In addition, the models perform better after netuning on a particular dataset (Table 1 as well as the rest of the results in Supplementary Data File 1). Furthermore, the addition of a dropout or a batch normalization layer generally improved the performance in all models. When comparing different architecture types, the RNN type models: regular LSTM, LSTM with BN and GRU achieved higher accuracy than the CNN-type networks: regular CNN and VDCNN, as seen in Table 1 with the total accuracy scores. Small/Medium/Large -refers to the size of the model, +D -dropout, +S -shortcut, +MP -maxpooling, +BN -regular batch normalization, +LS -last step taken from RNN, +HO -hidden output taken from RNN

Real-time selective sequencing with Read Until
Based on the previous results, we selected the LSTM with recurrent batch normalization model that achieved the highest accuracy of 95.81% with the HEK 2 data and the highest accuracy of 92.45% overall. The selected model was used in conjunction with Read-Until to perform real-time selective sequencing. The Read-Until script was con gured to sequence only molecules that were classi ed as mitochondrial reads by the model. During sequencing HEK3, the accuracy of the model was above 90%, which corresponds to the accuracy measured on HEK1 and HEK2 without ne-tuning.
The enrichment factor of our method was measured by calculating the difference in the percentage of mitochondrial nucleotides between experiments with and without selective sequencing. This was carried out in order to normalize the samples for total sequencing output and to eliminate any variance due to its interchangability throughout the experiment. We achieved a normalized enrichment factor of 2.3X (p.val < 0.05, Figure 1). When we compared the averages of the mitochondrial nucleotides in all the experiments with and without selective sequencing (normalization or not), we achieved an enrichment factor of 1.34X. When we compared the means of the mitochondrial coverages (as shown in Table 2), we achieved an enrichment factor of 1.32X. Even though most of the molecules were classi ed as genomic by the deep learning model and should not have been sequenced, most of the molecules classi ed as genomic were sequenced and saved to the hard-drive (see the Discussion).  In addition to differences in the percentage of nucleotides, we examined alterations in read lengths: there was a clear distinction in the read length between the groups. There was no signi cant variances between the read lengths of the mitochondrial and genomic reads in experiments without selective sequencing (p.val>0.8, difference between means=~75). However, there was a trend towards mitochondrial reads being longer than genomic reads in experiments with selective sequencing (p.val<0.1, diff=~1400). When comparing the read lengths of the mitochondrial reads in experiments with selective sequencing to those without it, the mitochondrial reads with selective sequencing were shorter than mitochondrial reads from experiments without selective sequencing (p.val<0.005, diff=~2100). Larger and more signi cant difference was observed between the genomic read lengths in experiments with selective sequencing and those without it; the genomic read lengths were signi cantly longer in experiments without selective sequencing (p.val<0.0005, diff=~3600).

The process and the results of the deep learning model training
When we examined the results of the deep learning model training, the overall high accuracy (>70%) of most of the models indicated that deep learning in general might be an appropriate solution for read classi cation based on raw signals. The higher accuracy of larger models is an expected outcome because larger networks have more weight that could be adjusted during the training process and could possibly capture more variability of the data 46 . However, the smaller regular CNN model, which performed better than the medium and large CNN models, is surprising. This could be explained by comparing all of the CNN networks; the larger networks performed well (>90% accuracy) on some datasets but on other datasets the larger models would either over-t or would not train at all. However, the smaller CNN network had much lower accuracy but performed similarly across all datasets; therefore, the smaller network had a higher total accuracy. This is possibly due to the relative simplicity of a CNN model and the fact that smaller CNN models have fewer weights to train; thus, smaller CNN models have to generalize better than the larger models 47 .
VDCNN expectedly outperformed a regular CNN, as was shown in the original paper 36 . Another expected result is the RNN-type architectures (regular LSTM, LSTM with recurrent batch normalization, and GRU), which outperformed the CNN-type architectures. The data in our study could be described as a sequential input, which is the type of data that RNN architecture was designed to analyze 6 . However, we observed that the average accuracies of regular LSTM and VDCNN are similar, which can be explained by the relative simplicity of the one-layered LSTM model against the more complex VDCNN with 17 layers.Even though we expected the combination of the CNN + RNN model to outperform each type individually, based on the fact that convolutional networks are useful for feature extraction 48 , and when used in conjunction with RNN, it could produce better results 49 . In our case, the combination of CNN + RNN produced results similar in accuracy to those of LSTM with recurrent batch normalization. These results could be explained either by the very optimal training of LSTM with a recurrent batch normalization model or the sub-optimal training of the CNN + RNN models.
Fine-tuning the models on a small portion of the dataset before analyzing the rest of the dataset improved the results for some models, as seen in Table 1. Each dataset was acquired from a different sequencing experiment and possibly various variables could affect the raw signal such as different chemistry kits, different MinION devices, different library preparation protocols, and different sample qualities. Therefore, by ne-tuning the model to each experiment, we increased the model's accuracy for those speci c conditions. Dropout and batch normalization improved the performance of most models as was expected, based on their contribution to the training process of the deep learning models 6,50 . In addition, the results after training models before the addition of the difference transformation to the raw input were dire; over tting was a big problem before adding arti cial noise, which is known to help with the training of the deep learning models 51 ; therefore, those two transformations were applied to the training of all models.
The mechanism by which the deep learning models perform the classi cation remains unknown, the model could either "simply remember" the relatively short sequence of the mitochondria (only 16.5K nucleotides) and can determine which reads originate from this sequence; or, the models could extract speci c features from the reads such as GC content/ k-mer content and more complex features such as the protein sequence and structure or DNA methylations and, during training, learn which features are present in genomic sequences and which features are present in mitochondrial sequences. We also postulated that the models could have learned more sophisticated features of the mitochondrial DNA, such as a different encoding codond,or the density of the genetic information 52 . Furthermore, deep learning models have been shown to successfully detect circular plasmids, based on their sequencing data, by examining larger chunks of the plasmid sequences achieved by longer reads as well as additional genomic features 53 , information that could contribute to a successful classi cation. We think a thorough analysis of a trained deep learning model from this work, as was done for the visual analysis models 48 , could provide useful insights for further research in this eld and perhaps new biological features that were not considered important before would be discovered.

Real-time selective sequencing
Our experiments, which utilized the ability of MinION to perform selective sequencing combined with a deep learning model, demonstrated from several different angles the validity of this method. From the aspect of classi cation accuracy, our deep learning achieved >90% accuracy in real time which is similar to the results during training and testing on the previous data. Even though we had some variance in the mitochondrial proportions between the experiments, which were caused by the relatively small amount of mitochondrial DNA present in the sample, when statistically examining the mitochondrial nucleotide proportions, the results show signi cant differences between experiments with and without selective sequencing, thus indicating that our method worked successfully. Also, the differences in the read lengths of the mitochondrial and genomic reads between the different experiments also support that executing the selective sequencing script prioritized the mitochondrial reads over the genomic reads during sequencing.
To assess the results of our selective sequencing script, we can calculate the expected results in a theoretically perfect hardware-software con guration (where each read that was marked for rejection would not have been sequenced): with a 90% accuracy model ( ) and samples where 0.1% of reads are mitochondrial (similar to our samples ). We can calculate this theoretical expected mitochondrial percentage using the following formula (the expected "true" mitochondrial reads divided by genomic reads falsely classi ed as mitochondrial reads): From this calculation, we could infer that with selective sequencing in theoretically perfect conditions, we should achieve 0.9% mitochondrial reads; therefore, we would achieve an enrichment of 9X when using selective sequencing and with perfect software-hardware performance. In our experiments achieved an enrichment of 2.3X, demonstrating that the hardware is working but also that there is still much room for optimization in terms of software-hardware interaction with the Read-Until feature of nanopore sequencing.
Statistical analysis of the difference between the percentages of mitochondrial nucleotides sequenced with selective sequencing and those without it revealed a signi cant difference. The fact that there was a smaller difference between the percentages of mitochondrial reads in experiments with and without selective sequencing when compared to the differences in the genomic reads with and without selective sequencing also supports the idea that selective sequencing prioritizes the mitochondrial reads. Our claim that our selective sequencing method allowed us to sequence more mitochondrial sequences is further supported when we combined the differences in the raw mitochondrial nucleotide counts without normalization, which showed that more mitochondrial sequences are sequenced in experiments with selective sequences, Therefore, even in its current state, our approach could assist researchers in achieving better coverage of a certain region and theoretically save time, resources, and budget by requiring less sequencing to a achieve a similar goal. Furthermore, theoretically, it is possible to change the classi cation model to target a different region in the genome, thus increasing the utility of this selective sequencing method. We can conclude that the genomic reads were shorter in experiments with selective sequencing probably because our script sent signals to the MinION device to stop sequencing the reads that were classi ed as genomic, thus non-mitochondrial reads would be shorter than in experiments without the stopping signal. We currently do not know why the signals sent to the hardware to stop sequencing did not entirely prevent the sequencing of reads classi ed as genomic. However, other studies reported a delayed ejection of unwanted DNA molecules, 28 which shows that the softwarehardware interaction is not perfect and could cause "rejected" DNA to be sequenced and saved.

Conclusion
From the results of the deep learning models training, we can conclude that the deep learning approach is a valid choice for classifying sequenced reads based on the rst 2,000 values of raw signal of the read.
There might be better models than those we tested here; however, even using our relatively simple and straighforward approach, when we tested different datasets we achieved good results in terms of accuracy and generalization. Furthermore, for the rst time, we showed the ability of deep learning models to classify whole reads based on raw nanopore signals.
The selective sequencing experiments we performed with our script, using the best deep learning model from the previous steps, produced enough evidence to conclude that our script prioritized mitochondrial reads over genomic reads. The deep learning model classi ed the reads correctly while they were being sequenced; an analysis of the proportion of mitochondrial DNA and the differences in the read lengths revealed that the mitochondrial reads were being prioritized during sequencing and were enriched by a factor of 2.3X.
When combining the results from both parts of the experiment, we concluded that real-time selective sequencing is possible by using deep learning models to analyze the raw signal at the beginning of each

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SupplementaryMaterialSupplementaryMethods.docx