MaxDEL: accurate and efficient calling genomic deletions from single molecular real-time sequencing using integrated method

Background: Single-molecule real-time (SMRT) sequencing data are characterized by long reads and high read depth. Compared with next-generation sequencing (NGS), SMRT sequencing data can present more structural variations (SVs) and has greater advantages in calling variation. However, there are high sequencing errors and noises in SMRT sequencing data, which brings inaccurately on calling SVs from sequencing data. Most existing tools are unable to overcome the sequencing errors and detect genomic deletions. Methods and results: In this investigation, we propose a new method for calling deletions from SMRT sequencing data, called MaxDEL. MaxDEL can effectively overcome the noise of SMRT sequencing data and integrates new machine learning and deep learning technologies. Firstly, it uses machine learning method to calibrate the deletions regions from variant call format (VCF) file. Secondly, MaxDEL develops a novel feature visualization method to convert the variant features to images and uses these images to accurately call the deletions based on convolutional neural network (CNN). The result shows that MaxDEL performs better in terms of accuracy and recall for calling variants when compared with existing methods in both real data and simulative data. Conclusions: We propose a method (MAXDEL) for calling deletion variations, which effectively utilizes both machine learning and deep learning methods. We tested it with different SMRT data and evaluated its effectiveness. The research result shows that the use of machine learning and deep learning methods has great potential in calling deletion variations.

advances in SMRT sequencing data, most calling methods have been conducted to estimate single nucleotide polymorphism (SNP) and indels. There are no new methods to call structural variations (SVs) as much and as accurately as possible on SMRT sequencing data. Therefore, how to accurately and effectively identify the SVs has become a focus on the SMRT sequencing data.
When calling genomic deletions on SMRT sequencing data, there are three main situations (called signatures) of the sequence reads mapped onto the given reference genome near the deletion site. (ⅰ) Split Reads (SR). When the read overlaps the breakpoints of a deletion border, the read consists of two parts that are not contiguous. Such a read is called split read. There are some calling methods based on it, such as PbHoney [9], pbsv, Sniffles [10], Picky [11], SVIM [12]. (ⅱ) Read Depth (RD). Read depth within a deletion is lower than non-deletion region. If the deletion is homozygous, the read depth is closed to zero; If the deletion is heterozygous, the read depth should still be lower than expected. Read depth is the most commonly signature for calling deletions. Sniffles also uses this strategy. (ⅲ) Local Assembly (LA). Local assembly signature is different from other two signatures. LA detects the variant region by assembling reads, such as SMRT-SV [13]. In addition, there are some other methods. For example, Next-SV [14] cleans the genome files and uses Sniffles to call variants.
Recently, deep learning methods have been used successfully to call genomic variations on NGS data. Google's DeepVariant [15], which uses deep learning methods to call SNP and Indel. It uses the base sequence as the main information and generates the image of the base, and treating variant calling as a special kind of image classification. Moreover, Cai et al. 's DeepSV [16], which is a tool to call deletions from NGS data, also uses base sequence as the main information. It generates RGB stack images from NGS data, and finally uses Convolutional Neural Networks (CNN) for image classification. DeepSV is a novel method in SVs calling. These cases of successful application of deep learning to NGS data convey a message to us: applying deep learning to SMRT sequencing data may also have a good performance in calling deletions.
In this investigation, we propose MaxDEL, which is accurate and efficient for calling genomic deletions with integrated method. MaxDEL addresses two aspects of calling problems for SMRT sequencing data. First, that VCF file is inaccurate of the deletion regions. Second, that most of the existing tools use an empirical model to call SVs. We know that the empirical model is based on extracting features manually. However, the deep learning model can learn the features independently. We propose a distinct detection method based on CNN [17] model that can convert the variant features to images and call the genomic detections from these images.
To investigate how well MaxDEL calls the variants, we compared MaxDEL and five state-of-art methods, including NextSV, SVIM, Sniffles, SMRT-SV and Picky. The performance is measured in terms of three metrics: precision, recall, and F1-score. We also counted the false positives (FP), true positives (TP), and false negatives (FN). The results showed that the MaxDEL achieved better performance compared to those methods in simulative data and real data.

The high-level approach
MaxDEL is a deep learning structural method that is based on the calling of SVs. Because of the SMRT sequencing data has higher sequencing noise (~15%) [18] than NGS data, MaxDEL is designed to be used with the noise data. It uses a novel method to convert sequence characteristics to the images and uses images to identify the variants. There are two main advantages to MaxDEL. The first is that MaxDEL can correct the errors in VCF and find the accurate deletion region, which can help to generate precise variant images. The second is that MaxDEL can use the strong learning ability and robustness of CNN to learn and overcome the noise from the sequencing data. In addition, another advantage of MaxDEL is that it solves the problem of data imbalance. As we know, the CNN model is sensitive when the data were unbalanced. However, the number of non-SVs was about 100 times greater than that of SVs. If we direct generate the images to call deletions, which led to reduce accuracy. MaxDEL uses GAN [19] to augment data to solve the problem of data imbalance. GAN can generate fake images with original images distribution so that the images have greater randomness and authenticity. After obtaining enough genomic images, MaxDEL uses these images data to train CNN. In order to implement the end-to-end model, MaxDEL has also complete one procedure from BAM file to calling results. First, MaxDEL collects the candidate variant from other tools and use the random forest to find accurate breakpoints of deletions. Second, MaxDEL extracts the variant features to generate images and use GAN to augment these images. Finally, MaxDEL uses CNN to call the deletions and filter false positives. The details are shown in Figure 1. The deletion and non-deletion regions are generated into images. Due to the small number of deletions, we can obtain enough deletion images by GAN and using these two genetic images to train CNN to get a trained CNN model. (c): Using existing tools to call variants to obtain a candidate deletion set. The trained CNN model is used to discriminate the candidate set and get the final deletion result. The trained CNN model will distinguish the candidate set and get the final deletion result.

Obtain accurate deletion regions
The accuracy of the deletion regions is related to the accuracy of calling the variants. After analyzing the BAM file of real data, it is found that the VCF has existed off-set of the breakpoint positions, which contain a part of the non-deletion region. Therefore, ac-curately determining the deletion region is significant. In this paper, we use random forest to filter the VCF file errors and correct the breakpoint offset of variants.
(ⅰ). Genome feature extraction. Through observing the BAM file, it can be found that the deletion region has unique characteristics different from the non-deletion region. We use the sliding window to scan the reads aligned the reference genome, and adopt the RD strategy to extract the useful features. The features are including: window information, mapping type of reads, mapping quality and alignment type. The features description is shown in Table 1. Among these features, window information and mapping quality can provide us with the basic information of the reads in the target window, which is the basis for distinguishing the non-deletion region and the deletion region. Mapping type of reads can record the main matching pattern of the target window. Alignment type describes the number of primary alignments in the target window. Moreover, the average depth is obtained by formula (1). The window complexity is obtained by formulas (2) and (3).
(3) (ⅱ). Calibration deletion regions. The deletion information provided by the VCF file may exist the errors. By statistic chromosomes 1 to 6 and selecting 125 deletion regions for each chromosome, we can obtain the accuracy of VCF in deletion regions (Table 2). In order to get the accurate deletion regions, we use random forest method in machine learning to identify the real deletion regions. Random forest method uses ensemble algorithm, which has high accuracy. In addition, due to the introduction of randomness, it is not easy to overfit and has anti-noise ability. When training the model, first, we shrink the deletion regions recorded in the VCF to get small deletion regions, which can greatly remove the false positive part and keep the true positive part. Then, we extract variant features from these regions to construct a multi-feature positive sample. Secondly, we extract non-variant features from wild regions to construct multi-feature negative samples. Finally, positive samples and negative samples are used to construct a training data set and train a random forest model. We use the trained model to validate the new multi-feature samples which are needed to be distinguished. The new samples are formed by expanding both sides of the deletion regions recorded in the VCF. An example is shown in Figure 2. The results in table 2 show that the accuracy of the deletion regions is improved after the processing of the random forest model, and the overall accuracy reaches 88.16%, which proves that the random forest can correct the breakpoint offset.  The black part is the deletion region recorded in VCF, and the red part is the deletion region corrected by the random forest method.

Image Generation
As we all know, BAM files contain a lot of sequence information. Converting the sequence characteristics into images can reflect more three-dimensional information between sequences. It means that it can show the positional relationship between reads in the same region and display the regional characteristics between reads. By studying real data, it is found that most of the deletions are hundreds to thousands of bp in length, and it is difficult to express the entire deletion region in a RGB image. Therefore, the ge-nomic image generation method will be solved from two aspects: interval division and image design.
(ⅰ). Interval division. Small regions were obtained by dividing the deletion region and the non-deletion region respectively. We divide the long genomic region into small regions and generate images instead of compressing the entire genomic region. The image process can not only maintain the integrity of the read information in each small region but also improve the sensitivity of MaxDEL. When dividing the interval, the deletion region and non-deletion region are divided one by one according to the principle of fixed-length and fixed-step. The interval length here is 100bp and the step size is 75bp.
(ⅱ). Image design. How to filling content and arranging colors of the image is significant. When generating an image, we use the CIGAR string in reads as the main information. Compared with the base sequence of reads, using CIGAR string information can not only reduce the redundancy of the image content but also keep the difference be- Through interval division and image design, the deletion regions and the nondeletion regions are generated into images (Figure 3b and Figure 3c). It could be found that there were obvious depth differences between them, which indicated that the image generation strategy was effective and could better show their differences.

Augment the image
Deep learning methods have a wide range of applications in processing image data, one of which is to augment image data. When training data is unbalanced, the CNN model will learn more knowledge on the positive samples, which will lead to not convergence. In genomic data, the number of deletion regions is extremely small compared with nondeletion regions, and there is also an imbalance in the amount of data between them. Therefore, the amount of data is too small and the sample imbalance is another obstacle to calling deletions. With the development of deep learning, Ian J. Goodfellow et al. proposed GAN in 2014 (Figure 4a). It is a powerful generation model, which can use a small amount of data to generate a large number of simulative data. MaxDEL uses GAN to augment genomic images to solve these two problems mentioned.

(ⅰ). Generation and comparison of fake images.
MaxDEL uses GAN to augment deletions to overcome the imbalance of positive and negative samples. In genomic image, the direction of it is deterministic and the content is regular, which can greatly reduce the complexity of the image and maintain sufficient information. Therefore, GAN can obtain amplified images with higher quality and distribution closer to real data generated images through less training. We use GAN to augment the gene image, and set the epoch to 100,000 to get sufficient quality and number of genomic images. In the experiment, two different GAN methods are compared to find a suitable one for MaxDEL. Figure 4b and  figure 4c show the images generated by two different GAN methods. Among them, figure 4b is the result of traditional GAN method, and figure 4 is the result of DCGAN [20]. It can be found that compared with Figure 4b, Figure 4 is closer to the original deletion images in color distribution, and there are fewer noises in the generated image generated by DCGAN method. Therefore, DCGAN was selected to augment the gene data.

(ⅱ). Selection and comparison of activation functions.
In the generator of GAN, different activation functions used in the hidden layer will also affect the final results. ReLU is a commonly used activation function, while SELU is a novel activation function, which can automatically normalize the input sample data to zero mean and unit variance. As shown in Figure 5, we compare their impact on the accuracy of the generated data and take the accuracy of fake images of each 25 rounds in the first 5,000 rounds and calculated the average result. It can be seen from Figure 5 that when we used ReLU, the accuracy of the results varied greatly. While using SELU to train data, when the epoch is close to 3000, the accuracy of the result is closer to 100%. Therefore, compared with using ReLU, SELU can better improve the accuracy of generated images and obtain high-quality generated images. SELU is used as the activation function of DCGAN to generate images.

Dataset
This paper uses GRCh37 as the reference genome. The real data comes from Genome in a Bottle (GIAB), which is a benchmark by Zook et al(ftp://ftptrace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/) [21]. This dataset has high-quality and it is provided through calculation and experimental verification [22]. BAM files coverage are HG002 (69x), HG003 (32X) and HG004 (30X) respectively. The first 6 chromosomes of HG002 were used as the training set, the remaining chromosomes and samples are testing sets. In addition, SURVIVOR [23], PaSS [24] and NGMLR [9] are used to generate simulative BAM data, which can increase the amount of data and verify the validity of MaxDEL. The VCF for HG002 comes from GIAB (ftp://ftptrace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0 .6/), while HG003 and HG004 use the output of existing tools as benchmark. The simulative data uses the VCF given by the simulative tool as benchmark.

Training and validating the CNN model
After image amplification, there are enough image samples, which are used as CNN input to train a CNN model. CNN has made great achievements in classification problems and computer vision. Firstly, when CNN learns image samples, it can transform the original data into abstract representation layer through representation learning. Secondly, CNN can also find some inherent features in the image through convolution processing that we have not discovered. Finally, in SMRT sequencing data, there are a large number of sequencing errors, but CNN has strong robustness to noise. While overcoming the shortcomings of the high error rate of SMRT sequencing data, it automatically learns the features of the genomic image and obtains the trained CNN model. In the following discussion, we will explore the composition of the data to get a trained CNN model.
(ⅰ). The impact of before and after sample-balance on the CNN model. The imbalance of positive and negative samples will cause many problems, which may eventually lead to a decrease in the credibility of the training model. Figure 6a, 6b shows the accuracy results of CNN for image classification when balancing the positive and negative samples. Among them, the amount of negative sample data is about 80000. The number of positive samples before balance is about 16000, and the ratio of positive and negative samples is 1:5; After using DCGAN to augment the positive samples, the ratio of positive and negative samples is about 1:1. When unbalanced, although the accuracy and loss are the best on the training set, the results on the validation set are the opposite. Finally, on the test set, the accuracy of CNN model with unbalanced positive and negative samples is only 93.39%, and the loss is 0.2089. After the balance of positive and negative samples, the accuracy is 97.225%, and the loss is 0.1097. Therefore, it is proved that after balanced positive and negative samples, CNN model can have better discriminability to unknown samples.
(ⅱ). The influence of different data preprocessing on the accuracy of CNN model. The data composition of the genomic image needs to be compared in several experiments. In order to explore the influence of different data composition on the accuracy of CNN, we conducted three sets of experiments for comparison: (1) image samples are generated without any processing; (2) image samples are generated only by machine learning processing; (3) image samples are generated after random forest processing and data amplification of GAN. The batch size is 128 and epoch is 50 for each comparative experiment, and the proportion of positive samples and negative samples in the training set is roughly the same. Figure 6c, 7d shows the loss and accuracy on the training set and the verification set. According to statistics, in experiment 1, the accuracy of the training set is 89.538% on average and 95.85% at the highest. The average accuracy on the validation set was 87.617% and the highest was 88%. The loss on the verification set has increased significantly, which indicates that the network has already been over-fitting at this time. Therefore, this result shows the necessity of machine learning to process deletion regions. Secondly, the accuracy and loss of the generated image only after random forest processing are in the middle position. Finally, after random forest processing and GAN image amplification, the training results and verification results of CNN model are the best. Therefore, we use the trained CNN model obtained from (3) and use this model to discriminate candidate images.

Performance of the Simulative Data
Testing on simulative data is one of the methods to verify the effectiveness of the calling tool. First, we use SURVIVOR to simulate the reference genome file with variations, then generate the simulative FASTQ file through PaSS, and finally use NGMLR to align the simulative FASTQ file with the simulative reference genome to obtain the BAM file.
When generating simulative data, according to the length of the variation, the simulative data is divided into two categories: shorter variations (50bp~1000bp, category A) and longer variations (1000bp~5000bp, category B). In each category, we validate the results of homozygous variants and heterozygous variants, respectively.
(ⅰ). Calling result of category A. MaxDEL uses the trained CNN above to distinguish the candidate sets. Through Table 3 and Table 4, we can conclude that the results of MaxDEL performed well in calling homozygous deletions and heterozygous deletions. Its recall has reached more than 0.9527, which is better than existing tools. And F1-score is also the highest among all the test tools, especially when calling heterozygous variants.  Table 5 and Table 6, it can be concluded that MaxDEL still performs well. In the high coverage simulative data, the results were somewhat lower, but in the low coverage data, the results are the best. It proves the stability of MaxDEL at different coverage depths.   Figure 7 displays the performance of the recall rate and F1-score of each tool and MaxDEL. The testing on the low coverage data shows that MaxDEL performed the best with an overall 99.13% recall of homozygous calling results and 98.91% recall of heterozygous calling results. The testing on the high coverage data shows that MaxDEL also achieved the best no matter homozygous and heterozygous. Furthermore, we validate the performance of MaxDEL with other four tools on F1-score. We found that the results of F1-score fluctuate strongly, but the results of MaxDEL maintain the high level.

Performance of the Real Data
Here we use HG002 real data to validate the MaxDEL performance in comparison with the NextSV, SMRT-SV, SVIM and Sniffles. From table 7, we found MaxDEL also can get more true positives and fewer false negatives, as well as the highest recall and F1score. The results show that MaxDEL has the best comprehensive performance in calling deletions in real data.

Discussion and conclusion
In this paper, we define and elaborate a novel method, MaxDEL, for calling genetic deletions, using integrated framework with SMRT sequencing data. The MaxDEL method can capture the variant features of the deletions and establish the learning model between images and gene data. In our experiment, the MaxDEL method is superior to NextSV, SVIM, Sniffles, Picky and SMRT-SV, especially in recall, and F1-score.
The MaxDEL focused on three strengths on calling the deletions. The first was that MaxDEL solve the breakpoint offset problem. Especially, in the VCF file, most deletion information has errors, which makes it difficult to generate accurate images. However, MaxDEL corrects the VCF file and give the precise breakpoint and deletion region.
The second strength was that the MaxDEL provide a new gene image augmentation method, which address the sample imbalance problem. MaxDEL use GAN to generate gene image to expand the deletion samples and automatically analyze the image feature to adjust the model parameters.
The third strength was that MaxDEL provide a significant method to use image to call the SVs on the SMRT sequencing data with different coverage, which prove the deep learning model can capable of accurate and efficient calling variants.
When we explore the limits of MaxDEL, we find that it cannot infer the genotype. The MaxDEL can recognize the deletion region, but the genotype of deletion depends on the different coverage. At present, we need to one extra step to calculate the genotype when given the deletion region. Future investigations will focus on how to use MaxDEL to infer genotype. Ethics approval and consent to participate Not applicable.