2.1 Preparation of samples for Machine learning
Standard Ion Torrent sequenced BAM alignment file for sample accession number NA12878 from Genome in a Bottle (GIAB: https://github.com/genome-in-a-bottle/giab_data_indexes) consortia database was downloaded and uploaded to local IonReporter server running UI version 5.10.38 and set for variant calling using all default parameters. A short Python 3 code was used to extract the 35-variant metrics from the resulting file and a spreadsheet of the same was made in Microsoft Excel. A total of 1,336 confirmed negative calls listed on the website was selected out from the total variants as a negative subset and given code of number zero. A total of 36,198 variants falling in the high confidence regions which were confirmed to be good calls from multiple sequencing platforms by the consortia were selected as confirmed positive calls and given the code of number one. Since we ended up with around 27 times more positive variants than negatives, we needed to select the same number of positive variants that is, 1,336. In order to do so, we firstly removed any variant with less than 30 read depth, as it is a cut off for germline variants, hence ending up with 33,757 positive variants. The final subset of 1,336 positive variants from the whole set of 33,757 variants was selected at random using Python module Pandas. The resulting file was exported to another Excel Sheet and used for training of the model (Supplementary Table 2). The training program takes both files and merges them together. After picking up the file for training, the model training program also codes the “type” field of the variant for SNP, MNP, INS, DEL, Complex as 0 to 4 providing the model with an additional weight that governs the type of variant. The final resulting variants were 2,672, out of which randomly 5% data was selected for validation per epoch and rest for training.
2.2 Development of the model
Python 3 along with TensorFlow, Pandas, Sci-kit learn, Numpy and Matplotlib modules were used to code the model. The model is based on artificial neural network (ANN) architecture which comprises of fully connected dense layers. The architecture of the model can be referred to in table 1. The model was compiled with Adamax optimizer, sparse categorical cross-entropy as loss function with default learning rate.
Table 1. The architecture of the machine learning model
S.No.
|
Type of layer
|
Units
|
Activation
|
|
Dense (input: 35,)
|
35
|
Sigmoid
|
|
Dropout
|
0.4
|
-
|
|
Dense
|
150
|
Sigmoid
|
|
Dropout
|
0.4
|
-
|
|
Dense
|
35
|
Sigmoid
|
|
Dropout
|
0.4
|
-
|
|
Dense
|
35
|
Sigmoid
|
|
Dropout
|
0.4
|
-
|
|
Dense (output)
|
2
|
SoftMax
|
Development of the wrapper code
The wrapper code written to analyse actual VCF files using the developed model was also written in Python 3 with sub-modules Sys, OS, TensorFlow, Pandas, Sci-kit Allel and MyVariant. Additionally, bcftools [4] was used to segregate multiple variants at one site during the pre-processing stage of the VCF files. The wrapper code is capable of accepting practically any number of VCF files at once and processes them one by one. The code creates a Microsoft Excel sheet pertaining to each VCF file with HGVS codes for all variants, prediction verdict of the machine learning model along with probabilities of both outcomes and annotation using available online databases enabled by Python MyVariant module.
2.3. Evaluation of the model
A Python 3 code was written to evaluate the efficacy of the model using GIAB data. The evaluator code additionally comprised of TensorFlow, Pandas, Matplotlib, Seaborn, Scikitplot, Numpy and Sci-kit learn. An equal number of positive and negative variant calls from the whole data set was randomly selected and fed to the model for evaluation. Confusion matrix, cumulative gain, Kolmogorov-Smirnov (KS) test, Matthews correlation, lift curve and receiver operating curve (ROC) was calculated for the model evaluation.