Paper [21], showed qualitatively that the deviation of the sum of the nucleotides in DIPs was generally higher than the deviation of the sum of the nucleotides of the SNPs for the whole genome. In other words, deviations in DIPs were more representative of the individual differences among them and could thus attribute to their differences in endo-phenotype or end-phenotype. As an example, paper [21] took the gender as the end phenotype and showed that the variance (and the standard deviation) between the set of structural variations (DIPs) was much higher than that of the sum of nucleotides of SNPs (Figure 1.), and stating that structural variations were a stronger means to determine the phenotype, i.e. gender here. Inspired by the article, this paper is about fine tuning and quantifying individual DIPs, so we introduce a deviation from consensus score to quantify the differences in SVs for these letters while also using one-hot encoding for the single nucleotide bases.
We developed DMWAS suite to simulate genomic data for genomic co-ordinates as a combination of A, T,G C or a 4-letter combination for a larger letter. Comprised of a Python script genSampleData.py, it can be used to generate genomic variation data specifying quantity of genomic loci, number of patients, frequency of occurrence of DIPs, and the maximum size of a DIP. Details of usage of script are specified in the downloadable ReadMe.md file from GitHub. For illustration purpose we are using 400 genomic co-ordinates.
In Figure 2 are the simulated data for 40 columns and 8 patients. The simulated genotype data for 200 loci is provided as file multiColumnSample.csv . Once the simulated data is generated, and then we use the script splitMultiColDIPs.py to split each feature column into two columns, one column for 1 letter variants and another column for DIPs variants. The split file is available by name multiColumnSplitSample.csv . In Figure 3 is an example of data with each column doubled as per described method.
With information for the DIPs as second column, we can extract them separately and conduct a clustering of the data getting the divergence score for each DIPs from the consensus. This method of encoding the letters based on divergence from a mean, median or consensus score is called ‘DivScoreEncoding’. DivScoreEncoding is different than one-hot, word embedding, index based encoding and other kind of label encoding methods as described in the article ‘Text Encoding: A Review’ [26].We realize that the InDels or DIPs can be different from each other and the difference in biological relevance such as by means of frameshift of codon or mutation at a point need to be given a score in biological context. The traditional means of encoding texts, such as those discussed at [26], do not take biological evolutionary distance into account when encoding for DIPs or InDels. This method of DivScoreEncoding is applicable to larger insertions and deletions as well as for other SVs in the genome like CNVs, translocations, etc. Cross-species multiple sequence alignment has been tried using phylogenetic tree construction in article [30]. Papers [28,29] generate pathogenic scores of InDels throughout the non-coding genome to classify them into pathogenic or not, and would be clearly very different in terms of method and application, although the similarity remains in terms of the concept of giving a score to the InDels based on a biological role. Figure 4 shows a sample clustering by multiple alignments done words of varying length with consensus and divergence score for each sequence.
For implementing DivScoreEncoding method by clustering, we have chosen T_coffee[1] application to get the divergence score. This third-party software is available online. A wrapper Python script multiColDIPsDiv.py is provided, which automates extraction of the DIPs from multiCoumnSplitSample.csv file, then passes it to T_coffee software for multiple sequence alignment and divergence score determination. Script reverseReadMulti.py is provided to reverse the scores obtained, and script ReplaceMultiColDIPsNew.py can be used to replace the DIPs with appropriate scores. This would lead to file with content such as in figure 5. The resulting file is also provided as MultiColDIPsScored.txt . The Python script encodeSNPs.py has been provided for this purpose; the resulting final scored and encoded file MultiColDIPsScoredEncoded.txt is also provided. Figure 6 shows a sample scored and encoded file snippet.
Since we had 40 individuals or rows, we generated 40 y-values 0-39, with the 1st row left as that of feature column variable names. File is named as Phenotype.txt . We decided to use several machine learning methods, such as logistic regression, naïve bayes, gradient boosting, bagging, and adaboost, and deploy enhanced form of exhaustive multi-layer perceptron (MLP= in the form of DNN by incorporating early stopping criteria to avoid overfitting, using rectified linear unit (ReLU) as activation function to reduce weight adjustment time and addressing the vanishing gradient problem. We also introduce an exhaustive nature of exploration for the right hidden layer and hidden units by varying the number of layers and number of hidden units in the DNN in a loop. Each time the best scores were chosen for its number of hidden layers and units. This exhaustive nature of DNN, when the range was given in realistic bound proved more useful than simply adding hidden layers as in a typical DNN, and thereby gave profound results, so this approach is called ‘ExhaustiveDNN’. The scripts ExhaustiveDNN.ipynb and ExhaustiveDNN.py are provided in DMWAS and feeds in MultiColDIPsScoredEncoded.txt as input file. The script internally looks for all columns with any null values that are removed before modeling. The data file was also separately checked for Null values and minor allele frequency (MAF) of at least 5% and the resulting encoded file is available at DMWAS as NullMafMultiColDIPsScoredEncoded.txt, which can be used as an alternative. From this file applying F-Test criteria for each of the feature columns we chose the top 1% of the feature set as the final data that the deep and machine learning scripts would work on. Feature set optimization has been an active area of research recently such as what we see in article [31]. Article [27] talks about various applications of deep learning in different spheres of biology, and to which ExhaustiveDNN as part of DMWAS with the DivScoreEncoding methodology can play vital role as it is exhibited in the results section later.