Using k-mer Embeddings Learned from a Skip-Gram Based Neural Network as Effective Feature Representation for Building a Cross-Species Prediction Model to Identify DNA N6-Methyladenine Sites in Plant Genomes

Identi�cation of DNA N6-methyladenine sites has been a very active topic of computational biology due to the unavailability of suitable methods to identify them accurately, especially in plants. Substantial results were obtained with a great effort put in extracting, heuristic searching, or fusing a diverse types of features, not to mention a feature selection step. We considered DNA, the human life book, as a book corpus for training DNA language models. K-mer embeddings then were generated from Skipgram neural networks and input into several ensemble tree-based algorithms. We trained the prediction model on Rosaceae genome dataset and performed a comprehensive test on 3 plant genome datasets. Our proposed method shows promising performance with AUC performance approaching an ideal value on Rosaceae dataset (0.99), a high score on Rice dataset (0.95) and improved performance on Rice dataset while enjoying an elegant, yet e�cient feature extraction process.


Introduction
DNA N6-methyladenine is a non-canonical DNA modi cation at the 6th nitrogen position of an adenine ring, catalyzed by DNA methyltransferases (DNMTs) [1]. Since it was not detectable in earlier studies, DNA 6mA modi cation was considered absent in eukaryotes. With the development of deep sequencing techniques, 6mA was found to be present in the genomes of diverse species [2][3][4][5]. Recently, it has been resurfaced as a possible reversible epigenetic mark that correlated with a series of biological processes, such as DNA replication, DNA repair, and transcription. [2,[6][7][8][9][10]. While assumed to be absent due to its undetectable degree, considerable 6mA distribution is identi ed across the plant genome suggesting its roles in plant development, tissue differentiation, and regulations in gene expression. Furthermore, a previous rice genome study indicates that 6mA is associated with the environmental alteration. The levels of 6mA are positively correlated with the expression of key genes related to the stress response [5].
Despite the potential roles of 6mA suggested from the abovementioned studies, the biological function of 6mA in plants remains elusive. The experimental methods for identifying DNA N6-Methyladenine sites such as next-generation sequencing with coupling immunoprecipitation [11], DNA immunoprecipitation with 6mA antibodies restriction enzyme-assisted sequencing with DpnI-assisted [12], mass spectrometry, single-molecule real-time (SMRT) sequencing and ultrahigh-performance liquid chromatograph [13] generally require substantial time and effort, yet they can only cover a partial 6mA epigenetics landscape. Thus, characterizing the distribution of 6mA sites across the genome using computational approach is essential to understand its functions.
As the amount of genomic data is soaring, building e cient computational approaches, especially those based on machine learning methods, for identify 6mA sites in different genomes becomes an attractive topic for a lot of research groups. A central question is how to de ne an effective and distinctive feature sets that can help enhance the prediction performance. For long time, many groups have relied on various encoding scheme for effective data representation and feature learning in DNA methylation pattern detection. In an extensive review published in 2018 [14], Yu et al. have collected, analyzed, and summarized the existing encoding schemes of genome sequence. In this review, the authors grouped the used encoding schemes into 5 groups namely biochemical properties, primary-structure properties, cartesian-coordinate properties, binary and information encoding, graphical representation. We also found that in order to obtain high prediction performance for DNA methylation sites, a great effort must be spent on extracting, heuristic searching, or combining a diverse types of features [15][16][17][18]. In several works, a two-step feature selection were also required. The work of Hasan et al. [19] is a great example in which the authors explored 10 different feature encoding schemes before nding ve best feature encoding schemes based on physicochemical and position-speci c information. Although the authors can achieved promising results, we believe that effective prediction model can also be constructed using a delicate, yet e cient feature selection method.
DNA is the giant book of life in which the arrangement of 4 nucleotide matters. This human life book is, in several aspects, similar to other normal books containing characters appearing in a certain order which follows some grammar rules to convey intended meanings. In this view, we can employ natural language processing techniques to decipher hidden biological meanings in DNA sequences. Thus, to address the above challenges and utilize the interrelations between biological sequences and human language, we employed advanced method in natural language processing area to formulate the representation of DNA.
Speci cally, we constructed a bag of nucleotide n-gram with Skip-gram architecture neural network to learn from DNA segments. This neural network generated numerical vectors for DNA k-mers which were called k-mer embeddings. In the next step, we concatenated word embeddings of various k-mers to form a real-valued vectors to input into machine learning-based prediction models. We offered several fusions of features all of which are based on k-mer embeddings when searching for best k-mer embedding-based feature sets.
Apart from a discriminative feature sets, a second important part to build an e cient prediction model is to employ a good machine learning algorithms. In this study, we exploited and compared 3 e cient ensemble tree-based machine learning algorithms namely Deep Forest, XGBoost, and Random Forest in predicting DNA N 6 -Methyladenine sites with the proposed feature representation method. Among 3 algorithms, Random Forest is used as a classi er in 6 out of 12 recent 6mA sites prediction [16,[20][21][22][23][24]. It uses an ensemble of trees to gather "the wisdom of crowds" for effective prediction. Regarding the two other ensemble tree-based machine learning algorithms, while Deep Forest is a special machine learning algorithm that get inspired from deep neural networks and ensemble learning, XGBoost a decision-treebased ensemble machine learning algorithm that uses a gradient boosting framework and it has been widely used in machine learning community due its e ciency in terms of both performance and processing time.
We applied our research on DNA N6-Methyladenine data of plant genome. Speci cally, we chose Rosaceae genome to train and evaluate the prediction performance. Moreover, we also use two recent plant genome datasets (Rice and Arabidopsis Thaliana) for the independent tests. Compared to the stateof-the-art predictor, on independent tests, our approach obtained promising results with AUC performance approaching the ideal value on Rosaceae genome dataset at 0.99 and 0.953 on Rice dataset. Furthermore, the statistical rates were also improved on 4 surveyed metrics on Rice genome dataset.
Without suffering from the laborious process for extracting, searching and combining traditional sequence-derived features and other encoding schemes, our experimental-con rmed results demonstrate that our approach is straightforward, yet e cient and our study could open a new method in bioinformatics modeling using natural language processing technique.

Materials & Methods
We present a cross-species prediction model to identify N 6 -Methyladenine sites based on k-mer embedding-based features. Figure 1 displays the owchart of our study which can be roughly divided into 4 main sub-processes: data collection, feature generation, classi cation and performance evaluation.

Data collection
It is always essential to construct a high-quality and unbiased benchmark dataset to establish a strong supervised learning classi cation. We therefore re-used the benchmark dataset from the latest study [19] to solve this problem. The data in this dataset were obtained from different sources for different species. First, the Rosaceae were obtained from MDR, MethSMRT, and the Gene Expression Omnibus (GEO) databases [25][26][27]. Speci cally, the Fragaria vesca and Rosa chinensis DNA genomes under the Rosaceae family were retrieved with 40,574 samples containing 41 base pairs centering the adenine. Then, to avoid overestimate the prediction performance, the author have removed redundant sequences with the cutoff value of 90% resulting in 36,537 positive samples. For the negative sample, the authors followed the procedure described in [16,17,20,28] with two other plant species datasets (rice and A. thaliana). Table 1 provides the statistics of the surveyed datasets. 2.2 Training a bag of nucleotide n-gram with Skip-gram architecture neural network as a DNA language model DNA sequences can be viewed as a language comprising of 4 letters (A, T, G, and C). Therefore, in this study, we regarded each sequence (DNA segment with the length of 41, to be more exact) as a "DNA sentence" by splitting it into a series of k-mers (overlapping is allowed) with k in the range from 1 to 5. This means that after the splitting step, we obtained 5 text corpuses for building 5 different language models. In each case, the k-mer is corresponding to a word in human language which enable the use of natural language processing technique to extract the hidden information. The upper part of the feature extraction section of Fig. 1 illustrates how DNA segment CTTATGG is split into 5 "DNA sentences" comprising of 1-mers, 2-mers, 3-mers, 4-mers, and 5-mers, respectively.
Apart from the corpuses of text, to construct a language model for word embedding generation, we also need an embedding method. In this study, we used a Skip-gram neural network [29] which has one hidden layer to learn from the "DNA sentences" described above. The input to this Skip-gram neural network is the target k-mer while the output is context k-mers and the learning is in an unsupervised manner. Figure   S1 in the Supplementary materials give an example of a Skip-gram neural network learning from a "DNA sentence".
We further applied another technique which is call bag of nucleotide n-gram. In this technique, each k-mer was represented as a bag of nucleotide n-grams. For example, a 4-mer CTAT was a bag of C, T, A, CT, TA, AT, CTA, TAT, CTAT, and special 6-mer < CTAT>. The vector representation for CTAT then was the sum of the representations associated to each nucleotide n-gram.
We used fastText [30][31][32] to create the bag of nucleotide n-gram based with Skip gram (BoN-Skipgram) described above. fastText is an open source, free and lightweight library that enables fast and e cient text representation learning. After training, the embedding can be retrieved from BoN-Skipgram by looking up in the mapping between k-mers and the weight vectors. In Table 2, we provided a hyperparameters for training a BoN-Skipgram model.

Hybrid features from k-mer embeddings
We obtained k-mer embeddings from the Skip-gram neural network trained on DNA sequences. Each kmer was presented by a dense continuous numerical vector assumed to distill the most information about that k-mer in the relation of it and its neighboring k-mers. In natural language processing, this kind of word representation has been proved to be more e cient than that of bag-of-word or one-hot encoding.
Unlike the traditional bag-of-words model where each word or nucleotide was represented by a highdimensional sparse vector based on the occurrence counts of words within a document, word embedding holds the power of generalization, which can catch some similarities among contextual features [16].
In many NLP applications, the dimension of the k-mer embedding vectors is an important hyperparameters. Applications using bigger datasets tend to have the dimension of word or sentence embedding vectors of several hundred or even several thousand. However, the bigger the dimension, the larger the size of the feature vectors. As our samples contain only 41 nucleotides, we assumed that a scalar can present the condensed information of the k-mers in such very short DNA segments.
Furthermore, using many real values to present each k-mers may lead to redundant information, thus causing over tting problems. Therefore, all our k-mer embedding vectors have the length of 1.
We further combined different k-mer embedding-based features to create a diverse feature sets to nd the optimal one. Table 3 presents our combination methods using concatenation methods and gure S2 in the Supplementary materials represents a visualization of these feature sets. Deep Forest is a special machine learning algorithm that get inspired from deep neural networks and ensemble learning [29]. Deep Forest integrates best ideas from these two approaches namely layer-bylayer processing and model diversity. In layer-by-layer processing, Deep Forest exploits a cascade structure where each level of cascade receives feature information from its preceding layer and provides its processing outcome to the next layer. Furthermore, each layer is an ensemble of decision tree forest translating the whole model into an ensemble of ensembles. To encourage the diversity, different types of classi ers (LogisticRegression, RandomForest, and ExtraTreesClassi er) were included as diversity is crucial for ensemble construction.
Equation (1) below illustrates the error-ambiguity decomposition: Where E denotes the error of an ensemble, E and A denote the average error of individual classi ers in the ensemble and the average ambiguity (or diversity) among the individual classi ers, respectively.

XGBoost
XGBoost is a generalized boosting technique that enable the optimization of an arbitrarily specialized loss function. It is rooted from boosting technique, which is an iterative ensemble method that trained models sequentially. These models can be considered as "weak learners" (usually are decision trees) since they perform basic prediction rules that only execute slightly better than a random guess. The basic principle behind boosting is to concentrate on the "hard" examples, or the examples that the model fails to trustfully predict correctly. These examples are given more emphasis by skewing the distribution of observations to make such examples appear in a sample probable. As such, the next weak learner will be more focused on correcting these hard examples correct. Since we want our learner is always doing better than random, from sequential training round, we will always get some degree of information. Combining all the simple prediction rules into one overarching model, a powerful predictor is obtained. There are several algorithms based on boosting techniques. Uniquely, in order to avoid over-tting, XGBoost uses a more regularized model formalization, which yields better performance.

Random Forest
Random Forest is an ensemble algorithm based on many decision trees. It inherits the bene ts of a decision tree model such as scaling well to larger datasets and being robust against to irrelevant features. Furthermore, it also improves the performance by reducing the variance which is one of the downsides of decision trees. The "random" part in the Random Forest was implemented via the sampling process from the training data. This means that we trained a group of different decision trees on different randomly-picked samples from training data and we also sample different subsets of features among all available ones. This added randomness in the splitting process with an aim to reduce variance in the nal model.

Model setting and evaluation metrics
We used sensitivity, speci city, accuracy, and Matthews Correlation Coe cient (MCC) as the evaluation metrics. Given the training data, for a certain model, we performed the 5-fold cross-validation technique to lower the risk of overestimating or underestimating the real performance of our prediction model.
Additionally, 3 independent tests on Rosaceae, Rice and Arabidopsis thaliana were performed with optimal model after the hyperparemeter tuning process employed with this cross-validation technique to evaluate the prediction performance on unseen data. The de nition of evaluation measurements are de ned as follows: negative. Moreover, we also plotted receiver operating characteristic (ROC) curve and calculated the area under the ROC curve (AUC) as a single metric to evaluate the overall performance of this binary classi cation. These metrics are based on dynamic positive-negative prediction thresholds instead of a static threshold as seen in accuracy. It is the area under ROC that plots true positive rates (sensitivity) against false positive rates using different prediction thresholds. AUC is a real value between zero and one. The higher AUC, the better the classi er and a perfect classi er, meanwhile, has AUC equal to one.

Hyperparameter optimization
Hyperparameter optimization plays a very important role to achieve high-quality models in most machine learning and deep learning tasks (e.g., in [20,21]). We performed grid search on possible values of hyperparemeters for each classi ers. We tuned XGBoost model on 3 hyperparameters namely, max_depth (maximum depth of a tree), learning rate and n_estimators (number of trees). For Random Forest, values of max_features (max number of features considered for splitting a node), and n_estimators (number of trees in the forest) was searched for. In deep forest, we xed the different types of classi ers (LogisticRegression, RandomForest, and ExtraTreesClassi er) to add the diversity to the model. Then, we tuned on 3 hyper-parameters: n_estimators max_feature, and max_depth (max number of levels in each decision tree). The ranges of these hyperparameters were given in Table 4.

Overall performance
As mentioned above, we concatenated k-mer embeddings with k from 1 to 5 to create several feature sets. We used these feature sets to input into Deep Forest, Random Forest, and XGBoost classi ers. The performance scores for for the different feature sets from 5-fold cross validation process are presented in Table 5. We show both mean and standard deviation of the scores for 6 random experiments. We observed that although the results are stable (low standard variations) across 6 runs, there were not much difference in terms of prediction performance among these feature sets. However, among 3 classi cation algorithms, Deep Forest yields the best performance with highest accuracy, sensitivity, speci city, and MCC on all feature types. In contrast, Random Forest classi er obtains the lowest performance. Furthermore, we consider the 1-2-3-4mers feature set comprising of 1-mer, 2-mer, 3-mer, and 4-mer embeddings the most effective one. Thus, we continued to use the 1-2-3-4mers feature set in the next experiments.

Prediction performance of the surveyed ensemble treebased algorithms on independent datasets
We reported the prediction performance of Deep Forest, XGBoost, and RandomForest classi ers on 3 independent datasets with 1-2-3-4mers feature set in Table 6a, 6b, 6c. These tables shows the prediction outcomes on Rosaceae, Rice and A. thaliana independent datasets, respectively. It is interesting to observer (from Table 6a) that for Rosaceae data, the independent test performance scores are higher in almost all metrics for all 3 classi ers (marked by arrows) compared to the 5-fold cross validation ones presented in Table 5. This indicates that our 3 classi ers do not undergo over tting problem. In addition, for Rice independent dataset, our 3 classi ers consistently obtains AUC scores of abound 0.95 which is promising as we train our model on Rosaceae genome data. However, such scores on corresponding metrics on A. thaliana independent dataset is not satis ed as our models can only achieve an accuracy of around 80%. From this comparison, we considered Deep Forest as the best classi er and used it for further experiments.  6mA sites prediction problems. We have reused their datasets for this study. Therefore, it is interesting to combine the optimal features they have found with our proposed features. Thus, we created a new hybrid feature sets comprising of our best types (1-2-3-4mers) with their best 5 encoding schemes (MBE, DBE, EIIP, DPP, and NCP). Then, we utilized the same hyperparameter settings as we did in the experimental mentioned above. We reported the average results of 10 random runs with Deep Forest classi er in Table   7 and plotted the ROC curves in Fig. 2. Table 7 Prediction performance on hybrid features comprising of k-mer embeddings and optimal encoding schemes. In Table 7, where applicable, we have put an arrow indicating an improvement compared to the k-mer embedding-based features presented in Table 6a on Deep Forest performance. It is easily to see that with this hybrid feature set, many performance scores are a little bit higher. For example, in the independent test on Rosaceae data, the scores on all 5 metrics are improved. In addition, for Rice and A.thaliana, we observed the improvement for scores in 3 metrics each. Figure 2 shows the ROC curves of Deep Forest on this hybrid feature set. It is obvious that our model can approach an ideal AUC score of 0.99 on independent test on Rosaceae data and obtain a high AUC value of 0.953 on Rice data. Furthermore, the performance are stable across 10 random runs.

5-fold cross validation
3.5 Comparison to previous works on DNA N 6 -Methyladenine sites prediction As we mentioned earlier, researchers working on DNA N 6 -Methyladenine sites prediction has employed a lot of encoding schemes in a laborious manner to obtain promising results. Here, we would like to compare the effectiveness of our feature extraction method with Meta-i6mA, the latest predictor by Hasan et al. Thus, we used the prediction performances obtained with our best k-mer embedding based feature set (1-2-3-4mer) and Deep Forest for this comparison. (We also recalculated the prediction performance of Meta-i6mA where needed). Table 8a shows the prediction comparison on cross-validation data while Table 8b, 8c, 8d display the results on 3 surveyed independent test datasets.
As shown in Table 8a, our predictor yields a better performance in all metrics except for the speci city score. On Rosaceae independent test, we achieved better accuracy, sensitivity and MCC. Furthermore, on Rice independent test (Table 8b), our accuracy, sensitivity, and MCC are better. It is noteworthy that we improved the MCC from 0.797 to 0.813 which indicating our proposed features is e cient even when being using in a cross-species prediction manner.
Overall, from this comparison and based on the fact that we did not need to go through the laborious process for extracting and selecting features but still can obtain same level prediction outcome with state-of-the-art 6mA site predictor. We therefore con dently conclude that our extraction method, based on the application of NLP technique, particularly k-mer embeddings generating from DNA language models, on this N6-methyladenine site prediction problems is straightforward, yet e cient.  To ensure the reproducibility of the results, we provided the feature sets learned from fasText model (kmer embedding) and the source code for constructing prediction model. The interested users can go to our repository at https://github.com/khucnam/Deep_Emb_6mA to examine our methods.

Conclusion
Predicting DNA methylation patterns has drawn a lot of attention, especially for DNA N 6 -Methyladenine sites on plant genome, due to its biological functions. The computational methods based on machine learning for such problems need to search and combine for optimal feature sets from a wide range of encoding schemes. Such approaches also neglect the analogies of DNA sequences and human language. Therefore, we eliminated all the old encoding scheme and adopted the new one based on advanced technique for decipher textual information in natural language processing. Furthermore, we employed 3 different ensemble tree-based classi ers, to extensively survey the e ciency of the proposed feature extraction method. As we have showed, our approach obtains promising results with AUC performance approaching the ideal value on Rosaceae genome dataset (0.99) and a high AUC score on Rice dataset (0.95). Furthermore, an improvement on the MCC score on Rice genome dataset was observed. This demonstrates that our proposed feature representation is elegant, yet e cient. We believe that our method could open a new way to extract useful features using advanced NLP technique.
This work was partially supported by the Ministry of Science and Technology, Taiwan  The owchart of our study. For illustration purpose, the hybrid features combining 5 types of k-mer embeddings were used