Machine Learning Algorithms for the Identication of Potential Non-Syndromic Intellectual Disability Associated miRNAs

Background: Intellectual disabilities (IDs) are a group of developmental disorders with high phenotypic and genotypic heterogeneity. Association of genetic elements to IDs has typically been empirically accomplished, however recently, machine learning (ML) has proved to be an excellent instrument to elucidate these associations. miRNAs are short non-coding molecules that participate in spatiotemporal gene regulation, making them relevant for the understanding ID causality. Methods: In this study we used the BrainSpan spatio-temporal expression database to develop a series of machine learning predictors: SVM, RF, FF-ANN, and Stochastic Gradient Descent Classiﬁer. These models were capable of recognizing gene expression proﬁles. The best classiﬁer was used to label miRNAs associated with NS-IDs using the BrainSpan expression proﬁles. Results: The model with the best performance was a FF-ANN with 0.78 of F1-score, 0.78 of weighted recall and 0.78 of weighted precision. We used this model to identify miRNAs with high probability to be associated with NS-IDs using the spatio-temporal gene expression proﬁle in the human brain. Labeled miRNAs that were annotated were associated with processes related to either IDs and-or neurodevelopmental processes. Conclusions: The development of a machine learning framework that identiﬁed potential NS-ID miRNAs represents an interesting approach for the identiﬁcation of a potential list of on genes that could be subject for further experimental validation. This study also reinforces the potential of machine learning frameworks in their discovery of potential biomarkers that could improve disease detection and management.


Background
Intellectual disabilities (ID) are a group of neurodevelopmental disorders with high phenotypic and genotypic heterogeneity [1,2] and an estimate prevalence of 1-3% worldwide [3]. The knowledge of genetic causes for syndromic (S-IDs) and nonsyndromic IDs (NS-IDs) is the result of exhaustive research in gene variants and their potential association. Such work has been mainly done using next generation sequencing [4,1] and has provided valuable information on ID causing genetic variants. The comprehension of biological basis of IDs could help improve quality of life in individuals with S-IDs/NS-IDs, as well as for the understanding of human cognition [4,1,5].
Due to their capacity to modulate expression of multiple genes, miRNAs, a class of small non-coding RNAs, play a crucial role in all tissues' development and maturation [6]. Compelling evidence exists supporting the involvement of miRNAs as important post-transcriptional regulators of encephalon development and plasticity of synaptic connections [6,7]. Overall miRNAs' expression patterns are widely variable among brain regions and throughout the lives of individuals, time during which each miRNA can regulate the expression of many different genes, other miR-NAs, and other genetic elements [8].
In the last decade miRNAs have been associated with S-IDs and with NS-IDs through the correlation of expression anomalies (overexpression and/or underexpression) [7,2]. Similarly, variants that alter the normal pairing of specific miRNAs and their mRNA targets, have been associated with the emergence of such disorders [7,2].
In recent years, the diagnosis of human disorders and conditions has benefited from the increased ability to convert data into knowledge. This process, defined as knowledge discovery, is primarily done using machine learning algorithms. Machine learning (ML) is a technique in the field of computation that allows a computer to learn how to recognize patterns hidden in data of any nature [9], a useful feature to work with data from complex and hard to model phenomena. Supervised learning allows us to teach the computers how to recognize classes from known data as for the computer to classify new data for us [9]. ML models are systems that apply one or several learning algorithms with the objective of analysing new data. These models have reached high popularity in recent years to study complex problems in genetics [10,11].
Cogill and Wang [12] created a ML model that provided a list of candidates miR-NAs for bio-markers in autism, utilizing a database of gene expression of the brain development. Using this same database, Gök [13] demonstrates the utility of considering different types of ML models in the search for better models with higher performance in classification problems in genetics, but focused on the association of lncRNAs with the autism spectrum.
In this paper, through the use of a computational framework that includes machine learning algorithms and functional annotations, we have presented a list of miRNAs candidates associated with NS-IDs. The list was the result of predictions generated from the highest performing machine learning model, trained with spatialtemporal gene expression data of the developing human brain. Our working hypothesis was that NS-IDs associated genes, and miRNAs that regulate them share spatial-temporal expression patterns. Based on that notion, we developed a classification system, focusing on the highest performance metrics, capable of classifying spatial-temporal expression profiles in the human brain.

Data Acquisition
For the training as well as for the miRNAs classification we utilized the gene expression profiles of the developing human brain repository BrainSpan [14]. BrainSpan contains the expression profiles from RNA-seq in RPKM of genes and non-gene genetic elements in 16 brain areas from prenatal states to 40 y.o. with an average depth of 2 patients per area-developmental state combination. In this database each patient-area-developmental state combination is a separated column. This information was consolidated as a data matrix by the association of expression profiles data and information about spatio-temporal variables for each patient from different files available in BrainSpan.

Data Preprocessing
We selected the expression profiles of 1823 genes with no known variants associated or causing ID phenotypes. The set of 1823 genes were assigned the -1 class label to indicate their negative association with ID. In contrast, a set of 707 genes reported in the literature [1] as associated with NS-IDs were selected. These genes were labeled as NS-ID, or given the 1 class label to indicate their positive association with ID.
The amount of expression profiles for negative and positive genes were different to the number of selected genes due to BrainSpan having the mRNA profiles instead of gene profiles. We decided not to eliminate those profiles with duplicated gene names because they correspond to different isoforms with different biological activities. The pairing of the gene names and their corresponding ENCODE identification was accomplished utilizing Bioconductor's biomaRt [15] library from the programming language R version 3.5.2. We built a database (named initial DB) with gene expression profiles of positive and negative genes only.
For the initial data set the RPKM values were normalized in three ways: 1) applying base 2 logarithm, 2) applying min-max normalization, and 3) applying base 2 logarithm (log2) followed by application of min-max normalization. Using log2 and min-max is based on the notion that the logarithmic transformation can be implemented for reducing the inter and intra-sample differences or noise, whereas the min-max normalization can be used to allocate all instances in the same scale or plane, thus improving algorithm convergence. The use of the three normalization methods aimed to find the best possible performance of the ML models. All experiments were executed with the three versions of the normalized initial DB (Fig.  1).

Machine Learning model building and analysis
We used the following ML algorithms: Support Vector Machine (SVM), Random Forest (RF), and Feed Forward Artificial Neural Network (FF-ANN). SVM models were implemented in the programming language R with the library e1071 [16], and in the programming language Python we implemented the SGDClassifier model, from scikit learn (sklearn) [17], which as a default uses the linear SVM. RF models were implemented in R with the library randomForest [18]. FF-ANN models were implemented as in R with the library neuralnet [19], and in Python with TensorFlow [20] and Keras [21]. For each ML model ten iterations were executed. Each iteration consisted of selecting a random seed and the creation of at least ten models of the corresponding algorithm with different combinations of its internal parameters (Fig. 1). For the creation of each model we used 70% of data for training and 30% for testing and validation. Training data were balanced utilizing the library caret of R [22], and the library sklearn of Python. Random seeds and confusion matrix were recorded for each iteration. SVM models were trained using "radial", "lineal", and "sigmoidal" kernels, cost and gamma (in case of sigmoidal kernel) were adjusted with the tune function of e1071 library. RF models were trained with 1500 decision trees per forest, with no replace, all other parameters were used with default values, and training data set was balanced by class using SMOTE function from DMwR [23] library of R with k equals 3 and 190% of oversampling of minority class. FF-ANN models were trained with all available solvers in Keras and MLPClassifier, with an alpha between 0.0001 and 1 in sweeps by one order of magnitude, with different combinations of activation functions for layer, and in uniform and autoencoder architectures.

Feature Selection of Important Variables
The RF model with the higher precision and F1 score was utilized for feature selection of the normalized initial DB based on the score of importance of each variable over the RF classification. Based on the feature selection we built two new databases, one with the 80% of variables with higher importance score, and another with the 60% of variables with the higher importance score. Three more iterations were executed for all algorithms using the new databases normalized with the third normalization algorithm described above.

miRNAs Classification
From the total of models created we selected the one with the higher F1 score and AUC for the classification of miRNAs. The F1 scores were calculated using the function "classification report " from sklearn.metrics library of Python. The AUC and ROC curve were obtained using ROSE library of R [24]. The best model was retrained with the totality of the expression profiles of positive and negative genes. The miRNAs' expression profiles were obtained from the same database as the gene expression profiles, available in BrainSpan project. Once the classification of miRNAs was obtained we searched for literature that associated the positive tags resulting from the miRNAs classification with NS-IDs or with neurodevelopment. We filtered miRNAs classified as positive according to their probability to be associated with NS-IDs in order to obtain the best candidates only (those with probability greater than 0.99). The mirbase-names of the best candidates were obtained through the library biomaRt of R. The list of target genes of the best candidates was obtained with the tool STARBASE with searches of high stringency of CLIP data (greater or equals to 3 according to the tool) over the human genome.

Results
The training database consisted in the expression profiles of 27657 gene transcripts of the negative class (unassociated with NS-IDs), 11640 gene transcripts of the positive class (already reported as associated with NS-IDs), and 524 patient-brain area-developmental state variables.
The normalization algorithm that produced the best results was the third algorithm explained above (data not presented). The performance of those models trained with the database with feature selection applied was higher than those trained with the totality of original variables (TABLE 1). However, the general performance of the models trained with the 60% of the variables was the lowest (data not presented). The best model obtained (TABLE 2) was a FF-ANN of Multilayer Perceptron Classifier type (MLPClassifier) integrated by 5 hidden layers of 400 neurons each, an alpha of 0.01, solver SGD, and random state 121. The AUC value 0.799 of the best model was the higher AUC between obtained (Fig. 2).
Once the best model was retrained the classification of 544 miRNAs was performed, and 36 of them were classified as associated with NS-IDs (TABLE 3), using their class probabilities. In this group of 36 miRNAs the probability average was 0.933 (±0.119 sd). For the miRNAs classified as negative the average probability of being positive was 0.067 (±0.119 sd). A total of 23 miRNAs were selected as best candidates due to their probability being greater or equal to 0.99 (TABLE 3). For 11 of the 23 best candidates, mirbase-names were not found in biomaRt, probably due to the original transcript names in BrainSpan being in the ENCODE 10 version. We found 7 of the best candidates already being reported as associated with brain development, cognitive development, or NS-IDs (TABLE 3). A total of 7 of the selected candidates have target genes reported in literature as associated with intellectual disorders and ND-IDs (TABLE 3) according to the results obtained from STARBASE. A total of 5 more were not found in the STARBASE system, probably due to the Gencode V10 used by BrainSpan in the identification of transcripts.

Discussion
The issue of predicting the association between genes and neurodevelopmental disorders has been successfully broached by several authors, including Cogill and Wang [12], Gök [13], and Wang and Wang [25]. However, only the best ML model obtained by Wang and Wang was an ANN type, with an Autoencoder architecture [25]. We also obtained a best ML model of ANN type but with a feed forward (FF) architecture instead of the best architecture Wang and Wang reported. The differences between the study's FF-ANN models and the remaining models can be clearly observed after the application of feature selection, suggesting that expression data patterns are better learned by FF-ANN model types, as long as the noise is filtered. We used the best model obtained with all the data (an RF type model) to perform the feature selection. This technique, which has already been reported as an embedded technique of feature selection [26] and used by Wang and Wang [25], allowed us to obtain a significant improvement in the best model, as well as all other developed models. The remaining types of model were not benefited by the feature selection as much as the best model.
In problems that requires classification in two categories random guess would obtain a performance of 0.5 correct classifications. Any higher performance indicates that information about existing patterns in the data was obtained, which is the case of this study's best model, since it obtained a considerably higher value of performance. This gained information has been of great value in other studies that use ML algorithms to associate genes with conditions, disorders and disabilities, with an effectiveness of close to 0.8, which is a similar value as the performance obtained by the best model of this study [12,13,25]. In biological terms, the pattern information that was gained means that the hypothesis the study was based on, that the genes associated with NS-IDS and the miRNAs that regulate them have similar spatio-temporal expression patterns in the human brain, is correct.
For the extent of this work, it was impossible to obtain the most significant spatiotemporal variables of the brain for the performed classification due to the best model being a FF-ANN type. FF-ANN are often considered black box systems, meaning that interpretation of the importance of used entrance variables for the final prediction is either highly complex or impossible. In cases of ANNs with many layers of many neurons per layer, this is particularly true, like is the case of our best model [27].

Conclusions
It is important to note that among the miRNAs recognized as positive by the best model are miRNAs previously reported by literature as associated to NS-IDs and other neurodevelopmental disorders. According to the best ML model, all miRNAs selected as best candidates have a high probability of being associated with NS-IDs, based on their spatio-temporal expression patterns. Finding that all candidate miRNAs have target genes previously reported as associated with NS-IDs or with other neurodevelopmental disorders supports the future study of the candidates proposed by the best model, as they have a high probability of being involved in important processes related to the emergence of NS-IDs. Ethics approval and consent to participate In this study we worked with human data, available in BrainSpan project's web site. No any consent to participate was required due to the absence of traceable, personal, or sensitive information.

Consent for publication Not applicable
Availability of data and materials The original data to replicate or findings is available in the download section of BrainSpan project's web site https://www.brainspan.org/static/download.html     Table 3 miRNAs tagged as associated with NS-IDs by the best ML model.