Divide and Conquer: Clustering Patients With Autism by Gene Expression Profiles Using Machine Learning Algorithms

doi:10.21203/rs.3.rs-87427/v1

Download PDF

Research

Divide and Conquer: Clustering Patients With Autism by Gene Expression Profiles Using Machine Learning Algorithms

https://doi.org/10.21203/rs.3.rs-87427/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Clinical heterogeneity in autism spectrum disorder (ASD) can complicate diagnostics and treatments. The identification of biomarkers may hold the key to the classification of ASD subgroups. Accumulating evidence suggests that genetic or genomic markers may facilitate the clustering of patients with ASD. The goal of the current study is to use machine learning algorithms to analyze microarray data to identify clusters with relatively homogeneous clinical features, such as language function.

Methods

The whole-genome gene expression microarray data were used to predict communication quotient (SCQ) scores against all probes to select differential expression regions (DERs). Gene set enrichment analysis was performed to identify hub pathways that play a role in the severity of social communication deficits inherent to ASD. We then used two machine learning methods, random forest classification (RF) combined with partition around medoids (PAM) and support vector machine (SVM), to identify two clusters using DERs. Finally, we evaluated how accurately the clusters predicted language impairment.

Results

A total of 191 DERs were identified. Cholesterol biosynthesis and metabolisms pathways appear to act as hubs that connect other trait-associated pathways to influence the severity of social communication deficits inherent to ASD. Both RF and SVM algorithms can yield a classification accuracy level greater than 90% when all 191 DERs were analyzed.

Limitations

The primary limitation of the current study is the small sample size. Nevertheless, some machine learning algorithm, such as SVM, can handle a small sample with a large number of features. Additionally, model overfitting may arise due to a lack of another independent sample for validation. Furthermore, unknown confounders may cause spurious associations between the phenotype and genomic markers.

Conclusions

The ASD subtypes defined by the presence of language impairment, a strong indicator for prognosis, can be predicted by transcriptomic profiles associated with social communication deficits and cholesterol biosynthesis and metabolism. Our proof-of-concept study suggests that both RF and SVM are acceptable options for machine learning algorithms to identify AD subgroups characterized by clinical homogeneity related to prognosis.

Cellular & Molecular Neuroscience

autism spectrum disorders

genomics

social cognition

language

machine learning

Clinical heterogeneity is a norm rather than an exception in autism spectrum disorder (ASD), a complex neurodevelopmental disorder characterized by social communication deficits and stereotyped behaviors. Heterogeneous clinical features pose great challenges for diagnostics for ASD, such that children who receive a diagnosis of ASD have a range of vastly different presentations, trajectories, and outcomes. Further, the diagnostic criteria for ASD have been continuously revised through different editions of the Diagnostic and Statistical Manual for Mental Disorders (DSM), particularly the substantial changes in the 5th edition (DSM 5) where the wide range of clinical presentations have been brought together under a single ASD diagnostic entity [1]. The current diagnostic system lacks an evidence-based approach and we urgently require a scientific approach to understanding which interventions are likely to be the most effective for which child with ASD [2]. Accumulating evidence has shown that no pharmaceutical treatments have thus far been conclusively found to substantially reduce core symptoms of ASD [3]. This may be partially attributable to the fact that most clinical trials did not take clinical heterogeneity into account and hence treatment effects remain equivocal. Variable clinical presentations may reflect different biological pathways. The identification of biomarkers for etiological pathways may hence hold the key to unraveling mechanisms underlying the variation in clinical presentations [4], which in turn may pave the way for personalized medicine in ASD.

The goal of identifying biomarkers for clinical homogeneity is to tackle challenges arising from clinical heterogeneity for research on either etiologies or treatments of ASD. One of the most extensively studied biomarkers for ASD is genetic factors. There are two different strategies to evaluate genetic markers for clinical heterogeneity: bottom-up and top-down approaches. The bottom-up approach is to define a priori subgroups using phenotypic information under the premise that some genetic loci are more likely to contribute to susceptibility to disease in a certain subgroup(s). Therefore, stratifying the population by a clinical marker (e.g., age of onset) will allow investigators to detect genetic association effects that are larger in certain subgroups. The top-down approach, on the other hand, is based on the premise that certain genetic markers can be used to distinguish subgroups, each of which is characterized by relatively homogeneous phenotypic profiles underscored by similar biological pathways - which imply similar therapeutic targets. Many of the earlier genome-wide linkage or association studies that aimed to unravel genetic underpinnings of clinical heterogeneity chose the second approach, which is to identify genetic markers associated with the phenotype defined by strict diagnostic criteria of ASD [5–7]. Using the data from the Autism Diagnostic Interview-Revised (ADI-R) [8], Autism Diagnostic Observation Schedule (ADOS) [9]. Vineland Adaptive Behavior Scales (VABS) [10], head circumferences, and ages at assessment as classifying variables, Veatch and colleagues identified clinically similar subgroups of individuals with ASD and found that the genotypes were more similar within subgroups compared to the whole population - the proportion of the total genetic variance contained in a subpopulation was 0.17 [11]. However, this approach has not yielded highly replicable and clinically meaningful findings that can lead to conclusively validated etiological factors yet [12]. Furthermore, another genome-wide association study of 2,576 families with ASD probands did not discover any genetic loci that exert a larger effect on the disease risk in subpopulations defined by the diagnosis, IQ, and symptom profiles; heritability estimates were also found to be similar in subpopulations to the whole population [13].

The top-down approach often starts with a few selected genetic loci associated with the disease. Despite fruitful findings from genome-wide and candidate gene-based association studies, few genetic loci can be used to improve accuracy in diagnostics or optimize treatment effects of therapeutics for ASD. Nevertheless, several genetic markers are found to be useful for classifying patients with ASD into relatively homogeneous subgroups. For example, Bruining and colleagues reported prominently higher symptom homogeneity in both the ASD group with 22q11 deletions and ASD group with Klinefelter Syndrome (KS), compared to the heterogeneous ASD sample [14]. Transcriptomic profiles have also been used to identify genetic markers to classify individuals with ASD. Hu and Lai used the gene expression data to identify a subset of the “classifier” genes, which resulted in an overall class prediction accuracy of nearly 82%, approximately 90% sensitivity, and 75% specificity [15]. These results seem to demonstrate the value of the bottom-up approach.

Determining subgroups of ASD is challenging mainly because of the complexity of biological factors and clinical heterogeneity inherent to ASD. To tackle these challenges, one of the solutions is to implement state-of-the-art statistical methods that can efficiently parse through high-dimensionality data, such as machine learning (ML) algorithms, to differentiate subgroups with meaningful etiological, diagnostic, or therapeutic implications [16]. Previous evidence suggests that ML algorithms can be used to reduce the number of items from standardized ASD assessment tools to make the assessment more efficient [17] and predict clinical outcomes with ASD phenotypic clusters and genetic data of copy number variations [18]. The ML algorithms appear to be useful to identify phenotypic clusters as ASD subgroups that can predict clinical outcomes [19]. In the current study, we attempted to implement the ML algorithms in the context of the bottom-up approach, which is to identify clusters using genomic information, and then explore the relationship between the genomic clusters and clinical features of ASD.

Data collection:

The goal of the current study is to evaluate whether transcriptomic profiles correlated with clinical severity levels of ASD - which were measured with social communication questionnaire (SCQ) [20], can classify patients into two subgroups defined on the basis of language (i.e., the subgroup with language impairment versus the subgroup without language impairment). The language function is considered as a strong predictor for cognitive ability and adaptive skills in children with ASD [21], and its variation within ASD patients is influenced by genetic factors [22–24]. The presence of language impairment was defined as the total score (verbal) greater than 10 in the section of Qualitative Abnormalities in Communication in Autism Diagnostic Interview-Revised (ADI-R) [8]. A total of 31 children diagnosed with ASD were recruited in the current study. The clinical diagnoses were made by Gau, a board-certified child psychiatrist, and confirmed by the ADI-R interview with the parents. The Chinese version of the ADI-R been approved by the Western Psychological Services in May 2007 [25] mRNA was extracted from lymphoblastoid cell lines (LCL) of all participants. The microarray experiment was performed at the Core Laboratory of National Taiwan University Hospital in Taiwan, using the Affymetrix Human Genome U133 Plus 2.0 Array (Affymetrix Inc., Santa Clara, CA, USA). The experimental procedures followed the protocols provided by the manufacturer. The study was conducted with the ethical approval by the Institutional Research Board at National Taiwan University Hospital in Taiwan.

Statistical methods:

The intensity files of all the subjects were input into the computer program GAP: Generalized Association Plots [26,27] for quality control using visualization and descriptive statistics. We used the Robust Multi-array Analysis (RMA) method to normalize the data [28]. To prioritize the gene expression levels associated with the clinical severity indicated by SCQ scores, we used the generalized linear model to screen for probes across the whole genome with mRNA levels associated with the SCQ scores with unadjusted p-value < 0.00001. All original intensity ratio data were transformed into logarithmic 2 values after being normalized. We controlled for the batch effect by adjusting for the batch as a binary covariate since there were two batches. These probes constitute the primary source of predictors to determine ASD subgroups. We then evaluated the genes that harbor loci with differential expressions using the pathway analyses. We first used several pathway databases (KEGG, REACTOME, Biocarta, Panther, and Wiki Pathways) to identify the pathways enriched with the candidate genes that showed differential gene expressions associated with the SCQ scores. Then we used the webtool at ConsensusPathDB (http://cpdb.molgen.mpg.de/) to perform over-representation analysis (CPDB analysis) [29]. The analysis criteria included: (1) one-next neighbors for the radius with p-value < 0.01, (2) pathway-based sets at least two overlapped genes and p-value < 0.01, and (3) gene ontology level 2 categories with p-value < 0.01. The results from the second approach helped visualize the possible “hub” pathway from the top 10 networks associated with the candidate genes.

We chose two machine learning (ML) algorithms to evaluate the clustering results: random forest classification and support vector machine algorithms. The presence of language impairment was considered as a dichotomous clinical outcome to determine classification errors. We chose the first ML algorithm proposed by Shi and Horvath [30]. We used the Random Forest classification (RF) algorithm in an unsupervised mode to generate a proximity matrix. The gene expression data were analyzed using RF using two different approaches. The first approach is to reduce data dimensionality, in which we implemented principal component analysis to identify principal component (PC) scores for each subject. The top 10 PCs were selected to calculate the proximity matrix. This matrix gives a rough estimate of the distance between samples based on the proportion of times the samples end up in the same leaf node. The second approach is to use the information of all 191 probes with gene expression levels significantly associated with SCQ scores to generate the RF proximity matrix. The RF proximity matrix was then converted to a dissimilarity matrix, which was then used as input in partitioning around medoid (PAM) clustering [31] to classify the patients into two clusters to determine the final cluster assignment. The RF-PAM clustering analysis could allow us to evaluate the classification error by calculating the frequency of patients with language impairment in the cluster, in which the majority of patients had no language impairment, and vice versa.

We further chose Support Vector Machine (SVM) as the second ML algorithm to classify the patients into two subgroups [32]. To reduce data dimensionality, we implemented principal component analysis to identify the principal component (PC) scores for each subject. The data of PC scores were split in a 7:3 ratio - in other words, 70% of the data was used for training the model and the remaining 30% was for testing the model. Estimating the C (Cost) parameter to classify the data was performed using SVM with the linear kernel function. The prediction accuracy and Kappa value estimated when the C value was held constant at 1. The Kappa value was calculated using the formula (p_o- p_e)/(1-p_e), where p_oand p_e denote the observed agreement and expected agreement for classification, respectively. We further used the confusion matrix, which contains the number of correct and incorrect predictions summarized with count values and broken down by each class, to predict the accuracy of the SVM model. The accuracy is calculated as (TP + TN)/(TP+TN+FP+FN), where TP and TN refer to true positives and true negatives, respectively; FP and FN refer to false positives and false negatives, respectively. The SVM analysis was performed using the R package “caret” [33].

Gene ontology and pathway analysis

We analyzed signaling pathways and Gene Ontology pathways to evaluate the biological relevance and functional pathways of the significant genes. We have incorporated the KEGG [34], WikiPathways [35], BioCarta [36], and Reactome [37] pathway database for the cell signaling pathways. We have also considered the GO Biological Process (2018) database for gene ontological analysis [38]. For this work, an adjusted P-value ≤0.05 was considered as statistically significant.

For the identified significant gene set from the selected signaling pathways and gene ontology pathways, we calculated the frequency (f) of genes in the experiment set (s) that interact with a signaling or functional pathway, and the frequency (F) of genes in the population set (S) that interact with the same pathway. We then executed a test to identify how likely it would be to select at least f genes interacting with a pathway if s genes would be randomly drawn from the population, given that the frequency F and size S of the population. This can be represented mathematically as follows:

The clinical features of the 31 subjects are summarized in Table 1. The group with language impairment and the group without language impairment has significant differences in clinical features associated with both social communication function and verbal IQ scores.

The transcriptomic association study reveals 191 probes that were statistically significantly associated with SCQ scores with a p-value < 0.00001. The gene set enrichment analysis results suggest that the top 10 pathways, ranked by the p-value based on the gene set enrichment analysis, suggest that genomic functions involved in social communications may involve several different physiological processes that imply novel molecular mechanisms. For example, the candidate genes are over-represented in pathways of cholesterol biosynthesis (p = 2.29 x 10^-4) and cholesterol metabolism (p = 3.77 x 10^-4). The CPBD analysis shows that Sterol Regulatory Element-Binding Proteins (SREBP) signaling pathway is the pathway connected with 9 of the top 10 pathways, so it can be regarded as the “hub” pathway. This pathway of SREBP focuses on the regulation of lipid metabolism by SREBP.

Cell signaling and Gene ontology pathway analyses

Considering the differentially expressed 54 genes we have performed cell signaling and gene ontology pathways analyses. For the cell signaling pathway analysis we have considered six pathways databases (KEGG, WikiPathways, BioCarta, BioPlanet, Panther, and Reactome) to identify the significant pathways that are associated with the significant biomarker genes. We found 176 GO biological process pathways with the p-values<.05 (Supplementary Table S1). The top 50 GO biological pathways among these 232 pathways are shown in Figure 2. The RF-PAM analysis identified two clusters (Figure 4). The classification accuracy was 67.7% when the top 10 PCs were used to generate the proximity matrix, while the classification accuracy was 96.9% when all 191 probes were used to generate the proximity matrix.

The SVM analysis based on the top 10 PC scores shows that the clustering results reached classification accuracy at 93.3% (95% CI 68.1% - 99.8%) and no-information rate (i.e., the largest proportion of the observed classes) at 53.3% (p-value = 0.0011). Other parameters relevant to prediction performance include Kappa value = 0.86, sensitivity = 0.86, specificity = 1.00, and balanced accuracy = 0.93. The SVM analysis using the information of all probes with differential gene expressions associated with SCQ scores yielded a slightly higher classification accuracy than the SVM analysis based on the top 10 PC scores. The classification accuracy at 99.9% (95% CI 78.2% - 100%) and no-information rate (i.e., the largest proportion of the observed classes) at 53.3% (p-value = 8.035 x 10^-5). This classification accuracy can be demonstrated in gene expression level distributions stratified by language impairment (Supplementary Figure S2). The SVM clustering results are shown in Figure 5. The results suggest that the first two principal components could identify support vectors that fell in the area with better prediction confidence (panel A), compared with the results predicted by individual probes (panel B).

We conducted a proof-of-concept study to demonstrate how transcriptomic data from a small sample could provide useful biomarkers to classify ASD subgroups. The selection of the predictors was based on differential expression regions (DERs) associated with SCQ scores, which indicate the variation in severity levels of social communication deficits, a hallmark clinical feature of ASD. The candidate genes that harbor these DERs suggest several genetic pathways that modulate the variation in social communication functions. Among these pathways, the pathway of cholesterol biosynthesis/metabolism and sterol regulatory element-binding proteins (SREBP) pathway - cholesterol metabolism appear to act as hubs that connect other top SCQ-associated pathways. Particularly, the SREBP pathway shares most genes with other SCQ-associated pathways. These two pathways are related to lipid metabolism. Cholesterol synthesis and uptake are tightly modulated at the transcriptional level through negative feedback control, which is regulated by SREBPs [39]. The relationship between lipid metabolism and brain functions has been well documented. A growing body of evidence has indicated that cholesterol metabolism plays a key role in synaptic functions [40–42]. Dysregulated cholesterol metabolism has been extensively documented in ASD [43–50]. A recent study implemented a personalized medicine approach combining healthcare claims, electronic health records, familial whole-exome sequences, and neurodevelopmental gene expression patterns, and identified an ASD subtype characterized by dyslipidemia [51]. There are certainly several other genetic pathways involved in molecular mechanisms underlying social communication deficits. Nevertheless, our results indicate that cholesterol synthesis/metabolism pathways act as hubs that connect most other biological pathways, which suggest that the genomic functional changes associated with lipid metabolism may moderate other genomic changes, such as the p53 signaling pathway, that regulate social communication functions.

Using the DERs as biomarkers, we clustered the sample into two subgroups using two different ML algorithms. Both the RF-PAM and SVM analyses yielded similar levels of classification accuracy when all 191 markers were utilized. However, compared to the analysis using the RF-PAM algorithm, the analysis using the SVM algorithm seemed to be more robust when we performed dimension reduction for all the 191 markers with the PCA method. The RF algorithm is applicable when there are more predictors than observations, relatively insensitive to the noise (e.g., a large number of irrelevant genes), and does not rely on excessive fine-tuning of parameters [52]. RF algorithm is more robust to small sample size as the SVM algorithm [53,54]. However, Brown and colleagues [3] found that SVM outperforms other techniques that include Fisher’s linear discriminant, Parzen window, and tow decision tree learners when using gene expression data to predict clinical outcomes [55]. Additionally, Statnikov and colleagues conducted a comprehensive comparison of RF and SVM using microarray data for 22 diagnostic and prognostic datasets and concluded that SVM is superior to RF in terms of classification accuracy [56]. Although the purpose of this study is not to comprehensively evaluate which ML algorithm outperforms the other ML algorithm, our results seem to lend some support to the robustness of the SVM algorithm. Nevertheless, the RF algorithm is at least as robust as the SVM algorithm when the dimension of input variables is not substantially reduced.

One of the major limitation of the current study is the small sample size. Nevertheless, some machine learning algorithm, such as SVM, can handle a small sample with a large number of features. Additionally, model overfitting may arise due to a lack of another independent sample for validation. Furthermore, unknown confounders may cause spurious associations between the phenotype and genomic markers. However, the goal of this proof-of-concept study is prediction of subtypes rather than the identification of etiologies. Therefore, confounders would not yield a substantial impact on prediction results [57].

The clinical and etiological heterogeneity in ASD has meant that there is considerable variability in treatment outcomes across different interventions and between individuals receiving the same intervention. Hence the traditional diagnostic and “one size fits all” approach to ASD intervention needs improvement. Further, we currently do not have a sufficient understanding of “what would work for whom”, thereby limiting opportunities for maximizing outcomes for children and their families with economic ramifications for broader society. In this context, ML algorithms have been found to be useful in predicting diagnostic accuracy in ASD with neuroimaging data [58]. Further, one recent study used Gaussian Mixture Models and Hierarchical Agglomerative Clustering, which provide a statistical framework for learning latent cluster memberships to determine ASD subgroups with differentiated treatment responses [59]. Our findings that using ML algorithms, children could be classified into two groups based on the presence of language impairment, offers promise for unraveling clinically meaningful subgroups in ASD. This, in turn, can be used for predicting likely responsiveness (and non-responsiveness) to specific treatment pathways. This ‘precision’ approach to assessment and intervention will ensure that resources for appropriate intervention and supports are allocated in an evidence-based manner. This is critical as without timely recognition of the variability in the clinical presentation, neurocognitive level of functioning, and psychosocial circumstances coupled with individualized intervention, children and their families may miss key opportunities of brain plasticity available in the critical early years. ML techniques as utilized in this study offer a viable solution to address this by allowing matching interventions and supports that are tailored to the individual profile and needs.

Ethics approval and consent to participate

The protocol entitled “Clinical and molecular genetic studies of autism spectrum disorder”, submitted by Principle Investigator SSG, Department of Psychiatry, National Taiwan University Hospital, Taiwan, has been approved by the 119th meeting of Research Ethics Committee of the National Taiwan University Hospital on September 26, 2006 (NTUH-REC ID: 9561709027) and the other two collaborating sites (Chang-Gung Memorial Hospital in Taoyuan, CGMH ID: 93-6244 and Taoyuan Mental Hospital in Taoyuan, TYMH ID: C20060905). The committees of the three research sites were organized and operated according to GCP and the applicable laws and regulations. The Research Ethics Committee of three research sites approved this study [ClinicalTrials.gov number, NCT00494754]. Written informed consent was obtained from majority of the probands if they were able to give their signature after reading the informed consent and all their parents after the purposes and procedures of the study were fully explained and confidentiality was ensured.

Consent for publication

Not applicable.

Availability of data and materials

The data-sharing plan has been approved by all key investigators across collaborating sites and approved by the Research Ethics Committee of the collaborating sites. SSG, the principal investigator of this project, coordinated the research and managed all the clinical and genetic data. We reached the agreement that the de-identified data and key clinical variables will be released to investigators upon the request with relevant institutional approval documents.

Funding

The genetic data collection and analysis were supported by grants from the Ministry of Science and Technology (NSC 99-3112-B-002-036), Taiwan, and National Taiwan University Hospital (NCTRC201114), Taiwan, awarded to SSG.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

PL and MAM carried out the statistical analysis. PL and VE conceived of the study and drafted the manuscript. SSG participated in the study design and coordination. All authors read and approved the final manuscript.

Acknowledgments

The authors would like to thank the subjects who participated in this study and facility support at National Taiwan University Hospital.

American Psychiatric Association. DSM-5 Diagnostic Classification. Diagnostic Stat Man Ment Disord. 2013.
Eapen V, Crncec R. There are Gains, But can we Tell for Whom and Why? Predictors of Treatment Response Following Group Early Start Denver Model Intervention in Preschool - Aged Children with Autism Spectrum Disorder. Autism Open Access. 2016;
Bowers K, Lin P-I, Erickson C. Pharmacogenomic Medicine in Autism: Challenges and Opportunities. Pediatr Drugs. 2015;17.
McPartland JC, Bernier RA, Jeste SS, Dawson G, Nelson CA, Chawarska K, et al. The Autism Biomarkers Consortium for Clinical Trials (ABC-CT): Scientific Context, Study Design, and Progress Toward Biomarker Qualification. Front Integr Neurosci. 2020;
Anney R, Klei L, Pinto D, Regan R, Conroy J, Magalhaes TR, et al. A genome-wide scan for common alleles affecting risk for autism. Hum Mol Genet. 2010;
Yonan AL, Alarcón M, Cheng R, Magnusson PKE, Spence SJ, Palmer AA, et al. A Genomewide Screen of 345 Families for Autism-Susceptibility Loci. Am J Hum Genet. 2003;
Liu J, Nyholt DR, Magnussen P, Parano E, Pavone P, Geschwind D, et al. A genomewide screen for autism susceptibility loci. Am J Hum Genet. 2001;
Lord C, Rutter M, Le Couteur A. Autism Diagnostic Interview-Revised: A revised version of a diagnostic interview for caregivers of individuals with possible pervasive developmental disorders. J Autism Dev Disord. 1994;
Rutter M, DiLavore P, Risi S, Gotham K, Bishop SL. Autism diagnostic observation schedule: ADOS: Manual. Los Angeles, CA West Psychol Serv. 2002;
Icabone DG. Vineland Adaptive Behavior Scales. Diagnostique. 1999;
Veatch OJ, Veenstra-Vanderweele J, Potter M, Pericak-Vance MA, Haines JL. Genetically meaningful phenotypic subgroups in autism spectrum disorders. Genes, Brain Behav. 2014;
Anney R, Klei L, Pinto D, Almeida J, Bacchelli E, Baird G, et al. Individual common variants exert weak effects on the risk for autism spectrum disorders. Hum Mol Genet. 2012;
Chaste P, Klei L, Sanders SJ, Hus V, Murtha MT, Lowe JK, et al. A genome-wide association study of autism using the Simons simplex collection: Does reducing phenotypic heterogeneity in autism increase genetic homogeneity? Biol Psychiatry. 2015;
Bruining H, de Sonneville L, Swaab H, de Jonge M, Kas M, van Engeland H, et al. Dissecting the clinical heterogeneity of autism spectrum disorders through defined genotypes. PLoS One. 2010;
Hu VW, Lai Y. Developing a Predictive Gene Classifier for Autism Spectrum Disorders Based upon Differential Gene Expression Profiles of Phenotypic Subgroups. N Am J Med Sci (Boston). 2013;
Mottron L, Bzdok D. Autism spectrum heterogeneity: fact or artifact? Mol. Psychiatry. 2020.
Küpper C, Stroth S, Wolff N, Hauck F, Kliewer N, Schad-Hansjosten T, et al. Identifying predictive features of autism spectrum disorders in a clinical sample of adolescents and adults using machine learning. Sci Rep. 2020;
Asif M, Martiniano HFMC, Marques AR, Santos JX, Vilela J, Rasga C, et al. Identification of biological mechanisms underlying a multidimensional ASD phenotype using machine learning. Transl Psychiatry. 2020;
Akter T, Shahriare Satu M, Khan MI, Ali MH, Uddin S, Lio P, et al. Machine Learning-Based Models for Early Stage Detection of Autism Spectrum Disorders. IEEE Access. 2019;
Schanding Jr. GT, Nowell KP, Goin-Kochel RP. Utility of the social communication questionnaire-current and social responsiveness scale as teacher-report screening tools for autism spectrum disorders. J Autism Dev Disord [Internet]. 2012;42:1705–16. Available from: http://www.ncbi.nlm.nih.gov/pubmed/22143742
Mayo J, Chlebowski C, Fein DA, Eigsti IM. Age of first words predicts cognitive ability and adaptive skills in children with ASD. J Autism Dev Disord. 2013;
Lin PI, Kuo PH, Chen CH, Wu JY, Gau SSF, Wu YY, et al. Runs of Homozygosity Associated with Speech Delay in Autism in a Taiwanese Han Population: Evidence for the Recessive Model. PLoS One. 2013;
Lin PI, Chien YL, Wu YY, Chen CH, Gau SSF, Huang YS, et al. The WNT2 gene polymorphism associated with speech delay inherent to autism. Res Dev Disabil. 2012;
Eicher JD, Gruen JR. Language impairment and dyslexia genes influence language skills in children with autism spectrum disorders. Autism Res. 2015;
Gau SSF, Lee CM, Lai MC, Chiu YN, Huang YF, Kao J Der, et al. Psychometric properties of the Chinese version of the Social Communication Questionnaire. Res Autism Spectr Disord. 2011;
Chen CH. Generalized association plots: Information visualization via iteratively generated correlation matrices. Stat Sin. 2002;
Wu HM, Tien YJ, Chen C houh. GAP: A graphical environment for matrix visualization and cluster analysis. Comput Stat Data Anal. 2010;
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP. Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003;
Kamburov A, Stelzl U, Lehrach H, Herwig R. The ConsensusPathDB interaction database: 2013 Update. Nucleic Acids Res. 2013;41.
Shi T, Horvath S. Unsupervised learning with random forest predictors. J Comput Graph Stat. 2006;
Kaufman L, Rousseeuw PJ. Partitioning Around Medoids (Program PAM), in Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, Hoboken. 2008;
Cortes C, Vapnik V. Support-Vector Networks. Mach Learn. 1995;
Kuhn M. caret Package. J Stat Softw. 2008;
Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000.
Slenter DN, Kutmon M, Hanspers K, Riutta A, Windsor J, Nunes N, et al. WikiPathways: A multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 2018;
Nishimura D. BioCarta. Biotech Softw Internet Rep. 2001;
Fabregat A, Jupe S, Matthews L, Sidiropoulos K, Gillespie M, Garapati P, et al. The Reactome Pathway Knowledgebase. Nucleic Acids Res. 2018;
Carbon S, Douglass E, Dunn N, Good B, Harris NL, Lewis SE, et al. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 2019;
Sato R. Sterol metabolism and SREBP activation. Arch. Biochem. Biophys. 2010.
Paul SM, Doherty JJ, Robichaud AJ, Belfort GM, Chow BY, Hammond RS, et al. The major brain cholesterol metabolite 24(S)-hydroxycholesterol is a potent allosteric modulator of N-Methyl-D-Aspartate receptors. J Neurosci. 2013;33.
Wang H. Lipid rafts: A signaling platform linking cholesterol metabolism to synaptic deficits in autism spectrum disorders. Front Behav Neurosci. 2014;8.
Petrov AM, Kasimov MR, Zefirov AL. Cholesterol in the pathogenesis of alzheimer’s, parkinson’s diseases and autism: Link to synaptic dysfunction. Acta Naturae. 2017.
Tamiji J, Crawford DA. The neurobiology of lipid metabolism in autism spectrum disorders. NeuroSignals. 2011.
Gillberg C, Fernell E, Kočovská E, Minnis H, Bourgeron T, Thompson L, et al. The role of cholesterol metabolism and various steroid abnormalities in autism spectrum disorders: A hypothesis paper. Autism Res. 2017.
Richardson AJ, Ross MA. Fatty acid metabolism in neurodevelopmental disorder: A new perspective on associations between attention-deficit/hyperactivity disorder, dyslexia, dyspraxia and the autistic spectrum. Prostaglandins Leukot Essent Fat Acids. 2000;
Aneja A, Tierney E. Autism: The role of cholesterol in treatment. Int. Rev. Psychiatry. 2008.
Cartocci V, Catallo M, Tempestilli M, Segatto M, Pfrieger FW, Bronzuoli MR, et al. Altered Brain Cholesterol/Isoprenoid Metabolism in a Rat Model of Autism Spectrum Disorders. Neuroscience. 2018;
Esparham AE, Smith T, Belmont JM, Haden M, Wagner LE, Evans RG, et al. Nutritional and metabolic biomarkers in autism spectrum disorders: An exploratory study. Integr Med. 2015;
Tierney E, Bukelis I, Thompson RE, Ahmed K, Aneja A, Kratz L, et al. Abnormalities of cholesterol metabolism in autism spectrum disorders. Am J Med Genet Part B Neuropsychiatr Genet. 2006;
Petrov AM, Kasimov MR, Zefirov AL. Cholesterol in the pathogenesis of alzheimer’s, parkinson’s diseases and autism: Link to synaptic dysfunction. Acta Naturae. 2017.
Luo Y, Eran A, Palmer N, Avillach P, Levy-Moonshine A, Szolovits P, et al. A multidimensional precision medicine approach identifies an autism subtype characterized by dyslipidemia. Nat Med. 2020;
Breiman L. Random forests. Mach Learn. 2001;
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;
Kim SY. Effects of sample size on robustness and prediction accuracy of a prognostic gene signature. BMC Bioinformatics. 2009;
Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, et al. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci U S A. 2000;
Statnikov A, Wang L, Aliferis CF. A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008;
Van Diepen M, Ramspek CL, Jager KJ, Zoccali C, Dekker FW. Prediction versus aetiology: Common pitfalls and how to avoid them. Nephrol. Dial. Transplant. 2017.
Moon SJ, Hwang J, Kana R, Torous J, Kim JW. Accuracy of machine learning algorithms for the diagnosis of autism spectrum disorder: Systematic review and meta-analysis of brain magnetic resonance imaging studies. J. Med. Internet Res. 2019.
Stevens E, Dixon DR, Novack MN, Granpeesheh D, Smith T, Linstead E. Identification and analysis of behavioral phenotypes in autism spectrum disorder via unsupervised machine learning. Int J Med Inform. 2019;

Table 1. Clinical features of the patients in the current study.

	Language impairment (51.3%)	No language impairment (48.7%)	Relationship with language impairment^*
Age	9.00 (SD: 2.52)	8.91 (SD: 3.99)	P > 0.05
ADIR-BV	17.83 (SD: 3.27)	8.55 (SD: 1.13)	P < 0.0001
ADIR-BN	8.92 (SD: 2.71)	3.64 (SD: 1.43)	P < 0.0001
SCQ	22.19 (SD: 4.84)	11.47 (SD: 4.84)	P < 0.0001
VIQ	82.08 (SD: 20.77)	111.91 (SD: 10.12)	P = 0.0003
PIQ	90.83 (SD: 15.74)	101.36 (SD: 15.34)	P > 0.05
SRS	89.61 (SD: 16.12)	79.55 (SD: 27.99)	P > 0.05

ADIR-BV: Autism Diagnostic Interview – Revised, Qualitative Abnormalities in Communication, Total Verbal score

ADIR-BN: Autism Diagnostic Interview – Revised, Qualitative Abnormalities in Communication, Total Non-Verbal score

SCQ: Social Communication Questionnaire score

VIQ: verbal IQ

PIQ: performance IQ

SRS: Social Responsiveness Scale score

Download PDF

Version 1

posted

You are reading this latest preprint version

Divide and Conquer: Clustering Patients With Autism by Gene Expression Profiles Using Machine Learning Algorithms

Status:

Version 1

Abstract

Figures

Introduction

Methods

Results

Discussion

Declarations

References

Tables

Supplementary Files

Status:

Version 1