The goal of the current study is to evaluate whether transcriptomic profiles correlated with clinical severity levels of ASD - which were measured with social communication questionnaire (SCQ) , can classify patients into two subgroups defined on the basis of language (i.e., the subgroup with language impairment versus the subgroup without language impairment). The language function is considered as a strong predictor for cognitive ability and adaptive skills in children with ASD , and its variation within ASD patients is influenced by genetic factors [22–24]. The presence of language impairment was defined as the total score (verbal) greater than 10 in the section of Qualitative Abnormalities in Communication in Autism Diagnostic Interview-Revised (ADI-R) . A total of 31 children diagnosed with ASD were recruited in the current study. The clinical diagnoses were made by Gau, a board-certified child psychiatrist, and confirmed by the ADI-R interview with the parents. The Chinese version of the ADI-R been approved by the Western Psychological Services in May 2007  mRNA was extracted from lymphoblastoid cell lines (LCL) of all participants. The microarray experiment was performed at the Core Laboratory of National Taiwan University Hospital in Taiwan, using the Affymetrix Human Genome U133 Plus 2.0 Array (Affymetrix Inc., Santa Clara, CA, USA). The experimental procedures followed the protocols provided by the manufacturer. The study was conducted with the ethical approval by the Institutional Research Board at National Taiwan University Hospital in Taiwan.
The intensity files of all the subjects were input into the computer program GAP: Generalized Association Plots [26,27] for quality control using visualization and descriptive statistics. We used the Robust Multi-array Analysis (RMA) method to normalize the data . To prioritize the gene expression levels associated with the clinical severity indicated by SCQ scores, we used the generalized linear model to screen for probes across the whole genome with mRNA levels associated with the SCQ scores with unadjusted p-value < 0.00001. All original intensity ratio data were transformed into logarithmic 2 values after being normalized. We controlled for the batch effect by adjusting for the batch as a binary covariate since there were two batches. These probes constitute the primary source of predictors to determine ASD subgroups. We then evaluated the genes that harbor loci with differential expressions using the pathway analyses. We first used several pathway databases (KEGG, REACTOME, Biocarta, Panther, and Wiki Pathways) to identify the pathways enriched with the candidate genes that showed differential gene expressions associated with the SCQ scores. Then we used the webtool at ConsensusPathDB (http://cpdb.molgen.mpg.de/) to perform over-representation analysis (CPDB analysis) . The analysis criteria included: (1) one-next neighbors for the radius with p-value < 0.01, (2) pathway-based sets at least two overlapped genes and p-value < 0.01, and (3) gene ontology level 2 categories with p-value < 0.01. The results from the second approach helped visualize the possible “hub” pathway from the top 10 networks associated with the candidate genes.
We chose two machine learning (ML) algorithms to evaluate the clustering results: random forest classification and support vector machine algorithms. The presence of language impairment was considered as a dichotomous clinical outcome to determine classification errors. We chose the first ML algorithm proposed by Shi and Horvath . We used the Random Forest classification (RF) algorithm in an unsupervised mode to generate a proximity matrix. The gene expression data were analyzed using RF using two different approaches. The first approach is to reduce data dimensionality, in which we implemented principal component analysis to identify principal component (PC) scores for each subject. The top 10 PCs were selected to calculate the proximity matrix. This matrix gives a rough estimate of the distance between samples based on the proportion of times the samples end up in the same leaf node. The second approach is to use the information of all 191 probes with gene expression levels significantly associated with SCQ scores to generate the RF proximity matrix. The RF proximity matrix was then converted to a dissimilarity matrix, which was then used as input in partitioning around medoid (PAM) clustering  to classify the patients into two clusters to determine the final cluster assignment. The RF-PAM clustering analysis could allow us to evaluate the classification error by calculating the frequency of patients with language impairment in the cluster, in which the majority of patients had no language impairment, and vice versa.
We further chose Support Vector Machine (SVM) as the second ML algorithm to classify the patients into two subgroups . To reduce data dimensionality, we implemented principal component analysis to identify the principal component (PC) scores for each subject. The data of PC scores were split in a 7:3 ratio - in other words, 70% of the data was used for training the model and the remaining 30% was for testing the model. Estimating the C (Cost) parameter to classify the data was performed using SVM with the linear kernel function. The prediction accuracy and Kappa value estimated when the C value was held constant at 1. The Kappa value was calculated using the formula (po - pe)/(1-pe), where po and pe denote the observed agreement and expected agreement for classification, respectively. We further used the confusion matrix, which contains the number of correct and incorrect predictions summarized with count values and broken down by each class, to predict the accuracy of the SVM model. The accuracy is calculated as (TP + TN)/(TP+TN+FP+FN), where TP and TN refer to true positives and true negatives, respectively; FP and FN refer to false positives and false negatives, respectively. The SVM analysis was performed using the R package “caret” .
Gene ontology and pathway analysis
We analyzed signaling pathways and Gene Ontology pathways to evaluate the biological relevance and functional pathways of the significant genes. We have incorporated the KEGG , WikiPathways , BioCarta , and Reactome  pathway database for the cell signaling pathways. We have also considered the GO Biological Process (2018) database for gene ontological analysis . For this work, an adjusted P-value ≤0.05 was considered as statistically significant.
For the identified significant gene set from the selected signaling pathways and gene ontology pathways, we calculated the frequency (f) of genes in the experiment set (s) that interact with a signaling or functional pathway, and the frequency (F) of genes in the population set (S) that interact with the same pathway. We then executed a test to identify how likely it would be to select at least f genes interacting with a pathway if s genes would be randomly drawn from the population, given that the frequency F and size S of the population. This can be represented mathematically as follows: