3.1. Methodological overview
The overview of our analytical method can be seen in Fig. 1. Firstly, to screen genes associated with ASD in each cell type, we constructed cell type-specific predictive models, which can predict the diagnosis of a nucleus whose cell type is known, using the algorithm of partial least squares (PLS) (Materials and Methods). Specifically, for each cell type, we extracted the gene expression data of the nuclei from the cell type and split the data into training and test sets. We selected the optimal model based on the training set, and then obtained the predictive probability of each nucleus being a nucleus from ASD patients. Next, ROC analysis was performed to obtain the best threshold on training set and the threshold was used to determine the predictive performance on training and test sets. To prioritize genes, we calculated the importance of each gene in the cell type-specific predictive model. In addition, in order to use less genes to achieve similar performances, we performed recursive feature elimination with cross-validation (RFECV) to reduce the number of genes used to re-construct cell type-specific predictive models. The optimal genes obtained using RFECV were denoted as RFE genes, which were used for the downstream analyses to depict the cell type heterogeneity of ASD.
Secondly, to screen gene sets associated with ASD in each cell type, we constructed cell type-specific gene set-based predictive models using PLS. Specifically, for each cell type and each gene set, we extracted the expression level of the nuclei from the cell type in the genes included in the considered gene set, and constructed a predictive model. To prioritize gene sets, we ranked gene sets using their predictive performance on the test set, and kept the gene sets whose predictive accuracy (ACC), sensitivity (SN), and specificity (SP) are larger than 70% as cell type-specific gene sets associated with ASD. Besides, for the total genes included in these identified gene sets, we calculated their frequency and averaged importance, and used the genes with top averaged importance to re-construct cell type-specific predictive models.
Lastly, we further constructed a multi-label predictive model using PLS, which can predict the cell type and the diagnosis of a given nuclei at the same time. For the labels of cell types, the predictive cell type of each nucleus was set as the cell type whose predictive probability is the largest. For the diagnosis label, we extracted the predictive probability of training set and applied ROC analysis to obtain the optimal cutoff for determining the predictive diagnosis of each nucleus in training and test sets.
3.2. Cell type-specific genes associated with ASD
For each of the 17 cell types, we first constructed a cell type-specific predictive model using all genes (Table 1, Supplementary File 1). To score genes in each cell type, we calculated the importance of genes and ranked the genes (Supplementary File 2). Next, in order to use less genes to achieve similar performances, we used the genes with top 500, 1000, and 1500 importance respectively to re-construct cell type-specific predictive models. We found out that using top 1000 genes made the model performance better than the one using top 500, while approach the one using top 1500 genes (Supplementary File 1). Therefore, for each cell type, we applied RFECV to reduce the number of genes to up to 1000 genes and also to obtain the optimal number of genes used to construct a cell type-specific predictive model. The optimal genes obtained using RFECV were denoted as RFE genes. It is noted that the performances on test sets of the cell type-specific predictive models based on RFE genes approach the ones based on all genes (Fig. 2. A, Supplementary File 1), hence, we used the RFE genes for the subsequently analyses in this section.
Table 1
The classification performances of cell type-specific predictive models built using all genes. The number of nuclei from ASD and controls are listed. ROC analysis was applied to obtain the AUC and the optimal cutoff point on the training set, then the optimal cutoff was used to determine the predictive accuracy (ACC), sensitivity (SN), and specificity (SP) on the training and test sets.
Cell type (ASD/Control) | Training set | Test set |
ACC | SN | SP | AUC | ACC | SN | SP | AUC |
AST-FB (2033/1622) | 0.91 | 0.92 | 0.9 | 0.97 | 0.72 | 0.78 | 0.63 | 0.79 |
AST-PP (4749/2336) | 0.93 | 0.93 | 0.93 | 0.98 | 0.84 | 0.87 | 0.79 | 0.90 |
Endothelial (850/1141) | 0.92 | 0.91 | 0.92 | 0.97 | 0.76 | 0.70 | 0.80 | 0.83 |
IN-PV (1811/1908) | 0.95 | 0.94 | 0.96 | 0.99 | 0.80 | 0.77 | 0.82 | 0.88 |
IN-SST (1945/2245) | 0.94 | 0.92 | 0.95 | 0.98 | 0.76 | 0.70 | 0.81 | 0.83 |
IN-SV2C (990/846) | 0.98 | 0.98 | 0.97 | 1.00 | 0.80 | 0.83 | 0.76 | 0.88 |
IN-VIP (3098/2523) | 0.89 | 0.88 | 0.91 | 0.96 | 0.79 | 0.79 | 0.78 | 0.86 |
L2/3 (6962/5833) | 0.95 | 0.95 | 0.95 | 0.99 | 0.89 | 0.90 | 0.88 | 0.96 |
L4 (3415/3103) | 0.93 | 0.91 | 0.94 | 0.98 | 0.83 | 0.80 | 0.87 | 0.91 |
L5/6 (1710/1692) | 0.93 | 0.93 | 0.93 | 0.98 | 0.78 | 0.77 | 0.80 | 0.86 |
L5/6-CC (2279/2106) | 0.97 | 0.98 | 0.97 | 1.00 | 0.85 | 0.88 | 0.82 | 0.93 |
Microglia (1174/1321) | 0.91 | 0.90 | 0.93 | 0.97 | 0.76 | 0.73 | 0.78 | 0.84 |
Neu-mat (1853/1679) | 0.85 | 0.82 | 0.88 | 0.93 | 0.75 | 0.70 | 0.80 | 0.83 |
Neu-NRGN-I (321/268) | 0.97 | 0.99 | 0.94 | 0.99 | 0.69 | 0.75 | 0.63 | 0.74 |
Neu-NRGN-II (828/631) | 0.82 | 0.86 | 0.78 | 0.89 | 0.63 | 0.70 | 0.53 | 0.68 |
Oligodendrocytes (4587/7619) | 0.83 | 0.86 | 0.81 | 0.91 | 0.77 | 0.79 | 0.75 | 0.85 |
OPC (5085/4562) | 0.83 | 0.82 | 0.84 | 0.91 | 0.75 | 0.74 | 0.76 | 0.82 |
By examining the number of RFE genes in every cell type (Table 2), we found that in several cell types, such as AST-PP, IN-PV, L2/3, L4, and L5/6-CC, there are more RFE genes and the corresponding cell type-specific predictive models have better performances than other cell types (Fig. 2. A). This implies that these cell types may be more vulnerable in ASD and more genes may be dysregulated in these cell types. Then, for each cell type, we also applied edgeR [33] to identify differential expressed genes in ASD compared to controls. It can be seen that in the mentioned cell types above, there are indeed more differential expressed genes, which also indicates that these cell types may be mainly affected by ASD. By performing hypergeometric tests, we found that the RFE genes are significantly overlapped with the differential expressed genes identified by edgeR (Table 2). Then we checked if building cell type-specific predictive models using edgeR genes would be better than the ones using RFE genes, while the model performances using RFE genes is better than the ones using edgeR genes (Supplementary File 1). This shows that genes that are not identified by edgeR may have predictive power for ASD. Next, we found that there are more SFARI ASD genes overlapped with RFE genes in neuron-related cell types. We also performed overrepresentation tests between RFE genes and SFARI ASD genes, and found that RFE genes are significantly overlapped with ASD genes (Table 2).
Table 2
The overrepresentation tests between RFE genes and differential expressed genes identified by edgeR and SFARI ASD genes. The number of overlapping genes between RFE genes and edgeR genes or ASD genes, the number of total edgeR genes or ASD genes, and the FDR-adjusted hypergeometric test P-value are shown. The genes with top five importance are listed, of which edgeR genes are bold, and SFARI ASD genes are underlined.
Cell type | Number of RFE genes | Overlapping genes/edgeR genes (FDR-adjusted P-value) | Overlapping genes/ASD genes (FDR-adjusted P-value) | Top five important genes |
AST-FB | 200 | 120/257 (1.5e-158) | 22/299 (5.8e-09) | DPP10, TMSB4X, SPARCL1, ZFP36L1, PCDH9 |
AST-PP | 1000 | 667/1464 (0.0e + 00) | 98/299 (2.2e-34) | PTGDS, HSPA1A, TRPM3, RP11-179A16.1, CIRBP |
Endothelial | 500 | 115/146 (3.2e-134) | 40/299 (6.3e-11) | HERC2P3, AKAP12, TMSB4X, RP11-649A16.1, RPS28 |
IN-PV | 1000 | 384/695 (4.2e-251) | 103/299 (2.7e-38) | AC105402.4, MTATP6P1, CNTNAP3, CIRBP, ARL17B |
IN-SST | 1000 | 549/1346 (5.9e-291) | 104/299 (5.3e-39) | SST, AC105402.4, VGF, HSPA1A, BCYRN1 |
IN-SV2C | 900 | 345/616 (7.4e-244) | 100/299 (9.7e-40) | CCK, BCYRN1, AC105402.4, MEG3, HSPB1 |
IN-VIP | 1000 | 676/1820 (0.0e + 00) | 104/299 (5.3e-39) | HSPA1A, CCK, RPS15, MEG3, RGS12 |
L2/3 | 1000 | 863/4690 (7.8e-230) | 107/299 (2.9e-41) | BCYRN1, CCK, CNTNAP2, MEG3, CAMK2N1 |
L4 | 1000 | 715/2477 (1.3e-294) | 113/299 (3.7e-46) | BCYRN1, CCK, NCAM2, SLC17A7, MTATP6P1 |
L5/6 | 900 | 467/1069 (4.0e-281) | 98/299 (2.7e-38) | BCYRN1, AC105402.4, MTATP6P1, ATP1B1, SLC17A7 |
L5/6-CC | 1000 | 701/3183 (7.1e-202) | 114/299 (7.5e-47) | BCYRN1, CCK, AC105402.4, RP11-750B16.1, MT-RNR2 |
Microglia | 200 | 74/106 (4.9e-112) | 20/299 (1.4e-07) | FKBP5, TMSB4X, NEAT1, SLC1A3, CHN2 |
Neu-mat | 900 | 351/476 (1.7e-312) | 116/299 (2.9e-53) | AC105402.4, XIST, CAMK2N1, MEG3, ROBO2 |
Neu-NRGN-I | 100 | 2/2 (6.8e-05) | 12/299 (7.0e-06) | RP11-750B16.1, PTMA, NRGN, GNAO1, TSPAN7 |
Neu-NRGN-II | 100 | 7/8 (1.9e-14) | 6/299 (3.8e-02) | PRNP, NRGN, STMN1, RP11-750B16.1 PLP1 |
Oligodendrocytes | 600 | 410/1420 (9.4e-253) | 57/299 (9.5e-19) | PTGDS, NRXN1, CNDP1, ABCA2, CREB5 |
OPC | 900 | 528/1413 (1.9e-285) | 102/299 (2.9e-41) | GPC5, TMSB4X, HSPH1, CNTNAP2, OLIG1 |
For each cell type-specific predictive model built based on RFE genes, we calculated the importance of each RFE gene (Supplementary File 3). Table 2 lists the top RFE genes in each cell type. Figure 2. B also demonstrates the expression of top three RFE genes in ASD and control groups for the representative cell types, including AST-PP, endothelial, IN-PV, L2/3, microglia, oligodendrocytes, and OPC. The top genes among different cell types are distinct, implying the cell type heterogeneity of ASD. However, some top genes appearing in several cell types are of note. For instance, gene BCYRN1 (brain cytoplasmic RNA 1, a long non-coding RNA) has the largest importance in all excitatory neurons, including L2/3, L4, L5/6, and L5/6-CC. Gene BCYRN1 is involved in the regulation of synaptogenesis, and there have been several literatures linking BCYRN1 and Alzheimer’s disease, a neurological disease [34, 35], which implies the possible association between BCYRN1 and ASD. Besides, BCYRN1 has been prioritized in a blood-based gene expression study of ASD [36].
To further characterize the cell type heterogeneity of ASD, we compared the RFE genes across different cell types. We preformed gene ontology analyses using R package of clusterProfiler [37], with background genes set as the genes in the analyzed gene expression matrix. The functions of cell type-specific RFE genes are not consistent among different cell types (Supplementary File 4). For instance, in IN-PV, the enriched GO terms include neuron projection, axon, somatodendritic compartment, and cell part morphogenesis, while in L2/3 the top GO terms are associated with ribosome, cotranslational protein targeting to membrane, and protein localization to endoplasmic reticulum (Fig. 2. B).
3.3. Cell type-specific gene sets associated with ASD
In addition to screening individual genes associated with ASD, we also constructed cell type-specific gene set-based predictive models to screen ASD-related gene sets. For each cell type and each gene set, we extracted the expression level of the nuclei from the cell type in the genes included in the considered gene set, and constructed a predictive model (Materials and Methods). We retained the gene sets whose ACC, SN, and SP on test set are larger than 70%, and there are 5, 1, 88, 15, 137 gene sets identified in cell types of AST-PP, IN-PV, L2/3, L4, and L5/6-CC respectively (Supplementary File 5). Figure 3. A shows the top five gene sets in each of these five cell types and the performances of corresponding cell type-specific gene set-based predictive models. For AST-PP, the top ASD-associated gene sets include REACTOME_DISEASE, GO_REGULATION_ OF_CELL_POPULATION_PROLIFERATION, GO_POSITIVE_REGULATION_OF_CATALYTIC_ ACTIVITY, GO_SIGNALING_RECEPTOR_BINDING and GO_ENZYME_LINKED_RECEPTOR_ PROTEIN_SIGNALING_PATHWAY. For other neuron cell types, the ASD-associated gene sets are mostly related to cell junction, synapse, neuron projection, neurogenesis, neuron differentiation, and cell projection organization. By checking the top important genes in each cell type-specific gene set, we found that several genes appear in the majority of the gene sets, for example, gene HSPA1A [heat shock protein family A (HSP70) member 1A] shows up in all AST-PP specific ASD-associated gene sets (Fig. 3. B). Therefore, for each cell type, we analyzed the frequency of each gene included in the identified gene sets, and calculated the averaged importance of genes (Supplementary File 5). Figure 3. C shows the genes with top five averaged importance in each cell type. Gene HSPA1A is noted in AST-PP. Actually, heat shock proteins play a central role in the development of neurological disorders, of which HSP70 family has been shown its functions [38], and HSPA1A, a member of HSP70 family, has already been associated with ASD [39]. As to gene CCK (cholecystokinin), it is prioritized in excitatory neurons, which a kind of gut peptide hormones. Gut peptide hormones have been found across different brain regions and many of them are involved with ASD-related deficits [40].
Next, based on the genes with averaged importance > 10% in corresponding cell types, we re-constructed a cell type-specific predictive model for each of these five cell types. It is noted that their predictive performances are even better that the ones of the cell type-specific gene set-based predictive models (Fig. 3. D). We checked the functions of these genes (Supplementary File 6), and found their functions are distinct, especially among AST-PP, IN-PV and excitatory neurons (Fig. 3. E). In AST-PP, the top genes are associated with the functions of enzyme linked receptor protein signaling pathway, transmembrane receptor protein tyrosine kinase signaling pathway, positive regulation of phosphorus and phosphate metabolic process, and cellular component morphogenesis. In IN-PV, the top genes are related to synaptic and postsynaptic membrane, cation channel complex, and neuron projection. As to the cell types of excitatory neurons, the top genes are associated with ribosome, SRP-dependent cotranslational protein targeting to membrane, nuclear-transcribed mRNA catabolic process, nonsense-mediated decay, and protein targeting to ER.
3.4. A multi-label classification model predicting cell type and diagnosis
To predict the cell type and diagnosis of a given nucleus at the same time, we applied PLS to construct a multi-label predictive model (Materials and Methods). We split the whole gene expression data to training set and test set. For the diagnosis label, we extracted the predictive probability of training set and applied ROC analysis to obtain the optimal cutoff for determining the predictive diagnosis of each nucleus in training and test sets. For the cell type labels, the predictive cell type of each nucleus was set as the cell type whose predictive probability is the largest. The Hamming loss of the multi-label predictive model is 0.02, and the accuracy achieves 72.8% with 92.7% accuracy for cell type labels and 78.5% accuracy for diagnosis label. Then we examined the predictive performance of the model in each cell type. For each cell type, Fig. 4 illustrates the proportion of the number of nuclei predicted as each cell type to the total number of nuclei, the proportion of correct and incorrect predictions for the label of diagnosis, the proportion of correct predictions for all labels in the test set. It can be seen that for most cell types, the predictive cell types are correct, except for AST-FB and Neu-mat. Because AST-FB and AST-PP are cell clusters of astrocytes and they may have similar gene expression patterns, a part of nuclei from AST-FB is predicted as AST-PP. As to Neu-mat, more than 40% nuclei were predicted as L2/3, which may indicate the gene expression patterns between Neu-mat and L2/3 are similar. For most cell types, the predictive accuracy of diagnosis label is larger than 70%, and the top highest accuracy values appear in L2/3, L5/6-CC, IN-SV2C, L4 and AST-PP, showing these cell types may be more vulnerable in ASD.