Overview of the BCA
We collected single-cell RNA sequence (scRNA-seq) or single-nuclei RNA sequencing (snRNA-seq) data through literature search as well as from the single-cell database22, which covers over 1800 published datasets. 70 human studies (65 published and 5 unpublished articles of the human, Extended Data Fig. 1) of 6577 samples and 103 mouse studies of 25,710 samples were selected as the data source, most of which were deposited in Expression Omnibus (GEO)23, the UCSC browser24 and ArrayExpress25, while others were from consortia (e.g., Allen Brain Map (portal.brain-map.org)26, synapse (synapse.org)27. The sample details (metadata) were manually curated and raw count expression data was collected (see Methods, Supplementary Fig. 1) in a consistent manner. The result includes 11.3 million (M) human cells, including cells from 14 main regions and 30 subregions of the brain (Fig. 1), while the mouse data includes 15 million cells. To achieve a consensus cell type annotation, we used two well-defined datasets as core references28,29, which can be used to infer cell type labels in new datasets using reference-based machine learning algorithms. The adult core28 contains 3.3M nuclei from tissues of 4 post-mortem healthy adults aged from 29–60 years across the whole brain, the super level has 31 cell type annotations covered for excitatory neurons, inhibitory neurons and non-neurons from distinct brain regions that are overlapping with brain regions in the rest of the adult brain datasets. The fetal core28,29 contains 1.6M cells sampled from the first trimester developing brain tissues, and has 12 major cell types in the annotations that covered for immature and mature neuronal and non-neuronal cells, and the stem cells. Considering the hierarchical cell type levels in the brain, we developed a novel machine learning cell label transferring workframe, named as scAnnot (Fig. 1), which enables the identification of potential neural progenitor cells as well as the rare cell populations. Further downstream analysis confirms the biological relevance of these cell populations. All together, our resource includes 26.3 M cells from human and mouse brains at an unprecedented atlas scale.
The human brain datasets were sorted into four types based on the sample source: adult (8,062,832 cells), fetal (2,203,728 cells), organoids (861,169 cells) and brain tumour (234, 295 cells) (Fig. 2a). Majority (~ 95%) of the nuclei or cells were sequenced with 10x Chromium (Supplementary Fig. 2). The samples were obtained from donors aged from 6 Gestational Weeks (GW) to over 80 years old (Fig. 2b). In fetal data, ~ 65% of the samples came from embryonic brain tissue in the first trimesters (0 – GW12), while samples from donors from age 40 to 80 years constitute ~ 78% of the postnatal samples. ~9% of the samples were curated without known age (Fig. 2b). In adult data, only ~ 25% of the samples were from female donors, while ~ 71% were from male donors, and < 5% were unknown in gender. Whereas, up to 92% of the samples had undetermined gender in fetal, and the female-to-male ratio is 1.3:1 in the rest (Fig. 2c). For disease status, ~ 75% were healthy samples and 3% were unspecified, while disease samples were dominated by AD followed by Epilepsy, Gliomas (Glioblastomas, Oligodendroglioma, Astrocytoma, Mixed glioma and etc.), Amyotrophic lateral sclerosis (ALS), MDD, Autism spectrum disorder (ASD), Dementia, Parkinson’s disease (PD), MS. Other neurological diseases accounted for less than 0.02% samples (Fig. 2d). BCA includes single-cell data from major cerebral cortex regions (frontal lobe, parietal lobe, occipital lobe and temporal lobe), cerebellum, brain stem (midbrain, pons and medulla oblongata) as well as the limbic system (hippocampus, thalamus, hypothalamus and amygdala) (Fig. 2e,f). Most cells or nuclei were collected from the hippocampus, followed by prefrontal cortex, occipital lobe and basal ganglia (Fig. 2g).
As an integrative resource, we provide a consensus cell type annotation of all the datasets using 7 well-established reference-based machine learning methods (see Methods) as well as an in-house built hierarchical annotation workflow (see section below). The resulting 31 primary putative cell types were further verified with top 3 differentially expressed genes (DEGs) (Methods and Fig. 2h,i). As the cell type proportion difference is a first cross brain region comparison (Fig. 2j), we found that the cell type proportions are similar across the frontal lobe regions, such as between the prefrontal cortex and the motor cortex. A decrease in Upper-layer and Deep-layer intratelencephahlic neurons is observed in the hippocampus, where the hippocampal CA neurons are most enriched as expected. Midbrain-derived inhibitory neurons are enriched in the thalamus and the midbrain.
Atlas-level hierarchical cell type annotation with scAnnot
45 out of the 70 datasets has their cell type annotations publicly available (Supplementary Fig. 2), but there is a lack of consensus cell-type annotations hindering the cross brain region comparison of the same cell type. The annotation of brain cell types at scale is challenging due to the complexity of the data and the large number of cell types involved. The well-established reference-based machine learning methods of cell type annotation only consider a single level of cell type, while requires a good amount of computational resources and time for data processing. However, the cell types in brain appear in a hierarchical manner. e.g., clear marker genes SLC17A7 and SLC6A1 (top-level markers) can be used to characterize excitatory neurons, inhibitory neurons and non-neuronal cells21. But within each main cell type, more detailed markers can be found to discriminate cell types or subpopulations, while the top-level markers are no longer discriminative. Here, we present, scAnnot, a hierarchical cell annotation workflow based on the Variational Autoencoder (VAE) model from scANVI30 (Methods and Fig. 3a). It can predict the harmonised latent space and thus the cell type labels as well as the dimension reduction space (Uniform Manifold Approximation and Projection, UMAP). Using the adult core as the reference, the top level includes 31 primary cell types, scAnnot selected 200 DEGs for each cell type as their feature genes and trained a machine learning model. According to the confusion matrix between the reported cell type labels and the scAnnot-predicted labels (Fig. 3b,c), most cell types can be predicted with high accuracy (above 90%). The oligodendrocytes demonstrates slightly lower accuracy due to their similarity to committed oligodendrocyte precursors. And for the second-level cell type annotation, the average accuracy of all cell types was 90% on the training set and 83% on the validation set. We show that our annotation is not much affected by batch effect, thus facilitating the large-scale integration of atlas data. The classification accuracy of the subpopulations of each cell type ranged from 50–100%, with the worst classification performance being the Splatter cluster (Fig. 3c), which is expected not to be discriminative. Besides, we found that the major cell types labels have been well-assigned (astrocyte, OPC, and microglia, Fig. 3d), while the reported annotations from their publications IT were subdivided into Upper-layer intratelencephalic, Deep-layer intratelencephalic, and some Miscellaneous (Fig. 3e,f). Taking advantage of the atlas-level data, scAnnot can provide better cell type labels, which can be confirmed by the feature gene expression (Fig. 3g). The result of the second-level annotation shows that the cell types of the brain can be further divided in the hierarchical classification (Fig. 3h). For the extended data in BCA, scAnnot annotated 6577 samples of 11,362,191 cells in the human brain and constructed a UMAP visualisation based on the machine learning predicted latent space (Fig. 1). The resulted hierarchical cell type annotations provides a base for the cross brain region analysis in the following sections.
Potential neural progenitor cells in adult hippocampus
The existence of neurogenesis in adults remains contraversial, while understanding this problem can significantly benefit stem cell therapy for brain damage and related diseases13,20. Yet, single-cell data generated from single research groups only involve a few samples on a specific experimental protocol and technology, resulting in disagreement over neurogenesis cell type definitions15,20,21,31. Atlas-level integrative analysis exerts special potential in enriching the rare cell populations, which can hardly be capture by single studies, shedding new lights on the contraversy.
Taking adavantages of the large-scale data in the BCA, we integrated hippocampus single-cell data from 6 independent studies, comprising a total of 450,000 cells from individuals spanning healthy controls from the human adults15,16,20,32, children15, infants15, fetuses14, and mouse across all development stages33 (Fig. 4a,b). Considering the human adult samples are dominated by mature neurons, integration with data covering the whole neurogenesis landscape can alleviate the bias towards mature neurons to better identify the neurogenesis related cell populations. After data integration using Harmony34 (see Methods), the UMAP visualisation shows well-mixed data from the 6 studies, while the data distribution shows a complete landscape of neurogenesis as well as the enriched mature neurons in adult samples(Supplementary Fig. 3a-f). The cell clusters were annotated according to the well-established cell type marker genes (Methods and Fig. 4c), including (1) MKI67 and TOP2A for neural progenitor cells (NPC), (2) DLX2 and SOX11 for neuroblast cells, (3)SOX11 and PROX1 for immature glutamatergic cells, (4)PROX1 and PLEKHA2 for glutamatergic neurons, (5)SLC17A7 and COL5A2 for CA neurons, (6) GAD1 and GAD2 for GABAergic neurons, (7) GFAP and AQP4 for astrocytes, (8) FLT1 and ENG for endothelial cells, (9) PDGFRA and OLIG1 for oligodendrocyte precursor cells (OPCs), (10) OLIG2 and SOX10 for newly formed oligodendrocytes (NFOLs), and (11) MOG and MAG for oligodendrocytes13,21 (Fig. 4d,e). According to marker gene expression (MKI67 and TOP2A), only a small proportion of cells (33 cells) in adult hippocampus as well as many fetal (26,604) and mouse (17,751) cells can be defined as the NPCs (Supplementary Table 2). But the differential expression genes (787 DEGs, Methods and Supplementary Table 3) of these marker gene defined NPCs highlight the well-established NPC markers13, including TOP2A, HMGB2, and UBE2C (Fig. 4f).
Besides marker expression, trajectory analysis can help in progenitor identification through indicating the order of cells according to gene expression35 or according to RNA splicing status36. It is clear that NPCs can bifurcate into astrocytes or neurons, thus pseudotime analysis (Supplementary Fig. 4a,b) was performed from NPCs to mature astrocytes and from NPCs to mature neurons. Clear cell clusters can be found between the nature astrocytes and mature neuronal cells, Supplementary Fig. 4c, while a cell population expression MKI67 + TOP2A + can be identified as putative NPCs, Supplementary Fig. 4d. Cells from the adult dataset33 were extracted for pseudotime analysis and RNA velocity analysis, indicating the trajectory from NPCs to immature neurons and to mature neurons (Supplementary Fig. 4e,f). Further differential expression analysis confirms the expression of known NPC markers (including TOP2A and MKI67, Supplementary Fig. 4g,h) in the cell population, validating the above mentioned marker gene expression analysis, Fig. 4f. The DEGs for putative NPCs from the adult human study20 and from the three fetal human studies3,14,29 showed good consistency (Supplementary Fig. 4i), including cell division related genes (TPX2 and BRCA2), cell proliferation related genes (PRM2, MKI67) and mitosis related genes (SMC4, NCAPG, CENPK). And pathway analysis (see Methods) of these DEGs demonstrate the upregulation in cell cycling activities and proliferation, including nuclear division, organelle fission, and chromosome segregation, which is consistent with the known proliferating nature of the NPCs37. This validates the enhanced expressions of TOP2A, CDC25C and MKI67 in cycling cells like the NPCs in fetal mouse and human comparing to the quiescent cells20,21,33 (Supplementary Fig. 4h). The pathways enriched for the putative NPCs in adult hippocampus confer with those in the developing hippocampus (Supplementary Fig. 4j).
In addition, we investigated these putative NPCs in adults according to the eight NPC-related genes, which are known to be conserved cross-species13. We found that each NPC-related gene is expressed in approximately half of the NPCs but not in the other cell types (Fig. 4g and Supplementary Fig. 5a,b). However, the expression of these genes in adult and infant hippocampal NPCs was lower than that in fetuses (Fig. 4h). Although adult putative NPCs also express these marker genes, The result shows that the expression levels are lower than in fetuses. An NPC gene module score was designed based on the co-expression of these marker genes (Methods). The cells with high NPC gene module scores well aligns with our annotated NPC (Fig. 4i and Supplementary Fig. 5c). According to the co-expression, the number of cells co-expressing two or more of these genes decreased sharper in adults than in fetuses (Fig. 4j and Supplementary Table 4), and cells with a NPC gene module score greater than 2 are considered as putative NPCs (Supplementary Table 2).
Finally, NPCs were analysed through cross-species comparison. As droplet-based single-cell technologies demonstrate a high dropout rate, marker gene expression may not be clear in some cell populations but their cell type labels can still be inferred by using reference-based machine learning methods considering the whole expression profile. In particular, NPCs are difficult to define according to marker gene expression in adult data, but the complete development landscape in mouse data from BCA provides a good reference for inferring NPCs. Therefore, five referenced-based machine learning methods, including ACTINN38, SCCAF39, SingleR40, scArches41, and singleCellNet42, were used to identify putative NPCs in adults based on the well-annotated mouse data33. The number of detected NPCs vary by methods (Fig. 4k and Supplementary Table 2). Most methods predicted a small portion of cells as NPCs, but the consistency between these methods is low. More efforts are needed for further validation.
In brief, our results provide potential evidence for the identification of NPCs in adult hippocampus, based on the expression of known NPC markers, trajectory analysis from putative NPCs to mature neurons, cross-species comparison based on mouse reference data, and the enrichment of cell proliferation and cell division pathways. More solid clues are required to validate these putative NPCs. And the BCA provides a good start point as a large-scale data resource for the identification of neural progenitor cells.
The discovery of PCDH9high microglia cells and a comparison across brain regions
The BCA provides an unprecedented single-cell resource of the brain, allowing for the identification of rare or unknown cell types based on the uncertainty of the label transfer results from different machine learning methods. As a showcase, a microglia population with a high level of PCDH9 expression was identified from the integrated data of 44 samples, which encompassed a total of 511,872 cells. These samples were obtained from four studies of adult human prefrontal cortex or hippocampal regions in the brain19,20,32,43, providing 12 well annotated primary cell types (Fig. 5a). Focusing on the microglia population, we characterized a novel population of microglia with high PCDH9 expression (Fig. 5b). Interestingly, this population was exclusively identified in the study from Ayhan et al.32 (named as Micro2), but not in other studies. In addition to the microglia pan-markers such as APBB1IP32, TBXAS114, LPCAT244, P2RY1245,46, and SLCO2B147, the microglia (PCDH9high) population also exhibits high expression of immune-related genes such as LRMDA48, PEAK1, SPTLC2, and CTTNBP249,50 (Fig. 5c), indicating a clear functional discrimination in modulating immune responses against other microglia cells. Such an unknown cell population can hardly be noticed in a single study due to the relative small data size, while integrative data in BCA provides helps with the discovery.
The same cell population may demonstrate different gene regulatory patterns in different microenvironments, while understanding such a niche difference may help in development of the in vitro cell culture protocols or technologies51,52. Thus, it may benefit the development of cell therapy for the brain. Datasets generated in a single research group may be limited in size or may suffer from a confounded experimental design in understanding the microenvironment resulted difference, in particular, the cross brain region effects. Large-scale atlas data provides a unique data source for such a cross brain region analysis. Although the sequencing lanes/samples are confounded with the brain regions, a gene covary exclusively with the brain regions but not the sequencing batches or studies is inclined to be a region-specific gene rather than batch-specific. Under this assumption, we performed differential expression analysis for the above-mentioned microglia (PCDH9high) population across two brain regions, prefrontal cortex and hippocampus (Fig. 5d). The results show that LRMDA and IL1RAPL1 are highly expressed in most samples of prefrontal cortex, while PARK2 and GRIK2 are highly expressed in most samples of hippocampus (Fig. 5e and Supplementary Fig. 6). 1469 genes are differentially expressed between the two regions (Fig. 5e and Supplementary Table 5), indicating the niche difference. The 315 highly expressed genes in the hippocampus region were found to be enriched in biological processes related to synapse organization and cell junction assembly. These processes are associated with microglia in the hippocampus, suggesting that microglia (PCDH9high) may play a role in clearing over-activated synapses and promoting the formation of new synaptic connections. A total of 1,154 genes highly expressed in prefrontal cortex were found to be enriched in biological processes related to the transportation of secreted proteins (Fig. 5f). And microglia (PCDH9high) in prefrontal cortex appear to participate in immune responses.
As cell-cell communication is an important part of the microenvironment related to the niche difference, cell-cell communication was compared across the brain regions (Fig. 6a-d and Supplementary Table 6). Overall, 60 pathways (980 genes) were detected to be involved in building the cell-cell communication network of the neural cell niches: 45 pathways were conserved; 12 pathways were prefrontal cortex specific; and 3 pathways were hippocampus specific (Fig. 6e). As shown in Fig. 6f and Supplementary Fig. 7a-b, the distribution of cells in 2D space showed obvious changes in the interaction strength of outgoing and incoming signallings between the microglia (PCDH9high) cells in prefrontal cortex and in hippocampus. Furthermore, we identified the specific signalling changes of microglia (PCDH9high) cell population between prefrontal cortex and hippocampus: the NRG, CADM, NEGR, and LAMININ pathways are specific to the hippocampus region (Fig. 6g).
In hippocampus, 10 cell senders, which secrete ligands, interact with the microglia (PCDH9high) cell population via the NRG, CADM, NEGR, and LAMININ pathways mediated by multiple ligand-receptors (Fig. 6h). Besides, 9 cell receivers interact with microglia (PCDH9high) when it acts as a signal sender (Fig. 6i and Supplementary Fig. 7c). LAMININ signaling, one of the ECM-receptor interactions, contributed the greatest number of ligand-receptor pairs. Astrocytes and their extracellular matrix (ECM) are considered as a major portion of the Central Nervous System (CNS) parenchyma53. Importantly, microglia (PCDH9high) cells in hippocampus form interactions with various cell types, including oligodendrocyte precursor cell, oligodendrocyte, newly formed oligodendrocyte, astrocyte, vascular leptomeningeal cell, fibroblast, endothelial cell, GABAergic neuron, and glutamatergic neuron, via LAMININ pathway. The expression trends of ligand-receptor pairs for NRG, CADM, NEGR, and LAMININ pathway in each hippocampal cell type is proportional to the probability of communication (Supplementary Fig. 7d). As expected, cell-cell communication network indicates that prefrontal cortex and hippocampus play different roles in brain information interaction, and they might cooperate to regulate the functional activities of brain connectivity.
We uncovered distinct microenvironmental characteristics in hippocampal and prefrontal cortex, and discovered that the rare cell population microglia (PCDH9high) engages in interactions with 10 distinct clusters via the LAMININ pathway. These findings may offer insights into the intricate regulation of immune responses and neurological functions, shedding lights on the pathogenesis of related diseases and potentially identifying therapeutic targets.
Web Portal of the BCA
The BCA provides a web portal with interactive features for data visualisation, gene/cell search, cross brain region comparison and data download (Supplementary Fig. 8a). A CELLxGENE UMAP viewer for each of the adult, fetal, organoid and tumour sub-atlases, selectable original cell type and reference-based re-annotations, and various curated metadata information such as regions of the sample, donor age, sex and disease status (Supplementary Fig. 8b). The Markers page allows the users to obtain a list of HVGs in each cell type from a selected brain region (By Region), or, retrieve a list of HVGs in each region from a chosen cell type (By CellType) (Supplementary Fig. 8c). The list is presented after a volcano plot that summerised the differences of the marker genes and is available for download in CSV or PDF format as well as a summary (Supplementary Fig. 8c, d). Further, the Anatomy page is designated for the interactive structural portrait that illustrated for the majority of the structures where samples were taken from in the adult brain datasets, from which each of the selected anatomical regions will be shown highlighted with a brief text introduction below the portrait (Supplementary Fig. 8e). From the DEG page users can find a violin plots of the top 3 DEGs By Region or By CellType generated from the Markers page (Supplementary Fig. 8f). Moreover, The cell sorting module is adapted from Chen et al.54 that allows the users to view a slice of the samples from the BCA by filtering with the cell type, region or attributes from the metadata and view the expression profiles of the filtered samples (Supplementary Fig. 8g, h). Finally, a list of all the individual datasets used is available via the Datasets page, searchable through publication name, authors, sub-atlas name, brain region, disease of interest, sequencing platform and cell or nuclei, and accession code or project code (Supplementary Fig. 8i).