Re-analysis of a single cell RNA-Seq data set reveals four lineages originating from globose basal cells.
We re-analyzed a scRNA-Seq dataset obtained from Fletcher et al31. After quality filtering and pre-processing (Methods, SFigure 2,3), 687 cells were included into further analysis and grouped into 13 clusters using Seurat KNN clustering on the top 15 principal components (Methods). Dimension reduction and visualization was performed using principal components analysis (Figure 1A) and tSNE / UMAP. Using an extensive set of known marker genes, we assigned clusters to cell types of the main olfactory epithelium (MOE) (Figure 1B, Methods SFigure 4A). We detected all cell types described by Fletcher et al31, and additionally we could subdivide the globose basal stem cell cluster into quiescent cells (qGBC) and active cells (GBC). Active GBC were identified by their expression of Ascl1/Mash132–35 and by the presence of cell cycle genes such as Mki67 and Top2a. GBC are known as the adult OSN stem cells responsible for a sustained self-renewal of the OSNs throughout life35.
Further, trajectory inference by Slingshot36 (Methods) revealed a tree with four leaves (Figure 1A). Be aware that Slingshot does not provide information on the direction of development, but merely a tree topology. In the following, we restrict our analysis to the neuronal lineage and therefore choose qGBC as a root node for pseudotime analysis. qGBC has been reported as a general stem cell population in the MOE following injury35. A previous analysis by Fletcher et al31 focused on differentiation processes starting from HBC0. They placed their root node in HBC0, which is a leaf node of our tree, and therefore merely report three lineages. Apart from this, their reconstruction is essentially identical to ours. The four branches of our tree are (Figure 1A):
1) The basal stem cells lineage, which connects the quiescent horizontal basal stem cells (HBC0, represented by 91 cells in the data) via two transient populations of horizontal basal stem cells (HBC1 (29 cells) and HBC2 (118 cells)) with the qGBC (53 cells).
2) The supporting cells lineage ranges from mature sustentacular cells (mSus (70 cells)) to qGBCs, and includes immature sustentacular cells (iSus (49 cells)) and HBC2.
3) The microvillous cells lineage contains merely microvillous cells (MV (36 cells)). No transient cell types have been detected in this trajectory.
4) The neuronal lineage ends with mature olfactory sensory neurons (mOSN (32 cells)). It spans a range of several stages, namely GBC (34 cells), three intermediate neuronal precursors Early.INP (26 cells), Mid.INP (20 cells), Late.INP (34 cells) and immature olfactory sensory neurons (iOSN (95 cells)).
OR gene expression is limited to the last three stages of OSN differentiation: Sudden onset of multigenic OR expression in Late.INP is followed by transition to monogenic expression in immature OSN stage.
Expectedly, OR expression is essentially unique to the neuronal lineage (Figure 2). Next we used Slingshot to assign a pseudotime to each cell, thereby providing a linear order of all 294 cells in the neuronal lineage from qGBC to the terminal cell cluster (Methods). Our analysis could detect 157 cells of neuronal lineage that express at least one OR gene at relevant levels, i.e. ≥50 normalized counts. 132 of them belonged to the last three stages of OSN differentiation. Most of Late.INP cells (28 of 34), iOSN (76 of 95) and mOSN (28 of 32) express at least one OR. Many cells express more than one OR (multigenic expression), in particular in the Late.INP stage (26 of 28 cells expressing OR genes, 92.8%). The frequency of multigenic expression drops sharply in later stages, 42% and 32% for iOSN and mOSN, respectively. We found 212 different OR genes that were expressed at least once in a single cell of the neuronal lineage (Excel file 1). Figure 2A shows the total number of reads that were assigned to OR genes, separately for each cell. While aggregate OR expression levels are almost zero for qGBC/GBC, early and mid INPs, there is a steep onset of OR expression in Late.INP. Then, overall OR expression stays at similarly high levels in iOSN and mOSN (Figure 2A). A few OSN do not appear to express any OR at a relevant level. More precisely,19 out 95 iOSN cells (20%) and 4 out 32 mOSN (12.5%) have less than 50 normalized OR counts. While it cannot be excluded that this is caused by incomplete annotation of the OR repertoire, it is also possible that reads were excluded due to multiple mapping to closely similar OR genes.
It is known that mature OSNs express only a single OR gene3,4,9,37, after a transient period of multigene expression29,30. We therefore decided to rank OR genes by expression level, separately in each cell. We then investigated the temporal behavior of the top four ranked OR genes in each cell. These genes account for 99% of all reads (413,388 out of 416,033)mapped to OR genes in cells from the neuronal lineage. The top-ranked gene of each stage will be referred to as ‘winner’ and the others as the ‘runners-up’. While the abundance of all runners-up drops sharply after Late.INP, the winner does not drop and in fact absolute levels keep increasing several fold until the mOSN stage (Figure 2C,D). As a consequence the distance between winner and runners-up increases considerably in the iOSN stage and even more so in the mature neurons (mOSN). Since we observe each cell only once, we cannot be sure that the winner gene observed in one cell at a certain stage will be the highest expressed OR gene when that cell matures. However, this is by far the most plausible explanation, since a rank switch between the winner and a runner-up before / during iOSN stage would require a coordinated switch of expression between these two specific OR genes, from high to almost zero and vice versa, an unlikely scenario.
Taken together, the pseudotime analysis of neuronal lineage cells suggests three main phases for OR expression (Figure 2C,D):
1) The silent phase exhibits virtually no OR expression, which is the case for the four early stages of neuronal lineage (qGBC, GBC, Early.INP and Mid.INP). In the present data, the silent phase is represented by 133 cells.
2) The onset of OR expression (multigenic phase) is characterized by the simultaneous expression of several OR genes per cell at relatively similar levels; this phase is represented by 34 cells and contains specifically the Late.INP stage and the beginning of the iOSN stage.
3) Finally, the monogenic phase includes the end of the iOSN stage and the mOSN stage, where each cell expresses essentially one functional OR allele (henceforth called the “winner”, Figure 2C) at a very high level while the remaining ORs (the “runners-up”, Figure 2D) show no or very low expression. Our data contains 127 cells in this phase.
Our analysis reveals that Late.INP is a crucial stage in stochastic selection.
OR gene expression during multigenic phase shows no sign of OR cluster-specific activation
Next, in an attempt to infer the mechanism of activation in the multigenic stage and transition to monogenic stage, we analyzed the joint OR gene expression per cell (Figure 2B, Excel file 2). We found that each of the cells expressing more than one OR gene in the Late.INP stage had at least two active OR clusters (i.e., clusters with an OR gene expressed at a level of at least 50 counts). The number of active clusters reached up to 7 for some cells. We performed a permutation test to assess whether OR genes that are jointly active in one cell have the tendency to be located in the same cluster (see SCode 1 and the description therein). The results however show that the average number of clusters with more than 1 active OR gene is consistent with the null model of random cluster allocation of OR genes (SFigure 10, p=0.112). Moreover, we found that the top two active OR clusters (ranked by expression level) always belonged to different chromosomes in each cell of the Late.INP stage. Thus the onset of OR gene expression in the multigenic stage cannot be caused by activation of an individual chromosome or a particular OR cluster. Conversely the transition to monogenic stage could be partially caused by restriction of expression to a single chromosome and cluster. However, this may not be the only selection mechanism involved, as some cells express up to 3-4 different OR genes simultaneously within a single cluster. Specifically, for the 34 Late.INP cells we found 20 cells, which expressed at least 2 OR genes from the same cluster (expression defined as ≥50 normalized counts). Thus, additional steps are required to restrict expression to a single gene within a cluster.
Motif search in Greek island enhancers identifies novel transcription factors
Next we proceeded to identify potential factors involved in both onset of multiple OR gene expression and the transition to monogenic expression. We focused on transcription factors that had a detectable expression in our scRNA-seq dataset. We consider a factor detectable if it has at least 35 counts in at least 15 out of 294 cells in the neuronal lineage, leaving us with 1358 (co)TFs.
Previously, 63 intergenic enhancer regions, termed Greek islands, have been identified inside or near OR clusters using DNase I hypersensitivity-sequencing and chromatin immunoprecipitation sequencing15 and ATAC-seq11. SFigure 1 shows the co-localization of the Greek islands and the OR clusters on a map of the murine genome. Chromatin conformation capture experiments have revealed that Greek islands extensively contact OR clusters, remarkably both in cis and trans38,39.
We performed a de novo motif search on all Greek island enhancer regions as annotated by11 using MEME40,41 (Methods). Ungapped motif analysis of Greek islands identified one known motif, TYCCYWKGGGVCTHATTARM (reported in Monahan et al11), and two novel motifs GVDHCYYCAGRGAV and TBYTCHTCTCYMCAGDGWBNY, with E-value 1.7e−057, 3.7e−027 and 4.1e−008, respectively. Almost all Greek islands contain each of these motifs exactly once, except 8 Greek islands which are missing the third motif. TOMTOM was employed to align these motifs with known transcription factor motifs from the JASPAR database42 (Methods). TOMTOM did not predict any significant TF binding for the two novel motifs, therefore we do not discuss them further. We found 65 significant target binding sites for transcription factors inside the first motif (see Excel file 3). From those, 9 TFs were expressed at a detectable level.
The most significant motif, TYCCYWKGGGVCTHATTARM is composed of two adjacent submotifs, which are overlapped by a third submotif (Figure 3A). The first submotif is targeted by the COE1 DNA-binding domain which is found exclusively in the Ebf transcription factor family (Ebf1-4). The second submotif is bound by homeodomain TFs such as Lhx2, Emx2 and Uncx (Figure 3A). These results are consistent with previously reported Ebf1 and Lhx2 motifs to be positioned next to each other in most Greek islands11. Furthermore, the second submotif is expected to interact with transcription factors from three other families, the homeobox domain TF family, the Pou TF family (Pou6f1, this family has a strong enrichment in OR genes) and the ARID (AT-Rich Interaction Domain) domain TF family (Arid3a).
Noteworthy, our analysis predicted a third submotif, which is a possible binding site for Fos, Fosb and Fosl2. Fos and Fosb are well-known early response transcription factors, which in turn regulate a broad variety of other transcription factors thus regulating many physiological processes43–45. This Fos-binding motif overlaps with the end of the first and the beginning of the second submotif, suggesting a cooperative/inhibitory interaction of the respective binding factors. This possibility will be investigated further in the context of the pseudotime analysis.
Complementary to the de novo motif analysis, we did a forward motif search with AME46 looking for known TF binding sites enriched in the 63 Greek Islands (Methods). This returned a total of 120 TFs (Excel file 4), of which 10 had a detectable expression (at least 35 count per cell in 15/294 cells of neuronal lineage) in our scRNA-seq dataset (SFigure 5). Six of those are also detected in the de novo motif search (see above), the additional TFs are Pax6, Dlx5, Pou2f1 and Arid3b (SFigure 3A). Dlx5 is part of the same TF family as Lhx2, which was found in the de novo motif search. Pax6 and Pou2f1 have a homeobox domain, whereas Arid3b is part of the same TF family as Arid3a.
Taken together we describe 13 TFs that are found at detectable levels and predicted to bind to Greek islands by de novo and/or forward motif search. Next we determined the pseudotime profiles of these TFs and found clear and distinct temporal expression patterns, providing additional evidence for their active involvement in the regulation of OR expression.
Pseudotime analysis suggests transcription factors involved in OR expression
The sorting of cells according to pseudotime (Methods) generates, for each gene, a time course of its expression (see above). Notably, all TFs found by motif search in the previous paragraph show a pronounced temporal expression pattern, which belongs to one of three groups (Figure 3B and SFigure 5): The first group is active early in the silent phase, but strongly downregulated in late silent phase to reach a minimum in the multigenic phase (Fos, Fosb, Fosl2, Pax6 and Pou2f1). Some, but not all, are upregulated again in the monogenic phase (Fos, Fosb).The second group peaks within the multigenic phase (Ebf1, Lhx2, Emx2, Uncx and Dlx5). The third group is specifically upregulated during the monogenic phase (Arid3a and Pou6f1). Hereafter we will refer to these group definitions.
The fact that all TFs with a known Greek island binding site show a clear temporal pattern prompted us to perform a systematic search for TFs that change their expression upon transition between the three phases of OR expression. We also include co-factors in this analysis, because co-factors such as LDB1 have been found to be selectively associated with Greek islands and were suggested to initiate OR expression38. We searched for (co-)TFs with a significant differential gene expression of at least 2-fold between silent vs. multigenic phase or between multigenic vs. monogenic phase (Methods), resulting in 83 (34 going up, 49 down) respectively 39 (8 up, 31 down) relevant hits (Figure 4A, see Excel files 5,6).
Several of these differentially expressed factors were also identified by motif analysis (e.g., Ebf1, Lhx2, Dlx5, Fosb and Fos), but many are not. We manually inspected the pseudotime patterns of all differentially expressed (co-)TFs, and for detailed discussion, we selected those factors whose pseudotime expression pattern falls clearly into one of the three groups described above. We limit ourselves in the following to discussion of novel (co-)factors with additional evidence from motif analysis or a previous link to OR expression11,18,19,38. All other factors with a characteristic expression time course are shown in SFigure 9A.
For group 1 (active early in silent phase, downregulated in late silent phase and minimum in the multigenic phase) the differential expression search newly identified Egr1, whose expression resembles that of Fos and Fosb, which were already identified by motif analysis (see above). Therefore we searched explicitly for the known Egr1 binding motif (Methods) in Greek islands, which could be identified in 16 of 63 Greek islands. Fos, Fosb and Egr1 are immediate early genes, which are rapidly upregulated in response to external stimuli, immune response, and cellular stress45. Egr1, Fos and Fosb are specifically downregulated during the multigenic phase (Figure 4B and SFigure 9A). This suggests combinatorial interactions with the other components that regulate OR expression and will be discussed later.
The second group peaks specifically within the multigenic phase and 6 factors have been identified by motif analysis (see above). Differential expression analysis further obtains the cofactor Ssbp2 (Figure 4B) and three factors, Cebpg, Rcor1 and Ldb1, which have been reported previously to be involved in OR expression11,38,47,48. Ssbp2 binds to Ldb1 and thereby prevents Ldb1 from degradation49,50. While Ldb1 and Lhx2 were shown to bind to Greek island enhancers to regulate OR expression in trans38, Ssbp2 is a novel candidate with such a function.
The third group is specifically upregulated during the monogenic phase and two factors from this group have been identified by motif analysis. Differential analysis identifies additionally the TFs Mef2b, Rfx3 and Sub1, also known as PC4 (Figure 4B and SFigure 9A). Among these factors, only Rfx3 has a known motif, for which we performed a strict motif search in Greek islands (Methods). We report that 56 out of 63 Greek islands contain the binding motif for Rfx3. Moreover, note that Mef2a, which shares a similar SRF binding domain with Mef2b51, is found to be strongly bound to OR promoters19.
Taken together, our pseudotime analysis recovers a large proportion of candidate (co-)TFs identified by motif analysis - both for initiating onset of OR expression, and for the transition to monogenic stage. Moreover it extends the range of candidates whose time course correlates with these two transitions, and consequently the regulatory repertoire for these transitions.
Changes in chromatin remodeler expression accompany both transitions in the OR selection process
It is known that chromatin changes accompany the selection of OR genes20,21,23. We therefore searched our data for chromatin remodelers that show expression changes during OR selection (Methods, Excel files 5,6). We confirmed previous observations that the chromatin remodelers Lbr and Cbx5 (SFigure 9B) are expressed at earlier stages and are downregulated in the course of OSN differentiation15,21. Furthermore, we discovered novel candidates for silencing OR genes, for onset of (multigenic) expression, and for transition to monogenic expression (Figure 4C):
Among the genes whose expression profiles fall into group 1 (minimum in multigenic phase), we found Eed, one of the constitutive subunits of the polycomb repressive complex 2, PRC2 (Figure 4C). Eed is required to maintain repressive H3K27me3 marks52,53 and its downregulation may lead to de-repression of OR expression in Late.INP stage. Note that another PRC2 subunit, Ezh2 is expressed during the silent phase as well, but decreases later, at the transition to monogenic phase (SFigure 6). Nsd1 is a histone methyltransferase that demethylates H3K36me254 (Figure 4C). All remodelers found with a group 1 pseudotime profile (Hells, H2afz and Set) are predicted to play a repressive role in the silent phase of OR selection (we only show H2afz as example SFigure 9B).
For group 2 (peak in multigenic phase), we found prominent chromatin remodelers such as Zmynd8, Ell3, Sertad2, Med12l and Scai (Figure 4C, SFigure 9B). We also investigated the expression profile of Kdm1a which was known before as a regulator of OR expression. Kdm1a alias LSD1 is a Lysine demethylase and functions both as a coactivator by demethylation of mono- or di-methylated H3K9 and as a corepressor through demethylation of mono- or di-methylated H3K455–58. There have been contradictory reports on the function of Kdm1a in OR expression as an activator27 or repressor of transcription48. The present data sheds light on this debate: While Kdm1a expression sharply peaks directly before the multigenic phase (arguing for its role as activator), it can be part of the Co-REST repressor complex59. Two components of the Co-REST repressor complex, Rcor1 and Hdac2, sharply peak during multigenic phase (Figure 5C, SFigure 8A), arguing for a change of function of Kdm1a by recruitment to the Co-REST complex at the transition to monogenic phase55,60,61.
Of the four novel remodelers with a group 2 pseudotime profile, Zmynd8 and Ell3 are highly differentially expressed (Figure 4C). Zmynd8, a chromatin reader, is a particularly appealing candidate, since it is also known to play a role in the selective expression of the immunoglobulin heavy chain (Igh) regions in immune cells (B cells). Its product ZMYND8 binds Igh super-enhancers known as 3’ regulatory region (3’RR). ZMYND8 thereby controls the 3′RR activity by modulating the enhancer transcriptional status62. Consistent with an activating role during the multigenic phase, Ell3 does not only bind to active enhancers, but also marks the enhancers that are in a poised or inactive state in ES cells63.
Remodelers whose pseudotime profiles fall into group 3 (specifically upregulated during the monogenic phase) are Cbx4 (Chromobox 4) and Jarid2 (Figure 4C). Cbx4 is a component of a Polycomb group (PcG) multiprotein PRC1-like complex, which is required to maintain the transcriptionally repressive state of many genes64,65. Jarid2 (Jumonji and AT-Rich Interaction Domain Containing 2) is required to repress expression of cyclin-D1 (CCND1) in cardiac cells by setting H3K9 methylation marks66, and it is upregulated upon transition from Late.INP to iOSN (SFigure 4B), i.e. upon exit from the cell cycle.
So far, we identified several chromatin remodelers that add to the regulatory repertoire for the onset of OR expression, and for the transition to monogenic stage. Another important feature which requires chromatin remodeling is the monoallelic expression of ORs in mature OSNs. This feature appears to be established from the very beginning of OR expression4,23. Factors involved in generating allelic exclusion therefore would be expected to peak at least as early as factors regulating the onset of (multigenic) expression (group 2 factors). We found two remodeling factors with a very early onset within group 2 which could potentially play such a role: Smchd1 (structural maintenance of chromosomes flexible hinge domain-containing protein 1) and Cdyl2 (chromodomain Y-like protein 2) (see SFigure 7).
Based on the interactions and temporal coordination of remodelers and (co-)TFs, we generate and discuss some hypotheses about OR selection below.