Comprehensive single-cell gene and isoform expression analysis reveals signatures of ageing in haematopoietic stem and progenitor cells


 Single-cell approaches have revealed that the haematopoietic hierarchy is a continuum of differentiation, from stem cell to committed progenitor, marked by changes in gene expression. However, many of these approaches neglect isoform level information, and thus do not capture the extent and effect of alternative splicing within the system. Here, we present the first integrated short- and long-read single-cell RNA-seq of haematopoietic stem and progenitor cells. We demonstrate that over half of genes detected in standard short-read single-cell analyses are expressed as multiple, often functionally distinct, isoforms. This includes many transcription factors and key cytokine receptors, and in particular the Thrombopoietin receptor Mpl, which displays complex isoform expression patterns between individual hematopoietic stem cells. The dataset further reveals novel signatures of hematopoietic ageing, including a global increase in lncRNA expression. Strikingly, the long-read sequencing enables us to observe aberrant expression of full-length VJ-rearranged immunoglobulin kappa transcripts in aged haematopoietic stem cells, prior to lymphoid commitment. Integrating single cell and cell-type specific isoform landscape in normal and aged hematopoiesis provides a new reference for accurate molecular profiling of heterogeneous tissues, as well as novel insights into transcriptional complexity, cell-type specific splicing events and effects of ageing.


Introduction
Single-cell RNA-seq (scRNA-seq) technologies are now applied to a broad spectrum of biological systems 1 with particular impact in the study of stem cell and developmental biology [2][3][4] . With the advance of short-read technologies capable of analysing many thousands of cells in a single experiment, it has become possible to identify cell types and state transitions in complex biological systems. In particular, scRNA-seq has highlighted the continuous nature of haematopoietic hierarchy. Cell types, previously thought of as discrete entities, have been shown to exist in a continuum of states from stem cell to mature progenitor 5 . Investigating the regulatory events that occur in these state transitions, and how they change with age, is central to the understanding of the regenerative potential of stem cells in health and disease.
Alternative splicing (AS) of mRNA transcripts is a mechanism by which several isoforms can be generated from individual genomic locus, enabling significant increases in transcriptomic and proteomic complexity. AS can affect many aspects of gene expression, including transcript export from the nucleus, transcript stability, and critically the production of functionally distinct protein isoforms. AS is thought to occur in at least 62% of multi-exonic genes in mouse 6 and up to 95% of multi-exonic genes in human 7 . Increasingly, there is an understanding that isoform (co-)expression in tissues and cells can reveal previously unseen complexities in cell signalling responses, as was demonstrated recently for G-protein coupled receptors 8 . In haematopoiesis, substantial levels of AS have been observed in sorted populations of stem and progenitor cells 9,10 , but in general AS remains overlooked in scRNA-seq studies. Because the vast majority of scRNA-seq approaches target just 3' or 5' transcript ends, AS events are unlikely to be captured and thus an entire class of biologically important information about celland cell-type specific gene expression is lost. Advances in long-read sequencing technologies have enabled unequivocal detection of AS isoforms 11 . Here, we have developed and applied integrated parallel short-(Illumina) and long-read (Pacific Biosciences; PacBio) sequencing of single-cell libraries generated using the 10X Genomics Chromium. This enables parallel profiling cellular diversity, gene expression and AS events in the mouse haematopoietic system.
To demonstrate the utility of this approach, we performed parallel gene and isoform expression analysis on haematopoietic stem and progenitor cells isolated from young and aged C57/BL6 mice. We generated cell-barcoded cDNA using the 10X Genomics platform, and Illumina sequencing of this cDNA to reveal cell states, gene expression, and cell frequency changes associated with ageing. Parallel PacBio sequencing of this full-length cDNA and integration of the cell barcodes enabled assignment of AS isoforms to individual cells and cell-type clusters generated from the short-read data (Fig 1A, Supplementary Fig 1). We demonstrate that AS is readily detectable by long-read sequencing of scRNA-seq libraries and that this integrated approach can identify novel aspects of gene expression in cell differentiation and ageing. We observe that functionally divergent isoforms, undetectable by short-read sequencing alone, are common throughout haematopoiesis, and thus that isoform-level analysis is critical for the understanding of cell states and their response to signals.

Generation of short-and long-read single cell RNA seq libraries
Using Fluorescence Activated Cell Sorting (FACS) we isolated the Lineage negative, cKit (Cd117) positive (LK) cell fraction of mouse bone marrow cells, a population containing stem and progenitor cells 12 from young (8 week) and aged (72 week) mice. We generated standard single-cell sequencing libraries from these populations, which revealed the diversity of cell types present within the 8,000 LK cells passing quality control (Fig 1B).
Analysis of this data identified 15 subclusters within the LK population, including long-term haematopoietic stem cells (HSCs), here associated with Procr (Cd201) expression 13 , as well as intermediate and committed progenitor cells identifiable by distinct marker gene expression ( Fig 1B, Supplementary Table 1). This matches well the diversity expected from FACSbased phenotypic analysis of the same population 12 . Small numbers of mature B-cells, myeloid cells and mast cells were also observed, but were transcriptionally very distinct from the main stem and progenitor cell cluster, and most likely represent low level contamination with mature cells.
In order to further examine transcriptional heterogeneity in the haematopoietic system, we performed long-read PacBio sequencing (IsoSeq) on the cDNA pools generated from the 10X platform, similar to an approach recently applied in cerebellar cells 14 . This protocol, taking advantage of the cell barcoding technology used in 10X Genomics library preparation, enables isoform identification and association with cell populations, and potentially even single-cells, through integration of the long-and short-read data.
A detailed breakdown of the PacBio sequencing statistics is presented in Supplementary Table 2. In brief, PacBio sequencing yielded a total of 17.9 million circular consensus sequencing (CCS) reads with a median read length of 1471 bases (Fig 2A). These reads mapped to an average of 16,427 genes per sample, representing an average of 33,345 transcripts per sample and an average of 31 reads per transcript. Transcript coverage averaged 74% (Fig 2B), and alternative isoforms were detected for 52.3% of genes (Fig 2C).
We took advantage of the long-read pacbio data to annotate and explore alternative splicing events using TALON 15 . We identified a total of 11,013 novel transcripts supported by at least five reads and identified in at least three of the samples (44,993 novel transcripts if at least two reads and two samples). Those annotations also enabled the identification of 910 novel cassette exons, 4,118 and 3,465 novel alternative 5' and 3; splicing sites respectively. We also identified 143 novel junctions between previously annotated splice sites. We investigated the impact of the novel splicing events on the coding potential of the transcripts. Among the 10,576 novel transcripts arising from protein coding genes, a total of 6,747 transcripts were identified as coding, whereas 3,830 were deemed non coding (Supplementary Table 3

)
Demultiplexing of the long-reads using the cell barcodes identified in the short-read analysis enabled 5.8 million (32%) of CCS reads to be assigned to individual cells. Using the multimodal capabilities of Seurat 16 , we integrated short-and long-read datasets allowing side-byside comparison of gene and isoform expression from the respective platforms, and annotation of individual cells with isoform-level transcript information. A median of 411 reads, corresponding to a median of 292 transcripts, could be assigned per cell (Fig 2D and 2E). In general, the numbers of isoforms detected are too low to allow meaningful comparisons at single-cell resolution. However, with 50-500,000 long-reads per cluster, scaling linearly with the number of cells per cluster (Fig 2F-H), it is possible to associate isoform expression with cell-type clusters.

Alternative Splicing in Haematopoietic Transcription Factor Networks
To highlight the impact of additional long-read sequencing in scRNA-seq studies, we screened the dataset to identify AS events in key transcription factors (TFs) from a regulatory network derived from single-cell studies of haematopoiesis 17 . Our long-read sequencing detected 28 (of 31) TFs in that network (Fig 3A), including Lmo2, Gata2, Ldb1 and Tal1, which when translated form protein complexes that regulate cell fate decisions across hematopoiesis.
Three predominant isoforms were detected for Lmo2, Lmo2-202, Lmo2-203 and Lmo2-208, each of which encodes a protein differing in length (228, 220 and 158 amino acids, respectively) with progressive truncation from the N-terminus. Additionally, a novel, in-frame variant of Lmo2 with a 297 bp exon was also detected and supported by 27 long-reads ( Supplementary Fig 2). In our data, while Lmo2-202 and 203 are widely expressed, the transcript encoding shortest protein form, Lmo-208, is additionally expressed in pro-erythroid cells, megakaryocytes and mast cells (Fig 3B). In human cell lines, long-and short-protein isoforms of Lmo2 (equivalent to the proteins encoded by Lmo2-202 and -203) have been shown to have distinct functional roles in the formation of TF complexes and regulation of gene expression 18 . Gata2 expression, which in our short read data is restricted to stem and progenitor cells, mast cell and early megakaryocyte and erythroid progenitor populations, consists of two main isoforms, Gata2-201 and Gata2-202 (Fig 3C). They are translated into the same protein but are differentiated by the usage of distinct distal exons, and have previously been shown to exhibit some lineage specific expression 19 . Here, both isoforms are most abundant in the mast cell population, with Gata2-202 showing more restricted expression to this cell type (Fig 3C). We similarly observe the expression of multiple isoforms of Ldb1 ( Fig 3D) and Tal1 (Fig 3E).

Alternative Splicing and Cytokine receptors
Transmembrane proteins can determine the cellular response to its environment, particularly regulatory cytokines. We screened the long-read data for isoform expression of a panel of cytokine receptors (Fig 4A). Two predominant isoforms of the stem cell factor (SCF) receptor Kit (Kit-201 and Kit-204) were observed to be expressed in virtually all cell types (Fig 4B). Using junction-targeting qPCR, we examined the distribution of Mpl isoforms in single FACS isolated HSCs (LSK Cd150+ Cd48-Cd34-) (Fig 4D). This demonstrated that individual HSCs frequently express more than one isoform, and in some cases three or four isoforms could be detected in the same single cell. Mpl-202 is the most common isoform but is frequently coexpressed with Mpl-204 and also occasionally with Mpl-203. The observation that individual HSCs can express more than one isoform of this key cytokine receptor suggests that ThPO signalling can be distinctively regulated in the functionally heterogeneous or lineage-primed stem cell pool. This highlights the critical need to understand not just gene, but isoform expression in single-cell studies.

Signatures of ageing
We integrated short-and long-read data to reveal age-associated signatures in cell type abundance, gene and isoform expression. While the majority of cell type abundances were unchanged between 8-and 72-week old mice, there was an increased abundance of phenotypic LT-HSCs (Fig 5A-C) in keeping with previous findings based on stem cells defined by protein marker expression (Lineage negative, Sca-1 positive, cKit positive (LSK) CD34-CD48-CD150+ cells) 24 .
To identify transcriptional signatures associated with ageing across the haematopoietic hierarchy, we performed differential gene expression analysis in each population using the short-read sequencing data (Fig 5D). We observed downregulation of genes encoding the cytosolic ribosomal components Rpl35 and Uba52 in stem cells and all progenitors. A decrease in expression of UBA52 and RPL35 genes has also been observed in aged human peripheral blood 25 . The cytosolic ribosomal subunit Rpl35 and the ubiquitin-ribosomal hybrid encoding gene Uba52 ensure protein homeostasis and efficient turnover of defective proteins.
Deficiency of Uba52 results in reduction of protein synthesis and cell cycle arrest 26 .
Interestingly, functional decline of ubiquitin-dependent protein clearance with increasing age reinforces the ageing process through accumulation of damaged and aggregated proteins, ultimately interfering with cell function 27 . Specifically in the HSC population, we observed ageassociated upregulation of HSC-specific genes including Sult1a1 and Nupr1 in addition to panhematopoietic upregulated genes including Igkc. We also observed downregulation of Rpl35, Uba52, Tmsb10, Gpx1, Plac8 and Cd34 (Fig 5E, Supplementary Table 4). Sult1a1 and Nupr1 are highly restricted to the LT-HSC population, and are indeed almost exclusively expressed in aged LT-HSCs (Fig 5F-G) and this increase in expression was confirmed in FACS-purified HSCs (LSK Cd34-Cd48-Cd150+) from young and aged mice by qPCR ( Fig   5H). While Nupr1 was not detected in the long-read data, the Sult1a1-203 isoform was the predominant isoform detected, encoding a 188 aa variant of the Sult1a1 protein.
A global comparison of the long-read data from young and aged mice showed an ageassociated increase in expression of noncoding transcripts, including IgV pseudogenes, lncRNAs and transcripts with retained introns (Fig 6A). This is consistent with observations that an increased frequency of intron retention has been identified as a signature of ageing in fruitfly, mouse and human 28,29 .
Following the observation that Igkc transcripts were upregulated in the aged LT-HSC population, and indeed throughout the myeloid progenitors (Fig 6B), we used the long-read data to determine that the Igkc arising from the LT-HSC population in aged mice consisted not just of "germline" Igkc, as previously reported 30 , but fully VJ-recombined transcripts (Fig 6C,   Supplementary Fig 3), and distinct from the upregulation of IgV pseudogenes (which account for only 1% of immunoglobulin reads sequenced).

Discussion
Accurately defining isoform expression may be key to understanding functional heterogeneity in cells with similar gene expression profiles. Aberrations in AS and its regulation are often associated with ageing and disease, and thus methods to analyse isoform expression at single cell resolution will be essential to understand the full impact this can have. Here, we demonstrate that long-read sequencing can be integrated into conventional single-cell approaches to profile cell-type specific isoform expression, and to predict functional diversity in the cell's proteome. This also enables integration of previously unannotated isoforms into reference transcriptomes to allow a more accurate annotation of cell types specific isoform expression, and a better understanding of contribution of isoform dynamics to cellular function.
We demonstrate that over half the genes detected in a standard single-cell analysis of the haematopoietic hierarchy are actually present as multiple isoforms, including transcription factors and transmembrane receptors with key roles in haematopoietic differentiation.
Complexes such as Lmo2/Gata2/Ldb1/Tal1 may be readily subject to regulation through isoform expression, with distinct protein isoforms having distinct roles within the overall complex. The finding that many of the transcription factors comprising established networks are present as multiple isoforms suggests an additional layer of complexity that should be taken into account, for example when building gene regulatory networks from single-cell expression data.
Similarly, it may be that AS events underlie how phenotypically highly similar cells respond differently to extrinsic signals. We detected multiple isoforms of the transcript encoding the ThPO receptor Mpl, a key regulator of stem cell maintenance and megakaryocyte differentiation and identified that these isoforms, with functionally distinct protein products are heterogeneously expressed in phenotypic HSCs. We demonstrate for the first time that single HSCs often express more than one isoform of Mpl with each isoform potentially having antagonistic responses to ligand binding. Functional heterogeneity in isoform expression/coexpression could potentially have a critical role in how individual HSCs respond to ThPO, and it would be of considerable interest to understand the role this might play in both steady-state and stress haematopoiesis.
The integrated analysis of short-and long-read data further allowed the identification of multiple signatures of hematopoietic ageing. Consistent with previous reports, we observe an expansion of a transcriptionally phenotypic HSC population with ageing, while other cell populations remain largely unchanged. Furthermore, the short-read data identified Sult1a1 and Nupr1 expression to be almost exclusively expressed in aged HSCs. The genes encoding Nupr1 and Sult1a1 are co-localised in a 50 kb region of chromosome 7, and their elevated expression has previously been shown to be associated with an age-related increase in the H3K4me3 chromatin mark in HSCs 31 . Sult1a1 encodes for a sulfotransferase which acts on substrates including hormones and neurotransmitters, and Nupr1 has a regulatory role in cell proliferation and apoptosis, but neither has a described functional role in haematopoiesis 31,32 .
Upon ageing, no specific changes in isoform expression could be observed at this resolution but the long-read data did show increased expression of lncRNAs. Highly tissue-specific changes in lncRNAs expression in aged human tissues has been reported 33 and abnormally elevated expression of some lncRNAs seem to relate to telomere shortening and senescence 34 . Similarly, an overall increase in intron retention was observed. Perhaps the most striking signature of ageing detected through the combined use of short-and long-read sequencing was the upregulation of recombined Igkc throughout hematopoiesis, including in phenotypic HSCs. This was highly unexpected but perhaps not without precedent. Previous work has shown that unrecombined Igkc transcripts are expressed in aged LT-HSCs, possibly as a result of epigenetic dysregulation 30 . Interestingly, Igkc expression was also observed to be upregulated in microarray analysis of gene expression in vWF+ platelet-biased HSCs when compared with vWF-HSCs 35 .
The generation of these recombined transcripts is dependent on genomic rearrangements, typically seen in committed lymphoid progenitors downstream of HSCs. This finding suggests that premature activation of the recombination-activating genes Rag-1 and Rag-2, in addition to epigenetic changes in the Igkc locus, may represent a novel hallmark of ageing stem cells.
There is an increasing body of evidence that the expression of recombined Ig molecules in non-lymphoid cells is possible [36][37][38] , including human CD34+ cord blood stem and progenitor cells 39 and acute myeloid leukemia 40 ), although the precise mechanisms of how the required structural changes in the genome enabling this can occur remain to be determined. Within the aged HSCs, there is some heterogeneity in VJ recombination, with a few dominant clones detected, for example Igkc/Igkj1/Igkv3-2, indicating that this recombination has occurred in multiple distinct HSC, from which some have generated recombined clonal progeny which retain the HSC phenotype.
Overall, our study not only provides an unprecedented picture of cell type-specific gene and isoform expression in a variety of haematopoietic cells but also offers novel insights into transcriptional signatures of ageing in stem and progenitor cells. Our results highlight the need to further characterise isoform diversity at single cell level and to build an isoform atlas for different cell types to reveal the full extent of transcriptional heterogeneity in development, ageing and disease. The observation that full-length, recombined immunoglobulin expression can occur in aged HSCs is a striking finding that requires further mechanistic investigation.
With continued improvements in the throughput of long-read sequencing platforms and development of methods targeting specific cells or transcripts, we envision that long-read sequencing will enable detailed characterisation of the total landscape of isoform diversity at the single-cell level at a scale comparable to current short-read based methods. This will be applicable to large-scale atlassing studies of cellular heterogeneity, and in haematological malignancies where mutations in splicing factors are common 41 .

Competing interests.
No competing interest to declare

Stem and Progenitor cell isolation
All animal work in this study was carried out in accordance with regulations set by the United Kingdom Home Office and the Animal Scientific Procedures Act of 1986. Bone marrow was isolated from spine, femora, tibiae, and ilia of 8 weeks and 72 weeks old C57BL/6 mice. Red  Raw Illumina sequencing data were analysed with the 10X Genomics CellRanger pipeline (version 3.0.2) to obtain a single-cell expression matrix object. Subsequent analysis was performed in R using Seurat version 3 16 . Cells showing gene counts lower than 1,000 and a mitochondrial gene expression percentage higher than 5% were excluded from further analysis. Within Seurat, data were normalised using NormalizeData (normalization.method = "LogNormalize", scale.factor = 10000) and data from multiple samples were merged using the FindIntegrationAnchors and IntegrateData commands.

Pacific Biosciences Sequel Sequencing of single-cell cDNA libraries
Libraries compatible for the Pacific Biosciences Sequel/Sequel II systems were prepared from 800 ng input cDNA, following the "no size selection" Iso-Seq library preparation method according to the manufacturer's instructions (IsoSeq Template Preparation for Sequel System V05), with the following modifications: the elution incubation time during AMPure beads purification was increased to 10 minutes and the second AMPure bead purification step, following the exonuclease reaction, was omitted to optimise library concentration.

Custom transcriptome annotation
A technology-agnostic long-read pipeline TALON v.5.0 [1], was used to construct a custom transcriptome annotation. First, 11 genome-aligned sam files were passed to TranscriptClean v.2.0.2 [2] for correction of read microindels (<5bp) and mismatches, though any noncanonical splice junctions were retained for downstream analyses and clean-up was not variant-aware. Internal priming artifacts are a known issue with oligo-dT selection methods [3] and this was assessed using a T-window size of 20bp (equivalent to the primer T sequence), with reads labelled accordingly using TALON functions. Read annotation was performed using and CDS as training sets.

Data integration in Seurat
In order to reduce the batch effect, count matrices produced by short-read sequencing for individual libraries were combined in Seurat using FindIntegrationAnchors and IntegrateData functions (dims = 1:20). Illumina and PacBio reads were integrated in Seurat using the CreateAssayObject command to add the long-read data to an existing Seurat object already containing the short-read data. This links the demultiplexed long-reads with the short-read data through the cell barcodes present in both.

Single-cell and low-input qPCR
Amplified cDNA was generated from sorted cells using the Smart-seq2 protocol 42 . This material was then used as input for qPCR reactions using assays targeting Sult1a1 and Nupr1.
Low-volume qPCR reactions were set up using the Mosquito HV instrument (STP Labtech) and analysed using LightCycler (Roche). For relative abundance, data are presented as expression relative to a housekeeping gene. For single-cell isoform junction detection PCRs, reactions were performed as above using junction-spanning primers. Data are presented as presence/absence heatmap, where analyses with Ct values < 30 were considered to be expressed.

Data availability
The raw sequencing data can be accessed on the NCBI-SRA archive, accession number PRJNA691569.