Re-processing of published data
The raw sequencing data of 4 previously published studies was re-aligned to GRCh38, ensembl 84 for the HLCA (Krasnow_2020, Barbry_Leroy_2020, Seibold_2020, Meyer_2021). For the Krasnow_2020 and Barbry_Leroy_2020 studies, the CellRanger 3.1.0-based HLCA pipeline was used for re-alignment. Seibold_2020 and Meyer_2021 data were processed as previously described22,43, but with reference genome and genome annotation adapted to the HLCA (GRCh38, ensembl84). All other datasets in the HLCA core were originally already aligned to GRCh38, ensembl84, except data from the Lafyatis_Rojas_2019 study (GRCh38, ensembl93) (Supplementary Data Table 1).
Single-cell sequencing and preprocessing of unpublished data
Unpublished data was generated as follows:
Barbry_unpubl: Barbry_unpubl: Nasal and tracheobronchial samples were collected from IPF patients after obtention of their informed consent, following a protocol approved by CHU Nice, registered at clinicaltrials under reference NCT04529993. The libraries were prepared as described in Deprez et al13, and yielded an average of 61,000 ± 11,000 cells per sample, with a viability above 95%. The single cell suspension was used to generate single cell libraries following the V3.1 protocol for 3’ chemistry from 10x Genomics (CG000204). Sequencing was performed on a NextSeq 500/550 sequencer (Illumina). Raw sequencing data were processed using the cellranger-6.0.0 pipeline, with reference genome and annotation GRCh38 and ensembl98. For each sample, cells with less than 200 transcripts or more than 40 000 transcripts were filtered out, as well as genes expressed in less than 3 cells. Normalization and log-transformation was done using the standard Scanpy76 pipeline. PCA was performed on 1000 HVG to compute 50 PCs and the Louvain algorithm was used for clustering. Those clusters were then annotated by hand for each sample. Raw counts and the thus obtained cell annotations were used as input for the HLCA.
Schiller_2021: Tumor-free, uninvolved lung samples (peri-tumor tissues) were obtained during tumor resections at the lung specialist clinic “Asklepios Fachkliniken Munich-Gauting” and accessed through the bioArchive of the Comprehensive Pneumology Center Munich (CPC-M). The study was approved by the local ethics committee of the Ludwig-Maximilians University of Munich, Germany (EK 333 − 10 and 382 − 10) and written informed consent was obtained from all patients.
Single-cell suspensions for single-cell RNAseq were generated as previously described15. In brief, lung tissue samples were cut into smaller pieces, washed with PBS and enzymatically digested using an enzyme mix composed of dispase, collagenase, elastase and DNAse for 45 min at 37°C while shaking. After inactivating the enzymatic activity with 10% FCS/PBS, dissociated cells were passed through a 70 µm cell strainer, pelleted by centrifugation (300 x g, 5 min), and subjected to red blood cell lysis. After stopping the lysis with 10% FCS/PBS, the cell suspension was passed through a 30 µm strainer and pelleted. Cells were resuspended in 10%FCS/PBS, assessed for viability and counted using a Neubauer hematocytometer. Cell concentration was adjusted to 1,000 cells/µl and around 16,000 cells were loaded on a 10x Genomics Chip G with Chromium Single Cell 3′ v3.1 gel beads and reagents (3′ GEX v3.1, 10x Genomics). Libraries were prepared according to the manufacturer’s protocol (10x Genomics, CG000204_RevD). After quality check, single-cell RNA-seq libraries were pooled and sequenced on a NovaSeq 6000 instrument.
The generation of count matrices was performed by using the Cellranger computational pipeline (v3.1.0, STAR v2.5.3a). The reads were aligned to the hg38 human reference genome (GRCh38, ensembl99). Downstream analysis was performed using the Scanpy76 package (v1.8.0). We assessed the quality of our libraries and excluded barcodes with less than 300 genes detected, while retaining those with a number of transcripts between 500 and 30,000. Further, cells with a high proportion (> 15%) of transcript counts derived from mitochondrial-encoded genes were removed. Genes were considered if they were expressed in at least 5 cells. Raw counts of cells that passed filtering were used as input for the HLCA.
Duong_HuBMAP_unpubl: All post-mortem human donor lung samples were obtained from the Biorepository for Investigation of Neonatal Diseases of the Lung (BRINDL) supported by the NHLBI LungMAP Human Tissue Core housed at the University of Rochester. Consent, tissue acquisition, and storage protocols can be found on the repository’s website (brindl.urmc.rochester.edu/). Data was collected as part of the Human Biomolecular Atlas Program (HuBMAP). For isolation of single nuclei, 10 cryosections (40 um thickness) from OCT embedded tissue blocks stored at -80℃ were shipped on dry ice and processed according to a published protocol77. Single nucleus RNA-sequencing was completed using 10X Chromium Single-Cell 3’ Reagent Kits v3 according to a published protocol77,78. Raw sequencing data was processed using the 10X Cell Ranger v3 pipeline and GRCh38 (hg38) reference genome. For downstream analysis, mitochondrial transcripts and doublets identified by DoubletDetection79 v2.4.0 were removed. Samples were then combined and cell barcodes were filtered based on genes detected (> 200, < 7500) and gene UMI ratio (gene.vs.molecule.cell.filter) using Pagoda2 (github.com/hms-dbmi/pagoda2). Also using Pagoda2 for clustering, counts were normalized to total counts per nucleus. For batch correction, gene expression was scaled to dataset average expression. After variance normalization, all significantly variant genes 4519 were used for PCA. Clustering was done at different k values (50, 100, 200) using the top 50 principal components and the infomap community detection algorithm. Then, principal component and cluster annotations were imported into Seurat28 v4.0.0. Differentially expressed genes for all clusters were generated for each k resolution using Seurat FindAllMarkers (only.pos = TRUE, max.cells.per.ident = 1000, logfc.threshold = 0.25, min.pct = 0.25). Clusters were manually annotated based on distinct differentially expressed marker genes. Raw counts and the thus obtained cell annotations were used as input for the HLCA.
Banovich_Kropski_2020: Data was a combination of published data17 and unpublished data (Vanderbilt IRB nos. 060165 and 171657 and Western IRB no. 20181836). Unpublished data was generated as previously described17, using reference genome GRCh38, ensembl84. Both published and unpublished data were combined and processed together using Seurat v4. Cells containing less than 500 identified genes or more than 25% of reads arising from mitochondrial genes were filtered out. Sctransform80 with default parameters was performed to normalize and scale the data, PCA was used for dimensionality reduction using the top 3000 most variable genes. Cell type annotation was performed using Seurat28,80 FindTransferAnchors and TransferData functions, using the annotated object from the published data17 as reference. Raw counts and the thus obtained cell annotations were used as input for the HLCA.
Nawijn_2021: Data was a combination of published2 and unpublished data. In both cases, healthy volunteers were recruited for bronchoscopy at the University Medical Center Groningen, after giving informed consent and according to the protocol approved by the Institutional Review Board. Inclusion criteria and tissue processing were performed as previously described2. In short, all subjects were 20–65 years old and had a history of smoking < 10 pack-years. To exclude respiratory disease, the following criteria were used: absent history of asthma or COPD, no use of asthma or COPD-related medication, a negative provocation test (PC20 methacholine > 8 mg/ml), no airflow obstruction (FEV1/forced vital capacity (FVC) ≥ 70%) and absence of lung function impairment (that is FEV1 ≥ 80% predicted). All subjects underwent a bronchoscopy under sedation using a standardized protocol81. Nasal brushes were obtained from the lateral inferior turbinate in a subset of the volunteers right before bronchoscopy using Cyto-Pak CytoSoft nasal brush (Medical Packaging Corporation, Camarillo, CA, USA). Six macroscopically adequate endobronchial biopsies were collected for this study, located between the third and sixth generation of the right lower and middle lobe. Bronchial brushes were obtained from a different airway at similar anatomical locations using a CellCellebrity bronchial brush (Boston Scientific, Massachusetts, USA). Extracted biopsies and bronchial and nasal brushes were processed directly, with a maximum of 1 hour delay. Bronchial biopsies were chopped biopsies using a single edge razor blade. A single-cell solution was obtained by tissue digestion using 1mgml − 1 collagenase D and 0.1mgml − 1 DNase I (Roche) in HBSS (Lonza) at 37°C for 1h with gentle agitation for both nasal brushes and bronchial biopsies. Single-cell suspensions were filtered forced using 70µm nylon cell strainer (Falcon), followed by centrifugation at 550g, 4°C for 5min and one wash with a PBS containing 1% BSA (Sigma-Aldrich). The single-cell suspensions used for 10X Genomics scRNA-seq analysis were cleared of red blood cells by using a red blood cell lysis bufer (eBioscience) followed by live cell counting and loading of 10,000 cells per lane. We used 10XGenomics Chromium Single-Cell 3’ Reagent Kits v2 and v3 according to the manufacturers’ instructions. Raw sequencing data were processed using the CellRanger 3.1.0-based HLCA pipeline, with reference genome and annotation GRCh38 and ensembl84. Ambient RNA correction was performed using FastCAR (https://github.com/LungCellAtlas/FastCAR), using an empty library cutoff of 100 UMI and a maximum allowed contamination chance of 0.05, ignoring the mitochondrial RNA. Data were merged and processed using Seurat28, filtering to libraries with > 500 UMIs and > 200 genes and to the libraries containing the lowest 95% of mitochondrial RNA per sample and < 25% mitochondrial RNA, normalized using sctransform80 while regressing out % mitochondrial RNA. In general, 15 principal components were used for the clustering, at a resolution of 0.5 to facilitate manual annotation of the dataset. Clusters in the final object that were driven by single donors were removed. Raw counts and cell annotations were used as input for the HLCA.
Jain_Misharin_2021: Nasal epithelial samples were collected from healthy volunteers who provided informed consent at Northwestern Medicine, Chicago, USA. Protocol was approved by Northwestern University IRB (STU00214826). Briefly, subjects were seated and asked to extend their neck. A nasal curette (Rhino-pro, VWR) was inserted into either nare and gently slid from posterior-to-anterior about one centimeter along the lateral inferior turbinate. Five curettes were obtained per participant. The curette tip was then cut and placed in 2 ml of hypothermosol and stored at 4C until processing. Single cell suspension was generated using cold-active dispase protocol reported by Deprez, Zaragosi and colleagues18,82 with slight modification. Specifically, EDTA was omitted and cells were dispersed by pipetting 20 times every 5 min using a 1ml tip instead of trituration using a 21/23G needle, the final concentration of protease from Bacillus Licheniformis was 10 mg/ml. Total digestion time was 30 min. Following the wash in 4 ml of 0.5% BSA in PBS and centrifugation at 400 rcf for 10 min, cells were resuspended in 0.5% BSA in PBS and counted using Nexcelom K2 Cellometer with AO/PI reagent. This protocol typically yields ~ 300–500,000 cells with viability > 95%. The resulting single cell suspension was then used to generate single cell libraries following protocol for 5’ V1 (10x Genomics, CG000086 Rev M) or V2 chemistry (10x Genomics, CG000331 Rev A). Excess cells from 2 of the samples were pooled together to generate 1 additional single cell library. After quality check, the libraries were pooled and sequenced on a NovaSeq 6000 instrument. Raw sequencing data were processed using the CellRanger 3.1.0 pipeline, with reference genome and annotation GRCh38 and ensembl84. To assign sample information to cells in the single-cell library prepared from 2 samples, we ran souporcell83 v.2.0 for that library and 2 libraries that were prepared from these samples separately. We used common genetic variants prepared by the souporcell authors to separate cells into 2 groups by genotype for each library, and Pearson correlation between the identified genotypes across libraries to establish correspondence between genotype and sample. Cell annotations were assigned to cell clusters based on expert interpretation of marker genes for each cluster. Cell clusters were derived with Seurat28 v3.2 workflow: samples were normalized with sctransform80, 3000 HVGs selected and integrated, and clusters derived from 30 principal components using Louvain algorithm with default parameters. Clusters with low number of UMIs and high expression of ribosomal or mitochondrial genes were excluded as low-quality. Raw counts and the thus obtained cell annotations were used as input for the HLCA.
Misharin_2020: Protocol was approved by Northwestern University IRB (STU00212120). Two biopsies that included the main left bronchus and distal parenchyma from the upper lobe were obtained from another donor lung that was not placed for transplant (donor 1). A single biopsy from a distal lung parenchyma (donor 2) was obtained from wedge resection of the donor lung for size reduction during lung transplantation. Lung and airway tissues were infused with a solution of Collagenase D (2 mg/ml) and deoxyribonuclease I (0.1 mg/ml) in RPMI, cut into ~ 2-mm pieces, and incubated in 10 ml of digestion buffer with mild agitation for 30 min at 37°C. The resulting single-cell suspension was filtered through a 70-m nylon mesh filter, and digestion was stopped by addition of 10 ml of PBS supplemented with 0.5% BSA and 2 nM EDTA (staining buffer). Cells were pelleted by centrifugation at 300 rcf for 10 min, supernatant was removed, and erythrocytes were lysed using 5 ml of 1× Pharm Lyse solution (BD Pharmingen) for 3 min. The single-cell suspension was resuspended in Fc-Block (Human TruStain FcX, BioLegend) and incubated with CD31 microbeads (Miltenyi Biosciences, 130-091-935), and the positive fraction, containing endothelial cells and macrophages, was collected. The negative fraction was then resuspended in staining buffer, the volume was adjusted so the concentration of cells was always less than 5E6 cells/ml, and the fluorophore-conjugated antibody cocktail was added in 1:1 ratio (EpCAM, Clone 9C4, PE-Cy7, BioLegend, catalog number 324222, RRID:AB_2561506, 1:40; CD206, Clone 19.2, PE, Thermo Fisher Scientific, catalog number 12-2069-42, RRID:AB_10804655, 1:40; CD31, Clone WM59, APC, BioLegend, catalog number 303116, RRID: AB_1877151, 1:40; CD45 Clone HI30, APCCy7, BioLegend, catalog number 304014, RRID: AB_314402, 1:40; HLA-DR, Clone LN3, eFluor450, Thermo Fisher Scientific, catalog number 48-9956-42, RRID:AB_10718248, 1:40). After incubation at 4°C for 30 min, cells were washed with 5 ml of MACS buffer, pelleted by centrifugation, and resuspended in 500 ul of MACS buffer + 2 ul of SYTOX Green viability dye (Thermo Fisher Scientific). Cells were sorted on a FACSAria III SORP instrument using a 100-m nozzle and 20 psi pressure. Macrophages were sorted as live/CD45 + HLA-DR + CD206 + cells, epithelial cells were sorted as live/CD45 − CD31 − EpCAM+, and stromal cells were sorted as live/ CD45 − CD31 − EpCAM − cells. Cells were sorted into 2% BSA in Dulbecco's PBS (DPBS, without calcium and magnesium), pelleted by centrifugation at 300 rcf for 5 min at 4°C, and resuspended in 0.1% BSA in DPBS to ~ 1000 cells/l concentration. Concentration was confirmed using K2 Cellometer (Nexcelom) with AO/PI reagent, and ~ 5000 to 10,000 cells were loaded on a 10x Genomics Chip B with Chromium Single Cell 3′ gel beads and reagents (3′ GEX V3, 10x Genomics). Libraries were prepared according to the manufacturer’s protocol (10x Genomics, CG000183_RevB). After quality check, single-cell RNA-seq libraries were pooled and sequenced on a HiSeq 4000 or NovaSeq 6000 instrument. Raw sequencing data were processed using the CellRanger 3.1.0 pipeline, with reference genome and annotation GRCh38 and ensembl84. Each sample was processed with Seurat28 v3.2 workflow to split cells into clusters: data was normalized with sctransform80, a number of principal components was selected based on gene weights within the component, and cells were clustered with Louvain algorithm. Clusters with low-quality cells were excluded, while clusters expressing markers of multiple cell types were further sub-clustered by repeating the workflow only on those cells. Cell annotations were assigned to clusters based on marker genes for each cluster. Raw counts and the thus obtained cell annotations were used as input for the HLCA.
Data and metadata collection
To accommodate data protection legislation, single cell RNA sequencing datasets of healthy lung tissue were shared by dataset generators (Supplementary Data Table 2) as raw count matrices, thereby obviating the need to share genetic information. Unpublished raw count matrices and cell annotations were generated as described above. For the Barbry_2020 study, count matrices provided had ambient RNA removed. Count matrices were generated using varying software (Supplementary Data Table 2). All count matrices of the HLCA core except one study (see above, "re-processing of published data") were based on alignment to reference genome GRCh38, annotation ensembl 84. For all datasets from the HLCA core, a pre-formatted sample metadata form was filled out by the dataset providers for every sample, containing metadata such as subject ID of the donor from which the sample came, the donor’s age, the type of sample, the sequencing platform, etc. (Supplementary Data Table 2). Cell type annotations from dataset providers were included in all datasets.
Cell type reference creation and metadata harmonization for the HLCA core
To harmonize cell type labels from different datasets, a common reference was created to which original cell type labels were mapped (Supplementary Data Table 4). To accommodate labels at different levels of detail, the cell type reference was made hierarchical, with level 1 containing the coarsest possible labels (immune, epithelial, etc.), and level 5 containing the finest possible labels (e.g. naive CD4 T cells). Levels were created in a data-driven fashion, recursively breaking up coarser level labels into finer ones where finer labels were available. Anatomical location of the sample was encoded into a continuous score, with 0 representing the most proximal samples (nose, inferior turbinate) and 1 representing the most distal possible sample (parenchyma, alveolar sacs) (Supplementary Data Table 3). A distinction was made between upper and lower airways. First, an anatomical coordinate score was applied to the upper airways, starting at 0 and increasing linearly (with a value of 0.5) between each of the following anatomical locations: inferior turbinate, nasopharynx, oropharnyx, vesibula, and larynx. The trachea received the next anatomical coordinate score using the same linear increment as in the upper airways (score of 2.5). In the lower airways, the coordinate score within the bronchial tree was based on the generation airway, with trachea being the first generation and the total number of generations assumed to be 2384. The alveolar sac was assigned the coordinate score of the 23rd generation airway. The coordinate score of each generation airway was calculated by taking the log-2 value of the generation, and adding that to the score of the trachea. Using this methodology, the alveolus received the anatomical coordinate score of 7.02. To calculate the finat CCF coordinate, the coordinates scores (ranging from 0 to 7.02) were scaled to a value between 0 (inferior turbinate) and 1 (alveolus). Samples were then mapped to this coordinate system using the best approximation of the sampling location for each of the samples of the Core HLCA.
General data preprocessing for the HLCA core
Patients with lung conditions affecting larger parts of the lung, such as asthma or pulmonary fibrosis, were excluded from the HLCA core, and later added to the extended atlas. For the HLCA core, all matrices were gene-filtered based on cellranger ensembl84 gene type filtering85 (resulting in 33694 gene IDs). Cells with fewer than 200 genes detected were removed (removing 2335 cells, and 21 extra erythrocytes with close to 200 genes expressed, these hampered SCRAN normalization, see below), and genes expressed in fewer than 10 cells were removed (removing 5167 out of 33694 genes).
Total counts normalization with SCRAN
To normalize for differences in total unique molecular identifier (UMI) counts per cell, we performed SCRAN normalization86. Since SCRAN assumes that at least half of the genes in the data to normalize are not differentially expressed between subgroups of cells, we performed SCRAN normalization within clusters. To that end, we first performed a total counts normalization, by dividing each count by its cell’s total counts, and multiplying by 10,000. We then performed a log transformation using natural log and pseudocount 1. A principal component analysis was subsequently performed. Using the first 50 principal components, a neighborhood graph was calculated with the number of neighbors set to k = 15. Data were subsequently clustered with louvain clustering, at resolution r = 0.5. SCRAN normalization was then performed on the raw counts, using the louvain clusters as input clusters, and with the minimum mean (library size-adjusted) average count of genes to be used for normalization set to 0.1. The resulting size factors were used for normalization. For the final HLCA (and not the benchmarking subset), cells with abnormally low size factors (< 0.01) or abnormally high total counts after normalization (> 10e5) were removed from the data (267 cells in total).
Preprocessing of data for integration benchmarking
For computational efficiency, benchmarking was performed on a subset of the total atlas, including data from 10 studies split into 13 datasets (Lafyatis_Rojas_2020 was split into 10Xv1 and 10Xv2 data, Seibold_2020 was split into 10Xv2 and 10Xv3 data, and Banovich_Kropski_2020 was split into two based on processing site). The data came from 72 subjects, 124 samples and 372,111 cells. Preprocessing of the benchmarking data included the filtering of cells (minimum number of total UMI counts: 500) and genes (minimum number of cells expressing the gene: 5).
For integration benchmarking, the scIB benchmarking framework was used16. All benchmarked methods except scGen (i.e. BBKNN, ComBat, Conos, fastMNN, Harmony, Scanorama, scANVI, scVI and Seurat RPCA) were run at least twice: on the 2000 most highly variable genes, and on the 6000 most highly variable genes. Of those methods, all methods that did not require raw counts as input were run twice on each gene set: once with gene counts scaled to have mean 0 and standard deviation 1, one with unscaled gene counts. scVI and scANVI, which require raw counts as input, were not run on scaled gene counts. scGen was only tested on 2000 unscaled highly variable genes. For highly variable gene selection, first, highly variable genes were calculated per dataset, using cellranger-based highly variable gene selection87 (default parameter settings: min_disp = 0.5, min_mean = 0.0125, max_mean = 3, span = 0.3, n_bins = 20). Then, genes that were highly variable in all datasets were considered overall highly variable, followed by genes highly variable in all datasets but one, in all datasets but two etc. until a predetermined number of genes was selected (2000 or 6000 genes).
For scANVI and scVI, genes were subset to the highly variable gene set and the resulting raw count matrix was used as input. For all other methods, SCRAN-normalized (as described above) data were used. Genes were then subset to the pre-calculated highly variable gene sets. For integration of gene-scaled data, all genes were scaled to have mean 0 and standard deviation 1.
Two integration methods allowed for input of cell type labels to guide the integration: scGen and scANVI. As labels, level 3 harmonized cell type labels were used (Supplementary Data Table 4), except for blood vessel endothelial, fibroblast lineage, mesothelial and smooth muscle cells, for which we used level 2 labels. Since scGen does not accept unlabeled cells, cells that did not have annotations available at these levels (i.e. cells annotated as cycling, epithelial, stromal, or lymphoid cells with no further annotations; 4499 cells in total) were left out of the benchmarking data.
Data integration benchmarking
Integration methods were run using default integration parameter settings as previously described42. Maximum memory usage was set to 376Gb, and all methods requiring more memory were excluded from the analysis. The quality of each of the integrations was scored using 12 metrics, with 4 metrics measuring the batch correction quality, and 8 metrics quantifying conservation of biological signal after integration (Extended Data Fig. 1, metrics previously described16). Overall scores were computed by taking a 40:60 weighted mean of batch effect removal to biological variation conservation (bio-conservation), respectively. Methods were ranked based on overall score (Extended Data Fig. 1). The top performing method (scANVI, 2000 hvgs, unscaled) was used for integration of the HLCA core.
Splitting of studies into datasets
For integration of the data into the HLCA core, we first determined in which cases studies had to be split into separate datasets (which were treated as batches during integration). Reasons for possible splitting were 1) different 10X versions used within a study (e.g. 10Xv2 versus 10Xv3), or 2) processing of samples at different institutes within a study. To determine if these covariates caused batch effects within a study, we performed principal component regression88. To that end, we preprocessed single studies separately (total counts normalization to median total counts across cells, and subsequent principal component analysis (PCA) with 50 principal components (PCs). For each study, we then calculated the fraction of the variance in the first 50 PCs that could be explained (“pcexpl”) by the covariate of interest (i.e. 10X version or processing institute):
$$pcexpl =\frac{{\sum }_{i=1}^{50}var\left(p{c}_{i}\right|cov)}{{\sum }_{i=1}^{50}var\left(p{c}_{i}\right)}$$
where var(pcI|cov) is the variance in scores for the ith PC across cells that can be explained by the covariate under consideration, based on a linear regression.
Then, 10X version or processing institute assignments were randomly shuffled between samples, and pcexpl was calculated for the randomized covariate. This was repeated over 10 random shufflings, and the mean and standard deviation of the pcexpl was then calculated for the covariate. If the non-randomized pcexpl was more than 1.5 standard deviations above the randomized pcexpl, we considered the covariate a source of batch effect and split the study into separate datasets. Thus both Jain_Misharin_2021 and Lafyatis_Rojas_2019 were split into 10Xv1 and 10Xv2, Seibold_2020 was split into 10Xv2 and 10Xv3, while Banovich_Kropski_2020 was not split based on 10X version nor based on processing location.
Dataset integration
For integration of the datasets into the HLCA core, coarse cell type labels were used as described for integration benchmarking, except cells with lacking annotations were set to “unlabeled” instead of being removed. scANVI was run on the raw counts of the 2000 most highly variable genes (calculated as described above), using datasets as the batch variable. The following parameter settings were used: number of layers: 2, number of latent dimensions: 30, encode covariates: True, deeply inject covariates: False, use layer norm: both, use batch norm: none, gene likelihood: nb, n epochs unsupervised: 500, n epochs semi-supervised: 200, frequency: 1. For the unsupervised training, the following early-stopping parameters were used: early stopping metric: elbo, save best state metric: elbo, patience: 10, threshold: 0, reduce lr on plateau: True, lr patience: 8, lr_factor: 0.1. For the semi-supervised training, the following early-stopping-parameter settings were used: early stopping metric: accuracy, save best state metric: accuracy, on: full dataset, patience: 10, threshold: 0.001, reduce lr on plateau: True, lr_patience: 8, lr_factor: 0.1. The latent embedding generated by scANVI was used for downstream analysis (clustering and visualization). For gene-level analyses (differential expression, covariate effect modeling) un-corrected counts were used.
Data embedding and clustering
To cluster the cells in the HLCA core, a nearest neighbor graph was calculated, based on the 30 latent dimensions that were obtained from the scANVI output, with the number of neighbors set to k = 30. This choice of k, while improving clustering robustness, could impair the detection of very rare cell types. Coarse leiden clustering was done on the graph with resolution r = 0.01. For each of the resulting level 1 clusters, a new neighbor graph was calculated using scANVIs 30 latent dimensions, with the number of neighbors again set to k = 30. Based on the new neighbor graph, each cluster was clustered into smaller “level 2” clusters with leiden clustering at resolution r = 0.2. The same was done for level 3, 4 and where needed 5, with k set to 15, 10, and 10 respectively, and resolution set to 0.2. Clusters were named based on their parent clusters and sister clusters, e.g. cluster 1.2 is the third biggest subcluster (starting at 0) of cluster 1. For visualization, a two-dimensional Uniform Manifold Approximation and Projection89 (UMAP) of the atlas was generated based on the 30-nearest-neighbor graph.
Calculation of cluster entropy of cell type labels and subjects
To calculate cluster cell type label entropy for a specific level of annotation, Shannon entropy was calculated as \(-{\sum }_{i=1}^{k}p\left({x}_{i}\right)log\left(p\right({x}_{i}\left)\right)\), where \({x}_{1}...{ x}_{k}\)are the set of labels at that annotation level, and \(p\left({x}_{i}\right)\) is the fraction of cells in the cluster that was labeled as \({x}_{i}\). Cells without a label at the level under consideration were not included in the entropy calculation. If fewer than 20% of cells were labeled at the level, entropy was set to NA. Entropy of subjects per cluster was calculated in the same way. To set a threshold for "high label entropy”, we calculated the label entropy of a hypothetical cluster with 75% of cells given one label, and 25% of cells given another label. Clusters with a label entropy above that level (0.56) were considered high label entropy clusters. To set a threshold for “low subject entropy”, we calculated the label entropy for a hypothetical cluster with 95% of cells from one subject, and the remaining 5% of cells distributed over all other subjects. Clusters with a subject entropy below that level (0.43) were considered clusters with low subject entropy.
Rare cell type analysis
To determine how well rare cell types (ionocytes, neuroendocrine cells and tuft cells) were clustered together and separate from other cell types after integration, we calculated recall (% of all cells annotated as a specific rare cell type that were grouped into the cluster) and precision (% of cells from the cluster that were annotated as a specific rare cell type) for all level 3 clusters. Nested clustering on Harmony32,89 and Seurat’s RPCA28 output was done based on PCA of the corrected gene counts, re-calculating the PCs for every parent cluster when performing clustering into smaller children clusters, and proceeding clustering as described above. The level 3 clusters with the highest sensitivity for each cell type are shown in figures.
Manual cell type annotation and marker gene selection
Re-annotation of cells in the HLCA core was done by six investigators with expertise in lung biology (EM, MCN, AVM, LEZ, NEB, JAK), based on original annotations and differentially expressed genes of the HLCA core clusters. Annotation was done per cluster, using finer clusters where these represented specific known cell types or states rather than subject-specific variation. Annotations were hierarchical (as the cell type reference), and each annotated cluster was annotated at all levels up to its highest known level.
Marker genes were calculated by performing a t-test-based differential expression analysis on a given cell type (based on SCRAN-normalized, logarithmized counts), compared to the rest of the cells from the same level 1 annotation (i.e. epithelial, endothelial, stromal or immune cells). Genes were then filtered further according to the following criteria: minimum fraction of cells within the cluster in which the gene was detected: 0.25; maximum fraction of cells from outside the cluster in which gene was detected; 0.5, minimum fold change; 1. The top 10 most significant genes that passed filtering were selected as top marker genes. For highly similar cell types, a second differential expression analysis was done, this time comparing the cell type to its closest sister annotations (e.g. submucosal gland (SMG) serous (nasal) cells compared to the remaining SMG cell types). For those cell types, top 10 marker genes consisted of the top 5 genes from the first and second comparison, respectively. Where fewer than 10 genes passed filtering criteria, only those genes were included as markers.
Variance between individuals explained by technical and biological covariates
To quantify to what extent different technical and biological covariates correlated with inter-individual variation in the atlas, we calculated the “variance explained” by each covariate for each cell type. We first split the data in the HLCA core by cell type annotation, merging substates of a single cell type into one (Supplementary Data Table 10). For every cell type, we excluded samples that had fewer than 10 cells of the sample. We then summarized covariate values per sample for every cell type as follows: mean across cells from sample for scANVI latent components (integration results), UMI count per cell, and fraction of mitochondrial UMIs; for all other covariates (i.e. dataset, 3’ or 5’, BMI, cell ranger version (1, 2, 3 or 4), cell viability %, ethnicity, sampling method (e.g. biopsy, brush, donor lung), sequencing platform, sex, single cell chemistry (e.g. 10X 3’ v2), smoking status (active, former or never), subject status (alive disease, alive healthy or organ donor), age, and anatomical region (CCF score)), each sample had only one value, therefore these values were used.
We then performed “PC regression” on every covariate, as described earlier, now using scANVI latent component scores instead of principal component scores for the regression. Samples that did not have a value for a given covariate (e.g. where BMI was not recorded for the subject) were excluded from the regression. Categorical covariates were converted to dummy variables. Celltype - covariate pairs for which only one value was observed for the covariate were excluded from the analysis.
To check to what extent covariates correlated with each other, thereby possibly acting as confounders in the PC regression scores, we determined dependence between all covariate pairs for every cell type. If at least one covariate was continuous, we calculated the fraction of variance in the continuous covariate that could be explained by the other covariate (dummying categorical covariates), and took the square root (equal to Pearson r for two continuous covariates). For two categorical covariates, if both covariates had more than two unique values we calculated normalized mutual information (NMI) between the covariates using scikit-learn90, since a linear regression between these two covariates is not possible.
To control for spurious correlations between inter-individual cell type variation and covariates due to low sample numbers, we assessed the relationship between sample number and mean variance explained across all covariates for every cell type. We found that for cell types sampled in fewer than 40 samples, mean variance explained across all covariates showed a high negative correlation with the number of samples (Extended Data Fig. 7, a). We reasoned that for these cell types, correlations between interindividual variation and our covariates were inflated due to under-sampling. Moreover, we note that at lower sample numbers, technical and biological covariates often strongly correlate with each other across subjects (Extended Data Fig. 7, c). This might lead to the attribution of true biological variation to technical covariates, and vice versa, complicating interpretation of observed inter-individual cell type variation. Consequently, we consider 40 a recommended minimum number of samples to avoid spurious correlations between observed inter-individual variation and tested covariated, and excluded results from cell types with fewer samples.
Modeling variation between individuals at the gene level and gene set enrichment analysis
To model the effect of demographic and anatomical covariates (sex, age, BMI, ethnicity, smoking status, and anatomical location of sample) on gene expression, we employed a generalized linear mixed model (GLMM). We used sample-level pseudo-bulks (split by cell type), since the covariates modeled also varied at sample- or subject level, and not at cell level. Modeling these covariates at cell level, i.e. treating single cells as independent samples even when coming from the same sample, has been shown to inflate p-values36,37. We encoded smoking status as a continuous covariate, setting never to 0, former to 0.5, and current to 1. Anatomical region was encoded into anatomical region ccf-scores as described earlier. As we noted that changes from nose to the rest of the airways and lungs were often independent from continuous changes observed in the lungs only, we encoded “nasal” as a separate covariate, setting samples from the nose to 1, and all others to 0. BMI and age were re-scaled, such that the 10th percentile of observed values across the atlas was set to 0, and the 90th percentile set to 1 (25 and 64 for age, respectively, and 21.32 and 36,86 for BMI). First, we split the lung cell atlas by cell type annotation, pooling detailed annotations into one subtype (e.g. grouping all lymphatic EC subtypes into one) (Supplementary Data Table 10). We then proceeded the same way for every cell type: we filtered out all genes that were expressed in fewer than 50 cells, and all samples that had fewer than 10 cells of the cell type. We furthermore filtered out datasets with fewer than 2 subjects, and refrained from modeling categories in covariates that had fewer than three subjects in their category, for that cell type. To determine if covariance between covariates was low enough for modeling, we calculated the variance inflation factor (VIF) between covariates at subject level. The VIF quantifies multi-collinearity among covariates of an ordinary least squares regression, and a high VIF indicates strong linear dependence between variables. If the VIF was higher than 5 for any covariate for a specific cell type, we concluded covariance was too high and excluded that cell type from the modeling. As many cell types lacked sufficient representation of other ethnicities than ‘’white”, while including ethnicity in the analysis simultaneously reduced the samples that could be included in the analysis to only those with ethnicity annotations, we excluded ethnicity from the modeling.
Gene counts were summed across cells for every sample, within cell type. Sample-wise sums (i.e. pseudo-bulks) were normalized using edgeR’s calcNormFactors function, using default parameter settings. We then used voom91, a method designed for bulk RNA-seq that estimates observation-specific gene variances and incorporates these into the modeling. Specifically, we used a voom extension (differential expression testing with linear mixed models or “Dream”) that allows for mixed effect modeling, and modeled gene expression as:
$$log\left(normcount\right) \sim 1+age + sex + BMI + smoking +nose + ccfscore + \left(1 \right| dataset)$$
where dataset is treated as a random effect, and all other effects are modeled as fixed effects. Resulting p-values were corrected for multiple testing within every covariate using the Benjamini-Hochberg procedure.
Gene set enrichment analysis was performed in R92 using the cameraPR function93 in the limma package94 with the differential expression test statistic. Gene ontology (GO) biological process terms95,96 were tested separately for each comparison. These sets were obtained from MSigDB (v7.1)97 as provided by the Walter and Eliza Hall Institute (https://bioinf.wehi.edu.au/MSigDB/index.html).
Mapping of GWAS results to the lung cell atlas cell types
To stratify GWAS results from several lung diseases by lung cell type, we applied stratified sc-LDSC, a method that can link GWAS results to cell types based on proximity of disease-associated variants to genes differentially expressed in the cell type51. GWAS summary statistics of COPD49 (GWAS catalog ID: GCST007692, dbGaP accession number: phs000179.v6.p2, n_cases = 35,735, n_controls = 222,076), and of lung cancer48 (GWAS catalog ID: GCST004748, dbGaP accession number: phs001273.v3.p2, n_cases = 29,266, n_controls = 56,450) were made available on dbGap upon request. Summary statistics of lung function50 (GWAS catalog ID: GCST007429, n = 321,047 individuals), of asthma47 (GWAS catalog ID: GCST010043, n_cases = 88,486 n_controls = 447,859), and of depression98 (used as negative control, GWAS catalog ID: GCST005902, n_cases = 113,769, n_controls = 208,811) were publicly available. A differential gene expression test was done for every grouped cell type (Supplementary Data Table 10) in the HLCA using a t-test, testing against the rest of the atlas. The top 1000 most significant genes with positive fold-changes were stored as genes characterizing that cell type (“cell type genes”), and used as input for sc-LDSC. Gene coordinates of cell type genes were obtained based on the GRCh37.13 genome annotation. For single nucleotide-polymorphism (SNP) data (names, locations, and linkage-related information), the 1000 genomes European reference (GRCh37) was used, as previously described51. Only SNPs from the HapMap3 project were included in the analysis. For identification of SNPs in the vicinity of cell type genes, we used a window size of 100,000 base pairs. Genes from X and Y chromosomes, as well as HLA genes, were excluded because of their unusual genetic architecture and linkage patterns. For ld-score calculation, a 1 centiMorgan window was used. P-values yielded by scLDSC were corrected for multiple testing for every disease tested, using the Benjamini-Hochberg correction procedure. As a negative control, the analysis was performed with a GWAS of depression, and no cell types were found to be significant (Extended Data Fig. 13).
Extension of the HLCA core by mapping of scRNA-seq and snRNA-seq lung data
To map unseen scRNA-seq and snRNA-seq data to the HLCA, we used scArches, our transfer-learning based method that enables mapping of new data to an existing reference atlas42. scArches trains an adaptor added to a reference embedding model, thereby enabling it to generate a common embedding of the new data and the reference, allowing re-analysis and de novo clustering of the joint data. The data to map was subsetted to the same 2000 highly variable genes (“hvgs”) that were used for HLCA integration and embedding, and hvgs that were absent in the new data were set to 0 counts for all cells. Raw counts were used as input for scArches. Healthy lung data (Meyer_2021) was split into two datasets: 3’ and 5’-based. Lung cancer data (Thienpont_2018) was also split into two datasets: 10Xv1 and 10Xv2.
The model that was learned previously for HLCA integration using scANVI was used as the basis for the scArches mapping. scArches was then run to train adapter weights that allowed for mapping of new data into the existing HLCA embedding, using the following parameters: freeze-dropout: true, surgery_epochs: 500, train base model: false, metrics to monitor: accuracy and elbo, weight-decay: 0, frequency: 1. The following early-stopping criteria were used: early stopping metric: elbo, save best state metric: elbo, on: full dataset, patience: 10, threshold: 0.001, reduce lr on plateau: True, lr patience: 8, lr_factor: 0.1.
Identification of clusters with spatially annotated cell types
The Meyer_2021 study of healthy lung included cell type annotations based on matched spatial transcriptomic data. Many of these annotations were not present in the HLCA core. To determine if these new “spatial cell types” could still be recovered after mapping to the HLCA core, we looked for clusters specifically grouping these cells. We focused on seven spatial cell types: perichondrial fibroblasts, epineurial nerve-associated fibroblasts, endoneurial nerve-associated fibroblasts, CCL adventitial fibroblasts, chondrocytes, myelinating schwann cells, and non-myelinating schwann cells. As these cell types were often present at very small frequencies, we performed clustering at different resolutions to determine if these cells were clustered separately at any of these resolutions. We clustered at resolutions 0.1, 0.2, 0.5, 1, 2, 3, 5, 10, 15, 20, 25, 30, 50, 80, and 100. Minimum recall (% of cells with the spatial cell type annotation captured in cluster) was set to 20%, and minimum precision (% of cells from Meyer_2021 study in the cluster that had the spatial cell type annotation) was set to 25%. The cluster with the highest recall was selected for every spatial cell type (unless this cluster decreased precision by more than 50% compared to the cluster with the second highest recall). If precision of the next best cluster was doubled compared to the cluster with highest recall, and recall did not decrease with more than 20%, this cluster was selected.
Cell type label transfer from the HLCA core to new datasets
To perform label transfer from the HLCA core to the mapped datasets from the extended HLCA, we used the scArches k-nearest neighbor based label transfer algorithm42. Briefly, a k-nearest neighbor graph was generated from the joint embedding of the HLCA core and the new, mapped dataset, setting the number of neighbors to k = 50. Based on the abundance in a cell’s neighborhood of reference cells of different types, the most likely cell type label for that cell was selected, and a matching uncertainty score was calculated. For label transfer to lung cancer and healthy, spatially annotated projected data (Fig. 5b, e), cells with an uncertainty score above 0.3 were set to “Unknown”. For the extended atlas, we calibrated the uncertainty score cutoff by determining which uncertainty levels indicate possible failure of label transfer. To determine the uncertainty score at which technical variability from residual batch effects impairs correct label transfer, we evaluated how label transfer performed at the level of datasets, as these predominantly differ in experimental design. To determine an uncertainty threshold indicative of possible failure of label transfer, we harmonized original labels for 7 projected datasets17,21,53,57,62,64 (one unpublished: “Duong_lungMAP_unpubl”) and assessed the correspondence between original labels with the transferred annotations. To assess the optimal uncertainty cutoff for labeling a new cell as “unknown”, we used these results to generate an ROC curve. We chose a cutoff around the elbow point, keeping the false positive rate below 0.5 (uncertainty cutoff 0.2, true positive rate 0.88, false positive rate 0.49) to best distinguish correct from incorrect label transfers (Extended Data Fig. 17). False positives are either due to incorrect label transfer, or due to incorrect annotations in the original datasets. Cells with an uncertainty higher than 0.2 were set to “unknown”.
Disease signature score calculation
To learn disease-specific signatures based on label-transfer uncertainty scores, cells from the mapped data with the same transferred label were split into low-uncertainty cells (< 0.2) and high-uncertainty cells (> 0.4). We then performed a differential expression analysis on SCRAN-normalized counts using diffxpy99 with default parameters, comparing high- and low-uncertainty cells, including only genes that were expressed in at least 20 cells. The 20 most up-regulated genes based on log-fold changes were selected, after filtering out genes with an FDR-corrected p-value above 0.05 and genes with a mean expression below 0.1 in the high-uncertainty group. To calculate the “score” of a cell for the given set of genes, the average expression of the set of genes was calculated, after which the average expression of a reference set of genes was subtracted from the original average, as previously described100. The reference set consists of a randomly sampled set of genes for each binned expression value. The resulting score was considered the cell’s “disease signature score”.
Version information:
diffxpy: 0.7.4+18.gb8c6ae0
edgeR: 3.28.1, R 4.1.1 (covariate modeling)
LDSC: 1.0.1
Limma: 3.46.0, R 4.0.3 (GSEA)
Scanpy: 1.7.2
scArches: 0.3.5
scIB: 0.1.1
scikit-learn: 0.24.1
scvi-tools (scANVI): 0.8.1
Code availability
The HLCA pipeline for processing of sequencing data to count matrices, used for a subset of HLCA datasets (Methods): https://github.com/LungCellAtlas/scRNAseq_pipelines.
Code used for HLCA project: https://github.com/LungCellAtlas/HLCA_reproducibility
Code for users to map new data to the HLCA core (for automated mapping, see HLCA platforms below): https://github.com/LungCellAtlas/mapping_data_to_the_HLCA
HLCA data portals and mapping to the HLCA:
Automated mapping to the HLCA and label transfer can be done with scArches42 at FASTGenomics (https://beta.fastgenomics.org/p/hlca)
Automated mapping to the HLCA, and label transfer with Azimuth14,42 (not shown in manuscript) can be done at: azimuth.hubmapconsortium.org.
Label transfer with CellTypist72 (not shown in manuscript): https://www.celltypist.org/models
Data availability
The HLCA (raw and normalized counts, integrated embedding, cell type annotations and clinical and technical metadata) is publicly available and can be downloaded via FASTGenomics:
https://beta.fastgenomics.org/p/hlca or via cellxgene
https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293
The HLCA core reference model and embedding for local mapping to the HLCA can moreover be found on Zenodo: https://zenodo.org/record/6337966#.Yid5Vi9Q28U.
The original, published datasets that were included in the HLCA can be accessed under GEO accession numbers GSE135893, GSE143868, GSE128033, GSE121611, GSE134174, GSE150674, GSE151928, GSE136831, GSE128169, GSE171668, GSE132771, GSE126030, GSE161382, GSE155249, GSE135851, GSE145926, EGA study IDs EGAS00001004082, EGAS00001004344, EGAD00001005064, EGAD00001005065, and under urls https://www.synapse.org/#!Synapse:syn21041850, https://data.humancellatlas.org/explore/projects/c4077b3c-5c98-4d26-a614-246d12c2e5d7, https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001750.v1.p1, https://www.nupulmonary.org/covid-19-ms2/?ds=full&meta=SampleName, https://figshare.com/articles/dataset/Single-cell_RNA-Seq_of_human_primary_lung_and_bronchial_epithelium_cells/11981034/1, https://covid19.lambrechtslab.org/downloads/Allcells.counts.rds, https://s3.amazonaws.com/dp-lab-data-public/lung-development-cancer-progression/PATIENT_LUNG_ADENOCARCINOMA_ANNOTATED.h5, https://github.com/theislab/2020_Mayr, https://static-content.springer.com/esm/art%3A10.1038%2Fs41586-018-0449-8/MediaObjects/41586_2018_449_MOESM4_ESM.zip, http://blueprint.lambrechtslab.org/#/099de49a-cd68-4db1-82c1-cc7acd3c6d14/*/welcome (see also Supplementary Data Table 1).