Data input
ezSingleCell accepts input from multiple technologies (Smart-Seq2, 10x, CITE-Seq, Multiome, Visium) in a variety of formats: (i) text files (txt, csv, or tsv) or 10x Cell Ranger output for scRNA-seq and data integration analysis, (ii) 10x CITE-Seq count output for CITE-Seq, and 10x Cell Ranger ARC output for multiome, (iii) 10x Cell Ranger ATAC output for scATAC-Seq analysis, and (iv) 10X Space Ranger output for spatial transcriptomics analysis.
Data pre-processing
ezSingleCell offers quality control and data normalization with log-normalization or SCTransform19. The normalized data can then be scaled, and Principal Component Analysis (PCA) is performed. For sc-multi-omics data, ezSingleCell has 3 different methods to perform multimodal analysis, namely Seurat WNN9 and MOFA + 20. For Seurat WNN, both assays (RNA and ADT for CITE-seq data; RNA and ATAC for 10x scMultiOme data) are first pre-processed and dimensional reduction performed independently. The closest neighbours are thereafter calculated for each cell based on a weighted combination of RNA and protein similarities. The resulting WNN is usable for visualization and clustering. After pre-processing, MOFA + infers a low-dimensional representation of the data in terms of a small number of (latent) factors that capture the global sources of variability. MOFA + employs Automatic Relevance Determination (ARD), a hierarchical prior structure that facilitates untangling variation that is shared across multiple modalities from variability that is present in a single modality.
For scATAC-Seq data, ezSingleCell employs the Signac package21 to perform term frequency-inverse document frequency (TF-IDF) normalization. This is a two-step normalization procedure, that both normalizes across cells to correct for differences in cellular sequencing depth, and across peaks to give higher values to more rare peaks. This is followed by feature selection using only the top n% of features (peaks) for dimensional reduction or removing features present in less than a specified number of cells. Singular value decomposition (SVD) is then performed on the TD-IDF matrix of the selected features (peaks) to return a reduced dimension representation. This workflow is known as latent semantic indexing (LSI). Users can visualize the correlation between each LSI component and sequencing depth.
For spatial transcriptomics data, quality control functions are available and data normalization is done using SCTransform22. In spatial data, the variance in molecular counts per spot can be substantial and is also dependent on the tissue anatomy, particularly if there are differences in cell density across the tissue. For example, tissues that are depleted for neurons (such as the cortical white matter), reproducibly exhibit lower molecular counts. As a result, standard approaches such as log normalization which forces all cells or spots to have the same total count, are inappropriate. Therefore, SCTransform is recommended over log-normalization as it builds on regularized negative binomial models of gene expression to account for technical artifacts while preserving biological variance.
Clustering and dimension reduction
Clustering in ezSingleCell employs graph-based community detection where a k nearest neighbour (KNN) graph is constructed based on the Euclidean distance in the PCA space and the edge weights between two cells are refined based on the shared overlap in their local neighbourhoods (Jaccard similarity). Modularity optimization techniques such as the Louvain algorithm23 are then applied to iteratively cluster the cells together to optimize the standard modularity function. The user can adjust the clustering resolution to obtain broad or fine-grained clusters with increased values leading to a greater number of clusters. For dimensionality reduction, UMAP and tSNE are available. The user can specify the number of PCA dimensions to be used.
For scMulti-omics data, closest neighbours of each cell are calculated based on a weighted combination of RNA and protein similarities. The cell-specific modality weights and multimodal neighbours are calculated using a function ‘FindMultiModalNeighbors’. Clustering is then performed using a shared nearest neighbour (SNN) modularity optimization technique such as the Louvain algorithm.
To cluster spatial data, ezSingleCell offers two methods, namely Seurat and GraphST. Seurat use a graph-based community detection algorithm such as Louvain clustering where a KNN graph is constructed based on the Euclidean distance in the PCA space. GraphST combines graph neural networks with self-supervised contrastive learning to learn latent representations of spots from gene expression and spatial information to accomplish spatial clustering.
Cell type identification
To identify cell types of clusters, ezSingleCell implements our in-house CELLiD algorithm. CELLiD employs the average expression of clusters for correlation analysis with the reference dataset from DISCO database (https://www.immunesinglecell.org/) to infer the cell type. It is available in the Single cell RNA-seq, scIntegration, and scMultiomics modules.
Data visualization
ezSingleCell offers UMAP, tSNE, violin, feature, ridge, and volcano plots for data visualization. For spatial datasets, users can visualize their data with interactive plots showing spatial localization. For scATAC-seq datasets, users can visualize their data using CoveragePlot showing the transcription factors and peaks within given regions of the genome.
Batch effect correction and data integration
ezSingleCell contains four popular and best-performing methods namely Seurat, Harmony25 and fastMNN26 identified from previous benchmarking studies18,27. In this data integration step, users can customise tuning parameters such as number of features to integrate. Users can quantitatively benchmark the data integration results with metrics such as ARI, iLISI, and cLISI. For scATAC-seq datasets, batch effect removal in ezSingleCell uses the reciprocal LSI projection to find integration anchors between the two datasets, followed by integration of the datasets in the low-dimensional cell embeddings (the LSI coordinates). For spatial datasets, GraphST enables vertical and horizontal batch integration of multiple tissue sections.
scRNA-seq downstream analyses
Differentially expressed genes analysis is available in ezSingleCell using the ‘FindAllMarkers’ function from the Seurat package with statistical tests like Wilcoxon, bimod, and t-test available. Available downstream analyses include gene set enrichment analysis (GSEA)28 analysis with the msigdbr database, and cell-cell communication with the CellphoneDB29 package. Users can select from multiple reference resources such as ICELLNET, CellPhoneDB, CellChatDB, CellTalkDB and, and OmniPath for cell-cell communication prediction.
Single cell ATAC-seq module
For processing ATAC-Seq data, ezSingleCell uses MACS2 to perform peak calling. Users can visualize the results with Coverage Plot and perform differential accessibility (DA) test to find differentially accessible regions between clusters using logistic regression. ezSingleCell is also equipped with chromVAR for analysing chromatin accessibility variations in scATAC-Seq data. It computes the gain or loss of chromatin accessibility for known transcription factor motifs or de novo sequence motifs. It is computed on a per-cell basis and can be visualized for different motif sequences. Differentially active motifs can be further computed between different cell clusters.
ezSingleCell also offers the ability to merge multiple scATAC-Seq datasets by performing peak calling on each dataset independently and create a common set of peaks across all datasets to be merged. The combined set of peaks in each dataset is then quantified and the shared set of features across datasets are merged.
Integrating scRNA-seq and scATAC-seq data
ezSingleCell allows users to integrate their scATAC-seq data with scRNA-seq data by performing cross-modality integration and label transfer. In the integration process, ezSingleCell estimates the RNA-seq levels from ATAC-seq by quantifying gene expression ‘activity’ from ATAC-seq reads. The internal structure of the ATAC-seq data is then learned using LSI. The ‘anchors’ are then identified between the ATAC-seq and RNA-seq datasets followed by data transfer between datasets (either transfer labels for classification or impute RNA levels in the ATAC-seq data to enable co-embedding). Finally, the scRNA-seq and scATAC-seq cells are co-embed in the same low dimensional space for visualization using the same anchors to transfer cell type labels to impute scRNA-seq values for the scATAC-seq cells. The measured and imputed scRNA-seq data are then merged and can be visualized via a UMAP.
Integration of scRNA-seq and spatial data (deconvolution)
Users can employ two different methods, namely Seurat and GraphST to integrate their spatial data with scRNA-seq reference data to annotate the cell types present. Publicly available scRNA-seq references are available at the DISCO database (https://www.immunesinglecell.org/) or the Human Cell Atlas (https://www.singlecell.broadinstitute.org) for human data, and Allen Brain Atlas data portal (http://www.brain-map.org) for mouse brain data.