Initial Session: https://interactivereport.github.io/cellxgene_VIP/static/MS.vip.session.txt
Colon is used as the delimiter for VIP to parse the initial settings for embedding and coloring. In our demo, the default layout is "umap_harmony" while cells will be colored based on “cell_type” categories. If keyword “scale” is not found in “Data” line, VIP will set the figure option “Scale data to unit variance for plotting” to “Yes”. The value of “Initial Session” will populate the “Load from a URL” input in “Load Session” interface.
Sample data used for demonstration of cellxgene VIP functionality. Human MS brain snRNA-seq dataset raw data (fastq, sample metadata and cell type identification files) from Schirmer et al.23 was downloaded and re-processed with 10x cellranger 3.0.2 (count) by mapping to human pre-mRNA genome reference (GRCh38 reference − 3.1.0, July 24, 2019). Quality control criteria were applied as follows: filtered out cells that have less than 1000 counts, or more than 40000 counts, and kept cells with < 0.2 mitochondria gene counts. Harmony47 and Liger48 were used to correct for batch effects. Louvain clustering method49 was used to cluster cells into different subgroups (clusters) on both batch corrected results. Both tSNE50 and UMAP51 coordinates are provided. Authors’ cell type annotation was in great concordance with the Louvain clustering result, except we decided to merge OL-A, OL-B and OL-C clusters into OL (Oligodendrocyte) and EN-L2-3-A and EN-L2-3-B clusters into EN-L2-3 (Excitatory Neuron Layer).
Integrated gene set query. Cellxgene VIP equips a helper tool, xGenesets that provides an Application Programming Interface (API) for web-based applications to select a list of genes in a gene set. The gene sets are defined based on pathways, gene sets, and categories from various sources, including KEGG Pathways52, WikiPathways53, Small Molecule Pathway Database54, Reactome55, Gene Ontology56, Molecular Signatures Database57 from Gene Set Enrichment Analysis (GSEA58), and LIPID MAPS Proteome Database59. xGenesets currently has 175,537 gene sets stored in a MySQL database, covering gene sets for species including human (115,497 records), mouse (30,554 records) and rat (29,486 records). xGenesets features auto-completion for quick gene set selection, gene set database browsing, and the full member gene list of any gene set. It can be easily embedded into an online application by leveraging jQuery plugin, DataTables, and bootstrap (either v3 or v4). An example of using xGenesets API can be found at https://bxaf.net/genesets. xGenesets API is freely available for all usages.
Generation of combo annotations. Celllxgene VIP also provides a fast annotation combination tool to label cells across selected categories. A new category, “Custom_combine”, will be automatically created to annotate cells across different categories, such as cell type and diagnosis. The user can also re-order the annotations (e.g., EN-L2-3:MS, EN-L2-3:Control) in this category which can be used in VIP plotting functions. Cells not belonging to any combined annotations will not be considered in plotting.
Differential gene expression. Three statistical test methods of differential gene expression analysis are provided including Welch’s t-test (both cellxgene and diffxpy implementations), Wilcoxon rank test, and Wald’s test. The statistical test results are presented in a table format including log2 fold change, p-value, and multi-testing adjusted p-value by Bonferroni in cellxgene or Benjamini-Hochberg in diffxpy. Notably, we provide users with simple test methods owing to quick response time within the interactive framework. However, more advanced statistical methods developed for differential gene expression analysis of scRNA-seq data such as glmmTMB26 or NEBULA27 are recommended even if it takes much longer computational time to get results considering that there would be covariates that need to be modeled in a proper statistical test. To accommodate data exploration of offline results, we developed a loading script for users to load and explore pre-computed results interactively inside the VIP environment in table, interactive volcano, and bubble heatmap forms.
glmmTMB differential analysis. As an example of an advanced testing method, we applied the glmmTMB algorithm26 to the data of Schirmer et al23. The glmmTMB method uses a generalized linear mixed model framework combined with the template model builder to determine differentially expressed genes. This method is very flexible allowing the user to adjust for covariate effects and model subject to subject or sample to sample variance components. The user can choose a variety of response families, dispersion models, zero inflation models, and variance/covariance structures. Unfortunately, this approach is very computationally demanding so its DEG results were pre-computed and then visualized using cellxgene VIP.
NEBULA differential analysis. As a second example of an advanced testing method, we applied the NEBULA algorithm27 (both NEBULA-LN and NEBULA-HL) to the data of Schirmer et al23. The NEBULA method uses a generalized linear mixed model framework to determine differentially expressed genes. This method is flexible allowing the user to adjust for covariate effects, to account for subject-level overdispersions, and to account for cell-level overdispersions. A large sample approximation in NEBULA-LN and the h-likelihood method of NEBULA-HL allows NEBULA to be computationally efficient and to be much faster than other single-cell level DEG tools. After calculating the DEGs for the data of Schirmer et al23, the results were visualized using cellxgene VIP.
Filtering criteria of DE analysis. For the NEBULA and glmmTMB differential analysis, the counts data of Schirmer et al23 was imported into R and two rounds of QC filtering were applied. In the first round, filtering was applied to the entire count matrix. We require: 1) a library size between 200 to 20M, 2) genes must be expressed in at least 3 cells, and 3) cells must have at least 250 genes expressed. Additionally, mitochondrial and ribosomal genes were filtered out at this stage. During the second round, the filters focused on the cell subset corresponding to a cell type of interest. With this cell subset, we required a minimum of 10% of the cells to be expressed in either group for the contrast of interest, a minimum of 3 cells per subject, and a minimum of 2 subjects per group.
Loading of pre-computed DEGs. The pre-computed DEGs can be visualized in cellxgene VIP as shown in Fig. 4b and 4d. The user needs to create a sqlite database file which must be in the same folder as the h5ad file and share the same name but with ‘db’ as the file extension, e.g., “ms_nature_2019_Schirmer.db” and “ms_nature_2019_Schirmer.h5ad” are in the same folder to be loaded. If the database file has changed, the server doesn’t need to restart.
There is a script named “DEG2sqlite3.py” under the bin folder, which saves all pre-computed DEG results in csv format to sqlite3 file. It takes 2 parameters where the first parameter is the full path of all csv files, and the second parameter is the name of the database file that will be saved in the same folder as csv files, e.g.,