Unsupervised K-means clustering identified specific patient clusters per immune cell-type
After profiling 120 SLE patients for cell sorted bulk RNA-seq data (CD4+ T cells, CD14+ monocytes, B cells, and NK cells) from a multi-racial/ethnic cohort, a total of 10% of the samples were removed after QC filtering retaining 415 samples (91 NK, 105 B cells, 108 CD4+ T-cells, and 111 CD14+monocytes) (Figure S1A). Batch effects were taken into account in all the analyses using the limma software. Using the QC’ed data, we wanted to look for specific-transcriptomic effects in each of these cell types using a four-tier approach which included: unsupervised clustering; differential expression analyses, gene co-expression analyses, and random forest (Fig. 1).
Unsupervised K-means clustering on the individual cell types was utilized to identify patient sub clusters. Clusters with a Jaccard mean stability score greater than 0.6 were considered stable and therefore only clusters with a score greater than or equal to 0.6 were retained for further analyses (Table S3). Clustering on CD4+T cells and CD14+ monocytes yielded three distinct clusters, while only 2 clusters were identified for B cells and NK cells (Fig. 2A). Thus, we sought to test for association between the transcriptomic K-means clustering and the clinical and demographic parameters (Fig. 2B).
We observed in CD4+ T cells a positive association with lupus severity index19, SLEDAI activity score10,21, ACR criteria thombocytopenia, and ACR criteria Oral Ulcers. CD4 transcriptomic cluster 1 was comprised of 44 patients and was characterized by a high prevalence of photosensitivity, seizures, younger age at diagnoses, and presence of anti-Smith autoantibody positivity (Table S4). Interestingly, we detected a positive association with CD4+ T cells cluster1 and lupus severity index, which is a validated tool based on weighted scores of the ACR classification criteria19. The ACR criteria positively associated with CD4+ T cells cluster1 were history of seizure and thrombocytopenia. There was also a positive association with this cluster and the SLICC score, which tracks organ damage over time, likely associated with disease severity. The same cluster was negatively associated with age at diagnoses, ACR leukopenia, ACR psychosis (Fig. 2B).
CD4+ T transcriptomic cluster 2 was positively associated with age at diagnoses, SLEDAI score, and ACR leukopenia. The same cluster was negatively associated with ACR seizure. Patients with a high prevalence of malar rash, psychosis, anti-dsDNA autoantibody positivity, and SLEDAI score constitute the CD4 transcriptomic cluster 2 (Table S4). Finally, CD4 + T transcriptomic cluster 3 was negatively associated with ACR malar rash and presence of anti-Smith autoantibody positivity.
In CD14+ monocytes we observe a significant positive association in cluster 1 with White as well as a negative association with Asian (Fig. 2B). Moreover, cluster 2 of the CD14+ monocytes was negatively associated with race White. This was the only association observed for cluster 2. On the clinical level, cluster 1 of CD14+ monocytes consisted of 29 patients and it was defined by lower rates of fare, and anti-dsDNA autoantibody positivity, while cluster 2 encloses 58 patients characterized by high rates of flare severity (measured as mild 1, moderate 2, severe 3) (Table S5).
The two clusters of CD19, B cells, showed for cluster 1 a negative association for SLICC score, lupus severity index, and ACR seizure, while on the other hand for cluster 2 we observe a positive association for SLICC score, lupus severity index, and ACR seizure. (Fig. 2B). Cluster 1 of B cells comprised of 68 patients with a low prevalence of seizure, lymphopenia and by low score for lupus severity index and SLICC score, which suggested low disease activity. On the other hand, cluster 2 of B cells (n = 38) featured patients with high prevalence of seizure, lymphopenia and by a high score for lupus severity index and SLICC score, suggesting high disease activity (Table S6).
In the two clusters of the NK cells we observed a significant association between the presence of antinuclear antibody and cluster 1 as well as a negative association between antinuclear antibody and cluster2. No further associations were seen between the clinical and demographic parameters in the two clusters of the NK cells (Fig. 2B, Table S7).
Interestingly, from this correlation analysis, we noted a similar relationship for the transcriptomic clusters of CD4 + T cells and the clinical K-means clusters previously identified by our group solely using ACR clinical criteria12 (Fig. 2C). Supporting the similarity between clusters identified in the CD4+ T cells and the clustering using uniquely clinical features we identified an important overlap in patient membership (Fig. 2C). A total of 61% of the individuals in the CD4+ T cell transcriptomic cluster 1 were also present in the severe cluster of the clinical data. Finally, as also observed in the clinical clusters, we statistically confirmed association with CD4+ T cells transcriptomic clusters and lupus severity index (p < 0.05) (Fig. 2D). Overlaying this information allowed us to look at a global relationship between clinical data and the transcriptional profiles across the cell types. In addition, as we detected an association between CD14+ monocytes and ethnicity, we used DESeq2 to identify the differentially expressed genes between SLE patients from different ethnic backgrounds.
Differences in gene expression levels within each cell type significantly differs between ethnic populations
Differential expression analysis on demographic and clinical features identified a few numbers of significant (FDR < 0.05 and log2FC 1) differentially expressed genes. Specifically, 29 genes were identified for the DE on lupus severity index, 3 on SLICC score, 5 for DE on SLEDAI score, 9 for DE on ACR criteria for lupus nephritis (Table S8). Pathway analyses derived from differential expression analysis on the three different clinical indexes and the presence of lupus nephritis, a type of kidney complication caused by SLE, across the immune cell types showed enrichment of pathways in CD4 and CD14 for lupus severity index and SLICC score while SLEDAI score is enriched in CD4 only (Figure S6).
Remarkably, a larger number of differentially expressed genes between Asian and White groups were identified for each cell-type (Fig. 3A, Figure S2-5). In the CD4+ T cells for the White cohort the Actin Filament Associated Protein 1 (AFAP1), USP32P1, and, NAP1L3 (Nucleosome Assembly Protein 1 Like 3) were the top three significantly upregulated (padj < 0.05) genes. ARHGEF10, and FMN1 were the top significant genes in CD4+ T cells upregulated in Asians. An example of significantly up-regulated genes in CD14+ monocytes cells for the White cohort were SNORD3B-2 and the lncRNA maternally expression gene 3, MEG3. In Asians, Neurotensin Receptor 1 (NTSR1), and RETN were among the top significantly up-regulated genes in CD14+ monocytes. Top up-regulated genes in B cells for the White cohort were ARHGAP24 and CHL1 (Cell Adhesion Molecule L1 Like). Genes encoding for immunoglobulin heavy- and light-chain genes were statistically significantly up-regulated in Asians in B cells. Regarding NK cells, examples of up-regulated significant genes in Whites were SNORD3B-2, and Cytokine Like 1 (CYTL1), while an example of up-regulated genes in Asians were HPGD, and ADAM19. Pathway analyses in Asians revealed up-regulated pathways involved in metabolism and transcriptional activity in CD14+ monocytes and CD4+T cells whereas in Whites, both B cell and NK cells were enriched for up-regulated pathways for integrin and IL2 signaling indicating a higher activity of specific cell types in Asians and Whites respectively (Fig. 3B). There were no genes shared across all four cell types, (Table S2) and only RPL3P2 is shared across CD19, CD4, and NK cells.
Gene co-expression analyses within ethnic populations and disease activity revealed specific modules of genes.
Classical approaches to analyze the transcriptome data by using differential gene expression analysis based on sample groups defined by a selection of clinical parameters precluded dissection of the heterogeneity of SLE. Co-expression analysis, on the other hand, identifies similarly regulated genes across samples, and then groups these genes into modules, which can then be explored for each patient sample individually or for entire patient groups 22,23. As we found the largest number for DE genes on ethnicity, we decided to use CoCena2 on the patients’ data grouped by ethnic background combined with disease activity (SLEDAI score). Hence, patients were divided into White-high, Asian-high (high disease activity defined by a SLEDAI score > = 6) from White-low, Asian-low (low disease activity defined by a SLEDAI score < 6) for each cell type. The CoCena pipeline was run on each cell line independently. The pipeline identified 8, 5, 10, and 4 co-expression modules for CD4 + T cells, CD14+ monocytes, CD19 and NK cells respectively. Analysis of the modules revealed group-specific enrichment of co-expressed gene modules (Fig. 4A). Enrichment analyses on each of the modules identified associated gene signatures displaying distinct functional characteristics, which distinguish the different sample groups. (Fig. 4B, Fig. 5). We then annotated the modules according to their behavior into seven categories: ethnicity specific independent of disease activity, opposite behaviors according to ethnicity, White specific, Asian specific, low disease activity specific, high disease activity specific, and high disease activity specific for White. For example, the IL6 JAK STAT3 signaling pathway was identified as having opposite behavior in Asian and Caucasian populations and was enriched in the modules dark grey and gold for CD4+ T cell. More precisely the dark grey module was highly expressed in Asians patients with low disease activity and in White patients with high disease activity. The maroon module of CD14+ monocytes was annotated as being specific for low disease activity as it was low expressed in Asians and White individuals with high SLEDAI score. Interestingly, seven of the 10 modules in B cells were highly expressed in Whites compared to Asians regardless of their disease activity, while three of the four modules in the NK cells were exclusively highly expressed in Asians patients with high disease activity (Fig. 4A). An extended analysis of the modules revealed a prominent enrichment of inflammatory, autoimmune, and interferon pathways (Fig. 5, Figures S7).
Machine Learning approaches predicted SLE disease activity in individual ethnic groups.
Based on the results of the previous analyses and on the clustering of the patients according to co-expressed modules found on ethnicity and disease activity, we decided to use random forest (RF) on the transcriptomic data for each cell type within each racial group to discriminate between patients with high disease activity (SLEDAI score > = 6) and low disease activity (SLEDAI score < 6). Receiver operating characteristic (ROC) analysis of each RF model showed an AUC of 0.8 (Fig. 6A), indicating the consistent efficiency of the model in discriminating high disease activity samples from low disease activity samples in CD4 + T cells and CD19 cells when the whole cohort is taken into account. With regards to the White cohort, the RF model trained on CD4 + T cells and CD14+ monocytes showed an AUC of 0.71 and 0.79 respectively. On the other hand, the models trained on the Asian cohort showed an AUC higher than random only in CD4 T cells (AUC 0.62). The models trained with the top feature of the other ethnic background showed an AUC less than 0.6 which indicates inefficiency in the model in discriminating patients based on disease activity and underlaying the ethnicity specific features for each model (Figure S8).
The top contributing feature identified by the random forest algorithm to segregate high and low disease activity patients in CD4+ T cell in the White cohort were EPSTI1 and LAIR1, while in the Asian cohort these were: IRS1, RRM2, ATP11C, ACCS, and CCR6. With regards to the B cells, the top features for the White cohort were SDK2, IGKV2D, LAIR1, DSP, and IGHV3 while top features for the Asians cohort were: CCFC292, SUDS3, DUSP4, and HSPB1 (Fig. 6B). Importantly some of these genes have been previously associated with SLE. We observed that the model was able to select SLE relevant features. In summary, these results suggested the potential interaction between clinical features and potentially disease related genes, which could help explain the multifactorial, heterogeneous and systemic nature of the disease within White and Asians.