Identification of DEGs and diagnostic biomarkers in lung biopsy
The gene expression data of a total of 29 SSc-ILD and 9 HC samples from the test set (GSE81292 and GSE76808) were retrospectively analyzed in this study.
First, the DEGs of the metadata were analyzed using the “limma” package after removing the batch effects. A total of 46 DEGs were obtained: 10 genes were significantly upregulated and 36 genes were significantly downregulated (Fig. 1A). Next, two different algorithms were used to screen potential biomarkers. The DEGs were narrowed down using the LASSO regression algorithm, resulting in the identification of 6 variables as diagnostic biomarkers for SSc-ILD (Fig. 1B), and a subset of 10 features among the DEGs were determined using the SVM-RFE algorithm (Fig. 1B). Then, four overlapping genes (CDH3, TNFAIP3, FOSB, and C10orf10) between these two algorithms were ultimately selected (Fig. 1B). Furthermore, to obtain more accurate and reliable results, the validation dataset (GS48149) was used to verify the expression levels of these four genes in SSc-ILD lung biopsies. Consistent with the test set, CDH3 was upregulated in SSc-ILD (p < 0.001; Fig. 1C), while TNFAIP3 was notably deregulated compared with those in the HC group (p = 0.0006; Fig. 1D). However, there was no significant difference between the two groups of FOSB and C10orf10 expression. Validation of SSc-ILD lung tissues by qPCR and western blotting showed that, along with the changes in these two genes (CDH3 upregulation and TNFAIP3 downregulation), their encoded proteins (Cadherin 3 and TNFAIP3, respectively) also showed the same trend of changes in SSc-ILD (Fig. 1E–H). Furthermore, the expression of TNFAIP3 in alveolar macrophages derived from the alveolar lavage fluid of patients with SSC-ILD was decreased too (Fig. 1I).
On the basis of the above findings, we chose to conduct more detailed analyses of CDH3 and TNFAIP3. These two genes were applied to establish a diagnostic model using logistic regression. In the test set, the diagnostic ability of TNFAIP3 and CDH3 in discriminating SSc-ILD from the control samples displayed a favorable diagnostic value, with an AUC of 0.98 (95% CI 0.931–0.972) for TNFAIP3 and an AUC of 0.98 (95% CI 0.904–0.968) for CDH3 (Fig. 1J). A powerful discriminative ability was confirmed in the validation set, with an AUC of 0.90 (95% CI 0.766–0.972) for TNFAIP3 and an AUC of 0.91 (95% CI 0.745–0.995) for CDH3 (Fig. 1K), which indicated that the featured biomarkers had potential diagnostic ability.
Enrichment Analysis
GO and KEGG enrichment analyses of the DEGs were performed to obtain more information about TNFAIP3 and CDH3. The GO-enriched sets associated with these two genes were drawn in a chord plot (Fig. 2A). These GO sets are mainly involved in the following bioprocesses: response to lipopolysaccharide, response to molecules of bacterial origin, negative regulation of cytokine production, regulation of transforming growth factor beta 1 (TGF-β1) production, regulation of endothelial cell apoptotic processes, tissue homeostasis, negative regulation of immune response, regulation of inflammatory response, regulation of B cell activation, and regulation of epithelial cell proliferation, all of them were associated with pulmonary fibrosis [3–5], in particular, TGF-β1 is the master regulator of fibrosis. Furthermore, GO sets such as regulation of cytokine production, immune response, inflammatory response, and B cell activation indicated humoral immunity was dominant in the occurrence and development of pulmonary fibrosis in SSc patients. In addition, the enriched KEGG pathways that included CDH3 and/or TNFAIP3 were as follows: IL-17 signaling, TNF signaling, cytokine–cytokine receptor interaction, rheumatoid arthritis, viral protein interaction with cytokine and cytokine receptor, NF-κB signaling pathway, chemokine signaling pathway, AGE-RAGE signaling in diabetic complications, and NOD-like receptor signaling (Fig. 2B).
GSEA of all the normalized genes was performed for functional and pathway enrichment analyses in a broader and deeper perspective. All signaling pathways associated with collagen synthesis were activated (Fig. 2D). Puzzlingly, the top five enriched sets, which are known to promote fibrosis, were inhibited in SSc-ILD (Fig. 2C): tyrosine kinases receptor signaling [6], MAPK signaling [7], EGF–EGFR signaling [8], secreted factors signaling, and nuclear receptor meta signaling pathway [9]. This could be the result of negative feedback regulation at the end-stage of pulmonary fibrosis in an attempt to prevent further fibrosis.
Immune Cell Infiltration
We next explored the composition of immune cells in SSc-ILD and HC samples (Fig. 3A). The proportions of resting NK cells (p < 0.001), activated dendritic cells (p = 0.012), and eosinophils (p = 0.001) in SSc-ILD were significantly lower than normal, while the proportions of plasma cells (p < 0.001), M2 macrophages (p < 0.001), and resting mast cells (p < 0.001) in SSc-ILD were significantly higher (Fig. 3A). Subsequently, the correlation of 22 types of immune cells was calculated (Fig. 3B). M2 macrophages were significantly negatively correlated with activated mast cells (r = − 0.48), CD4 naïve T cells (r = − 0.37), activated dendritic cells (r = − 0.46), resting NK cells (r = − 0.37), eosinophils (r = − 0.22), memory B cells (r = − 0.32), gamma delta T cells (r = 0.22), and M1 macrophages (r = 0.17), and the resting mast cells were negatively correlated with activated mast cells (r = − 0.69), eosinophils (r = − 0.47), monocytes (r = − 0.63), resting NK cells (r = − 0.49), activated dendritic cells (r = − 0.43), neutrophils (r = − 0.38), and plasma cells (r = 0.49). Plasma cells were negatively correlated with activated mast cells (r = − 0.53), eosinophils (r = − 0.57), resting NK cells (r = − 0.43), activated dendritic cells (r = − 0.31), activated CD4 memory T cells (r = − 0.31), neutrophils (r = − 0.44), M0 macrophages (r = − 0.28), and M1 macrophages (r = − 0.4), while they were positively correlated with regulatory T cells (r = 0.4). Activated dendritic cells were significantly positively correlated with activated mast cells (r = 0.56), eosinophils (r = 0.56), resting NK cells (r = 0.78), and neutrophils (r = 0.42). Eosinophils were significantly positively correlated with activated mast cells (r = 0.75), activated dendritic cells (r = 0.56), gamma delta T cells (r = − 0.34), and regulatory T cells (r = 0.4).
Correlation Analyses Between The Two Candidate Genes And Infiltrating Immune Cells
As shown in Fig. 4A, CDH3 was positively correlated with plasma cells (r = 0.43, p = 0.007), M0 macrophages (r = 0.40, p = 0.01), and resting mast cells (r = 0.32, p = 0.049), and negatively correlated with M1 macrophages (r = − 0.35, p = 0.03), resting NK cells (r = − 0.40, p = 0.01), activated mast cells (r = − 0.60, p < 0.001), eosinophils (r = − 0.60, p < 0.001), and monocytes (r = − 0.60, p < 0.001).
As shown in Fig. 4B, TNFAIP3 was positively correlated with activated mast cells (r = 0.56, p < 0.001), resting NK cells (r = 0.36, p = 0.03), eosinophils (r = 0.34, p < 0.032), monocytes (r = 0.34, p = 0.34), and activated dendritic cells (r = 0.34, p = 0.039), and negatively correlated with macrophages 2 (r = 0.58, p < 0.001) and resting mast cells(r = − 0.72, p < 0.001).