Analysis of immune infiltration in keloid patients.
Firstly, in order to describe the immune difference between keloid samples and healthy skin samples, we used single sample gene set enrichment analysis (ssGSEA) method to compare the difference of immune cell subsets enrichment scores of each sample in GSE44270. The results showed that T cell (Effector memory CD8+ T cell and T follicular helper cell), macrophage and monocyte infiltration were higher in keloid samples, while Effector memory CD4+ T cell, Type 17 T helper cell, Natural killer cell, Immature dendritic cell, Immature B cell and Mast cell were significantly different, but the infiltration was lower than that in healthy skin samples (Fig. 1A, B).In order to further verify the results of immune infiltration, we used the same method to evaluate the difference of 28 immune cell scores in GSE7890.The results showed that there were significant differences in T cells, including Effector memory CD8+T cell, Central memory CD8+T cell, Effector memory CD4+T cell, Activated CD4+T cell, Type 17 T helper cell, Type 1 T helper cell, Type 2 T helper cell, Natural killer T cell and T follicular helper cell. Other differential immune cells were Activated B cell, Natural killer cell, Immature dendritic cell, Macrophage and Monocyte. Effector memory CD8 T cell, Macrophage, Monocyte and T follicular helper cell immunocytes were also highly infiltrated in keloid samples (Fig. 1C, D). Combining the results of the two analyses, we obtained 9 kinds of differentially expressed immune cells in keloid patients (Effector memory CD8+T cell, Macrophage, Monocyte, T follicular helper cell, Effector memory CD4+T cell, Type 17 T helper cell, Activated B cell, Natural killer cell and Immature dendritic cell). Subsequently, we used Glmnet algorithm and Lars algorithm for LASSO regression analysis to further identify the characteristic immune cells associated with keloid progression. Combining the two algorithms, we found that among the 9 immune cell subsets, except T follicular helper cell, the other eight are determined as the optimal variables with non-zero coefficients(λglmnet = 0.020; λLars = 0.750, R2 = 0.917 and ministep = 8; Fig. 1E-H, Table S3).
Screening of differentially expressed genes (DEG) associated with immune microenvironment.
Through the screening conditions of differentially expressed genes, we identified 333 differentially expressed genes (150 up-regulated and 183 down-regulated) from the GSE92566 dataset (Fig. S2A, Fig. S2B). Then, with the GSE44270 expression profile as the background, we identified the key modules related to 8 characteristic immune cell subtypes in keloid patients by WGCNA. The analysis results showed that the scale-free topological network and connectivity were the most effective when the soft threshold was 3 (Fig. 2A). The hierarchical clustering algorithm combines all genes into six gene modules with different colors (Fig. 2B). The result of correlation analysis shows that turquoise module (1842 genes) has a high correlation with Effector memory CD4 T cell (R = 0.59), Type 17 T helper cell (R=-0.73), Activated B cell (R=-0.44), Natural killer cell (R=-0.68), Imitate dendritic cell (R=-0.37), Macrophage (R=-0.45) and Monocyte (R=-0.55) (Fig. 2C). At the same time, the blue module (74 genes) also has a certain correlation with the Type 17 T helper cell (R=-0.77), Activated B cell (R=-0.69), Natural killer cell (R=-0.37), Automatic differentiation cell (R=-0.42) and Macrophage (R = 0.48) (Fig. 2C). Therefore, we selected the genes in turquoise and blue module for further analysis. Then using Venn diagram to compare key module genes, differentially expressed genes and immune microenvironment related genes (ImmPort dataset and InnateDB dataset), we obtained 19 overlapping immune microenvironment related differentially expressed genes (Fig. 2D).
Correlation and functional enrichment analysis of immune microenvironment-related DEGs.
Firstly, we analyzed the correlation between 19 overlapping differentially expressed genes and 28 immune cell subsets. The results showed that these differential genes were significantly correlated with immune cells, indicating the accuracy of the previous analysis results. At the same time, it also showed that the change of immune microenvironment may be an important physiological mechanism for the occurrence and development of keloid (Fig. 2E). Subsequently, we analyzed the functional enrichment of 19 differential genes. Biological processes (BP) are mainly involved in immune response and glial cell differentiation. Cell composition (CC) is mainly concentrated in the granule cavity, transcription factor complex and receptor complex. In terms of molecular function (MF), it is mainly involved in receptor binding, phosphatase activity and amino acid binding (Fig. 3A-C). KEGG enrichment analysis showed that differentially expressed genes related to these immune microenvironments were closely related to Th17 cell differentiation, Jak-STAT signal pathway, nod-like receptor signal pathway, B/T cell receptor signal pathway, HIF-1 signal pathway, cGMP-PKG signal pathway and a variety of human diseases (cytomegalovirus infection, leukemia virus type 1 infection and immunodeficiency virus type 1 infection, etc.) (Fig. 3D). These results further indicate that these differentially expressed genes play an important role in the immune microenvironment. Finally, in order to evaluate the interaction between these differentially expressed genes, we constructed protein-protein interaction (PPI) network (p = 8.1e-15, Fig. 3E), including 18 nodes and 44 edges. 2 key subnetworks were identified by MCODE plug-in, which included 10 differentially expressed genes (Fig. 3E).
Establishment and evaluation of keloid risk prediction model.
To establish the keloid risk prediction model, we used three methods: univariate analysis, support vector machine recursive feature elimination (SVM-REF) algorithm and LASSO regression (Lars algorithm) for screening feature genes. Merge GSE44270,GSE7890 and GSE92566, and remove the batch effect to create a combined dataset. Univariate analysis results showed that, except for TLR4 and INPP5D, the p values of the other 8 genes were less than 0.05 (Table 1). At the same time, the LASSO regression (Lars algorithm) results showed that 7 genes (IL1R1, TLR4, JAK1, MAPK1, PTPN6, PTPRC and STAT3) were determined as the optimal variables with non-zero coefficients (R2 = 0.952, λ = 0.31, ministep = 9; Fig. 4A, B). In addition, the SVM-REF analysis results show that the highest performance model can be obtained using 4 features (RMSE = 0.086, Fig. 4C), namely MAPK1, PTPRC, STAT3 and IL1R1. Based on the results of the 3 methods, we identified MAPK1, PTPRC, STAT3 and IL1R1 as candidate characteristic genes for risk prediction model. Subsequently, we used the expression profiles of these 4 genes as input variables for multivariate logical regression analysis. A risk prediction model was established based on the logical regression coefficients of 4 genes (R2 = 0.953, Table 2). The formula of the model was: 0.700+( -0.001) ×expression (IL1R1) + 0.005×expression (MAPK1) + (-0.029) ×expression (PTPRC)+(-0.004) ×expression (STAT3). ROC analysis showed that the AUC value of the model was 96.4% (Fig. 4D), indicating that the model has a strong clinical prediction ability. Subsequently, we used ROC to further evaluate the clinical risk prediction ability of the model on GSE44270, GSE145725 and GSE7890, respectively. The results show that the AUC values of the three datasets were 96% (Fig. 4E), 100% (Fig. 4F) and 100% (Fig. 4G), respectively, which further illustrates the reliability of the risk prediction of the model.
Table 1
Univariate analysis of 10 genes.
|
B
|
Standard error
|
Wald
|
Significance
|
Hazard ratio
|
95% confidence interval for the Hazard ratio
|
|
lower
|
upper
|
|
IL1R1
|
-0.042
|
0.013
|
9.962
|
0.002
|
0.958
|
0.934
|
0.984
|
|
IL6ST
|
-0.014
|
0.004
|
10.408
|
0.001
|
0.986
|
.977
|
0.994
|
|
INPP5D
|
-0.243
|
0.201
|
1.452
|
0.228
|
0.785
|
0.529
|
1.164
|
|
JAK1
|
-0.032
|
0.010
|
11.054
|
0.001
|
0.969
|
0.951
|
0.987
|
|
MAPK1
|
0.074
|
0.024
|
9.071
|
0.003
|
1.077
|
1.026
|
1.129
|
|
PPARG
|
0.668
|
0.317
|
4.445
|
0.035
|
1.950
|
1.048
|
3.626
|
|
PTPN6
|
0.168
|
0.062
|
7.336
|
0.007
|
1.183
|
1.048
|
1.336
|
|
PTPRC
|
-1.138
|
0.423
|
7.231
|
0.007
|
0.320
|
0.140
|
0.734
|
|
STAT3
|
-0.110
|
0.032
|
11.628
|
0.001
|
0.896
|
0.841
|
0.954
|
|
TLR4
|
-0.026
|
0.024
|
1.231
|
0.267
|
0.974
|
0.930
|
1.020
|
|
|
Unstandardized Coefficients
|
Standardization coefficient
|
t
|
Significance
|
95% confidence interval for the Hazard ratio
|
Collinear statistics
|
B
|
Standard error
|
β
|
lower
|
upper
|
Allowance
|
VIF
|
(Constant)
|
0.700
|
0.133
|
|
5.279
|
< 0.01
|
0.433
|
0.968
|
|
|
IL1R1
|
-0.001
|
0.000
|
-0.135
|
-2.273
|
0.028
|
-0.002
|
0.000
|
0.281
|
3.557
|
MAPK1
|
0.005
|
0.001
|
0.435
|
7.776
|
< 0.01
|
0.004
|
0.006
|
0.315
|
3.178
|
PTPRC
|
-0.029
|
0.012
|
-0.143
|
-2.368
|
.022
|
-0.054
|
-0.004
|
0.270
|
3.701
|
STAT3
|
-0.004
|
0.000
|
-0.468
|
-8.584
|
< 0.01
|
-0.005
|
-0.003
|
0.332
|
3.009
|
Table 2
Multivariate logistic regression analysis.
|
B
|
Standard error
|
Wald
|
Significance
|
Hazard ratio
|
95% confidence interval for the Hazard ratio
|
lower
|
upper
|
IL1R1
|
-0.042
|
0.013
|
9.962
|
0.002
|
0.958
|
0.934
|
0.984
|
IL6ST
|
-0.014
|
0.004
|
10.408
|
0.001
|
0.986
|
.977
|
0.994
|
INPP5D
|
-0.243
|
0.201
|
1.452
|
0.228
|
0.785
|
0.529
|
1.164
|
JAK1
|
-0.032
|
0.010
|
11.054
|
0.001
|
0.969
|
0.951
|
0.987
|
MAPK1
|
0.074
|
0.024
|
9.071
|
0.003
|
1.077
|
1.026
|
1.129
|
PPARG
|
0.668
|
0.317
|
4.445
|
0.035
|
1.950
|
1.048
|
3.626
|
PTPN6
|
0.168
|
0.062
|
7.336
|
0.007
|
1.183
|
1.048
|
1.336
|
PTPRC
|
-1.138
|
0.423
|
7.231
|
0.007
|
0.320
|
0.140
|
0.734
|
STAT3
|
-0.110
|
0.032
|
11.628
|
0.001
|
0.896
|
0.841
|
0.954
|
TLR4
|
-0.026
|
0.024
|
1.231
|
0.267
|
0.974
|
0.930
|
1.020
|
Evaluation of Characteristic factors in keloid risk prediction model.
Then, we evaluate the characteristic factors in the risk prediction model. Taking the combined dataset as the background, we use five algorithms: logical regression (gradient descent method), naive Bayesian classification, support vector machine (SVM) classification, bp neural network classification and K-Nearest Neighbor (KNN) classification to verify the classification effect of the classifier established by the 4 feature factors. The sample was divided into training set and test set according to the ratio of 7:3. The results show that the 5 classifiers have significant classification effects, and the classification accuracy, accuracy and F1 value of each classifier are 100% (Fig. 5A-E). The AUC value of each classifier evaluated by ROC was also 100% (Fig. 5F-J). Then, based on the expression profile of the combined dataset, we drew the ROC of each characteristic factor. The analysis results showed that the AUC values of MAPK1, PTPRC, STAT3 and IL1R1 were 97.306%, 92.592%, 97.475% and 96.633 respectively (Fig. 6A), which showed that these 4 genes themselves also have the potential of keloid diagnosis biomarkers. In addition, based on the expression of MAPK1, PTPRC, STAT3 and IL1R1 genes in combined dataset, we constructed the corresponding nomograph of the keloid risk prediction model (Fig. 6B). The result shows that the C index was 0.873, indicating that the model has recognition ability. The calibration curve of nomograph shows a good consistency between the predicted value and the actual value (Fig. 6C). The ROC results showed that the AUC of nomograph model was 0.930, indicating that the nomograph model was highly feasible (Fig. 6D). In order to further understand the potential biological role of these 4 genes in keloid, we performed gene set enrichment analysis (GSEA) for the 4 genes using KEGG gene sets. GSEA results show that these 4 genes are mainly involved in immune cell (B/T) receptor signal pathway, Fc epsilon RI signal pathway, ERBB signal pathway, cancer-related pathway, endocytosis, fatty acid metabolism and phosphorylation regulation (Fig. 7). Combined with the above results, we suggest that these immune microenvironment-related characteristic genes have good diagnostic ability in predicting keloid risk.
Identification of immune microenvironment subtypes in keloid patients.
To further explore the keloid immune microenvironment, we merged the combined dataset (GSE44270, GSE7890 and GSE92566), GSE145725 and GSE83286 into a new dataset containing 74 samples (39 keloids and 35 controls) (Fig. S3A). Then, based on 39 keloid samples, we conducted consensus cluster analysis on keloid samples according to the expression of 4 characteristic genes, and identified different immune microenvironment subtypes of keloid patients. According to the cluster heat map (Fig. 8A), CDF curve (Fig. 8B), the relative change of CDF curve area (Fig. 8C) and the consistent cluster score (Fig. 8D), we selected k = 2 as the optimal value, and divided 39 patients into two different subtypes, including 23 subtype Ⅰ and 16 subtype Ⅱ (Fig. 8A). At the same time, the tSNE dimension reduction analysis showed that there was a significant difference between subtype Ⅰ and subtype Ⅱ samples (Fig. 8E). In addition, the differentially expressed genes of subtype Ⅰ and subtype Ⅱ were analyzed by limma method, and a total of 1292 differentially expressed genes were identified (511 up-regulated and 781 down-regulated (Fig. S3B, Fig. S3C), of which 116 were related to immune response (Fig. 8F), and the expression of the 4 characteristic genes was also significantly different (Fig. 8G).
To understand the relationship between different immune microenvironment subtypes and biology, we carried out Gene set variation analysis (GSVA) on two different subtypes of patients base on all GO gene sets and KEGG gene sets. GO results showed that immune microenvironment subtype Ⅰ mainly involved cell division and chromosome structure. The biological functions related to subtype Ⅱ mainly include interferon production, leukocyte regulation, immune cell receptor regulation and other immune responses (Fig. 8H). KEGG results showed that immune microenvironment subtype Ⅱ was also related to immune response, including GNRH signal pathway, immune cell receptor signal pathway, ERBB signal pathway, p53 signal pathway, TGF β signal pathway, phosphorylation and amino acid metabolism. However, subtype Ⅰ is involved in cell division, homologous recombination and splicing (Fig. 8I). Subsequently, ssGSEA results showed that the immune infiltration efficiency of patients with subtype Ⅱ was significantly higher than that of patients with subtype Ⅰ, especially the infiltration rate of T cells, monocytes and macrophages (Fig. 9A). These results suggested that immune microenvironment subtype Ⅱ may be an immune subtype, which further explains the close relationship between keloid and immune response. In addition, the immune score of each immune microenvironment subtype was compared. The immune score of patients with subtype Ⅱ was higher than that of patients with subtype Ⅰ (Fig. 9B).
External validation of differential expression of characteristic genes.
Firstly, we used GSE44270, GSE7890, GSE145725 and GSE83286 to analyze the expression pattern of characteristic genes in keloid patients. Compared with the control group, the expression of PTPRC, STAT3 and IL1R1 in keloid samples decreased significantly, while MAPK1 increased significantly (Fig. 9C-F). Then we explored the expression trend of 4 characteristic genes in different tumors through pan-cancer analysis. The results showed that the 4 characteristic genes were down-regulated in most tumors (Fig. S4A-D). Moreover, the prognosis showed that the 4 characteristic genes were closely related to the overall survival of at least 8 tumor patients (Fig. S4E-H). Finally, we analyzed the expression of 4 characteristic immune microenvironment-related genes in human primary keloid fibroblasts by RT-PCR. The results were consistent with the datasets analysis, the expression levels of PTPRC, STAT3 and IL1R1 were significantly decreased in keloid fibroblasts, while the expression of MAPK1 gene was significantly increased (Fig. 9G).