Immune Cell as a Promising Biomarker in the Diagnosis and Prognosis of Cutaneous Melanoma by Using Machine Learning


 Background: Tumor infiltration, is known to associate with various cancer initiations and progressions, is potential therapeutic target for this aggressive skin cancer.Methods: single sample gene set enrichment analysis (ssGSEA) algorithm was applied to assess the relative expression of 24 types of immune cell from public database. Firstly, the differentially expressed immune cells between melanomas and normal samples were identified. Next, multiple machine learning algorithms were performed to evaluate the efficiency of immune cells in diagnosis of melanoma. In addition, the feature selection in machine learning methods was used to figure out the most important prognostic immune cells for developing biomarker to predict the prognosis of melanoma.Results: In comparison with the expression of immune cells in tumors and normal controls, we built the immune diagnostic models in training dataset, which can accurately classify melanoma patients from normal (LR AUC= 0.965, RF AUC= 0.99, SVM AUC=0.963, LASSO AUC= 0.964 and NNET AUC=0.989). These diagnostic models also validated in three outside datasets and suggested over 90% sensitivity and specificity to distinguish melanomas from normal patients. Moreover, we also developed a robust immune cell biomarker which could estimate the prognosis of melanoma. This biomarker also further validated in internal and external datasets. Next, we constructed nomogram combined risk score of biomarker and clinical characteristics, which showed good accuracies in predicting 3 and 5 years’ survival. The decision curve of nomogram model manifested a higher net benefit than tumor stage. In addition, melanoma patients divided into high and low risk subgroups by applied risk score system. The high risk group have a significantly shorter survival time than the low risk subgroup. Gene Set Enrichment Analysis (GSEA) analysis revealed that complement, epithelial mesenchymal transition and inflammatory response and so on significantly activated in high risk group. Conclusions: We constructed immune cell related diagnostic and prognostic models, which could provide new clinical applications for diagnosing and predicting the survival of melanoma patients.


Introduction
Melanoma is the most aggressive type of skin cancer which derived from melanocyte lineage with the highest metastasis and mortality rate [1]. Despite melanoma consist of only 5% of all skin-related cancers, it's accounting for approximately 80% of deaths related to skin tumor. Like most other cancers, surgical enucleation and drug therapy are di cult to treat once it has metastasized [2][3][4]. In addition, it is di cult to detect at early, the majority of patients with melanoma were diagnosed at an advanced stage [5,6].
Currently,cancer treatment guiding and prognosis predicting are largely determined by TNM staging system. However, clinical experience revealed that many patients even within the same TNM stage have differences in the overall survival [7]. The clinical limitations of TNM stage are increasingly becoming apparent. Thus, it is urgently required to explore the new biomarkers for early diagnosis and prognostic prediction.
Growing studies have recently reported that the tumor microenvironment plays crucial roles in the development and progression of many malignant tumors [8,9]. In tumor microenvironment, the type, function and location of immune cell are intimately associated with clinical outcome [10,11]. Tumor cells regarded as antigens which will attract immune cells and leukocytes by many chemokines to in uence immune response. Moreover, immune escape of tumor cell was considered as a crucial factor in tumorigenesis [12][13][14]. The prognostic value of tumor microenvironment also had been demonstrated in numerous melanoma experiments. High immune in ltration in melanoma also have shown to be associated with favorable prognosis. To date, immunotherapy is pivotal for the treatment of patients with advanced melanoma, such as CTLA-4, PD-1 and PD-L1 inhibitors [15,16]. Therefore, prognostic or predictive biomarkers related to tumor microenvironment may hold great promise for identifying novel molecular targets and guiding patient management. Besides tumor cells, melanoma also commonly includes various types of immune cell, which may regard as a potential diagnostic signature to classify tumors from suspected patients. Thus, systematically evaluated in ltrating immune cells was recognized a signi cant supplemental biomarker to the TNM stage for diagnosis and prognosis prediction. Fortunately, the availability of public, large-scale datasets like the cancer genome atlas (TCGA) and Gene Expression Omnibus (GEO) databases which provided numerous transcriptome pro les to investigate the landscape of immune cells.
Machine learning is a powerful tool to analyze and summarize complex datasets, which can provide various computational approaches to predict clinical diseases [17]. Previously, several algorithms have been successfully applied to diagnose and predict disease, including logistic regression model [18], support vector machine [19], random forests analysis [20] and arti cial neural network [21]. Compared to classical methods, machine learning often reach remarkable high sensitivity and speci city. What's more, machine learning is the best choice to process the increasingly growing genomic data and clinical information in oncology research and predict the susceptibility, survival and recurrence of cancer. Therefore, in this research, we rst estimate the proportions of 24 immune cells in 944 samples (653 tumors and 291 normal controls) according to their gene expression pro ling download from public databases. Next, we used multicategory machine learning to identify several important immune cell signatures as well as construct diagnostic and prognostic models, which show more effective performance for the diagnosis and prognosis of melanoma patients.

Melanoma collection and normal controls
The melanoma patients were collected from public datasets. The eligible datasets were downloaded from the Xena Public Data Hubs (http://xena.ucsc.edu/public-hubs/) and GEO database (https://www.ncbi.nlm.nih.gov/geo). Finally, ve datasets consist of 944 samples including TCGA, GSE3189, GSE15605, GSE46517 and GSE54467 were screened out for this study. The raw transcriptome pro les of RNA sequencing data were processed by applying R software. Firstly, the probe IDs were annotated according to the annotation information of platform. For the same gene corresponding to multiple IDs, the median expression value will be calculated to represent the gene expression level. Next, genes with a variance of 0 will be excluded for its tiny expression level. Finally, the raw matrix data were normalized by log2(x + 1) conversion.

Estimation of immune cell types
To estimate the proportions of immune cells, single sample gene set enrichment analysis (ssGSEA) algorithm were applied to transform the normalized gene expression data into 24

Diagnostic analysis
Firstly, these samples were classi ed into tumor tissue groups and normal tissue groups. Next, the difference analysis of immune cells in tumor tissues and normal tissues was performed and p values < 0.05 considered as the differentially expressed immune cells (DEICs). Then, the Upset plot analysis was used to explore the overlap of DEICs among these datasets. To develop a diagnostic model with selected DEICs, ve machine learning algorithms including Logistic Regression (LR), Random Forests (RF), Support Vector Machines (SVM), Least Absolute Shrinkage and Selection Operator (LASSO), neural network (NNET) and ve-fold cross-validation were systematically used to construct the models in TCGA dataset and the sensitivity and speci city of the diagnostic models was evaluated by the receiver operating characteristic (ROC) curves. Principal component analysis was performed to determine whether these DEICs could de nitely classify tumors form normal controls. The diagnostic model also validated in another three independent datasets. Besides, the coe cients of LR were used to calculate the diagnostic score of each sample and the formula is Diagnosis score = , which could well distinguish normal and tumor tissues.

Prognostic analysis
To explore the most signi cant immune cells in the prognosis of melanoma patients, the eligible patients in TCGA dataset were randomly separated into training and testing samples (1:1). Multicategory machine learning methods including LASSO, SVM-recursive feature elimination (SVM-RFE), RF-feature selection (RF-FS) were conducted to identi ed the important immune cells in training dataset. Then, cox regression analysis method was performed to develop prognostic model with selected cells for the risk formula and the risk score is generated as follows: Risk score = , in which N means the number of feature cells, expr i means the expression level of cells and coef i means regression coe cient. The risk score of each sample in training dataset was estimated and the patients were accordingly classi ed into high-and low-risk group by optimal cutoff value. To compare the differences between high-and low-risk group, Kaplan-Meier survival curves were drawn and signi cance was calculated by log-rank tests. The area under the curve (AUC) of Receiver operating characteristic curves (ROC) was used to evaluate the 5year overall survival predictive accuracy of the model. Besides, to proven the robustness of the result, these immune cell features were further validated in testing dataset and GSE54467. Finally, nomograms were established in this study based on patients' clinical characteristics and risk score. To compare the prediction and actual survival, calibration curves were drawn. And decision curves were also plotted to discriminate the clinical usefulness of nomogram and tumor stage.

Strati ed analysis
To assess the relationship between risk score distribution and clinical characteristics, the subgroup analysis of clinical variables included age, gender, race, stage, vital status and tumor status were performed. Moreover, the univariate and multivariate cox logistic regression were carried out to compare the prognostic value between the risk score and clinical variables. Next, to investigate the potential biological phenotypes between high and low risk groups, gene expression data corelated to immune checkpoint regulators and epithelial-mesenchymal transition (EMT) were analyzed [24][25][26][27]. Firstly, the expression data of these genes were extracted from the TCGA dataset and divided into high and low risk groups by applying the optimal cutoff value. Finally, subgroup analysis of these genes was conducted.

Gene set enrichment analysis
In order to explore the different signaling pathways between the low-and high-risk groups, Gene Set Enrichment Analysis (GSEA) was conducted by "clusterPro ler" package in R software. Firstly, the differential analysis of all genes between low-and high-risk groups were generated and these genes were ordered by the value of log2 fold change. Then the cancer hallmark pathway database (h.all.v7.0.symbols) and KEGG pathway database (c2.cp.kegg.v7.0.symbols) in GSEA were performed to investigate the signaling pathways correlated with different subgroups of melanoma. The random sample permutations were 1000 and the q value < 0.05 was the signi cance threshold.

Statistical analysis
All statistical analyses were conducted using R package (v.3.6.0) and corresponding packages. The Upset plot was drawn by the "UpSetR" package. ssGSEA method was estimated by "GSVA" package. LASSO and LR analysis were calculated by the "glmnet" package. SVM and SVM-RFE methods were conducted by "e1017" package. RF and RF-FS algorithms were applied by "randomForest" and "varSelRF" packages respectively. NNET method were performed by "nnet" package. The optimal cutoff values computed by using the "survminer" package. Kaplan-Meier survival analysis was applied by using "survival" and "survivalROC" packages. GSEA analysis was conducted by "clusterPro ler" package. The hazard ratios (HR) and 95% con dence intervals (95% CI) of the prognostic factors were calculated. P < 0.05 was regarded as statistically signi cant in all statistical tests.

Melanoma collection and normal controls
Totally, 944 samples were selected for the subsequent analysis, which were acquired from the ve datasets including TCGA of melanoma, GSE3189, GSE15605, GSE46517 and GSE54467. TCGA of melanoma, GSE3189, GSE15605 and GSE46517 were used to diagnostic analysis. 446 melanoma samples obtained from TCGA of melanoma and GSE54467 were conducted to prognostic analysis. The complete analysis work ow in this study is illustrated in Fig. 1.

Differentially expressed immune cells (DEICs)
Firstly, the 24 immune cells expression matrix was calculated by ssGSEA. According to standard of differential analysis, 19 DEICs were identi ed in TCGA dataset in which 11 cells were up-regulated and 8 genes were down-regulated ( Fig. 2A). 13 DEICs contained 4 up-regulated and 9 down-regulated cells were distinguished in GSE3189 (Fig. 2B). 11 DEICs were found in GSE15605, which consist of 6 up-regulated cells and 5 down-regulated cells (Fig. 2C). Besides, 7 up-regulated cells and 6 down-regulated cells were identi ed in GSE46517 (Fig. 2D). Eventually, 6 overlap of DEICs were found in the four datasets (Fig. 2E).
These immune cells including iDC, DC, Eosinophils, NK CD56bright cells, Mast cells, Treg were selected for subsequent research.

Construct diagnostic model
In order to develop a diagnostic model by using the 6 identi ed DEICs and assess the effectiveness, ve machine learning algorithms were comprehensively conducted to diagnose melanoma samples from normal healthy controls. Additionally, we also applied 5-fold cross-validation to estimate the accuracy of each model in TCGA dataset. The ROC curves manifested that the six DEICs can accurately classify melanoma patients from normal (LR AUC = 0.965, RF AUC = 0.99, SVM AUC = 0.963, LASSO AUC = 0.964 and NNET AUC = 0.989) (Fig. 2F). This diagnostic models also validated in GSE3189, GSE15605 and GSE46517. Similarly, the ROC curves suggested over 90% sensitivity and speci city to distinguish melanomas from normal patients in all datasets ( Fig. 2G-I). Principal components analysis showed that tumors and normal controls could be well differentiated based on the expression of six DEICs (Fig. 2J). Next, we built a diagnostic score model with these DEICs by using LR method and calculated the score of each sample according to the diagnostic formula. The diagnostic score distribution of tumors and normal controls was shown in Fig. 2K. The violin plot revealed that the diagnostic values were signi cant difference between tumor and normal groups.

Develop prognostic model
After eliminating patients with no survival information, 432 melanoma patients were obtained from the TCGA of melanoma and GSE54467. Firstly, the TCGA of melanoma dataset were randomly classi ed into training samples (N = 179) and testing samples (N = 179). The clinical statistical information of training and testing samples is displayed in Table 1, and results indicated that there is no difference between two datasets. Next, Combing the feature selection results of LASSO method (Fig. 3A), RF-FS method (Fig. 3B) and SVM-RFE method (Fig. 3C) showed that four overlapping immune cells were selected out (Fig. 3D). Then these immune cells were further used to construct a risk score system by applying Cox analysis in the training dataset. By applying this risk model, a risk score for each sample will be calculated. The distributions of the risk scores, overall survival (OS) time, vital status, and expression levels of corresponding immune cells in training (Fig. 3E-G), testing (Fig. 3H-J) and GSE54467 (Fig. 3K-M) datasets were respectively shown. Afterwards, melanoma patients in training dataset were classi ed into a highrisk group and a low-risk group by applying the optimal cut-off value of the risk scores. Kaplan-Meier curves showed that patients in high-risk group have a shorter survival time than low-risk with a log-rank test of p = 0.003 (Fig. 4A). To estimating the prediction power of features, the ROC curve was drawn and 5 years of AUC was 0.664 (Fig. 4B). Besides, in order to con rm the robustness of the result, veri cation tests were conducted in testing and GSE54467 datasets. The testing and GSE54467 datasets were divided into high-risk and low-risk groups based on training dataset. Kaplan-Meier curves showed that there is a signi cant difference between high-risk and low-risk group both in testing dataset (log-rank p < 0.001) (Fig. 4C) and GSE54467 dataset (log-rank p = 0.002) (Fig. 3E). The 5 years of AUC were 0.832 (Fig. 3D) and 0.729 (Fig. 4F) respectively.

Nomogram building and validating
In order to supply a simple and accuracy method for OS prediction, nomogram was built on the basis of clinical information and risk scores of patients in train dataset (Fig. 5A). Then, a total point summarized the points of each parameter, which can predict the probability of OS at 3 and 5 years. The calibration plots suggested that the nomogram performed well in comparison with ideal model (Fig. 5D). Moreover, nomograms constructed by testing (Fig. 5B) and GSE54467 (Fig. 5C) were used to validate the results. Surprisingly, calibration plots in testing (Fig. 5E) and GSE54467 (Fig. 5F) for nomogram predicting 3 and 5 years' OS also performed well with the ideal model. Similarly, by applying the decision curve analysis, the nomogram model offered a higher net bene t and a better clinical utility than tumor stage no matter in training (Fig. 5G) or testing (Fig. 5H) and GSE54467 (Fig. 5I) datasets.

Associations between risk model with clinical characteristics and gene phenotypes
The association between risk model and clinical characteristics was investigated and the violin plot manifested that only vital status and tumor status were correlated with risk score (Fig. 6A). Other clinical characteristics, such as age, gender, race and stage had no relationships with risk score. Thus, the risk model was associated with two clinical variables and had on effect on other clinical characteristics. Furthermore, in comparation the prognostic value of risk score with clinical variables, univariate and multivariate logistic regression were applied in training, testing, and GSE54467 datasets ( Table 2). The univariate regression showed that age, stage, tumor status and risk sore were signi cantly associated with OS, but the multivariate regression revealed that only the risk score were signi cantly correlated with OS and could regarded as an independent risk factor in training (HR = 3.517, P = 0.005), testing (HR = 1.869, P = 0.042), and GSE54467 (HR = 2.661, P < 0.000). To evaluate the associations between risk model and selected immune checkpoint related genes, subgroup analysis of immune checkpoint related genes was performed. The violin plot revealed that CD28, CTLA4, ICOS, PDCD1, TIGIT, CD274, CD226, CD40 and CD40LG in high risk group had a higher expression value than those in low risk group (Fig. 6B).
Interestingly, subgroup analysis of EMT related genes showed that majority of EMT related genes were differently expressed between high and low risk group. The high risk group had signi cantly higher expression levels of CTNNB1, FGF2, EGFR, SNAI2, ZEB1, CXCL12, SNAI1 and PDGFB (Fig. 6C).

Gene set enrichment analysis
To investigate the signi cant pathways shared by different high and low risk group, we performed cancer hallmark and KEGG pathway enrichment by GSEA analysis. Based on selection standard and q values. Multiple signi cant cancer hallmark pathways were enriched, such as allograft rejection, complement, EMT and in ammatory response (Fig. 6D). Additionally, KEGG enrichment showed that complement and coagulation cascades, ECM receptor interaction, natural killer cell mediated cytotoxicity and T cell receptor signaling pathway were positively enriched in high risk group (Fig. 6E).

Discussion
Recently, melanoma patients are growing younger and with highly metastasize and deadly threatening, which places a huge burden to thousands of people worldwide. In spite of numerous advanced therapeutic methods were used to treat melanoma, such as chemotherapy and radiotherapy and immunotherapies, their survival rate still remains low [1,3]. Besides, the traditional classi cation is often ineffective and lacks clinical bene ts. Therefore, researchers are struggling to explore the new biomarkers for better diagnose and predict prognosis. Huang et al identi ed eight immune-relate genes biomarker which could predict the prognosis of melanoma [28]. An RNA sequencing-based 12-gene signature was established by applying univariate and multivariate regression models to predict the survival of malignant melanoma patients [29]. Lu et al discovery a ve-miRNA signature by analyzing microarray dataset in GEO database, which could be an independent prognostic biomarker in melanoma patients [30]. Recently, the tumor immune microenvironment in melanoma has become a research hotspot and under active investigation [31]. Moreover, the immune cell types which differentially distributed in cancer tissue on diagnosis has attracted great interest in recent years. Therefore, in this study, we systematically analyzed the immune microenvironment and tried to establish a more evaluable and precise signature for advanced melanoma patients.
In spite of various differential expression of genes were analyzed to diagnose tumors. Nevertheless, little research attention looked at the effects of immune cell on the diagnosis of melanoma. Firstly, we used the ssGSEA method to calculate the relative expression of 24 human immune cells. Because of compared to normal tissues, the distribution of immune cell was signi cantly higher in tumor tissues. The overlapping DEICs were identi ed and put into machine learning analysis. The high sensitivity and speci city of multiple machine learning algorithms indicated that DECIs was an e cient indicator for diagnosis of melanoma. In addition, we built a diagnostic score model by logistic regression method, which could effectively distinguish the melanomas from the normal controls, replying that the immune system is closely associated with the tumorigenesis of melanoma. Similar results have been reported that in ltration of immune cell can be used to diagnose colon cancer, even all digestive system cancers [32,33]. In this sense, immune in ltration opened a novel strategy for diagnosing and treating melanoma.
To subsequently investigate the prognostic value of immune in ltration in melanoma, LASSO, RF-FS and SVM-RFE methods jointly applied to select potential immune cells for building the prognostic model. Finally, four types of immune cell including Th2 cells, T helper cells, Macrophages, iDC were used to construct the risk score system by Cox regression method, which also was validated in internal and external datasets. Among these immune cells, some have been proven to be associated with melanoma. For instance, approximate 70% of melanoma metastatic lymph nodes were detected the distribution of immature DCs which may take an immunosuppressive function in melanoma [34]. Under normal immune environment, Th1 cells and Th2 cells are in a relatively balanced state. Th2 bias signi es the imbalance of Th1/Th2. Th2 could strongly inhibit Th1 responses [35]. Th2 bias is one of the mechanisms of tumor immune escape. Studies have also shown that Th2 dominance could mediate a chronic in ammation which could promote melanoma metastasis. Moreover, Falleni at el con rmed that the accumulation of Macrophages was a poor predictor for survival of melanoma patients and could be regarded as a potential therapeutic target [36]. In order to assess the accuracy of prognostic prediction, we also constructed nomogram integrate risk score and clinical information. The calibration curve for the observed 3-year, and 5-year outcomes showed that the nomogram model performed well with the ideal prediction model. What's more, compared with the tumor stage, the decision curve plots depicted that the nomogram model can acquire more bene t. The multivariate regression analysis also indicated that the risk score of immune cell related biomarker could be regard as an independent prognostic factor in melanoma.
According to the optimal cutoff value of risk score, melanoma patients classi ed into different risk groups. The Kaplan-Meier revealed that patients in high risk group have a poor prognosis. Thus, in order to explore the underlying mechanism with different subgroups, strati ed analyses of clinical characteristics and gene phenotypes were performed. The risk score distribution of clinical features showed that the risk score was only correlated to vital status and tumor status, and had on effect on other clinical features. Presently, checkpoint blockade immunotherapies represent a promising strategy for cancer therapy and acquired extensive investigations [37,38]. However, the e cacy of immunotherapies is dramatic varied in individual patients and different subtypes of cancer. In our research, the expression of immune checkpoint related genes including CD28, CTLA4, ICOS, PDCD1, TIGIT, CD274, CD226, CD40 and CD40LG were highly expressed in high risk of melanoma patients. Besides, epithelial mesenchymal transition (EMT) recognized the indictor for the invasion and progression of many cancers[39, 40]. The selected EMT related genes also including our research and the results also manifested that most of them are highly expressed in high risk group. Hence, we have enough reasons to believe that our immune cell feature closely correlated with the prognosis of melanoma.
Further investigating the potential biological mechanism in high risk phenotype, GSEA method was applied to analyze the candidate pathways. The results showed that high risk phenotype was positively associated with cancer hallmarks such as allograft rejection, complement, EMT and in ammatory response, which supported the previous ndings that EMT and immune related genes are highly expressed in high risk group. The complement system, an essential constituent of innate immunity, affects tumor growth and metastasis by regulating chronic in ammation. Moreover, KEGG pathway analysis showed that complement and coagulation cascades, ECM receptor interaction, natural killer cell mediated cytotoxicity and T cell receptor signaling pathways were enriched in high risk phenotype, which largely consisted with cancer hallmarks analysis. ECM-receptor interaction pathway is important in tumor metastasis [41]. The signi cance of the ECM-receptor interaction pathway implied the interaction between tumor cell and environment are very dynamic[42].

Conclusion
In summary, our study identi ed several differential immune cells and proven the e ciency of immune cells in diagnosing and predicting prognosis of melanoma. The constructed diagnosis and prognosis models might provide a more simple and accurate prediction for melanoma patients in clinical application and management.   Figure 1 The complete work ow of the analysis in this study.     Strati ed analysis and GSEA analysis. A: The relationship between risk score distribution and clinical variables which including age, gender, race, stage, vital status and tumor status. B: Expression of immune checkpoint related genes between the high and low risk melanoma patients. C: Expression of epithelial mesenchymal transition (EMT) related genes between the high and low risk melanoma patients. *p<0.05; **p<0.01;***p<0.001;****p<0.0001. D-E: Gene set enrichment analysis (GSEA) of high vs. low risk score groups by using gene sets of the cancer hallmark pathway (h.all.v7.0.symbols) (D) and KEGG pathway (c2.cp.kegg.v7.0.symbols) (E) downloaded from the MSigDB database. Samples were classi ed into high-and low-risk groups. Each run was performed with 1000 permutations. Enrichment results with signi cant associated pathways are shown.