Analysis of point mutations and copy number variation in Siglec family genes
We created a mutation waterfall map of Siglec family genes in the TCGA-SKCM cohort (Figure 1A). The results showed that the mutation frequency of SIGLEC7 was the highest, reaching 8%; SIGLEC1 and CD22 mutation frequencies were as high as 7%; SIGLEC6, SIGLEC10, CD33, and SIGLEC5 mutation frequencies were 6%; whereas SIGLEC15 and SIGLEC16 had no mutations, suggesting that they were highly conservative. T > A is the main point mutation found in Siglec family genes. Copy number analysis of the TCGA-SKCM cohort revealed that the increase in the copy number of SIGLEC9, SIGLEC7, CD33, SIGLEC10, SIGLEC8, SIGLEC6, SIGLEC5, and SIGLEC14 was greater than the frequency of loss (Figure 1B-C), while the opposite was true for SIGLEC15.
Siglec family gene expression can predict overall survival
We combined the cohorts of TCGA-SKCM and GSE91061 to obtain data of 519 patients, of which clinical follow-up information was available for 507 patients. The expression matrix of the Siglec family genes was extracted and KM analysis was performed. The results showed that 12 genes had statistical significance (P < 0.05), and only CD22 and SIGLEC15 had no statistical significance. Also, the prognosis of the high expression group was worse than that of the low expression group. Univariate analysis of these 14 genes showed that only MAG, CD22, SIGLEC15, and SIGLEC8 were statistically insignificant; the others were statistically significant, and were all protective genes with Hazard-Ratio > 1 . Siglec family genes have highly synergistic effects at the expression level, and the positive correlation between genes is obvious.
Siglec family genes can define melanoma typing
The consensus clustering algorithm was used for clustering in the merge queue (n = 519). With the increase of the consensus index, the slope of the cumulative distribution curve slightly decreased, and the degree of variability of consistent CDF was the largest (Figure 3A-B). This method produces the best clustering result (Figure 3C), and the white part of the matrix heat map is clean. We used these methods to form a Siglec cluster. Using this typing to analyze the cohort with KM analysis, we found that the prognosis of patients in the A cluster was significantly higher than that of patients in the B cluster (Figure 3D). Heat map analysis showed that 14 Siglec genes were overexpressed in the A cluster and underexpressed in the B cluster (Figure 3E). We therefore defined the A cluster as the high expression of Siglec (HES) subtype and the B cluster as the low expression of Siglec (LES) subtype. Principal component analysis (PCA) clearly grouped the HES and LES subtypes, indicating the accuracy of the consistent clustering algorithm (Figure 3F). Unsupervised clustering of HES and LES by gene set variation analysis (GSVA) showed that many immune-related Kyoto encyclopedia of genes and genomes (KEGG) pathways were enriched in the HES cluster. We estimated an enrichment score of 23 immune cell gene sets in each sample by single sample gene set enrichment analysis (ssGSEA). Except for "CD56dim.natural.killer.cell" in both groups, the enrichment scores of other immune cells were significantly higher in the HES than in the LES subtype.
Gene clustering can stratify prognoses
We analyzed the differences between the HES and LES cluster groups and selected the genes with |logFC| > 2 and adjusted p-value < 0.05 to obtain a group of 181 genes. We used the Metascape website for enrichment analysis of these genes. The first three enrichment terms were "regulation of cell activation", "leukocyte activation", and "immune effector process". Univariate Cox analysis showed that of the 181 genes, 178 genes showed significantly altered levels. The expression matrices of these 178 genes were extracted for consistent clustering, and the clustering effect was the best when K = 2 was used (Figure 4B). The prognosis of patients in gene cluster A was significantly better than those in gene cluster B (Figure 4C). The heat maps of these 178 genes showed that these genes were generally highly expressed in gene cluster A and had strong consistency (Figure 4D). The expression levels of the Siglec family genes in gene cluster A were significantly higher than those in gene cluster B (p < 0.001).
Robust prediction using the LASSO model
We used LASSO regression to construct a gene prognosis model based on the results of Cox analysis of 178 genes that showed statistical significance. The merged cohort was randomly divided into two groups: training and test sets, and the training set was selected to construct the prognosis model. With the increase of the lambda value, the coefficients of some genes become 0, indicating that these genes have little contribution to the model and can be omitted (Figure 5A). We used 10X cross-validation to calculate the partial likelihood deviance of the model. The prognostic ability of the model was the best when the number of genes in the model was two. The calculation formula of the model is: Risk score = SRGN* (- 0.1676) + GBP4*(- 0.1537). Using this formula, we generated a risk score for each patient. The Sankey diagram shows that the HES cluster flows to the A cluster and then flows to the low-risk group, but there is no difference in the final survival ratio (Figure 5B). The expression levels of the Siglec family genes in the low-risk group were significantly higher than those in the high-risk group, which was consistent with the previous gene family univariate Cox analysis (Figure 5C). The risk score of the HES cluster was significantly lower than that of LES cluster, and the risk score of cluster A was also significantly lower than that of cluster B (Figure 5D). KM analysis in the training set, the test set, and the merging set showed that the patients could be divided into two groups, and the overall survival rate of high-risk patients was lower than that of low-risk patients (Figure 5E). At the same time, the ROC curve of risk score was drawn for the three cohorts. In 1-/3-/5-year scenarios, the AUC values were all greater than 0.5, indicating that the model has certain prognostic ability.
Correlation analysis of risk score
We used the CIBERSORT algorithm to calculate the infiltration ratio of 22 immune cells, and analyzed the Spearman correlation with gene and risk score of the model. GBP4 was positively correlated with CD8+ T cells and macrophage M1 levels but negatively correlated with macrophage M0/2 (Figure 6A, p < 0.001). The risk score was negatively associated with CD8+ T cells and macrophage M1 and positively correlated with macrophage M0 (Figure 6B). We then used the ESTIMATE algorithm to calculate stromal score, immune score, and ESTIMATE score in the merge queue. The three groups with previous high-risk scores also had high scores in this analysis (Figure 6C). The TMB in the high-risk group was higher than that in the low-risk group (Figure 6D, p = 0.017). Finally, two groups of mutation waterfalls were drawn, and the mutation frequency of the low-risk group was higher than that of the high-risk group (Figure 6E).