The 9 regulators-mediated m5C methylation modification patterns
According to 9 m5C regulators with expression profiles in the TCGA-PTC dataset, PTC samples were identified from normal samples (Fig. 1A). Afterwards, the expression profile data of these 9 m5C regulators were carried out z-score standardization using the scale function in mosaic package. Then, 3 different m5C modification patterns were discovered according to those 9 m5C regulators expression patterns (Fig. 1B and C). These 3 patterns were named as m5C-cluster 1–3. It was observed from Fig. 1D that, the expression level of these 9 m5C regulators showed significant differences among the 3 distinct subtype samples.
Prognostic analysis was also carried out for these 3 major m5C modification patterns, which suggested that m5C-cluster 2 modification pattern showed survival advantage (Fig. 1E). However, due to the speciality of PTC and the good overall prognosis, there was no significant statistical difference among these 3 subtypes. Besides, average survival time of samples in these 3 subtypes was also analyzed, which discovered that the average survival time of C2 subtype samples was 1307.657 days, that of C1 subtype samples was 1125.877 days, and that of C3 subtype samples was 1202.695, with that in C2 higher than those in C1 and C3.
TME-infiltrating cell features in different m5C modification patterns
For exploring those biological behaviors in the different m5C modification patterns, GSVA was conducted. It was illustrated from Fig. 2A that, m5C-cluster 1 significantly associated with the amino acid metabolic pathways, m5C-cluster 2 was enriched to the endocrine system, lipid metabolism, and cancer, whereas m5C cluster-3 was associated with cell cycle, DNA repair and nucleic acid metabolism.
Further, the distribution of clinical features of samples in the above 3 subtypes was statistically analyzed. The statistical results are displayed in Supplementary Table 3 and Fig. 3. It was found from the results that, multiple clinical features in the 3 subtype samples were randomly distributed, with no significant difference.
In addition, the ESTIMATE algorithm was applied in quantifying the differences in stromal cell infiltration among the 3 subtype samples. As shown in Fig. 2B, the stromal score in m5C-cluster 2 was the highest, followed by m5C cluster-3, while m5C cluster-1 had the lowest score. In addition, there were significant differences among them. Thereafter, the CIBERSORT deconvolution algorithm was utilized for comparing the heterogeneities in immunocyte components of the 3 m5C modification patterns (Fig. 2C). Meanwhile, the support vector regression was used to deterimine the immunocyte types in tumors. As a result, high levels of Tregs and monocytes were detected in m5C-cluster 1 and m5C-cluster 3, whereas excessive resting/activited DCs were found in m5C-cluster 2. Recently, research has particularly focused on the RNA modification mechanism in the regulation of DC activation. DCs function to present antigen and to activate the naive T cells, which connect the intrinsic immunity with the adaptive one[42].
Finally, this study analyzed the expression of the 34 known immune checkpoints in the 3 subtype samples. As found from Fig. 2D, there were significant differences in the expression of these 34 immune checkpoints among the 3 subtypes. Most immune checkpoint genes were highly expressed in m5C-cluster 2, followed by m5C-cluster 3, while m5C-cluster 1 had the lowest expression, which was consistent with the average survival time of samples in the 3 subtypes.
m5C gene signature establishment along with functional annotation
For better investigating the possible biological behaviors of all the m5C modification patterns, the limma package was used to determine 690 m5C phenotype-associated DEGs (Supplementary Fig. 3). In addition, KEGG pathway enrichment analysis was carried out on DEGs using the WebGestaltR package. It was surprising that, these genes were enriched to cell cycle, DNA repair, cell adhesion molecules and immune inflammatory response related pathways. These findings verified the important role of m5C modification in cancer cells themselves and in TME immunomodulation (Fig. 4A). For better validating such regulatory mechanism, the unsupervised clustering analysis was performed using those 690 m5C phenotype-associated genes, for the sake of classifying cases to distinct genome subtypes. Similar to clustering analysis of m5C modification patterns, 3 different m5C modification genome phenotypes were found, which were referred to as m5C gene-cluster A-C, separately (Supplementary Fig. 4). According to such results, there were 3 different m5C methylation modification patterns in PTC. Besides, there were diverse signature genes in the 3 different gene clusters (Supplementary Fig. 4). The m5C regulators expression levels were significantly different among the 3 m5C gene-clusters (Supplementary Fig. 5), consistent with the results obtained for m5C methylation modification patterns. The expression quantities of these 9 genes were the highest in gene-cluster B samples, followed by gene-cluster A samples, and were the lowest in the gene-cluster C samples.
Clinical features and transcriptome traits of the m5C-associated phenotypes
First of all, we analyzed the stromal scores of 3 m5C gene-cluster subtypes. The results suggested that (Fig. 4B), there were significant differences in the stromal score of 3 subtypes, among which, gene-cluster C had the highest score, followed by gene-cluster B, while gene-cluster A had the lowest score. Then, we analyzed the distribution of 22 immunocytes in the 3 m5C gene-cluster subtypes. As observed from Fig. 4C, the distribution of 15 immunocytes in three subtypes showed statistically significant differences. These findings revealed the important role of m5C methylation modification in the formation of diverse TME landscapes and tumor-related immune regulation.
Nonetheless, the above results were obtained from patient population alone, which might not precisely estimate the m5C methylation modification patterns of individual cases. Due to the m5C modification complexity and heterogeneity in individual samples, this study established the scoring system (m5C-score) using the phenotype-associated genes for quantifying m5C modification patterns in individual PTC cases. Besides, those attribute alterations in individual patients were visualized by the alluvial diagram (Fig. 5A). It was discovered from the figure that, among the 3 m5C-cluster subtypes, samples in m5C-cluster 2 and m5C-cluster 3 subtypes were mostly distributed in the low m5C-score score group, while those in high m5C-score score group were basically derived from the m5C-cluster 1 subtype. In the 3 m5C gene-cluster subtypes, the m5C-score values of Cluster A and Cluster C samples were lower. Samples aged over 40 years were mostly classified into the low m5C-score score group, while females mostly belonged to the high m5C-score score group.
To further evaluate the differences between low and high score samples, the limma package was used to analyze the DEGs between the two groups. Using the thresholds of logFC > log2(1.2) and p < 0.05, 67 DEGs were screened, including 58 up-regulated and 9 down-regulated ones (Fig. 5B). Moreover, the WebGestaltR package was utilized for the GO and KEGG enrichment analyses of DEGs, with p < 0.05 as the threshold. A total of 62 biological processes (BP), 2 cellular components (CC), 6 molecular functions (MF) and 9 pathways were selected. As shown in Supplementary Fig. 6, these genes were mainly involved in tumor proliferation and immune response-related biological processes/molecular functions and signaling pathways, such as MAPK, TNF and IL-17.
Subsequently, this study observed the correlation of m5C-score with patient survival, and analyzed the difference in prognosis between high and low m5C-score score samples. The results suggested that, samples with low m5C-score scores had better prognosis than those with high score, regardless of DFS or OS (Fig. 5C and D). In addition, it was also discovered that, there was no difference in the clinical features (such as T, M and stage) between high and low m5C-score samples (Supplementary Fig. 7). The expression levels of 9 m5C regulators in high m5C-score group were significantly higher than those in low score group, and there was significant difference between two groups (Supplementary Fig. 8).
Subsequently, this study observed the correlation of m5C-score with TME. First of all, the CIBERSORT method was adopted to evaluate the infiltration level of each immunocyte type in the high and low m5C-score TCGA-TPC samples. The results are presented in Supplementary Fig. 9A. There were significant differences in 6 cell types between high and low m5C-score groups. In addition, this study also calculated the stromal score, immune Score and ESTIMATE score in different samples. As presented inSupplementary Fig. 9B, in the low m5C-score group, the immune Score was significantly higher than that in high m5C-score group, which was consistent with the previous results that the low m5C-score group had better prognosis than the high score group. Moreover, it was discovered through expression of immune checkpoint genes that, there were significant differences in 16 immune checkpoint genes expression levels between high and low m5C-score groups (Supplementary Figure S10). Based on these findings, low m5C-score showed close correlation with immune activation. Further, m5C-score helped to assess m5C modification patterns in individual tumors, and better assess the TME cell infiltration features of tumors, thus contributing to distinguishing the true or false TME immune infiltration.
At last, this study integrated the influences of m5C-score and various immunocyte infiltration levels on the prognosis for PTC patients. From Fig. 6, it was discovered that, resting CD4+ memory T cells and CD8+ T cells were mainly enriched in low m5C-score samples, while activated NK cells and Monocytes were mostly enriched in high m5C-score group. Then, the median infiltration level of the above 4 cell types was used to divide all samples into high and low immunocyte infiltration level groups. It was discovered that, samples with low m5C-score and low infiltration level of resting CD4+ memory T cells had the best prognosis, while those with high m5C-score and low infiltration level of resting CD4+ memory T cells had the poorest prognosis. In addition, samples with low m5C-score and high CD8+ T cells infiltration had the best prognosis, while those with high m5C-score and low CD8+ T cells infiltration had the poorest prognosis. Further, it was found that samples with low m5C-score and high Monocytes infiltration had the best prognosis, while those with high m5C-score and high Monocytes infiltration had the poorest prognosis. According to the prognostic prediction model, we analysed the correlation between m5C-score and Treg expression in 24 PTC patients. The m5c-score showed a negative relationship with CD3 + CD4+/CD3 + CD8+ (r = -0.9543, p < 0.0001; Fig. 5E), but a positive relationship with CD4 + CD25 + Tregs percentage (r = 0.4477, p = 0.015; Fig. 5F).
Construction and verification of the m5C-score-based PTC diagnostic model
First of all, this study calculated the correlation of 49 m5C phenotype-related genes with m5C-score. Then, 10 signature genes related to the m5C-score were screened by the threshold of correlation coefficient > 0.4, which were used as the features to construct the SVM classification model.
In order to verify the classification efficiency and accuracy of the model, we used the expression profile data of TCGA tumor samples as the training set. The m5C-score was utilized to classify the samples into high and low groups. Then, the expression profile data of these 10 genes were used to construct the SVM classification model to classify the TCGA-TPC samples. It was discovered that, compared with the m5C-score classification results, the accuracy reached 98.3%, and the sensitivity was up to 88.9%. The 493 samples were accurately classified, with the area under the ROC curve (AUC) of 0.936 (Fig. 7A). The above results demonstrated that, the classification model constructed based on these 10 signature genes well simulated the classification results of m5C-score. The gene number was substantially reduced, which significantly improved the classification efficiency.
Thereafter, all the 551 TCGA samples (including 493 tumor samples and 58 normal samples) were used as the verification set 1. The above-mentioned 10 genes were used as the features to construct the SVM classification model to classify the samples. Surprisingly, it was discovered that, the model accurately classified TCGA-TPC samples into tumor samples and para-carcimoma tissue samples, with the classification accuracy of 89.7% and the sensitivity of 98.6%. 538 of the 551 samples in verification set 1 were accurately classified, with the AUC of 94.2% (Fig. 7B).
To further verify the model classification efficiency and accuracy, another 3 sets of microarray data were also downloaded, and the 10 signature genes were used for SVM verification. The GSE29265 data set was utilized as the verification set 2, which included 49 samples (20 normal samples and 29 tumor samples), with the model classification accuracy of 95%. 48 of the 49 samples were accurately classified, the model sensitivity to high and low scores was up to 100%, and the AUC was 97.5% (Fig. 7C). Meanwhile, the GSE33630 data set was used as the verification set 3, which included 105 samples (45 normal samples and 60 tumor samples). The model classification accuracy reached up to 100%, all the 105 samples were accurately classified, the model sensitivity to high and low scores was 100%, and the AUC was 100 (Fig. 7D). The GSE65144 data set was used as the verification set 4, which contained 25 samples (13 normal samples and 12 tumor samples). The model classification accuracy was 84.5%, all the 25 samples were accurately classified, the model sensitivity to high and low scores was 100%, and the AUC was 92.3% (Fig. 7E).
Potential Drug screening and evaluation for the m5C-score-based PTC diagnostic model
We firstly used L1000 fireworks display (l1000FWD) tool, and reverse drug screening method for deferentially expressed genes in high and low-risk groups of m5C score, and obtained small molecules (drugs, Supplementary Table 4). In the interaction database between CMAP drug and gene expression, we analyzed 67 drugs that may interact with genes with different changes in the risk model constructed by m5C score, and selected 55 small molecules (drugs, Supplementary Table 5). We compared the potential drug overlap between L1000 and CMAP annotation, and found that there were five overlapping small molecules (S8), namely cephaeline, emetine, anisomycin, ouabain and thapsigargin. CCK8 was used to detect the effect of five potential drugs on the growth and metabolic activity of PTC tumor cells. It was found that compared with the control group, the five drugs could inhibit the growth of thyroid cancer cells in different degrees (Fig. 8A). Consistent with this, results of subcutaneous transplantation model also showed that intraperitoneal injection of these five drugs could significantly inhibit the growth of tumor, respectively (Fig. 8B).