Public cohort datasets and preprocessing
We systematically searched for publicly available ATC transcriptome datasets. In total, the following four cohorts using the same array platform (Affymetrix Human Genome U133 Plus 2.0 Array) were gathered for this study: GSE33630, GSE29265, GSE65144 and GSE76039 [5, 17, 18]. The raw expression data were downloaded from the GEO database (https://www.ncbi.nlm.nih.gov/geo/). The boxplot function in R package was used to determine whether the distribution of the samples’ expression abundance value is uniform. If not, the normalizeBetweenArrays function in the R package limma was applied to correct the expression data, which were then log2 transformed. When a gene symbol corresponded to multiple probes, we selected the highest value as its expression level, and a probe was deleted when it was recorded with multiple genes. To enhance the reliability of the validation of the screened genes, we merged the GEO datasets mentioned above, and batch effects were removed using the R package sva.
We also used transcriptome data and clinical information of PTCs from the TCGA database downloaded from the UCSC Xena browser (https://xenabrowser.net/datapages/), and 505 patients with PTC were enrolled in our study. Considering the excellent survival outcome in most PTC patients, the progression-free interval (PFI) was regarded as the preferable indicator of prognosis [19].
To investigate the treatment response to immunotherapy, a dataset of urothelial cancer patients who received anti-PD-L1 therapy (IMvigor210 cohort) was acquired from the R package IMvigor210CoreBiologies, and a dataset of AB1-HA mesothelioma mice treated with anti-CTLA4 agents (GSE63557) was obtained [20, 21].
To compare the differentiation level among the different samples, a list of 16 TDS genes was obtained from a published study investigating PTC and served as a parameter of thyroid differentiation [22]. We summed the 16 genes in each sample to obtain the TDS and then separated the PTCs from the TCGA dataset into a low-differentiated group and a high-differentiated group according to the median value.
Collection of immune related data
The stromal score and immune score of each sample were calculated by the ESTIMATE package in the R program [23]. Single sample gene set enrichment analysis(ssGSEA), as implemented in the R package GSVA, was used to assess the enrichment levels of 29 immune cells, and a principal component analysis (PCA) was applied.
The lists of IRGs were collected from the ImmPort web portal (https://immport.org/shared/home), which contains vast immunology data and resources. Overall, 1240 IRGs were present in our array and were further analysed.
Function and pathway enrichment analysis
A gene annotation enrichment analysis was carried out by using the R package clusterProfiler. A gene set enrichment analysis (GSEA) was performed three times in our study as follows: one GSEA was performed to compare the differences in IRGs between ATCs and normal tissues or PTCs in the GSE33630 cohort; the second GSEA was performed to identify differentiation-associated immune signalling pathways in the TCGA cohort; and the third was to analyse the differences between the low-risk-score and high-risk-score subtypes in the expression of broad hallmark gene sets in the combined GEO cohort [24]. We also identified the signalling pathways of the deregulated IRGs in ATCs in the GSE33630 cohort by performing Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses.
Cell culture
All cell lines were cultured in DMEM supplemented with 10% fetal bovine serum (FBS) (Gibco). The normal thyroid follicular epithelial cell line Nthy-ori 3-1 and PTC cell line TPC-1 were obtained from the Institute of Medical Biology Chinese Academy of Medical Sciences (Kunming). The ATC cell lines CAL-62, DRO and 8305C were provided by Sun Yat-sen University Cancer Center (Guangzhou). All cell lines were authenticated by short tandem repeat (STR) sequencing and confirmed to be negative of mycoplasma.
qRT-PCR
The total RNA was isolated from the cells using a FastPure® Cell/Tissue Total RNA Isolation Kit (Vazyme). cDNA synthesis was carried out using HiScript® II Q RT SuperMix (Vazyme). qPCR was performed using ChamQ™ SYBR® qPCR Master Mix (Vazyme) on an Applied Biosystems 7500 system. The reaction procedure was performed according to the manufacturer’s instructions. The sequences of the primers were as follows:
human GAPDH forward: CTCCTGCACCACCAACTGCT,
human GAPDH reverse: GGGCCATCCACAGTCTTCTG;
human MMP9 forward: CAGTCCACCCTTGTGCTCTTC,
human MMP9 reverse: TGCCACCCGAGTGTAACCAT;
human SDC2 forward: TGGAAACCACGACGCTGAATA,
human SDC2 reverse: ATAACTCCACCAGCAATGACAG.
GAPDH was used as a control, and the fold changes of the target genes were calculated by the ΔΔCt method; the normalized gene expression was calculated using 2-ΔΔCt.
Immunohistochemical staining
Paraffin-embedded ATC, PTC and normal thyroid tissue specimens were sectioned at 4 µm and incubated at 70 °C for 2 hours. Xylene was used to deparaffinize the samples, and then, the sections were hydrated in gradient ethanol. Antigen retrieval was carried out using sodium citrate and a pressure cooker for boiling for 2 minutes. After washing with phosphate-buffered saline (PBS), the slides were incubated with the following primary antibodies overnight: anti-SDC2 (Proteintech, 1:300), anti-MMP9 (Cell Signaling Technology, 1:100), anti-TG (Proteintech, 1:200), anti-PLAUR (Proteintech, 1:100) and anti-FGFR2 (Cell Signaling Technology, 1:100). After incubating with the appropriate secondary antibodies, the samples were reacted with DAB reagents.
Statistical analysis
The analysis of the count data of differentially expressed IRGs was performed by the R package limma in the GEO cohorts and IMvigor210 cohort and the R package DESeq2 in TCGA cohort. The Benjamini-Hochberg method was applied to adjust the p-value based on the false discovery rate (FDR). The eBayes method in the R package was used to identify the differentially expressed genes (DEGs), with an FDR<0.05 and fold change≥2 as screening conditions. The R packages VennDiagram and Heatmap were used to draw Venn diagrams and heatmaps. ggplot2 was used to generate volcano plots and other plots.
The Wilcoxon test (also called the Mann-Whitney U test) was applied to compare the continuous variables between the two groups. For comparisons among three groups, the Kruskal-Wallis test was used. To evaluate the association between the clinicopathological characteristics or IRGs and PFI in the TCGA cohort, a univariate Cox proportional hazard model was used. A multivariable Cox regression model was further conducted to identify the independent prognostic factors. Both the hazard ratio (HR) and 95% confidence interval (CI) were calculated.
To investigate the role of selected IRGs, a risk score model was constructed by integrating the regression coefficient derived from the multivariable Cox regression analysis and the expression data of screened IRGs as follows: risk score=∑(Coefi*Expi) [25]. Based on the median value, the samples from different cohorts were dichotomized into a low-risk score group and a high-risk score group. Correlation analyses were conducting by using Pearson’s test in the R program. Receiver operating characteristic (ROC) analyses were carried out to calculate the area under the curve (AUC) to assess the accuracy of the risk score in predicting dedifferentiation. Kaplan-Meier curves of the PFI were built, and the log-rank test was applied to determine the differences using SPSS software. The distributions of the response to immunotherapy in the low-risk score group and high-risk score group were evaluated with a chi-square test or two-sided Fisher's exact test.
All graphic representations and statistical analyses were performed by using R software (version 3.6.1), GraphPad Prism (version 7.0) and SPSS software (version 21.0). p<0.05 was considered statistically significant in all analyses.