Identication of a three-gene signature associated with immune inltration to distinguish follicular thyroid cancer and adenoma: an integrated bioinformatic analysis

Follicular thyroid cancer (FTC), accounting for about 15% of all thyroid cancers, was characterized by a more invasive nature and earlier hematogenous spread. The distinction between FTC and follicular thyroid adenoma (FTA) remained challenging for clinical scientists due to their highly similar cytological features at present, and frequently led to potentially unnecessary surgical procedures. The present study aimed to identify potential new biomarkers to improve diagnosis.


Page 3/16
Background Thyroid cancer was the most common endocrine malignancy with a rapidly increasing incidence in the past 30 years throughout the world [1]. Most of thyroid cancer was indolent and curable with low risk of recurrence or disease-speci c mortality. An accurate preoperative diagnosis to identify malignant from lots of thyroid nodules was important, or it would lead to potentially unnecessary surgical procedures and potential complications. In this sense, follicular thyroid neoplasm might be the most controversial area in the thyroid pathology [2,3].
Follicular thyroid cancer (FTC) and follicular thyroid adenoma (FTA) were the malignant and benign thyroid epithelial tumor showing follicular cell differentiation, respectively. FTC, accounting for about 15% of all thyroid cancers, was characterized by a more invasive nature and earlier hematogenous spread [4,5]. The distinction between FTC and FTA was based on the presence of capsular and/or vascular invasion. However, neither of them could be easily identi ed by ne-needle aspiration biopsy (FNAB) cytology, the currently used methodology for preoperative diagnosis after clinical and ultrasound malignancy risk strati cation [5,6]. Several studies had investigated the gene expression pro le and tried to nd some markers to exclude malignancy preoperatively [7][8][9][10][11][12][13][14][15][16], however, reproducibility of results from different studies was rather low and no powerful molecular markers had been established. Although some genetic alterations had been found to play a fundamental role in FTC development in initial studies, such as RAS, H/K/NRAS, and PAX8-PPAR-γ [17][18][19][20][21], they were not found to be speci c for FTC in subsequent studies, as these genetic alterations occurred in both FTCs and FTAs with similar frequencies [22][23][24][25][26]. So effective molecular markers to differentiate FTC from FTA were urgently needed.
Integrated bioinformatic analysis with multiple datasets had emerged recently as an e cacious novel approach to identify novel genes and comprehend the underlying molecular mechanisms of cancer [27,28]. In the present study, we carried out a bioinformatic analyses based on four microarray datasets from Gene Expression Omnibus (GEO) database, to identify potential new markers to distinguish FTC and FTA.
Three of them (GSE82208, GSE15045, and GSE29315) were used to screen differentially expressed genes (DEGs). Enrichment and protein-protein interaction (PPI) analysis were performed, and hub genes were selected. Subsequently, a gene signature was identi ed by logistic regression and further validated in another independent dataset GSE27155. Furthermore, association of the gene signature and tumorin ltrating immune cells was analyzed. The ow chart of this study was shown in Fig. 1. This study would promote our understanding about the molecular mechanism of FTC development and provide potential target genes for differential diagnosis.

Gene Expression Pro le Data
Four microarray gene expression pro les (GSE82208, GSE15045, GSE29315, and GSE27155) were obtained from GEO database (http://www.ncbi.nlm.nih.gov/geo). All included datasets met the following inclusion criteria: 1) expression pro ling by array; 2) tissue samples gathered from human FTC or FTA tissues; 3) included at least 10 samples. For GSE27155 and GSE29315, only FTC and FTA samples were selected for analysis in this study. Information about datasets was summarized in Table 1 and Additional le 1. Integrated Analysis of Microarray Datasets Three GEO datasets (GSE82208, GSE15045 and GSE29315) were used for DEGs screening. The downloaded platform and matrix les were read and pre-processed utilizing Limma package in R/Bioconductor software [29]. The names of probes were converted to international standard gene names (gene symbols) and saved in TXT format. The probes without matching gene symbols were ltered out, and the values of different probes mapped to the same gene were averaged to obtain the nal gene expression value. Limma package was also used to identify DEGs between FTC and FTA samples in each dataset. Subsequently, gene integration for the DEGs identi ed from the three test datasets was conducted employing RobustRankAggreg package [30]. |log2FC| > 0.5 and P-value < 0.05 were de ned as the threshold for identifying DEGs.

Functional and Pathway Enrichment Analysis
Gene Ontology (GO) enrichment analysis was one of the most important tools used to predict the potential functions of target genes based on biological processes (BP), cellular components (CC), and molecular function (MF) [31]. Kyoto Encyclopedia of Genes and Genomes (KEGG) was the major public database comprised of known genes and their biochemical functionalities [32]. So to investigate the functions and processes of the DEGs, ClusterPro ler package in R was utilized to annotate and visualize GO terms and KEGG pathway of upregulated and downregulated genes, respectively [33]. The gene annotation information was obtained from org.Hs.eg.Db. P-values were adjusted for multiple testing depending on the Benjamini-Hochberg False Discovery Rate (BH) method, and the adjusted P-values < 0.05 were regarded as the cut-off criteria for signi cant results.

PPI Network Construction and Hub Genes Identi cation
The Search Tool for the Retrieval of Interacting Genes (STRING) database provided information regarding the predicted and experimental interactions of proteins [34]. In this study, the identi ed DEGs were input into the online STRING database (https://string-db.org/) to detect the potential relationships and construct a PPI network. A combined score of ≥ 0.4 was used as the cut-off value. Thereafter, Cytoscape software (version 3.6.0) was used for the visual exploration of the PPI network. Moreover, the Cytoscape plug-in Molecular Complex Detection (MCODE) was used to screen notable modules in this PPI network.
Degree cutoff = 2, Node Score Cutoff = 0.2, and K-Core = 2 were set as the advanced options. A MCODE score ≥ 5 was used as the selection criterion. Genes with a cut-off degree value of ≥ 24 were identi ed as hub genes. The enrichment analysis of each module and hub genes was performed by the Funrich software 3.1.3, including BP, CC, MF, transcription factor (TF) and biological pathway (BPA). A P-value < 0.05 was set as the cut-off criterion.

Expression Levels of Hub Genes in Malignant Thyroid Nodules
The expression of identi ed hub genes in FTC and FTA samples was explored in an independent dataset GSE27155, and the boxplot was performed to visualize the results.
The Cancer Genome Atlas (TCGA) database contained abundant genetic data of several other histological subtypes of thyroid cancer. UALCAN (http://ualcan.path.uab.edu/index.html) was a userfriendly, interactive web resource for analyzing cancer transcriptome data from TCGA [35]. In the current study, UALCAN was used to explore the expression levels of hub genes in TCGA thyroid cohort. A P-value < 0.05 was considered as statistically signi cant.
Gene Signature Selection for Differential Diagnosis of FTC and FTA Logistic regression was a widely applied and useful statistical, non-linear model for predicting a binarized outcome based on a sequence of independent features. Receiver operating characteristic (ROC) curve was a comprehensive index used to re ect the sensitivity and speci city of continuous variables. In this study, logistic regression with stepwise backward selection was applied to construct a gene signature model for differential diagnosis of FTC and FTA. All the hub genes were used as independent variables. The diagnosis score was calculated as follows: Logic(P) = ∑(C × EXPgene). EXP was the expression of the hub genes, and C was the regression coe cient for the corresponding gene in logistic regression analysis. FTA samples were treated as a reference group, whereas high-risk values indicated a highly probable occurrence of FTC. Then the diagnostic value of the model was evaluated with ROC curve by calculated the area under the curves (AUC). Furthermore, the selected gene signature was validated in independent dataset GSE27155. All analyses were conducted in R. All reported P-values were two-sided with P < 0.05 as signi cant.
Evaluation of Tumor-in ltrating Immune Cells and Its Association with the Gene Signature Selected CIBERSORT was an analytical tool developed by Newman et al to characterize cell composition of complex tissues from normalized gene expression pro les [36]. In the present study, gene expression data of GSE82208 with a platform of Affymetrix U133 was uploaded to the CIBERSORT web portal (http://cibersort.stanford.edu/), to infer the relative proportions of 22 in ltrating immune cell types, with the algorithm run using the LM22 signature matrix at 1000 permutations. For each sample, all evaluated immune cell fractions combined were equal to 1. Besides, CIBERSORT calculated a global P-value for the immune cell fraction result based on the statistical probability of the presence of immune cells in the sample. All samples were included in this analysis regardless of P-value, for those samples with P-value > 0.05 might represent samples with low immune cell in ltrate. The relative proportions of 22 immune cell subpopulation were compared between FTC and FTA groups using Wilcoxon test. Spearman correlation was used to assess correlations between immune cell fractions and score of the gene signature selected above. A P-value < 0.05 was considered as statistically signi cant.

Gene Expression Pro le Data and DEGs Identi ed
In all, 210 genes (55 upregulated and 155 downregulated) were identi ed as DEGs in the FTC samples compared with the FTA ones, such as FOS, IL1B, and TOP2A (Figs. 2A-C and Additional le 2). According to the cut-off criteria, we screened 15 representative upregulated and downregulated genes (Fig. 2D).

GO and KEGG Enrichment Analyses
In order to gain insights into the biological functions of DEGs after gene integration, GO enrichment and KEGG pathway analysis were performed ( Fig. 3 and Additional le 3). GO enrichment analysis revealed that 55 upregulated DEGs were mainly enriched in nuclear division and organelle ssion in BP term, neuronal cell body, leading edge membrane, and condensed chromosome in CC term; however, downregulated DEGs were main enriched in ossi cation, response to metal ion, and epithelial cell proliferation in BP term, extracellular matrix in CC term, and DNA binding, carboxylic acid binding, and organic acid binding in MF term. In addition, 155 downregulated DEGs were analyzed by KEGG analysis and showed that they were most signi cant related to cytokine-cytokine receptor interaction, uid shear stress and atherosclerosis, and IL-17 signaling pathway. No KEGG pathway was signi cant for upregulated DEGs.

PPI Network Construction and Hub Genes Identi cation
Based on the STRING database, we made a PPI network with 159 nodes and 513 edges to further explore the interaction among DEGs, as shown in Fig. 4A and Additional le 4. According to the degree of importance, three modules (modules 1, 2 and 3) with score ≥ 5 were detected by MCODE, as shown in Furthermore, enrichment analysis of module 1 and module 2 were performed and shown in Fig. 5 and Additional le 5. The chromosome segregation (P = 0.008) and regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism (P < 0.001) were identi ed as the most signi cant BP term in module 1, while immune response (P = 0.046) was identi ed as the most signi cant BP term in module 2. Besides, for module 3, transport was identi ed as the most signi cant BP term (P = 0.003), and transporter activity was identi ed as the most signi cant MF term (P < 0.001). No pathway was signi cant.
Seven hub genes with high degree of connectivity were selected ( Table 2). Enrichment analysis by FunRich of these hub genes were shown in Additional le 6 and 7. The cytokine activity (P < 0.001), DNA topoisomerase activity (P = 0.002), transcription factor activity (P = 0.003) and chemokine activity (P = 0.02) were identi ed as the most signi cant MF term, and immune response (P = 0.001) and regulation of nucleobase, nucleoside, nucleotide and nucleic acid metabolism (P = 0.014) were identi ed as the most important BP term. Validation the Expression Levels of Hub Genes in TCGA and GSE27155 To validate the association of hub genes and malignant thyroid tumors, expression levels of hub genes in other histological subtypes of thyroid cancer in TCGA cohort was explored with UALCAN online tools. Results demonstrated that the expression of all seven hub genes in TCGA was also dysregulated and was consistent with that in GEO cohorts. As shown in Fig. 6A and Table 2, high expression of TOP2A and low expression of IL6, JUN, FOS, EGR1 and CCL2 were found in classical papillary thyroid cancer (PTC) tissue. Meanwhile, the expression levels of six downregulated hub genes, IL6, JUN, FOS, EGR1, IL1B, CCL2, were also downregulated in follicular variant of papillary thyroid cancer (FVPTC) compared to normal thyroid tissue. The expression level of TOP2A was upregulated in FVPTC but not signi cant.
The expression pro les of all 7 hub genes in GSE27155 were also explored. As shown in Fig. 6B, the expression level of CCL2 was signi cant downregulated in FTC samples compared to FTA samples, and the expression level of TOP2A was signi cant upregulated. It was consistent with the results above from the other three GEO datasets.

Gene Signature Selection and Validation for Differential Diagnosis of FTC and FTA
After a stepwise backward selection by logistic regression, three genes (FOS, IL1B, and TOP2A) were selected as signature genes to distinguish FTC patients from FTA patients (Fig. 7A). Using this combination of genes, we generated a nal regression model with diagnostic score calculated as follows: Moreover, ROC analysis was performed for this diagnostic score system. As shown in Fig. 7C, the threegene signature achieved an AUC as 0.839 (95% CI, 0.756-0.922, P < 0.001).
Then the score system of three-gene signature selected above was validated in another independent dataset GSE27155. As shown in Fig. 7D, the ROC curves showed that the score system was useful to distinguish FTC and FTA samples, with an AUC of 0.862 (95% CI, 0.714-1.000, P = 0.004). Taking all the four GEO dataset together, the AUC of the score system was 0.817 (95% CI, 0.740-0.894, P < 0.001, Fig. 8E).

The Landscape of Immune In ltration and Its Association with the Gene Signature Selected
Using CIBERSORT algorithm, we investigated the different composition of tumor-in ltrating immune cells between FTC and FTA tissue. As shown in Fig. 8A-C, M0 and M2 macrophages accounted for the largest proportion among the 22 immune cell subpopulations, whereas the fractions of T cells CD4 memory activated, dendritic cells activated, and neutrophils were rare. The proportions of immune cells varied signi cantly between both intra-and intergroup. Compared with FTA tissue, FTC tissue generally contained a lower proportion for plasma cell and CD8 T-cells (Fig. 8B, P < 0.05). The proportion of CD8 T-cells was moderately correlated with Macrophages M0 and T cells CD4 memory resting (Fig. 8D). Moreover, the proportion of CD8 T-cells showed signi cant negative Spearman correlation (P = 0.039) and Pearson correlation (P = 0.017) with score of three-gene signature selected above.

Discussion
Distinguishing between FTC and FTA remained challenging for clinical scientists resulting in equivocal histopathological diagnoses. Therefore, additional molecular markers to rule out malignancy and provide accurate diagnoses preoperative were urgently needed. In this study, we utilized a bioinformatics method to identify the potential biomarker and underlying molecular mechanisms of FTC.
Totally, 210 genes were identi ed from three pro le datasets, including 55 upregulated and 155 downregulated genes, which were differently expressed in FTC samples compared with FTA ones. Chi J et al had performed a miRNA-mRNA regulatory network analysis and identi ed 86 DEGs and 32 differentially expressed miRNAs between FTC and FTA [37]. Similar work was done by Hu S et al [38]. In comparison to their study focused on individual dataset, our study integrated microarray data with relatively large sample size from multiple GEO datasets. Moreover, in-depth functional enrichment analysis was carried out and a PPI network was built.
Especially, a three-gene signature with differential diagnostic value to distinguish FTC from FTA was selected in our study and further validated in independent dataset. The AUC in all four GEO datasets reached 0.817. It would be very useful and better than any of single gene as biomarker.
Among the various mechanisms involved in cancer development, mounting evidence had found that the dysfunction of immune system might exert an important role. In a malignant environment, immune system homeostasis and control of self-tolerance were signi cantly altered [39]. In our study, cytokinecytokine receptor interaction and IL-17 signaling pathway were the most signi cant pathways in KEGG analysis, suggest that dysfunction of immune system was also important in FTC compared to FTA. Cytokine-cytokine interaction pathway had also been reported to be associated with anaplastic thyroid cancer (ATC), the most aggressive thyroid cancer [40]. IL-17 was important for host defense and contributed to the pathogenesis of autoimmune diseases and cancer. Expression of IL-17 had been found to be upregulated in thyroid cancer tissues when compared with benign lesions including FTA, and was associated with recurrence and mortality [41]. IL-17RB could enhance thyroid cancer cell invasion and metastasis via ERK1/2 pathway-mediated MMP-9 expression [42]. Furthermore, 7 hub genes identi ed in our study were also signi cant enriched in immune response.
To further explore immune response involved in development of FTC, tumor-in ltrating immune cells were assessed. Association of immune cell in ltration and prognosis of patients with differentiated thyroid cancers (DTC) had been reported repeatedly [43][44][45]. However, few studies referred FTC. Here, we hypothesized that FTC would have different compositions of immune cell in ltration from FTA. And we used CIBERSORT deconvolution software to infer the relative proportions of 22 distinct leukocyte cell types in samples of GSE82208, for microarray gene expression data from Affymetrix U133 platform was most suitable for this software. The results showed that, FTC tissue contained a lower proportion for CD8 T-cells than FTA tissue. Moreover, the proportion of CD8 T-cells showed signi cant negative correlation with score of three-gene signature selected above. Cytotoxic CD8 T-cells had been demonstrated to be related to protective anti-tumor immune responses that could eliminate tumor cells [46]. CD8 T-cell in ltration with COX2 expression might predict relapse in DTC patients [47]. However, Cunha et al had also reported DTC patients with chronic lymphocytic thyroiditis background and increased CD8 T-cell tumor in ltration showed increased disease-free survival [43]. The con icting results might suggest a complex role in thyroid cancer which should be further explored.
Six downregulated genes, IL6, JUN, FOS, EGR1, IL1B, CCL2, and one upregulated gene, TOP2A, were identi ed as hub genes in our PPI analysis. The dysregulated expression of all seven hub genes was validated in other histological type of thyroid cancer in TCGA, supporting their role in thyroid cancers.
Most of them were involved in carcinogenesis, for example, JUN was considered as key genes of PTC [48,49], and FOS had been reported to contribute to the progression of ATC [40]. TOP2A gene encoded a DNA topoisomerase that was involved in DNA replication and DNA metabolic processes. High expression of TOP2A was closely related to malignant biological behaviors such as proliferation and invasion, and more importantly TOP2A had been served as a cancer target in clinical application [50]. Immunohistochemical analysis showed that TOP2A correlated with thyroid tumor histology and it was more frequently expressed in tumors with aggressive clinical behavior [51]. Also, TOP2A had been identi ed as a hub gene for ATC [52,53].
Several limitations of our study should be noted. First, although our study recruited 4 datasets, the sample size was still relatively small, and clinical factors, such as sex, age, tumor staging, were not considered. So, we could not construct a personalized prediction model for FTC and FTA to utilize in the clinic. Second, the data used in our study was accessed from a public database while the quality of the data could not be appraised. Third, due to lack of clinical samples, the results obtained solely by means of bioinformatic analysis and were not be con rmed by an independent clinical and experimental studies, such as q-PCR. So further studies with larger sample sizes and experimental veri cation would be necessary to con rm these ndings in the future.

Conclusions
In conclusion, we performed an integrated bioinformatic analysis from four GEO datasets and identi ed several core genes and pathways in FTC. Furthermore, a three-gene signature with diagnostic value was selected and validated by independent dataset. The landscape of immune in ltration in FTC was explored for the rst time. These results would not only be helpful for differential diagnosis, but also provided a novel perspective on the molecular mechanisms underlying development of FTC. However, further experimental studies with larger cohort and more clinical data were required to con rm the ndings. Availability of data and material: Publicly available datasets were analyzed in this study, which can be found in GEO database (https://www.ncbi.nlm.nih.gov/geo/).