Development of A Three-Gene Signature Prediction Model for Lymph Node Metastasis in Papillary Thyroid Cancer


 Background: Thyroid cancer is one of the most prevalent endocrine cancers with a rising incidence rate over the past years. Papillary thyroid cancer (PTC) is the dominant historical type of thyroid cancer. Early lymph node metastasis happens frequently in PTC. However, some of the lymph node metastasis may be troublesome for detecting because of limited methods.Methods: Robust rank aggregation afforded us the shared differential expression genes among multiple datasets. Gene ontology analysis was performed to identify potential functions. Weighted gene co-expression network analysis was used to research the correlations between gene expression patterns with clinical characteristic. Protein-protein interaction network was performed to identify the hub genes. The least absolute shrinkage and selection operator and Logistic regression were performed to construct a prediction model.Results: We developed a three-gene signature prediction model for lymph node metastasis in PTC through transcriptomic analysis. After quality control, we collected 8 microarray datasets from GEO database and an RNA sequencing dataset from TCGA database. We found the transcriptome profiles were correlated with lymph node metastasis and 3 genes were verified to be independent prediction factors towards those statistic approach. Afterwards, we designed a predicable risk score system and effectively confirmed the model in two independent papillary thyroid cancer cohorts.Conclusions: We recommended a successful predicable model of lymph node metastasis in papillary thyroid cancer patients with moderate accuracy.


Background
Thyroid cancer, one of the most prevalent endocrine cancers with a rising incidence rate over the past years, is the fth leading incidence of cancer in female [1]. Papillary thyroid cancer (PTC) is the dominant historical type which contributes to approximately 84% of all thyroid cancer [2,3]. However, through the low mortality and moderate prognosis were frequently mentioned, the recurrence and the complications are still perplexing those PTC patients [4]. Besides, lymphatic invasion, cervical lymph node metastasis, larger size of tumor, increasing diagnosis age and extraordinary enlargement of thyroid tissue increase the progression risk of PTC [5]. Due to early lymph node metastasis happening in PTC frequently [6], early detection and diagnosis are of great value. Currently multiple methods like thyroid and neck ultrasound, CT/MRI and ne-needle aspiration (FNA) for suspicious lateral neck nodes could effectively diagnose thyroid cancer [7,8]. Yet some of the lymph node metastasis may be troublesome for detecting. We therefore urgently called for a reliable and straightforward approach to determine the possibility of lymph node metastasis.
Recent years, high throughput analysis afford us an advanced and e cient technique of evaluating the molecular disruptions in tumor tissues. For instance, A study of predicting chemotherapy sensitivity in cervical cancer, in which expression levels of 22 total and phosphorylated protein were analyzed in 181 frozen tissue samples, resulted in a model that was capable of predicting patients' chemotherapy sensitivity and assessing clinical outcome [9]. Thus, as more and more cancer sequencing databases are established, there is a great potential for us to acquire and analyze these data and guide clinical decisions with the ndings.
In order to discover the predictive value of lymph node metastasis in the papillary thyroid cancer transcriptome, we searched for several RNA-seq as well as gene microarray datasets containing both papillary thyroid cancer samples and normal thyroid samples to identify differentially expressed genes. Genes with powerful correlation to lymph nodes metastasis were selected as well. Afterwards, we designed a predicable risk score system and effectively con rmed the model in two independent papillary thyroid cancer cohorts. The entire ow of our efforts to identify predictive models was presented in Additional le 1: Fig. S1. Eventually, we recommended a successful predicable model of lymph node metastasis in papillary thyroid cancer patients with moderate accuracy.

Methods
Data collection, normalization and preprocessing Gene microarray datasets and the associated clinical data of PTC and normal thyroid samples were downloaded from UCSC XENA (https://xenabrowser.net/) and Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/). Totally 8 GEO datasets and 1 The Cancer Genome Atlas (TCGA) dataset were involved in this research. Gene labels in other forms were all normalized into o cial gene symbols. Furthermore, those samples were proceeded to a quality control step called SimpleAffy to lessen bias. SimpleAffy was used to identify the 3'-to-5' ratios of GAPDH and β-actin. Outliers were excluded from the study.

Differential analysis of gene expression datasets
EdgeR R package afforded us the differential expression analysis between papillary thyroid cancer tissues and normal thyroid tissues. The lter thresholds of P value were set as less than 0.05 in integrated analysis. |log2FC (fold change) | was set as more than 0.1 or more than 0.5 in GEO or TCGA microarray separately due to the different sample sizes. A robust rank aggregation (RRA) algorithm was used to determine the overall differentially expressed genes DEGs in the 8 GEO datasets in which they were all up-regulated or down-regulated. The most relevant genes could be revealed in the analysis.

Functional annotation analysis
Gene Ontology (GO) analysis was performed on the DEGs in order to identify their potential functions. The process was settled by the clusterPro ler 3.11. The result of three major GO terms called biological process (BP), cellular component (CC) and molecular function (MF) could be visualized in dot plot or pie chart form by using the ggplot2 R package.
Weighted gene co-expression network analysis (WGCNA) All samples of TCGA underwent a sample clustering test to identify their relationships. Those outliers were eliminated and then the others went through soft-threshold de ning procedure. A Scale-free gene co-expression network was attained when the correlation coe cient was 0.85. At this point the soft-thresholding power was equal to 12. Afterward, WGCNA algorithm of R software was performed to construct a scale-free network of all DEGs of TCGA. After the construction, cluster analysis was operated to arrange genes with similar expression patterns into gene modules. The minimum size of each module was set as 30. Besides, the cut height threshold for merging modules was de ned as 0.25 and some of the modules could be integrated to analyze. Lately, we constructed the relationships between modules and phenotype. The module-trait association was de ned as the correlation of module eigengenes (MEs) and traits. Accordingly, each correlation value and the signi cance value were calculated separately to reveal the relevant modules, which are intently related to the traits. For each gene modules, gene signi cance (GS) is described as the correlation level between clinical trait and expression pattern. While module membership (MM) stands for the correlation level between MEs and expression pattern. In our study, genes with GS score more than 0.8 and MM score more than 0.2 were de ned as module genes with high correlation to certain phenotype.

Protein-protein interaction (PPI) network and identi cation of hub-genes
The DEGs of both GEO and TCGA datasets were uploaded to the STRING tool (http://www.string-db.org/) to look into their correlation and to discover the hub-genes in the gene network. A darker line represents a stronger edge con dence in the network. Then, Cytoscape software was utilized to visualize the outcome of STRING database and to acquire the hub-genes which have a closest relation and have a more considerable function among the DEGs. In our study, the 100 genes with the highest connectivity degree were identi ed as hub-genes of the PPI network.

Statistical analysis of model acquisition and validation
We randomly separate the TCGA microarray data and relevant clinical data into 2 parts, named a training cohort and a validation cohort. A univariate Logistic regression was applied to DEGs in TCGA training cohort, aid of nding the relationship between those genes and patient's prognostic data. The results were then diminished by using the LASSO algorithm with the R package 'glmnet'. The minimal partial likelihood deviance was carried out as optimal tuning parameter (λ) changed. Also some of the coe cients of gene would reduce to zero. Those genes were excluded and the others were accessed to the multivariate Logistic test. The hazard ratio (HR) and 95% con dence intervals (CI) of each gene would be calculated and only those genes which did not include 1 in the 95% CI were selected as nal trait-relative genes. The coe cient of each gene in multivariate Logistic regression model was summed to calculate the risk score of lymph node metastasis. The function could be computed as follows, in which βi stands for the Logistic regression coe cient of gene i in the training cohort.
By calculating risk score of each sample, we could sort them and divide the cohort into low risk group and high risk group by the median value. A student's t test of two groups in comparing the trait demonstrated that the model could predict the risk effectively. Meanwhile, the receiver operating characteristic (ROC) curves was used to verify the sensitivity and speci city of lymph node metastasis related model risk prediction. After the model construction, the corresponding approach above was applied to TCGA validation cohort as well as GEO independent cohort in turn for risk predicting system con rmation.

Identi cation of DEGs and associated GO analysis in GEO datasets
Firstly, all the human tissue microarrays including papillary thyroid carcinoma samples and paired/unpaired normal thyroid samples were extracted from the GEO database. In order to make sure the quality of the research, SimpleAffy R package was utilized to determine the 3'-to-5' ratios of β-actin and GAPDH (Additional le 2: Fig. S2). After excluding seven tumor samples and 2 normal samples in three datasets, a total of 187 tumor samples and 119 normal control samples in eight datasets (GSE33630 [10,11], GSE60542 [12], GSE66783 [13], GSE5364 [14], GSE129562 [15], GSE97001 [16], GSE3467 [17] and GSE27155 [18,19]) were included this research (Table 1). After performing differential analysis on the 8 datasets by the edgeR and robust rank aggregation algorithm, the DEGs between the normal tissues and PTC tissues of each dataset were obtained, with an adjusted P value < 0.05 and |log2FC (fold change) | > 0.1 as the cut-offs. A total of 531 over-expressed genes and 474 suppressed genes were discovered in PTC tissues. The top ten genes that were overexpressed or suppressed are listed by a heatmap in Fig.  1a. GO analysis was performed on the DEGs in order to identify potential functions. Those DEGs were enriched in GO terms including 500 BPs, 70 CCs and 28 MFs with the cut-off value set as adjust P value < 0.01 and q value < 0.01. TOP 10 GO terms of each category were displayed in Fig. 1b. Among them, extracellular structure organization and extracellular matrix organization, cell-substrate adhesion and renal system development were three signi cant aspects in BP category. The rst two functions are also remarkable in CC term. The most enriched MF terms were sulfur compound binding, glycosaminoglycan binding and serine-type peptidase activity.

Identi cation of DEGs and associated GO analysis in TCGA dataset
The TCGA-THCA dataset consisting of 56 normal thyroid tissues and 497 thyroid cancer tissues were selected by integrated analysis (P value < 0.05 and |log2FC | > 0.5 set as cut-offs), from which 6492 mRNAs with steady differentially expressed patterns were identi ed by edgeR analysis. Similarly, these DEGs were analyzed by GO functional enrichment analysis to nd out the potential biological functions. The BP term were shown in a pie chart ( Fig. 2a). Activation of MAPK activity and leukocyte mediated cytoxicity are the top 2 enriched terms of BPs. While CCs and MFs were exhibited in Fig. 2b. Each category only shows the top ten functions.
Discovery of a strong correlation between lymph node metastasis and DEGs in TCGA dataset by WGCNA algorithm For DEGs of thyroid cancer tissues and normal thyroid tissues in the TCGA dataset, we used WGCNA algorithm to nd out which clinical traits they are signi cantly related to. Firstly, after screening the clinical characteristics and sample tree, a total of 482 thyroid cancer samples were included in the study. In addition, the sample tree and the clinical characteristics of each sample are also shown in Additional le 3: Fig. S3a. Then, through the soft threshold screening (Additional le 3: Fig. S3b), a gene co-expression network with the scale-free characteristics was established, in which the evaluation parameter R 2 was set as 0.85. After the network was established, genes with similar expression level were grouped in the same module and displayed in the form of a hierarchical clustering diagram (Additional le 3: Fig. S3c). Different colors indicated separate modules. It should be noted that the genes in gray color were not classi ed into any modules. After that, the clinical characteristics of the samples were taken out and analyzed for correlation with each module (Fig. 3). We can nd that the brown module (r =-0. 34, p=3e-14) and the turquoise module (r= 0.35, p=6e-15) were highly correlated with lymph node metastasis. By calculating the MM value of these two modules (cut-off set as MM P value < 0.05 and Gene MM value > 0.8), a total of 106 genes that were closely related to the brown module and the turquoise module were identi ed for further analysis. Since the results of WGCNA strongly implied the relevance of these DEGs to lymph node metastasis, we subsequently concentrated on lymph node metastasis and attempted to explore models that could accurately predict lymph node metastasis in PTC.
Potential hub-genes relating to lymph node metastasis revealed by PPI network The shared DEGs in both GEO datasets and TCGA dataset were considered as important genes for papillary thyroid cancer. To narrow the scope and identify the much meaningful genes among them, PPI network analysis was used to nd key regulatory genes in these genes. STRING database (https://string-db.org) was used to identify the PPI network of these 599 shared DEGs. Fig. 5a showed the overall PPI regulatory network of these differential genes.
through the PPI network connection score, one cluster with the highest score was displayed in Fig. 5b. After all genes going through PPI analysis, some potential hub-genes can be found based on the score. The rst 100 hubgenes were extracted as key genes. What's more, those hub-genes were intersected with the WGCNA module genes mentioned above (Fig. 5c). A total of 9 genes were identi ed as important DEGs and related to lymph node metastasis.
construction a risk scoring system by logistic regression analysis and LASSO algorithm We randomly divided the thyroid cancer samples in the TCGA database into two parts. The rst cohort named training cohort was utilized to nd the risk scoring system of lymph nodes metastasis. Meanwhile, the other one named validation cohort was used to con rm the system. Also, an independent GEO cohort with clinical traits was enrolled to verify the model. Their clinical features were listed in Table 2. Univariate logistic analysis was performed on the 9 genes to nd out if they related to lymph nodes metastasis (Fig. 4d). A predictive gene was de ned as a hazard ratio (HR) and 95% con dence interval (CI) greater than or less than 1 and P value less than 0.05. It is gratifying that these nine genes were all found to be associated with lymph node metastasis after the analysis. Among them, EPHB3, also called EPH Receptor B3; MET, one of the receptor tyrosine kinase; ICAM1, also named Intercellular Adhesion Molecule 1; SERPINA1, also called Serpin Family A Member 1 and FN1, also named Fibronectin 1 are the single risk factors for lymph nodes metastasis, while ITPR1, also named Inositol 1,4,5-Trisphosphate Receptor Type 1; PPARGC1A,also called PPARG Coactivator 1 Alpha; GNA14, also named G Protein Subunit Alpha 14 and BCL2,a apoptosis regulator were found as protective factors in this trait. In order to compress the model and identity the key genes, A least absolute shrinkage and selection operator (LASSO) regression model was then used to test these 9 Genes. In the LASSO model, when the value of λ increases, more coe cients (genes) will be set to zero, meaning those variables could be remove from the model due to their shrinking property (Fig.  5a). Six genes model satis ed the minimum partial likelihood deviance due to the ridge regression. The minimum log(λ) was -3.67 at this status. Finally, these six genes were subjected to multivariate logistic analysis in order to nd the nal risk prediction model. The result indicated that MET, ITPR1, and BCL2 were independent prognostic factors for lymph nodes metastasis (Fig. 5c, Table 3). The score of risk estimation model could be calculated as: expression of ITPR11 × 0.589 + expression of MET × 0.841 -expression of BCL2 × 0.786. The factors stand for the respective multivariate Logistic regression coe cients.
Veri cation of the prediction system through validation cohorts from TCGA and GEO datasets TCGA training cohort used for model construction were separate into high-risk group and low-risk group based on each sample's risk score (Fig. 6a). The median risk score (3.879) was set as the cut-off. For two groups, a signi cant difference of lymph nodes metastasis was clearly shown in heatmap (Fig. 6b). A student's t test also demonstrated that the high-risk group had a higher frequency of lymph nodes metastasis than the low-risk group (Fig. 6c) .The receiver operating characteristic (ROC) curve shows the e ciency of the prediction model, in which the area under the curve (AUC) was 0.744 (Fig. 6d). In order to verify the prediction accuracy of the risk prediction model, we used multiple cohorts to verify. TCGA validation cohort has been operated by the same protocol (Fig. 6e). Obviously, the result also showed that a high risk score could be riskier to lymph nodes metastasis by a heatmap (Fig. 6f) and student's t test (Fig. 6g). The AUC over validation model is 0.711 (Fig. 6h). At the same time, we conducted an independent veri cation in another cohort combined from two GEO dataset (GSE3467 and GSE60542) with the same platform (GPL570) and corresponding clinical data, in order to show that the risk scoring system is generally applicable (Fig. 7a). In the heatmap, a strong tendency was discovered of which higher risk score indicated higher frequency of lymph nodes metastasis (Fig. 7b). A student's t test veri es the point (Fig. 7d).
Furthermore, gene expressions of each sample tissue were displayed in Fig. 7c as a heatmap. Finally, we performed a ROC analysis and a meaningful result was revealed with 0.6842 of AUC value. Accordingly, the three-genes model was veri ed in multiple microarray and all showed a great difference. It is a reliable, accurate and independent predictive appliance for determining lymph nodes metastasis in papillary thyroid cancer patients.

Discussion
Genomic analysis has an extraordinarily critical role in tumor research. For example, Paik S et al. derived a 21-genes based recurrence scoring system from a prospective analysis of multiple gene expression levels in a breast cancer population [20]. Besides, transcriptome mapping could also be applied for the determination of molecular subtypes of tumors. It has already been implemented in colorectal cancer [21], breast cancer [22], prostate cancer [23], and pancreatic cancer [24], which has a facilitating effect on clinical decision-making. Moreover, genomic studies provide an insight into the tumor immune microenvironment. Xu M et al. calculated the corresponding immune in ltration scores from the expression of immune-related molecules in breast cancer specimens and the immune score was found to perform a detrimental effect in overall survival and recurrence-free survival [25]. Chakladar J et al. also analyzed immune-related genes in PTC through the combined application of genomics and transcriptomics [26]. In the present research, we investigated multiple independent datasets of PTC, and successfully established a three-gene (MET, ITPR1 and BCL2) prediction model for lymph node metastasis in PTC patients.
In our study, MET and ITPR1 expression in the predictive model was a risk factor for lymph node metastasis, whereas BCL2 was a protective factor. Previous study has reported that BCL2 was found to be highly expressed only in poorly differentiated tumors [27]. Moreover, another study also showed that a lower expression level of BCL2 could act as an early sign of oncogenesis and be a reason for the favorable prognosis [28], suggesting that BCL2 may act as a protective factor in thyroid cancer. As for ITPR1, one study demonstrated that up-expression of ITPR1 shelters renal cancer cells against natural killer cells [29]. Another study showed that ITPR1 could enhance paclitaxel toxicity in breast cancer [30]. We discovered ITPR1 is a risk factor for lymph node metastasis in thyroid cancer, but the biological mechanism still under solving. As a heterodimeric transmembrane receptor tyrosine kinase, MET mediates the activation of multiple signaling pathways, including PI3K/AKT, Ras-Rac/Rho and phospholipase C-γ pathways [31]. Signi cantly higher level of MET was detected in PTC [32,33], non-small cell lung cancer [34], bladder cancer [35] and oral cancer [36]. Those results all add up to a better proof of MET as a cancerpromoting factor and can be corroborated with our ndings. In conclusion, based on the available studies, BCL2 and MET match the results more accurately, while ITPR1 in thyroid cancer has been less studied and needs to be further explored.
The Robust rank aggregation algorithm was utilized to analyze multiple gene sets integrally and identify the common DEGs. This algorithm compensates for the limitations of the previous single data set analysis and minimizes bias. Since PTC samples are generally small, the RRA algorithm could be quite helpful. WGCNA is an effective method for describing associations of gene expression patterns with clinical phenotypes. In thyroid cancer research, WGCNA was widely used [37,38]. Based on the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement, studies developing new prediction models should always go through an internal validation (also called self-validation) to quantify the predictive appearance. Also, it is rmly recommended to appraise the model in other data (also called external validation) after developing a prediction model [39]. We then applied our prediction model to TCGA self-validation cohort and one independent GEO cohort.
The outcomes of two dataset strongly t in the model.
The innovative feature of our study is that the combined application of RRA algorithm, WGCNA analysis and PPI methods was rst used to analyze the correlation between PTC and lymph node metastasis. The degradation of samples may cause bias of the results. Therefore, we performed quality control on each sample in the dataset.
Tissues that did not meet the requirements (degradation occurred) were excluded from the follow-up experiment. Furthermore, the predictive model of PTC lymph node metastasis was uniquely established, which might be useful for therapeutic decision-making and clinical monitoring. However, the inadequacy of this study is that the prognostic data in the TCGA database for papillary thyroid cancer are quite good and we failed to nd signi cant differences in prognosis, while the GEO dataset was unable to nd prognosis information. Therefore, it is regrettable that prognosis cannot be measured.
For PTC, lymph node metastasis may arise at an early stage and there is a potential risk for skip metastasis [40], the mechanism of which is currently not clear. Routine diagnosis methods such as neck CT/MRI and neck lymph node ultrasound [7,8] may not be able to fully detect the development of lymph node metastasis. Patients who develop lymph node metastases are likely to require postoperative I 131 radiation therapy and may have a higher recurrence rate and lower survival rate [41]. Therefore, the outcome of this study may have a better role in predicting lymph node metastasis in PTC patients. Patients in the low-risk group have a lower likelihood of lymph node metastasis. With current conventional methods of measuring expression level, such as RT-qPCR or immunohistochemistry, we can easily, accurately and economically obtain risk scores for this patient, thus allowing the model to be better applied in clinical practice.

Conclusions
Summarily, this study is a highly scienti c and accurate method with successful predictive signi cance for determining lymph node metastasis in PTC patient. Among the model, 2 genes were identi ed as risk factors and 1 genes was protective factor in PTC patients. It might be potential treatment targets and urged for further research.    Identi cation of DEGs in GEO datasets along with their GO analysis. a TOP 10 up-regulated and down-regulated genes in eight GEO datasets were displayed as a heat map. Each grid represents the differential expression of these genes in each dataset with the log2FC inset, while the red color representing up-regulated genes and the green color    Revealing of potential hub-genes relating to lymph node metastasis. a The overall PPI network made of 599 DEGs in both TCGA and GEO datasets. b One of the clusters that had the highest score in overall PPI network. c Venn gram revealed the intersection between top 100 genes in PPI network and the genes with high relationship to modules that strongly related to lymph node metastasis. d Forest illustration of the univariate logistic analysis. P value and Hazard ratio of each gene were shown on the left side. Hazard ratio less than 1 indicated that the gene was a protective factor. Meanwhile a bigger-than-1's hazard ratio indicated a harmful factor.

Figure 5
Drawing three genes most relevant to lymph node metastasis by LASSO algorithm and multivariate Logistic regression. a The optimal parameter (λ) was chosen by cross validation. The dashed vertical line on the left intersects over the best log λ, corresponding to the maximum value of AUC. b LASSO coe cient plot of 9 DEGs. Each curve represents a coe cient and the x-axis represents the regularization penalty parameter. Those coe cients that do not become zero as x changes are included in the LASSO regression model. c HRs and 95% CIs of the three genes based on multivariate Logistic regression analysis of the training cohort from TCGA-THCA dataset. Figure 6