Predicting malignancy in thyroid nodules based on conventional ultrasound and elastography: the value of predictive models in a multi-center study

This study aimed to establish predictive models based on features of Conventional Ultrasound (CUS) and elastography in a multi-center study to determine appropriate preoperative diagnosis of malignancy in thyroid nodules with different risk stratification based on 2017 Thyroid Imaging Reporting and Data System by the American College of Radiology (ACR TI-RADS) guidelines. Five hundred forty-eight thyroid nodules from three centers pathologically confirmed by the cytology or histology were retrospectively enrolled in the study, which were examined by CUS and elastography before fine needle aspiration (FNA) and surgery. Characteristics of CUS of thyroid nodules were reviewed according to 2017 ACR TI-RADS. Binary logistic regression analysis was used to develop the prediction models based on the different risk stratification of CUS features and elastography which were statistically significant. Values of predictive models were evaluated regarding the discrimination and calibration. Binary logistic regression showed that patients’ age, taller-than-wider, lobulated or irregular boundary, extra-thyroid extension, microcalcification and the elastic parameter of Virtual touch tissue imaging quantification (VTIQ) max were independent predictors for thyroid malignancy (p < 0.05) in the ACR model and showed the area under the curve (AUC) in training (0.912) and validation cohort (internal and external: 0.877 vs 0.935). Predictive models showed predictors in ACR TR4 and TR5 for malignancy and diagnostic performance of AUC in training, internal and external validation cohort respectively: the VTIQ max (p < 0.001) with AUC of 0.809 vs 0.842 vs 0.705 and the age, taller than wide, VTIQ max variables with AUC of 0.859 vs 0.830 vs 0.906 in validation cohort. All predictive models have better calibration capabilities (p > 0.05). Predictive models combined CUS and elastography features would aid clinicians to make appropriate preoperative diagnosis of thyroid nodules among different risk stratification. The elastography parameter of VTIQ max has the priority in distinguishing thyroid malignancy with moderately suspicious (ACR TR4).


Introduction
Thyroid nodules is a clinically common problem and the prevalence of thyroid cancer increased yearly due to the incidence of papillary thyroid cancer (PTC), which occupied the vast majority [1]. Thyroid imaging reporting and data system for Ultrasound (US) features has been widely used to stratify the risk of malignancy and help determine whether FNA is needed at an early stage [2]. Various guidelines such as European Thyroid Association guidelines for ultrasound malignancy risk stratification of thyroid nodules (EU-TIRADS), associazione Medici Endocrinologi (AACE/ACE/AME) guidelines, and updated 2017 American College of Radiology (ACR) Thyroid Imaging, Reporting, and Data System (TI-RADS) had already given recommendations for the risk classification of thyroid nodules. Typical ultrasound characteristics were assigned scores which were divided into five levels (TR1-TR5) for representing the risk malignancy of thyroid nodules on 2017 ACR TI-RADS guidelines. Thyroid ultrasound risk stratification system can evaluate the malignant risk of thyroid nodules more objectively and accurately. The number of suspicious US features and the ACR TI-RADS scores were potential risk factors for cervical lymph node metastasis in patients with the PTC (<10 mm) [3]. It has been widely used to screen thyroid nodules requiring puncture and further treatment. A higher ACR TI-RADS score can forecast an increased risk of malignancy [4]. It allowed clinicians to manage patients more convenient, effective and cost-effectiveness [5].
Elastography has been recognized as an auxiliary method for the ultrasound diagnosis and risk assessment of benign and malignant thyroid nodules. The combination of the two dimensional shear wave elastography (2D SWE) and ACR TI-RADS classification could improve diagnostic sensitivity and accuracy when differentiating thyroid malignancy with indeterminate FNA cytology [6]. The rate of unnecessary biopsy was significantly decreased when the ACR TI-RADS classification were combined with elastography in a multi-center study [7]. A meta analysis reported that combining elastography and other ultrasound techniques improves evaluation of indeterminate thyroid nodules [8]. In addition a prospective research showed that strain elastography (SE) had a better performance than Kwak TI-RADS classification on thyroid nodule discrimination, while their combination improved sensitivity [9].
Previous studies were mostly based on the Conventional US (CUS) and elastography including 2D SWE and SE features of thyroid nodules to predict malignancy. In the actual clinical situation, clinicians usually use the risk stratification system to evaluate the malignant risk of nodules, and then make a comprehensive analysis combined with the results of elastography. Therefore, using the thyroid nodule CUS risk stratification system combined with elastography to predict malignancy of thyroid nodules is more conducive to nodules management process.
Our study aimed to establish the prediction model using CUS risk stratification system and elastic parameters for verifying the diagnostic efficacy of thyroid nodules in ACR TR4-5 classifications. In addition, further verification was carried out in the different grades of nodules classified based on the ACR TI-RADS guideline and determined appropriate preoperative diagnosed method for malignancy.

Patients
The multi-center study was approved by the ethics committee of the three hospitals respectively and complied with the Declaration of Helsinki. Informed consent was obtained from all individual participants included in the study. The flowchart of the included and excluded procedure was shown in Fig. 1. The inclusion criteria were listed as followed: (1) age ≥ 18 years old, (2) the component of nodule was solid or mixed cystic and solid (<75% solid); (3) sufficient normal thyroid tissue surrounded the nodule; (4) sizes of nodules were ranged from 5 mm to 50 mm; (5) nodules were not treated before examinations. Exclusion criteria: (1) malignant FNA results had no confirmed surgical pathology; (2) indefinite nodules did not receive repeated punctures or surgeries; (3) nodules were diagnosed as benign by FNA without 1 year of follow-up; (4) incomplete or poor quality images of CUS and elastography. In patients with multiple nodules, the nodules with the most malignant features were selected.
If multiple nodules had the same degree of malignancy, the nodule with largest diameter was selected. Finally, a total of 548 nodules from 548 patients were included in the study. 478 patients from center 1 (Department of Medical Ultrasound, Shanghai Tenth People's Hospital), 60 patients from center 2 (Department of Ultrasound, The first Affiliated Hospital of Harbin Medical University), 10 patients from center 3 (Department of Ultrasound, The second Affiliated Hospital of Kunming Medical University) were included from June 2016 to June 2019.

CUS, SE and 2D SWE examinations
CUS and elastography were operated using same kind of machine (4-9 MHz multi-frequency 9L4 transducer, Siemens ACUSON S3000) in three centers. All elastic color mark remained factory setting of the machine. Four experienced physicians from three centers respectively participated in the images acquisition.

CUS
CUS images operation standards were listed as followed: (1) grayscale and color images of the long axis and short axis section horizontal and vertical were stored and measured; (2) target nodules were placed in the center of the image and occupied a third of the area as much as possible; (3) asking patients to breathe smoothly and expose the neck during the operation. Characteristics of target nodules in gray scale images were recorded, scored and graded according to the 2017 ACR TI-RADS guideline: composition (cystic or almost completely cystic, 0 points; spongiform, 0 points; mixed cystic and solid, 1 points; solid or almost completely solid, 2 points), echogenicity (anechoic, 0 points; hyperechoic or isoechoic, 1 points; hypoechoic, 2 points; very hypoechoic), shape (wide-than-tall, 0 points; taller-thanwider, 3 points), margin (smooth, 0 points; ill-defined, 0 points; lobulated or irregular, 2 points; extra-thyroid extension, 3 points), echogenic foci (none or large comet-tail artifacts, 0 points; macrocalcifications, 1 points; peripheral calcifications, 2 points; punctate echogenic foci, 3 points).

SE
Switch to the elastic mode of SE using the same probe (4-9 MHz multi-frequency 9L4 transducer, Siemens ACU-SON S3000) to get images when the target nodule was stable on the grayscale mode. Slight pressure were performed on patients neck and then acquiring the images after 5 seconds until the quality number up to 50. The selection of the sampling frame made both the target nodule and surrounding tissues. Differences of hardness of the region of interest (ROI) can be reflected by the color map (blue color means harder tissue, green means softer). Elastic scores were classified into five different patterns as described [10]: Elastic score-1: the nodule is displayed homogeneously in green. Elastic score-2: the nodule is displayed predominantly in green with a little blue spots. Elastic score-3: the nodule is displayed 50% areas in blue and green. Elastic score-4: the nodule is displayed homogeneously in blue. Elastic score-5: the nodule and surrounding tissues are displayed homogeneously in blue.

2D SWE
2D SWE is represented by Virtual touch tissue imaging quantification (VTIQ) on the elastic mode. Sampling frame sufficiently encapsulated the target nodule and surrounding tissues on grayscale images. Starting the VTIQ mode after the image stabilized and then acquiring four modes, including SW quality mode, SW velocity mode, SW displacement mode and SW time mode. The selection high quality image depends on the SW quality mode, which was represented as a color map (green color means high quality, red means low quality). Differences in hardness between tissues can be displayed qualitatively in color (red color means harder tissue, blue means softer tissue) and quantitatively in numerical (threshold was from 0 to 10 m/s). The ROI was placed on the image of SW velocity mode. Size of the ROI was defined as 1 × 1 mm.The measurement was repeated seven times in Calculating the maximum values which be represented by the parameter of VTIQ max. According to the machine settings, the SW velocity measurement result of "High" was replaced with upper threshold (10 m/s), which is corresponding to the solid portion of nodules [11,12].

Predictive models
We built predictive modules for all nodules and ACR TR4-5 nodule respectively using the Binary logistic regression method. Seventy percent of the 478 nodules from center 1 were enrolled in the training cohort, and remaining percent were enrolled in the internal cohort. A total of 70 nodules from center 2 (60/70) and center 3 (10/70) were enrolled in the external cohort. Binary logistic regression was applied in the training cohort to analyze the predictors for malignancy.The performances of the predictive models were evaluated with discrimination and calibration. Receiver operating characteristic (ROC) curves and measurements of area under the curve (AUC) values were used to evaluate the discrimination of the predictive model in the training cohort and validation cohorts. The method of Hosmer-Lemeshow was used to test the goodness whether three predictive models were well-calibrated.

Statistical analysis
Statistical analysis was performed by the SPSS 20.0 and R software. SPSS was used to compare the differences of variables between training and validation cohort. The quantitative and qualitative variables were expressed with the form of mean ± standard deviation and independent ttest respectively. P value > 0.05 indicates that no significant differences of the variables between the training and validation cohort, while the p value < 0.05 means the differences in the benign and malignant groups of the training cohort. R software was used for building the prediction model based on the binary logistic regression in the training cohort. Predictive models were established in the training cohort meanwhile in TR4 and TR5 classification based on CUS and elastography features. Values of the predictive models were evaluated regarding the discrimination with the AUC (area under the receiver operator: ROC curve), accuracy (ACC) and calibration with the method of Hosmer-Lemeshow test.

CUS findings
Nodules classified by ACR TR4, TR5 classification were showed on Table 2. In the training cohort, significant features for differentiation between benign and malignant nodules were echogenicity, shape, margin and echogenic foci (all p < 0.000). Composition of nodules is the only feature related to malignancy in TR4 classification (p = 0.009). Features of the shape and margin achieve significant differences between benign and malignant nodules in TR5 classification (all p < 0.000) ( Table 3).

SE score and 2D SWE
The elastography results of thyroid nodules were presented in Table 3. Elastic parameters including SE and VTIQ max were statistically significant differences between benign and malignant nodules as well as TR5 classification (all p < 0.000). The cut-off values of elastography in the Training Cohort were showed on Table 4. VTIQ max has the prior advantage in diagnosing TR4 nodules (p = 0.012) rather than SE and CUS features (

Prediction models based on CUS and elastography features
Based on the training cohort, nodules characteristic in CUS and elastic images were included into the binary logistic regression predictive models:

Discrimination
The performances of predictive models in the training cohort and validation cohorts evaluated by ROC curves (Fig. 3) and measurements of AUC values were showed in Table 6

Calibration
All three models showed considerable results for the calibration curves (all p > 0.05), meaning that all models showed good agreement between prediction and observation (Table 5).

Discussion
CUS is a noninvasive main imaging tool that contributes to assessing the risk of malignancy in thyroid nodules, and FNA was guided. Other imaging method, such as elastography, because of the characteristic of evaluating the stiffness of tissues, has the good ability to distinguish benign from malignant thyroid nodules [13,14]. While US elastography holds promise as a non-invasive method of assessing cancer risk, its performance is highly variable, perhaps influenced by factors such as operator dependence. In this study, we verified risk factors associated with thyroid malignancy in different risk stratification after comprehensively evaluating ultrasound and elastic variables in a population of 334 patients. Subsequently, we developed predictive models which were validated in the internal and external validation cohort with respect to discrimination and  calibration, carried out in the different grades of nodules, thereby avoiding missing features of malignant thyroid nodules in different risk stratification and assisting clinicians to make more accurate decisions preoperatively. In this study, younger age was the negative independent risk factor for thyroid malignancy in ACR TR5 model (p = 0.012) but not in ACR TR4 model according to binary logistic analysis. The result indicated that younger age may have the high possibility related to the malignancy in nodules with higher risk degree. Other researches also identified that decreased age was one of the independent risk factors for thyroid cancer. However, the incidence of thyroid cancer diagnosis rather than disease prevalence has increased dramatically in the past 30 years [15]. Modern medical practices have resulted in the heightened detection of subclinical disease, increasing the representation of low-risk patients in the cohort of diagnosed patients [16]. VTIQ Virtual touch tissue imaging quantification, ACR American College of Radiology, SE strain elastography, AUC area under the curve a Reflects the differences of elastography features between two groups ("Benign" and "Malignant") The diagnosis of benign and malignant results by FNA is very helpful for the detection of thyroid cancer [17]. However, most nodules are benign and up to one-third of FNA may be nondiagnostic, causing the pre-operative preferred approach remains a challenge [18]. A reliable, reasonable, and non-invasive method to determine which nodules require FNA is essential. CUS is widely used to differentiate malignancy from benign [19]. Five imaging features of thyroid nodules in CUS including solid composition, hypoechogenicity, taller-than-wide shape, irregular margins, and microcalcification are the most important predictors for malignancy [20].
The 2017 ACR TI-RADS committee has provided recommendations for the diagnosis of thyroid nodules [21]. In this study, each nodule was assigned ultrasound features accordance with the guideline to predict malignancy. Binary logistic regression analysis were performed to evaluate the association between malignancy. Factors with taller-than-wider, lobulated or irregular boundary, extra-thyroid extension and microcalcification Fig. 3 Images of a 43-year-old woman with pathologically confirmed as PTC. a CUS shows a thyroid nodule (arrow) on the left lobe with hypo-echogenicity and an irregular boundary and was classified as TR4 according to 2017 ACR TI-RADS; b The nodule is displayed predominantly in blue on SE image; c The VTIQ max of the nodule is showed as "high" (means more than 10 m/s)  were confirmed as predictive roles of malignancy. In ACR TR5 model, however, diagnostic value was decreased after being included in the binary logistic regression analysis, probably due to its weak role in predicting malignancy, which could be masked by including other co-effectors. In this study, malignant nodules accounted for only 13% of the TR4 classification and the 2017 ACR guidelines recommended that nodules larger than 1.5 cm in TR4 nodules are required FNA. According to our results, Of the 16 malignant nodules, 75% of the nodules had diameters <1.5 cm and ultrasound characteristics analysis: only 2 nodules (17%) with microcalcification and 5 nodules (41.6%) with irregular boundaries from US features. US has limitations in the differentiation of malignancy from benign. However, 2D SWE especially VTIQ max showed good diagnostic performance for TR4 classification nodules based on the regression analysis. Other studies had also reported that SWE was a highly accurate diagnostic modality for the identification of malignant thyroid nodules [22]. Different subtypes of thyroid cancer could be quantitatively distinguished through the SWE from the study by Bardet at al. [23]. Other research has reported that SWE can be used to identify benign and malignant using the 22kPa cut-off value for nodules that cannot be diagnosed by FNA [24].
Discrimination and calibration are the most commonly used pair of indicators of predictive models evaluated and examined by inter and external verification in this study. The AUC was applied to evaluate the discrimination ability. Witczak et al. retrospectively reviewed the demographic, biochemical, and ultrasound characteristics of 536 thyroid nodules. They reported that serum thyrotropin, sex, microcalcification, and margin were independent predictors of malignancy. A model consisting of these variables and age group was developed and demonstrated an AUC of 0.770 [25]. The radiomics score demonstrated an AUC of 0.921 in the training cohort, which was statistically similar to the AUCs of ACR score assessed by senior radiologists. The good discrimination performance of the radiomics score was confirmed in the validation cohort [26]. In our research, the training cohort of overall data demonstrated an AUC of 0.912, which was statistically similar to the AUCs of inter and external cohort: 0.877 and 0.935. The good discrimination performance was also confirmed in the classification of TR4 (0.809, 0.842 and 0.705) and TR5 (0.859, 0.830 and 0.906). The good discrimination performance of the model combining clinical, CUS and elastography characteristics was confirmed in the validation cohort and no less than the radiomics score. The calibration curves of the prediction models demonstrated good agreement between the predictive and actual probability when the p values were more than 0.05. In addition, the prediction models had better calibration capabilities with 0.211, 0.890 and 0.737, which suggested that the value between the predicted model and the actual observed has no statistically differences.
Our study has several limitations. Firstly, the sample sizes of the external cohort were small, leading to poor sample size and the range with 0.271-1 of 95% CI in the external validation cohort of ACR TR4 model. External sample size would have to be increased to make the results more accurate. Secondly, because of not suspicious and mildly suspicious of malignancy in TR2 and TR3 classification respectively, we did not analyze these cases in addition to the poor sample sizes. However, there is still 1 (25%) PTC in TR2 classification, 2 (10%) in TR3 of training cohort; 3 (27%) in TR3 of internal cohort and 3 (30%) in TR3 of external cohort. Therefore, although this study provides initial evidence that multi-factor predictive models can be useful for predicting malignancy of thyroid nodules, a multi-center study with larger sample size should be performed to validate our results.

Conclusions
Although the diagnostic performances of CUS and elastography were comparable in the overall sample, elastography parameter of VTIQ max showed diagnostic superiority in moderately suspicious (ACR TR4) thyroid nodules. Preoperative examinations, which varied with different risk stratification, avoided missing features of potentially malignant thyroid nodules. The establishment of predictive models that combined CUS and elastography characteristics would aid clinicians to make more appropriate preoperative diagnosis.