This multicenter study was based on patient data collected from three tertiary referral institutions in South Korea. The institutional review boards (IRB) of all three institutions approved this retrospective observational study and the need of informed consent was waived for the review of patient images and records by three IRBs (Kangbuk Samsung Hospital Institutional Review Board, 2020-03-020; Yonsei University Health System, Severance Hospital, Institutional Review Board, 4-2020-0106; and Seoul National University College of Medicine/ Seoul National University Hospital Institutional Review Board, 1911-039-1076). This study was performed in accordance with relevant guidelines and regulations.
We collected 3,590 consecutive patients who underwent thyroid surgery at each hospital (Institution A, Jan 2014 to Jun 2019, n = 1,938; Institution B, Jan 2019 to Sep 2019, n = 1,311; and Institution C, Jan 2017 to Jun 2019, n = 341; Fig. 2). In these patients, we searched for nodules ≥ 1cm that were confirmed as Bethesda category III on FNA and surgically excised. Finally, 202 nodules in 202 patients were included in this study (A, n = 112; B, n = 44; and C, n = 46; Fig. 2).
US Examinations and Imaging Interpretation. US examinations were performed using several types of US machines (Supplementary Information 1). One clinician at each hospital reviewed the preoperative thyroid US images, selected the most representative image of each thyroid nodule, and saved them as JPEG files (Fig. 3). A square region-of-interest (ROI) was then drawn to cover each whole nodule using the Microsoft Paint program (version 6.1; Microsoft Corporation, Redmond, WA, USA). The saved images from the 3 hospitals were randomly mixed and numbered by an experienced radiologist (Fig. 3). They were independently reviewed by the following 8 physicians, none who had information on the cytopathologic results of each thyroid nodule: 2 faculty radiologists (7 and 10 years of experience in thyroid imaging), 2 less experienced radiologists (2 and 4 years of experience), 2 faculty endocrinologists (more than 5 years of experience), and 2 less experienced endocrinologists (1 year of experience). Before reviewing the captured images, all of 8 physicians were trained using the user’s guide by ACR TI-RADS24.
The 8 physicians evaluated the following US features using the TI-RADS system proposed by the ACR 24: composition (cystic or almost completely cystic, spongiform, mixed cystic and solid, solid or almost completely solid), echogenicity (anechoic, hyperechoic or isoechoic, hypoechoic, very hypoechoic), shape (wider-than-taller, taller-than-wide), margin (smooth, ill-defined, lobulated or irregular, extrathyroidal extension), and echogenic foci (none or large comet-tail artifacts, macrocalcifications, peripheral calcifications, punctate echogenic foci). Eight physicians determined malignancy risk using the ACR TI-RADS system and the assigned risk levels ranged from TI-RADS (TR) 1 (benign, 0 points), TR2 (not suspicious, 2 points), TR3 (mildly suspicious, 3 points), TR4 (moderately suspicious, 4–6 points), to TR5 (highly suspicious, 7 or more points) (Supplementary Table 2)24.
Deep Convolutional Neural Network. In this study, we used a computer-aided diagnosis (CAD) program to differentiate malignancy from benign lesions, which was recently developed with 13,560 US images of thyroid nodules using a deep convolutional neural network8 (Supplementary Information 2 and Supplementary Fig. 1).
Statistical Analysis. We collected data on the final diagnosis of each thyroid nodule after surgery that had been recorded in the electronic medical records of each hospital. Cancer probabilities were calculated using CNN, and were presented as percentages (0 ~ 100%). Categorical data were summarized as frequencies and percentages, and continuous variables were presented as means ± standard deviations or median (interquartile range). The Shapiro-Wilk test was performed to assess the normality of continuous variables. We evaluated differences in variables using the independent two-sample t-test, Mann-Whitney U test, Chi-square test, or Fisher’s exact test.
Sensitivities and specificities of the 8 physicians and CNN for predicting malignancy were evaluated and compared by generalized estimating equation (GEE). Of the risk levels of the ACR TI-RADS system, we used a cut-off point of TR 5 for the 8 physicians. The cut-off values of CNN were determined with Youden’s index. A receiver operating characteristic (ROC) curve analysis and areas under the curve (AUCs) were compared by DeLong’s test. The diagnostic performances of the 8 physicians and CNN were evaluated in each AUS and FLUS group, and also compared using the ROC curve analysis.
We evaluated interobserver variability among all 8 physicians using Fleiss’ Kappa, and then divided the physicians into 2 groups to also compare interobserver variability among the 4 radiologists and among the 4 endocrinologists separately with Fleiss’ Kappa. A kappa value (k) of less than 0 indicated no agreement; 0-0.20, slight agreement; 0.21–0.40, fair agreement; 0.41–0.60, moderate agreement; 0.61–0.80, substantial agreement; and 0.81-1.00, almost perfect agreement25.
All P values were calculated using the two-tailed t-test and a P < 0.05 was considered to indicate statistical significance. All statistical analyses were performed using commercially available statistical software (SAS, version 9.4, SAS Inc., Cary, NC, USA) and R Statistical Package (Institute for Statistics and Mathematics, Vienna, Austria, ver 4.0.2, www.R-project.org).