Using the Deep Convolutional Neural Network to Evaluate Thyroid Nodules With Atypia of Undetermined Signicance/follicular Lesion of Undetermined Signicance Cytology: Multicenter Study

To compare the diagnostic performances of physicians and a deep convolutional neural network (CNN) predicting malignancy with ultrasonography images of thyroid nodules with atypia of undetermined signicance (AUS)/follicular lesion of undetermined signicance (FLUS) results on ne-needle aspiration (FNA). This study included 202 patients with 202 nodules ≥ 1cm AUS/FLUS on FNA, and underwent surgery in one of 3 different institutions. Diagnostic performances were compared between 8 physicians (4 radiologists, 4 endocrinologists) with varying experience levels and CNN, and AUS/FLUS subgroups were analyzed. Interobserver variability was assessed among the 8 physicians. Of the 202 nodules, 158 were AUS, and 44 were FLUS; 86 were benign, and 116 were malignant. The area under the curves (AUCs) of the 8 physicians and CNN were 0.680-0.722 and 0.666, without signicant differences (P > 0.05). In the subgroup analysis, the AUCs for the 8 physicians and CNN were 0.657–0.768 and 0.652 for AUS, 0.469-0.674 and 0.622 for FLUS. Interobserver agreements were moderate (k=0.543), substantial (k=0.652), and moderate (k=0.455) among the 8 physicians, 4 radiologists, and 4 endocrinologists. For thyroid nodules with AUS/FLUS cytology, the diagnostic performance of CNN to differentiate malignancy with US images was comparable to that of physicians with variable experience levels.


Introduction
Thyroid nodules occur commonly with incidence rates going up to 68% 1 , and ultrasonography (US) is the primary screening method used to detect these nodules with high sensitivity and speci city. Fine-needle aspiration (FNA) is an easy, relatively safe, and highly accurate diagnostic tool that can be performed under US-guidance to identify benign and malignant nodules based on US ndings.
The Bethesda system is a standardized, category-based reporting system for thyroid cytopathology, and widely used to interpret FNA results 2 . The nodules with Bethesda class III lesions, otherwise known as atypia of undetermined signi cance (AUS) or follicular lesion of undetermined signi cance (FLUS), have a malignancy risk of 6-18%, and management plans vary widely from clinical observation, US follow up, repeat FNA or core needle biopsy, molecular test to thyroid surgery 2,3 . Although thyroid US examination has been shown to help stratify the risk of Bethesda class III lesions 3,4 , US assessment is limited in application due to its inherent limitations of poorly reproducible tests 5 .
Recently, machine learning and deep learning methods have been developed, and rapidly become a methodology of choice for medical image analysis 6,7 . Deep convolutional neural network (CNN) trained with an automated process using raw image pixels rather than engineered features extracted by experts of traditional machine learning algorithm 7 . Recently, we developed a computer-aided program that uses a deep convolutional neural network (CNN) to diagnose thyroid nodules according to US features 8 . This CNN can be an objective, operator-independent method to identify benign lesions and malignancy, and these advantages are thought to be especially helpful for nodules with AUS/FLUS cytology on FNA in predicting malignant risk and determining the next management step.

Page 4/18
The purpose of this study was to compare the diagnostic performances of physicians with varying experience levels and CNN to predict malignancy using US images of thyroid nodules with Bethesda class III results on FNA. Table 1 summarized the demographic features of the included 202 nodules. There were 86 (42.6%) benign nodules and 116 (57.4%) malignancies con rmed after surgery. The pathologic results after surgery were shown in Table 2. Of 202 nodules, preoperative FNA found 158 with AUS cytology and 44 with FLUS cytology. There was no statistical difference between the benign and malignant nodules for sex and age. Malignant nodules had signi cantly smaller size than benign ones (P = 0.009), and higher cancer probabilities than benign nodules using CNN (P < 0.001).   Fig. 1). The calculated sensitivity, speci city, and AUC of CNN were 59.5%, 69.8%, and 0.666, respectively, using an estimated cut-off value of 54.1% (Table 3, Fig. 1). CNN showed signi cantly higher sensitivity than 6 physicians, but not over Radiologist 4 (50.0%; P = 0.082) and Endocrinologist 1 (50.9%; P = 0.137). CNN showed signi cantly lower speci city than all 8 physicians (P < 0.05). CNN had similar AUC values compared to the 8 physicians, without statistical difference (P > 0.05). In the 158 nodules of the AUS group, the sensitivity, speci city, and AUC of the 8 physicians ranged 25.0-52.8%, 76.0-98.0%, and 0.657-0.768, respectively, while the sensitivity, speci city, and AUC value of CNN was 62.0%, 66.0%, and 0.652 with a cut-off value of 54.1% (Table 3, Fig. 1). CNN showed signi cantly higher sensitivity than 6 physicians (ranges, 25.0-50.0%; P < 0.05) but not over Radiologist 4 (52.8%; P = 0.110) and Endocrinologist 1 (52.8%; P = 0.128). CNN showed signi cantly lower speci city than 7 physicians (ranges, 82.0-98.0%; P < 0.050), but not lower than Endocrinologist 1 (76.0%; P = 0.123), and CNN had relatively lower AUC values than all 8 physicians, but this difference was only signi cant in Radiologist 2 (P = 0.011).

Discussion
The AUS/FLUS cytology includes a heterogeneous and broad spectrum of diagnoses which contain more pronounced cells with architectural and/or nuclear atypia than benign lesions but not enough of these cells to be considered malignant, and have a malignancy risk of 6-18% after NIFTP is removed which can make it di cult for clinicians to reach a decision on further management 2 . For nodules of this category, we can perform repeat FNA/CNB or molecular tests as supplementary evaluation methods instead of proceeding to surgery; however, even results from repeated FNA show the same cytology in 10-30% of the nodules 9 . In nodules with AUS/FLUS cytology, US features can help stratify the malignancy risk of thyroid nodules 3,4,10−12 . A meta-analysis study showed that the more suspicious US features a nodule has, the more likely it is to be malignant 3 , with similar results being observed in nodules with AUS cytology, but not in those with FLUS cytology 10,11 . However, the US examination itself is highly subjective, operator dependent and less reproducible than other imaging methods 5,13 .
CNN is a typical deep learning algorithm based on feature recognition [14][15][16]  Recently, machine learning and deep learning methods have been developed, and CNN showed the highest accuracy and speci city when machine learning models were compared to differentiate Bethesda category III nodules from Bethesda IV/V/VI nodules using US images 22 . This previous study was performed to make decisions on treatment, and showed the US characteristics of the ACR TI-RADS system assigned by each radiologist, but diagnostic accuracy was not compared between clinicians and the machine learning approaches. Our study is meaningful because as far as we know, it is the rst to compare the diagnostic performance of clinicians and CNN to predict malignancy in thyroid nodules with AUS/FLUS cytology. In this study, the AUC of CNN was similar to those of the 8 physicians for diagnosing malignancy. CNN showed higher sensitivity and lower speci city for diagnosing malignancy in AUS/FLUS lesions than the 8 physicians and these results were comparable to those of other recent studies with higher sensitivity and lower speci city for CNN compared to radiologists 17,20,21,23 . However, our results for both CNN and radiologists showed relatively lower sensitivity, higher speci city, and lower AUC values than other studies 17,20,21 . Our study only included nodules with AUS/FLUS con rmed at FNA. Furthermore, the structures of CNNs are varying in each study and used cut-off values to make the decision based on the probability results from CNNs (there are diverse approaches to determine the cutoff value) are different. In comparison, other studies included thyroid nodules without considering their cytologic results of FNA. Thus, the absolute values of the diagnostic performances are affected by these differences. Rather than weighing the absolute values of the diagnostic performances, it would be more appropriate to check and compare trends. Moreover, most of our study population consisted of AUS nodules (78.2%), and CNN also showed similar diagnostic performances with AUS/FLUS.
Interobserver variability is a very important issue because US is highly subjective and operator dependent as mentioned above, and diagnosis using captured JPEG images is more subjective 5,13 . There was a study evaluating the interobserver variability of three radiologists with various experience levels (a resident, a fellow, and a staff), and moderate agreement was observed for each US characteristic (k = 0.473-0.634) except for shape (k = 0.034) 21 20 . We only analyzed risk levels according to the ACR TI-RADS system for interobserver variability, and did not analyze each US feature. Our results showed moderate interobserver variability among the 8 physicians. Substantial agreement was observed between the 4 radiologists, which is slightly superior to the interobserver variability of all 8 physicians and also the interobserver variability of 4 endocrinologists. Our 4 radiologists had different levels of experience with thyroid US, but their daily work exposed them much more to US images, making them also much more familiar with US images and the ACR TI-RADS system than endocrinologists.
Our study has several limitations. First, there was selection bias due to its retrospective study design. Second, the total sample size was not large despite it being a multicenter study, and the number of FLUS cytology nodules was only 44 (21.8%), which is relatively small for generalizing its ndings to an entire population. Third, the malignancy rate after surgery was 57.4%, much higher than the rate recommended by the Bethesda system 2 . For AUS/FLUS cytology, excision can be considered when repeated FNA/CNB or molecular tests are not helpful or nodules show suspicious US characteristics. We used the inclusion criteria of surgery-performed lesions only, thus, a higher malignancy rate is expected. Fourth, we only compared the risk levels of the ACR TI-RADS system without considering each US feature, which again was a point of con ict between the 8 physicians (Supplementary Table 1).
The diagnostic performance of CNN was comparable to that of physicians with variable experience levels in differentiating malignancy from thyroid nodules with AUS/FLUS cytology on US.

Methods
This multicenter study was based on patient data collected from three tertiary referral institutions in South Korea. The institutional review boards (IRB) of all three institutions approved this retrospective observational study and the need of informed consent was waived for the review of patient images and US Examinations and Imaging Interpretation. US examinations were performed using several types of US machines ( Supplementary Information 1). One clinician at each hospital reviewed the preoperative thyroid US images, selected the most representative image of each thyroid nodule, and saved them as JPEG les (Fig. 3). A square region-of-interest (ROI) was then drawn to cover each whole nodule using the Microsoft Paint program (version 6.1; Microsoft Corporation, Redmond, WA, USA). The saved images from the 3 hospitals were randomly mixed and numbered by an experienced radiologist (Fig. 3). They were independently reviewed by the following 8 physicians, none who had information on the cytopathologic results of each thyroid nodule: 2 faculty radiologists (7 and 10 years of experience in thyroid imaging), 2 less experienced radiologists (2 and 4 years of experience), 2 faculty endocrinologists (more than 5 years of experience), and 2 less experienced endocrinologists (1 year of experience). Before reviewing the captured images, all of 8 physicians were trained using the user's guide by ACR TI-RADS 24 .
Deep Convolutional Neural Network. In this study, we used a computer-aided diagnosis (CAD) program to differentiate malignancy from benign lesions, which was recently developed with 13,560 US images of thyroid nodules using a deep convolutional neural network 8 (Supplementary Information 2 and Supplementary Fig. 1).
Statistical Analysis. We collected data on the nal diagnosis of each thyroid nodule after surgery that had been recorded in the electronic medical records of each hospital. Cancer probabilities were calculated using CNN, and were presented as percentages (0 ~ 100%). Categorical data were summarized as frequencies and percentages, and continuous variables were presented as means ± standard deviations or median (interquartile range). The Shapiro-Wilk test was performed to assess the normality of continuous variables. We evaluated differences in variables using the independent two-sample t-test, Mann-Whitney U test, Chi-square test, or Fisher's exact test.
Sensitivities and speci cities of the 8 physicians and CNN for predicting malignancy were evaluated and compared by generalized estimating equation (GEE). Of the risk levels of the ACR TI-RADS system, we    There was a 12mm-sized thyroid nodule diagnosed as FLUS on US-guided FNA. The cancer probability calculated by CNN was 88.1%. The patient underwent surgery, and pathology con rmed encapsulated angioinvasive follicular carcinoma.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. Supplementarymaterial.docx