Clinical evaluation of malignancy diagnosis of rare thyroid carcinomas by an artificial intelligent automatic diagnosis system

To evaluate the application value of a generally trained artificial intelligence (AI) automatic diagnosis system in the malignancy diagnosis of rare thyroid carcinomas, such as follicular thyroid carcinoma, medullary thyroid carcinoma, primary thyroid lymphoma and anaplastic thyroid carcinoma and compare the diagnostic performance with radiologists of different experience levels. We retrospectively studied 342 patients with 378 thyroid nodules that included 196 rare malignant nodules by using postoperative pathology as the gold standard, and compared the diagnostic performances of three radiologists (one junior, one mid-level, one senior) and that of AI automatic diagnosis system. The accuracy of the AI system in malignancy diagnosis was 0.825, which was significantly higher than that of all three radiologists and higher than the best radiologist in this study by a margin of 0.097 with P-value of 2.252 × 10−16. The mid-level radiologist and senior radiologist had higher sensitivity (0.857 and 0.959) than that of the AI system (0.847) at the cost of having much lower specificity (0.533, 0.478 versus 0.802). The junior radiologist showed relatively balanced sensitivity and specificity (0.816 and 0.549) but both were lower than that of the AI system. The generally trained AI automatic diagnosis system showed high accuracy in the differential diagnosis of begin nodules and rare malignancy nodules. It may assist radiologists for screening of rare malignancy nodules that even senior radiologists are not acquainted with.


Introduction
The incidence of thyroid cancer has increased rapidly worldwide in recent decades [1][2][3], on average, 3.6% per year during 1974-2013. Papillary thyroid cancer (PTC) incidence increased for all stages at diagnosis. Overall and distant PTC incidence-based mortality increased respectively 1.1% and 2.9% per year during 1994-2013. Ultrasonography is the first-line imaging tool for the diagnosis of thyroid nodules [4]. As an economical, convenient and easy-to-promote image modality, the diagnostic accuracy of ultrasonography is highly dependent on the personal ability and experience of radiologists. Especially for rare thyroid neoplasms, for example, follicular thyroid carcinoma (FTC) is the second most common thyroid cancer, accounting for 5-10% of all thyroid cancer [5]. The diagnosis of FTC requires definite capsular and/or vascular invasion and lack of nuclear features of papillary carcinoma, which is difficult to be identified by fine needle aspiration and frozen section [6]. However, FTC is less prone to lymph node metastasis, but more likely to relapse and metastasize to lungs and bones. In addition, compared with PTC, FTC also has a significant tendency of local aggressiveness, so to invade adjacent anatomical structures [7]. Medullary thyroid carcinoma (MTC) originates from the malignant proliferation and differentiation of parafollicular thyroid cells (C cells), which is a rare malignant neuroendocrine tumor [8]. Although MTC only accounts for 1-2% of thyroid cancer, it has a very high fatality rate and is more invasive and metastatic than common thyroid tumors, and is prone to recurrence and poor prognosis [9]. Primary thyroid lymphoma (PTL) is a malignant neoplasm originating in the endolymphatic tissue of the thyroid gland, with or without cervical lymph node infiltration. The incidence of this disease is extremely low, accounting for 2-8% of all thyroid malignancies and less than 2% of all lymphomas [10]. PTL is easily confused with Hashimoto's thyroiditis (HT) and anaplastic thyroid carcinoma (ATC), and misdiagnosis is easy to occur in the clinics. As a result, many patients miss the best treatment occasion or the best treatment. These thyroid carcinomas are rare but the degree of malignancy is relatively high, so it is clinically critical to be able to distinguish these rare thyroid carcinomas from benign ones.
For these rare thyroid carcinomas, most radiologists cannot establish sufficient diagnostic experience because of their low incidence rates. At present, there are several ultrasound systems for risk stratification of thyroid nodules [11,12] available to provide assistance for diagnosis of thyroid nodules. However, the development of these stratification systems is based on the statistical results of the pathological types of common thyroid neoplasms that account for the vast majority of their respective study samples, and the applicability of these stratification systems for rare thyroid carcinomas still lack practical validation.
In principle, genetic testing can help diagnose thyroid nodules. For FPTL cases specifically, role of RAS mutations may be relevant [13]. However, the most common thyroid related oncogene, namely the BRAFV600E mutation, is poor for malignancy differentiation of follicular patterned tumors [14,15]. In addition, compared with noninvasive ultrasonography, genetic testing requires more invasive fine needle aspiration biopsy and is also more costly. In recent years, with rapid development artificial intelligence (AI) algorithms, they have shown promising performances in medical image-based diagnosis. Big data combined with deep learning can effectively solve problems in clinical diagnosis and treatment, which show great advantages and application prospects in the field of medical image diagnosis [16][17][18]. At present, the ultrasound automatic diagnosis system for thyroid nodule based on deep learning algorithm can assist the diagnosis of thyroid neoplasms, simplify the work process of ultrasound examination, and help radiologists improve the efficiency of clinical practice [19]. Unfortunately, there is currently very little study trying to apply deep learning technologies to diagnose malignant nodules among rare thyroid carcinomas.
In this study, we applied a generally trained commercial thyroid nodule diagnosis system that has a fixed cut-off value for malignancy prediction of rare thyroid carcinomas that included retrospectively collected FTC, MTC, PTL, ATC, and squamous cell carcinoma (SCC) together with thyroid nodules determined to be malignant by postoperative pathology. We evaluated the diagnostic performance of the AI system and radiologists of different experience levels in diagnosing rare thyroid neoplasms using common evaluation metrics such as sensitivity, specificity and accuracy as well as McNemar test to verify whether if any observed differences were statistically significant.

Database
This retrospective study was approved by The Cancer Hospital of the University of Chinese Academy of Sciences (Zhejiang Cancer Hospital). The inclusion criteria were: (1) The patient received thyroid surgery for the first time and did not receive other relative treatments (2) patients underwent thyroid ultrasound diagnosis before surgery, and had clear B-mode ultrasound images (3) patients confirmed to rare thyroid carcinomas (such as FTC, MTC, PTL and SCC) by surgical pathological analyses. From January 2016 to June 2021, a total of 378 thyroid nodules from 342 patients were included in this study. 196 cases of rare thyroid carcinomas (51.85%) which met the inclusion criteria at our institution were included in this study, such as 59 cases of FTC, 86 cases of MTC, 29 cases of PTL, 13 cases of ATC, 5 cases of SCC, 3 cases of medullary thyroid carcinoma and 1 case of ATC. Meanwhile, for study completeness, 182 benign nodules (48.15%) were randomly selected (the number was chosen randomly with a rough 1:1 ratio to malignant cases to have a balanced evaluation of the diagnostic performance by each rater) during the same time period in this study. The study subjects included 245 females and 133 males. The specific information is shown in Table 1.

Ultrasound examinations by radiologists and AI software
One junior radiologist A with 3 years of working experience, one mid-level radiologist B with more than 10 years of working experience and one senior radiologist C with more than 20 years of working experience in ultrasound diagnosis performed the clinical ultrasound examinations on patients without knowing their pathological outcomes. The radiologists used the American College of Radiology (ACR) Thyroid Imaging Reporting and Data System (TI-RADS) to score the main characteristics of the thyroid nodules (composition, echo, shape, margin and echogenic foci) [20,21]. The TI-RADS classification was determined by the total score and diagnosis results of the malignant and benign nodules were determined based on the cut-off value, such that if the TI-RADS is >3, the nodule is diagnosed as suggestive of being malignant, otherwise benign. Three ultrasound images were provided for each case. These ultrasound images were first grouped according to the associated nodules and then analyzed independently by all radiologists and the AI-SONIC TM Thyroid system with software version A1.01.001.001 and algorithm referred as US_THYROID_S (Zhejiang Demetics Medical Technology Co., Ltd,, China). The AI system has a fixed cut-off value of 0.6 and was developed on the EfficientNet architecture [22] using the proprietary deep learning framework DE-Light. For network initialization, self-training with noisy student method was used to migrate the model trained on ImageNet database to this system. In addition, it used a Sharpness-Aware Minimization (SAM) algorithm to simultaneously minimizing loss value and loss sharpness to improve the generalizability of the model [23]. In particular, SAM algorithm seeks parameters that lie in neighborhoods with uniformly low loss. A softmax classifier with the learned feature maps was used to diagnose thyroid nodules, as shown in Fig. 1. This system can automatically detect thyroid nodules in two-dimensional grayscale ultrasound images. Therefore, radiologists didn't need to manually outline the thyroid nodules only in case the AI system failed to detect the nodules automatically. We only needed to import the ultrasound images into the system and the system returned the predicted malignancy probability value of each nodule in the ultrasound images. The maximum malignancies predicted from the images associated to each nodule were chosen as the nodule-specific malignancy scores by all the raters, i.e., the radiologists and the AI. If the probability value is ≥0.6, the nodule is diagnosed as malignant, otherwise benign.

Statistical analysis
To assess the performance of the AI-SONIC TM system, we computed the Receiver Operating Characteristic (ROC) curve and used the Area Under the Curve (AUC) as the evaluation metric. In order to compare its diagnosis with that of the radiologists, we calculated the sensitivity, specificity and accuracy. In addition, McNemar test was used to compute p values for statistical comparisons. In all analyses, a p value less than 0.05 was considered a statistically significant difference. Statistical analysis was performed using Python 3.8 (Python Software Foundation, Delaware, United States).

Results
Comparison between the AI system and radiologists of different experience levels The sensitivity, specificity and accuracy of the AI system and three radiologists in malignancy diagnosis of thyroid nodules were computed, as shown in Table 2. The accuracy Fig. 1 The training process of the AI automatic diagnosis system and specificity of the AI system were higher than all surveyed radiologists. The sensitivity of the AI system was slightly lower than that of senior radiologist C (0.847 vs 0.959) and mid-level radiologist B (0.847 vs 0.857), but higher than that of junior radiologist A (0.847 vs 0.816). Moreover, we calculated the associated AUC values of the AI system and three radiologists and plotted the ROC curves in Fig. 2. The AUC value of the AI system was higher than all surveyed radiologists.
In order to compare whether there was a statistical difference between the AI system and three radiologists in thyroid malignancy nodules diagnosis, we applied the McNemar test to compute p values with the results shown in Table 3. All p values between the AI system and three radiologists were less than 0.0002. Hence, there were significant statistical differences between the AI system and three radiologists in diagnosing rare thyroid carcinomas. To demonstrate the AI system concretely, we presented a set of representative ultrasound images and the diagnostic results by the AI system, as shown in Fig. 3.
The performance comparison of the AI system and three radiologists in diagnosis various subtypes of thyroid nodules To further compare the AI system and three radiologists' diagnosis, we chose the thyroid nodules subtypes of relatively lager numbers to calculate sensitivity and specificity, including FTC, MTC, PTL and begin nodules, respectively. The results were shown in Table 4. The specificity of the AI system was dramatically higher than that of the three radiologists for begin nodules. Furthermore, the sensitivity of the AI system was higher than that of the junior radiologist A, same as the mid-level radiologist B, but lower that of the senior radiologist C in diagnosis PTL. The sensitivity of the AI system in diagnosis FTC was higher than that of the midlevel radiologist B, but lower than that of the junior radiologist A and senior radiologist C. For MTC cases, the sensitivity of the AI system was lower than that of senior radiologist C and middle-level radiologist B, but same as that of junior radiologist A.

Discussion
Thyroid carcinoma is the most common endocrine malignant tumor, accounting for about 1% of all malignant tumors. Papillary thyroid carcinoma (PTC) with low malignancy and good prognosis is the most common thyroid carcinoma type and has been mostly studied, which can be well diagnosed clinically. But for rare thyroid carcinomas, such as FTC, MTC, PTL, ATC, there are still great challenges to differentiate benign from malignant thyroid nodules. Most radiologists cannot establish sufficient diagnostic experience because of their low incidence for rare thyroid carcinomas. Even the senior   radiologist C with more than 20 years of working experience in our study had 72.8% accuracy. For junior radiologist A, the accuracy is even lower, only 68.8%. In our study, the sensitivities in malignancy prediction by radiologists were all above 80%, but the specificity could be as low as 47.8%. The poor diagnostic performance of the radiologists for rare thyroid carcinomas, such as FTC, MTC, PTL, ATC, can be most likely attributed to the fact that even senior radiologists lack diagnostic experiences due to the low overall incidence rate. Though the sensitivity in malignancy prediction by the AI-SONIC TM thyroid automatic diagnostic system trained for general benign and malignant nodule differentiation was slightly lower than radiologist B (1%) and radiologist C (11.2%), the AI system provided a much better diagnostic accuracy of 82.5% in malignancy diagnosis of thyroid nodules. For specificity, the AI system was 13.7%, 12.4%, 9.7% higher than that of the junior radiologist A, the mid-level radiologist B and the senior radiologist C respectively. Moreover, the AI system had an impressive balanced performance with the sensitivity and specificity being 84.7% and 80.2% respectively. Using the McNemar test, the p-values comparing the AI system and three radiologists in terms of malignancy accuracy in this study Fig. 3 Representative examples of thyroid nodule diagnosis by the AI system. When the risk probability of thyroid nodules was <0.6, the AI system diagnosed the thyroid nodules as "benign" and displayed in a green bounding box, otherwise as "malignant" displayed in a red bounding box. a-d Original ultrasound images of thyroid nodules. a Pathological diagnosis: benign nodule. b Pathological diagnosis: primary thyroid lymphoma. c Pathological diagnosis: follicular thyroid carcinoma. d Pathological diagnosis: medullary thyroid carcinoma. e-h Diagnosis of thyroid nodules in the AI system. e The AI system diagnosed the nodule as "benign" with risk probability value of 0.15.
All three radiologists diagnosed the nodule as "malignant". f The AI system diagnosed the nodule as "malignant" with a risk probability value of 0.87. All three radiologists diagnosed the nodule as "malignant". g The AI system diagnosed the nodule as "malignant" with a risk probability value of 0.73. All three radiologists diagnosed the nodule as "benign". h The AI system diagnosed the nodule as "malignant" with a risk probability value of 0.90. Radiologist A and radiologist C diagnosed the nodule as "benign" while radiologist B diagnosed the nodule as "malignant" were all below 0.0002, confirming the differences were firmly statistically significant. In diagnosis of thyroid nodules subtypes, such as FTC, MTC, PTL and begin nodules respectively, the specificity of the AI system (80.2%) was higher than that of all three radiologists. The sensitivity of the AI system for PTL (93.1%) was same as that of mid-level radiologist, but higher than that of junior radiologist and lower than that of senior radiologist. The sensitivity of the AI system for FTC (72.9%) was lower than the radiologist A and radiologist C, but higher than that of radiologist B. The sensitivity of the AI system (87.2%) for MTC was lower than the radiologist B and radiologist C. However, the specificity of the senior radiologist C was extremely low (47.8%). This is probably because radiologist C was more conservative and had the tendency to overestimate the malignancy levels, resulting in high sensitivity and low specificity. Radiologist B instead had difficulty in discriminating FTC from benign cases. Radiologist A instead had difficulty in discriminating PTL from benign cases. The AI system provided more balanced performance with the sensitivity and specificity. Therefore, the AI system can potentially used as an auxiliary method for distinguishing of rare malignancy thyroid nodules, given its higher overall accuracy and especially higher specificity. One possibility is to let the radiologists subjectively decide whether they would adopt the suggestions by the AI system or not, as long as higher diagnostic accuracy can be expected [24] from the AI system than the radiologists and guessing. Another possibility is to set up a rule so that a favorable outcome would be expected [25]. This is presumably a better approach than taking a subjective decision because it is more fine-tuned to the competitive edge of the AI's specificity. Following the line of rule-based approach, a more sophisticated rule can be constructed based on more thorough statistical characterizations of the correctly predicted nodule cases by the AI system, but this is for future investigation.
This study was a single-center retrospective study with its inherent limitations. But because of the low prevalence of rare thyroid carcinomas, such as FTC, MTC and PTL, it can take a fairly long time to accumulate enough cases for statistically reliable evaluation. Prospective multi-center studies with large enough sample sizes should be encouraged for further verification and comprehensive comparison.
In conclusion, our study showed that the AI automatic diagnosis system exhibited high diagnostic accuracy in the malignancy diagnosis of rare malignancy thyroid carcinomas in presence of solid benign nodules. The AI automatic diagnosis system can be potentially used as an auxiliary method for distinguishing of rare malignancy thyroid nodules. It is likely that it could help reduce missed diagnosis and misdiagnosis for rare malignancy thyroid nodules such that we can improve the treatments of patients who suffered from them in a timely manner and reduce the incidence rate of metastasis.