Ethical approval
The institutional review boards of all participating institutions approved this study. Representative institutional review board approval was granted by Seoul Metropolitan Government Seoul National University Boramae Medical Center (H-10-2020-195), and the study was conducted in accordance with the Declaration of Helsinki. Informed consent was waived by the board. The manuscript was written in concordance to the Strengthening the Reporting of Observational Studies in Epidemiology guidelines.21
Dataset collection
Figure 1 shows the flow chart of data collection and its composition. US images were collected from the records of patients who underwent surgery or fine needle aspiration cytology examination for thyroid nodules. From these data, we developed a model to predict the pathology of thyroid nodules (malignant vs. benign) using features of US images. We used data from two medical institutions (Seoul Metropolitan Government Seoul National University Boramae Medical Center and Seoul National University Bundang Hospital) for model development (Set A in Figure 1). The training set consisted of 4182 thyroid US images (1528 benign, 2654 malignant), the tuning set consisted of 1393 thyroid US images (509 benign, 884 malignant), and the internal test set consisted of 1397 images (511 benign, 886 malignant). Images were stored in Digital Imaging and Communications in Medicine (DICOM) file format. For external replication of the developed model, we used data from four institutions to overcome the issue of overfitting (Set B, Incheon St. Mary’s Hospital, Korea; Set C, Seoul St. Mary’s Hospital, Korea; Set D, Korea Cancer Center Hospital, Korea; and Set E, Kuma Hospital, Japan). The institutions were different sizes, and three sites primarily treat Korean patients while one primarily treats Japanese patients. With the collected images, the study was designed as according to Figure 5.
Image preparation
DICOM files were transferred to Portable Network Graphic files and the images of thyroid nodules were cropped into squares. For consistent cropping, we developed a web-based in-house program using JavaScript software. Clinicians cropped the region of interest into a square using mouse drag. The square size was set at a minimum of 299 × 299 pixels in order to obtain sufficient resolution. Cropped images larger than 299 × 299 pixels were downsized to 299 × 299 pixels. The process is further described in Supplementary Figure 5.
Image labeling
One experienced clinician (Y.J.C) labeled the images as benign (fine needle aspiration cytology Bethesda Category II or surgical histology of benign tumor) or malignant (fine needle aspiration cytology Bethesda Category V/VI or surgical histology of thyroid carcinoma).
Model development
Supplementary Figure 6 shows a general schematic map of the convolutional neural network architecture utilized in the proposed framework for prediction of benignity or malignancy of thyroid nodules. All models in this study (i.e., stress test, scratch-based, and ImageNet-based) were trained under the same conditions. VGG16,22 VGG19,10 and ResNet5023 were selected as classification architectures to validate transfer learning. The VGG16 network contained 13 convolutional, 13 activation, four pooling, and three full-connection layers. The VGG19 network contained 16 convolutional, 16 activation, four pooling, and three full-connection layers. The structure of ResNet allowed the gradients to flow backward directly through an identity connection from the later layers to the initial filters. After a set of convolution layers of each model, 1024 features with the same structure were extracted (average pooling, fully connected layer, and dropout; p=0.5) and trained to predict malignancy or benignity through those features. Thereafter, 1024 features with the same structure were extracted (Average pooling, Fully connected layer, and Dropout; p=0.5) and trained to predict malignancy or benignity through those features. All experiments were conducted using the NVIDIA RTX3090 GPU set-up with 300 epochs and a batch size of 16. For the hyperparameter control experiment, the initial learning rate was set to 0.00005, the optimizer was set to stochastic gradient descent (momentum = 0.9), and categorical cross entropy loss function was used. Data augmentation techniques such as zoom (-0.1 ~ 0.1), rotation (-5 ~+5 ), and width/height shift (-0.1 ~ 0.1) were used to create more images to train the model efficiently.
Performance evaluation
1) Performance measurements
We compared the performance of each model with the test set and the four external replication sets. The performance of each model was evaluated using area under the receiver operating characteristic curve (AUC), accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV). Performance is shown as mean and standard deviation.
2) Performance comparison between scratch vs. transfer learning
We compared the performance of the transfer learning and scratch learning models using three neural networks (VGG16, VGG19, and ResNet50) on thyroid US image datasets. Transfer learning is a common method in computer vision because it can achieve high accuracy in a short time.21 Unlike scratch learning which requires a model to be taught from the beginning of the project based on random weights, transfer learning applies knowledge that has already been gained from one task (source task) to a different task (target task). ImageNet is an image database organized according to the nouns of the WordNet hierarchy, in which each hierarchy is associated with more than one hundred images. ImageNet dataset’s pre-trained models are one of the most popular base datasets.
3) Stress test
We conducted stress tests to determine whether the training dataset was large enough to saturate the error rate on the validation set. We designed an experiment using different dataset ratios at 10~100% (in 10% intervals) of the total training set. We randomly selected benign and malignant samples in the same proportion as the total training set. For each interval, 10 tests were performed for each internal and external replication set using the three neural networks.
4) Threshold adjustment
We further tested whether the performance of the algorithm varied according to adjustment of the probability threshold. We tested the performance of the algorithms at the following threshold settings: 0.3, 0.5, and 0.7.
Statistical evaluation
For statistical analysis of stress tests and experiments on comparison of initial weights, we compared the average AUCs and performed a paired t-test of US image datasets with classification models in internal and external validation sets. Data was analyzed using SPSS Statistics for Windows, version 28 (IBM Corp, Armonk, NY). For the stress test, paired t-tests were used for the intragroup comparison of AUC values of the 100% ratio-trained model and those of each of the models trained with 10-90% ratios of the training datasets. We also performed a comparative analysis of AUC, accuracy, sensitivity, specificity, PPV, and NPV by classifier threshold using each model trained on 100% of the training set for statistical analysis of models with initial weights learned in different domains.