Study Population
This retrospective study received approval from the Institutional Review Board (No.2024-A 06). 310 patients with PTC across two hospitals in China were collected for this study. The training and internal validation sets comprised 249 patients who underwent surgery at Nantong Tumor Hospital between September 1, 2022, and March 1, 2024. The external validation set included 61 patients at the First People's Hospital of Qinzhou, Guangxi, between September 3, 2021, and March 23, 2023. (Fig. 1).
Inclusion criteria: (1) Pathologically confirmed PTC diagnosis. (2) Postoperative pathology clarified the presence or absence of CLNM. (3) First-time thyroid surgery. (4) Complete and clear ultrasound bimodal video, including BMUS and SMI videos of the entire nodule. (5) Ultrasound examination conducted within two weeks before surgery. (6) Complete clinical baseline data.
Exclusion criteria: (1) Patients who received preoperative therapy, such as radiofrequency ablation. (2) Patients undergoing repeat surgery. (3) Unclear postoperative pathology. (4) Blurred or incomplete ultrasound video. (5) Incomplete clinical baseline data.
Clinical baseline data collection: contains gender, age, Size, Pathological results, and location.
Dual-Modality Ultrasound Image Collection
Ultrasound images were obtained using a Samsung (Samsung RS85 systems, Seoul, Korea) or Canon (Canon Aplio 800 systems, Tokyo, Japan) ultrasound device with the linear array probe at 10 to 15 MHz. The patient was placed in the supine position with the head tilted posteriorly to expose the neck region fully. The probe was used to scan the nodule region uniformly and continuously along the long axis from left to right, and the video images were stored. SMI was then performed by the same radiologist using the same line array probe at the same speed in the same direction in successive uniform sweeps, and the SMI video images were stored.
Data Preprocessing and Nodule Segmentation
In the video, keyframes are extracted from the start to the end of the nodule. Then, two images are evenly taken from each side of the largest cross-sectional area of the nodule, resulting in a total of five images per nodule (Fig. 2). The regions of interest (ROIs) were manually delineated using the open-source software ITKSNAP 3.8.0. Two radiologists with more than 5 years of experience in thyroid ultrasound examination independently delineated the tumor boundaries without knowledge of postoperative pathology, resulting in two sets of ROIs. Afterwards, the reproducibility of radiomics features was assessed by calculating the Interclass Correlation Coefficient (ICC), and features with an ICC value less than 0.75 were excluded. The same radiologist delineated ROIs on the SMI images based on their corresponding BMUS images.
Radiomics Feature Extraction
The radiomics features were automatically extracted using the open-source software Pyradiomics (https://pyradiomics.readthedocs.io/en/latest/index.html). A total of 1567 radiomics features were extracted from the ROI of each ultrasound image, including first-order features, shapebased features, and texture features.
Feature selection and model construction
To eliminate the effect of magnitude between different features, making the model more stable and faster convergence during the training process, we first standardize the features with Z-score. Then we use the U-test to select features with significant differences. Furthermore, to reduce redundant and highly correlated features and remove useless information to improve the classification effectiveness of the model, we use Spearman's correlation coefficient to assess the multicollinearity of the features[29], and if any coefficient value ≥ 0.9 exists in a pair of features, only the features with better diagnostic performance are retained. Least Absolute Shrinkage and Selection Operator (LASSO) logistic regression was used for the final feature selection [30]. We employed Support Vector Machine (SVM) classifiers to develop the prediction model. SVM is a commonly used method in biomedical binary classification problems, aiming to identify the optimal hyperplane that best separates different targets, known as the optimal decision boundary[29]. It has been widely utilized in numerous research studies.
ViT Feature Extraction
In this study, we applied the ViT model to extract features from thyroid ultrasound images. Initially, after cropping, images underwent resizing to a standardized 256 × 256 pixel size using linear interpolation. Data augmentation included random horizontal and vertical flips. Subsequently, the ultrasound images were partitioned into 8x8 patches. Each patch was linearly projected into a 1024-dimensional feature vector using learnable weight matrices. To enhance feature representation, a learnable embedding class token was introduced at the sequence start, and each linearly projected patch received a learnable positional encoding to preserve spatial context.
Following preprocessing, the sequence of image patches and positional encodings was fed into six Transformer encoder layers. Each encoder layer comprised layer normalization and multi-head self-attention with 16 attention heads for parallel computation and aggregation of attention scores. Subsequently, a second layer normalization was followed by two fully connected networks using ReLU activation, with output dimensions of 768 and 1024, respectively. Residual connections between subcomponents facilitated the propagation of feature representations across encoder layers.
In the final Transformer encoder layer, the class token's hidden state was transformed into a flattened feature vector, serving as the definitive representation of thyroid ultrasound image features. Post-training, 1024 ViT features were extracted from the last layer output 'to latent' for each image.
Feature fusion and fusion model construction
In this study, all of our fusion strategies use early fusion, combining features from different modalities into a single feature vector[31]. These features were subsequently normalized using z-score normalization. Then the features of each fusion model were downscaled by Spearman correlation analysis, U-test, and LASSO analysis respectively. Finally, the DMU_RAD, DMU_ViT, and DMU_RAD_ViT models were constructed using SVM classifiers, respectively. The detailed method of model construction is described in the "Feature Selection and Model Construction" section above.
Statistical analysis
The baseline data of patients were analyzed using R language (version 4.3.3, Index of /src/base/R-4 (r-project.org)) and the compareGroups package. Continuous variables were summarized as mean ± standard deviation and categorical variables were described using frequencies and percentages. The normality of continuous variables was assessed using the Shapiro-Wilk test. Differences between groups were evaluated using the Mann–Whitney U test or Student's t-test for continuous variables, and the Chi-squared test or Fisher's exact test for categorical variables. Statistical significance was defined as p < 0.05 (two-sided). The DeLong method was employed to compare the AUCs of different models.