There has been a significant increase in the prevalence of PTC worldwide due to pre-operative sonography. Although most patients with PTC have an indolent clinical course and positive prognosis, the incidence of LNM has been identified as a risk factor for recurrence 3,30. Currently, physicians rely on preoperative neck ultrasound for assessment of lymph node metastasis in the neck region. However, the sensitivity and specificity of preoperative ultrasound in predicting cervical lymph node metastasis demonstrate notable variability, and the incidence of occult LNM was observed to be as high as 55%6. That means the utility of preoperative ultrasonography for lymph node assessment is often limited 8,31,32.
Additionally, several mathematical and statistical models, which combine clinical factors with medical imaging examinations, have been developed to predict LNM in PTC. KIM et al. 33 employed multiple regression analysis, considering variables such as patient age, gender, tumor size, multifocality, bilaterality, and Hashimoto’s, to construct a nomogram model prognosticating the risk of CLNM in PTC patients after thyroidectomy, achieving an AUC of 0.70. On the other hand, Jin et al. 10 developed a scoring model for the prediction of CLNM by using multiple regression analysis with correlating factors such as large tumor size, irregular margins and BRAF mutations. This model demonstrated an overall sensitivity of 85.1% and specificity of 75.8%. In summary, these nomograms or scoring systems primarily utilize Multiple Logistic Regression (MLR) based on the amalgamation of various risk factors to predict binary outcomes. However, this approach presents certain limitations. To circumvent overfitting of the dataset, it encompasses only meticulously screened independent variables, enabling the model to anticipate the associated outcome. Furthermore, these models prove inadequate in addressing the prevalent issue of missing values commonly encountered in electronic medical records. Consequently, it is necessary to develop new models that can handle a wider range of variables, reduce the impact of missing data, and provide precision and robustness.
Machine Learning is an application of Artificial Intelligence (AI) that mining and learning from historical data to build predictive models, which can overcome or reduce the limitations of MLR34. In the era of "big data", ML can help clinicians make appropriate decisions based on large amounts of digital medical information. Previous studies have shown that sophisticated machine learning techniques can construct precise predictive models by utilizing unprocessed data extracted from electronic medical records and medical images35,36.
In recent years, ML has been broadly applied in the metastasis prediction and prognosis of cancer37. For instance, Tseng et al.38 developed a decision tree model to predict the risk of recurrence in cervical cancer. Singa et al.39 employed the RDF algorithm to predict the onset of liver cancer in cirrhosis patients. Kim40 and Liang41 et al. established a Support Vector Machine (SVM) model to discriminate between recurrent and non-recurrent malignant tumors. In predicting regional lymph node metastasis of cancer, Bollschweiler37 et al. employed a single-layer neural network to predict LNM in gastric cancer, achieving an accuracy of approximately 79%. Takada40 et al. employed a decision tree model-based data mining approach to predict the risk of axillary lymph node metastasis in breast cancer patients, achieving an Area Under the Curve (AUC) of 0.77. Andrés et al.42 using the decision tree algorithm, predicted the presence of occult LNM in patients with early oral squamous cell carcinoma, achieving an AUC of 0.84. Additionally, the SVM algorithm has displayed a robust classification ability in predicting LNM in breast cancer, cutaneous melanoma, and colon cancer43–45.
In the real world, cancer occurrence and prognosis can be influenced by many relevant factors such as demographics, family history, age, dietary habits, body weight (obesity), poor lifestyle habits (smoking and alcohol consumption) and environmental exposures. This information can be gathered through routine clinical history data present in electronic medical record systems.
However, even for the most skilled clinicians, it is not easy to synthesize the above information. Therefore, machine learning is more effective at this task. Advanced machine learning models can extract potential information from input data and consequently become superior predictive tools. For example, hart et al.46 used machine learning model to effectively predict the risk of endometrial cancer within 5 years using personal health data. The model involved 952 patients with endometrial cancer, of which 57.2% were defined as high risk and 41.8% as intermediate risk. The AUC of the best algorithm was 0.88.
Currently, no studies have been found utilizing machine learning algorithms to predict lymph node metastasis in PTC through electronic medical record data. It is hypothesized that machine learning algorithms can enhance the prediction accuracy of lymph node metastasis in patients with PT. This approach was trained and verified using DNN models and machine learning algorithms, with the aim of devising surgical strategies.
In this study, an innovative multiparameterized ML-based model was formulated and validated to predict LNM in patients with PTC utilizing readily available clinical data extracted from electronic medical records. The findings of this study indicate that the developed model possesses a remarkable capacity for accurately prognosticating cervical lymph node metastasis in patients with PTC. Notably, the RDF-based prediction model performs better in prediction, attaining an accuracy rate of 0.98,0.98,0.96 in the prediction of LNM, CLNM and LLNM. This probably due to its more sophisticated classification decisions and different weighting ratios compared to other algorithms. Conversely, NB, DT and XGB algorithms manifest suboptimal performance in the predictive task. Studies 47 have demonstrated that RDF stands as one of the most precise machine learning models, surpassing other techniques in its ability to handle large amounts of features and extremely non-linear data. Furthermore, RDF performs well in mitigating data noise and its adaptability makes it easier to adapt and integrate with learning algorithms. It is worth emphasizing that the selection of the most appropriate algorithm is contingent upon numerous parameters, including the type of data collected, the size of the data sample and the prediction results.
The advantages of this study lie in the application of deep learning and deep NLP techniques to rationally apply the information recorded in EMR to obtain the classification results of clinical prediction objectives, which are manifested as the following aspects: (1) In this study, the case dataset is vectorially mapped from a low-dimensional space to a high-dimensional space through word embedding techniques to endow the feature matrix representing the case dataset with rich semantic information, which effectively helps the model to learn and mine the potential features of LNM in PTC. (2) The training method of multi-DNN model based on the Stacking framework can utilize the characteristics that the parameters are varied when multiple models are trained with the same task under the same training set, by assigning decision weight to each field model, enhance the generalization ability of the model on the test set, promote the accuracy of the model under the condition of small probability, and ensure the overall prediction effect of the model.
But this study also has some limitations. First, the model uses DNN model and machine learning algorithms, so clinical interpretation of important features identified by the model may be a challenge. Second, the collected data are all from a single center and the trained model may not perform well in other healthcare institutions. Therefore, multi-center datasets are needed for further training and validation of the model to optimize its diagnostic efficacy and generalizability. Third, not all patients underwent total thyroidectomy and some occult multifocal cancers would be missed. Third, EMR is unstructured data. Disease history and family history were dictated by patients and recorded by doctors, and there may be bias in patients' description of diseases. Further research attempts to augment the sample size, incorporate additional data (e.g., ultrasound images or serological examination), and further improve the model’s performance through external data validation.