Machine Learning Models for Risk Prediction of Lymph Nodes Metastasis in Non-Small Cell Lung Cancer: Development and Validation Study

Background: To develop and validate machine learning models for risk prediction of lymph node metastasis (LNM) in non-small cell lung cancer (NSCLC) using clinicopathologic parameters and immunohistochemical features. Methods: From January 2010 to December 2019, 639 patients' data were continuously collected in Nanfang Hospital. We exacted immunohistochemical features and clinicopathological features from the electronic medical records of patients. We established two models (a full model and a selection model) and implemented three algorithms (random forest, support vector machine and penalized logistic regression). The model performance was evaluated in terms of discrimination (receiver operating characteristic curve (AUC)), calibration, and decision curve analysis. Results: AUROC (area under receiver operating characteristic curve) analysis (also calibration curves) showed that the selection model (AUC values for training and testing, 0.843 and 0.840 respectively) and the full model constructed using random forest (AUC values for training and testing, 0.855 and 0.863 respectively) performed best among all models. Decision curve analysis depicted that the full model and the selection model using random forest was clinically useful. The model performance of the full model and the selection model were comparable. Conclusion: The random forest model using clinicopathologic- immunohistochemical features can predict the LNM of NSCLC patients. with more than two different locations.

combined with clinical-pathological characteristics to achieve a preliminary prediction of the risk of NSCLC patients with LNM. Develop and validate models that can accurately predict the presence of LNM, prevent lymph node-negative NSCLC patients from removing pathological lymph nodes, and remove lymph nodepositive patients in time to achieve a good prognosis. We present the following article in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) reporting checklist [19].

Patients
A total of 639 consecutive patients for lung cancer at Nanfang Hospital (Guangzhou, Guangdong, China) were identi ed retrospectively between January 2010 and December 2019. The medical records and histopathology reports were retrospectively analyzed. Inclusion criteria considered the conditions that (1) Patients with primary NSCLC underwent radical resection with routine systematic lymph nodes dissection. (2) The post-surgical diagnosis was NSCLC with the absence or presence of LNM. The exclusion criteria included (1) Patients with non-NSCLC such as mixed carcinoma, polymorphic carcinoma, atypical carcinoid, and mucoepidermoid carcinoma. (2) Radiotherapy or chemotherapy before surgery. (3) Patients underwent an incomplete resection or were without mediastinal node dissection. (4) Missing key information. Finally, 152 eligible patients were enrolled. The inclusion and exclusion criteria process is shown in Fig. 1. The institutional review board at Nanfang Hospital approved the ethical approval and the informed consent requirement was waived.

Predictors
Based on literature reviews [15,18,[20][21][22][23][24], clinicians and pathologists' opinions on factors that may increase the risk of lymph node metastasis combined with our existing data, we identi ed 17 potential predictors including patients of basic information, clinicopathologic parameters, and immunohistochemical features. Except for the maximum tumor diameter, which is a continuous variable, the others are categorical variables. The de nition of predictors is shown in Appendix A1.

Outcome
The outcome is lymph node status for NSCLC patients. All the patients underwent anatomical lung resection and systematic nodal dissection by thoracic surgeons. Experienced pulmonary pathologists histologically assessed all resected tumor specimens and nodal samples, and a nal diagnosis was evaluated based on the WHO classi cation. The medical records and histopathology reports were retrospectively analyzed.

Model Development And Validation
We randomly divided the complete original dataset into 80% as the training set and 20% as the validation set.
The analysis process consists of four main stages: data preprocessing, feature selection, construction of prediction models, and model evaluation. The rst stage includes: data preprocessing incorporates two items of the data standardization, processing of missing values. As long as one categorical variable data is missing, we delete the row of data. For continuous variables, random forest imputation was used. Secondly, in the selection models, least absolute shrinkage and selection operator (LASSO) algorithms under 10-fold cross-validation were used to select features related to the outcome models. In the full models, we chose all features to build the models. In the third stage, based on the selected features, we conducted three algorithms included random forest, support vector machine, and penalized logistic regression to build prediction models. Finally, we used the area under the receiver operating characteristic (AUC), accuracy, sensitivity, speci city, calibration curves and decision curve analysis (DCA) to evaluate models. The calibration curve is usually used to evaluate the consistency or the degree of calibration, that is, the difference between the predicted value and the true value. DCA is a novel method of assessing the clinical prediction model used to help identify high-risk patients for intervention and low-risk patients to avoid overtreatment.

Statistical Analysis
Statistical analysis was conducted with R software (version 4.0.2; http://www.Rproject.org) and Python (version 3.7.0; https://www.python.org). The models were programmed in Python using the sklearn library.
The AUROC curves, calibration curves and DCA were generated using the "ggplot2" package, "rms" package and "dca" package respectively. See the appendix A2 for the adjusted parameters of each algorithm. The Chisquare test was tested to compare the two counting data sets and two independent sample t-test for measurement data. Statistical analyses were all two-sided, with the p-value set at .05 while the p-value was set at. 01 in the Delong test. The AUC, accuracy, sensitivity and speci city were used for testing discrimination. The calibration curves were performed for testing calibration. The DCA was used to test clinical use.

Characteristics of patients in cohorts
A total of 152 patients comprise the primary cohort. Patient characteristics in the training and validation cohort are given in Table 1. There were no signi cant differences between the two cohorts in lymph node prevalence (P = .830). The rate of LNM was 55.9% and 58.1% in the training and validation cohorts, respectively. Despite the temporary disconnect, no evidence for a statistically signi cant difference in demographic and clinical characteristics between the training cohort and the validation cohort, which con rmed training and validation cohort was reasonable.

Feature Selection
A total of 17 features were collected from the clinicopathologic-immunohistochemical features of the cohort.
The 17 features were used to construct the full models. We selected ve non-zero coe cients (distant metastasis, differentiation, CK7, hypertension, CK5/6) from 17 features as potential predictors using LASSO regression within the training cohort to build selection models (Appendix A3).

Discrimination
The AUC value, accuracy, sensitivity, speci city for each model is shown in Table 2

Model Calibration And Decision Curve Analysis
The calibration curve of six models in the validation cohort is shown in Fig. 3. The random forest's calibration curve for the full model offers the best agreement between the predictions and observations (Fig. 3D). The decision curve for each model is presented in Fig. 4. The decision curves displayed that the threshold probability for the full model's random forest was between 16% -81%, while the random forest of the selection model is > 9%. Within these ranges, we use the models to predict LNM added more net bene t than treat all patients or treat none. Besides, within a larger threshold range, the selection model's net bene t was higher than the net bene t of the full model. Assuming that we choose a prediction probability of 60% to diagnose LNM and perform treatment, then for every 100 patients who use the selection model, 25 people can bene t from it without harming the interests of anyone else; for every 100 patients who use the full model, Only 14 people can bene t from it without harming anyone else's interests.

Discussion
In this study, we developed and validated machine learning models based on clinicopathologic and immunohistochemical parmeters to predict LNM patients with NSCLC. Our results showed that immunohistochemical features could provide more information to distinguish the absent or present LNM in patients with NSCLC. The combined determination with clinicopathological features demonstrated high discrimination, calibration, and clinical use to diagnose LNM.
For the construction of clinicopathological-immune features, the correlation between the predictor and the result was tested using the LASSO method to shrink the regression coe cient. The 17 candidate features were reduced to 5 potential predictors. LASSO is a statistical method that imposes L1 or L2 penalties to improve prediction accuracy and model interpretation, thereby generating sparse models [25]. In recent studies [26][27][28], LASSO regression has been used to integrate a single biomarker into a comprehensive analysis of multiple markers, and this method has been gradually developed and validated. To compare the feature screening effect of lasso regression, we also performed a full model incorporating all features. In this study, we selected three machine-learning algorithms to develop and validate models. It turned out that random forest is the best classi er on the entire dataset. These results were similar to other ndings [29][30][31].
The two models' performance is comparable in the random forest algorithm, whether it is a full model or a selection model. Given that performance was comparable in the two random forest models, we are more inclined to choose the selection model of the random forest algorithm as the optimal model to obtain the greatest clinical utility with the fewest features. The selection model of the random forest algorithm showed good degree of discrimination (AUC, 0.84) and calibration in the training cohort, and then performed well in the validation cohort (AUC, 0.84). Given that the positive rates of LNM in the two cohorts are comparable, good discrimination means that the model is robust to prediction.
The evaluation of model performance is generally only analyzed from the degree of discrimination and calibration. However, AUROC only considers the speci city and sensitivity of the method and pursues accuracy without considering the patient's clinical bene t. For example, when using biomarkers to predict the patient's disease, no matter which value is selected as the critical value, there will be the possibility of false positives and false negatives. Sometimes avoiding false positives bene ts more, sometimes it is more hopeful of avoiding false negatives. Since both situations are unavoidable, then we want to nd a way to maximize the net bene t. Therefore, to evaluate the clinical usefulness, whether our model can improve patient prognosis, decision curve analysis was conducted. DCA is a novel strategy to access the value of diagnostic tests that target various patient preferences to accept the risks of undertreatment and overtreatment [32]. In our study, the decision curve indicated that the threshold probability is > 9%, the effect of using the selection model of random forest algorithm to predict LNM is greater than the scheme of "treatment none" or "treatment all." First, one of the study limitations is genomics and radiomics have not been considered. Some studies have recently proved that genomics and radiomics can predict LNM in NSCLC [33][34][35], but we are unsure whether adding these factors will improve our model. Second, small sample size will lead to residual confounding or statistical uctuation [36]. Another limitation is that the lack of external validation may limit the extrapolation of the model. Although all the results showed that the established model has satisfactory performance, it still needs to be externally validated in future research datasets from other centers.

Conclusion
In summary, this study developed a machine learning model that combines immunohistochemical characteristics and clinicopathological features, which may be useful to clinicians applied to facilitate the individualized prediction of preoperative LNM in NSCLC patients.

Declarations
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.

Ethics approval and consent to participate
The study was approved by institutional ethics board of Nan fang hospital and individual consent for this retrospective analysis was waived.

Consent for publication
The consent to publish this manuscript has been obtained from all authors.

Figure 1
Flow diagram of selected patients.

Figure 1
Flow diagram of selected patients.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.