3.1. Data Description
This study collected 848 patients who visited a hospital in Dalian from 2018.04.06 to 2020.12.15. Among them, there were 478 patients with early stage lung cancer and 370 patients with benign lung nodules. There were 369 male patients, 186 in the lung cancer group, accounting for 50.41%, and 183 in the benign nodule group, accounting for 49.59%. There were 479 female patients, 292 in the lung cancer group, accounting for 61.96%; 187 in the benign nodule group, accounting for 39.04%. According to the age distribution, among the early stage lung cancer patients, the 61-70 age group has the most patients, accounting for 41.21% of all early stage lung cancer patients; among the benign nodule patients, the 61-70 age group has the largest distribution. , for 119 patients, accounting for 32.16% of all patients with benign nodules. The specific distribution is shown in Table 1.
3.2 Performance Comparison of Data Index Screening Algorithms
In this study, we use the XGBoost model as a representative, and apply two algorithms for data feature screening, namely stepwise regression and Boruta. The stepwise regression algorithm is a traditional statistical feature screening method. Its basic idea is to reduce the degree of multicollinearity by eliminating variables that are less important and highly correlated with other variables.Boruta algorithm is a popular feature screening method in machine learning. It is based on the same idea of random forest classifier, that is, adding randomness to the system and collecting results from random sample sets can reduce the misleading effects of random fluctuations and correlations. We also used the original dataset as a control group.The results are shown in Table 2. In the original data set, all 49 features were used. After the Boruta algorithm screening, 19 features were included in the study, and after the stepwise regression algorithm screening, 16 features were included in the study. In terms of the number of included features, the number of features filtered by the stepwise regression algorithm is the least, which can simplify the subsequent operation process and shorten the operation time.Comparing from the accuracy, precision, F1 score and recall index, the accuracy of stepwise regression is 75.29%, the accuracy of Boruta is 72.55%, and the accuracy of the original dataset is 73.73%. The area under receiving operating characteristic (ROC) curve (AUC) value showed that it was 0.79 for the original dataset, 0.78 for Boruta, and 0.81 for the stepwise regression, Figure 3.The results show that, using the XGBoost model for testing, the stepwise regression algorithm has the highest accuracy and the least number of features after filtering. Therefore, in the follow-up research, we choose 16 features after stepwise regression screening as the data set. They are Sex, Age, Arg, Asn, Glu, Orn, Ser, Val, C4OH, C4DC, C5, C5DC, C12, C16, C22, C26.
3.3. Performance Metrics Comparison of Machine Learning Algorithms
Use random seeds to divide the training set and test set by 7:3, and use 4 algorithms of machine learning to compare the accuracy, precision, recall, and F1 score of the model index values. The receiver operating characteristic curve (ROC) was used to determine the strength of the predictive ability by the area under the curve (AUC). The larger the AUC value, the stronger the predictive ability. The results show that in Table 3, the accuracy of the XGBoost model is 75.29%, and the AUC value is 0.81, which is better than all models. The random forest model has an accuracy of 72.55% and an AUC value of 0.78. The accuracy rate of the support vector machine model is 71.37%, and the AUC value is 0.77. The accuracy rate of the adjacent algorithm model is 66.67%, and the AUC value is 0.69. Figure 4.
3.4 Performance Comparison of Nomogram and Machine Learning Algorithms
The nomogram shows an accuracy of 68.24%, a sensitivity of 0.71, and a specificity of 0.64. The machine learning model (XGBoost) showed 75.29% accuracy, 0.74 sensitivity, and 0.76 specificity. As shown in Table 4, the XGBoost model is better than the nomogram model in terms of parameter index performance. In the subsequent index feature importance ranking, we will apply the XGBoost model for testing.
3.5 Index importance score ranking
For the 16 included indicators, the XGBoost model is used to score the importance of the indicators, and the importance ranking of the 16 indicators is obtained, as shown in Figure 6. The order of importance is Orn, Val, C16, Arg, Asn, Glu, Ser, Age, C4DC, C5DC, C5, C22, C4-OH, C12, C26, Sex. In the amino acid category, the most important index is ornithine, and in the carnitine category, the most important feature is palmitoylcarnitine.