Comparison of the radiomics-based predictive models using machine learning and nomogram for epidermal growth factor receptor mutation status and subtypes in lung adenocarcinoma

The purpose of this study is to develop the predictive models for epidermal growth factor receptor (EGFR) mutation status and subtypes [exon 21-point mutation (L858R) and exon 19 deletion mutation (19Del)] and evaluate their clinical usefulness. Total 172 patients with lung adenocarcinoma were retrospectively analyzed. The analysis of variance and the least absolute shrinkage were used for feature selection from plain computed tomography images. Then, radiomic score (rad-score) was calculated for the training and test cohorts. Two machine learning (ML) models with 5-fold were applied to construct the predictive models with rad-score, clinical features, and the combination of rad-score and clinical features. The nomogram was developed using rad-score and clinical features. The prediction performance was evaluated by the area under the receiver operating characteristic curve (AUC). Finally, decision curve analysis (DCA) was performed using the best ML and nomogram models. In the test cohorts, the AUC of the best ML and the nomogram model were 0.73 (95% confidence interval, 0.59–0.87) and 0.79 (0.65–0.92) in the EGFR mutation groups, 0.83 (0.67–0.99) and 0.85 (0.72–0.97) in the L858R mutation groups, as well as 0.77 (0.58–0.97) and 0.77 (0.60–0.95) in the 19Del groups. The DCA showed that the nomogram models have comparable results with ML models. We constructed two predictive models for EGFR mutation status and subtypes. The nomogram models had comparable results to the ML models. Because the superiority of the performance of ML and nomogram models varied depending on the prediction groups, appropriate model selection is necessary.


Introduction
Lung cancer is one of the causes of cancer-related deaths, in particular, non-small cell lung cancer (NSCLC) accounts for over 80% of all lung cancers [1]. Adenocarcinoma is the primary histological subtype of NSCLC [2]. Radiation therapy and/or chemotherapy are the treatment of choice for NSCLC; however, molecularly targeted drugs have been widely used in recent years. Epidermal growth factor receptor tyrosine kinase inhibitors (EGFR-TKIs) have been used in patients with lung adenocarcinoma with EGFR mutations [3,4]. EGFR-TKIs have demonstrated longer progression-free survival (PFS) than conventional chemotherapy [5,6]. Moreover, the regulation of EGFR mutations by EGFR-TKIs increases radiosensitivity [7]. Therefore, the identification of EGFR mutations is crucial for planning appropriate treatment regimens.
Among the EGFR mutation subtypes, exon 19 deletion mutation (19Del) and exon 21 point mutation (L858R) account for approximately 90% of all EGFR mutations [3,8]. These mutations are sensitive to EGFR-TKIs. However, 19Del and L858R exhibit varied characteristics, and 19Del show a better response to EGFR-TKIs than L858R [9]. Patients with 19Del also have a longer PFS than patients with L858R after treatment with EGFR-TKIs [10,11]. Therefore, identifying the mutation subtypes is critical 1 3 for administering appropriate treatment and personalized therapy.
EGFR mutations are typically detected using biopsy or surgical specimens [12]. However, these processes are time-consuming, expensive, and invasive for patients. Tumor heterogeneity also presents accurately detecting EGFR mutations from tumor-derived tissues [13]. Computed tomography (CT) is a useful tool for the non-invasive diagnosis and analysis of lung cancers [14,15]. Some studies have demonstrated using the use of radiomic features of lung tumors on CT images to predict EGFR mutation status and subtypes and/or OS by using machine learning (ML) models [8,[16][17][18]. Radiomics analyses the tumor phenotypes by automatically extracting numerous quantitative features (e.g., size, shape, texture, and histogram) from medical images [14,19,20]. Radiomics have the advantages of analyzing tumor phenotypes non-invasively. Hong et al. reported the clinical effectiveness of nomogram by combining with radiomic features for EGFR mutation status [17]. Furthermore, Zhao et al. reported that radiomicsbased nomograms could predict EGFR mutation subtype [21]. However, most of studies evaluated only ML models or nomogram models, and as far as we know, no studies compared ML models and nomogram models in the same data set for clinical usefulness. The superiority of the performance of ML and nomogram models may depend on the prediction group.
The purpose of this study is to develop the predictive models for EGFR mutation status (EGFR mutation (EGFR+) vs. EGFR wild-type (EGFR-) groups) and subtypes (L858R vs. EGFR-groups and 19Del vs. EGFR-groups) by using ML models and nomogram, then evaluate the clinical usefulness of these models.

Patient data
This study was approved by the Institutional Review Board of authors' affiliation (#2020-148). A total of 172 patients with NSCLC who had undergone biopsy or surgical specimens between 2016 and 2020 from authors' affiliation were retrospectively analyzed. The inclusion criteria were as follows: (a) pathologically confirmed as adenocarcinoma, (b) confirmed EGFR mutation status (mutation or wild-type), (c) confirmed EGFR mutation subtypes (L858R or 19Del or else), and (d) non-contrast enhanced chest CT images acquired before surgery, radiation therapy, chemotherapy, and/or targeted molecular therapy. The exclusion criteria were as follows: (a) patients who had tumors other than adenocarcinoma, (b) patients who had previously undergone surgery or targeted molecular therapy, and (c) simultaneous 19Del and L858R mutations. Patients who met the inclusion criteria were divided into three groups EGFR + vs. EGFR-, L858R vs. EGFR-and 19Del vs. EGFR-. Each group included the training and test cohorts. Fifteen patients with EGFR + had subtypes other than 19Del and L858R. These subtypes were not included in subtype groups. The data were randomly divided into training and test cohorts in a ratio of 7:3. Clinical features included age, sex, location of lung tumor, smoking status, staging, c-reactive protein level, carcinoembryonic antigen level, cytokeratin 19 fragment level and chronic obstructive pulmonary disease status. The lung tumor location was divided into right upper lobe, right middle lobe, right lower lobe, left upper lobe, and left lower lobe [2,16]. The tumor location was referring only to the primary tumor. Figure 1 shows the workflow of this study.

CT imaging
CT examinations were performed on multiple CT scanners, including Aquilion Precision (Canon Medical Systems, Otawara, Japan), Optima CT 660 (GE Healthcare, Waukesha, WI, USA), SOMATOM Sensation 64, SOMATOM Force, and SOMATOM Drive (Siemens Healthcare, Forchheim, Germany). The scanning parameters were as follows: tube voltage, 70-120 kV; tube current, 44-2585 mA; slice thickness, 1.00 or 1.25 mm; size of the image, 512×512; and field of view, 270-400 mm. All the CT images were acquired from patients in the supine position, with inhaling their breath holding in the end-inspiratory while both hands raised. CT images were converted to isotropic voxels with a size of 1.00 mm using linear interpolation.

Extraction of radiomic features and feature selection
The lung tumors in the CT images were segmented semiautomatically using GrowCut in the open-source software 3D Slicer (version 4.10.2, Brigham and Women's Hospital) [21,22]. GrowCut segmentation for NSCLC showed a high correlation with the pathology [22]. All segmentations were performed by two medical physicists. These were supervised by a radiation oncologist with over 16 years' of experience in radiation therapy. Observers read CT images on the axial, coronal, and sagittal views using the mediastinum (width, 350 HU; level, 40 HU) and lung window (width, 1500 HU; level, − 500 HU) settings. Segmented tumors were used to extract the radiomic features using the open-source software Pyradiomics (version 3.7.1) in Python [23]. In total, 1046 radiomic features were extracted from the segmented tumors. These radiomic features consisted of 100 original features derived from 18 shape features, 14 firstorder features, 22 Gy-level co-occurrence matrix (GLCM) features, 16 Gy-level run length matrix (GLRLM) features,

3
16 Gy-level size zone matrix (GLSZM) features, and 14 Gylevel dependence matrix (GLDM) features, as well as 946 filtered features using two Laplacian of Gaussian (LoG) filters, one gradient filter, and eight wavelet filters. The filtered features were acquired and multiplied above filters with the first order, GLCM, GLRLM, GLSZM, and GLDM features. Radiomic features based on the three-dimensional volume of interest were generated automatically. The major settings for extracting radiomic features were as follows: the bin width of feature extraction parameters, 30 because of its high reproducibility of radiomic features [24]; the size of sigma for the LoG filter, 1.0 or 3.0 mm; the bin width of the wavelet filter, 10. ResamplePixelSpacing was set as none [25].
After feature extraction, all the features were standardized using StandardScaler method in scikit-learn package [26]. To reduce the redundant features, selectKbest method in scikit-learn package which are based on analysis of variance and the least absolute shrinkage and selection operator (LASSO) were applied to training cohorts [8,17]. The k-value was set as 100 in selectKbest method. Five-fold cross-validation (CV) was applied to determine the tuning parameter that regularizes the magnitude of the penalization, and then features with non-zero coefficients were selected. Radiomics score (rad-score) was calculated through a linear combination of selected features multiplied by their coefficients [17,21].

Construction of machine learning and nomogram predictive models
After rad-score was calculated, clinical features which showed significant difference were added. Then, two ML models (support vector machine (SVM) and logistic regression (LR)) were used to construct the prediction model for the three groups (rad-score alone, clinical features alone, rad-score + clinical features). In the SVM model, the radial basis function was applied and, in the both models, the grid search method with a five-fold CV was applied to optimize the hyperparameters.
Clinical features which showed significant difference on univariate analysis and rad-score were used to construct the nomogram.

Statistical analysis
In clinical features, quantitative data was assessed by using Student's t-test, and categorical data were assessed by using 2 test. Rad-score was compared by using the Wilcoxon rank-sum test. The prediction performance of each ML model was evaluated in the area under the receiver operating characteristic curve (AUC) in fivefold CV. Then, trained models were evaluated by the test cohorts. In the nomogram model, AUC was also calculated to quantify the discriminative performance of the nomogram [17] [21]. The Hosmer-Lemeshow (HL) test was performed to evaluate the goodness-of-fit of the nomogram. Finally, the decision curve analysis (DCA) was performed to evaluate the clinical usefulness of the ML model, which showed the highest AUC in the each group and nomogram models in both cohorts [27]. When p < 0.05, it is considered as a significant difference. All the procedures were performed by using in-house programs (Python ver. 3.7.1, R ver. 4.1.1).

Results
The clinical data of both cohorts are listed in Supplement Tables 1 and 2, and 3. There was a significant difference in sex and smoking in training cohorts of EGFR mutation status and subtypes. Furthermore, there was a significant difference in age in training cohorts in EGFR + vs. EGFRgroups. In the training cohorts, seven, ten and five radiomic features were selected in EGFR + vs. EGFR-groups, L858R vs. EGFR-groups, and 19Del vs. EGFR-groups, respectively through the feature selection. As shown in Supplement Tables 1 and 2, and 3, the rad-score showed significant differences in both cohorts except test cohort in 19Del vs. EGFR-groups (p = 0.447). Supplement Table 4 presents the selected features used in the ML models.
The AUCs and confidence intervals for each group in the two ML models are shown in Table 1. In each group, the models constructed of the combination of rad-score and clinical features showed the best AUC for both cohorts. The best AUCs of ML model for the training and test cohort were 0.82 and 0.73 in EGFR + vs. EGFR-groups, 0.83 and 0.83 in L858R vs. EGFR-groups, and 0.84 and 0.77 in 19Del vs. EGFR-groups, respectively. Table 1 The area under the curves and confidence intervals for each group in the two machine learning models EGFR epidermal growth factor receptor, SVM support vector machine, LR logistic regression, AUC area under the receiver operating characteristic curve, CI confidence interval, Rad-score radiomic score, L858R exon 21 point mutation, 19Del exon 19 deletion mutation  Figure 2 shows the nomograms in the training cohort in each group. The AUCs and confidence intervals for each group are shown in Table 1. The AUCs of nomograms for the training and test cohort were 0.85 and 0.79 in EGFR + vs. EGFR-groups, 0.84 and 0.85 in L858R vs. EGFR-groups, and 0.86 and 0.77 in 19Del vs. EGFR-groups, respectively. The HL test showed no significant difference in training ( 2 = 4.82, p = 0.78 in EGFR + vs. EGFR-groups, 2 = 11.3, p = 0.18 in L858R vs. EGFR-groups, and 2 = 10.1, p = 0.26 in 19Del vs. EGFR-groups) and test cohorts ( 2 = 16.2, p = 0.04 in EGFR+ vs. EGFR-groups, 2 = 4.74, p = 0.79 in L858R vs. EGFR-groups, and 2 = 11.5, p = 0.18 in 19Del vs. EGFR-groups), where these results mean that our models have well goodness of fit [28]. Supplement Fig. 1 presents the calibration curves of all nomogram models. Figure 3 shows the decision curves of two ML and nomogram models in each group. In each group, ML models consisted of the combination of radiomic and clinical features. Those three models showed more net benefit than using the treat-all or the treat-none in both cohorts. For a risk threshold over 35%, the ML and the nomogram models in the test cohort of the EGFR + vs. EGFR-groups had more benefits than the treat all and the treat none. In the other cohort, for the risk threshold over 20%, the ML and the nomogram models had more benefits than the treat all and the treat none.

Discussion
In this study, we constructed ML and nomogram models for predicting EGFR mutation status and subtypes. In the best ML model, for the EGFR + vs. EGFR-groups, the AUC for the test cohort was 0.73 (SVM). Li et al. and Mei et al. reported AUC of 0.79 and 0.66 [8,18]. The AUC of test cohorts of the L858R vs. EGFR-groups and the 19Del vs. EGFR-groups were 0.83 (LR) and 0.77 (SVM), respectively. Liu et al. reported that the AUC of L858R vs. EGFRgroups and 19Del vs. EGFR-groups were 0.92 and 0.77, respectively, using positron emission tomography (PET) CT images [12]. According to Li et al., the accuracy of prediction using only CT images was lower than that of Fig. 2 The nomograms in training cohort. a epidermal growth factor receptor (EGFR) mutation vs. wild-type groups. b exon 21-point mutations vs. wild-type groups. c exon19 deletions vs. wild-type groups using PET-CT images [29]. Active mutations in EGFR are related to the 18 F-FDG uptake manifesting in PET images [30]. Because PET/CT images are considered demonstrations of the tumor characteristics in detail, the prediction accuracy was higher than that of using plain CT images. Though we cannot directly compare our AUC results due to different modalities, our models are considered reasonable. The AUC of the 19Del vs. EGFR-groups in the test cohort was relatively low (0.77) than other groups because of the non-significant difference of rad-score. Moreover, because the small number of patients were included in the 19Del vs. EGFR-groups in the test cohort, evaluation using the large number of patients is necessary.
Hong et al. reported that the performance of the nomogram model for EGFR + vs. EGFR-groups by using contrast enhanced (CE) CT images was 0.83 in the test cohort [17]. On the other hand, the AUC of our model was 0.79; however, it is the results from plain CT images which have less information than CE CT images. It is, therefore, thought that our nomogram model was reasonable. Some studies reported ML models alone [8,12,18], and some studies reported nomogram models alone [21,24]. However, as far as we know, no studies that compare models in the same data set have been reported. In this study, we compared ML and nomogram models. Though the AUCs were almost equal ML and nomogram on most models, the ML model showed higher AUC in the 19Del vs. EGFRgroups in the test cohort and the nomogram model showed higher AUC in the EGFR + vs. EGFR-groups in the test cohort. Fig. 3 The decision curve analysis (DCA) for two machine learning (ML) and the nomogram models in each group The y-axis presents the net benefit. The net benefit = true positive rate -(false positive rate × weighting factor), where the weighting factor = threshold -(1 -threshold). The nomogram model had the highest net benefit compared with radiomics alone or clinical feature alone and simple strategies such as gray line, which assumes that all tumors were epidermal growth factor receptor (EGFR) mutation (the treat all); horizontal black line which assumes that all tumors were EGFR wild-type (the treat none). a The DCA in the EGFR mutation vs. wild-type groups. b The DCA in the exon 21-point mutations vs. wild-type groups in the training (left). c The DCA in the exon 19 deletions vs. wild-type groups. In each group, the DCA shows the training (left) and test (right) cohort, respectively Moreover, we evaluated clinical usefulness using DCA. The results of DCA showed that the nomogram models had comparable results to the ML models. As shown in Fig. 3, three models showed more net benefit than using the treat-all or the treat-none in both cohorts. Since the net benefit in DCA indicates the clinical utility [31], we can consider all these models have clinically useful. The model with higher net benefit has more clinical utility; therefore, clinicians can refer to these results to determine whether clinical decision machining based on our models will be useful or not [31]. Moreover, nomogram has the advantage as a simple visualization tool for calculating patient risk. Therefore, nomograms are considered a meaningful tool for clinical usage. Although smoking status contributes to the construction of our nomograms, Mitsudomi et al. reported that EGFR mutations are more common in non-smokers [6], indicating that there is no contradiction. Therefore, it may be possible to identify EGFR mutation and subtypes based on our nomograms. The performance of our constructed nomogram was slightly lower than that constructed by Hong et al. [17]. One of the reasons may be the difference between the data set and radiomics features that consist of the rad-score.
The gold standard for detecting EGFR mutation status detection is biopsy or surgical specimens. However, biopsy or surgical specimens could lead to incorrect diagnosis since the samples are often insufficient for analyzing the EGFR mutation status [13,16]. Therefore, combining radiomics with biopsy or surgical specimen results could allow more accurate identification of the EGFR mutation status and subtypes.
It is reported that patients who administered EGFR-TKIs have demonstrated longer PFS than conventional chemotherapy [5,6]. Moreover, it is reported that 19Del and L858R exhibit varied characteristics, and 19Del shows a better response to EGFR-TKIs than L858R [9]. Furthermore, the patients with 19Del have a longer PFS than patients with L858R after treatment with EGFR-TKIs [10,11]. Therefore, it is necessary to identify the EGFR mutation status and subtypes for appropriate selection of EGFR-TKI, and our prediction models are meaningful.
Our study has certain limitations. Only a few patients were included in this study; in particular, the sample size of cases with 19Del and L858R mutation subtypes was small. Therefore, a greater number of cases should be examined to improve the validity of our results. In addition, we did not validate our proposed nomograms with the external data set; it is necessary to compare with other models. Second, the characteristic of our used clinical data was limited. For example, we treated smoking status as "Yes" or "No" in this study, not like smoking habits. Furthermore, we did not use pathological data. For further study, more clinical information to improve the efficiency of our models will be added.
We only included data from one institution. In future works, we must perform validation using external data set.

Conclusion
We constructed prediction models for EGFR mutation status and subtypes based on ML and nomogram by using radiomic features, then evaluated the performance of these models. The prediction performances and net benefits were comparable ML and nomogram models. Because the superiority of the performance of ML and nomogram models varied depending on the prediction groups, appropriate model selection is necessary.