Prediction of radiomics-based machine learning for speci�c dosimetric veri�cation of pelvic intensity modulated radiotherapy

Background: Machine learning (ML) and deep learning(DL) technology has been used widely in the quality assurance. Due to the complexity of intensity modulated radiotherapy(IMRT)technology, the implementation of patient-specific quality assurance (PSQA) before the treatment has become an essential part in the IMRT. Therefore, this paper is aim to establish the different machine learning classification predict models of gamma pass rates for specific dosimetric verification of pelvic IMRT plan which based on the radiomic features and to explore the best prediction model. Methods: Retrospective analysis of the 3D dosimetric verification results based on measurements with gamma pass rate criteria of 3%/2 mm and 10% dose threshold of 196 pelvic intensity-modulated radiotherapy plans was carried. Prediction models were established by extracting radiomic features data. Four machine learning algorithms, random forest, support vector machine, adaptive boosting and gradient boosting decision trees, were used to calculate the AUC value, sensitivity and specificity respectively. The classification performance of the four prediction models were evaluated. Results: The sensitivity and specificity of the random forest, support vector machine, adaptive boosting, and gradient boosting decision trees models were 0.93,0.85,0.93,0.96, and 0.38,0.69,0.46, and 0.46, respectively. The AUC values for the random forest model and the adaptive boosting model were 0.81 and 0.82, respectively, and the AUC values for the support vector machine and gradient boosting decision tree models were 0.87. Conclusions: Machine learning methods based on radiomics can be used to establish a prediction model of gamma pass rate for specific dosimetric verification of pelvic intensity modulated radiotherapy. The classification performance of support vector machine model and gradient boosting decision trees model is better than that of random forest model and adaptive boosting model.The prediction model for a specific site is helpful to improve the performance of the model


Background
Intensity-modulated radiation therapy (IMRT) technology provides a highly conformal dose distribution to the tumor, while significantly reducing the dose of the surrounding normal tissue [1].Due to the complexity of IMRT technology, the implementation of patient-specific quality assurance (PSQA) before the treatment has become an essential part of the whole process [2].The conventional specific dosimetric verification based on the phantom which includes the dose recalculation, data transmission, set up, beam delivery, and γ analysis.It is not only increases the workload of the medical physicist, but also delays the first treatment of the patients [3].To improve the efficiency and safety of IMRT implementation, the treatment plan complexity parameters were used to predict those plans that are not pass before treatment [4].
With the development of artificial intelligence technology, machine learning learning (ML) and deep learning (DL) technology has been used widely in the quality assurance(QA) [5][6].The predict model of IMRT plan gamma pass rate(GPR)which based on ML/DL was trained and tested using the data of multiple treatment sites.The feature data of different treatment sites has a very important influence on the classification performance of GPR prediction model.The prediction model for a specific site is helpful to improve the performance of the model [7][8][9][10].However, there is a lack of studies about pelvic by now.Therefore, this paper is aim to establish the different machine learning classification predict models of gamma pass rates for specific dosimetric verification of pelvic IMRT plan which based on the radiomic features and to explore the best prediction model.

Data source
Retrospective collected 196 pelvic patients who received IMRT at the Radiotherapy Center of Hunan Cancer Hospital from November 2021 to November 2022.The QA plans for all patients were calculated using Pinnacle³(Version 9.2,Philips) treatment planning system, using a computational grid of 3mm.The 3D dosimetric verification of QA plans were measured by Delta4 system.The linear accelerator and the Delta4 system are regularly calibrated during the measurement period to ensure that the equipment is in a good performance state.
As recommended by the AAPM TG218 report [11], the mean value of GPR was 96.6% (the range was 78.8%~100%) based on the criterion of absolute dose,3%/2 mm, global normalization, and a 10% dose threshold.The statistics of GPR value see Fig. 1.To better evaluate the classification performance of the prediction model, the 99.5% confidence level of the mean value of GPR was used as the threshold for the GPR.When the GPR value was greater than this threshold, the measured GPR result was expressed as "pass" and otherwise as "failure".

Feature extraction
Radiomics features refer to the semi-quantitative and/or quantitative features extracted from radiography (medical images), which combined with artificial intelligence technology play an important roles in radiotherapy [12].The 10% isodose line of the maximum dose was included as the area where the radiomics features were extracted in this study.Batch extraction of the features was performed using the radiomics library in Python 3.7.The image types included the original image (Original), the wavelet transform image (Wavelet), and the Gaussian filter image (LoG).There were 7 different types of features: shape features 2D/3D, first order features, gray level co-occurrence matrix(GLCM), gray level size zone matrix(GLSZM), gray level run length matrix(GLRLM), neighboring gray tone difference matrix(NGTDM), gray level dependence matrix(GLDM), 1,130 features (Original:107, Wavelet:744, LoG:279) were extracted in total.

Data preprocessing
The whole data set is randomly divided, with 80% of the data (156 cases) being used as the training set and 20% being used for the test set.Due to the imbalance of the data, the stratified sampling method was used, making the proportion of all kinds of data in the training set and the test sets consistent with the original data.Then, the data were normalized with Equation (1). ( Where χ is the value after normalization, X is the original value, μ is the mean of each feature class, and σ is the standard deviation for each feature class.The normalization was done on the training set firstly, and then this transformation was applied to the test set.

Feature selection
The feature selection is the key step in building a machine learning prediction model.The reasonable feature selection can avoid high-dimensional data disaster problems, reduce the training time, increase the interpretability of the model, and reduce the overfitting to enhance the generalization performance of the model [13][14].Feature selection was performed using the embedding method based on the extreme random tree (Extra-Trees) algorithm.Due to the randomization of its split point selection, it can build pure random trees to reduce the variance, which is different from other tree-based algorithms [15].Using Extra-Trees on the training data ranked the relative importance of all variables to the target values, and then 11 features were finally screened as input to the model based on the evaluation of the model performance.

Model establishment and evaluation
Four machine learning classification algorithms, random forest(RF) [16], support vector machine(SVM) [17], adaptive boosting (AdaBoost) [18], gradient boosting decision trees(GBDT) [19] , were fitted to the training data and then predicted on the test set, adjusting the model parameters to achieve better model classification performance.
The performance of the binary classification model was evaluated using precision, sensitivity, specificity, F1-score, and the area under the curve(AUC).The curve is receiver operating characteristic curve(AUC).TP and FP represent the number of positive and negative samples which predicted as positive.TN and FN represent the number of positive and negative samples which predicted as negative.The modeling and analysis procedures were done on Python 3.7.
Precision: the correct ratio of the predicted positive classes.See Equation (2).Sensitivity(Recall): the correct prediction ratio in the number of positive classes.See Equation (3).Specificity: the proportion of the number of negative classes that the wrong prediction was positive.See Equation (4).F1-score: considering both the precision and recall of the classification model.See Equation (5).ROC curve: the curve with false positive rate (FPR) as the abscissa and true positive rate (TPR) as the ordinate at different thresholds.AUC values represent the area of the region below the ROC curve, and larger values indicate the relatively better the model. (

Feature selection results
Table 1 shows the 11 features input as the model after feature screening.Based on the image type, 9 features were derived from the Wavelet image, and the remaining two were from the original images.According to the type of identified features, seven features belonged to GLCM, three features belonged to GLSZM, and the remaining one belonged to GLRLM.

Hyperparameter values
Table 2 shows the main values of the hyperparameter used in the four models after parameter adjustment.For the RF model, the n_estimators was 285, and the max_depth of the tree was limited to 14.For the SVM model, the penalty coefficient C was 32, and the kernel function coefficient gamma was 0.15.For AdaBoost model, the n_estimators was 50, and the learning_rate was 0.75.The GBDT model selected smaller hyperparameter values compared to the RF model.

Feature importance assessment
Fig. 2 shows that the four different models rank the input features by SHAP value in the test set, and shows the distribution of the influence of each feature on the model output.Different colors represent the feature value (red is high, and blue is low).According to the results, the most important features in the RF, SVM, AdaBoost, and GBDT models are Feature7, Feature0, Feature8, and Feature7, respectively.

Evaluation of classification performance
Table 3 shows the comparison of the performance evaluation indicators of the four models on the test set.The confusion matrix of the four machine learning models as shown in Fig. 3. Fig. 4 shows the comparison of the ROC curves for the four models, where the AUC values for the RF and AdaBoost models were 0.81 and 0.82, respectively, and the AUC values reached 0.87 for the SVM and GBDT models.

RF model SVM model
AdaBoost model: GBDT model Fig. 2. Assessment of features importance for the four different models.

Discussion
The specific dosimetric verification is the essential step for the intensity modulated radiotherapy.Medical physicist need spend a large amount of time to complete these work.With the increasing number of patients, it is difficult to accomplish the specific dosimetric verification for every patient timely.Consequently , it would delay the first treatment for some patients.In this study, we established the machine learning classification predict models of gamma pass rates for specific dosimetric verification of pelvic intensity modulated radiotherapy plan which based on the radiomic features.It would help the medical physicist to judge the "pass" or "failure" of the IMRT plan without making the actual measurements based on the phantom.The study on machine learning combined with radiomics method in the GPR classification prediction model is less.Hirashima et al [8] developed the model for machine learning to predict GPR from radiomics features extracted from multiple treatment site data and evaluated the model performance using plan complexity features by contrast.Which showed data set contains multiple tumor sites, classification performance will be affected.In our study, four machine learning models, including  RF, SVM, AdaBoost, and GBDT, were used to predict the GPR of the pelvic IMRT plan under the 3%/2 mm tolerance.The results revealed that the AUC values of the four models were above 0.8, the SVM and GBDT models were 0.87, the three treebased integrated models were above 0.9 in sensitivity, and the best GBDT model in F1-score and sensitivity, and the SVM model had the highest score in precision and specificity.In comprehensive, the GBDT model and SVM model performed better in the four models, and were better than the results of Hirashima ' s study under the same standard.The reason may be that the data of this study came from this specific site of the pelvic , which could improve the performance of the classification prediction model to some extent.Park et al [20][21][22] showed that the texture features calculated by the GLCM were important indicators of the dose distribution complexity, and had a better correlation with the accuracy of plan implementation than the other index.Similar results were obtained in this study, where 7 of the 11 features selected as final inputs belong to the GLCM class, and the most important features of the four models are wavelet-LLL_ glcm_Imc2, wavelet-LLL_glcm_Idmn, and wavelet-LLH_glcm_ClusterTendency.The radiomics feature GLCM was an important indicator in the GPR classification prediction model.In addition,we also found that the GLSZM and the GLRLM class characteristics also had a great influence on the GPR classification prediction model.To better evaluate the output of the machine learning model, SHAP was used to analyse the importance of the features [23].In each model, the relevant influence of each feature on the model output can be explained according to the changing trend of the SHAP value, and the contribution of the same feature in different model outputs was also different.
The data of pelvic patients selected in this study was mainly focused on gynecological tumors.In the future, more data of different pelvic tumor types (rectum, prostate, bladder, etc.) will be involved in and to further verify the prediction model.In addition, considering the universality of the input features to different radiotherapy techniques, only the radiomics features extracted based on dose files were selected for modeling in this paper.In the future, predictive models will be involved in the data of different treatment techniques.Multi-institutional validation of the results is essential for machine learning predictive models as clinical decision making.Valdes et al [24] presented the feasibility and effectiveness of GPR prediction model of IMRT applied in different institutions.We will establish the prediction models with better versatility and robustness considering the datasets of different radiotherapy institutions in the future.

Conclusions
In conclusion, the machine learning methods based on radiomics can be used to establish a prediction model of gamma pass rate for specific dosimetric verification of pelvic intensity modulated radiotherapy.The classification performance of SVM model and GBDT model is better than the RF model and AdaBoost model.These models can help the medical physicist have time to focus on those "failed" plan and provide safe and efficient PSQA management for the patients.

Fig. 3 .
Fig. 3. Confusion matrix of the four machine learning models.

Fig. 4 .
Fig. 4. The ROC curves of the four different models on the test set.

Table 1
Features selected as model inputs

Table 2
Main hyperparameter values used in the four models

Table 3
Comparison of performance evaluation metrics of the four models