Stereotactic Body Radiotherapy for Hepatocellular Carcinoma: Current Evidence and the Feasibility of Radiomics-based Predictive Models

Background : Stereotactic body radiotherapy (SBRT) is an effective but less focused alternative for treatment of hepatocellular carcinoma (HCC). To date, a personalized model for predicting therapeutic response is lacking. This study aimed to review current knowledge and to propose a radiomics-based machine-learning (ML) strategy for local response (LR) prediction. Methods : We searched the literature for studies conducted between January 1993 and August 2019 that used > 100 patients. Additionally, 172 HCC patients in our hospital were retrospectively analyzed between January 2007 and December 2016. In the radiomic analysis, 41 treated tumors were contoured and 46 radiomic features were extracted. Results : The 1-year local control was 85.4% in our patient cohort, comparable with current results (87-99%). The Support Vector Machine (SVM) classifier, based on computed tomography (CT) scans in the A phase processed by equal probability (Ep) quantization with 8 gray levels, showed the highest mean F1 score (0.7995) for favorable LR within 1 year (W1R), at the end of follow-up (EndR), and condition of in-field failure-free (IFFF). The area under the curve (AUC) for this model was 92.1%, 96.3%, and 99.2% for W1R, EndR, and IFFF, respectively. Conclusions : SBRT has high 1-year local control and our study sets the basis for constructing predictive models for HCC patients receiving SBRT. AUC: are under the curve; mRECIST: modified Response Evaluation Criteria in Solid Tumors; DICOM: Digital Imaging and Communications in Medicine; CERR: Computational Environment for Radiotherapy Research; SMOTE: Synthetic Minority Oversampling Technique; LASSO: Least Absolute Shrinkage and Selection Operator; LRG: logistic regression; ROC: Receiver operating characteristic; KM: Kaplan-Meier; OS: overall survival; CP: Child-Pugh; PVT: portal vein thrombosis; LRHGE: Long Run High Gray-Level Emphasis; GLN: Gray-Level Non-Uniformity; LGRE: Long Gray-Level Run Emphasis

that tumors of a specific genotype display heterogeneous phenotypes and anatomic variations [8].
The application of radiomics involves the use of personalized image data that can be incorporated within the clinical decision-support system, enabling the construction of models that predict therapeutic outcomes [9]. As sophisticated tools for image analysis generate large amounts of data, machine learning (ML) has emerged as a powerful methodology for constructing precise predictive models, thus improving predictive performance [10].
In HCC, radiomics is particularly valuable because this cancer is most often diagnosed via dynamic three-phase computed tomography (CT) scans, thus potentiating the use of image features to predict therapeutic responses. To the best of our knowledge, there have been no studies using CT-based radiomics to predict the effect of SBRT on HCC. Therefore, in addition to examining current evidence of SBRT in HCC, we aimed to propose a radiomics-based ML strategy for the prediction of local response (LR) after SBRT. noise and increase the sensitivity of radiomic analyses, a bin width of 25 Hounsfield units was used for discretization prior to global texture analysis [15]. Radiomic features were subsequently extracted and divided into 4 categories: first-order statistics, shape, global, and texture. For matrix-based texture features, different combinations of quantization (equal probability (Ep) and uniform quantization (U)) of gray levels (8,16,32, and 64) were performed [16]. Incorporating CT scans from each phase, the training datasets combined features from different CT phases, quantization methods, and gray levels, and were termed AEp (8,16,32, and 64), AU (8,16,32, and 64), EEp (8,16,32, and 64), EU (8,16,32, and 64), DEp (8, 16, 32, and 64), and DU (8, 16, 32, and 64). Altogether, there were 41 tumors identified and 46 radiomic features prior further processing.

Data augmentation and adjustment of imbalanced data
Oversampling with bootstrapping was used to expand our data population to 5000 samples. Synthetic Minority Oversampling Technique (SMOTE) was then used to adjust the imbalanced data. For the minority class, SMOTE is advantageous for making the decision region more general and improving the classifier performance [17].
All 41 identified tumors were used for radiomic analyses. 78% (32/41) of the initial samples were extracted randomly and oversampled for training. The remaining 22% (9/41) was oversampled in a similar way for subsequent testing. The final cohort comprised 5201 training and 1301 testing samples, both of which were then balanced by SMOTE for positive and negative LRs.

Feature selection
The radiomic features extracted from the contoured region of each tumor were averaged and subsequently normalized across the cohort. To select the radiomic features, a regression method used to improve prediction accuracy by incorporating penalized estimation functions known as Least Absolute Shrinkage and Selection Operator (LASSO) was used [18]. LASSO started feature selection by tuning a parameter (λ). During this process, most covariate coefficients were shrunk to zero, and the remaining features with non-zero coefficients were selected. For each training set, which included the training features with the corresponding training responses (i.e. W1R, ENdR, and IFFF), the features selected by LASSO were used to build the classifiers.

Support Vector Machine (SVM) and Logistic Regression (LRG) classifiers
With high prediction accuracy in various clinical settings [19,20], SVM and LRG were adopted in our study to construct classifiers for LR. The SVM classifier deals with non-linear interaction and was used to discriminate whether a LR was achieved or not [21]. The penalty parameter C was set to 1 to determine the tradeoff between fitting error and model complexity. Radial basis function was used as the kernel function in our SVM classifier. The LRG was used to predict the likelihood of positive LR, and a probability equal to 0.5 was set as the minimum threshold to determine the predicted class.

Evaluation of model performance
The model performance of SVM and LRG classifiers were evaluated for accuracy, sensitivity, and specificity after 10-fold cross-validation in the training cohorts. F1 score was used to evaluate the model robustness in the testing cohorts. The classifier with the highest mean F1 scores for the three LRs was chosen as the candidate model. Receiver operating characteristic (ROC) curves were then used to assess the output quality with area under curve (AUC) and 5-fold cross-validation. All the functions used in our analyses were based on Python.

Statistical analysis
For the evaluation of radiomic data processing methods, the 50th percentile of accuracy was used as the threshold. The numbers of processing methods with an accuracy above the 50th percentile were identified and used for comparison of parameter robustness. Kaplan-Meier (KM) analysis with log-rank test was used to evaluate the effect of therapeutic response on survival. P value < 0.05 was considered significant. All statistical analyses and survival calculation were performed in R.

Reported Clinical evidence for SBRT in HCC
A total of 5 retrospective studies with > 100 patients were selected (Table 1). To summarize, SBRT led to encouragingly high LC (1-year LC: 87% to 99%) in HCC [1,[22][23][24][25]. Despite a reported lower overall survival (OS) from Bujuold et al., other studies showed high 1-year OS, ranging from 94% to 99%. Four of the five studies used SBRT in small-to medium-sized HCC, and all studies included patients with Child-Pugh (CP) class A liver function as the majority. Only Nabavizadeh et al.
investigated SBRT in CP class C patients (13%) [25]. The dose fractionation scheme was variable, ranging from 24 Gray (Gy) in 6 fractions to 60 Gy in 3 fractions, resulting in a biological effective dose ranging from 33.6 Gy to 78 Gy.

In-Patient Demographics
Overall, 172 patients receiving SBRT for HCC were retrospectively reviewed in our institution. Patient information and therapeutic outcomes are summarized in Table 2. The median tumor size was 5.4 cm, ranging from 0.8 cm to as large as 20.1 cm. There were 76.7% CP class A patients and 23.3% CP class B patients in the overall cohort. In this cohort, 36.6% patients presented with portal vein thrombosis (PVT). Patients in the radiomic cohort showed similar characteristics with those in the overall cohort.
The median tumor size was 5 cm, with the largest treated tumor being 13 cm. There were 84.8% CP class A patients and 15.2% CP class B patients. Out of the 33 patients, 21.2 % showed PVT. The radiation dose ranges were also similar between the two cohorts (25-65 Gy in 4-6 fractions in the overall cohort; 32-60 Gy in 4-6 fractions in the radiomic cohort).

Construction and Evaluation of the Predictive Models
Accounting for the three types of LR, a total of 72 datasets were analyzed. Using SVM, classifiers with the highest accuracy (98.7%) were built from features processed in A phase with Ep quantization and 32 gray levels (AEp32) for W1R ( Figure 1A). For EndR, features processed in E phase with Ep quantization and 8 gray levels (EEp8) had the highest accuracy (99.3%, 95% confidence interval [CI] = 91.2-99.5%); and for IFFFR, features processed in A phase with Ep quantization and 16 gray levels (AEp16) corresponded to the highest accuracy (99.7%, 95% CI = 99.5-99.8%). When we looked at sensitivity and specificity, CT scans in the A and D phases generally presented with higher sensitivity and specificity than in the E phase (Supplementary Figure 1A). Interestingly, AEp8 exhibited the highest sensitivity (98.2%, 95% CI = 97.9-99.0 %) for W1R and the highest specificity (76.5%, 95% CI = 72.1-77.3%) for EndR. However, the specificity was generally low for IFFF in all datasets.
When examining the processing sources and methods, we identified roughly similar distribution of three phases for the three LRs in the 50th percentile of accuracy ( Figure 1B). Moreover, classifiers with higher accuracy were mostly constructed by features with 8, 16, or 32 gray levels, not 64 gray levels. For the quantization methods, features processed with Ep quantization corresponded to higher accuracy for W1R (n = 9), compared with those processed with U quantization (n = 3). In the testing cohort, dataset processed in the A phase with Ep quantization and 8 gray levels (AEp8) exhibited the highest mean F1 scores for the three LRs (0.7995) ( Figure 1C). In separate conditions, Ep quantization and 8 gray levels in the A phase had the highest cumulative F1 scores compared to other parameters ( Figure 1D). Based on the above findings, AEp8 was chosen as the candidate model. ROC curve was plotted, and we observed the AUC for IFFF was the highest (AUC = 99.2%, 95% CI = 99.0-93.2%), followed by EndR (AUC = 96.3%, 95% CI = 96.0-96.9%) and W1R in AEp8 (AUC = 92.1%, 95% CI = 91.8-93.4%) ( Figure 1E). For AEp8, the optimal λ for LASSO was 0.00045 and the most potential feature for W1R was Long Run High Gray-Level Emphasis (LRHGE, coefficient = -5.135). For EndR and IFFF, Long Zone High Gray-Level Emphasis (LZLGE, λ = 0.00039, coefficient = 7.207) and Long Run Low Gray-Level Emphasis (LRLGE, λ = 0.00035, coefficient = -3.587) were the most predictive features, respectively.

Association of W1R with 1-year OS
Patients with W1R had significantly higher survival (P < 0.001) (Figure 3). The patients with W1R had a median survival of 27.4 months, compared to 8 months in patients without W1R. This result suggested that achieving W1R was associated with better survival probability.

Discussion
Despite great therapeutic outcomes and advances in radiation techniques, the treatment of HCC using SBRT remains challenging. The benefit of SBRT is stronger for CP class A patients, who can receive higher radiation doses and more aggressive fractionation schedules with lower toxicity [ Nonetheless, SBRT is still considered to be of therapeutic benefit for patients with PVT [31]. In the evaluable 53 cases of PVT, EndR was achieved in 42 cases (79.2%), which represented a higher percentage than in patients who received Sorafenib alone (7% based on mRECIST) in a comparable study [32]. After a median follow-up of 32.8 months, the 1-year LC in the overall cohort was 85.4%.
This relatively lower LC could be explained by various definitions of LC across studies [1,22,26,[23][24][25]. In our analysis, only those having at least a partial response were considered as achieving LC.
In a similar cohort in Taiwan, the reported 1-year in-field control was 85.3% [33]. Even though this finding was quite similar to our result, their LC was defined as the absence of new lesion or increase in tumor size, which was less strict compared to our definition. Therefore, our result might be better and still comparable to the literature.
Scorsetti et al. used a similar definition of LC to that in our study and found that higher LC contributed to higher OS in HCC patients after SBRT [28]. In support of this, patients with W1R were shown in our study to have significantly longer survival than those without W1R (Fig. 3). Therefore, achieving LC seems to be of great clinical significance. In several previous reports, LC appears to be determined by tumor size [34,28]. However, as aggressive tumors tend to exhibit increased intratumoral heterogeneity, the sole use of tumor size in predicting LR appears insufficient and inappropriate [35][36][37]. In addition to histologic and genomic study, radiomics is another non-invasive approach that enables a spatiotemporal and quantitative measurement of tumoral heterogeneity [38]. In the current study, 26 out of 46 features were based on the matrix manipulation, which were related to the quantitative description of size and intensity variations of the connected sub-regions. The most predictive features selected by LASSO for the three LRs were LRHGE, LZLGE, and LRLGE, respectively.
These metrics quantify the heterogeneity in size and intensity within tumor volumes in CT images, which was in agreement with a previous study describing the significance of Gray-Level Uniformity for prognosis in HCC patients receiving radiotherapy [39]. These features are related to the spatial correlation with emphasis on gray levels, possibly capturing the intratumoral heterogeneity. However, the value of these features is attributed to the type of CT scans and the preprocessing methods. In our study, the features extracted from the A phase had higher mean F1 scores in either SVC or LRG classifiers. This finding is in line with the fact that the development of HCC tumor frequently involves neovascularization of unpaired arteries without associated portal tracts [40,41] and consequently that contrast enhancement in the A phase with early wash-out in the E phase is a widely accepted diagnostic criteria. Therefore, features in the A phase are believed to be much more informative, potentiating the prediction for LR. Unlike in other studies [42, 43], we used two quantization methods and identified that Ep quantization with 8 and 64 gray levels performed better on the A phase CT scans. Ep quantization method attempts to define a decision threshold in the tumor volume while maintaining the same number of voxels after quantization [16]. Ep quantization and 8 gray levels showed higher cumulative F1 scores individually in the A phase in the SVM and LRG classifiers (Fig. 1D). The SVM classifier based on AEp8 also exhibited the highest mean F1 score for the three LRs, suggesting the feasibility of these parameters for LR prediction.
Even though we hereby proposed a radiomics-based ML strategy for SBRT in HCC, some limitations of current predictive models still need to be addressed prior to clinical application. First, the sample size was small in the initial dataset. Although we used an oversampling technique, the augmented data retained the intrinsic characteristics of the small number of original tumors, restricting its general utility. An alternative is to use image augmentation, generating large number of tumor images for training. However, this strategy also suffers from the similar limitation. Second, the tumors were not segmented automatically, thus the uncertainty of the peripheral regions might be increased. Future development of tumor segmentation out of normal liver could help refine this procedure. Furthermore, we did not adopt image filtering such as Laplacian or Gaussian filters, which could have enhanced the reproducibility of feature extraction. Finally, since low specificity was consistently observed for SVM and LRG, IFFF seemed poorly defined and required further elucidation. Once more patients are included with clearly defined target response, we believe that our model will exhibit some improved performance in the future.

Conclusions
In conclusion, this is the first study to propose a radiomics-based predictive model for SBRT efficacy in patients with HCC. The findings warrant further studies in larger populations to confirm the feasibility of using our radiomics-based model in the clinic.     Kaplan-Meier plot for patients with or without 1-year response (W1R).