Study population and CT acquisition
Twenty-nine consecutive SPCH cases with tumor size > 5mm were considered in this study. These SPCH cases underwent surgical tumor resection by a single surgical team, using the same clinical protocols and perioperative orders, at the National Taiwan University Hospital between January 2013 and December 2017. Out of these 29 cases, 9 cases with other lung nodules in the same lobe were excluded from this study. Another 7 cases without preoperative thin-cut CT images were also excluded. Finally, 13 cases were enrolled in the SPCH group for further analysis (Figure 2). For the patient selection in the LPA group, we retrospectively evaluated 3327 consecutive patients who underwent thoracoscopic surgery for lung cancer by the same surgical team at our institute between January 2013 and December 2018. The inclusion criteria in the LPA group were as follows: 1) diagnosed as lung adenocarcinoma with pathologically confirmed near-pure (≥70%) of lepidic-predominant histological subtypes and 2) existence of preoperative thin-cut CT images. Finally, 49 cases were enrolled in the LPA group for further analysis (Figure 2). All pathologic slides of the enrolled patients were reviewed according to the 2015 World Health Organization criteria [18]. The characteristics of the 13 SPCH patients and 49 LPA patients investigated in our study are summarized in Table 1. The SPCH diagnosis and the percentage of the lepidic subtype were confirmed microscopically by an experienced thoracic pathologist (MSH). The Research Ethics Committee at the National Taiwan University Hospital reviewed and approved this study (project approval no. 202003074RIND; approval date, April 14, 2020) and waived informed consent for this retrospective study.
Image acquisition
Chest CT images were obtained with a 16-, 64-, 128-, or 256-detector row CT scanner from the following manufacturers: GE (LightSpeed 16, LightSpeed VCT, Revolution CT, and Revolution RVO), Siemens (Emotion 16, Sensation 64, and SOMATON Definition AS+), Philips (iCT 256, and Ingenuity CT), and Canon (Aquilion ONE) Medical Systems. The CT image parameters were as follows: detector collimation, 0.5–0.625 mm; pitch, 0.813–1.2; gantry speed, 0.35 or 0.5 s per rotation; 120 kVp; 41–330 mA; slice thickness, 1.0–1.25 mm; and matrix, 512 × 512.
SPCH/LPA classification model
The SPCH/LPA classification model was based on a divide-and-conquer radiomic analysis. The kernel idea was to untangle the intervened radiomic distribution of SPCH and LPA by two different sets of radiomic features. The first set partitioned the SPCH/LPA dataset into two subsets, including one with high confidence of being LPA and the other a mixture of SPCH and LPA to be further classified using the second set of radiomic features. The rationale behind this idea was to decompose the LPA samples into two subgroups, each of which was expected to have a higher homogeneity than its parent group. It reduced the originally imbalanced and intertwining classification problem into a relatively balanced problem with a more homogeneous subset of LPA, opening up an opportunity for better discrimination between SPCH and LPA.
The schema of this study is depicted in Figure 3. The SPCH- or LPA-containing volumes of interest were first extracted from 3D thoracic CT images followed by segmentation processes demarcating the lesion boundaries. Texture features were extracted from histograms and the GLCM [19] of the lesions with a divide-and-conquer paradigm. The performance of the proposed SPCH/LPA classification model was assessed using a leave-one-out cross-validation method.
Tumor segmentation
Segmentation was a key step in the process of extracting radiographic features for the features were extracted from the lesion volumes defined by the derived lesion boundaries. However, the segmentation procedure was a challenging task for SPCH and LPA due to the weak lesion boundaries formed by the GGN-like appearance, as well as the complex structural compositions of blood vessels, bronchial tubes, pleural indentations, among others. To better describe the lesion boundaries of SPCH and LPA, a semiautomatic segmentation algorithm based on the hybrid level-set algorithm proposed by Zhang et al. [20] was developed in the present study. The key ideas were to find a lesion boundary maximizing overall edge gradient strengths while ensuring regional uniformity within the boundary. To further account for the complex compositions of SPCH and LPA, especially, along the lesion boundaries, the computer-generated lesion boundaries were examined and, if necessary, modified manually by two chest radiologists (YC Chen and YC Chang) to reach consensus segmentation results.
Feature extraction
To characterize SPCH and LPA, two types of radiomic features were extracted from the segmented lesions, namely the histogram features and the 3D spatial texture features. The histogram features included skewness, kurtosis, 75th percentile, 97.5th percentile, and uniformity [21]. The 3D spatial texture features were composed of 21 features derived from the GLCM of each lesion as listed in Table 2. The histogram features characterized the gray-level distribution of a lesion, whereas the 3D spatial texture features described the spatial distribution of the gray levels within a lesion. More precisely, the 21 GLCM-based texture features modeled the gray-level co-occurrence characteristics of all horizontally adjacent voxels in a lesion.
Feature selection and classification model building
Based on the key idea of a divide-and-conquer radiomic analysis, the SPCH/LPA classification model, comprising a two-level decision tree with two SVMs, was employed to differentiate SPCH from LPA. The first level of the decision tree, i.e., the root node of the decision tree, was composed of an SVM followed by a step function, u(Ps1-0.5) , where Ps1 is the probability of a nodule being an SPCH estimated by the SVM with u(x)=1 for x≧0 and u(x)=0 for x<0. If we denote the probability output of a nodule at the root node of the decision tree by P1, then P1=u(Ps1-0.5). The SVM of the root node was constructed based on the training data, the features of which comprised the first two principal components of the training data points in the feature space. The feature space was formed by the 26 features of all training data, including 5 histogram-based and 21 GLCM-based features. The first two principal components were the eigenvectors corresponding to the two largest eigenvalues derived via principal component analysis [22]. The SVM of the root node maximized the positive predictive value subject to the constraint of 100% sensitivity. A true positive referred to the case that an SPCH was correctly classified as such.
Because the SVM of the root node was trained to have a sensitivity of 100%, the second level of the decision tree consisted of only one leaf node to further classify the positive outcomes of the root node into SPCH or LPA. The classifier of the leaf node was also an SVM but without being connected to a step function. The SVM of the leaf node was built using the 26 texture features. To avoid overfitting the classification model, a subset of features was selected to best differentiate SPCH from LPA by a SFFS algorithm [23] using the training dataset. While there is no unique rule in determining the maximum number of features for n training data, Jain et al. [24] suggested that the number of features used for a two-class classification problem should be less than n/10. Considering the reduced number of training data at the leaf node, the number of features to be selected to train the SVM was set to 2 in this study. The SVM of the leaf node yielded the probability of a nodule being an SPCH, denoted by P2. As a result, the probability of a test datum being classified as an SPCH would be P = P1 x P2 . It should be noted that for a test datum with P1 = 0, its probability of being an SPCH was set to 0, i.e., P = 0.
Performance assessment
To evaluate the performance of the proposed classification model, a leave-one-out cross-validation approach [25] was used in the present study to estimate the model’s ability in predicting the new data that were not involved in the model construction. With a total of 62 cases, the leave-one-out cross-validation carried out 62 folds of cross-validation in a way that each case took turns to serve as the test data and the remaining 61 cases were treated as the training data. The training data were used not only to derive the two principal components for the root node and select the features for the leaf node but also to construct the two SVMs of the two-level decision tree.
To demonstrate the advantage of the divide-and-conquer radiomic analysis, a baseline classification model was implemented with the aim to separate SPCH from LPA using a single set of radiomic features. The baseline model used the SVM as the classifier with the same set of 26 texture features as the SPCH/LPA classification model. The performance of the baseline model was also evaluated by the leave-one-out cross-validation method. The SFFS algorithm was used to select the features in each fold using the training data. Following the rule suggested by Jain et al. [24], the number of features selected in each fold was no more than 6.
Statistical analysis
To investigate the differentiation capability of each radiomic feature, the independent two-sample t-test was conducted for each of the 26 features. Levene’s test was performed prior to the t-test to assess the homogeneity of variance of each radiomic feature with the null hypothesis of equal population variances. The significance levels of the independent two-sample t-test and Levene’s test were both set to 0.05. If the p-value of Levene’s test was less than the significance level, the null hypothesis was rejected and both groups, i.e., SPCH and LPA, were considered to have unequal variances for the tested radiomic feature. Otherwise, the group variances of SPCH and LPA were considered as equal.
ROC curve analyses were carried out to assess the leave-one-out cross-validation performances of the proposed SPCH/LPA classification model and the baseline model using the probability of being an SPCH, i.e., p.