Study design
This study was designed as a multi-factor exploratory study to probe and verify the heterogeneity of MRI radiomics features in LGG and reveal the underlying prognostic values of the biological mechanisms and heterogeneity (Fig. 1).
Data retrieval and processing for this study
Data for 199 LGG patients with enhanced MRI results (T1W1) were downloaded from TCIA database (www.cancerimagingarchive.net). The genomic data of 515 LGG patients were downloaded from the TCGA database (portal.gdc.cancer.gov). Genomic data from 149 patients in TCGA were used to extract radiomic features and construct the model, and data from 412 patients were used to analyze the gene-based prognosis. The R package “survminer” [20] was used to divide patients into high- and low-expression groups using a cutoff value of 4.55 for CCND1 expression. We prepared a radiomics model of CCND1 and explored its prognostic value.
The exclusion criteria were samples without survival data, samples without clinical data, samples with survival time < one month, samples with poor image quality, and samples without enhanced T1W1 images. The inclusion criteria were as follows: samples of WHO grades II and III and samples with clinical and genomic data from TCGA. The detailed inclusion and exclusion criteria are shown in Table 1.
Table 1
Clinical characteristics of the population with high and low CCND1 expression group
Variables
|
Total
(n = 412)
|
Low
(n = 242)
|
High
(n = 170)
|
P value
|
Age, n (%)
|
|
|
|
1
|
~ 40
|
198 (48)
|
116 (48)
|
82 (48)
|
|
41~
|
214 (52)
|
126 (52)
|
88 (52)
|
|
Gender, n (%)
|
|
|
|
0.228
|
Female
|
183 (44)
|
101 (42)
|
82 (48)
|
|
Male
|
229 (56)
|
141 (58)
|
88 (52)
|
|
Histology, n (%)
|
|
|
|
0.684
|
Oligodendroglioma
|
158 (38)
|
93 (38)
|
65 (38)
|
|
Astrocytoma
|
154 (37)
|
87 (36)
|
67 (39)
|
|
Oligoastrocytoma
|
100 (24)
|
62 (26)
|
38 (22)
|
|
Grade, n (%)
|
|
|
|
< 0.001
|
WHO II
|
194 (47)
|
134 (55)
|
60 (35)
|
|
WHO III
|
218 (53)
|
108 (45)
|
110 (65)
|
|
IDH_status, n (%)
|
|
|
|
0.057
|
WT
|
80 (19)
|
55 (23)
|
25 (15)
|
|
Mutant
|
332 (81)
|
187 (77)
|
145 (85)
|
|
Chr_1p_19q_codeletion, n (%)
|
|
|
|
0.65
|
Non-Codel
|
273 (66)
|
163 (67)
|
110 (65)
|
|
Codel
|
139 (34)
|
79 (33)
|
60 (35)
|
|
MGMT_promoter_status, n (%)
|
|
|
|
0.024
|
Unmethylated
|
73 (18)
|
52 (21)
|
21 (12)
|
|
Methylated
|
339 (82)
|
190 (79)
|
149 (88)
|
|
Chemotherapy, n (%)
|
|
|
|
0.001
|
NO
|
173 (42)
|
118 (49)
|
55 (32)
|
|
YES
|
239 (58)
|
124 (51)
|
115 (68)
|
|
Radiotherapy, n (%)
|
|
|
|
0.24
|
NO
|
185 (45)
|
115 (48)
|
70 (41)
|
|
YES
|
227 (55)
|
127 (52)
|
100 (59)
|
|
From UCSC XENA (xenabrowser.net/datapages), the obtained GBMLGG (glioma LGG) RNAseq data in fragments per kilobase of transcript per million mapped fragments format were TCGA and GTEx (gtexportal.org/home) data processed by Using Toil flow unified [21]. To compare the samples, the data were transformed to transcripts per million reads format via log2 and demonstrated in a boxplot using the R package “ggplot2” [22].
Survival analysis
Every variate was subjected to survival analysis using the R packages “survival” [23] and “survminer” [20]. Using Kaplan–Meier (KM) survival curve analysis and the log-rank test, we showed the changes and significance of survival rates among the different groups. A survival rate of 50% was defined as the median survival time.
Landmark analysis plotted the KM curve in time periods, comparing the 12, 18, ..., 78, and 84 months defined as “early stage” from diagnosis to time point, and “late stage” from time point to end of follow-up. In the KM curve analysis via landmarks, the abscissa is the survival time, and the ordinate is the mortality risk.
Univariate landmark analysis
Setting 12, 18, …, 78, and 84 months as time nodes after diagnosis of LGG, we used the R packages “jskm” (CRAN.R-project.org/package = jskm)and “survival” [23] to plot a KM curve via landmark analysis. In addition, “early grade” was defined from diagnosis to time nodes, and “advanced grade” was defined from time nodes to the end of follow-up.
Univariate and multivariable COX analysis
Cox proportional hazards regression can be used to explore the relationship between one factor or more factors and survival outcomes. We used univariate Cox regression to perform an association analysis to explore the factors influencing overall survival (OS). Multivariate Cox regression was used to explore whether one independent factor or various factors would influence OS. To identify significant factors, all meaningful variables identified using univariate Cox regression analysis [hazard ratio (HR) < 1] were subjected to multivariate analysis. We used the R packages “survival” [23] and “forestplot” (CRAN.R-project.org/package = forestplot) for these analyses.
Subgroup analysis and interaction test
Based on univariate Cox regression, we used exploratory subgroup analysis to analyze the impact of the main variable (high vs. low CCND1 expression) on patient prognosis in every covariate subgroup. Furthermore, we analyzed the interactions between CCND1 and other covariates using the log-likelihood ratio test. We used the “cmprsk,” “survival,” and “forestplot” R packages for these analyses.
Correlation analysis
Based on Spearman’s rank correlation coefficient, we explored the correlation between CCND1 and clinical characteristics and presented the results using a correlation heat map.
Analysis of the correlation between CCND1 expression and immune cell infiltration
The gene expression matrix of the LGG samples uploaded to the ImmuCellAI database (bioinfo.life.hust.edu.cn/ImmuCellAI#!) was used to measure immune cell infiltration in each sample [24]. Based on Spearman’s rank correlation coefficient, we explored the correlation between CCND1 expression and immune cell infiltration and presented the results using a correlation heat map.
Gene Set Variation Analysis (GSVA) between low and high CCND1 expression
GSVA is a non-parametric, unsupervised analysis method to evaluate gene set enrichment in microarrays and transcriptomes [25]. Based on the KEGG pathways and hallmark gene sets, GSVA was used to calculate the pathway enrichment score for each of the 412 LGG samples from TCGA-LGG. The R package “limma” [26; 27] was used for differential analysis of low and high CCND1 expression.
Screening of radiomics features
An open-source software (“pyradiomics” python package) [28] containing 107 radiomics factors (RFs) was used to analyze size, shape, intensity, morphology, and texture. Based on the CCND1 threshold value (4.554), the samples were divided into low- and high-expression groups. By intersecting TCGA transcriptome data and TCIA radiomics data, 149 TCGA-TCIA intersection samples were standardized via “pyradiomics.” Then, using the R packages “caret” and “CBCgrps” [29], we divided the data into the training and validation sets randomly at a ratio of 6:4 and analyzed the difference between them via z-score standardization (i.e., mean 0 and standard deviation 1).
The maximum relevance, minimum redundancy (mRMR) screening feature considers the correlations between the features and the predicted variables, as well as the correlation between the features themselves. The metric used was mutual information. For the mRMR method, the correlation between feature subsets and categories was calculated as the mean of the information gain of each feature and category, while the redundancy among features was calculated as the sum of the mutual information among features divided by the square of the number of features in the subset.
Assuming that the dataset contains m features, based on the mRMR formula, the importance of the NTH feature can be expressed as follows:
$${f}^{mRMR}\left({X}_{I}\right)=I(Y, {X}_{i})\frac{1}{\left|S\right|}{\sum }_{{X}_{S}\in S}I({X}_{S}, {X}_{i})$$
\(I\left(Y, {X}_{i}\right)\) is the mutual information of \({X}_{i}\) and target variable Y. \(\frac{1}{\left|S\right|}{\sum }_{{X}_{S}\in S}I({X}_{S}, {X}_{i})\)is average of mutual information between variable \({X}_{i}\) and all the variable in subset of existing variables.
Relief (relevant features), a well-known filtration feature selection method, is a feature-weighting algorithm that ensures different weights for each feature according to its relevance and class. Features are removed when the weight is below a certain threshold. In the relief algorithm, the correlation between features and categories is based on the strength of the ability to distinguish close samples based on features. The larger the weight of the feature, the stronger its classification ability; conversely, a lower weight indicates a weaker classification ability.
Construction of the radiomics model
The logistic regression (LR) model, based on variations in linear regression, is a generalized regression algorithm widely used in classification. LR renders the output value of model distribution between 0 and 1 using the sigmoid function to transform the linear regression using the formula:
\(\text{g}\left(\text{z}\right)=\text{y}\left(\text{x}\right)=1/(1+{\text{e}}^{{x}^{}}-omega^T\) )
The R package ‘stats’ was used to fit the screened radiomic features using the LR algorithm to construct a model for predicting gene expression.
Support vector machines (SVMs) with a Gaussian kernel were employed in the reduced feature space as the classifiers of choice to discriminate the two cohort groups. The rationale for adopting SVMs as classifiers is primarily because of their ability to generate classification hyperplanes, so the margins between the hyperplane and the nearest instances of the classified sample categories are maximized. The R packages “stats” and “caret” were respectively used to fit the screened radiomics features with the LR algorithm and SVM to construct a model for predicting gene expression.
The radiomics model was evaluated using training and validation sets. The evaluation indexes were accuracy, specificity, sensitivity, positive predictive value, and negative predictive value. The X-axis of the receiver operating characteristic curve was the false positive rate, whereas the Y-axis was the true positive rate. A larger area under the curve (AUC) and a more convex curve to the upper-left corner indicated a better model effect. Calibration of the radiomics prediction model was evaluated by drawing the calibration curve and performing the Hosmer–Lemeshow goodness-of-fit test. The Brier score was used to quantify the comprehensive performance of the radiomic prediction model. A Brier score indicated a more consistent model prediction. The degree of clinical benefit of the radiomic prediction model was determined by drawing a decision curve (DCA). The DeLong test was used to compare the AUC values of the training and validation sets to determine the condition of fit.
The radiomics model score was used to predict the probability of gene expression. A Wilcoxon test was used to compare the radiomics scores (RSs) between the high and low CCND1 expression groups.
Intraclass correlation coefficient (ICC)
ICC was used to evaluate the consistency of radiomic features extracted from volume of interest delineated by two radiologists [30]. First, one radiologist delineated all patient images. Another radiologist delineated 20 randomly selected samples and extracted the radiomics features. ICC ≥ 0.80 is typically considered an optimal agreement, 0.51–0.79 is considered moderate agreement, and < 0.50 was considered poor agreement [30]. The correlation between CCND1 and immune-related genes was analyzed based on SVM model prediction.
Statistical analysis
All statistical analyses were conducted using R version 4.0.0 (https://www.r-project.org). P < 0.05 was considered statistically significant.