2.1 Dataset collection and study population
The research design and workflow of this study is shown in Fig. 1.
Single-cell RNA sequencing (scRNA‑seq) data and the corresponding clinical information were downloaded from the Chinese Glioma Genome Atlas (CGGA) website (Yu, Hu et al. 2020). Four additional datasets with mRNA data and the corresponding clinical information were used in the model. The datasets were acquired from three databases: The Cancer Genome Atlas (TCGA), Gene Expression Omnibus (GEO), and CGGA.
The TCGA-LGG cohort was downloaded from the GDC Data Portal. A total of 496 patients met the inclusion criteria: primary tumor, WHO stage II and III, matched clinical and genetic information, and complete overall survival (OS) outcomes. The inclusion criteria were consistent for all datasets used in this study. The GDC Data Portal was also used to download somatic mutation data in the mutation annotation format (MAF) for LGG patients from TCGA for mutation analysis of genes.
The GSE16011 data set (the microarray data) was downloaded from GEO and collated with the required data information from 106 patients (Gravendeel, Kouwenhoven et al. 2009).
Two bulk-RNA sequencing and clinical data sets (mRNAseq_693 and mRNAseq_325) were acquired from the CGGA database (Yan, Zhou et al. 2021). After removing the batch effect by the “ComBat” function in “sva” R package (version 3.44.0), the two CGGA datasets were combined into one dataset (CGGA data set) (Johnson, Li et al. 2007). Information for 406 patients was available after collation.
2.2 Bioinformatics analysis
Four patients (S7, S8, S9 and S14) with lower-grade gliomas were only included in the CGGA single-cell sequencing dataset, all of which were retained to screen for marker genes for TAMs in LGG (Yu, Hu et al. 2020, Yan, Zhou et al. 2021). Single-cell sequencing data analysis was performed using the R package “Seurat” (version 4.3.0) (Yan, Zhou et al. 2021). Genes with expressions between 200–2500 and those with less than 5 expressed in mitochondria were retained for data quality assessment. The "vst" method was used to identify the top 2000 high-variance genes as variable characteristics. Principal component analysis (PCA) was performed to reduce data dimensions based on identified highly variable genes, from which the best principal components were used for t-SNE analyses. Based on source literature of the dataset, TAM type clusters were annotated (Yu, Hu et al. 2020) and the differentially expressed genes (DEGs) for TAMs and other clusters computed. The threshold for DEGs was adjusted by p value < 0.05 and |log2FC|>1.5.
We screened for m7G variant-associated genes that were common in the four mRNA datasets (TCGA, CGGA, GSE16011 and RMVar). Spearman rank correlation was used to calculate correlation coefficients for the two gene sets, DEGs and m7G mutation-associated genes. Final gene numbers were obtained based on filtering criteria of an absolute rank correlation coefficient > 0.6 and adjusted p < 0.05.
The R package “maftools” (version 2.12.0) was used to analyze the TCGA somatic mutation data (Mayakonda and Koeffler 2016). ESTIMATE was used to predict the presence of stromal cells and immune cells in the tumor tissue (Yoshihara, Shahmoradgoli et al. 2013), while CIBERSORT was used to calculate the proportions of 22 immune cells derived from bulk tumors on the basis of gene expressions (Newman, Liu et al. 2015).
2.3 Statistical analysis
The TCGA and CGGA RNA-sequencing data were converted to log2(TPM + 1) format before data analysis. The microarray data (GSE16011) were processed via the Robust Multichip Average (RMA) method.
The TCGA-LGG cohort (n = 496) was used as the derivation set, while the GSE16011 data set (n = 106) and the CGGA data set (n = 406) were used as validation sets in model 1. The CGGA data set (n = 408) was used as the validation set in model 2. Before fitting and evaluating the model, the “ComBat” function was used to remove batch effects in a uniform manner across all datasets to reduce bias (Leek, Scharpf et al. 2010, Tang, Ji et al. 2021).
Patient characteristics are presented as median and interquartile ranges (IQR) for continuous variables while categorical variables are presented as counts and percentages. Overall survival outcomes are presented by Kaplan-Meier curves. The median follow-up time was calculated using the reverse Kaplan-Meier method (Mathoulin-Pelissier, Gourgou-Bourgade et al. 2008). Model predictions used OS as the event outcome. Event per variable (EPV) in the model was used to assess sample size adequacy (Moons, Altman et al. 2015). Prior to computing the correlation coefficient, one must first compute uniformly whether the data conforms to the bivariate normal distribution, otherwise, one uses the Spearman rank correlation method. p < 0.05 was the threshold for statistical significance, unless otherwise stated. The R software (version 4.2.1) was used for analyses.
2.4 Model development, validation
Two models have been developed: a genomics model that includes only mRNA (model 1) and model 2 obtained by an extension of model 1 using clinical variables. The genomic data were not missing, thus, imputation was not considered. The derivation and validation sets were missing some values for the clinical variables. To impute the missing values for clinical variables, the R package “mice” (version 3.14.0) was used for multiple imputation, based on the percentage of missing values in the derivation set (10 times) (Murad, Dankner et al. 2020). For the multiple imputation model, the variables used included 9 candidate predictors (prognostic index for model 1, age, sex, grade, radiation therapy, pharmaceutical therapy, IDH mutation status, 1p/19q codeletion status, and MGMT methylation status) and two outcome indicators (Nelson–Aalen estimator of the baseline cumulative hazard and outcome events) (Moons, Donders et al. 2006, White and Royston 2009).
Genes were filtered for the presence of variables using the least absolute shrinkage and selection operator (LASSO) in “glmnet” package of R (Goeman 2010). Ten-fold cross validation was used to select the optimal value of λ, and the corresponding variables with non-zero coefficients kept. For imputed data sets, variable selection was performed in the stacked data, which was formed by stacking multiply imputed data on top of each other. For each imputed set, optimal penalty (lambda) was determined to find the mean lambda, which was used as the final optimal penalty for LASSO for variable selection in the stacked data (Steyerberg 2009). Weighted Cox regression was fitted to the stacked data to estimate predictor coefficients (1/10 weight per patient) (Wood, White et al. 2008). Coefficients can be well estimated by stacking the data, however, this method can underestimate the standard errors (Steyerberg 2009). Therefore, performance of the model was assessed by parallel analysis of 10 imputed datasets, combined with results from each dataset and Rubin’s rules (Rubin 2004, Marshall, Altman et al. 2009, Wang, Yuan et al. 2021).
Coefficients corresponding to variables were obtained by fitting the Cox multivariable regression models. The prognostic index (PI), also referred to as the linear predictor, was extracted from the built Cox model. It is a weighted sum of the model's variables, with weights corresponding to coefficients of regression (Royston and Altman 2013).
Overall performance measures for the prognostic model were assessed by calibration and discrimination (Steyerberg, Vickers et al. 2010). Discrimination and calibration of the model were separately assessed in derivation and validation sets. Discrimination refers to the ability to discriminate between high- and low-risk patients and is measured by Harrell’s overall concordance (C) statistic and time dependent ROC (Harrell Jr, Lee et al. 1996, Steyerberg, Vickers et al. 2010, Wang, Yuan et al. 2021). Indicator values vary between 0.5 (no predictive discrimination) and 1.0 (perfect discrimination) (Wang, Yuan et al. 2021). The C-statistic of the model was calculated using the “validate” function in R package “rms” (version 4.3.0) (Harrell 2001). Time dependent ROC curves at 1, 3, and 5 years were created using the R package “survivalROC” (version 1.0.3) (Heagerty, Lumley et al. 2000). Time dependent C-statistic of the models were visualized using the “cindex” function in R package “pec” (version 2022.05.04) (Gerds, Kattan et al. 2013).
Calibration refers to the agreement between observed and predicted outcomes (Hilden, Habbema et al. 1978). Calibration plots can be used to assess calibration. In this study, patients were assigned into several groups according to sample size, and the predicted survival probabilities (X-axis) compared with observed survival probabilities (Y-axis) at 1, 3, and 5 years. We further performed a more sensitive calibration degree assessment (Houwelingen 2000). Specifically, patients were assigned into two risk groups based on the median of PI. Then, we compared the observed Kaplan-Meier survival curves and the average predicted probability curves for each risk group. Given the inconvenience of later updating the model in terms of applying the function, calibration plots were plotted using the R code, as described by Wang (Wang, Yuan et al. 2021). We also estimated the calibration slope, i.e., the regression coefficient of PI in the external validation set (Royston and Altman 2013, Stevens and Poppe 2020, Wang 2020). The calibration slope reflects whether the impact of predictors (regression coefficients) in the original model is over-or under-estimated (Janssen, Moons et al. 2008). A calibration slope of less than 1 indicates that regression coefficients are too large and may result in extreme predictions for new patients, such as weaker predictions for low risk and stronger predictions for high risk (Janssen, Moons et al. 2008).
Validation of the prediction model includes internal and external validation. Internal validation is aimed at testing the reproducibility of the construction process of the model in a population with similar sources of derivation sets (Steyerberg, Harrell Jr et al. 2001). To correct for optimism in model performance, internal validation was performed 1000 times via a bootstrap resampling procedure. External validation was used to verify whether the model has good predictive performance in data independent of the derivation set, as revealed by assessing differentiation and calibration of the model in the external validation set.
2.5 Model updating
When the model performs poorly in external validation, we tended to make some adjustments/updates to the model based on information from the validation set (Van Calster, McLernon et al. 2019). After updating, the model should be revalidated using an independent data set (Royston and Altman 2013). By estimating the calibration slope and the cumulative baseline hazard in the external validation set, this allows the determination on how to update the model: the calibration slope only, the baseline hazard only, or both (Van Calster, McLernon et al. 2019, Wang, Yuan et al. 2021). Two main approaches for updating the model have been proposed: recalibration and revision methods (Steyerberg, Borsboom et al. 2004, Janssen, Moons et al. 2008). In external validation, model recalibration improves model calibration, and model revision improves both model discrimination and calibration (Toll, Janssen et al. 2008).
Model revision was performed on model 1. Two validation sets were used in model revision: updating set (in which the model is updated based on its information) and the test set (in which the update model is validated). In the updating set, the Cox regression model was re-fitted to obtain new regression coefficients based on original predictors (Janssen, Moons et al. 2008). Taking into account the large sample size requirement for the proposed model revision method (Steyerberg, Borsboom et al. 2004), we combined the TCGA-LGG cohort (derivation set, n = 496) and the GSE16011 dataset (as updating set, n = 106) into a larger “revision set” (n = 602), after which the regression coefficients of the model were re-fitted in the revision set. Then, we observed and compared the baseline survival probability of the revision and test sets. If the difference was too large, we used the baseline survival probability (\({\text{S}}_{0}\left(\text{t}\right)\)) from the test set when calculating the predicted survival probability (\(\text{S}\left(\text{t}|\text{X}\right)\)). Predicted survival probability of the model was calculated by: \(\text{S}\left(\text{t}|\text{X}\right)={{\text{S}}_{0}\left(\text{t}\right)}^{\text{e}\text{x}\text{p}\left(\text{P}\text{I}\right)}\). Baseline survival probability was estimated as \({\text{S}}_{0}\left(\text{t}\right)=\text{e}\text{x}\text{p}[-{\text{H}}_{0}\left(\text{t}\right)]\), where \({\text{H}}_{0}\left(\text{t}\right)\) is the baseline cumulative hazard (Royston and Altman 2013). The baseline cumulative hazard can be computed using the “basehaz” function in “survival” R package (version 3.4.0). The CGGA dataset (n=408) was used as the test set, with at least 100 events and a sufficient amount of power to detect differences in prediction model performance between the two datasets (Vergouwe, Steyerberg et al. 2005, Janssen, Moons et al. 2008). The revised model was re-evaluated in both revision and test sets.
We re-calibrated model 2 based on baseline survival probability and calibration slope of the validation set (Janssen, Moons et al. 2008, Moons, Kengne et al. 2012). We used the \({\text{S}}_{0}\left(\text{t}\right)\) from the validation set rather than the derivation set to calculate the predicted survival probability in the validation set. To obtain the final regression coefficient, the regression coefficient for each predictor variable was multiplied by the calibration slope. After re-calibration, model calibration was re-evaluated in the validation set.
2.6 The Online Calculator
To make it easier for users to use the two models, they can be obtained from the online calculator website: https://lhj0520.shinyapps.io/M7G-LGG_model/. The site includes single person predictions for each model as well as batch predictions, and a tool for removing batch effects with the model derivation set (TCGA).