Patients: This retrospective study was approved by the institutional review board or research ethics board of the two participating academic institutions: The Hospital for Sick Children (Toronto, Ontario, Canada) and The Lucile Packard Children’s Hospital (Stanford University, Palo Alto, California). This study was performed in accordance with the relevant guidelines and regulations. Informed consent was waived by the local institutional review or research ethics boards due to the retrospective nature of the study. An inter-institutional data transfer agreement was obtained for data-sharing. Patients were identified from the electronic health record data bases at Toronto from January 2000 to December 2018 and at Stanford from January 2009 to January 2016. Patient inclusion criteria were: 1) age 0–18 years, 2) availability of molecular information on BRAF status in histopathologically confirmed pLGG, and 3) availability of preoperative brain MRI with a non-motion degraded FLAIR sequence. Patients with histone H3 K27M mutation and neurofibromatosis 1 were excluded. Spinal cord tumors were also excluded.
The datasets of 94 patients from The Hospital for Sick Children, Toronto, and 21 patients from The Lucile Packard Children’s Hospital, Stanford, used in this study have been previously published (20). The previous study applied an RF model without variations in sample size to differentiate BRAF fused from BRAF V600E mutated pLGG. Our current study investigates the performance of five commonly used ML models and various sample sizes to predict BRAF fusion or BRAF V600E mutation on an independent validation set using a systematic step-wise increase of training data.
Molecular Analysis: BRAF fusion status was determined using a nanoString panel or fluorescence in situ hybridisation (FISH) while BRAF p.V600E mutation was determined using immunohistochemistry or droplet digital PCR as previously described (21).
MRI Acquisition, Data Retrieval, Image Segmentation: All patients from The Hospital for Sick Children, Toronto, underwent brain MRI at 1.5T or 3T across various vendors (Signa, GE Healthcare; Achieva, Philips Healthcare; Magnetom Skyra, Siemens Healthineers). Sequences were acquired according to the institutional tumor protocol and included a 2D axial T2 FLAIR sequence (TR/TE, 7000–10000/140–170 ms; 3–6 mm slice thickness; 3-7.5 mm gap). Patients from Lucile Packard Children’s Hospital, Stanford, underwent brain MR imaging at 1.5T or 3T scanners from a single vendor (Signa or Discovery 750; GE Healthcare, Milwaukee, Wisconsin). Sequences were acquired using the institutional brain tumor protocol, which included a 2D axial T2 FLAIR sequence (TR/TE, 7000–10000/140–170 ms; 4–5 mm slice thickness; 1-1.5 mm gap). All MRI data were extracted from the respective PACS and were de-identified for further analyses. Tumor segmentation was performed by a 4th year radiology resident with neuroradiology research experience (AA) using 3D Slicer (ver. 4.10.2) (22) (http://www.slicer.org). The scripted loadable module SlicerRadiomics extension was used to obtain access to the radiomics feature calculation classes implemented in the pyradiomics library (http://pyradiomics.readthedocs.io/). This extension offers to select all available feature classes and ensures isotropic resampling under “Resampled voxel size” when extracting 3D features. Semi-automated tumor segmentation on FLAIR images was performed with the Level-Tracing-Effect tool. This semi-automatic approach had been found superior to multi-user manual delineation with regard to reproducibility and robustness of results (23). The final and proper placement of ROIs was confirmed by a pediatric neuroradiology trained and board-certified radiologist (MWW, 7 years of neuroradiology research experience).
Radiomic Feature-Extraction Methodology: A total of 851 MRI-based radiomic features were extracted from the ROIs on FLAIR images. Radiomic features included histogram, shape, and texture features with and without wavelet-based filters. Features of Laplacian of Gaussian filters were not extracted. All features are summarized in Supplementable Table. Bias field correction prior to z-score normalization were used to standardize the range of all image features (24, 25). Once the features were extracted, we applied z-score normalization again followed by L2 normalization to the features of cohort 1 and used the distribution of the features in cohort 1 (training data) to normalize cohort 2 (validation data). Details of pre-processing and radiomic feature extraction in 3D Slicer have been described elsewhere (13, 17, 26).
Statistical and ML analysis: We used t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize our dataset. RF, XGBoost, NN1 (100:20:2), NN2 (50:10:2), NN3 (50:20:10:2) were utilized as classification models (27–29).
In dimensionality reduction, t-SNE can be applied to different types of data including radiomics features. It applies Principal Components Analysis to map data points from an original high dimensional space to an intermediate lower dimensional space (default dimensions = 30). Subsequently, pair-wise distances are calculated and probability distributions are fit to examples so that data points in closer proximity are assigned with higher probabilities. Initialization for embeddings (i.e., representation of data points in latent space, usually a 2D or 3D space) is realized. The same procedure is repeated for the points in the latent space. Iteratively and applying a gradient descent approach, Kullback-Leibler divergences between the two sets of probability distributions are minimized. Ultimately, if two data points are similar in high dimensional space, their embeddings will be close to one another in 2D/3D space.
RF is a learning method consisting of several decision trees that can be used for classification and regression. Decision trees are multi-step thresholders, which can overfit to any data, if there is no controlling mechanism such as maximum depth. In order to enhance their generalization capacity, RF ensemble decision trees. Similar to any other tree-based algorithm, RF are suitable for classification of tabular data which makes them a high potential option for radiomics pipelines. The most critical hyperparameters of RF are number of trees (number of estimators), maximum depth of each tree, and the minimum and maximum number of examples at leaf nodes.
eXtreme Gradient Boost (XGBoost) is a popular gradient boosted trees (GBT) algorithm. Similar to RF, GBTs ensemble a collection of decision trees. However, GBTs add trees in a sequential manner such that errors of the previous tree are revised by the next tree. Compared to other GBTs, XGBoost utilizes a customized loss function and implements multiple regularization techniques to enhance the model’s generalization and computational efficiency. Learning rate of the tree booster (default value is 0.3) and maximum depth of trees are the most important hyperparameters of the algorithm.
Neural Networks (NNs) are highly nonlinear classifiers that were initially built based on ensembling of perceptron blocks. To date, there are multiple well-established categories of NNs including convolutional neural networks (CNNs), recurrent neural networks (RNNs), and generative adversarial networks. Each of these categories is suitable for specific sets of problems. For example, CNNs have a high potential for image and video processing tasks such as object detection or segmentation. Given the type of data, we used feedforward NNs in this study. Feedforward NNs are considered a conventional type of NN, where individual perceptrons form layers and a stack of layers creates the architecture without recurrent paths in the network. We designed an initial NN (NN1) and derived two other architectures (NN2 and NN3, respectively) by changing its width and depth. Figure 1 illustrates the three architectures, NN1, NN2, and NN3. Rectified linear unit was used as activation function of the linear layers. In order to enhance generalization of the models, we implemented dropout layers in our architectures. The dropout mechanism arbitrarily excludes some nodes of the network from the weight updating process during training.
Internal Cross Validation
Starting with 10% of the training data, all models were cross-validated using a 4-fold approach with a systematic step wise 2.25% increase in sample size. At each step, experiments were repeated 10 times using randomized versions of the respective percentage of the training data, resulting in 10 classifiers per step.
At each step, the 10 classifiers were validated on the entire independent external data set.
Classification performance metrics
Mean AUC and 95% confidence intervals (CI) were calculated for every step for both training and validation data sets, and the process was repeated for all five models. The external validation data set was never used in any stage of the training of the models and was dedicated to external validation. To examine whether the difference between performance of the models were significant, we conducted a two-sided two-sample Kolmogorov-Smirnov (KS) test on mean AUCs across training sample sizes for each pair of our models.