A scan-rescan randomized case-control MRI neuro-imaging dataset of healthy adults (n = 46, age 26 ± 7 years; 29 women) was used, as described in previous studies.67–70 Twenty-three subjects were randomly assigned to either a night of sleep deprivation, or a normal sleep wake cycle (SWC). T1-weighted and DTI scans were acquired at four time points (TPs) over two days: TP1, around 9 a.m. after a night of normal sleep in their own home; TP2, around 8 p.m. approximately 11 hours after T1; TP3 approximately 23 hours after TP1; And finally, TP4, in the afternoon of the second day around 4 p.m. Participants in the normal SWC group went home to sleep between TP2 and 3, while those in the sleep deprivation group stayed at the hospital. This data collection was approved by the Regional Committee for Medical and Health Research Ethics, South-Eastern Norway (REK Sør-Øst, ref: 2017/2200) and conducted in line with the Declaration of Helsinki. All participants provided written informed consent. Data, code and documentation used in the study are available from the corresponding author upon reasonable request.
4.2 MRI acquisition and processing
Image acquisition consisting of T1-weighted and DTI and processing and was conducted in accordance with recommended standard brain segmentation protocols provided by the FreeSurfer group.75
T1-weighted brain images were scanned using a 3T Siemens Magnetom Prisma scanner (Siemens Healthcare, Erlangen, Germany) using a 32-channel head coil. The acquisition parameters75 were as follows: repetition time (TR) = 2530ms, echo time (TE) = 3.5ms, flip angle = 7°. The voxel size was 1.0 × 1.0 × 1.0 mm and field-of-view (FOV) was 256 × 256 mm2 (256 × 256 matrix) with 176 sagittal slices. Acquisition time was six minutes three seconds.
Preprocessing of T1 data consisted of motion correction, skull stripping and intensity normalization using the FreeSurfer image preprocessing pipeline (autorecon2).76
The DTI scan protocol has been described previously.67 It consisted of a full-brain multi-shell Stejskal-Tanner pulsed mono-planar gradient scheme77 with a single-shot spin-echo multiband-accelerated echo-planar imaging (EPI) readout module.78 Seventy-six axial slices with b-values = [500-1000-2000-3000-4000](s/mm2) and non-coplanar diffusion-sensitized gradient directions were acquired with the corresponding numbers of gradient directions ndir = [12–30–40–50–60]. The following parameters were applied: TR = 2450ms, TE = 85ms, flip-angle = 78°. The voxel size = 2.0 × 2.0 × 2.0 mm, FOV = 212 x 212 mm2 (106 × 106 matrix), slice thickness = 2 mm, and multiband acceleration factor = 4. Acquisition time was eight minutes and 21 seconds. In addition, five non-diffusion-weighted image sets (b = 0) of opposite phase-encode direction –but otherwise identical imaging parameters– were acquired for correction of susceptibility distortions. Acquisition time was 31 seconds.
Each DTI volume was affinely registered to the average non-diffusion weighted volume using the FMRIB's Linear Image Registration Tool (FLIRT)79, correcting for intra-scan subject motion and eddy-current distortions. After non-brain tissue was removed80, voxel-wise eigenvalues and eigenvectors were extracted from the estimated diffusion tensor and fractional anisotropy (FA) was calculated. The fractional anisotropy (FA) map was co-registered to the structural scan using the statistical parametric mapping (SPM) toolbox for Matlab 2019b (The MathWorks, Natick, Massachusetts). We manually checked for co-registration errors.
4.3 Segmentation and Segmentation agreement
Four whole-brain segmentation methods were selected: FreeSurfer Automatic Segmentation (ASEG)81 version 7.1.0, FreeSurfer Sequence Adaptive Multimodal Segmentation (SAMSEG)65,66, FastSurfer55 and VUNO Med-DeepBrain (VUNO Inc., Seoul, South Korea) version 1.0.1. A description of each of these methods is provided in Supplementary Note S3. The selection of segmentation methods used was based on three considerations: First, the selection was limited to methods producing the same anatomical labeling. Second, methods were required to use the same modality as source, being T1-weighted MRI. And last, the different methods represent a mix of underlying methodologies. This mix consists of commonly used conventional and recently introduced methods that aim to outperform conventional methods through deep learning.
We generated segmentation labels from T1-weighted images and semantically matched these across segmentation methods, excluding areas that were not available for all methods. All segmentations were checked manually for all subjects to exclude potential failures. Segmentation agreement was determined by computing the Dice-coefficient for each segmentation method pair, for each anatomical area.82 We interpreted Dice-coefficients using the same range of strength of agreement as for the Kappa coefficient and computed interquartile ranges (IQR) to aid interpretation.83
4.4 Radiomic feature extraction reliability analysis
We used seven classes to subdivide radiomic features, roughly in order of complexity: Shape-based features84, first-order features84, gray level co-occurrence matrices (GLCM)85, gray level dependence matrices (GLDM)86, gray level run length matrices (GLRLM)87, gray level size zone matrices (GLSZM)88, and neighboring gray tone difference matrices (NGTDM). A description of these feature classes can be found in Supplementary Note S4.
We computed radiomic features for each anatomical area, in each scan modality and at each TP using PyRadiomics89 (v.3.0.1) implemented in Python (v.3.8.4). Geometry tolerance was set to 10− 3 mm. Feature definitions are in compliance with Imaging Biomarker Standardization Initiative (IBSI)44 and are described extensively at the proprietary repository.90
Next, we computed radiomic feature reliability for each anatomical area and each radiomic feature. Reliability was assessed for two measurements: First, scan-rescan reproducibility was calculated by computing the two-way mixed intra-class correlation (ICC)91 between TP1 and TP2, thus before any effects of sleep deprivation. Second, robustness to segmentation method variation was computed similarly using ICC among all four segmentation methods. ICC was implemented using the Pingouin (v. 0.3.10) for Python. Resulting values were averaged over all subjects. Throughout this work, averaging of coefficients was performed using Fisher’s z-transformation.92 At first, computation of radiomic feature reliability produces a comprehensive overview of reliability of each feature for each MR modality. Second, reliability metrics were averaged per anatomical region and radiomic feature class. Anatomical regions with failing segmentation agreement were excluded to avoid including segmentation errors or regions with different semantical definitions that were not previously excluded in the matching of segmentation labels from affecting radiomic feature properties. Failure was defined as a Dice-coefficient below 0.5.
4.5 Radiomics-based classification and statistical analysis
To investigate the effect of the segmentation method variation on the discriminative power of radiomic features, we used a binary classifier to separate sleep-deprived subjects from controls. This classifier was trained on the radiomic features produced by each of the four segmentation methods.
Dimensionality of the data was reduced in three steps. First, a selection in modalities and radiomic features was made based on a combination of literary findings on effects of sleep deprivation and two classes with highest radiomic feature reproducibility. Second, anatomical regions with failing agreement, as described in the previous section, were excluded. Left-right hemisphere values were not averaged, such that potential asymmetric properties remained. Last, to better express the temporal relationship in the data while reducing dimensionality, the values at the four time points were used to compute a first-order polynomial. Only the coefficient-over-time of this polynomial was used for prediction.
A neural network was trained on the remaining radiomic features. The input of the network consisted of a one-dimensional vector of radiomic features, and sleep deprivation as binary class label. The network consisted of three blocks, each consisting of a batch normalization layer, a rectified linear unit (ReLU) activation layer93, linear layer and a dropout regularization layer, in that order. These blocks were followed by a sigmoid layer combined with a binary cross-entropy (BCE) loss.
The network was implemented in PyTorch version 1.7.1, on a single Nvidia GeForce RTC 2080 SUPER (Nvidia Corporation, Delaware, California) graphics processing unit (GPU) with CUDA version 10.294. Adam optimization with L1-regularization, default weight initialization, and a constant learning rate of 10− 3 were used for training and a batch size of 16. The regularization factor was set to 10− 3 and dropout probability to 0.3. Training was performed until loss on the validation set refrained from decreasing. After training the parameters were reinstated to the state with lowest validation loss for testing.
A complete data approach, excluding subjects with missing scans, was followed. This is required to allow for the identical computation of the temporal effect using a polynomial. Data was divided into a training partition of 70%, validation partition of 20% and test partition of 10% of the included samples with which 10-fold cross-validation was performed. Each fold was initialized differently, but for each method the same sequence of randomization seed was used with the same partitioning to ensure comparability of the performance of the model of each fold.
Shapiro-Wilk test was used to test normality of the paired BCE-loss performance over all folds. Friedman test implemented in SciPy version 1.6.2 was used to test the hypothesis that model performance expressed as BCE-loss did not differ among methods used.95