All participants, or their legal guardians, gave informed consent for their clinical data to be used for research purposes. All methods were performed in accordance with the relevant guidelines and regulations. This research was performed in accordance with the declaration of Helsinki. MRI acquisition was carried out after all participants completed a screening form to determine MRI eligibility. We received de-identified MRI data. This study was approved by the Institutional Review Board of the University of Florida (IRB No. 202101469).
DeepBrainNet architectures for brain age prediction
A DeepBrainNet consists of a CNN connected to a dense layer with 1024 units with a Rectified Linear Unit (ReLU) activation, an 80% dropout layer (when training), and a single output layer with linear activation (the predicted brain age)2. The CNN can be the “InceptionResNetV2”19 or the “Visual Geometry Group” network with 16 layers (VGG16)20, initialized with ImageNet weights before trained with MRI data2. Since these CNNs predict from 2D images, DeepBrainNet operates with the 2D slices of an MRI as independent samples. The individual’s predicted brain age is the median of the predictions across a set of slices. DeepBrainNet architectures were trained with preprocessed research-grade T1w images from 11,729 individuals (ages 3–95 years) from various geographic locations, scanners, acquisition protocols, and studies, and tested in an independent sample of 2,739 individuals. The required preprocessing for a T1w brain MRI is described in the next section.
Preprocessing of the MRIs
We preprocessed all original and synthetic MPRAGEs as required for the use of DeepBrainNet. We skull-stripped them using smriprep (https://www.nipreps.org/smriprep/usage.html), i.e., the image was corrected for intensity non-uniformity using N4BiasFieldCorrection24 distributed with ANTs 2.2.0 and skull-stripped with a Nipype implementation of the antsBrainExtraction.sh workflow from ANTs25, using OASIS30ANTs as target template. Finally, we transformed the skull-stripped images to the 1-mm isotropic voxel FSL skull-stripped T1w template [“MNI152, LPS orientation (Right ◊ Left, Anterior ◊ Posterior, Inferior ◊ Superior), and 218 x 182 x 218 dimensions] using a 12-parameter linear affine transformation estimated via spm_coreg.m from the Statistical Parametric Mapping (SPM; https://www.fil.ion.ucl.ac.uk/spm/) and scaled them from 0 to 255. Finally, image intensities were scaled from 0 to 255.
The 2D images used by DeepBrainNet for prediction were the 80 slices centered at the z = 0 plane in MNI coordinates of the normalized T1w. Note that our chosen brain age prediction (i.e., DeepBrainNet) method does not rely on the whole-brain MRI. This is at least convenient for axial clinical-grade MRIs, since they often lack their topmost and bottommost slices.
Quality control of the preprocessed MRIs
Preprocessed MRI images were submitted to a careful quality control (QC) procedure. Since the study included a large number of MRIs, only a subset of the preprocessed MRIs that were likely to have bad quality were visually inspected. The selection of this subset of MRIs was carried out as follows. We calculated the normalized mutual information (NMI)26 between the preprocessed MRIs and the 1-mm isotropic voxel FSL skull-stripped T1w template. We then plotted the histogram of the NMIs and visually defined a threshold based on those values appearing to be significantly below the main unimodal distribution. We inspected all images below this threshold and those above it until they had no visible preprocessing errors. Since the goal is to demonstrate feasibility of the brain age estimation for clinical images, which have generally less quality than those intended for research purposes, we were lenient regarding the consideration of what a “processing error” was. We only removed preprocessed MRIs that were indisputably unrecognizable due to motion, the brain extraction did not remove significant non-brain tissues, or the normalization performed poorly. In addition, we discarded images with significant structural abnormalities (e.g., tumors, deformations, very large ventricles, tissue loss).
Data subdivisions
Figure 5 depicts how the dataset was split for training and evaluations. Splits were done at the participant level, i.e., if a participant belonged to a certain data proportion or subdivision, all the slices of all the modalities of that participant belonged in that proportion or subdivision. We split the whole dataset into three main independent subdivisions: 77% for training, 8.5% for characterizing and correcting the “regression dilution” bias (i.e., the linear correction set), and 14.5% held-out for external validation. These very specific proportions were the consequence of the distribution of modalities in the whole sample, as we will explain here and evidence in the results methods. First, there were 4 modalities that had too few samples to be part of the training set. Thus, we decided that they could only be part of a sample used to test the predictions, specifically to test the accuracy in modalities that were not used to training the model. These four modalities were only in 14.5% of the participants, hence this proportion for the testing set. The training set was selected to be 70% plus its 10%, hence the total 77%, because 70% was used for the actual training and the rest was used to monitor accuracy and overfitting during the training process, as we explain in the next section. The remaining 8.5% of the sample was used to estimate the linear correction. Therefore, the training and linear correction sets had only MRIs among 5 different modalities, whereas the testing set had MRIs among the 8 modalities.
In addition, the training data were split into three folds for cross-validation of the models and hyperparameters. Following the same strategy previously described for the whole dataset, at each of the three iterations of cross-validation, two folds (i.e., 66.7% of the training data) were used for training and the remaining proportion was further subdivided: 30% for correcting the bias and 70% for evaluation (and to define early stopping) during training (10% and 23.3% of the whole training data, respectively).
Also, each subdivision had roughly the same distribution of the number of MRIs per participant, and this is why we could not use, for each iteration of the cross-validation, the same proportions for the training, linear correction, and evaluation we used for the whole dataset, as described above (i.e., 66.7%, 10% and 23.3% versus 77%, 8.5% and 14.5%, respectively). Finally, to ensure reproducibility, the samples were generated with a fixed seed at the beginning of the study.
(Fig. 5)
Re-training of the DeepBrainNet models
We used Keras 2.11 with the mean squared error (MSE) as the loss function, the mean absolute error (MAE) as the accuracy metric and the Adam optimizer27. To tune model hyperparameters, we cross-validated across a grid defined by the Cartesian product of the models (InceptionResnetV2 and VGG16) and the following values of the hyperparameters: learning rate = [7e-6, 7e-5, 7e-4], the batch size = [80, 160, 240, 320, 400] and a Boolean variable specifying whether weights of importance of the observations were applied during training—weights were inversely proportional to the frequency of occurrence of a certain chronological age range (defined by the bins of the histogram of the ages)—to avoid the algorithm to learn to predict brain ages more accurately for ages more frequent in the training data. The training data was shuffled at the slice level so a batch could include slices from several subjects.
For each cross-validation iteration and grid cell, we trained the original CNN model on the training subset, evaluated its performance using MAE on the validation subset, corrected the linear bias using the linear correction subset, and calculated a corrected MAE again in the validation subset. This was done as follows. We loaded the corresponding original DeepBrainNet model and, at first, set the weights of the fully connected and output layers as trainable, while the weights of the rest of the network were frozen. We trained this model using a maximum of 20 epochs. We then unfroze all the layers of the model and re-trained it with another maximum of 20 epochs. Early stopping was applied during training if there was no improvement of the loss function during the last three consecutive epochs (patience = 3) or if the MAE of the training data was lower than that of the validation data (patience = 0) to avoid overfitting. During training, we adjusted the learning rate after each batch using \(\text{l}\text{e}\text{a}\text{r}\text{n}\text{i}\text{n}\text{g} \text{r}\text{a}\text{t}\text{e}\left(\text{b}\text{a}\text{t}\text{c}\text{h}\right)=\text{i}\text{n}\text{i}\text{t}\text{i}\text{a}\text{l} \text{l}\text{e}\text{a}\text{r}\text{n}\text{i}\text{n}\text{g} \text{r}\text{a}\text{t}\text{e} /\left(1 + \text{b}\text{a}\text{t}\text{c}\text{h}\times \text{d}\text{e}\text{c}\text{a}\text{y}\right)\), decay = 3e-7. After the end of the training, we used the trained model to predict the brain age of each slice of the linear correction subset of the cycle and computed the median within subjects to obtain the individual whole-brain brain ages. We then fitted the linear correction model, i.e., brain age ~ prediction (in Wilkinson’s notation) in this subset to characterize the linear bias, where:
1. prediction = chronological age,
2. prediction = chronological age * modality,
3. prediction = chronological age * scanner, or
4. prediction = chronological age * modality * scanner.
We then added an additional layer to the model’s architecture (using the Keras’ “Lambda” layer) that takes both the predicted brain age and the chronological age as inputs and outputs a corrected brain age according to the formula: corrected brain age = chronological age + brain age – prediction. We shall call this new model the “corrected model” Finally, we predicted the brain age of the evaluation subset of the cycle using the trained uncorrected model to obtain an “uncorrected MAE” for that cycle and grid cell, and the corresponding corrected models to obtain four “corrected MAEs” for that cycle and grid cell.
We selected the models and training parameters that yielded the lowest MAE averaged across folds. We used the uncorrected MAE to select the CNN and set of training hyper-parameters. We then used the four corrected MAEs of the selected CNN/hyper-parameter combination (grid cell) to select the bias correction model. Using this final combination, we re-trained the original CNN model as described above, keeping only 10% whole training set to monitor early stopping and using the remaining 90% for training (hence the 77% to make the actual training set the typically used 70% in the literature). We then used the held-out linear correction set to correct the linear bias using the selected linear model. Finally, we evaluate the generalization error and internal consistency in the held-out testing set as described in the next section.
Evaluation of the accuracy in the testing dataset (generalization error)
We reported the accuracy of the predictions using the sample MAE, the correlation between the predicted brain age and the chronological age, and the coefficient of determination (R2) of the \(brain age=chronological age+error\) linear model (i.e., \(slope=1\) and \(intercept=0\)). That is, with \(⟨ x⟩\) being the average of \(x\):
$$\begin{array}{c}{R}^{2}=1-\left(\frac{RSS}{TSS}\right)\\ RSS={\sum }_{i}{\left({brain age}_{i}-{chronological age}_{i}\right)}^{2}\\ TSS={\sum }_{i}{\left({brain age}_{i}-⟨brain age⟩\right)}^{2}\end{array}$$
A perfect fit yields R2 = 1. Also, with this constrained definition, R2 can have a negative value when the model \(brain age=chronological age+error\) does not follow the trend of the data.
We adopted a non-parametric approach to calculate these measures of accuracy to deal with possible inflation due to repeated observations. We calculated these measures separately for each modality using a randomly selected repetition per subject. Since this method yields a different result for each run, we used bootstrapping to calculate the accuracy measures and took the average. Bootstrapping also allowed us to estimate the 95% confidence intervals (CIs). We used 10,000 × Nr bootstraps to account for the fact a random image among a maximum of Nr repeated measures is drawn for each subject and modality. When reporting the accuracy measures for the total sample, i.e., pooling from modalities and repetitions, we used 10,000 × Nr × Nm bootstraps, where Nm is the number of modalities.
To test part of our first hypothesis, that involves comparing predictive accuracies, we fitted the linear mixed model “|brain-PAD| ~ modality + subject + (1 | subject)”, where | | takes the absolute value, modality is a categorical variable with a level for each modality, including the original MPRAGE. Note that this is a mixed effects repeated measures ANOVA with missing values in long format, i.e., \(\left|{brain-PAD}_{i,r,m} \right|={s}_{i}+{\beta }_{m}+{\eta }_{i}+{e}_{i,r,m}\) where, \({s}_{i}\) is the \(ith\)’s subject fixed effect, \({\beta }_{m}\) is the fixed effect of the \(mth\) modality, \({\eta }_{i}\) is \(ith\)’s subject the random effect and \({e}_{i,r,m}\) is a random error for the\(ith\) subject \(rth\) repetition and \(mth\) modality. The fit was done by maximizing the Restricted Maximum Likelihood using the ‘Quasi-Newton’ optimizer and modeling the covariance matrix using a full Cholesky parameterization. By modeling a non-diagonal covariance matrix and adding a random effect in the intercept across subjects, we are accounting for possible correlations between observations due to the use of repeated measures for some subjects (between and within modalities). This should avoid inflation of Type I errors when testing the significance of certain contrasts in the model. We then tested the within-subject comparison of the |brain-PAD| between the \(mth\) synthetic modality and the original MPRAGE by evaluating and testing the contrast that compared their corresponding coefficients in the model, i.e., \({\beta }_{m}-{\beta }_{MPRAGE}\), \({\beta }_{m}\) is the coefficient of the fixed effect of the \(mth\) modality.
Finally, note that this linear mixed model also provides an estimation of the absolute value of the brain-PAD of each subject’s modality given by \({s}_{i}+{\beta }_{m}\), where \({s}_{i}\) is the coefficient of the fixed effect of the \(ith\) subject. Thus, we can also report a “population” estimate of the MAE. To that end, we evaluated the contrast \({\widehat{MAE}}_{m}={⟨{s}_{i}⟩}_{i}+{\beta }_{m}\), where \({⟨ ⟩}_{i}\) denotes averaging across \(i\), for an estimation of the MAE for each modality, and the contrast \({⟨{\widehat{MAE}}_{m}⟩}_{m}\)for an estimation of the total MAE.
Reliability of brain age predictions across modalities and repetitions
To test our second hypothesis, we also evaluated the intra-subject reliability of the predictions, specifically, we evaluated Cronbach’s alpha on the brain-PAD. The Cronbach’s alpha is a measure of internal consistency 21. The ranges α ≥ 0.9 and 0.8 ≤ α < 0.9 denote an excellent and acceptable reliability, respectively, whereas lower values denote questionable, poor or even unacceptable reliability. Very high values of α ≥ 0.95 denote redundancy (which is desirable in our case). For the calculation of the Cronbach’s alpha, we reorganized the values into a matrix, \(X\), of number of participants by modality/repetition pairs (the items). We considered all possible 6x4 = 24 modality/repetition pairs (i.e., the MPRAGE and the synthetic MRIs for all five modalities, and the maximum of four repetitions). We then dropped those participants (rows) having less than 3 items, and those items having more than 95% of missing values. Cronbach’s alpha was calculated using the following formula 28:
$$cronbach=\frac{number of items}{number of items-1}\left[1-\frac{trace\left(C\right)}{\sum _{ij}{C}_{ij}}\right],$$
where \(C\) is the covariance matrix of \(X\). To handle the remaining missing values, C was calculated using pairwise elimination. The lower and upper bounds of a CI (e.g., 95%) for the Cronbach’s alpha were given by:
$$\begin{array}{c}lower=1-\left(1-cronbach\right){F}^{-1}\left(\alpha /2,{df}_{1},{df}_{2}\right)\\ upper=1-\left(1-cronbach\right){F}^{-1}\left(1-\alpha /2,{df}_{1},{df}_{2}\right)\end{array},$$
where \({F}^{-1}\left(\right)\) is the inverse of the complementary cumulative distribution function (or inverse survival function) for the F-distribution, \(\alpha =1 – CI/100\), \(df1=number of observations – 1\), and \(df2= df1\times number of items\).