Data acquisition. We submitted data access applications to nearly all the open-access brain imaging data archives and received permissions from the administrators of 34 datasets. The full dataset list is shown in table 1. Deidentified data were contributed from datasets approved by local Institutional Review Boards. The reanalysis of these data was approved by the Institutional Review Board of Institute of Psychology, Chinese Academy of Sciences. All participants had provided written informed consent at their local institution. All 50,876 participants (contributing 85,721 samples) had at least one session with a T1-weighted structural brain image and information on their sex and age. For participants with multiple sessions of structural images, each image was considered as an independent sample for data augmentation in training. Importantly, scans from the same person were never split into training and testing sets, as that could artifactually inflate performance.
Table 1: The datasets used in the present study
Full Name of Dataset
|
Number of T1 Scans
|
T1 Scans after QC
|
Age (mean±std)
|
Number of Subjects
|
Number of Sites
|
Manufacturers
|
Field Strength
|
Adolescent Brain Cognition Development
|
31176
|
30222
|
13.76±10.08
|
11875
|
21
|
SIE/PHI/GE
|
3T
|
UK Biobank
|
20124
|
19744
|
63.1±7.46
|
20124
|
4
|
SIE
|
3T
|
Alzheimer's Disease Neuroimaging Initiative
|
16596
|
16431
|
74.97±7.4
|
2546
|
57
|
SIE/PHI/GE
|
3T/1.5T
|
Open Access Series of Imaging Studies
|
3150
|
3099
|
67.54±20.64
|
1664
|
(5)
|
SIE
|
3T/1.5T
|
REST-meta-MDD sample
|
2380
|
2363
|
36.2±15.11
|
2380
|
17
|
SIE/PHI/GE
|
3T/1.5T
|
Brain Genomics Superstruct Project
|
1570
|
1552
|
21.54±2.89
|
1570
|
2
|
SIE
|
3T
|
Human Connectome Project
|
1267
|
1220
|
/
|
1267
|
1
|
SIE
|
7T/3T
|
Autism Brain Imaging Data Exchange
|
1102
|
1073
|
17.09±8.06
|
1102
|
17
|
SIE/PHI/GE
|
3T/1.5T
|
Autism Brain Imaging Data Exchange II
|
1043
|
1019
|
15.16±9.39
|
1043
|
19
|
SIE/PHI/GE
|
3T/1.5T
|
1000 Functional Connectomes Project
|
897
|
864
|
25.8±10.76
|
897
|
33
|
SIE/PHI/GE
|
3T/1.5T
|
ADHD-200 Sample
|
876
|
864
|
12.35±3.28
|
876
|
8
|
SIE/PHI
|
3T/1.5T
|
Consortium for Reliability and Reproducibility
|
714
|
691
|
23.45±12.31
|
715
|
2
|
SIE/GE
|
3T
|
Cambridge Centre for Ageing Neuroscience (Cam-CAN)
|
652
|
523
|
54.36±18.55
|
652
|
1
|
SIE
|
3T
|
Enhanced Nathan Kline Institute - Rockland Sample
|
646
|
616
|
38.63±21.21
|
646
|
1
|
SIE
|
3T
|
Southwest University Longitudinal Imaging Multimodal
|
586
|
581
|
20.1±1.3
|
586
|
1
|
SIE
|
3T
|
Child Mind Institute Healthy Brain Network
|
572
|
506
|
10.74±3.65
|
572
|
3
|
SIE
|
3T/1.5T
|
Establishing Moderators and Biosignatures of Antidepressant Response in Clinic Care
|
540
|
523
|
/
|
540
|
4
|
SIE/PHI/GE
|
3T
|
Southwest University Adult Lifespan Dataset
|
493
|
483
|
45.16±17.45
|
493
|
1
|
SIE
|
3T
|
Max Planck Institute Leipzig Mind-Brain-Body Dataset
|
316
|
291
|
/
|
316
|
1
|
SIE
|
3T
|
Beijing Enhanced Sample
|
180
|
176
|
21.22±1.94
|
180
|
1
|
SIE
|
3T
|
Nathan Kline Institute - Rockland Sample
|
167
|
151
|
35.59±20.71
|
167
|
1
|
SIE
|
3T
|
The Center for Biomedical Research Excellence
|
147
|
137
|
/
|
147
|
1
|
SIE
|
3T
|
The Age-ility Project
|
110
|
105
|
21.87±5.39
|
110
|
1
|
SIE
|
3T
|
Parkinson's Disease Datasets
|
68
|
65
|
66.18±7.58
|
68
|
2
|
SIE
|
3T/1.5T
|
Power et al., 2012 Neuroimage Sample
|
63
|
62
|
14.25±6.05
|
63
|
1 (2)
|
SIE
|
3T
|
NYU Institute for Pediatric Neuroscience
|
47
|
45
|
30.4±8.98
|
47
|
1
|
SIE
|
3T
|
Beijing Eyes Open Eyes Closed Sample
|
46
|
44
|
22.54±2.18
|
46
|
1
|
SIE
|
3T
|
Multi-Modal MRI Reproducibility Resource
|
42
|
42
|
31.76±9.35
|
42
|
1
|
PHI
|
3T
|
Adelstein et al., 2011, PLoS ONE Sample
|
39
|
36
|
29.59±8.38
|
39
|
1
|
SIE
|
3T
|
Cleveland CCF
|
31
|
29
|
43.55±11.14
|
31
|
1
|
SIE
|
3T
|
Virginia Tech Carilion Research Institute
|
25
|
24
|
26.84±8.17
|
25
|
1 (3)
|
SIE
|
3T
|
Beijing Short TR Sample
|
24
|
23
|
23.71±6.74
|
24
|
1
|
SIE
|
3T
|
FIND Lab sample
|
13
|
12
|
24.08±3.73
|
13
|
1
|
GE
|
3T
|
The Midnight Scan Club dataset
|
10
|
10
|
29.1±3.35
|
10
|
1
|
SIE
|
3T
|
|
85712
|
83735
|
|
50876
|
217
|
|
|
Note: Abbreviation: SIE = Siemens, Phi = Philips, GE = General Electric. Numbers in the bracket indicate the number of scanners.
MRI preprocessing. We did not feed raw data into the classifier for training, but used accepted pre-processing pipelines that are known to generate valuable features from the brain scans. The brain structural data were segmented and normalized to acquire grey matter density (GMD) and grey matter volume (GMV) maps. Specifically, we used the voxel-based morphometry (VBM) analysis module within Data Processing Assistant for Resting-State fMRI (DPARSF)(21), which is based on SPM(22), to segment individual T1-weighted images into grey matter, white matter, and cerebrospinal fluid (CSF). Then, the segmented images were transformed from individual native space to MNI space (a coordinate system created by Montreal Neurological Institute) using the Diffeomorphic Anatomical Registration Through Exponentiated Lie algebra (DARTEL) tool(23). Two voxel-based structural metrics, GMD and GMV were fed into the deep learning classifier as two features for each participant. GMD is the output of unmodulated tissue segmentation map in the MNI space. GMV is calculated by multiplying the voxel value in GMD by the Jacobian determinants derived from the spatial normalization step (modulated) (24).
Quality control. Poor quality raw structural images would produce distorted GMD and GMV maps during segmentation and normalization. To remove such participants from affecting the training of the classifiers, we excluded participants in each dataset with a spatial correlation exceeding the threshold defined by (mean - 2SD) of the Pearson’s correlation between each participant’s GMV map and the grand mean GMV template. The grand mean GMV template was generated by randomly selecting 10 participants from each dataset and averaging the GMV maps of all these 340 (from 34 datasets) participants. All these participants were visually checked for image quality. After quality control, 83,735 samples were retained for classifier training (figure S1).
Deep learning: classifier training and testing for sex. We trained a 3-dimensional Inception-ResNet-v2(25) model adopted from its 2-dimensional version in the Keras built-in application (see figure 1A for its structure). This is a state-of-the-art model in pattern recognition, and it integrates two classical series of CNN models, Inception and ResNet. We replaced the convolution, pooling, and normalization modules with their 3-dimensional versions and adjusted the number of layers and convolutional kernels to make them suitable for 3-dimensional MRI inputs (e.g., GMD and GMV as different input channels). The present model consists of one stem module, three groups of convolutional modules (Inception-ResNet-A/B/C) and two reduction modules (Reduction-A/B). The model can take advantage of convolutional kernels with different shapes and sizes, and can extract features of different sizes. The model also can mitigate vanishing gradients and exploding gradients by adding residual modules. We utilized the Keras built-in stochastic gradient descent optimizer with learning rate = 0.01, Nesterov momentum = 0.9, decay = 0.003 (e.g., learn rate = learn rate0 × (1 / (1 + decay × batch))). The loss function was set to binary cross-entropy. The batch size was set to 24 and the training procedure lasted 10 epochs for each fold. To avoid potential overfitting, we randomly split 600 samples out of the training sample as a validation sample and set a checking point at the end of every epoch. We saved the model in which the epoch classifier showed the lowest validation loss. Thereafter, the testing sample was fed into this model to test the classifier.
While training the sex classifier, random cross-validation may share participants from the same sites between training and testing samples, so the model may not generalize well to datasets from unseen sites due to the site information leakage during training. To ensure generalizability, we used cross-dataset validation. In the testing phase, all the data from a given dataset would never be seen during the classifier training phase. This also ensured the data from a given site (and thus a given scanner) were unseen by the classifier during training (see figure1B for an illustration). This strict setting can limit classifier performance, but it makes it feasible to generalize to any participant at any site (scanner). Five-fold cross-dataset validation was used to assess classifier accuracy. Of note, 3 datasets were always kept in the training sample due to the massive number of samples: Adolescent Brain Cognition Development (ABCD) (n = 31,176), UK Biobank (n = 20,124), and the Alzheimer's Disease Neuroimaging Initiative (ADNI) (n = 16,596). The remaining 31 datasets were randomly allocated to the training and testing samples. The allocating schemas were the solution that balanced the sample size of 5 folds the best from 10,000 random allocating procedures. Both healthy normal control and brain-related disorder patient samples in 34 datasets were used in training the sex classifier.
Transfer learning: classifier training and testing for AD. After obtaining a highly robust and accurate brain imaging-based sex classifier as a base model, we used transfer learning to further fine-tune the AD classifier. Rather than retaining the intact sophisticated structure of the base model (Inception-ResNet-V2), we only leveraged the pre-trained weights in the stem module and simplified the upper layers (e.g., replacing Inception-ResNet modules with ordinary convolutional layers). The retained bottom structure of the model works as a feature extractor and can take advantage of the massive training of sex classifier. And the pruned upper structure of the AD model can avoid potential overfitting and promote generalizability by reducing the number of parameters (10 million parameters for the AD classifier vs. 54 million parameters for the sex classifier). This derived AD classifier was fine-tuned on the ADNI dataset (2,186 samples from 380 AD patients and 4,671 samples from 698 normal controls (NCs), 76 ± 7 years, 3,493 samples from women). ADNI was launched in 2003 (Principal Investigator: Michael W. Weiner, MD) to investigate biological markers of the progression of MCI and early AD (see www.adni-info.org). We used the Keras built-in stochastic gradient descent optimizer with learning rate = 0.0003, Nesterov momentum = 0.9, decay = 0.002. The loss function was set to binary cross-entropy. The batch size was set to 24 and the training procedure lasted 10 epochs for each fold. Similar to the cross-dataset validation for the sex classifier training, five-fold cross-site validation was used to assess classifier accuracy (see figure 1C for an illustration). By ensuring that the data from a given site (and thus a given scanner) were unseen by the classifier during training, this strict strategy made the classifier generalizable with non-inflated accuracy, thus better simulating realistic clinical applications than traditional five-fold cross-validation. Other than using GMD+GMV as the input in transfer learning, we also used GMD, GMV or z-standardized normalized raw T1-weighted images as the input for the sex/AD classifiers to verify the influence of the input format (table 2). We also trained an age prediction model instead of the sex classifier in transfer learning to verify the influence of the base-model. We used the same structure as the sex classifier, except for adding a fully-connected layer with 128 neurons with “ELU” activation function before the final layer; we also changed the dropout rate from 0.5 to 0.2 following the parameters in the brain age prediction model reported by B. A. Jonsson et al. (20).
Furthermore, to test the generalizability of the AD classifier, we directly tested it on three unseen independent AD samples, i.e., the Australian Imaging, Marker and Lifestyle Flagship Study of Ageing (AIBL)(26), the Minimal Interval Resonance Imaging in Alzheimer's Disease cohort (MIRIAD)(27), and the Open Access Series of Imaging Studies (OASIS)(28). We used an ensemble method, based on averaging the output of 5 AD classifiers in the five-fold cross-validation trained on ADNI, to obtain the final classification for each sample. We used diagnoses provided by the qualified physicians for the AIBL and MIRIAD datasets as the labels of samples (115 samples from 82 AD patients and 554 samples from 324 NCs in AIBL, 74 ± 7 years, 374 samples from women; 409 samples from 46 AD patients and 235 samples from 23 NCs in MIRIAD, 70 ± 7 years, 358 samples from women). As OASIS did not specify the criteria for an AD diagnosis, we adopted criteria of MMSE and clinical dementia rating (CDR) modified from ADNI-1 protocol manual to define AD and NC samples. To be specific, criteria for AD are (1) MMSE <= 22 and (2) CDR >= 1.0, and criteria for NC are (1) MMSE > 26 and (2) CDR = 0. Thus, we tested the model on 137 samples from 34 AD patients and 986 samples from 213 NC participants in the OASIS dataset after quality control, 75 ± 10 years, 772 samples from women. Of note, the scanning conditions and recruitment criteria of these independent datasets differed much more than variations among different ADNI sites (where scanning and recruitment was deliberately coordinated), so we expected the AD classifier to achieve lower performance.
We further investigated whether the AD classifier could predict disease progression in people with mild cognitive impairment (MCI). MCI is a syndrome defined as relative cognitive decline without symptoms interfering with daily life; even so, more than half of MCI patients progress to dementia within 5 years(29). The stable MCI (sMCI) samples were defined as “scans from a individuals who was once diagnosed as MCI in any phase of ADNI and has not progressed to AD by the end of the ADNI follow-up”, and the progressive MCI (pMCI) samples were defined as “scans from a participant who was once diagnosed as MCI in any phase of ADNI and who has progressed to AD”. The scans labeled as “conversion” or “AD” (after conversion) for pMCI and the last scan for sMCI were excluded in the present study for precision. We screened imaging records of the MCI patients who converted to AD later in the ADNI 1/2/’GO’ phases, and collected 2,371 images from 243 participants labeled as ‘pMCI’. We also assembled 4,018 samples from 524 participants labeled ‘sMCI’ without later progression for contrast. We directly fed all these MCI images into the AD classifier without further fine-tuning, thus evaluating the performance of the AD classifier on unseen MCI information.
Interpretation of the deep learning classifiers.
To better understand the brain imaging-based deep learning classifier, we calculated occlusion maps for the classifiers. We repeatedly tested the images in testing sample using the model with the highest accuracy within the 5 folds, while successively masking brain areas (volume = 18mm*18mm*18mm, step = 9mm) of all input images. The accuracy achieved on “intact” samples by the classifier minus accuracy achieved on “defective” samples indicated the “importance” of the occluded brain area for the classifier. The occlusion maps were calculated for both sex and AD classifiers. To investigate the clinical significance of the output of the AD classifier, we calculated the Spearman’s correlation coefficient between the predicted scores and MMSE scores of AD, NC, and MCI samples. We also used general linear models (GLM) to verify whether the predicted scores (or MMSE score) showed a group difference between people with sMCI and pMCI. The age and sex information of MCI participants was included in this GLM as covariates. We selected the T1-weighted images from the first visit for each MCI subject and finally collected data from 243 pMCI patients and 524 sMCI patients.