Dataset
Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. The dataset included data from 267 people in an AD group, 230 people in a normal control group (NC), and 328 people in a MCI. The MCI group was further divided into progressive MCI (pMCI) and stable MCI (sMCI) groups based on whether the patients would develop AD in the next 36 months, with a total of 166 patients for pMCI and 162 patients for sMCI. The clinical information on each participant group, such as sex, age, and Mini-Mental State Examination (MMSE) score, is presented in Table 1.
Table 1 Demographics and clinical information on subjects.
|
NC
|
MCI
|
AD
|
p value
|
sMCI
|
pMCI
|
Number
|
227
|
153
|
166
|
141
|
-
|
Sex (F/M)
|
113/114
|
63/90
|
65/101
|
60/81
|
2.12×10−1
|
Age
|
73.36 (±6.66)
|
71.56 (±7.30)
|
74.42 (±7.04)
|
74.59 (±7.57)
|
4.16×10−4
|
Education
|
16.77 (±2.43)
|
15.98 (±2.87)
|
15.88 (±2.92)
|
15.82 (±2.48)
|
8.64×10−4
|
MMSE
|
28.90 (±1.45)
|
28.02 (±1.78)
|
21.67 (±5.36)
|
19.97 (±4.27)
|
2.55×10−90
|
CDRSB
|
0.19 (±0.65)
|
1.18 (±0.81)
|
5.34 (±3.23)
|
6.33 (±3.28)
|
1.37×10−103
|
(NC: Normal Control, MCI: Mild Cognitive Impairment, AD: Alzheimer’s Disease, pMCI: Progressive MCI (Converted to AD within 36 months), sMCI: Stable MCI (No conversion to AD within 36 months), pMCI: Progressive MCI, Education: Average number of years of education of the subject group, MMSE: Mini-Mental State Examination score, CDRSB: Clinical Dementia Rating Sum of Boxes)
Data preprocessing
Preprocessing of the sMRI images was performed using SPM12 software. The sMRI images were first subjected to realign-head-movement correction to screen out images with excessive head movement; the head movement oversize criteria were 2 mm and 2°. The head-motion-corrected functional image file was then normalized to the MNI (Montreal Neurological Institute) space, and the skull was stripped to obtain a 3D brain image of size 121×145×121.
Preprocessing of the fMRI images was performed using Gretna software. A time-point removal operation was first performed on the fMRI to remove the first 10 time points that were unstable because of machine factors. Next, a slice timing time-layer correction was performed to ensure uniformity of the scan time across layers within a scan cycle. Subsequently, head movement correction was performed to assess the head movement of each subject and adjust for misalignments in the resulting image at different moments. Subsequently, individual echo-planar imaging (EPI) images were directly aligned to a standard EPI template. Finally, a brain network was constructed using the AAL90 template to divide the fMRI data into 90 ROI nodes to construct a functional brain network.
Model framework
Figure 1 illustrates our proposed CSEP coupling (CSEPC) framework. This framework consists of three main modules: an Intra-modality module, an Inter-modality module, and a classifier. We first preprocess the fMRI and sMRI data to obtain image data matrices and further transform them into tensor matrices at various levels. Subsequently, these tensor matrices are inputted into the Intra-modality module, where we extract cross-scale balanced features through structural and functional encoders and then compress each type of feature into a 64×32 feature map. Following that, the Inter-modality module is employed to calculate the cosine coupling matrix between the structural feature map and the functional feature map. Finally, the classifier module applies contrastive learning to the coupling features between fMRI and sMRI, thus obtaining the final classification results.
Intra-modality module
Our Intra-modality feature encoder consists of two modules: the CSEP module (Fig. 2a) and the residual block (Fig. 2b). Before encoding, we first divide the image into different levels through upsampling and downsampling. Subsequently, we extract Intra-modality multiscale features from different receptive fields using convolution strategies of various sizes and strides. For sMRI, we use convolution kernels of different sizes to obtain structural features of different scales. For fMRI, we use kernels of different sizes to explore the correlation of functions between different brain areas.
Simultaneously, to reduce data parameters and adapt to small-sample medical image data, in the CSEP module, we avoid using convolution kernels of sizes 3×3×3, 5×5×5, and 7×7×7 in the extraction of multiscale features. This would significantly increase the number of network parameters, thus impeding model training. Therefore, we have adopted a dilated convolution strategy (Fig. 2a) that can expand the receptive field of the convolution kernel without increasing the number of parameters, thereby providing a wider range of information in each convolution output. For instance, using a 3×3×3 dilated convolution kernel with a dilation coefficient of 3, we can achieve the same coverage as that of a 7×7×7 kernel, but require only 7.9% of the parameters needed for the larger kernel. The equivalent receptive field \(\text{K}\) is calculated as follows:
$${K}^{{\prime }}=K+\left(K-1\right)\times \left(d-1\right)$$
1
,
where \(\text{K}\) denotes the convolution kernel size, and \(\text{d}\) denotes the dilation factor.
After determining the convolution strategy for each scale, we should proceed to integrating multiscale features. The specific process of integration is as follows:
\(\left\{{\text{F}}_{1},{\text{F}}_{2},{\text{F}}_{3},...,{\text{F}}_{\text{n}},{...,\text{F}}_{\text{N}}\right\}\in \text{S}\) denotes the dataset containing N samples, and \(\left\{{\text{Y}}_{1},{\text{Y}}_{2},{\text{Y}}_{3},...,{\text{Y}}_{\text{n}},{...,\text{Y}}_{\text{N}}\right\}\in \text{Y}\) denotes the labels corresponding to the data.
First, to obtain multiscale input data, each of our input data \({\text{F}}_{\text{n}}\) must be divided into different levels, \(\left\{{\text{L}\text{e}\text{v}\text{e}\text{l}}_{1}^{\text{n}},{\text{L}\text{e}\text{v}\text{e}\text{l}}_{2}^{\text{n}},{\text{L}\text{e}\text{v}\text{e}\text{l}}_{3}^{\text{n}},...,{\text{L}\text{e}\text{v}\text{e}\text{l}}_{\text{k}}^{\text{n}},...,{\text{L}\text{e}\text{v}\text{e}\text{l}}_{\text{K}}^{\text{n}}\right\}\in {\text{F}}_{\text{n}}\).
After dividing the input data into different levels, the network selects the corresponding equilibrium expansion convolution strategy according to the levels to derive the corresponding output features \(\text{X}\):
\({y}_{k}^{n}={\text{W}}_{k}{\text{D}\text{C}\text{o}\text{n}}_{k}\left({\text{l}\text{e}\text{v}\text{e}\text{l}}_{k}^{\text{n}}\right)+{\text{W}}_{COORD}{{\text{D}\text{C}\text{o}\text{n}}_{k+1}(\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+1}^{\text{n}})\)
|
(2)
|
\({y}_{k+2}^{n}={\text{W}}_{COORD}^{{\prime }}{\text{D}\text{C}\text{o}\text{n}}_{k+1}\left({\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+1}^{\text{n}}\right)+{\text{W}}_{k+2}{{\text{D}\text{C}\text{o}\text{n}}_{k+2}(\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+2}^{\text{n}})\)
|
(3)
|
\({X}_{n}={\text{W}}_{{\text{l}\text{e}\text{v}\text{e}\text{l}}_{k}}{y}_{k}^{n}+{\text{W}}_{{\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+2}}{y}_{k+2}^{n}\)
|
(4)
|
where \({\text{y}}_{\text{k}}^{\text{n}}\) represents the output feature after \({\text{L}\text{e}\text{v}\text{e}\text{l}}_{\text{k}}^{\text{n}}\) and its adjacent features are fused with adjacent scales across levels; \({\text{W}}_{\text{k}}\) represents the weight of level k; \({\text{D}\text{C}\text{o}\text{n}}_{\text{k}}\) represents the dilation convolution operation at level k; \({\text{W}}_{\text{C}\text{O}\text{O}\text{R}\text{D}}^{}\) and \({\text{W}}_{\text{C}\text{O}\text{O}\text{R}\text{D}}^{{\prime }}\) represent the coordination weights of the intermediate scale to the adjacent scales; and \({\text{W}}_{{\text{l}\text{e}\text{v}\text{e}\text{l}}_{\text{k}}}\) represents the cross-scale fusion weight.
Finally, the set of output features are input into ResNet for further feature extraction. Figure 2b shows the ResNet architecture. A residual block is composed of two paths, \(\text{H}\left({\text{X}}_{\text{n}}\right)\) and \({\text{X}}_{\text{n}}\). Path \(\text{H}\left({\text{X}}_{\text{n}}\right)\), the residual path, includes three convolution steps, each followed by a normalization step; path \({\text{y}}_{\text{n}}\) is the input to the residual block. In the end, the sum of the two paths is passed through a rectified linear unit (ReLU) activation function to yield the final extracted feature \({\text{g}}_{\text{n}}\).The module then converts these into 64×32 feature matrices using matrix correction. The calculation formula is as follows:
$${\text{g}}_{n}=max(0,H({X}_{n}\left)\right)+{X}_{n}$$
5
.
Inter-modality module
Following the completion of the aforementioned feature extraction process, we obtained the cross-scale equalization features of sMRI and fMRI. To ensure that our model can comprehensively capture intermodality features and account for their coupling relationships, we incorporated the coupling matrices of these two features as merged features into the classifier for contrastive learning of the associations between modalities. We utilized cosine similarity to calculate the coupling matrices of spatiotemporal features, subsequently mapping them onto a probability map using the softmax function, thereby ensuring that feature values are constrained between 0 and 1. Such normalization operation guarantees a unified measurement for the coupling matrix in subsequent processing. The formula is as follows:
$${\text{G}}_{n}=\text{S}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(\frac{{\text{W}}_{s}{\text{g}}_{n}^{{\prime }}\bullet {\text{W}}_{f}{\text{g}}_{m}^{{\prime }{\prime }}}{\sqrt{{d}_{k}}}\right)$$
6
,
where \({\text{g}}_{\text{n}}^{{\prime }}\) and \({\text{g}}_{\text{m}}^{{\prime }{\prime }}\) represent the feature maps of sMRI and fMRI, respectively, after being processed by the intramodality module; and \({\text{d}}_{\text{k}}\) denotes the number of columns in the coupled feature map, which is set to 64 in our research study.
We employ a random matching approach to fuse the intermodality features. In this way, each sMRI feature is paired with an fMRI feature, thoroughly accounting for all possible matching combinations between modalities. By calculating the coupling matrix, we project the intermodality contrast features into the label semantic space; the cosine calculation operation solves the fusion difficulty caused by different data dimensions. Finally, the fused features are input into the classifier to complete the classification of the corresponding tasks.
Classifier module
In this study, a CNN was selected as the classifier for our model, as shown in Fig. 2c. Through the 4-layer CNN classification designed in this study, we can better learn the global features in the coupling matrix and achieve a higher classification performance. The convolution strategy of the CNN selects the convolution method of downsampling and further simplifies and extracts the key features among the the coupled features, thereby improving the final classification accuracy.
K-Fold cross-validation
Because of the constraints of the sample data, the CSEPC network model may encounter overfitting problems. To reduce their impact on the experimental results, we performed K-fold cross-validation. For our experiments, we set the value of K to 10. The subject samples were randomly divided into ten subsamples of equal size, of which one subsample was used as the test set to validate the classification performance of the model, while the remaining nine subsamples were used as training data. This cross-validation procedure was repeated 10 times, with each subsample used once as a test set. By following these procedures, we can check whether the model is experiencing overfitting problems.
Experimental setup
We implemented the proposed deep-learning approach using the Pytorch framework. All experiments were run on an NVIDIA Tesla V100 16GB GPU and Windows Server 2016. In the CSEPC training process, we separately trained the sMRI_CSEP and fMRI_CSEP models. We used the stochastic gradient descent function (SGD) as the optimizer of the models during the training process and set the batch size to 1.