CSEPC: Deep Learning Framework for Small Sample Multimodal Medical Image Data in Alzheimer’s Disease Prediction

doi:10.21203/rs.3.rs-3609208/v1

Background

Alzheimer’s disease (AD) is a neurodegenerative disorder that has a significant impact on global healthcare, especially among the elderly population. The prediction of its progression is crucial for slowing down the disease's progression and subsequent intervention management. However, the challenge of small sample sizes remains a significant obstacle in predicting the progression of AD.

Methods

In this study, we propose a novel diagnostic algorithm network architecture named cross-scale equilibrium pyramid coupling (CSEPC). This model adopts the scale equilibrium theory and integrates it with modal coupling properties, taking into account the comprehensive features of multimodal data. This structure not only enhances the feature representation of intermodal and intramodal information from multimodal data but also significantly reduces the number of learning parameters, making it better suited for small-sample characteristics.

Results

Through our experimental tests, our proposed model performs comparably or even superior to those from previous studies in conversion prediction and AD diagnosis. Our model achieves an accuracy (ACC) of 85.67% and an area under the curve (AUC) of 0.98 in predicting the progression from mild cognitive impairment (MCI) to AD. To further validate its efficacy, we used our method to perform diagnostic tasks for different stages of AD. In these two distinct AD classification tasks, our approach also achieved leading performance.

Conclusions

In conclusion, the performance of our model in various tasks has demonstrated its significant potential in the field of small-sample multimodal medical imaging classification, especially in the application of predicting the progression of Alzheimer's disease. This advancement could significantly assist clinicians in effectively managing and intervening in the disease progression of patients with early-stage Alzheimer's disease.

Alzheimer’s disease

Deep learning

Small sample

Inter-modality and Intra-modality

Multi-modal medical images

Alzheimer’s disease (AD) stands as the most prevalent and recognized form of dementia, constituting 60–80% of all cases [1]. The prediction of its progression holds paramount importance in decelerating disease advancement and tailoring subsequent treatments [2]. Throughout the progression of AD, brain image changes across different modalities demonstrate interconnected characteristics [3]. Such cumulative damages culminate in manifesting corresponding symptoms in the elderly. For the monitoring of these changes, multimodal medical imaging has been extensively employed owing to its commendable efficacy in diagnosing brain disorders [4]. Specifically, because of its non-invasiveness and ready accessibility, multimodal MRI imaging has been considered an ideal data source for efficient and transferable diagnoses. However, the prolonged duration of follow-ups for AD prediction has intensified the challenge of data acquisition, compelling us to grapple with the hurdle of limited sample sizes.

Deep learning is a potent method for identifying intrinsic correlations and hierarchical representations within sample data, which it achieves by constructing artificial neural network models. Its superior performance in data mining and feature integration has rendered it a popular choice in medical image analysis. For instance, Kwak et al. [5] devised a convolutional neural network (CNN) using gray matter images from structural magnetic resonance imaging (sMRI) as training data, with the objective of classifying subtypes of mild cognitive impairment (MCI). Conversely, Ning et al. [6] proposed a relation-induced multimodal shared representation learning technique. By amalgamating positron emission tomography (PET) with sMRI, they fused representation learning, dimensionality reduction, and classifier modeling into a cohesive framework tailored for AD diagnosis. Past studies predominantly strived to enhance the performance of deep learning in medical image classification either by pioneering novel network architectures or by augmenting data modalities [7, 8]. However, the intricate and diverse nature of medical images complicates feature extraction, significantly expanding the learning dimensions. Such conditions often precipitate problems such as the curse of dimensionality and overfitting. Furthermore, the challenge of preserving entire brain features while boosting data expressiveness persists. Thus, applying deep learning to small-sample medical imaging remains a formidable endeavor.

Enhancing data expressiveness is deemed a pivotal approach to ameliorating the subpar classification outcomes of having to use small-sample medical images. Past methodologies, such as selecting regions of interest (ROIs) based on prior theories [9], brain feature extraction [10], and brain imaging slicing or partitioning [11, 12], predominantly aimed at maximizing data expressiveness. However, these tactics often involve pruning brain data, potentially leading to the omission of crucial information, which in turn could stymie the discovery of new imaging biomarkers. Another suggested approach is to use a feature pyramid network (FPN), which is an architectural strategy for extracting multiscale features from images by leveraging variations in receptive fields [13] and has been extensively utilized in medical image classification tasks. However, it has been highlighted by Wang et al. [14] that the conventional FPN structure and the convolution operations of multiscale features operate essentially independently. This disconnected architecture, without duly considering the interrelationships across scales, may result in the loss of significant semantic information, consequently constraining the classification prowess of the model. They underscored the need to judiciously address semantic linkages between scales when processing image data to bolster data expressiveness and enhance model classification performance. Inspired by this, we introduce a novel network module, which we have named cross-scale equilibrium pyramid (CSEP). This module, via multiscale 3D convolution, maintains the stability of whole-brain features and fuses convolution kernels of varied sizes, ensuring simultaneous focus on both low-level detail features and high-level global ones. Moreover, the multiscale structure within the module aids the network in more precisely learning and amalgamating features of different hierarchies, thereby crafting a more intricate and enriched feature representation, which further enhances data expressiveness. We also integrated CSEP with the residual module to augment the depth of the network and devised an Intra-modality encoding module within the architecture of the model. This design seeks to thoroughly exploit and leverage the high-dimensional multiscale attributes of medical images, intensifying the expressiveness of medical image data and, consequently, enhancing the classification results of using small-sample medical images.

The number of learning features equally stands out as a pivotal factor in the classification of small-sample multimodal medical images [15]. Typically, the quantity of training samples should far outnumber the feature count to avert the risk of overfitting. Consequently, to more precisely cater to the nuances of small-sample multimodal medical images, our focus transcends merely enhancing data expressiveness; we also adapt to the small-sample environment by curtailing the learning parameters of the model. In constructing the CSEP module, we did not fully embrace the full-scale balanced method proposed by Wang et al. Instead, we opted for intermediate-level features as coordinating features, enabling a flexible linkage between low- and high-level features. This choice of design ensures balanced feature associations while concurrently reducing data parameters by 30%. Given the high-dimensional nature of traditional medical image data, we eschewed the conventional multiscale 3D convolution operations with a higher parameter count, pivoting instead to a 3D dilated convolution strategy. This tactic, while preserving an identical receptive field size, significantly alleviates the burden of learning parameters, improving the efficiency and conduciveness of training the model on small-sample data [16]. Furthermore, with regard to modality fusion, recent studies have reported the existence of Inter-modality couplings, offering fresh perspectives for probing the pathogenic progression of AD [17–19]. Hence, we devised an Inter-modality module for the contrastive learning of correlative features between modalities, empowering the model to discern the interrelations across various modality images more accurately. With regard to the methodology for correlative feature computation, our endeavors align with selecting correlation feature computation techniques that reduce parameter complexity to fit small-sample multimodal medical imaging. Therefore, for our feature computation method, we selected cosine similarity because of its computational efficiency and reduced learning parameters. More crucially, cosine similarity can adapt to diverse vector space dimensions, thereby extensively accommodating various medical image data types and precisely quantifying the mutual associations of features across modalities in multidimensional spaces.

In this study, we developed a deep learning architecture specifically tailored for small-sample multimodal medical imaging data. This design ingeniously integrates Inter-modality feature semantic associations and Intra-modality multiscale semantic relationships. Employing this approach, we extracted balanced multiscale feature information from sMRI and functional MRI (fMRI) data, respectively, and contrastively learned Inter-modality association details. Subsequent experiments on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset demonstrated that our model is capable of achieving high-performance predictions and staging for AD progression.

Dataset

Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial MRI, PET, other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of MCI and early AD. The dataset included data from 267 people in an AD group, 230 people in a normal control group (NC), and 328 people in a MCI. The MCI group was further divided into progressive MCI (pMCI) and stable MCI (sMCI) groups based on whether the patients would develop AD in the next 36 months, with a total of 166 patients for pMCI and 162 patients for sMCI. The clinical information on each participant group, such as sex, age, and Mini-Mental State Examination (MMSE) score, is presented in Table 1.

Table 1 Demographics and clinical information on subjects.

	NC	MCI		AD	p value
	NC	sMCI	pMCI	AD	p value
Number	227	153	166	141	-
Sex (F/M)	113/114	63/90	65/101	60/81	2.12×10⁻¹
Age	73.36 (±6.66)	71.56 (±7.30)	74.42 (±7.04)	74.59 (±7.57)	4.16×10⁻⁴
Education	16.77 (±2.43)	15.98 (±2.87)	15.88 (±2.92)	15.82 (±2.48)	8.64×10⁻⁴
MMSE	28.90 (±1.45)	28.02 (±1.78)	21.67 (±5.36)	19.97 (±4.27)	2.55×10⁻⁹⁰
CDRSB	0.19 (±0.65)	1.18 (±0.81)	5.34 (±3.23)	6.33 (±3.28)	1.37×10⁻¹⁰³

(NC: Normal Control, MCI: Mild Cognitive Impairment, AD: Alzheimer’s Disease, pMCI: Progressive MCI (Converted to AD within 36 months), sMCI: Stable MCI (No conversion to AD within 36 months), pMCI: Progressive MCI, Education: Average number of years of education of the subject group, MMSE: Mini-Mental State Examination score, CDRSB: Clinical Dementia Rating Sum of Boxes)

Data preprocessing

Preprocessing of the sMRI images was performed using SPM12 software. The sMRI images were first subjected to realign-head-movement correction to screen out images with excessive head movement; the head movement oversize criteria were 2 mm and 2°. The head-motion-corrected functional image file was then normalized to the MNI (Montreal Neurological Institute) space, and the skull was stripped to obtain a 3D brain image of size 121×145×121.

Preprocessing of the fMRI images was performed using Gretna software. A time-point removal operation was first performed on the fMRI to remove the first 10 time points that were unstable because of machine factors. Next, a slice timing time-layer correction was performed to ensure uniformity of the scan time across layers within a scan cycle. Subsequently, head movement correction was performed to assess the head movement of each subject and adjust for misalignments in the resulting image at different moments. Subsequently, individual echo-planar imaging (EPI) images were directly aligned to a standard EPI template. Finally, a brain network was constructed using the AAL90 template to divide the fMRI data into 90 ROI nodes to construct a functional brain network.

Model framework

Figure 1 illustrates our proposed CSEP coupling (CSEPC) framework. This framework consists of three main modules: an Intra-modality module, an Inter-modality module, and a classifier. We first preprocess the fMRI and sMRI data to obtain image data matrices and further transform them into tensor matrices at various levels. Subsequently, these tensor matrices are inputted into the Intra-modality module, where we extract cross-scale balanced features through structural and functional encoders and then compress each type of feature into a 64×32 feature map. Following that, the Inter-modality module is employed to calculate the cosine coupling matrix between the structural feature map and the functional feature map. Finally, the classifier module applies contrastive learning to the coupling features between fMRI and sMRI, thus obtaining the final classification results.

Intra-modality module

Our Intra-modality feature encoder consists of two modules: the CSEP module (Fig. 2a) and the residual block (Fig. 2b). Before encoding, we first divide the image into different levels through upsampling and downsampling. Subsequently, we extract Intra-modality multiscale features from different receptive fields using convolution strategies of various sizes and strides. For sMRI, we use convolution kernels of different sizes to obtain structural features of different scales. For fMRI, we use kernels of different sizes to explore the correlation of functions between different brain areas.

Simultaneously, to reduce data parameters and adapt to small-sample medical image data, in the CSEP module, we avoid using convolution kernels of sizes 3×3×3, 5×5×5, and 7×7×7 in the extraction of multiscale features. This would significantly increase the number of network parameters, thus impeding model training. Therefore, we have adopted a dilated convolution strategy (Fig. 2a) that can expand the receptive field of the convolution kernel without increasing the number of parameters, thereby providing a wider range of information in each convolution output. For instance, using a 3×3×3 dilated convolution kernel with a dilation coefficient of 3, we can achieve the same coverage as that of a 7×7×7 kernel, but require only 7.9% of the parameters needed for the larger kernel. The equivalent receptive field $\text{K}$ is calculated as follows:

$${K}^{{\prime }}=K+\left(K-1\right)\times \left(d-1\right)$$

1

,

where $\text{K}$ denotes the convolution kernel size, and $\text{d}$ denotes the dilation factor.

After determining the convolution strategy for each scale, we should proceed to integrating multiscale features. The specific process of integration is as follows:

$\left\{{\text{F}}_{1},{\text{F}}_{2},{\text{F}}_{3},...,{\text{F}}_{\text{n}},{...,\text{F}}_{\text{N}}\right\}\in \text{S}$ denotes the dataset containing N samples, and $\left\{{\text{Y}}_{1},{\text{Y}}_{2},{\text{Y}}_{3},...,{\text{Y}}_{\text{n}},{...,\text{Y}}_{\text{N}}\right\}\in \text{Y}$ denotes the labels corresponding to the data.

First, to obtain multiscale input data, each of our input data ${\text{F}}_{\text{n}}$ must be divided into different levels, $\left\{{\text{L}\text{e}\text{v}\text{e}\text{l}}_{1}^{\text{n}},{\text{L}\text{e}\text{v}\text{e}\text{l}}_{2}^{\text{n}},{\text{L}\text{e}\text{v}\text{e}\text{l}}_{3}^{\text{n}},...,{\text{L}\text{e}\text{v}\text{e}\text{l}}_{\text{k}}^{\text{n}},...,{\text{L}\text{e}\text{v}\text{e}\text{l}}_{\text{K}}^{\text{n}}\right\}\in {\text{F}}_{\text{n}}$.

After dividing the input data into different levels, the network selects the corresponding equilibrium expansion convolution strategy according to the levels to derive the corresponding output features $\text{X}$:

${y}_{k}^{n}={\text{W}}_{k}{\text{D}\text{C}\text{o}\text{n}}_{k}\left({\text{l}\text{e}\text{v}\text{e}\text{l}}_{k}^{\text{n}}\right)+{\text{W}}_{COORD}{{\text{D}\text{C}\text{o}\text{n}}_{k+1}(\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+1}^{\text{n}})$	(2)
${y}_{k+2}^{n}={\text{W}}_{COORD}^{{\prime }}{\text{D}\text{C}\text{o}\text{n}}_{k+1}\left({\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+1}^{\text{n}}\right)+{\text{W}}_{k+2}{{\text{D}\text{C}\text{o}\text{n}}_{k+2}(\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+2}^{\text{n}})$	(3)
${X}_{n}={\text{W}}_{{\text{l}\text{e}\text{v}\text{e}\text{l}}_{k}}{y}_{k}^{n}+{\text{W}}_{{\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+2}}{y}_{k+2}^{n}$	(4)

where ${\text{y}}_{\text{k}}^{\text{n}}$ represents the output feature after ${\text{L}\text{e}\text{v}\text{e}\text{l}}_{\text{k}}^{\text{n}}$ and its adjacent features are fused with adjacent scales across levels; ${\text{W}}_{\text{k}}$ represents the weight of level k; ${\text{D}\text{C}\text{o}\text{n}}_{\text{k}}$ represents the dilation convolution operation at level k; ${\text{W}}_{\text{C}\text{O}\text{O}\text{R}\text{D}}^{}$ and ${\text{W}}_{\text{C}\text{O}\text{O}\text{R}\text{D}}^{{\prime }}$ represent the coordination weights of the intermediate scale to the adjacent scales; and ${\text{W}}_{{\text{l}\text{e}\text{v}\text{e}\text{l}}_{\text{k}}}$ represents the cross-scale fusion weight.

Finally, the set of output features are input into ResNet for further feature extraction. Figure 2b shows the ResNet architecture. A residual block is composed of two paths, $\text{H}\left({\text{X}}_{\text{n}}\right)$ and ${\text{X}}_{\text{n}}$. Path $\text{H}\left({\text{X}}_{\text{n}}\right)$, the residual path, includes three convolution steps, each followed by a normalization step; path ${\text{y}}_{\text{n}}$ is the input to the residual block. In the end, the sum of the two paths is passed through a rectified linear unit (ReLU) activation function to yield the final extracted feature ${\text{g}}_{\text{n}}$.The module then converts these into 64×32 feature matrices using matrix correction. The calculation formula is as follows:

$${\text{g}}_{n}=max(0,H({X}_{n}\left)\right)+{X}_{n}$$

5

.

Inter-modality module

Following the completion of the aforementioned feature extraction process, we obtained the cross-scale equalization features of sMRI and fMRI. To ensure that our model can comprehensively capture intermodality features and account for their coupling relationships, we incorporated the coupling matrices of these two features as merged features into the classifier for contrastive learning of the associations between modalities. We utilized cosine similarity to calculate the coupling matrices of spatiotemporal features, subsequently mapping them onto a probability map using the softmax function, thereby ensuring that feature values are constrained between 0 and 1. Such normalization operation guarantees a unified measurement for the coupling matrix in subsequent processing. The formula is as follows:

$${\text{G}}_{n}=\text{S}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(\frac{{\text{W}}_{s}{\text{g}}_{n}^{{\prime }}\bullet {\text{W}}_{f}{\text{g}}_{m}^{{\prime }{\prime }}}{\sqrt{{d}_{k}}}\right)$$

6

,

where ${\text{g}}_{\text{n}}^{{\prime }}$ and ${\text{g}}_{\text{m}}^{{\prime }{\prime }}$ represent the feature maps of sMRI and fMRI, respectively, after being processed by the intramodality module; and ${\text{d}}_{\text{k}}$ denotes the number of columns in the coupled feature map, which is set to 64 in our research study.

We employ a random matching approach to fuse the intermodality features. In this way, each sMRI feature is paired with an fMRI feature, thoroughly accounting for all possible matching combinations between modalities. By calculating the coupling matrix, we project the intermodality contrast features into the label semantic space; the cosine calculation operation solves the fusion difficulty caused by different data dimensions. Finally, the fused features are input into the classifier to complete the classification of the corresponding tasks.

Classifier module

In this study, a CNN was selected as the classifier for our model, as shown in Fig. 2c. Through the 4-layer CNN classification designed in this study, we can better learn the global features in the coupling matrix and achieve a higher classification performance. The convolution strategy of the CNN selects the convolution method of downsampling and further simplifies and extracts the key features among the the coupled features, thereby improving the final classification accuracy.

K-Fold cross-validation

Because of the constraints of the sample data, the CSEPC network model may encounter overfitting problems. To reduce their impact on the experimental results, we performed K-fold cross-validation. For our experiments, we set the value of K to 10. The subject samples were randomly divided into ten subsamples of equal size, of which one subsample was used as the test set to validate the classification performance of the model, while the remaining nine subsamples were used as training data. This cross-validation procedure was repeated 10 times, with each subsample used once as a test set. By following these procedures, we can check whether the model is experiencing overfitting problems.

Experimental setup

We implemented the proposed deep-learning approach using the Pytorch framework. All experiments were run on an NVIDIA Tesla V100 16GB GPU and Windows Server 2016. In the CSEPC training process, we separately trained the sMRI_CSEP and fMRI_CSEP models. We used the stochastic gradient descent function (SGD) as the optimizer of the models during the training process and set the batch size to 1.

Three tasks were designed to validate the effectiveness of our proposed model. The first task was AD prediction (pMCI or sMCI). The second task was to diagnose the NC patients with AD using the same method. The third task was a three-classification task to complete the diagnosis of AD, MCI, and NC. To further validate our model, we performed a 10-fold cross-validation for each classification task and averaged the performance of the final model. We also selected four metrics to evaluate the performance of our model: accuracy (ACC), area under the curve (AUC), sensitivity (SEN), and specificity (SPE).

Results of AD prediction

Table 2 presents the results of our experiments on using different models for Task 1. Table 3 illustrates the effects of the fusion method and classifier selection on the prediction results. Fig. 3 visualizes the performance of the proposed model under different conditions.

As shown in Table 2, our proposed CSEP+Coupling model achieved the highest classification performance, with ACC of 88.52% (95% confidence interval (CI) 85.92 to 91.12), AUC of 0.93 (95% CI 0.91 to 0.96), SEN of 0.89 (95% CI 0.85 to 0.93), and SPE of 0.88 (95% CI 0.84 to 0.92). As shown in Table 2, the performance of models that integrate intermodality features surpasses those based on only a single modality. This indicates that excavating and learning the correlations between modalities can effectively enhance the diagnostic accuracy of the model. Additionally, the strategy of feature coupling fusion yielded the optimal classification performance. This further confirms that, compared to general fusion methods, coupled features can more comprehensively learn and exploit the correlation characteristics between modalities.

Table 2 Performance of our proposed method in pMCI vs. sMCI classification tasks.

Method	ACC (%)	AUC	SEN	SPE
sMRI	79.93±5.27	0.83±0.07	0.76±0.13	0.84±0.08
fMRI	78.59±8.00	0.84 ±0.07	0.91±0.08	0.78±0.04
sMRI_CSEP	82.20±5.83	0.87±0.09	0.83±0.06	0.83±0.11
fMRI_CSEP	79.93±7.57	0.85±0.08	0.88±0.09	0.76±0.12
CSEP_Concat	85.52±5.99	0.91±0.06	0.87±0.07	0.85±0.07
CSEP_Coupling	88.52±4.16	0.93±0.04	0.89±0.06	0.88±0.06

To compare the impact of different classifiers on the final results, we compared a number of classical classifiers that are currently in use, namely, support vector machine (SVM), deep neural network (DNN), and CNN. Table 3 shows the classification performance of the different classifiers for the coupling matrix. As shown in Table 3, CNN achieves the highest accuracy, which is 8.94 percentage points higher than that of the machine learning method SVM and 3.53 percentage points higher than that of the DNN method. Therefore, using a CNN as a classifier allows for a more comprehensive extraction of the global properties of the coupling matrix, enhancing the data expressive power of coupled features.

Table 3 Performance of different classifiers in pMCI vs. sMCI classification tasks.

Classifier	ACC (%)	AUC	SEN	SPE
SVM	79.58±8.73	0.87±0.07	0.84±0.07	0.77±0.10
DNN	84.99±5.03	0.91±0.04	0.87±0.05	0.84±0.07
CNN	88.52±4.16	0.93±0.04	0.89±0.06	0.88±0.06

Table 4 Performance of our proposed method in pMCI vs. sMCI classification tasks.

Reference	Modality	Subjects	pMCI vs. sMCI
Reference	Modality	Subjects	ACC(%)	AUC	SEN.	SPE.
Ning Z et al.[6]	sMRI+PET	224 pMCI+161 sMCI	85.20	0.85	0.91	0.78
Zhu et al.[20]	sMRI	172 pMCI+232 sMCI	80.20	0.77	0.83	0.85
Lu et al.[21]	PET	122 pMCI+409 sMCI	82.51	-	0.81	0.83
Lian C et al.[7]	sMRI	205 pMCI+465 sMCI	81.00	0.78	0.53	0.85
Pan et al.[22]	PET	166 pMCI+360 sMCI	83.05	0.87	0.72	0.88
Gao et al.[4]	sMRI+PET	168 pMCI+230 sMCI	82.00	-	0.81	0.81
Ours	sMRI+fMRI	166 pMCI+162 sMCI	88.52	0.93	0.89	0.88

To provide a more in-depth demonstration of the performance of our model, we compared it with other models proposed in recent research studies. The results of this comparison are given in Table 4. Among current advanced models, our proposed algorithm achieved the highest accuracy rate, which was 88.52%. Furthermore, even though our model employed fewer samples than those used by the other classification models, it still yielded superior results.

Visual results of AD prediction

To demonstrate that we have retained whole-brain features and enhanced the interpretability of our model, we conducted a visual analysis to inspect the internal weights and identify key brain regions that play a crucial role in predicting AD. Fig. 4 presents the weighted activation map of our model on sMRI, where the intensity of the color represents the weight value of the corresponding brain region. To precisely locate the significant areas, we listed 14 related regions in the brain with a weight value of at least 0.8 in Table 5. Among these areas, the fusiform gyrus (FFG) had the highest weight value across all regions. The areas listed in Table 5 play a pivotal role in the development of AD, impacting the observed deterioration of memory, emotion, and sensory integration during disease progression. The high focus of our model on these brain regions highlights its ability to accurately capture key areas of complexity and redundancy in medical imaging data. Furthermore, our model identified the brainstem and cerebellum as important areas for predicting the development of AD, which are often overlooked in AD research.

Table 5 Brain regions focused on by the model and their weights.

Brain regions	abbr.	weight
anterior cingulate cortex	ACC	0.944
brainstem	_	0.806
callosum cortex	CC	0.962
dorsolateral prefrontal cortex	DLPFC	0.853
fusiform gyrus	FFG	1.000
gyrus cerebellum	_	0.851
inferior parietal lobule	IPL	0.849
medial prefrontal cortex	mPFC	0.84
orbitofrontal cortex	OFC	0.865
parahippocampal gyrus	PHG	0.853
posterior cingulate cortex	PCC	0.908
superior parietal lobule	SPL	0.872
primary motor cortex	M1	0.848
superior temporal sulcus	STS	0.801

Classification results of AD staging diagnostic

To further assess the performance of our model, in addition to predicting AD, we also employed an AD staging diagnostic task to test our model (Fig. 5) and compared it with those from recent literature, as shown in Table 6. In the binary classification task between AD and NC, our model showed commendable performance, with an ACC of 91.21% (95% CI 88.21% to 94.21%), AUC of 0.95 (95% CI 0.92 to 0.98), SEN of 0.94 (95% CI 0.91 to 0.97), and SPE of 0.89 (95% CI 0.85 to 0.93). In the ternary classification task encompassing AD, MCI, and NC, our model performed exceptionally well, with an ACC of 85.67% (95% CI 84.37% to 86.97%), AUC of 0.98 (95% CI 0.97 to 0.97), SEN of 0.77 (95% CI 0.71 to 0.83), and SPE of 0.88 (95% CI 0.83 to 0.93). In this ternary classification task, our method outperformed the second most accurate classification algorithm by 14.49%, demonstrating that our model exhibits a superior ability in extracting features in complex classification tasks compared to those of other models.

Table 6 Performance of our method in AD vs. NC and AD vs. MCI vs. NC classification.

	AD vs. NC				AD vs. MCI vs. NC
	ACC (%)	AUC	SEN.	SPE.	ACC (%)	AUC	SEN	SPE
Pan et al. [22]	93.13	0.97	0.90	0.95	-	-	-	-
Gao et al. [4]	92.00	0.96	0.89	0.94	-	-	-	-
Zhang Z et al. [23]	92.00	0.96	0.90	0.93	62.90	0.65	0.65	0.82
Wang M et al. [24]	90.28	0.90	0.97	0.80	71.76	-	-	-
Lee E et al. [25]	92.75	0.98	0.92	0.93	71.18	-	0.56	0.71
Liu S et al. [26]	91.40	-	0.92	0.90	53.79	-	0.52	0.87
Ours	91.21	0.95	0.94	0.89	85.67	0.98	0.77	0.88

In this study, we constructed a network framework specifically for predicting AD, designed for multimodal small-sample medical image data. Our method thoroughly exploited the associated features within and between modalities in multimodal MRI, achieving efficient classification for AD prediction and staging tasks. Simultaneously, based on the weights of the brain region classification model, we generated brain region activation weight maps for AD prediction to aid clinicians in determining the medical imaging markers of AD progression, providing a more accurate basis for diagnosis.

In addition, compared to other multimodal diagnostic models, our model achieved good results even with fewer samples. This is primarily attributed to our model accounting for both intermodality and intramodality correlated features, enhancing data representation while reducing the number of model learning parameters. For intramodality features, as shown in Table 2, single-modality classification using the CSEP module significantly outperformed the other single-modality classifications. This is mainly due to the balanced consideration of correlated information across different scales by the CSEP module, enhancing the data representation power of a single modality. Moreover, our sMRI_CSEP, despite using less data, attained the highest AUC among all single-mode classification studies. This substantiates the effectiveness of our approach of balanced processing of features at various scales with fewer parameters. Recent research has emphasized the importance of coupled features, which help link brain structure and function, as a new metric, thus enabling a deeper investigation of brain structural differences while enhancing the understanding of AD-related pathological changes [27, 28]. Katabathula et al. [29] have proposed that to fully understand the specificity of MCI and explore the pathological differences among various MCI subcategories, the integration of multimodal features is crucial. Our proposed CSEPC framework leverages cosine coupling computation to extract complementary features between modalities from multimodal MRI data. Through experimental comparison, we found that models that accounted for intermodal features performed better than those using single modalities alone, as shown in Table 2. We also validated the effectiveness of coupled features by comparing their classification results with those of uncoupled features, with the AUC of coupled features being higher than that of uncoupled features. This finding supports our view that our model can better utilize coupled features to extract complementary characteristics from intermodal modalities, thereby enhancing the performance of modality fusion features. Both the existing literature and our research results prove that cosine computation performs excellently in multimodal classification tasks. This might be because cosine coupling similarity modeling aids in forming tighter classification clusters in the projected semantic [30]. Overall, our proposed CSEPC framework provides a new perspective for the process of integrating multimodal medical images, effectively enhancing the performance of disease prediction methods in diagnostic tasks using multimodal medical images.

At the same time, we also observed that the choice of classifier significantly influences the final classification performance of the model, as shown in Table 3. Comparing Tables II and III, we observe that using SVM as a classifier yields results that are poorer than those from using a single modality for classification. This can be attributed to the fact that, compared to machine learning algorithms such as SVM, deep learning methods such as CNN and DNN are better equipped to classify higher-dimensional and more complex features. As a shallow learning method, SVM may suffer from the curse of dimensionality when the data dimension is large and the features are complex [31]. The spatiotemporal features used to construct the coupling matrix for our coupled data do not have a one-to-one correspondence, and there is considerable redundant information that may introduce interference variables into an SVM and negatively impact its classification performance. By contrast, a CNN can learn the intermodality complementary features in the coupling matrix more effectively and does not have to consider the dimensionality of the generated images [32, 33].

To further investigate the brain regions that had a more significant impact on the diagnostic results of our model, we parsed the weights of each layer in the network. We found that the listed 14 brain regions are primarily located within the default mode network (DMN), the salience network (SN), and white matter areas adjacent to these two networks, as shown in Fig. 6. These brain regions are associated primarily with cognitive functions such as memory and emotion in humans [34, 35]. Previous studies have reported that disconnection of the SN from the DMN is associated with cognitive decline [36]. Additionally, studies have demonstrated that the deterioration in emotional processing in patients with AD in high-level complex and cognitively demanding emotion recognition tasks are linked to atrophy of the SN [37]. Our model highlights the brain regions of these networks that are associated with AD development, and damage to these brain regions may be a crucial factor influencing emotional instability and memory impairment in early-stage AD patients [36, 37]. Furthermore, two brain regions, namely, the cerebellum and brainstem, exhibited considerable weights in our model. The cerebellum, an essential regulatory organ in the body, has been observed to reduce activity in cognitive tasks in early-stage AD patients [38]. Similarly, the integrity of the locus coeruleus in the brainstem has been identified as a potential indicator of subtle changes in initial AD-related processes and preclinical AD cognitive trajectories [39]. The analysis of brain region weights demonstrates that our method helps identify bioimaging markers that are highly correlated with AD development and provides a scientific foundation for investigating the pathological mechanism of AD.

Several key limitations of our study should be noted. Firstly, the preprocessing steps to which we have subjected the fMRI data are relatively complex, and because of the inconsistency of the fMRI time series and image noise in the data supplied by ADNI, we are currently unable to utilize the original fMRI images for voxel-level classification, as we have done with sMRI. Secondly, because of the scarcity of follow-up data for early-stage AD patients, our study is reliant on a single dataset from ADNI. However, we believe that as follow-up data related to AD disease progression increases, integrating multi-center, multi-device, and multi-parameter MRI data will become key to in-depth research into AD development. Additionally, although diffusion tensor imaging (DTI) has been demonstrated to be effective in diagnosing AD, we did not include DTI data in our feature selection because of its presently limited data quantity. Nonetheless, as the amount of available data grows, DTI could potentially become an important tool for predicting AD, and future research should explore this possibility. Finally, although our model has shown promising results, we have not further considered the integration of additional modalities. In the future, we will continue to explore the fusion of multimodal medical imaging data, employ larger datasets for large-scale validation, and keep track of its clinical application scenarios to further optimize the performance of our model in practical applications.

In this study, we introduce a novel deep learning model specifically designed to address the small-sample problem in multimodal medical imaging for predicting and classifying the progression of AD. The model comprehensively considers the deep semantic relationships both inter-modality and intra-modality in multimodal data, aiming to enhance data expressiveness while fully preserving brain features. Our focus is not only on the accuracy of the model but also on revealing key brain regions involved in the transition from MCI to AD through the visualization of network weights. Our approach is intended to assist clinicians in more effectively managing the progression of early-stage AD patients using small-sample data, exploring the pathological characteristics of AD, and thus laying the groundwork for more effective therapeutic interventions.

AD: Alzheimer's disease

MCI: Mild cognitive impairment

ACE: ADHD Child Evaluation

ROIs: selecting regions of interest

sMRI: structural magnetic resonance imaging

PET: positron emission tomography

CNN: convolutional neural network

CSEPC: cross-scale equilibrium pyramid coupling

ACC: accuracy

AUC: area under the curve

FPN: feature pyramid network

CSEP: cross-scale equilibrium pyramid

fMRI: functional magnetic resonance imaging

pMCI: progressive mild cognitive impairment

sMCI: stable mild cognitive impairment

NC: normal control

MMSE: Mini-mental state examination

MNI: Montreal Neurological Institute

EPI: echo-planar imaging

ReLU: rectified linear unit

SGD: stochastic gradient descent

SEN: sensitivity

SPE: specificity

SVM: support vector machine

DNN: deep neural network

FFG: fusiform gyrus

DMN: default mode network

SN: salience network

DTI: diffusion tensor imaging

Ethics approval and consent to participate

This study was exempted from local institutional review board approval because all neuroimaging and clinical data were obtained in deidentified format upon request from external study centers, who ensured compliance with ethical guidelines and informed consent for all participants. No compensation was provided to participants.

Consent for publication

Based on our observations, the images contained in this article are entirely unidentifiable, and there are no detailed individual-specific information provided in the manuscript. Given this circumstance, we believe that obtaining consent for the publication of these images is not necessary. The final decision on whether consent to publish is required will be at the discretion of the editor. We will ensure that any publication involving images complies with ethical and legal requirements and take appropriate measures to safeguard individual privacy.

Availability of data and materials

The data used in this article can be accessed by request at ADNI:

adni.loni.usc.edu

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by the National Natural Science Foundation of China under Grant 62103404, in part by the Shenzhen Overseas Innovation Team Project under Grant KQTD20180413181834876, in part by the Shenzhen Basic Research Program under Grant JCYJ20210324101402008, in part by the Shenzhen Science and Technology Program under Grant GJHZ20210705141405016, in part by the Chinese Academy of Sciences President’s International Fellowship Initiative under Grant 2022VBA0031, and in part by the Japan Society for the Promotion of Science under Grant 21K15614.

Contributions

In the preparation of this manuscript, Jingyuan Liu took the lead in writing and constructing the model. Xiaojie Yu played a pivotal role in preprocessing and data collection. Hidenao Fukuyama, Toshiya Murai, and Jinglong Wu were instrumental in manuscript revision and provided valuable guidance on medical image processing. The final stages of manuscript refinement, along with guidance on research direction, were the responsibilities of Qi Li and Zhilin Zhang. The collective efforts of each author contributed significantly to the completion and enhancement of this research work.

Acknowledgements

I extend my heartfelt gratitude to all those who contributed to the successful completion of this paper. Throughout the entire process, their support and assistance have been immeasurable. Special thanks to the ADNI organization for their invaluable contribution to our research by providing impactful data. Their data played a crucial role in our study, significantly advancing the progress of this work.

I would also like to express my appreciation to the various project funds for their financial support. It is thanks to their sponsorship that this project was made possible.

Gauthier S, Webster C, Servaes S, Morais J, Rosa-Neto P: World Alzheimer Report 2022: Life after diagnosis: Navigating treatment, care and support. London, England: Alzheimer’s Disease International; 2022.
Munivenkatappa A, Rajeswaran J, Devi BI, Bennet N, Upadhyay N: EEG Neurofeedback therapy: Can it attenuate brain changes in TBI? Neurorehabilitation 2014, 35(3):481-484.
Zhang L, Wang L, Zhu DJ: Predicting brain structural network using functional connectivity. Medical Image Analysis 2022, 79.
Gao XY, Shi F, Shen DG, Liu MH, Alzheimer's Dis Neuroimaging I: Task-Induced Pyramid and Attention GAN for Multimodal Brain Image Imputation and Classification in Alzheimer's Disease. Ieee Journal of Biomedical and Health Informatics 2022, 26(1):36-43.
Kwak K, Giovanello KS, Bozoki A, Styner M, Dayan E, Alzheimer's Dis Neuroimaging I: Subtyping of mild cognitive impairment using a deep learning model based on brain atrophy patterns. Cell Reports Medicine 2021, 2(12):18.
Ning ZY, Xiao Q, Feng QJ, Chen WF, Zhang Y: Relation-Induced Multi-Modal Shared Representation Learning for Alzheimer's Disease Diagnosis. Ieee Transactions on Medical Imaging 2021, 40(6):1632-1645.
Lian CF, Liu MX, Zhang J, Shen DG: Hierarchical Fully Convolutional Network for Joint Atrophy Localization and Alzheimer's Disease Diagnosis Using Structural MRI. Ieee Transactions on Pattern Analysis and Machine Intelligence 2020, 42(4):880-893.
Brand L, Nichols K, Wang H, Shen L, Huang H, Adni: Joint Multi-Modal Longitudinal Regression and Classification for Alzheimer's Disease Prediction. Ieee Transactions on Medical Imaging 2020, 39(6):1845-1855.
Zhou T, Thung KH, Zhu XF, Shen DG: Effective feature learning and fusion of multimodality data using stage-wise deep neural network for dementia diagnosis. Human Brain Mapping 2019, 40(3):1001-1016.
Qin Q, Qu JD, Yin YS, Liang Y, Wang Y, Xie BX, Liu QQ, Wang X, Xia XY, Wang M et al: Unsupervised machine learning model to predict cognitive impairment in subcortical ischemic vascular disease. Alzheimers & Dementia 2023.
Basheera S, Sai Ram MS: Convolution neural network-based Alzheimer's disease classification using hybrid enhanced independent component analysis based segmented gray matter of T2 weighted magnetic resonance imaging with clinical valuation. Alzheimers Dement (N Y) 2019, 5:974-986.
Qiu SR, Joshi PS, Miller MI, Xue CH, Zhou X, Karjadi C, Chang GH, Joshi AS, Dwyer B, Zhu SH et al: Development and validation of an interpretable deep learning framework for Alzheimer's disease classification. Brain 2020, 143:1920-1933.
Lin TY, Dollar P, Girshick R, He KM, Hariharan B, Belongie S, Ieee: Feature Pyramid Networks for Object Detection. In: 30th IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): Jul 21-26 2017; Honolulu, HI. NEW YORK: Ieee; 2017: 936-944.
Wang X, Zhang S, Yu Z, Feng L, Zhang W: Scale-Equalizing Pyramid Convolution for Object Detection. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020:13356-13365.
Kaplan J, McCandlish S, Henighan T, Brown T, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D: Scaling Laws for Neural Language Models; 2020.
Fisher Y, Vladlen K: Multi-Scale Context Aggregation by Dilated Convolutions. 2015 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2016.
Guan H, Wang CY, Tao DC: MRI-based Alzheimer's disease prediction via distilling the knowledge in multi-modal data. Neuroimage 2021, 244:14.
Suk HI, Lee SW, Shen DG, Alzheimer's Dis N: Deep sparse multi-task learning for feature selection in Alzheimer's disease diagnosis. Brain Structure & Function 2016, 221(5):2569-2587.
Zhou T, Liu MX, Thung KH, Shen DG: Latent Representation Learning for Alzheimer's Disease Diagnosis With Incomplete Multi-Modality Neuroimaging and Genetic Data. Ieee Transactions on Medical Imaging 2019, 38(10):2411-2422.
Zhu WY, Sun L, Huang JS, Han LX, Zhang DQ: Dual Attention Multi-Instance Deep Learning for Alzheimer's Disease Diagnosis With Structural MRI. Ieee Transactions on Medical Imaging 2021, 40(9):2354-2366.
Lu DH, Popuri K, Ding GW, Balachandar R, Beg MF, Alzheimers Dis Neuroimaging I: Multiscale deep neural network based analysis of FDG-PET images for the early diagnosis of Alzheimer's disease. Medical Image Analysis 2018, 46:26-34.
Pan XX, Phan TL, Adel M, Fossati C, Gaidon T, Wojak J, Guedj E: Multi-View Separable Pyramid Network for AD Prediction at MCI Stage by F-18-FDG Brain PET Imaging. Ieee Transactions on Medical Imaging 2021, 40(1):81-92.
Zhang ZH, Gao LL, Jin G, Guo LJ, Yao YD, Dong L, Han JM, Alzheimer's Disease NeuroImaging I: THAN: task-driven hierarchical attention network for the diagnosis of mild cognitive impairment and Alzheimer's disease. Quantitative Imaging in Medicine and Surgery 2021, 11(7):3338-3354.
Wang ML, Lian CF, Yao DR, Zhang DQ, Liu MX, Shen DG: Spatial-Temporal Dependency Modeling and Network Hub Detection for Functional MRI Analysis via Convolutional-Recurrent Network. Ieee Transactions on Biomedical Engineering 2020, 67(8):2241-2252.
Lee E, Choi JS, Kim M, Suk HI, Alzheimers Dis Neuroimaging I: Toward an interpretable Alzheimer's disease diagnostic model with regional abnormality representation via deep learning. Neuroimage 2019, 202:15.
Liu SQ, Liu SD, Cai WD, Che HY, Pujol S, Kikinis R, Feng DG, Fulham MJ, Adni: Multimodal Neuroimaging Feature Learning for Multiclass Diagnosis of Alzheimer's Disease. Ieee Transactions on Biomedical Engineering 2015, 62(4):1132-1140.
Gu ZJ, Jamison KW, Sabuncu MR, Kuceyeski A: Heritability and interindividual variability of regional structure-function coupling. Nature Communications 2021, 12(1):12.
Sun LL, Liang XY, Duan DN, Liu J, Chen YH, Wang XD, Liao XH, Xia MR, Zhao TD, He Y: Structural insight into the individual variability architecture of the functional brain connectome. Neuroimage 2022, 259:14.
Katabathula S, Davis PB, Xu R: Comorbidity-driven multi-modal subtype analysis in mild cognitive impairment of Alzheimer's disease. Alzheimers & Dementia:12.
Sun B, Li BH, Cai SC, Yuan Y, Zhang C, Ieee Comp SOC: FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): Jun 19-25 2021; Electr Network. LOS ALAMITOS: Ieee Computer Soc; 2021: 7348-7358.
Trunk GV: A Problem of Dimensionality: A Simple Example. IEEE Transactions on Pattern Analysis and Machine Intelligence 1979, PAMI-1:306-307.
Hinton GE, Salakhutdinov RR: Reducing the dimensionality of data with neural networks. Science 2006, 313(5786):504-507.
Krizhevsky A, Sutskever I, Hinton GE: ImageNet Classification with Deep Convolutional Neural Networks. Communications of the Acm 2017, 60(6):84-90.
Smallwood J, Bernhardt BC, Leech R, Bzdok D, Jefferies E, Margulies DS: The default mode network in cognition: a topographical perspective. Nature Reviews Neuroscience 2021, 22(8):503-513.
Chiong W, Wilson SM, D'Esposito M, Kayser AS, Grossman SN, Poorzand P, Seeley WW, Miller BL, Rankin KP: The salience network causally influences default mode network activity during moral reasoning. Brain 2013, 136:1929-1941.
He XX, Qin W, Liu Y, Zhang XQ, Duan YY, Song JY, Li KC, Jiang TZ, Yu CS: Abnormal salience network in normal aging and in amnestic mild cognitive impairment and Alzheimer's disease. Human Brain Mapping 2014, 35(7):3446-3464.
Kumfor F, Sapey-Triomphe LA, Leyton CE, Burrell JR, Hodges JR, Piguet O: Degradation of emotion processing ability in corticobasal syndrome and Alzheimer's disease. Brain 2014, 137:3061-3072.
Amanzio M, Torta DME, Sacco K, Cauda F, D'Agata F, Duca S, Leotta D, Palermo S, Geminiani GC: Unawareness of deficits in Alzheimer's disease: role of the cingulate cortex. Brain 2011, 134:1061-1076.
Jacobs HIL, Becker JA, Kwong K, Engels-Dominguez N, Prokopiou PC, Papp KV, Properzi M, Hampton OL, Uquillas FD, Sanchez JS et al: In vivo and neuropathology data support locus coeruleus integrity as indicator of Alzheimer's disease pathology and cognitive decline. Science Translational Medicine 2021, 13(612):16.

No competing interests reported.

CSEPC: Deep Learning Framework for Small Sample Multimodal Medical Image Data in Alzheimer’s Disease Prediction

Status:

Version 1

Abstract

Figures

Background

Methods

Results

Discussion

Conclusions

Abbreviations

Declarations

References

Additional Declarations

Status:

Version 1

\({y}_{k}^{n}={\text{W}}_{k}{\text{D}\text{C}\text{o}\text{n}}_{k}\left({\text{l}\text{e}\text{v}\text{e}\text{l}}_{k}^{\text{n}}\right)+{\text{W}}_{COORD}{{\text{D}\text{C}\text{o}\text{n}}_{k+1}(\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+1}^{\text{n}})\)	(2)
\({y}_{k+2}^{n}={\text{W}}_{COORD}^{{\prime }}{\text{D}\text{C}\text{o}\text{n}}_{k+1}\left({\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+1}^{\text{n}}\right)+{\text{W}}_{k+2}{{\text{D}\text{C}\text{o}\text{n}}_{k+2}(\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+2}^{\text{n}})\)	(3)
\({X}_{n}={\text{W}}_{{\text{l}\text{e}\text{v}\text{e}\text{l}}_{k}}{y}_{k}^{n}+{\text{W}}_{{\text{l}\text{e}\text{v}\text{e}\text{l}}_{k+2}}{y}_{k+2}^{n}\)	(4)