A case-control clinical trial on the diagnostic performance for Alzheimer’s Disease of a deep learning-based classication system using brain magnetic resonance imaging

Objective To investigate diagnostic performance of a deep learning-based classication system using structural brain MRI (DLCS) for Alzheimer’s disease (AD). Methods A single-center, case-control clinical trial was conducted. T1-weighted brain MRI scans of 188 patients with mild cognitive impairment or dementia due to AD and 162 cognitively normal controls were retrospectively collected. The patients were amyloid beta (Aβ)-positive, whereas the controls were Aβ-negative, on 18F-orbetaben positron emission tomography. Sensitivity, specicity, positive predictive value, negative predictive value, and area under the receiver operating characteristic curve were calculated to evaluate the performance of DLCS in the classication of Aβ-positive AD patients from Aβ-negative controls. The DLCS was excellent in classifying AD patients from normal controls; sensitivity, specicity, positive predictive value, negative predictive value, and area under the receiver operating characteristic curve for AD were 85.6% (95%CI, 79.8–90), 90.1% (95%CI, 84.5–94.2), 91.0% (95%CI, 86.3–94.1), 84.4% (95%CI, 79.2–88.5), and 0.937 (95%CI, 0.911–0.963), respectively.


Introduction
The number of individuals with dementia is increasing globally. More than 130 million people are expected to live with dementia in 2050, [1] with Alzheimer's disease (AD) as the most prevalent type. [2] Since no cure for AD has been developed yet, early diagnosis is crucial for proper management of AD.
However, more than 60% of community-dwelling people with dementia are undiagnosed, due to its insidious nature [3]. To improve the accuracy and advance the timing of AD diagnosis, the National Institute on Aging-Alzheimer's Association proposed a new diagnostic criteria for AD that incorporated neuroimaging biomarkers such as amyloid beta (Aβ) deposition and neuronal degeneration [4][5][6][7]. However, while Aβ deposition is an earlier and more speci c biomarker of AD than neurodegeneration, assessment of the former (i.e., positron emission tomography [PET]) has many practical drawbacks compared to the that of the latter (i.e., magnetic resonance imaging [MRI]), because PET scans are more expensive, involve radiation, and are less available in clinical settings.
Brain MRI is an effective and widely used tool for detecting neuronal loss and structural changes in the brain. Recently, several studies have developed arti cial intelligence (AI)-based algorithms for classifying AD using structural brain MRI, with promising performance, including processing time and classi cation accuracy [8][9][10][11][12][13]. However, in most previous studies [9,10,12,13], training and validation datasets were constructed by randomly splitting a dataset into two. Because the training and validation datasets came from the same population, the performance of algorithms were likely to be overestimated in those studies. Furthermore, most studies [9-13] did not con rm the presence of Aβ deposition in the AD patients as well as the absence of Aβ deposition in the normal controls, despite the fact that about 12% of the clinically diagnosed probable AD patients are Aβ-negative [14] and 10-40% of cognitively normal controls are Aβ-positive [15].
In our previous work, we developed a deep learning-based classi cation system for AD using structural brain MRI (DLCS) as an AI software as a medical device (SaMD) and found its accuracy in classifying probable AD patients from cognitively normal controls to be excellent (0.88-0.94) [16]. However, our previous work shared the same limitations as previous studies stated earlier. In addition, our previous work did not include mild cognitive impairment (MCI) due to AD in the patient group, which might have exaggerated the performance of the DLCS.
Therefore, in the current clinical trial, we investigated the performance of the DLCS in discriminating Aβpositive patients with MCI or AD dementia from Aβ-negative cognitively normal controls, all of whom were from a sample independent of the population used for the development of the DLCS.

Study participants
A single-center, case-controlled clinical trial was conducted and registered in the Korean Clinical Trials Registry (KCT0004758). Data of subjects over 50 years of age who visited Seoul National University Bundang Hospital (SNUBH) and underwent a T1-weighted MRI scan between January 2010 and September 2019 were retrospectively collected. Our data include brain MRI scans with clinical assessment and 18F-orbetaben PET scans from visitors to our dementia clinic as well as from participants of the Korean Longitudinal Study on Cognitive Aging and Dementia (KLOSCAD) [17].
A group of patients with AD and a group with normal cognition (NC) matched for age and sex were screened and enrolled using the following inclusion criteria. The AD groups included those who had: (1) a diagnosis of probable or possible AD according to the National Institute of Neurological and Communicative Disorders and Stroke and the Alzheimer's Disease and Related Disorders Association (NINCDS-ADRDA) criteria, or MCI according to the International Working Group on MCI, and (2) amyloid deposition as determined by a positive 18F-orbetaben PET scan. The NC group included those who (1) had no subjective cognitive complaints, (2) had no objective cognitive decline in the Korean version of the Consortium to Establish a Registry for AD (CERAD-K) neuropsychological assessment battery, (3) were functioning independently in the community, and (4) had no amyloid deposition as determined by a negative 18F-orbetaben PET scan. Subjects who had any of the following conditions were excluded: (1) diagnosis of dementia with a cause other than or in addition to AD, i.e., mixed dementia, (2) brain pathologies on T1-weighted MRI that may cause cognitive de cits, (3) more than 1 year between the date of clinical assessment and date of MRI scan (NC and MCI participants only), and (4) white matter hyperintensities with a Fazeka's rating of 3 or higher on uid-attenuated inversion recovery images.
The data of the participants were retrospectively screened and collected starting from April 27, 2020 to June 5, 2020 (6 weeks). The employment of the DLCS on the data were conducted between June 8, 2020 to June 19, 2020 (2 weeks).

Sample size calculation
We employed both the sensitivity and speci city of DLCS to AD as primary outcome measures. We calculated the sample size needed to evaluate whether DLCS performed better than a reference, based on a one-sided α of 2.5% (Z α = 1.96), statistical power of 80% (Z 1-β = 0.842), and the results of a pilot study.
The pilot study tested the performance of DLCS using a dataset consisting of 367 AD patients and 316 controls with NC: 130 AD and 130 NC from SNUBH and 237 AD and 186 NC from the Alzheimer's Disease Neuroimaging Initiative database. At a threshold value of 0.38, the DLCS yielded a sensitivity of 82.0% (95% con dence interval [CI], 77.7-85.8%) and speci city of 83.2% (95%CI, 78.6-87.2). To calculate the sample size n, we used the following formula [18]: where p 0 is the assumed sensitivity/speci city under the null hypothesis H 0 , and p 1 is the targeted sensitivity/speci city under alternative hypothesis H 1 . The p 0 and p 1 values were de ned as the lower and higher bounds of the 95%CI of the sensitivity and speci city from the pilot study (p 0 = 0.777 and p 1 = 0.858 for sensitivity; p 0 = 0.786 and p 1 = 0.872 for speci city). The null hypothesis was that the sensitivity/speci city of the DLCS is less than or equal to the lower boundary of the assumed sensitivity/speci city. The alternative hypothesis was that it is higher. Based on this, the necessary number of subjects with the disease was 188, and the number of subjects without the disease was 162. Therefore, the nal estimated sample size was 350 subjects, consisting of 188 patients with AD and 162 normal controls that were matched for age (<5 years apart) and sex to the AD group.
We acquired 18F-orbetaben PET scans in 3D using a Discovery VCT scanner (General Electric Medical Systems, Milwaukee, WI, USA). The subjects were injected with 8.1 mCi (300 MBq) 18F-orbetaben (Neuraceq) through a slow single intravenous bolus (6 MBq) in a total volume of 10 mL. After a 90-min uptake period, 20-min PET images comprising four 5-min dynamic frames were obtained. Images of each time frame were reframed into one summed frame. Board-certi ed nuclear medicine physicians then determined Aβ-positivity based on visual interpretation of tracer uptake in the gray matter compared to neighboring subcortical white matter in the following four brain regions: the temporal lobes, frontal lobes, posterior cingulate cortex/precuneus, and parietal lobes.

Deep learning-based Alzheimer's disease classi cation system
We used VUNO Med-DeepBrain AD (version 1.0.0, VUNO Inc., Seoul, South Korea), which is the DLCS for AD. The convolutional neural network model used in VUNO Med-DeepBrain AD has been previously described [16]. Brie y, the DLCS receives a subject's T1-weighted image, extracts coronal slices from areas that span the medial temporal lobe, and feeds each coronal slice as a separate input into a convolutional neural network. The network, which uses Inception-V4 as its backbone, extracts various features that include structural and textural information of the brain from the coronal slice. The feature vector is then concatenated with the subject's age and sex information (which is input to the system at the beginning with the MRI scan) and the location information (slice number) of the coronal slice, and entered into a fully connected network that calculates the probability of the slice belonging to that of a patient with AD. The probabilities of each slice are averaged to calculate a nal score that represents the subject's probability of having AD (score ranges from 0 to 1).
In this clinical trial, we processed the MRI data of subjects anonymously, omitting information that could identify the individual (name, sex, birth date, and hospital number). A researcher (K.J.S.), who was blinded to the subjects' clinical diagnoses and did not participate in the construction of the study dataset, performed the processing of the subjects' data with DLCS. The DLCS was installed on a desktop PC with the following speci cations: Intel hexa-core 2.90 GHz CPU with 16 GB RAM running on Ubuntu 18.04.4 LTS.

Statistical analysis
We evaluated the accuracy of the DLCS in the diagnosis of AD by comparing its output (a continuous probability ranging from 0 to 1) with the subject's clinical diagnosis. We de ned sensitivity and speci city as the primary outcomes, and the area under the receiver operating characteristic curve (AUC) as the secondary outcome.
Continuous variables were compared using independent samples t-test, and categorical variables were compared using the chi-square test between groups. We estimated the 95%CIs of sensitivity and speci city using the Clopper-Pearson method [19] and the AUC using the DeLong test  [20]. Because this clinical trial was conducted retrospectively, participation consent forms from subjects or legal guardians of the subjects were waived.

Data Availability
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Results
We enrolled a total of 350 subjects who met the eligibility criteria, with 162 (46.3%) in the NC group and 188 (53.7%) in the AD group. The demographic and clinical characteristics of the participants are summarized in Table 1. The mean age of the whole dataset was 73.3 ± 7.23 (range, 55 to 92) years. Age and sex were comparable between the NC and AD groups, while years of education were higher in the NC group. In the patient group, 76 (40.4%) had MCI due to AD, and the rest had AD dementia. All participants with MCI due to AD had a clinical dementia rating (CDR) score of 0.5. Among the 112 participants with AD dementia, 68 (60.7%) had a CDR score of 0.5, 35 (31.3%) had a CDR score of 1, and the rest (8.0%) had a CDR score of 2 or 3. The models of MR scanners were comparable, while the type of head coil was different between the two groups. As summarized in Fig. 1, DLCS demonstrated a good diagnostic performance for AD. Its sensitivity for AD was 85.6% (95%CI, 79.8-90.3), and the lower bound of 95%CIs for its sensitivity was higher than the assumed value of 77.7%. Its speci city for AD was 90.1% (95%CI, 84.5-94.2), and the lower bound of 95%CIs for its speci city was higher than the assumed value of 78.6%. Its accuracy, positive predictive value, and negative predictive value for AD were 87.7%, 91.0%, and 84.4%, respectively. The AUC of DLCS for AD classi cation was 0.937 (95%CI, 0.911-0.963).

Discussion
This clinical trial demonstrated that the diagnostic performance for AD of DLCS was excellent. To the best of our knowledge, this is the rst clinical trial in the eld of AI-based AD diagnosis using structural brain MRI data.
According to a previous meta-analysis, the rate of undetected dementia in community-dwelling elderly is pooled to be 61.7% (95%CI = 55.0-68.0%) [3]. The rate of undetected cases is especially higher for those with dementia that have a slow progressive onset, such as AD [21].
Structural brain MRI has been extensively explored for improving the early diagnosis of AD because of its good accessibility and rich information on neurodegeneration. Various automated MRI measures that are sensitive to AD detection, such as cortical volume [22], cortical thickness [23], shape [24], and texture [25] have been developed. The previously reported MRI-based markers reported good accuracy in diagnosing AD, but they all require heavy preprocessing of data and a long time to process, which is not feasible in a non-research setting such as in the clinic. This critical limitation has long delayed the active use of MRIbased markers in clinical practice. However, in our previous work, DLCS was found to take only 23 s per case to process structural brain MR images. DLCS could drastically reduce the processing time by making an inference using a prelearned neural network, rather than preprocessing the whole data and extracting or calculating new features [16]. In addition, the features extracted by DLCS can comprehensively re ect volumetric, shape, and textural information, making it potentially more informative than previously developed single MRI measure-markers. What remained now for employing DLCS in clinical practice was to prove its diagnostic performance.
In this clinical trial, the diagnostic performance of the DLCS for AD was found to be excellent according to the criteria of excellent biomarkers proposed by the Ronald and Nancy Reagan Research Institute of the Alzheimer's Association and the National Institute on Aging Working Group on "Molecular and Biochemical Markers of Alzheimer's Disease" [26]. The working group suggested that an excellent evaluating biomarker should have a sensitivity approaching or exceeding 85%, a speci city of approximately 75-85% or greater, and a positive predictive value of approximately 80% or more. The sensitivity, speci city, and positive predictive value of the DLCS were 85.6%, 90.1%, and 91.0%, respectively, which met the requirements proposed by the working group.
The diagnostic performance of the DLCS for AD was comparable to or better than that of clinical diagnosis, uorodeoxyglucose PET, and cerebral blood ow single-photon emission computed tomography (SPECT). Clinical diagnosis of probable AD according to the NINCDS-ADRADA criteria has shown a sensitivity of 70.9-75.3% and a positive predictive value of 59.5-70.8% for autopsy-proven AD. [27] Fluorodeoxyglucose PET has a sensitivity of 84% and speci city of 74% for autopsy-proven AD [28] and a sensitivity of 75.8% and speci city of 74.3% for amyloid PET-proven AD. [29] Cerebral blood ow SPECT showed a sensitivity of 63% and speci city of 82% to autopsy-proven AD [30] and sensitivity of 42.9% and speci city of 82.9% to amyloid PET-proven AD. [29] Both amyloid PET [31] and cerebrospinal uid (CSF) β-amyloid 42 [32] can detect AD much earlier during the preclinical stage than structural brain MRI. However, in clinical settings, they are not supposed to be administered to patients with AD during the preclinical stage. [33] In contrast, structural brain MRI is conducted not only for diagnosing various types of dementia but also for diagnosing other numerous neurologic disorders and even for health checkups in clinical settings. Since the DLCS can be applied to all structural brain MRI scans taken for any purpose, it can increase the detection rate of AD, which may otherwise go unnoticed, and direct the patients for a timely examination that con rms the presence of AD.
This study has several strengths. First, we minimized the misclassi cation bias by con rming the presence of Aβ in the AD group and the absence of Aβ in the NC group using 18F-orbetaben PET scans. Assigning patients to the AD group based solely on clinical diagnosis can result in a misenrollment of patients who have AD-like symptoms but do not actually have AD pathology. Likewise, using only clinical diagnosis can also result in misassigning asymptomatic AD patients to the NC group, as it is known that up to one-fourth of cognitively normal elderly individuals can have Aβ pathology, which is recognized as a preclinical form of AD. [15] Second, we minimized the overestimation of the diagnostic performance of the DLCS by including MCI due to AD in the patient group [5]. Since neurodegenerative changes in the brain are less prominent in the prodromal phase, the diagnostic performance of the biomarkers using structural brain MRI can be overestimated if the patients with prodromal AD (MCI) would not be included in the patient group.

Limitations
This study has several limitations. First, all MRI scans used in this study were acquired from a single scanner (Philips) using the same protocol. Therefore, the performance of the DLCS on scans from other vendors or protocols is unknown. Second, DLCS currently only takes 3D T1-weighted images as input data because 3D scans contain higher anatomical detail and resolution than conventional 2D scans. However, 3D scans are not always available in clinical settings, which may undermine the applicability of DLCS. Third, it is not clear which features contributed to the predictions made by the DLCS, which can undermine the explainability of the results. Increasing the explainability and interpretability of deep learning algorithms will be crucial in increasing the trustworthiness of the technology for use in the medical domain. This is an unresolved issue that is currently the topic of many recent research. [34].

Conclusions
In conclusion, DLCS, a software as a medical device using structural brain MRI, demonstrated excellent diagnostic performance for AD. When used together during screening of MRI, taken for whatever purpose, DLCS may improve the early detection of AD. The protocol for the current study was approved by the Ministry of Food and Drug Safety in South Korea and the Institutional Review Board of SNUBH.

Consent for publication
Not applicable.

Availability of data and materials
The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request. The other authors declare no con icts of interest. Funding