Reproducibility and clinical validation of automated habenula segmentation via deep learning in major depressive disorder with 7 Tesla MRI

The habenula is one of the most important brain regions for investigating the etiology of psychiatric diseases such as major depressive disorder (MDD). However, the habenula is challenging to delineate with the naked human eye in brain imaging due to its low contrast and tiny size, and the manual segmentation results vary greatly depending on the observer. Therefore, there is a great need for automatic quantitative analytic methods of the habenula for psychiatric research purposes. Here we propose a fully-automated segmentation and volume estimation method for the habenula in 7 Tesla magnetic resonance imaging based on a novel fully convolutional network. The proposed method, using the data of 69 participants (33 patients with MDD and 36 normal controls), achieved an average precision, sensitivity, and dice similarity coefficient of 0.869, 0.865, and 0.852, respectively, in the automated segmentation task. Moreover, the intraclass correlation coefficient reached 0.870 in the volume estimation task. This study demonstrates that this deep learning-based method can provide accurate and quantitative analytic results of the habenula. By providing rapid and quantitative information on the habenula, we expect our proposed method will aid future psychiatric disease studies.


Introduction
The habenula (Hb) is a paired epithalamic structure adjacent to the dorsomedial thalamus and the third ventricle [1] that can be divided into distinct portions via different cellular morphological features. It integrates information received from the cerebral and limbic cortex and provides forebrain control over the activity of ascending monoaminergic projections from the brainstem [2]. Additionally, based on previous studies of Hb function, the Hb is involved in the pathogenesis of psychiatric disorders such as major depressive disorder (MDD) [3,4].
Compared to normal controls (NCs), the Hb volume of patients with MDD showed atrophy in a post-mortem study [5].
According to previous post-mortem and structural imaging studies, the average volume of the human Hb is 15-30 mm 3 [5,6]. Several studies have reported comparing the volume of the Hb between patients with a psychiatric disorder and NCs: volume comparison among patients with different stages of MDD and NCs [7]; among medicated and unmedicated MDD patients, bipolar disorder patients, and NCs [8]; and among medicated and unmedicated patients with MDD and NCs [9]. The majority of previous human Hb volumetric studies have used manual segmentation to determine Hb volumes [7][8][9][10]. However, these conventional manual-based approaches are time-consuming and laborious, particularly with extensive datasets, and it is challenging to accurately produce the segmented masks due to the anatomical characteristics of the Hb. Thus, manual segmentation results of the Hb by different observers have large deviations and it is difficult to determine which fit the gold standard. To overcome this problem, two examiners trace the individual region and the reliability of their results are evaluated with an intraclass correlation [11]. Yet, this method is still time-consuming for the tracers. Overall, accurate Hb segmentation for quantitative analysis is still a challenging task. An accurate and 4 quick Hb segmentation method might be a fundamental step in medical treatment, such as deep brain stimulation and neurosurgery, for targeting Hb sub-regions related to psychiatric diseases in the future [12,13].
For this reason, a couple of semi-or fully-automatic Hb segmentation approaches have been reported: 1) reproducibility of a myelin content-based Hb segmentation from 3T magnetic resonance imaging (MRI) using a semi-automatic myelin contrast-based method [14], and 2) a machine learning technique for fully-automatic Hb segmentation of 1.5T MRI for Hb volume comparison of patients with bipolar disorder and schizophrenia with healthy controls [15].
Since those studies performed image processing such as intensity-based threshold and image registration [14,15], there remain limitations in their ability to reliably perform automatic Hb segmentation in large MRI datasets. Accordingly, the development of accurate methods for a fully-automated Hb segmentation of 7T MRI in patients with depressive disorder is necessary However, research on automatic analytic methods using a deep learning approach in the depressive disorder research field is currently scarce.
Recently, demonstrated as a powerful tool for semantic segmentation, deep learning methods based on convolutional neural networks can accommodate large annotated datasets and computational resources compared with traditional segmentation techniques [16,17].
Moreover, various studies have reported regional segmentation of the human brain and their performance using u-net based novel fully convolutional networks (FCNs) [18,19].
Nevertheless, there are no such reported cases of deep learning approaches for automated Hb segmentation. Thus, we developed a deep learning-based method for fully-automated Hb segmentation using high-resolution 7T MRI and assessed the clinical utility of this method using brain images of patients with MDD and NCs for the validation of our deep learning approach. 5 Although 7T MRI is an imaging technique suitable for visualizing the Hb, it is still challenging to segment the Hb accurately using naked eye-based manual segmentation because of its low anatomical contrast and tiny size, resulting in low reliability of segmentation results from different observers. To address this limitation, we designed deep learning networks trained on manual segmentation masks from two different examiners. The final Hb segmentation results fused the two pre-trained networks' outputs, taking into account both examiners' manually segmented masks.
Additionally, to perform automatic anatomical structure segmentation, it is more efficient to focus on specific areas of the visual scene, picking out only important features of interest, similar to human visual attention, than to examine every part of the brain with the feature aggregation of a deep learning network. The attention u-net was designed for this purpose and has been proposed to simply and accurately segment the pancreas, which occupies a small area in the abdomen [20,21]. In this study, therefore, we designed our deep learning networks' architecture based on the attention u-net for robust and accurate Hb segmentation.
This study aimed to validate the reproducibility of our deep learning-based computer-aided tool via evaluating the automatic Hb segmentation performance and comparing manual and automated Hb volume estimation in individuals with MDD and NCs. Global Impression Scale (CGI) [24,25]. This study was conducted in accordance with the Declaration of Helsinki and approved by the Institutional Review Board of the Gil Medical Center (IRB No. GDIRB2018-005), and written informed consent was obtained from all the participants.
The common eligibility criteria for the MDD and NC groups were as follows: (i) no previous abnormal findings on brain imaging; (ii) no intellectual disability, neurocognitive disorders, or history of significant brain injury; (iii) no personality disorder or substance use disorder including alcohol use disorder in the last year; (iv) no major or unstable medical or neurological disorders in the last year; (v) no current serious suicide risk; (vi) right-handedness using the Edinburgh Handedness Test; (vii) not pregnant or lactating; and (viii) no metal material in the body. The NCs were included according to the following additional criteria: (i) no family history of first-degree relatives with a major psychiatric disorder; (ii) no history or symptoms of psychiatric disorders; (iii) no history of taking psychotropics during their lifetime; and (iv) a total score ≤6 on the HDRS-17. The participants who met the DSM-5 diagnostic criteria for MDD [26] were included in the MDD group. The MDD and NC groups were 7 matched for age and sex.
Image and label acquisition. Whole-brain sagittal images were acquired using an 8-channel phased-array coil for 7-T MRI (MAGNETOM 7T, Siemens, Erlangen, Germany). To evaluate the possibility of simultaneously recording relaxation times, such as T1 and T2*, the prototype multi-echo magnetization-prepared 2 rapid gradient echoes (MP2RAGE) sequence by Siemens was utilized [27]. The manual segmentation was performed by two well-trained researchers using the T1 map of the participants' 7T MRIs. The researchers manually segmented the target voxels by tracing the Hb, which differed in signal intensity from that of the contiguous brain tissues, using three-dimensional analytic programs (i.e., ImageJ ver. 1.52a). The reliability of the segmentation was inspected using the overlap index ratio (%) [26].
Experimental overview. Two deep learning networks were trained for automatic Hb segmentation from the manual segmentation results of two different observers. To evaluate the segmentation results, the fusion output label, which was the intersection of the automatic segmentation result, was compared with ground truth (GT), the intersection of 1 and 2 .
Preprocessing and experimental setup. We acquired a region of interest mask in the axial plane of the 7T MRI volume (Fig. 2a). The window level and window width were set to clearly observe the Hb on 7T MRI (window level: 1300, window width: 750; converted to an 8-bit 8 image) (Fig. 2b). To remove unnecessary brain regions, the images were uniformly cropped to 96 pixels (x-axis) and 128 pixels (y-axis) (Fig. 2c), including the Hb (Fig. 2d). To train the segmentation network, we divided the total brain MRI data (n = 69) by a ratio of 6:2:2 (train:validation:test). A total of 5-folds with the same training, validation, and test set ratio were formed. Accordingly, the performance of the model was evaluated in the whole dataset while shifting the test dataset (Fig. 2e).
Network. The proposed network was designed based on the attention u-net [20,21] which is a modified version of the traditional u-net [29]. The difference between the attention and basic u-net is that the attention u-net includes attention gates (AGs). The AG, located in skipconnection layers of the attention u-net, is a module that helps to optimize the model in segmentation tasks of small and polymorphic regions by using a sigmoid function. The sigmoid function inside the AG has an effect similar to the simultaneous localization and segmentation of the object area via activation considering both the skip-connected layer and the previous layer.
When the 7T MRI was fed into the model (Fig. 3), the significant feature maps were aggregated for Hb segmentation by the convolution operation. In the feature aggregation process, the feature map, which was reduced in resolution by a repeated pooling operation, was restored to the input image resolution by the up-sampling operation. Up-sampling was performed after the AG operation. The feed-forward procedure of the AG-based up-sampling was as follows: Where 1 is the rectified linear unit (ReLU) activation function, 2 is the sigmoid activation Implementation details. We trained our networks on a single Tesla V100 (32GB) GPU (graphics processing unit). Each network consisted of 1,984,565 parameters. Our networks were trained using the Adam optimizer [30] to jointly minimize the generalized dice loss [31].
We conducted early termination of the training procedure when loss did not improve during 50 epochs. The initial learning rate (LR) was 0.001, and when the loss did not minimize for 10 epochs, the LR was reduced by a factor of 0.2. The networks early terminated the training procedure in 100-300 epochs.
Statistics. Demographic data and clinical characteristics were calculated and compared using two-tailed independent t-tests and chi-square tests. The software IBM SPSS Statistics (ver. 21.0) was used and P < 0.05 was set as the limit for statistical significance for these analyses.
We obtained the precision, sensitivity, and dice similarity coefficient (DSC) by comparing 10 the GT and automatic segmentation result of networks for evaluation in the test set.
To evaluate our network, we calculated the coincidence-rate of the GTs and autosegmentation results. The evaluation was conducted slice-by-slice with binary 2D images using the following equations: The True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) were obtained by comparing the voxels of a GT and fusion segmentation result. Since the whole dataset was divided into 5-fold as a test dataset, we were able to evaluate every slice in our whole dataset (69 participants). Since deep neural networks (DNNs) are dependent on the training and validation set, we designed the training, validation, and test dataset to 5-fold for every test dataset in each fold.
For the validation of clinical applications such as volume analysis, we compared manual and automated Hb segmentation in participants with MDD and NCs. Therefore, it was necessary to estimate the size of the Hb volume via 3D volume reconstruction for each participant (see Supplementary Fig. S1). In addition, we divided the total volume of the Hb into the left and right hemispheres to analyze automatic segmentation performance on each side.
After 3D reconstruction of the Hb, the intraclass correlation coefficients (ICCs) were calculated from each pair of brain volumes using the automatic and manual segmentation 11 methods. Before this analysis, the normalization of the Hb volumes was performed using total intracranial volume (ICV). The Hb volumes were divided by the ICV for each participant ( × 100) to adjust for individual differences in brain size. To assess the inter-rater reliability (i.e., the degree of agreement between the Hb volumes by automatic and manual segmentation), the ICC method involving the absolute agreement mode, which is sensitive to the differences in the mean values of observations, was used [32,33]

Results
Demographics. Supplementary Table S1 shows the demographics of the participants in this study. The age and sex ratio did not significantly differ between the two groups. The years of education and depressive symptom severity measured using the HDRS-17, BDI, and CGI differed significantly between the two groups.
Evaluation of habenula segmentation. The average total number of voxels with automated segmentation for the Hb out of all voxels ( 256 × 256 × 208 ) was 24.01±6.42 mm 3 (mean±standard deviation), and in the case of manual segmentation it was 24.19±6.10 mm 3 . Table 1 shows the performance evaluation of the automated Hb segmentation. The performance of our network reached a mean precision, sensitivity, and DSC of 0.869, 0.865, and 0.852, respectively, using 5-fold cross-validation. We also trained a single attention u-net from the intersected GT of the two raters for an ablation study of our proposed network. It achieved a mean precision, sensitivity, and DSC of 0.847, 0.789, and 0.790, respectively, in 5-fold. In the ablation study, the proposed network achieved a higher sensitivity than did the network that did not consider the two raters' manual segmentation results (see Supplementary Table S2 Figure 5.

Discussion
In this study, we proposed a deep attention u-net-based intersection network for accurate Hb segmentation and quantitative Hb analysis. As a result of experiments, the mean precision, mean sensitivity, and mean DSC in the automatic segmentation using the intersection of attention u-net was good in the total participants. Additionally, the ICCs between automatic and manual segmentation of the total Hb were excellent in all participants, participants with MDD, and NCs. Therefore, we suggest that the proposed approach is suitable for the segmentation of the Hb, which is a brain region tiny in size with low contrast in brain MRI. To the best of our knowledge, this is the first study which presented a fully-automatic Hb segmentation based on volume estimation method in participants with MDD and NCs using 7T MRI.
In the automatic segmentation procedure, we obtained a mean precision, sensitivity, and DSC of 0.869, 0.865, and 0.852, respectively, in the whole dataset. In recent years, a couple of studies on automated segmentation of the Hb have been reported. The first study performed semi-and fully-automated segmentation in 3T MRI of healthy young adults [14], and the second study performed fully-automated segmentation in children, adolescents, and adults with bipolar disorder and schizophrenia [15]. In the first study, the DSC for binary segmentation reached 0.71 for semi-automated segmentation and 0.69 for fully-automated segmentation, and the DSC of the probability map reached 0.74 for both semi-and fully-automated segmentation [14]. In a more recent study that segmented the Hb with a fully-automated framework, the DSC of the inter-rater reliability tests between manual and automatic segmentation ranged between 0.758 and 0.828 [15]. Although the participants in the previous studies had different clinical characteristics from those in our study, our automatic Hb segmentation approach seems to be more accurate (mean DSC > 0.85) than that of the other studies.
In this study, the of the total Hb ranged between 0.818 and 0.897, depending on the group. In a previous study conducted on healthy young adults, the for the Hb was 0.62 for semi-automated segmentation and 0.47 for fully-automated segmentation [14], which shows the superiority of our approach. However, the of the Hb was different between the groups and hemispheres in our study. Specifically, the of the left Hb was excellent (0.903-0.920), while the of the right Hb was from 0.658 (NC) to 0.819 (MDD). It is difficult to accurately explain why the for the left and right Hb were different; however, the asymmetry of the left and right Hb might be one reason [35].
Another attribute of our approach is that the results of two networks, each trained on GT generated by two different observers, were intersected to output fusion segmentation results.
When trained with the intersected GT, the network reached a low mean sensitivity (0.789) compared to that of the single attention-network (0.865). We assume that the single attentionnetwork was not able to fully capture the anatomical context of the Hb without considering the different viewpoints of two examiners.
Our approach is different from the previous studies for following reasons: first, this is the first fully-automated segmentation study performed in participants with MDD and NCs using high-resolution 7T MRI that can ideally visualize the Hb. Second, a DNNs approach for automatic Hb segmentation and volume estimation was conducted. We designed a deep learning network based on the attention u-net that was optimized for segmenting small objects (i.e., the Hb) of various shapes. Third, since the segmentation was performed by the fusion of the two pre-trained attention u-net using two different GTs, it is believed that a more reliable segmentation was achieved.
The high DSC and reproducibility of the automated segmentation of this study demonstrate 16 that the applicability of the DNNs' approach for Hb volume estimation in 7T MRI is promising.
Although the Hb is considered to be an important brain region in the etiology of major psychiatric disorders, its small size has made it difficult to investigate via neuroimaging. The Hb is involved in emotional and cognitive processes, having connections to many other areas of the brain (e.g., thalamus, prefrontal cortex, basal ganglia, and brainstem monoaminergic neurotransmitter systems) [36,37]. Recently, there are many studies focused on the connectivity between the Hb and other brain regions of interest such as monoamine centers and the thalamus in depression [38,39]. However, manual segmentation is time-consuming, highly variable, and the rater must acquire a high level of technical ability and anatomical knowledge for accurate segmentation, which has become a significant barrier to entry into this field of research [15].
Considering that the data acquired through neuroimaging research is gradually increasing and that machine learning techniques are becoming more popular, the automatic segmentation approach in our study is expected to be a useful tool for many future studies.

Conclusion
This study presented an intersection network based on attention u-net for a fully-automated segmentation of the Hb using 7T MRI that performed automatic segmentation and estimated the Hb volume with high accuracy and reproducibility (i.e., high DSC and correlation coefficients). Although the sample size was not large (69 participants), cross-validation confirmed that reliable Hb segmentation results can be obtained using our network.
Furthermore, it is expected that the proposed automatic Hb segmentation method will be useful for future psychiatric neuroimaging studies to facilitate automatic segmentation and volume estimation of the Hb and other important small brain regions in 7T MRI.

Data availability
The datasets generated during or analyzed during the current study are available from the corresponding author on reasonable request. In the network training procedure, two manual segmentation masks were used for the training of two networks, and two segmentation results were obtained. The network evaluation was performed by comparing the intersected GT and fusion output.

Figure legends
Abbreviations: GT, ground truth; MR, magnetic resonance; 7T, 7 Tesla; AG, attention gate.   The evaluation results are presented as mean and standard deviation.
Abbreviations: DSC, dice similarity coefficient a Habenula volumes were normalized using total intracranial volume (ICV). Habenula volumes were divided by the ICV for each participant as a normalization process (regional brain volume/ ICV × 100%) for the subsequent analyses. Normalized habenula volumes are described as mean ± SD.
Abbreviations: SD, standard deviation; ICC, intraclass correlation coefficient Significant results are indicated in bold.