Generalizability and Quality Control of Deep Learning-Based 2D Echocardiography Segmentation Models in a Large Clinical Dataset

Use of machine learning for automated annotation of heart structures from echocardiographic videos is an active research area, but understanding of comparative, generalizable performance among models is lacking. This study aimed to 1) assess the generalizability of �ve state-of-the-art machine learning-based echocardiography segmentation models within a large clinical dataset, and 2) test the hypothesis that a quality control (QC) method based on segmentation uncertainty can further improve segmentation results. Five models were applied to 47,431 echocardiography studies that were independent from any training samples. Chamber volume and mass from model segmentations were compared to clinically-reported values. The median absolute errors (MAE) in left ventricular (LV) volumes and ejection fraction exhibited by all �ve models were comparable to reported inter-observer errors (IOE). MAE for left atrial volume and LV mass were similarly favorable to respective IOE for models trained for those tasks. A single model consistently exhibited the lowest MAE in all �ve clinically-reported measures. We leveraged the 10-fold cross-validation training scheme of this best-performing model to quantify segmentation uncertainty for potential application as QC. We observed that �ltering segmentations with high uncertainty improved segmentation results, leading to decreased volume/mass estimation errors. The addition of contour-convexity �lters further improved QC e�ciency. In conclusion, �ve previously published echocardiography segmentation models generalized to a large, independent clinical dataset— segmenting one or multiple cardiac structures with overall accuracy comparable to manual analyses— with variable performance. Convexity-reinforced uncertainty QC e�ciently improved segmentation performance and may further facilitate the translation of such models.


Introduction
Structural segmentation is an important step for interpreting 2D echocardiography, which is highly timeconsuming and subject to signi cant inter-and intra-observer variability [1]- [6]. To overcome these limitations, several computer-aided methodologies, such as active shape models [7]- [10], level-sets [9], [10], and deep learning (DL)-based algorithms [11]- [14], have been developed to automatically segment cardiac structures in echocardiography images, with models based on DL particularly gaining increasing attention in recent years [15].
To date, most state-of-the-art DL models focus on segmentation of the left ventricular (LV) endocardium [12], [15]- [18], with only a few segmenting other cardiac structures such as the LV epicardium or the left atrium (LA), which could provide additional information for diagnosing and treating heart disease [19]. In 2018, Zhang et al. developed the rst model that segments multiple chambers [20]. In 2019, Leclerc et al. published the CAMUS dataset for which LV endo-and epicardium, and LA endocardium were manually segmented [21]. This dataset greatly facilitated the development and improvement of multi-structural echocardiography segmentation models [19], [22]- [26], such as those trained with adversarial [19] or motion-segmentation co-learning strategy [25], [26]. The generalizability of these models to external, independent datasets was only partially tested on either the small CAMUS dataset (450 patients) [19] or the single-view EchoNet-Dynamic dataset with only LV endocardial contours [25]. Thus, the performance Loading [MathJax]/jax/output/CommonHTML/jax.js of these multi-structural segmentation models within a large, independent, clinically-acquired echocardiography dataset remains unknown. Moreover, none of these models has been tested with automated quality control (QC) methods.
QC, as an important consideration for translating these AI segmentation models into potential clinical use, is mostly achieved by estimating aleatoric uncertainty which is described by the noise inherent in observations [27]- [29]. Unlike epistemic uncertainty which can be eliminated by training on big data or with data augmentation, aleatoric uncertainty can be formalized by a distribution over model outputs [27]. Common uncertainty modeling approaches require multiple segmentation predictions for a single input either through test-time augmentation [30] or by feeding into multiple models generated during training [31], [32]. To estimate the uncertainties, the differences between nal/averaged segmentation and individual predictions are assessed at the pixel or image level [28], [29]. Previous studies showed that removing uncertain segmentations improved LV segmentation scores [28]. In addition to uncertainty, convexity has been used as a shape prior in image segmentation to improve model regularization and performance [33]. Its use as a quality score for segmentation, however, has not been tested yet. Therefore, the main objective of the present study was to compare ve state-of-the-art echocardiography segmentation models on a large (>47k studies), independent clinical echocardiography dataset comprising both apical two (a2c) and apical four (a4c) chamber views. Models were compared by their accuracy to assess any of ve standard clinical measures: LV end-diastolic volume (EDV), LV end-systolic volume (ESV), LV ejection fraction (EF), LV wall mass (LVM), and maximal LA volume (LAV). Using the most generalizable segmentation model, we then tested a new method for measuring segmentation aleatoric uncertainty, i.e., using cross-validation (CV) model averaging, and evaluated the use of segmentation uncertainty combined with convexity as a QC. By doing so, the present study provides detailed evaluations for selection of segmentation models and paves the way for the development of an echocardiographic analysis pipeline that can be used in production with automatic QC.

Datasets
The Institutional Review Board at Geisinger approved this retrospective study with a waiver of consent, in conjunction with institutional patient privacy policies. We randomly extracted 88,322 studies from the Geisinger Xcelera database (Philips Medical Systems) in the Digital Imaging and Communications in Medicine (DICOM) format which were collected between 1998 and 2020. Among the transthoracic echocardiography (TTE) videos with DICOM view labels, we identi ed 50,593 studies (37,704 unique patients) having both a2c and a4c videos longer than one heartbeat (online Fig. 1).
We extracted EDV, ESV, LVM, and LAV measures from the Xcelera database. While EDV, ESV, and LAV were computed using the bi-plane method of disks (MOD-bp) at Geisinger clinic, LVM was calculated using Mmode linear cube method. All these clinical values were estimated from a single or multiple heartbeat(s) Loading [MathJax]/jax/output/CommonHTML/jax.js selected by cardiologists based on image quality. We also extracted physician-reported EF measurements, which were approximate values or ranges derived either qualitatively or through a 2D-or 3D-volume technique [34] and were not signi cantly different from MOD-bp EF derived from 2D echocardiograms. After excluding the studies without pixel scale information or clinical volume/mass measurements, we collected 47,431 studies from 35,826 unique patients containing 80,550 a2c and 85,975 a4c videos for segmentation and evaluation (online Fig. 1). The data characteristics are summarized in Table 1. Disease prevalence was based on either custom phenotypes [35]- [37]  We also collected two publicly available 2D echocardiography datasets with manual segmentations, CAMUS [21] and EchoNet-Dynamic [16]. CAMUS contained 900 a2c and 900 a4c images at ED and ES from 450 unique patients [21]. EchoNet-Dynamic contained 10,025 a4c videos from 10,025 unique  Table S1), as follows:

A. Preprocessing
For all ve models, after masking out the non-cone pixels, we reshaped the video frame dimensions with preserved aspect ratio using cubic interpolation to the target sizes and normalized/standardized pixel intensity per the model requirements (online Table S1). Since the Stough et al. 3D model required ED-ES clips, we identi ed the ED and ES frames based on the areas of LV segmentation produced by the Stough et al. 2D model using a peak-nding algorithm with a distance of 85% of cardiac cycle duration [16]. After cutting the whole videos into ED-ES clips, we converted the length of each ED-ES clip to 10 frames using trilinear interpolation. Although this preprocessing may be slightly different from the protocols described in the original codes/manuscripts, pilot studies did not detect any signi cant difference in their performance on Geisinger data.

B. Deployment
We fed preprocessed videos/clips into candidate models and obtained structural segmentations by identifying the maximum Softmax score output by each model. Since Stough et al. proposed to use the accumulative output from ten CV models which showed improved performance on CAMUS test data [24], [25], we accumulated ten sets of Softmax probabilities and identi ed the maximum score to produce a nal prediction for both 2D and 3D models. To postprocess the segmentations generated by each model, we kept the largest region and lled any holes smaller than 128 pixels for each segmented structure using connected component analysis with scikit-image [21].

C. Evaluation
Loading [MathJax]/jax/output/CommonHTML/jax.js We estimated volume and mass from predicted segmentations using Simpsons modi ed MOD [24], [39] and computed the errors when compared to clinically reported measures. Within each study, we reported the median volume/mass estimation from all qualifying segmentation results (aggregating across multiple beats and videos). Speci cally, after identifying ED and ES frames (see section Segmentation A) [16], we used LV endocardial segmentations from a single a4c view (MOD-sp4 method) or from a2c and a4c views (MOD-bp method) at ED and ES to estimate LV EDV, ESV, and EF. LAV was calculated similarly [4]. To estimate LVM, we computed the MOD-bp volume of the LV wall at ED as the difference between the epicardial and endocardial volumes and multiplied the value by myocardial density (1.05 g/ml) [6].

Segmentation QC
Adapted from state-of-the-art methods [29], we computed two image-level uncertainty measures using the ten sets of outputs from the ten 2D Stough et al. models: where were predicted segmentation samples obtained by identifying the maximum Softmax score output by each model for each pixel, and was the mean predicted segmentation obtained by averaging the Softmax scores from ten models and then taking the maximum value. DICE was computed using the following equation [40]: Segmentations with uncertainty larger than a threshold were excluded from downstream volume/mass estimation; this QC strategy was denoted as uncertainty-based QC.
We used 80% of CAMUS data to de ne the uncertainty threshold. To increase the incidence of bad segmentations, we augmented images 20 times by randomly rotating the image with the transducer as a pivot point, adding Gaussian noise, and applying intensity windowing [24]. This led 1,439 training images. We considered the predicted segmentation to be poor when the ground truth DICE (i.e., DICE between prediction and ground truth segmentation) <0.85 [29]. Pareto front curve, also called the tradeoff curve, which is usually used for optimizing bi-objective problems, was generated by plotting the percentage of poor segmentations remained after QC against the percentage of segmentations dropped by QC [29]. The ideal uncertainty threshold should remove the least number of samples (objective one) while dropping the largest number of poor segmentations (objective two).
We also computed convexity of the segmentation contour using a boundary-based method [41].
Speci cally, the convexity score was calculated as where was perimeter of the minimal rectangle R surrounding the segmentation contour whose edges were parallel to the x and y axes, and was the sum of projections of the edges of segmentation contour S onto the x and y Loading [MathJax]/jax/output/CommonHTML/jax.js axes [41]. As such, the convexity score ranged from (0, 1]; values approaching 1 represent oval or circular shape with smooth boundaries [41]. For LV wall segmentation, the convexity score was de ned as the minimal value of LV endo-and epicardial convexity. Convexity lters were added to reinforce the uncertainty-based QC by screening out segmentations that met any of the following criteria: 1) convexity <0.6 (shapes with multiple fragments, substantial indentations or protrusions according to the convexity rankings by Zunic and Rosin [41]); 2) Seg_dev >0.15 (equivalent to mean DICE <0.85); or 3) Seg_dev > a threshold (i.e., 0.039 for LV endocardium, 0.057 for LV wall, and 0.055 for LA endocardium as learned from the Pareto front curves in Fig. 3) and convexity <0.96 (around the 10th percentile observed for independent Geisinger and EchoNet-Dynamic data). This method was denoted as convexity-reinforced uncertainty QC.
We tested segmentation QC with the learned cutoffs on the CAMUS test set, EchoNet-Dynamic data, and Geisinger segmented studies. For the former two datasets, we compared changes in ground truth DICE before and after QC; since Geisinger data do not have ground truth segmentation, we compared changes in volume/mass estimation errors before and after QC.

Segmentation
All ve tested models showed median absolute errors (MAE) comparable with reported inter-observer errors (IOE) in segmenting LV endocardium [2]; however, only the Stough et al. 2D (Fig. 1). Note that the IOE used in this study were the minimal relative errors that were estimated in previous studies using the same Simpsons MOD-bp method as we did in our study (Table 2) [2], [5], [6].
As shown in Bland-Altman density plots (Fig. 2 Pareto front curves show that the two uncertainty scores, i.e., Seg_dev and Seg_CoV, were able to preferentially identify poor segmentations (Fig. 3). Seg_dev, however, exhibited higher e ciency with steeper slopes, especially for LV endocardium segmentation. For demonstration purposes, we selected the threshold closest in Euclidean distance to the origin point (0,0) for Seg_dev of each structure, i.e., 0.039 for LV endocardium (LV), 0.057 for LV wall (Myo), and 0.055 for LA endocardium (LA) (Fig. 3).
By removing segmentations with Seg_dev > cutoffs selected above, the mean ground truth DICE was improved for all three segmented structures (pre-QC vs post-QC: LV 0.945 vs 0.951, Myo 0.894 vs 0.903, LA 0.931 vs 0.940), as evaluated on augmented CAMUS test images (Fig. 4A) (Table 3).
Using the same LV Seg_dev cutoff, similar improvement in mean ground truth DICE with fewer poor segmentations was observed for LV endocardium in the independent EchoNet-Dynamic dataset (Fig. 4B). The mean ground truth DICE increased from 0.893 (pre-QC) to 0.909 (QC with Seg_dev) after removing 68% of poor segmentations (data not shown). QC performance metrics for identifying good segmentations were slightly lower compared to those observed for CAMUS test set (Table 3). We observed some disassociation between segmentation ground truth DICE and uncertainty measures, i.e., some segmentations with DICE <0.85 exhibited low uncertainty measures ( Fig. 5C and 5D) while some segmentations with higher DICE exhibited high uncertainty values ( Fig. 5A and 5B), especially for those with part of myocardium outside the image eld. In both instances, segmentation convexity scores provided additional, independent insight (Fig. 5). Indeed, the addition of convexity to Seg_dev-based QC greatly increased the sensitivity score while slightly compromising the precision score, increasing the F1 score to 0.92 from 0.86 (Table 3). The percentage of good segmentations removed by QC dropped from 20-3% (Table 3). With convexity, Seg_dev-based QC only removed 20% of poor segmentations (data not shown).
We assessed the changes in absolute errors in volume and mass estimation before and after QC in Geisinger data using both Seg_dev and convexity-reinforced Seg_dev (Tables 4 and 5). With Seg_dev alone, the mean absolute errors for the reported clinical measures decreased by 3-15%, while the mean absolute errors of the removed studies ranged from 17-34%. However, in each case, a large proportion of studies (14-71%) were removed based on the speci ed QC thresholds. We again observed that many segmentations with signi cant uncertainty had high convexity scores (Fig. 6). Thus, we again observed that the use of QC with convexity-reinforced Seg_dev removed signi cantly less data compared to QC with Seg_dev alone (<10%; with convexity-reinforced Seg_dev were lower (Table 4), the errors in the studies removed by this QC were much larger (Table 5). With both QC strategies, LVM estimation exhibited the largest loss of studies which was followed by EF estimation (Table 5).

Discussion
Overall, ve published echocardiography segmentation models tested in the present study generalized well to a large external clinically-acquired dataset of 2D echocardiograms on most segmentation tasks, leading to volume and/or mass estimations with accuracy comparable to manual analyses [2], [4]-[6]. This clinically-acquired dataset involved a variety of cardiovascular disease conditions with a wide span of EF values. The comparable errors in LV EDV, ESV, and EF exhibited by the ve models suggest a good adaptation to patients with different LV conditions. The striping pattern observed for EF Bland-Altman density plot was likely due to the tendency for the physician-reported EF to take discrete values (e.g., 60%) or ranges (e.g., 60-65% for which we used the mean value 62.5%) rather than naturally comprising a continuous representation, as the segmentation result does. Although mostly within the reported IOE range (i.e., 11 ± 12%) [5], the LAV errors were higher than the mean IOE for all of the four multi-structural segmentation models. This sub-optimal LA segmentation, as compared to LV segmentation, could be partially attributable to the high boundary-to-area ratio of LA. Another contributing factor could be the fact that LA was usually associated with more distortion/noise/motion, especially at the bottom region, since it was farther from the transducer. Moreover, for videos without focusing on LA, part of LA might be out of image. , [19], [24], [25]. As for the Zhang et al. model, although it was trained with some data augmentation [20], the relatively small original training data likely restricted its generalizability, leading to modest performance as observed in the present study as well as on the CAMUS dataset [19].
Besides training data, model structure and training strategy also likely contributed to the different performance exhibited by the ve models. the balance between two objectives during adversarial training may compromise the segmentation task of the Arafati et al. model [19], contributing to its sub-optimal performance on this large external clinical dataset.
Additionally, the varied segmentation performance could partially arise from the training labels, i.e., the manual segmentations generated by different cardiologists. For example, compared to the precise LV and LA endocardial manual segmentations used to train the Arafati et al. model [19], the manual segmentations of the CAMUS and EchoNet-Dynamic datasets were conservative, especially at the apex and along the free wall of the LV [16], [21]. As a result, models trained on these two datasets, i.e., the QC was achieved in this study by leveraging the 10-fold CV models trained by Stough et al. [24].
Segmentation uncertainty was easily obtained as a by-product when generating the nal segmentation through accumulating the outputs of ten CV models. Our Seg_dev-based QC method e ciently removed the majority of poor segmentations for all three segmented structures, leading to slight increases in the mean ground truth DICE. It was not surprising to detect such minor increases in DICE score, particularly because the segmentation models already had superior performance before QC with less frequent failure (i.e., fewer segmentations with ground truth DICE <0.85). Moreover, the performance of our Seg_devbased QC method on improving LV segmentation quality was comparable to the state-of-the-art results as evaluated on CAMUS and EchoNet-Dynamic datasets [28]. Like the other uncertainty methods [28], this method failed to ag some bad segmentations when all ten models performed consistently poorly. Moreover, although some nal segmentations accumulated over ten models looked acceptable, they were dropped due to high uncertainty arising from the presence of low-contrast or invisible surrounding tissue in the images. This problem was more evident for LV wall segmentation. Signi cantly, the addition of convexity to Seg_dev-based QC greatly saved those segmentations with good convex shape and added more con dence in ltering out bad segmentations. In fact, the superior precision and sensitivity scores for picking up good segmentations from CAMUS and EchoNet-Dynamic datasets support our QC method, especially the convexity-reinforced uncertainty strategy, as an effective approach once appropriate cutoffs were set. This was further evidenced by the removal of large errors and the decreased absolute errors in downstream volume/mass estimation as shown in Geisinger data. However, it should be noted that there will always be a tradeoff between segmentation quality and number of studies excluded when de ning a cutoff, and this cutoff may need to be adjusted depending on the deployment scenario of interest. The decision on QC cutoffs will be the prior step for the deployment of the Stough  One limitation related to our study was the lack of ground truth segmentation for the large Geisinger data. This restricted our evaluation to the estimation of volume and mass which was downstream of segmentation. The fact that the errors in volume/mass estimation were within human IOE lent su cient support for the use of our model-generated segmentations to derive key clinical measures. Another limitation was the exclusion of a second external evaluation dataset which should be independent of any of the ve models. However, in our pilot studies, we evaluated the performance of all the ve models on CAMUS and part of EchoNet-Dynamic datasets using the same procedures described in this study (online Table S2). Overall, the Stough et al. 2D and 3D models outperformed the others in these studies. Finally, it is of great interest to evaluate a QC method on all the ve models. But this will require either re-training these models in a similar ten-fold CV scheme on non-Geisinger datasets, or easy accessibility to multiple sets of model weights for each of the ve models, which both are beyond the scope of the current study.
Although aleatoric uncertainty can be estimated using test-time augmentation, it is tricky to choose an appropriate augmentation range.
In conclusion, all ve state-of-the-art echocardiography segmentation models generalized well with good performance on most tasks within a large clinically-acquired echocardiography dataset. Stough    AD: absolute difference between 2 observers in percent of their mean; Individual SD: standard deviation of difference between 2 observers in percent of their mean; AD = √2 x Individual SD.