Overall, five published echocardiography segmentation models tested in the present study generalized well to a large external clinically-acquired dataset of 2D echocardiograms on most segmentation tasks, leading to volume and/or mass estimations with accuracy comparable to manual analyses [2], [4]–[6]. This clinically-acquired dataset involved a variety of cardiovascular disease conditions with a wide span of EF values. The comparable errors in LV EDV, ESV, and EF exhibited by the five models suggest a good adaptation to patients with different LV conditions. The striping pattern observed for EF Bland-Altman density plot was likely due to the tendency for the physician-reported EF to take discrete values (e.g., 60%) or ranges (e.g., 60–65% for which we used the mean value 62.5%) rather than naturally comprising a continuous representation, as the segmentation result does. Although mostly within the reported IOE range (i.e., 11\(\pm\)12%) [5], the LAV errors were higher than the mean IOE for all of the four multi-structural segmentation models. This sub-optimal LA segmentation, as compared to LV segmentation, could be partially attributable to the high boundary-to-area ratio of LA. Another contributing factor could be the fact that LA was usually associated with more distortion/noise/motion, especially at the bottom region, since it was farther from the transducer. Moreover, for videos without focusing on LA, part of LA might be out of image.
The generalizable performance exhibited by four of these models (i.e., Stough et al., Ouyang et al., and Arafati et al. models) could be partially attributable to the fact that they were trained either on a very large dataset or with data augmentation [16], [19], [24], [25]. As for the Zhang et al. model, although it was trained with some data augmentation [20], the relatively small original training data likely restricted its generalizability, leading to modest performance as observed in the present study as well as on the CAMUS dataset [19].
Besides training data, model structure and training strategy also likely contributed to the different performance exhibited by the five models. While the Ouyang et al. model, which exhibited superior accuracy in segmenting the LV blood pool, leveraged atrous convolutions [16], Stough et al. employed CV model averaging with a simple U-net structure [24], [25]. In fact, models developed by Stough et al., especially the 2D frame-level model, stood out as superior in almost all segmentation tasks, suggesting CV model averaging as a robust method to generalize the segmentation models. Stough et al. also used appearance and shape co-learning strategies when training the 3D segmentation model [25]; the additional objective to enforce temporal coherency between ED and ES phases, however, may compromise the segmentation performance at individual ED and ES frames [25]. Moreover, the Stough et al. 3D model required ED-ES clips, which were generated based on frame-level segmentations; any errors at frame-level segmentation could propagate during 3D segmentation. All these factors could explain the sub-optimal accuracy exhibited by the 3D model, as compared to the Stough et al. 2D model. Similarly, the balance between two objectives during adversarial training may compromise the segmentation task of the Arafati et al. model [19], contributing to its sub-optimal performance on this large external clinical dataset.
Additionally, the varied segmentation performance could partially arise from the training labels, i.e., the manual segmentations generated by different cardiologists. For example, compared to the precise LV and LA endocardial manual segmentations used to train the Arafati et al. model [19], the manual segmentations of the CAMUS and EchoNet-Dynamic datasets were conservative, especially at the apex and along the free wall of the LV [16], [21]. As a result, models trained on these two datasets, i.e., the Stough et al. 2D and 3D models and the Ouyang et al. model, tended to generate conservative segmentations, especially when the LV was enlarged with part of the myocardium outside the images, leading to underestimated LV volumes at ED and ES. However, the majority of segmentations generated, especially by the Stough et al. 2D model, clustered around zero bias for all five volume/mass estimations. These results, taken together, support the Stough et al. 2D model as the most generalizable segmentation model with great versatility, which could be leveraged for a production deployment in a clinical setting.
QC was achieved in this study by leveraging the 10-fold CV models trained by Stough et al. [24]. Segmentation uncertainty was easily obtained as a by-product when generating the final segmentation through accumulating the outputs of ten CV models. Our Seg_dev-based QC method efficiently removed the majority of poor segmentations for all three segmented structures, leading to slight increases in the mean ground truth DICE. It was not surprising to detect such minor increases in DICE score, particularly because the segmentation models already had superior performance before QC with less frequent failure (i.e., fewer segmentations with ground truth DICE <0.85). Moreover, the performance of our Seg_dev-based QC method on improving LV segmentation quality was comparable to the state-of-the-art results as evaluated on CAMUS and EchoNet-Dynamic datasets [28]. Like the other uncertainty methods [28], this method failed to flag some bad segmentations when all ten models performed consistently poorly. Moreover, although some final segmentations accumulated over ten models looked acceptable, they were dropped due to high uncertainty arising from the presence of low-contrast or invisible surrounding tissue in the images. This problem was more evident for LV wall segmentation. Significantly, the addition of convexity to Seg_dev-based QC greatly saved those segmentations with good convex shape and added more confidence in filtering out bad segmentations. In fact, the superior precision and sensitivity scores for picking up good segmentations from CAMUS and EchoNet-Dynamic datasets support our QC method, especially the convexity-reinforced uncertainty strategy, as an effective approach once appropriate cutoffs were set. This was further evidenced by the removal of large errors and the decreased absolute errors in downstream volume/mass estimation as shown in Geisinger data. However, it should be noted that there will always be a tradeoff between segmentation quality and number of studies excluded when defining a cutoff, and this cutoff may need to be adjusted depending on the deployment scenario of interest. The decision on QC cutoffs will be the prior step for the deployment of the Stough et al. 2D model with QC in the clinical setting.
One limitation related to our study was the lack of ground truth segmentation for the large Geisinger data. This restricted our evaluation to the estimation of volume and mass which was downstream of segmentation. The fact that the errors in volume/mass estimation were within human IOE lent sufficient support for the use of our model-generated segmentations to derive key clinical measures. Another limitation was the exclusion of a second external evaluation dataset which should be independent of any of the five models. However, in our pilot studies, we evaluated the performance of all the five models on CAMUS and part of EchoNet-Dynamic datasets using the same procedures described in this study (online Table S2). Overall, the Stough et al. 2D and 3D models outperformed the others in these studies. Finally, it is of great interest to evaluate a QC method on all the five models. But this will require either re-training these models in a similar ten-fold CV scheme on non-Geisinger datasets, or easy accessibility to multiple sets of model weights for each of the five models, which both are beyond the scope of the current study. Although aleatoric uncertainty can be estimated using test-time augmentation, it is tricky to choose an appropriate augmentation range.
In conclusion, all five state-of-the-art echocardiography segmentation models generalized well with good performance on most tasks within a large clinically-acquired echocardiography dataset. Stough et al. models, particularly the frame-level 2D model, exhibit the best performance in segmenting the three key left heart structures with accuracy comparable to manual analyses. The deployment of the proposed convexity-reinforced uncertainty QC method can improve the overall performance and enable real-time detection and correction of poor segmentations. Thus, incorporation of the Stough et al. 2D model and the proposed QC method into an echocardiographic analysis pipeline could potentially facilitate cardiac research and clinical diagnosis by providing efficient and accurate cardiac measurements. Further modifications to improve both segmentation models and post-segmentation analysis are possible and may help improve performance for both clinical and research applications.