CTR derived from CXR is a valuable index for evaluation of heart diseases, especially cardiomegaly [1–4]. To measure it, however, still requires manual operations that are user dependent and time consuming. Even with its usefulness, the measurement process is a burden in clinical practice. Recently, the AI method successfully provided automatic calculations of such an index and has been validated technically in various studies [9–12]. To use AI in the clinical setting, there is a need for clinical evaluation to assess the measurement agreement with manual method. However, there have been only two published pilot studies [9, 11] with small datasets that addressed this issue.
To our knowledge, this study was the first report of observer and method variations to validate CTR measurement using AI on a large dataset (n = 7,517). Using a modified U-Net deep-learning model (i.e., 2D VGG-16 U-Net) for CTR calculation, AI was found to be not suitable to be used as an automated method for CTR measurement due to its high variations compared to the manual method. Its CTR calculations, on the other hand, can assist the user to obtain better results. Furthermore, the coefficient of determination (R2) or classification performance test (e.g., AUC) should not be employed because it may lead investigator to falsely conclude that the AI method can be employed as an automated method. Bland-Altman plot with Covariant of Variation (CV) parameters evaluated on a large data should be utilized instead to indicate agreement between these methods.
We found that the AI method can provide excellent outcomes in about 40% of the data, which is a desirable result for an automated method. However, there was about another 56% of good outcomes (i.e., required adjustment by user) that needs improvement before the method can be used automatically. Most of the required adjustment was on heart diameter. Therefore, to aim for automated CTR measurement, the AI method needs to be improved on heart diameter calculation which is difficult to perform because its pixel value is low, and its edges are fused with the lung borders or thoracic spine [20]. In addition, the AI method also had about 4% failure rate (i.e., poor outcome) most of which was in normal group (97%: 290/299). In routine clinical usage, which measures CTR only on suspicious cardiomegaly cases, thus, this failure is infrequent (9 failures in 2,517 cardiomegaly data). Nevertheless, most of the segmentation failure was on hearts with quite short diameters (e.g., Fig. 2J). This may be due to an inadequate presentation of such heart data shape in the training dataset. Fine tuning the model (retrain model on previous weighting data) with such heart shape dataset from local data should further reduce such failures.
We found that the AI-assisted method had lower inter-observer bias and variation than the manual method (CV and bias: 1.72% vs 2.13% and − 0.61 vs -1.62). This may be due to AI’s excellent outcome in about 40% of data which can help to improve measurement agreement. Furthermore, it is almost five fold faster to perform than using the manual method, and increase F1 from 0.866 to 0.872 at the standard CTR cutoff point of 0.5. This is clearly demonstrated the usefulness of AI method to assist the CTR measurement. Our AI-assisted time performance was also in agreement with a recent study by Bercean et al. [9] which found a similar magnitude of time reduction (22.5 vs 5.1 secs, or 4.4 times). Even on a small dataset (n = 200), that study also found that the model-assisted method can improve individual radiologist’s cardiomegaly F1 score (0.845 to 0.851) compared to the manual method.
We concluded that the classification performance test of AI method was not better than from the manual method, a finding at odds with a report by Li et al. [11] which found that the sensitivity and negative-predictive values of the AI method was significantly better than manual method. This may be due to two factors. First, the performance of deep learning algorithms in automated CTR measurement tasks depends on their ability to correctly locate heart and lung boundaries. In Li et al. [11], algorithms may have achieved more precise anatomical segmentations, although the authors did not provide precision metrics on an open dataset for comparison with the model we used [10]. Second, the algorithm in Li et al. [11] was trained and tested on the same dataset, while the model used in this paper was trained on an open dataset, and tested in an out-of-sample fashion. It would be useful to validate their finding by performing the classification test using their model on our dataset.
CTR measured from manual and AI-assisted methods were in substantial agreement with the reference method (CVs of 2.0 and 2.2%, respectively). The AI method, in contrast, had almost three times higher CVs on all comparisons. This strongly suggests that the AI method is not yet suitable to be employed as an automated method. However, its R2 of all data (normal and cardiomegaly groups) and classification performance test at the standard or optimum cutoffs were in similar outcomes as other methods. This is because R2 measures linear association rather than agreement of data [21, 22] and measurements with highly correlated data may have poor agreement [22], as in our case. Furthermore, the correlation typically depends on the range of measure. This is why the R2 of the manual and AI methods was good in normal and cardiomegaly groups (R2 = 0.79; CTR data range = 0.35–0.85), but poorly correlated in the cardiomegaly group alone (R2 = 0.34; CTR data range = 0.52–0.85) (Fig. 4A and 4C). On the other hand, if the agreement measurement, like Bland-Altman plot and CV, presents with good agreement (Fig. 5B and 5D) then they will surely be highly correlated [22], as shown in Fig. 5A and 5C. Thus, the agreement measurement should be employed to evaluate the compatibility of AI to manual method in CTR measurement study.
Classification performance tests may also be misleading because they only provide information on the performance of normal and cardiomegaly groups, and not how the methods agree. For example, Fig. 2D and 2F present cases where the AI method gave a false positive and negative result, respectively. These two data have an effect on classification test but most of the AI data did not have this effect (i.e., AI’s CTR data did not change the classification) as shown in Fig. 2E and Table 3. Still, we obtained excellent classification performance at the standard CTR cutoff (e.g., AUC = 0.902). However, if this AI method were employed to rule out cardiomegaly patients (i.e., using CTR cutoff at the maximum sensitivity), then the method would perform poorly (e.g., accuracy of 34.8%) and should not replace the manual approach. Test agreement is the necessary for evaluation of the AI method if it were to be implemented as an automated method, and its agreement should be comparable to the manual method (CV = 2.1%).
We performed observer and method variation tests on a large dataset using only a modified U-Net Deep-Learning model because we wished to obtain baseline AI performance data. Our results, especially the manual measurement of 7,517 CXRs, will serve as a reference to evaluate other state-of-the-art AI models [23]. Our plan is to test these models on our dataset and accept the AI outcome only if it differs from our manual results by less than ± 1.8% (i.e., an excellent category where the user can accept its outcome without adjustment). Any model with > 70% acceptance rate will be studied prospectively in a clinical setting and evaluated by our radiologists. Furthermore, at such acceptance rate, we will perform another retrospective study in our PACS data (around one million CXR images). Such a pioneering study would provide more insight into CTR values and useful information for clinicians.
There were some limitations in our dataset and methods. We used only normal and cardiomegaly data and there was no data from other pathologies, such as the fat pad of the pericardium or pleural effusion. These pathologic conditions may limit the DL model’s ability to segment heart and lung, and may lower the performance of CTR measurement. Such data should be included in the future studies to better evaluate the performance of the model. Furthermore, we only investigated adult cases; evaluation of CTR measurement by AI in pediatric cases is needed. Next, we used only a publicly available dataset. Future studies using local datasets are needed to improve the model’s performance. Finally, unlike most deep learning for CXR analysis studies, this study did not address the question of how AI can be trained to match human performance in CTR measurement, but focused on assessing the extent to which deep learning methods can benefit the radiologists’ practice in a clinical setting. Future studies may focus more on the patterns of errors generated by the algorithms and suggest ways to improve its accuracy.