In this study, changes of diagnostic accuracy, inter-observer agreement and Intra-observer reproducibility between with and without AI assistance were investigated. The results showed that AI-assisted software can eliminating both inter- and intra-rater variability. Furthermore, with the assistance of bone age AI software, the diagnostic accuracy of bone age assessment can be improved for less experienced radiologists.
With the use of AI and machine learning, especially the most known machine learning method deep learning, new possibilities for automated BAA have emerged[5–7]. The most popular deep learning is convolutional neural networks (CNNs), which has tremendous progress in recent years, and there are numerous publications about the use of CNNs in BAA[4, 8–11, 14, 15]. Radiological Society of North America (RSNA) launched a BAA challenge in 2017 and many machine learning methods achieved good results[18, 19]. The AI tool used in this study is based on CNNs method.
The emergence of fully automatic AI software help us overcome complexity and time consumption in the interpretation process. Most publications discuss the data between AI and radiologists with convincible good results about improved accuracy or reduced complexity and time. But it is not yet the reality to send the AI results directly to the pediatrician without confirmation of radiologist. In clinical practice, the purpose of AI-assisted software is to assist the radiologist but not to use it independently. Only by validating the results of AI-assisted software in in daily routine can it truly prove its value. So two images interpretation scenarios “without AI” and ”with AI” were included in our research. Our results demonstrated that with the assistance of AI, accuracy of residents’ results improved significantly, which were same as most similar publications.
One of the challenges in BAA is the variability in radiologist clinical interpretation of bone age radiographs, both for inter- and intra- observer. Will automated bone age tools eliminate enhance inter-observer diagnostic consistency or intra-observer diagnostic reproducibility? There is only a few papers focused on it. A study by Tajmir et al[16] revealed that AI BAA improved the radiologist performance while decreased the variation (ICC without AI was 0.9914, with AI was 0.9951). Only three radiologists participated in image interpretation. Lee et al[12] developed a deep learning-based hybrid (GP and modified TW) method for BAA and the ICC of the two radiologists slightly increased with AI model assistance (from 0.945 to 0.990). In another study by Koc et al[20], the ICC were 0.980 for with AI and 0.980 with AI (BoneXpert). The inter-observer variability was not eliminate in their research. Our study demonstrated that, AI bone age tools can eliminate both inter-observer variability and intra- observer variability. 6 observers were analyzed and the intra- observer variability was also compared.
It is well known that the GP and TW methods are most commonly used clinical approaches for BAA. GP is the most popular method among pediatricians and radiologists, as BAA by GP is relatively quick and easy to learn. But GP method itself has significant inter-observer and intra-observer variability[21]. The TW method is considered to be more accurate and objective than the GP method and lower variability than GP[22, 23]. So we chose the TW method and the TW-based AI software in our study. Skeletal maturity varies by ethnicity, geographic location, and socioeconomic status. Caucasian reference standards cannot be expected to be used for comparison in China. So a modified TW3 standard modified for Chinese people was applied in our research. The bone age reference standards modified for Chinese was approved by the national official standards certification center. The AI software used in the research was also designed for Chinese by modified TW3 standard.
TW3-Carpal was less evaluated than TW3-RUS as the epiphysis of ulna and all carpal bones are less reliable as indicators of bone age for female from 2 to 7 years old and males from 3 to 9 years old. But the inter- and intra- observer variability can be evaluated as there is evaluation criterion for TW3-RUS. Thus, we designed our study to investigate the inter- and intra- observer variability for TW3-RUS and also for TW3-Carpal. And the study population was preschool children between 3 years to 6 years.
There were several limitations in this study. First, this was a single-center study with a small and single-ethnicity sample size, and only preschool children were enrolled. In the future, prospective multicenter studies with more cases will be performed. Second, the interpretation time was not recorded. Time consumption should be compared though many research already demonstrated that AI-assisted software can obviously reduce the diagnostic time[9, 10, 14].
For preschool children X-ray bone age assessment, besides improving diagnostic accuracy, bone age AI-assisted software can also increase inter-observer agreement and intra-observer reproducibility. AI-assisted software can be an effective diagnostic tool for residents during BAA.