Multi-characteristic reinforcement of horizontally integrated TENet based on wrist bone development criteria for pediatric bone age assessment

Pediatric bone age assessment (BAA) is an essential human physiological examination that reflects human growth potential and sexual maturation trends. In clinical practice, the “Methodology for Bone Maturation and Evaluation of the Wrist in Chinese Adolescent Children” (CHN-05) is a widely used method for BAA by Chinese radiologists. CHN-05 adopts the metacarpal length (ML) as well as the metacarpal margin (MM) as reference standards to estimate bone age. Inspired by the semantic description of CHN-05, we propose a new model, called a topology and edge map composed of a network(TENet), for automatic bone age assessment. In TENet, we design a hand topology module to recognize key hand locations and extract structural semantics. In addition, we devise an edge feature enhancement module to supply precise skeletal edge information throughout the training process. Our model can detect the overall message of edges alongside the local message of topology for the purpose of multi-feature horizontally fused assessment of bone age. Experimental results show that our TENet achieves a state-of-the-art model performance of 5.35 mean absolute error (MAE) on the public dataset RSNA. Since our model for designing follows CHN-05 semantic logic, it is reliable and interpretable in terms of clinical use.


Introduction
The developmental process of each human skeleton has continuity and stages, and the skeleton at different stages has Kunyuan Jian and Shuxiang Li contributed [1]; therefore, bone age assessment can reflect the growth and maturity level of an individual more accurately. It can not only determine the physiological age of children but also provide an early understanding of their growth potential and sexual maturation trend.
Traditional assessment of bone age usually involves taking radiographs of the subject's hands and wrists, which are then interpreted by a physician based on the radiographs. The most commonly used methods are the Greulich & Pyle (G&P) mapping method [2] and the Tanner & Whitehouse (TW3) scoring method [3]. Subject to racial differences and long-term trends in growth and development, the most suitable bone age standard for contemporary Chinese children is currently CHN-05 [4], which has become the only industry bone age standard in China at present. Recently, deep learning is widely used in the medical field to extract features [5,6] and regress to get predicted bone age using classical networks such as Inception3, ResNet50 [7][8][9][10].
Although these methods can achieve good performance in BAA, they are inspired by G&P and TW3, and their criteria are based on samples derived from American and European children in the 1990s. The chronological changes and the different populations render these methods unable to objectively and accurately evaluate the bone age of contemporary Chinese children. Therefore, in this paper, we propose a new deep learning [11] framework called TENet, inspired by the prior knowledge [12] of CHN-05 (see Fig. 1), which takes the Chinese children's metacarpal edges as well as metacarpal length as reference standards for phased feature acquisition and bone age regression prediction by using hand radiographs with only bone age supervision. TENet imitates the diagnostic logic of CHN-05 by designing a hand topology [13] module consisting of Sobel [14] edge detection and hand pose estimation [15] to obtain structured region of interest (ROI) information; and we also develop an edge enhancement module to strengthen the detection of overall hand edge features through automatic color equalization [16], bilateral filtering [17] and Otsu algorithm [18], and finally fuse the pre-processed original images for bone age assessment. The contributions of this paper are three-fold.
1. We propose a novel deep learning model to predict bone age, following the prerequisite knowledge and diagnostic criteria of CHN-05, reflecting the reliability and interpretability of the model. 2. We design a pixel-level enhancement map of the edge features of the metacarpal phalanges and a topological structure map based on the semantic knowledge in CHN-05, simultaneously using these two features to complement the assessment of bone age.
3. Under the bimodal horizontal integration of multi-trait reinforcement map and gender information, our deep learning model reaches promising performance on both public and private datasets.

Related work
In recent years, with the development of artificial intelligence, deep learning can facilitate bone age assessment more quickly and efficiently [9,[19][20][21]. Krit Somkantha et al. [22] used an edge-following technique to extract the boundary results of each carpal bone to calculate five features for bone age assessment. All features were used as input to support vector regression (SVR) to assess bone age. Pengyi Hao et al. [23] used the carpal bones as a ROI for boundary extraction of the carpal bones and then evaluated the bone age based on regression convolutional neural network from a left-hand carpal x-ray of children. Dong Wang et al. [24] designed an anatomical local awareness network module to learn hand structures and extract local information to estimate bone age, meanwhile an anatomical patch training strategy was developed to provide additional regularization during training.
Motivated by CHN-05, which assesses bone age by using metacarpal edges as well as metacarpal length as reference standards in Chinese children, we employ edge algorithms and hand pose estimation to obtain overall features of metacarpal edges along with topologically local features that are more suitable for assessing bone age in Chinese children.  1 Overall framework: the blue double line indicates the doctor's manual assessment of bone age results, and the black single line indicates our proposed three-stage TENet framework with reference to the semantic description in CHN-05. In the first stage, the key points are predicted by hand pose estimation and projected on top of the hand Sobel edge map to form a hand topography map to obtain structured ROI information. In the second stage, the enhanced overall features of the edges are obtained by using the ImCAT module. In the third stage, we have trained a regression model to predict bone age by horizontally incorporating global features, local features, preprocessed maps, and gender information Experiments on both the public dataset RSNA and the private dataset show that our framework achieves good results on bone age prediction.

Methodology
When estimating bone age using the CHN-05 method, doctors will characterize the patient's metacarpal pattern, using the metacarpal edges as well as the metacarpal length as reference criteria, scoring each feature according to the scoring criteria, and finally summing the scores to make a bone age prediction.
Imitating this process, we design a three-stage TENet for bone age prediction, as illustrated in Fig. 1. In the first stage, we perform fast segmentation of hand radiographs by using conventional Sobel edge detection to obtain the backbone feature map of the hand. Then we predict the key point locations by using the hand pose estimation model and attach them to the hand backbone features to form the hand topography map to obtain the local ROI information.
In the second stage, we segment hand radiographs utilizing improved canny edge detection with adaptive thresholding to extract their enhanced edge overall features. In the third stage, the edge feature map obtained in the first stage, the topography map obtained in the second stage, and the pre-processed original map are used as inputs to train the regression model to obtain the predicted bone age.
In what follows, we describe the hand topology map in Section 3.1, the extraction of enhanced edge features in Section 3.2, and the backbone network in Section 3.3.

Hand topography
While analyzing the edge features, the physician will focus on the length of the patient's metacarpal phalanges as well as the positional structure. Therefore, we propose a hand topography module to obtain structured ROI information by fast segmentation of the foreground edge of the metacarpal through the conventional Sobel operator and attaching the key nodes of the hand by hand pose estimation (see Fig. 2), which is also more in line with the semantic description of CHN-05 to evaluate bone age from four directions: metacarpal, dorsal, radial, and ulnar.

Sobel edge detection
Sobel algorithm processes efficiently, using both horizontal and vertical direction matrices to convolve with the image itself in-plane to obtain the luminance difference approximation [25] of the image A in the horizontal as well as in the vertical direction (1). Gradient approximation and direction of each pixel point in the horizontal and vertical directions With the conventional Sobel edge detection operator, it is possible to efficiently automate the recognition and segmentation of the regions involved in the semantic description of the "first to fifth metacarpals" in CHN-05, and to obtain an overall foreground edge map, thus laying the foundation for subsequent metacarpal length localization.

Hand pose estimation
Hand pose estimation has a wide range of applications in the field of 3D modeling. Therefore, we use the MediaPipe application framework developed and open-sourced by Google Research to predict the position of 21 joint points on 2D hand radiographs for the purpose of focusing on the position of hand ROI topology. As the semantics of "metacarpal length" is described in the CHN-05 standard for bone age assessment, 21 hand key nodes are designed to correspond to the locations of the anterior, middle, distal, and ossification centers in the CHN-05 standard, so that the vector nodes can be attached to the Sobel prospective edge map of the wrist bone to form a topological feature map of the hand, which can be used to guide and extract features from different physiological anatomical regions of interest during training in a deep learning framework, enabling the assessment of bone age in four directions: metacarpal, dorsal, radial and ulnar, improving the performance of the final method (Fig. 3).

Edge feature enhancement
In clinical practice, physicians also perform pattern analysis based on metacarpal edge features. We utilize the traditional Canny (TC) operator as the detection algorithm with some modifications, called improved adaptive [26] thresholding of canny edge detection (ImCAT), which can precisely locate the edge position and obtain the enhanced overall edge features (see Fig. 4). Compared with the TC algorithm, innovations are proposed in the following three aspects: 1. Automatic color equalization (ACE) algorithm is adopted in data pre-processing to solve the problems of background area and low contrast of ROI in hand radiographs. 2. Instead of Gaussian filtering, bilateral filtering is used to effectively deal with the impulse noise in images.

Manual selection of thresholds is replaced by using the
Otsu algorithm that calculates the best thresholds suitable for different images, minimizing the uncertainty in the performance of the algorithm.
As illustrated in Fig. 5, ACE addresses the lack of clarity in the original hand radiographs; meanwhile, the bilateral filter and the Otsu algorithm synergistically tackle the inherent problem for the TC, Laplacian, Prewitt, and Roberts operators of failing to identify the center of ossification and the end boundary of the metacarpals, making the final result more consistent with the semantics to which CHN-05 belongs, in which the center of ossification is clearly visible, disc-shaped, with a smooth and continuous edge and the blend of epiphysis and diaphysis can be observed.

Image pre-processing
In an X-ray image, a white-colored mask blurs the image itself due to technical limitations. Therefore, we employ an automatic color equalization algorithm to perform regional adaptive filtering on the original image to complete the color difference correction and obtain a null domain reconstruction image to highlight the features and valuable information in the image, making the image more compatible with the human eye perception.
I c ( p), I c ( j) is the grayscale difference between two pixel points p and j, expressing the side suppression on the proposed biology; d( p, j) denotes the distance metric function, mapping the regional adaptation of the filtering; S α (x) is the luminance expression function; R c is an intermediate result.
And then the corrected image is dynamically extended, which is to map the variables in (4) into [0,255], where

Bilateral filtering for noise removal
Impulse noise [27] is often present in hand radiographs, although there are many image denoising methods that can be used, such as Gaussian filtering [28], median filtering [29], and box filtering [30], these methods are not able to protect high-frequency information well, thus blurring the edge details of the image. In contrast, bilateral filtering can achieve the effect of edge-preserving denoising by compromising the spatial proximity [31] and pixel similarity [32] of the image and by considering the null domain information and gray-scale similarity. The formulas for the space [33] domain kernel and the value domain kernel are as follows: i, j,k,l represent the coordinate points q(i, j) and p(k, l),δ x is the Gaussian standard deviation [34], f (i, j) denotes the pixel value of the image at point q(i, j), W d is to calculate the proximity of the near point q to the center point p, W r is to determine the magnitude of the difference between the proximal point q and the central point p.

Otsu algorithm
In the traditional Canny algorithm, it is necessary to set the high and low thresholds artificially based on experience, which is to find the optimal threshold by decreasing the value in a certain range, but this method is time-consuming and not robust. Therefore, the Otsu algorithm is proposed to automatically determine the optimal threshold value to obtain the optimal threshold value that separates the foreground pixels from the background pixels in an image with a histogram data distribution. The detailed principles are as follows: The existence of a threshold T can partition the image into one where the background and the target are proportioned as p 0 (10), p 1 (11), and their average gray value are m 0 (8: the multiplication of each gray value from 0 to the assumed threshold T with the frequency of occurrence of that gray value and then being divided by the total frequency of the area, is its average threshold), m 1 (9), and the global gray mean value of the image ism, then there is (12), (13), followed by the concept of variance, the expression for the variance between categories is (14), substituting (12), (13), simplifying the formula, from which to obtain (15), the procedure of the calculation is to interpolate over the range of gray levels K from 0 to 255, thus obtaining the gray level K that maximizes (15) as the pixel optimal threshold for this image. Ostu is highly efficient and interpretable compared to the traditional Canny algorithm, which repeatedly discovers the trajectory of the entire edge line by seeking out the nodes that can be connected within a certain range.
k is the assumed threshold T ; i is the gray value; L is the pixel level of the image, which is normally 255; p i is the frequency of occurrence of each gray value; p 0 denotes the sum of the likelihood of occurrence of each gray value in the field from 0 to k; p 1 denotes the sum of the likelihood of occurrence of each gray value in the field from k + 1 to L.

Backbone
We feed the resulting edge feature map, hand topography map, and the preprocessed original image into different channels, employ Xception [35] with some modifications as the backbone network, remove the topmost layer, then add a 3×3 downsampling layer, 3×3 maximum pooling layer and a fully connected layer with 32 neurons, and incorporate the gender information into the image features through the fully connected layer, and finally by softmax [36] activation function outputs the probability distribution of different bone ages, p k , k={1, 2, ..., 228}, which leads to the bone age prediction results (see in Fig. 6). The objective of the regression model is to reduce the mean absolute error between the true bone age and the predicted bone age, so we use the L1 loss function [37] as the objective function.

Dataset
We evaluate TENet on our private dataset and the Radiological Society of North America Bone Age Assessment (RSNA-BAA) [38] public dataset. The private dataset contains 4954 hand radiographs in the training set and 275 hand radiographs in each of the validation and test sets. The ground truth bone ages are from 0 to 202 months. The RSNA-BAA public dataset contains 5611 hand radiographs in the training set and 275 hand radiographs each in the validation and test sets. The ground truth bone ages of RSNA-BAA vary from 0 to 228 months. All hand radiographs (see in Fig. 3)are resized to 560 × 560 before processing through TENet. We then report the mean absolute error between the underlying true bone age and the corresponding predicted bone age. Dual data sets are provided by the orthopedic department of the Fig. 6 The framework of Xception, which includes the entry flow, the middle flow and the modified exit flow hospital and the collection equipment is an X-ray machine, in which a private dataset is not publicly available.

Implementation details
We implemented TENet via Tensorflow 1.9 and completed the training on a system with NVIDIA TITAN RTX GPU and 32G RAM, which consumed about 8 hours. We took 100 epochs to train the whole process of TENet. The batch size is 32. The initial learning rate is 3 × 10 −3 , which is reduced to 10 −3 after 50 epochs and 10 times after 80 epochs. The optimizer used is Adam [39].

Bone age assessment:
We first investigate the role of each module in TENet, namely pre-processed hand image(PHI), edge feature reinforcement (ImCAT), and hand topography (HT). The experimental results (Table 1) show that without these two modules, the network degrades to Xception and the resulting MAE score is 6.92 months. When the edge feature reinforcement module is applied with this model, the resulting MAE score is 5.76 months, and the MAE score gains a boost of 1.16. This improvement illustrates the need to acquire edge features and the effectiveness of the ImCAT algorithm. We also test separately the performance obtained with the hand topography applied to Xception. In this case, the hand topographies guide the ROI when training the network, boosting the MAE score to 5.89 months. Combining these two components, our TENet achieves an MAE of 5.25 months. Exp.1, 2, 3, 4, and 5 are operated on a private dataset, and Exp.6 is executed on RSNA. Collectively, the superiority of the results from the private dataset over the public dataset corroborates that our model design is inspired by the CHN-05 wrist bone criterion.

Improved canny detection:
Our model detects hand edge features and unites the topology for bone age assessment. We observe that ImCAT also  benefits from the design of its modules. We measure performance using the peak signal-to-noise ratio (PSNR), mean square error (MSE), structural similarity (SSIM), root mean square error (RMSE), and mean absolute error (MAE). The results are shown in Table 2. We use four algorithms such as the traditional Canny (TC) as a baseline to which our ImCAT algorithm significantly outperformed, indicating that ImCAT has the ability to better capture the overall edge detail of the hand.

Comparison with state-of-the-arts
In this subsection, we compare previous state-of-the-art methods with our model. The results are displayed in Table 3, where the age ranges for the data sets covered are all 0-18 chronological ages. The performance of the G&P and TW3inspired methods for bone age assessment is 9.48 months. The performance of the methods employing different network structures has an improvement of 1.42 months. Our TENet yields better performance because both overall information together with ROI local information is taken into account. TENet has a significant advantage over models that solely use local or global features. It is attributable not only to the proposed information on the hand topology but also to the enhancement of the overall structure of the edges. Consequently, our model achieves a performance gain of 0.49 MAE over the previous P. Hao, a relative improvement of 8.3%.

Conclusion
In this paper, we propose a new automatic bone age assessment model, TENet, which imitates the diagnostic logic of CHN-05 by feature extraction of the semantic description of bone age assessment in CHN-05 and algorithmically implementing the purpose of convergence of horizontality of multiple characteristic presentations. We first design a hand topography module to furnish the length as well as the location structure information of metacarpal phalanges to conform to the semantics of CHN-05 quadratic evaluation of bone age. Secondly, we develop an enhanced edge feature module to recognize the hand architecture and withdraw significant edge information to meet the requirements of CHN-05 for clear bone edges, and also enhance the accurateness of bone age assessment. The experimental results in RSNA public and private datasets demonstrate that our model achieves a decent performance. In addition, our model is designed from CHN-05 a priori knowledge, so it has favorable interpretability and reliability for clinical practice. In future work, we will attempt to incorporate the feature acquisition module and the score regression module to formulate an end-to-end automated bone age assessment framework.
Author Contributions Kunyuan Jian was the first author of this paper, responsible for all experiments and figures, and the author of the first draft of the article; Mengning Yang was the lead person who actually coordinated the processing of submissions and undertook the work of responding to review comments; Cui Song provided the private data set for the study of this paper; Simin Wang and Shuxiang Li provided some assistance with research ideas for this paper.

Data Availability
The RSNA-BAA dataset is openly available in a public repository, https://www.rsna.org/. The private dataset is not available due to [ethical/legal/commercial] restrictions.

Compliance with ethical standards
Conflicts of interest The authors declare that they have no confict of interest.
Ethical standard This article does not contain any studies with human participants performed by any of the authors.