Dataset
OPG data obtained from a total of 910 Korean outpatients who were treated at the Jeonbuk National University Dental Hospital were used as research data. Informed consent was obtained from all participants and all methods were performed in accordance with the relevant guidelines and regulations. This study was duly approved by the Institutional Review Board of the Jeonbuk National University Hospital (CUH 2021-03-021). The age of subjects was determined based on the date of radiography. Their age range is presented in Table 1. With reference to a previous study in which mandibular first molar and canine teeth were used for age estimation, in the present study each image was cropped into a total of four sub-images of two types of teeth on the left and right1,13,16. More specifically, the images were manually segmented under the guidance of skilled dentists into a total of four sub-images, of left and right mandibular first molar teeth, and left and right mandibular canine teeth. The obtained images were then resized into 256 x 256 pixels to contain the maximum pixel information. From a total of 910 dental radiographs, 1,216 images of first molar teeth and 1,634 images of canine teeth were obtained. In the present study, when these tooth images were used to train the developed model, all of the left-side images were mirrored before use to ensure that the learning process was not affected by whether a given tooth was located on the left or right side17,18.
Table 1
Age and gender distribution of dental x-ray datasets.
Age group | Gender | Total |
Male | Female |
10–19 | 30 | 54 | 84 |
20–29 | 79 | 116 | 195 |
30–39 | 23 | 66 | 89 |
40–49 | 57 | 97 | 154 |
50–59 | 53 | 111 | 164 |
60–69 | 74 | 73 | 147 |
70–79 | 33 | 44 | 77 |
Total | 349 | 561 | 910 |
Variational Autoencoder-linear Regression Model
Single VAE with linear regression
Convolutional neural networks (CNN) are one of the powerful deep learning tools that can automatically extract feature values from input image data to establish a nonlinear relationship model between label data and the corresponding feature values19. Meanwhile, convolutional VAE is a probabilistic graphical model capable of extracting features of input image data as a continuous probability distribution function using two symmetrical CNNs20. When the dental radiograph \(x\) is entered into \({q}_{\varnothing }\left(z\right|x)\), which is the VAE's encoder model, the probability distribution of all latent z variables, which correspond to all characteristic values that the input data can result in, is then returned as output. The decoder \({p}_{\theta }\left(x\right|z)\), also known as the generative model, in turn, returns virtual image data, \(\stackrel{-}{x}\), which has been reconstructed from the input latent z variables.
To train this VAE model, the Kullback-Leibler divergence that minimizes the distribution of latent z variables between \(q\left(z\right|x)\) and \(p\left(z\right)\) is first calculated, as shown in Eq. (1). Next, a reconstruction error is calculated, as shown in Eq. (2), in an attempt to minimize the difference between \(x\), the dental radiograph that has been entered into the encoder and \(\stackrel{-}{x}\), the resultant image obtained from the decoder. The total loss of the VAE model \({\mathcal{L}}_{VAE}\) is used to train the model with hyperparameters in a way that minimizes both the KL divergence loss \({\mathcal{L}}_{KL}\) and image reconstruction loss \({\mathcal{L}}_{rec}\), as shown in Eq. (3). In order to optimize the learning process of the developed model, each loss term was weighted by the factor γ.
$${\mathcal{L}}_{KL}= {D}_{KL}\left(q\right(z\left|x\right)\left|\right|p\left(z\right))$$
1
$${\mathcal{L}}_{rec}= -{\mathbb{E}}_{q\left(z|x\right)}\left[\text{log}p\left(x\right|z)\right]$$
2
$${\mathcal{L}}_{VAE}= {{\gamma }_{1}\mathcal{L}}_{KL}+ {\gamma }_{2}{\mathcal{L}}_{rec}$$
3
In the present study, a VAE composed of five convolutional layers was developed, and grayscale images of 256 x 256 pixels were used as input for the model, as shown in Fig. 1 (a). From the input images, the encoder model returns 512×8×8 distribution variables \(\mu\) and \(\sigma\). The latent z variables are then calculated from the obtained distribution variables. The decoder model receives the latent z variables as input and then generates virtual images of 256 x 256 pixels as output. A linear regression equation was developed to estimate the age of teeth by using the latent z variables obtained from the encoder, as shown in Fig. 1 (b). The VAE model was built using Python's TensorFlow Library and trained in an unsupervised learning manner, while the linear regression equation was developed using Python's Scikit-learn Library and trained in a supervised manner.
Parallel VAE with linear regression
Accuracy and precision are among the most critical factors that determine the performance of dental age estimation methods for adults, especially from the perspective of forensic odontology. The accuracy of these methods may vary depending on what tooth is selected and examined or what estimation method is employed. Thus, it is important to select and apply at least two estimation methods in order to improve both accuracy and precision21. This is also the case for dental age estimation using AI techniques; multiple teeth need to be comprehensively examined. In this light, the present study developed a parallel VAE model capable of estimating the age of subjects by comprehensively analyzing the feature values obtained from at least two types of teeth images, generating a virtual image of the teeth22,23.
Assuming that images of first molar and canine obtained from a single subject contained common latent variables that may help estimate the age of the subject, a parallel-VAE model that partly shared the same latent z variables were developed, as shown in Fig. 2 (a). Encoder 1 and Decoder 1 of the parallel VAE used datasets of first molar images as input and output data, while Encoder 2 and Decoder 2 used datasets of canine images as input and output data. Here, it was assumed that, among the \(\mathcal{n}\) latent variables calculated from Encoder 1, \({z}_{1u}\), which corresponded to half the \(\mathcal{n}\) latent variables (\(\mathcal{k})\), included unique variables specific to first molar teeth, and the other half of the latent variables (\({z}_{1c}\)) included common variables shared by both first molar and canine teeth. Similarly, among the \(\mathcal{n}\) latent variables obtained from Encoder 2, \({z}_{2u}\), which corresponded to half the \(\mathcal{n}\) latent variables (\(\mathcal{k})\), were assumed to include unique variables specific to canine teeth, and the other half (\({z}_{2c}\)) were assumed to have common variables shared by both the first molar and canine teeth. In order to ensure that the latent variables \({z}_{1u}\) and \({z}_{2u}\) would have respective feature values for each tooth type while allowing the latent variables \({z}_{1c}\) and \({z}_{2c}\) to include common information shared by both types of teeth, Eq. (4) was added to the VAE loss calculation term22.
The \({\mathcal{L}}_{separate}\) value, which is the separate loss, was obtained by dividing the difference between the common variables \({\mathcal{L}}_{common}\) for the two teeth types by the difference between the unique variables \({\mathcal{L}}_{unique}\). The mean square errors between two latent variable sets can be expressed as in Eq. (5) and Eq. (6). \({\mathcal{L}}_{{KL}_{m}}\) and \({\mathcal{L}}_{{rec}_{m}}\)in Eq. (7) refer to the KL loss and image reconstruction loss of the VAE that receives first molar images as input, respectively. \({\mathcal{L}}_{{KL}_{c}}\) and \({\mathcal{L}}_{{rec}_{c}}\) are the KL loss and image reconstruction loss of the VAE that receives canine as input, respectively. If one trains the entire parallel-VAE model using the loss functions in Eq. (7) and gradient descent methods, the common latent variables for each teeth type, that is, \({z}_{1c}\) and \({z}_{2c}\), tend to converge toward the same value in a gradual manner to minimize the \({\mathcal{L}}_{common}\) value. However, at the same time, this learning process proceeds in a way that maximizes the \({\mathcal{L}}_{unique}\) value, allowing the values of the unique latent variables for each teeth type, i.e., \({z}_{1u}\) and \({z}_{2u}\), to be as different as possible from each other.
$${\mathcal{L}}_{separate}= \frac{{\mathcal{L}}_{common}}{{\mathcal{L}}_{unique}}$$
4
$${\mathcal{L}}_{common}=\frac{1}{\mathcal{k}}\sum _{i=1}^{k}{({z}_{{1c}_{i}}-{z}_{{2c}_{i}})}^{2}$$
5
$${\mathcal{L}}_{unique}=\frac{1}{\mathcal{k}}\sum _{i=1}^{k}{({z}_{{1u}_{i}}-{z}_{{2u}_{i}})}^{2}$$
6
$${\mathcal{L}}_{parallel-VAE}= {{\gamma }_{1}\mathcal{L}}_{K{L}_{m}}+ {{\gamma }_{2}\mathcal{L}}_{K{L}_{c}}+{{{{\gamma }_{3}\mathcal{L}}_{re{c}_{m}}+\gamma }_{4}\mathcal{L}}_{re{c}_{c}}+{{\gamma }_{5}\mathcal{L}}_{separate}$$
7
A linear regression model capable of age estimation using the latent variables obtained from the parallel VAE was developed, as illustrated in Fig. 2 (b). The configuration of this linear regression model was basically the same as that employed for the single VAE in that it received latent variables as input and returned the age of subjects as output. However, it differed in that this regression model was built using only 3/4 of all latent variables. This configurational difference is attributed to the fact that, once the parallel VAE is sufficiently trained, the common latent variables \({z}_{1c}\) and \({z}_{2c}\) become almost identical, and thus there is no need to use both of them to develop a regression model. Ultimately, the developed regression model contained a total of \(\mathcal{n}+\mathcal{k}\) regression coefficients \(\beta\), along with \(\alpha\), a single intercept.
Generation Of Teeth Images That Reflect Age Changes
The present study developed a method for quantitatively determining and analyzing the correlation between age and the morphological changes in teeth by generating dental images that may vary continuously with age. This was accomplished using the coefficients of a regression model trained with age data, along with a decoder. Once fully trained, the regression model can be expressed as an equation, as shown in Eq. (8).
$${\beta }_{1}*{z}_{1}+{\beta }_{2}*{z}_{2}+{\beta }_{3}*{z}_{3}+\dots +{\beta }_{n}*{z}_{n}+\alpha ={y}_{age}$$
8
The regression coefficient \({\beta }_{k}\) may be used not only to estimate age but also to generate latent variables corresponding to teeth images when the subject is either younger or older compared to the reference image. The latent variables extracted from the encoder contain various information, including the brightness of the teeth images used, gender, and age. When a regression model is developed using age as a dependent value, coefficients for latent variables that have a strong correlation with age tend to be highly positive or negative. When the correlation with age is weak, the coefficients tend to be small. Based on this relationship between latent variables and their coefficients, it is possible to selectively control only the values of the latent variables that have a strong correlation with age. For example, when adding or subtracting the regression coefficient \(\beta\) to or from the latent variables obtained from the reference image, the values of the latent variables that have a strong correlation with age tend to change greatly (Fig. 3). When the correlation is weak, the corresponding changes are also small. Accordingly, virtual images in which the subject is younger compared to the reference image can be generated by subtracting the regression coefficient \(\beta\) from the extracted latent variables and then reconstructing them using the decoder. Similarly, virtual images in which the subject is older can be obtained by adding the regression coefficient \(\beta\) to the latent variables and reconstructing them using the decoder.