This study was approved by the Ethics Review Board of Yonsei University Dental Hospital Institutional Review Board (IRB No. 2-2020-0005) and passed the exemption review of informed consent on the use of patients’ cephalometric data. Written or verbal informed consent was not obtained from any participants because the IRB waived the need for individual informed consent, as this study had a non-interventional retrospective design, and all cephalometric images were anonymized to ensure confidentiality. This study was performed in accordance with the Declaration of Helsinki.
The inclusion criteria were as follows: (1) patients with permanent dentition whose facial growth was complete and (2) patients who underwent orthodontic therapy or orthognathic surgery between 2015 and 2021. The exclusion criteria were as follows: (1) partial or total edentulism and (2) a history of dentofacial trauma, craniofacial syndromes, or systemic diseases. Thus, 1,114 PA cephalometric images that met the inclusion criteria were included in this study.
The PA cephalograms used in this study were acquired using a Rayscan machine (Ray Co. Ltd., Hwaseong, Korea) and collected from the picture archiving and communication system (PACS) of the Yonsei University Dental Hospital as JPEG files. The images had a resolution of 1930 × 2238 pixels and pixel spacing of 0.13 mm. Each pixel was represented by a single grayscale channel with values ranging from 0 to 255.
The 1,114 PA cephalometric images included in the study were randomly divided into three sets: 803 images for training purposes, 229 for validation, and 82 for testing. The training and validation sets were used exclusively during the model training phase, whereas the test set was used solely to evaluate the reliability of the human examiners and the accuracy of the auto-identification model.
Nineteen clinically important PA cephalometric landmarks used in routine dentofacial diagnosis were selected. Table IV and Fig. 4 describe their definitions and positions. Two expert human examiners, an oral and maxillofacial surgeon with over 10 years of clinical experience in dentofacial deformity, and an orthodontic specialist with 5 years of orthodontic training, independently and manually identified the landmarks on the 1,032 images used for model training and validation to obtain the ground truth.
In the process of model training, the large size of the original image facilitates the creation of more feature maps for learning; however, it is also associated with the disadvantages of GPU memory allocation limitation and long computing time. Therefore, in the first step, the image was resized to 964 × 1119 pixels, which was ¼th of the original size. It is important to retain the features of the widest possible area to extract the approximate coordinates of the 19 landmarks. Thus, the x- and y-coordinates of the 19 landmarks were extracted by locating the center of mass of each labeling point, which enabled the construction of a coordinate landmark detection model.
The landmark detection framework operates through a two-step process, as shown in Fig. 5. The 19 landmark positions were coarsely extracted, and the image was cropped to a certain size based on these rough positions. Subsequently, the fine points were extracted from the cropped images. The adoption of this two-step framework facilitated efficient learning and high accuracy.
ResNet 18 was employed as the initial step to preserve the original features and expedite the learning process while minimizing the computational complexity. ResNet 18 is a model that can solve the gradient vanishing problem as the layer deepens through residual learning using skip connection, and it is widely used in facial landmark detection tasks.27 ResNet 18 consists of 17 convolution layers and a fully connected layer at the end. However, the first convolution layer was limited to a 7 × 7 kernel and max pooling to minimize the input size, whereas all subsequent layers were implemented using a 3 × 3 kernel convolution layer. The final fully connected layer comprised 38 output features, thereby enabling the derivation of the x- and y-coordinates of the 19 landmarks. Residual shortcut connections were introduced between the two convolution layers to optimize the learning process, as shown in Fig. 5, where the solid line represents the input and output having the same dimension, and the dotted line represents an increase in the dimension with zero padding and a stride of 2.
An augmentation strategy randomly selected from a list of methods, including rotation, scale, flip, and contrast, was applied to account for patients with tilted heads and asymmetric X-rays, as shown in Fig. 3. Wing loss was utilized as the loss function in the first step, which helped reduce an excessive focus on outliers to find the approximate landmark positions.28 Wing loss is more resistant to the impact of outliers than the mean squared error (MSE) loss function.
In the second step, the original image was cropped to a size of 400 × 400 pixels and centered around the 19 landmark positions obtained in the first step. Subsequently, contrast-limited adaptive histogram equalization (CLAHE) was applied to the cropped images. CLAHE is a histogram-flattening method that enhances the contrast of the radiographs, thereby enabling clear visualization of the bone, soft tissue, and background regions.29 ResNet 50 architecture was used in this step. It is similar to ResNet 18 but with deeper networks, and it comprises 49 convolution layers and a fully connected layer at the end. The final fully connected layer was designed with two output features to derive the x- and y-coordinates of one landmark. To optimize learning, residual shortcut connections were applied to three convolution layers with kernel sizes of 1 × 1, 3 × 3, and 1 × 1. The 1 × 1 convolution layers were responsible for dimensionality reduction and restoration, whereas the 3 × 3 layer functioned as a bottleneck with smaller input/output dimensions. MSE loss was utilized as the loss function for the second step.
The two-step models were initialized with a learning rate of 0.01 during model training, which was then decayed by a factor of 0.5 every 30 epochs. An Adam optimizer with a batch size of 64 was used for 400 epochs. All procedures were conducted using the PyTorch framework running on an NVIDIA Quadro RTX8000 GPU.
Two expert examiners specializing in oral and maxillofacial surgery and orthodontics manually identified 19 cephalometric landmarks on 82 images that constituted the test set. The MRE and SD were calculated to evaluate the inter-examiner reliability, and the ICC was computed to assess the degree of reliability between the two human experts. The mean values of the x- and y-coordinates determined by the two examiners were used as the gold standard for subsequent analysis.
Automatic detection of the 82 test set images was completed using the constructed AI algorithm, and the MRE and SDR with error ranges of < 1.0, < 2.0, and < 4.0 mm for all landmarks were calculated to evaluate the performance of the proposed model. All calculations were performed in Microsoft Excel using the following formulae:
Radial error (R)= \(\sqrt{{\Delta }{x}^{2}+\varDelta }{y}^{2}\)(mm)
MRE = \(\frac{\sum _{i=1}^{N}{R}_{i}}{N}\)(mm)
SDR = \(\frac{\text{N}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{a}\text{c}\text{c}\text{u}\text{r}\text{a}\text{t}\text{e} \text{i}\text{d}\text{e}\text{n}\text{t}\text{i}\text{f}\text{i}\text{c}\text{a}\text{t}\text{i}\text{o}\text{n}s}{\text{N}\text{u}\text{m}\text{b}\text{e}\text{r} \text{o}\text{f} \text{t}\text{o}\text{t}\text{a}\text{l} \text{i}\text{d}\text{e}\text{n}\text{t}\text{i}\text{f}\text{i}\text{c}\text{a}\text{t}\text{i}\text{o}\text{n}s}\) x 100%