Accuracy of Articial Intelligence-Assisted Landmark Identication in Serial Lateral Cephalograms of Class III Patients Who Underwent Two-Jaw Orthognathic Surgery

To compare the accuracy of articial intelligence-assisted landmark identication in serial lateral cephalograms of Class III patients who underwent two-jaw orthognathic surgery using a convolutional neural network (CNN) algorithm. 3,188 lateral cephalograms of Class III patients were allocated into the training and validation sets (3,004 cephalograms of 751 patients) and test set (184 cephalograms of 46 patients; subdivided into the genioplasty and non-genioplasty groups, n=23 per group)]. Each patient in the test set had four cephalograms: initial (T0), pre-surgery [T1, presence of orthodontic brackets (OBs)], post-surgery [T2, presence of OBs and surgical plates and screws (S-PS)], and debonding [T3, presence of S-PS and xed retainers (FR)]. Statistical analysis was performed using mean errors of 20 landmarks between human gold standard and the CNN model. The total mean error was 1.17 mm without signicant difference among four time-points. Before and after surgery, ANS, A point, and B point showed an increased error, while Mx6D and Md6D showed a decreased error. No difference in errors existed at B point, Pogonion, Menton, Md1C, and Md1R between the genioplasty and non-genioplasty groups. The CNN model can be used for landmark identication in serial cephalograms despite presence of OB, S-PS, FR, genioplasty, and bone remodeling. (3) assessment of surgical outcome and planning for post-surgical orthodontic treatment using post-surgical cephalograms, and (4) comprehensive assessment of orthodontic treatment and orthognathic surgery using debonding cephalograms. 3,4 In superimposition of serial cephalograms taken at different time-points is also important to assess the outcomes of pre- and post-surgical orthodontic treatment and orthognathic surgery. Accurate detection of cephalometric landmarks is mandatory to perform these procedures. identication of the hard tissue landmarks in serial lateral cephalograms, further studies are needed to investigate the accuracy of soft tissue landmark identication in serial lateral cephalograms. (AP, 50 ~ 70%), and ”low” (AP < 50%). Repeated measures analysis of variance (ANOVA) test with Tukey HSD, repeated measures multivariate analysis of variance (MANOVA) test, and independent t-test were performed using SPSS ver. 23.0 (IBM Corp., Armonk, NY, USA). P-values of < 0.05 were considered statistically signicant.


Introduction
Owing to the high prevalence of Class III malocclusion and negative social recognition of the prognathic appearance, 1,2 Korea has become one of the countries that performs two-jaw orthognathic surgery (TJ-OGS) extensively in patients with skeletal Class III malocclusion. To obtain successful treatment outcome, the following four steps should be performed precisely: (1) diagnosis and gross treatment planning for pre-surgical orthodontic treatment and orthognathic surgery using initial cephalograms, (2) planning for the direction and amount of surgical movement using pre-surgical cephalograms, (3) assessment of surgical outcome and planning for post-surgical orthodontic treatment using post-surgical cephalograms, and (4) comprehensive assessment of orthodontic treatment and orthognathic surgery using debonding cephalograms. 3,4 In addition, superimposition of serial cephalograms taken at different time-points is also important to assess the outcomes of pre-and post-surgical orthodontic treatment and orthognathic surgery. Accurate detection of cephalometric landmarks is mandatory to perform these procedures.
An arti cial intelligence (AI) algorithm including convolutional neural network (CNN) can help clinicians detect cephalometric landmarks, whose accuracy is close to that of human experts. [5][6][7][8][9][10][11][12] Previous AI studies have regarded the accuracy within a range of 2 mm as a clinically acceptable performance in landmark identi cation. 8, [12][13][14][15] However, it appears to be a lenient standard for appropriate clinical use. Therefore, use of stricter criteria (i.e., range within at least 1.5 mm) is necessary in determining the accuracy of landmark identi cation for clinical relevance.
In addition, most AI studies on the accuracy of automated landmark identi cation 8, [13][14][15] have trained and tested their models using initial lateral cephalograms only, which do not have orthodontic brackets (OB), surgical plates and screws (S-PS), xed retainer (FR), and bone remodeling changes. To the best of our knowledge, no study has compared the accuracy of automated landmark identi cation in serial cephalograms at the four time-points covering from the initial, pre-surgery, post-surgery, to debonding stages in orthognathic surgery cases. Therefore, the purpose of the study was to compare the accuracy of AI-assisted landmark identi cation in serial lateral cephalograms of Class III patients who underwent pre-and post-surgical orthodontic treatment and TJ-OGS using a cascade CNN algorithm and strict criteria for determining the degree of accuracy.

Results
Evaluation of total landmarks (Table 1) The total landmarks showed a good mean error value (1.17 mm), and the total AP had a high degree of accuracy (74.2%) in Table 1.
Evaluation of skeletal landmarks (Table 1) Nasion and Sella showed an excellent mean error value and a very high degree of accuracy (0.59 mm and 95.1%; 0.46 mm and 100%, respectively). Porion and Orbitale showed a good mean error value and a high degree of accuracy (1.07 mm and 76.1%; 1.21 mm and 73.9%, respectively). However, Basion showed a fair mean error value (1.64 mm) and a medium degree of accuracy (63.1%).
ANS and A point showed a good mean error value and a medium degree of accuracy (1.39 mm and 65.2%; 1.41 mm and 63.0%, respectively). PNS had a good mean error value (1.19 mm) and a high degree of accuracy (72.7%).
Pogonion, Menton and Articulare showed an excellent mean error value and a very high degree of accuracy (0.79 mm and 91.3%, 0.77 mm and 93.5%, 0.77 mm and 93.5%, respectively). B point showed a good mean error value (1.15 mm) and a high degree of accuracy (77.2 %).
Evaluation of dental landmarks (Table 1) Mx1C showed an excellent mean error value (0.44 mm) and a very high degree of accuracy (97.8%). Mx6D had a good mean error value (1.43 mm) and a medium degree of accuracy (64.1%). However, Mx1R and Mx6R had a fair mean error value and a low degree of accuracy (1.55 mm and 57.6%; 1.68 mm and 51.6%, respectively).
Md1C demonstrated an excellent mean error value (0.49 mm) and a very high degree of accuracy (97.3%). Md1R had a fair mean error value (1.57 mm) and a low degree of accuracy (58.2%). Md6D had a fair mean error value (1.67 mm) and low accuracy (51.6%). Md6R exhibited an acceptable mean error value (2.03 mm) and a low degree of accuracy (41.3%).
Comparison of the mean errors among the four timepoints (T0, T1, T2, and T3) ( Table 2) No signi cant difference was found in the overall mean errors (P > 0.05). Only three landmarks including ANS, Mx6D, and Md6R showed a signi cant difference in the mean errors among the four timepoints [ANS, increase in the mean error from T0 and T1 to T2, P < 0.01; Mx6D, decrease in the mean error from T0 to T2, P < 0.05; Md6R, decrease in the mean error from T0 to T2 and T3, P < 0.01].
Comparison of the mean errors between the two timepoints [(T0, T1) vs. (T2, T3)] ( Table 2) ANS, A point, and B point showed an increase of mean error after TJ-OGS than before TJ-OGS, [ANS, P < 0.01; A point, P < 0.05; B point, P < 0.01], while Mx6D and Md6D showed a decrease in the mean error after TJ-OGS than before TJ-OGS [all P < 0.01].
Comparison of the mean errors between the genioplasty and non-genioplasty groups ( Table 3) No signi cant difference in the mean errors in the landmarks located adjacent to the genioplasty area (B point, Pogonion, Menton, Md1C, and Md1R) existed in each timepoint between the two groups, except Md1R at T1 (P<0.05).

Discussion
Since TJ-OGS induces the position change and bone remodeling in the skeletal structures and produces the metallic images of the OB, SP-S, and FR, the accuracy and reliability of cephalometric landmark identi cation in serial lateral cephalograms are important for assessment of treatment outcomes. 16 As total landmarks exhibited a good mean error value and a high degree of accuracy (1.17 mm and 74.2%, respectively, Table 1) without signi cant difference among the four time-points (P > 0.05, Table 2), accuracy of the AI-assisted digitization was not signi cantly affected by the presence of OB, SP-S, FR, and bone remodeling change during orthodontic treatment and TJ-OGS. Regardless of the degree of accuracy of each landmark ( Table 1) Table 2). Accuracy of the cranial base landmarks can be regarded as baseline for comparison of serial lateral cephalograms because the positions of these cranial base landmarks are not affected by TJ-OGS.
Three error patterns were found in the maxillary skeletal landmarks. First, the mean errors of ANS were different among the four time-points (T0, 1.07 mm; T1, 1.22 mm; T2, 1.78 mm; T3, 1.49 mm, P < 0.01; Table 2) and presented an increased error value after TJ-OGS than before TJ-OGS [(T0, T1) vs. (T2, T3), P < 0.01; Table 2], which suggested that the metal image of the SP-S adjacent to ANS as well as surgical shape modi cation of ANS 17,18 (Fig. 1) could affect the accuracy of AI-assisted landmark detection. Second, although the error of A point was not signi cantly different among the four time-points (T0, 1.27 mm; T1, 1.28 mm, T2, 1.50 mm, T3, 1.59 mm, Table 2), it presented an increase in the mean error value after TJ-OGS than before TJ-OGS [(T0, T1) vs. (T2, T3), P < 0.05; Table 2]. This occurred because A point might be less affected by the metal image of the SP-S installed at the maxilla and have a lower chance for surgical shape modi cation, compared to ANS (Fig. 1). Third, in case of posterior impaction and/or anteroposterior movement of the maxilla, the position of PNS had to be changed. However, for PNS, no signi cant difference was found either among the four time-points (T0, 1.16 mm; T1, 1.14 mm, T2, 1.29 mm, T3, 1.17 mm; P > 0.05, Table 2) or between the two time-points [(T0, T1) vs. (T2, T3), P > 0.05; Table 2]. This might be due to (1) absence of the metal image of the SP-S within the ROI of PNS and (2) the end point of the hard palate can still be easily de ned.
There are three explanations of the errors in the mandibular skeletal landmarks. First, since there were no metal images within the ROI of Articulare and Menton, their mean errors were not signi cantly different among the four time-points and between the two time-points (all P > 0.05, Table 2). Second, the mean error of Pogonion was not signi cantly different among the four time-points and between the two time-points (P > 0.05; Table 2), which suggests that the metal image of the SP-S adjacent to Pognion ( Fig. 1) might not affect the accuracy of AI-assisted landmark detection. Third, although the mean errors of B point did not differ among the four time-points (T0, 1.00 mm; T1, 1.01 mm; T2, 1.29 mm; T3, 1.31 mm, P > 0.05; Table 2), comparison of the two time-points revealed an increase in error after TJ-OGS than before TJ-OGS [(T0, T1) vs. (T2, T3), P < 0.01; Table 2]. These ndings suggest that the metal image of the SP-S adjacent to the B point ( Fig. 1) might affect the accuracy of AI-assisted landmark detection.
There are two sources of errors in the dental landmarks. First, regardless of the degree of accuracy in the dental landmarks (Table 1)

Conclusions
The cascade CNN algorithm proposed in this study can be used for landmark identi cation in serial lateral cephalograms despite the presence of OB, S-PS, FR, genioplasty, and bone remodeling.

Methods
Materials. A total of 3,188 lateral cephalograms of 797 patients with Class III malocclusion were used for the training and validation sets and the test set for automated landmark identi cation using the CNN model. All procedures were performed in accordance with relevant guidelines. The inclusion criteria were as follows: (1) Class III patient who underwent pre-and post-surgical orthodontic treatment and TJ-OGS with/without genioplasty and (2) Class III patient whose serial lateral cephalograms were available. The exclusion criterion was Class III patient who had craniofacial deformities.The training and validation sets for automated landmark identi cation by the CNN model included 3,004 lateral cephalograms of 751 Class III patients from 10 institutions ( Table 4). Some of the patients who belonged to the training or validation set had more than four lateral cephalograms because additional progress lateral cephalograms were taken between time-points, while some of them had missing lateral cephalograms at a speci c timepoint.
For the test set, Class III patients with cephalograms obtained at the following four timepoints were selected: initial (T0), pre-surgery (T1, taken at least 1 month before TJ-OGS; presence of OBs), post-surgery (T2, taken at least 2 months after TJ-OGS; presence of OBs and S-PS), and debonding [T3, presence of S-PS, FR, and bone remodeling change). As a result, the test set consisted of 184 cephalograms of 46 Class III patients from eight institutions ( Table 4). It was subdivided into the genioplasty and non-genioplasty groups (n = 23 patients per group). Their characteristics are enumerated in Figure 1 Data sets were obtained from 10 centers using anonymized Digital Imaging and Communications in Medicine (DICOM) le format. Since nding the exact location of landmarks in a large lateral cephalogram image is relatively di cult, a fully automated landmark prediction algorithm with the cascade network was developed. 12 Two steps were followed: 1) detection of the region of interest (ROI; 256 × 256 and 512 × 512 pixels depending on the landmark) using the RetinaNet 19 and 2) prediction of the landmark using the U-Net 20 (Figure 2).
De nitions of 12 skeletal and eight dental landmarks are presented in Figure 3 and Table 5. The landmarks were digitized by a single orthodontist who had 20 years of experience (human gold standard, MHH) and by the CNN model. The mean values of absolute errors for each landmark were calculated using the absolute distance between the human gold standard and AI-assisted detection. The degree of error was allocated into excellent (< 1.0 mm), good (1.0 -1.5 mm), fair (1.5 -2.0 mm), acceptable (2.0 -2.5 mm), and unacceptable (> 2.5 mm) groups. Then, the accuracy percentage (AP) was calculated using a formula (percentage of the excellent and good groups among the total degree of error groups), which means that the error range within 1.5 mm was considered accurate. The degree of accuracy was de ned as "very high" (AP > 90%), "high" (AP, 70 ~ 90%), "medium" (AP, 50 ~ 70%), and "low" (AP < 50%). Repeated measures analysis of variance (ANOVA) test with Tukey HSD, repeated measures multivariate analysis of variance (MANOVA) test, and independent t-test were performed using SPSS ver. 23.0 (IBM Corp., Armonk, NY, USA). P-values of < 0.05 were considered statistically signi cant. Accuracy Percentage (AP); error range within 1.5 mm was considered accurate.