A Deep Learning-based Positioning Classication of the Mandibular Third Molars: Is Multi-task Deep Learning Useful?

Pell and Gregory, and Winter’s classications are frequently implemented to classify the mandibular third molars and are crucial for safe tooth extraction. This study aimed to evaluate the classication accuracy of convolutional neural network (CNN) deep learning models using cropped panoramic radiographs based on these classications. We compared the diagnostic accuracy of single-task and multi-task learning after labeling 1,330 images of mandibular third molars from digital radiographs taken at the Department of Oral and Maxillofacial Surgery at a general hospital (2014-2021). The mandibular third molar classications were analyzed using a VGG 16 model of a CNN. We statistically evaluated performance metrics (accuracy, precision, recall, F1 score, area under the curve [AUC]) for each prediction. We found that single-task learning was superior to multi-task learning (all p<0.05) for all metrics, with large effect sizes and low p-values. Recall and F1 scores for position classication showed medium effect sizes in single and multi-task learning. To our knowledge, this deep learning study is the rst to examine single-task and multi-task learning for the classication of mandibular third molars. Our results demonstrated the ecacy of implementing Pell and Gregory, and Winter’s classications for specic respective tasks.


Introduction
The mandibular third molar is one of the most commonly impacted tooth. Treatment requires tooth extraction surgery, extraction of the third molar is one of the most common surgical procedures worldwide. Since mandibular third molars cause a variety of complications, surgical treatment is primarily performed to treat the symptoms associated with impaction [1,2] and to prevent conditions that impair oral health, such as future dentition malocclusion [3]. Infection and neuropathy are common complications that occur after extraction of the mandibular third molars, it is known that the position of these molars in uences the occurrence of postoperative complications [4,5]. Therefore, an accurate understanding of the position of the mandibular third molars based on preoperative radiography taken before surgery leads to safer treatment.
Pell and Gregory [6] and Winter's classi cations [7] are often used for classifying third molars. In the Pell and Gregory classi cation, the mandibular third molars are classi ed according to their position with respect to the second molars and the ramus of the mandible, the position of the mandibular third molar in the mesio-distal relationship is classi ed into classes I, II, and III and the part of the mandibular third molar in the depth is classi ed into levels A, B, and C. The Winter's classi cation was classi ed into the slope category with respect to the vertical axis of the mandibular third molar. These classi cations help describe the condition of the third molar of the lower jaw using standardized language as well as making it easier to understand the di culty of tooth extraction. Unfortunately, clinical dentists and young oral surgeons often misunderstand this diagnosis and may not be able to complete an accurate interpretation process.
Deep learning is a machine learning method that can automatically detect the functions required to predict a speci c result from the given data. Complex learning is possible using a deep convolutional neural network (CNN) with multiple layers between inputs and outputs. Many achievements have been made in the application of these technologies in the medical eld. In particular, analyses using deep learning based on medical images have provided comprehensive knowledge because this methodology can interpret data complexity more appropriately than standard statistical methods. In the eld of dentistry, this methodology has also been applied to the identi cation and diagnosis of dental caries [8], endodontic lesions [9], dental implants [10], orthodontic diagnoses [11], and osteoporosis [12]. Various methods are currently being developed for use in machine learning. Among these, the multi-task learning method learns multiple classi cation items simultaneously, enabling multiple predictive diagnoses [13]. This is an e cient machine learning method that may improve performance compared to single-task learning by evaluating interrelated concepts.
The aim of the current study was to present a CNN-based deep learning model using panoramic radiographs according to Pell and Gregory, and Winter's classi cations, with the purpose of locating the precise positioning of the mandibular third molars. Furthermore, we propose multi-task learning as another approach for analyzing medical images while improving the generalization function of multiple tasks. In addition, another purpose of this study was to evaluate the accuracy of position classi cation of the mandibular third molars via multi-task deep learning.  Table 1. Position classi cation showed high performance metrics in a single-task. Figure S1 shows the ROC curves of single-task learning at 10-fold.  Table 2 shows the performance metrics of the three-task multi-task model, including information on class, position, and Winter's classi cation. Table 3 shows the performance metrics for the two-task multitask model, including information on class and position. Figure S1 shows the ROC curves of two type multi-task learning at 10-fold.  2.3. Comparison of the single-task and multi-task models in terms of performance metrics Table 4 shows the results of the statistical evaluation of the single-and multi-task models for each performance metric. In the comparison between the two groups by p-value, the single-task model was superior to the multi-3task model, and the single-task model was superior to the standard statistical approach for all metrics. In the single-task and multi-2task (class and position) models, the single-task model was superior in all metrics except the AUC for position classi cation. Regarding effect size, in the single-task and multi-3task models, the effect size was large for all metrics except position classi cation (AUC and p-value). On the other hand, in the single-task and multi-2task models, recall and F1 score (in the position classi cation) showed medium effect sizes, and all other parameters showed small effect sizes.

Visualization
Grad-CAM was used to explain the prediction process for the CNN in terms of identifying each category. As a result, we visualized the judgment basis for determining the identi cation image area used for classi cation (Fig. 1, FigureS2). For the classi cation of class and position, the space above the mandibular third molar is regarded as a characteristic area of the CNN judgment basis. In contrast, Winter's classi cation was used as a characteristic area for classi cation judgment of the entire crown of the mandibular third molar. In the multi-task models, in addition to the characteristics for each task, the characteristics of other simultaneously learned tasks were added to the criteria. In addition, because of a tendency of multi-task feature areas, we mainly focused on areas that are common to these models.

Discussion
In this deep learning study, mandibular third molar classi cation (class, position, Winter's classi cation) was performed in single-task and multi-task models. In multi-task models, in which three classi cation tasks are performed at the same time and for each single-task, we found that the classi cation evaluation metric was statistically superior to that of the multi-task models. There was no statistical difference in classi cation accuracy between the multi-task and single-task models, two tasks (class and position) were performed simultaneously.
Multi-task modeling uses inductive transfer to improve task learning by using signals from related tasks discovered during training [20]. Multi-tasks have a great advantage in reducing calculation costs because they can perform multiple tasks simultaneously. In fact, in our research, we found a big difference when comparing the total number of parameters for each single-task and the number of parameters for multiple tasks. In addition, multiple tasks can improve the accuracy of other classi cations by learning the characteristics common to each task [13,21]. However, in our results, the classi cation performance of multi-task models decreased after three tasks. This may be due to the fact that each task has classi cation criteria for different characteristics. In multi-task models, classi cation performance may be degraded as a result of con icting areas of interest for the classi cation of each task.
The mandibular third molar classi cations performed in this study were the Pell and Gregory classi cation as well as Winter's classi cation. In the Pell and Gregory classi cation, there are classes and positions that are classi ed according to the mesio-distal positional relationship and vertical depth of the mandibular third molar [6]. This classi cation has a criterion that de nes the amount of space between the mandibular second molar and ramus of the mandible. Position classi cation classi es the vertical positional relationship based on the mandibular second molar. The positional relationship between the second molar and mandibular third molar is a common criterion. Accuracy was improved by multi-task these two tasks. Unfortunately, no statistically signi cant improvement in performance metrics was observed. On the other hand, we found a statistically signi cant decrease in classi cation performance in the three multi-task tasks with the addition of Winter's classi cation. Speci cally, in Winter's classi cation, the angulation and inclination of the mandibular third molar is judged, with the orientation of the mandibular third molar as the criterion [7]. Because feature extraction is weighted toward the entire mandibular third molar, it is possible that the features for predicting CNN were different from those of the Pell and Gregory classi cation.
Few studies have used deep learning to classify the position of the mandibular third molar. Yoo et al. [22] performed class, position, and Winter's classi cations of the mandibular third molar. The observed accuracy was 78.1% for class, 82.0% for position, and 90.2% for Winter's classi cation. Although Winter's classi cation cannot be compared because all evaluations had not been made, our results are more accurate for class and position.
For the weights learnt by the CNN, Grad-CAM can use the gradient of the classi cation score for convolutional features determined by the network to understand which parts of the image are most important for classi cation [19]. Grad-CAM can visualize the judgment basis for learning by CNN, which is regarded as a black box. In this study, visualization was performed using the gradient of the nal convolution layer. Visualization results for Grad-CAM class and position classi cations often show similar feature areas, while Winter's classi cations primarily assign features to the entire crown. Interestingly, in the multi-task models, the characteristics of the other tasks were added to the judgement basis in addition to the characteristics of each task. The rate of classi cation errors may have therefore been increased by referring to other parts that deviated from the judgment basis based on the original most notable features in multi-task.
Since statistically signi cant differences are easily recognized in proportion to the sample size within statistical hypothesis tests between two groups, effect sizes and statistically signi cant differences are important for evaluating substantial differences [23]. Effect size can be interpreted as a value that indicates the actual magnitude of the difference and that does not depend on the unit of measurement, this is one of the most important indicators for analysis. In this study, there was a correlation between the statistical hypothesis test between the two groups and effect size, statistical evaluations showed that the sample size was appropriate. Our study is the rst to show the effect size for the evaluation of mandibular third molar position classi cation using deep learning. The effect sizes calculated from this experiment will be useful when pre-designing the sample size in a similar study. To our knowledge, there are few reports on the calculation of effect sizes for comparison between deep learning models.
Diagnosis of the third mandibular molar is the most common oral surgery and is important not only for oral and maxillofacial surgeons, but also for general dentists. Accurate diagnosis leads to safe tooth extraction. In the future, as an auxiliary diagnosis, it is desirable to automatically diagnose the mandibular third molar using deep learning on the captured digital panoramic X-ray image. For this purpose, we would like to work on automatic detection using object detection of the mandibular third molar.
The strength of our study over previous studies is that the in uence of multitask learning has been statistically evaluated. The mandibular third molar classi cation grouping performed in this study was as close as possible to the clinical setting. To the best of our knowledge, this is the rst study to statistically and visually reveal the in uence of multitask learning on mandibular third molar classi cation by deep learning. Grad-CAM has revealed areas of interest for each model of CNN. The calculated effect size can also be used to estimate the sample size for future studies. It is suitable for evaluating results statistically correctly, rather than simply comparing values between different groups.
This study had several limitations. First, the amount of data for the current evaluation was modest.
Especially in the Winter's classi cation, there are few buccolingual and inverted results, which could result in bias. We veri ed our ndings using a strati ed K-fold CV, so that there is no bias in the data set for training, however, it is important to conduct further studies with a larger amount5 of data. Second, the CNN type was VGG16 only. In the future, CNNs with various characteristics should be evaluated, and it will be necessary to verify the most suitable CNN. The third limitation is the search for a Pareto optimal solution. In multi-task learning, classi cation performance is degraded as a result of con icting areas of interest for the classi cation of each task. Therefore, in multi-task learning, it is necessary to consider the ratio of the gradients of loss function, in which the gradients of each task are relatively balanced.

Conclusions
To our knowledge, this deep learning study of the classi cation (class, position, Winter) of the mandibular third molar is the rst study to examine single-task and multi-task models. The multi-task model with two tasks (class and position) was not statistically signi cantly different from single-task models, the three multi-task classi cations were statistically signi cantly less accurate than the respective single-task classi cations. Finally, we found that, in the deep learning classi cation of the mandibular third molar, it is more effective to classify the Pell and Gregory, and Winter's classi cations as their respective tasks. Our results will greatly contribute to the development of automatic classi cation and diagnosis of mandibular third molars from individual panoramic radiograph images in the future.

Study design
The purpose of this study was to evaluate the classi cation accuracy of CNN-based deep learning models using cropped panoramic radiographs according to the Pell and Gregory, and Winter's classi cations for the location of the mandibular third molars. Supervised learning was chosen as the method for deep learning analysis. We compared the diagnostic accuracy of single-task and multi-task learning.

Data acquisition
We used retrospective radiographic image data collected from April 2014 to December 2020 at a single general hospital. This study was approved by the institutional review boards of the respective institutions hosting this work (the institutional review boards of Kagawa Prefectural Central Hospital, approval number 1020) and was conducted in accordance with the ethical standards of the Declaration of Helsinki and its later amendments. Informed consent was waived for this retrospective study because no protected health information was used by the institutional review boards of Kagawa Prefectural Central Hospital. Study data included patient's aged 16-76 years who had panoramic radiographs taken at our hospital prior to extracting their mandibular third molars.
In the Pell and Gregory classi cation, the mandibular second molar was the diagnostic criterion. Therefore, cases of mandibular second molar defects, impacted teeth, and residual roots were excluded from the current study. We also excluded unclear images, residual plates after mandibular fracture, and residual third molar root or tooth extraction interruptions. We evaluated residual third molar roots (39 teeth), mandibular second molar defects or residual teeth (15 teeth), impacted mandibular second molars (12 teeth), tooth extraction interruptions of third molars (9 teeth), unclear images (3 teeth), and residual plates after mandibular fracture (1 tooth). A total of 1,330 mandibular third molars were retained for further deep learning analysis.

Data preprocessing
Images were acquired using dental digital panoramic radiographs (AZ3000CMR or Hyper-G CMF, Asahiroentgen Ind. Co., Ltd., Kyoto, Japan). All digital image data were output in Tagged Image File Format format (2964 × 1464, 2694 × 1450, 2776× 1450, or 2804 × 1450 pixels) via the Kagawa Prefectural Central Hospital Picture Archiving and Communication Systems system (Hope Dr Able-GX, Fujitsu Co., Tokyo, Japan). Two maxillofacial surgeons manually identi ed areas of interest on the digital panoramic radiographs using Photoshop Elements (Adobe Systems, Inc., San Jose, CA, USA) under the supervision of an expert oral and maxillofacial surgeon. The method of cropping the image was to cut out the mandibular second molar and the ramus of the mandible in the mesio-distal direction and to completely include the apex of the mandibular third molar in the vertical direction (Fig. 2). The cropped images had a resolution of 96 dpi/inch. Each cropped image was saved in portable network graphics format.
The manual method of cropping the image involved cutting out the mandibular second molar and the ramus of the mandible in the mesio-distal direction as well as completely including the apex of the mandibular third molar in the vertical direction.

Classi cation methods
Pell and Gregory classi cation [6] is divided into class and position components. Classi cation was performed according to the positional relationship between the ramus of the mandible and the mandibular second molar in the mesio-distal direction. The distribution of the mandibular third molar classi cation is shown in Table 5. Class I: The distance from the distal surface of the second molar to the anterior margin of the mandibular ramus was larger than the diameter of the third molar crown.
Class II: The distance from the distal surface of the second molar to the anterior margin of the mandibular ramus was smaller than the diameter of the third molar crown.
Class III: Most third molars are present in the ramus of the mandible. Position classi cation was done according to the depth of the mandibular second molar.
Level A: The occlusal plane of the third molar was at the same level as the occlusal plane of the second molar.
Level B: The occlusal plane of the third molar is located between the occlusal plane and the cervical margin of the second molar.
Level C: The third molar was below the cervical margin of the second molar.
Winter's classi cation was classi ed into the following six categories [7,14].
Horizontal: The long axis of the third molar is horizontal (from 80° to 100°).
Mesioangular: The third molar is tilted toward the second molar in the mesial direction (from 11° to 79°).
Vertical: The long axis of the third molar is parallel to the long axis of the second molar (from 10° to -10°) Distoangular: The long axis of the third molar is angled distally and posteriorly away from the second molar (from − 11° to − 79°).
Inverted: The long axis of the third molar is angled distally and posteriorly away from the second molar (from 101° to -80°).
Buccoangular or lingualangular: The impacted tooth is tilted toward the buccal-lingual direction.

CNN model architecture
The study evaluation was performed using the standard deep CNN model (VGG16) proposed by the Oxford University VGG team [15]. We performed. a normal CNN consisting of a convolutional layer and a pooling layer, for a total of 16 layers of weight (i.e., convolutional and fully connected layers).
With e cient model construction, ne-tuning the weight of existing models as initial values for additional learning is possible. Therefore, the VGG 16 model was used to transfer learning with ne-tuning, using pre-trained weights in the ImageNet database [16]. The process of deep learning classi cation was implemented using Python (version 3.7.10) and Keras (version 2.4.3).

Data set and model training
The model training was generalized using K-fold cross-validation in the model training algorithm. Our deep learning models were evaluated using 10-fold cross-validation to avoid over tting and bias and to minimize generalization errors. The dataset was split into 10 random subsets using strati ed sampling to retain the same class distribution across all subsets. Within each fold, the dataset was split into separate training and test datasets using a 90-10% split. The model was trained 10 times to obtain the prediction results for the entire dataset, with each iteration holding a different subset for validation. Data augmentation can be found in the appendix.

Multi-task
As another approach to the mandibular third molar classi er, a deep neural network with multiple independent outputs was implemented and evaluated. There are two proposed multi-task CNNs. One is a CNN model that can analyze the three tasks of the Pell and Gregory, and Winter's classi cations simultaneously. The other is a CNN model that can analyze the class and position classi cations that make up the Pell and Gregory classi cation simultaneously. These models can signi cantly reduce the number of trainable parameters required when using two or three independent CNN models for mandibular third molar classi cation. The proposed model has a feature learning shared layer that includes a convolutional layer and a max pooling layer that are shared with two or three separate branches and independent, fully connected layers used for classi cation. For the classi cation, two or three separate branches consisting of dense layers were connected to each output layer of the Pell and Gregory, and Winter's classi cations. Each branch included softmax activation. (Figure.3) Table 6 shows the number of parameters for each of the two types of multi-tasks and single-tasks in the VGG 16 model. In the multi-task model, each model was implemented to learn the classi cation of the mandibular third molars. In both trainings, the cross entropy calculated in (a) was used as the error function. (c) 5.9. Deep learning procedure All CNN models were trained and evaluated on a 64-bit Ubuntu 16.04.5 LTS operating system with 8 GB of memory and an NVIDIA GeForce GTX 1080 (8 GB graphics processing unit). The optimizer used stochastic gradient descent with a xed learning rate of 0.001 and a momentum of 0.9, which achieved the lowest loss on the validation dataset after multiple experiments. The model with the lowest loss in the validation dataset was chosen for inference on the test datasets. Training was performed for 300 epochs with a mini-batch size of 32. The model was trained 10 times in the 10-fold cross-validation test, and the result of the entire dataset was obtained as one set. This process was repeated 30 times for each singletask model (for class, position, Winter's classi cation), multi-task model (for class and position classi cation [two tasks], and all three multi-tasks) using different random seeds.

Performance metrics and Statistical analysis
We evaluated the performance metrics, with precision, recall and F1 score and along with the receiver operating characteristic curve (ROC), and the area under the ROC curve (AUC). The ROC curves were shown for the complete dataset from the 10-fold cross-validation, producing the median AUC value.
Details on the performance metrics are provided in the appendix.
The differences between performance metrics were tested using the JMP statistical software package (version 14.2.0) for Macintosh (SAS Institute Inc., Cary, NC, USA). Statistical tests were two-sided, and p values < 0.05 were considered statistically signi cant. Parametric tests were performed based on the results of the Shapiro-Wilk test. For multiple comparisons, Dunnett's test was performed with single-task as a control.
Differences between each multi-task model and the single-task model were calculated for each performance metric using the Wilcoxon test. Effect sizes were calculated as Hedges' g (unbiased Cohen's d) the following formula [17]: M1 and M2 are the means for the multi-task and single-task models, s1 and s2, respectively, are the standard deviations for the multi-task and single-task models, and n1 and n2, respectively, are the numbers for the multi-task and single-task models.
The effect size was determined based on the criteria proposed by Cohen et al. [18], such that 0.8 was considered a large effect, 0.5 was considered a moderate effect, and 0.2 was considered a small effect. 5.11. Visualization for the CNN model  A depiction of the crop method for data preprocessing.