Ethics and information governance. A written informed consent was obtained from all patients before CBCT examinations. The research protocol was performed in accordance with the principles of the Declaration of Helsinki and was approved by the non-interventional Institutional Review Board (IRB) within the research project of the YDU 82-1147 of Near East University, Faculty of Medicine, Health Sciences Ethics Committee (Nicosia, Cyprus). Deidentification was performed in line with the Information Commissioner’s Anonymization: managing data protection risk code of practice (https://ico.org.uk/media/1061/anonymisati code.pdf), and validated by the aforementioned institution. Only de-identified anonymized retrospective data were used for research, without the active involvement of patients.
Testing the system. The primary goal of this study is to evaluate the ability of this AI system (Diagnocat) to enhance the diagnostic capabilities of the dentist and radiologist. In order to test this, a few steps had to be taken to prepare the dataset for viewing and analysis. These steps are necessary due to the inherent variability of CBCT datasets coming from CBCT machines as well as the variability in clinical experience on the part of the examiners. Thus, this study has two distinct parts. The first was preparing the dataset for evaluation and the second was evaluating the usefulness of the system for enhancing diagnostic capabilities.
Part (A): Preparing the dataset for evaluation:
- Image processing. Due to high variety of CBCT scanning devices and different calibration settings, CBCT images need to be normalized for both manual and automatic diagnostics. This is usually done with the help of window level and width DICOM properties extracted from scan metadata. Unfortunately, the radiodensity of bone and tissue of scans from the same scanning device manufacturer differ when the extracted window is applied. The difference is significantly higher when corresponding windows are applied to images from different devices. We apply the normalization process based on voxel radiodensity measured in Hounsfield Units (HU):
- HU values below −1000 (air radiodensity) are clipped.
- HU values below 5th and above 95th percentiles of an image are clipped.
- HU values are standardized by subtracting the mean value and dividing the difference by the standard deviation.
The last step may vary depending on the task. When there is no need to preserve difference between a dense bone (2000 HU) and metal (3000 HU) radiodensities, HU values above 2000 can be clipped, and resulting values can be rescaled to [0, 1] range.
- Localization datasets. To obtain precise segmentation results for training purposes, dental and radiology specialists used ITK-SNAP software35 that allows users to navigate 3D images in three planes. Once being annotated, each segmentation mask was automatically examined to eliminate human factor, e.g. misalignment of a tooth volume and a resulting mask. Each dataset split mimics the distribution of the original one. These modules are:
- ROI localization module. The dataset consists of 562 CBCT scans with segmented teeth and jaws. The scans are equally distributed among 19 scanner models of 12 scanner manufacturers.
- Tooth localization and numeration module. The dataset consists of 684 CBCT scans with segmented and numerated teeth. The scans are equally distributed among 24 scanner models of 15 scanner manufacturers.
- Periodontitis module. The dataset consists of 99 CBCT scans with precisely segmented alveolar bone area and 120 CBCT scans with precisely segmented enamel area of teeth. The scans are equally distributed among 11 scanner models of 8 scanner manufacturers. Each side of a tooth (mesial, distal, oral, and vestibular) has a group of three periodontium landmark points: a point of cementoenamel junction, a point of bone attachment, and a point of bone peak within 2 mm tooth vicinity. The dataset with segmented enamel area is used to obtain the first point, while the dataset with segmented alveolar bone is used to obtain two latter points.
- Caries localization module. The dataset consists of 4398 tooth volumes with a context area. The class labels are: background (no pathology), caries sign, metallic artefact, and non-contrast filling. One instance can have multiple conditions. The dataset was additionally validated by a lead radiologist.
- Periapical lesion localization module. The dataset consists of 2800 tooth volumes with a context area. The class labels are: background (no pathology), periodontal ligament (PDL) widening, poorly circumscribed radiolucency, well circumscribed radiolucency, and radiopacity. One instance can have multiple conditions. The dataset was additionally validated by a lead radiologist.
- Classification (Descriptor) datasets. Descriptor, the main diagnostic module, is a complex model that, besides accurate data collection, requires several iterations of dataset formation and annotation regulations. We provided detailed description of the annotation process and insights on managing class imbalance and high model uncertainty.
- Annotation protocol. Every radiologist was provided with an instruction de- scribing annotation including a list of required pathologies, access to the internal web-based application that provided a data collection form, and an option to download study DICOM for a standalone viewing. Additionally, every radiologist reviewed and described 3 sample CBCTs containing all target pathologies, which were then reviewed by the study supervisor, highly experienced oral and maxillofacial radiologist. Then, the study supervisor provided feedback to the radiologist. Each radiologist independently studied a CBCT image in a clinical viewer software and noted the presence or absence of each condition for each tooth in the target list. Radiologists were required to answer either “applicable”, or “not applicable” for every condition in table 4.
- Initial annotation. During the first stage of the annotation process, a group of experienced radiologists annotated a large set of images following the annotation protocol. Images were randomly sampled, filtered by the study coordinator according to the inclusion and exclusion criteria, and then passed to radiologists. Before the main annotation process, annotators were trained and evaluated by the study coordinator:
- Participant studied annotation instruction and protocol
- Participant annotated a small set of exemplary images, the study coordinator evaluated the results and provided feedback to the participant
During this stage each sample (distinct patient-tooth) received 1 diagnostic vote for every condition in consideration.
- Test set separation. Following the completion of the first stage of annotation, a test was separated from the annotated data pool and excluded from all following development activities. Test images were sampled in a way to have at least N positives and N negatives for every condition. The choice of N = 300 was motivated by the available number of positive samples for rare conditions. The sampling procedure was as follows.
- Randomly sample a condition.
- If the test set contained less than N positives of the condition, sample a random positive example from the data pool and allocate it to the test set. Additionally, allocate all other samples from the same image.
- Repeat until the test set contains at least N positives and N negatives for each condition.
Each sampled example contains annotation for all target conditions, so the resulting test set contains more than N positives/negatives for the majority of conditions. Additionally, the test set contains a different number of positives and negatives for each condition, typically, negatives outnumbering the positives (class imbalance). This influenced our decision to choose the AUPRC metric for evaluation as it is robust to significant class imbalance.
- Test set additional annotation. An additional vote from a second radiologist was obtained for each tooth-condition (sample). Then, for the sample where the first two radiologists disagreed, another vote from the third radiologist was obtained. Ground truth was decided by the majority vote (2-vote agreement).
- Model development dataset. A set for model development purposes formed from remaining annotated data pool (i.e. not included in the test set) was split into training and validation subsets as it was fit for the task. As the majority of examples in the train set had only 1 vote, it was expected that some labels would be incorrect. However, deep learning is known to have some level of robustness against noisy training labels, and we hypothesized that the models would be able to learn the correct labels and achieve satisfactory scores. Additionally, the partially trained model could be used to find and correct the erroneous votes by measuring disagreement between votes and model predictions. In the course of this project, this hypothesis was confirmed. While samples with 1 vote were widely used in the train set, model validation was performed using standard 2-vote agreement protocol.
- Rare case mining. Following the separation of the train set, a series of models was trained. Then, the best model was used to enrich the train set by mining rare cases and finding potentially erroneous votes in the train set. Initially rare conditions did not contain enough positive examples in the train set. To rectify this, following mining procedure was implemented:
- Define a set of rare condition list where additional data is required.
- Perform inference of the best model available at a time on studies from the non-annotated data pool.
- Calculate information entropy for every condition in the rare condition list.
- Sample teeth with high information entropy.
- Run images containing sampled teeth though the annotation process.
Information entropy is defined as
S = − Pi log Pi,
where Pi is the probability of ith outcome of a set of all possible outcomes. For binary task, such as our formulation, i iterates over “present” and “not present”. Information entropy is highest when probabilities of “present” and “not present” are equal to 0.5. Intuitively, information entropy is a measure of uncertainty in the probability distribution. High uncertainty on an example excluded from the training and validation set means that the training process is likely to improve if the example is annotated and added to the training set.
- Incorrect vote mining and rectification. Since we collected only 1 vote for a large number of images allocated to the train set, some of these votes were submitted incorrectly. To rectify this, we implemented the following procedure:
- Perform K -fold inference on all images in the train set using the best model available at a time. K -fold inference procedure:
(a) Split train images into K disjoint sets
(b) Pick a subset i
(c) Train model on all subsets except i
(d) Perform inference on set i and record resulting scores
(e) Repeat for every i in K
- Calculate radiologist-model disagreement.
- Sample from cases where radiologist-model disagreement was high.
- Collect additional votes for sampled cases using the annotation process.
- Dataset statistics. We evaluated the performance of our models on a dataset with 705017 samples consisting of 28745 teeth and 25 conditions across 1135 CBCT scans. The scans are distributed among 31 scanner models of 17 scanner manufacturers.
Each step analyzes data in progressively higher spatial resolution, from a coarse voxel size of 3 mm3 at the initial stage of region of interest (ROI) localization to a fine voxel size up to 0.15 mm3 for final per-tooth diagnosis. This multi-step pathway was required due to the large memory size of CBCT images at the original resolution. The flow of the system pipeline is shown in figure 1.
The first step is ROI localization module (fig. 1). Reduction of field of view (FoV) to ROI sufficient for analysis of dental diseases allows completion of the diagnostics without any information loss. The ROI localizer identifies specific regions of jaws and teeth with some extended context and excludes other anatomical regions. The localization module is based on the volumetric modification of U-Net architecture36 performing 3-class semantic segmentation: background, teeth, and jaw bone. To fit large FoV volumes, the module operates at 3 mm3 per voxel resolution.
The cropped image is further passed to the tooth localization and numeration module (fig. 1) that plays a crucial role in diagnostic pipeline. Tooth localization allows further analysis of different conditions inside and around a tooth, while tooth numeration helps with determining number-specific attributes and inter- tooth relations. The localization and numeration module is implemented as a volumetric U-Net network performing semantic segmentation on 54 classes (the background, 52 possible teeth, and an additional class for supernumerary teeth). It operates at 1 mm3 per voxel resolution.
At the next step each localized tooth area is extended with some context and passed to Descriptor (fig. 1) that defines the probabilities of a tooth being affected by a set of conditions (table 1). Descriptor is a principal classification module and is implemented as an ensemble of a ResNeXt37 with integrated squeeze-and-excitation blocks38 and a DenseNet39 architectures performing multiple binary classification on 25 classes.
Each tooth volume is further examined by three modules for auxiliary classification purpose. (1) The periodontitis module detects and evaluates alveolar bone loss in close vicinity to a tooth. It allows classification of 3 bone loss types of different severity by calculating distances between pairs of periodontium landmarks segmented by a separate landmark localizer. (2) The caries localization module defines signs of caries probability using segmentation of carious lesions found inside a tooth area. (3) The periapical lesion localization module detects periapical lesion presence and allows classification of 4 lesion types found around a tooth. The embedded localizers of three classification modules are implemented as volumetric U-Nets performing semantic segmentation.
Part (B) Evaluating the ability of the AI system (Diagnocat) to enhance the diagnostic capabilities of the dentist and radiologist:
- Evaluating diagnostic capabilities of the Diagnocat AI system. The primary end- point was to test end-to-end performance of this AI system, measuring tooth localization, numeration, and diagnostic sub-modules as a single system. It allowed to estimate overall safety and performance of the proposed system.
The Diagnocat AI software was used to obtain a binary condition prediction made on 3D CBCT scans using its predefined operating point (checkpoints of the trained models), which was then compared to ground truth to calculate sensitivity (pro- portion of correctly defined conditions) and specificity (proportion of correctly defined teeth not having conditions).
Secondary endpoint was to evaluate examiners performance and compare it to the AI results. Although examiners were tested on the data that was beforehand annotated by each of them the results showed comparable diagnostic quality of Diagnocat and the examiners. For the performance evaluation a set of 300 CBCT maxillofacial images in DICOM format was sourced consecutively from three clinics (100 images from each site) and anonymized by replacing “PatientName” with an empty string and truncated “PatientBirthDate” to the first day of the nearest year. Subsequently, images were screened against the inclusion and exclusion criteria.
The inclusion criteria were:
- a patient with the ability to consent to participate in the project
- a patient of 21 years or older
- anonymized CBCT image of maxillofacial region, and
- both model and manufacturer of imaging device are not present in the training dataset of the system (allows testing generalizability to new imaging devices).
The exclusion criteria were:
- images containing significant motion artifacts (as judged by radiologist coordinating the study)
- images containing severe artifacts such as streak artifacts, beam hardening (low and medium artifact remover was applied using device-specific software when available for standardization of the images) and;
- images of patients with cleft lip and palate, trauma, bone lesions, and severe bone erosions.
Final set of images was then reviewed by a scientific coordinator (an internationally recognized dentomaxillofacial radiologist with at least 18 years of experience) and 10 images were rejected due to significant motion artifacts. To establish the ground truth, examiners were recruited from experienced dentomaxillofacial radiologists. In total, the data was evaluated by four of them with a mean of 10 years professional experience.
Each examiner was responsible for annotation of CBCT anatomy on their own. Moreover, the examiners were unaware of conditions of the patients. Each examiner was then trained by the study coordinator to annotate 3D CBCT scans and fill the provided form correctly. After the study coordinator evaluated the examiners and approved them as sufficiently trained, the study proceeded to actual data collection. Each radiologist received a random, non-overlapping portion of the dataset via electronic means (shared folder). They evaluated the cases in their clinical environment and filled the spreadsheet, then saved them to the separate shared folders. The examiners could not access each other’s forms. After they evaluated cumulatively and annotated the full CBCT dataset, a second round of annotation started, where the examiners were assigned a different random subset of the dataset. After the second round was finished, the third commenced. At the end of the third round, the scientific coordinator collected the examination from 3 radiologists for every sample. Evaluations took place between December 2019 and April 2020. Data was extracted on an individual and group comparison level. To establish the true values of conditions, a consensus process was performed, where the ground truth was taken as a majority (at least 2 of 3 votes) per each case, tooth, and condition. The whole process was then reviewed again by the study coordinator for last adjustments and establish final ground truth evaluations of each patient and teeth as well as for each condition. Inference of Diagnocat system was performed once for the full dataset: an engineer performed inference using the production version of the system.
- Evaluating the clinical performance. After evaluation of diagnostic capability of the Diagnocat AI system, the next step is to evaluate the clinical performance of the system which can be achieved by comparing the accuracy of the diagnosis and time required for the reading for aided and unaided cases. Evaluation duration was compared between aided and unaided to determine if the addition of Diagnocat suggestions changes the time required to review the case. It was estimated that approximately eight weeks were required to conduct this study. This was addressed at each stage, from recruitment to analysis: recruitment and consent - 1 week, training and randomization - 1 week, investigation - 1 week, washout - at least 1 month, investigation - 1 week and analysis -2 weeks. The washout period was at least 1 month. This was to minimize memory bias and confounding factors. Crossover design reduced confounding factors as well. To identify the number of required examiners a power analysis was performed40. To hold the study at least 20 examiners were required in total. Thus, 24 dentists were enrolled in the study as the examiners and divided into two groups at a 1:1 ratio: (1) Group 1 examined the CBCTs with AI system-aided; (2) Group 2 examined the CBCTs unaided. The confidence interval is 0.80, and 5% was used for error. Enrolled examiners were qualified general dental practitioners of various experience with no defined specialty interest. Following inclusion and exclusion criteria were applied:
- Qualified dentist - General Dental Practitioner
- At least five years of experience in dentistry
- Ability to interpret dental CBCT
- Access to CBCT software at the workplace
- Unable to commit to the study
- Employees of Diagnocat - and their relatives were excluded from participation
The scientific advisor for CBCT scan reading conducted training sessions for examiners for one hour including the use of the Diagnocat AI system. 10 training CBCT scans were used for training and practice purposes; those encompassed the full spectrum of required diagnostics. A list of all possible diagnoses was given as well to ensure that the scope of diagnosis was calibrated and participants were aware of that. There was also remote support available to guide through the training process. For this study overall dataset contained 40 CBCT images, including 30 study images and 10 images for examiners training. These scans were sampled randomly from the dataset of the standalone performance test. 30 study images were sequentially numbered after randomization was performed. Thus each participant had a different sequence of clinical cases. Each CBCT scan required all 32 teeth to be diagnosed with none, one or more pathologies. Thus, 32 units in each CBCT, multiplied by the number of pathologies identified in each unit, with a total number of 30 CBCT scans per group. In this way, 960 (30x32=960) diagnostic activities were carried out in each investigation by each participant. The crossover nature of this study ensured that this was performed twice by each participant. Table 3 shows the conditions that were asked to diagnose by the examiners. Once investigations were completed, the raw data from forms filled by un-aided group and Diagnocat was transformed into the same format using automated scripts written before the study and then sent to an independent blinded assessor. This assessor analyzed the data and compared it with ground truth (same as in the standalone Diagnocat performance test). Raw data was compared to ground truth electronically. Scoring was performed via electronic means and data stored securely. Once this was completed, groups were decoded and results compared.