This study was approved by our institution’s review board (Osaka University Hospital Ethics Review Committee. No.20416) and written informed consent was waived because of the retrospective design. The study was performed in accordance with approved guidelines and in compliance with the principles of the Declaration of Helsinki.
Study Participants
Study participants were selected in two ways: First, we searched the list of patients who underwent cervical spine surgery in our spine clinic at some point between May 2012 and December 2020, and second, we searched the list in the medical information system for patients who had cervical x-rays obtained at our hospital at some point between April 2019 and April 2021. From the two lists, we chose to include 1674 patients with a total of 4546 x-rays, excluding patients who underwent radiography more than once. To validate the capability of AI in real-world clinical practice, we did not exclude any patients who had deformities or who underwent spinal instrumentation, and all patients from the two lists were included in our study. All x-rays were measured on the lateral view and included flexion, extension, and the neutral position. X-rays were downloaded in DICOM (Digital Imaging and Communications in Medicine) file format and converted to PNG (Portable Network Graphic) file format.
Method of Radiographic Measurement
We used the Cobb method to measure the C2–C7 angle because it is simple and most commonly used.1,12 We labeled the anterior and posterior endpoints of the C2 inferior endplate as anatomic landmarks in a digital viewer to draw a straight line along the C2 inferior endplate, and we used the same method for the C7 vertebra (Figure 1). If the C7 vertebral body was obscured by the shoulder girdle and difficult to see, we used the C6 vertebral endplate as a reference for the C7 vertebral endplate. We used a publicly available image annotation software labelme (https://github.com/wkentaro/labelme) for this manual measurement process.
We labeled the C2 slope and the C7 slope, which are the angles that the C2 lower endplate and the C7 lower endplates make with the horizontal line, with clockwise being positive in both cases. The angle obtained by subtracting the C2 slope from the C7 slope is the C2–C7 angle, with a negative angle indicating lordosis and a positive angle indicating kyphosis.
Artificial Intelligence Model
The AI model detected four anatomic landmarks: the anterior and posterior endpoints of the C2 and C7 inferior endplates. This anatomic landmark localization was performed by using CNNs to produce a heat map and then extracting the coordinates with the maximum value from the heat map of each landmark13 (Figure 2).
For the CNNs to output heat maps, we used the DeepLabV3 segmentation architecture,14 with the EfficientNet-B4 scaling method15 as a backbone. DeepLabV3 is a segmentation architecture that uses atrous convolution to enlarge the field of view of the network and Efficient Net-B4 is a classification model that was designed to balance the model size and the model accuracy. CNNs and angle measurements were implemented using Python version 3.9.5 (an interface) and PyTorch version 1.8.1 (an open-source machine learning framework). Our model was built using Segmentation Models Pytorch (https://github.com/qubvel/segmentation_models.pytorch), which is a publicly available package of Python and the backbone (EfficientNet-B4) was pretrained with ImageNet. The training of CNN was performed using Adam optimizer with initial learning rate of 0.0001 using the root mean square as the loss function until the loss of the validation data extracted from the training data started to drop (i.e., just before overfitting).
The value on the heat map for each landmark was used as the confidence score, and the smallest of the four values was used as the confidence score for that x-ray. We used confidence scores for later analysis.
Creation of Ground Truth Data and Validation of Accuracy
A spine surgeon with 18 years’ experience labeled the C2 and C7 endplates on all 4546 x-rays, and we used this as the ground truth. In machine learning, ground truth is labeled data that are considered to be the correct values. Ground truth data are divided into training data and test data. We examined measurement accuracy using two techniques.
The first technique involved the error of the AI algorithm’s measurements relative to the ground truth, calculated by 5-fold cross-validation. We randomly divided all ground truth data into five groups: four groups were training data, and one group was test data. The algorithm learned the training data of the four groups and measured the test data of the remaining one group. We then calculated the absolute error of the algorithm’s measurements and the ground truth measurements on the test data (Figure 3). This process was repeated five times, changing the training and test data groups so that all data were tested. Finally, the average of these absolute errors obtained from five processes represents the accuracy of the algorithm’s measurements. We did this five-grouping on a case-by-case basis, not on the basis of each x-ray; the CNNs did not learn from x-rays of the same patient in different positions. We performed validation on a workstation with two NVIDIA computers with GeForce RTX 3090 graphics-processing units, and the CNNs and angle measurements were implemented using Python (an interface) and PyTorch (an open-source machine learning framework). The training of each CNN was performed until the accuracy of the validation data extracted from the training data dropped (i.e., just before overfitting).
The second technique involved comparing the accuracy of the algorithm’s measurements with that of surgeons. Surgeon 1, with 11 years’ experience, and Surgeon 2, with 7 years’ experience, were both spine surgeons. From 1674 patients, we randomly selected 168 patients (57 men and 111 women) with a total of 416 x-rays, and each surgeon measured these according to the Cobb method described in the section “Method of Radiographic Measurement.” The surgeon who created the ground truth also measured again more than 1 month later, recording data at that point as Surgeon 3. The CNNs were trained on 1506 patients (4130 x-rays), excluding the 168 test patients, and measured on 168 patients (416 x-rays). We compared the error for the AI algorithm with the error for Surgeon 1, for Surgeon 2, and for Surgeon 3.
Repeatability and Measurement Time
For the AI algorithm versus the surgeons, we compared the repeatability of measurements and the time needed to obtain measurements. The intraclass correlation coefficient of the two measurements of the ground truth surgeon (Surgeon 3) was used as surgeon repeatability. Surgeon 3 recorded the time needed to measure 10 x-rays and calculated the average value per x-ray; the AI algorithm recorded the time to measure all 4546 x-rays and calculated the average value per x-ray.
Setting the Confidence Score
We set the confidence score to measure the level of confidence in the measurements of the AI algorithm. The confidence score is expressed as a value between 0 and 1, where 0 indicates no confidence and 1 indicates confidence. Excluding x-rays with a low confidence score was expected to reduce the absolute error. By varying the confidence score as a threshold, we examined the relationship between the number of excluded x-rays and the absolute error.
Relationship Between the Absolute Error of Artificial Intelligence and Background Data on Participants
We performed a multivariate analysis with absolute error as the objective variable and with age, sex, whether the patient had undergone surgery, and cervical spine position (flexion, neutral, and extension) as explanatory variables. The absolute errors were compared between the group of patients who had undergone surgery and the group of those who had not.
Statistical Analysis
We used the t test to compare absolute errors between surgeons against such errors by the AI system and to compare errors regarding patients who underwent surgery and those who did not. Stepwise multiple regression analysis was performed with the absolute error at the C2–C7 angle as the dependent variable and the patients’ demographic data as the independent variable. P values <0.05 (two-sided) were considered statistically significant. Statistical analysis was performed using the SPSS Statistics software (version 20; IBM, Armonk, NY, USA).