Definitions of AIS severity and curve type. The severity of AIS was defined by the degrees of the CA measurement on the coronal radiographs following the clinical gold standard and also being the primary consideration for treatment planning29. To measure the CA, the end vertebrae (the most tilted vertebrae from the horizontal apical vertebra) need to be identified at the upper and lower ends of the curve, and the angle formed by lines drawn at the superior and inferior endplates of the upper and lower end vertebrae is measured as the CA (Fig. 1b). AIS contains 3 different severity classes (normal/mild: CA ≤ 20°; moderate: 20°<CA ≤ 40°; and severe CA > 40°)30 with accurate CA degrees, end vertebra and apical vertebra labels according to corresponding X-ray images.
The curve types were subsequently decided from the GT labels by the location of the apical vertebra. The majority of population have 12 thoracic vertebrae and 5 lumbar vertebrae. For patients with a single curve (Fig. 1b&c), if the apex locates between the 1st and the 11th thoracic vertebrae it is considered as the thoracic curve (T), whereas if the apex locates between the 12th thoracic and the 5th lumbar vertebrae it is considered as the thoracolumbar/lumbar curve (TL/L)31. For patients with more than one curves it is typed as mixed curve (Fig. 1b&c).
Dataset collection and preparation. The collection, use, analyses and prospectively testing the X-rays images with paired back images taken by smartphones were approved by the local ethics committee (UW19-620). All males and females with AIS were recruited. Written consents were completed by the guardian of the participants prior to data collection. Participants were excluded if they were not within the age range of 10–18; diagnosed with or have any signs of psychological disorders that might influence the compliance of the study; have any pre-diagnosed systematic neural disorders that might influence the mobility of the patients (e.g. prior cerebrovascular accident, Parkinson’s disease, myopathy); have musculoskeletal diseases including congenital spinal deformities, McCune-Albright syndrome, early-onset scoliosis, previous spine operations and instrumentation performed, trauma that might impair posture and mobility; and/or had severe skin disorders and/or lesions at the back.
All back images were taken voluntarily by AIS patients using smartphones (iPhone X, iPhone 12, Redmi Note 8 Pro). The corresponding radiographs were anonymously archived from the hospital picture archiving and communication system in PNG format, which were taken in the clinic as a routine practice with no extra experimental radiographs taken. Using the gold standard clinical classifications, the ground truth (GT) labels of the cohort 1 and 2 were provided by the same spine surgeon (with over 20 years’ experience in AIS management) by annotating the coronal radiographs manually.
The severities and curve types of cohort 2 were also blindly assessed by two spine specialists using visual assessment of the back of the participants. The severity classification, curve types were recorded individually for the two assessors.
Data preprocessing. The back images of cohort 1 were used to develop the technology of ScolioNet for classifying scoliosis severity and curve types. Specifically, we randomly divided cohort 1 by 8:2, corresponding to training set and in-house validation set. The in-house validation set was used to evaluate and select different deep learning models. Independent testing cohort 2 were recruited to prospectively test the performance of ScolioNet. Back segmentations were firstly performed on the images taken by smartphones. We empirically selected to segment the back with arms to achieve improved classification performance. Data augmentation methods32–34 included random rotate (-10°~10°), affine transform, crop and pad, gaussian blur, sharpen, change contrast and brightness was introduced. Each method was set to appear with a 50% probability and combined them for each image of the cohort 1.
Performance evaluation. To evaluate the performance of ScolioNet, true positive (TP), true negative (TN), false positive (FP) and false negative (FN) values were calculated35,36. Evaluation metrics included sensitivity (Sn), specificity (Sp), positive predicted value (PPV), negative predicted value (NPV), accuracy (ACC).
$$Sn = \frac{TP}{TP+FN}$$
$$Sp = \frac{TN}{TN+FP}$$
$$PPV = \frac{TP}{TP+FP}$$
$$NPV = \frac{TN}{TN+FN}$$
$$\text{A}\text{C}\text{C} = \frac{TP+TN}{TP+TN+FP+FN}$$
The ROC curve was plotted based on the final Sn scores and (1 – Sp scores) with different thresholds, and the area under the curve (AUC) was computed from the area of ROC curve which denotes as:
$$AUC = {\int }_{x=0}^{1}TPR\left({FPR}^{-1}\right(x\left)\right)dx=P({X}_{1}>{X}_{0})$$
Where \({X}_{1}\) and \({X}_{0}\) are the scores for a positive and a negative instance, respectively; TPR represents true positive rate and equals to Sn; FPR represents false positive rate and equals to (1 – Sp).
The Multi-layer CNNs and Model Selection. Several deep convolutional network benchmarks were compared using the data from cohort 1 (Table 1). In our CNN framework, there were 1 input layer, 84 hidden layers, and 1 output layer. In the input layer, the input image was processed by convolution, batch normalization (BN) and pooling operations, to reduce the interference caused by the difference in the range of values of the input data in each dimension.
In the hidden layers, neurons were in each layer were connected and propagated containing internal feature coding and computational outcome. The convolutional layers were used for feature extraction and presentation, and a commonly used rectified linear unit (ReLU) function was selected to active the outcome of a neuron and defined as follows:
$$y=\text{m}\text{a}\text{x}(0, x)$$
where x was the weighted sum of a neuron and y was the output of the activation function.
In the output layer, the weights of all neurons output from the hidden layer were fully connected to obtain the required probability values for classification. Since the final output of the classification task was essentially the probability value of the image input in each category, a softmax function was generally introduced at the end in order to map the output of the fully connected layer to the (0,1) range with a summation value of 1, which is calculated as follows:
$${S}_{i}= \frac{{e}^{{V}_{i}}}{\sum _{j}{e}^{{V}_{j}}}$$
where \(V\) was the input of the softmax layer derived from the fully connected layer and \(S\) was the final output. In the models for back image-based severity and type prediction, \({S}_{i}\) was a value in the range 0–1 representing the probability of a back photo classified as normal-mild, moderate, or severe with T, TL/L, or mixed type.
For the experiment on model selection, we trained five models, Resnet5021, Densenet16937, VGG1638, InceptionV439 and ResNeXt5040, of which results are shown in Supplementary Fig. 2 and Table 1, where the model with Resnet50 and Densenet169 as backbones performed relatively better. Our initial selection of models was based on the AUC metric to judge the comprehensive performance of the models, and Resnet50 and Densenet169 performed best on the macro-average AUC metric for the severity classification task (Supplementary Table 1). Although, InceptionV4 was comparable to Resnet50 and Desnsenet169 in many metrics in both FU and CS statistics (Supplementary Table 1), we finally chose Resnet50 and Densenet169 as the backbone considering the size of the model. To further improve performance, we added multi-tasking strategy and attention.
The Attention Algorithms and Multi-tasking Strategy. The human visual system tends to focus on a certain part of the image that is useful for judgment and ignore other unimportant areas. The attention mechanism41,42 is a method that allows a network to mimic the human visual system and tend to pay attention to a certain part of the image when performing a classification task. Multi-task strategy43–45 is designed to improve the model performance by learning multiple tasks in parallel so that the results can interact with each other, either by sharing the weights of feature extraction on the network or by interacting only on some key layers, and finally using classifiers to classify each task.
Different attention algorithms were tested, including SE block46 and residual attention block22. Empirically, residual attention block with the activation function for channel and spatial mixed attention performed the best. The activation function is as follows:
$$f\left({x}_{i,c}\right)= \frac{1}{1+{e}^{{-x}_{i,c}}}$$
where x was the weighted sum of a neuron, i was position of the feature map, c was channel of the feature map.
Previous studies reported22 the residual attention block is an improvement on the residual net, thus we add the attention module to both Resnet50 and Densenet169 for further testing. Resnet50 and Densnet169 consist of 4 residual block and 4 dense block respectively, so we added a total of 4 attention blocks after the input layer and after each submodule respectively. With attention introduced, we empirically discovered that spatial and channel mixed attention improved the performance of the classification task. The model with Resnet50 as backbone with attention module performed better and then was selected to be our final classification model ScolioNet.
Multi-task strategy was tested in addition to the attention to try to improve the generalization ability of the model. We did not change the design of the feature extraction part of the network, shared the feature extraction weights for the severity and type of scoliosis tasks, and only used 3 parallel classifiers for each task at the end. Since our main task is the severity and type classification task, we set the loss to the sum of the three-classification problem loss and set the weight of the type of scoliosis problem to 0.5.
After the selection, our ScolioNet consisted of 1 input layer, 84 hidden layers and 1 output layer, where the hidden layers consisted of 4 residual blocks and 3 attention modules. In the output layer, we used three parallel FC layers to implement multi-task method to finish the classification of the severity, while for classification of type of curve, we only used one FC layer.
Model development and parameter optimization. 1429 images were used for model training and during training, we resized each image to 3×224×224 pixels in order to make it compatible with the original dimensions of the network architecture. We trained triple classification models for the severity classifications directly and binary classification models for the curve type classification. Mixed curve type was classified if both curve type were presented on one image. ROC curves with each class as a positive class was generated to calculate the AUC.
During model development our validation set consisted of 351 images and the independent test set consisted of 378 images. We performed data augmentation on the in-house validation sets to be proportionally matching with the training dataset. The parameter optimization was stopped when the performance AUC was not improving. The validation set was used for model selection during training, while the independent test set was used to further test the accuracy and robustness of the selected model. The in-house dataset was not used for training, but for the in-house performance testing of the models, before an independent testing study. The prediction of each class performed well with an AUC of around 0.85. All proposed models performed similarly and the ROC curve of ScolioNet had the most balanced performance on each classification task (Supplementary Fig. 2).
For model training, we used a workstation configured with an Intel(R) Xeon(R) Gold 5218 2.30GHz central processing unit, 308 GB of RAM, a NVIDIA GeForce GTX 300 core and PyTorch (https://pytorch.org/). Models were trained by minimizing the general cross-entropy loss function between predictions and GT labels,
$$Loss=-\left[y\text{log}\widehat{y}+(1-y)\text{log}(1-\widehat{y})\right]$$
Where \(y\) was the GT and \(\widehat{y}\) was the prediction.
During training SGD optimizer in PyTorch was adopted and a decay factor \(\gamma\) was used to control the learning rate at each epoch as described below:
$${lr}_{new}= {lr}_{initial}* {\gamma }^{epoch//stepsize}$$
Where \({lr}_{initial}\) was the initial learning rate, \({lr}_{new}\) was the learning rate at ith epoch and step size was the learning rate decay step. Adjustable parameters such as initial rate, decay and batch size were simultaneously adjusted to improve the model performance. Our initial learning rate of 0.01, the momentum of 0.9, weight decay of 0.001, \(\gamma\) of 0.1 and step size of 20, batch size 16 and epoch of 50.
ScolioNet Model prospective testing and explicability. The independent testing dataset, 378 images, is a prospectively collection not seen during the model development and in-house testing. No data augmentation or resampling for this part was done to ensure a true validity of the prospective experiments.
With the continuous development of deep learning, the explicability of deep learning models has attracted more and more attention. In order to improve the interpretability of the model and mine the decision logic of the model, Class Activation Mapping (CAM) method47 was proposed, and some interpretable algorithms based on CAM48,49 were proposed afterwards. For the explicability of ScolioNet, we use the Score-CAM50 algorithm to explain the decision of the model. Score-CAM makes use of all the feature maps output by the last convolutional layer to obtain interpretable information and gives the decision-supporting regions in the original image in the form of interpretable heatmap. Features maps from the last convolutional layer contain high-dimensional features and spatial information, and are also the direct input for classification, so they play an important role in explaining the model predictions. Score-CAM overlays each feature map on the original image as an interpretable mask and inputs the masked image into the model to obtain its probability in the predicted label, and the predicted label. The difference between the probability of the masked image and that of the original image in the predicted label is used as the weight of the corresponding feature map. We normalize and up-sample all the feature maps, and then linearly superimpose the feature maps according to their weights to get an interpretable heatmap. Warm color in the heatmap represent the importance of the region in supporting the model to make the classification decision.
Statistical analysis. The Wilcoxon signed-rank test was performed using the stats.wilcoxon() function in SciPy for assess the interrater agreement of the two spine surgeons and ScolioNet on using back appearance classifying the scoliosis severity and curve type. The same practice was performed to compare the agreements between different deep learning models (Supplementary Table 2), and p < 0.0001 was considered statistically significant.