Study design and setting
This study is a validation study of a diagnostic method based on retrospectively collected radiographic examinations. These examinations were analyzed by a neural network for both presence and severity of knee OA using the Kellgren & Lawrence OA classification system (KL) (10).
Data selection
We extracted all plain radiographic series taken between 2002 and 2016 from Danderyd University Hospital from the hospital´s Picture Archiving and Communication System (PACS). Images along with corresponding radiologist reports were anonymized. For the purpose of this study, a random subset of image series of the knee were selected. In the test-set, the selection was biased towards OA and certain KL grade subtypes to reduce the risk of non-OA cases to dominating the evaluation data.
Inclusion & exclusion criteria
Projections included were not standardized. Trauma protocols as well as non-trauma protocols were included. Diaphyseal femur and tibia/fibula protocols were included as these display the knee joint although not in the center of image. To ensure that the same case was not used twice, we excluded any examination within 90 days of the previously examination. We excluded 0.6 % of cases due to poor image quality as these could preclude classification.
Method of classification
In this method of machine learning the neural network identifies patterns in images. The network is fed both the input (the radiographic images) and the information of expected output label (classifications of OA grade) in order to establish a connection between the features of the different stages of knee OA, (e.g. possible osteophytes, joint space narrowing, sclerosis of subchondral bone) and the corresponding category (11).
Prior to being fed to the network the images along with radiologist´s reports were uploaded to a custom-built platform to be labelled according to the KL grading scale by members of the research team (SO, AL, MG, EA). The KL grading scale was chosen as it is widely used for knee OA classification (12). The research team also evaluated any potential OA features in the patellofemoral joint when laterality radiographs were present. The lack of recognition of patellofemoral OA as a distinct or contributory factor has been a criticism of the KL grading scale (12). We also created custom output categories such as medial/lateral OA as it is interesting to see how well the network can discern these qualities on its own.
Data sets
We split the data into three sets: training, validation and test set. The same patient seeking and receiving an x-ray of the knee joint on multiple occasions with a > 90-day separation could be included multiple times in the same set, but there was no patient overlap between the training, validation and test sets.
The test set consisted of 300 cases classified by two senior orthopedic surgeons working independently. Any disagreement was dealt with by a joint reevaluation of the cases in question until a consensus between the two surgeons was reached. The test set then served as the ground truth that the final network was tested against.
During training, two sets of images were used. The training set which the network learned from and a validation set for evaluating performance and tweaking network parameters. The validation set was prepared in the same way as the test set but by SO and AL, two 4th year medical students. The training set was labeled only once by either SO or AL. If images were of bad quality or difficult to label, the students marked for revisit and were validated by MG. Initially, images were randomly selected for classification and fed to the network i.e. passive learning. As the learning progressed cases were selected based on the networks output: 1) initially cases with high probability of a class were selected to populate each category, and then 2) cases where the network is the most uncertain to define the border was used i.e. active learning (13).
Neural network setup
Network used was a of a ResNet type, convolutional neural network, that consisted in total of a 35-layer architecture with batch normalization for each convolutional layer and adaptive max pool (see Table 1 for structure). We randomly initialized the network and trained using stochastic gradient descent. During training we alternated between knee labels and other previously gathered fracture classification tasks (16,785 exams from other classification tasks (14) where each task shared the core network.
We initially trained without noise for 100 epochs with an initial learning rate of 0.025. After the initial training we re-set the learning rate to 0.01 and trained for another 50 epochs with a combination of white noise (5%) and random erasing 3 blocks per image of 10x10 pixels.
Type
|
Blocks
|
Kernel Size
|
Filters
|
Group
|
ResNet block
|
1
|
3x3
|
64
|
Image
|
ResNet block
|
1
|
3x3
|
64
|
Image
|
ResNet block
|
6
|
3x3
|
64
|
Core
|
ResNet block
|
4
|
3x3
|
128
|
Core
|
ResNet block
|
2
|
3x3
|
256
|
Core
|
ResNet block
|
2
|
3x3
|
512
|
Core
|
Image max
|
1
|
-
|
-
|
Pool
|
Convolutional
|
1
|
1x1
|
72
|
Classification
|
Fully connected
|
1
|
1x1
|
4
|
Classification
|
Fully connected
|
1
|
1x1
|
4
|
Classification
|
The images were additionally augmented with 2 jitters and separately processed up until a max pool merged the features into per image or exam depending on the type of outcome. In addition to the AO classification outputs that were pooled at the per exam level we had image view (i.e. AP, lateral, Oblique).
Input images
The network was presented with all available radiographs in each series. Each DICOM format radiograph was automatically cropped to the active image area, i.e. any black border was removed, and the image was reduced to a maximum of 256 pixels. Image dimensions were retained by padding the rectangular image to a square format of 256 x 256 pixels. During training we jittered the images by cropping and rotating.
Outcome measures/ statistical analysis
Network performance was measured using area under curve (AUC) as primary outcome measure and sensitivity, specificity and Youden J as secondary outcome measures. Proportion of correct graded OA severity was estimated using AUC - the area under a receiver operating curve (ROC) - which is a plot of true positive rate against the false positive rate. An AUC value of 1 signifies prediction that is always correct and a value of 0.5 is no better than random chance at predicting an outcome. There is no exact guide for how to interpret AUC values, but in general an AUC of 0.7-0.8 is considered acceptable, 0.8-0.9 is considered good or very good and ≥ 0.9 is considered outstanding (15, 16). Youden Index (J) is a value also used in conjunction with the ROC curve, it is a summary of sensitivity and specificity. It has a range of 0 to 1 and is defined as (17).
![](blob:https://wordtohtml.net/dee44a0f-1ee4-431d-8445-e94e1b162588)
We also present confusion matrices to allow for good visualization of the algorithm´s performance when the true values are known. The network was implemented and trained using PyTorch (v. 1.4). Statistical analysis was performed using R (4.0.0).