A Deep Learning Method for Identifying Predictors of Knee Osteoarthritis Radiographic Progression From Baseline MRI

-- Background --The identication of patients with knee osteoarthritis (OA) likely to progress rapidly in terms of structure is critical to facilitate the development of disease-modifying drugs. -- Methods --Using data from the Osteoarthritis Initiative database (OAI), we implemented a Deep Learning method to predict, from baseline magnetic resonance images, further cartilage degradation, the latter being measured by Joint Space Narrowing at 12 months. -- Results --Using COR IW TSE images, our classication model achieved a ROC AUC score of 65% to be compared with a ROC AUC score of 58.7% obtained by trained radiologists. Additional analyses conducted in parallel to predict pain grade evaluated by the WOMAC pain index achieved a ROC AUC score of 72%. Attention maps provided evidence for distinct specic areas as being relevant in those two predictive models, including the internal femoro-tibial compartment for JSN progression and the intra-articular space for pain prediction. -- Conclusions --This feasibility study demonstrates the interest of deep learning applied to OA, with a potential to support even trained radiologists in the challenging task of identifying patients with a high-risk of disease progression.


Introduction
Osteoarthritis (OA) is a common disease which constitutes the fourth leading cause of disability worldwide [1]. According to the US National Health Interview Survey, up to 14 million American people are considered to have a symptomatic knee [2], with additional tens of millions affected as well in Europe, South America, Asia, or Middle East [3]. As a consequence of ensuing healthcare expenditures and losses of activity, the economic burden associated with OA is estimated to represent up to 2.5% of Growth National Product in Western countries [4].
The standard of care for OA based on both non-pharmacological and symptomatic pharmacological treatments has only a limited effect on function and pain. Thus, a very high unmet medical need still persists for a disease-modifying osteoarthritis drug (DMOAD) counteracting disease progression for both function and pain and avoiding the requirement for knee surgical replacement. As of today, the development of such DMOADs has been unsuccessful for two reasons. First of all, signi cant differences are observed among patients in terms of progression of cartilage degradation. Secondly, in the absence of any established patient strati cation in the form of endotypes re ecting well-characterized pathophysiological mechanisms, the slow and heterogeneous evolution of the disease makes it di cult to evaluate the effectiveness of a treatment in a broad patient population, within the 1 or 2 year(s) usual timeframe of a clinical study [5].
In this context, a personalized medicine approach is being considered to treat OA, consisting in identifying the most appropriate target populations predicted to bene t from DMOADs [6]. Primary e cacy endpoints required to document DMOAD e cacy include both clinical variables such as requirement for joint replacement as well as structural changes. The diagnosis of knee OA and the evaluation of its severity are currently based on imaging, with radiography remaining the most commonly used modality in clinical practice [7]. Speci cally, knee X-rays are used to determine the JSW (Joint Space Width) as a measurement of the distance between tibia and femur considered as an indicator of cartilage thickness. X-rays of the knee performed for an individual patient at various time points allow to de ne the JSN (Joint Space Narrowing) as a change in JSW over time [8]. Current regulatory guidelines for clinical trials aiming at evaluating candidate DMOADs recommend that JSN should be used as the primary endpoint in those trials [9].
One limitation, however, is that a reliable evaluation of JSN during patient follow-up remains di cult [10].
A clustering method on OAI data during a 8 year follow-up concluded that only 29% of patients displayed a radiographic progression (as de ned by JSN), with no further association between progression and pain worsening [11]. In this context, the use of MRI emerges as a better quantitative endpoint recommended for assessing morphological changes in knee cartilage during OA [12]. MRI allows the assessment of meniscal lesions such as root meniscal tears and extrusions known to be associated with OA progression [13,14]. It also detects other lesions predictive of pain, such as the presence of synovitis and synovial uid effusion [15] or bone marrow lesions [16].
We thus undertook the present feasibility study in support of the development of candidate DMOADs with the assumption that the latter should preferably be evaluated in patients likely to progress rapidly. To assess whether knee MR images collected at baseline could predict further cartilage degradation, we implemented a deep learning method using baseline MR images to build up a predictive model for future progression of knee OA, the latter being measured by JSN at 12 months. Additional analyses were conducted in parallel to predict pain grade evaluated by the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC).

Description of the OAI database
The OsteoArthritis Initiative database is a public multi-center longitudinal database assembled by a consortium led by the National Institutes of Health in the US to help better understand and prevent the progression of knee OA [17]. At baseline, a total of 4796 patients had a bilateral standing knee radiograph (X-Ray) and 3D knee MRI. Follow-up visits were done at 12, 24, 36, 48, 72 and 96 months (with 65% of patients enrolled at baseline having a 96-month follow-up visit). The knee MRI sequences include sagittal 3D DESS, coronal 2D IW TSE and sagittal 2D IW TSE fat-suppressed. A detailed description of MRI sequences from the OAI database can be found in Peterfy et al [18]. The database further contains clinical informations (age, sex, Body Mass Index [BMI]...), including as well results of pain assessment from WOMAC, a self-administered questionnaire encompassing for each visit up to 24 items divided into 3 subscales (i.e. pain, stiffness and physical function). Assessments from X-Rays such as Kellgren & Lawrence (KL) grade [19] and JSW were performed as well in the cohort at several locations in the medial and lateral compartments.
2.2 Endpoints used in the study 2.2.1 Joint Space Narrowing OA progression, de ned as cartilage degradation over time, was measured by using X-Ray images as the minimum JSW in the medial compartment of the knee. This semi-automated measurement was obtained at several time points (baseline, 12 months, 24 months). As proposed by Bruyere et al. [20], a 12 month-OA progressor was de ned as a patient's knee exhibiting a JSN at 12 months lower than − 0.5mm: JSN(12 months) = JSW(12 months) -JSW(baseline) ≤ -0.5mm. The threshold of -0.5mm for minimum radiographic JSN was identi ed as clinically relevant in several studies, see Reginster et al [21] for a review. Since the JSN criteria was evaluated separately for each knee, a patient could be a 12 month-OA progressor for a single knee or both. Using this JSW variation as a threshold, we proceeded to identify from knee MRIs those patients predicted to lose at least 0.5mm of knee cartilage.

WOMAC pain score
A secondary objective was to study the prediction of pain encoded by the WOMAC score, using contemporary MR images and clinical data (see description of clinical variables in Table 1 in additional materials). Hence, this objective was not to build a model predictive of future evolutions of the disease, but rather to explain the current state of the disease, still exploiting a combination of imaging and clinical information.

Evaluation
The performance of models described below was evaluated using a ve-fold cross-validation scheme and measured with the following metrics: area under the Receiver Operating Characteristic (ROC AUC score), area under the Precision-Recall (PR) curve and F1 score. These metrics are well suited to binary classi cation tasks which suffer from class imbalance.

Preprocessing of MRI data
Prior to feeding images into the model, several preprocessing steps have been applied sequentially in order to normalize the dataset, as illustrated in Fig. 1 and summarized below.

Image conversion
The OAI database exposes images in the DICOM format. In order to ease image reading and writing operations, each acquisition was converted to the NIFTI format (representing a full, three-dimensional image) by using the dcm2niix software [22].

Image orientation
The OAI database contains images of both knees for each patient. Speci cally, left knee images are RAS (Right, Anterior, Superior)-oriented, while right knee images are LAS (Left, Anterior, Superior)-oriented. In order to homogenize the dataset, orientations were normalized for all images. To this aim, images of right knees have been "mirrored" along the sagittal-axial plane in order to look similar to images of left knees. This operation was performed using the NiPype python library [23].

Bias eld correction
MR images can suffer from local magnetic eld variations, resulting in artefacts in the reconstructed image. To solve this problem, the N4 bias eld correction method [24] was applied to reconstructed images.
The raw MR image is rst re-oriented so that both left and right knees are similarly oriented. Noteworthy, only the left knee image is ipped, whereas the right is maintained as is, in order to obtain uniform orientations across the dataset. The N4 bias eld correction is then applied, together with a color normalization step.

Feature extraction
For both architectures used for 2D and 3D MR images, meaningful features were rst extracted by using models trained on the ImageNet dataset [25], resulting in high performances on the associated challenge. This approach, most suitable to address problems of very high dimensionality, further allows to speed up model training by delegating the computationally cumbersome task of building meaningful representations from images, before feeding those representations into a classi cation neural network.
When applied to MR images, such representations are extracted from each individual slice of the image, resulting in up to 150 (respectively 30) 1280-dimensional feature vectors for SAG 3D DESS (respectively SAG/CORIW TSE images). An overview of the feature extraction process is presented in Fig. 2.
Following-up preprocessing, each slice of the input volume is passed to a pretrained model which will compute 1280 numerical descriptors (features), resulting in depth x 1280 descriptors (where depth corresponds to the number of input slices).

Architectures for 2D MRI sequences
Throughout this manuscript, 2D MRI sequences refer to both COR IW TSE and SAG IW TSE images listed in the OAI database. Those images usually contain less slices than pure 3D sequences such as SAG 3D DESS. Taking this into account, the architecture of the deep learning model used with 2D MRI sequences slightly differs from the ones developed for 3D sequences, as detailed thereafter.

Attention sub-model
In light of previous studies related to multiple instance learning [26], we rst implemented a simple architecture whose aim was to compute attention scores (which can be viewed as "importance" scores) for each slice of the input image. Such scores were further used in the second part of the model. Starting from a set of one-dimensional feature vectors for each slice, a 1-dimensional convolution was applied (hence leading to one 2-dimensional matrix per image), followed by a Gated Recurrent Unit (GRU) layer. Such an architecture reduces the 2D matrix to a 1D vector (with one scalar score per input slice). This score was then scaled in the [0,1] interval through the use of a softmax activation function, thus preventing this sub-model to give full importance to all slices.
Classi cation sub-model Following calculation of importance scores, a mean weighted by those scores was computed from all feature vectors. Consequently, the model learned to select slices carrying information through the attention sub-model, a global mechanism called "attention mechanism". Clinical variables were standardized before being concatenated to this vector, resulting in a 1290-long vector. A description of clinical variables can be found in the additional materials, Table 1. This multimodal 1-dimensional vector was then fed into a multi layer neural network, followed by a softmax activation, outputting nal class probabilities. In this approach, slices carrying little information (e.g. out-of-knee slices) were given low attention scores, hence participating little (or not at all) to the nal logits computed by the second submodel. A global overview of the model, from a group of feature vectors (one per slice of the image) to the nal prediction (e.g. prediction of progression as an example) is presented in Fig. 3.
The purple-shaded area is a rst sub-model aiming to locate regions of interest within input images. The green-shaded area represents the classi cation sub-model, which aggregates both image and clinical information into progression (or pain score) probabilities.

Human Benchmark
To further qualify the performance of our predictive model in the identi cation of 12 month-OA progressors, we undertook a comparative study with two expert radiologists, one senior and the other more junior, on the same task. The senior radiologist has specialized in musculoskeletal imaging for more than 20 years whereas the junior radiologist has 2 years of experience. We rst selected 300 knee MRI with both SAG 3D DESS, 2D COR IW TSE and baseline clinical variables (age, gender, BMI, height, weight and minimum JSW in the medial compartment). These 300 knee images were then used to create 150 pairs of knee MRI, each pair being composed of both a 12 month-OA progressor and a nonprogressor. To account for noise measurement on the minimum medial JSW, the 150 knee MRI of 12 month-OA progressors were chosen such that 10 knees were from "almost certain" 12 month-OA progressors with JSN(12 months) < -1.1mm, 130 satis ed − 1.1mm ≤ JSN(12 months) < -0.6mm and 10 were "doubtful" progressors with − 0.6mm ≤ JSN(12 months) < -0.5mm. These three classes of progressors re ect the distribution of 12 month JSN in the OAI population. In addition, the − 1.1mm threshold for "almost certain" 12 month-OA progressors was computed using the methodology from Parsons et al [27]. This threshold takes into account the standard deviation of JSW at baseline and 12 months and further ensures that, with a high probability (≥ 95%), the observed loss of knee cartilage is associated with a degenerative process rather than simply re ecting noise measurement. This methodology mimics the way ROC AUC is computed for a binary classi er [28] as it evaluates the ability of either the radiologists or the classi er to correctly rank two images (picked at random) knowing that one is a positive sample whereas the other one is a negative.

Prediction of progression at 12 months
The model aims to identify knees for which JSN(12 months) ≤ -0.5mm. The ve curves correspond to the ve-fold cross-validation scheme. The dotted diagonal line (purple) illustrates the performance of a random predictor.
To identify 12 month-OA progressors from baseline knee MRI, we developed models to predict JSN(12 months) ≤ − 0.5mm from 2D (SAG IW TSE and COR IW TSE) as well as 3D (SAG DESS) knee MRI sequences. The most promising results were obtained using the classi cation model depicted in Fig. 3, which takes as an input 8 consecutive slices (centered around the "middle" slice) from a 2D COR IW TSE volume as well as the patient Body Mass Index (BMI) and predicts whether the observed knee is a 12 month-OA progressor, i.e. JSN(12 months) ≤ -0.5 mm. We thus report below on classi cation results obtained with 2D COR IW TSE sequences.
The performance of our classi cation model was evaluated using the ROC AUC score, well suited to this task as a metric given the class imbalance: only 14% of the patients with COR IW TSE images at baseline (3236 patients ; 5709 COR IW TSE knee images) are 12 month-OA progressors. Using COR IW TSE images, the proposed classi cation model achieved a ROC AUC score of 65%. This model achieved a precision of 13% and a recall of 84%. The above results are further summarized in a confusion matrix, reported in Fig. 4.

Human benchmark
The two radiologists concluded that, for most pairs, their decision was virtually random. Both radiologists found that 2D COR IW TSE volumes were less useful than SAG 3D DESS. The junior radiologist obtained a ROC AUC score of 57.82% whereas the senior radiologist obtained 59.72%. This benchmark with human radiologists highlights the di culty of identifying 12 month-OA progressors using only knee MRI and clinical data at baseline. Nonetheless, these results show the added value of AI in assisting radiologists in a complex image analysis task.

Prediction of pain severity
We subsequently applied our machine learning approach to the prediction of pain contemporary to image acquisitions. The grading of pain quanti ed by the WOMAC score was organized into two sets of values, including WOMAC pain score ≥ 2 and WOMAC pain score < 1. The rationale behind this strati cation is two-fold. On one hand, it re ects some clinical relevance in that pain scores below 2 are often identi ed as "no pain". Furthermore, it facilitates a data-driven approach where independent models can be trained and evaluated using only clinical data from different ranges of values. Such models were found to perform better when considering two sets of values with the above mentioned orders of magnitude.
Using this approach, our predictive model for pain achieves a mean PR AUC of 66.8% (+/-1%), a mean ROC AUC of 72.4% (+/-1%) and a mean weighted-F1 score of 65.2% (+/-1%). Corresponding ROC and PR curves obtained for each of the ve training folds are shown in Fig. 5. For comparison, a random predictor would achieve a mean ROC AUC of 50% and a mean weighted-F1 score of 60%. Globally, the model demonstrated good capabilities to identify high-pain knees (i.e. produce a relatively low number of False Negatives), with however a tendency to misclassify non-painful knees (i.e. produce False Positives), as can be seen in the confusion matrix represented Fig. 5, left panel. The associated PR curve is given in the additional materials, Fig. 8.
Graphic representations are shown for the binary task of classifying sets of pain scores across the crossvalidation process.

Model interpretability
In order to get a better understanding of model predictions, the GradCam [29] method was further used to visually pinpoint relevant characteristics within input images, as shown in Fig. 6. Yellow-colored regions were identi ed within the joint as the ones contributing with a high probability to the positive class, i.e. progression in the case of JSN progression prediction (Fig. 6, top row), and high WOMAC pain score in the case of pain prediction (Fig. 6, bottom row), respectively. Purple-colored regions did not contribute to high probabilities in the predictions. Interestingly, this analysis emphasized different regions of interest depending on the task. Speci cally, JSN progression-related regions are highlighted by the model in the internal femoro-tibial compartment. In contrast, for pain prediction, areas of interest are rather located in the intra articular space, where effusion is observed in the case of congestive osteoarthritis.
The bottom row corresponds to prediction of JSN progression and the upper row to pain prediction. Yellow areas are the ones considered of high interest by the model: the more intense the yellow, the higher its contribution to a high score for JSN progression prediction (bottom row, coronal view) or severe pain classi cation (top row, sagittal view). All images are obtained from patient 9932578 (right knee).

Discussion
Predicting disease progression in knee OA is critical to identify patients more likely to bene t from DMOADs and further, to help selecting patients and de ning treatment duration in clinical studies evaluating drug candidates [30]. In the present study, we thus developed a weakly supervised deep learning method to build up predictive models for OA progression at 12 months from baseline MR images. Further analyses were also conducted to predict pain grade evaluated by WOMAC from MR images and clinical data at the same visit.
Using COR IW TSE images, our proposed classi cation model achieved a ROC AUC score of 63%. The latter was comparable to the performance of trained radiologists, obtaining a ROC AUC score of 59.72%. To our knowledge, this is the rst application of a weak supervised learning method to the prediction of knee osteoarthritis progression from baseline MRI. Although not shown, no improvement on performance was observed on prediction of progression when considering a 24 month follow-up. We also successfully designed a task to identify imaging features associated with pain, leading to a model achieving a ROC AUC score of 72%. This encouraging result is likely explained by the presence of synovial effusion in painful knees, very contrasted in images and thus easy to detect for a radiologist. Our results are consistent with a previous study relying upon siamese neural networks to analyse pairs of knees and predict pain with a high AUC (85.3%) [15]. This study con rmed that 86% of correctly predicted painful patients exhibited an effusion-synovitis within areas most associated with pain.
Deep learning methods are often described as "black-boxes", referring to the lack of interpretability of their predictions. Interpretability can however be introduced in the form of "heatmaps" generated using a GradCam method [29] to highlight the relevant regions in the knee MRI used by the predictive model. In our study, such attention modeling of OA progression con rmed the importance of internal joint space, consistent with the fact that the Joint Space Narrowing is evaluated in this anatomic compartiment. The pain prediction model rather showed heatmaps focused on the intra-articular space, where cartilage, meniscal lesions, and effusion synovitis are observed. In future developments, other interpretability methods based on Generative adversarial networks (GANs) could be applied to generate synthetic imaging features re ecting pathophysiologic processes of interest in OA. Whereas GANs modeling the natural history of OA progression observed on knee radiographs have been developed [31] such studies remain to be done on MRI.
Other developments in AI-based image analyses could be considered to improve the predictive models obtained in our feasibility study. For example, whereas we used MRI as inputs for predicting an endpoint determined from knee X ray imaging, further studies could rather use MRI criteria as endpoints of progression in clinical trials of knee osteoarthritis. The latter could circumvent part of the complexity in identifying the future progression of OA based on the absolute joint space narrowing > 0.5 mm as an endpoint, i.e. a criteria di cult to quantify reproducibly, even more so in light of interpersonal variability in joint space measurement. In this regard, we investigated in a post hoc analysis, the use of a different criteria to characterize OA progression. Whereas our initial analyses have been based on JSN(12 months) ≤ -0.5mm, re ecting that a knee is a "12 months OA progressor" when the minimum JSW is reduced by 0.5mm in the medial compartment of the knee, we reasoned that this "absolute" criteria may not be suited to knees with advanced OA at baseline. We thus considered as an alternative a "relative" criteria de ned by: JSN(12 months) ≤ -25% relative to JSW(baseline), choosing the latter threshold as it ensures that the dataset has approximately the same class imbalance as with the "absolute" criteria. In this approach, a classi cation model based on 2D COR IW TSE and validated using a 5-fold cross-validation strategy obtained an average ROC AUC score of 80%, suggesting an interest in considering relative over absolute JSN reduction as an alternative endpoint of OA progression.

Conclusions
Our study further demonstrates the added value of deep learning [32] in musculoskeletal imaging. On 2D radiographs, previous studies have been successfully conducted for bone fracture detection [33], as well as automatic Kellgren and Lawrence Grading for knee OA [34]. Other studies on knee MRI showed strong performance on cartilage segmentation [35], as well as detection or grading of meniscal or anterior cruciate lesions [36]. All these studies relied upon "strong" labeling methods, requiring time-consuming manual image annotations by expert radiologists experts. In comparison, the deep learning approach developed herein is based on "weak" labels for machine learning tasks, i.e. relying on information not explicitly shown in images as targets for predictions. The latter further demonstrates the added value of deep learning in clinical practice as it applies to OA, with the promise of a convergence of intelligences between machines and radiologists in the interpretation of radiological images [37,38]. The future in the eld is likely one of a new era of augmented radiology.

Funding
This study was funded by Servier.

Acknowledgements
The authors thank Dorothée Piva for excellent assistance in preparing the manuscript. Overview of the image preprocessing pipeline.

Figure 2
Global overview of the feature extraction step.

Figure 3
Global overview of the model. ROC curve and confusion matrix for prediction of pain severity. Graphic representations are shown for the binary task of classifying sets of pain scores across the cross-validation process.