Comparing the accuracy and reliability of detecting intensity of spinal inflammation on STIR sequence with ADC values in axial spondyloarthritis

Objective To compare the accuracy and reliability of detecting the intensity of spinal inflammation on short tau inversion recovery (STIR) with the apparent diffusion coefficient (ADC) values of the active MRI lesions in axial spondyloarthritis (axSpA). Fifty active lesions in STIR sequence of spinal MRI were identified. With reference to sites of active lesions in STIR, the corresponding region of interest (ROI) on ADC map was drawn to determine the maximum ADC (ADCmax), mean ADC (ADCmean), normalized maxium (nADCmax) and mean (nADCmean). Four independent readers scored the identified active lesions as “intense” or “non-intense” according to the SPARCC MRI index. They were compared to various ADC parameters for assessment of accuracy and reliability. Regression analyses were used to adjust potential factors that could affect ADC. Significant differences were found in ADCmax between “intense” and “non-intense” lesions scored by 3 of the 4 readers (1405.7±271.4 vs 1165.8±223.8, p=0.01; 1420.7±272.1 vs 1209.0±248.5, p=0.01; 1438.0±307.2 vs 1213.6±231.0, p=0.01). Only 1 reader could differentiate a difference in “intense” and “non-intense” lesions with respect to ADCmean (899.2±248.3 vs 711.0±222.6, p=0.01) and nADCmean (4.4±2.1 vs 3.4±1.4, p=0.05). Inter-reader agreements were slight to moderate (Kappa=0.07-0.45). Reliability substantially improved when only the lowest and highest 25th percentiles of ADC values were included (Kappa=0.17-0.75). Regression analyses showed the “intense” lesions were associated with higher ADC values after adjustment for confounders.


Introduction
Magnetic Resonance Imaging (MRI) is considered an objective method for assessment of disease activity in axial spondyloarthritis (axSpA). Spinal inflammation on MRI is the only parameter in axSpA that correlates with inflammatory cellularity in tissue biopsy (1), and is also a positive predictor of response to biologic therapy (2). These provide strong evidence that MRI could be used in disease monitoring.
Short Tau Inversion Recovery (STIR) sequence, a fat suppression sequence in MRI, is the most commonly used imaging technique in axSpA. Its ability of assessing the extent of spinal inflammation has been evaluated by various scoring methods (3)(4)(5)(6). In contrast, apparent diffusion coefficients (ADC) of diffusion weighted imaging (DWI) is a new MRI sequence for axSpA assessment. In contrast to STIR sequence, it measures the intensity of inflammatory lesions. It reflects the magnitude of water diffusion at the tissue level (7) and is shown to be useful in measuring temporal changes of intensity of spinal inflammation in ankylosing spondylitis (AS) (8). ADC has been shown to be associated with disease activity, functional impairment and patient global assessment in axSpA in our previous publication (9). In this study, we compared the accuracy and reliability of detecting the intensity of spinal inflammation on STIR with the ADC values of the active MRI lesions in patients with axSpA according to the SPARCC MRI index method. This was done by 4 independent readers in a scoring exercise.

Materials And Methods
The data and MRI images were from an on-going multicenter cohort evaluating the utility of DWI in axSpA. The cohort has been registered in the clinical trial registry of the University of Hong Kong (HKUCTR-2087). Detailed methods have been described in our previous publications (9,10). A brief description of the cohort is given here.

Patient recruitment
Patients with expert-diagnosed axSpA and older than 18 years of age with back pain of greater than 3 months duration were consecutively recruited from 7 rheumatology centers (Queen Mary Hospital, Grantham Hospital, Tung Wah Hospital, Pamela Youde Nethersole Eastern Hospital, Tseung Kwan O Hospital, Caritas Medical Center, and Kwong Wah Hospital) and one ophthalmology center (Hong Kong Eye Hospital). Written consent was obtained from all recruited patients. Patients pregnant or unable to undergo MRI examination were excluded from the study.

Clinical data
Clinical data collected included basic demographics, characteristics and severity of back pain, and extra-articular features. Physical examination was performed to determine the tender and swollen joint count, and spinal mobility as represented in the Bath Ankylosing Spondylitis Metrology Index (BASMI) (11). Blood tests including C-reactive protein (CRP), erythrocyte sedimentation rate (ESR), and human leucocyte antigen B27 (HLA-B27) were performed. Self-assessment questionnaires including Bath Ankylosing Spondylitis Disease Activity Index (BASDAI) (12), Bath Ankylosing Spondylitis Functional Index (BASFI) (13), and Bath Ankylosing Spondylitis Global Index (BASGI) (14) were done and Ankylosing Spondylitis Disease Activity Score (ASDAS) (15) was calculated.

Radiographs
Radiographs of lumbosacral (anteroposterior views) spine were performed for the modified New York criteria for AS (16). Severity of radiographic sacroiliitis were defined as follow: grade 0, normal; grade 1, suspicious; grade 2, erosion/ sclerosis without joint space change; grade 3, sclerosis/ erosion with change in joint space or partial ankylosis; and grade 4, complete fusion. The X-rays were read by a single reader (CWSC).

MRI and interpretations
Whole spine and sacroiliac (SI) joint MRIs STIR and DWI sequences were performed consecutively in the same MRI examination in all patients using a 3T Achieva scanner (Philips Healthcare, Best, the Netherlands). The spinal MRI images were from cervical to lumbosacral levels. SI joint images were not used in this study. Free-breathing DWI with fat suppression was performed using a single-shot spin-echo echo-planar imaging sequence with 4 b-value (0, 100, 600 and 1000 sec/mm 2 ). Details of the imaging parameters have been described in our previous publication (9,10), with technical summary as follow: TR/TE 5000/80 (STIR), 4000/90 (DWI); field-of-view 150x240 mm 2 (STIR), 300x241 mm 2 (DWI); Slice thickness 3.5mm (STIR), 4mm (DWI).
Two readers (HHLT, CWSC) independently identified 50 active lesions for the scoring exercise. Discrepancies were resolved by consensus. Active lesions were defined as hyper-intensities in the vertebral body with no associated features of adjacent disc degeneration. With reference to sites of active lesions in STIR, the corresponding region of interest (ROI) on ADC map was drawn by a radiologist (KHL) to determine the maximum ADC (ADC max ) and mean ADC (ADC mean ). In addition, the background ADC (ADC bg ) was determined by drawing another ROI in a normal appearing lumbar vertebra excluding cortical endplate to calculate the mean ADC. Normalized maximum ADC (nADC max ) and mean ADC (nADC mean ) were defined by ADC max /ADC bg and ADC mean /ADC bg respectively. Figure 1 showed the lesion identified in STIR sequence and the corresponding ROI on ADC map. All MRI images and ADC values were visualized and determined using OsiriX MD v 9.5.2.

Scoring exercise
Four independent readers, blinded to clinical, radiological and ADC parameters scored the previously identified active lesions. All the 50 lesions identified were readable. Lesions were graded as "intense" when the signal intensity was similar to that of cerebrospinal fluid (3). Otherwise, they were graded as "non-intense". The 4 readers included one musculoskeletal radiologist (VWHL) (reader 1), one rheumatologist (HYC) with 8 years of experience in reading MRI (reader 2), and a rheumatologist (TTC) (reader 3) and a medicine trainee (FKPC) (reader 4) both inexperienced in MRI interpretation. The latter two readers had received training in scoring intensity of MRI lesions according to the Spondyloarthritis research Consortium of Canada (SPARCC) MRI index (3,4) prior to the scoring exercise and 3 lesions (not included in the analyses) were used for a validation exercises of the 4 scorers.

Statistical analyses
Clinical, radiological and MRI data were described in mean ± standard deviation (SD) or percentage.
Intra-class correlation coefficient was used to determine the interobserver agreement for SPARCC MRI index scores. Reliability of the "intense" lesions scored by different readers were calculated by Cohen's kappa coefficient (K). Overall reliability by Fleiss Kappa coefficient. Subgroup analyses were performed using 1) data included the lowest and highest 25 th percentiles of maximum ADC, and 2) data included the lowest and highest 25 th percentiles of mean ADC. The degree of reliability was interpreted as follows: 0.00-0.20 as slight; 0.21-0.40 as fair; 0.41-0.60 as moderate; 0.61-0.80 as substantial and 0.81-1.00 as almost perfect agreement.
We used ADC values as the "gold standard" to assess whether STIR MRI lesions could predict the true degree of inflammation. These were done by using t-tests, univariate regressions and multivariate regressions. Independent t-test was first used to compare the difference in ADC max , ADC mean , nADC max , and nADC mean between lesions graded as "intense" and "non-intense" by different readers.
The "intense" lesions with a p-value less than 0.1 were used as independent variables in univariate linear regression analyses to determine their associations between ADC values. Independent multivariate regression models were built using "intense" lesions with a p-value less than 0.1 in univariate analyses as independent variables. In addition to the "intense" lesions, factors known or expected to be associated with a change of ADC values were also tested as regressors in univariate linear regression analyses. These included: age and male gender. Independent variables with a pvalue less than 0.1 in univariate analyses were re-test in multivariate regression analyses using "enter" mode. Results were reported as standard coefficient (SC) and regression coefficient (RC) with 95% confident interval (CI) stated. All statistics was performed using the commercial software Statistical Package for Social Sciences (SPSS) version 25. A p-value of less than 0.05 was defined as statistically significant.
Ethics approval and consent to participate

Results
Clinical and demographic data is described in table 1. All selected patients fulfilled the Assessment in Spondyloarthritis International Society (ASAS) classification criteria for axial spondyloarthritis (axSpA).
Accuracy of intensity of lesions as defined by SPARCC MRI index Table 2 shows the differences in ADC max , ADC man , nADC max , and nADC mean between lesions graded as "intense" and "non-intense" on STIR. There were differences in ADC max between "intense" and

Subgroup analyses
In the subgroup including the lowest and highest 25 th percentiles of maximum ADC only, the musculoskeletal radiologist (reader 1) and rheumatologist experienced in reading MRI (reader 2) had the best agreement (k = 0.75). The worst agreement was found between the musculoskeletal radiologist (reader 1) and rheumatologist not experienced in reading MRI (reader 3) (k = 0.17).
Overall agreement by Fleiss Kappa was 0.39.
In the subgroup including the lowest and highest 25 th percentiles of mean ADC only, the musculoskeletal radiologist (reader 1) and rheumatologist experienced in reading MRI (reader 2) had the best agreement (k = 0.59). The worst agreement was found between the rheumatologist Univariate and multivariate regression models using ADC values as dependent variables Independent variables tested in univariate linear regression analyses included: age, male gender, and "intense" lesions. ADC mean was positively associated with "intense" lesions by the musculoskeletal nADC max was not associated with "intense" lesions by any readers.
Multivariate regressions showed ADC mean is positively associated with "intense" lesions by the rheumatologist inexperienced in reading MRI (reader 3), and "intense" lesions by medicine trainee (reader 4). ADC max is positively associated with "intense" lesions by the rheumatologist experienced in reading MRI (reader 2), "intense" lesions by the rheumatologist inexperienced in reading MRI (reader 3), and "intense" lesions by the medicine trainee (reader 4) after potential associated factors adjustment. nADC mean had no association with "intense" lesions by any of the readers. Results are shown in table 5.

Discussion
In this study, we compared the accuracy and reliability of grading intensity in STIR sequence in patients with axSpA with the computer-generated ADC parameters.
DWI and ADC are new MRI sequences and measurements in spinal inflammation in axSpA. They have been validated in previous studies (8)(9)17). In contrast to STIR sequence, they allow quantitative assessment (1, 8) of disease activity. Measurement of ADC however, is not without limitation. ADC has a wide degree of variability as a result of instrumental variation and errors, and biological variations. Therefore, a proposed solution is the normalized ADC, which calculated the ratio between the abnormal ADC and normal ADC values to eliminate the variations. At the present moment, there is still a lack of validation data on the two methods in axSpA. In this study, a higher ADC or normalized ADC is assumed to represent higher degree of inflammation.
We used the SPARCC MRI index as a reference method to score the intensity of MRI inflammatory lesions. The original definition was "The signal from cerebrospinal fluid constituted the reference for designating an inflammatory lesion as intense" (3). Our data shows human eye has ability in differentiating lesions with greater degree of inflammation from those with less degree of inflammation but different readers have different ways of MRI interpretation. Using this method, most (3 out of 4) readers were able to differentiate the image intensity of maximally inflamed areas. Two of the readers were also able to differentiate the intensity of the mean degrees of inflammation within the lesions. This suggests that readers tended to use the most inflamed area as the reference. As ADC mean would depend on the way the ROI was drawn, ADC max could represent a more objective measurement.
Overall intensity grading of STIR MRI inflammation has poor reliability. Inter-readers agreement on intensity of lesions were only slight to fair. Significant different in ADC max and ADC mean were only observed in the intensity grading by the rheumatologist inexperience in MRI reading (reader 3). There were also significant discrepancies in the number of "intense" lesions identified by different readers.
When only the most and least inflamed lesions were included in the subgroup analyses, the reliability significantly improved. In the subgroup analyses including the lowest and highest 25 th percentiles of maximum ADC only, "Intense" lesions identified by the musculoskeletal radiologist (reader 1) and rheumatologist experienced in reading MRI (reader 2) achieved substantial reliability. The number of "intense" lesions graded correctly also increased in the lowest and highest 25 th percentiles. As the differences in intensity of inflammation (as reflected by the ADC parameters) between the "intense" and "non-intense" lesions were small, human eye would be inferior to computer in differentiating subtle differences in intensity. Our results showed STIR MRI could be inferior to ADC in identifying lesion intensity and is compatible with another international study (18). Experience of readers is a factor to improve the reliability of MRI interpretation. ADC could be affected by a number of factors including the way the ROI was drawn, health of the spine and skeletal maturity (19). Age and osteoporosis have also been reported to affect the ADC (20). Although we did not directly evaluate the effect of osteoporosis in our analyses, we adjusted the ADC values for age and sex, two known risk factors for osteoporosis (21-23). Upon adjustments, we still found positive associations between ADC max and "intense" lesions identified by 3 of the readers.
Positive associations were also found between ADC mean and "intense" lesions identified by the rheumatologist inexperienced in reading MRI (reader 3), as well as nADC mean and "intense" lesions identified by the medicine trainee (reader 4). The results suggested STIR MRI could differentiate the degree of inflammation despite the effect of age and sex.
STIR MRI showed poor ability to differentiate different nADC values. As a matter of fact, no difference in nADC max was observed between the "intense" and "non-intense" lesions by different readers.
nADC is define as the ratio of lesion ADC to normal spine ADC. The value allows comparison between different machines. At present, we are still not sure the best way to perform the normalization.
However, our study only involved one MRI machine, normalization would not be absolutely necessary.
Our data also showed ADC acquired from a single MRI machine outperformed nADC as the former value appeared to be less affected by variability in interpretation.

Conclusion And Future Direction
By comparing with ADC, we showed that the STIR MRI has the ability to differentiate degree of spinal inflammation. However, it is limited by the inability in differentiate subtle differences. Moreover, different readers have different ways of MRI interpretation. ADC is an alternate method. With technological advances and development of artificial intelligence in the future of radiology (24), we believe that ADC may eventually be automatically computer-generated to replace the intensity grading in STIR sequence of patients with axSpA.     Figure 1 Lesion identified in STIR sequence and ROI of the lesion on ADC map. STIR=short tau inversion recovery; ROI=region of interest; ADC=apparent diffusion coefficient