An independent agreement study of modified Pfirrmann grading system for cervical inter-vertebral disc degeneration in cervical spondylotic myelopathy

Abstract Objective Neck pain, sensory disturbance and motor dysfunction in most patients suffered cervical spondylotic myelopathy (CSM). For CSM surgery, it is necessary to evaluate preoperative inter-vertebral disc degeneration (IDD) which determines whether to adopt fusion strategy, and postoperative IDD which is one of the main reasons for reoperation. Modified Pfirrmann grading system is commonly used to evaluate IDD. The objective of this study is to evaluate its reliability and reproducibility on cervical IDD in CSM patients, and to explore its clinical application value. Methods All 165 patients with CSM were enrolled. 6 physicians (3 spine surgeons and 3 radiologists) who have certain clinical experience were selected. They graded cervical inter-vertebral disc according to modified Pfirrmann grading system, we used intra-class correlation coefficient (ICC) and weighted kappa (wκ) to assess the inter- and intra-observer agreement. After 12 weeks, we repeated the analysis. Results The inter-observer reliability of modified Pfirrmann grading system was excellent with an ICC value of 0.76 and near perfect with wκ value of 0.82. The intra-observer reproducibility of modified Pfirrmann grading system was excellent with ICC values ranging from 0.80 to 0.91, and near perfect with wκ values ranging from 0.83-0.92. Conclusion Modified Pfirrmann grading system has excellent inter-observer reliability and intra-observer reproducibility on cervical IDD in CSM. In addition, it indicates a good appliance among spine surgeons and radiologists, clinical and radiological studies applying it should be deemed accurate. Thus, modified Pfirrmann grading system can be widely used as an appropriate instrument in clinical care.


Introduction
Cervical spondylotic myelopathy (CSM) is the most serious type of cervical spondylosis. 1,2Over the past 10 years, the number of patients undergoing surgery has increased more than 7 times. 3or CSM surgery, evaluating operative and adjacent cervical degeneration is essential to make a more detailed operation plan. 4,5The preoperative inter-vertebral disc degeneration (IDD) determines whether to adopt fusion strategy, while the postoperative degeneration of adjacent segment is one of the main reasons for reoperation.In the past clinical work, due to the lack of a clear definition of cervical IDD, there are few reliable criteria determining the CSM surgical strategy and judging the prognosis and risk factors.It is of great importance to establish a comprehensive and rational grading system based on modern imaging examinations for cervical IDD.Magnetic resonance imaging (MRI) is considered to be the best imaging tool for the evaluation of IDD, for it shows both disc morphology and hydration.
In 2007, Griffith et al. 6 proposed the modified Pfirrmann grading system which classified lumber inter-vertebral discs into 8 grades, in order to make a qualitative research of IDD (Table 1, Figure 1).The 8 grades represent a progression from normal disc to severe disc degeneration.Grade 1 corresponds to no disc degeneration while Grade 8 corresponds to end-stage degeneration.
For IDD in CSM, an adequate and rational grading system would standardize research terminology, allow easier communication among physicians and help to determine surgical strategy in individual patients.However, there was no study evaluating the modified Pfirrmann grading system and its application in cervical IDD, it still require independent validation.In view of the frequent need for MRI studies categorizing IDD in the evaluation of patients with CSM, our study aims to analyze the interobserver reliability and intra-observer reproducibility of the modified Pfirrmann grading system.Besides, this will be the first study assessing its application value in cervical IDD.

Patient case selection and evaluation
This study was performed in accordance with the Declaration of Helsinki, and institutional review board approval was obtained from our ethics committee with informed signed consent being provided by all participating subjects.Database records of patients with CSM admitted to Shanghai Longhua Hospital from 2018 to 2019 were retrospectively collected and analyzed.spine Cervical MRI examinations and available clinical data were required for inclusion.Cases with concurrence of cervical spine fracture, tumor, infection, or presence of instrumentation in the cervical spine were excluded.A complete MRI examination must include T2-weighted turbo spin sagittal images without fat suppression to cover all types of modified Pfirrmann grading system.MRI data were gained through 1.5-T whole-body imaging system.The complete and available clinical data included demographic characteristics, complaints, spinal cord and neurological function, concomitant diseases, and treatment history.
One resident of our department who did not participate in the later statistics and analysis collected the cases from our database of patients.Meanwhile, it was essential that physicians who treated the patients could not act as assessors.Six physicians from two specialties: three spine surgeons and three radiologists volunteered to be evaluators while they did not know the identity of the patient, the treatment they received, and the original classification used in clinical care.In order to conduct a sufficiently reliable study, each evaluator was provided the necessary original  Hypointense Indistinct >60% reduction a Grades 1, 2, and 3 are based on the signal intensity of the nucleus and inner fibers of anulus.For Grade 4, the margins between the inner and other fibers of the anulus at the posterior margin of the disc are indistinct.For Grade 5, the disc is uniformly hypointense, although there is no loss of disc space height.For Grades, 6, 7, and 8, there is progressive loss of disc space height.These could be broadly classified as mild, moderate, to severe loss of disc space height.Very occasionally, although obvious disc collapse is present, hyperintense signal from the nucleus and inner fibers of the anulus is preserved.This is referred to by a double entry, e.g.4/7, with the former reporting the disc signal and the latter the degree of collapse.
literature and relevant information to evaluate cases on the basis of modified Pfirrmann grading system.Any different opinions about the system were discussed before performing the assessment through face-to-face meetings until all the evaluators came to a consensus.Standard image reports were available to evaluators as reference.According to the modified Pfirrmann grading system, the six evaluators respectively assigned each cervical inter-vertebral disc with a single grade (from C2-C3 C7-T1).
Inter-observer reliability was evaluated by comparing the initial responses of all the six evaluators.The assessment of intraobserver reproducibility was performed by comparing the same evaluator's two responses of the same case with an interval of 12 weeks, and the cases were presented in a random order to minimize the recall bias.

Statistical analysis
All data analyses were performed using Statistical Packages of Social Sciences (SPSS) software (version 22.0).Considering the grading of all discs belonged to ordinal data, we adopted intraclass correlation coefficient (ICC) and weighted kappa (wj) to measure inter-and intra-observer agreement for modified Pfirrmann grading system (two-way mixed effect model, in which people effects are random, and measures effects are fixed). 7ICC allows to analyze the corresponding data when the observer agreement varies with multiple responses, while wj makes it possible to assess agreement when not all disagreements are equally significant.Besides, we expressed ICC values with a 95% confidence interval (CI).For each grade of modified Pfirrmann grading system, Fleiss's j was used to assess inter-observer reliability, and intra-observer reproducibility was measured by Cohen's j. 8,9 The range of ICC value is (0,1) while that of j value is (-1, 1).The larger the value is, the better the agreement is.Based on the recommendations of Fleiss 10 and Landis et al., 11 there were three levels of ICC, with ICC values 0.00-0.40considered poor agreement, 0.40-0.74fair to good agreement, and 0.75-1.00excellent agreement, while levels of agreement for j were divided into five grades, with j values 0.00-0.20 considered slight agreement, 0.21-0.40fair agreement, 0.41-0.60moderate agreement, 0.61-0.80substantial agreement; and 0.81-1.00near-perfect agreement (Table 2).

Result
According to the exclusion criteria, a total of 165 consecutive cases from our database of patients since 2018-2019 were involved in this study, including 94 males and 71 females with an average age of 63.5 ± 2.4 years (range from 43 to 85 years).There were 990 cervical inter-vertebral discs altogether in these individuals, and for one assessment, we finally obtained 5940 records since each disc was evaluated by six evaluators.After 12 weeks, we acquired another record of 5940 evaluations.

Inter-observer reliability
Based on reliability analysis of the results among the six evaluators, the overall inter-observer agreement of modified Pfirrmann grading system using ICC and wj were respectively excellent and near perfect.The ICC value was 0.76 [95% CI, (0.74, 0.78)] while the wj value was 0.82 [95% CI, (0.78, 0.86)].And the interobserver agreement of six cervical discs: C2/3, C3/4, C4/5, C5/6, C6/7 and C7/T1 were mostly good (including fair to good, excellent, substantial, and near-perfect), which indicated no significant difference in agreement evaluation of various cervical discs (Table 4).

Intra-observer reproducibility
Similar to the first assessment, the repeated assessment after 12 weeks indicated that the inter-observer agreement of modified Pfirrmann grading system was also excellent or near-perfect [ICC ¼ 0.79 (0.77, 0.81), wj ¼ 0.81 (0.77, 0.85)].The following reproducibility analysis of the same evaluator's results showed excellent intra-observer agreement, with all ICC and wj values higher than 0.80.Besides, the intra-observer agreement based on disc level was excellent as well, which showed no difference in agreement evaluation of various cervical discs (Table 6).

Discussion
In recent years, the incidence of CSM has significantly increased. 124][15] Cervical IDD is considered to be the trigger of static mechanism, 16 and it will result in dynamic compression on the spinal cord if it occurs in the motion segment. 13Briefly, IDD plays a leading role in the pathogenesis of CSM.
A recent study that non-surgical treatment is not suitable for moderate or severe CSM, and once CSM is diagnosed, surgery should be performed as early as possible. 17As cervical IDD largely determines surgical strategy and plays a decisive role in prognosis, relevant grading system of IDD has been further improved, providing important clinical value for standardized evaluation.
In 2001, Pfirrmann et al. 18 developed the most well-known grading system based on MRI, dividing IDD into five grades according to disc signal intensity, disc structure, distinction between nucleus and anulus, and disc height.Though this classification has been widely accepted and proved to have excellent inter-and intra-observer agreement, 19 study 6 found that it did not demonstrate discriminatory when applied to evaluate IDD in the elderly spine, besides, on the basis of images and descriptions provided, there were ambiguities in grading IDD as one level or another.To address these deficiencies, Griffith et al. 6 proposed a modified Pfirrmann classification which increased the 5 grades to 8 (Table 1, Figure 1) to improve its discriminatory power when evaluating the elderly spine and minimize ambiguity when selecting grades.
The establishment of modified Pfirrmann grading system not only gives spine surgeons a clear definition of IDD, but also provides an ideal treatment plan prediction of prognosis for patients with CSM.At present, JOA 20 and NDI 21 scoring systems are the most commonly used criteria to evaluate the treatment of patients with CSM, in particular, JOA system can divide CSM into three levels, mild, moderate and severe according to the score, in order to help physicians determine whether patients need surgery as soon as possible.It is worth mentioning that both scoring systems focus on patients, especially their functional status, but neither JOA nor NDI scores lay emphasis on the cervical spine, no matter vertebra, inter-vertebral disc, spinal canal or spinal cord.Hence the advantage of modified Pfirrmann grading system is obvious.
However, it must be emphasized that the cervical IDD is only one consideration for surgery.Other important related factors include: non-surgical treatment, cervical spine stability, operative segment, severity of spinal stenosis and spinal cord compression, prognosis, etc.Thus, modified Pfirrmann grading system can only be regarded as an important reference for surgery, and the most ideal treatment scheme can be formulated by combining JOA and NDI scores.
The results show that the twice inter-observer agreement (ICC: 0.76, 0.79; wj: 0.82, 0.81) of modified Pfirrmann grading system are slightly higher than that reported by Griffith et al., 6 while the intra-observer agreement in this study is excellent ((ICC: 0.84; wj: 0.87), similar to that of Griffith et al., 6 indicating that modified Pfirrmann grading system has a very good consistency.It is noteworthy that the evaluators involved in establishing a modified Pfirrmann grading system were all radiologists (two musculoskeletal radiologists and a general radiologist).However, the six physicians in our study came from two specialties (three spine surgeons and three radiologists), thus we could have a multi-angle and more comprehensive understanding of the imaging manifestations of IDD, which may be one of the factors that caused the slight differences in results between two articles.
The current study has limitations that could be improved in some ways to better ascertain the inter-and intra-observer error of this grading system.Firstly, its relatively small sample size.Though the number of patients included in our study is more than that of Pfirrmann et al. 18 and Griffith et al., 6 further expanding our sample population will allow for more meaningful statistical testing on the agreement of these parameters.Secondly, recall bias from evaluators, namely the deviation of results for repeated assessments in all evaluators, as shown in Table 6.This deviation has been mentioned by Wang YX et al. 22 in their study, which indicated that there was no significant difference in repeated assessments performed on the same day by the same evaluator, but the deviation was obvious when the same evaluator made further assessments 8 months later.Thus, in any study setting, paired assessments should be conducted ideally in a short period of time.And 12-weeks interval still might be long in our study.Thirdly, the difference in specialty is an important factor.We must point out that radiologists did not specialize in the spine and lack a deep understanding of IDD or the grading system, which may affect the accuracy of the final result.So, it may be valuable to repeat this study with overall senior spine surgeons to explore if higher skill level and specialization will cause a better agreement than that assessed by junior evaluators or multidisciplinary team.Finally, though postoperative IDD was the cause of poor prognosis and reoperation, we excluded patients with the presence of cervical instrumentation to make a better judgment of the inter-vertebral disc.On this issue, we will lay more emphasis on postoperative IDD in our later study.Therefore, high-quality, large sample, and multicenter studies should be performed in our future clinical work to provide spine surgeons with the best evidence-based information.

Conclusion
Modified Pfirrmann grading system has excellent inter-observer reliability and intra-observer reproducibility on cervical IDD in CSM.In addition, it indicates a good appliance among spine surgeons and radiologists, clinical and radiological studies applying it should be deemed accurate.Thus, modified Pfirrmann grading system can be widely used as an appropriate instrument in clinical care.However, we still need more future prospective studies to evaluate whether this grading system allows better decisionmaking or prognosis-prediction in individual patients.

Table 2 .
Level of agreement for ICC and j statistic levels.

Table 3 .
Assigned grades for inter-vertebral disc in twice assessments.

Table 4 .
Reliability analysis by disc for modified Pfirrmann grading system.

Table 5 .
Reliability analysis by specialty of evaluators for modified pfirrmann grading system.

Table 6 .
Reproducibility analysis for modified pfirrmann grading system.A, B, C, D, E, F represent the 6 evaluators who participated in the study. a