This is the first study to evaluate the intra- and inter-observer agreement of individual architectural and cytological features of epithelial dysplasia in OLP and OLL according to the 2017 WHO criteria and binary system, and to discuss the challenges faced by pathologists in diagnosing epithelial dysplasia and its severity in these lesions.
Disagreement between evaluators has been a well-known consequence of the challenges of imposing artificial categories on continuous biological changes (14). Additionally, applying the "thirds" epithelial classification proposed by the WHO in the oral mucosa is difficult due to the oral mucosa's higher degree of heterogeneity, with a wide variation in size, thickness, and complexity of epithelial architecture, making it harder to define the level of "thirds," particularly in very thin epithelia, in contrast to what occurs in lesions in the cervix, for example (15).
Thus, defining the degree of dysplasia based solely on the number of altered epithelial thirds oversimplifies the complexity of classification, as the presence of cytological atypia in only the basal third may be sufficient for a diagnosis of severe dysplasia, depending on individual characteristics such as bulbous epithelial ridges, disorganization of basal cells, and marked cytological atypia (16). Alternatively, lesions with mildly atypical features that extend into the middle third of the epithelium may warrant classification as mild dysplasia (17).
In our study, we found that the highest inter-observer agreements among architectural criteria were an increase in the number of mitotic figures, premature keratinization in single cells, and superficial mitoses. When evaluating cytological criteria, we found that abnormal variation in cell shape, abnormal variation in cell size, and hyperchromatism showed higher levels of agreement among observers. On the other hand, the most discordant criteria were the loss of cohesion between epithelial cells and drop shaped rete ridges.
The loss of cohesion between epithelial cells has been described as an easily recognizable feature (18). However, we observed a limitation in its differentiation from spongiosis caused by inflammation, which is common in oral epithelium, as well as liquefactive degeneration of basal layer cells, characteristics of OLP and OLL. Additionally, an important characteristic found in these lesions is the presence of saw teeth rete ridges (19), which may have made it difficult for evaluators to assess drop shaped projections.
Three other studies (20–22) have also reported the histopathological characteristics that showed higher and lower agreement in PMD. Kujan et al. (21) evaluated 68 cases of epithelial dysplasia by four observers using the WHO (23) criteria of 2005, and demonstrated that increase of mitotic figures and drop shaped rete ridges had the highest levels of agreement among the observers when assessing architectural changes. Among cytological changes, there was higher agreement in the increase of nucleus size and variation in cell shape. Irregular stratification of the epithelium, loss of polarity, variation in nucleus size, atypical mitosis, and hyperchromatism had the greatest disagreements among observers (21).
Krishnan et al. (22), in the evaluation of 63 leukoplakia slides by three observers using the 2005 WHO (23) criteria, reported excellent agreement in the assessment of irregular epithelial stratification, abnormal variation in nucleus size and shape, and increased nucleus-to-cytoplasm ratio. However, they found greater disagreement in the evaluations of superficial mitosis, atypical mitosis, premature keratinization, and increased number of mitotic figures (22).
In Ranganathan et al.'s study (20), using the 2005 WHO (23) criteria, six evaluators analyzed 72 cases of PMD (leukoplakia, erythroplakia, proliferative verrucous leukoplakia, lichen planus, and submucosal fibrosis). There was a higher level of agreement regarding the increase of mitotic figures, loss of cohesion of epithelial cells, and increased nucleus-to-cytoplasm ratio (20). The greatest disagreements were observed in irregular stratification of the epithelium, loss of polarity of basal layer cells, abnormal variation in nucleus shape, and abnormal variation in cell size.
With these controversial results reported in studies (20–22), it is difficult to synthesize any guidance on which microscopic features are consistently reported among oral pathologists (20). However, despite the increase in the number of mitotic figures being a subjective analysis and no study indicating how many figures need to be present to meet this criterion, both our study and the studies by Kujan et al. (21) and Ranganathan et al. (20) showed some agreement among evaluators in this analysis. Furthermore, although both evaluators in our study reported difficulties in differentiating Civatte bodies from dyskeratosis in some cases when located in the lower third of the epithelium during individual evaluations, we found considerable inter-observer agreement with respect to this criterion.
The only previous study that evaluated inter-observer agreement of epithelial dysplasia in oral lichenoid diseases, according to the 2017 WHO criteria (6), was conducted by Zohdy et al. (24) in 2021. These authors evaluated 84 cases of proliferative verrucous leukoplakia (PVL) and 28 cases of OLL and found a wide variability in the interpretation of epithelial dysplasia, with low inter-observer reliability among the four examiners (24). The Kappa was classified as very low, measuring 0.05 and 0.11, respectively, in the two repeated evaluations among the examiners. Regarding the grading of epithelial dysplasia according to the WHO criteria (6), in our study, we also did not find a statistically significant agreement among the evaluators, and the Kappa value was considered very low (κ=-0.107) (no agreement).
A recent multicenter study (2020) conducted in India by Ranganathan et al. (20) evaluated the intra- and inter-observer agreement of six observers who examined PMD (leukoplakia, erythroplakia, proliferative verrucous leukoplakia, lichen planus, and submucosal fibrosis) using the 2005 WHO (23) classification. The study found substantial agreement between two evaluators (κ = 0.8; κw = 0.746; P = 0.012; 80%). However, the other observers had poor to fair agreement (κ = -0.029 to 0.372) (no agreement to considerable agreement) (20).
These results reinforce that grading of epithelial dysplasia is therefore subjective, lacks reproducibility, and may be influenced by evaluators' experience, fatigue, and emotional factors (25). It has been shown that more anxious individuals tend to behave more negatively when making professional decisions (26), suggesting potential consequences in microscopic interpretation.
Speight et al. (14), in 2015, highlight that the highest levels of disagreement are found when categorizing low-grade lesions. These observations may be due to the fact that these lesions present more subtle changes, which can be observed in reactive lesions as a result of an inflammatory infiltrate (14). Therefore, we believe that the low values of inter-observer agreement in our study may also be related to the fact that the majority of our sample of OLP and OLL presented subtle cytological and architectural changes, as well as the intense inflammatory infiltrate, which is characteristic of these lesions.
Regarding intra-observer agreement, the study by Zohdy et al. (24) conducted in oral lichenoid diseases reported a slight improvement in intra-observer reliability during the evaluation of the epithelial dysplasia grading by the WHO system (6), which varied among examiners (κ = 0.34, κ = 0.36, κ = 0.52, and κ = 0.65) (moderate to substantial agreement).
In our study, we found better intra-observer agreement between the two evaluators. However, we observed that evaluator 1 showed higher agreement in evaluating cytological criteria, while evaluator 2 showed higher agreement in evaluating architectural criteria. Regarding architectural criteria, we found that the highest agreements were for loss of polarity of basal cells layer, increased number of mitotic figures, and individual premature keratinization of the cell. On the other hand, abnormal variation in nucleus shape and abnormal variation in cell shape were the cytological criteria with the highest agreements. In the classification of epithelial dysplasia, both evaluators showed moderate intra-observer agreement regarding the 2017 WHO criteria (6).
In an attempt to minimize inter- and intra-observer disagreements, many authors (15, 20) recommend the use of the binary system (7), as they believe this approach is standardized and could overcome subjectivity in reporting epithelial dysplasia (20). However, in our study, despite finding better inter-observer agreement with the binary system (7) compared to the WHO system (6), evaluator 2 showed better reliability in assessing the classification of epithelial dysplasia according to the 2017 WHO criteria.
Despite their limitations, the evaluation of epithelial dysplasia offers the pathologist the best opportunity to convey the overall risk of malignancy to the clinician (14). Additionally, high concordance of a method does not indicate that it is the most correct for use, only that it is reproducible. Thus, we emphasize the importance of defining guidelines to improve the interpretation of criteria involving the evaluation of epithelial dysplasia and reduce inter-observer variability. We also highlight the need for studies that seek to correlate the classification of epithelial dysplasia in both systems with the risk of malignant transformation.
The present study has some methodological limitations due to its retrospective design. In addition, despite the pathologists having training in Pathology from the same institution and working together for years, there was no prior standardization before the evaluations. Furthermore, it has been questioned in the literature whether the Kappa statistic, used to measure reproducibility in a dichotomous decision model in pathology, is the most appropriate (27), as it is believed that these verbal descriptors of reproducibility can be arbitrary (15). However, it is still the most widely used in the majority studies to date (20–22, 24).