The 2D system is better than the 3D system both visually and electronically in terms of ΔE and d(0M1) for statistics of agreement and reliability to assess intra-rater variability. All four methods show strong patterns of disagreement between repeated measurements in Bland-Altman plots. As hypothesized, the 3D system lacks reliability of hue compared with that of lightness and chroma, which is more pronounced visually than electronically. The SDCD differs by the four methods used and is most favorable in the electronical 2D system. The agreement between the 2D and 3D system in terms of ΔE is not good. It is lower within the electronical method than within the visual method. The comparability of the 2D and 3D system is uncertain because confidence intervals of ICCs accounting for systematic error are wide. The systematic error between the 2D and 3D system cannot be neglected. The reliability of the visual and electronical method is substantially the same within the 2D and 3D system; this comparability is fair to good.
We discuss following aspects: 2D and 3D, visual and electronical, ΔE and d(0M1), Bland-Altman plots and statistics (patterns and numbers), single shade designations of the 3D system, validity and reliability, statistical SDCD and known thresholds, agreement and reliability (comparability), human and machine, and intra- and inter-method variability.
2D and 3D system
The 2D and 3D system differ in the color space assessed [31]. Some 3D shades that are lighter (lightness) or stronger (chroma) are not well covered by the 2D system, which is especially pronounced for the additional bleaching shades available only in the 3D system. Compared to VC hue ranges of 3D Master are extended toward yellow-red; 3D Master shades are more uniformly spaced than that of VC [4]. In contrast, there are spatial gaps of the 3D system, which are filled by the 2D system [31, 39]. In short, both guides are suboptimal and can be improved [12]. The intrarater variability depends on trained skills. For example, the intrarater repeatability of the 3D-Master shade guide is better than that of the VITA Lumin Vacuum in general practitioners but not in specialists (prosthodontics) [52]. Our experienced technician was not only trained, but also calibrated and ophthalmologically examined to ensure an efficacy instead of an effectiveness approach [53]. The variability between raters, which was not investigated herein, may favor the 3D Master shade guide over the VC shade guide [54]. The coverage error favors the 3D system, although it is unclear, whether the difference between the 2D and 3D system is clinically relevant [10, 12, 55–57]. The accuracy of the measurement of tooth shade obtained with an intraoral digital scanner was higher when the color was recorded as 3D Master values rather than VC values, whereas a visually perceptible color difference was found more often for VC values [58]. Repeatability was similar for both values. For some tooth-colored dental materials, it was suggested to convert 3D shades into VC shades (2D) adding a clinically relevant error in comparison with direct shade determination using the VC shade guide [59]. The clear patterns in Bland-Altman plots for d(0M1) question that this transformation is meaningful.
Visual and electronical method
The aforementioned gaps filled by the 2D system are supported by additional 2D shades to assess quarter points for the second shade designation number [31], which is an important difference between the visual and electronical method. A further important difference is the extension of the second shade designation number from the visual four-point scale to the electronical five-point scale. Similarly, the electronical 3D system includes bleaching shades not used by the visual 3D system herein. Thus, there are reasons to have expected that a human rater is inferior to the electronical rater, especially for the 2D system. It is of note that the agreement of intra-rater variability in terms of ΔE and d(0M1) is better for the visual 2D measurement than that for the electronical 3D measurement. Numerous studies exist comparing instrumental and the conventional visual method [1, 6, 9, 13, 16–18, 20–25, 60]. Several studies found that instrumental methods are more accurate or reliable than visual measurements [9, 17, 21–23, 61–63]. Contrary to these findings, in a recently published study, results of the ΔE values showed that clinically relevant differences between the visual evaluation and the intraoral scanning device (3Shape) are negligible [18]. According to Li & Wang 2001, the reliability of shade matching can be ensured by neither the instrumental nor the visual approach [60]. Furthermore, studies indicate that the difference in color matching between human-eye assessment and computerized colorimetry dependents on tooth type [16] and shade [6]. The color dimension in with the greatest agreement between operator and spectrophotometer is value (chroma) or lightness [24]. No compatibility between visual and digital methods did exist for MLR and chroma [64]. The compatibility between both methods were determined only for lightness of maxillary central and canine teeth at all regions of labial surfaces [64]. Regarding repeatability, no significant differences were found between three shade guides by visual color assessment, although repeatability was relatively low (33–43%). Agreement with the colorimetric results was also low (8–34%) [65].
ΔE and d(0M1)
ΔE supports only statistics on agreement; neither Bland-Altman plots nor reliability statistics are feasible. Essentially, d(0M1) enables evaluating patterns of disagreement, further agreement statistics such as SDCD, and reliability statistics including versions of ICC accounting for systematic errors. Regarding agreement of repeated measurements of the same rater, the differences among the four methods are substantially the same for ΔE < 2.7 and d(0M1). The level of agreement within fixed limits, however, is higher for d(0M1). For example, d(0M1) hardly differentiates 3M1 from 2L2.5 (d(0M1): 15.2 and 15.3, respectively) although ΔE is 8.3. Thus, if lightness is compensated by less chroma or (or chroma by darkness), then d(0M1) will not work well. The systematic errors between 2D and 3D measurements in d(0M1) are plausible, because the 2D and 3D system differ in the color space assessed (see above). Systematic errors between visual and electronical measurements are small, but present within the 2D system, which can be explained by the additional quarter-point shades in the electronical 2D system. It is thus highly plausible that the corresponding systematic error in the 3D system is close to zero – the electronical 3D system does not differ from the visual one.
Bland-Altman plots and statistics – patterns and numbers
According to Bland-Altman plots, bias between the 2D and 3D system is neither constant nor uniquely proportional. Even if these kinds of bias could be adjusted for - as suggested for uniquely proportional bias [46, 47] - the clear patterns are not appealing for sophisticated statistical methods. Thus, Bland-Altman plots provide important information hardly available in numbers.
Single shade designations of the 3D system and d(0M1)
Although the reliability for the hue component of the visual 3D system is zero, the corresponding d(0M1) indicates good reliability. Likewise, the reliabilities are fair versus very good for the electronical 3D system, respectively. Thus, reliabilities of single shade designations can be misleading, especially for hue, for which ΔE values are only about 1.5 (see above). Nevertheless, the hue component of the 3D system is problematic, because its reliability is lower than those of lightness and chroma.
Validity and reliability
Colorimetry does not facilitate valid measurements. The value of d(0M1), however, supports pseudo-valid measurements, as the range of d(0M1) values differs across the four methods. The bleaching shades added to the electronical 3D system (not to the visual 3D system) make the difference: this range (21.6) is twice as high compared to visual 2D (11.0). Reliability in terms of the ICC depends on this range – if the variability of d(0M1) is small, the ICC will be small. As expected, the pooled standard deviation of the electronical 3D system is higher than that of the electronical 2D system. The ICC of the electronical 3D system, however, is lower, which emphasized the problems of the 3D system – independent of human raters.
Smallest detectable color difference, acceptable and perceptible thresholds
An acceptability threshold of 2.7 in ΔE and a perceptibility threshold of 1.2 in ΔE are known [14]. The SDCD in terms of d(0M1) depends on the method and is diminished from 2.8 to 1.0 for a row of eight teeth using electronical 2D measurements. These values are statistical ones and can differ from study to study. However, it is plausible that electronical 2D is the method with the best agreement, including SDCD. For properties of ΔE and d(0M1), electronical 2D is the recommended method for study designs with repeated measurements such as longitudinal studies.
Agreement and reliability (comparability)
Whereas agreement of repeated measurements of the same rater in terms of SEM and SDCD does not differ between visual and electronical 3D measurements, reliability or ICC differ substantially. Thus, a single human rater is not worse than the electronical device for a longitudinal study, when using the 3D system. The comparability of the four methods remains uncertain. Therefore, the same method should be used in multicenter studies, too.
Human and machine
A set of human raters may cause additional problems concerning agreement and reliability. Compared with a set of human raters, a set of devices from the same electronical system should have higher levels of standardization [66], which corresponds to the more favorable ICCs observed. However, n-of-1 trials, as used herein [37] for the single human rater, limit generalizability. It may be further argued that the human rater lacks ability to percept hue. But even if the examiner had lacked this ability, this missing ability would not have been invalidated our conclusions, because we do not make an isolated statement on hue but compare hue with lightness and chroma. These within-human comparisons are supported by the n-of-1 trial design. Moreover, the same within-device comparisons support the hypothesis that hue is not well reproducible; the electronical reliability of hue is merely fair. In addition to our findings, background knowledge further supports that 3D hue cannot be well assessed (see Introduction).
Intra- and inter-method variability – validity revisited
Whereas the reliability within each of the four methods is good to very good, comparability of the visual and electronical measurements is only fair to good. This questions also the validity of visual and electronical measurements. In turn, this question also refers to the difference between the 2D and 3D system. In fact, Bland-Altman plots using the 2D system suggest that both visual and electronical values are valid only in the d(0M1) ranges of about 12 (A1 – A2, B1 – B2) and greater than 20 (A4, B3 – B4, C3 – C4, D4). The shades B1 and A2 are not well covered by the 3D system [31], which is mirrored in corresponding Bland-Altman plots. Vice versa, 3D shades 1M1 and 1M2 (both d(0M1) < 11.2 for the minimum of the 2D system) are not well covered by the 2D system [31] and question the validity of neighbored 2D shades, namely A1, B1, and B2. In daily-life practice, the 3D system may be useful for shades not available in the 2D system. Nevertheless, switching between methods cannot be recommended in scientific studies. The 3D system, however, can be favorable in bleaching studies owing to the added bleaching shades.