The 2D system proved superior to the 3D system both visually and electronically in terms of ΔE and d(0M1) for statistics of agreement and reliability to assess intra-rater variability. All four methods showed strong patterns of disagreement between repeated measurements in Bland-Altman plots. As hypothesized, the 3D system is less reliable for hue than for lightness and chroma, a phenomenon which was more pronounced visually than electronically. The SDCD differs by the four methods used and was most favorable in the electronic 2D system. The agreement between the 2D and 3D systems in terms of ΔE was not good. It was lower in the electronic than in the visual method. The comparability of the 2D and 3D systems was uncertain, because confidence intervals of ICCs accounting for systematic error were wide. The systematic error between the 2D and 3D systems cannot be neglected. The reliability of the visual and electronic method was substantially the same in the 2D and 3D systems; this comparability was fair to good.
Below, the following aspects are discussed: 2D and 3D, visual and electronic, ΔE and d(0M1), Bland-Altman plots and statistics (patterns and numbers), single shade designations of the 3D system, validity and reliability, statistical SDCD and known thresholds, agreement and reliability (comparability), human and machine, and intra- and inter-method variability.
2D and 3D systems
The 2D and 3D systems differ in the color space assessed [33]. Some 3D shades that are lighter (lightness) or stronger (chroma) are not well covered by the 2D system, which is especially pronounced for the additional bleaching shades available only in the 3D system. Compared to VC, hue ranges of 3D Master are extended toward yellow-red, and 3D Master shades are more uniformly spaced than that of VC [6]. In contrast, there are spatial gaps in the 3D system which are filled in the 2D system [33, 41]. In short, both guides are suboptimal and can be improved [14].
The variability between raters may favor the 3D Master shade guide over the VC shade guide [58]. The coverage error favors the 3D system, although it is unclear whether the difference between the 2D and 3D systems is clinically relevant [12, 14, 59-61]. However, the clear patterns in Bland-Altman plots for d(0M1) cast doubt on the meaningfulness of converting 3D shades into VC shades (2D) as suggested elsewhere [62].
Visual and electronical method
The gaps mentioned above that are filled by the 2D system are supported by additional 2D shades to assess quarter-points for the second shade designation number [33], which is an important difference between the visual and electronic method. A further important difference is the extension of the second shade designation number from the visual four-point scale to the electronic five-point scale. Similarly, the electronic 3D system includes bleaching shades not used by the visual 3D system evaluated here. Thus, it could have been expected that a human rater is inferior to the electronic rater, especially for the 2D system. It is of note that the agreement of intra-rater variability in terms of ΔE and d(0M1) is better for the visual 2D measurement than that for the electronic 3D measurement.
Several studies have found that instrumental methods are more accurate or reliable than visual measurements [11, 19, 23-25, 63-65]. A recent study, however, has shown that clinically relevant differences between the visual evaluation and the intraoral scanning device (3Shape) are negligible [20]. According to Li & Wang, the reliability of shade matching can be ensured neither by the instrumental nor by the visual approach [66]. Furthermore, the difference in color matching between human-eye assessment and computerized colorimetry depends on tooth type [18] and shade [8].
ΔE and d(0M1)
ΔE supports only statistics on agreement; neither Bland-Altman plots nor reliability statistics are feasible. Essentially, d(0M1) enables evaluating patterns of disagreement, other agreement statistics such as SDCD, and reliability statistics including versions of ICC accounting for systematic errors. Regarding agreement of repeated measurements of the same rater, the differences among the four methods are substantially the same for ΔE < 2.7 and d(0M1). The level of agreement within fixed limits, however, is higher for d(0M1). For example, d(0M1) hardly differentiates 3M1 from 2L2.5 (d(0M1): 15.2 and 15.3, respectively) although ΔE is 8.3. Thus, if lightness is compensated by less chroma (or chroma by darkness), then d(0M1) will not work well. The systematic errors between 2D and 3D measurements in d(0M1) are plausible, because the 2D and 3D systems differ in the color space assessed (see above). Within the 2D system, systematic errors between visual and electronic measurements are small, which can be explained by the additional quarter-point shades in the electronic 2D system.
Bland-Altman plots and statistics – patterns and numbers
According to Bland-Altman plots, bias between the 2D and 3D systems is neither constant nor uniquely proportional. Even if these kinds of bias could be adjusted for – as suggested for uniquely proportional bias [48, 49] – the clear patterns are not appropriate for sophisticated statistical methods. Thus, Bland-Altman plots provide important information hardly available in numbers.
Single shade designations of the 3D system and d(0M1)
Although the reliability for the hue component of the visual 3D system is zero, the corresponding d(0M1) indicates good reliability. Likewise, the reliabilities are fair versus very good for the electronical 3D system, respectively. Thus, reliabilities of single shade designations can be misleading, especially for hue, for which ΔE values are only about 1.5 (see above). Nevertheless, the hue component of the 3D system is problematic, because its reliability is lower than those of lightness and chroma.
Validity and reliability
Colorimetry does not facilitate valid measurements. The value of d(0M1), however, supports pseudo-valid measurements, as the range of d(0M1) values differs across the four methods. The bleaching shades added to the electronic 3D system (not to the visual 3D system) make the difference: this range (21.6) is twice as high compared to visual 2D (11.0). Reliability in terms of the ICC depends on this range – if the variability of d(0M1) is small, the ICC will be small. As expected, the pooled standard deviation of the electronic 3D system is higher than that of the electronic 2D system. The ICC of the electronic 3D system, however, is lower, which emphasizes the problems with the 3D system – independent of human raters.
Smallest detectable color difference, acceptable and perceptible thresholds
An acceptability threshold of 2.7 in ΔE and a perceptibility threshold of 1.2 in ΔE are known [16]. The SDCD in terms of d(0M1) depends on the method and decreases from 2.8 to 1.0 for a row of eight teeth using electronic 2D measurements. These are statistical values and can differ from study to study. However, it is plausible that electronic 2D is the method with the best agreement, including SDCD. For properties of ΔE and d(0M1), electronic 2D is the recommended method for study designs with repeated measurements, such as longitudinal studies.
Agreement and reliability (comparability)
Whereas intra-rater agreement of repeated measurements in terms of SEM and SDCD does not differ between visual and electronic 3D measurements, the reliability or ICC differ substantially. Thus, a single human rater is not worse than the electronic device for a longitudinal study when using the 3D system. The comparability of the four methods remains uncertain. Therefore, the same method should also be used in multicenter studies.
Human and machine
Compared with a set of human raters, a set of devices from the same electronic system should have higher levels of standardization [67], which corresponds to the more favorable ICCs observed. However, n-of-1 trials, as used herein for the single human rater, limit generalizability. It may be further argued that the human rater lacks the ability to perceive hue [39]. But even if the examiner had lacked this ability, this would not have invalidated our conclusions, because we did not make an isolated statement on hue, but rather compared hue with lightness and chroma. These intra-human comparisons are supported by the n-of-1 trial design. Moreover, the same intra-device comparisons support the hypothesis that hue is not well reproducible; the electronic reliability of hue is merely fair. In addition to our findings, background knowledge further supports that 3D hue cannot be well assessed (see Introduction).
Intra- and inter-method variability – validity revisited
Whereas the reliability within each of the four methods is good to very good, comparability of the visual and electronic measurements is only fair to good. This also questions the validity of visual and electronic measurements. In turn, this question also refers to the difference between the 2D and 3D system. In fact, Bland-Altman plots using the 2D system suggest that both visual and electronic values are valid only for d(0M1) values of about 12 (A1 – A2, B1 – B2) and greater than 20 (A4, B3 – B4, C3 – C4, D4). The shades B1 and A2 are not well covered by the 3D system [33], which is mirrored in the corresponding Bland-Altman plots. Vice versa, 3D shades 1M1 and 1M2 (both d(0M1) < 11.2 for the minimum of the 2D system) are not well covered by the 2D system [33] and question the validity of adjacent 2D shades, namely A1, B1, and B2. In daily practice, the 3D system may be useful for shades not available in the 2D system. Nevertheless, switching between methods cannot be recommended in scientific studies. The 3D system, however, can be favorable in bleaching studies owing to the added bleaching shades.