3.1. Effects of Equalization and Tile Size
We explored variations in equalization technique and tile size in tandem. Training and testing through the four cross-validation folds at each tile size and for each approach to equalization (including no modification to grayscale images) revealed distinct advantages for 200×200-pixel tiles and CLAHE preprocessing. Perhaps not surprisingly given the muddy, noisy images produced by histogram equalization as exemplified in Fig. 1, this technique produced few usable tiles following entropy sifting and was not explored further.
Tiles of all sizes derived from the CLAHE images performed well across the in-sample folds. Although Fig. 3 suggests only a marginal advantage for 200×200-pixel tiles, in fact the superiority was more pronounced: in each of the four folds, perfect classification accuracy was attained after just a few epochs and largely persisted thereafter. At all other tile sizes, by contrast, peak accuracy was less than perfect or, in the case of 350×350-pixel tiles, occurred later and with fewer saved models.
Model performance was far different on the out-of-sample set, with lower accuracies and sharper differences among tile sizes. The accuracies illustrated in Fig. 3 reflect averaging across the best models from the four cross-validation folds; that is, each image was classified by the best-performing model from each fold, and the four resulting probabilities were averaged to produce a final classification for the image.
Although tiles produced from pure grayscale images or those preprocessed using background normalization resulted in lower accuracies relative to the CLAHE set, the 200×200-pixel tiles invariably elicited the best results relative to the other tile sizes tested. Why a particular tile size proves optimal for a given artist may be debatable; that one will emerge, however, seems inevitable, at least in our experience. This is true across artists and media, as well as for subject matter far afield of art [20]. For Raphael, it seems clear that limiting CNN attention to stroke-level detail cannot resolve distinctions between his drawings and those of others. Raphael’s distinctive “signature” as recognized by the CNN emerges at a larger feature level (see Fig. 2).
Holding tile size constant at 200×200 pixels, we can examine the effects of preprocessing technique. Fig. 4 shows results for histogram peak-shifted, CLAHE and pure grayscale tiles across the four folds for both in-sample and out-of-sample datasets. Table 1 reports the fold accuracy scores and standard deviations. CLAHE-preprocessed images produced not only higher maxima and mean values but also tighter spreads across folds. The latter feature suggests less accuracy-compromising noise, which may largely account for the superior performance.
|
Equalization Type
|
Mean Accuracy
|
Standard Deviation
|
In-Sample
|
Histogram shift
|
0.68
|
0.067
|
CLAHE
|
0.77
|
0.019
|
None
|
0.57
|
0.106
|
Out-of-Sample
|
Histogram shift
|
0.87
|
0.03
|
CLAHE
|
1.0
|
0
|
None
|
0.77
|
0.04
|
Table 1 – Mean accuracy scores and standard deviations across four cross-validation folds for in-sample and out-of-sample datasets using different forms of preprocessing
3.2. Chalk vs. Pen: One Class or Two?
We next investigated whether accuracy could be improved by segregating chalk (including charcoal) drawings from those made with a sharp tool, which we collectively refer to as “pen” drawings. The possibility seemed realistic given the clear differences in mark appearance and application technique. Indeed, our initial assumption was that the visual differences might be so pronounced as to require separate classes. As we assembled the dataset, however, we found that many works combine both types of media in varying proportions. Consequently, we knew that efforts to identify a dominant medium would involve subjective judgment and risk biasing the results.
Using all of the pen-dominant drawings from the curated dataset and supplementing these with additional pen-dominant drawings produced a set of 169 images (74 Raphael and 95 comparative), while for chalk-dominant drawings the set contained 94 images (30 Raphael and 64 comparative). These datasets were considerably smaller than our curated set. To preserve a sufficient number of test images, we used three-fold (rather than four-fold) cross-validation for the chalk-dominant set.
After CLAHE preprocessing, we overlapped tiles sufficiently to achieve at least 12,500 total Raphael tiles and a similar number of comparative tiles. The smaller numbers of source images required greater degrees of overlap than were necessary, at 200×200 pixels, for the curated dataset. Whereas the latter required 80% overlap, the overlaps necessary for the pen and chalk datasets were closer to 90%.
We need not have worried about bias in identifying a dominant medium. The results obtained with separate chalk and pen datasets were inferior as set forth in Table 2.
|
Drawing Set
|
Mean
|
Standard Deviation
|
In-Sample
|
Pen-dominant
|
.84
|
.03
|
Chalk-dominant
|
.80
|
.07
|
Out-of-Sample
|
Pen-dominant
|
.61
|
.07
|
Chalk-dominant
|
.47
|
.02
|
Table 2 – Mean accuracy scores and standard deviations for pen-dominant and chalk-dominant datasets
The models trained solely on chalk-dominant or pen-dominant images sharply underperformed models trained on the mixed curated image set. By comparison, as reported in Table 1, averaging across the best fold models trained on the curated image set produced an out-of-sample classification accuracy of 0.77. In particular, of 48 tested images in the mixed-medium out-of-sample set, 37 were classified correctly (averaging across the best fold models trained on the curated image set) and 11 incorrectly. The 11 incorrect classifications were evenly split — six chalk-dominant, five pen-dominant — and all were false positives. The mixed out-of-sample set was itself roughly split between pen-dominant (25) and chalk-dominant (23) images.
The models trained on chalk-dominant or pen-dominant images were tested only against pen-dominant or chalk-dominant out-of-sample images. Why do models trained on a dataset that includes both pen-dominant and chalk-dominant drawings perform so much better than models trained on datasets that separate the two? Indeed, for models trained on chalk-dominant drawings, the results on the out-of-sample set were worse than guessing.
Certainly, part of the answer lies in the smaller size of the chalk-dominant dataset. But the large performance disparity suggests other, more influential factors at work. Chalk drawings, particularly those that are centuries old, have far more variability in quality of mark than pen drawings due to the greater material vulnerability of chalk. The mark character of chalk also varies more, from soft and broad to bolder and finer strokes. The Viti drawing illustrated in Fig. 2, for example, is diffuse throughout with broad marks. It was misclassified by all CNNs tested, including those trained on the mixed dataset. Such visual features may simply have too little differentiation among artists drawing in a similar style to be resolved by the CNN.
Moreover, as noted, many drawings contain both chalk and pen passages. An artist’s signature style may reflect not only how specific media are applied but how they are combined. If training images are skewed toward chalk-like or pen-like drawings, the CNN will only rarely encounter the combined media during training. This effectively removes a signature characteristic — i.e., a degree of freedom — from consideration relative to the more broadly trained CNN.
c. Behavior Across Cross-Validation Folds
It is well known that CNN performance can be affected unpredictably by various forms of image noise, even in small amounts [21]. Indeed, this unpredictability is often exploited to undermine face recognition and other CNN-based detection systems [22]. CNN-spoofing noise sources can include lighting variations, blur, differences in camera sensors, and contrast variations [21,23]. We found that we can exploit the unpredictable sensitivity of CNNs to atypical conditions to reveal the accuracy-compromising existence of those conditions. We can then better understand the inherent accuracy of the prediction.
Because each of the best cross-validation models was trained on a different but largely overlapping mix of training images, their predictions should be similar for a new image. The number of training images in each fold seemed sufficient, given our experience with other artists, to support good (and consistent) model performance. We verified this by training new models on the entire in-sample image corpus and testing against the out-of-sample set. Although this carries some risk of overfitting to the out-of-sample set, the results reflected only a modest improvement (from an accuracy of 0.77 to 0.83) over averaging the best cross-validation models. Those models, in other words, performed well enough individually that their predictions should be similar.
Various factors may account for divergence among model predictions for a given image. These may arise from characteristics of the drawing itself (due to rough handling and the passage of time, as described above), or can stem from the training underlying the CNN models. In the former case, the image may contain noise or blur not immediately apparent on visual inspection, and because of the unpredictable CNN sensitivity to such defects, different models respond differently. In the latter case, the training set may simply lack enough differently labeled images with sufficient visual similarity to the work under study to classify it reliably. Either way, the degree of divergence — the spread between prediction extrema — strongly correlates with the accuracy of the averaged prediction. For example, a Hebborn forgery in the out-of-sample set was weakly classified (with an averaged probability of 0.54) as drawn by Raphael. But in fact, the classification probabilities ranged from 0.18 to 0.75, a very wide range clearly indicating failure. Had the training set included more Hebborn (or Hebborn-like) images, the output probabilities may have converged toward a correct classification.
Spread
|
Surviving Image Fraction
|
Accuracy
|
≤ 0.25
|
.75
|
.81
|
≤ 0.20
|
.63
|
.83
|
≤ 0.15
|
.5
|
.92
|
< 0.14
|
.44
|
1.0
|
≤ 0.10
|
.33
|
1.0
|
Table 3 – Effect of spread between minimum and maximum probabilities across best fold models on classification accuracy. The surviving image fraction represents the proportion of images whose spread satisfies the constraint. Accuracy reaches 100% when the spread is below 0.14.
As indicated in Table 3, when the probability spread falls below 0.14, all qualifying images are classified properly. The price of increasing accuracy, however, is exclusion of images we might wish to study – in the case of our out-of-sample set, more than half of them in order to achieve 100% accuracy. Of course, the fact that the out-of-sample set includes many noticeably flawed images makes it something of a worst case.
Even with the largest probability spreads, classification accuracy never falls below 0.77. Table 3, although based on a limited number of out-of-sample images, suggests how inherent accuracy varies with the spread. The averaged classification probability assigned to a candidate image can be qualified by the accuracy limit corresponding to the observed spread to produce a more meaningful, adjusted prediction. In addition, we found all classification errors to be false positives. As a result, these inherent accuracy limits do not directly apply to classifications of candidate works as not by Raphael, since our models have yet to produce a false negative.