Designing a controlled experiment to study the so-called painter’s hand
Historical paintings have known and unknown variables that contribute to their physical states, including the materials used, the artist or artists’ technique or style, and damage and restoration that have occurred over time. Each of these could contribute to the attribution of a painting. The aim of this experiment is to explore the questions that (a) the brushstroke-produced high resolution profilometry data from a painting’s surface contain stylistic information (i.e., painters leave behind a measurable “fingerprint”), and (b) the data is such that ML analysis can quantitatively distinguish among painters by the topographic data. Therefore, the experimental goal is to categorize small areas from the surface of paintings by their stylometric information, without the influence of purposeful stylistic (choice of tools or materials) or factors relating to the subject of the painting.
To ensure control over the stylistic and subjective content of the test set of paintings, we enlisted nine painting students from the Cleveland Institute of Art each to create triplicate paintings of a fixed subject, a photograph of a water lily (Fig.1). Each painting was created using the same materials (paint, canvas) and tools (paintbrushes) as described in Materials and Methods below. In addition, the students were instructed to treat the three versions as copies. To guide our investigation, four painting specialists (three art historians and a painting conservator) grouped the paintings by artist style using traditional connoisseurship, selecting the work of four of the nine students’ sets of paintings for our investigation based on their stylistic similarity.
Acquiring and preparing data from paintings
The surface height information for each painting was collected by high resolution optical profilometry. Measurements were conducted over a 12 x 15 cm region centered around the subject of the painting, with a spatial resolution of 50 microns, and a height repeatability of 200 nm. Given that brush strokes and their associated features are on the scale of hundreds of microns this was sufficient for capturing the fine brushstroke features of the painting’s surface.
In preparation for the experiments, the height information is digitally split into small patches, the central objects of the investigation, as depicted in Fig. 1B. A typical patch size for these experiments is 1 x 1 cm, or 200 x 200 pixels, though we eventually explored a range of patch sizes from 10 pixels (0.5 mm) to 1200 pixels (6 cm). The effect of this is threefold. First, it eliminates the subjective information (the water lily figure) from the patches, since these patches are too small to individually contain indicatory information. Second, it allows us to employ ML methods for each painter. For example, the 3 paintings will provide 540 patches at the 1 x 1 cm patch size. Finally, we use ML to quantitatively attribute the individual patches of the painting. So, by reconstructing the original topography from the patches we can visually represent regions of the surface with different quantitative attributions. This will be important for our future studies inspecting historical paintings, where different regions may represent the contributions of different hands, whether from different members of the workshop or later conservation attempts.
ML methods were then applied to the surface topographic information from the paintings to explore the following questions:
-
Is there enough information to differentiate between artists?
-
Which length scales contribute useful information?
-
How does topographical information compare to photographic data?
-
What machine learning methods provide the most accurate performance?
Machine learning to reliably attribute patches of topographical data to individual artists
Convolutional neural networks (CNNs) are a powerful and well-established method in computer vision tasks such as image classification. [10–11] They generally consist of three classes of layers: convolution, pooling, and fully connected layers[12] (see SI Fig. S2). Convolution layers learn translation-invariant features from the data and pooling summarizes the learned features. The stacking of these layers helps to build a hierarchical representation of the data. Fully connected layers input these extracted features into a classifier and output image classes or labels. Identification using CNNs are ideal for signals—such as topographical data—that have local spatial correlations and translational invariance. However, training a deep CNN from scratch on a small dataset typically results in a problem known as over-fitting, where the network performs better on the training set, but often does not generalize well to unseen data. To avoid this, a common solution is transfer learning [13]: adapting a network that has been pretrained on a large dataset to a different but related task. For the case of CNNs pretrained on images, the initial layers perform general feature extraction, and hence are often applicable to a broad variety of image classification problems. The final fully connected layer (and sometimes several of its predecessors) is replaced and retrained for the problem of interest. This retraining of the network is fine-tuned in a block-wise manner, starting with tuning the last few layers, and then allowing further preceding layers to be trainable as well. In this work, the network we have used is an architecture called VGG-16 [14], which was pretrained on more than one million images in the ImageNet dataset [15].
This transfer learning procedure allows us the full functionality of a highly tuned deep CNN with specificity to our task of surface topography. In short, this CNN now is outfitted to take a small input patch of the painting and produce an output pertaining to attribution. The output of our network is a 4-D vector whose components correspond to the probability of attribution to one of the four artists in the experiment (Fig. 1C). Patches from two of the three paintings from each artist are used for training/validation, with patches from the third painting reserved for testing. Because of the stochastic nature of the training procedure, involving presenting random minibatches from the training set over many epochs, the weights in the final network after training would be different if we were to repeat the whole procedure from the start. We take advantage of this stochasticity by creating ensembles of 10–100 different trained networks for each task we consider, using the mean of the probability vectors from the entire ensemble as the final prediction. We then calculate the overall accuracy as well as F1 scores, which is a measure of test accuracy for each artist. Such ensemble learning predictions in many cases outperform those of single networks [16]. Additional details of the network architecture, training, and fine-tuning procedure can be found in the Material and Methods and SI.
The results of ensemble ML of 100 different trained networks for attribution using patches of side-length 200 pixels (10 mm) are shown in Fig. 2. Each patch is color coded according to the largest probability (most likely artist), with the opacity of the shading proportional to the magnitude of that probability (i.e., more transparent shadings correspond to more uncertain attributions). Out of 180 patches for each artist in the test painting, we found 12, 0, 2, and 14 patches attributed incorrectly for artists 1 through 4, representing an overall accuracy of 96.1%. This is remarkable as 25% is expected by random choice. Further, we find that most of the patches were attributed with high confidence (more opaque shading) for all four artists. The accuracy of ML prediction from the height data is remarkable, particularly given the similarity of the patches in terms of features distinguishable to the human eye (Fig. 1B) as well as its success in broad monochrome areas of the painted background.
Exploring the effect of patch size on attribution accuracy
The surprisingly accurate attribution of 1 cm patches leads to a natural question: how does the size of the patch affect the machine’s ability to properly attribute? In other words, can we make the patch size smaller than 1 cm and still reliably attribute the hand? Fig. 3 presents results for networks trained patches with different side-lengths ranging from 10 pixels (0.5 mm) to 1200 pixels (6 cm). The predictions are quantified in terms of overall accuracy for all four artists (solid thick curve) and individual artist F1 score (thin colored curves). We also calculated precision and recall, results are shown in SI Fig. S3. To check the self-consistency of the predictions, we conducted repeat training/testing trials at each patch size (details in the SI). The data points and error bars in Fig. 3 represent the mean and standard deviation for those trials.
The accuracy exhibits a broad plateau around 95% between 100 and 300 pixels, the optimal patch size range for attribution among these artists. Below 100 pixels there is a gradual drop-off in accuracy, as each individual patch contains fewer of the distinctive features that facilitate attribution. The F1 scores allow us to separate out the network performance for each artist. Consistent with the results in Fig. 2, the attribution is generally better for artists 2 and 3 versus 1 and 4 across patch sizes less than 300. Nonetheless, the F1 scores for all artists are above 90% near the optimal patch size (around 200 pixels).
On the other end of the patch size spectrum, the ML approach faces a different challenge. The size of training sets becomes quite small, even though each individual patch contains many informative features. The single-network accuracy drops off quickly for patch sizes above 300 pixels, decreasing to about 75% at the largest sizes.
Predictions using single-pixel information versus spatial correlations
One of the hallmarks of CNNs is their ability to harness spatial correlations at various scales in an input image in order to make a prediction. However, there is also information present at the single pixel level since each artist’s height data will have a characteristic distribution relative to the mean. The probability densities for these distributions are shown in Fig. 4, calculated from the two paintings in the training sets of each artist. The height distributions are all single-peaked and similar in width, except for Artist 1, who exhibits a broader tail at heights below the mean than the others. In order to determine how important spatial correlations are, we can compare the CNN results to an alternative attribution method that is blind to the correlations: maximum likelihood estimation (MLE). For a given patch in the testing set, we calculate the total likelihood for the height values of every pixel in the patch belonging to each of the four distributions in Fig. 4. Attribution of the patch is assigned to the artist with the highest likelihood. The predictive accuracy of the MLE approach versus patch size is shown as a dashed line in Fig. 3. We expect MLE to perform the best at the largest patch sizes, since each patch then gives a larger sampling of the
height distribution, and hence is easier to assign. Indeed, at the patch size of 1200 pixels, representing nearly a fifth of the area of a single painting, the MLE accuracy approaches 70%, comparable to the CNN accuracy. In this limit the size of the training set is likely too small for the CNN to effectively learn correlation features. As the patch size decreases, the gap between the CNN and MLE performance grows dramatically. In the range of 100–300 pixels where the CNN performs optimally (~ 95%), the MLE accuracy is only around 40%. These small patches are an insufficient sample of the distribution to make accurate attributions based on single pixel height data alone. Clearly the CNN is taking advantage of spatial correlations in the surface heights. This leads to a natural next question: what correlation length scales are involved in the attribution decision?
Using empirical mode decomposition to determine the length-scales of the brushstroke topography
In order to examine the spatial frequency (length) scales most important in the ML analysis, we employed a preprocessing technique used historically in time-series signal analysis called empirical mode decomposition (EMD) [17] which has recently been extended into the spatial domain. [18–20] Its versatility is derived from its data-driven methodology, relying on unbiased techniques for filtering data into intrinsic mode functions (IMFs) that characterize the signal’s innate frequency composition. [21] In our case, we have used a bi-directional multivariate EMD [22] to split our 3D reconstruction of each painting’s complex surface structure into IMFs that characterize the various spatial scales present.
The first IMF contains the smallest length scale textures, and subsequent IMFs contain larger and larger features until the sifting procedure is halted and a residual is all that remains. This process is lossless in the sense that by adding each IMF and the resulting residual together, the entire signal is preserved. [17], [21] It is also unbiased in the sense that when compared to standard Fourier analysis techniques, there are no spatial frequency boundaries to define, and no edge effects introduced from defining those boundaries.
By investigating each series of IMFs individually, we can estimate the length scale for each as follows. We use a standard 2D fast-Fourier transform on the IMF and calculate a weighted average frequency for the modes. The length scale is the inverse of the average frequency, and is plotted versus IMF number in Fig. 5B. Among the four artists, the typical scale increases from about 0.2 mm for IMF 1 to 0.8 mm for IMF 5. Figure 5A shows a sample patch and the corresponding IMFs, which illustrates the progressive coarsening at higher IMF numbers. To see how the length scale affects the attribution results, we repeated the CNN training using each IMF separately, rather than the height data. The resulting mean accuracies versus IMF number at three different patch sizes are shown in Fig. 5C. Individual IMFs are by construction less informative than the full height data (which is a sum of all the IMFs), and hence we do not reach the 95% level of accuracy seen in the earlier CNN results. However, IMFs 1 and 2 (the smallest length scales) achieve accuracies of above 80% at patch size 10 mm (200 pixels). There is a drop-off in accuracy as we go to larger patch sizes (IMFs 3–5), indicating that the salient attribution information is at length scales of 0.2–0.4 mm. These are comparable to the dimensions of a single bristle in the two types of brushes used by the artists (0.25 and 0.65 mm respectively, shown as dashed lines in Fig. 5B). This strongly suggests that the key to attribution using height data lies at scales that are small enough that they reflect the unintended (physiological) style of the artist. This result is consistent with the scale-dependent ML results depicted Fig. 3. Below a patch size of 5 mm, all results are well-above that expected for random attribution (25%). Remarkably, even at the scale of 0.5 mm, that is, the scale of 1–2 bristle widths, ML was able to attribute to 60% accuracy.
Comparing topography versus photography when testing on data with novel characteristics
Image recognition by ML is most often performed on photographic images of the subject depicted by arrays of RGB channels performed on the entire image. We sought to determine how well using CNNs on patches of the images of row A of Fig. 2 would perform compared to the profilometry data. We were particularly interested in how well ML of the two types of data—photo and height-based—would perform if the testing set had novel colors and subject matter absent in the training set. This approach better approximates the challenges of real-world attribution, where we would not necessarily have extensive well-attributed training data matching the palette and content of the regions of interest in a painting where the algorithm would be applied. To generate qualitatively distinct training and testing sets, we divided each painting into patches of side-length 100 pixels (5 mm) and then sorted the patches into three categories: background, foreground, and border depending on the color composition of each patch (see Fig. 6A for an example). Among our training set, 25% of the patches are assigned to background, 50% count as foreground, and the remaining 25% are border patches (Fig. 6B). The latter include regions of both background and foreground, and were excluded from both training and testing to make it more challenging for the algorithm to generalize from one category to the other. The mostly dark green and black color palette and lack of defined subjects in the background distinguishes it from the foreground, which is dominated by the painted flower, with various shades of yellows and reds. Could a network trained on only background patches still accurately attribute foreground patches, or vice versa? The mean accuracy results are shown Fig. 6C, with the left two bars corresponding to training on the background, testing on the foreground, and the right two bars to the reverse scenario. Because the training sets are significantly smaller (and less representative of the test sets) than in our earlier analysis, we expect lower attribution accuracies. Despite this, networks trained on the height data (blue bars) perform reasonably well, achieving 60% accuracy when trained on background, and 80% when trained on foreground. (We note that the background training set is about half the size of the foreground one.) In contrast, networks trained on the photo data did significantly worse (red bars), achieving 27% and 43% accuracies, respectively. Clearly, in this context the color and subject information in the photo data, which was likely the focus of the ML training, was a hindrance, since the test set confronted the network with novel colors and subject matter. On the other hand, there is a significant, small-scale, stylistic component in the height data that is present whether the artist is painting the foreground or background, which can therefore be harnessed for attribution.