Assume training in our LEGO® domain is done while leaving out one value of some attribute dimension from the training set creating a new network A−i where the i signifies the ith dimension where one of its possible values is not represented at all in the training set. This would not be unlike training an autonomous car using a dataset where only sunny, daylight scenes are included. That 'left out' dimension i, with respect to visual appearance of the bricks, is one of: color (e.g., leave out one color), size (e.g., leave out all 2x2 bricks), shape (e.g., leave out all slopes), orientation (e.g., leave out images where the long axis of a brick is parallel to the conveyor belt), or pose (e.g., leave out all upsidedown poses). We could also consider more than one such 'left out' value or dimension. For each case, we will consider the resulting generalization error.
Going back to Figure 1, the flowchart of the Backcasting process, we use the 'current status quo', namely the stateoftheart in computer vision, to 'create specific consequent', namely the network A−i, which will be assumed to satisfy all production specifications LEGO® would require. We will consider a number of different sources of possible training biases and their impact on error for this network. For each, the overall error will be estimated and given as the sum of the error of the full network A (GA < 0.000018) and any new errors introduced as a result of the changes in the training set. As in Figure 1, one or more steps will be taken towards the 'logical antecedent', in other words, towards what must be true in order for network A−i to perform as assumed. Whether or not the logical antecedent is true or even realistic will determine our conclusions regarding generalization.
4.1 Training Set Missing One Brick Color
Suppose one color of the staple LEGO® brick set possibilities (red, blue, green, white, black, yellow) is left out of the training dataset and that the resulting system A−c satisfies the empirical error stipulated above. This means 5 colors are sufficiently represented and thus are learned correctly. This places the state of our argument at the 'create specific consequent' stage of the Backcasting flowchart in Figure 1. We now proceed to the boxes in the figure labelled 1 through to the ‘logical antecedent’. (These steps will be true for each of the cases described below and not further repeated in those descriptions). Although the color label set referred to above includes 141 colors, these 6 are the standard ones, and in any case, sufficient to demonstrate the point here. If those 5 colors are learned and thus ‘in the weights’ of the network, then those weights might be invoked when the 'left out' color is seen.
Color is complicated (see Koenderink 2010). The most common representation for images, such as in those datasets in common use, is that each pixel in an image is represented by R (red), G (green) and B (blue) values, each coded using 8 bits. In other words, for each of the the colors, 256 gradations are possible yielding a potential 16,777,216 colors. Luminance or intensity is not independently coded but is rather derived from these as an average of the three values. The values for each of R, B and G come from image sensor outputs, and the most common practice is for the sensor to employ a Bayer filter (a pattern of color filters with specific spatial distribution), and then to be further processed, including to demosaic the image. These color filters are designed to specific characteristics; B is a low pass filter, G is a bandpass filter and R is a highpass filter, attempting to fully cover the range 400700nm. Spatially, the Bayer filter is arranged to have 2 green filters for each red or blue, that is, each image pixel is the result of these 4 samples, 2 green, one red and one blue. The filter pass ranges overlap only by small amounts. The goal is to match the wavelength absorption characteristics of human retinal cones, which very roughly are: S (blue) 350  475nm, peaking at 419nm; M (green) 430  625nm, peaking at 531nm, S (red) 460  650nm, peaking at 559nm. Rods, for completeness, respond from about 400  590nm, peaking at 496nm. An interesting characteristic is that beyond 550nm, only the M and L systems respond. Koenderink points out that the distributions in the human eye are not ideal in how they cover the spectrum; an ideal system would more equally cover the full spectrum. This is what the Bayer filter attempts to accomplish. Of course, there are many color models and encoding schemes, but this is most common.
Suppose the ‘left out’ brick color is yellow. As Koenderink points out, the spectra of common objects span most of the visible light spectrum (see his Fig. 5.18 that shows 6 different spectra of real 'yellow' objects; for each, most of the visible spectrum has some representation except blue). So, if one considers the RGB representation across all the bricks, most of the other colored bricks will have some yellow spectral content but this will never be independently represented. The portion of the spectrum we commonly see as yellow is quite narrow, centred around 580nm. In both human vision and using the Bayer filters, these wavelengths are not sampled by the blue component at all. In human vision, the absorption spectra of the L and M cones have good sensitivity to the yellow wavelengths (their peaks are at 531 and 559 respectively). However, using Bayer filters, most applications try to minimize overlap so at the yellow wavelengths, sensitivity is typically low, less than 50% of the peak.
The question we ask is, if no yellow brick is part of the development of A−c, what is the resulting generalization error? Since color is 'in the weights' of this network, and all nonyellow objects likely exhibit some yellow in their reflectance spectra, we now ask if it is enough to enable classification of a yellow brick via mixing.
The color humans perceive is a complex function of the spectrum of the illuminant (and ambient reflected spectra), the angle of incidence of the illuminant on the surface being observed, the albedo of the surface, the angle of the viewer with respect to the surface, and the surrounding colors of the surface point being observed, Most of this is captured by the wellknown Bidirectional Reflectance Distribution Function (BDRF) (see Koenderink 2010 for more detail). In our brick binning task, much of this can be ignored. The illuminant, its angle to the conveyor belt, the camera and its angle to the conveyor belt, the surrounding surface are all constant. The variables remaining are the surface albedo and the angle between the camera's optical axis and the brick surfaces being imaged.
It has long been accepted that colors can be formed by additive mixtures of other primary colours. For our purposes, this means that even though some spectrum of color is 'in the weights' of the network, their combination cannot necessarily result in any other color if a primary color is missing. Color theory for mixtures tells us:

green is created by combining equal parts of blue and yellow;

black can be made with equal parts red, yellow, and blue, or blue and orange, red and green, or yellow and purple;

white can be created by mixing red, green and blue; alternatively, a yellow (say, 580nm) and a blue (420nm) will also give white.
However, red, blue and yellow are primary colors and cannot be composed as a mixture of other colors. If a primary color is unseen during training, there can be no set of weights that would represent it as a combination of other colors. This would mean that if a primary color is unseen, it could not be classified. In a brick classification task such as ours, color is an important dimension because it divides the entire population into 6 groups; then within each group are the different block types.
The above are all additive mixing rules. Subtractive mixing should be considered as well because networks employ both positive and negative weights, and it might be that this is an alternate avenue for dealing with the color. The additive rules arise by using the RGB color model while the subtractive rules are within the CMYK color model, not often seen in neural network formulations. The three primary colors typically used in subtractive color mixing systems are cyan, magenta and yellow. Cyan is composed of equal amounts of green and light blue. Magenta is composed of equal amounts of red and blue. Yellow is primary and cannot be composed of other colors in both color models. In subtractive mixing, the absence of color is white and the presence of all three primary colors makes a neutral dark gray or black. Each of colors may be constructed as follows:

red is created by mixing only magenta and yellow

green is created by mixing only cyan and yellow

blue is created by mixing only cyan and magenta

black can be approximated by mixing cyan, magenta, and yellow, but pure black is nearly impossible to achieve.
As Koenderink (2010) says, when describing his Fig. 5.18, the best yellow paints scatter all the wavelengths except the shorter ones. In fact, the spectra he shows for 'lemon skin', 'buttercup flower', 'yellow marigold' or 'yellow delicious apple' have significant strength throughout the red and green regions. In practice, a strong yellow, such as in a brick, could appear as the triple (r, g, b) in an RGB representation, where 0 ≤ r, b, g ≤ 1, and r, g >> b, b being close to 0. It would be reasonable to think that since no yellow brick is seen in training, that there would be no corresponding ability to classify it.
With some generous assumptions about how the colors are represented in the weights and how they combine through the network, there may be a route to combinations that lead to most colors. After all, this would reflect the distributed representations underlying such networks (Hinton 1984): each entity is represented by a pattern of activity distributed over many computing elements, and each computing element is involved in representing many different entities. But color theory tells us that no combinations can yield yellow; it cannot be overcome using a learning strategy that never sees samples of these colors nor by distributed representation strategies. The only conclusion possible is that network A−c where c ∈ (red, blue, yellow) will not properly generalize, but that c ∈ (green, white, black) might. This would of course be unacceptable to LEGO®. On the other hand, a dataset that is biased towards this latter color subset might actually exhibit performance metrics for test error that could appear promising; this would, however, be a false indication. An analysis of failure instances would reveal this problem.
There are many other color spaces. Rasouli & Tsotsos (2017) review 20 different spaces and show how each leads to different characteristics for the detectability of objects. Above, we show only two of these. It might be that the correct choice of color space for particular training data sets leads to different generalization properties for learned systems. This requires further research but the methods described here might be helpful.
Our hypothetical network A−c where c ∈ (red, blue, yellow) would thus exhibit the following error. Leaving out one of these colors, if we assume all LEGO® bricks are made in all colors equally, means that 1/6th of all bricks (since 1/6th of all bricks cannot have their color correctly classified) are erroneously binned. Since the total number of bricks possible is 1,468,800. as enumerated in Section 2.2, a 1/6th error (244800/1468800 = 0.166667) overwhelms the manufacturing error of 0.000018. The overall error, as defined in Section 4.0, of G can be estimated as GA−c < 0.166685. For A−c where c ∈ (green, black, white), using generous training assumptions, we can assume no additional error so GA−c < 0.000018.
4.2 Training Set Missing One Brick Size
Let us name the learned system where one of the sizes is left out of training A−z, z ∈ (s1, s2, ....sk), the set of all block sizes, and that it satisfies the empirical error stipulated above. There are many sizes in the label set we are using and not all sizes apply to all brick types. The majority of the specialty bricks are of a unique size. Common pieces have several sizes. For example, in the Bricks 1xN type, there are 9 simple sizes, i.e., 1x1, 1x2, 1x3, 1x4, 1x6, 1x8, 1x10, 1x12, and 1x16, but also 40 more complex shapes with varying counts of studs, thus of differing physical size when compared to other bricks but without variation of size with respect to that particular type of brick. In other words, there seem to be at least two different dimensions along which size might be represented: stud count and brick volume. There may be additional ways as well; let's consider stud count only (in Figure 2, the bricks in panels a, b, c, and d have 4, 1, 2, and 3 studs respectively). The minimum stud count is 1 and the maximum is 54 in the label set referenced above. The accurate size distribution is tedious to enumerate; it will likely be as informative to assume that there are equal numbers of each stud count, that is, approximately 73,320/54 = 1358 using the full brick set, or 1133/54 = 21 using the production values for unique brick types cited above. Suppose one stud size is left out of the training set but that the resulting system A−z satisfies the empirical error stipulated above. How could the system generalize to that missing stud count? One might imagine that if a simple linear piece with 4 studs is left out of the training data, with generous assumptions, that the learned portions of the network for the similar shapes might jointly fire and fill in the gap. This would mean that there is some combination of network elements that form a 4stud straight piece, as shown in Figure 2a. Straightforward possibilities include Figure 2b, c, and d. But then could the 3stud piece in Figure 2e participate in this composition? How could the 4stud bricks of Figure 2eh be made? It is assumed that all the pieces in this figure, and in all other figures, are part of the training set for the ideal network A since they are all art of the brick sets enumerated in Section 2.1.
There are many similar questions given the variety of ways LEGO® has found to make bricks with 4 studs. It seems that the composition with learned pieces might provide a partial answer, but not a complete solution. Certainly, any error measure would be increased even if we assume a partial solution.
Our hypothetical network A−z would exhibit the following worstcase error. If all the bricks of a single stud measure are binned incorrectly, that is, 21 out of 1130 bricks, the error is 21/1130= 0.018584 which when summed to the overall production error gives a cumulative error of GA−z < 0.018602.
4.3 Training Set Missing One Brick Orientation
Suppose one orientation in the imaging plane (i.e., parallel with the conveyor belt  recall the imaging geometry described in Section 2.2) is left out of the training set and that the resulting system A−o satisfies the empirical error stipulated above. Simple data augmentation methods, such as spatial shifts or within the image plane rotations, are likely to permit generalization to one, or perhaps, more orientations being omitted from a training set (more on data augmentation methods below). The lack of good representation of all orientations is probably not an insurmountable problem; this would, however, also depend on whether or not there is a need for precision grasping if a robotic manipulator is used and this is not further considered here. Data augmentation is a relevant method for reducing orientationintheimagingplane sampling biases in training data given the restricted imaging geometry. The hypothetical network A−o exhibits an overall error that is not increased due to this bias, and thus we can assume that GA−o < 0.000018.
4.4 Training Set Missing One Brick 3D Pose
Suppose one 3D pose is left out and that the resulting system A−p satisfies the empirical error stipulated above. The LEGO® brick domain is wellsuited to an aspect graph representation and this is yet another advantage of choosing these bricks as our domain of interest. We could not easily enumerate all the characteristics of most other domains, certainly not of natural images. First, a brief overview of aspect graphs (Cutler 2003).
An Aspect is the topological appearance of an object when seen from a specific viewpoint. Imagine a sphere where each point on the sphere represents the viewing direction formed by that point and the sphere's centre, as is shown in Figure 3. A Viewpoint Space Partition (VSP) is a partition of viewpoint space into maximal regions of constant aspect. An Event is a change in topological appearance as viewpoint changes. A constant aspect then is a contiguous region on the sphere that is bounded by changes in topological appearance, i.e., events. The full space partition would show a different region for each face. Consider the simple example of a cube. The separate regions of the space partition would be composed of regions where only one side is visible, only 2 sides are visible and only 3 sides are visible. There is no partition where 4 sides can be visible. There are 26 of these regions. An Aspect Graph is a graph with a node for every aspect and edges connecting adjacent aspects. The dual of a viewpoint space partition is the aspect graph. Aspect graphs can be made for convex polyhedra, nonconvex polyhedra, general polyhedral scenes (same as the nonconvex polyhedra case), and nonpolyhedral objects (e.g., torus). In general, for convex objects, the size of the VSP and the aspect graph are of order O(n3) while for nonconvex objects, the VSP and aspect graph are of order O(n9) under perspective projection (O(n2) and O(n6) under orthographic projection respectively, where n is the number of aspects) (Plantinga and Dyer 1990). The cube has fewer than this general number because it has 3 pairs of parallel planes that do not intersect and thus do not form a viewing possibility. Recognition algorithms that employ aspect graphs typically match a set of aspects to a possible reconstruction of a hypothesized object (see Rosenfeld 1987 for the first of these; Dickinson et al. 1992 for a nice development using object primitives).
We will assume orthographic projection because we can engineer the imaging geometry to satisfy its properties. If one 3D pose is left out of the training set representation, for example, there are no images of any LEGO® brick with its top surface facing the conveyor belt, then the following results. If one side is never seen, it affects all aspects in which it participates. For a cube, this would mean: the aspect of itself, the aspects with one neighbouring face (4), and the aspects with 2 neighbouring faces (4), for a total of 9 aspects. In other words, 9 of the 26 aspects of the brick need to be generalized using the remaining 17 learned aspects. In other words, 35% (9/26) of the possible constellation of aspects required for recognition would not be available. Recognition would fail if the observed viewpoint led to a set of visible aspects that overlapped these 9 aspects. For bricks more complex than a cube, this would differ.
However, each possible missing pose does not necessarily represent a stable configuration for a brick lying on a conveyor belt. Of the brick types listed above, wedgeplates, tiles, bricks 1xN, slopes, plates 3xN/4xN, plates 2xN, plates 1xN, many would have only 2 stable configurations, rightside up and upside down with no possibility of a stable sideways configuration.
Further, there are a number of shapes with unique characteristics such as those in Figure 4ac. The first could appear on its side or on its back end with the protruding elements acting as stabilizers, so it might be tilted towards the camera. The second could easily be studside down, on each vertical side, on the slant, but not likely stable on its bottom or backside. The third might have 5 stable sides. In other words, the number of possible cells of an aspect partition differs for each brick type. It is quite possible to enumerate all of these given we have the library of part labels; however, that enumeration would be tedious, and an approximation will suffice. We assume, as mentioned earlier, that each unique brick has an average of 3 stable poses, so for 1130 unique brick types there would be a total of 3390 stable brick appearances due to changing pose. Call these expected stable poses p1, p2, and p3 (perhaps right side up, upside down, and on the left side, to pick one possible set of examples). For our cube example, above, a missing face would impact 9 out of 26 aspects, or about a third of all aspects. In other words, the 'left out' pose in A−p must be one of the brick's stable poses, p1, p2, or p3, and not an arbitrary pose.
What is the impact if one stable pose, say p1, of all brick types being left out of a training set? The aspect graph itself would be incorrect. Not only would one face (at least) be missing (or presumed flat) but its interactions with its neighboring faces would be incorrect. Recognition of that particular brick, if based on the learned visible aspects, would be impaired unless the particular viewpoint the camera sees is one from which only properly learned aspects are visible. If a third of the aspects are affected, then assuming all viewpoints are equiprobable, one third of all views of this brick would lead to erroneous recognition.
It is worth pointing out that the blocks in Fig. 4cf are examples of objects for which degenerate views are possible. The crosssection of the each is the same, only the length differs, so even though a block maybe in a stable pose, the imaging geometry would yield an ambiguous situation (Dickinson et al. 1999).
Is it possible that the other aspects for this brick or the aspects of other bricks, could together combine to permit correct recognition? We could consider this on a casebycase basis, but this would be quite tedious. Nevertheless, it might be, with generous assumptions about the capabilities of the learned network, that the learned elements corresponding to the bricks in Figure 4df could combine to provide what is needed to recognize the slope of intermediate size, shown in the previous set. But there is no corresponding brick combination for the first one in the previous set, nor for many other bricks; they are all quite unique. Thus, recognition of those bricks is assumed to fail if the proper constellation of aspects is not seen.
It seems safe to think that the error would be larger than that of the ideal network which has already been pegged at GA < 0.000018. If no brick is seen with p2, for example, it means that one third, or 1130, of the possible brick appearances would have to be recognized as a result of generalization. We can make generous assumptions about the generalization capabilities of the network, specifically that it can successfully handle similar bricks such as the rooflike ones just depicted even if no training sample for a particular pose is present. However, a rough count in the brick library yields 132 unique pieces that have no similar ones from which a generalization can easily obtained, or in other words, 12% of the possible 1130 unique bricks.
A different possibility would be some kind of data augmentation in training set preparation. No data augmentation could remedy missing poses because no such method inserts patches out of other images (the stud surface appears in other poses of course). Even a data augmentation that could consider geometry would not be able to fill in the missing samples unless prior knowledge of what all the invisible aspects could look like, or assumptions of surface continuity, is somehow applied in the data augmentation process (see also Logan et al. 2018, Ian et al. 2014). In any case, such an action has its own problems (Rosenfeld et al. 2018) and would not be a sensible strategy.
Our hypothetical network A−p would have the following error, assuming the least error implied by the above arguments. Suppose that the generalization process is effective for the bricks for which there might be additive ways of deriving a brick (even if not straightforward), but for the 132 unique bricks they are all classified incorrectly, a 12% error as described. Thus, an estimate of overall test error is GA−p < 0.120018. This estimate is likely too small.
4.5 Training Set Missing One Brick Shape
Suppose one shape is left out and that the resulting system A−s satisfies the empirical error stipulated above. Shape is a more abstract notion here than the other dimensions considered. It is included because one might imagine a customer requesting an order of all sloped bricks, of all rectangular bricks, of all plates, etc. For example, the bricks in Figure 5ad are all considered plates and their regularity is the thickness of the base. On the other hand, the bricks in Figure 5eh are classified as slopes because they all contain a sloped surface.
There are 106 slope brick subtypes. The question here is if all plates (or slopes, or bricks, or wedges, etc.) were left out of the training set, could the resulting network generalize so that they could be recognized sufficiently well?
Let us consider some of the details of our assumed network A−s. These images contain no texture information useful to the task, thus we assume that none is learned. The imaging geometry, especially the lighting is such that shadows or other cues for shape from shading are not useful. We can also leave out color for the current purpose. It is wellknown that many learned networks represent receptive field tunings in early layers very similar to oriented Gabor filters in their early layers and this seems a good first level of processing for the brick images. One can easily imagine what a line drawing of each of the LEGO® bricks might be; the bricks are textureless. It is an acceptable assumption then that processing then continues on such a representation.
How can shape be inferred from a line drawing? There is a wealth of literature on how computers might interpret line drawings. LEGO® brick shapes are mostly polyhedra and one of their shape characteristics (but not a complete characterization to be sure) is the set of labels of their lines and vertices. A labelling of an image is an assignment to each edge of the image of one of the symbols +, , ⇢ and ⇠ (concavity or convexity), and similarly for each vertex one of the many types of vertices (Clowes 1971, Waltz 1971; several sources enumerate the set, but the exact cardinality is not important). Such a labelling is a reasonable assumption as a representation of shape sufficient to enable discrimination of one shape from another, although it is not difficult to show counterexamples. For the purposes of our argument, this does not really cause any difficulties. Such an ‘in principle’ solution is instructive even if not precisely what a trained neural network realizes.
Kirousis & Papadimitriou (1988) examined the processing of images of straight lines on the plane that arise from polyhedral scenes. They asked, given an image, is there a scene of which the image is the projection? Such images are called realizable. One classical approach to the realizability problem is through a combinatorial necessary condition. A labelling is legal if at each node of the image there is only one of the legal patterns. A legal labelling is consistent with a realization of the image, if the way the edges are seen from the projection plane is the way indicated by the labelling. They provided a proof that it is NPcomplete, given an image, to tell whether it has a legal labelling. This is true for opaque polyhedra, and is even true in the simpler case of trihedral scenes (no four planes share a point) without shadows and cracks. Although there are methods for labelling such a line drawing, the problem is that it is exponential to determine if the labelling corresponds to a real object. In other words, since any algorithm for extracting straight lines from images necessarily may have error, any error may signal an illegal labelling where there is none or a legal labelling where there is not one, so a step requiring the verification is needed. Once it is known that there is a legal labelling, there exist algorithms for matching labelled image to known objects that have known labellings. These results, importantly, are independent of algorithm or implementation; they apply to the problem as a whole.
Kirousis & Papadimitriou (1988) also present an effective algorithm for the important special case of orthohedral scenes (all planes are normal to one of the three axes). It is tempting to think that this latter case applies to LEGO® bricks; most are indeed approximately orthohedral (the studs pose the exception), but then again there are many pieces within the library that are strictly polyhedral or involve curved segments. For example, the class Plates 1xN includes the bricks of Figure 6a, the class Plates 2xN includes the bricks of Figure 6b. Slopes include the bricks of Figure 6c. The class Brick 1xN includes bricks such as that of Figure 4d. Even in decomposition these will contain elements that are nonpolyhedral.
The task of realizing a general, nonorthohedral scene, given its labelled image can be solved by employing linear programming, i.e., a polynomial time algorithm exists. So, the problem remains to find the legal labelling. It is known that the labeling of trihedral scenes is NPcomplete as is the complexity of labeling origami scenes, that is, scenes constructed by assembling planar panels of negligible thickness (Parodi 1996, Sugihara 1982, Malik 1987).
It is difficult to accept that a deep learning procedure can effectively learn the solution to an NPComplete problem; it might, however, learn approximate solutions that are within some error bound 𝜀−s for subsets of the full problem. We will not explore this route but suggest that this makes the generalization issue even more difficult to address.
Note that we still have not come to the generalization issue. Suppose no slopes are included in the training set but a customer wishes a box of all slopes in the LEGO® catalog. This means that no instance of the configuration of lines and vertices seen in bricks such as that in Figure 6e have participated in any learning. It is highly unlikely that a network can construct a particular configuration out of learned elements that would suffice; after all, as shown, the problem in general is combinatorial and even generous assumptions about learned networks cannot defeat this fact.
In order to quantify the expected error of the hypothetical network A−s, we can be optimistic and say that the error would not be greater than the error incurred by misclassifying the smallest group of bricks in the catalog. In other words, if all slopes are erroneously classified because slopes were not in the training set, it would mean that the remaining 520  106 = 414 types are correctly classified. The smallest such group in the catalog is that of plates with 22 instances. Then, if those 22 classes are not part of the training set, 22 x 6 (poses) x 6 (colors) x 72 (orientations) = 9504 images are misclassified out of the possible 1,468,800 images, or 0.647%. Thus, an estimate of overall test error is GA−s < 0.006488. It should be clear that this is a very optimistic estimate.
4.6 Interim Summary
Having looked at 5 different cases of selection bias in training sets, we summarize the above analysis in Table 1. The hypothetical networks GA−c, GA−z, GA−p and GA−s all have orders of magnitude too great an error given the production standards. As should be clear, if the training set is biased with respect to any characteristic except orientation in the imaging plane or nonprimary colors, the resulting network is unlikely to generalize so that the required performance standard can be met. The particular values for generalization error may invite argument to be certain. However, there are two indisputable facts that emerge. First, the error for these four networks is far beyond the acceptable manufacturing values; it's not small nor negligible. Second, there are at least a few training biases that cannot be generalized away. LEGO® bricks are a simple domain, one that can be completely characterized. Imagine how such a thought experiment might be impacted by a more complex domain.
Table 1
Summary of training bias analyses.
Training Set Bias

Learned Network

Generalization Error

Unbiased

A

GA < 0.000018

Color Biased

A−c, c ∈ (red, blue, yellow)
A−c, c ∈ (green, black, white)

GA−c < 0.166685
GA−c < 0.000018

Size Biased

A−z

GA−z < 0.018602

Orientation Biased

A−o

GA−o < 0.000018

3D Pose Biased

A−p

GA−p < 0.120018

Shape Biased

A−s

GA−s < 0.006488
