Robust statistics is a large field of mathematics, which attempts to provide reliable estimates of statistical parameters in contaminated data sets. One such algorithm is the Minimum Covariance Determinant (MCD) [32], which is available through scikit-learn [37] in python. By default, this algorithm separates the data into a core cluster (50% + m + 1), where the density of points is the greatest. In most schemes, the actual estimation of the covariance matrix is carried out in a two-step process. In the first step, the core cluster is identified based on the data that is closest together and, hence, gives the covariance matrix with the smallest determinant. The MD of all the points is then calculated and a new covariance matrix is calculated using some fraction of the data (greater than 50% and less than or equal to 1) with the smallest MDs. The actual estimate tends to underestimate the true value of the covariance matrix. To illustrate this, a series of simulations based on real data collected for samples described in [8] was carried out. The true 5 x 5 covariance matrix for the blanks from a sample set [8] was determined and decomposed using the Cholesky method [38], to give a lower triangular matrix. This matrix was then used to generate 50 sets of pseudo data (5 scores) with the same covariance matrix [38]. The MCD algorithm was run using different data fractions ranging from 0.58 to 1.0.
The eigenvectors of the robust covariance matrix were an excellent match to the eigenvectors of the true covariance matrix at all data fractions. The dot product between the true and robust eigenvectors typically only differed from 1 by less than 1%.
The ratio of the average robust eigenvalues to the true eigenvalues at different data fractions is shown in Fig. 2. The true eigenvalues are sorted in descending value ranging from 0.89 (1) to 1.74E-4 (5). As is well known, the eigenvalues calculated from the robust covariance matrix underestimate the true values [32]. This underestimation decreases slightly as the fraction of the data used by the MCD increases. Consequently, MDs calculated using the robust covariance matrix will tend to be high. For all data fractions, the standard deviation of the eigenvalues is what would be predicted on the basis of the chi-square distribution and 49 degrees of freedom, which is approximately 20%.
The consequences of the underestimation shown in Fig. 2 on the partial Mahalanobis distance squared (PMDS) are shown in Fig. 3. To generate this plot for each curve (data fraction), 50 random data points of 5 scores each were generated as before using the Cholesky matrix. The PMDS using the first 3 rotated scores was calculated for all 50 points and sorted from smallest to largest. Each point was then plotted vs the corresponding point from the chi-square distribution. So, if PDMS(j) is the jth ranked point, then the jth point in the graph is (\(\left({\chi }_{1-\left(j-0.5\right)/n}^{2}\left(3\right), PMDS\left(j\right)\right)\). The dashed black line in Fig. 3 has a slope of 1 and so represents data having an exact chi-square distribution. The red line labelled as having a data fraction of 1, was generated not using the MCD, but the empirical covariance matrix. Both it and the line representing the MCD calculation using all of the data have a perfect chi-square distribution within expected statistical scatter. The curves for the remaining data fractions are all approximately straight lines but with steeper slopes indicating that they are overestimating the true MD. It would be straightforward to divide them by their slopes to scale them back to the chi-square value, but the real question is “How does this overestimate affect the detection of outliers?” If the MD of the outliers also scales up in proportion to this overestimate, then there is no significant effect. This question is related to the power of the test, i.e. the probability of a Type II error (i.e. a false negative), where an outlier is missed.
In order to investigate the ability of a robust MD method to detect an outlier, another series of simulations was done. As before a Cholesky decomposition was performed on a 5 x 5 covariance matrix calculated from real life data. This matrix was used to generate 50 pseudo-data as before. However, this time an outlier was added to the data. The outlier was calculated based on the eigenvalues and eigenvectors of the true covariance matrix. For example, in Fig. 4, the outlier, in terms of the eigenvector basis was \(C=\left(\text{0,0},m\sqrt{{e}_{3}},-m\sqrt{{e}_{3}},0\right)\), where m is a scaling factor and \({e}_{i}\) is the ith eigenvalue. This specific choice produces a point that can be thought of as \(m\sqrt{2}\) standard deviations from the cluster center. This point was then rotated back into the original basis set \(\stackrel{´}{C}=C{V}^{T}\) and added to the previously generated data. A robust covariance matrix was then calculated using a data fraction of 0.7 and the PMDS calculated using Vectors 3 and 4. Two different methods were then used to determine whether the outlier was “detected”. In the first, the outlier was “detected if it had one of the 3 largest PMDSs (i.e. was in the top 95% of the data). In the second approach, the point was considered detected if it was greater than a cut-off of 9.33, which was the average value of the 95 percentile of the 0.7 curve in Fig. 3. The first approach had the advantage of ensuring there would never be more than 5% false calls, but also commited the analysis to that level of false calls. It also required no simulations or other methods to determine a cut-off and so was very flexible. At each value of m, one hundred such simulations were performed and the number detected is shown in Fig. 4. A number of other series of simulations were carried out using other outlier directions and they all produced essentially identical results, as would be expected when the scaling is in terms of standard deviations. Finally, it is noted that the choice of data fraction at or below 90% had little effect on this result as might be expected from Fig. 3. The use of 70% maintains the high breakdown capability of the MCD algorithm.
The results of the analysis above suggest that a data point needs to be approximately 3 standard deviations away from the cluster center (in terms of the true covariance matrix) in order to have a 90% chance of being detected. The question then is “How big is this in terms of crack size?”. This question obviously depends on the inspection being made. However, as an example, data from Sample 13 of [8] is shown in Fig. 5. The data has been rotated by the eigenvectors of the true covariance matrix of the blanks and scaled with the square root of the corresponding eigenvalue. The results suggest that, at least for this case, a threshold of three standard deviations should result in substantial detection of the notches.
Another important consideration is: How well does this approach work as a function of cluster size? In order to investigate this a series of graphs similar to Fig. 4 were generated for blank clusters of 30, 40 50, 60, 80, and 100 samples. A logit fit to the data for each graph allowed the a90, the size at which a defect would have a 90% chance of being detected, to be determined. The result is shown in Fig. 6. Down to a cluster size of 60, cluster size has little effect on a90. However, below this a90 starts to increase sharply. As might be expected, this result depends somewhat on the number of parameters contributing to the covariance matrix. More parameters require larger data sets to achieve constant performance, while fewer parameters require less.
In fact, the situation is actually somewhat worse as the cluster size decreases, as it is often the a90/95 that is the quantity of interest for aviation applications [39], [40]. This latter quantity will increase as the scatter in the covariance matrix increases. As can be seen in Fig. 3, a plot of the MDs versus chi-square yields an almost straight line. If the MDs had a chi-square distribution, the slope of this line would be one. The slopes in Fig. 3 are averages over 100 repeat simulations. If instead the 95 percentile was plotted in the same way, it also produces approximately a straight line. The closer the slope of that line is to that of the average, the more reproducible the results are and, consequently, the closer the a90/95 would be to the a90. Figure 7 shows how the slopes of these two lines vary as a function of sample size for a data fraction of 70%. It can be seen that at sample sizes of 60 or greater, there is little effect on the average slope value but below this value, the relative slope begins to rise. This is reflected in similar behaviour for the a90 in Fig. 6. However, for all the sample sizes investigated, the slope associated with the 95 percentile increases as the sample size decreases. The increase is slow at first, but increases rapidly as the sample size decreases. This would result in a rapidly increasing a90/95 at the smaller sample sizes. Hence, there is an emphasis on attaining large data sets to improve the estimate of a90/95.An inspection of the underside of the outer wing of a Twin Otter was recently carried out and serves as an example of the sample sizes of interest. Running along a spar, there were typically 13 fasteners between ribs. Consequently, sample sizes well over 100 were easily obtainable for each spar or stringer. In fact, the data for several stringers could be combined making even larger data sets. Going along a rib, there were typically only 4 fasteners between spars for approximately 29 fasteners per rib. Still, combining the results for several ribs could push the numbers up. The challenge came when considering fasteners at the intersection of a rib and a spar. In one case, there were only 25 such fasteners in a wing panel that had a different conductivity from the other wing panels. In this case, a qualitative assessment based on a 3D scattergraph was required.
The size requirement of this approach depends on the number of parameters that are used to describe the data. As a guideline, Rousseeuw and Hubert [31] state that as a minimum the number of samples needs to be greater than five times the number of dimensions. In the example shown in Fig. 6, this would translate to a minimum of 25 data points. However, as Fig. 6 shows, in order to achieve a reasonable a90, one needs at least 50 samples and preferably more. Intuitively this makes sense, since the more parameters there are, the bigger the covariance matrix becomes and the more coefficients need to be evaluated. If more unknowns are to be evaluated, then more samples are required to achieve the same level of uncertainty.
In principle, one would want to reduce the number of dimensions used in the PMDS to just those that have relevant information. Simply adding dimensions with no or little contribution from the defect signal increases the magnitude of the cut-off with no corresponding increase in signal. For example, the 90% cut-off of a chi-square with one degree of freedom is 2.71 (standard deviations). With two degrees of freedom (DOF) it increases to 4.61 and, at 3 DOF, it is 6.25. Simulations using a sample size of 50 and the same data as Fig. 6, put the a90 at 2.3, 3.3 and 3.9 standard deviations for 1, 2 and 3 dimensions respectively. This is approximately linear with the square root of the chi-square limits as might be expected. On the other hand, reducing the number of dimensions below what is represented in the defect signal can result in a significant increase in the number of misses. For example, in Fig. 5 (right), if only the scores associated with Score 3 were used, one would miss all the defects above the grey circle. Using just Score 4 would result in the loss of all the points to the left of the circle, i.e. most of the notches. Fortunately, the question as to what dimensions to include or exclude can be established empirically because the rotated scores (Eq. 4), are independent of each other in terms of their contribution to the PDMS as was shown in Eq. 5. One can add (or subtract) a dimension to the MD and if the defects have no contribution to that dimension, the largest outliers will only increase marginally. On the other hand, if one adds or removes a dimension with a large contribution from defects then the PMDS associated with that point can be expected to change significantly. Hence, the choice of dimensions to include in the PDMS can be established by examination of the data. In the case where the inspection is expected to possess very few or no defects then it makes sense to establish which dimensions are likely to be useful using a test piece with a selection of representative flaws. It is important, however, that such a test sample contain enough unflawed samples that a reasonably accurate empirical covariance matrix can be established. Test pieces tend to be notoriously defect rich.
The previous examples were based on a single outlier, but the results do not change significantly if more outliers are added. Figure 8 shows the results obtained when 5 outliers were added in the plane defined by Eigenvectors 3 and 4. The five defect points are all the same distance from the origin but randomly oriented. Two curves based on the 5 defects are shown. In one (open circles), a point was considered detected if it had one of the top 7 PMDS, which implies only two false calls out of 50 if all 5 of the defects are found. However, the number of false calls goes up as the detected fraction goes down. In the second example (full circles) a cut-off was used such that a 5% false call rate overall was obtained across all the simulations. This second curve is very close to that obtained earlier with only one defect.
The distribution of PMDS scores for the simulation in Fig. 8, corresponding to defects at 3 standard deviations, is shown in Fig. 9. The 100 simulations were ranked on the basis of the highest PMDS for a blank. The black curve is from the run with the 5th highest blank PMDS score (95th percentile) and the blue curve from the 5th smallest (5th percentile). Over most of their range, the two curves have very similar slopes, suggesting that all such curves are very similar. Obviously the false call rate is very high (12%) for the 95th percentile curve. On the other hand, there are no false calls for the 5th percentile curve. One positive feature of using the cut-off approach is that while it may result in a high false call rate a small number of times, in neither of the two cases examined here did it miss any of the defects. The cut-off approach also works better as the number of defects to be found increases. When it is expected that there are only going to be one or two defects, if any, the rank approach is easy to implement and requires no simulations. However, as the number of expected defects increases one needs increasingly complex decision trees to determine how far down in rank to go to find defects. The deeper one goes, the fewer defects escape, but the more false positives are obtained. The cut-off approach does not suffer this issue. To obtain a cut-off, one can use simulations as was done here but a simplistic alternative is to plot the PMDS against the corresponding chi-square for the same percentile as is done in Fig. 9. The appropriate cut-off tends to be where the curve breaks from linearity. For the 95th percentile, this occurs at about a chi-square value of 3.3 and for the 5 percentile curve it occurs about a chi-square value of 5. Both curves give a cut-off of approximately 7, which is close to the value of 6.8 determined from simulations. This approach can be done automatically with little effort.
It should be emphasized that the point of cluster analysis or outlier detection, is just that, to detect outliers for further investigation. While the primary interest here has been in analyzing complex signals from relatively small data files, the approach can be very useful for automatic scanning of large data files of even simple data, such as eddy current signals from heat exchangers. In this case, the approach can use large amounts of “normal” data to automatically identify regions of interest and bring them to the attention of human analysts for investigation, saving the human analyst many hours of time scanning through the large data files. The less common the defects are and the larger the data sets the better the cluster analysis will work. This is different from humans who have a tendency to be less vigilant when scanning large amounts of data with little expectation of finding anything, the proverbial “needle in a haystack”.