Explainable analytics: Understanding causes, correcting errors, and increasingly achieving perfect accuracy from nature of distinguishable patterns

In addition to pursue accurate analytics, it is invaluable to clarify how and why inaccuracy exists. We propose a transparent classication method (TC). In training, we discover patterns from positive and negative observations respectively; next, patterns are excluded if they appear in both types. In testing, observations are scored by the pure patterns and connected like social networks. Based on set theory, pure patterns have explanatory power for distinguishing tangled relationship between negative and positive observations. Experimental results demonstrate that TC can identify all positive (e.g., malignant) observations at low ratios of training to testing, e.g., 1:9 in Breast Cancer Wisconsin (Original) and 3:7 in Contraceptive Method Choice dataset. Without ne-tuned parameters and random selection, TC eliminates uncertainty of the methodology. TC can visualize causes, and therefore, prediction errors are traceable and can be corrected. Further, TC shows potential of identifying whether the ground truth is incorrect (e.g., diagnostic errors).


Main Text
Accurate prediction plays a pivotal role in analytics; however, in reality people usually face the challenge of explaining how and why the prediction is inaccurate 1-2 . By the survey 3 , outpatient diagnostic errors are at a rate of 5.08% (around 12 million US adults) per year. Even reducing only 1% errors, the lives of million people are saved. We consider three major issues of errors. First, fault in data, human mistakes or defective machines can produce faulty data. Without domain knowledge, such a fault is di cult to correct. Nevertheless, we should remove inconsistency, i.e., observations of positive class are identical to those of negative one. In addition, positive and negative observations may have similar patterns that are inextricably interwoven, e.g., people with similar pro les may exhibit different behavior. Lim et al. 4 show the contraceptive method choice (CMC) dataset 5 is the most di cult to classify, and especially the minimum error rates are greater than 0.4. Second, mismatches between data and methods, data contains categorical (e.g., country), numerical (e.g., age), or both types, which have natural constraints on analysis.
For categorical values, only the number of items and the mode are statistically relevant 6 so that a numerical-orientated method is inherently inadequate. For numerical values, they can be transformed into categorical ones by discretization 7 , which have been widely applied to knowledge discovery and data mining (KDD) applications 8 . However, bias occurs if categories are unrepresentative of numerical values.
Third, big data challenge, the complexity of data is determined by the number of rows and features (columns). Particularly, computation tasks increase rapidly with the number of features, as known curse of dimensionality (CoD) 9 . To cope with CoD, dimension reduction and feature selection methods reduce the complexity by extracting information that is practical for classi cation and cluster analysis. The extraction, a trade-off between e ciency and effectiveness, may involve pruning large amounts of data.
Actually, such a pruning manner may have pitfalls 10 or miss clues to errors.

Results
We conducted experiments with two datasets: Breast Cancer Wisconsin (Original) (BCWO) and Contraceptive Method Choice (CMC) 5 . Figure 2 (A)(B) and (C)(D) show the results of BCWO and CMC respectively. In Figure 2 (A), TC achieves perfect recall (i.e., recall=1.0) at the lowest ratio (i.e., 1:9) and 7 other ratios. It means TC is not only accurate in small amounts of data but also stable when data increases. TC has one error at the ratios 2:8 and 3:7 because the positive observation PO 223 is predicted as a negative one. Since the ratio 4:6, PO 223 belongs to training data and is not used in testing. For further exploration, at the ratio 10:10 # , PO 223 is irrelevant to PPP except itself. Indeed, PO 223 is related to PP that are also relevant to NO. PPP can eliminate impingement of PO 223 on other observations. Novel

Methods
We propose a method of transparent classi cation, named TC, which not only pursues accuracy but also clari es cause of inaccuracy. Further, the design principles of TC ensure reproducibility 11 . Figure 1 shows processes of TC. In data preprocessing, TC handles missing values and mixed values. Without involving randomness and reduction, TC delivers intrinsic nature of data. In identifying distinguishable patterns, TC nds patterns from training observations, which are used for predicting which class a test observation belongs to, e.g., a malignant or benign tumor. Through increasing ratios of training to testing, TC represents the forest and the trees, and input data is given in sequence. To avoid CoD, TC nds patterns by intersecting pairwise observations in each of the classes, which possess essential features of data in miniature. In the worst case of TC, n observations produce only patterns. By contrast, KDD faces the challenge of CoD, i.e., given the lowest threshold, k items yield 2 k itemsets 12 , and large amounts of itemsets are pruned if the threshold is high. In positive patterns (PP), TC obtains PP from positive training observations (PO). For pure PP (PPP), TC excludes any positive pattern that also appears in negative training observations (NO). By set theory, the exclusion implies none of PPP is included in any of NO and hence TC can distinguish between PO and NO. Analogously, Negative patterns (NP) and pure NP (PNP) are the counterpart of PP and PPP. Without involving either ne-tuned parameters or random selection, TC eliminates uncertainty of the methodology. In establishing the causes, TC accumulates positive, negative, and novel degrees of a test observation O t by rule 1, 2, and 3 which associate patterns with the observation and provide obvious clues for judgement. In rule 1, O t containing patterns in PPP gets a positive score (PS). In rule 2, O t containing patterns in PNP gets a negative score (NS). In rule 3, O t containing none of patterns in PPP and PNP is considered novel and gets a novelty score (NT) equal to the number of training observations. In understanding results of analytics, we evaluate performance of TC by three measures, Precision, Recall, and AUC (Area under Curve) 8,12 . According to the standard of diagnostic medicine 13 : AUC=0.5, no discrimination; 0.7≤AUC<0.8, acceptable; 0.8≤AUC<0.9, excellent; and 0.9≤AUC≤1, outstanding. In cause for prediction errors, error 1 (false positive 14 ) occurs if O t is predicted as positive but actually negative, denoted by NO t* . By cause 1.1, NO t* contains pure positive patterns although it should not. By cause 1.2, NO t* is novel, namely, containing no pattern in PPP and PNP. Error 2 (false negative 14 ) occurs if O t is predicted as negative but actually positive, denoted by PO t* .
By cause 2.1, PO t* contains pure negative patterns but no pure positive pattern although it should. Prediction errors occur due to insu cient training data or labelling errors in training data. Increasing training data helps to reduce prediction errors. If the portion of labelling errors is small, TC has the potential of identifying labelling errors. Speci cally, false negatives usually have a small NS. Cause for prediction errors, based on set theory, provides rational explanations for errors caused by TC. data set (BCWO), we map class values "malignant" to "1" and "benign" to "0". In case (A), the granularity of discretization is to the rst decimal place, e.g. 1.68≈1.6, while in case (B) we take an integer for the granularity, e.g., 1.68≈1. (C) (D) In Contraceptive Method Choice data set (CMC), we map class values "1=No-use" to "1", "2=Long-term" to "0", and "3=Short-term" to "0". For (C) and (D), we set the same granularity as that of (A) and (B), respectively. For consistency, we remove observations that have identical features but different class labels. The number of observations is thus reduced from 1473 to 1399 in (C) and from 1473 to 980 in (D).