Clinical diagnostic decisions have a direct impact on the outcomes and treatment of patients in the clinical setting. As large volumes of biomedical data are being collected and becoming available for analysis, there is an increasing interest and need in applying machine learning (ML) methods to diagnose diseases, predict patient outcomes and propose therapeutic treatments. For example, Deep Learning (DL) has been successful in assisting analysis and classifying medical scans, X-rays, etc. Although DL is generally considered as a black box [1] lacking transparency to interpret why a decision is made, yet for these forms of visual data, users with cognition ability are able to relate the targets to the input data. However, when dealing with relational datasets where no explicit pattern (except the class label if given) could be extracted from the input data to relate to the decision targets, the ML/DL process remains opaque. If the patterns inherent in the relational data, though not visualized, are succinctly related to the targets, existing ensemble algorithm, such as Boosted SVM, or Random Forest could produce good predictive results. But the underlying patterns in support of the decision are still opaque and uninterpretable for the clinicians [2]. Hence, existing ML approaches on relational data are still encountering difficult problems concerning transparency, low data volume, and/or imbalance classes [3] [4].
To render transparency and interpretability, Decision Tree, Frequent Pattern Mining or Pattern Discovery were proposed. For decades, Frequent Pattern Mining[5] [6] [7] is an essential data mining task to discover knowledge in the form of association rules from relational data [7]. However, as revealed in our recent work [8] [9] [10], the Attribute Value Association (AVA) forming patterns of different classes/targets could be entangled due to multiple entwining functional characteristics inherent in the source environments. Hence, the patterns discovered directly from the acquired data may have overlapping or functionally entwined AVAs as observed from our recent works [8] [10].
Hence, in this paper, we present a new classification method based on Pattern Discovery and Disentanglement (PDD) with the capability to tackle this problem. We particularly focus on imbalanced class problem since it is still challenging most of the traditional ML methods.
The cPDD algorithm is briefly described in Fig. 1. From a clinical relational dataset R say with N attributes, the frequency of occurrences for all distinct Attribute-Value (AV) pairs (or second order Attribute-Value Associations (AVAs)) are first obtained. Then, the frequency of occurrences is turned into a statistical measure known as adjusted statistical residual (SR) [7] which accounts for the deviation of that frequency from the default model if the AVs in the AVA pair is statistically independent. So then, a matrix of SRs is obtained, and each SR represents the statistical interdependency of an AV pair. In this matrix, each row represents a AV-vector with its coordinates representing the SR values of the AVA it associates with other AV’s corresponding to the VS of the column vector. This matrix is thus referred to as the AVA Statistical Residual Vector Space (SRV). The next step is applying principal component decomposition (PCD) to decompose the SRV into different principle components (PCs) and re-project the projections of the AV-vectors on each PC after the transformation to a new SRV, referred to as Re-projected SRV (RSRV). The AV-vectors with a new set of coordinates in the RSRV reflect the SR of AVAs captured by that PC. We refer a PC with its RSRV together as a Disentangled Space (DS). Since the number of DSs is as large as the number of AVs, cPDD uses a DS-Screening Algorithm to select a small set of DSs denoted by DS* = {} if the maximum SR in the RSRV of that DS exceeds a set statistical threshold (e.g. 1.96 in 95% confidence interval). The AVs with statistically significant AVAs will form Attribute-Value Clusters (AV-Clusters) in the PCs reflecting groups of strongly associating AVs.
In traditional pattern discovery, to discover high-order patterns from the AVs of a dataset is complex since there is an exponential number of combinations of AVs as pattern candidates. cPDD discovers patterns from a small number of AV-Clusters from a small set DS*. Hence, it not only dramatically reduces the number of pattern candidates, but also separates patterns according to their orthogonal AVAs components revealing orthogonal functional characteristic in AV clusters[10][11] and subgroups in different DS*. Since the AV-clusters are coming from a disentangled source, the set of patterns discovered therein are relatively small with no or least overlapping and “either-or” cases among their AVs, cPDD significantly reduces the variance problem and relates more specific patterns to the targets. Unlike traditional PD methods which often produce an overwhelming number of entangled patterns, cPDD renders a much smaller succinct set of patterns associating with specific functionality from the disentangled sources for easy and direct interpretation. Furthermore, due to the reduction of the pattern to target variance, the patterns discovered from uncorrelated AVA source environment will enhance prediction and classification, particularly effective for data with imbalanced classes.
Machine Learning on Clinical Data Analysis
Today, deep learning (DL) and frequent pattern mining are two commonly used methodologies for data analysis. However, in a more general healthcare setting where data analytics is based predominantly on clinically recorded numeral and descriptive data, the input (in terms of inherent patterns) and output (decision targets/classes) relations are not that obvious, particularly when the correlation of signs, symptoms, test results of the patients could be the manifestation of multiple factors[3] [12]. Hence, this poses a challenge to DL in clinical application. Another concern is on the transparency and the assured accuracy[3] [12]. As for transparency, DL is generally considered as a black box [1]. Although ML methods like ensemble algorithm, such as Boosted SVM for imbalanced data (BSI), or Random Forest are good at prediction, their classification results are highly opaque and difficult for the clinicians to interpret [2]. Hence, to render transparency and interpretability, Decision Tree, Frequent Pattern Mining or Pattern Discovery were proposed. Since rules discovered by Decision Tree is guided by class labels, it is unlikely to discover associations between attributes when class labels are not available. Furthermore, as revealed in our recent work [8] [9] [10], associations discovered from relational data could be entangled due to multiple entwining functional characteristics inherent in the source environments. The patterns discovered using existing frequent pattern mining approaches based on the likelihood, weight of evidence [7], support, confidence or statistical residuals[6] [7], may have overlapping or functionally entwined AVA patterns captured in the data leading to overwhelming pattern number and redundancy, making explanation very difficult. Although extra pattern clustering, pruning and summarization algorithms[13] [14] have been proposed and produced a smaller set of patterns/pattern clusters, yet the pattern entanglement problems have not been solved and the interpretation is not robust or comprehensive.
cPDD that we proposed in this paper has solved the fundamental pattern entanglement problem. It is proposed to meet the clinical challenges posed above. It intends to provide clinical results explainable to clinicians using a small number of patterns discovered from the disentangled sources in a more succinct and interpretable form to reveal diagnostic characteristics of the patients and provide statistical support for prediction. Due to its ability of pattern disentanglement, patterns from minority class can be discovered in AVA Spaces orthogonal to those of the majority classes.
Novelty and Contributions
cPDD extends our recent work [10] on AVA disentanglement to the discovery of statistically significant high-order patterns in AVA disentangled spaces. It provides robust and succinct interpretation and achieves from clinical data with anomalies and imbalance class distribution more specific and precise prediction. Its major contributions are three-fold.
i. The cPDD discovers and disentangles statistically significant high-order patterns to reveal the characteristics of different functional subgroups and/or classes in clinical data.
ii. It provides an explicit pattern representation for interpreting the characteristics of the dataset
iii. It uses the discovered patterns to classify entities in the dataset with high precision even when the class distributions imbalanced.