We evaluated the quality of two classification (aggregation) algorithms: ROC-tree and division at quartiles. The universal nature of the aggregation task allows to use for the demonstration of the algorithm ‘Credit Card Fraud Detection’ dataset downloaded from https://www.kaggle.com/mlg-ulb/creditcardfraud. This dataset contains much more cases than any available medical dataset. ‘Credit Card Fraud Detection’ preserves the imbalance structure inherent to the medical data. Furthermore, using dataset distant from the healthcare allows to avoid unnecessary discussion around acceptability of predictive scores (e.g. Euroscore, syntax score, CSA-AKI, Charlson comorbidity index, et cetera ).
Classification methods are used in various fields of biological and medical sciences as a form of categorization when the discrete groups (strata) of data are created. Classification is one of the most important and difficult tasks of data mining, which is why finding a good classifier and classification algorithm is an important component of data mining. Classification into several tiers is the further step in the organization and understanding data. For example: division at high, medium, and low risk, based on scores of the patient cohort is an important step in the organization and understanding clinical contexts.4
One of the most often used algorithm for division dataset into tiers is division at percentiles with creation of strata with similar number of cases or usage of early predefined cut-off points are traditional specially in medical investigations.
In the two-class classification task, the Receiver Operating Characteristic (ROC) curve is one of the most widely used tools to assess the performance of algorithms.5, 6 The area under the receiver operating characteristic curve (AUC) (also referred to as the c statistic) is by far the most popular index of discrimination ability7 ROC curves have an attractive property: they are insensitive to changes in class distribution. The ROC curves are independent of the proportion of positive to negative instances in a test set.8
Several researchers have investigated the application ROC curves not only as a metrics of classification successes. Ferri et al. (2002) altered decision trees to use the AUC-ROC as their splitting criterion.9, 10 Another example of binary decision tree construction algorithm based at c-statistics is developed by Hossain et al. (2008). These authors used an AUC measure to select a node based on its classification performance and then used the misclassification rate to choose a split point.11 In our study, we adapted the idea of ROC-tree as a form of tree which divides the classification process at a number of smaller steps which are intuitive and generally easily interpretable.12 However, we used the Youden index (Bookmaker Informedness) for the determination of the optimal cut-off point. The misclassification rate as a complement of accuracy (one can be calculated from the other) can be misleading when the data are imbalanced,13 because of the dominating effect of the majority class.5
The Youden index, in contrast to the accuracy, directly includes a true positive and a true negative rate. This index is recognized as suitable performance metrics of the classification of imbalanced datasets.14
The selection of performance metrics is another issue considered in this study. Accuracy and error rate, sensitivity and specificity are the most often used metrics for summarizing the performance of classification models. Comparing different classifiers using these measures is easy, but it has many problems such as the sensitivity to imbalanced data and ignoring the performance of some classes.13, 15–17 Class imbalance is one of the significant issues which affect the performance of classifiers18 The determination of the most suitable performance metrics is a major issue in the classification of class imbalanced datasets.14, 18 In imbalanced datasets, not only is the class distribution skewed, the misclassification cost is often uneven too. The minority class examples are often more important than the majority class examples.5
It is recommended to consider a combination of different measures instead of relying on only one measure when dealing with class-imbalance data.13 Hybrid threshold metrics, such as the Geometric Mean14, 19 or the Bookmaker Informedness14 showed to be useful as performance metrics for imbalance datasets. The F-measure (harmonic mean) is also recommended as the measure in this case.19 However, it still completely ignores true negatives which can vary freely without affecting the statistic. 20 The Matthews correlation coefficient (MCC) described as least influenced by imbalanced data.13
In our study, we used hybrid measures for comparison of classification algorithms. The macro-average of the Youden index as a metric of discriminative power21 was significantly higher for the ROC-tree algorithm in the one-vs-one comparison (Table 3). Also, other hybrid threshold metrics such as optimized precision, geometric mean had difference with higher values of macro-averages for the ROC-tree algorithm in the one-vs-one comparison.
The “reproducibility” of cut-off points and metrices were tested by the 10-folds cross-validation which is more stable extension of split-sample validation.2, 22 In this case cut-off points were determined in nine of the ten and testing in one of the ten, which is repeated ten times. In this way, all cases have served once to test the model. The performance is commonly estimated as the average of all assessments.2 The cut-off points derived using the full dataset are accepted as unique and can be used for further evaluation.23
Study limitations. Extending the number of studied datasets could increase the power of derived conclusions. The power of conclusions could also be increased by including more known confusion table derivates which could lead to the selection of most effective combination of classification performance metrics. We defined an optimal cut-off point in ROC analysis using the Youden index. However, a comparision of stabilty of cut-off points computed by other known methods could help in selecting optimal metrics for the determination of the splitting point.
The effects of sampling techniques such as down-sampling with reducing the number of samples in majority class and the assessment the differences in proportion of minority class in datasets were not evaluated in our study. However, these methods are known and recognized as effective in machine leaning fields. To some extent, the development of ‘failure to rescue’ as a quality indicator24, 25 is an example of down-sampling in health care.
In our study, the metrics in the one-vs-one comparison of classes were computed independently for each class and then their averages were compared. These macro-averages treated all classes equally. The combination of this approach with micro-average, which aggregates the contribution of all classes, to compute the average metric, could be effective in the evaluation of the effect of the individual classes.