Stop Oversampling for Class Imbalance Learning: A Critical Review

For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and three imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.

• Changing the loss function to give the failing minority class a higher cost [70].
• Oversampling the minority class.
• Any combination of previous approaches.
Each of the aforementioned approaches has its own set of benefits and drawbacks [71,72]. Oversampling, on the other hand, is the most often used approach among them, as seen by the multitude of oversampling methods published in the last two decades. However, this does not necessarily imply that the oversampling approach is beneficial. Oversampling approaches boost the quantity of minority-class instances by creating new ones out of thin air based only on their similarity to one or more of the minority's examples. This is troublesome since such methods may raise the likelihood of the learning process being overfitted [73,74,75,76,75][71] [72] [73] [74] [75]. On paper, the overfitted synthetic datasets produce good machine learning results, however this is not always the case in practice. Another more critical problem of oversampling is that the fabricated examples could exist in the real world belonging to a different class, regardless of how similar it is to the minority's examples, as we always have examples from class A that are the closest to examples from a different class B. Therefore, we argue that, even if such synthesizing generates favorable outcomes on paper, negative results can be easily obtained in practice. The major goal of this study, in addition to reviewing a large number of oversampling methods, is to prove our counterclaim on the use of oversampling as a solution to the problem of class imbalance, which is as follows: Oversampling in its current forms and methodologies is a misleading approach that should be avoided since it feeds the learning process with falsified instances that are pushed to be members of the minority class when they are most likely members of the majority.
To the best of our knowledge, the only methodology for proving an oversampling method's goodness is its classification accuracy metrics after the classification of the oversampled datasets, with no tests for the validity of the synthesized instances and if they are appropriate for training a model for real-world use. Therefore, we find oversampling practitioners are pleased with their machine learning outcomes in the lab, but they should consider how much harm could be done in practice outside of the lab, particularly in medical and other vital applications. The harm is exacerbated when we realize that several of these methods have become integral parts of APIs and machine learning packages, such as Python imbalanced-learn API [77] and Smote-Variants API [78]. We prove our counterclaim in this paper by using a number of typical oversampling methods on several benchmark datasets, concealing some of the majority examples, and then comparing the created examples to the hidden majority examples to determine if they approximately match. Finding such counter examples proves our counterclaim.
The following is the structure of this paper: The literature review of class imbalance problem is presented in the second section. The mythology of proving our counterclaim is illustrated in Section Three. And the experimental results are listed and discussed in section four.

Literature review of oversampling methods
In the literature, there are various ways to machine learning from class imbalance data. One of the most prevalent ways, particularly SMOTE-like approaches, is oversampling. On January 26, 2022, a Google Scholar search for the term "SMOTE" yielded 77,300 results, while a search for "oversampling" yielded 297,000 results. This is merely a foreshadowing of the developing trend of oversampling. Figure 1 depicts the nearly exponential increase in the number of articles that dealt with, employed, or addressed oversampling and/or SMOTE.
The relevance of the well-defined class imbalance problem and the simplicity of oversampling solutions are the reasons for this abnormal surge in oversampling research. Anyone with a rudimentary understanding of machine learning can come up with a novel way to produce fresh similar examples given some minority examples. There could be an infinite number of such solutions.
Several studies, such as [79,80,81], have reviewed various oversampling approaches; nevertheless, they are not thorough and have not paid adequate attention to validating the oversampling approach to the problem of class imbalance.
One of the earliest and most extensively utilized approaches for class imbalance is the Synthetic Minority Oversampling Technique (SMOTE) [82]. It interpolates synthetic examples between nearest neighbors from the training set's collection of minority class cases. As a result, by merging the properties of seed instances with randomly picked k-nearest neighbors, a synthetic sample is generated. The earliest version of the SMOTE algorithm relied solely on synthetic oversampling. They also used a combination of synthetic oversampling and undersampling, which might be useful [83]. SMOTE was tested on nine benchmark datasets and proven to improve classification performance. SVMSMOTE [84], which is based on SMOTE, focuses on constructing SVM modifications to successfully handle the problem of class imbalance. Oversampling, cost-sensitive learning, and undersampling are some of the heuristics used in SVM modeling. This method produced promising results when compared to other oversampling methods.
Borderline-SMOTE [85] is an SMOTE-based minority oversampling method that only oversamples the minority examples around the borderline. In comparison to SMOTE and other random oversampling methods investigated, their findings show that this solution improves classification results for the minority class.
Oversampling by a synthetic inverse minority is used in Reverse-SMOTE (R-SMOTE) [86], a technique based on SMOTE and the inverse near-neighbor idea. R-SMOTE beats other over-sampling methods in terms of precision, F-measurement, and accuracy, according to this study that compared traditional sampling procedures to alternative methods, including SMOTE. In the comparison, three benchmark datasets were employed.
Constrained Oversampling (CO) [87] is a technique for reducing noise in oversampling. This method is used to extract the overlapping regions in a dataset. Ant Colony Optimization is then used to define the boundaries of minority regions. Most significantly, in order to create a balanced dataset, fresh samples are synthesized via oversampling under constraints. This method varies from others in that it includes noise-reduction constraints in the oversampling process. CO outperforms a range of oversampling benchmarks, according to their results.
In addition, the Majority Weighted Minority Oversampling Technique (MWMOTE) [88] was offered as a solution to the problem of class-imbalance learning. MWMOTE finds and weights difficult-to-learn informative minority class samples based on their distance from nearby majority class samples. It then creates synthetic samples from the weighted informative minority class samples using a clustering algorithm. The primary premise of MWMOTE is that all generated samples must belong to one of the minority class clusters. In terms of numerous assessment measures, the provided results suggest that MWMOTE is superior than or similar to some other existing approaches.
Adaptive synthetic (ADASYN) [89] was given with the goal of eliminating bias and moving the classification decision boundary in the direction of the hard examples. The primary idea behind ADASYN is to use a weighted distribution for different minority class examples based on their learning difficulty, with more synthetic data created for more difficult minority class examples than for easier minority class examples. The efficacy of this method is proved by the results of experiments conducted on a variety of datasets using five different evaluation measures.
Synthetic Minority Over-Sampling Technique Based on Furthest Neighbor Algorithm (SOMTEFUNA) [6] is another exciting and recent method for machine learning from imbalanced datasets. To produce fresh synthetic minority examples, this method employs the farthest neighbor examples. SOMTEFUNA has a number of advantages over some other approaches, one of which being the lack of tuning parameters, which makes it easier to be used in realworld scenarios. Using Naive Bayes and Support Vector Machine classifiers, the method compared the benefits of resampling to common methods such as SMOTE and ADASYN. The reported findings show that SOMTEFUNA is a viable alternative to the other oversampling methods, according to its reported results.
Sampling WIth the Majority (SWIM) [90] is a synthetic oversampling method that is robust in cases of significant class imbalance. SWIM's fundamental feature is that it uses the density of the well-sampled majority class to direct the creation process. SWIM's model was built using both the radial basis function and the Mahalanobis distance. SWIM was put to the test on 25 benchmark datasets, and the findings show that it beats some of the most common oversampling methods.
Other ways of oversampling include, but are not limited to, the work of [91,92,93,94,78,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119] The validation process is what all oversampling methods have in common, which is basically the evaluation of the classifier's performance employed to classify the oversampled datasets using one or more accuracy measures such as Accuracy, Precision, Recall, F-measure, G-mean, Specificity, Kappa, Matthews correlation coefficient (MCC), Area under the ROC Curve (AUC), True positive rate, False negative (FN), False positive (FP), True positive (TP), True negative (TN), and ROC curve. Table 1 lists 72 oversampling methods, including their known names, references, the number of datasets utilized, the number of classes in these datasets, the classifiers employed, and the performance metrics used to validate the classification results after oversampling. As can be seen from the previous discussion and Table 1, all the aforementioned oversampling methods use the classification accuracy measures of the synthesized data to verify their goodness, assuming that the synthesised examples belong to the minority class. On paper, however, the accuracy measures appear to be good if the data is over-fitted, which is common when using Oversampling methods [72,73,74,75,76,71].
Another critical problem with the oversampling approach is the assumption that the synthetic examples belong to the minority class; do they truly belong to the minority class?
None of the previous literature has answered this critical question. This study aims to provide a validation system for oversampling methods, in order to determine to what degree these methods synthesize unrealistic examples; assuming they are belonging to the minority when they are not.

Method and Data
The proposed validation system for oversampling methods works by hiding a subset of the majority's examples, which is referred to as the hidden subset. Although the hidden majority examples are part of the population, we excluded them from the training dataset since we assumed they were not obtained from the real-world domain of discourse. Because all oversampling approaches do not access the entire real-world population, this assumption is correct.
It is important to make sure that the class imbalance problem still exists after concealing the hidden subset.
After that, we apply the oversampling method that needs to be validated on the remaining dataset in order to generate new examples, which is referred to as the synthetic subset. The hidden subset is then returned to the training set.
The generated examples in the synthetic subset are claimed to belong to the minority class by all oversampling methods. We compare the similarity between these examples (the synthetic subset) and all examples in the original training set before oversampling to see if these synthesized examples belong to the minority or the majority. Figure 2 illustrates the proposed validation system.

Minority Majority
Imbalanced Dataset

Hide K samples
Oversampling In order to determine the degree of similarity, we need a similarity measure such as Euclidean distance (ED), Manhattan distance (MD), Hassanat distance (HD) [183], etc. In this paper, we opt for HD as being invariant to noise, outliers and data scale, since the nature of this metric prevents each feature from having a distance greater than one, regardless of the scale of the features in the targeted dataset. Furthermore, HD had been shown to outperform a wide range of machine learning similarity measures, including the most common ones like ED and MD [184,185,186,187,188].

Error percent
HD can be expressed mathematically as in equation1.
and for the total distance between two examples is Where p and q are feature vectors and N is the number of features in each vector.
It is worth mentioning that we are proposing a validation system, not an evaluation system, the similarity measure using HD is meant to find the number of examples taken from the synthetic subset that are similar to the minority as the core of our validation system. Those generated examples, which are more similar to the majority indicate the error of the oversampling method validated. This error is calculated according to equation 3.
where CM is the number of synthetic examples that are close to majority examples and S S is the total number of examples in the synthetic subset.

Datasets
We employ three real-life datasets to put our validation system to the test, namely Yeast4, Yeast5, and Yeast6, which are routinely used by many oversampling methods. On [189], all of the datasets are freely available. Table 2 contains information about these datasets.  Table 2 shows that the datasets have different minority and majority distributions, despite the fact that the number of attributes and classes are the same. It is not necessary to address the problem with multi-class datasets to prove our counter claim, as most oversampling approaches only use binary class datasets, as shown in Table 1.

Experiments and Results
In our experiments, we used all of the oversampling methods listed in Table 1 on each of the three datasets listed in Table 2 Table 2 shows the number of erroneous synthetic examples (NE), which are ones that are generated as minority examples but appear to be more comparable to majority examples, as the proposed validation system suggests. It also shows the number of synthetic examples (SE) generated by each oversampling method, in addition to the error rate (ER) which is calculated using Equation 3. All the result reported in Table 2 were obtained by hiding only 10% of the majority examples. The averages of five trials on each dataset for each approach are provided in Table 2. In addition, for each approach, the average error is calculated for the error rates on all three datasets. The last column in the table shows the average rank of each method based on the three datasets; the lower the rank, the better the oversampling performance; for example, rank 1 shall be awarded to the method with the smallest error.
A thorough examination of  Figure 3. The average error rate of all oversampling methods increases somewhat as the hidden percent increases, as seen in Figure 3. This is logical since when oversampling methods synthesize their minority examples, they become unaware of some majority examples; in fact, we expected a significant error rise as the size of the hidden subset grew larger. In terms of the effect of the dataset on the average oversampling error, we can observe in the same figure that some datasets, such as (Yeast5), are easier to be oversampled than others, such as (Yeast6) and (Yeast4). However, the difference is not substantial, and more importantly, as the Box plots show, the standard deviation of the error rates produced by all oversampling methods on each dataset is extremely high.
In order to compare oversampling methods, we ranked them according to their average error across three datasets. The average ranks of the methods are plotted against the average errors they produced on three datasets (Yeast4, Yeast5 and Yeast6) as shown in Figure 4.
Despite the fact that all of the methods discussed generate false examples, Figure 4 indicates that certain methods do better than others in avoiding the formation of false examples. As a result, if oversampling is unavoidable, the method's reliability should be verified using a validation tools such as ours. Methods like M51, for example, have error rates close to 0%, whereas others like M7 and M16 have error rates near to 100%. As seen in Figure 5, many synthetic examples are created on the basis of or near hidden examples, producing almost identical feature values. Even in higher dimensional feature space, such a situation has the potential to occur. The common mistake that all oversampling methods make is to feed such data to a classifier, assuming that all of the examples are realistic and labeled based on reality. The classifier has no other knowledge and learns based on the false assumption, which produces excellent results in labs but unexpected behavior in real-world applications.
The results shown thus far do not necessarily imply that incorrect example synthesis occurs just when the majority examples are hidden from the oversampling method. Even though the majority examples are completely visible to the methods, some methods generate false examples. The published findings of all of the oversampling methods demonstrate this, as none of them claimed to be an accurate method with no errors.
We validated the best performers on a fourth machine learning dataset, Vehicle3, because some oversampling approaches passed our validation test by presenting a relatively small number of unrealistic examples, and to further support our counterclaim against the validity of the oversampling approach in general. The validation results of the best performers are shown in Table 4. As can be seen in Table 4, when we changed the dataset, the errors of the "best" oversampling methods increased significantly, demonstrating once again that these oversampling methods fill in the features space gap without considering whether the generated examples are truly belong to the minority, and falsely consider them as such. This makes the training of these examples deceptive, and it could lead to the classifier being overfitted on incorrect data if robust generalization techniques are not used. As a result, when applied to real-world tasks, it is possible that the entire machine learning system fails spectacularly, particularly in critical applications such as security, autonomous driving, aviation safety and medical applications, where even one unrealistic synthesized example could do catastrophic harm.

Conclusion
Oversampling methods have been used and developed for decades to handle the problem of class imbalance learning, and there is a near exponential growing trend for such type of research. The main question of this research is oversampling approach in its current form and methods provide applicable and viable solution for learning from class imbalance data? We claim that the current oversampling approach is deceptive and could lead to severe failures in real-world applications. In order to answer the main question and to prove our counterclaim, we reviewed a large number of oversampling methods and analyzed their performance in terms of providing unrealistic examples, for this purpose we propose a new validation system for oversampling methods, which we utilized to validate over 70 different oversampling methods. Our validation results on four real-world common datasets reveal that all of the oversampling methods investigated generate false examples, assuming that they are minorities when they are not, causing classifiers to perform well in labs but more likely fail in practice.
The Oversampling methods investigated in this paper are ranked according to how many incorrect examples they generate. When used to solve real-life problems, the ranking shows that some methods are less harmful than others. When the datasets were changed, however, they were found to be useless. Therefore, we recommend avoiding such methods when dealing with sensitive applications such as security, autonomous driving, aviation safety, and medical applications that use machine learning from class imbalanced data. Instead, we seriously encourage using ensemble approaches to problems of class imbalance, such as Easy Ensemble. [190], Random Data Partitioning [71], etc. Because these methods do not create data out of thin air and do not, as the Undersampling approach suggests, deny the learning process from critical data.
More research should be done in the future to confirm the validity or invalidity of oversampling approach, investigating more methods and incorporating more data. Furthermore, we recommend that additional research be conducted on real-world applications, including measurements of incorrect predictions made with and without the use of oversampling methods, as well as comparisons with ensemble methods.