MINORITY RESAMPLING BASED ENSEMBLE FRAMEWORK USING ENHANCED EARLY DRIFT DETECTION METHOD FOR IMBALANCED DATA STREAMS

: As the data mining applications are increasing popularly, large volumes of data streams are generated over the period of time. The main problem in data streams is that it exhibits a high degree of class imbalance and distribution of data changes over time. In this paper, Timely Drift Detection and Minority Resampling Technique (TDDMRT) based on K-nearest neighbor and Jaccard similarity is proposed to handle the class imbalance by finding the current ratio of class labels. The Enhanced Early Drift Detection Method (EEDDM) is proposed for detecting the concept drift and the Minority Resampling Method (KNN-JS) determines whether the current data stream should be regarded as imbalance and it resamples the minority instances in the drifting data stream. The K-Nearest Neighbors technique is used to resample the minority classes and the Jaccard similarity measure is established over the resampled data to generate the synthetic data similar to the original data and it is handled by ensemble classifiers. The proposed ensemble based classification model outperforms the existing over sampling and under sampling techniques with accuracy of 98.52%.


Introduction:
The amount of data generated by smart devices, social media, and from all kinds of sensors has increased with the rapid growth of information technology. Different sources generate an ordered sequence of instances with high speed, named as streaming data. Stream data analysis innovates the business ideas and provides decision-making support using data mining approaches [1]. The streaming analysis is an extraction of sequential patterns from streams of data. The big data streaming analysis has to cope-up with the three features such as volume, velocity, and variety [2] [3]. An integral dimension of streaming data features is time, and so the stream data must be analyzed promptly. Thus, the data mining algorithms are essential to capture and mine the stream data record by record in a realistic time. The streaming data analysis includes online and offline techniques. Even though most of the sources generate the data streams in the nonstationary environment, the traditional offline works build the data mining algorithms under the hypothesis that the data will be static. The offline data mining algorithms include data collection, data mining on static data sets, interpretation, and evaluation of result. However, the offline algorithms exert a lot of unresolved challenges in data storage and query processing due to huge volume and variety of data. The online learning algorithms are introduced to cope-up with this problem.
Data streams are massive, continuously changing, timely ordered, and possibly infinite in length. In recent years, mining and analyzing the data streams is very crucial and it has gained attention over the past decade. Some of the real-world applications for class imbalance data are in anomaly detection [4], diagnosis of fault [5], medical diagnosis [6], detections of credit card fraudulent transaction [7], oil spill detection from radar images [8], biological sequence detection [9], spam filtering [10] and many others. In the binary classification problem, the rare class is considered to be a positive class, which occurs very infrequently and the negative class is considered as majority. When more number of positive class examples is there in the dataset, it can significantly impair the learner capacity to learn the positive class. This makes the classifier to learn the class boundary in reality. In the classification problem, for a static dataset the instances will be generated from a specified generating function, which belong to some distribution and the learner tries to learn the concept from the static dataset and this is not valid in the streaming data model. The swift in data distribution is called concept drift and the imbalance distribution makes many conventional machine learning algorithms to be less efficient, particularly in predicting examples of minority groups. Hence, the data stream classification in the existence of concept drift and class imbalance is very challenging.
In a data stream application, the data arrives continuously and learning can be done either by using batch or online method. In batch learning, the data samples are processed in chunks at each time step whereas in online learning model, the data samples are processed incrementally and the classifier model will be updated once the samples are received. The online learning can build and update the classification model to accommodate the new instance of data sample and it can preserve its performance on the previous instance, giving the way for stability-plasticity problem [11]. The main task in imbalanced learning is that, completely under-represented class cannot gain equivalent interest to the classification algorithm. An online imbalance learning system needs to be developed to increase the accuracy of the classification in order to solve this issue. When we consider the problem of spam filtering and credit card application, the minority class is more difficult to collect than the majority class. Hence the misclassification of minority class will be very costly. The traditional models aim at maximizing the overall performance, which can lead to the greater chance of predicting the example as majority class, and there is less chance of recognizing the minority class correctly. In real time, it is seen that the accuracy of the majority class is close to 100% and the minority class will have less than 10% accuracy. The consequence of imbalance class was studied on classifiers such as Random Forest [12], Naive Bayes [13], Decision Tree [14] and KNN [15].
The problem of class imbalance can be tackled at data and algorithmic levels. In the data level, techniques such as oversampling the minority class example, under sampling the majority class or resampling technique is applied to make the dataset balanced. The over sampling and under sampling method is simple whereas the most popular method is resampling, which adds or removes the data samples randomly. In the algorithmic level, the training mechanism of the minority class is modified in order to improve the accuracy of the minority class. There are different types of learning methods called cost-sensitive learning, meta-learning and ensemble learning [16]. In the cost sensitive learning, the cost of misclassification with respect to the classes is determined from the cost matrix. The preprocessing mechanism of training data and post processing of the test data is integrated, in a way that the learner is not modified in the case of meta-learning. The ensemble method is the most popular method that is used for handling the class imbalance. It tries to improve the accuracy of the single classifier by combining multiple classifiers as the base learner and the new classifier will outperform every one of them. In the recent year, the ensemble based classification has provided a better solution for class imbalance problems.
The rest of this paper is organized as follows. Section II describes the related work of class and imbalance and concept drift. Section III explains the proposed model for detecting and handling class imbalance in drifting datastream. Section IV discusses about the experimentation result and Section V describes the conclusions and discusses future possible directions.

Related Works:
The existing algorithms for class imbalance and concept drift mainly deals with online or offline learning is studied by the authors in [17]. Using batch or incremental learning, the data streams may be processed and ensemble learners are deemed successful for issues with data stream classification. The data instances are learned one by one without any prior knowledge in the online learning process. From the mining perspective, streaming data analysis has opened many challenges for online learning algorithms that include concept drifts, temporal dependencies, imbalanced class distribution, and limited data storage availability. The concept drift happens when an unforeseen change occurs on streaming data over time. The skewed class distribution over streaming data is known as a class imbalance. The concept drift creates significant challenges in the discovery of sequential patterns for online learning algorithms.
Due to the uneven data distribution, the data streaming applications face the issues of concept drift and imbalanced data distribution. Ensemble learning [18] is widely used for handling imbalanced class. The conventional ensemble member selection algorithms such as bagging, boosting, and random sampling techniques are cost-effective [19]. However, the heuristic based approaches lack proper focus on the performance of ensemble classification over skewed classes. Moreover, the conventional approaches lack in understanding the importance of diversity among ensemble members and impact of minority class on the performance of ensemble classifier. The ensemble member diversity should not be high which have almost no convergence to the new data pattern. Moreover, there is no clear idea on how to determine the optimal number of classifiers and combine those classifiers. It is essential to analyze the relationship between the nature of imbalanced classes and the ensemble size required to handle the concept drift on imbalanced class efficiently. Mostly, the ensemble classification algorithms exploit majority voting combination method. The majority voting concept tends to ensemble failure when the similar classifiers of incorrect results are combined. Therefore, this necessitates that the ensemble diversity must be exploited during the combination phase of ensemble algorithm.
The Uncorrelated Bagging [20] technique is used to learn the concept drifting imbalanced data stream. Dynamic Ensemble Selection-KNN model [21] ranks the accuracy of the base models of ensemble in the decreasing order and it uses the diverse member to form the ensemble. The random under-sampling [22] removes the majority class instances randomly and it removes the biased samples which will hinder the performance of the classifier. Therefore SMOTE [23] is proposed which generates the current minority class instances using the k-Nearest Neighbour. The support vectors are used to generate new samples based on SVM SMOTE [24]. In ADASYN [25] the number of samples generated will be proportional to existing neighborhood instances. Based on accuracy and diversity, a heuristic replacement strategy is used in Streaming Ensemble Algorithm (SEA) [26]. The class distribution is balanced by the current stationary minority class and the Mahalanobis distance is used as measure in selectively recursive approach (SERA) for selecting the previous minority group that is most similar in the candidate region [27]. The resampling and time-delayed metrics in OOB an UOB [28] can cope with the imbalanced data stream. The combined issue of drift and class imbalance is addressed in two category namely chunk based and online ensemble [29].

Timely Drift Detection and Minority Resampling Technique (TDDMRT)
Many stream data mining strategies do not concurrently take into account the class imbalance and the issues of concept drift. However, the accuracy of the minority class is inferior. To deal with these challenges, conventional data and algorithm level methods are exploited. The data level method focuses on changing the training samples to improve the accuracy of ensemble learners. To balance the class distributions, the data level methods generate more samples for minority classes, more than that in minority classes.
Since most approaches exploit random sampling, this often leads to missing important samples and poor classification accuracy. By providing importance to the minority classes, the modified ensemble members include each of the considered groups of examples. This increases the cost of stream data classification. To minimize the cost of ensemble classification without reducing the accuracy, it is essential to fix the optimal size of the ensemble classifier as well as to select the diverse ensemble members. Even though the optimal number of members of an ensemble is determined, an insufficient amount of samples in a class has an adverse impact on the classification accuracy. Thus, the hybrid method is necessary for solving the issues of concept drift over imbalanced data in a cost-efficient manner.
Data imbalance is the most important factor that affects the performance of the model that are developed through the learning algorithms. The data are initially used to train and construct an effective model and its imbalance result in some flaw model that can be biased over the test and real time data [30]. Henceforth, the data imbalancing has to be addressed at the earliest to build a effective prediction model. In the proposed method, a novel scheme of Timely Drift Detection Minority Resampling Technique (TDDMRT) is employed to improve the performance of model.
The datasets are initially acquired from the data perception center and most commonly it was divided into train and test data. The data selected for training is initially fed into the TDDMRT scheme which contains the Enhanced Early Drift Detection Model (EEDDM) and the resampling model based on K-nearest neighbor and Jaccard similarity. The imbalanced data are initially identified along with the instances of misclassification. The extracted features are fed in to the EEDDM model to perform the drift analysis and the drifted imbalanced data from dataset are estimated over the NSL-KDD [31] dataset with labels of bad and normal conditions. The drifted imbalanced data are classified into minority and majority classes. The obtained minority classes are resampled using the KNN technique and the Jaccard similarity measure is established over the resampled data to generate the synthetic data similar to the original data. Thus, the imbalanced data in the dataset will be balanced and the model for testing is constructed effectively.

Enhanced Early Drift Detection Model (EEDDM)
The early drift detection model was proposed in [32]. In that work, two different threshold conditions are established to analyze the drift between the points based on error rate and standard deviation. However, in the current model, three different conditions are formed with the threshold values. The data are streamed into the EEDDM through the data stream generators. In the proposed model, the mean (m) of the errors and standard deviation (s) is formulated as the tan h function for the points and are estimated with the equation (5): The corresponding maximum distance is given in equation 6: Where m' and the s' are the maximum mean and standard deviation over the given points in space. The ratio between the point error and maximum error is estimated in equation 7: Based on the value of , three different drift levels are obtained over the split dataset. The first level in the in-control level for which the value of is less than 0.5. The second level is the warning level for which the value of is more than 0.5 and less than 0.90. The third and final condition is out of control level the for which the value of is more than 0.9.

Algorithm 1: Enhanced Early Drift Detection Model (EEDDM)
1: get the sample y //y belong the data feature or point 2: if y ≤ n 3: Calculate the mean and standard deviation of error between x and y 4: Estimate the distance of error with equation (1) 5: Get the maximum distance of error with equation (2)  6: Estimate the threshold value with equation (3) 7: else 8: if more than 0. 9 9: "Drift is in out of control level" 10: Add y to the dataset 11: else if lie in between 0.5 to 0.9 12: "Drift is in warning level" 13: Replace the point y i with the alternative point y j 14: else 15: reset the dataset (No drift) 16: end

Minority Resampling (KNN-JS)
The drifted instances in the dataset are determined and corrected based on the proposed EEDDM. The corrected imbalanced data are classified into minority and majority classes. The minority classes are taken and are resampled through the K-nearest neighbor algorithm [33]. The similarity of the new instances and the existing instances are calculated using the jaccard similarity measure [34] which is given in equation 4: Where A and B are the new and existing instances of the data. The threshold is set as 0.75 and hence the new instance that exceeds this value is taken for the synthetic data. The process for resampling is carried out until the dataset is completely balanced.

Classifier model
In the present work, the NSL-KDD cup dataset is used for analyzing the effectiveness of proposed TDDMRT. The data in the NSL-KDD cup dataset are corrected for drift and balanced using the EEDDM and random sampling respectively and fed into the ensemble models of Decision tree, naïve bayes and random forest for classification.

Experimentation Results and Discussion
The proposed TDDMRT scheme with EEDDM and the resampling based on KNN-JS framework is compared with the existing model over the NSL-KDD cup dataset. The performance of the TDDMRT over data chunk is given in Table 1 and Figure 1.  The comparison for EEDDM is carried out with Dynamic classifier selection based low Accuracy DDM (DCS-LA-DDM) and DDM [35], Semi-Supervised Adaptive Novel Class Detection and Classification (SAND), Accurate Hoeffding trees (AHT) , Streaming classification with Emerging New Class emerging by class Matrix Sketching, EcsMiner, Semi-supervised Adaptive Classification Over data Stream (SACCOS) [36] . The accuracy of the proposed model is about 98.52% that was achieved over the data label of 7% which is better than the Novel DCS-LA-DDM and SACCOS (ANN). However, the SAND-D model showed better accuracy than AHT and SENC-MaS over lesser percentage of labels. The comparison on accuracy and % of label for drift detection over NSL-KDD dataset is given in Table 2 and Figure 2.

Figure 2. Performance of drift detectors
The Table 3 and Figure 3 showed the comparison on performance over the balanced data that are obtained from oversampling (OS), under sampling (US) and TSSMRT. The accuracy, precision, sensitivity, specificity, detection rate metrics are involved to analyze the classification over the KDD cup data.
The balanced data through the TDDMRT technique showed accuracy of 86.40% in Random forest classifier and it is better than the classification through Naïve bayes and decision tree classifier. Similarly, the performance on detecting the attack in the dataset is about 0.025 for the RF classifier and it is better than the NB and DT classifier over the TDDMRT approach. The sensitivity, precision and specificity of NB classifier is about 0.856, 0.782 and 0.075 respectively are lesser compared to both the RF and DT classifier over the given data.

Conclusion and Future Work
In this paper, a Timely Drift Detection and Minority Resampling Technique (TDDMRT) based on K-nearest neighbor and Jaccard similarity is proposed to handle the combined problem of concept drift and class imbalance. A minority resampling method is applied to handle the class imbalance by finding the imbalance ratio. The Enhanced Early Drift Detection Model (EEDDM) handles the concept drift by analyzing the drift between points based on threshold values. The experimental results shows that the proposed method can handle both concept drift and class imbalance with the accuracy of 98.52% which is better compared with the results of the existing literature. For future work, we plan to apply the drift detection in imbalanced data stream for a multi-class problem.

Declarations
Ethics approval and consent to participate Not Applicable

Availability of data and materials
The data that support the findings of this study are available from the corresponding author upon reasonable request