This section provides the experimental evaluation of the proposed AEDDM method for drift detection. The AEDDM approach has been tested on three sets of experiments from three different perspectives. Firstly, evaluating the proposed approach for sudden and gradual drift detection on synthetic datasets; Secondly, testing it on drift-induced real-world datasets from real and virtual (as well as from sudden and gradual) drift point of view and thirdly comparing the proposed approach with other similar state-of-the-art in an online classification scenario using Hoeffding Tree classifier. The experimental evaluation is aimed at empirically showing that the proposed method can detect sudden and gradual drifts, the detected drift is real thus minimizing the false alarms. Section 5.1 describes the datasets used in this research along with other experimental settings, sections 5.2 show experiments on synthetic datasets, Section 5.3 details the experiments on real world drift induced datasets while Section 5.4 compares the classification performance of AEDDM using Hoeffding tree classifier in an online setting.
5.1 Datasets and Experimental Settings
The proposed approach is evaluated on four synthetic datasets including Rotating Hyperplane (Haixun Wang et al., 2003,Fan, 2004), Moving RBF (Losing et al., 2017,Menon & Gressel, 2021), GAUSSIAN and Varying Distributions (VD) and four real -world datasets including NOAA, Covertype, KDDCUP99 and ELEC2. These datasets have been briefly described below: -
Rotating Hyperplane
A hyperplane with d dimensions is represented by the equation i xi =w0, If wi xi > = w0 then instances are labeled as positive otherwise negative. (Hulten et al., 2001a). A change in classification boundary is introduced by changing the weights of the features gradually. We have used scikit multiflow to generate hyperplane dataset with 10 features. The hyperplane dataset contains slow gradual drift in five features with 60% change magnitude.
Moving RBF Dataset
This dataset is also generated using scikit multiflow and consists of gaussian distributions which move with constant speed. Moving RBF datasets involves gradual drift. Concept drift can be introduced by changing the position or number of centroids. The generated dataset has 2 classes, 30 attributes and 50 centroids. The drifted data is generated using scikit multiflow random RBF drift generator with a change speed of 0.6.
Gaussian Dataset
A synthetic dataset with 20 features is drawn from a standard normal distribution based on a specified range of mean and standard deviation. For each feature in negative samples, the mean range is (0.1,0.6) and the standard deviation range is (0.05,0.45) with 30,000 samples. For positive samples, the mean range is (2,7) and standard deviation range is (1.5, 2.5) with 30,000 samples. In drifted data, the mean range for positive class is changed to (4,9) with standard deviation range in the range (1.5,3) while for negative class data the mean range is changed to (0.3,0.9) and standard deviation ranges in the range (0.1,0.5)
Varying Distributions (VD) Dataset
In this dataset, class 1 instances are sampled from a binomial distribution (n = 10, p = .05) while class 0 instances are sampled from a logistic distribution (loc = 0.38). VD dataset consists of 19200 data points, five features and balanced class distribution. The drifted data contains 30 batches, where initial 20 batches contain non-drifted data. In the last 10 batches, the distribution of both class data is changed in such a way that in batch number 31 and 32 data distribution of both classes changes in one column; in batch 33 and 34 the data distribution changes in two columns, in batch 35 and 36 in three columns and so on.
NOAA Weather Dataset: NOAA dataset (Ditzler & Polikar, 2013b) contains weather measurements in eight dimensions and contains daily records covering 50 years. There are 18,159 records in this dataset and the task is to predict whether it will rain or not. The class distribution is 12,461 records for no rain and 5,698 for rain. Eight attributes are temperature, dew point, sea level pressure, visibility, average wind speed, maximum sustained wind speed, minimum temperature, and maximum temperature. The data has been made available by the National Oceanic and Atmospheric Administration (NOAA) which is referred to as the NOAA dataset.
Covertype
Forest Covertype (Cabral & Barros, 2018; Frías-Blanco et al., 2015) dataset divides the forest land based on physical attributes like elevation, soli type, wilderness area, slope etc. of a 30m X 30m region into 7 different classes. The dataset has been made public by the US Forest Service. It contains 581012 records and 54 attributes both numerical and categorical. This dataset has been converted into a binary classification dataset.
KDDCUP99
KDDCUP99 is a network intrusion detection dataset which was used in KDDCUP 1999 competition. It is a multiclassification dataset where instances are labeled as normal or some known attack type (Pinagé et al., 2020).
The original dataset contains more than 4M records. We have used one of its subsets which contains 494021 records and 41 dimensions by converting it to a binary classification dataset with class values as normal and attack.
ELEC2
Electricity dataset commonly known as ELEC2 (Harries & Wales, 1999) shows daily supply, demand and scheduled electricity transfer between states New South Wales and Victoria. It contains 45,312 instances with 8 features. The class label shows whether the price was up or down on a day in New South Wales relative to a moving average of the last 24 hours. This dataset have been used in various research paper related to drift detection (Pinagé et al., 2020; Gama et al., 2004; Costa et al., 2018; A. Liu et al., 2018).
The summary of the datasets used in this research is shown in Table 3. In the case of real-world datasets, it is usually not known whether the drift is present or not, and if it is present, the location of the drift is not known. For experimental evaluation, we explicitly introduced drift in these datasets which will be explained in section 5.3. All the datasets used and generated in this research work are available at https://github.com/Usman07442/ConceptDrift_IJMLC.
Table 3
Dataset
|
Instances
|
# of Features
|
Type
|
Drift Type / Drift Induced
|
Size of Drifted Data
|
Gaussian
|
60,000
|
20
|
Synthetic
|
Sudden
|
6000
|
VD
|
19,200
|
5
|
Synthetic
|
Gradual
|
960
|
Hyperplane
|
40,000
|
10
|
Synthetic
|
Gradual
|
2,560
|
Moving RBF
|
40,000
|
30
|
Synthetic
|
Gradual
|
2,560
|
NOAA
|
18,159
|
8
|
Real
|
Sudden, Gradual
|
1816
|
Covertype
|
581,012
|
54
|
Real
|
Sudden, Gradual
|
58,102
|
KDD99
|
494,021
|
41
|
Real
|
Sudden, Gradual
|
49,403
|
ELEC2
|
45,312
|
8
|
Real
|
Sudden, Gradual
|
4,531
|
Experimental Settings
In offline training phase, the available dataset is divided into three distinct subsets. Initial 70% data is used for training (and validation) of autoencoders, next 20% data is used for threshold computation and the last 10% data is used for testing the autoencoder on normal non-drifted data stream. The simplest deep autoencoder architecture with one hidden layer in an undercomplete setting is used in all the experiments which allows to compress the input onto a lower dimension at the bottleneck. The size of the hidden layer (number of neurons) is kept approximately one-third of the input and the size of the bottleneck is taken as approximately one-third of the size of the previous hidden layer in the encoder part while decoder part is an exact replica of the encoder part. Since we have real valued data at the input layer, mean squared error (mse) is used as the loss function with Adam optimizer. ReLU activation function is used at all the layers except the output layer where sigmoid is used. For threshold computation, results have been averaged over 10 runs to avoid any experimental bias. To test AEDDM effectiveness in drift detection, drift is introduced in the stream (last 10% of the dataset) which starts at batch 20 (21st batch). The initial 20 batches contain normal (non-drifted) data. For all the datasets batch size of 32 is used for autoencoders training as well as for windowing purposes. The reference window (the training data with computed threshold) is fixed while the detection window moves in batches. The next section describes the experiments on synthetic datasets.
5.2 Experiments on Synthetic Datasets
To show the effectiveness of AEDDM in detecting sudden and gradual drifts which are the most common scenarios, four synthetic datasets including Gaussian, VD dataset, Hyperplane and Moving RBF have been used. The former two datasets are newly designed datasets in this research work while the latter two datasets have been used in various research papers in drift detection domain. The focus of this set of experiments is to determine the best set of parameters for AEDDM framework which can be used to effectively detect drift with minimum delay and least false positives. These parameters include:
k parameter:
k parameter is basically the sensitivity parameter which defines the spread of the data that will be considered as non-drifted. This sensitivity parameter is used in threshold computation and can be determined empirically by applying AEDDM framework on normal non-drifted data as well as on drifted data. We have evaluated the values of k = 1,2,3 to determine the best value of k which can be used in \(u+k\sigma\) for threshold computation (see Table 4 &5)
Alpha Parameter
To signal a batch as drifted or normal, AEDDM uses batch threshold as well as count threshold. The average reconstruction error of a batch from the respective AE is compared with batch threshold while the exceed count (explained in Section 4) of a batch is compared with count threshold. Average reconstruction error hides the internal details of reconstruction error of individual instances in a batch while count threshold encounters this by providing the count of instances which exceed instance threshold (explained in Section 4). This count threshold can be computed as a median of batch wise exceed counts in normal data or as a maximum value. Median would make the count threshold more sensitive resulting in more false positives while maximum would make it less sensitive to noise and false positives. The best fit can be determined empirically for a chosen dataset. We name this the Alpha Parameter with two possible values as median or maximum.
The logical parameter or “Beta Parameter”
Another parameter of AEDDM framework is the logical parameter (see Algorithm 1: Step 3a) which determines whether both thresholds will be used in conjunction or disjunction. It can be empirically established whether to use batch threshold and count threshold in AND setting or in OR setting to signal a batch as normal or drifted. Using AND is expected to be more robust to noise and false alarms but may cause delays in drift detection in some cases. The best fit values of the above three parameters for a dataset can be determined empirically. We name this the Beta Parameter with two possible values as “AND” or “OR.
Table 4 summarizes the results for all the selected values of the AEDDM parameters on non-drifted data while Table 5 shows the results on drifted data for four synthetic datasets. In the case of non-drifted data, warnings and false positives are reported over the entire test batch stream while in case of drifted data both warnings and false positives are only counted till the detection point. In all datasets, drift starts from batch 20. The best set of parameters can be determined from both the performance on the non-drifted data as well as the drifted data. In the case of non-drifted data, the best set of parameters is expected to generate a minimum number of warnings and false positives. In the case of Gaussian and VD datasets, there is no clear distinction in performance across different values of k since both classes are more apart from each other from data distribution point of view (See Table 4). But this distinction is much clearer in the case of Hyperplane and moving RBF dataset for k=3 and there are comparatively less warnings and false positives in non-drifted data. Based on this, we have limited our search of best parameters within k=3 across all four datasets. In case of drifted data, the best set of parameters would be where AEDDM can detect drift with minimum delay, warnings, and false positives; delay being the most important. Considering performance on both drifted and non-drifted data, Table 6 summarizes the best set of parameters across all four datasets.
Table 4
AEDDM Parameters Calibration on Non-Drifted Data
Dataset
|
K
|
Count Threshold Measure:
|
Logical Parameter:
AND OR
|
Warnings
|
False Positives
|
Gaussian (187 batches)
|
1
|
Median
|
AND
|
0
|
0
|
OR
|
0
|
0
|
Maximum
|
AND
|
0
|
0
|
OR
|
0
|
0
|
2
|
Median
|
AND
|
0
|
0
|
OR
|
27
|
0
|
Maximum
|
AND
|
0
|
0
|
OR
|
0
|
0
|
3
|
Median
|
AND
|
0
|
0
|
OR
|
16
|
0
|
Maximum
|
AND
|
0
|
0
|
OR
|
2
|
0
|
VD
60 Batches
|
1
|
Median
|
AND
|
0
|
0
|
OR
|
2
|
0
|
Maximum
|
AND
|
0
|
0
|
OR
|
0
|
0
|
2
|
Median
|
AND
|
0
|
0
|
OR
|
5
|
0
|
Maximum
|
AND
|
0
|
0
|
OR
|
0
|
0
|
3
|
Median
|
AND
|
0
|
0
|
OR
|
4
|
0
|
Maximum
|
AND
|
0
|
0
|
OR
|
0
|
0
|
Hyperplane
124 Batches
|
1
|
Median
|
AND
|
30
|
77
|
OR
|
27
|
83
|
Maximum
|
AND
|
25
|
2
|
OR
|
46
|
44
|
2
|
Median
|
AND
|
46
|
13
|
OR
|
46
|
40
|
Maximum
|
AND
|
7
|
0
|
OR
|
16
|
0
|
3
|
Median
|
AND
|
49
|
7
|
OR
|
55
|
10
|
Maximum
|
AND
|
0
|
0
|
OR
|
2
|
0
|
RBF
125 batches
|
1
|
Median
|
AND
|
2
|
123
|
|
OR
|
2
|
123
|
Maximum
|
AND
|
5
|
117
|
|
OR
|
2
|
123
|
2
|
Median
|
AND
|
17
|
97
|
|
OR
|
4
|
120
|
Maximum
|
AND
|
24
|
3
|
|
OR
|
48
|
29
|
3
|
Median
|
AND
|
46
|
37
|
|
OR
|
31
|
70
|
Maximum
|
AND
|
10
|
0
|
|
OR
|
24
|
0
|
Table 5
Table 5: AEDDM Parameters Calibration on Drifted Data.
Dataset
|
K
|
Count Threshold Measure:
|
Logical Parameter:
AND OR
|
Warnings
|
Detection Delay
|
False Positives
|
Gaussian
|
1
|
Median
|
AND
|
0
|
0
|
0
|
OR
|
0
|
0
|
0
|
Maximum
|
AND
|
0
|
0
|
0
|
OR
|
0
|
0
|
0
|
2
|
Median
|
AND
|
0
|
0
|
0
|
OR
|
4
|
0
|
0
|
Maximum
|
AND
|
0
|
0
|
0
|
OR
|
0
|
0
|
0
|
3
|
Median
|
AND
|
0
|
0
|
0
|
OR
|
2
|
0
|
0
|
Maximum
|
AND
|
0
|
0
|
0
|
OR
|
0
|
0
|
0
|
VD
30 Batches
|
1
|
Median
|
AND
|
0
|
4
|
0
|
|
OR
|
0
|
0
|
0
|
Maximum
|
AND
|
0
|
4
|
0
|
OR
|
0
|
2
|
0
|
2
|
Median
|
AND
|
0
|
5
|
0
|
OR
|
2
|
1
|
0
|
Maximum
|
AND
|
0
|
5
|
0
|
OR
|
0
|
2
|
0
|
3
|
Median
|
AND
|
0
|
6
|
0
|
OR
|
2
|
1
|
0
|
Maximum
|
AND
|
0
|
6
|
0
|
OR
|
0
|
2
|
0
|
Hyperplane
80 Batches
|
1
|
Median
|
AND
|
8
|
-20
|
9
|
OR
|
6
|
-20
|
12
|
Maximum
|
AND
|
5
|
-1
|
1
|
OR
|
10
|
-20
|
4
|
2
|
Median
|
AND
|
7
|
-19
|
4
|
OR
|
8
|
-19
|
5
|
Maximum
|
AND
|
3
|
No Det
|
0
|
OR
|
5
|
No Det
|
0
|
3
|
Median
|
AND
|
4
|
1
|
0
|
OR
|
5
|
1
|
0
|
Maximum
|
AND
|
1
|
No Det
|
0
|
OR
|
2
|
No Det
|
0
|
RBF
|
1
|
Median
|
AND
|
0
|
-20
|
20
|
|
OR
|
0
|
-20
|
20
|
Maximum
|
AND
|
0
|
-20
|
20
|
|
OR
|
0
|
-20
|
20
|
2
|
Median
|
AND
|
2
|
-20
|
14
|
|
OR
|
2
|
-20
|
17
|
Maximum
|
AND
|
5
|
0
|
0
|
|
OR
|
9
|
-14
|
2
|
3
|
Median
|
AND
|
8
|
0
|
3
|
|
OR
|
6
|
0
|
8
|
Maximum
|
AND
|
4
|
22
|
0
|
|
OR
|
6
|
21
|
0
|
Based on the results, it is evident that using k = 3, median as a count threshold measure and using both batch threshold and count threshold in AND setting shows better results. We chose this setting as a default setting of AEDDM which is followed in all rest of the experiments.
Table 6
Best Set of Parameters for Synthetic Datasets
Dataset
|
Best Parameter Settings
|
K
|
Alpha
|
Beta
|
Gaussian
|
3
|
Median
|
AND
|
VD
|
3
|
Maximum
|
OR
|
Hyperplane
|
3
|
Median
|
AND
|
Moving RBF
|
3
|
Median
|
AND
|
To evaluate the results graphically, we have plotted the reconstruction error and exceed counts in Fig. 6 which shows the reconstruction error plots as well as exceed count plots at the outputs of layer 1 and layer 2 autoencoders for all four datasets. For those batches which are not passed to layer 2 autoencoder, exceed count at layer 2 is symbolically set to -1. Reconstruction error plots for both layer 1 and layer 2 autoencoder (Fig. 6 (a) and 6 (b)) show a clear difference between non-drifted data (starting 20 batches) and drifted data which starts from batch 20 in three out of four datasets. In the case of Hyperplane datasets, the difference in reconstruction error is not very clear as there is a very slight difference between the distribution of positive and negative class data (Fig. 6(iii) a and b). Similarly, exceed counts layer 2 plots (Fig. 6(d)) show a clear difference between the distribution of drifted and non-drifted data in the case of Gaussian, VD and RBF dataset as compared to the Hyperplane dataset. This distinction is not very clear in layer 1 exceed count plots (Fig. 6 (c)). Since drift is detected based on the reconstruction error and exceed counts at layer 2, so we are more interested in layer 2 outputs specifically. As we have used a simple deep vanilla autoencoder with some default set of hyperparameters, there is a huge space to improve the detection performance on these datasets by calibrating different types of autoencoders with different architectures and set of hyperparameters. The next section details further experimentation of AEDDM on real-world datasets.
5.3 Experiments on Real World Datasets
The drift detection performance of AEDDM approach is evaluated on four benchmark real-world datasets including NOAA weather data, Forest Covertype, KDDCUP99 Intrusion detection dataset and ELEC2 electricity dataset. In the case of real-world datasets, it is more likely that the change in distribution of most relevant features will impact the decision boundary to a larger extent and it is expected that such changes will impact the accuracy of the pre-trained classifiers (real drift). Similarly, changes in distribution of features which are less informative should not impact the decision boundary with no or very limited impact on the performance of pre-trained classifiers (virtual drift). An ideal drift detector should be able to consider the data distributional changes in more informative features and ignore the changes in less informative features with respect to drift detection.
To test AEDDM on real world datasets, we introduced drift by interchanging the values of top25% ( 30% or 40%) attributes in one setting and by interchanging the feature values of bottom25% (30% or 40%) in another setting by using the same approach followed by (Sethi & Kantardzic, 2017) and (Castellani et al., 2021). The top 25% and bottom 25% attributes are selected based on feature importance measure like information gain or mutual information. The drift detection results of AEDDM approach have been summarized in Table 7. For batch threshold k = 3 is used, median is used as a count threshold measure (alpha parameter) while AND is used as a logical parameter (beta parameter) for this set of experiments. The number of batches used in non-drifted and gradual drift case depends on the size of the dataset while in the case of sudden drift initial 55 batches are considered based on the size of the smallest dataset. In case of non-drifted data, AEDDM shows no false positives in all four datasets and only a few warnings in case of NOAA and Covertype datasets (see Table 7). A two-sample t-test is used as a significance test with 5% significance level to validate the outcome of AEDDM. The null hypothesis (H0) is true when there is no drift in the datasets and H1 is true when there is a drift in the dataset. The t-test also confirms that there is no drift in the normal non-drifted batch stream for all four datasets. For sudden drift top 25% case, AEDDM detects the drift with zero delay for three datasets and with a delay of four batches in case of ELEC dataset whereas for sudden drift bottom 25% case there are only a few warnings (NOAA = 3, ELEC = 5, Covertype = 16); only one false positive in the case of Covertype and ELEC datasets but no detections in case of all four datasets confirmed by t-test for initial 55 batches. This result strengthens our preposition that a drift detector should be able to detect real drift (drift in important features) while ignoring the virtual drift (changes in less important features) which AEDDM effectively demonstrates through these experiments.
To induce gradual drift in the real world datasets, we introduced a new mechanism for incorporating a gradual drift (sudden at discrete steps) in the datasets in such a way that initial 10% data contains no change, for the next 10% data ,values of the top25% attributes are increased by 10% ( as compared to the original values in non-drifted data), the next 10% data values are increased by 20% and so on, so that in the last chunk, data values in the top 25% attributes are increased by 100%. In the case of top25% gradual drift, drift is detected in all four datasets while in the case of bottom25% gradual drift, no drift is detected in NOAA and Covertype. Although the results vary across different datasets in bottom25 gradual drift case, the performance of AEDDM is encouraging and can be further explored.
Table 7
AEDDM Drift Detection Results on Real Datasets
Dataset
|
Non-drifted Data
|
Sudden Drift
(Initial 55 batches considered: drift starts at batch 20)
|
Gradual Drift
(Whole batch stream is considered)
|
Top25% (30%/40%)
|
Bottom25% (30%/40%)
|
Top25%
|
Bottom25%
|
Batches
|
Warnings
|
False Positives
|
t-test
α = 5%
|
Detection Delay
|
t-test
|
Warnings
|
False Positives
|
t-test
|
Drift Point
|
Detect Point
|
t-test
|
Detect Point
|
t-test
|
NOAA
|
55
|
3
|
0
|
H0
|
0 (Top 40%)
|
H1
|
3
|
0
|
H0
|
6
|
25
|
H1
|
No Detection
|
H0 (L1)
H1 (L2)
|
Covertype
|
1815
|
51
|
0
|
H0
|
0 (Top 25%)
|
H1
|
16
|
1
|
H0
|
182
|
584
|
H1
|
No Detection
|
H0
|
KDDCUP
|
1543
|
0
|
0
|
H0
|
0 (Top 30%)
|
H1
|
0
|
0
|
H0
|
155
|
616
|
H1
|
467
|
H1
|
ELEC2
|
142
|
0
|
0
|
H0
|
4 (Top 30%)
|
H1
|
5
|
1
|
H0
|
15
|
77
|
H1
|
35
|
H1
|
To demonstrate the impact of drift on the classifier’s performance and effectiveness of AEDDM in detecting real drift while ignoring the virtual drift, we experimented with seven most used classifiers in batch classification problems including logistic regression, random forest, KNN, SVM, XGB, decision tree and MLP (In the case of kddcup99 and Covertype datasets SVM is not used due to long training time). In all four datasets, sudden and gradual drift is introduced using the top25% and bottom25% approach; in sudden case drift starts from batch 20 while in gradual case it starts after initial 10% of the batch stream. Classification performance is measured using f1 score and results averaged over 5 batches are shown in Fig. 7. For sudden drift scenarios (Fig. 8 (a) and (b)) results are reported only for first 55 batches while for gradual drift scenarios (Fig. 7 (c) and (d)) results have been reported over the entire batch stream. In case of top 25% sudden drift (see Fig. 7(a)), there is a clear drop in f1score after index 3 (batch indices start from zero) which is batch 20 as the results have been averaged over 5 batches which indicates the drift is real.
In case of bottom25% sudden drift (Figure:7(b)), there is not as much degradation in performance and f1score almost follows the same pattern as in case of first non-drifted 20 batches. Here AEDDM shows its robustness to distributional changes in less informative features (virtual drift) and does not detect this drift (see Table 7). Similarly, in top25% gradual case (Fig. 7(c)), f1score gradually falls/changes at discrete intervals in accordance with how the drift is introduced which indicates a real drift and AEDDM detects this drift successfully in three datasets namely NOAA, Covertype and KDDCUP99 with some delays. While in case of bottom25% gradual drift (Fig. 7(d)), the distributional changes do not impact the classification performance as much (virtual drift) and AEDDM ignores these changes in three of the datasets. These results show the effectiveness of the proposed AEDDM method in detecting distributional changes in the real-world datasets which are more likely to impact the classifier’s performance. Apart from testing the AEDDM performance using batch classifiers, we have also tested it using a well-known online classifier “ Hoeffding tree classifier”. The next section briefly describes these experiments.
5.3 Performance Comparison
In real word applications, the drift detection mechanism should be transparent to machine learning scenarios (specifically classification in our case) working in tandem with the arrival of new data. In case a drift is detected which is likely to impact the performance of the classifier, the pre-trained classifier should be retrained with the new data so that it can maintain acceptable predictive performance. To test and compare the performance of the proposed AEDDM approach in an online learning environment, we have used Hoeffding tree (Hulten et al., 2001b) as a base classifier which is an online incremental learning algorithm. A Hoeffding tree has the capability to adapt to the changes in the new data with the addition of every new sample and gives performance comparable to a non-incremental batch learner with unlimited data availability (Montiel et al., 2018).
For this set of experiments, we have used the same four real datasets in the top 25% sudden drift setting. The drift is detected by AEDDM in NOAA, Covertype and KDD at batch 20 with zero delay while at batch 26 in case of ELEC dataset with a delay of six batches. Comparison is made across the following:
-
Static Model (NoChange): A Hoeffding tree is trained on the available training data. It is assumed that no drift occurs, so no drift detector is employed, and no model update takes place. This acts as the lower baseline.
-
Prequential HT: A prequential Hoeffding Tree classifier with 32 as pre-train size and batch size is used. Performance measures like accuracy and kappa statistics are averaged over 32 instances. It acts as an upper baseline as labels are readily available and first HT predicts each instance then updates itself with the correct label.
-
AEDDM (The proposed method): A Hoeffding tree is trained on the entire available label data and used for making predictions until a drift occurs. Each drifted batch becomes part of the training data when labels are available, and the model is retrained. Yet, we haven’t formalized the complete adaptation mechanism for AEDDM, and it will be part of the future work. The current focus is on the effectiveness of real drift detection with a limited demonstration of adaptation.
-
KS Test: Kolmogorov Smirnov test is used for drift detection in an unsupervised manner. It compares an empirical distribution (incoming batch stream) with a theoretical distribution (non-drifted data) to test whether both batches come from the same distribution or not. A p-value less than 0.05 at 5% significance level indicates the presence of drift between the current and reference distribution (Z. Wang & Wang, 2020a). A Hoeffding tree is trained on the available training data and is used to predict the incoming batches. After the drift point, each incoming batch becomes part of the training data and the Hoeffding is retrained.
-
ADD: ADD is another autoencoder-based batch drift detection method which uses thresholding mechanism to detect drift. It also uses the Hoeffding tree as a base classifier which is initially trained on the available training data along with the autoencoders. ADD uses two different thresholds for gradual and sudden drift and alerts drift if any batch reconstruction error exceeds either threshold. In case of drift, the base classifier is retrained on the current batch and each drifted batch becomes part of the training data to retrain the autoencoders and training loss is recomputed.
We have selected KS test based drift detection (Z. Wang & Wang, 2020a) and ADD (Jaworski et al. 2020) for comparison as both are batch based drift detection methods and work in batch incremental settings. The results have been reported for the first 40 batches where the initial 20 batches contain non-drifted data while the last 20 batches contain drifted data with drift starting at batch 20 in case of all four datasets. This equal distribution of drifted and non-drifted batches provides an equal base for accuracy comparison over the entire batch stream.
In this set of experiments our focus is to evaluate the drift detection performance of AEDDM followed by an adaptation mechanism so that the pretrained classifier can recover from the drift and performance degradation. Here both prequential Hoeffding tree and ADD have their own adaptation mechanism (described above) while KS test and AEDDM use a similar approach (described above). The static model employs no detection hence no adaptation as well. The plots in Fig. 8 show that the performance of the base classifier falls from batch 20 and then it recovers as the model is retrained based on drift detection and adaptation mechanism by each method. This fall in accuracy is much clearer in the case of KDDCUP99 and Covertype datasets as compared to the other two datasets. The no-update model in all four cases shows how sharp the accuracy falls after drift occurs at batch 20 and if there is no detection and adaptation mechanism. The proposed AEDDM approach effectively detects this drift with zero delay (see Table 7) and the adapted base classifier quickly recovers from this drift giving the best average accuracy over the batch stream in case of KDDCUP99, Forest Covertype and ELEC datasets and shares the top rank in case of NOAA dataset with KS approach. The average accuracy scores over the entire batch stream are summarized in Table 8. The proposed AEDDM approach almost outperforms other methods in all four datasets. The access to experiments can be provided on a reasonable request to the corresponding author.
Table 8
|
Average Accuracy Over 40 Batches
|
Dataset
|
AEDDM
|
ADD
|
KS
|
Prequential HT
|
No Update
|
NOAA
|
0.68
|
0.67
|
0.68
|
0.66
|
0.48
|
COVERTYPE
|
0.70
|
0.62
|
0.61
|
0.58
|
0.57
|
KDDCUP99
|
0.94
|
0.91
|
0.92
|
0.93
|
0.49
|
ELEC
|
0.85
|
0.63
|
0.83
|
0.73
|
0.71
|