CPOCEDS-Concept Preserving Online Clustering for Evolving Data Streams

Clustering streaming data is challenging due to many temporal dynamics, such as concept drift, concept evolution, and feature evolution. Concept evolution is the most challenging of these. Due to concept evolution, new classes may emerge or existing classes may disappear, so it is crucial to process streaming data continuously. This paper proposes a novel online clustering method, speciﬁcally for streaming data with concept evolution. It consists of three phases: initialization, clustering and outlier handling. To identify recurrences of previous data in streaming data, it is critical to preserve the sequential properties of data chunks. In the proposed model, representatives from previous windows are added to the current window, making it distinct from existing models. The detection and handling of outliers are very challenging tasks in streaming data analysis. Outliers are often the ﬁrst instances of a new cluster. The proposed model stores the outliers from each data window. When the number of outliers exceeds a certain threshold, the representatives of outliers are added to the next window to identify new classes. To handle the lack of data sets for the training of such models, we created a synthetic data set with 22020 data instances. Using Silhouette Coeﬃcient, Calinski-Harabasz Index, and Davies-Bouldin Index analysis, this model yielded the most favourable results.


Introduction
A huge amount of data is generated each day from a variety of sources. Realtime analysis of such data becomes crucial for the expansion of their services in line with fast-changing trends. Traditional batch data analysis models are based on historical data. Considering the rapid change in trends in numerous domains, this model may not be able to predict the future. As a result, streaming data analysis has become an unavoidable requirement in a variety of fields such as IoT, online markets, network traffic, cyber security, weather forecasting, and so on. The term 'streaming data' refers to data that is continuously generated by various sources in varying format and volume with varying velocity. A data stream DS is represented as DS = {d 1 , d 2 , d 3 , ..., d l } where d i is a feature vector n-dimensional and l is the number of features in that stream. The demand for ever-increasing memory capacity makes it impossible to store such data. Therefore, streaming data should be analyzed on the fly without having access to all data.
Streaming data differs from batch data. One major difference is that the distribution of data in streaming data can change over time, a phenomenon known as concept drift. [19]. Another difference between streaming data and batch data is the dynamic nature of features. Additionally, certain classes may appear or disappear over time, thereby varying the number of classes, a process known as concept evolution. [17]. A new type of class usually has a different set of features. As a result, various chunks may have distinct feature sets. Streaming data analysis also faces some major challenges. First of all, data should only be processed once and can never be retrieved. Due to the high velocity and infinite volume of data, an efficient method and a large memory space are required, which is not realistic. Additionally, the order of the data cannot be controlled. To improve performance, however, it is a necessity to preserve the properties of the previous data if the same pattern of data is reoccurring.
Since there is unpredictability about the stream class labels and their numbers, clustering is an ideal method for mining and analyzing streaming data. Clustering is the most commonly used unsupervised machine learning method for grouping similar objects. Data clustering involves identifying patterns in the data and classifying the data by grouping similar data together. There are several challenges associated with streaming data clustering like infinite size, high speed, dynamic nature, lack of global view of data, outliers, high dimension of data stream and so on [34]. Since the number of clusters for each instance may vary, we should find out how many clusters each instance dynamically has. Streaming data can also contain outliers similar to traditional data. Detecting and handling outliers is a very difficult task in streaming data analysis.
To address the challenges of streaming data analysis, we propose a fully online streaming data clustering technique based on K-Means clustering. The method consists of three phases: initialization, clustering, and handling outliers. Since the method is completely online, a summary of the data is not maintained. By adding representatives of previous windows to the current window, our model preserves the properties of their predecessors. This eliminates the need to store a synopsis. This is what makes our model unique. Outliers are often the first instances of a new cluster.Moreover, we store outliers in our model. If their size exceeds a certain threshold, the representatives of outliers are added to the next window. It helps in identifying a new class if it exists. Thus, the model deals with concept evolution. Also, there are no ideal datasets for training streaming data clustering models currently. As a result, most of the models rely on traditionally labelled data streams for training and testing. In order to test the proposed model, we created a synthetic data set with 22020 data, based on the Silhouette Coefficient, Calinski-Harabasz Index, and Davies-Bouldin Index, which produced the best results.
The rest of the paper is organized as follows: In section 2, we present a review of previous works that mostly deal with streaming data clustering. Our proposed algorithm and methodology are explained in section 3. Whereas section 4 provides a short description of the synthetic dataset, which is used for the training of our model as an ideal dataset. Section 5 describes the evaluation of the model and its analysis of results. Conclusions, open issues, and future works are discussed in section 6.

Background
In this section, we discus various recent studies and research related to the proposed work, as well as the concepts and characteristics of streaming data that would reflect streaming data clustering such as concept drift (sec 2.1), infinite length of data streams (sec 2.2), feature drift (sec 2.3), and concept evolution (sec 2.4). Further, we discuss commonly used data structure in streaming data analysis; windowing (sec 2.5), outliers (sec 2.6), data stream clustering (sec 2.7), and benchmark datasets (2.8).

Concept drift
Concept drift is one of the challenging streaming analytic problems which observes the changes in the distribution of the data over time. Concept drift can be divided into four types: sudden, gradual, incremental and recurring [39]). Concept drift is responsible for the formation of new clusters, the disappearance, or the evolution of existing clusters. Cluster boundaries may also be affected by concept drift. Real concept drift occurs when the cluster boundaries are changed, whereas virtual concept drift occurs when they are not (Gama Ja, et al. [18]).

Infinite Data streams
It is impossible to store all of the data in memory because streaming data is infinite; instead, only a synopsis is stored if needed. A special structure is therefore needed to summarize the input stream incrementally. The most common data structures are feature vectors-which represent an overall overview of the data, prototype arrays -which contain only a few representative samples, coreset trees -which preserve a tree-structured summary, and grids -which preserve the data density in a spatial manner (Silva J, et al. [

Feature drift
It occurs when certain features become irrelevant to a learning task or cease to be relevant to the task [7]. As a result, the associated class may change. [38]. Another key characteristic of streaming scenarios is time-dependent changes in features, called feature evolution. It is difficult and critical to develop an efficient representation of streaming data as a dynamic vector.

Concept evolution
Labels for classes can appear or disappear through concept evolution. Problems such as anomaly detection or handling outliers are intrinsically related to concept evolution [17]. ECSMiner [28], CLAM [4], and MINAS [16] are some algorithms that deal with concept evolution.

Windowing model
Stream data processing, instead of the entire set of data, use the most recent data. A number of window models are used for this purpose: the damped window, where the most recent data takes precedence over the oldest data. The landmark window, where each instance has equal weight, and the sliding window model, where each instance is flipped at each step [5].

Outliers
Outliers are data instances that are distinct from all other data instances. Thakkar et al. [33] divided outlier detection techniques into four categories.: statistical, distance-based, density-based, and clustering-based outlier detection. Nonparametric statistical outlier detection techniques can be adapted to streaming data because they use the existing data points as a distribution model. Distance-based methods (Chauhan.P, et al. [13]) employ the instance's neighbour count to determine whether it is an outlier or not. If there are no k neighbours within a d distance, it is termed an outlier. There is no assumption made on the distribution model. As a result, it works well with data streams. However, they fail to cope with high-dimensional data streams. If the density around the instance is comparable to that of its neighbours, it is not an outlier. Otherwise, it is considered an outlier. This is more effective than distance-based approaches, although it has a larger computational complexity. Data instances that do not belong to any clusters are considered outliers in clustering-based outlier detection methods(Bhosale [8]).

Related Works
In order to cluster streaming data, most works have used variations of clustering methods used for traditional batch data clustering. The recent stream clustering algorithms can be broadly divided into three types: partition-based clustering, distance-based clustering, and density-based clustering.

Online-Offline Clustering
The majority of existing methods (for example, CluStream [2], DenStream [12], StreamKM++ [1], or ClusTree [25]) divide the clustering operation into two phases [32]: online phase and offline phase. In online phase, data instances are summarized and stored in a specific data structure. This synopsis is updated whenever a new data point is arrived. On synopsis, the actual clustering occurs during the offline phase. It runs periodically or re-clusters every time a new data instance appears, ensuring the clusters are always up-to-date.
MuDi-Stream (Multi density data stream clustering algorithm, Amini et al. [6]) is a hybrid method that employs density-based as well as grid-based techniques. Outliers are identified using grids after input data instances are clustered using a density-based technique. Core mini-clusters are utilized for data summarization. Core mini-clusters are feature vectors that have been customized. They maintain the weight, centre, radius, and the foremost distance between a data point and the mean. During the online phase, core mini-clusters are built and updated for each new data instance. As a result, final clustering is done during the offline phase.
Adaptive stream K-Means (Puschmann et al. [30]) and FEAC-Stream (Andrade Silva et al. [15]) are k-means based algorithms for clustering data streams with a variable number of clusters. There are two main phases to these algorithms: the initialization and the continuous clustering. In initialization part, the value of 'k' is determined by the Silhouette coefficient calculation function. Being fully online, these algorithms don't keep a summary of the data, but maintain the clustering result. FEAC-Stream Tracked the clustering quality during the execution using the Page-Hinkley (PH) test(Mouss et al. [29]), and if the quality deteriorates. Hence, the algorithm should be adapted. The quality of clustering in FEAC-Stream is determined by three hyperparameters: size of data sequence, decay rate, and minimum weight threshold. The adaptive stream K-Means measures and monitors standard deviations and means of input data streams in order to detect concept drift. Existing cluster centroids become invalid when there is concept drift.
CEDAS (Clustering of evolving data streams into arbitrarily shaped clusters, Hyde et al. [23]) is a density-based clustering technique for forming irregular shape clusters from evolving data streams. The damped window model is used with the help of a linear decay function rather than an exponential decay function. CEDAS maintains a data synopsis in micro-clusters and constructs a graph structure out of micro-clusters which are larger than an initialized threshold. The decay rate, micro-cluster radius, and minimum density threshold are the three parameters required by CEDAS. These parameters are directly related to the input data and have a significant impact on clustering quality.
Improved Data Stream Clustering Algorithm (Yin et al. [36]) is a densitybased algorithm that is suitable for arbitrary-shaped clusters. This algorithm constantly changes threshold values, depending on the input data. There are two main phases to this algorithms, initialization and the continuous clustering . Using DBSCAN clustering on the n number of data instances, major micro-clusters and critical micro-clusters are created during the initialization phase.The final clustering process will include major micro-clusters with high densities. Low densities characterise critical micro-clusters, which are treated as potential outliers. During the continuous clustering phase, new data instances are intended to be placed to the closest major micro-cluster.If the nearest major micro-cluster does not suit, it is then added to the nearest critical micro-cluster. If neither of these options is appropriate, a new micro-cluster will form. A damped window model is used and low weighted major and critical micro-clusters are removed periodically. The threshold values for major and critical micro-clusters in the algorithm are global instead of unique to each cluster.
DBIECM (Zhang et al. [37]) is an online, distance-based, evolving data stream clustering technique that utilizes the Davies Bouldin Index (DBI) rather than the shortest distance as the assessment criterion. It is a better variant of the Evolving Clustering Method (ECM) (Song et al. [24]). When a new data instance d is arrived and the distance of d is lower than radius of any cluster ,d can be to this cluster. Whenever the distance between a cluster and d exceeds the maximum cluster radius r 0 , a new cluster is created. The number and quality of DBIECM clusters are affected by the hyperparameter maximum cluster radius r 0 .
In the literature, it was noted that some streaming data clustering is performed online and offline for better results. Several other methods are used only in online phases, but most of them store synopses to improve results. In fully online models, the first step is to estimate how many clusters are in each window of the data stream. There are very few methods for detecting and handling outliers. A fully online stream clustering model is proposed that estimates the number of clusters 'k', detects and handles outliers without storing the synopsis. Furthermore, the proposed model's performance is incrementally improved by preserving the properties of previous data windows.

Benchmark Datasets
It is rare to have real-time data streams for analyzing streaming data. Thus, most models stream traditional data sets using some streaming mechanism such as Apache Kafka or simulate synthetic data streams using some applications. There are some data repositories used for stream data clustering, such as Meetup RSVP Stream37 , National Weather Service Public Alerts38, and Stream Data Mining Repository39: a repository includes four different datasets, namely Sensor Stream (2,219,803 instances, 5 features, and 54 classes), Power Supply Stream (29,928 instances, 2 features, and 24 classes), Network Intrusion Detection 10 percentage Subset (494,021 instances, 41 features, and 23 classes), and Hyper Plane Stream (100,001 instances, 10 features, and 5 classes). Data mining framework Massive Online Analysis (MOA) [9] includes four datasets that are suitable for data stream analysis. Table 1 summarizes some other popular datasets used for streaming datasets. Except for the Charitable Donation Dataset, all datasets in this section have true class labels. It is clear from this discussion that most datasets used for streaming data analysis are labeled and thus suitable for supervised learning. As of today, there are no datasets that satisfy all of the characteristics of streaming data.

Overview of Proposed Method
Figure-2 illustrates the proposed streaming data clustering framework. We propose a fully online method for clustering streaming data that uses the K-Means algorithm. The method consists of three phases: initialization, clustering, and handling outliers. Since the model is fully online, no summary of data is stored. By adding representatives of previous windows to the current window, our model preserves the properties of their predecessors without storing previous data or a synopsis. This is what makes our model distinctive. During the initialization phase, k is estimated based on the inertia of the data instances in the window. This 'k' is used to perform K-Means clustering on the data window. Each cluster's radius, which is calculated from its mean and standard deviation, can be used to determine clusters and outliers. Outliers are often the first instances of a new cluster. Outliers can be stored for each window. If the number of outliers reaches the average size of clusters, then outliers will be added to the next window for clustering and identifying the new cluster if it exists. This is how the proposed method handles outliers as well as clustering the evolving data streams.

Windowing
In streaming data clustering, it must cluster the most recent data rather than the entire data. Therefore, the proposed model uses the landmark window [5] model that includes N most recent data points. Data between two landmarks is included in the landmark window model. Window length is determined by instance count or elapsed time. For the model proposed, instance count is used to determine window length. With this window model, the new window starts where the old window ends, and successive windows don't overlap. Equation 1 can be used to determine window instances and equation 2 to determine the window number of a particular data instance.Here, l is the window length, d j is the j th instance, and SW n is the n th window. Indexes j and n begin at zero.

Handling Outliers
There may be patterns of some data instances that are different from other patterns in a particular data set called outliers. An outlier may be present due to malicious activities, data collection problems, instrument errors, or transmission problems [5]. Outlier may negatively affect the data analysis. So, identifying and handling the outliers are crucial issues in data analysis. Detecting outliers from streaming data can be difficult. There is no way to detect outliers with a K-Means model. Outliers are separated based on a threshold distance. Mean and standard deviation are used to set threshold distance. In

Get Representatives
To get representation each cluster is micro-clustered again using the K-Medoid clustering technique and the micro-cluster centers are stratified samples for representing the cluster. The number of micro-clusters 'k' is estimated at 10 percent of the cluster size. As a result of merging all the representatives and adding to the next data window, we can make all properties of data points in the previous data window live. In this way, the model preserves the properties of its predecessors without storing any data or synopsis of its predecessor.

Algorithm 4 getRepresentatives(C, e)
Require: C-Clusters and e-elitism rate Ensure: representatives of each cluster reps for appending with next data window W 1: while C ̸ = ϕ do

Dataset
There may be some data points that are outliers in some cases. However, at times, the number of outliers at a particular location will increase and form a new class. The proposed model is excellent for clustering these types of data and identifying novel classes. However, there are several real-time use cases for such problems, and the analysis of evolving data and identifying novel classes are major challenges in streaming data analysis. According to all the literature that we have done, there are currently no datasets that satisfy all of the characteristics of streaming data, especially concept evolving datasets that are suitable for training our model. Thus, we have created a synthetic dataset of 22020 data for training the proposed model. The synthetic dataset consists primarily of two classes with ten thousand data points each, which are drawn at specific locations. A random sample of 20 outliers is added to the dataset, then the dataset is shuffled. The third class is also generated with 2000 data points at a certain location and distributed to the existing dataset in such a way that the initial data chunks contain the most data points from the first two classes and the subsequent chunks contain more data points from the third class in increasing order, with some outliers in each chunk.

Result Analysis
The hyperparameters of the proposed model are window size (amount of data per chunk), threshold for outliers, and number of cluster representatives. However, the most efficient performance could be obtained for a window size of 250 by a threshold of 25 percent growth of the edge of each cluster, and 10 percent of each cluster's size as a representative. Figure4 shows the clustering result of some sample windows.
In cases where the ground truth labels are unknown, the model must be evaluated using the model itself. To evaluate the proposed model, we calculate the Silhouette Coefficient, Calinski-Harabasz Index, and Davies-Bouldin Index. The yellow points are the outliers, and the stars are the cluster centroids. This observation shows that the data stream windows start with two classes and some outliers. Outliers are becoming more common in one region, and then they form a new class.

Silhouette Coefficien
For each sample, the Silhouette Coefficient [31] consists of two scores: a: The intra-cluster distance. b: The mean nearest cluster distance When a set of samples is considered, the Silhouette Coefficient is the mean of the Silhouette Coefficients for each sample. Incorrect clustering will receive a score of -1 while highly dense and well-separated clustering will receive a score of +1. A score of zero indicates clusters overlapped. For the proposed model with a window size of 250, the Silhouette Score varies between 0.6 and 0.71.

Calinski-Harabasz Index
The Calinski-Harabasz index [11] is also known as the Variance Ratio Criterion. Calinski-Harabasz scores higher when the clusters are more well-defined. Index values are calculated as the ratio of the sum of inter and intra cluster dispersion for all clusters (distance squared is the measure of dispersion).
Calinski-Harabasz score s is defined as the ratio of the measure of dispersion within and between clusters when comparing a set of data E of size n with k clusters: is trace of the inter-cluster dispersion matrix and tr(Y k )is the trace of the intra-cluster dispersion matrix defined by: where C p the set of points in cluster p, c p the center of cluster p , c E the center of E, and n p the number of points in cluster p.

Davies-Bouldin Index
Davies-Bouldin scores [14][21] are easier to compute than Silhouette scores. Indexes like this one are used to represent the degree of similarity between clusters; where similarity is a measure of the distance between clusters in relation to their size. Due to the fact that the distances are calculated only point-bypoint, the index depends entirely on the quantity and feature information in the dataset. The lowest score is zero. Lower values indicate better partitioning.
The index is defined as the average similarity between each cluster C i for i = 1, ..., k and C j . In the context of this index, similarity is defined as a measure R ij that trades off: s i is the cluster diameter d ij is the distance between cluster centroids i and j R ij = s i + s j d ij Then Davies-Bouldin index can be defined as:

Fig. 7: Davies-Bouldin Index
Summarizing the different aspects discussed in the results analysis section, the proposed model shows its superiority over existing methods in terms of concept evolution, concept preservation, and handling outliers. The novel stream clustering technique maintains a consistent range of values for different evaluation metrics, such as the silhouette coefficient, the Calinski-Harabasz index, and the Davies-Bouldin index, and proves that the proposed model is superior to the existing one.

Complexity Analysis
Suppose l is the length of the data stream, and we divide it into b blocks (windows), each with a size of n = l b . We cluster each window into K clusters using K-Means clustering, an algorithm of complexity O n 2 . In the next step, we select some representatives of each cluster K with size m = n K by using the K-Medoid clustering technique, which has a complexity of O (m * k * t), where k = m * 0.1 and t is the constant number of iterations. As a result, the total complexity of the algorithm is calculated as follows: Thus, the complexity is directly proportional to the square of the length of the string l and inversely proportional to the number of blocks b.

Conclusion
The features of each chunks in evolving data streams may vary. As a result, clustering of evolving data streams and finding new classes are the major challenges in streaming data analysis. Using K-Means clustering, we proposed a fully online streaming data clustering method. The method is divided into three phases: initialization, clustering and outliers handling. The uniqueness of the model is reflected in handling outliers. In the proposed model, the outliers from each window are stored, and representatives of these outliers are added to the next window to identify new classes. Thus we handle the concept evolution. In streaming data analysis, another barrier is the lack of datasets for the training. In order to test the proposed model, we created a synthetic dataset with 22020 data instances. Based on the evaluation of the Silhouette Coefficient, Calinski-Harabasz Index, and Davies-Bouldin Index, the model performed very well. Due to the high velocity and volume of real-world data, the proposed model faces some challenges while analyzing it. In the future, we will model a scalable and distributed clustering algorithm for handling real-world streaming data.

Declaration
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript. Moreover, there is no ethical clearance required for either the data or the procedures.

Ethical Approval
The manuscript is not published elsewhere or submitted to more than one journal for simultaneous consideration.

Availability of supporting data
The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request.