Samples shall cover as most variety of data as possible, to be bias, introduce an acceptable degree of redundancy and reduce noise with the least computational cost possible. In this approach, we employ the DBSCAN clustering algorithm to identify areas of density within the dataset, aided by the proposed jump and ARI metrics. ARI, which does not entail a high computational cost. our primary objective is not to establish clusters but rather to identify representative samples. DBSCAN is chosen for its ability to discern density-based structures within the data, allowing us to pinpoint regions of significance. By incorporating ARI, a measure of clustering similarity, we can evaluate the effectiveness of our approach in identifying these dense areas and selecting suitable samples from them. Although ARI itself does not impose significant computational overhead, the overall computational complexity of our approach may vary depending on the dataset's size and dimensionality and minPoints parameter of DBSCAN . It's important to note that the efficacy of our approach hinges on the suitability of DBSCAN and ARI for our specific dataset and objectives. In the proposed model, the Jump is a critical aspects as it keeps the probability of finding new minor clusters high , it does not neglect sparse areas. Jump mechanism divides dataset initially to 2 segments (visited, unvisited) the visited segments is further divided into promising areas and weak or sparse areas as the model will collocate the number of cluster created and how many noise points. the jump collects fixed number of records in every iteration fig 1 illustrates the flowchart of jump while fig 2 depicts the model .
2.1 algorithm for jump mechanism
- The initial jump is randomly made within the range of the dataset with a specific jump size.
- If the data points are good (resulting in many new clusters), this area is considered promising, and the next jump is made next to it.
- Jumping will continue to this area until it is fully utilized, meaning the returned data is similar to the data we already have or contains too much noise.
- A new jump is executed to a far unvisited area. The model controls the jump, utilizing DBSCAN and ARI to determine whether to accept, exclude, or replace the sample.
In this paper, our approach involves utilizing clustering on jump data to identify clusters and noise points, which are crucial for the functionality of the jump and for enabling comparison between clusters using ARI. The limited data provided by the jump enhances clustering efficiency, although the effectiveness of DBSCAN depends on two key parameters: minPoints and eps. To address this, we manually specify these parameters in our study.
We assess the output of each jump using ARI against all previously accepted samples. Clusters meeting our criteria are accepted and added to the array of accepted clusters. Conversely, if the ARI index falls below a certain threshold, the cluster is excluded. Excluded clusters are then compared to the most relevant clusters from the accepted set. If an excluded cluster demonstrates superior characteristics, it replaces the relevant accepted cluster, while the latter is added to the excluded array ,this process continues until we stopped finding new clusters as the ensures that we get the optimal representatives of the big dataset , and this is the stopping criteria thus there is no need to specify sample size
Fig. 1 Jump Algorithm Flow Chart
2.2 Flowchart and proposed model
1 Start
2 Supply minPTS and esp and jump size
3 Execute the initial jump, cluster it, and add it to the accepted cluster since there is nothing yet to compare it to
4 Execute the next jump, cluster it, and then compare it with the accepted clusters for similarity. The Adjusted Rand Index (ARI) comes in handy in this process. If the result of ARI is less than 0, the clusters are not accepted.
5 If the result is greater than or equal to 0, then the cluster is further checked for better sampling with the most similar cluster.
5.1 If the current cluster (doomed to be excluded) has better clusters, it is replaced with the already accepted weaker cluster. The replaced cluster is then added to the rejected.
5.2 If the current cluster is not better than the already accepted most similar cluster, it is added to the rejected clusters.
6 Check if the exit limit has been reached .
6.1 If no execute a jump .
6.2 If yes stop .
7 Stop.