Proposing an algorithm for detecting real-time anomalies in large amounts of data is not an easy task due to the fact that many researchers tend to show that their solution is better. In research done earlier [1] we have proposed an algorithm for detecting anomalies in large real-time data. We tested the accuracy of the algorithm there, comparing it to several other algorithms that we singled out from previous research such as ARIMA [10], Moving Average[11] and Holt Winters [12].
The purpose of this research is to test the performance of the proposed algorithm HW-GA and compare it with other algorithms which are used for finding anomalies in large amounts of data. Comparisons will be made between different algorithms such as HW-GA, ARIMA, MovingAverage, Holt Winter, etc.
The field of analyzing large amounts of data is current because the number of data is increasing every day. When dealing with large amounts of data, it should be in mind that they are characterized by three characteristics: volume, veracity and variety of data. Hence the need for performance testing in order to meet the speed characteristic of large amounts of data. It is a vast field of research because it involves algorithms from different disciplines. First, to make the selection of the algorithm important is to specify the data to be analyzed in order to know how we make the algorithm selection.
The study will use comparative methods in order to draw conclusions regarding comparative performance. Experiments, statistical analysis and visualization were managed in R, a free software environment for statistical computing and graphics.
There will be used benchmark and real time data to test the algorithm. The NUMENTA benchmark [3] database will be used and real time data from e-dnevnik application which is electronic education system in North Macedonia.
Related Work
In our previous research [2] we have compared many algorithms as MAD, RunMAD, Boxplot, Twitter ADVec, DBSCAN, Moving Range Technique, Statistical Control Chart Techniques, ARIMA and Moving Average, to find which one is faster. So the most important aspects that we considered, in order to find anomaly detection algorithm suitable for future implementation in the online environment was the execution time (complexity), the CPU usage and the satisfactory quality of algorithm (measured through TP- True Positive, FP-False Positive, FN-False Negative, TN-True Negative anomalies found).
Best algorithms selected from that research ARIMA and Moving Average are compared with or proposed algorithm [1] and Holt Winters where we have tested the correctness of our algorithm in our previous research [1] and now in this research we are going to test the performance/ speed and CPU usage of our algorithm.
The authors [3] propose a benchmark Numenta Anomaly Benchmark (NAB), this benchmark is used in our research. Numenta Anomaly Benchmark (NAB), attempts to provide a controlled and repeatable environment of open-source tools to test and measure anomaly detection algorithms on streaming data. The perfect detector would detect all anomalies as soon as possible, trigger no false alarms, work with real-world time-series data across a variety of domains, and automatically adapt to changing statistics.
The authors [4] propose an online and unsupervised anomaly detection algorithm for streaming data using an array of sliding windows and the probability density-based descriptors (PDDs) (based on these windows). This algorithm mainly consists of three steps: 1) we use a main sliding window over streaming data and segment this window into an array of nonoverlapping subwindows; 2) we propose the PDDs with dimension reduction, based on the kernel density estimation, to estimate the probability density of data in each subwindow; and 3) we design the distance-based anomaly detection rule to determine whether the current observation is anomalous. The experimental results and performances are presented based on the Numenta anomaly benchmark. Compared with the anomaly detection algorithm using the hierarchical temporal memory proposed by Numenta (which outperforms a wide range of other anomaly detection algorithms), our algorithm can perform better in many cases, that is, with higher detection rates and earlier detection for contextual anomalies and concept drifts.
Martin,.et.al.[5] investigate to what extent sequence-based Markov models can be used for anomaly detection by means of the endusers’ control sequences in the video streams, i.e., event sequences such as play, pause, resume and stop. This anomaly detection approach is further investigated over three different temporal resolutions in the data, more specifically: 1 h, 1 day and 3 days. The proposed anomaly detection approach supports anomaly detection in ongoing streaming sessions as it recalculates the probability for a specific session to be anomalous for each new streaming control event that is received. Two experiments are used for measuring the potential of the approach, which gives promising results in terms of precision, recall, F1-score and Jaccard index when compared to k-means clustering of the sessions.
The authors Filipe Falcão, et.al., [6] they evaluate experimentally a pool of twelve unsupervised anomaly detection algorithms on five attacks datasets. Results allow elaborating on a wide range of arguments, from the behavior of the individual algorithm to the suitability of the datasets to anomaly detection. They identify the families of algorithms that are more effective for intrusion detection, and the families that are more robust to the choice of configuration parameters. Further, they confirm experimentally that attacks with unstable and non-repeatable behavior are more difficult to detect, and that datasets where anomalies are rare events usually result in better detection scores.
Huihui Zhu, et.al., [7] propose a real-time anomaly detection framework with low computational complexity and high efficiency. They propose Histogram of Magnitude Optical Flow (HMOF) which capture the motion of video patches. They show that HMOF is more sensitive to motion magnitude and more efficient to distinguish anomaly information. Experimentally they show that the framework outperforms state-of-the-art methods, and can reliably detect anomalies in real-time.
Based on the existing research summarized above, we have identified a research gap related to the analysis of speed of anomaly detection algorithms HW-GA.