Monitoring, also known as computer surveillance or supervision, is the process of controlling and verifying the activity of a computer system to obtain relevant information [6].
Many terms have been used in the literature to refer to the observation and evaluation of computer technologies such as: technological surveillance [7], AIRI (Associazione Italiana per la Ricerca Industriale [8] and EIRMA (European Industrial Research Management Association [9].), technological intelligence [10], [11], technological forecasting [12], [13]and technology assessment [14]. [15]
IT systems monitoring is one of the daily activities of an IT team which allow supervision, analysis, and management by acting directly after receiving alerts informing of potential failures. Monitoring meets the 4 following objectives [16]:
-
Help with the decision to achieve the expected results.
-
Document the supervision project to feed the learning process, communication and even advocacy processes.
-
Report to the actors concerned.
-
Contribute to strengthening the skills of the actors involved.
By collecting information from the network, monitoring help IT teams to gain a better understanding of the behavior of employees within the same network and to study problems, such as the misuse of resources or the overload of IT resources.
Many existing tools aims to evaluate the use of the system and to monitor performance measures, such as Munin [17], OpenNMS [18], Nagios [19], Cacti [20], Zabbix [21] and many others. These tools generally represent graphs of system performance measurements using Simple Network Monitoring Protocol (SNMP) [22].
To identify the network model, we need to follow a process of five layers that correspond to the functions of logical measures shown in Fig. 1 which are: collection, representation, reporting, analysis, and presentation.
Earlier works have focused on improving the efficiency of this monitoring process by handling the large amounts of data to be collected, analysed, and stored. [23], [24], [25] and even if we manage to overcome the problems of critical limits, induced by the actions of the monitoring process related to the processing power, the overload of the bandwidth and the storage capacity, it remains nevertheless a major challenge which is characterized by the identification of important events among vast quantities of measurement data, their visualizations, and their interpretation to make decisions and detect failures [26]. In this context, a large body of previous work on surveillance [27], [28], [29], [30], [31], [32], were particularly interested in the analysis layer. According to Sihyung [5] Lee et al high-level interpretations of derived measurement data are used to decide whether to alert the operator, record reported measurements or reconfigure the network.
Although much work addresses issues in more than one layer, this logical separation of different monitoring functions clarifies their functionality and interactions. Also, in our approach we will be interested in the "Collection" layer, to solve the management problems of the large quantities of recorded and analysed data, then in the "Analysis" layer to allow IT teams to predict failures in their terminals instead of just displaying them in a graph. Our approach will be based on the principles of big data and on the use of Multi-Agent Systems.
In what follows, we will analyse all the surveillance operations based on the five levels of the surveillance process in Fig. 1 illustrated by Sihyung Lee et al [5], then we present a sub-classification, to put highlight, the different logical functions of surveillance. We will then analyse, in a detailed manner, the functions of each layer, then we will present two issues which combine all the monitoring operations.
2.1 Collection layer
This layer represents the collection of measurement data from a computer network. The collection can be done actively or passively [34]. Active monitoring can directly measure what users want to observe without waiting for a particular event to occur. However, active monitoring should minimize the impact on normal traffic and continued network operation. The size and frequency of the active poll are the two parameters that determine the impact of surveillance on the operation of a computer network [35].
Passive monitoring runs on dedicated devices where it is performed by network devices such as routers and switches. Passive surveillance is non-intrusive and affects network behavior less than active surveillance. In addition, it observes the actual behavior of the network.
Many works on active and passive monitoring raises several problems such as latency to obtain monitoring information, the rate of loss of data received through packets and storage capacity [36], [37], which must be very important, to not block the monitoring process and reduce the performance of the network. Other works talk about bottlenecks [38], [39], which limits the operation of the process in the event of oversized processing of the monitoring data obtained.
To solve these performance issues, several studies suggest the use of both active and passive surveillance [40], others suggest adding new functions to existing routers to make active monitoring easier and more accurate.
-
R. Kompella, et al [41], S. Machiraju [42] and P. Papageorge [43] suggest that routers should transmit measurement packets with modulable and configurable priority.
-
C. Estan, [43] and E.A. Hernandez [44] Provides sampling that reduces overhead on monitoring nodes and performs efficient monitoring. However, the unsampled information is lost.
As a result, most of the work on this layer focuses on improving the accuracy of the measurements collected. But do not deal with the problems that can be generated because of data filtering and their processing.
2.2 Presentation layer
The role of this layer is to give a unique form to the information sent by the equipment of the network. It ensures that the application layer information of the OSI model in transit is readable by the application layer of another system. The representation layer manages three aspects including the format of the data which must be Postscript, binary or ASCII, compatibility with the host's OS and the encapsulation of the data for transmission via the network
According to R. Kompella, [45] there are three important problems related to the "Representation" layer:
-
First, the representation of collected metrics should be normalized to ensure that each analysis function uses the same form and to prevent these metrics from being converted into separate intermediate representation forms. In this perspective, several works express the use of these standards through SNMP MIB, CIM, IPFIX: IP Flow Information Export (ipfix), and YANG monitoring tools.
-
The second problem consists in synchronizing in time the measurements coming from heterogeneous and distributed collection devices [46].
-
The last problem concerns the representation of measures which must be concise, to save storage space and network bandwidth [47].
This is achieved by identifying and sorting the duplicate measures and by combining values to generate derived values.
The data encoded in this layer is represented as strings of characters, integers, and floating-point numbers. It is handled differently by different machines depending on the encoding methods followed by them.
2.3 Report Layer
The reporting layer consists of producing a measurement report of the data taken through a computer network. It provides standard methods for signing flow records and verifying them [48]. It also specifies an encapsulation format that can be used for encryption as well as for digital signatures.
The reports sent are triggered either:
-
Periodically, with a predefined frequency, generally of the order of a few minutes. This periodic method decreases the number of measurements transferred, as well as the frequency of data request.
-
Only when a predefined event occurs.
Periodic polling is less efficient, but it allows the identification of new undefined events. On the other hand, polling triggered by an event is more efficient, and the events to be reported must be carefully defined, because undefined events go unnoticed. These two methods present a trade-off between efficiency and precision of monitoring. This reporting must be efficient in terms of bandwidth consumed during the transfer of measurement data to the supervision equipment.
Several recommendations are found in the literature to improve the efficiency of reports, by reducing the number of measurements:
-
Reporting and elimination of redundant measures [49].
-
Aggregation of measures with similar characteristics [50].
-
Consolidation of the same interrogation requests from different applications into a single request [51].
-
Use of bandwidth-saving measurement encodings: IPFIX [52] defines such encodings.
Despite the various research carried out at this layer, reporting is a problem when the data is very large. Our approach will make it possible to refine the report phase to allow more efficiency and precision at the same time.
2.4 Analyse Layer
This layer is the most important in our research, it consists in analyzing the data collected by the preceding layers, because each network problem has its own characteristics and it is impacted by many factors such as CPU load, hard disk temperature, ping, RAM memory occupation and response time.
In this layer, we also find the characteristics appropriate to network problems and the intervals of data measurements, which makes it possible to fully unleash the potential of the analyzed data.
Operations classified in this layer extract high-level interpretations of network condition by analyzing collected metric data. In general, the measurements collected give a result:
-
Yes or no type boolean to indicate whether a component is available or not.
-
Digital, eg response time.
-
Qualitative, during the execution of a query, for example the state of the network at a specific time (Good, medium, or poor).
Although there are a variety of analysis functions identified in [53], [54] and [55], here we present the six most common ones:
-
General purpose traffic analysis:
-
Estimate of traffic demand.
-
Classification of traffic by application.
-
Extraction of communication models.
-
Management of faults.
-
Automatic update of network documentation.
The results of these analysis functions are widely used in the design of configuration changes.In our approach we will focus on the application of machine learning on fault management, several research have been done in this direction and which we will cite in section 5.
2.5 Presentation Layer
The presentation layer is responsible for coding the application data. It is easier to monitor a network through visual representations, rather than digital data. The Multi-Router Traffic Grapher (MRTG) [56] and the SNMP Network Analysis and Presentation Package (SNAPP) [57] provide visual representations of SNMP data, such as: bandwidth usage, physical state of 'a machine (CPU load, RAM, etc.), application availability or known attacks on a firewall.
However, visualizing usually large metric data proves to be a challenge. Because a good visualization tool must be able to deduce the importance and relevance of information and must also visually represent information in an intuitive way. So, it will become even more important to summarize and present the important events among vast amounts of measurement data, as internet usage is increasing rapidly more and more data is being collected especially with the emergence of new intelligent sensors and the use of IoT. To overcome this definition, for its part, SNAPP [58] identifies the correlated events and presents these events in the form of groups, so that the operators can have an efficient judgment of the relevance of the alerts via a simpler presentation.