Development of a method for processing log files using clustering

A log file is a document that keeps track of all events that occur on a website or server. Many log files are very large, so they can be regularly written over outdated content, or entire collections of log files with names, including a date, for example, can be created. In the event of technical problems, site inaccessibility, virus infection, hacker attacks and Distributed Denial of Service (DDoS) attacks, the resource administrator can use the information in log to find the cause, which makes it easier and faster to eliminate unwanted incidents. The paper analyzes the definition, types, location, use and examples of log files. Data are transferred to the MySQL database using the Squid.db database. Clustering is performed using a database. The study highlights clustering, analyzes metrics, and determines the proximity of clusters and objects in clusters in Euclidean space. Experiments are conducted and the results are satisfactory. For example, data are transferred to the MySQL database using the Squid.db database. Since the Squid proxy server is a cache proxy server, it stores resources, and the work is done quickly on the next request. Data are clustered using a compiled table of databases transferred to MySQL via Squid proxy. In this case, unnecessary entries are deleted from the table, which significantly speeds up data processing. The application of clustering method in problem solving is fast and simple. For the problem stated, the degree of closeness of clusters and objects in clusters in Euclidean space is determined. Experiments are conducted using obtained results.


Introduction
Problem statement. The goal of the problem is to transfer text type log files to MySQL database using Squid proxy program package and to implement clustering. Three sections are used to solve the problem stated.
1) Definition and application of log files, reviewing the literature in this field, etc. 2) Transfer of text log files to MySQL database via Squid proxy package. 3) Application of clustering method on data obtained.
Definition of logs. The function of the logs is to record the operations performed on machine by an administrator to analyze certain tasks in future. Regular browsing of logs allows collecting all the errors in the network (especially hidden errors), as well as collecting statistics by detecting malicious activity on site visits.
A log file is a document that keeps track of all events that occur on a website or server. The origin of the term comes from English, which was used in the Middle Ages as a kind of record of events.
Since the events are recorded in a journal, the storage of data collected for various purposes allows them to be used in the future.
In case of technical problems, site inaccessibility, virus infection, hacker attacks and Distributed Denial of Service (DDoS) attacks, the resource administrator can use the information in the log to find the reason, which makes it easier and faster to eliminate unwanted events and accelerates the work.
With the help of daily information, an Internet marketer can study the behavior of the site visitors and assess the quality of traffic, develop recommendations for improving the quality, promote the site and choose the best strategies for it.
Types of logs. Since the main program installed on the server is often recorded in the system log, each of these programs will have its own log. In particular, the following common logs can be distinguished (Kimberly and Wessels 1997): • Key log file (general information-operations with the system, File Transfer Protocol (FTP), Domain Name System (DNS); • Helps to discuss the system if the system download log is not loaded, stores key system events (for example, device errors); • Web server logs (server calls, error information on web server); • Database server logs (database queries, server errors); • Hosting panel logs in the site hosting (attempt to access the panel, license renewal and panels, statistics on the use of server resources); • Mail server logs (records of sent and delivered messages, mail server errors, reasons for mail denials); • Task scheduler logs-cron (maintenance of task execution, errors when opening cron).
Cron is a classic service (computer program on UNIX systems) used to perform tasks from time to time. Regular actions are described by instructions placed in crontab files and special folders.
Location of the logs depends on the software or the path specified by the administrator. That is, the availability and location of journals will vary significantly depending on what is used. As a rule, most server logs are stored along / var / log /, but some services may store logs in other directories.
For convenience, the appropriate paths in the software configuration files can always be specified. For example, on a server, the syslog / rsyslog daemon can be configured, which organizes service logs and stores them in defined log files.
Possible sources of events: • Result of the activities of a particular user; • Interruptions to the program from the device; • Events created by the programs themselves; • Events caused by software errors; • Events arising from the operating system or other software, as well as the events from any other source.
The main issue is the change of situation in the working program. The simplest version of a daily log file is a text file consisting of lines. All information in the log files is recorded in a special format that allows understanding the causes of events.

About the application of log files
The scope of log files is wide. These types of files can be used to track the history of a software process, record the status of tools and machines, track user activity, and provide security, and so on.
To explore and analyze log files, as a rule, independent software is used, which enables quick and visual study of input data about the operation of the software system.
Many log files are very large, so can be regularly overwritten on outdated content, or entire collections of log files with names, including a date, for example can be created. In many cases, they prefer to deal with databases instead of log files.
Here are some examples of how to use log files. If the program has many unexpected exceptions, they can easily be written to a log file to analyze subsequent errors. Another example, user login information in multiple userclient-server systems can be accessed. This allows monitoring their unauthorized operations (Grinkevich 2009). 3 Literature review 1) Logs collect valuable information during the execution of software systems. The rich knowledge provided in the logs is used at a high level by researchers and practitioners during the development and operation of software to perform various tasks. In Hassani et al. (2018), the empirical study of log problems is initially performed on two large open-source software systems. It is found out that journal-related files are subject to statistically significant changes more. The tool used can detect 75 existing non-compliant statements in 40 logging registration problems.
Modifications and error correction of files with daily problems are statistically and significantly more common. Journals are difficult to maintain without high-level professionals. Most of the defective records do not appear to be registered for a long time (320 days on average). Once problems are reported, they are resolved quickly (within an average of 5 days). Evidence obtained suggests the need for automated tools that can quickly detect journal problems. Once the study completed, seven main causes of journal problems are identified. Based on these reasons, an automated tool is developed that detects the four-day problem. With this tool, it is possible to detect 75 non-compliant documents in 40 existing journals. It is also reported that the new tasks detected by the tool will also be developed.
2) Alekseev et al. (2015) describes the application of a highly efficient system for processing and analyzing log documents for the PanDA infrastructure of the ATLAS experiment in the Large Hadron Collider (LHC), which is responsible for managing approximately 2 million daily workloads worldwide. LHC is based on ELK technology, which consists of several components. Data collected in ElasticSearch (ES) can be viewed using a special Kibana program. These components are integrated with the PanDA infrastructure and differ from previous log processing systems in terms of scalability. The authors provide some examples of all the components for current tasks and the configuration they use to demonstrate the benefits of a centralized daily processing and storage service. 3) Teixeira et al. (2018) presents a research approach for the analysis of log organization files. Providing information about it is the goal of the log file analysis, i.e., how the client uses the program and detects anomalies. The main goal of this study is to determine the architecture for the application of data analysis methods to search size and acquire knowledge of programs. 4) Each Siemens Magnetic Resonance Imaging (MRI) system records events to record files while it is running. Input documents and their contents are constantly updated by software providers. This results in different information content depending on the software version. Bayesian decision trees and Bayesian network completed with neural networks are compared, and neural networks are estimated to provide better classification (Kuhner et al. 2017). 5) In Ribeiro et al. (2018), the application of a method is presented to visualize the game patterns observed in the data of the input document in a geometric game. Based on the principle of synchronized word trees, the information identifies five new behaviors that help players understand how they approach the game using VisCareTrails, a visual software system. The time interval is convincing evidence for determining pedagogically significant player behavior based on the usefulness of VisCareTrails and semi-structured data related to educational games. 6) Teixeira et al. (2018) this article presents a research approach for the analysis of journal organization files. The provision of information is the goal of the log file analysis, how the client uses the application (program) and shows possible ways to detect and correct anomalies.

AlterWind Log Analyzer functions
AlterWind Log Analyzer is a function for analyzing web server log files and website traffic statistics. Log file analyzers allow creating all traditional reports and include a number of unique features and unique reports. Depending on the purpose and target, one of three versions of AlterWind Log Analyzer can be chosen: AlterWind Log Analyzer Professional creates unique reports for website presentation and click programs to optimize website for search engines.
Alter Wind Log Analyzer Standard defines the interests of visitors and customers, analyzes the results of the advertising campaign, studies when visitors visit the website, and makes the website more convenient and interesting for customers.
Some additional features of AlterWind Log Analyzer: • More than 430 search engines are available in the database of any edition of log file analyzer; • Log files of any format can be analyzed; • Standard log file formats can be detected automatically; • A large number of log files can be analyzed simultaneously; • At the same time, they can be created on different servers and have different formats. It is very convenient for the analysis of the traffic of related websites; • Reports can be fully customized. Design of reports, output data, and their volume can be changed.

Analysis of web logs
Any user working with programs on the Internet is constantly monitored. He/she is also followed by many members of the Global Web. There are several purposes for analyzing web logs. First of all, web logs can show what a visitor is doing in the site, where the user came from, what he/she wants and read and where he/she goes. As a result, the route of this visitor can be specified.
If users' routes are studied and the most used ones are selected, then the interests can be understood.
The set of operations performed by the user, as well as the implementation of these actions over a period of time, is of interest, and these features are mainly preferred.
Types of web logs. Web logs are different. When the login function on query is enabled on the web server, a log is generated based on the login format implemented for each request-response pair.
The most common format is the Common Log Format (CLF). The information for the current request covers seven areas. The fields and their values are presented in Table 1.
Another web log format is the Extended Common Log Format (ECLF), and unlike the simple one, there is no fixed list. Their final number depends on the Web server manufacturers.
For example, Web server published by Microsoft, Internet Information Services (IIS). In addition to the ability to submit the date and time of the user's request, the Web server allows to submit 20 other parameters. Fields and definition are shown in Table 2.
Finding a web log. A web log or journal of web server is a text file that contains information about each user's request. A request-response pair corresponds to writing. The text of each field is separated from the other by a tab or a space, which makes it easier to view and analyze the file in Excel.
To find the path to the log file, Web Server Manager (Administrative Tools / Internet Information Services (IIS) Manager) is required to open, and to open the resource tree on the computer and the Properties window, it is required to move the cursor in the Web site area.
The bookmark of the website includes a section on log protocolling. The login settings window is hidden behind the Properties button. The path to the log file is shown under the General bookmark.
Obtaining useful information from web logs. Obviously, web logs are interesting in terms of load testing of Internet applications. In order to perform the tests successfully, it is necessary to create a model of the typical load.
The load model can be characterized by the following parameters: • Number of simultaneous users during the monitoring period; • Most visited pages of site; • Typical behavior of system users; • There is a pause in the actions of users when moving from one page to another (or when using various functions of system).
As mentioned above, a log file is a text file that can be opened in any text editor. Each field is separated by a space or tab. Excel can be used for its analysis.
First of all, the number of users applying the program for a certain period of time is of interest. This information is very easy to obtain. It is enough to calculate the number of unique IP addresses registered by the system on a daily basis.
For example, a log file contains the records of 134 unique IP addresses to which the requests are sent within an hour. This means that 134 different users applied the program for information for one monitoring period.  A user can use two or more IP addresses to access the site. The main reason for this is the dynamic placement of IP addresses of the user's provider. Once the connection is disrupted, a user disconnects the Internet again under own name, and at the same time, receives a new IP address.
If the program supports user authentication mode, these tables can be analyzed with the User column. This prevents the growth of users with different IP addresses.
Log file monitoring methods can be categorized as follows: 1. Fault detection; 2. Anomaly detection.
If faults are detected, the expert creates a database of error message templates.
One of the important features that are not often found in log file lines is an incident. Incident can be obtained from log file lines, because the same incidents often correspond to a particular line pattern. For example: • Router myrouter1 interface 192.168.13.1 down; • Router myrouter2 interface 10.10.10.12 down; • Router myrouter5 interface 192.168.22.5 down.
The incident type corresponds to the down-line patterns ''interface down'' and the router * interface *. Line patterns can be identified by manually reviewing log files, but this is possible for only small daily log files. One of the attractive  Transferring log files from text documents to various databases is one of the key issues. Squid proxy is one of the best software packages in this regard.
In MySQL, there is a quick way to transfer data from ordinary text documents to database via LOAD DATA INFILE directive (can be used as the program is open source (Aleksey 2008).
Squid web proxy and the URL log generator design is shown in Fig. 1.
Squid proxy server is used to collect and manage network traffic log files. Squid proxy provides extensive options for compiling log files.
The second step is to discard all lines with the result code other than 200 queries, because ''200 OK'' is a successful query code (Ma et al. 2020).  With minor changes, this script can be adapted not only for squid, but also for other services for working with log files. An important condition: the fields in LOAD DATA INFILE directive has to be determined exactly.
Advantages of this scenario may include high performance (n4-3.2 1024 MB RAM 4 million lines in 10-12 s) and a line can be defined in a unique way with the last ''hash'' field (because the data is not deleted when MySQLfails).
Examples of squid proxies are shown below.
As seen from the list, these are used to collect logs from Squid-db database. The sequence is as follows: 1. Addiction is added to Squid proxy server; 2. Then the scripts and configuration are used to connect the database; 3. Rights are granted; 4. Next, a configuration file is created to link the script with the database; 5. Database is created, the scheme is imported and the username is created; 6. Then the bad side of Squid proxy configuration is reached.
In this study MySQL server is used. After the log file is transferred from Squid-db to the MySQL database. Figure 2 presents the database structure.
By accessing Live Notes [ HTTP proxy, all web traffic can be viewed in real time (Fig. 3) (Xu et al. 2020).
The clustering process is performed using this database (Fig. 4).

About clustering
One of the important issues is to study the information needs of users in connection with the intensive development of Information and Telecommunication Technologies (ICT), the speed of data transmission and the expansion of communication channels.
Clustering is grouping any set according to its similar elements, so that the elements in the groups do not intersect. The elements of a set may include data, points, vector characteristics, and so on. Each divided group is called a cluster (Mandel' 1988).
Cluster analysis is a multidimensional statistical procedure that collects data and then groups the objects.
Most researchers (Golovchiner 2009) believe that the term ''cluster analysis'' was first proposed in 1939 by the American mathematician Robert Trion.
What is the difference between clustering and classifier? Clustering groups the objects of the set only according to the result obtained.
In classifier, each object belongs to a predefined group. In general, the cluster analysis is applied in following stages: • Selecting the objects for clustering; • Determining a set of variables to evaluate selected objects. If necessary, normalizing the values of the variables; • Calculating the values of proximity dimension among objects; • Applying the cluster analysis method to group similar objects (clusters); • Presenting analysis results.
After obtaining and analyzing the results, selected metric and cluster method can be adjusted until the optimal result is obtained.
Distance metrics First, the characteristics of the vector must be drawn for each object-as a rule, it is a set of numeric variables, for example, height-weight of a person. In addition, algorithms working with quality characteristics are also available.
Once the vector characteristics are determined, the normalization can be implemented, so that all components provide the same contribution when calculating the ''distance''. During the normalization process, all values are in a certain range. Finally, the ''distance'' between each pair is measured for the proximity for each object. Many metrics are available, for example: Euclidean distance is the most common distance function and presents geometric distance in multidimensional space.
Square Euclidean distance is used to give more weight to objects far from each other.
City Block Distance (Manhattan Distance) This distance is the coordinate difference average. In most cases, this distance leads to the same results as the ordinary Euclidean distance. However, for this distance, the effect of large individual differences is reduced (as it is not square).
Chebyshev distance can be useful when two objects have to be determined ''differently'' if they differ in any coordinate.
Degree distance is used when it is necessary to increase or decrease the weight of the corresponding objects, which differ in size.
This study.
1. Calculates the distances among each point of input data. 2. Reviews each point as an object. 3. Combines the two closest clusters and updates the distance matrix. 4. Repeats the third step until a cluster remains.
Computer science, as well as other sciences applies clustering.
Its application fields may include: 1. Data analysis; 2. Data search and retrieval; 3. Face recognition, etc.
Description of object signs Each object is described by a set of its own characteristics called signs. Signs may or may not be numeric.
Each object is described according to the distances of the learning option from all other objects.
Data explanation through creating the cluster structure The clustering of similar objects by applying own analysis method to each cluster simplifies the subsequent data processing and decision-making.
Hierarchical cluster analysis (clustering) is an analysis method for creating a hierarchy of clusters.
According to the Public Opinion Society (Khaydukov 2009), an office worker spends 1.5 h a day working on personal issues, communicating on social networks, reading the news, visiting entertainment portals, and so on.
Hierarchical clustering fragments large clusters into small ones. They, in turn, break down into smaller ones, and so on. It is called taxonomy.
Taxonomy (Greek táxis-location, arrangement, row (order) and nómos-law) is the systematization theory of difficult-to-organize areas of classification, often has a hierarchical structure. The concept of taxonomy first appeared in the field of biology (term was proposed in 1813 by the Swiss botanist Augustin Decandol to classify plants) (Shatalkin 2012).
The advantage of hierarchical clustering is that the required number of clusters can be determined by examining the characteristics of the resulting tree, for example, by grouping sub-trees at sufficiently large distances from each other. The structure obtained is convenient for finding clusters. The convenience is that such structure is built once and does not need to be rebuilt while looking for required number of clusters. The disadvantage is that the algorithm is very challenging for the memory used. Working with data is the most difficult part of clustering. In this study the data is taken from log file. Before any clustering, the proximity matrix has to be determined, which includes the distance among each point. The matrix is then updated to show the distance among each cluster.
Two main types of hierarchical clustering algorithms are available: bottom-up and top-down algorithms. Top-down algorithms work on a top-down principle: initially all objects are placed in one group, then grouped into smaller ones. Bottom-up algorithms are more common. They place each object in a separate group at the beginning of the work, and then combine the groups into larger ones until all the objects in the sample are included in one cluster. Thus, a system of nested sections is established. The results of such algorithms are generally presented as a treedendrogram.
Hierarchical clustering algorithms To calculate the distance among clusters, two distances are commonly used: single transition or complete transition.
Hierarchical clustering algorithms can be divided into two major groups: 1) Agglomerative method for bottom-up hierarchical clustering; 2) Divisive method for top-down hierarchical clustering.
The first group of algorithms works on the principle of initial combination of object pairs to bring separate objects closer to each other, then pairs are combined with others on the same principle, and so on. This work continues until one group is created.
Top-down hierarchical grouping algorithms work on the opposite principle-first the largest cluster that contains all objects in sample is selected, and then gradually divided into many clusters. Figure 5 presents an example of agglomerative and divisive hierarchical clustering algorithms.

Determining the proximity of clusters and objects in clusters
As mentioned, Table 2 shows the database structure in MySQL. The code field in this structure will be used for clustering. The objects of the clusters are viewed as points. Assume that n number of clusters is obtained. The clusters are  . 6 Barchart representing the proximity of clusters to each other denoted byK 1; K 2; ; :; K n . The objects of the i-th cluster are denoted byK i1; K i2; ; . . .; K nm . To calculate the average value of the objects of the i-th cluster Here O i is the average value of the values of the objects of the i-th cluster.
Excluding the i-th cluster, the average value of the objects of the remaining clusters is calculated: S i denotes the Euclidean distance.
Using formula (3), S i -s are found and arranged in ascending order, and the proximity of arbitrary cluster to other clusters is determined.
Using formula (3) and the values in Table 4, the value of S i -s is calculated. Table 2 presents obtained values by clusters.
A barchart corresponding to these values is drawn and presented in Fig. 6.

Conclusion
As a result of the analysis of the applied web logs, it was possible to obtain much information that fully characterized existing load on the journal. In the future, this information can be used to simulate the test load. The completeness and maximum adaptation of the updated model to the real situation is a prerequisite for the success of the test project.
Over time, efficient clusters lead to large investments. Combining a cluster based on vertical integration forms a certain system for the dissemination of various scientific knowledge and technologies (Shatalkin 2012).
An important distinguishing feature of cluster is being innovative. One of the advantages of cluster applications is the scalability of systems. With the available custom software, companies can easily configure multiple devices that follow same instructions and refer to same data sets (Pautov and Popov 2015).
The key advantage of cluster analysis is grouping the objects into a whole set, rather than for a single parameter. In addition, cluster analysis, unlike most mathematical and statistical methods, does not require any restrictions on the type of objects under consideration and views various arbitrary primary data. Cluster analysis allows viewing large amounts of data and drastically reducing and compressing large arrays, making them brief and visual (Tsikhan 2003).
Cluster analysis, like other methods, also has its drawbacks.
In particular, the composition and number of groups depend on criteria of selected section. When the original data array becomes more compact, certain distortions can occur and the individual properties of separate objects can be lost due to their replacement by the properties of aggregated values of cluster parameters. The probability of the absence of cluster value in reviewed set is often not taken into account (Deepika et al. 2020).
Despite all this, different clusters will always exist and develop in parallel. In the future, the above-mentioned aspects will be taken into account and clusters will be better quality.