Tra c state prediction based on stochastic repetitive hill climbing method

A reasonable structure of traffic state network is a prerequisite for traffic state prediction. In order to overcome the shortcomings of the hill climbing method, a traffic state prediction method based on the random repetitive hill climbing method is proposed. A multi-network structure is obtained by iteratively running the hill-climbing method on the randomly generated directed acyclic graph; the node and directed edge selection criteria in the optimal Bayesian network structure are determined by defining the confidence degree of directed edges and calculating the confidence threshold; using the optimal Bayesian network structure, four traffic states, such as smooth, smooth, congested and blocked, are predicted and evaluated comprehensively. The analysis results show that the overall accuracy of the method for traffic state prediction exceeds 85% when only two variables such as time of day and holiday are selected, which can provide effective methods and data support for highway operation state monitoring and early warning and decision analysis.


Introduction
The complete acquisition, real-time identification and accurate prediction of highway traffic state is the basis for accurately grasping the behavior of traffic system, making scientific traffic management decisions and giving full play to the potential of traffic facilities on the one hand [1,2]; on the other hand, it is also an important theoretical basis for analyzing the mechanism of traffic state evolution and grasping the law of traffic congestion occurrence [3].
Bayesian Network (BN) was first introduced by Pearl in 1986 in expert system, which is an ideal model for data mining and uncertainty knowledge representation [4]. In the field of intelligent transportation, the application of Bayesian network for uncertainty inference analysis on travel behavior, accident causation and traffic flow prediction is a hot research topic at home and abroad [5]. The existing research mainly focuses on Bayesian network learning and network inference algorithm implementation, but the learning algorithm for Bayesian network structure is relatively single, and the research on the reasonable trade-offs of network nodes and directed edges is not deep enough to verify the advantages and disadvantages of the network structure [6]. In this paper, we propose a traffic state prediction method based on the stochastic repetitive hill climbing method, and iteratively run the hill climbing method with the factors of weather environment, holidays and time of day as the influencing variables, and calculate the confidence level of directed edges to obtain a Bayesian network structure suitable for traffic state prediction and realize the inference prediction of traffic state [5,7].

Overview
A Bayesian network is a binary group, i.e., BN = (G, P ), where G = (X, E) denotes a Directed Acyclic Graph (DAG); X = X 1 , X 2 , ..., X n is the set of nodes in the DAG, representing random variables; E is the set of directed edges, representing direct dependencies between variables; P is the additional probability distribution of variables. For the joint distribution P (X 1 , ..., X n ) containing n variables in the network, according to the chain rule we have This shows that Bayesian networks are a representation of the decomposition of joint probability distributions, which reduces the complexity of probabilistic models and also provides convenience for probabilistic inference.

Bayesian network modeling approach
The Bayesian network modeling process consists of two phases, network learning and inference learning. Network learning consists of two parts: structure learning and parameter learning. Structure learning is to find a network structure G best that best matches the given training sample D. Parameter learning is to compute the probability distribution of the variables in the network based on the structure learning results using the given training data D ′ , i.e., to update the original prior probability distribution in the network.

Structural Learning
Due to the limitations of constructing Bayesian network structures relying only on domain knowledge, while searching for the optimal network structure is an NPhard problem, the effective use of data to automatically construct network structures has become the focus and difficulty in the field of Bayesian research. The existing research methods can be divided into two main categories: constraint-based and scoring-based.
Constraint-based structure learning algorithms are simple and fast, but the independence test is error-prone and directly affects the computation of the network structure, so it is difficult to guarantee the learning accuracy. The scoring-based structure learning algorithm defines a scoring function in advance and then searches for the highest rated network structure, so the scoring function and the search strategy are the main factors affecting the algorithm.
The most straightforward search strategy in the scoring-based structure learning algorithm is the exhaustive method, which calculates the score of each structure and then selects the structure with the highest score. In practice, this method is not feasible due to the large number of network structures. Hill-Climbing (HC) is a locally meritocratic search algorithm that selects the addedge, subtract-edge, and trans-edge operators that improve the network score on a given initial structure, and iterates until the highest scoring network structure is obtained.

Parameter Learning
Parameter learning methods include maximum likelihood estimation and Bayesian estimation, which represent the classical and Bayesian schools of mathematical statistics, respectively.
The classical school considers the conditional probability P (D|Θ) of the training sample D as a function of Θ, and the maximum likelihood estimate of the parameter Θ is the Θ * that corresponds to the maximum value of P (D|Θ), i.e.
The Bayesian school treats Θ as a random variable and calculates its posterior probability distribution. However, the integral operation required to calculate P (Θ|D) is usually difficult to implement, so an approximate method is usually used to solve the maximum a posteriori probability (Maximum A Posteriori, MAP), i.e., Bayesian MAP estimation.
Maximum likelihood estimation does not consider the effect of prior knowledge on the parameter to be estimated, while Bayesian estimation treats the parameter to be estimated as a random variable, which can be used with prior knowledge, and therefore Bayesian estimation is more reasonable.

Network Inference
Network inference refers to the use of Bayesian network structure and its probability distribution to calculate the posterior probabilities of variables under given conditions. The known variables are called evidence variables, denoted as E, and the values are denoted as e; the variables for which the posterior probabilities need to be calculated are called query variables, denoted as Q, and their posterior probability distributions are P (Q|E = e).
Bayesian network inference is also an NP-hard problem, and in recent years, some progress has been made in inference algorithms for specific types of Bayesian networks, such as variable elimination method and joint tree algorithm. Compared with the former, the joint tree algorithm is the most widely used exact inference algorithm because it can improve the inference efficiency by sharing steps and has fast computation speed.

Network structure learning algorithm based on random repetition
This paper proposes to improve the above defects by Random Restart Hill-Climbing (RRHC), which is a local meritocratic algorithm and has the defect of easily falling into local optimum but not global optimum. Finally, the confidence level of each directed edge in the r network structures is calculated and the more frequent directed edges are selected by the confidence level threshold to form the final network structure.
3 Traffic state inference prediction model

Definition of traffic state inference prediction
Traffic state inference prediction refers to the construction of a Bayesian network with the traffic state identified by the traffic flow parameters as the outcome variable and the meteorological and temporal factors affecting the traffic state as the influencing variables; then the posterior probability of the traffic state under the given conditions is calculated and the maximum value is taken as the traffic state prediction result.

Traffic state inference prediction modeling process
According to the above Bayesian network modeling method, the traffic state prediction modeling process is divided into the following stages:

Data preprocessing and variable selection
The original data for a period of time is selected from a database storing road environment, meteorological conditions, traffic status information, etc.; the original data is pre-processed (including resampling at reset intervals, interpolation of missing data, discretization of continuous values, etc.); the influencing variables for constructing Bayesian networks are selected.

DAG random generation and directed edge blacklist setting
Set the number of randomly generated DAGs r; set the blacklist of directed edges according to the interrelationship between variables and a priori knowledge, i.e., the DAGs obtained from each run of the hill-climbing algorithm do not contain any directed edges in the blacklist.

Stochastic repetitive Bayesian network learning
The scoring function is selected, and the hill-climbing algorithm is run iteratively with the randomly generated DAG as the initial structure; then the confidence of each directional edge and the confidence threshold are calculated for the obtained r DAGs, and then the optimal Bayesian network structure is obtained; finally, parameter learning is performed.

Inference prediction
The posterior probability of the query variable (traffic state) is calculated based on the values of the evidence variables, and the maximum value is taken as the inference prediction result of the traffic state.

Comprehensive evaluation of the model
The evaluation indexes are selected and the prediction results are evaluated comprehensively. If the evaluation result does not meet the prediction requirements, adjust the influencing variables and parameters in the model and repeat the above process until the model meets the evaluation requirements.

Model evaluation criteria
Based on the actual traffic status and Bayesian network prediction results, the confusion matrix is established as shown in Table 1.  Table 1, ni,j denote the number of samples in status category c predicted by Bayesian network as category cj. The confusion matrix reflects the inferential prediction performance of Bayesian networks. The i-th row of the table reflects the Recall of category ci, the j-th column reflects the Precision of category cj, and the diagonal reflects the overall Accuracy. For a specific traffic state (e.g., blockage state cj), the independent accuracy and recall are calculated as shown in equation (10) and equation (11), respectively, and the overall accuracy of the prediction is calculated as shown in equation (12).

Data preprocessing and variable selection
The literature gives a basic method to apply traffic flow parameters for real-time traffic state identification. In this paper, we firstly apply the above method to identify the traffic flow parameters in December 2014 from the performance measurement system (PeMS) of the California Department of Transportation (Caltrans), and then build a traffic status database with the traffic flow parameters of the test station No. 717490 in Highway 101. After pre-processing, weather conditions, visibility, week, time of day, holidays, etc. were selected as the influencing variables.

Bayesian network learning results
The Bayesian Information Criterion (BIC) is chosen as the scoring function, and the hill climbing algorithm is run iteratively to obtain the 100 highest rated DAGs respectively. From Eq. (9), the final network structure is composed of the directed edges with confidence thresholds greater than 0.574 (see Table 2).

Analysis of traffic state prediction results
To summarize the above analysis, only two variables, such as time period and holiday, are selected as evidence variables, and the posterior probability of the query variable -traffic status is calculated by applying the joint tree algorithm, and the maximum value is taken as the traffic status prediction result. The confusion matrix was established by combining the traffic status under the actual conditions, and finally, various evaluation indexes of the prediction results were obtained (see Table 3). As shown in Table 3, the overall accuracy of traffic status prediction exceeds 85% when only two variables such as time of day and holiday are selected for the detection station section, and the overall accuracy of traffic status prediction can reach 87.2% for three types of traffic status such as smooth, smooth and congested, which can provide forecast and early warning for monitoring the  operation status of highway. The reason for the unsatisfactory prediction results is that the traffic congestion is affected not only by weekly, periodical and holiday factors, but also by occasional factors such as traffic accidents and bad weather, but the traffic congestion only accounts for 8.14% of the total sample of the network learning. The network learning results do not fully express the causality of the obstruction state. From the comparison of Table 3 and Table 4, it can be seen that the traffic state prediction based on the random repetitive hill climbing method has significantly better evaluation indexes than the traditional hill climbing method, which makes up for the defect that the traditional hill climbing method is easy to fall into the local optimum.

Conclusion
In this paper, we proposed a traffic state prediction method based on the random repetitive hill climbing method by mining the basic data such as traffic state, weather environment and holiday information, and constructed a Bayesian network using this method to realize the inference prediction of traffic state. The method can provide an effective method and data support for highway operation status monitoring and early warning and decision analysis.
In this paper, the number of iterations, i.e., the number of DAG random generation, is specified in the random repetitive hill climbing process. Too many iterations will affect the learning speed of the network structure; too few iterations may still fall into local optimum. Therefore, the convergence of the number of iterations needs to be further investigated. The traffic accident, bad weather and other incidental factors can have a significant impact on the traffic state, and how to improve the prediction results of the blockage state needs to be further investigated by filtering the historical data in a specific time period to build a Bayesian network.