Suzhou is a prefecture-level city in Anhui Province, China. It governs four counties and one district, namely Dangshan, Xiaoxian, Lingbi, Sixian county and Yongqiao district. Suzhou has a total area of 9,939 square kilometers, of which the urban construction land area is 195.89 square kilometers. The annual average temperature is 16℃, and the annual precipitation is 975 mm.
The total population of Suzhou was 6.55 million until end of 2017. The total length of highways was 16,471 kilometers, the expressway was 359 kilometers, the length of urban roads was 1,737.61 kilometers, and the mileage of the first-class road had reached 350 kilometers. By the end of 2017, and the total automobiles of Suzhou was 440,900, the road freight volume was 252.27 million tons, and the average car ownership per thousand people was 67.32. In addition, the volume of passenger traffic of Suzhou highway was 36.71 million per year .
Real-time meteorological data
We got 10 years of meteorological data extracted and accumulated from the National Meteorological Information Center (http://data.cma.cn). These data are collected and stored in the database by the meteorological observatories of five counties or districts in Suzhou. Data include atmospheric pressure (hPa), temperature (°C), relative humidity (%), precipitation (mm), wind direction (°), wind speed (m/s), visibility (m), snow depth (cm), evaporation(mm), total cloud amount(%), etc.
In meteorological observatories, atmospheric pressure, temperature, humidity, precipitation, wind direction, and wind speed are all recorded by electronically controlled mechanical equipment. These equipment are equipped with an embedded chip that automatically collects surrounding meteorological data on time (every 2 minutes, 10 minutes, 1 hour or 1 day,). Then, the collected data is automatically encoded into a binary data stream that is sent to the database. For variables such as visibility and total cloud amount, they are manually recorded by observers and stored in the database.
The cases included all recorded road traffic accidents in the city of Suzhou from January 1, 2008 through December 31, 2017. The traffic accident data were obtained from the Traffic Police Detachment of the Public Security Bureau of Suzhou. And the traffic accident data mainly included the number of accident, the administrative division, the accident time, the total number of deaths, the number of injured, the accident identification cause, the direct property loss, the location of accident, and the highway administrative grade. Other vehicle, population and road information are derived from the other information comes from the Suzhou Municipal Bureau of Statistics website.
According to the loss caused by the accident and “the Notice of the Ministry of Public Security on Revising the Classification Standard”, the severity of the traffic accident is divided into four different levels:
The minor road traffic accident (level Ⅰ): It means that one or two people are slightly injured at one time, and the property loss is less than 1,000 yuan (RMB, the same below);
The general road traffic accident (level Ⅱ): It means that the number of injured person is less than 10, and the property loss is less than 30,000 yuan;
The serious road traffic accident (level Ⅲ): It means that an accident that caused one or two deaths and the number of injured persons was less than 10; or more than 10 people were injured; or the property loss was more than 30,000 yuan and less than 60,000 yuan;
The particularly serious road traffic accident (level Ⅳ): It means that an accident that caused one or two deaths at a time but the number of injured more than 10 people; or caused more than two deaths at a time; or a property loss of more than 60,000 yuan.
All data were sort out by SPSS 23.0. The accident data information and meteorological data were matched according to time (hours). For missing values of meteorological data, if the previous hour and the last hour were available, then their average regarded as the substitute value. If the adjacent data were also missing, the data would be deleted. For traffic accident information, the data would be deleted if the related detail was missing.
Before data treatment, all variables are converted into numerical types according to R package requirements: the ordered categorical variables, such as road grades, converted from national roads, provincial roads, county roads, and rural roads to “1”, “2”, “3”, “4”; the categorical variables converted to dumb variables such as “1” indicated that the accident occurred on the highway, “2” indicated not. And different variables may have different dimensions, which could lead to large differences between the data. Failure to process may affect the results of the data analysis. In order to eliminate the influence of the dimension and the range of values between the indicators on the results of the data analysis, the data needs to be standardized:
Land use random forest model
Random forest model have no requirements for data types and can solve regression and classification problems. The advantages are not only widely applicable, but also suitable for situations with many variables or large sample sizes. However, the disadvantage is that sometimes the model is difficult to interpret and the amount of calculation is complicated.
Random forest models have already been used in contributions dealing with the problem of traffic accident and severity prediction . In the model, random forest is designed to produce accurate predictions that do not overfit the data. Random forests are similar to bagging trees in that bootstrap samples are drawn to construct multiple trees, the difference is that each tree is grown with a randomized subset of predictors, hence the name “random” forests. A large number of trees are grown, hence a “forest” of trees. Random forests are more like a "black box" approach because each individual tree cannot be inspected separately. However, it provides some indicators that help explain. The results table can be used to compare the relative importance between predictors. Therefore, this process is easier to interpret than a method such as a neural network.
Out-of-bag samples can be used to calculate an unbiased error rates and variable importance without the need for test sets or cross-validation. Because a large number of trees are grown, the generalization error is limited, which means that over-fitting is not possible, which is a very useful feature of prediction. And another advantage of RF is that the predicted output depends only on one user-selected parameter, the number of predictors to be chosen randomly at each node .
The randomForest package (version 4.6-14) in the R software (version 3.5.1) implements a random forest model. We randomly chose 75% of the data as training data. The remaining 25% of the data were treated as testing data. According to the methods previously studied, this study determines the variables method is assess how output changes by varying input variable values one by one [9-10].
To select model parameters, the appropriate number of variables “mtry” and decision trees “ntree” were chosen for better model fitting. The mtry is the number of variables randomly sampled as candidates at each split, the “for()” function in R software can traverse all variables and select the mtry value with the lowest error rate. The ntree is the number of trees to grow, it indicates the number of decision trees when modeling. Too high will increase the complexity, and too low will increase the error rate. Random forest modeling is performed with the mtry of the minimum mean error described above, and the relationship between the model error rate and the decision tree is visualized. Select the optimal ntree parameter. Establish a random forest model to obtain the importance of various variables in the model. Compute an out-of-bag (OOB) error rate by using the data not in the bootstrap sample, and the predicted results were calculated and compared with the actual results.
The back-propagation neural network model
The back-propagation neural network (BPNN) is one of the artificial neural networks, and has a classical multilayer topology with feed-forward connections. A BPNN does not need any a priori assumptions on relationships between linear or non-linear variables, and offers the opportunity to investigate and create the first discriminant analysis in problems where the phenomena (the relationships between input and output) are not well known .
In this investigation, the neural network model could be implemented in the R software using the neuralnet package (1.44.2). The data was randomly divided into training and test sets by 3:1. The R software will output the neural network structure diagram through the “plot” function. The parameters were selected according to the method of Mussone et al, and 10 neurons in the hidden layer were set . And by using the algorithm, the weight matrix of the model and the visualization of the importance of each variable are obtained. Sensitivity analysis was performed by changing the neurons for parameters (7-14).
In addition, in order to understand the accuracy of the model prediction, the predicted results were calculated and compared with the actual results.