Research on Multi-factor Forest Fire Prediction Model 14 Using Machine Learning Method in China

Forest fires can cause serious harm in many ways. Studying the scientific prediction of forest fires is an research on the of series forest fires in a suitable fire prediction model importance to China’s forest fire prevention and control work. Based on data on fire hotspots, meteorology, terrain, vegetation, infrastructure, and socio-economics collected from 2003 to 2016, we used a random forest model as a feature-selection method to determine 13 major drivers of forest fires in China (such as 25 temperature, terrain etc.). The forest fire prediction models developed in this study are based on four machine-learning algorithms: an artificial neural network, a radial basis function network, a support-vector machine, and a random forest. The models were evaluated using the five performance indicators of accuracy, precision, recall, f1 value, and area-under-the-curve value. We used the optimal model to obtain the probability of forest fire occurrence in various provinces in China and create a spatial distribution map of the areas with high incidences of forest fires. The results show that the prediction accuracy of the four forest fire prediction models is between 75.8% and 89.2%, and the area-under-the-curve value is between 0.840 and 0.960. The random forest model has the highest accuracy (89.2%) and area-under-the-curve value (0.96). It is used as the optimal model to predict the probability of forest fire occurrence in China. The prediction results indicate that the areas with high incidences of forest fires are mainly concentrated in northeastern China (Heilongjiang Province and northern Inner Mongolia Autonomous Region), southeastern China (including Fujian Province and Province) etc. In those areas at high risk of forest fires, the management departments can improve the forest fire prevention and control by establishing watch towers and using other monitoring equipment. This study not helps in understanding the main drivers of forest fires in China, but it also provides a reference for the selection of high-precision forest fire prediction models and provides a basis for work. The results show that the main influencing variables are longitude, latitude, average surface temperature, daily maximum surface temperature, accumulated precipitation, average relative humidity, average temperature, daily maximum temperature, altitude, population, and NDVI. These variables performed subsequent model fitting. Then the mean decrease accuracy obtained


Variable Handling 154
The dependent variable is a binary variable (i.e., whether a forest fire occurs), and so we used ArcGIS 155 to create a certain percentage of random points (non-fire points) and assigned 1 to fire points and 0 to 156 non-fire points [52]. To ensure that the data were not over-dispersed, random points were selected 157 according to experience in a ratio of 1:1 [53], and, in principle, randomness in space and time should be 158 followed [54]. We used the ArcGIS 10.4 software to create random points and then used the 2015 159 National Land Use data as a basis to exclude random points that fell in bodies of water or urban land. 160 We obtained a total of 65,492 fire points and random points.  Similarly, from the infrastructure data and socio-economic data, we extracted the information 171 corresponding to the sample points. We set the aspect and special festivals as categorical variables, and 172 the others as continuous variables. Table 1 shows the classification of aspect [57]. During certain 173 traditional festivals in China, people burn paper to commemorate their loved ones, which raises the 174 probability of a forest fire. We included as special festivals (value 1) the following dates: Chinese New 175 Year's Eve, the first day of the first lunar month, the second day of the first lunar month, the fifteenth 176 day of the first lunar month, and Qingming Festival and Zhongyuan Festival (July 15th of the lunar 177 calendar). Non-special festivals were set to 0. After processing, we obtained 20 independent variables and their possible values (see Table 2). Finally, 180 we performed data cleaning on the sample points and the various types of data extracted to remove 181 abnormal samples from the original dataset (including some samples with missing data and samples 182 with observations that were significantly outside the normal range).

Data Normalization 185
Given the different dimensions and magnitudes of the factors above, the data were normalized to 186 eliminate the variation in dimensions, avoid large differences in the magnitudes of the input and output 187 data, and balance the contributions of various factors. All the data were converted to between 0 and 1.
188 Table 3 shows the normalized formulas and specific interpretations of the independent variables.  In the formula, input layer ∈ , hidden layer output h ∈ , output layer ∈ , input layer to 202 hidden layer weight connection matrix (1) ∈ × , the weight connection bias from the input layer 203 to the hidden layer (1) ∈ , the weight connection matrix and the bias from the hidden layer to the 204 output layer are (2) ∈ × and (2) ∈ × .

Radial Basis Function Neural Network 208
The radial basis function (RBF) neural network structure is a feedforward structure with an input layer, 209 a single hidden layer, and an output layer [61]. Its advantages are concise training and fast learning 210 convergence speed, which can approximate any nonlinear function. It has been widely used in 211 time-series forecasting, nonlinear control systems, and the graphics-processing field. The basic idea of 212 an RBF network is as follows. The RBF is used as the "base" of the hidden unit to form the hidden 213 layer space. The hidden layer transforms the input vector and transforms the low-dimensional pattern 214 input data into the high-dimensional space. The result is that the data are linearly separable in the 215 high-dimensional space. The output of the RBF neural network is: 233 The optimal solution is then obtained: * = ( 1 * , ⋯ , * ) . A positive component of * :0 ≤ * ≤ 234 is then selected, and the threshold is calculated as follows: Finally, the decision function is constructed:

Model Performance Evaluation 257
In this study, we used five performance indicators: accuracy, precision, recall, f1 value, and

275
The form is shown in Table 4: 276

319
The comparison between the predictive value and the actual value in the test dataset is shown in Figure   320 4. Note: Due to the large sample size, only a part of the sample comparison chart is displayed. This is 321 also the case for the following comparison charts.

Radial Basis Function Neural Network 325
The input and output layer variables of the RBF neural network were the same as those of the ANN.

Support-Vector Machine 332
We used the LIBSVM package of the MATLAB software to construct the SVM. The model was 333 constructed using the RBF kernel function for processing nonlinear data. We used the grid search method and 10-fold cross validation to select parameters and determine the penalty parameter C and 335 the kernel parameter g. Figure 6

Random Forest 351
We used the randomForest package in the R language to train random training samples. We then used 352 cross-validation to determine the optimal parameters of the model and the number of optimal decision 353 trees. Finally, we obtained the number of trees and the accuracy of the test and training data through 354 cross-validation. As shown in Figure 8, when the number of decision trees is 400, the accuracy tends to 355 be stable. We used the optimal number of decision trees to create the comparison charts of the actual 356 and predicted values of the test set ( Figure 9) and the average accuracy decline of 13 forest fire driving 357 factors ( Figure 10). It can be seen from Figure 10

367
We used the prediction results of the four models to construct a confusion matrix to obtain the accuracy, 368 precision, recall, f1 value, and AUC value, as shown in Table 6. Figure 11 shows the visualization of 369 the accuracy, precision, recall, and f1 values of the four models. Figure 12 shows the ROC curves of 370 the four models. The accuracy and f1 value of each model are more than 75%, and the AUC value is 371 more than 0.80. Thus, the performance of all four models is high. Among the four models, the RF 372 model has the highest predictive ability, with an accuracy rate of 89.2%, an f1 value of 89%, and the 373 highest AUC value, reaching 0.960. Compared with the other three models, the prediction ability of the 374 RBF neural network is the lowest, with an accuracy rate of 75.8% and an AUC value of 0.840. As 375 shown in Figures 11 and 12, the RF model outperforms the other three models. We therefore consider 376 the RF model to be the most suitable of the four models for forest fire prediction in China.

438
We entered the forest fire driving factors selected by feature selection into the four models (ANN, RBF 439 neural network, SVM, and RF) for training. We then evaluated them using five criteria: accuracy,