Agricultural Irrigation Area Prediction based on Improved Random Forest Model

: The food problem is a major problem of common concern in the world, and the prediction of irrigation area can promote the solution of food and agricultural problems. In this paper, the data of grain production and irrigation area in the world are analyzed. An improved Random Forest Regression model is proposed and applied to the prediction of irrigation area. Based on ordinary Random Forest and Limit Tree Regression algorithm, an improved random forest prediction model for irrigation area in China is proposed. Firstly, the arithmetic mean value (AMM) of mean square error (MSE) and mean absolute error (MAE) was used as the evaluation index of the improved impure function and irrigation area prediction effect. Then, the grid search method is used to determine the optimal number of decision trees (70 trees and 30 trees respectively) in ordinary random forest and limit tree regression, and a new improved random forest model is established. After following, the model is compared with other prediction models, and 10 fold cross validation shows the rationality of the model. Finally, the error analysis of the improved Random Forest model shows that the prediction error is small. It is expected to be applied in the annual analysis of irrigation area in China.


Introduction
In recent years, the problem of food production and food supply has once again become the focus of attention all over the world. Agriculture is the source of food [1] . Irrigation area is an important factor that determines the development of agriculture and grain harvest. Irrigation is an important land management measure in agriculture [2] . The prediction of irrigation area is of great help to the development of agriculture.
Irrigation area is an important basic agricultural data in the database of Food and Agriculture Organization of the United Nations (FAO). This paper establishes the prediction model of irrigation area based on the data provided by FAO.
Random Forest regression is an important prediction algorithm in data analysis and data mining [3] . Random Forest regression model can solve nonlinear problems. In the field of machine learning, random forest is a classifier which contains multiple decision trees, and its output category is determined by the mode of the output categories of multiple trees. The concept of random forests evolved from Random Decision Forests proposed by Ho of Bell Laboratories in 1995 [4] . Then Leo Breiman and Adele Cutler [7] developed and deduced random forest algorithm. Combined with Breimans' idea of "Bootstrap aggregation" and Ho's [4] "Random Subspace Method", the set of decision trees is constructed. According to the set of decision trees, the random forest is obtained. The output of the forest is Bagging or Boosting of the output results of multiple trees. This model can solve the problem of nonlinear prediction and avoid the problems of large error and over fitting.
Based on the random forest and limit tree regression algorithm, this paper improves its impure function, obtains a new evaluation index Average − MSE − MAE, and studies it. At the same time, an improved prediction method for irrigation area is obtained [8] .
The work and innovation of this paper are as follows: (1) Based on Random Forest and Limit Tree regression model, the impure function and Random Forest model of CART algorithm are improved.
(2) Combined with grid search, the number of decision trees is adjusted to get better parameters.
(3) The cross validation method is used to evaluate and train the model for many times, which reduces the probability of model error.
(4) Combined with Random Forest and Limit Tree regression prediction model, a new improved random forest model is proposed and compared with other prediction models.
(5) It is feasible to select China as a significant agricultural country.
1 Materials and methods

Grain types and production analysis
This study is based on two agricultural data from FAO statistical database (FAOSTAT). They are the detailed data of agricultural irrigation area from 1961 to 2017 and food products of various countries in the world from 1961 to 2007. In order to study and predict the national irrigation area (unit:1000 Ha ) and establish the prediction model of irrigation area, this paper is based on China, a major grain producing country. This research is mainly divided into the following steps: raw data preprocessing, screening of important agricultural data in China, establishment of mathematical prediction model, experimental evaluation, improvement of prediction model, experimental evaluation and comparative analysis. The simplified process is shown in Fig.1. First of all, through Matplotlib and Seaborn, the statistical analysis of the world's food types is carried out, as shown in Fig.2. It can be seen that the top three of the world's total grain production are: excluding butter, eggs, excluding beer and excluding butter, which are far more than the total number of other grains. This also shows that the demand for dairy products is relatively large for all countries in the world. According to the actual situation, we can know that the production and planting of grain cannot do without the irrigation area of the country. The following is the data exploration and statistical analysis of irrigation area in various regions of the world.  According to the original data of irrigation area collected by FAOSTAT, the data are processed by data De duplication, cleaning and transposition. Using Matplotlib and Seaborn for visualization, the change trend image of irrigation area is obtained, as shown in Fig.3. We can see that China ranks first in irrigation area and production quantity. According to the analysis and practice, China is a big grain producing country. Therefore, it is meaningful to study the irrigation area in China. The following is a more detailed data analysis of China's grain and irrigation area. In order to study and predict China's agricultural production, Random Forest model is improved based on Random Forest and Limit Tree algorithm. We studied, analyzed and predicted the irrigation area [2] . Decision tree is composed of a decision graph and output results, which is used to create a plan to achieve the goal. In machine learning, decision tree is a prediction model. The decision tree is used as the prediction model to predict the class label of samples, which is also called classification tree or regression tree.
The goal of decision tree is to create mathematical model and use samples to predict the target value of samples. The training process of a decision tree [10] is: according to the selected index, the proposed training set is split into multiple subsets, and then the multiple subsets are split continuously and recursively. This process is called recursive segmentation, until the subset gets the target value, and finally stop recursion.
The year is assumed to be the variable x, and the forecast variable Y is the irrigation area of China (target value). Then the data is presented as follows: Where n is the n-th year.
Decision tree is mainly divided into regression tree and classification tree [7] . It is usually called CART. In this study, the regression tree in the decision tree is used for prediction (the output of the regression tree is a real number).
CART algorithm can solve the problem of classification and regression prediction. The tree constructed by CART regression tree is a binary tree, and all the inputs that fall on the same leaf node have the same output. When constructing the decision tree learning model, each leaf node in CART regression tree represents the continuous predictive value of the target.
Selecting and measuring segmentation variables and points is an important step in decision tree learning. Generally, the impure of segmented nodes is used to measure, and there are two important impure functions in regression task. Next, two common impure function models are introduced [11] .

(1) Mean Square Error
First, given the data set.
The goal of regression prediction is to construct function F(x) and fit elements of data set. The calculation formula of mean square error [13] of prediction model is Equation (3).
In order to minimize the mean square error [14] , the objective function of minimizing the mean square error is established, namely Equation (4).
Where F(x i ) is the predicted value and y i is the real value.
Next, assume that a CART regression tree has L leaves, which means that the CART regression tree divides the input x into L units ω 1 , … , ω L to obtain at most L target prediction values. By improving the objective function with minimum mean square error, the objective function with minimum mean square error of cart regression is obtained. It is shown in Equation (5).
Where Ĉ k is the predicted value of the k-th leaf obtained by CART regression prediction model.
Since the predictive value of multiple leaves can be obtained, the final predictive value is set as the mean value of each leaf containing the training set data, and Equation (6) can be obtained.
In order to minimize the mean square error of the tree model in the training set and the sum of the mean square errors of each leaf, it is necessary to select appropriate segmentation variables and points when dividing each node of the regression tree The heuristic method is used to traverse all the segmentation variables and points, and the minimum sum of the mean square error is selected for the segmentation. The m-th feature x (m) and the value r m are selected as the segmentation variables and points. All the segmentation variables and points divide the set into two. Equation (7) and (8) are obtained.
According to cart, the optimal segmentation variable m and segmentation point r are obtained.
Equation (9) is got as following.
At the same time, we get the following results that is Equation (10).
The segmentation variable m, segmentation point r and prediction value Ĉ j are calculated repeatedly until the minimum mean square error is obtained. The input x is divided into L units ω 1 , … , ω L to obtain the decision tree model. It is shown in Equation (11).
The above is the analysis process of mean square error impure function.

(2) Mean Absolute Error
The mean absolute error is similar to the mean square error, which measures the prediction error by the square of the difference between the error value and the true value. Mean absolute error [15] measures the prediction error by the absolute value of the difference between the error value and the true value. Equation (12) for calculating the mean square error of cart regression tree is as follows.
The objective function of cart regression tree is the minimum mean absolute error [16] . So we get Equation (13).
The procedure of selecting appropriate segmentation variables and points is similar to the mean square error.
In order to improve the prediction model of China's irrigation area, this study is based on China's irrigation area data and years. Based on ordinary Random Forest and Limit Tree regression, an improved Random Forest model is established.

Improved Random Forest model based on ordinary Random Forest and Limit Tree algorithm
Bootstrap aggregation, also known as Bagging algorithm [17] , is a group learning algorithm in the field of machine learning [18] . By integrating with regression algorithm, we can improve the accuracy and stability of prediction, reduce the variance of prediction results and avoid over fitting.
The principle of bagging algorithm is: given a training set of size, select a new training set of size by self-help sampling method (uniform and put back). In the new training set, we use the regression algorithm to get a model, and then get the average value of the model to get bagging results. The process of algorithm steps is shown in Fig.5.  This is the bagging method of the original tree model.

(1) Random Forest
In order to get a better prediction tree model [19] , we improved it to random forest regression model [6] . Through the improved learning algorithm, the feature random subset is selected in each candidate splitting and splitting process. This process is called Feature Bagging.
Feature Bagging can greatly reduce the influence of strong correlation between decision trees. In this paper, sklearn is used to call RandomForestRegressor to establish the prediction model of Random Forest.
Random Forest is a model integrated by multiple decision trees, as shown in Fig.6. Tree [20] is also an integrated algorithm of single decision tree. But the difference is that for the split points, it is not by calculating the best point of each feature, that is, it is not based on the ordinary Random Forest method, but by random division. Among all the partition points, the point with the highest score is selected as the partition point of the node. For the regression prediction problem studied in this paper, the number of features selected by each node is the data quantity n (samples). In this paper, we use sklearn to call ExtraTreeRegressor to build the prediction model of extreme Random Forest (hereinafter referred to as Limit Tree).
Next, the improved random forest regression model and limit tree regression model are established for algorithm prediction, and the prediction effect is evaluated.

(3) Improved Random Forest Model
The data of irrigation area in China are standardized [21] (dimensionless) to make the target value approximately obey N(0,1) normal distribution. The standardized formula is as follows. For random forest regression tree, the most important parameter is the number of decision trees.

Random Forest Simplified
In order to optimize the prediction model, the number of decision trees is adjusted to get better tree model parameters.
In  Table 2 . It can be seen from the above table that the prediction error of random forest is small. When the number of decision trees is 70, the mean square error and mean absolute error are the smallest, only 0.0083 and 0.0608. Finally, we calculate min{INDEX RF } = 0.0346.
The prediction effect of limit tree regression is also good. When the number of decision trees is 40, the mean square error is the smallest. When the number of decision trees is 30, the average absolute error is the smallest. When the number of trees is 30 and 40, combined with the mean square error and mean absolute error of the prediction model, it can be concluded that the mean square error and mean absolute error of 30 trees are smaller. We choose 30 trees as the optimal parameters, and the mean square error and mean absolute error are about 0.0036 and 0.0462. Finally, we calculate min{INDEX EF } = 0.0249. In conclusion, the number of decision trees in random forest is 70, and the number of decision trees in limit tree regression is 30. Taking these two adjusted parameters as the better parameters of the research experiment, a new improved Random Forest model is finally established.
2 Experimental Evaluation

Comparative Analysis
In order to widely verify the effect of the improved random forest regression model established in this paper, it is compared with other prediction models [22] .
10 fold cross validation based on Sklearn. The model scores of Kneighbors, LinearRegression, DecisionTree, SVR, AdaBoost, GradientBoosting and original Bagging [9] were obtained and compared [23] . As shown in Table 3 .  In order to better evaluate the effect of the improved random forest regression model in predicting irrigation area, k-fold cross validation was carried out [24] . Let = 10, based on the improved random forest regression model, 10 fold cross validation is carried out. The model determination coefficient [22] , mean square error [13] and mean absolute error [15] are obtained, as shown in Fig.9. After ten fold cross validation, the coefficient of determination of the improved random forest model is above 95%, and the mean value is 98%, and the mean square error is 135.52 (unit:1000 Ha). The average absolute error is only 6.08 (unit:1000 Ha). It shows that the effect of the improved random forest regression prediction model is significant. It is used to forecast the irrigation area of different countries in different years. Compared with other prediction models, the rationality of the model is verified. Through 10 fold cross validation, the improved prediction algorithm has the advantages of high fitting degree, low mean square error and mean absolute error, and strong adaptability to different training sets. Experiments show that the proposed improved prediction model has smaller error and better robustness. It can effectively predict the annual change of irrigation area in various countries, and has a certain application value for agricultural production in various countries.