Ecological Footprint Prediction based on Global Macro Indicators in G-20 Countries using Machine Learning Approaches


 Paying attention to human activities in terms of land grazing infrastructure, crops, forest products and carbon impact, the so-called ecological impact (EF) is one of the most important economic issues in the world. In the present study, data from global databases were used. The ability of the penalized regression approach (PR including Ridge, Lasso and Elastic Net) and artificial neural network (ANN) to predict EF indices in the G-20 over the past two decades (1999–2018) was depicted and compared. For this purpose, 10-fold cross-validation was used to assess predictive performance and to specify a penalty parameter for PR models. Based on the results, a slight improvement in prediction performance was observed over linear regression. Using the Elastic Net model, more global macro indices were selected than Lasso. Although Lasso included only some indicators, it still had better predictive performance among PR models. Although the findings using PR methods were only slightly better than linear regression, their interest in selecting a subset of controllable indicators by shrinking the coefficients and creating a parsimonious model was apparent. As a result, penalized regression methods would be preferred, using feature selectivity and interpretive considerations rather than predictive performance alone. On the other hand, neural network-based models with higher values of coefficients of determination (R2) and values lower of RMSE than PR and OLS had significant performance and showed that they are more accurate in predicting EF. The results showed that the ANN network could provide considerable and appropriate predictions for EF indicators in the G-20 countries. predictions

proposed by McCulloch and Pitts (1943), are processing systems inspired by human brain neural 129 networks (Van Gerven and Bohte 2017). Relatively untreated data is managed directly using neural 130 networks and the network is trained with potentially acceptable results (Mignan and Broccardo 2020). 131 So far, several ANN forms have been released, but they all have an analogous framework. (Devillers 132 1996). Thus, a basic ANN is created by connecting multiple layers such as input, output, and hidden 133 layer (s) to transmit information from one artificial neuron to another (Sözen et al. 2005). A three-layer 134 ANN is illustrated in Figure 1 , and each layer contains its weight matrix, bias input, and output vectors. 135 A set of input predictions is received by the network, their average weight is calculated by using 136 the summation function, then output is generated using some activation functions. In other words, the inputs are received by a node, combined and amplified by weight and bias 142 coefficients, and then adjusted through a non-linear activation function ' ' (Khan and Roy 2018). 143 Here, is the input to the th neuron of the next layer. The ultimate output is passed through a transfer 145 function that shows the summed input as the output value. The logistic (sigmoid) function is the most 146 widely used transfer function, which uses the feature of non-linearity in the mapping method (Dunn et  The activation functions are used to link weights to find the output at each layer in the forward pass. 155 The result of a forward pass is considered as the new output of the prediction. After evaluating the 156 derivatives of the error function between the predicted outputs and the actual outputs, the backward 157 pass starts. Derivative findings are propagated backward, weights are updated, and new error terms are 158 processed for that layer. For each layer, this method is repeated until the input layer is obtained again. 159 The amount of training is indicated by the epochs number and the learning rate and should be considered 160 in front of the validation or test set to avoid over-fitting (Waldmann 2018). Variable selection using 161 ANN is made so that the neural network is trained with all the predictor variables in the input layer. To 162 remove the predictors from the input layer, a backward propagation algorithm is applied. Relevant 163 input nodes and their connection weights are deleted. Subsequently, the input weight for model training 164 is corrected by reducing the error in subsequent iterations. Other suitable settings may be required to 165 remove unusual input nodes (Tetko et al. 1996). ANN learns from experience and recognizes 166 relationships between variables even when there is no tangible relationship (Bandyopadhyay and 167 Chattopadhyay 2007). In modeling with ANN, the number of hidden layers and hidden nodes is 168 important. Improper selection of these parameters affects the model's ability to generalize. Two issues 169 can occur in these cases: 1) underfitting: When the network has poor performance in the training set 170 and cannot match performance with the data and cannot perform well in the test set. 2) overfitting: 171 When the network has remarkable performance in the training set and adapts the performance very 172 close to the data, but is unable to generalize unseen data (Janković et al. 2020). 173

Literature review 174
The inhumane effects of global warming and environmental degradation threaten the global economy 175 and a challenging issue against humanity in the 21st century. The world is trying to prevent the spread 176 of ecological degradation by enacting laws and regulations at the national and international agreements. 177 Improper consumption patterns and lifestyles have destroyed everything that has formed in nature over 178 millions of years. There are conflicting arguments about the destruction of the environment and natural 179 resources. Human needs have changed ecosystems through ecological pressures such as land use 180 change, resource extraction and deforestation, and overfishing and pollution (Rudolph and Figge 2017).  From 1961 to 2010, demand for renewable resources increased by about 140% (from 7.6 billion to 191 18.1 billion gha), so that the planet's bio-production will not be adequate to meet human needs. (Galli 192 et al. 2015). We have been living in ecological conditions since the 1970s, which shows that the use of 193 the planet's resources is more than its ability to recreate resources, according to the EF Atlas (Network 194 2010).It takes one and a half times as much time to reproduce the resources we consume in a year 195 Undoubtedly, studies on the relationship between natural resources and comprehensive environmental 238 indicators such as EF are not enough, and more research is needed to move towards a sustainable 239 environment. Therefore, a statistical look at the ecological footprint and its factors and its prediction 240 using different models seems necessary. Compared to classical prediction models, the efficiency and https://www.g20.org. To reduce the deterioration of the environment, studying EF using machine 249 learning techniques is attractive and not without merit. To our knowledge, there is no such study. 250 Therefore, this paper uses machine learning (ML) in EF research and evaluates their potential benefits 251 in the G-20 countries over the last two decades. In the current study, these models were trained, 252 validated, and then tested to evaluate their comprehensive statistical analysis performance. 253

Descriptive statistics and data preparation 255
The data needed to examine ecological footprint indicators over the last two decades (from 1999 to 256

2018)
were extracted from https://databank.worldbank.org, http://www.fao.org/, 257 https://freedomhouse.org/, and https://data.footprintnetwork.org. To help make decisions more 258 effective in the G20, we considered three dependent variables (outputs), i.e., Ecological Footprint (EF), Civil Liberties (PRPRCL). The studied predictor variables and their sources are introduced in Table 1, 266 and some descriptive information of the data set during the last two decades is presented in Table 2. 267 268 Table 1: Introduction and definition of studied factors 269  Table 3 shows the descriptive statistics of the predictor variables for each country. The normality of the 272 residuals (errors) of the linear regression model and their autocorrelation were investigated using 273 Shapiro-Wilk and Durbin-Watson tests. This statistic test was defined to consider the correlation 274 between the residues as follows: 275 Where D in the range 1.5-2.5 indicates no correlation between the residues. After fitting in the initial 277 model, multicollinearity was examined using VIF (Variance Inflation Factor (Myers and Myers 1990)) 278 as follows: 279 Finally, the initial fitted model's validation was depicted (for instance, Figure 2). 281

Statistical procedures and model configuration 284
If we consider a standard multiple linear regression model as follows: = 1β 0 + β + ℯ 285 that is a response variable vector, = 1 ,…, is a predictor variables matrix, β 0 is the intercept, 286 β = β1, …, β is a regression coefficients vector, and ℯ is an error terms vector, assuming normal 287 distribution ℯ ̴ (0, ℯ 2 ). In this case, β0 and βs coefficient values be estimated by minimizing the 288 residual sum of squares (RSS) (Waldmann et al. 2013)as: 289 Anda PR coefficients is expressed as: 291 Here, to prevent overfitting and control the penalty function's shrinkage amount of contraction, the 293 hyper-parameter λ tunes the equation. In fact, bias-variance trade-off is set by this hyperparameter. Its 294 amount is directly related to bias and inversely associated with variance, i.e., with increasing lambda, 295 bias increases and variance decreases. 296 By applying an ℓ 2 -norm penalized least squares criterion [i.e., P (λ, β) = λ ‖β‖ℓ 2 ] on the linear 297 regression coefficients (Hoerl and Kennard 1970) , the RR estimates are obtained as follows: 298 In RR, the shrinkage value is tuned so that no variable is exactly zero and only reduces their 300 variance, so the estimates are biased. 301 In another case of PR, the values of the coefficients are obtained by applying the Lasso constraint 302 (i.e., an ℓ 1 -norm penalized least-squares as P (λ, β) = λ ‖β‖ℓ 1 ) (Tibshirani 1996). An important 303 feature of Lasso is that it allows the coefficients to be exactly zero, thus selecting the variable. If we 304 consider to be standardized so that ∑ = 0 and ∑ 2 = 1,then the lasso coefficients are estimated 305 as follows: 306 The EN method is another mode of PR that uses a combination of two penalties applied in RR and 309 Lasso on the coefficients (Zou and Hastie 2005): 310 where 0 ≤ α ≤ 1 is a penalty weight. If α is equal to 1, EN functions like Lasso but modifies how it deals 312 with high correlated variables (Waldmann et al. 2013). 313

Cross-validation and parameter optimization 314
To maintain the original distribution of variables, the data in both train and test sets were scaled through 315 a min-max normalization method. 70% of the data was used to training the models, and the rest was 316 used for testing. In this study, the performance of the models was evaluated using the 10-fold cross-317 validation method. The whole data was randomly divided into ten equal subsets. A subset of the 318 validation set was considered for testing the model, and the remaining K-1 subset was used to train the 319 model. This method reduces the dependence of performance on the test-training set and reduces the 320 variance of the performance criteria and confirms that the results are free from any sampling bias (James 321 et al. 2013). The lambda value, which minimizes the cross-validation prediction error rate in the training 322 set, is considered the optimal value, automatically determined using the cv.glmnet function. By default 323 10-fold cross-validation, the cv.glmnet function sets the optimal lambda value to provide the simplest 324 model. In fact, the proper model lies within one standard error of optimal lambda, i.e., lambda.1se. If 325 the lambda is equal to lambda.1se we will have a simpler model than the lambda equal to lambda.min, 326 but it may be slightly less accurate. Choosing lambda values for each fold Fi was performed by the 327 following cross-validation technique: 328 that β ℛ ℱ ,λ was estimated on − ℱ (here, = {x n , y n } and ℱ is the data not including the th fold). 330 th fold was used as a test set and the rest of the data as the training set. So, λ was chosen as: 331

Performance evaluation 333
The behavior of the models was evaluated when introducing new data using the following criteria: MSE 334

Results and discussion 345
Ecological footprint as a measure of human demand for natural capital can help people understand 346 consumption and its impact on the planet and convince local leaders to improve people's well-being by 347 investing in it. According to Table 3, the United States, Canada, and Australia's EF indicator is higher 348 than the other G-20 countries. 349  Obviously, correlation does not indicate causality; Although finding causation is not the goal of the 361 present study, it can be helpful to to evaluate the potential relationships between parameters, especially 362 between independent and dependent variables. The corresponding P-value for variables whose 363 correlation was not significant (P>0.05), has been specified in Figure 3. 364 confuses. For example, there is a significant correlation of 0.70 between "ICRL" and "GDP" and they 369 are also correlated with the response variable (EF). We fitted the regression model based on these two 370 variables, the "GDP" coefficient was positive and the "ICRL" coefficient was negative, meaning that 371 "GDP" has a positive effect on "EF" and the other has a negative effect. Once again, we fit the model 372 based on each of these variables separately, the coefficient of each variable was unexpectedly positive. 373

This means that the regression coefficient for each variable depends on what other variables are included 374
in the model, in addition, the coefficients of these correlated variables become over-inflated. These 375 coefficients fluctuated sharply, indicating that they are more prone to over-training sets (i.e., high 376 variance in the bias-variance trade-off space). VIF can be used to find correlated variables, but this 377 index does not always specify which variables should be removed (Myers and Myers 1990). A better 378 option to solve the problem is to use PR to monitor the estimation of coefficients. PR models impose 379 constraints on the value of the coefficient and shrink them towards zero. These constraints reduce the 380 amount and fluctuations of the coefficient and variance of the model. Feature importance is shown in 381 a significant variance. Therefore, to deal with this problem and multi-collinearity, using PR models 386 seems reasonable (Kuhn and Johnson 2013). All features (predictors) have been retained in the model 387 using RR but have been severely shrunk in weight compared to OLS (Figure 4

-B vs. 4-A). PR models 388
with less fixed parameters than OLS are less exposed to the over-fitting training set. Although OLS may 389 perform better than PR on training data, the PR model is better generalized to new data in the presence 390 of extreme variance. In general, a predictive model aims to increase outcome prediction rather than 391 considering underlying cause structures (Yarkoni and Westfall 2017). However, both viewpoints of 392 predictive and causal share some resemblances. Statistically, the model that has the best estimate is not 393 necessarily the most effective in predicting real-world results (Shmueli 2010). Lasso allows some 394 coefficients to be exactly zero and, in the case of correlated variables, holds only one variable and sets 395 the rest to zero (Wei et al. 2015). Therefore, this approach leads to variable selection and presents the 396 final model with fewer parameters, as seen in Figure 4-C. Based on this, the coefficients of "Year", 397 "TOP", "UP" and "PROTOCOL" have been shrunk to zero, creating a parsimonious model. In this way, 398 Lasso allows us to focus on the strongest predictions to understand how the TEFB will change. In other 399 words, in the variable selection process, "CO2E", "GFC" and "GDP" were identified as the most 400 important variables. Not surprisingly, EF is strongly associated with these macro-global indicators. It 401 is noteworthy that "Year", "TPOP" and "UP" are of little importance, indicating that they are not the 402 main drivers of EF prediction in the PR model. Figure 4 confirms the results of Table 4, with the most 403 accurate prediction obtained by the Lasso followed by the EN and the RR. 404 Table 4: Comparison of validation of prediction for ecological footprint indicators in G-20 countries 405 using linear and penalized regression models 406 In fact, the reality may be much more complex than a proposed model. Hence, there is no assurance 407 that the events that are routinely reviewed are so simple that they can be approximated by models that 408 are understandable to humans (Yarkoni and Westfall 2017). "Everything should be made simple as 409 possible, but not simpler -ALBERT EINSTEIN". In Figure 4-D, EN behaves by shrinking and selecting 410 variables as a combination of RR and Lasso. Totally, PR models performed better than OLS, with decent 411

R-squared and stable RMSE values. 412
Plots of test MSE by λ value and the PR models' practical aspects are demonstrated in Figures 5, 6  413 and 7 for EF, EFB and TEFB. 414 relative to each other rather than allowing one to be positive and the other to be negative. Therefore, we 426 reduced the data noise, which makes the model more accurate in identifying real signals. As shown in 427 causing their coefficients to be inflated. The advantage of EN is that it allows adjustment via RR with 431 the Lasso variable selection feature. 432 The studied models are assessed based on the highest determination coefficient (R 2 ) and the least 433 RMSE values. Besides, MSE, MAD, and MAE values are also available in Tables 4 and 5. Table 4  434 indicates the comparison of validation of EF indicators prediction using OLS and PR. According to 435   for visual comparison is shown in Figure 11. According to Figure 11, the ANN predictions are centered 458 around the line more than those made by LM. 459 than the maximum. Predict validation for EF indicators using ANN compared to OLS is presented in 474 Table 5. 475 Table 5: Comparison of validation of prediction for ecological footprint indicators in G-20 countries 476 using linear regression and artificial neural networks 477 As shown in Table 5, the MSE, RMSE, MAD, and MAE for ANN models are lower than those of 478 the LM. In addition, the performance of ANN neural network models in predicting new data was better 479 than the OLS model. For EF, the ANN model with six neutrons had better performance. But, for the 480 other two indicators, the two-layer model performed better. This was expected due to its remarkable 481 flexibility / nonlinearity. Obtained RMSE and R 2 for EF in the appropriate ANN model were 0.413 and 482 0.9082, respectively; corresponding values were 0.350, 0.991, and 0.695 and 0.941 for EFB and TEFB. 483 Overall, the results obtained are consistent with studies conducted and literature in this area. By studying 484 environmental impact prediction using neural network modeling, Spitz and Lek (1999) concluded that 485 ANNs can learn the complex relationships between ecological variables and evaluate the impact and 486 produce operational predictions. They suggested that reasonable predictions could help managers take 487 preventive, protective and compensatory measures. in industrial performance, especially for executives to make effective decisions for business 501 performance by considering the costs of CO2 monitoring. 502

Conclusions 503
In this study, we present the application of PR and ANN to EF to predict it based on global macro 504 indicators in G-20 countries. The results show that EF values are at the top of the G20 in the United 505 States, Canada, Australia, the European Union and Germany. In this, It was important to use PR 506 approaches to identify important global macro indicators. Among the PR methods, Lasso was 507 parsimonious in selecting variables but performed better. Still, the number of indicators selected by AN 508 was higher, meaning that this model was generous in selecting variables. 509 But overall, there was no significant difference in variable selection and therefore interpretation 510 between Lasso and EN. Understanding both the characteristics of variable selection and predicting each 511 model is essential for informed decision-making. For example, translating a well-predicting forecasting 512 model with many indicators may be difficult in a brief tool. In contrast, an interpretable model with 513 poor predictive performance may call into question the usefulness of such indicators (Greenwood et al. 514 2020). The results showed better predictive performance of the PR compared to the OLS, albeit slightly. 515 The ANNs have shown the highest prediction accuracy compared to the PR models evaluated. In 516 general, the results showed that both types of models could predict EF indicators. In conclusion, the 517 relatively simple ANN architecture performed better than the PR models in predicting EF using global 518 macro indicators in the G-20.