Wind speed forecast based on combined theory, multi-objective optimisation, and sub-model selection

Wind energy is the primary energy source for a sustainable and pollution-free global power supply. However, because of its characteristic irregularity, nonlinearity, non-stationarity, randomness, and intermittency, previous studies have only focused on stability or accuracy, and the forecast performances of their models were poor. Moreover, in previous research, the selection of sub-models used for the combined model was not considered, which weakened the generalisability. Therefore, to further improve the forecast accuracy and stability of the wind speed forecasting model, and to solve the problem of sub-model selection in the combined model, this study developed a wind speed forecasting model using data pre-processing, a multi-objective optimisation algorithm, and sub-model selection for the combined model. Simulation experiments showed that our combined model not only improved the forecasting accuracy and stability but also chose different sub-models and different weights of the combined model for different data; this improved the model generalisability. Specifically, the MAPEs of our model are less than 4.96%, 4.60%, and 5.25% in one-, two-, and three-step forecast. Thus, the proposed combined model is demonstrated as an effective tool for grid dispatching.


Introduction
For decades, the demand for electricity has represented a major global constraint, and the use of renewable energy has become increasingly important. From the perspective of economic growth, energy plays a vital role in procuring power from nature. Fossil fuels have been used to generate electricity; however, owing to the related fossil fuel crisis and current global environmental issues, the energy mixture is changing. Renewable energy sources (e.g. wind, tidal, and solar energy) are gaining increasing amounts of attention (Wang et al. 2004). Wind energy is clean, abundant, environmentally friendly, inexhaustible, and inexpensive; it represents a feasible alternative to fossil fuels.
In recent decades, significant progress has been made in wind turbines, from the small to large scales. The reason for this rapid growth is that many parts of the world are rich in the required raw materials, the turbines are ecologically friendly and achieve low-carbon power generation, and new policies have been implemented towards the introduction of new renewable generators (Enrique et al. 2019;Aguilar Vargas et al. 2019). For example, in France, wind power installation increased by 14.04% (Council and Global wind statistics 2019) in 2017, mainly owing to tariff subsidies. According to statistical research, the global cumulative installed wind capacity reached nearly 591 Gigawatts (GW) by the end of 2018, with an annual growth rate of 9.6% (Fried et al. 2021). However, owing to changes in wind speed and direction (especially the former), the integration of a large portion of wind power in wind power systems faces major challenges (Meng et al. 2016). These challenges can be classified into two categories: operational issues (Yan et al. 2017) and planning and economic issues (Bruninx et al. 2016). Improving wind forecasting is one of the most effective ways to overcome these challenges.
In recent research, improving the forecast accuracy of wind speed has become a hot topic, and new development directions and numerous wind speed forecasting methods have been proposed. Some of these technologies provide more accurate wind speed forecasts for specific wind speed data, whereas others are effective for multiple wind speed dataset (Chang 2014). These different wind-speed forecasting models can be classified into the following six categories: (1) Persistence models. Persistence models, which assume that the wind speed at a certain time in the future matches the forecasting wind speed (Soman et al. 2010), are used for short-term forecasts. However, when the forecast timescale increases, the forecast accuracy of the model decreases. These models can be used as reference models to test new models for short-term wind-speed forecasts (Sfetsos 2000).
(2) Physical models. Physical models are used in atmospheric weather forecast modelling; they require copious amounts of numerical weather forecast data, including humidity, temperature, pressure, speed, and topological parameters; this leads to data accumulation. To forecast the wind speed, a long calculation time is required for data correlation. Therefore, the model is best suited for long-term wind speed forecasting (Lei et al. 2009). (3) Statistical models. Historical data are used in statistical models to forecast the wind speed. Statistical models include linear and nonlinear models, and these have also been used in time series forecasting (Smith and Mehta 1993;Riahy and Abedi 2008). However, for linear methods such as the autoregressive moving average (Torres et al. 2005;Brown et al. 1984), auto-regressive integrated moving average (ARIMA) (Grigonyte and Butkeviciute 2016), Box-Jenkins method, and Markov chain models (Shamshad et al. 2005), when the main component of the wind speed data set is a nonlinear feature, most statistical models assume that wind speed to be normally distributed; hence, the accuracy of these forecasting methods decreases rapidly. These nonlinear statistical methods require a large amount of historical data to train and develop forecasts; they primarily include fuzzy logic methods (Damousis et al. 2004), support vector machines (SVMs) Niu et al. 2018), and probabilistic methods (Heng et al. 2022).
(4) Hybrid models. Hybrid methods combine complementary characteristics (e.g. linear and nonlinear data) and the advantages of various methods to obtain the best forecasting performance. (5) Artificial intelligence methods. Artificial intelligence methods, including artificial neural networks (ANNs) such as the back propagation neural network (BPNN) (Guo et al. 2011), generalised regression neural network (GRNN) (Specht 1991), radial basis function neural network (RBFNN) (Specht 1991), multiple layer perceptron (Wang et al. 2014), and long short-term memory (LSTM) (Dorvlo et al. 2002), have also been proposed to forecast wind speeds (Li and Shi 2010;Yu et al. 2017;Wang et al. 2016). To improve the forecasting accuracy, researchers have also used optimisation algorithms such as differential evolution ) and cuckoo search (Xiao et al. 2015), amongst others. (6) Combined models. Depending on the above methods, the stability and forecasting accuracy of forecasting beyond these methods do not meet the levels desired by wind farm operators. Therefore, to achieve a higher level of accuracy and more stable forecasting results, researchers have developed a combination model that incorporates the advantages of a single forecast model and can be widely used for wind-speed forecasting Wang et al. 2021;Xiao et al. 2018).
In general, the combined models not only overcome certain difficulties of the single models but also combine the advantages of these models, making them better than the models mentioned above. However, for combined models, the selection of sub-models has always been difficult. Some researchers choose linear and nonlinear models so that the combined models can satisfy both linear and nonlinear data. Some researchers have selected models that perform better on a given dataset.
However, certain problems remain: (a) The lack of complementary theoretical knowledge for selecting linear and nonlinear models; in other words, why are these methods selected as sub-models? (b) The listed sub-models are limited. For the resulting combined model, the submodels represent the best of these models. Because the submodels do not cover most models, the single model selected is not sufficiently convincing. (c) For different datasets, because of the different data characteristics, it is impossible to find a single combined model suitable for multiple datasets. (d) Most of the models feature single objectives. For wind speed data, it is usually difficult to guarantee both forecasting accuracy and stability using a single objective function in the combination model, owing to the nonlinear characteristics of the wind-speed time series.
To the best of our knowledge, no model has been proposed that can resolve the problems stated above. Thus, we propose a forecasting model based on model selection, multi-objective functions, and combined model theory. In the following section, we use CM to denote our proposed combined model.
Our contributions and innovations are as follows: 1. A combined model based on outlier detection and processing is constructed, and the negative effects of outliers are eliminated using data analysis methods, whilst retaining the main trend of the wind-speed time series. Outliers in the original wind-speed data can result in poor forecasting results, and the data analysis module of the CM eliminates such outliers. 2. Based on singular spectrum analysis (SSA) technology, data pre-processing technology is used to extract the main features of the wind speed data. The wind speed data are denoised to make them smoother and more reflective of the trend of the original data. 3. An optimal sub-model selection criterion based on multiple forecast criteria is proposed. A new predictive evolution criterion is proposed to select the optimal predictive sub-model. This criterion is called the weighted information criterion (WIC), and it combines six criteria [mean absolute percentage error (MAPE), root mean square error (RMSE), Akaike information criterion (AIC), Bayesian information criterion (BIC), and direction accuracy (DA)]. These criteria are used not only to select the best sub-models from various forecasting models but also to improve the forecasting accuracy. 4. The weights of the sub-models used to build the combined model are obtained using a multi-objective optimisation algorithm. 5. The single model and weight of the ideal combination model vary with respect to the data, indicating the lack of a consistent model suitable for all datasets.
The remainder of this paper is organised as follows. The theories and methods are introduced in Sect. 2. In Sect. 3, we introduce the multiple objective functions, non-dominated sorting genetic algorithm-III, and model-building process. In Sect. 4, the performance metrics, three numerical experiments, and summary are presented. Finally, Sect. 5 presents the conclusions.

Theories and methods
This section introduces the theories and methods used in CM. In the attached table in the appendix, we fit the original wind-speed time series using linear and nonlinear functions; we found that the wind speed had both linear and nonlinear characteristics. Therefore, linear and nonlinear models were selected to forecast and study the wind-speed time series. The basic methods and theories were shown in the appendix. The flowchart of the proposed combined model is shown in Fig. 1. 3 The non-dominated sorting genetic Algorithm-III and model proposal For a single-objective model, only one objective function must be determined for forecasting. However, theoretical and practical examples have shown that it is difficult to achieve high stability and accuracy by relying on only one objective. For multi-objective problems, multiple goals must be proposed for the optimisation algorithm to optimise; therefore, an objective function and optimisation algorithm [the non-dominated sorting genetic algorithm-III (NSGA-III)] are introduced in this section.

Non-dominated sorting genetic Algorithm-III (NSGA-III)
The NSGA-III (Deb and Jain 2014) includes five main steps: (1) Population initialisation: Decision variables are randomly generated based on the given upper and lower boundaries.
(2) Offspring selection (this is the selection algorithm for the next generation of individual evolution; it is based on a genetic algorithm): according to natural section theory, offspring selection is introduced in the algorithm to determine which is more likely to produce a solution.
(3) Generation of new offspring: The parent generation generates new offspring via crossovers and mutations. (4) Non-dominated sorting: The solutions are sorted according to non-dominated relations. (5) Reference-point-based selection mechanism: To select a new group of size N in the next generation, a reference-pointbased selection mechanism is introduced in the NSGA-III; this guarantees the uniformity of the distribution and is an enhanced optimisation drive for multi-objective optimisation problems. We expected the entire process to identify each group member corresponding to the reference point close to the Pareto optimal frontier. These group members constitute a set of Pareto-optimal solutions.

Model proposal
The structure and construction process of the proposed two-objective combined model CM are introduced in this section.
1. We select the data to be tested and use SSA (Abdollahzade et al. 2015) to denoise it.
x L x Lþ1 x Lþ2 Á Á Á x N 2 6 6 6 6 6 4 3 7 7 7 7 7 5 ; ð1Þ Þ is a lag vector of length L, and X is a Hankel matrix for which the elements on each of the sub diagonals are equal.
(2) Singular value decomposition Singular value decomposition (SVD) is applied to the trace matrix X. We let S ¼ XX T and calculate the eigenvalues k 1 ; . . .; k L of S; because S is a symmetric matrix, we have that k 1 ! Á Á Á ! k L ! 0, and the standard orthogonal basis of the matrix S corresponding to these eigenvalues is obtained as U 1 ; . . .; U L .
Let d ¼ rankðXÞ Þ ; then, the SVD of the trace matrix X can be expressed as The rank of every matrix X i is one, and these matrices are referred to as elementary. Vectors U i are left as singular vectors of matrices X i , and set ffiffiffiffi Þis called the spectrum of the trace matrix; hence, this is a singular spectrum decomposition.
(3) Grouping eigenvalues: By grouping the elementary matrices X i i ¼ 1; . . .; d ð Þ , the index set 1; . . .; d f gis divided into m disjoint subsets, Then, the matrices X I corresponding to Group I are defined as Because the subsets are divided into m groups, X can be expressed as X ¼ X I 1 þ Á Á Á þ X I m .
(4) Diagonal average: Each matrix X I of the group decomposition is Hankelised; then, by forming a one-to-one correspondence between the obtained Hankel matrix and time-series data, the previously obtained Hankel matrix will be transformed into a new sequence of length N. For any matrix Z LÂK , its elements are z ij , and Hankel Z LÂK ; then, the kth value denotes the average over all elements of XZ satisfying The diagonal average method is applied to generate a reconstruction sequenceX ðkÞ ¼ ðX ðkÞ 1 ; . . .;X k ð Þ N Þ from the generated matrix x 1 ; . . .; x N . In this way, the initial sequence x 1 ; . . .; x N is decomposed into the sum of m reconstruction subsequences: Meanwhile, we take 1 2 m and _ x n ¼ P1 2 m k¼1x ðkÞ n ; n ¼ 1; 2; ð . . .; NÞ; then, _ x 1 ; . . .; _ x N is the time series data after x 1 ; . . .; x N is denoised.
2. According to the model selection algorithm WIC, we select N single models M and record each single model. Then, we reconstruct the data. The reconstructed data are expressed as follows: 3. For the N selected single models M i ði ¼ 1; 2; . . .; NÞ, the forecasting value is set toŷ i , the initial weightx i is given (to construct a combined model M 0 ), and the forecasting value isŷ t ¼ P M i¼1x iŷit ; t ¼ 1; 2; . . .; L, where t represents each time point in the time series. In CM, to achieve a high accuracy and stability, the objective functions can be defined as where Here, fy t ; t ¼ 1; 2; . . .; Lg is the real value of the time series, and t is the time point. 4. A multi-objective optimisation algorithm NSGA-III is used to optimise the weight of the forecasting value for each forecasting model M i ði ¼ 1; 2; . . .; NÞ in the forecasting value setŷ t ¼ P M i¼1x iŷit ; t ¼ 1; 2; . . .; L. This includes five steps: population initialisation, offspring selection (the offspring selection algorithm is a selection algorithm that produces the next generation of individuals in the genetic algorithm), new offspring generation, nondominated sorting, and reference point-based selection. 5. A solution set S exists. The solutions in S are Pareto optimal solutions, and each solution corresponds to a set of weight valuesx ij , where i represents the weight of the ith model M i ði ¼ 1; 2; . . .; NÞ and j represents the jth solution in set S. Because the number of objective functions is three, the elements in S are three-dimensional arrays, which are expressed as follows: if one of the objective function values is reduced, the other objective function values will increase. Finally, the solution with the closest Euclidean distance to the origin in solution set S is selected as the weight: Because each solution s corresponds to a set of weight values, the weightx ibest of a single model in CM (with three objective functions) is found; hence, the forecasting value of the optimised CM isŷ t ¼ P M i¼1x ibestŷit . In our study, the parameters of NSGA-III are as follows: generated reference points: 10; maximum number of iterations: 50; population size: 80; crossover percentage: 0.5; mutation percentage: 0.5; and mutation rate: 0.02.

Numerical experiment
To evaluate the forecasting accuracy and stability of our CM, 10-min wind speed datasets from four stations were selected for multi-step forecasting. In this study, three datasets were selected as the research objects, and the data differed slightly depending on the forecasting steps. The ratio between the training and test sets was 125:18. When forecasting the second value, the training set and test set were each moved one dataset forward, and the numbers of the training and test sets were kept unchanged. For example, when forecasting the first value, the first 1000 sets of data were used for training, Datasets 1001 to 1144 were used for testing, and the first value was forecasted. When the second value was needed for forecasting, the second set (up to Dataset 1001) was used for training, and Datasets 1002 to 1145 were used for testing; thus, we obtained the second forecasting value; the process was repeated for a total of 1008 forecasted values. This method stopped learning when the training error reached MSE = 10 -6 (after normalisation). Three datasets were used for the one-step, two-step, and three-step forecasting.

Performance metrics
To evaluate the characteristics of the model more comprehensively, certain performance indices were considered. Eight metrics, mean absolute error (MAE), RMSE, standard deviation of absolute percentage error (STDAPE), direction accuracy (DA), Theil U statistic 1 of forecasting results (U1), Theil U statistic 2 of forecasting results (U2), MAPE, and coefficient of determination (R 2 ), were used, as shown in Table 1; furthermore, the Diebold-Mariano (DM) test and forecasting availability test were used. These metrics were taken from a study by Wang et al. (2018).

Experiment I: One-step-ahead forecasting comparison between our proposed combined model (CM) and other models
This experiment aimed to compare the performance of our forecasting model against some widely used statistical forecasting models, ANNs, and other established models.
The forecasting results for one-step-ahead forecasting are shown in Table 2 and Fig. 2. We can see that extreme learning machine (ELM) consistently obtained the optimal results at Sites 1, 2, and 3. At Site 4, the adaptive network- ðy i À yÞ 2 based fuzzy inference system (ANFIS) performance was optimal among the branch models. CM consistently realised the optimal performance of all models. Table 2 and Fig. 2 show the numerical simulation results and model performance, respectively.
1. At Site 1, for the branch models, ELM achieved the best results in terms of MAE, RMSE, U1, U2, and R 2 . The STDAPE value (4.96%) of ELM was 0.01% higher than those of ARIMA and ANFIS, which achieved a value of 4.95%. The DA ensures that the trend of the forecasting results is consistent with the true data; for this metric, ANFIS produced an optimal value of 43.20%. ANFIS also achieved the lowest MAE value, of 5.79%. 2. For Site 2 in Table 2 and Fig. 2, the optimal results of MAE, RMSE, STDAPE, U1, and MAPE were obtained by the ELM for the five single models. ARIMA To achieve the best forecasting results, the sub-models had to be modified by the dataset. 4. For the combined model, five models with the lowest MAPE were selected as the single models in CM (e.g. the BPNN at Site 3, with a value of 6.74%). This is because the MAPE values for the single models were not the only indexes used to select the sub-models and build CM. 5. For the single models selected to build CM, each achieved one or more optimal scores according to the metrics.

Experiment II: Two-step-ahead forecasting comparison between our proposed model (CM) and other models
This experiment aimed to compare the two-step-ahead forecasting performance of CM with some widely used statistical forecasting models, ANNs, and other established models. Table 3 and Fig. 3 show the two-step-ahead forecasting results for the four sites. From these results, ELM performed better than the other single models at the four sites (four datasets), and CM obtained the optimal values of all models under the eight metrics. Remarks 1. For the two-step-ahead forecasting (see Table 3), the optimal results in all metrics could not be obtained from any single model. 2. The forecasting results of ELM almost achieved the optimal values at the four sites, as shown in Table 3 and Fig. 3. The ELM weights selected to build CM for the four sites were 0.5698, 0.5249, 0.4914, and 0.5994, respectively. 3. At Site 2, ELMAN achieved the best DA value; however, the CM for Site 2 was not built by ELMAN. This indicates that a single standard or metric cannot determine whether a single model should be selected to construct CM. 4. For Sites 3 and 4, the models selected to build CM were the same. The single models were ELM, ARIMA, ANFIS, GRNN, and BPNN. However, the weights of these models for Sites 3 and 4 were different. This indicates that the CMs for these two sites differed. 5. For the four sites or four datasets, the CMs differed. 6. ELMAN and BPNN were not selected to construct the CMs for the four sites.

Experiment III: Three-step-ahead forecasting comparison between our proposed model (CM) and other models
This experiment aimed to compare the three-step-ahead forecasting performance of CM with several widely used statistical forecasting models, ANNs, and other established models. ELM and BPNN outperformed the other single models at the four sites (four datasets). Table 4 and Fig. 4 show the results, which can be summarised as follows: 1. From the results of Site 1, the proportion of variation in the dependent variable of indicator R 2 can be explained by the independent variable via a regression relationship. LSTM achieved the optimal value (0.9709). The optimal values of U1 and RMSE were also achieved by LSTM, at 0.6707 and 32.96%, respectively. The RMSE  Wind speed forecast based on combined theory, multi-objective optimisation, and sub-model selection 13625 the metrics (lower than ELM), the weight of the SVM used to build CM was 0.5055, higher than that of ELM. 3. At Site 3, the BPNN performed best among the nine single models, and the weight of the BPNN used to build CM was 0.4829, which was the highest among the nine single models. 4. Similar to Site 3, an optimal single model was observed at Site 4; ELM had an overwhelming advantage among the nine models. The CM of Site 4 was built using ELM, ARIMA, ANFIS, GRNN, and LSTM.

Remarks:
1. For the three-step-ahead forecasting from Table 4, the best results of all metrics in four sites could not be obtained by any single model.  Wind speed forecast based on combined theory, multi-objective optimisation, and sub-model selection 13627 2. ELM, ARIMA, and ANFIS were selected to build the three-step-ahead forecasting CM for the four sites, which indicated that these four models were fit for three-step-ahead forecasting and the four datasets. 3. For the four sites, the CMs were built with different single models, to obtain the best CM for the forecasting results. 4. The weight of LSTM was 0.6189, greater than the weight of ELM; however, ELM obtained the best metric scores. Thus, the weights of the branch models were unrelated. 5. For Sites 1 and 4, the single models of the combined model were identical. The single models were ELM, ARIMA, ANFIS, GRNN, and LSTM. However, the weights of these models for Sites 1 and 4 differed. This indicates that the CMs for these two sites were different. Therefore, for the four sites or four datasets, the CMs chosen by our proposed method differed. 6. ELMAN and RBFNN were not selected to build CMs for the four sites. 7. For constructing CM, five models with the lowest MAPE were selected, including the ELMAN at Site 4 (with a value of 6.59%). This is because the MAPE of the single models did not propose a single index to help select the branch model of the CM.

Experiment IV: DM test and forecasting availability
To further evaluate our CM, we used two evaluation methods-the DM test and forecasting availability-to evaluate the model quality.  Diebold and Mariano (1995), focuses on forecasting accuracy and evaluates the forecasting performance of two or more time-series models, as well as forecasting availability . The effectiveness of the forecasting was measured by the sum of the squares and the mean square deviation of the forecasting errors, to further evaluate and analyse the performance of CM. Among all models, the optimal performance was achieved by CM.
The results of the DM test and forecasting availability are shown in Tables 5 and 6, respectively.
(1) As shown in Table 5, CM differed considerably from the other models, regardless of the dataset or order.
(2) The results of forecasting availability are listed in Table 6. In wind speed forecasting, the first-and second-order forecasting availability of CM for the four datasets and one-step, two-step, and three-step forecasting outperformed those of the other models. For example, at Site 1 (one-step forecasting), the first-order forecasting availabilities of each model were 0.9419, 0.9400, 0.9421, 0.9401, 0.9098, 0.9383, 0.8582, 0.9222, 0.8945, and 0.9504, respectively.

Remark
(1) The results of the DM test showed that the CM differed from other models. The higher the value, the greater this difference. (2) The forecasting availability results show the differences between CM and other models. The higher the value, the greater the difference. The results show that the first-and second-order values of the CMs are close to 1, which indicates that CM is significantly better than the other models. Wind speed forecast based on combined theory, multi-objective optimisation, and sub-model selection 13629

Summary
From the four experiments, we obtained the following findings: (1) The sub-models of CM were not static. Across the three different datasets, four sites, and different stepahead-forecasting scenarios, the preferred model was always varied to achieve the best results.
(2) No single model could consistently achieve the best results among the sub-models, owing to the complexity of the data. (3) The multi-objective optimisation algorithm was used to optimise the weights of the combined model's sub-models from the former three experiments. The weights of the sub-models in CM differed. In addition, the multi-objective optimisation algorithm could balance the objective functions of the combined model, thereby ensuring the accuracy and effectiveness of the forecasting. (4) The model selection chose the best sub-models to construct CM in different situations (datasets, forecasting steps, and sites); this made the selection of the sub-models more reasonable, and the combined model helped achieve the best results.
Our experiments demonstrate that CM has a stronger forecasting power and higher forecasting accuracy than the benchmark model.

Conclusion
In this study, our proposed combined model obtained an optimal result compared to the other models. There was no single ANN that could perfectly solve the problem for different levels of step-ahead forecasting and different datasets. Furthermore, based on the experiments, certain innovations of the forecast system developed in this study [i.e. our proposed model (CM)] considered not only different time and site data but also the disadvantages of the combined model. To overcome these disadvantages, a model was proposed using multiple objective functions, and the model selection theory and innovations can be summarised as follows: (1) The multi-objective functions guaranteed the stability and accuracy of the CM's results, because they considered both aspects.
(2) The model selection theory made the single models of the CM more reasonable, instead of manually selecting single models. (3) Our proposed model confirmed that for different data and different step-ahead forecasting processes, the optimal combined model was not fixed. For these results, it was necessary to adjust the sub-models used to construct CM for the different data. These developments in our model are rarely seen in other studies; therefore, this study fills that research gap.
To summarise, by overcoming the disadvantages and making innovations, our proposed model based on multiple objective functions and model selection was found to be stable and accurate (the MAPE is less than 4.60% and the STDAPE is less than 5.68%); it overcame the difficulty of selecting sub-models for the combined model. With the forecasting results and theories, CM can also be applied to futures, forwards, securities, house prices, and other forecasting fields. More benchmark models should be added to the model selection, to help CM achieve better results. However, more models will increase the model runtime; thus, the number of models and runtime should be kept balanced in future studies.

Appendix
For the data, only with a better understanding of the features of the data we can better select the model to prepare for future work. In order to achieve better results, we must consider the characteristics of the data. Generally speaking, the linear model has a better fitting effect for linear data, as the nonlinear model does for nonlinear data.
Only when we understand the characteristics of the data we can achieve good results in future forecasting work. For the data, it is not just linear or nonlinear, but both. Therefore, it is necessary to judge the linear nonlinearity of the data used in this paper, so we constructed the above experiments.
From the results of Tables 7 and 8, wind speed data are both linear and nonlinear by hypothesis test. So, the linear models and nonlinear models considered in our proposed forecasting model are correct and necessary.

Support vector machines
The support vector machine (SVM) method is a machine learning method that was proposed by Vapnik (1997) and the nonlinear function can be written as follows: where / x ð Þ is the kernel function that maps the data from low-dimensional space to high dimensional space. x and b are the coefficient and threshold, respectively.
Using loss function e to optimize regression function, and the best regression function is found by the minimum value of the loss function. e is as follows: Constraints: where C is a penalty factor: n i ,n i are slack variables, b is the offset, and x i , y i are the input and output, respectively. After the operation, the linear regression function can be obtained as follows: where K x g x i À Á is the kernel function of SVM.

Long short-term memory network
The long short-term memory (LSTM) network is a recurrent neural network (RNN), which was proposed by Hochreiter and Schmidhuber (Hochreiter and Schmidhuber 1997) in 1997 and can be written as follows (Greff et al. 2017): where t is the time; N is the cells of LSTM; x z ; x i ; x f ; x o 2 R NÂM are the input weights; r z ; r i ; r f ; r o 2 R NÂM are the recurrent weights; p i ; p f ; p o 2 R N are the peephole weights (Gers and Schmidhuber 2000); b z ; b i ; b f ; b o 2 R N are the biases; g x ð Þ; h x ð Þ and r x ð Þ are activation functions; z t is the activation of the input block; i t is the activation of the input gate; f t is the activation of the forget gate; c t is the cell state at time t; o t is the activation of the output gate; and y t is the output of the cell at time t.

Autoregressive integrated moving average
The autoregressive integrated moving average (ARIMA) model is one of the most popular forecasting models in the wind speed forecasting field (Contreras et al. 2003) and can be written as follows: These above formulas are recorded as ARIMA (p, d, q), where B q X t ¼ X tÀq , X t is a time series at time t, e t is the random error at time t, and B is the backward shift operator.

Back-propagation neural network
Back-propagation neural network (BPNN) is a widely used multi-layer feedforward neural network that is based on a gradient descent method that minimizes the sum of the squared errors between the actual output value and the expected output value. The output function is between 0 and 1, which can convert input to output to achieve continuous non-linear mapping (Guo et al. 2011) and can be written as follows: The topology of the BPNN is as follows: where X min and X max are the minimum and maximum value of the input array or output vectors, and X 0 i denotes the real value of each vector.
Step 1. Calculate the outputs of all hidden layer nodes: where the activation value of node j is net j , w ji represents the connection weight from input node i to hidden node j, b j represents the bias of neuron j, y j represents the output of hidden layer node j, and f is the activation function of a node, which is usually a sigmoid function.
Step 2. Calculate the output data of the neural network: where w 0j represents the connection threshold from hidden node j to the output node, b 0 represents the bias of the neuron, O 1 represents the output data of the network, and f 0 is the activation function of the output layer node.
Step 3. Minimize the global error via the training algorithm: where Z represents the real data vector of the output, and m represents the number of outputs.

Generalized regression neural network
Generalized regression neural networks (GRNN) were proposed by Specht (1991), and the theoretical basis was nonlinear kernel regression analysis. The network was based on nonlinear regression theory and consisted of four layers of neurons: the input layer, pattern layer, summation layer, and output layer.
Definition 1 Setting the joint probability density function of the random variable x and y to f (x, y), the observed value of x is X. Thus, the estimation value as follows: Definition 2 To set the probability density function f (x, y), which is unknown, but can be obtained using a nonparametric estimation, we use the sample observation values of x and y: where X t and Y t are the sample observation values of x and y, respectively; d is the smoothing parameter; n is the number of samples; and p is the dimensionality of the random vector x. Here, we can calculate Y using f (x, y) instead of f (x, y) in formula (30). Finally, whereŶðXÞ is the weighted average of all sample observations of Y t , and every weight factor Y t is the Euclidean squared distance index value between the corresponding samples X t and X.

Radial basis function neural network
The radial basis function neural network (RBFNN) (Schwenker et al. 2001) is an efficient feedforward neural network that exhibits better approximation performance and global optimal ability than other feedforward networks. The structure of the neural network is simple and the training speed is fast. An RBFNN also includes three layers: an input layer, a hidden layer with a nonlinear activation function (the radial basis function), and an output layer. The RBFNN was proposed by Schwenker, and this network exhibits better approximation performance and global optimal ability than other feedforward networks. The neural network also has a simple structure and fast training speed. There are three layers: an input layer, an output layer, and a hidden layer with a nonlinear activation function.
Definition 1 The modelled input is a real-number vector x 2 R n . Then, the output vector of the neural network is a scalar function of the input vector, u : R n ! R, given by where N represents the total number of neurons in the hidden layer, the centre vector of the ith neuron is represented by c i , and a i is the weight of neuron i in the linear output neuron.
Definition 2 A radial basis function is a scalar function that is radially symmetric. Thus, it is defined as a monotone function of the Euclidean distance between any point x in space and c i of a centre vector. The most commonly used radial basis function is the Gauss kernel function, given as follows: where c i is a centre vector and r is a width parameter that controls the radial scope of the function.
Adaptive network-based fuzzy inference system Jang (1993) combined the best features of the fuzzy system and neural network to construct an adaptive network-based fuzzy inference system (ANFIS). ANFIS integrates the human inference style of fuzzy inference system (FIS) by using input-output sets and a set of if-then fuzzy rules. FIS (Neshat et al. 2012) has structured knowledge, in which each fuzzy rule describes the current behaviour of the system; however, it lacks adaptability to changes in the external environment. Therefore, the concept of neural network learning and FIS are combined in ANFIS (İnan et al. 2007). ANFIS is a method that uses neural network learning and fuzzy inference to simulate complex nonlinear mapping. This method has the ability to deal with the uncertain noisy and imprecise environments (Liu and Ling 2003). ANFIS uses the training process of the neural network to adjust the membership function and the related parameters close to the expected data set (Wu et al. 2009). Layer 1: This layer is the input layer, which is responsible for the fuzziness of the input signal. Each node I is a node function represented by square node: where x (or y) is the input of node i, A i , B i are fuzzy sets, and O 1 i is the membership function value of A i and B i , indicating the degree to which X and Y belong to A i and B i . Usually, the l A i and l B i are chosen as bell-shaped functions or Gaussian functions. The membership function has some parameters; these parameters are called premise parameters.
Layer 2: The nodes in this layer are responsible for multiplying the input signals and calculating the firing strength of each rule. The output is where the output of each node represents the credibility of the rule. Layer 3: This layer normalizes all applicability, each node is represented by N. The ratio of the ith rule's firing strengths to the sum of all rules' firing strengths is calculated by the ith node: Layer 4: Calculating the output of the fuzzy rule, the output is where x i is the output of layer 3 and {p i , q i , r i } is the parameter set (consequent parameters). Layer 5: The single node of this layer is a fixed node that calculates the total output of all input signals: Extreme learning machine Extreme learning machine (ELM) is a machine learning algorithm proposed by Huang, which was designed for single-layer feedforward neural networks (SLFNNs) (Huang et al. 2006). The main feature of ELM is that the parameters of hidden layer nodes can be randomly generated without adjustment. The learning process only needs to calculate the output weights.
Definition 3 An SLFNN consists of three parts: an input layer, hidden layer, and output layer. The output function of the hidden layer is given as follows: where x is the input vector, b is the output weight of the ith hidden node, and h(x) is the hidden layer output mapping, called the activation function, defined as follows: Here, b i is the parameter of the feature mapping (also called the node parameter), and a i is called the input weight. In the calculation, the parameter of the feature mapping is randomly initialized and is not adjusted. Hence, the feature mapping of the ELM is also random.

Elman neural network
The Elman neural network is a machine learning algorithm proposed by Elman designed for SLFNN. In addition to the input layer, the hidden layer, and the output layer, it also has a special contact unit. The contact unit is used to memorize the previous output value of the hidden layer unit. It can be considered as a delay operator. Therefore, the feedforward link part can be corrected for the connection weight, whilst the recursive part is fixed, that is, the learning correction cannot be performed. The mathematic model of Elman can be written as follows (Elman 1990): where f (x) is a sigmoid function, and 0 a\1 is a selfconnected feedback gain operator. If the a ¼ 0, then the network is a standard Elman neural network, if the a 6 ¼ 0, then the network is a modified Elman neural network, u is the input data with n-dimensional vector, x is a hidden layer output, X c is the hidden layer output with an n-dimensional vector, y is the output for the network with an mdimensional vector, and W I1 , W I2 , and W I3 , are connection weights with n Â n-, n Â q-, and m Â m-dimensional matrices, respectively. We set the actual output of the k-step system to be y d ðkÞ, define the error function as E k ð Þ ¼ 1 2 ðy d ðkÞ À yðkÞÞ T ðy d ðkÞ À yðkÞÞ, and let derivative E be the connection weights A, W I1 , W I2 , W I3 . The learning algorithm of an Elman network can be obtained using the gradient descent method: Dw I2 jq ¼ g 2 d h j u q k À 1 ð Þ; j ¼ 1; 2; . . .; n; q ¼ 1; 2; . . .; r where r is the node number of the input layer, n is the node number of the hidden layer and unit layer, and m is the node number of the output layer. g 1 , g 2 , and g 3 are the learning steps of W I1 , W I2 , and W I3 , respectively.
Weighted information criterion (WIC) The weighted information criterion (WIC) was initially proposed to find the best ANN model (Egrioglu et al. 2008). In this paper, WIC was applied to four real-time series datasets in order to measure the forecasting performance of the examined sub-models' architectures and to decide the sub-models of the combined model. The number of input nodes depends on the number of sub-models of the combined model. The mean absolute percentage error (MAPE), root mean square error (RMSE), Akaike information criterion (AIC), Bayesian information criterion (BIC), and direction accuracy (DA) were chosen as the model selection criteria. The MAPE and RMSE were used to detect deviations between the actual values and the forecasting values. The AIC and BIC were used for penalizing large models. The DA measured the forecasting direction accuracy. Although the MAPE and RMSE can also be used individually to select a model, the models they select tend not to be sufficiently detailed. These criteria are calculated as follows: where y is the actual value,ŷ is the forecasted value, T is the total number of data items, and m is the number of ANN weights.
In this paper, we used a special criterion called the modified direction accuracy (MDA), which was proposed according to the special direction accuracy criterion. The MDA criterion is calculated in the following way: The following describes the algorithm of model selection strategy based on WIC: (1) All of the structure of the sub-models, which consist of the combined model are determined. For instance, we have five input layer nodes, one output layer node, and 15 hidden layer nodes. Thus, the total number of possible structures is 18.
(2) The best weights of AIC, BIC, RMSE, MAPE, DA, and MDA are determined using the training data and calculated with the training data.
(3) The five criteria must be standardized for neural network structures.
(5) We choose an architecture with the minimum WIC.
The theory of the combined model The combination forecasting theory indicates (Bates and Granger 2001) that if the M forecasting models can solve a certain forecasting problem, the weight coefficients should be appropriately selected, and then the results of the M forecasting methods are added to obtain a new model named the combined model. The results of the combined model are better than the M models. Assuming that the actual time series data is presented in the form of y, the number of sample points is i, y i is the forecasting value obtained by the ith forecast model, the forecast error is e, and the weight coefficient of the ith forecasting model is w, then the general combined forecast model can be expressed as follows: x iŷit þ e it ð Þ ; t ¼ 1; 2; . . .; L; ð58Þ wherex i is the estimated value of x i and represents the weight of each single model,ŷ t is the forecasting value of the combined model. Determining the weight coefficient of each model is a key step in establishing a combined forecasting model. Then, by solving the optimization problem of the combination model, an optimal combination model can be obtained. Then this optimization problem can be expressed as: When the predefined absolute error or the maximum number of iterations are reached, the optimization process is stopped.

Multi-objective optimization theory
The optimization problems of the objective function with multiple measurement indexes in the domain of definition can be solved by multi-objective optimizations. The objective function checks and balances the shortcomings of the forecast model in many aspects by assigning weights to each measurement index, to improve the forecasting accuracy and stability. Generally speaking, multi-objective optimization problems (MOPs) can be divided into two categories: constrained problems and non-constrained problems. A constrained problem with j inequality and k equality constraints can be expressed as (Laumanns et al. 2006): where M is the number of objectives, x ¼ ðx 1 ; x 2 ; . . .; x n Þ T is the decision vector, and n is the number of decision variables. In (61), X ¼ Q n i¼1 x L i ; x U i Â Ã R n is called the decision space, where x L i and x U i are the lower and upper limits of the decision variables, respectively.
When the inequality and equality constraints in (61) are omitted, an unconstrained multi-objective problem is obtained, which is expressed as follows And this method was widely used in many fields like carbon dioxide emissions forecast (Liu et al. 2022), electricity price forecast (Fu et al. 2020) and electricity demand forecast .