Hourly solar irradiation forecast using hybrid local gravitational clustering and group method of data handling methods

The foundation for many solar energy uses as well as economic and environmental concerns is global solar irradiation information. However, due to solar irradiation and measurement variations, reliable worldwide statistics on solar irradiation are frequently impossible or challenging to acquire. In addition, more precise forecasts of solar irradiation play an increasingly important role in electric energy planning and management due to integrating photovoltaic solar systems into power networks. Hence, this paper proposes a new hybrid model for 1-h ahead solar irradiation forecasting called LGC-GMDH (local gravitational clustering-group method of data handling). The novel LGC-GMDH model is based on local clustering that adequately captures the underlying features of the solar irradiation time series. Each cluster is then forecasted using the GMDH method, which is a self-organized system capable of handling very complicated nonlinear problems. Finally, these local forecasts are reconstructed in order to obtain the global forecast. Comparative study between the proposed model and the traditional individual models such as backpropagation neural network (BP), supporting vector machines (SVM), long short-term memory (LTSM), and hybrid models such as BP-MLP, RNN-MLP, LSTM-MLP hybrid wavelet packet decomposition (WPD), convolutional neural network (CNN) with LSTM-MLP, and ANFIS clustering shows that the proposed model overcomes conventional model deficiencies and achieves more precise predicting outcome.


Introduction
Nowadays, a natural desire to develop renewable energies exists. Indeed, we are talking about energy transition, which refers to the transition from the current energy system using non-renewable resources to an energy mix based mainly on renewable resources. Thus, the inclusion of renewable energies in the electricity grid is one of the current issues. Maintaining the balance of the network implies that electricity production is equal to electricity consumption. However, photovoltaic energy, like wind energy, is so-called intermittent energy. Indeed, it strongly depends on the weather. In order to be able to use renewable energies, solutions must therefore be found intermittently. The development of storage technologies, consumption control, and the electricity network development is thus envisaged. Likewise, forecasting of photovoltaic production will have a primordial role in the future (Ahmed et al. 2020;Guermoui et al. 2020;Lai et al. 2021).
A critical step in predicting the amount of electricity produced by a solar panel is predicting the amount of energy received by the panel. Assuming that the inclination of the panels is low, we speak of a forecast of GHI (Global Horizontal Irradiance) which is the total irradiation received by a horizontal surface on the ground. The methodology used to predict irradiation depends on the desired forecast duration. Indeed, it is possible to use the following: • Statistical models considering GHI measurements as time series. These are, for example, autoregressive moving average models (ARMA) (Voyant et al. 2012), autoregressive integratedmoving average models Responsible Editor: Marcus Schulz (ARIMA) (Shadab et al. 2019), Kalman filter (Soubdhan et al. 2016), and Markov chain (Vindel and Polo 2014), which are commonly used statistical techniques. These approaches are generally used for reasonably short forecast horizons. However, statistical models cannot cope with non-stationary data that are abruptly changing. Solar irradiation data are non-stationary, resulting in poor forecast accuracy for sudden fluctuations. • Artificial intelligence (AI) and machine learning have gained much interest regarding solar irradiation forecast models as more research is done. Artificial neural network (ANN) (Premalatha and Valan Arasu 2016;Benali et al. 2019), support vector machines (SVM) (Wang et al. 2019), extreme learning machine (ELM) (Burianek and Misak 2016), wavelet transform (Yadav and Behera 2014), deep learning (Lai et al. 2021), and ensemble learning are a few examples of typical AI techniques (Voyant et al. 2017;Alkhayat and Mehmood 2021). These methods can handle the nonlinearity presented in the solar irradiation time series and usually have higher accuracy than statistical models. • Numerical weather prediction (NWP) involves complex mathematical models that simulate atmospheric change and forecast weather patterns using current weather circumstances. In meteorology, NWP models have met with a great deal of success, but because of their complexity, and they cannot be easily implemented (Perez et al. 2013;Verbois et al. 2018). • Physical models by observing clouds and their movement are used to forecast the amount of energy received on Earth (Chu et al. 2016;Blanc et al. 2017;Caldas and Alonso-Suárez 2019). This can be done by analysing images taken by ground cameras (forecast up to 30 min) or satellite imagery (forecast up to 5 h). • Combining these different models to take advantage of their complementarity creates a category named hybrid forecasting models. Despite being very complicated, these models have become popular because they can incorporate several models and provide more accurate results for forecasting solar irradiation. Several models are proposed in different researches, a review of this category for forecasting solar irradiation is given in Refs. (Guermoui et al. 2020;Álvarez-Alvarado et al. 2021).
However, the solar irradiation is a dynamic time series representing high nonlinearity (Gan et al. 2012), making it hard for the old forecasting models to obtain good results. Hence, as a solution, clustering was introduced recently as a category of hybrid models for forecasting purposes (Benmouiza andCheknane 2013, 2018) . Clustering (or data partitioning) is an unsupervised classification method that brings together a set of learning algorithms whose goal is to group unlabelled data with similar properties. In general, data are assigned to groups (clusters) so that observations within each group are similar to one another in terms of variables or attributes of interest, and the groups themselves are distinct from one another in terms of variables or attributes (Ghayekhloo et al. 2015;Benmouiza and Cheknane 2018).
Our previous published papers focused on several hybrid models, with detailed testing of hybrid clustering models. In ref. (Benmouiza and Cheknane 2013), hourly global horizontal solar irradiation forecasting based on the combination of an unsupervised k-means clustering algorithm and artificial neural networks (ANN) was studied. The k-means algorithm was used to extract some useful information from the dynamic solar irradiation time series. Despite its simple clustering algorithm, the combination with ANN has given good results. Moreover, we have published a recent paper in 2016 (Benmouiza and Cheknane 2016a), which deals with overcoming the stationarity problem of the solar irradiation time series, especially on small scales. In addition, for better forecasting results, hybrid ARMA and nonlinear autoregressive neural network models were introduced. The goodness of ARMA is suitable for linear components and NAR for nonlinear components of the solar irradiation time series. After that, we noticed some weaknesses in the neural network architecture, especially for selecting a suitable number of inputs as well as the hidden layers and their neurons. Hence, a clustering phase was added to the neural network for better results. In 2018 (Benmouiza and Cheknane 2018), we have proposed an improved clustered adaptive neurofuzzy inference system (ANFIS) to forecast an hour-ahead solar irradiation data. First, we have classified the history data of solar irradiation time series to decrease the input sample size using clustering methods. Three methods were used: fuzzy c-means (FCM), subtractive clustering, and grid partitioning. These methods allow classifying the input data into groups; each group has similar properties that help to understand the correlation between the data and by consequence, simplify the forecasting process. However, more investigation and more advanced clustering methods should be elaborated in order to enhance the forecasting results.
Hence, based on this idea, we propose in this paper a hybrid model named LGC-GMDH in order to forecast the measured hourly solar irradiation data for one hour ahead. It consists of using a hybrid local gravitational clustering (LGC) ) method and group method of data handling (GMDH) (Farlow 1981;Onwubolu 2016). The main idea consists of dividing the data set into training and testing sets using the k-fold cross-validation technique (Kohavi 1995;Klipp et al. 2005). Then, the selected best set will be clustered into groups with the same proprieties using LGC clustering. However, due to the dynamic behaviour of the solar irradiation time series, the phase space reconstruction is used before the clustering phase (MacQueen 1967;Benmouiza andCheknane 2013, 2018). It is a method used to reconstruct full system dynamics using a single time series. Takens (1981) explained that a single vector of observations describing a chaotic system might be regenerated into a series of multidimensional vectors using this theorem. As long as the embedding size is sufficient enough, the regenerated vectors may exhibit a wide range of essential characteristics of real-time series.
After that, each obtained cluster from the previous step will be forecasted as a single system using GMDH. It is a multi-layered network with a predetermined structure. It offers the advantage of expressing nonlinear dynamics mathematically and allowing higher order terms without instabilities (Farlow 1981;Onwubolu 2016). GMDH is also used to find input-output connections in models. It looks for connections between one output and a vast number of potential inputs. The network determines which inputs are relevant to the specified system. During training, the network is constructed layer by layer. Each layer contains neurons with just two inputs; their output is a quadratic function of their two inputs.
Next, a global forecast is reconstructed based on each local forecast. Then, error metrics such as mean absolute error (MAE), root mean square error (RMSE), coefficient of determination (R 2 ), and forecasting skill (FS) are used in order to compare the obtained forecast versus the measured test solar irradiation time series.
The main contributions of this paper are as follows: -A new multi-branch hybrid structure with clustered inputs is developed, which considers the highly complicated dynamics of the hourly solar irradiation time series and exhibits outstanding prediction performance. -The used LGC algorithm in this paper introduces the data mining and clustering notions in order to forecast locally the solar irradiation time clustered series, which have similar characteristics. -The proposed LGC-GMDH framework enhances greatly the hourly-ahead solar irradiation forecasting taking the advantages of feature selection using LGC algorithm and the power of the multilayer neural network using GMDH with its predetermined structure. -A comparison between the results obtained from our proposed LGC-GMDH model and several forecasting models proposed in literature, namely, single models such as, the back propagation (BP) neural network, recurrent neural network (RNN), long short-term memory (LTSM), support vector machines (SVM), and hybrid models such as BP-MLP, RNN-MLP, LSTM-MLP, the hybrid wavelet packet decomposition (WPD), and convolutional neural network (CNN) with LSTM-MLP. Also, the ANFIS clustering model is used in this comparison in order to judge the goodness of our work.
The rest of this paper is organized as follows: "Methods" describes the principle of the proposed LGC-GMDH model as well as the error metrics. "Data" introduces the used solar irradiation data. "Stimulation results" presents the simulation results and discussions. The comparison between the proposed model and the other models presented in the literature is shown in "Comparison." The last section has been devoted to the conclusion.

Methods
Our objective is to deliver the proper 1-h ahead forecast model to forecast the hourly solar irradiation time series. Hence, the LGC-GMDH model is proposed. This part will discuss the model framework as shown in Fig. 1, which summarizes the adopted methodology of how this could work. The details of the whole process are given in the following steps: Step A: data pre-processing The first stage for the proposed methodology is the data. It is necessary to collect and analyse the data before any prior use in the forecasting phase. The raw data should be divided into training and testing data sets. However, this data should first pre-processed since it contains missing data due to lack of measurement and suffers from the overfitting problem. To solve this, it was necessary to apply the k-fold cross-validation technique in order to solve the overfitting issue. Part (A) from Fig. 1 depicts a conceptual cross-validation model in its simplest form. First, the data were divided into k random subsets, S 1 , S 2 , and S K . Each one was referred as fold. Each fold is about the same size as the previous one. The model was then run k times to see how well it worked. On each occasion, the k subset was used as a test set, while the other subsets were used as training and validation sets. After that, the model with the most outstanding performance was selected from among the conducted k tests (Kohavi 1995;Refaeilzadeh et al. 2009;Wong 2015).

Step B: data clustering
The nonlinearity presented in the solar irradiation time series could affect the forecast model (Gan et al. 2012). In this step, clustering was proposed as a solution to discover similar patterns from the solar irradiation time series. Clustering is an effective unsupervised learning method that has been used in various areas like pattern recognition, image processing, bioinformatics, and information retrieval. Clustering is the process of separating a data collection using a similarity metric into uncoupled groups. The items are identical in the same cluster, and the components are distinct in separate clusters (Benmouiza andCheknane 2013, 2018). Several clustering techniques have been developed and categorized into categories such as partition-based clustering, hierarchical clustering, grid-based clustering, density-based clustering, and model-based clustering .
However, before the clustering phase, the phase space reconstruction method was proposed in order to rebuild the global behaviour of a dynamic system from a single variable of the system in question (MacQueen 1967;Benmouiza andCheknane 2018, 2016b). Our goal is as follows: -Determine the minimum, appropriate, embedding dimension for phase-space reconstruction for the time series (Kennel et al. 1992; Benmouiza and Cheknane 2018); -Identify regions of the reconstructed phase-space which have similar characteristics using local gravitational clustering algorithm; The theoretical background of both phase space reconstruction and LGC algorithms is given in detail in the following sections. Step C: data forecasting In this step, the obtained clusters from the previous step will be forecasted using the GMDH (Farlow 1981;Onwubolu 2016). The main idea is to create local forecasts, and then reconstruct a global forecast. The GMDH model is a selforganized system in which the structure optimizes itself based on the information provided by the user. This neural network attempts to build a function in a feed-forward neural network based on a second-degree transfer function using the GMDH neural network as the starting point. A computer programme determines the number of hidden layers and neurons with the best deterministic transfer function in the generalized Markov decision-making network (GMDH). As a result, the best possible model structure is discovered. Several nonlinear functions, such as the Volterra series and the Kolmogorov-Gabor polynomial, are used to establish a connection between the input and output variables (Farlow 1981;Onwubolu 2016;Vaishnav and Vajpai 2018).
The error metrics will be used in order to judge the goodness of the forecast by comparing the obtained data with the measured one. Once the criteria are reached, the algorithm will end.
In what follows, the mathematical modelling and the best parameters choosing for all the mentioned methods earlier are given in detail.

Phase space reconstruction
The phase space reconstruction introduced by Takens in 1981 (Takens 1981) is a technique that from a variable of a dynamic system, allows reconstructing of the global behaviour of its original system. It consists of embedding time series into high-dimension space in order to understand its dynamics, which helps to provide a simple multidimensional representation of nonlinear time series. The principle of phase space reconstruction is based on the delay of the time series studied. We delay the time series to generate vectors of chosen dimensions that will constitute the reconstructed phase space (Benmouiza andCheknane 2013, 2018).
The state vectors are defined as the set of d components corresponding to the d system's degrees of freedom. Therefore, it is theoretically necessary to have as many independent measurements as the system has degrees of freedom. In general, we will consider that we have only a sampled series resulting from the experimental observation of a fluctuating quantity and from which the system's phase space has to be estimated. A possible estimate consists in defining the set of vectors X n : (1) X n = x nẋnẍ n ⋯ T Using finite differences, the downside of this method is its high sensitivity to noise. Whitney (1936) proposed an alternative considering the vectors: where; d is the dimension of the estimated space and where D is the dimension of the state space (Whitney 1936). Authors in (Packard et al. 1980) have applied this method in the context of chaotic systems. A dynamic system will be said to be chaotic if it is a deterministic dynamic system whose behaviour is very dependent on the initial conditions in the sense that close initial trajectories diverge exponentially. The validity of the approach in the context of chaotic systems has been justified by F. Takens (1981), for an infinite series, for any values of τ i and with the sufficient condition on the dimension d.
Where D 2 is the correlation dimension (Grassberger and Procaccia 1983), the space formed by the set of vectors given in Eq. (2) is topologically and dynamically equivalent to the real phase space in the sense that the geometric properties of the attractor are preserved.
The phase space reconstruction technique involves two parameters to be adjusted. The first is the dimension of the phase space we want to generate, and the second is the delay we want to apply to the time series.

The choice of delay τ
In the case of non-noisy observations over infinite times, the choice of the delay τ is irrelevant (Takens 1981), but it becomes critical (Michel and Flandrin 1996;Benmouiza andCheknane 2013, 2018) when the observation time is limited.
• If τ is too small, all the coordinates are strongly correlated: x k ≈ x k + 1 , the vectors defined by Eq. (2) with τ i = i τ are almost collinear, the phase space tends to be a straight line, and the estimated dimension tends to be 1. • If, on the contrary, τ is too large, the coordinates are almost independent. The set of vectors {X n } n = 1, …, N traverses almost all of the phase space: the system is close to a vector noise with independent coordinates, and the estimated dimension tends towards the value of the reconstruction dimension.
The most commonly used method is to choose τ at the value of the first 0 of the estimated autocorrelation functions given by Eq. (4). It has also been proposed to take the first minimum of mutual information (Fraser and Swinney 1986) to determine τ.
(2) • The estimation of the dimension of the attractor by the algorithm of Grassberger and Procaccia (1983) using the condition of Eq.
(3), • The study of the dynamics estimated by the reconstruction of different techniques such as singular value analysis (Auvergne 1988), the estimation of the intrinsic local dimension (Michel and Flandrin 1996), the false nearest neighbours method (Kennel et al. 1992) and the vector field method (Kaplan and Glass 1992). The dimension is then chosen as the smallest value d for which two points close in dimension d remain in dimension d + 1.

Clustering by local gravitation
Gravity theory has been used in clustering for a long time. Numerous gravity-based clustering algorithms have been developed, which replicate the process of objects being attracted and merging by their gravitational force. Each data point is considered an object by these methods and is assigned a mass in feature space. Local gravitation clustering LGC ) is a method that uses local gravitational attraction to cluster data points, in which each data point is treated as a massed object that is attracted to its neighbours. According to , two key stages are included in the LGC algorithm. First, LGC differentiated interior points, border points, and unlabelled points, and second, LGC uses soft-connecting x n x n+ technologies to connect interior points. More details about this algorithm are given in what follows.

LGC Algorithm
Inspired by Newton's theory of gravity, the local gravitation in the data clustering technique reflects the relationship between a data point and its immediate neighbours. According to the theory of gravity, the attraction force between two-point masses may be calculated using the following formula: where, ��� ⃗ F ij is the force between two points i and j. D ij is the distance between two mass points m i and m j . G is the gravitational constant, and D ij is the line direction connecting two points along which the power acts. Because distances between points within the same local region are unlikely to change considerably, Eq. (5) can be simplified as follows: The resultant force of point i with its k-nearest neighbours (LRF) is as follows: D ij contains the directional information between point i and its neighbours, and m j values are considered as weighting factors in composing the forces in the neighbourhood. Equation (7) shows that the neighbours of points with greater masses affect over those with smaller masses, and vice versa. Therefore, authors in  suggest that the new concept of the LRF replace Newton's theory of gravity as expressed in Eq. (8):.
where the mass m i is defined as follows: The centrality (CE)-based local clustering methods are proposed to take advantage of the LRF's information. Equation (10) is the definition of the CE of data point i; k is the number of the neighbours: An LRF's approximate direction is indicated by a CE value larger than 0. Since The property of CE is as follows: According to Eq. (10), the neighbour's LRF direction is nearly opposite to D ij , and CE is small if a point is at the border of a cluster. The opposite is true if a point is at the centre of a cluster.
In summary, data points in the centre area are differentiated from those in the border region using the physical principle-based LGC technique, which calculates the LRF for each data point. In the first phase of the algorithm, LGC calculates the LRF field for every data point. After classifying the data points into the interior, border, and unmarked points, LGC performs the clustering process in the second stage based on LRF information and connects the internal data points. The data points that have been selected as limit points are used to form a cluster.

Group method of data handling (GMDH)
GMDH modelling technique is an effective method for the identification of higher order nonlinear systems. A.G. Ivakhnenko was the first to use this term (Farlow 1981). Furthermore, GMDH is an inductive self-organizing algebraic model, so it is not essential to know the precise physical model in advance to use it successfully. Instead, GMDH automatically discovers dominant relationships among the system variables during the training phase. In other words, the optimum neuron's structure is automatically chosen to minimize the values of the prediction error criterion, and any superfluous neurons are removed from the network. As a result, the GMDH has excellent generalization capabilities and can accommodate the complexity of nonlinear systems (Kondo 1998).

GMDH Model
The GMDH network is a data-driven modelling method. It uses mathematical functions to describe the complicated nonlinear connections between the provided input and output data sets. The GMDH network is comprised of several layers, each of which contains neurons. Each neuron has two inputs and a single output (Nazerfard et al. 2006). The output of each neuron is computed using the Ivakhnenko (Farlow 1981) polynomial stated in Eq. (15): The GMDH networks are constructed by combining the two input variables in each layer. The coefficients of the polynomials are then calculated for each combination and its associated output using the least square fitting method or regression analysis. To assess and verify the output of the polynomials once they have been computed, external accuracy requirements are used.
External criteria of accuracy The model adequacy is evaluated using the external criterion, which is known as the regularity criteria. It evaluates and measures the GMDH network's mean square error for each model neuron to assess the model output. When applied to the network, regularity also tells us which input combinations are more important. This method also evaluates the neuron polynomial's capacity and fitness to produce the intended system output. The lower the regularity requirement should be the closer the neuron polynomial fits the data. The regularity criteria are given by Eq. (16): where; R is the regularity criteria measure, N is the number of samples, and y is the desired output, g is the GMDH neuron output.
Each neuron output is subjected to the regularity criteria in the GMDH network. In the next part, we will discuss the sorting process used to decide which neurons will survive to go on to the next layer.
Sorting out procedure description Since the number of neurons and the number of layers are not specified, GMDH modelling is self-organizing. In order to ensure optimal performance, the best performing neurons in each layer of the GMDH are chosen using the external criterion defined in the preceding part. Neurons with less than a pre-set number of periodicity criterion but which meet or exceed their established criteria on other neurons will be chosen for use. However, those that fall below the requirements set on other neurons will be removed and discarded from the network. Moreover, each of the layers keeps the lowest regularity criteria. When the following layer's lowest regularity criteria are greater than the preceding layer's smallest regularity criterion, adding layers will no longer produce new regularities. The last layer's neuron will provide the network's final output (Water et al. 2000). The main procedure for GMDH algorithm implementation is described in the following steps (Nazerfard et al. 2006;Onwubolu 2016;Vaishnav and Vajpai 2018). • Step (1) The data set is separated into a training set and a checking set. The GMDH neuron's weights are estimated using training data, while checking data is utilized to organize the network structures. A heuristic method is used to divide the data, either by choosing random locations for each group or by analysing the variance of the data (Onwubolu 2008). • Step (2) Generate every combination of two inputs conceivable among all of the input variables. The number of possible combinations is represented by n = m(m−1) 2 , where m and n are the numbers of input variables and the number of combinations, respectively. After that, expand the inputs to a quadratic polynomial Z for each combination.
In this way, polynomial (g) coefficients, which were previously stated in Eq. (15), are calculated for each combination in the training set. To get the polynomial weights, the least-squares fitting method is used, and the value obtained is given by Eq. (17). • Step (3) The third step is to do a detailed evaluation and statistical testing of each polynomial using the testing data points. Eq. (18) is used to determine the outputs of each polynomial.
The criteria used to determine the regularity of each neuron in the first layer are determined after the polynomials have been computed. Neurons that meet the requirement for being "less than" a certain value are permitted to continue and become inputs for the x 21 x 11 x 21 x 11 2 x 21 2 1 x 12 The GMDh network flowchart neurons in the following layer. The remaining neurons are eliminated from the network. • Step (4) Finally, the whole process starting with the second step is repeated until the requirement for ending the GMDH network is fulfilled, at which point the procedure ends. When the lowest regularity criterion in the current layer is no longer smaller than the lowest regularity criterion in the preceding layer, the GMDH network will end. In order to get the final GMDH model, the route of the neurons in each layer that corresponds to the lowest regularity criterion is traced backward in time. The flowchart of the GMDH algorithm is given in Fig. 2.

Error metrics
A forecasting model's prediction values must be tested to determine its performance. To find the optimal solution, calculate appropriate error metrics. A model's error metric is a method for measuring the quality of a model and may be used by the forecaster to compare several models. It allows us to assess better how effectively the model completes its various duties. To this end, we have chosen four metrics in order to evaluate the forecasted data noted as ŷ versus measured one noted y and a number of observations noted N. These static metrics are summarized as follows (Benmouiza and Cheknane 2018;Botchkarev 2019;Huang et al. 2021).

Root means square error (RMSE)
In mathematics, root means square error (RMSE) is defined as the square of the average of the errors' squares. It is also described as a measure used to evaluate the quality of a forecasting model or predictor, among other things. RMSE takes into account both variance (the dispersion of expected values from one another) and bias (the difference between two anticipated values) (the distance of a predicted value from its true value). This error measure is likewise always positive, with lower values being preferable. The RMSE number is in the same unit as the predicted value, which benefits this approach. In comparison to mean square error MSE, this makes it simpler to comprehend. The RMSE may also be compared to the mean absolute error MAE to see whether the prediction includes significant but occasional mistakes. The greater the gap between RMSE and MAE, the greater the inconsistency of error magnitude. This measure may gloss over issues related to poor data volume. Equation (19) gives the calculation of RMSE.

Mean absolute error (MAE)
MAE is the average of the absolute difference between forecasted and actual values. The MAE informs us of how much inaccuracy we may anticipate from the prediction on average. The error numbers are in the predicted values original units, and MAE = 0 implies that there is no mistake in the forecasted values. The lower the MAE number, the better the model; a value of 0 indicates that the prediction is error-free. In other words, when comparing several models, the model with the lowest MAE is seen to be superior. However, MAE does not show the proportional magnitude of the mistake, making it impossible to distinguish between large and minor errors. It may be used with other measures (root mean square error) to assess whether the mistakes are greater. Furthermore, MAE may obscure issues related to low data volume. The MAE is calculated using the Eq. (20):

R-squared value
R-squared or R 2 is a statistical measurement that illustrates how near the data are to the fitted regression line. It is sometimes known as the coefficient of determination, or the coefficient of multiple determination for multiple regression or the correlation coefficient. R-squared represents the proportion of the response variable variance described by a linear model. R 2 values are between 0 and 100%. A correlation coefficient between 0 and 100% implies that the model has little explanatory power in explaining the variability of the response data around their mean. The 0% indicated that there is no fluctuation in the response data around its mean; therefore, the model has explained nothing. When 100% appears, the model explains all of the response data's variability around its mean. Hence, higher R 2 values, the better the model fits the data. R 2 is calculated using Eq. (21):

Forecasting skill
The forecasting skill (FS) is defined as the degree of accuracy with which a prediction is associated with the estimation of actual values. It is calculated using Eq. (22) (Mazorra Aguiar et al. 2015;Schmidt et al. 2016): RMSE smart is the smart persistence model, which consists of forecasting the clear sky index for each time horizon h persists for the next time step given by the following: k * t is the clear sky index given by the following: where, GHI is the measured horizontal hourly solar irradiation data at ground level and GHI clear is the calculated clear sky hourly solar irradiation data (W. M.O. 1981;Tadj et al. 2014).
A perfect forecast will have an FS equal to 1, and an FS equal to 0 means there is no improvement over the reference model.

Data
The suggested approach was evaluated in Ghardaia, Algeria. It is situated in the heart of North Africa, along the Mediterranean coast, between the latitudes of 19° and 38° North and the longitudes of 8° and 12° East. A significant portion of the Sahara is located in its southern section (nearly 86% of the total area of the country). It is a key player in the area of solar technology due to its location in the Sun Belt and excellent climatic conditions. Algeria has the most significant solar energy potential in the MENA area. It has the highest effective solar potential where sunlight length is approximately 7.3 h in the North, 8.3 h in the highlands, and 10+ h in the southern regions. As a whole, the entire country's sunlight duration surpasses 3000 h/year, while in the Sahara, it may approach 3500 h/ year (Benmouiza 2015).
This paper is especially interested in a particular area in the southwest of Algeria, namely the city of Ghardaia (positioned at 33.46° N, 3.78° E, at an altitude of 463 m). The accessible data was gathered by the enerMENA meteorological network project's meteostation (Schüler et al. 2016). Ghardaïa meteorological station measures global, DNI, and diffuse solar irradiation; furthermore, it monitors meteorological factors (air temperature, humidity, wind speed, atmospheric pressure). Hence, as a result, a database that has been around for over 9 years enabled us to collect an in-depth knowledge of the area's solar potential. For DNI measurements, the meteostation is equipped Hourly global horizontal irradiance at time t+1 (W/m2)

Phase space reconstruction
with CHP1 pyrheliometers, and for diffuse (DHI) and global horizontal irradiation measurements, it is equipped with CMP11 thermal pyranometers (GHI). These instruments are placed on a Solys2 solar tracker with a sun sensor to provide.

Simulation results
The main objective is to provide the best model for 1-h ahead GHI forecasting. Hence, we have proposed the clustering using local gravitation method and the forecasting using GMDH. To this end, we will first train the model from 1 January 2020 to 30 November 2020, and test it for 744 h (total predicted hours) from 1 December 2020 to 31 December 2020. Then, 2 years has been chosen for training and 6 months for testing.  The measured data versus estimated training a and testing b for Ghardaia 2020 using hybrid LGC with GMDH model First, and using the time delay embedding technique, the phase space reconstruction of hourly solar irradiation was obtained. We determined the best time delay for the phase space reconstruction to be one, with dimensions equal to two. Next, the LGC algorithm was applied to the results obtained from the phase space reconstruction. The clustering results from the LGC show that the number of clusters is 3.
The simulation results of the reconstructed phase space, the LGC clustering, and the three subsets obtained from the LGC clustering algorithm of the solar irradiation are shown in Figs. 3, 4, and 5, respectively.
These figures show that the solar irradiation time series had been grouped into three clusters. High values of solar irradiation (cluster N°2) that describe the number of sunny hours throughout the year (for example, the number of hours in the middle of the day when there are no clouds) are contained in this cluster. These values are contrasted with medium values solar irradiation (cluster N°3) that depict hours when the sky is partly cloudy (for example, hours from 9 AM to 11 AM and from 2 PM to 4 PM). Finally, values of low solar irradiation (cluster N°1) describe hours when the sky is entirely cloudy (for example, when it is raining or snowing) (or the hours of sunshine and sunset).
The next step consists of forecasting each cluster alone, then, obtaining the global forecast. In the simulation, various numbers of parameters are chosen in order to achieve  The number of points Fig. 8 LGC for the reconstructed phase space of the training data (1 January 2018 to 30 September 2020, Ghardaia) the best forecast. The calculation of errors has been used to assess system's performance. The simulation results for the best configuration of the forecasted and measured test data are shown in Fig. 6 for December 2020; the red line represents the forecasted data, and the blue dot line shows the measured hourly global solar irradiation data. Furthermore, the relative error and its histogram are also shown in this figure.
In addition, the training and testing data versus its fit with their R-squared values are shown in Fig. 7. The majority of the locations in both the predicted and measured series are within a few degrees of each other. Because of the large number of cloudy days, there are some delays in the system's response time.
A further evaluation of the forecasted hourly global horizontal time series was carried out by computing   Table 1, we can see clearly that a high number of layers and neurons gives the lowest errors and highest R 2 and forecasting skills. However, it takes more computational time. Hence, it is unnecessary to increase their number since the results are close to each other. Moreover, from the results of Table 1 and Fig. 6 which represent the plots of the best configuration, we can see an overall RMSE equals 35.96 W/m 2 and the MAE equals 22.26 W/m 2 , which may be regarded as excellent predicted values.
Furthermore, based on Fig. 7, the R-squared value computed by Eq. (21) equals 0.982, which is a positive value with a good forecasting skill of 48.12%. Finally, based on simulation findings, it was determined that this approach would be an excellent way to forecast solar irradiation data.
In the same way, we have tried larger data sets in order to forecast more than one month. Hence, we have chosen a solar irradiation time series measurement for three years from 1 January 2018 to 31 December 2020 for the site of Ghardaia, Algeria. Thirty-three months was used for training and 3 months was used as forecast horizons. The same methodology applied for 1 month forecast has been applied for the case of 3 months. The results of the clustering of the reconstructed phase space are shown in Fig. 8.
In addition, the comparison between measured and forecasted hourly solar irradiation data for three months as well as the measured data versus forecasted data is shown in Fig. 9. From these figures, we can clearly see the goodness of the LGC-GMDH model. It forecasts in a good manner even on cloudy days with an R 2 value equals 98.57%. These results clearly prove the robustness of the proposed model to forecast solar irradiation time series.

Comparison
Various comparison tests were conducted with other models to check the accuracy of the proposed LGC-GMDH forecast model. Single forecasting methods such as artificial neural networks (ANN) (Faceira et al. 2015;Voyant et al. 2017), support vector machines (SVM) (Wang et al. 2019), recurrent neural networks (RNN) (Zhang and Behera 2012), and their hybrid models have been extensively utilized in solar irradiation forecasting. As a result, the use of single benchmark models, such as those presented by back propagation BP (Laopaiboon et al. 2019), SVM, generic RNN, and LSTM (Malakar et al. 2021;Kumari and Toshniwal 2021), is studied extensively in this paper. New multivariate time series models have also been tested alongside the three models mentioned above, which use multivariate time series data: the BP-MLP model, the RNN-MLP model, the LSTM-MLP model, and WPD-CNN-LSTM-MLP (Huang et al. 2021). Moreover, clustered ANFIS network using fuzzy c-means, subtractive clustering, and grid partitioning model presented in (Benmouiza and Cheknane 2018) which provides good results compared to other models is also considered in this comparison.
For the comparison purpose, solar irradiation data and other related climatic parameters (including air temperature, relative humidity, wind speed, and other data) for the location of Denver (39440 N, 105, and 110 W, Colorado, USA) in the USA are utilized to Table 3 Ranking of the studied models based on their error performance (in an ascending order from the highest accurate model to the lowest one)

Error metrics
Models performance (best to the worst)

RMSE
LGC as the training dataset, and the dataset from 1 January 2016 to 31 December 2016 was used for testing the models. The original data is captured and recorded on a continuous minute basis; then, this data is transformed to hourly data. The RMSE, MAE, MBE, R 2 and FS, SBF, and WIA have been chosen as error metrics in order to judge the goodness of our proposed model.
Where MBE is mean bias error, SBF is the slope of best-fit line which is a straight line that is the best approximation of the given set of data, and WIA is the Willmott's index of agreement which is a standardized measure of the degree of model prediction error. It represents the ratio of the mean square error and the potential error and can be calculated using the Eq. (25): where y i is the measured data, ŷ i is the forecasted data, y i is the average measured data, and y i is the average forecasted data.
A value of 0 indicates perfect performance for the RMSE, MAE, and MBE, where a maximum value of 1 indicates a perfect model for R2, FS, SBF, and WIA. The results are expressed in Table 2.
In addition, for clear representing of this comparison, a ranking of these methods based on each error metrics is given in Table 3.
Moreover, the errors bars for these results are shown in Fig. 10.
As can be seen, the prediction accuracy of single models, such as BP, SVM, RNN, and LSTM, is not favourable. However, the prediction error is significantly reduced once the models are integrated with the other network structure. The lowest RMSE values for single models equal 68.7213 W/m 2 for LTSM, and a FS equals to 0.3167 for the BP model. In comparison, these errors are much lower in hybrid models with an RMSE equal to 46.1336 W/m 2 for WPD-CNN-LSTM-MLP with a FS equal to 0.590, and WIA equal to 0.9652. Furthermore, the category of clustering hybrid models gives more accuracy. The errors have been improved in the ANFIS-clustering model, which RMSE, MAE, MBE, R 2 , and FS are 45.5463 W/m 2 , 19.2985 W/m 2 , 21.224 W/m 2 , 98.60%, and 59.8%, respectively. However, our proposed LGC-GMDH model has better performance than all mentioned ones, with an RMSE, MAE, MBE, R 2 , and FS, WIA, and SBF of 43.7268 W/m 2 , 19.1856 W/m 2 , 20.395 W/m 2 , 98.87%, 61.3%, 0.9689, and 0.8727 respectively. This comparison concludes that the LGC-GMDH can be used to forecast the hourly global solar irradiation data, with the condition to choose the best configuration for better forecasting performance.

Conclusion
This paper suggests a new hybr id LGC-GMDH machine learning model for 1-h horizon global horizontal irradiation forecasting. In this model, we have developed a sophisticated three-phase structure. The data were divided into training and testing data sets using k-cross validation. Clustering using the local gravity algorithm was used in order to get groups with similar dynamic characteristics. To this end, a phase space reconstruction was used in order to build a 2-dimension representation of the global horizontal irradiation data. The obtained clusters were used as inputs to the GMDH network in order to obtain the local forecasts. Then, a global forecast is then reconstructed. The proposed model achieved an excellent forecasting result because it takes into consideration the frequency characteristics of the irradiation time series and the power of the neural networks.
Comparing it with other models, the proposed model shows the lowest RMSE error equals 43.7268 W/m2, and an R 2 equals 0.9887, and the highest FS equals 0.613. It also illustrates that network structure significantly impact the model's overall accuracy. The LGC-GMDH deep learning model has shown that they provide accurate hour-ahead irradiation forecasts. There will be enough information to find the correct references for practical applications. Future work could test optimisation methods for optimal network configuration choosing to refine results further.
Author contribution The corresponding author BK contributed to the study conception and design. In addition, he performed the material preparation, data collection, and analysis. The first draft of the manuscript was written by BK. Also, he read and approved the final manuscript.
Data availability Data used within this research are available upon request from the corresponding author.

Declarations
Ethics approval No ethical approval is foreseen since the research described by this paper only concerns the analysis of meteorological data that have been made public by the laboratory that collected them and that no participant has been involved not directly or indirectly as the subject of the research itself.

Consent to participate
No participant has been involved not directly nor indirectly in the research itself. Consent for publication All the authors mentioned in the manuscript have agreed to authorship, read, and approved the manuscript, and given consent for submission and subsequent publication of the manuscript. All named authors have agreed with the order of authorship before submission, and all authors have agreed to the name of the corresponding author. The manuscript has not been published anywhere else.

Competing interests
The author declares no competing interests.