Assessment of LSTM, Conv2D and ConvLSTM2D Prediction Models for Long-Term Wind Speed and Direction Regression Analysis

: This paper assessed the model performance accuracies of 3 forecast-based architectures (Long Short-Term Memory, LSTM; Convolutional Neural Network, Conv2D and hybrid ConvLSTM2D) for multivariate inputs to multi-steps wind speed and direction forecasts. These high-level neural network-based architectures were setup with the Keras sequential models trained to learn the historical patterns from the processed weather input datasets. To build these forecast models, the sampled time series weather observations at different station heights were obtained and reshaped for network layer compatibility, while the Adamax algorithm for the network optimization was considered. The trained and evaluated model performances with different input data sequences (normalized/un-normalized) were assessed while the forecast results were also compared with the Actual and Conv1D models. Upon optimal network training, the Conv2D model returned MSE, MAE and RMSE estimated values of 0.82, 4.48 and 0.91 %, respectively; the LSTM model returned 1.03, 4.75 and 1.01 %; while the ConvLSTM2D model returned 2.11, 10.13 and 1.45 %, respectively. Also, Conv2D validated model values of 3.16, 14.73 and 1.77 % were obtained %, respectively; 3.21, 14.98 and 1.82, for the LSTM-based; while ConvLSTM2D model returned 3.27, 15.92 and 1.91 %, respectively. Studied finding results show that better prediction and evaluation could be achieved for all the trained model architectures as compared to the untrained models. Also, from the predicted model results, the Keras sequential models were found to be useful for replicating the time-series historical wind speed and direction based on the well-tuned model hyperparameters as well as the input sequence structure


Introduction
In recent studies, the Long Short-Term Memory (LSTM), Recurrent Neural Network (RNN), Gated Recurrent Unit (GRU), Convolutional Neural Network (CNN) and hybrid model (LSTM-CNN) have been widely deployed in many machine-learning frameworks (short-to long-term application) [1][2][3][4][5]. Also, these forecast neural networks had shown different state-of-the-art results based on their selected network optimizers [6][7][8][9][10][11] for machine-learning algorithms and the model hyperparameters tuning techniques [12][13][14][15][16]. Within the deep-learning framework, both the network optimizer and tuning approach have played a critical role in the setup of an accurate forecast system as they do impact the model learning ability and forecasts skills. Among these prediction models, the LSTM network have been preferred and utilized in complex frameworks for the processing of long-term non-linear input data sequence as compared to other state-of-the-art models. This choice was solely based on LSTM gating mechanism (memory cells) required for better feature extraction and modeling of longterm dependencies of the input sequence [17,49]. Although, it is well known in machine-learning that achieving a reliable state-of-the-art result with LSTM-network tuning is a difficult task as it increased the model complexity [18] (that is, it required the right selection and optimization of different model hyperparameters [11]) and computational time. Notwithstanding, this forecast-model have been found to be more efficient and reliable with optimally tuned hyperparameters [19]. Lastly, the sequential model learning adaptivity in extracting hierarchical local features [20] from a given input sequence had made the LSTM model competitive with existing machine-learning architectures [21] and implementable in various fields: petroleum production forecasts [22], Stock/financial markets forecast [23][24], workload predictions for cloud monitoring [25,26], text processing [27][28][29], speech recognitions [30][31], shortterm photovoltaic and temperature forecasts [32][33][34][35][36][37][38][39], micro-pollutants forecasts in watershed [40], among others.
In comparison studies, the Convolutional networks have been also considered as a competitive forecast model with LSTM: -1 due to fast/timely convergence, 2 easily trained forecasts model, and 3 ability to interpret the learnt features for a given input sequence. From the Keras sequential models, Baccouche et al [41] utilized the sequential Conv3D (3-D CNN) model for human action recognitions. Also, Keren et al [17] utilized the convolutional-RNN with the sequential input data within the audio classification frameworks. Yang et al [42] deployed the Sequential-CNN to extract effective spatial-temporal features for the given video frames. Ngoc Vu [18] considered the SCNN for slot filling task in spoken language understanding. Vinyals et al [43]and Ciresan et al [44] utilized the CNN models for image caption generation and traffic sign recognition, respectively. From the aforementioned studies, these forecast models had emerged and outperformed some existing forecasting networks (RNNs, ARIMA, self-organizing map, support vector machine, among others) in: 1 time series predictions, 2 regression and 3 classification analysis [32,45]. Despite the wide applications of different forecasts model architectures in learning processes of the spatial-temporal features/patterns of a given input sequence, 1 the selection and optimal tuning of different model hyperparameters and 2 model layers internal covariate shift in distributed input sequence are the main limitations of the deep-learning models [46]. To mitigate the internal shift effects within the network layers, Molina et al and Saxena [47][48] in their findings suggested the batch normalization technique as one useful method for correcting the internal distribution shifts between the model input and flatten layers. Nevertheless, the neural network architectures built with the Keras sequential models in TensorFlow had shown significant improvement and have been recently deployed in the deeplearning frameworks [11,17,49] than the conventional forecast models [50][51][52][53][54][55]. With the Keras neural networks, different model hyperparameters could be optimally tuned with the model layers internal covariate shift being easily resolved by incorporating 'batch normalization layer' into the traditional LSTM and CNN model layers. The Keras sequential model is a linear-stack of layers (model layer by layer) that can be concatenated for building a high-level neural network architecture [26]. As an important, emerging and promising tool in energy assessment and trading forecasts, different forecast models in the Keras Library [56] are available for designing the reliable forecastbased architectures [11,17,26,57]. In few recent studies, Malakar et al [32] had designed the LSTM model for shortterm solar energy predictions based on concatenated 4 sequentially arranged layers (1-input, 2-hidden and 1-output); Kong et al [58] utilized the LSTM network model for residential short-term load predictions within the deep-learning frameworks; Sutskever et al [57] utilized the LSTM neural network in machine translation; Reimers et al [11] adopted the LSTM model for sequence tagging. Xie et al [72] also utilized the multi-variable LSTM to predict the short-term wind speeds. Their studied findings revealed that the Keras sequential models were essential in short to long-term forecast projections based on its deep-learning ability [59][60]. However, for the multivariate inputs to multi-steps wind speed and direction forecasts, we couldn't find any extensive work as the Keras neural networks in long-term wind predictions had not been explored within an African context and we adopt the concept of batch normalization technique as one solution in addressing the ConvLSTM2D model covariate shifts as well as assessing the accuracy of the Keras neural networks in wind speed and direction projections. For this present study, we proposed/assessed and evaluated 3 neural-network architectures (Conv2D, LSTM and ConvLSTM2D) built with Keras sequential models (TensorFlow 2.0) for projections of the time series wind speed and direction at some selected weather stations. Each Keras neural network is developed with linear stack of network layers complied together (Fig 1) as essential forecast model in wind studies. Due to 1 high-level model computations, 2 huge cost of ConvLSTM2D setup and 3 the model poor performance of interconnected layers if trained with longterm non-linear input datasets, the batch training approach is considered as a better option. Thus, batches of input data arrays from weather datasets into the Keras neural networks are fed and the hidden layer units are allowed to extract local features from an input sequence. In preparation of the forecast models for deep-learning frameworks, the historical time series variables of 3 stations are normalized (scaled-down) and utilized as training, validation and testing input data arrays, respectively; while model input dataset from 5 additional stations are obtained for network evaluations of trained and untrained (compiled layers model without training procedures). Also, the effect of overlearning of the developed neural networks with historical weather input dataset is considered with dropout regularization between 20 and 40% [37]. Furthermore, for model hyperparameters tuning and selection of reliable forecast system, the 1 model parameters, 2 learning-rate of network optimizer and 3 layers activation functions are rightly selected and initialized (Figs 1, S1-2); while the model training and validation losses are monitored. Lastly, the model prediction accuracy of Conv1D with unscaled inputs at 300 epochs ( Fig S4) and scaled input arrays at 100 epochs ( Fig S5 a-b) are compared with the LSTM performance at 100 epochs ( Fig S9 a-b). Based on the model performances, the models forecast accuracy with the actual model is assessed ( 1 time series wind speed and direction for 0-200 timesteps, 2 wind roses and the frequency occurrences in 12 wind sectors, 3 trained and untrained network models accuracy) for right selection and recommendation of a reliable forecast model in wind speed projection. Hence, the studied objectives are centered on the development of high-level Keras neural networks within the deeplearning framework that would be: -(1) utilized as stand-alone forecast model in energy assessment based on historical wind conditions of a given site; and lastly, to determine if the Keras neural networks for timely convergence could be deployed in deep-learning framework as the reliable forecast tools in replicating the historical wind speed and direction patterns at a given time horizon. Following this introduction, the rest of the paper is structure: -section 2 describes the monitoring stations and data pre-processing; section 3 explains the methodology for: 1 model input data preparation, 2 Keras sequential model setup and development, 3 assessment of the model metrics (training and prediction accuracy). Lastly, the studied finding results and discussion, conclusion and direction for further studies relating to the developed forecast models are all presented in sections 4 and 5, respectively.

Stations/Data Description
The multivariate time series observations of different sampled rates (5-min/10-min/1-hour) and heights (2/10/20/60 m) AGL at 8 stations (Col 1 of Table 1) were continuously monitored by the weather sensors deployed at the South African Weather Stations (Paarl, Geelbek and Darling) and the South African Wind Atlas stations (WM01, WM02, WM03 and WM04). For the sampled periods, the historical time series of weather variables such as: wind directions at 10 (WD_10) and 20 m (WD_20); air temperature over 1-min at 2 m (Tair/Temp), 10 m (TS_10) and 60 m (TS_60); temperature gradient (Tgrad); relative humidity (Hum/Rhum) over 1-min rate; wind speeds at 10 (WS_10) and 20 m (WS_20); as well as the gust (highest wind) at 10 m height (Gust_10) were collected and processed for model system architectures (Tables 2-4). For each selected station, time series of 4 variables were obtained (Col. 4) and separated into 3-input variables (Col. 5) and 1-output variable datasets (Col. 6). Data cleansing were carried out by checking for missing data-point in each variable for each station height. However, in the collected dataset, there was no missing data point. Meanwhile, for the proposed forecast models (LSTM, Conv2D and ConvLSTM2D), the 3-input variables in 2-dimensional format (rows and columns) were used to prepare the model input data array (Cols. 8-13) while the 1-output variable (Col. 6) was used to prepare the model output array (Col. 14). Thus, the adopted technique for reshaping the time series station variables into: LSTM model input array (Col 11); Conv2D input array (Col 12); and ConvLSTM2D model input array (Col. 13) with its corresponding model output array (Col. 14) is discussed (section 3.1). The prevailing wind flow(s) of the considered stations are: North-Easterly, South-Easterly, Westerly and South-Westerly to North-Westerly as shown (Fig. S3).

Methodology
The forecast accuracy of a developed sequential model is often based on the right selection and tuning of the network hyperparameter, model system architecture as well as the data-quality of input sequence. To make a reliable forecast of the local wind speed and sectorwise direction, the time series of a high-quality historical datasets (given sequence) is required. Thus, high-quality time series of past observations for the sampled periods were obtained, scaled-down and reshaped accordingly to the model specification of accepting input arrays into the network layer as discussed below: -

3.1
Input/Output Data Structure In deep learning forecast model, a high-level neural network requires input data arrays that would be passed from the input nodes into the hidden layer so that the network model could process and learn the historical pattern, and makes a reasonable prediction. To ensure that the proposed sequential neural model trains well and converge quickly, the time-series of each selected station variable was normalized to similar range of values. Also, because the sampled time-series observations were noisy and stochastic in nature, the input data arrays at different sampled rates would negatively impacts the model training and validation performances. Hence, data normalization to improve the model generalization ability was also considered. As shown (Fig S4) in the Conv1D model training and validation losses, un-normalized datasets as model input arrays into 1 st network layer didn't allow the gradient descents to converged and prevented the model deep learning at 300 iterations. Also, the plots of model performances with normalized input arrays ( Fig. a-b of S5) show that the gradient descents converged quicker and generalized with model input data arrays at 100 iterations [61].
The time series of multivariate input variables for model training, validation and testing (100000x3) as well as for evaluation (41200x3) were obtained (Cols. 5 and 7 of Table 1) at the station heights (Col 1). For model input variables such as: -1 wind direction (0-360 o ) 2 air temperature (0.3-46 o C) and 3 relative humidity (9.7-93.7 %), these variables were scaled-down to similar values ranging from 0-1 [26,32]: - where . , is the station variable value (> 0) at a given instance i; /.0 and /12 are the minimum and maximum values per station variable, respectively; is the normalized station datasets.
For Conv2D-based network (Fig 1a), this required a 4-dimensional (4-D) input data (sequence) with its corresponding 3-D input shape. The Conv2D input shape for the input-layer was obtained using an input_shape argument below: - where the batch =1, timesteps = number of row(s) per input variable and also denoted as t1, t2, t3, .., tn for n = 100000; features = 3 (no of columns of input variables). From Eq. (2), the training, validation and testing data arrays had 1 batch, 100000 timesteps and 3 features; while the input shape of evaluation data array had 1 batch, 41200 timesteps and 3 features (Col. 9).
Note: For the Conv2D model, information about its input_shape structure (Input_Shape_Conv2D) was passed to the input (1 st ) layer while the subsequent model layers do automatic shape inference within the neural network (see block diagram of Fig 1a).
Meanwhile, a 4-D input shape to the 1 st network layer of the ConvLSTM2D model was obtained: where the channel set-value is 1 (channels =1). The summary of the 4-D input shape was presented (Col. 10).
Since batched training method is considered when reduced model computation complexity is required, we adopted the batch training method as a good option. For recalling the weighted sum of stored training model with the system architecture (Tables 2-4). this method is considered for: 1 (Tables 2-5) and thereafter, the procedures were repeated for subsequent batches.
Meanwhile, the transfer functions for initialization of input, hidden, dense and output layers of the neural networks were defined. For the layers (input, hidden and dense) of Conv2D and ConvLSTM2D models, the weighted sum of the input data arrays used the hyperbolic tangent transfer function with a zero-centered model output as determined [63]:tanh = (7 , '7 8, ) where e is the base of a natural logarithm; x is the vector of the outputs, taking scaled input and output values in the range of -1 to 1.
For the LSTM model layers (above), the weighted sum of the model input array was determined using rectified linear unit (relu) function while the sigmoid function of all the model output layer was also defined as [64]:relu = max (0.0, ) where sigmoid function takes any scale inputs and output value in the range of 0 to 1.

3.2
Conv2D Model Setup The Conv1D model is a convolutional network that works best with very-short input data arrays for univariate time series forecasts. One main limitation of this model is its poor performance when utilized with long-term non-linear input sequence. To overcome this limitation, a robust model (2-dimensional CNN, Conv2D) that was suitable for the long-term multivariate inputs to multistep outputs was developed. Unlike the Conv1D model that utilized 1-D MaxPooling layer with 1-D kernel size (Table 4), the Conv2D sequential model in its architecture [42,18] (Table 2). Thus, the Conv2D model (Fig 1a) was built on the traditional Conv1D model architecture with replacement of '1-D MaxPooling layer and 1-D kernel size' with '2-D MaxPooling layer and 2-D kernel size'. The developed Conv2D neural model architecture entails building a 2-D CNN model with normalized input/output data arrays. For this network setup, the hyperparameters that influence the model learning ability and performance from the input to output layers were identified, and initialized before the network layers setup was compiled (Fig 1a). Firstly, From Eq. (4), the Conv2D model normalized input arrays (0.0 -1) were considered for: -1 smoother gradient descent, 2 faster convergence [65, 50] of the training and validation losses (Figs a-b of 2 and 5), and 3 reduced number of epochs for learning the input sequence (Fig a-b of S4-5). Next, 7-layer Conv2D block diagram (Fig. 1a) and sequential model system architecture (Table 2) from Keras Library [56] were arranged/built with the following parameters: -'1 input layer and 1 st 2-D MaxPooling layer', '1 convolutional hidden layer and 2 nd 2-D MaxPooling layer', 1 flatten layer, 1 st dense layer and 1 output (2 nd connected dense) layer, respectively. The MaxPooling layer distill the output of previous layer by reducing the spatial dimensionality of the input sequence volume through downsampling approach [66-68] and passed it to subsequent layer, while the flatten layer takes the processed input sequence from the previous layer and narrow the featured sequence by wrapping it as 1-D vector (from 4-D [1,1,20000,64] to 2-D output shape [1,1280000]). The 1 st connected dense layer extracts the wrapped input sequence from a flatten layer and interprets them for model predictions before passing through to the output nodes of the layer. Lastly, the following model hyperparameters were selected: 128, 64 and 72 convolutional filters (layer neurons) for the input, hidden and dense layers, respectively; the kernel size of 2-D convolution window (width=1, height=1) for input and hidden layers, respectively; strides value = 1 and the padding set to 'same' (for zero padded inputs with the input and output data arrays having similar spatial dimension), respectively; dropout regularization rates between 20 and 40 % (ensuring unused or randomly selected neurons in complex learning are ignored); tanh function to activate the input, hidden and dense layers but the sigmoid function for the output layer activation. Other parameters for Conv2D model build-up and learning were: epochs = 100 (iterations), batch_size =1, nb_classes = timesteps = 20000. Thereafter, step 6 (Appendix A.1) was used to compile and summarize the system architecture (output shape of each convolutional layer). For the network compilation (Fig 1a and step 6) that performs the regression analysis, the neural network was designed with the: 1 model optimizer of Adamax version of stochastic gradient descent (to achieve better model performance); 2 loss function (mean_square_error, MSE) and 3 model metrics (mean_absolute_error, MAE) for assessing the training and validation performances. Meanwhile the fit function (step 8) was used for training the compiled Conv2D network layers with model input data arrays. The procedures utilized for developing a Conv2D neural network (Fig 1a, Cols. 9 and 12 of Table 1, and Table 2) [32] were wrapped in 12 steps (Appendix A.1).

ConvLSTM2D Model Setup
Firstly, the 5-D ConvLSTM2D input arrays were obtained from Eq. (7). Next, 7-layer network of the ConvLSTM2D model block diagram (Fig. 1b) and system architecture ( Table 3) were built with: -1 input layer with 1 st batch normalization layer (instead of 1 st 2-D MaxPooling layer and dropout), 1 convolutional hidden layer with 2 nd batch normalization layer (instead of 2 nd 2-D MaxPooling layer and dropout rate), 1 flatten, 1 dense and 1 output layers, respectively. The batch normalization layer stabilized (regularized) the layer input arrays for faster deep learning process and also reduced the model learning time. Due to the complexity of this model, the batch normalization layer was considered and takes an output data sequence from previous connected layer (input or hidden layer) and normalized it (resolved the internal covariate shift in a distributed sequence from previous layer) [48,[68][69] before passing the normalized sequence to subsequent layers network. The flatten layer accepts the processed input sequence from 2 nd batch normalization layer and reduced/wrapped this feature sequence into a single 1-D vector (from 5-D [1,1,20000,3,60] to 2-D output shape [1,3600000]). The dense layer directly connected to the model flatten layer extracts the wrapped input sequence and interprets before passing it through to an output layer. Lastly, the following model hyperparameters were considered: 30, 60 and 25 filters (layer neurons) for input, hidden and connected dense layers, respectively; kernel sizes of (7,7) and (6,6) for the input and hidden layers, respectively; return sequence ='True' and padding = 'same', respectively; tanh activation function for the input, hidden and dense layers but sigmoid function for the model output layer. For the network compiler (Fig 1a), the Adamax was used as the network optimizer; the mean_square_error as loss function and mean_absolute_error as the model metrics for checking the training performance. Other considered parameters for the model build-up and learning process were: -epochs = 100, batch_size = 1, nb_classes = timesteps = 20000. The procedures for developing the ConvLSTM2D neural model (Fig 1b, Cols. 10 and 13 of Table 1, and Table 3) were wrapped into 12 steps (Appendix A.2).

LSTM Model Setup
The long short-term memory (LSTM) is a special form/extension of recurrent neural model 'RNN' that has a looped network layer(s) for deep learning and allows a given input sequence information to persist for longer time [26,50]. To calculate the model predicted values for short-long term, the simple structure and operation of an LSTM cell has been presented in Fig. 4 of the literature [72], comprising of the following 4 stages [73]: - where Wi, Wc, Wf and Wo are the input neurons' shared weight matrices; ht denotes the hidden state at a given time instance (t), Xt is the input of the network; is the gate activation function; ht-1 is the output at previous time (t-1); Ri, Rc, Rf and Ro are the recurrent weight matrices; bi, bc, bf and bo are the bias vectors.
Firstly, the 3-D LSTM model input arrays were obtained from Eq. (9). Next, 5-layer LSTM model block diagram (Fig. 1c) and system architecture ( Table 5) were built with: 1 input layer with dropout rate of 20 %, 1 convolutional hidden layer with added dropout rate of 35 %, 1 flatten layer, 1 dense layer and 1 output layer. The flatten layer takes the input sequence from the hidden layer, and reduced this feature sequence into 1-D vector (from 3-D [1, 20000, 64] to 2-D output shape [1,1280000]). Also, this network model utilized the relu activation function for all layers (input/hidden/dense) and the sigmoid function for the model output layer activation. Lastly, the following network hyperparameters were considered: 128-, 64-and 72-layer neurons (input, hidden and dense, respectively); return sequence ='True' for input and hidden layers, respectively. For the network optimizer (Fig1c), the Adamax version of stochastic gradient descent was also utilized; the mean_square_error as the loss function and mean_absolute_error as the metrics. Other considered parameters for the neural network build-up and learning were: epochs = 100, batch_size = 1, nb_classes = timesteps = 20000. The procedures for developing the LSTM neural model (Fig 1c, Cols. 8 and 11 of Table 1, and Table 5) were wrapped into 12 steps (Appendix A.3).

3.5
Model Performance Evaluations The model performances (training, validation, evaluation and prediction) for each neural network were assessed using the model loss function and metrics as follows: - (15) where Ypred denotes the i th predicted wind speed and direction of the Conv2D, ConvLSTM2D and LSTM models; Yact is the i th actual model wind speed and direction; N is the input data-point per variable (Col. 7 of Table 1 (18) where estimated ME is the model mean error of wind speed (m/s)

Results
The studied finding results of the developed Conv2D, ConvLSTM2D and LSTM network models are depicted (Figs S1-10, supplementary file; Figs 2-10 and Table 6). Meanwhile, the training, validation, testing and prediction errors (1 st and 2 nd batched datasets) with evaluated stations result have been summarized (Table 6), in which the best performance is highlighted in bold.

Model hyperparameters tuning
In model learning process of the historical wind speed and direction pattern, an appropriate allocation of different filter-size (layer-neurons) with the kernel size, network optimizer, as well as the right selection of the activation function of each network layer directly influence how the forecast model layers processed and learnt the input sequences at different station heights (Figs S1a and 2a, Figs S2a and 5a, and Figs S9a and 8a). The model performance comparisons of ConvLSTM2D and Conv2D with independent input data for model tuned hyperparameters (selected filters and kernel sizes) were presented (Figs a and d of S1-2). For the Conv2D setup, the kernel size (1,1) with small filters (12 (input) and 6 (hidden)) and dense layer filters =72 were considered ( Fig S1a); before, a new network model with the same kernel size but with large filters (28 (input), 14 (hidden layer) and 72 (dense layer filters)) was built for comparisons (Fig. S1d). From the network performances (Figs. a and d of S1), it was clearly seen that the developed Conv2D model with 28 and 14 filter-sizes fairly learnt at 9 epochs with less model oscillation. Also, a fair wind speed forecast (Fig. b and e) with minima model errors (Fig c and f) were achieved as compared to developed model with '12 (input) and 6 (hidden) filters'. For further model tuning-process and better results with the Conv2D neural network, the forecast model with the kernel size of (1,1) but larger filter sizes of 128 and 64 (input and hidden layers) was adopted, and initialized for reliable time series wind predictions (Fig 1a and Table 2). Meanwhile, for the ConvLSTM2D model with kernel size = (4,4) and selected filter-size (48 (input), 24 (hidden) and 25 (dense layer)), the network model learnt slowly with an input sequence (Fig S2a). Increasing the hyperparameters to kernels of (8,8) with larger filter sizes/neurons (64, 32 and 25, respectively), the forecast model didn't learn ( Fig  S2d) as the network training and validation quickly diverged at 9 epochs with over-predicted wind speeds (Fig S2e), and high forecasts mean errors (Fig. S2f). To achieve better performance and convergence with the ConvLSTM2D network, the new forecast model was built with 'filter-size = 30, kernel-size = (7,7) at the input layer', 'filter-size = 60, kernel-size = (6,6) at the hidden layer' and dense-layer filters =25 (Fig 1b and Table 3). Lastly, the LSTM network performances (training and validation) were also assessed ( Fig S9a) with layer-neuron of '128 (input), 64 (hidden) and dense layer = 45'; and compared with newly built model of similar input and hidden neurons but different dense layer neurons=72 (Fig S9b). The LSTM network built with the dense neurons =72 showed significant improvement over the old forecast model with dense layer neurons = 45. For the reliable LSTM forecasts, the LSTM model with dense layer neurons =72 was retained for better model learning and predictions of time-series wind speed and directions (Fig 1c and Table 5). From the comparisons of Conv2D and ConvLSTM2D architectures (Tables 2 with 3), the allocated filters size to the Conv2D input and hidden layers differ with the ConvLSTM2D allocated filter sizes. Based on the designed network architectures of Conv2D (Fig 1a and Table 2) and LSTM models (Fig 1c and Table 5), both model input layers were allocated higher filter-size/neurons =128 (step 2 of Appendix A.1 and A.3) for handling the input data sequence but was later decreased (layer filters =64 in step 3) as the sequence deepened into the hidden layer. Also, the 1 st dense layer connected to the flatten layer had more filters (filters =72 in step 4b) than the hidden layer with filter-sizes = 64 (step 3). As the model learning process increases within the network layers, the filter sizes allocated to the dense layer also increased but with output spatial volume reduction at the flatten layer ((from 5-D to 2-D) in ConvLSTM2D model, (4-D to 2-D) in the Conv2D and (3-D to 2-D) in LSTM/Conv1D model). Meanwhile, for the ConvLSTM2D model of a hybrid network (Fig 1b and Table 3), the convolutional input layer was allocated few filter maps (filtersize = 30 in step 2 of A.2) but was increased to filter-size = 60 (step 3) at the hidden layer. As ConvLSTM2D model deepened for input sequence learning, the hidden and dense layers of the model usually required more filters than at an input layer. However, the 1 st dense layer connected to the flatten layer had less filter-size =25 (instead of 72-filters as shown in step 4b) than the hidden layer filters = 60 (step 3 of A.2). This was due to computation complexity of ConvLSTM2D model if assigned a higher dense-layer filter. The reduction in the dense layer filters of ConvLSTM2D only led to the fast-learning process and efficient computation with moderate memory needs.

Models result comparison and accuracy of forecast (training)
The model performances (validation and training losses) of the Conv2D (Figs 2a-b), ConvLSTM2D (Figs 5a-b) and LSTM (Figs 8a-b) networks have been assessed and compared (Table 6 in terms of the mean_square_error, MSE,  and mean absolute_error, MAE). Also, the Conv1D neural model performances were assessed (Fig S5 a-b). For 1 st batch input sequence, the Conv2D model training and validation losses converged at 6 epochs but with model validation loss oscillations in twisty pattern at 11-100 epochs (Fig 2a); for 2 nd batch training, the model learnt better with lower gradient descent but with validation loss oscillations at 26-100 epochs (Fig 2b). For the ConvLSTM2D model performance, the training and validation losses converged at 11 epochs but with validation loss oscillations at 79-100 epochs (Fig 5a); while the model learnt better with loss convergence at 16 epochs for 2 nd batch training data but with validation loss oscillation at 91-100 epochs (Fig 5b). Furthermore, the LSTM model training and validation losses were compared (Figs 8a-b): for the 1 st batch training data, this network converged at 5 epochs with validation loss oscillations at 32-100 epochs (Fig 8a); for 2 nd batch training, the model had higher gradient descent with losses convergence at 5 epochs but with the model validation loss oscillates at 24-100 epochs (Fig S10a). From an in-depth into the model performances (validation and training losses), Fig 5a-b clearly shows that the ConvLSTM2D would be a better choice for model training as compared to the Conv1D (Figs S5a-b) and LSTM (Figs 8a-b, S10a) networks.
Also, the Conv2D model shows significant improvement over the LSTM and ConvLSTM2D model performances.
The cause of the model validation loss oscillations could not be ascertained but may be attributed to historical input arrays (batched datasets) fed into the network input layer. Thus, passing either a recent input data sequence or more historical input arrays into network training/validation process may have greater impact on the model performances and oscillations stability [26].
Meanwhile, the Conv2D model predictions of time series wind speeds for 4 horizons (0-50, 50-100, 100-150, and 150-200 timesteps) were presented (Figs 3-4) and compared with the other model prediction results (ConvLSTM2D (Figs 6-7), LSTM2D (Figs 8d and S10d) and Conv1D (Fig S5d)). From these results, it was shown that all trained networks replicated the historical patterns of the wind speeds (Fig S6 a-f), however, LSTM model over-estimated the historical wind speeds but outperformed the Conv1D predictions. Also, the Conv2D model prediction results show significant improvements over the ConvLSTM2D results and other models; although both forecast models ( Fig  S8 a-b) accurately reproduced the actual wind direction patterns for the considered timesteps/ horizons. For the LSTM model predictions with dense layer-neurons = 45 (Fig S9c), the actual wind speeds were highly overestimated as compared to the model prediction with dense layer neurons = 72 (Fig 8d). The comparisons of LTSM models built with different dense layer neurons (Fig 8d vs S9c) revealed that the selected dense-layer neurons were highly sensitive to the model learning process and had greater impacts on the model predictions accuracy. With independent weather input data arrays, the model forecast ability in replicating the historical wind directions was assessed. The time series of wind direction forecasts from the LSTM trained models were compared (Fig. 8e). From the forecast results (Figs. 8e, S8a and S8b), the LSTM, Conv2D and ConvLSTM2D wind direction predictions were very close to the actual model results in most of the considered time instances. Also, the plots of Conv2D, LSTM and Actual models wind roses as well as their frequency occurrences in 12 wind sectors have been presented (Fig 10 a-f); while for the ConvLSTM2D model, this was depicted (Fig S7 a-b). From the model results, the Conv2D and LSTM network models replicated the sectorwise actual directions with their corresponding frequency occurrences (%) as compared to ConvLSTM2D model predictions in 12 wind sectors (0, 30, 60, .... 359 o ). The Conv2D model prediction errors (Fig 2 c-d) were compared with ConvLSTM2D (Fig 5 c-d) and LSTM errors for 200 timesteps (Fig c of 8 (Fig 2 c-d) were minima, followed by ConvLSTM2D (Fig 5 c-d) and LSTM prediction errors (Fig c of 8 and S10). Lastly, the accuracy of the Conv2D, ConvLSTM2D and LSTM models for both untrained (compiled layer models without training procedures) and trained networks (layer models with training procedures) have been assessed with evaluated input data arrays of 5 stations (SM-ST1 to SM-ST5). In terms of the MSEs and MAEs, the Conv2D model produced minima training and validation losses as compared to other model results (Table 6). Also, Conv2D model evaluated results were better with stations 1, 3 and 4 datasets as compared to other model evaluated results. LSTM model outperformed the Conv2D and ConvLSTM2D models at station 5; while ConvLSTM2D and LSTM models outperformed the Conv2D at station 2. For the untrained models, the ConvLSTM2D showed slight improvement (at stations 1-2) as compared to Conv2D and LSTM model evaluations while the evaluated results for stations 3-5 were similar for all untrained network models (Fig. 9c). Using different datasets, the evaluated results (MSE/MAE/RMSE) of 4 developed prediction models by Qiu et al [62] were compared to our studied findings (  [38]. Lastly, our predicted/validated/evaluated model results in Table 6 outperformed Xie et al [72] summarized evaluation metric values (Table 5). Thus, prediction model errors from an optimally tuned model hyperparameters and high-quality datasets show that the forecast accuracy of Keras neural network architectures outperformed the traditional neural networks in existing literature.

Conclusion
From the studied results, it has been shown that the Keras neural network architectures in the wind speed and direction regression analysis required a well-structured input data, right selection and optimal tuned model hyperparameters before it could be utilized for reliable wind predictions. The developed network models with the weather input arrays for accurate wind speed and direction predictions were achieved. From finding results, the following conclusions are drawn: -• The allocation of the filters (layer neurons) and kernel sizes should be based on input arrays dimension and the size of the interconnected layers within the network, as they had great impacts on the model learning ability (convergence or divergence) and directly influence the sequential network predictions accuracy. • Independent datasets (high-quality) of different station heights should be normalized before used as the model input data array (sequence). The trained model with the scaled input arrays (0-1) was a better choice for reliable forecast time horizons. • From the evaluated results, a better prediction result was obtained with the trained network models as compared to compiled layer models without training procedures. • From the predicted results, the Keras sequential models were able to replicate the time-series of historical wind speed and direction based on the model hyperparameters tuning as well as the input sequence structure. • From the trained and validated results, the Adamax optimizer of Conv2D model recorded the best performance with timely convergence for smaller training batch_size=1 as compared to other forecast models' configuration. • From the model training and validation losses, the Conv2D neural network outperformed all other forecast models (ConvLSTM2D, LSTM and Conv1D) for wind speed and direction analysis. Hence, the Conv2D model would be a better choice for a stand-alone forecast model in the wind assessment based on the model learning ability with historical wind conditions of a given site. • Lastly, the wind roses of Conv2D and LSTM with Actual models as well as their frequency occurrences show that the proposed neural networks within deep-learning framework was a reliable forecast tool in replicating the historical wind speed and direction patterns at the considered station heights.
ConvLSTM2D model performances (trainings/validations/predictions) were encouraging and calls for improvement through a hyperparameter tuning process based on allocation of higher dense-layer filters and lower kernel sizes. Also, since batched training method was adopted to reduce the computation complexity, the model framework could be refined with unbatched and seasonal datasets in further studies. It is worth noting that LSTM model is a good choice when model ability to memorize the long-learnt patterns through memory cells/gates is required but it's unsure if training with longer noisy (non-linear) input sequence would improve ConvLSTM2D and LSTM models performance. Future studies would consider bi-directional LSTM; incorporation of more hidden layers into the LSTM and ConvLSTM2D models, and the effects of batch normalization layer on the LSTM performances. Furthermore, the ConvLSTM2D model framework with other developed sequential models in classification task would be fully explored. Thus, higher machine learning platform to host the ConvLSTM2D model would be required in the future.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.