Optimization Algorithms As Training Approach With Deep Learning Methods To Develop An Ultraviolet Index Forecasting Model


 The solar ultraviolet index (UVI) is a key public health indicator to mitigate the ultraviolet-exposure related diseases. However, in practice, the ultraviolet irradiance measurements are difficult and need expensive ground-based physical models and time-consuming satellite-observed data. Furthermore, accurate short-term forecasting is crucial for making effective decisions on public health owing to UVI related diseases. To this end, this study aimed to develop and compare the performances of different hybridized deep learning models for forecasting the daily UVI index. The ultraviolet irradiance-related data were collected for Perth station of Western Australia. A hybrid-deep learning framework was formulated with a convolutional neural network and long short-term memory called CLSTM. The comprehensive dataset (i.e., satellite-derived Moderate Resolution Imaging Spectroradiometer, ground-based datasets from Scientific Information for Landowners, and synoptic-scale climate indices) were fed into the proposed network and optimized by four optimization techniques. The results demonstrated the excellent forecasting capability (i.e., low error and high efficiency) of the recommended hybrid CLSTM model compared to the counterpart benchmark models. Overall, this study showed that the proposed hybrid CLSTM model successfully apprehends the complex and non-linear relationships between predictor variables and the daily UVI. A complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN)-CLSTM-based is appeared to be an accurate forecasting system capable of reacting quickly to measured conditions. Further, the genetic algorithm is found to be the most effective optimization technique. The study inference can considerably enhance real-time exposure advice for the public and help mitigate the potential for solar UV-exposure-related diseases such as melanoma.

This study aims to apply a CLSTM hybrid machine learning model, which can exploit the 155 benefits of both convolutional layers (i.e., important feature extraction) and LSTM layers (i.e., 156 storing sequence data for an extended period) and evaluate its ability to efficiently forecast the 157 UVI for the next day. The model was constructed and fed with hydro-climatic data in association 158 with UV irradiance in the Australian context. The model was optimized using ant colony 159 optimization, genetic algorithm, particle swarm optimization, and differential evolutional 160 algorithms. The model accuracy (i.e., efficiency and errors involved in UVI estimations) was 161 assessed with the conventional standalone data-driven models' (e.g., SVR, decision tree, MLP, 162 CNN, LSTM, gated recurrent unit (GRU), etc.) performance statistics. The inference obtained 163 from the modeling results was also discussed that could be tremendously useful in building expert 164 judgment to protect public health in the Australian region and beyond. 165 February) had the most extreme UV concentration, whereas the Autumn (March to May) has 173 moderate to high, and Winter (June to August) demonstrates lower to moderate, and Spring 174 (September to November) has higher to extreme UVI value in Perth.

Materials and Methods
Oscillation Index highly correlated with solar irradiance and mean Northern Hemisphere 245 temperature fluctuations reconstructions (Yan et al., 2011). In North America and the North 246 Pacific, land and sea surface temperatures, precipitation, and storm tracks are determined mainly 247 by atmospheric variability associated with the Pacific North American (PNA) pattern. The modern 248 instrumental record indicates a recent trend towards a positive PNA phase, which has resulted in 249 increased warming and snowpack loss in northwest North America (Liu et al., 2017). This study 250 used fifteen climate mode indices to increase the diversity. 251

Multiple layer perceptron (MLP) 253
The MLP is a simple feedforward neural network with three layers and is commonly used as a 254 reference model for comparison in machine learning application research (Ahmed and Lin, 2021). 255 The three layers are the input layer, a hidden layer with n-nodes, and the output layer. The input 256 data are fed into the input layer, transformed into the hidden layer via a non-linear activation 257 function (i.e., a logistic function). The target output is estimated, Eq. (1). 258 The computed output is then compared with the measured output, and the corresponding loss, i.e., 262 the mean squared error (MSE), is estimated. The model parameters (i.e., initial weights and bias) 263 are updated using a backpropagation method until the minimum MSE is obtained. The model is 264 trained for several iterations and tested for new data sets for prediction accuracy.

Support vector regression (SVR) 266
The SVR is constructed based on the statistical learning theory. In SVR, a kernel trick is applied 267 that transfers the input features into the higher dimension to construct an optimal separating 268 hyperplane as follows (Ji et al., 2017): 269 where w is the weight vector, b is the bias, and ( ) indicates the high-dimensional feature space. 271 The coefficients w and b, which define the location of the hyperplane, can be estimated by 272 minimizing the following regularized risk function: 273 Minimize: Where, ( , ) is the non-linear kernel function. In this present study, we used a radial basis 281 function (RBF) as the kernel, which is represented as follows: 282 where is the bandwidth of the RBF. 283

Decision tree (DT) 284
A decision tree is a predictive model used for both classification and regression analysis (Jiménez-285 Pérez and Mora-López, 2016). As our data is continuous, we used it for the regression predictions. 286 It is a simple tree-like structure that uses the input observations (i.e., x1, x2, x3, …, xn) to predict the target output (i.e., Y). The tree contains many nodes, and at each node, a test to one of the 288 inputs (e.g., x1) is applied, and the outcome is estimated. The left/right sub-branch of the decision 289 tree is selected based on the estimated outcome. After a specific node, the prediction is made, and 290 the corresponding node is termed the leaf node. The prediction averages out all the training points 291 for the leaf node. The model is trained using all input variables and corresponding loss; the mean 292 squared error (MSE) is calculated to determine the best split of the data. The number of maximum 293 features is set as the total input features during the partition. 294

Convolutional neural network (CNN) 295
The CNN model was originally developed for document recognition (Lecun et al., 1998) 296 and used for predictions. Aside from the input and output layer, the CNN architecture has three 297 hidden layers: the convolutional layers, the pooling layers, and a fully connected layer. The 298 convolutional layers abstract the local information from the data matrix using a kernel. The 299 primary advantage of this layer is the implementation of weight sharing and spatial correlation 300 among neighbors (Guo et al., 2016). The pooling layers are the subsampling layers that reduce the 301 size of the data matrix. A fully connected layer is similar to the traditional neural network added 302 at the final pooling layer after completing an alternate stack of convolutional and pooling layers. 303

Long short-term memory (LSTM) 304
An LSTM network is a special form of recurrent neural network that stores sequence data 305 for an extended period (Hochreiter and Schmidhuber, 1997). The LSTM structure has three gates: 306 an input gate, an output gate, and a forget gate. The model regulates all these three gates and 307 determines how much data from previous time steps must be stored and transferred to the next 308 steps. The input gate controls the input data at the current time as follows: 309 Where = the input received from the i th node at time t; ℎ −1 = the result of the h th node at time 311 t-1; −1 = the cell state (i.e., memory) of the c th node at time t-1. The symbol 'w' represents the 312 weight between nodes, and the f is the activation function. The output gate transfers the current 313 value from Eq. (5) to the output node, Eq. (6). Then, at the final stage, the current value is stored 314 as the cell state in the forget gate, Eq. (7). 315

Gated recurrent unit (GRU) network 318
The GRU network is an LSTM variant having only two gates, such as reset and update gates (Dey 319 and Salem, 2017). The implementation of this network can be represented by the following 320 The ACO algorithm is the most used simulation optimization algorithm where myriad artificial 365 ants work in a simulated mathematical space to search for optimal solutions for a given problem. 366 The ant colony algorithm is dominant in multi-objective optimization as it follows the natural 367 distribution and self-evolved simple process. However, with the increase of network information, 368 the ACO algorithm faces various constraints such as local optimization and feature redundancy 369 for selecting optimal pathways (Peng et al., 2018). 370

Differential evaluation optimization 371
The differential evolution (DEV) algorithm is renowned for its simplicity and powerful stochastic 372 The initial population =0 is generated using random within given boundaries, which can be 383 written in the following equation: 384 for each . represents the boundary condition. In contrast, ( ) and ( ) represents the upper and 387 lower limit of the boundary vector parameters. For every generation, a new random vector is 388 randomly created, selecting vectors from the previous generation from the following manner: 389 Where, is the number of optimizations, is the candidate vector, ∈ [0,1] and ∈ [0,2] 391 control parameter. is the randomly selected index that ensures the difference between the 392 candidate vector and the generation vector. The population for new the new generation will be 393 assembled from the vector of the previous generation −1 and the candidate vectors , the 394 following equation can describe selection: 395 The process repeats with the following generation population number until it satisfies the pre-398 defined objective function. 399

Particle swarm optimization 400
The particle swarm optimization (PSO) method was developed for continuous non-linear 401 functions optimization having roots in artificial life and evolutionary computation (Kennedy and 402 Eberhart, 1995). The method was constructed using a simple concept that tracks each particle's 403 current position in the swarm by implementing a velocity vector for particle's previous to the new 404 position. However, the movement of the particles in the swarm depends on the individuals' 405 external behavior. Therefore, the process is very speculative, uses each particle's memory for 406 calculating new position, and gained knowledge by the swarm as a whole. Nearest neighbor 407 velocity matching and craziness, eliminating ancillary variables and incorporate multidimensional 408 search and acceleration by distance, were the precursor of PSO algorithm simulation (Eberhart and 409 Shi, 2001). Each particle in the simulation coordinates in the n-dimensional space calculation and 410 responds to the two quality factors called 'gbest' and 'pbest'. gbest represents the best location 411 and value of particle in the population globally, and pbest represents the best-fitted solution achieved by the particle so far in the population swarm. Thus, at each time step in the swarm, the 413 PSO concept stands, each particle changing its acceleration towards its two best quality factor 414 locations. The acceleration process begins by separating random numbers and presenting the 415 optimal 'gbest' and 'pbest' locations. The basic steps for the PSO algorithm are given below, 416 according to (Eberhart and Shi, 2001): 417 1. The process starts with initializing sample random particles with random velocities and 418 locations on n-dimensions in the design space. 419 2. The velocity vector for each particle in the swarm is carried out in the next step as the initial 420 velocity vector value. 421 3. Plot the velocity vector value and compare particle fitness evaluation with particle's pbest. 422 If the new value is better than the initial value, update the new velocity vector value as 423 pbest and previous location equal to the current location in the design space.

Genetic algorithm 436
The genetic algorithm (GA) is a heuristic search method based on natural selection and 437 evolution principles and concepts. This method was introduced by John Holland in the mid- The genetic similarity determines the selection progress indicator. These random individual 458 objectives with rank are transferred to the next generation. The remaining individuals participate 459 in the steps of selection, crossover, and mutation. The individual objective parent selection process 460 can happen several times, and this can be achieved by many different schemes, such as the roulette-461 wheel ranked method. For any pair of objective parents' selection, crossover, and mutation process 462 of next-generation is defined. After that, the fitness of all individuals scheduled for the next 463 generation is evaluated. This process repeats from generation to generation until a termination 464 criterion is met. 465 GA methodology is quite similar to another stochastic searching algorithm PSO. Both 466 methods begin their search from a randomly generated population of designs that evolve over 467 successive generations. They do not require any specific starting point for the simulation. The first 468 operator is the "selection" procedure similar to "Survival for the Fittest" principle. The second 469 operator is the "Crossover" operator, which mimics mating in a biological population. Both 470 methods use the same convergence criteria for selecting the optimal solution in the problem space 471 (Hassan et al., 2004). However, GA differs in two ways from the most traditional optimization 472 methods. First, GA does not operate directly on the design parameter vector but a symbolic 473 parameter known as a chromosome. Second, it optimizes the whole design chromosomes at once, 474 unlike other optimization methods single chromosome at a time (Weile and Michielssen, 1997). 475

Complete Ensemble Empirical Mode Decomposition with Adaptive Noise (CEEMDAN) 476
The complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) 477 decomposition approach initiates by discretizing the n-length predictors of any model χ(t) into 478 IMFs (intrinsic model functions) and residues to conform with tolerability. However, to ensure no 479 information leakage in the IMFs and residues, the decomposition is performed separately by taking 480 training and testing subsets. The actual IMF is produced by taking the mean of the Empirical mode 481 decomposition (EMD)-grounded IMFs across a trial and combining white noise to model the 482 predictor-target variables. The CEEMDAN is used in machinery, electricity, and medicine such as 483 impact signal denoising, daily peak load forecasting, health degradation monitoring for rolling 484 bearings, friction signal denoising combined with mutual information (Li et al., 2019). 485 The CEEMDAN process is as follows: 486 Step 1: The decomposition of p-realizations of [ ] = 1 [ ] using EMD to develop their first 487 intrinsic approach, as explained according to the equation: 488

498
Step 6: Now, the k value is incremented, and steps 4-6 are repeated. Consequently, the final 499 residue is achieved: 500 Here, K is defined as the limiting case (i.e., the highest number of modes). To comply with the 502 replicability of the earliest input, [ ]., the following is performed for the CEEMDAN approach. 503 The determination of the model's valid predictors does not have any precise formula. 515 However, the literature suggests three methods, i.e., trial-and-error, the autocorrelation function 516 (ACF) and partial autocorrelation function (PACF), and the cross-correlation function (CCF) 517 approaches, for selecting lagged UVI memories and predictors to make an optimal model. In this 518 study, the PACF was used to determine significant antecedent behavior in terms of the lag of UVI 519 The CCF determined the predictors' statistical similarity to the target variable (Figures 6a and 7a). In Eq. (30), is the respective predictors, is the minimum value for the predictors, is 547 the maximum value of the data and is the normalized value of the data. After normalizing 548 the predictor variables, the data sets were partitioned: 70% of the data sets were considered training 549 data, 15% were used for testing, and the remaining 15% of the data sets were considered validation 550 data. The LSTM model was followed by developing a hybrid LSTM model with 3-layered CNN 551 and 4-layered LSTM, as illustrated in Figure 2. Using the conventional models, the traditional 552 antecedent lagged matrix of the daily predictors' variables was applied. The prior application of 553 the optimization algorithm was made before using CCF and PACF and before significant 554 predictors being removed from the model. Table 2 shows the selected predictors using four 555 optimization techniques in association with the UVI. 556

Model performance assessment 557
In this study, the effectiveness of the deep learning hybrid model was assessed using a variety of 558 performance evaluation criteria, e.g., Pearson's Correlation Coefficient (r), root mean square error 559 (RMSE), Nash-Sutcliffe efficiency (NS) (Nash and Sutcliffe, 1970), and mean absolute error 560 (MAE). The relative RMSE (denoted as RRMSE) and relative MAE (denoted as RMAE) were used 561 to explore the geographic differences between the study stations. 562 The exactness of the relationship between predicted and observed values was used to 563 evaluate the effectiveness of a predictive model. When the error distribution in the tested data is 564 Gaussian, the root mean square error (RMSE) is a more appropriate measure of model performance Legates-McCabe's Index (LM): 578   percentile were preferred over objective models, which performed best in the current forecast. A 642 hybrid CLSTM model was found to be the most accurate for upcoming extra-terrestrial intelligent 643 beings. However, Figure 9 showed the effect of applying CEEMDAN as a feature extraction 644 method of data on the percent change in RMAE values within the testing phase of UVI forecast 645 incrementally. The contribution of the data decomposition method (i.e., CEEMDAN) was 646 significant in the model implementation. The decrement of RMAE in percent using GA was found 647 between 17% to 63%, whereas the CLSTM showed the highest percentage of decrement (i.e., 648 63%). Moreover, the PSO optimized model showed that the RMAE (%) values with the deep learning model appeared to decrease by ~2 to 60%, and the lowest decreasing RMAE was found 650 for the ACO algorithm with a reduction ~3% to 36%. However, the CLSTM model using four 651 optimization methods showed the highest improvement among all the deep learning approaches 652 that reduced the RMSE from 36% to 63%. It is worth mentioning that the percent increase in 653 RMAE was ~83% for the DEV algorithm using the SVR method. Overall, the CEEMDAN, as a 654 data decomposition algorithm for UVI forecasting with four optimization algorithms, showed 655 significant improvement over the testing phase. 656 After additional analysis, the forecasted-to-observed UVI and absolute forecasting errors 657 are displayed in Figure 10. The absolute forecasted error has a maximum dispersion of (|FE| = 658 |UVIfor -UVIobs|). The box plot demonstrated the data dispersal of the observed and forecasted 659 UVI from the proposed deep learning approaches and other comparing models. Figure 10  Australia, where skin cancer is significantly high. An accurate forecasting system in this region is 700 therefore essential. 701 To function effectively, alert systems must generate accurate irradiance forecasts, but UVI 702 is generally determined by many factors (i.e., the solar zenith, altitude, cloud fraction, aerosol and 703 optical properties, albedo, and vertical ozone profile) (Deo et al., 2017). The study extensively 704 utilized four optimization techniques (i.e., GA, ACO, DEV, and PSO) to have optimum predictors 705 used in UVI forecasting. The incorporated predictors from three distinct data sets (i.e., SILO, 706 MODIS, and CI) were optimized. The optimization techniques selected a diversified list of 707 variables except for RMM1 and RMM2, as four algorithms select them both. The predictors like 708 ozone total column, AOD, and cloud fraction were significant using the GA algorithm. In most 709 cases, the hydro-meteorological variables were insignificant by all four algorithms that agree with 710 UV concentration's general concept. The objective algorithm (i.e., GA) selected SOI, GBI, AAO, 711 Nino4, Nino12, RMM1, and RMM2 as potential predictors as well. The ground-based 712 measurements and modelling studies are essential (Alados et al., 2007(Alados et al., , 2004, but are challenging 713 to implement in practice. Furthermore, secondary factors affecting UV levels (i.e., clouds or 714 aerosols) are rarely known with sufficient precision. Considering the practical feasibility, an 715 algorithm that is data-efficient, simple to develop, flexible, and user-friendly should be considered as a viable alternative for information (Igoe et al., 2013a(Igoe et al., , 2013bParisi et al., 2016). Therefore, 717 our developed forecasting model will play a vital role in the decision-makers to adopt prompt 718 measures without difficulties. 719 The proposed hybrid deep learning network (i.e., CEEMDAN-CLSTM) for predicting 720 surface UV radiation also demonstrated low errors in forecasting, i.e., showing around 10% error 721 for the next-day forecast and 13-16% error for 7-day up to the 4-week forecast. This further affirms 722 that the quantitative UV forecast is appropriate for heliotherapy applications, which tolerates up 723 to 10-25% error levels. The CEEMDAN-CLSTM's performance is competitive on UV data from 724 multiple regions. Thus, the CEEMDAN-CLTSM model can be adapted to forecast other useful 725 UV action spectra, such as vitamin D production and erythemal UV index. A key limitation of 726 machine learning is its overfitting tendency on the training dataset and often does not generalize 727 well to other datasets from different distributions. In the context of UV forecasting, this dictates 728 that the model must be retrained with data from the weather station to be used for that geographic 729 region. In a geographical region with the highly variable weather condition, such as London in 730 2019, artificial neural network models' performance dropped significantly (Raksasat et al., 2021). 731 This capability of the model to extract seasonal patterns may also explain why the addition of 732 ozone, cloud fraction, and AOD information significantly improved the performance of 733 CEEMDAN-CLSTM, particularly when the GA algorithm was applied. 734

Conclusion 735
This study conducted a daily UV Index forecasting at Perth station using aggregated 736 significant antecedent satellite-driven variables associated with UV irradiance. The forecasting 737 was made using a novel hybrid deep learning model (i.e., CEEMDAN-CLSTM) and compared