4.1 Parameterization
In Table 1, LB:36, H:36 parameter combination (PC) with R2, RMSE (psu), MAE (psu), and MAPE (%) performance metrics (PMs) of 0.5542318, 0.7936542, 0.5714286, and 1.76698, respectively, achieved the 6th best performance position (PP). LB:36, H:24 were combined with 0.5659614, 0.7831428, 0.5658741, and 1.74519, respectively, at the 5th PP. LB:36, H:12 were combined with 0.5980433, 0.7536442, 0.5502256, and 1.697232, respectively, at the 4th PP. LB:24, H:24 were combined with 0.7389093, 0.5815567, 0.4080222, and 1.25989 were combined with the 2nd PP. LB:24, H:12 were combined with 0.77123, 0.5443722, 0.3624757, and 1.120972, respectively, at the 1st PP. LB:12 and H:12 were combined with 0.6054062, 0.6821565, 0.5375925, and 1.65196, respectively, at the 3rd PP. The results show that the values of all the PMs of the 6 LASSO models vary with the LB and H parameter combinations. The difference between the R2 of the LASSO model with the best PC (1st PP) and the LASSO model with the worst PC (6th PP) was 0.2169982. This finding implies that the best model explained approximately 28.14% more of the variation than did the worst model. In relation to the RMSE, the MAE and MAPE seem to be versatile because they offer relatively simple, interpretable and practical measures for assessing the performance of the models and the accuracy of the predicted values at the same time. The lower the MAE and MAPE are, the better the model performance, and the more accurate the predicted values are. The difference between the MAEs of the LASSO model with the best PC (1st PP) and the LASSO model with the worst PC (6th PP) was 0.2089529 psu. This finding implies that the best model forecasts SSS approximately 36.57% more accurately than does the worst model.
Table 1: Performance of the Lookback (LB) and Horizons (H) parameter values in time series forecasting of SSS with the ML LASSO models based on the R2, RMSE, MAE and MAPE values
Parameter Combination
|
Predictor Variables
(Input)
|
R2
|
RMSE (PSU)
|
MAE (PSU)
|
MAPE (%)
|
Performance
Position
|
LB:36, H:36
|
sss, ws, hws, sst, adt, sla, precip
|
0.5542318
|
0.7936542
|
0.5714286
|
1.76698
|
6th
|
LB:36, H:24
|
sss, ws, hws, sst, adt, sla, precip
|
0.5659614
|
0.7831428
|
0.5658741
|
1.74519
|
5th
|
LB:36, H:12
|
sss, ws, hws, sst, adt, sla, precip
|
0.5980433
|
0.7536442
|
0.5502256
|
1.697232
|
4th
|
LB:24, H:24
|
sss, ws, hws, sst, adt, sla, precip
|
0.7389093
|
0.5815567
|
0.4080222
|
1.25989
|
2nd
|
LB:24, H:12
|
sss, ws, hws, sst, adt, sla, precip
|
0.77123
|
0.5443722
|
0.3624757
|
1.120972
|
1st
|
LB:12, H:12
|
sss, ws, hws, sst, adt, sla, precip
|
0.6054062
|
0.6821565
|
0.5375925
|
1.65196
|
3rd
|
4.2 Determination of variable importance for SSS forecasting
In Table 2, the results of the relative importance of the predictor variables are ordered according to the value of each maxSuppSize parameter utilized, particularly from 2 to 6. In this regard, where the maxSuppSize was 1, V1 (ws), with a coefficient value of 0.07405407 at the 32.68467276 intercept, emerged as the most important predictor variable. When the maxSuppSize was 2, V1 (ws) and V2 (hws), with coefficients of 0.5631397 and -0.2651062, respectively, at the 32.5008548 intercept, emerged as the most important predictor variables. When the maxSuppSize was increased to 3, V1 (ws), V2 (hws) and V5 (sla), with coefficients of 0.3885072, -0.1784569 and -6.2265858, respectively, at the 33.1928511 intercept, emerged as the most important predictor variables. When the maxSuppSize was 4, V1 (ws), V2 (hws), V5 (sla) and V4 (adt), with coefficients of 0.14046467, -0.06784094, -2.72801923, and -2.72839010, respectively, at the 34.52565342 intercept, emerged as the most important predictor variables. At a maxSuppSize of 5, V1 (ws), V2 (hws), V5 (sla), V4 (adt) and V3 (sst), with coefficients of 0.06456058, -0.03330373, -1.44796925, -1.44796835 and -0.02981000, respectively, at the 42.86146292 intercept, emerged as the most important predictor variables. At a maxSuppSize of 6, V1 (ws), V2 (hws), V5 (sla), V4 (adt), V3 (sst) and V6 (precip), with coefficients of 0.06453669, -0.03312278, -1.45196918, -1.45194054, -0.03004772 and 0.07919975, respectively, at the 42.91275984 intercept, emerged as the most important set of predictor variables. When maxSuppSize = 3, 4, 5, and 6, V1 (ws), V2 (hws), and V5 (sla) consistently top the list of the most important PPVs in the same descending order of importance. This finding implies that the 3 PPVs are crucial for achieving relatively accurate sss forecast values, although this finding was subjected to further verification through subsequent experiments. This result implies that the performance of the ML LASSO regression model is also affected by the importance of the predictor variables. At maxSights of 4, 5, and 6, the coefficient values (-2.72801923 and -2.72839010; -1.44796925 and -1.44796835; and -1.45196918 and -1.45194054) of V5 (sla) and V4 (adt) are found to be extremely close to each other. The results of the correlation analysis (collinearity test) used to ascertain the strength of the relationship between sla and adt and the uniqueness of their coefficients show that R2 = 1. Thus, the closeness of the coefficients implies collinearity, a condition in which two predictor variables are highly correlated such that R2 > 0.9. The negative effect of such collinearity on the performance of the ML LASSO regression model in the final “SSS forecast” task must be avoided by eliminating the adt, which has a lower variable importance, as determined by the maxSuppSize of 3, 4, 5, and 6.
Table 2: The output of the 7 x 1 sparse matrix from the L0Learn package utilized for the validation of the relative importance of the potential predictor variables for machine learning forecasting of SSS
Maximum Support Size
|
Variable Importance in Descending Order
|
Coefficient
|
Intercept
|
maxSuppSize = 1
|
V1 (ws)
|
0.07405407
|
32.68467276
|
maxSuppSize = 2
|
V1 (ws)
V2 (hws)
|
0.5631397
-0.2651062
|
32.5008548
|
maxSuppSize = 3
|
V1 (ws)
V2 (hws)
V5 (sla)
|
0.3885072
-0.1784569
-6.2265858
|
33.1928511
|
maxSuppSize = 4
|
V1 (ws)
V2 (hws)
V5 (sla)
V4 (adt)
|
0.14046467
-0.06784094
-2.72801923
-2.72839010
|
34.52565342
|
maxSuppSize = 5
|
V1 (ws)
V2 (hws)
V5 (sla)
V4 (adt)
V3 (sst)
|
0.06456058
-0.03330373
-1.44796925
-1.44796835
-0.02981000
|
42.86146292
|
maxSuppSize = 6
|
V1 (ws)
V2 (hws)
V5 (sla)
V4 (adt)
V3 (sst)
V6 (precip)
|
0.06453669
-0.03312278
-1.45196918
-1.45194054
-0.03004772
0.07919975
|
42.91275984
|
4.3 Validation of variable importance for SSS forecasting
Table 3 shows that experiment A, which utilized 6 PPVs (ws, hws, sla, adt, sst and precip), predicted SSS (Jan.-Dec., 2021) with R2 and RMSE values of 0.77123 and 0.5443722, respectively. Experiment B, which utilized 5 PPVs (ws, hws, sla, adt and sst), predicted the SSS with R2 and RMSE values of 0.8189632 and 0.4842613, respectively. Experiment C, which utilized 4 PPVs (ws, hws, sla and adt), predicted the SSS with R2 and RMSE values of 0.8239762 and 0.4775096, respectively. Experiment D, which utilized 3 PPVs (ws, hws and sla), predicted the SSS with R2 and RMSE values of 0.8239762 and 0.4775096, respectively. Experiment E, which also utilized 3 PPVs (ws, hws and adt), predicted the SSS with R2 and RMSE values of 0.8239761 and 0.4775098, respectively. Experiment F, which utilized 2 PPVs (ws and hws), predicted the SSS with R2 and RMSE values of 0.8223169 and 0.4797549, respectively. Experiment G, which utilized 1 predictor variable (ws), predicted the SSS with R2 and RMSE values of 0.8216164 and 0.4806997, respectively. The results show that the PPVs utilized in experiments A, B, C, D, E, F and G occupy the 6th, 5th, 1st, 1st, 2nd, 3rd and 4th performance positions, respectively. The highest value of RMSE (0.5443722) recorded by experiment A, which utilized 6 PPVs and took the 6th (worst) performance position, implies that the ML LASSO model could not determine the most important factors driving the temporal changes in SSS. The poor performance also implies that the LASSO model could not detect collinearity in such relatively sparse datasets. Given the same parameter combination (LB:24, H:12), the relative differences in the results of experiments A, B, E, F, and G in terms of the PMs (R2 and RMSE) confirm the need for predetermination of important predictor variables that should be combined for building a relatively accurate ML LASSO regression model with such sparse datasets. Additionally, the indifference observed in the results of experiments C and D in terms of the PMs substantiates the need for conducting collinearity/multicollinearity tests on PPVs for forecasting SSS when using such an ML LASSO model. The results also show that V1 (ws), V2 (hws), and V5 (sla) consistently top the list of the most important PPVs in the same descending order of importance in experiments A, B, C, and D, similar to where maxSuppSize = 3, 4, 5, and 6 in Table 2. This finding substantiates the 3 PPVs as the most crucial for achieving relatively accurate sss forecast values for the coastal zone. The consistent unbiased selection of the wind speed and high wind speed as the 1st and 2nd most important variables, respectively, agrees with the opinion of Gimeno et al. (2012), who argue that the rate of E at the ocean surface depends essentially on three variables, which include the wind speed. This further implies that the E predictor variable can be reasonably represented by wind speed in a regression model that seeks to forecast SSS in a coastal zone, particularly where E is not accessible.
Table 3: Performance metrics of the LASSO regression model experiments A-G utilized for validating (verifying) the relative importance of the predictor variables for machine learning forecasting of SSS
Experiment
|
PPVs
|
R-squared (R2)
|
RMSE
|
PPVs Performance Position
|
A
|
V1 (ws)
V2 (hws)
V5 (sla)
V4 (adt)
V3 (sst)
V6 (precip)
|
0.77123
|
0.5443722
|
6th
|
B
|
V1 (ws)
V2 (hws)
V5 (sla)
V4 (adt)
V3 (sst)
|
0.8189632
|
0.4842613
|
5th
|
C
|
V1 (ws)
V2 (hws)
V5 (sla)
V4 (adt)
|
0.8239762
|
0.4775096
|
1st
|
D
|
V1 (ws)
V2 (hws)
V5 (sla)
|
0.8239762
|
0.4775096
|
1st
|
E
|
V1 (ws)
V2 (hws)
V4 (adt)
|
0.8239761
|
0.4775098
|
2nd
|
F
|
V1 (ws)
V2 (hws)
|
0.8223169
|
0.4797549
|
3rd
|
G
|
V1 (ws)
|
0.8216164
|
0.4806997
|
4th
|
4.4 SSS forecast
The best ML LASSO model was built with 3 variable forecasts, 31.96201, 32.14657, 32.88663, 32.88309, 32.95712, 33.6475, 34.66366, 34.52677, 33.94008, 33.06252, 32.17958, and 31.21796 psu, as the SSS values for 12 consecutive months in 2021 from January to December (Table 4). The predicted maximum SSS (34.66366 psu) occurred in July, while the predicted minimum SSS (31.21796 psu) occurred in December. Figure. Figure 2 represents the plot of the monthly SSS forecasts for January to December 2021 using the LASSO regression model. The y-axis represents the SSS values, while the x-axis represents the years where the values 0-9, 10-19, 20-29, 30-39, 40-49, 50-59 (training) and 60-69 (forecasting) represent the 2016, 2017, 2018, 2019, 2020 and 2021 epochs, respectively. The vertical red line is the boundary between the outputs of the SSS training and forecasting models. The predicted SSS values are relatively low from January to May but relatively high from June to December (Figure 1). 3).
Table 4: Predicted and observed monthly SSS values for January to December 2021
2021
(Months)
|
Predicted SSS
(PSU)
|
Observed SSS
(PSU)
|
January
|
31.96201
|
32.74971
|
February
|
32.14657
|
33.16754
|
March
|
32.88663
|
33.11268
|
April
|
32.88309
|
33.10024
|
May
|
32.95712
|
33.57706
|
June
|
33.6475
|
33.23043
|
July
|
34.66366
|
34.00293
|
August
|
34.52677
|
34.29564
|
September
|
33.94008
|
33.31446
|
October
|
33.06252
|
31.69396
|
November
|
32.17958
|
30.93628
|
December
|
31.21796
|
31.1894
|
4.5 Validation of the SSS forecast
In relation to the observed SSS in Table 4, the RMSE, MAE and MAPE of the SSS predicted by the ML LASSO method are 0.742761142 psu, 0.620565 psu and 1.903923287%, respectively. Among the 3 PMs, the MAPE offers a relatively simple, interpretable and realistic measure of the error (accuracy) in a regression model, particularly where there are negligible or no outliers. Generally, a MAPE of < 10% is considered to indicate “high prediction accuracy” (Lewis, 1982; Ağbulut et al., 2021b). In this regard, the computed MAPE of approximately 1.90%, which is approximately 5 times less than 10%, implies a relatively high prediction accuracy in the ML LASSO regression model. Although the predicted maximum SSS (34.66366 psu) was found in July, as previously mentioned, the observed maximum SSS (34.29564 psu) occurred in August. While the predicted minimum SSS (31.21796 psu) occurred in December, the observed minimum SSS (30.93628 psu) occurred in November. The slight disparity shows that the accuracy of the predicted SSS values has implications for the accuracy of the months in which the predicted maximum and minimum SSS values occurred. However, a reasonable increase in the monthly epochs of the time series datasets used in training the model can reduce the model’s error, improve its accuracy, and consequently eliminate such disparity.