An approach for good modelling and forecasting of sea surface salinity in a coastal zone using machine learning LASSO regression models built with sparse satellite time-series datasets

doi:10.21203/rs.3.rs-4016353/v1

Download PDF

Research Article

An approach for good modelling and forecasting of sea surface salinity in a coastal zone using machine learning LASSO regression models built with sparse satellite time-series datasets

https://doi.org/10.21203/rs.3.rs-4016353/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The risks of upstream seawater intrusion from coastal zones, particularly to the environment and people’s health,are gradually becoming serious issues thatrequire proactive environmental monitoring and good modellingapproaches. However, the temporal resolutions of relevant contemporary all-weather satellites that detect SSS are unable to support real-time applicationsthat can provide the required early warning information for mitigating such risks. Our current practical knowledge of the efficiency of machine learning (ML) least absolute shrinkage and selection operator (LASSO) regression modelsbuilt with relatively sparse all-weather satellite data for achieving relatively accurate predictor variable selection,collinearity detection,and high SSS prediction accuracy is still limited. In this paper, we utilized relatively sparse time series all-weather satellite datasets consisting of 6 potential predictor variables (PPVs), wind speed (WS), high wind speed (HWS), sea surface temperature (SST), absolute dynamic topography (ADT), sea level anomalies (SLAs) and precipitation (PRECIP) (January 2016-December 2020) to construct an ML LASSO model (using the forecastMLlibrary in R/R-studio) to predict SSS ona tropical coast (Nigerian coastal zone). We utilized the same datasets for building the L0-regularized regression (L0) model (using the L0Learnlibrary) to determine the relative importance of the PPVs for the ML time series forecasting of the SSS and to detect collinearity. The output was used to determinethe abilityof the LASSO model to determinethe relative importance of the PPVs for forecasting SSS and detecting collinearity. We determinedthe best combination of lookback (LB) and h-step-ahead (H) parametervalues for building a relatively accurate ML LASSO model with the datasets. We determinedand validatedthe relative importance of the PPVs for forecasting the monthly SSS using the LASSO model with the best combination of parametervalues. We predict and validate the monthly SSS values for January-December 2021 with a relatively accurate model. We show that the LB:24 and H:12 parametervalues,with an RMSE of 0.54437, are the best for building a relatively accurate LASSO model with such datasets. We show that the WS, HWS, and SLA are the most important PPVs for achieving relatively accurate SSS forecasts with the model. However, we show the limitations of such a LASSO model in achieving relatively accurate predictor variable selection and collinearity detection. We show practical solutions to such limitations by utilizing the L0 model to assist the LASSO model in achieving relatively high SSS prediction accuracy. Finally, we predict the monthly SSS values using the relatively accurate LASSO model and validate them with the observed SSS (January-December 2021) and obtain an RMSE of 0.7428 and a MAPE of 1.9039%. AMAPE value approximately5 times less than 10% implies a high SSS prediction accuracy that can be replicated to provide useful early warning information for mitigating such risks in any coastal zone. The results imply that the good practice for using such satellite datasets to build a relatively accurate ML LASSO model for forecasting should begin with rigorous supervised-automatic deletion of observation records with null values and outliers,followed by unbiased selection of appropriate parametervalues and important predictor variables and collinearity assessment.

Sea surface salinity

sparse satellite datasets

machine learning

time series forecasting

lasso model

coastal zone

Approximately 97% of all stored free water is found in the oceans, which make up 71% of the surface of the planet (Durack, 2015). Despite the key roles that oceans play in the Earth’s energy budget, the complexity involved still poses some difficulties in understanding all the relevant processes (Lo Bue et al., 2021). A crucial aspect of ocean expeditions in oceanography has been the study of ocean currents, which usually involves measuring relevant phenomena, particularly tides, temperature, salinity, and wind speed. The relationships among such dynamic phenomena have implications for the physical, chemical and biological conditions of oceans from time to time. For example, the interactions between surface temperature and surface salinity have implications for surface density in oceans because warm and/or fresher water is characterized by relatively low density, while cold and/or salty water is usually of relatively high density. This phenomenon primarily causes the surface density of the ocean to invariably impact the vertical flow through the thermohaline circulation that constitutes part of the large-scale ocean circulation (Delworth et al., 1993; Dinnat et al., 2019). The implications of spatial and temporal anomalies in sea surface salinity (SSS) along coastal zones, particularly on relatively small (local or national) spatial scales, include the increasing risk of upstream seawater intrusion. More than often, the risk is associated with socioeconomic and environmental problems such as (a) the relatively high cost of tidal river water treatment for domestic and industrial purposes, (b) the threat to a sustainable freshwater supply for household consumption (Sneath, 2023) coupled with exposure to high blood pressure that may result from drinking water containing relatively high salt concentrations, (c) a decrease in the viability of the agricultural sector that can achieve an optimum yield of sensitive plants such as paddy rice and horticultural crops (CGIARCSA, 2016, Trung et al., 2016), and (d) disturbed natural ecosystems that cannot support species diversity and composition. It should be noted that relatively low-salinity water is important for establishing an enabling natural environment for industrial growth and economic development in coastal areas, particularly for the manufacturing and food processing industries. Therefore, proper modelling and forecasting of SSS changes along coastal zones are crucial for providing useful early warning information for mitigating any future risk of upstream seawater intrusion from the open sea through the contiguous continental shelf (Ajibola-James et al., 2023).

Saline water in ocean aquatic ecosystems accounts for approximately 80% of global surface (atmospheric) freshwater fluxes (FWFs) (Durack, 2015). On the global scale, the concentration of dissolved salt in ocean water, which varies in space and time, is a function of evaporation (E), precipitation (P), discharge and ice freezing/melting (Dinnat et al., 2019). Recent studies have shown that the effects of the relationships between some of these factors, particularly E and P, or between evapotranspiration (Et) and P on spatial and temporal changes in SSS may vary significantly with the latitude of the geographic location and spatial scale of the study area. Cheng et al. (2020) observed that large-scale (continental) regions in the subtropics have a distinct pattern of a net negative FWF (E > P), but the higher latitudes have a net positive FWF (E < P) pattern. Liu et al. (2022) reported that 24.1 ~ 58% of the global ocean area exhibited a significant positive correlation between the FWF and SSS tendencies derived from satellite products. Ajibola-James et al. (2023) observed that a relatively small-scale (local or national) region in the tropics has a net positive FWF (Et < P) on both annual and seasonal timescales. However, none of these studies considered the relative importance of other potential SSS drivers, such as high wind speed, high wind speed, high sea surface temperature, absolute dynamic topography, and sea level anomalies, in relation to E and P for spatial and/or temporal changes in SSS along a coastal area on a relatively small spatial scale. Additionally, it should be noted that the rate of E at the ocean surface depends essentially on three variables, namely, the wind speed, the available heat energy, and the difference in humidity between the air and the surface (Gimeno et al., 2012). Consequently, our current understanding of the most important factors driving both spatial and temporal changes in SSS in coastal areas on a relatively small scale is still limited, particularly in regions (including the tropics) that are traditionally undersampled and understudied using appropriate remote sensing techniques.

Prior to the advent of remote sensing technologies that lend themselves to sea surface data observations, the traditional approach to such surface data acquisition has been in situ measurements. However, remoteness and large spatial extent limit conventional in situ measurements, while the seasonal cloud cover of dynamically important regions limits the applications of the predominant optical satellite surface observations. The launch of different all-weather satellite missions with a focus on sea surface observations, particularly the European Space Agency’s (ESA’s) Soil Moisture and Ocean Salinity (SMOS) satellite, which had a microwave imaging radiometer using aperture synthesis (MIRAS) on board in 2009; the subsequent National Aeronautics and Space Administration’s (NASA’s) Aquarius in 2011; and the Soil Moisture Active Passive mission (SMAP) in 2015, which used L-band (1.4 GHz) radiometry to measure SSS at approximately 0.2 practical salinity unit (psu) accuracy, brought a paradigm shift to global satellite observations of SSS and other relevant sea surface variables such as high wind speed, wind speed, and sea surface temperature. This is evident in the relatively high spatial resolutions of approximately 1° (111 km) and 0.25° (27.75 km) in the Aquarius and SMAP SSS datasets, respectively, in relation to the conventional in situ measurements of near-surface salinity obtained by the Argo network of drifting floats at an average spacing of approximately 3° (333 km) (Kramer, 2012; Ajibola-James et al., 2023). It has been observed that the regional or local (relatively small) scale root-mean-squared difference (RMSD) between Aquarius L3 data (V 5.0) and the Scripps Argo in situ salinity measurements for three different spatial scales (sizes), 0.18 psu for 1° x 1°, 0.165 psu for 3° x 3° and 0.11 psu for 10° x 10°, showed that each of the RDDs varied with the size of the area of interest (AOI) and exceeded the mission accuracy of 0.2 psu (Aquarius/SAC-D, 2018). This also showed that the accuracy increases consistently with the size of the AOI. In relation to the Aquarius/SMAP SSS data composite, Boutin et al. (2016) showed that the SMOS SSS data were much more affected by radio frequency interference (RFI) and land contamination. Yi et al. (2020) investigated the effect of such contamination on the accuracy of the Aquarius/SMAP SSS in the South China Sea (SCS) and reported that the RMSD was approximately 0.26 psu at a mean distance of approximately 160 km from the coastline but decreased to 0.18 psu in the middle of the SCS at a distance of approximately 450 km from the coastline. This finding implies that the accuracy increases with distance from the shoreline. In terms of the average time series of the RMSD, they also argued that the SMOS Level-3 data showed more than double the value (0.41 psu) relative to the Aquarius/SMAP data (0.18 psu). However, the current temporal resolutions of these all-weather satellite SSSs are inadequate for real-time application and can provide the required early warning information for mitigating the risk of upstream seawater progression from coastal zones.

Machine learning (ML) is a subset of artificial intelligence that enables computers to cleverly and intuitively make relatively accurate predictions based on previous learning by ML models. To make the model selection process simpler for forecasting, ML entails a variety of strategies for identifying patterns and relationships in the data (Chan-Lau, 2017). A notable advantage of ML models and algorithms is their increasing ability to handle the time component of relatively large amounts of data (complex structured, semistructured and unstructured datasets with several characteristics, including volume, velocity, veracity, value and validity) in predictive studies. A time series is a sequence of data collected over a specific period of time. The time scale component of a data series may be either every minute or hourly or daily or monthly or yearly. When only one of the time scales is involved, it is regarded as a single seasonality. Any situation involving datasets with more than one of the time scales, for example, hourly and daily or hourly, daily and monthly or daily, monthly, and yearly, is called multiple seasonality. Time series forecasting has become a significant part of ML since there are many prediction problems with a time component. However, it should be underscored that the concept of a trade-off seems to be inevitable in the application of such ML solutions to real-world problems. In terms of trade-offs, Chan-Lau (2017) considered various ML methods from two categorical perspectives, namely, ‘flexibility’ and ‘interpretability’. He opines that the latter should be given priority over the former and hence suggests a selection of relevant ML methods in decreasing order of interpretability, least absolute shrinkage and selection operator (LASSO) regressions, least squares (LS), generalized additive models (GAM), trees (T), support vector machines (SVM), and methods combining different base learning methods such as bagging and boosting. He further argues that the predictive power and interpretability of a linear regression model that improves fit by including a large number of independent variables are negatively affected. To alleviate and/or possibly overcome such effects, he proposed two types of linear models based on methods such as ‘subset selection’ and ‘shrinkage’. Typical examples of ML models that use the latter category of methods, which is of primary concern to this particular study, are the L0-regularized regression model (L0) and LASSO model. Both models are considered sparse learning models, which can assist in eliminating the least important set of predictor variables to optimize the forecast accuracy.

LASSO regression is one of the two most widely used shrinkage methods (Tibshirani, 1996; Chan-Lau, 2017). According to Tibshirani (1996), LASSO development was motivated by Breiman’s nonnegative garotte approach, which usually minimizes a problem’s equation by commencing its computation with the conventional ordinary least squares (OLS) estimate and shrinking it by nonnegative factors that have constrained sums. An extensive simulation study by Breiman showed that the garotte method is capable of lowering the prediction error relative to another method called the subset selector. The study also showed that the garotte is capable of competing well with the ridge regression method (a relaxed variant of LASSO in terms of predictor variable penalty, which operates a continuous process that shrinks coefficients towards zero). When the true model has many small nonzero coefficients, the garotte cannot compete adequately with the ridge regression. The observed weakness in the garotte approach is related to the two notable disadvantages of OLS: prediction accuracy (low bias but large variance) and interpretability (inability to determine and select the most important subset from a large set of predictors). Despite the relatively stable characteristics of ridge regression, it lacks the ability to set the coefficients of predictors to zero (Tibshirani, 1996). The challenges associated with OLS, garotte and ridge regression have been adequately addressed in LASSO regression models, which conduct variable selection and parameter estimation at the same time (Tibshirani, 1996). However, L0 and LASSO can be applied complementarily such that the output of the latter can be used to further optimize the input (in terms of predictor variable selection) and performance (in terms of forecast accuracy) of the former. A recent study showed that L0 does a relatively good job at optimization through the integration of two complementary algorithms, cyclic coordinate descent (CD) and partial swap-escapable (PSI) (Hazimeh and Mazumder, 2020). In one of the variants of ML LASSO regression models built with the forecast package in R-studio software, finding a parameter value that provides a correct balance between bias and variance is made easier with the use of the cross-validation (CV) function, a resampling approach. It should be considered that Bergmeir et al. (2018) opined that “When purely (nonlinear, nonparametric) autoregressive methods are applied to forecasting problems, as is often the case (e.g., when using machine learning methods), the aforementioned problems of CV are largely irrelevant, and CV can and should be used without modification, as in the independent case.” Their opinion was the basis for the deployment of a relatively new forecasting package called forecastML by Redell (2020), which is compatible with both R and Python platforms.

Previous studies involving modelling and predicting water surface salinity with statistical models (Urquhart et al., 2012; Qing et al., 2013) and ML models (Nguyen et al., 2018) have utilized both optical and all-weather remotely sensed satellite data. However, studies that utilized such all-weather satellite data for ML time series modelling and SSS forecasting, particularly in coastal areas at relatively small spatial scales, are still very rare. Our current understanding of the efficiency of sparse learning ML models, particularly LASSO regression models, built with relatively sparse (less than 10 predictor variables for less than 70 monthly epochs) all-weather satellite time series datasets for determining the most important factors driving temporal changes in SSS and for modelling and forecasting SSS in coastal areas is still limited. Additionally, our current evidence-based working knowledge of the most relevant parameter combinations and the best parameter values that can be combined for building a relatively accurate ML LASSO model with sparse time series datasets is still relatively poor. What should be considered good modelling practices in the tasks of modelling and predicting SSS using ML LASSO built with relatively sparse datasets is still unknown. In this regard, the objectives of this study are to (i) determine the best combination of lookback (LB) and h-step-ahead (H) parameter values for building a relatively accurate ML LASSO model with sparse time series satellite datasets; (ii) determine and validate (verify) the relative importance of 6 potential predictor variables (PPVs) for forecasting monthly SSS using the model with the best combination of parameter values; and (iii) predict and validate the monthly SSS values for January-December 2021 with a relatively accurate model.

The location adopted for this experimental study was the Nigerian coastal zone, which comprises the immediate maritime area (IMA) and the contiguous Exclusive Economic Zone (EEZ) and reaches approximately 200 nautical miles (370 km) offshore of the Nigerian continental shelf (Anyikwa & Martinez, 2012). The IMA was established for the purpose of this study. The offset ranged from 58-100 km between the shoreline and the edge of the observation points in the contiguous EEZ (Figure 1). To significantly reduce the effect of the errors associated with satellite SSS data acquisitions close to land masses on the data accuracy, as observed by Boutin et al., 2016, the IMA was excluded from the study area. The study area was restricted to 278 data observation points in the contiguous EEZ of approximately 295,027.4 km2 (Figure 1). In the area, the mean monthly rainfall ranges from approximately 28 mm in January to approximately 374 mm in September (Inger et al., 2005), while the mean daily temperature range is 25–36°C (298.15– 309.15 K), depending on the time of day and the month of the year (Usoro, 2010). Several rivers, including the Niger, Forcados, Nun, Ase, Imo, Warri, Bonny, and Sombreiro Rivers, discharge freshwater to the coastal region of Nigeria. Given the actual evaporation of 1,000 mm per annum, a total runoff of 1,700–2,000 mm, and an additional flow of 50–60 km³ calculated for the water balance of the Niger system, a total of 250 km³ per year eventually discharges into the Gulf of Guinea (Golitzen et al., 2005; Ajibola-James et al., 2023).

3.1 Satellite observations and map

This study utilized time series satellite datasets on wind speed (ws), high wind speed (hws), sea surface temperature (sst), absolute dynamic topography (adt), sea level anomalies (sla) and precipitation (precip), which constitute the PPVs, and sea surface salinity (sss), which is the target (response) variable. The sss, ws, hws and sst were retrieved from NASA's Soil Moisture Active Passive (SMAP) online repository managed by NASA’s Joint Propulsion Laboratory (JPL, 2020). The adt and sla were retrieved from the Copernicus’ online repository managed by the Copernicus Climate Change Service (CCCS, Undated), while the precip was retrieved from NASA’s Earthdata online repository (Huffman, 2019). All the satellite datasets were retrieved directly from the various online repositories in the network common data form-4 (netCDF-4) file format. The base map material used for the study area was sourced from the article of Anyikwa & Martinez (2012) and modified as appropriate.

3.2 Data preparation

Prior to the modelling and prediction tasks of the study, the appropriate data preparation tasks (data extraction, cleaning and selection) were implemented using automatic (scripted) procedures. The datasets were automatically extracted from the netCDF and network common data form (.nc and .nc4) files into comma-separated Excel (.csv) files by executing a python 3.10.2 script with glob, netCDF4, pandas, numpy and xarray libraries in Spyder IDE (Integrated Development Environment) 5.2.2 software. The data cleaning, which involved rigorous supervised-automatic deletion of the observation records with null values and outliers (induced by RFI and land contamination) in each of the datasets in the .csv files, was achieved through three consecutive tasks: (a) automatic deletion of null values by executing a python script with libraries pandas, numpy, csv and xarray in the IDE; (b) visual identification and verification of outliers by overlaying each of the monthly SSS observations in the .csv files on the Google Earth Pro online to ascertain their proximity to land and tendency for land contamination; and (c) automatic deletion of the predetermined outliers by using their concatenated location coordinates as criteria for executing a python script with the same libraries and IDE that was utilized in (a) above (Ajibola-James et al., 2023). A total of 278 appropriate satellite observation points were selected for analysis in this study; these points constitute the study area (Fig. 1), was achieved by executing a python script with the pandas, numpy, csv and xarray libraries in the IDE. The points were imported and merged with the base map using the overlay function in ArcMap 10.4.1.

3.3 Parameterization

To determine the best combination of lookback (LB) and h-step-ahead (H) parameter values for building the LASSO model with relatively good forecast accuracy, comparative analyses of the 6 possible combinations of LB and H values (LB:36, H:36; LB:36, H:24; LB:36, H:12; LB:24, H:24; LB:24, H:12 and LB:12, H:12) in terms of the performance metrics were carried out using an appropriate algorithm to construct 6 variants of the models, as subsequently described.

3.4 Least absolute shrinkage and selection operator regression models and algorithm

The variants of the ML LASSO models and algorithms coupled with appropriate LB and H parameter combinations were built with the direct forecasting algorithm of the forecastML package (library) 0.9.1 in R 4.1.3/R-studio 2022.02.3–492 by utilizing a user-defined prediction function. The package is essentially for time series forecasting with ML methods. Additionally, cv. glmnet, which performs k-fold cross-validation for the glmnet package 4.1-4, produces a plot and returns a value for lambda (and gamma if relax = TRUE), which was also used in R-studio to complement the forecastML library.

The initial tasks of the ML procedure for achieving objectives (i) of this study involved the creation of a data frame for the sss, ws, hws, sst, adt, sla and precip datasets; the creation of a dataset of lagged features for the modelling; and the use of validation datasets for outer-loop nested cross-validation. In the data frame, sss was set as the response (dependent) variable, while the remaining 6 PPVs were set as the predictor (independent) variables. The subsequent task involved training LASSO models across forecast horizons and validation datasets. For the purpose of replicability, a set.seed(555) was used for executing the codes prior to the creation of a dataset of lagged features for modelling with appropriate arguments, which included numeric parameters of appropriate values for the LB and H. In general, the LB period defines how many previous time steps are used to predict the subsequent time step, while the H typically defines the forecast time steps in days, weeks, months or years. For this particular study, H was defined in months. A set.seed(777) was used for plotting the forecasts for each validation dataset. It should be underscored that the forecastML approach assumes that the value of the LB should be equal to or greater than the value of the horizon to compute the results for any supported ML model, including LASSO. For example, if H = 36, LB must be = or > 36; otherwise, no result will be computed.

To achieve objective (i), 6 possible combinations of LB and H values, as previously detailed under ‘Parameterization’, were utilized for building 6 variants of the ML models. The best combination of LB and H parameter values for building a relatively accurate ML LASSO model with sparse time series satellite datasets was achieved by comparing the accuracy of the 6 variants of the models in terms of 4 performance metrics, namely, R-squared (R²), root mean squared error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE). R² is a measure of the amount of variation explained. The RMSE is a measure of accuracy for comparing the forecasting errors of two different models for the same response variable. The MAE is a good measure of the absolute differences between the actual and predicted values, while the MAPE is a good measure of the absolute percentage differences between the actual and predicted values. The return_error (data_fit) function was used to compute the LASSO Residual Error Metrics by Horizon (for each of the steps Ahead Forecast) and Error Global in terms of the MAE and MAPE, while the R² was computed by running appropriate lines of codes, and the RMSE was computed with the MLmetrics for the same period (January 2016-December 2020).

3.5 L0-regularized regression model and algorithm for determining variable importance

To achieve objective (ii), two variants of sparse learning models (L0 and LASSO) were built and applied sequentially. The first variant, L0, was built with the L0Learn library. It was used to determine the relative importance of the predictor variables for the ML time series forecasting of the SSS. L0 characterized by the L0L2 penalty and CDPSI (Cyclic Coordinate Descent and Partial Swap-Inescapable) was utilized. The L0-regularized modelling approach was based on the integration of two complementary algorithms, CD and PSI, because it does a relatively good job at optimization (Hazimeh and Mazumder, 2020). A 7 x 1 sparse matrix of class "dgCMatrix" was computed, and the coefficients of the 6 PPVs of sss were computed at different “Maximum Support Size” (maxSuppSize) ranging from 1 to 6 using the model fitted with the L0Learn.fit function in the L0Learn library. The outputs of this particular analysis at each of the maxSpr values were compiled and ordered in a table according to the relative importance of the PPVs.

3.6 LASSO experiment for validation of variable importance

The last task in achieving objective (ii) involved validation (verification) of the relative importance of the PPVs for the ML time series forecasting of SSS for a period of 12 months (January to December 2021) using a variety of trained LASSO regression models and algorithms to conduct a series of experiments A–G. For this particular task of the objective, the algorithm adopted the best LB:24, H:12 combination, which took the 1st performance position in the comparative analyses of the 6 possible parameter combinations. Experiment A utilized 6 PPVs, V1 (ws), V2 (hws), V5 (sla), V4 (adt), V3 (sst) and V6 (precip). Experiment B utilized 5 PPVs, V1 (ws), V2 (hws), V5 (sla), V4 (adt) and V3 (sst). Experiment C utilized 4 PPVs, V1 (ws), V2 (hws), V5 (sla) and V4 (adt). Experiment D utilized 3 PPVs, V1 (ws), V2 (hws), and V5 (sla). Experiment E also utilized 3 PPVs, V1 (ws), V2 (hws) and V4 (adt). Experiment F utilized 2 PPVs, V1 (ws) and V2 (hws), while Experiment G utilized 1 PPV, V1 (ws). The outputs of the comparative analyses of the experiments were compiled and ordered in a table according to the relative importance of the combinations of the PPV(s), which was determined by the computed models’ performance metrics (PMs), R² and RMSE. However, given the sparse monthly epochs of the time series dataset (60 months) utilized, it should be underscored that the PMs were computed for only the time frame included in training the model.

3.7 LASSO for forecasting SSS

The typical LASSO regression model and algorithm coupled with the best parameter combination (LB:24, H:12) and the 3 most important predictor variables, V1 (ws), V2 (hws), and V5 (sla), utilized for Experiment D were adopted for forecasting the sss values to achieve the initial part of objective (iii). In this regard, the sss values were predicted on a monthly time scale of 12 months (January-December 2021) and subsequently validated.

3.8 Validation of the SSS forecast

To achieve the final part of objective (iii), the PMs of the ML LASSO model were computed for the time frame excluded from training the model. This was essentially because the true test of an ML model’s performance in forecasting future values is usually determined by the value of its PMs in forecasting target values that are not included in the model’s training datasets. In relation to the observed SSS, the RMSE of the SSS predicted by LASSO for January-December 2021 was computed with the MLmetrics library in R/R Studio by running the code RMSE (sss_pred1, sss_obs1), where sss_pred1 is the predicted SSS value and sss_obs1 is the actual satellite SSS for January-December 2021. The output was also verified in MS Excel by using the formula SQRT (SUMSQ(A2:A13-B2:B13)/COUNTA(A2:A13)), where A2:A13 and B2:B13 represent the columns containing the predicted and observed SSS values, respectively. The formula was typed in a cell and executed with CTRL + SHIFT + ENTER keys to compute the RMSE. The MAE and MAPE were also computed in MS Excel given their relatively good interpretability. To compute the MAE, the absolute difference between each of the observed and the predicted SSS records in columns A2:A13 and B2:B13, respectively, was first given by the formula ABS(B2-C2), and the MAE value was subsequently given by the formula SUMPRODUCT(E2:E13)/COUNT(E2:E13). To compute the MAPE, the absolute percentage error for each of the observed and predicted SSS records in columns A2:A13 and B2:B13, respectively, was given by = ABS(B2-C2)/B2*100, and the MAPE value was subsequently given by the formula = AVERAGE(G2:G13).

4.1 Parameterization

In Table 1, LB:36, H:36 parameter combination (PC) with R², RMSE (psu), MAE (psu), and MAPE (%) performance metrics (PMs) of 0.5542318, 0.7936542, 0.5714286, and 1.76698, respectively, achieved the 6th best performance position (PP). LB:36, H:24 were combined with 0.5659614, 0.7831428, 0.5658741, and 1.74519, respectively, at the 5th PP. LB:36, H:12 were combined with 0.5980433, 0.7536442, 0.5502256, and 1.697232, respectively, at the 4th PP. LB:24, H:24 were combined with 0.7389093, 0.5815567, 0.4080222, and 1.25989 were combined with the 2nd PP. LB:24, H:12 were combined with 0.77123, 0.5443722, 0.3624757, and 1.120972, respectively, at the 1st PP. LB:12 and H:12 were combined with 0.6054062, 0.6821565, 0.5375925, and 1.65196, respectively, at the 3rd PP. The results show that the values of all the PMs of the 6 LASSO models vary with the LB and H parameter combinations. The difference between the R² of the LASSO model with the best PC (1st PP) and the LASSO model with the worst PC (6^th PP) was 0.2169982. This finding implies that the best model explained approximately 28.14% more of the variation than did the worst model. In relation to the RMSE, the MAE and MAPE seem to be versatile because they offer relatively simple, interpretable and practical measures for assessing the performance of the models and the accuracy of the predicted values at the same time. The lower the MAE and MAPE are, the better the model performance, and the more accurate the predicted values are. The difference between the MAEs of the LASSO model with the best PC (1st PP) and the LASSO model with the worst PC (6^th PP) was 0.2089529 psu. This finding implies that the best model forecasts SSS approximately 36.57% more accurately than does the worst model.

Table 1: Performance of the Lookback (LB) and Horizons (H) parameter values in time series forecasting of SSS with the ML LASSO models based on the R², RMSE, MAE and MAPE values

Parameter Combination	Predictor Variables (Input)	R²	RMSE (PSU)	MAE (PSU)	MAPE (%)	Performance Position
LB:36, H:36	sss, ws, hws, sst, adt, sla, precip	0.5542318	0.7936542	0.5714286	1.76698	6^th
LB:36, H:24	sss, ws, hws, sst, adt, sla, precip	0.5659614	0.7831428	0.5658741	1.74519	5^th
LB:36, H:12	sss, ws, hws, sst, adt, sla, precip	0.5980433	0.7536442	0.5502256	1.697232	4^th
LB:24, H:24	sss, ws, hws, sst, adt, sla, precip	0.7389093	0.5815567	0.4080222	1.25989	2nd
LB:24, H:12	sss, ws, hws, sst, adt, sla, precip	0.77123	0.5443722	0.3624757	1.120972	1st
LB:12, H:12	sss, ws, hws, sst, adt, sla, precip	0.6054062	0.6821565	0.5375925	1.65196	3^rd

4.2 Determination of variable importance for SSS forecasting

In Table 2, the results of the relative importance of the predictor variables are ordered according to the value of each maxSuppSize parameter utilized, particularly from 2 to 6. In this regard, where the maxSuppSize was 1, V1 (ws), with a coefficient value of 0.07405407 at the 32.68467276 intercept, emerged as the most important predictor variable. When the maxSuppSize was 2, V1 (ws) and V2 (hws), with coefficients of 0.5631397 and -0.2651062, respectively, at the 32.5008548 intercept, emerged as the most important predictor variables. When the maxSuppSize was increased to 3, V1 (ws), V2 (hws) and V5 (sla), with coefficients of 0.3885072, -0.1784569 and -6.2265858, respectively, at the 33.1928511 intercept, emerged as the most important predictor variables. When the maxSuppSize was 4, V1 (ws), V2 (hws), V5 (sla) and V4 (adt), with coefficients of 0.14046467, -0.06784094, -2.72801923, and -2.72839010, respectively, at the 34.52565342 intercept, emerged as the most important predictor variables. At a maxSuppSize of 5, V1 (ws), V2 (hws), V5 (sla), V4 (adt) and V3 (sst), with coefficients of 0.06456058, -0.03330373, -1.44796925, -1.44796835 and -0.02981000, respectively, at the 42.86146292 intercept, emerged as the most important predictor variables. At a maxSuppSize of 6, V1 (ws), V2 (hws), V5 (sla), V4 (adt), V3 (sst) and V6 (precip), with coefficients of 0.06453669, -0.03312278, -1.45196918, -1.45194054, -0.03004772 and 0.07919975, respectively, at the 42.91275984 intercept, emerged as the most important set of predictor variables. When maxSuppSize = 3, 4, 5, and 6, V1 (ws), V2 (hws), and V5 (sla) consistently top the list of the most important PPVs in the same descending order of importance. This finding implies that the 3 PPVs are crucial for achieving relatively accurate sss forecast values, although this finding was subjected to further verification through subsequent experiments. This result implies that the performance of the ML LASSO regression model is also affected by the importance of the predictor variables. At maxSights of 4, 5, and 6, the coefficient values (-2.72801923 and -2.72839010; -1.44796925 and -1.44796835; and -1.45196918 and -1.45194054) of V5 (sla) and V4 (adt) are found to be extremely close to each other. The results of the correlation analysis (collinearity test) used to ascertain the strength of the relationship between sla and adt and the uniqueness of their coefficients show that R² = 1. Thus, the closeness of the coefficients implies collinearity, a condition in which two predictor variables are highly correlated such that R² > 0.9. The negative effect of such collinearity on the performance of the ML LASSO regression model in the final “SSS forecast” task must be avoided by eliminating the adt, which has a lower variable importance, as determined by the maxSuppSize of 3, 4, 5, and 6.

Table 2: The output of the 7 x 1 sparse matrix from the L0Learn package utilized for the validation of the relative importance of the potential predictor variables for machine learning forecasting of SSS

Maximum Support Size	Variable Importance in Descending Order	Coefficient	Intercept
maxSuppSize = 1	V1 (ws)	0.07405407	32.68467276
maxSuppSize = 2	V1 (ws) V2 (hws)	0.5631397 -0.2651062	32.5008548
maxSuppSize = 3	V1 (ws) V2 (hws) V5 (sla)	0.3885072 -0.1784569 -6.2265858	33.1928511
maxSuppSize = 4	V1 (ws) V2 (hws) V5 (sla) V4 (adt)	0.14046467 -0.06784094 -2.72801923 -2.72839010	34.52565342
maxSuppSize = 5	V1 (ws) V2 (hws) V5 (sla) V4 (adt) V3 (sst)	0.06456058 -0.03330373 -1.44796925 -1.44796835 -0.02981000	42.86146292
maxSuppSize = 6	V1 (ws) V2 (hws) V5 (sla) V4 (adt) V3 (sst) V6 (precip)	0.06453669 -0.03312278 -1.45196918 -1.45194054 -0.03004772 0.07919975	42.91275984

4.3 Validation of variable importance for SSS forecasting

Table 3 shows that experiment A, which utilized 6 PPVs (ws, hws, sla, adt, sst and precip), predicted SSS (Jan.-Dec., 2021) with R² and RMSE values of 0.77123 and 0.5443722, respectively. Experiment B, which utilized 5 PPVs (ws, hws, sla, adt and sst), predicted the SSS with R² and RMSE values of 0.8189632 and 0.4842613, respectively. Experiment C, which utilized 4 PPVs (ws, hws, sla and adt), predicted the SSS with R2 and RMSE values of 0.8239762 and 0.4775096, respectively. Experiment D, which utilized 3 PPVs (ws, hws and sla), predicted the SSS with R² and RMSE values of 0.8239762 and 0.4775096, respectively. Experiment E, which also utilized 3 PPVs (ws, hws and adt), predicted the SSS with R² and RMSE values of 0.8239761 and 0.4775098, respectively. Experiment F, which utilized 2 PPVs (ws and hws), predicted the SSS with R² and RMSE values of 0.8223169 and 0.4797549, respectively. Experiment G, which utilized 1 predictor variable (ws), predicted the SSS with R² and RMSE values of 0.8216164 and 0.4806997, respectively. The results show that the PPVs utilized in experiments A, B, C, D, E, F and G occupy the 6th, 5th, 1st, 1st, 2nd, 3rd and 4th performance positions, respectively. The highest value of RMSE (0.5443722) recorded by experiment A, which utilized 6 PPVs and took the 6^th (worst) performance position, implies that the ML LASSO model could not determine the most important factors driving the temporal changes in SSS. The poor performance also implies that the LASSO model could not detect collinearity in such relatively sparse datasets. Given the same parameter combination (LB:24, H:12), the relative differences in the results of experiments A, B, E, F, and G in terms of the PMs (R² and RMSE) confirm the need for predetermination of important predictor variables that should be combined for building a relatively accurate ML LASSO regression model with such sparse datasets. Additionally, the indifference observed in the results of experiments C and D in terms of the PMs substantiates the need for conducting collinearity/multicollinearity tests on PPVs for forecasting SSS when using such an ML LASSO model. The results also show that V1 (ws), V2 (hws), and V5 (sla) consistently top the list of the most important PPVs in the same descending order of importance in experiments A, B, C, and D, similar to where maxSuppSize = 3, 4, 5, and 6 in Table 2. This finding substantiates the 3 PPVs as the most crucial for achieving relatively accurate sss forecast values for the coastal zone. The consistent unbiased selection of the wind speed and high wind speed as the 1^st and 2^nd most important variables, respectively, agrees with the opinion of Gimeno et al. (2012), who argue that the rate of E at the ocean surface depends essentially on three variables, which include the wind speed. This further implies that the E predictor variable can be reasonably represented by wind speed in a regression model that seeks to forecast SSS in a coastal zone, particularly where E is not accessible.

Table 3: Performance metrics of the LASSO regression model experiments A-G utilized for validating (verifying) the relative importance of the predictor variables for machine learning forecasting of SSS

Experiment	PPVs	R-squared (R²)	RMSE	PPVs Performance Position
A	V1 (ws) V2 (hws) V5 (sla) V4 (adt) V3 (sst) V6 (precip)	0.77123	0.5443722	6^th
B	V1 (ws) V2 (hws) V5 (sla) V4 (adt) V3 (sst)	0.8189632	0.4842613	5^th
C	V1 (ws) V2 (hws) V5 (sla) V4 (adt)	0.8239762	0.4775096	1^st
D	V1 (ws) V2 (hws) V5 (sla)	0.8239762	0.4775096	1^st
E	V1 (ws) V2 (hws) V4 (adt)	0.8239761	0.4775098	2^nd
F	V1 (ws) V2 (hws)	0.8223169	0.4797549	3^rd
G	V1 (ws)	0.8216164	0.4806997	4^th

4.4 SSS forecast

The best ML LASSO model was built with 3 variable forecasts, 31.96201, 32.14657, 32.88663, 32.88309, 32.95712, 33.6475, 34.66366, 34.52677, 33.94008, 33.06252, 32.17958, and 31.21796 psu, as the SSS values for 12 consecutive months in 2021 from January to December (Table 4). The predicted maximum SSS (34.66366 psu) occurred in July, while the predicted minimum SSS (31.21796 psu) occurred in December. Figure. Figure 2 represents the plot of the monthly SSS forecasts for January to December 2021 using the LASSO regression model. The y-axis represents the SSS values, while the x-axis represents the years where the values 0-9, 10-19, 20-29, 30-39, 40-49, 50-59 (training) and 60-69 (forecasting) represent the 2016, 2017, 2018, 2019, 2020 and 2021 epochs, respectively. The vertical red line is the boundary between the outputs of the SSS training and forecasting models. The predicted SSS values are relatively low from January to May but relatively high from June to December (Figure 1). 3).

Table 4: Predicted and observed monthly SSS values for January to December 2021

2021 (Months)	Predicted SSS (PSU)	Observed SSS (PSU)
January	31.96201	32.74971
February	32.14657	33.16754
March	32.88663	33.11268
April	32.88309	33.10024
May	32.95712	33.57706
June	33.6475	33.23043
July	34.66366	34.00293
August	34.52677	34.29564
September	33.94008	33.31446
October	33.06252	31.69396
November	32.17958	30.93628
December	31.21796	31.1894

4.5 Validation of the SSS forecast

In relation to the observed SSS in Table 4, the RMSE, MAE and MAPE of the SSS predicted by the ML LASSO method are 0.742761142 psu, 0.620565 psu and 1.903923287%, respectively. Among the 3 PMs, the MAPE offers a relatively simple, interpretable and realistic measure of the error (accuracy) in a regression model, particularly where there are negligible or no outliers. Generally, a MAPE of < 10% is considered to indicate “high prediction accuracy” (Lewis, 1982; Ağbulut et al., 2021b). In this regard, the computed MAPE of approximately 1.90%, which is approximately 5 times less than 10%, implies a relatively high prediction accuracy in the ML LASSO regression model. Although the predicted maximum SSS (34.66366 psu) was found in July, as previously mentioned, the observed maximum SSS (34.29564 psu) occurred in August. While the predicted minimum SSS (31.21796 psu) occurred in December, the observed minimum SSS (30.93628 psu) occurred in November. The slight disparity shows that the accuracy of the predicted SSS values has implications for the accuracy of the months in which the predicted maximum and minimum SSS values occurred. However, a reasonable increase in the monthly epochs of the time series datasets used in training the model can reduce the model’s error, improve its accuracy, and consequently eliminate such disparity.

Given the current temporal resolutions of all-weather satellite SSS datasets (including SMOS, Aquarius and SMAP) that are unable to support a desirable real-time application, a relatively accurate forecast of temporal changes in SSS along coastal zones is crucial for providing useful early warning information for mitigating any upcoming risk of upstream seawater intrusion to people’s well-being and environmental health. In this study, we successfully show that relatively sparse archived (nonreal-time) all-weather satellite datasets can be properly utilized to construct an efficient ML LASSO model for forecasting temporal changes in SSS along a coastal zone to provide useful early warning information for mitigating such risk.

Our findings in terms of parameterization, important predictor variable selection, and performance metrics in relation to the experimental application of the ML model are subsequently highlighted.

Among the 6 parameter combination values evaluated in this study, LB:24, H:12, with an RMSE of 0.5443722 psu, is the best for building a relatively accurate ML LASSO model with such sparse time series satellite datasets (6 predictor variables for 60 monthly epochs).
The Wind Speed, High Wind Speed, and Sea Level Anomalies are the most important (in descending order) of the 6 PPVs for forecasting the monthly SSS using the model with the best combination of parameter values (LB:24, H:12).
The ML LASSO model that we built with the best parameter values and the 3 most important predictor variables was found to forecast SSS values (from January to December 2021) with relatively high prediction accuracy (given a MAPE of approximately 1.90%, which is approximately 5 times less than 10%).
We found that the ML LASSO regression model built with relatively sparse all-weather satellite time series datasets could not determine the most important factors driving temporal changes in SSS and could not detect collinearity without the support of the L0-regularized regression model.
However, the utilized L0-regularized model with the L0L2 penalty and CDPSI complemented the inadequacies found in the LASSO models by helping to determine the most important predictor variables and detecting collinearity prior to achieving relatively accurate SSS prediction by the best LASSO model.

The results imply that a good modelling procedure for using such sparse all-weather satellite datasets to build a relatively accurate ML LASSO model for time series forecasting of SSS should begin with data cleaning, which involves rigorous supervised automatic deletion of observation records with null values and outliers (induced by RFI and land contamination), followed by unbiased selection of appropriate LB and H parameter values and important predictor variables and collinearity assessment. The results further imply that the E predictor variable can be reasonably represented by wind speed in a regression model that seeks to forecast SSS in a coastal zone, particularly in a situation such as that of this study where the E is inaccessible.

Finally, we recommend further studies that utilize additional monthly epochs of time series datasets for training the ML LASSO model to possibly improve its accuracy and ascertain the anticipated effect of such improvement in terms of eliminating the slight disparity found in the accuracy of the months in which the predicted maximum and minimum SSS values occurred.

Acknowledgements The authors appreciate GeoInheritance Limited for providing appropriate support for successful data preparation tasks (e.g., data extraction, cleaning and selection) using Python scripts.

Authors’ contributions O.A. and F.O. contributed to the study conception and design. Data retrieval from online repositories, data preparation, data analysis and allied coding were performed by O.A. The study was supervised by F.O. The first draft of the manuscript was written by O.A., and the coauthor commented on previous versions of the manuscript. The final manuscript was written by O.A. with the approval of the coauthor.

Funding Not applicable.

Availability of data and material The data utilized in this study are available upon request from the corresponding author.

Code availability The code is available upon request from the corresponding author.

Ethics approval Not applicable.

Consent to participate Not applicable.

Consent for publication Not applicable.

Conflict of interestThe authors declare that they have no conflicts of interest.

Ağbulut, Ü., Gürel, A. E., & Sarıdemir S. (2021b). Experimental investigation and prediction of performance and emission responses of a CI engine fuelled with different metal-oxide based nanoparticles–diesel blends using different machine learning algorithms. Energy, 215:119076.
Ajibola-James, O., Okeke, F. I., & Ojinnaka, O. C. (2023). Assessment of variability of sea surface salinity using integrated all-weather satellite data in a tropical coast (Nigerian coastal zone). Research Square. Preprint. https://doi.org/10.21203/rs.3.rs-3449318/v1
Anyikwa, O. B., & Martinez, N. (2012). Continental Shelf Act, 2012. A legislation drafting project submitted in partial fulfilment of the requirements for the award of the Degree of Master of Laws (LL.M.) in International Maritime Law at the International Maritime Law Institute, IMO. 1 - 40. https://imli.org/wp-content/uploads/2021/03/Obiora-Bede-Anyikwa.pdf
Aquarius/SAC-D (2018). Aquarius salinity validation analysis. Aquarius Project Document: AQ-014-PS-0016, 1–45. https://salinity.oceansciences.org/docs/AQ-014-PS-0016_AquariusSalinityDataValidationAnalysis_DatasetVersion5.0.pdf
Bindoff, N.L., W.W.L. Cheung, J.G. Kairo, J. Arístegui, V.A. Guinder, R. Hallberg, N. Hilmi, N. Jiao, M.S. Karim, L. Levin, S. O’Donoghue, S.R. Purca Cuicapusa, B. Rinkevich, T. Suga, A. Tagliabue, & P. Williamson, (2019). Changing ocean, marine ecosystems, and dependent communities. In H.-O. Pörtner, D.C. Roberts, V. Masson-Delmotte, P. Zhai, M. Tignor, E. Poloczanska, K. Mintenbeck, A. Alegría, M. Nicolai, A. Okem, J. Petzold, B. Rama, N.M. Weyer (Eds.), IPCC Special Report on the Ocean and Cryosphere in a Changing Climate, 447–588. In press. https://www.ipcc.ch/site/assets/uploads/sites/3/2019/11/09_SROCC_Ch05_FINAL-1.pdf
Boutin, J., Chao, Y., Asher, W. E., Delcroix, T., Drucker, R., Drushka, K., Kolodziejczyk, N., Lee, T., Reul, N., Reverdin, G., Schanze, J., Soloviev, A., Yu,, L., Anderson, J., Brucker, L., Dinnat, E., Santos-Garcia, A., Jones, W., Maes, C., Meissner, T., Tang, W., Vinogradova, N., & Ward, B. (2016). Satellite and in situ salinity: understanding near-surface stratification and subfootprint variability. Bulletin of the American Meteorological Society, 97(8), 1391–1407. https://doi:10.1175/bams-d-15-00032.1
Casey, R. (2021, November 12). Concern grows over Atlantic Ocean ‘conveyor belt’ shutdown. Aljazeera. https://www.aljazeera.com/news/2021/11/12/concern-grows-over-atlantic-ocean-conveyor-belt-shutdown
CCCS (Undated). Sealevel_glo_phy_climate_L4_my_008_057. Global ocean gridded L4 sea surface heights and derived variables reprocessed. Dataset accessed: 2022-07-10, https://doi.org/10.48670/moi-00145
Chan-Lau, J. A. (2017). Lasso Regressions and Forecasting Models in Applied Stress Testing. International Monetary Fund (IMF) Working Paper, WP/17/108. https://www.imf.org/~/media/Files/Publications/WP/2017/wp17108.ashx
Delworth, T. D., Manabe, S. & Stouffer, R. J. (1993). Interdecadal variations of the thermohaline circulation in a coupled ocean–atmosphere model. J. Clim. 6, 1993–2011.
Dinnat, E. P., Le Vine, D. M., Boutin, J., Meissner, T., & Lagerloef, G. (2019). Remote sensing of sea surface salinity: Comparison of satellite and in situ observations and impact of retrieval parameters. Remote Sensing, 11(7). https://doi.org/10.3390/rs11070750
Durack, P. J. (2015). Ocean salinity and the global water cycle. Oceanography 28(1):20–31, http://dx.doi.org/10.5670/oceanog.2015.03
FAO. (1986). Marine fishery resources of Nigeria: A review of exploited fish stocks. Chapters 1-3. https://www.fao.org/3/r9004e/R9004E00.htm#TOC
FAO. (2003). Monitoring, measurement and assessment of fishing capacity: the Nigerian experience. In S. Pascoe, D. Gréboval (eds.), Measuring capacity in fisheries. FAO Fisheries Technical Paper, 445, 314p. https://www.fao.org/3/y4849e/y4849e0c.htm
Gimeno, L., Nieto, R., Drumond, A., Durán-Quesada, A.M. (2012). Ocean Evaporation and Precipitation . In: Meyers, R.A. (eds) Encyclopedia of Sustainability Science and Technology. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-0851-3_734
Golitzen, K. G. (Ed.), Andersen, I., Dione, O., Jarosewich-Holder, M., & Olivry, J. (2005). The Niger River Basin: A vision for sustainable management. World Bank, Washington, DC. https://doi.org/10.1596/978-0-8213-6203-7
Huffman, G.J., Stocker, E.F., Bolvin, D.T., Nelkin, E.J., Jackson, T. (2019). GPM IMERG Final Precipitation L3 1 month 0.1 degree x 0.1 degree V06, Greenbelt, MD, Goddard Earth Sciences Data and Information Services Center (GES DISC), Accessed: 2022-08-08, https://doi.org/10.5067/GPM/IMERG/3B-MONTH/06
JPL. (2020). JPL CAP SMAP Sea Surface Salinity Products. Ver. 5.0. PO.DAAC, CA, USA. Dataset accessed: 2022-07-10, https://doi.org/10.5067/SMP50-3TMCS
Khorram, S. (1982). Remote sensing of salinity in the San Francisco Bay Delta. Remote Sensing of Environment, 12(1), 15–22. https://doi.org/10.1016/0034-4257(82)90004-9
Khorram, S., & Cheshire, H. M. (1985). Remote sensing of water quality in the Neuse River Estuary, North Carolina (USA). Photogrammetric Engineering & Remote Sensing, 51(3), 329–341. https://www.asprs.org/wp-content/uploads/pers/1985journal/mar/1985_mar_329-341.pdf
Lagerloef, D. M., Swift, C. T., & Vine, D. M. L. (1995). Sea surface salinity: the next remote sensing challenge. Oceanography, 8(2), 44–50. https://doi:10.5670/oceanog.1995.17
Lerner, R. M., & Hollinger, J. P. (1977). Analysis of 1.4 GHz radiometric measurements from Skylab. Remote Sensing of Environment, 6(4), 251–269. https://doi.org/10.1016/0034-4257(77)90047-5
Lewis, C. D. (1982). Industrial and business forecasting methods: A radical guide to exponential smoothing and curve fitting. London: Butterworth Scientific.
Lo Bue N, Artale V and Schroeder K (2021) Editorial: Impact of Deep Oceanic Processes on Circulation and Climate Variability: Examples From the Mediterranean Sea and the Global Ocean. Front. Mar. Sci. 8:801479. http://dx.doi.org/10.3389/fmars.2021.801479
Mckeon, J. B., & Rogers, R. H. (1976). Water quality map of Saginaw Bay from computer processing of Landsat-2 data. 1–8.
Nguyen, P. T. B., Koedsin, W., McNeil, D., & Van, T. P. D. (2018). Remote sensing techniques to predict salinity intrusion: application for a data-poor area of the coastal Mekong Delta, Vietnam. International Journal of Remote Sensing, 39(20), 6676–6691. https://doi.org/10.1080/01431161.2018.1466071
Qing, S., Zhang, J., Cui, T., & Bao, Y. (2013). Retrieval of sea surface salinity with MERIS and MODIS data in the Bohai Sea. Remote Sensing of Environment, 136, 117–125. https://doi.org/10.1016/j.rse.2013.04.016
Redell, N. (2020). forecastML Overview. https://cran.r-project.org/web/packages/forecastML/vignettes/package_overview.html
Tibshirani, R. (1996). Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society, Series B., 58(1), 267-288.
Urquhart, E. A., Hoffman, M. J., Murphy, R. R., & Zaitchik, B. F. (2013). Geospatial interpolation of MODIS-derived salinity and temperature in the Chesapeake Bay. Remote Sensing of Environment, 135, 167–177. https://doi.org/10.1016/j.rse.2013.03.034
Usoro, E. (2010). Encyclopedia of the World’s coastal landforms, Vol. 1, London, p. 949.

The authors declare no competing interests.

Download PDF

Version 1

posted

You are reading this latest preprint version

An approach for good modelling and forecasting of sea surface salinity in a coastal zone using machine learning LASSO regression models built with sparse satellite time-series datasets

Status:

Version 1

Abstract

Figures

1. Introduction

2. Study area

3. Materials and methods

3.1 Satellite observations and map

3.2 Data preparation

3.3 Parameterization

3.4 Least absolute shrinkage and selection operator regression models and algorithm

3.5 L0-regularized regression model and algorithm for determining variable importance

3.6 LASSO experiment for validation of variable importance

3.7 LASSO for forecasting SSS

3.8 Validation of the SSS forecast

4. Results and discussion

5. Conclusion and recommendation

Declarations

References

Additional Declarations

Status:

Version 1