Deterministic and Stochastic Principles to Convert Discrete Water Quality Data into Continuous Time Series

Limited water quality data is often responsible for incorrect model descriptions and misleading interpretations in terms of water resources planning and management scenarios. This study compares two hybrid strategies to convert discrete concentration data into continuous daily values for one year in distinct river sections. Model A is based on an autoregressive process, accounting for serial correlation, water quality historical characteristics (mean and standard deviation), and random variability. The second approach (model B) is a regression model based on the relationship between flow and concentrations, and an error term. The generated time series, here referred to as synthetic series, are propagated in time and space by a deterministic model (SihQual) that solves the Saint-Venant and advection-dispersion-reaction equations. The results reveal that both approaches are appropriate to reproduce the variability of biochemical oxygen demand and organic nitrogen concentrations, leading to the conclusion that the combination of deterministic/empirical and stochastic components are compatible. A second outcome arises from comparing the results for distinct time scales, supporting the need for further assessment of statistical characteristics of water quality data - which relies on monitoring strategies development. Nonetheless, the proposed methods are suitable to estimate multiple scenarios of interest for water resources planning and management.


Introduction
The use of models for water resources planning and management purposes has been extensively reported (Yao et al. 2019;Yaseen et al. 2019). Despite being a well-recognized practice, it includes multiple sources of uncertainty, such as process description and numerical approximations (Guzman et al. 2015). Among these, input data limitations are often indicated as one of the main challenges (Strokal et al. 2019). According to Martin and McCutcheon (1998), the absence of information can result in failure to represent critical events, leading to inconsistencies of gradients and mass flow simulations.
Although high-frequency data has become more available with increasing technology (Bowes et al. 2016;Miller et al. 2017;Leigh et al. 2019), monitoring plans remain limited by several conditions, such as those related to sample collection, integrity, and laboratory analysis. Furthermore, data acquisition through sensors and remote sensing techniques is often limited due to calibration and validation issues, presence of interference, and spectral resolution (Gholizadeh et al. 2016). Since concentrations of water quality parameters are affected by several external conditions (meteorological, hydrological, and geomorphological characteristics, sources of pollution, land use, etc.), this challenge will remain as long as new compounds appear in the environment, highlighting the importance of studies related to data requirements for modeling, management, and research purposes.
In this sense, the use of statistical analysis becomes useful to understand and describe a phenomenon using past observations. Considering the water quality problems, usual strategies include interpolation functions (Lim et al. 2019;Dadhich et al. 2021), autoregressive models (Zhang and Hirsch 2019;Elkiran et al. 2019) and regression techniques (Huang et al. 2017;Del Giudice et al. 2018). More recently, Tung and Yaseen (2020) presented a thorough review of methods and examples based on principles of artificial intelligence for water quality modeling in rivers, including black-box and white-box models. While reviewing several techniques, these authors call attention to the common fact of missing data and data unavailability and recommend integrating features of physiochemical aspects with numerical approaches to improve the results.
Another relevant aspect in this context is proper communication, one of the challenges for model and theories acceptance in real-life applications. Taherdoost (2018) stated that numerous studies do not provide clear guidelines for operational aspects. Moreover, mechanisms to identify data anomalies and correlations may be useful to distinguish and communicate simulations conclusions, such as information on system accuracy, comparative assessment, visual data enhancement, hierarchically and sequentially arrangement of spatial-temporal patterns (Marrin 2017;Tung and Yaseen 2020). This is especially relevant to assist acceptability by stakeholders and managers, since the temporal analysis is usually not accounted for in water resources legislation in most countries, including Brazil. Although the management approach is often interested in long-term averages, one of the benefits of describing a system on high-frequency temporal scales is an estimate of possible variance and uncertainty (Uusitalo et al. 2015).
In this context, the present study advances in the subject of conversion of discrete data into continuous information, comparing two simplified hybrid methods. Both approaches are based on an integration of deterministic and stochastic principles, using a historical monitoring dataset of flows, biochemical oxygen demand, and organic nitrogen concentrations, gathered since the 1980s in five sections along 85 km of the Iguaçu River, located in Paraná state, Brazil. The synthetic series of each model are the input for the deterministic module in the SihQual model (Hydrodynamic and Water Quality Simulation). The objective is to test their physical meaning and to have more reliable transport and fate estimates for different substances -this integration of stochastic and deterministic approaches has been consolidated in a previous study (Ferreira et al. 2019). Comparisons are made in terms of overall data distribution and transgression of legal limits against observed data. The goal of grouping is to identify general patterns and representation gaps.
While literature often recognizes that data representation is a key aspect in modeling studies, the contribution of this study is the exploration of different approaches to convert temporal scales of water quality information. In addition, we show how this process should be improved, especially concerning the temporal evolution of concentration data. Although the methods are simple, synthetic series generation must be carefully assessed. This study also presents recommendations towards the application of these strategies to assess water quality over time and space, indicating future monitoring needs and research efforts.

Modeling Strategy
The experiments are based on the integration of five simulation modules, as presented in Fig. 1: (i) deterministic hydrodynamic model, (ii) deterministic water quality model under unsteady state, (iii) kinetics rates model, (iv) stochastic model A and (v) stochastic model B. The objective is to convert historical water quantity and quality data -originated from discrete sampling -to continuous information. This procedure highlights the use, differences, and role of A and B stochastic models.
The first module solves the Saint-Venant equations, providing the advection field and cross-sections of areas where the pollutants are diluted and transported. This information is applied to solve the one-dimensional water quality equation, which is a mass balance considering the processes of advection, dispersion, and reaction. Upstream and lateral boundary conditions are required inputs also, besides dispersion and reaction coefficients. The stochastic model A assists the determination of transformation rates, besides providing a time series as input for the deterministic water quality module. Model B is also tested as an auxiliary to generate the upstream boundary condition. Method A is based on an autoregressive process, that accounts for series correlation, statistical traits (mean and standard deviation), and random noise. The latter part deals with the inherent uncertainty related to data representativeness and external factors, leading to multiple possible combinations. This aspect is solved with an algorithm to select one series most similar to the observed data. Model B is a regression approach, in which concentrations are generated from observed flows. It is also a stochastic approach, since a lognormal probability density function is fitted to the error of the regression model. Hence, an infinite number of synthetic water quality time series can be generated from a unique streamflow time series.
Both methods are parsimonious, aiming at simplified estimates for changes (specifically concentration statistical metrics and flow conditions) in the overall water quality variability at each river control section.

Stochastic Model A
The first-order autoregressive process, described by Loucks and Beek (2017), is the basis of model A to convert discrete information into a continuous dataset: where C j represents concentrations at the interval j, is the mean and is the standard deviation of C; is the autocorrelation parameter (deterministic component), and z j represents a Gaussian random noise (stochastic component).
The stochastic component is a set of a thousand series with normal random distribution, with zero mean and unit variance. Because the historical dataset is not continuous, the coefficient is adopted. It represents the sample autocorrelation parameter, which indicates the dependency between lagged concentrations. Further discussions related to this critical trait for time series analysis can be found in Sturludottir et al. (2017).
The model result for each concentration was limited to a maximum value equal to the limit for the high extreme values in boxplots (i.e., the 75th percentile -3rd quartile -plus 3 times the interquartile range of the observed data): In equation 2, lim is the defined limit, while Q1 and Q3 represent first and third quartiles, respectively. This is one of the standard criteria to define outliers, although other methods could be used (Helsel et al. 2019). An upper limit is needed to control the generation of values that are unlikely to occur, while critical events of maximum concentrations are still accounted for.
Because multiple daily time series are generated, a single series most similar to the data measured in the simulated period is selected. The criterion is the minimum difference between observed and fitted series quantiles: quartiles 1, 2, and 3 and concentration of 10 and 90% of occurrence.
Even though this method does not represent variability patterns (e.g., trends or cycles), the model has been successfully applied to reproduce diverse study cases, including daily data (Ferreira et al. 2019). (1)

Stochastic Model B
A linear regression model is established between flows and concentrations to generate concentrations from observed flows. It is also a stochastic approach since a lognormal probability density function (pdf) is fitted to the error of the regression model; hence, an infinite number of synthetic water quality time series can be generated from a unique streamflow time series -a number of 1000 is set in this study. Since the model is applied on an observed daily flow time series, variability patterns (if present) will influence the results.
The generation of synthetic daily concentrations from daily flows was performed in three main steps: i) filling gaps of missing data in the daily flow time series; ii) building a regression model for flows and concentrations and; iii) generation of synthetic daily concentrations.
The first step is based on a relationship among different stations, which is detailed in the supplementary material. In the following step, a quadratic polynomial regression model is fitted to the flows and concentrations of each station: A 3-parameter lognormal pdf (LN3) was fitted to the errors of the model (e). The choice was guided by the graphic similarity between the histograms of the original error series and the one generated by the LN3 model. The error is defined as the difference between the original water quality time series and that generated by the linear regression model adjusted, as discussed in Coelho (2019).
In the last step, time series of synthetic daily concentrations were generated by applying the regression model on the daily flow time series of each station. Negative values were avoided by performing a new draw when the model resulted negative. The model also limits the generated values according to Eq. 2.

SihQual -Hydrodynamic and Water Quality Simulation
The SihQual model solves the hydrodynamic problem coupled with water quality. It is a tool to study the fate and transport of pollutants in rivers under steady and unsteady state conditions using deterministic and statistical strategies (Ferreira et al. 2016(Ferreira et al. , 2019. The deterministic module refers to the Saint-Venant equations and the one-dimensional advection-dispersion-reaction, while the second approach explores simplified methods of time series generation -models A and B in this study. This latter module was first developed to meet the temporal scale required for a numerical solution, since water quality data is often unavailable in high-frequency samples. Scale compatibility for the numerical solution is achieved through piecewise cubic interpolation.
Explicit finite differences are applied to solve the partial differential equations. This is an advantage to implement different modeling configurations, as the ones introduced in this study.
The dispersion coefficient in the water quality component is assumed constant, while point and diffuse contributions from the basin are calculated using population data and export coefficients -details are in Ferreira et al. (2016). The calibration techniques, named TRATS (Transformation Rates Time Series) generates daily decay rates as a result of (i) attributes of the river station, (ii) reference values from the literature, and (iii) random variability. Details are presented in Ferreira et al. (2020).
The water quality module is tested simulating organic matter and nutrient concentrations, given by the parameter biochemical oxygen demand (BOD), and organic nitrogen (N-org), respectively. The following transformation processes are represented in the mass balance: deoxygenation, BOD removal through sedimentation, organic nitrogen sedimentation, and conversion of organic nitrogen to ammonia.

Study Case and Dataset
The study is conducted for 83.5 km of the Iguaçu River, located in Paraná state, south of Brazil. The flow and water quality data come from two databases in five control sections, IG2 to IG6 -stations 65009000, 65017006, 65019980, 65025000, and 65028000, respectively (IAT 2018; Fernandes 2019). A summary of the dataset is presented in the supplementary material, while the study area is shown in Fig. 2.
Since the deterministic simulations are performed for the year 2010, an important assumption is that the observed historical dataset represents the true range of concentrations in this period of interest. Furthermore, one of the main premises in model A is that such information is representative of the statistical characteristics of the population (mean and standard deviation). Model B, on the other hand, assumes that the relationship between flows and concentrations is well established by the observations in each control section of the river and that no significant trends, cycles, or shifts are present.
Other data required for deterministic simulations include cross-section geometry, rating curves and flow as daily records for the period of interest, which is also available at IAT (2018)

Results and Discussion
Two cases were selected for comparison in this study: (i) test TD (Daily Test): daily data generated with = 0.5 and seasonal and of the historical dataset; (ii) test TH (Hourly Test): hourly data with the same and of the monitoring time series and = 0.9. As hourly data have higher persistence than daily values, for TH is higher than TD. The parameters specified were chosen after testing discussed in Ferreira et al. (2019). The parameters 0 , 1 , and 2 of the model B were estimated by the Ordinary Least Squares method, as presented in the supplementary material.
The two stochastic models were applied to the first river section (IG2) and then propagated by the SihQual model downstream in the Iguaçu River. For comparison, models A and B were also applied for the other control points (IG3 to IG6). Therefore, nine datasets were compared in the river sections IG3 to IG6: monitoring data, results from the deterministic model propagating four stochastic series -two of model A and two of model B (differently than model A, in which one series among the multiple choices is selected, two random solutions generated by model B were chosen) -, and two synthetic series of each model A and B. This setting is summarized in Table 1.
Hydrodynamic simulations and other intermediary results were not focused on this study, and details can be found in Ferreira et al. (2016Ferreira et al. ( , 2019. Additionally, the criterion to limit the generation of extreme values by models A and B did not alter the time series, since these rejected data represents 0 to 10% of the total values in each synthetic times series.

Deterministic Water Quality Modeling Integrated to Stochastic Models A and B
The results for each case are compared as boxplots and duration curves in Figs. 3 and 4. Corresponding coefficients of variation are presented in the supplementary material. The series of models A and B are daily values, while the results from the SihQual model are every 50 s (which is the time step for numerical solution). The goal of this comparison is to verify the concentration range obtained by each approach for the simulated year.
In general, models A and B proved to be adequate to represent the upstream boundary condition for the deterministic model. The first one generated similar variability (series 6 Deterministic simulation result with series 8 as UBC Fig. 3 Boxplots and duration curves of compared BOD series Fig. 4 Boxplots and duration curves of compared N-org series and 8) for biochemical oxygen demand; for organic nitrogen, option TH (hourly data generation) produced boxplots with a wider interquartile range than expected, although still close to observed data. Deterministic solutions are characterized by a high persistence in the series, which explains the lower variability (series 3, 5, 7, and 9) when compared to the series from the stochastic models (series 2, 4, 6, and 8). The high persistence in deterministic data is a consequence of the explicit numerical solution, in which a value is calculated from previous time steps.
Most of the differences among duration curves are for higher concentrations, with less than 20% occurrence. This was observed mainly for the BOD simulation in station IG4, as verified in Fig. 3. The other discrepancies tend to be small when compared to the typical uncertainties associated with BOD and organic nitrogen data (Coelho 2019). The differences between higher percentiles happen because it is the region of extreme events. As hydrological models have difficulties in representing flow peaks (Onyutha 2019), water quality concentrations that occur at small frequencies have larger variances -therefore, greater uncertainty.

Meaning for Water Resources Management
Although the variability of the year simulated was well reproduced by all strategies, an alternative analysis showed that data frequency is a key aspect when dealing with tools to convert discrete into continuous information. Figure 5 compares the number of days in which the limit of BOD concentrations was exceeded during the simulated year. Since the results from the deterministic model are available at every fifty seconds (i.e., 1728 values per day), three approaches were considered to allow daily comparisons: the first, second, and third quartiles of each day in the simulated year.
The limits for comparison are based on the legal regulation CONAMA n o 357/2005 in Brazil (CONAMA 2005), which establishes water quality classes 1, 2, 3, and 4 for BOD concentration with a maximum threshold of 3, 5, 10, and larger than 10 mg-O 2 /L, respectively. For organic nitrogen, the limits were defined as 3.7 mg-N/L for Classes 1 and 2, and 13.3 mg-N/L for classes 3 and 4.
The results in Fig. 5 reveal that the approaches differ in the indication of water quality impairment during the year. For most of the comparisons, deterministic simulations tend to indicate the need for more restrictive actions in the watershed than stochastic approaches, since their results show more days in which the river is classified as class 3 or 4. Most of these differences between stochastic and deterministic solutions for BOD are observed for section IG3 and IG4 in Fig. 5, that receive higher pollution loads than section IG5 and IG6. This observation suggests how relevant is a future representation of sub-daily effects in sections where the temporal dynamic is relevant.
This behavior can be explained with the assistance of violin plots build to illustrate the distribution of each complete series (Fig. 6). This type of visualization is similar to boxplots, but it combines the box shape with a density trace (kernel density plot, understood as a smoothed histogram), revealing structure found within the data (Hintze and Nelson 1998).
Although the central tendency of both types of datasets (deterministic and stochastic) is similar in Fig. 6, differences in data distribution are expected, since daily time series (stochastic models) are being compared to sub-daily values (50 s from the deterministic model). Nonetheless, the comparison of violin shapes shows that data distribution is compatible with the patterns of exceeding daily concentration limits observed in Fig. 5. For BOD estimations in section IG3, for example, violins plots of series 3, 5, 7 and 9 show that concentration from the SihQual model are mostly above the limits of class 3 and 4, as previously indicated in Fig. 5.   Fig. 5 Number of days in the simulated year -N -for which the river stations are classified as Class 1, 2, 3, or 4 accordingly to BOD and N-org limits; rows differ by the measure to take daily values from deterministic simulations: quartiles 1, 2, and 3, respectively

Conclusions
This study aimed for more reliable representations of pollutants transport and fate over time in rivers. Because field samplings do not usually match the required model input, two approaches to generate continuous data from discrete sampling were compared -models A and B. The verification of generated data was achieved through the integration with a physical-based model and analysis of their meaning in real-world applications.
The conversion of temporal scales of water quality data is a challenging research topic, since it requires considering several processes of low occurrence frequency. Such processes are usually missed by discrete field samplings, such as daily and weekly patterns, or rain-driven events. This study presented a comparison of two models to convert Violin plots for BOD series (left column) and N-org series (right column); continuous lines indicate legal concentration limits sparse historical monitoring data into daily time series for one year, combining the observed data information with deterministic and stochastic representations. Model A generates concentrations from a combination of past values, while Model B generates concentrations from the linear relation with discharges. In addition, both methods have a stochastic component in the form of a gaussian random noise (model A) and a lognormal probability density function (model B).
Results indicate that there is no significant difference in generating time series from a concentration or a concentration-flow dataset. Therefore, the combination of deterministic and stochastic parts of models A and B are equivalents and suitable to reproduce the overall variability of historical monitoring datasets in different river sections. Synthetic series showed a consistent behavior in reproducing mass transport, since, as input in the deterministic model, they produced reasonable simulations.
Nonetheless, data representativeness is a key aspect to guarantee reliable results, since all the statistical information should be from a representative period of the time series. In model A, preservation of statistical metrics and a consistent description of natural persistence in the required timescale are essential -however, it is difficult to determine this latter characteristic due to irregular and low-frequency data; model B relies mainly on a well-established relationship between discharges and mass distributions -which can be highly variable and challenging because of the scarcity of concentration measurements during high flow events.
Despite the uncertainties due to representativeness of concentrations and flows samples, model adjustments and hypothesis to fill in missing data, both hybrid approaches are suitable to reproduce historical water quality variability. This is a valuable analysis in water resources planning and managing, since the methods are parsimonious and yield quick multiple scenarios forecasting: changes in metrics that characterize the system, in case Aaverage, standard deviation, quartiles, and concentrations of 10 and 90% of frequency -or in flow conditions, in model B.
The analysis of the overall patterns indicated that to properly convert the information into different time scales and support decisions in water resources management, future efforts should account for data distribution. It is also necessary to understand and represent cyclical effects (such as daily, weekly and seasonal patterns), and statistical characteristics of water quality series, such as trends. This is imperative to assess validity periods of water quality synthetic series and allow simulations of long-term, in which variability patterns are more relevant.
In this context, all the approaches here presented to have the potential to represent these aspects in future efforts. Furthermore, statistical validations and uncertainty analysis are also recommended to increase model reliance.

Supplementary Information
The online version contains supplementary material available at https:// doi. org/ 10. 1007/ s11269-021-02908-1. Funding No funding was received to assist with the preparation of this manuscript.

Author Contributions
Data Availability All data and material used in this study are available on request and at http:// www. iat. pr. gov. br/ Pagina/ Siste ma-de-Infor macoes-Hidro logic as.