Dynamic Multivariate Analysis for Pollution Proling and Abatement Recommendation in La Buong Watershed of Vietnam

Analysis of temporal patterns of high-dimensional time-series water quality data is essential in informing better pollution management. In this study, Dynamic Factor Analysis (DFA) and Cluster Analysis (CA) were adopted to analyze time-series water quality data monitored at ve stations SB1, SB2, SB3, SB4 and SB5 on La Buong river in the Southern Vietnam. Application of DFA identied two temporal patterns in SB1 and SB2 and three temporal patterns in SB3, SB4 and SB5. Analysis of factor loadings of water variables revealed run-off-driven patterns with the contribution of Total Suspended Solid (TSS), turbidity or Fe at all stations. The association of other variables like BOD 5 , COD at SB1, SB2, SB4, and SB5 to this run-off pattern exposed their sharing of common driver. On the contrary, separation of variables like Phosphate (PO 43− ) in SB3, SB4 and SB5 from run-off pattern suggested their local point-source origin. The derived factors from DFA were later used in time-point CA to explore temporal distribution of pollution intensities. Comparisons between clusters’ value and two regulatory benchmarks A2 and B1 for drinking and irrigation water respectively suggested land-use approach for abating TSS, Fe and BOD 5 , COD at most sites. The control of point sources of BOD 5 and COD pollutants is needed at SB3 along with PO 43− , Ammonium (NH4 + ) and Escherichia coli (E.coli) at SB1 and SB4.


Introduction
Long-term monitoring of water quality in a watershed is essential to understand hydro-chemical properties and water resource pollution (Vega et al. 1998;Felipe-Sotelo et al. 2007). Continuous surveying of water quality over time also helps keep track of the dynamics of water quality.
Insights from analyzing water quality time series can unravel the link between anthropogenic and natural drivers and water quality (see Diamantini et al. 2018 for instances) and inform appropriate policies towards better managing quality of water resource. However, long-term monitoring program of water quality usually originate multi-dimensional time-series datasets which require sophisticated method for analysis (Dixon and Chiswell 1996).
Univariate statistical analysis is the approach used to describe each water quality variable's value distributions and, to its best, the extent they are correlated. However, shortcomings of univariate method appear in case comprehensive interpretation of dataset is needed because it is unable to capture the complexity of data structure (Le et al. 2017). Insights generated from univariate methods can hardly support integrated water quality management, for it usually targets speci c types of pollution. In such context, composite indices like Water Quality Index (WQI) were formulated as a measure considering multiple water parameters. The index was proved to t the objectives of communicating about the health status of water bodies rather than the structural composition of water datasets (Pesce and Wunderlin 2000; Wunderlin et al. 2001). Other information can be derived from datasets like sources of pollution would also be hidden in WQI after normalization and weighting of water parameters (Le et al. 2017).
Multivariate statistical analysis with two popular techniques of factor analysis (FA) and cluster analysis (CA), is an approach to handle multiconstituent dataset. The FA explores latent structures that explain the correlation between variables (Liu et al. 2003; Kowalkowski et al. 2006;Budaev 2010). The primary objectives of FA are to capture maximum variabilities in the original dataset with a minimum number of factors and therefore, minimize information redundancy (Kowalkowski et al. 2006) while CA is a multivariate technique of grouping observations into clusters possessing high internal homogeneity and high external heterogeneity between groups (Shrestha and Kazama 2007; Le et al. 2017). FA combined with CA has multiple applications like explaining temporal and spatial characteristics of water quality (Vega et al. 1998;Felipe-Sotelo et al. 2007;Juahir et al. 2011;Vialle et al. 2011;Magyar et al. 2013) and exploring sources driving water quality (Singh et al. 2005;Le et al. 2017). The dynamic factor analysis (DFA) was suggested as FA for time-series data for its ability in exploring temporal co-variabilities in dataset (Molenaar 1985). Understanding this dynamic relationships in time-series data had many applications especially in ecological and environmental researches, like noted works by Y. M. Kuo  La Buong river is a downstream branch of the Dong Nai river system, one of the largest and most important national river basins in Vietnam (Khoi et al. 2019). The La Buong watershed has been undergone rapid socio-economic development, imposing high pressures on water environment (Fig. 1). Drivers of such pressures include population increase, urbanization and agricultural intensi cation to name a few. Khoi et al. (2019) used a modelling approach with SWAT model to show that stream ow and water quality of the La Buong river would be impaired by climate change and land use and land covers (LULCs) changes. However, the application of SWAT model in assessing water quality has many uncertainties especially in area featured by diffused and heterogeneous sources of pollution (Glavan and Pintar 2012). The modelling approach also aimed at predicting water quality rather than understanding driving forces of pollutions. To address complexity of water pollution in the area, it is essential to adopt a data-driven multivariate time-series technique. Therefore, in this research, we applied dynamic multivariate technique for pollution characterization and later suggested policies for local authorities in water quality management. The paper's objective is twofold: i) to understand temporal patterns of water quality parameters and their drivers, and ii) to assess pollution intensities of pollution patterns and suggest abatement solutions.

Material And Method
Page 3/13 2.1 Study area, sample collection and data preprocessing La Buong river traverses a distance of 52km with a catchment of 478,5 km 2 (Fig. 1). The terrain of the catchment area is relatively at, with an average elevation of 93m. The catchment is affected by a tropical monsoon climate, with the rainy season from May to October and the dry season from November to April (Khoi et al. 2019). The mean annual rainfall is 1999 mm with 87-93% in the rainy season and 7-13% in the dry season. The yearly average evaporation is about 1155 mm, and the average air temperature is 26°C. The La Buong catchment is characterized by high agricultural activities. Rhodic Ferralsols and Ferric Acrisols are the dominant soil in the catchment taking up to 75% of catchment area.
Five monitoring sites were distributed along La Buong river. Figure 1 shows the locations of these sites along with borders of sub-basins. Figure 2 presented land-use characteristic at each sub-basin. Four sites SB1, SB2, SB4 and SB5 are located near residential area. The most intensive level of built-up was observable in SB1 (Fig. 2a) and SB5 (Fig. 2e). Besides built-up, agriculture is the main land-use type in SB1, SB2 (Fig. 2b) and SB4 (Fig. 2d) sub-basin. Lastly, SB3 (Fig. 2c) is the site directly receives discharge from Giang Dien industrial park.
Water quality data were monitored monthly during 2009-2017 by the Department of Natural Resources and Environment of Dong Nai province.
The monitoring program measured 21 water parameters. However, ve parameters including Salinity, Zinc, Total grease, Aldrin and Endosulfan were excluded in this study due to their low variances (see Figure A1 in the Appendix). Hence, our study analyzed 16 parameters, including Temperature, pH, Conductivity, Turbidity, Total Suspended Solid (TSS), DO, COD, BOD 5   2.3 Dynamic factor analysis (DFA) and cluster analysis (CA)

DFA
DFA was adopted for analyzing dynamic patterns of dataset. The DFA is a dimensionality reduction technique used for time-series data (Kuo et al. 2014). The method is useful for identifying latent temporal pattern in multivariate datasets by mining their lagged covariance. The full mathematical formula for the DFA was given in (2). In this study, the DFA was conducted using Python module statsmodels (Seabold and Perktold 2010).
in which: Exogenous (or explainable) variables (x) can be included to improve the model tness (Kuo et al. 2014). The regression matrix B is of size i x N, with i is number of exogenous variables. Regression matrix helps identify exogenous variables that had signi cant in uences on response variable (ibid). However, exogenous variables were not included for unraveling impacts of driver variables is out of scope of this research.
To nd optimal value for number of factors, the Akaike's information criterion (AIC) (Akaike 1974 In this study, the Python scikitlearn module's hierarchical algorithm (Pedregosa et al. 2011) was utilized for clustering data. The Ward method was selected for de ning similarities between observations. Number of clusters was identi ed by interpreting the Dendrogram and the validity of the clustering result was examined by silhouette score (Rousseeuw 1987). The independent factors derived from DFA were inputs for CA to address the concern about the existence of multi-collinearity in a dataset, which can impose implicit weighting and affect the analytical result (Kisekka et al. 2013; Kuo et al. 2013).
The derived clusters were pro led in terms of their temporal features and pollution intensities. To link pollution intensities with temporal patterns, only parameters that showed associations to derived dynamic factors were selected. Two Vietnamese regulatory standards of A2 and B1 for drinking and irrigation water respectively were adopted as benchmarks for assessing pollution intensities. The 95% con dent intervals (CIs) of water quality parameters were calculated for each cluster using Bootstrap method (Efron 1979) to make statistical inference about level of pollution on comparing to the benchmark regulatory. Besides high correlation between BOD 5 and COD which was intuitive, another noted type of correlation was among DO and nutrient parameters (reactive Nitrogen and PO 4 3− ). The highest correlation was between DO and NO 2 − in SB1 and SB2 ( Fig. 3-a,b), probably explained by denitri cation process forming Nitrite in low DO condition (Le et al. 2017, 2019).

Pairwise correlation of water quality variables
Cross-group association were important as they may indicate sharing of common drivers of pollution (Wunderlin et al. 2001). In SB1 and SB2 ( Fig. 3-a,b), a signi cant correlation was observable between Fe and NO 2 − (0.66 and 0.85 for SB1 and SB2 respectively). SB2 was also marked for its high correlation between run-off-related variables (like TSS, Fe) and organic parameters (BOD 5 , COD). This indicated run-off source of NO 2 − and organic pollution in sub-basin SB1 and SB2 respectively. The SB3 (Fig. 3-c), by contrast, had very low correlation between these runoff and organic groups (for instances, MIC value of Fe-COD pair was only 0.48). The noted relationships in SB3 was between organic (BOD 5 and COD) and nutrient parameters (PO 4 3− and NO 3 − ). In SB4 and SB5 (Fig. 3-

Temporal patterns of variables and drivers of pollution
An analysis of determination coe cients of variables combined with graph validation showed that valid factors could capture temporal patterns of variables with coe cient higher than 0.4. This nding was similar to work by Y. M. Kuo, Chiu, and Yu (2014) suggesting that a correlation higher than 0.25 is moderate. Figure 5 presents derived valid factors projected onto space of scaled variables (for determination coe cients, see Table A2 of Appendix). Table 1 shows loadings of variables in valid factors for ve pollution sites. In SB1 site, two common patterns were recognized including factor 1 of Turbidity, TSS, COD, BOD 5  Contrarily, the independence of derived factors to each others at each site suggested distinctions of drivers of pollution. In the SB1, NH 4 + owned a pattern that diverted from run-off components, pointing to sources other than run-off of NH 4 + . A very likely source of NH 4 + in the SB1 was human excreta (Van Drecht et al. 2003) taking in account location of SB1 in residential area. E uence from residential area was also likely the main source of PO 4 3− in the SB3, SB4 and SB5 as suggested by Riemersma et al. (2006). Similarly, in the SB3, separation of COD and BOD 5 from run-off factor revealed that organic pollution might be driven by point sources' activities; such in uence was very likely on considering that water bodies at SB3 site receive discharge from industrial zone nearby. Also, the isolation of COD from BOD pattern in SB3 proved the in uence of industrial e uent. Same inference was also valid in case of E.coli in SB4 site. Unlike total Coliform which was run-off driven, trend of E.coli in SB4 was driven by discrete point sources of surrounding livestock farms. Lastly, pattern of PO 4 3− at the SB3, SB4 and SB5 were driven by point

sources. 3.2.3 Pollution pro le and policy implication
Application of Hierarchical clustering for factors showed that at SB1 and SB4, there were optimally two clusters, and at the three remaining sites SB2, SB3 and SB5, there were three clusters. Figure A4 of Appendix showed dendrograms plotting distances between clusters at each site. Figure 6 and 7 presented temporal and pollution characteristics of clusters respectively; for numeric presentation of cluster pro les, see Table  A3 of Appendix. Water quality parameters associated with run-off factors realized in the DFA models had an apparently higher range of mean distributions in clusters occupied mainly by wet-month data points, i.e. May, June, July, August, September and October. For example, in SB1 site, cluster 1 composed approximately 71% by wet-season data points (Fig. 6-a)  Pro ling of clusters in term of their seasonal characteristics and varying pollution intensities supported pollution management on comparing to regulatory limit A2 (for drinking water) and B1 (for irrigation water) (Fig. 7). Run-off-induced pollution that breach the two standards was rather a year-round phenomenon especially for Fe which exceeded both the criteria of A2 and B1. The problem with TSS was less severe in dry season in SB1 (cluster 2), SB2 (cluster 1), SB5 (cluster 1) when pollution level only exceeded lower regulatory level B1. This also suggested that TSS and Fe pollutions of La Buong river is rather a natural phenomenon which are intensi ed in wet season by surface run-off. This phenomenon was also observed in another study in Dong Nai river basin (Quan and Meon 2015). Therefore, a proper abatement scheme for TSS and Fe pollution should consider controlling of pollution sources by land-use practices in wet season combined with water treatment solutions before using. Furthermore, for TSS pattern at SB5 site showed the in uence of urban run-off and quarry site in wet season, measures like green infrastructure for abating run-off contamination and pollution control at quarry site at SB5 should be considered.
For organic parameters that could be linked to run-off drivers like BOD 5 , COD in the SB1, SB2 SB4, and SB5, an obvious pattern of exceeding regulatory levels (mostly B1) could be observed in wet-season clusters. This nding was strongly policy-relevant as organic pollution abatement in SB1, SB2, and SB4 sites could be conducted through land-use management policy at watershed scale aiming at limiting the in uence of runoff. For SB5, more concerns should be on urban run-off. A similar suggestion was also suitable for NO 2 − and PO 4 3− pollution in SB2. On the contrary, exceeding the regulatory limit of BOD 5 and COD at the SB3 site could be reduced at point sources with a particular focus on the industrial zone.  Figure 1 Map of study area and locations of monitoring sites Nash-Sutcliffe coe cients for dynamic factor models at the ve stations Please see the Manuscript le for complete gure caption.

Figure 6
Temporal pro les of derived clusters (percent of months) at each monitoring site, blank cell presents zero percent