Application of synthetic data to establish the working framework for multivariate statistical analysis of river pollution traceability — the heavy metals in Nankan River, Taiwan

This study applied multivariate statistical analysis (MSA) to synthetic data simulated by a river water quality model to verify whether the MSA can correctly infer the pollution scenario assigned in the river water quality model. The results showed that when assessing the number and possible locations of pollution sources based on the results of cluster analysis (CA), two instead of three pollution point source were identified when considering the hydraulic variations of surface water. When discussing the principal component analysis (PCA) result, the second principal component (PC2) and the Pearson correlation coefficients among the pollutants should also be considered, which can infer that Cu, Pb, Cr, and Ni are contributed by the same pollutant point source, and Cu is also influenced by another pollutant point source. This result also implies that the solid and liquid partition coefficients (Kd) of pollutants can affect the interpretation of the PCA results, so the Kd values should be determined before tracing the pollution sources to facilitate the evaluation of the source characteristics and potential targets. This study established a working framework for surface water pollution traceability to enhance the effectiveness of pollution traceability.


Introduction
Over the past few years, heavy metal pollution incidents have occurred frequently around the world. For example, in 2004, the concentrations of heavy metals in the Niger River in Nigeria were 50 mg/L for cadmium (Cd), 30 mg/L for lead (Pb), 2080 mg/L for chromium (Cr), and 780 mg/L for nickel (Ni) (Olatunji and Osibanjo 2013); the detection rates of arsenic (As), mercury (Hg), Cd, Cr, and Pb in the Yangtze River Basin were 62.7%, 24.9%, 18.3%, 17.9%, and 22.4% (Fu and Zhao 2016), respectively. The concentrations of heavy metals in the Korotoa River in Bangladesh were 11 mg/L for Cd, 35 mg/L for Pb, 83 mg/L for Cr, and 46 mg/L for As in 2013 (Islam et al. 2015). Heavy metal contamination in surface water is a global environmental problem (Edet and Offiong 2002;Jain 2005;Tiwari and Singh 2014;Tiwari et al. 2015;Ali and Khan 2018). With the passage of time, the pollutants in surface water have changed from single heavy metal pollution to mixed heavy metal pollution (Zhou et al. 2020).
The quality of surface water is a major factor affecting human health and ecosystems, especially in urban areas where rivers and their tributaries are affected by large amounts of pollutants from industrial and domestic wastewater and agricultural wastewater (Qadir et al. 2008). In Lake Manzala in Egypt, heavy metal pollution in the water comes mainly from agricultural drainage, sewage, and industrial waste (Bahnasawy et al. 2009). In China, heavy mining activities, sewage, runoff from agricultural land and continued industrial growth in recent decades have seriously impacted the water quality of the Huaihe River and exacerbated its heavy metal pollution (Wang et al. 2017 In another study of the rivers near Dhaka, Bangladesh, the concentrations of Pb, Cd, Ni, copper (Cu), and Cr were all higher than average due to industrial wastewater, the widespread use of pesticides, waterborne transport, untreated domestic sewage discharges, and industrial wastewater from tanneries (Ahmad et al. 2010). Therefore, to help administrators set priorities and make sound decisions to determine the best course of action to improve water quality, it is necessary to reduce uncertainties by explaining the temporal and spatial variability of water quality and identifying potential sources of pollution (Wang et al. 2013;Venkatramanan et al. 2014). Multivariate statistical methods, such as cluster analysis (CA), principal component analysis (PCA), and Pearson correlation analysis, are effective tools for environmental studies. By identifying hidden relationships among variables, these methods can reduce large and complex chemical data to a small number of factors without significant information loss and provide a better understanding of water quality impacts and the possible sources of contamination (Ali et al. 2016;Banerjee et al. 2016;Bilgin and Konanç 2016;Khound and Bhattacharyya 2017;Kuang et al. 2016;Singh et al. 2004;Zhang et al. 2015).
The first pollution source apportionment with MSA was reported by Vega et al. (1998). In this study, polluting sources causing spatial (anthropocentric pollution) and temporal (seasonal and climatic) variation of Spanish Pisuerga river water quality and hydro-chemistry have been assessed and differentiated by CA and PCA. Later, CA and PCA have been widely used in assessing water quality variation and source apportionment, for instance, clustering the sampling sites and relating to physicochemical characteristics and pollution levels of the water systems in Northern Greece (Simeonov et al. 2003), attributing the contamination to domestic and industrial wastewater, agricultural activities, and natural processes in Moroccan Oum Er Rbia River (Barakat et al. 2016) and deciphering three dominant pollution sources of Chinese Huaihe River, including the influences of tributaries, power plants and vehicle exhaust, and agricultural activities (Wang et al. 2017).
However, due to the complexity of the actual environmental situation, similarities between pollution sources, and a lack of standards, it is not possible to fundamentally investigate how different pollution sources affect the results of multivariate statistical analysis (MSA). If designing a known pollution scenario (the characteristics of two known pollution sources and the contribution of each pollutant), it is possible to clearly determine how different pollution sources or pollutant contributions affect the results of MSA, and in the future, such information can be used to discuss these results.
The purpose of this study was to apply MSA to synthetic data simulated by a river water quality model to verify whether the MSA can correctly infer the pollution scenario assigned in the river water quality model. Therefore, this study inferred (1) the locations, and (2) the characteristics of pollutant sources.

Research framework
To understand more accurately how different pollution sources affect the results of MSA for surface water, this study used a scenario set with synthetic data as a standard to verify the results and then evaluated the accuracy and limitations of the MSA method. The overall roadmap of this study is shown in Fig. 1 Fig. 1 The working framework for surface water pollution traceability following four steps: (1) Collect information on hydrological and hydrogeological conditions, water quality monitoring data, and data on potential pollution sources in the study area; (2) Compare the river pollution indicators by the Taiwan Environmental Protection Administration (EPA) to understand the severity of water pollution; (3) Establish a river water quality model for the study area based on the collected data, and consider two sources of heavy metal pollution (containing the same and different pollutants) in the model to simulate the concentrations of heavy metal pollutants in each grid, which is called synthetic data in this study; (4) Apply the MSA method to evaluate the synthetic data, and use the parameters of the transmission model to validate the explanatory power of MSA to evaluate heavy metal pollution in rivers.

Description of study area
The Nankan River is located in northern Taiwan, has a basin area of 214.67 km 2 , originates from the Linkou Plateau in Taoyuan City, and flows northward for approximately 30.73 km before entering the Taiwan Strait (Fig. 2). There are 9 water quality monitoring stations in the Nankan River Basin: S1 to S6 are water quality monitoring stations along the mainstream of the river, and S7 to S9 are water quality monitoring stations along the tributaries of the river. According to a report from the Taiwan EPA, there are more than 700 fixed pollution sources in the Nankan River Basin. The printed circuit board (PCB) industry dominates in the upper reaches, and many AH facilities are allocated along the tributaries. The composition of the wastewater from PCB manufacturing is very complicated, with a

Legend
: County : River : Nankan River basin : water quality staƟons relatively high concentration of heavy metal ions, including Cu, Pb, Ni, Cr, and others (LaDou 2006;Gwak et al. 2018). The AH pollution is due to a reliance on chemicals in the breeding process and may come from direct discharge, runoff, or pollutant infiltration into surface water. The heavy metal pollution in AH wastewater mainly includes Cu and As.

River Pollution Index
Taiwan Environmental Protection Administration (TEPA) has developed a River Pollution Index (RPI) classification system for river water quality evaluation based on the purpose of water usage and degree of protection for each stream section (TEPA 2002). The RPI involves four parameters: dissolved oxygen (DO), biochemical oxygen demand (BOD), suspended solids (SS), and ammonia nitrogen (NH 3 -N), each of which is ultimately converted to a four-state quality sub-index (1, 3, 6, and 10). The overall index is then divided into four pollution levels. Table 1 presents the equation for RPI calculation and criteria for the four RPI classes (good, slightly polluted, moderate polluted, and gross polluted).

Water quality modeling and synthetic data
In this study, the Water Quality Analysis Simulation Program (WASP7) Ambrose et al. 2006) was used to simulate the water quality, and the simulated concentrations of heavy metals in river water were referred to as synthetic data. WASP is a surface water simulation program that can calculate the dynamic mass balance. Moreover, it has advanced and extended transmission functions between discrete grids, and provides a series of modules to calculate common water quality and poison problems.
The grid design and hydraulic conditions are given according to the stream direction, the pollution discharge situation, and the actual monitoring points. The simulated rivers are designated as an algorithmic grid, and the tributary drainage and pollution discharge points are configured according to their relative positions. After completing the grid design, the hydrological parameters of each grid must be determined. Due to the limited amount of data, the Hydraulic Engineering Center's River Analysis System (HEC-RAS) is used to calculate the hydraulic parameters of each grid in the WASP model, including flow velocity, water depth, through-hole area, water volume, etc. The study area is a natural channel, which is generally wide at the top and narrow at the bottom, and the contact surface is the sediment. Therefore, the Manning's formula was used to convert among the water level, flow rate and velocity, and the resulting flow data were imported into the water quality model for water quality simulation.

Theory/calculation
In this study, SAS 9.0 was used for MSA, which included CA, PCA, and Pearson correlation analysis.

Cluster analysis
CA is the process of clustering similar samples together to form clusters and classifying these clusters based on "distance". The smaller the relative distance, the higher the degree of similarity, and the more likely a group of samples will be classified into the same cluster. The purpose of CA is to find a small number of clusters so that the ratio of the relative intracluster variation to the intercluster variation is minimized.
Agglomerative hierarchical clustering is a common method for determining the similarity relationship between any one variable and an entire dataset, which is usually illustrated as a dendrogram (McKenna 2003). In this study, the synthetic data were first standardized in terms of the Z-score, and differences in the variable size and measurement units were adjusted to reduce the effect of the differences on the variance (Simeonov et al. 2003;Liun et al. 2003). Next, the standardized synthetic data were clustered using completelinkage clustering, and statistically significant clustering dendrograms were drawn.

Principal component analysis
PCA can simplify multiple related variables into a few unrelated principal components. The small number of principal components obtained by linear combination can retain most of the information of the original variables. PCA extracts the eigenvalues and eigenvectors from the covariance matrix of the original variables and uses orthogonal transformations to convert the observations of a set of potentially correlated variables (each entity has a different value) into a set of linearly uncorrelated variables, i.e., principal components. This transformation is defined in such a way that the first principal component (PC1) has the maximum possible variance (meaning that the variability of the data is considered as much as possible), while each subsequent component has the maximum variance subject to constraints and is orthogonal to the previous components.

Pearson's coefficient of correlation (r)
Pearson correlation analysis is used to compare and confirm the PCA results. The correlation coefficient is a numerical measure of the correlation between two correlated variables or pairs of variables, denoted by r. It measures the strength of the linear relationship between these variables. The values +1, −1, and 0 indicate a positive correlation, a negative correlation, and no correlation between pairs of variables, respectively.

Water pollution level
Taoyuan City is one of the major industrial towns in Taiwan. From 2007 to 2020, 597 datasets were obtained for the Nankan River. In terms of the River Pollution Index (RPI), which is a comprehensive indicator used by the Taiwan EPA to evaluate the water quality of rivers, 97% of the data reveal an RPI over 6, indicating that the Nankan River was seriously polluted during the monitoring period. The maximum RPI at each water quality monitoring station ranged from 7.8 to 9.0. The worst RPI was observed at water quality monitoring station S4 (Dakwai River Bridge), followed by S3 (Gueishan Bridge) and S6 (Jhuweida Bridge) on the mainstream of the river and S9 (Hontia Bridge) on the tributary. These results indicate sources of pollution along both the mainstream and the tributaries of the river.

Synthetic data
In this study, the WASP model was used to synthesize the heavy metal concentrations in the Nankan River. In the WASP model, based on a survey report from the Taiwan EPA (DEPT, 2015), the Nankan River was divided into 36 water quality grids from the upstream to the downstream, including 17 right bank tributaries and 9 left bank tributaries. In the Nankan River Basin, the main sources of pollution are the PCB industry in the upstream section and AH along the tributaries, and the pollutants are primarily heavy metals. Fig. 3 and Table 2 show the hydrological characteristics of the 1D steady state model grid of the Nankan River and the locations of the two point sources. The K d values for the heavy metals in the WASP toxicity module were set with reference to Sheppard et al. (2009) (Table 3).
This study assumed that the wastewater from both point sources was discharged directly into the river without treatment. For the two sources, some heavy metal pollutants were the same, and some were different. For the same heavy metal pollutants, different ratios between the heavy metal concentrations discharged from the two sources were designed to investigate the differences in the results of the subsequent statistical analyses. Source 1 (Grid No. 1) was the wastewater discharged from the PCB industry located in the upstream section of the mainstream, and the heavy metal discharge was designed to be 268 mg/L of Cu, 4.87 mg/L of Pb, 4.23 mg/L of Ni, and 19.3 mg/L of Cr according to data from the Taiwan Industrial Development Bureau (TIDB 2000). Source 2 (Grid No. 23) was the AH wastewater located at the junction of the midstream reaches of the mainstream and the tributaries. The heavy metal emissions were based on data from the Taiwan Council of Agriculture (TCA 2010), and designed to be 0.2 mg/L for As and 90.6 mg/L for Cu.
The simulated results show that there was only one synthetic value for each of the heavy metals including Pb, Cr, Ni, and As in each grid, but Cu had different synthetic values in the grids depending on the designed ratio. Table 4 shows the synthetic data of the concentrations of five heavy metals in each grid simulated by the WASP model.

Locations of pollution sources
CA is a powerful tool for assessing the locations of pollution sources. The CA results for the spatial distribution of the concentration of heavy metals in water can further indicate the locations of pollution sources. The results were further compared with the scenarios in the WASP model to Fig. 3 The hydrology characteristic of major input source of pollution and horizontal water quality model grid for the Nankan River determine whether the CA results in this study were correctly inferred from the actual situation.
According to the characteristics of the synthetic data, the CA yielded three clusters (Fig. 4), and the results show a clear spatial relationship between the upstream and downstream sections of the river. The first cluster includes grids 1 to 13 and belongs to the upstream section of the river. The concentration of Cu in the synthetic data of this cluster is high, followed by Pb, Ni, and Cr, and the heavy metal concentration in the synthetic data decreases as the distance to the most upstream location increases, so it was inferred that there was only one pollution source in the most upstream location. The synthetic data of this cluster were influenced by the wastewater from PCB manufacturing from Source 1 (Grid No. 1), and the inferred results are consistent with the scenario designed for the WASP model.
The second cluster includes grids 14 to 23 and belongs to the midstream section of the river. The concentration of pollutants in the synthetic data of this cluster did not increase significantly, and the concentration of heavy metals in the synthetic data decreased as the distance to the upstream section increased, so it was inferred that there was only one pollution source located in the upstream section of this cluster. The influence of the first pollution source on this cluster was significantly reduced, and the concentration of heavy metals in the synthetic data was significantly reduced. Due to the large number of tributary drainages from Grid No. 14 (13 in total), the heavy metals in the synthetic data were assumed to be diluted by water (Vega et al. 1998), so this cluster was taken as the second cluster according to the CA results.
The third cluster includes grids 24 to 36 and belongs to the downstream section of the river. As pollution was added in the synthetic data of this cluster, and the concentration of Cu pollution also increased, while the concentration of other pollutants did not increase significantly, and the concentration of heavy metals in the synthetic data decreased as the distance to the upstream section increased. It was inferred that there was a new pollution source in this cluster located in the upstream section of the river. The main pollutants were As and Cu. The synthetic data of this cluster were influenced The three clusters from the CA results initially indicated that there could be three pollution sources (or pollution characteristics). However, the water conditions and the water quality indicated that the second cluster was influenced by a large number of tributary discharges, which led to a difference in the synthetic data for the first cluster. Therefore, there should be only two pollution sources, located upstream of the first and third cluster. This result is consistent with the scenario designed for the WASP model. In summary, the CA results should be examined together with the hydrological conditions and water quality of the surface water to deduce a more accurate source location.

Pollutant source characteristics
PCA is a powerful tool used to assess the characteristics of pollution sources by understanding the relationship between pollutants. To understand the influences of the three clusters of pollution sources (water quality characteristics) based on the CA results, the two clusters of pollution sources set by the WASP model in the subsequent PCA results, and the assessment of the pollution source characteristics. The synthetic data were divided into clusters according to the two aforementioned results, and a PCA was conducted to investigate the differences between the clusters. Figure 5 shows that based on the CA results, the synthetic data were divided into three clusters, and a PCA was conducted separately. Figure 5 shows that for the first cluster of grids 1 to 13 (upstream reach), PC1 was selected, with eigenvalues greater than 1. The results of PC1 (total explained variance is 98.93%) show that the four heavy metals, Cu, Pb, Cr, and Ni, were all concentrated on the right side and should be from the same pollution source. Only one pollution source -the PCB manufacturing industry -was considered. As was located in the center (origin) because it was from the second pollution source -AH wastewater. Thus, the synthetic data did not include As. Figure 5 shows that for the second cluster of grids 14 to 23 (midstream reach), PC1 was selected, with eigenvalues greater than 1. The results of PC1 (total explained variance of 83.86%) show that the four heavy metals, Cu, Pb, Cr, and Ni, were all concentrated on the right side and should be from the same pollution source. At this time, there was only one pollution source -the PCB manufacturing industry. In addition, As was located to the left in the PC1 results, so the presence of another pollution source should be evaluated. This result is consistent with the scenario set by the WASP model, in which As was the pollutant of the second source, AH wastewater, and the PCA results can be used to correctly identify As.  Figure 5 shows that for the third cluster of grids 24 to 36 (downstream reach), PC1 was selected, with eigenvalues greater than 1. The results of PC1 (total explained variance of 99.02%) show that the five heavy metals were all concentrated on the right side, and the preliminary assessment indicated that they should be from the same pollution source, which is not consistent with the scenario set by the WASP model. These are the general results and inferences for PC1.
In addition, although the results of the second principal component (PC2) did not meet the selection criteria, more information would be obtained if the PC2 results were included. The results of PC2 are summarized in Table 5. The regional drainage and pollutant K d values were also considered, and the WASP settings (pollutant concentration and location) were used for the study. The results for each river segment are discussed below.
In the upstream reach, the four pollutants should theoretically be in the same direction, but based on the results, they were divided into two directions, i.e., Cr and Ni were in one direction, and Pb and Cu were in another. It is assumed that such division is due to the effects of pollution source concentration. However, the pollution source setting situation (Fig. 3) reveals that, the original concentrations of Pb and Ni were 4.87 mg/L and 4.23 mg/L, respectively, which are on the same order of magnitude, and the original concentrations of Cu and Cr were 268 mg/L and 19.3 mg/L, respectively, which are on a similar order of magnitude. If the PCA results were related to the concentrations of the pollutants, the clustering results should be as described above, but the actual results are different. This discrepancy was obviously not caused by the concentrations of the pollutants.
Based on the pollutant K d values, the pollutants were divided into two clusters (Table 3); i.e., the first cluster included Cu, As, and Pb, with K d values of 40, 25, and 23, respectively, which are all on the same order of magnitude; the second cluster included Cr and Ni, with K d values of 0.52 and 0.21, respectively, which are also on the same order of magnitude. However, the K d values of the two clusters differed by two orders (100 times), and this clustering was consistent with the PCA results. Therefore, it can be inferred that for the same pollution source, the K d values of the pollutants affect the PCA results and the determination of the source characteristics. K d refers to the ratio of a pollutant in particulate matter and in water when the water-particulate matter two-phase system has reached equilibrium. The K d value reflects the migration ability of pollutants between the aqueous and particulate phases and potential ecological hazards and is an important physicochemical characteristic parameter to describe the behavior of pollutants in aquatic environments.
In the midstream reach, the inclusion of PC2 did not affect the assessment results because all five pollutants were clustered in the same direction. However, the distance between Cu and Pb increased compared to the results of the upstream reach. Thus, it was inferred that Cu was affected by the dispersion of the second source (providing trace concentration).
In the downstream reach, the pollutants were divided into two clusters, one for Cu and As and the other for Pb, Cr, and Ni. This clustering is generally consistent with the scenario set by the WASP model because the pollution source for Pb, Cr, and Ni was the PCB manufacturing industry in the upstream reach, and the pollution source for As was AH in the downstream reach. The source of Cu was mainly the AH, so the PCA results are closer to those of As, but some Cu pollutants were still from the upstream PCB manufacturing industry. In addition, Pb, Cr, and Ni were relatively close to each other, the Pearson correlation coefficients among them were all 0.99 (Table 6), and the Pearson correlation coefficient between Cr and Ni was as high as 1.00, which is assumed to be related to the K d value. Water, sediment, and zoo benthos are crucial carriers and storage media for heavy metal migration and transformation . Sediment in aquatic ecosystems can markedly absorb several trace metals and play a key role in regulating the migration, transformation, and purification of heavy metals in aquatic ecosystems (Banerjee et al. 2016). Figure 6 shows the PCA results of the two clusters of synthetic data according to the WASP model. Among them, the results in Fig. 6B are the same as those in Fig. 5, so they are not described in this paper. Fig. 6A shows that for the first cluster of grids 1 to 23 (upstream and midstream reaches), PC1 was selected, with eigenvalues greater than 1. The results of PC1 (total explained variance of 82.13%) show that the four heavy metals, Cu, Pb, Cr, and Ni, were all concentrated on the right side and from the same pollution source. This result is consistent with the scenario set by the WASP model, in which the only pollution source was the PCB manufacturing The results of principal component analysis after three clusters by cluster analysis. A Cluster 1-grid numbered 1 to 13; (B) Cluster 2grid numbered 14 to 23; (C) Cluster 3-grid numbered 24 to 36 ▸ industry. In addition, As was located to the left of the PC1 results and should thus be evaluated as another pollution source. This result is consistent with the scenario set by the WASP model, in which As was the pollutant of the second source, AH wastewater, and the PCA results can be used to correctly identify As. This result is highly similar to that of Fig. 5.
In summary, three clusters of pollution sources (water quality characteristics) were obtained from the CA results, and two clusters of pollution sources were set by the WASP model. The model results were similar to the subsequent PCA results and the assessment of the pollution source characteristics, so without a standard, CA should still be conducted first to facilitate the inferences about the pollution source characteristics during the PCA. In addition, when pollutants are from the same source, the correlation between pollutants is affected by the K d values. In other words, the correlation between pollutants may be high if the K d values are relatively close (preferably on the same order). However, if a pollutant has a second source, the correlation between pollutants is related to the source of the main pollution contribution in that area.

Conclusions
CA provides clustering results according to water quality characteristics. The results showed that when assessing the number and possible locations of pollution sources based on the results of cluster analysis (CA), two instead of three pollution point source was identified when considering the hydraulic variations of surface water. Therefore, it is recommended that when tracing pollution sources, information on the inlet, outlet, and volume of water should be obtained to assess the number and possible locations of pollution sources.
PCA is a powerful tool for exploring the relationship between pollutants. When discussing the principal component analysis (PCA) result, the second principal component (PC2) and the Pearson correlation coefficients among the pollutants should also be considered, which can infer that Cu, Pb, Cr, and Ni are contributed by the same pollutant point source, and Cu is also influenced by another pollutant point source. This result also implies that the solid and liquid partition coefficients (Kd) of pollutants can affect the interpretation of the PCA results, so the Kd values should be determined before tracing the pollution sources to facilitate the evaluation of the source characteristics and potential targets.
This study applied the common and powerful MSA method to establish a working framework for pollution traceability in surface water, including the collection of front-end surface water background information, the generation of synthetic data (i.e., assumed pollution scenario) and the evaluation of the MSA results and the factors that may affect the analysis results, to enhance the effectiveness of pollution traceability. Table 5 The PC2 results of principal component analysis after three clusters by cluster analysis (1) "-" means the pollutant is in the negative quadrant of PC2 (2) "+" means the pollutant is in the positive quadrant of PC2 (3) "*" means the pollutant is at the origin of PC2