Modeling The Formation of Trihalomethanes In Rural And Semi Urban Drinking Water Distribution Networks of Costa Rica

Chlorination is one of the most important stages in the treatment of drinking water due to its effectiveness in the inactivation of pathogenic organisms. However, the reaction between chlorine and natural organic matter (NOM) generates harmful disinfection by-products (DBPs), such as trihalomethanes (THMs). In this research, drinking water quality data was collected from the distribution networks of 19 rural and semi urban systems that use water sources as: springs, surfaces, and a mixture of both, in three provinces of the Pacic slope of Costa Rica during April 2018 to September 2019. Twelve models were developed from four data sets: all water sources, spring, surface, and mixture of spring and surface waters. Linear, logarithmic, and exponential multivariate regression models were developed for each data set to predict the concentration of total trihalomethanes (TTHMs) in the distribution networks. Concentrations of TTHMs were found between < 0.20 to 91.31 µg/L, with chloroform being the dominant species accounting for 62% of TTHMs on average. Turbidity, free residual chlorine, total organic carbon (TOC), dissolved organic carbon (DOC) and ultraviolet absorbance at 254 nm (UV 254 ) showed a signicant correlation with TTHMs. In all the data sets the linear models presented the best goodness-of-t and were moderately robust. Four models, the best of each data set, were validated with data from the same systems, and, according to the criteria of R 2 , SE, MSE and MAE, spring water and mixed spring/surface water models showed a satisfactory level of explanation of the variability of the data. Moreover, the models seem to better predict TTHMs concentrations below 30 µg/L. These models were satisfactory and could be useful for decision-making in drinking water supply systems and be considered in possible modications in current legislation.


Introduction
Disinfection is one of the most important stages in water treatment to reduce the content of pathogenic material.
In most of the world, chlorine disinfection is the most widely used method for its high effectiveness in preventing pathogenic microorganisms and its low cost (Mazhar et al. 2020). However, chlorine can react with natural organic matter (NOM) present in water from supply sources and generate disinfection by-products (DBPs) such as trihalomethanes (THMs) (Richardson and Plewa 2020). The formation of THMs is in uenced by a number of factors: operational variables (e.g., pH, type and disinfectant dose, residence time), environmental conditions (e.g., water temperature and seasonal variation) and water characteristics (e.g., type and concentration of NOM, bromide ion concentration) (Al-Tmemy et al. 2018).
Various researches have reported adverse human health effects from exposure to THMs, for example, bladder cancer (Costet et  Monitoring of THMs is important to avoid the aforementioned adverse effects and for compliance with legislation. However, the most common method for THMs determination by gas chromatography is expensive and time consuming (Mukundan & Van Dreason, 2014). As an alternative to measuring THMs, multiple prediction models have been developed. These models can be generated from laboratory or eld data by collecting samples at the treatment plant and/or distribution network (Sadiq et al. 2019). For the rst case they have the advantage that many variables can be controlled, however, it does not contemplate certain aspects that occur on a real scale (Chowdhury et al. 2009). The models obtained with eld data have the advantage of contemplating variables such as the in uence of the infrastructure of the distribution networks, however, they are speci c to each site (Shahi et al. 2020) and therefore cannot be generalized to any context (Semerjian et al. 2009). The prediction models can be classi ed into mechanistic ones based on kinetics of chlorine reactions, and on empirical ones (Kumari and Gupta 2015). The DBPs empirical models are based on the water quality, operational and environmental conditions that in uences its formation. The models are develop using statistical regression or arti cial neural networks (Sadiq et al. 2019). Accordingly with the same study, the generation of empirical models bene t in understanding of the factors that contribute to the formation of THMs and are a tool for decisionmaking.
In the literature most models predicting the formation of THMs have been developed in temperate and urban  (Semerjian et al. 2009) and in few cases in tropical regions, for example, in Thailand (Feungpean et al. 2015). The present research is the rst attempt to develop a THMs prediction in Costa Rica, and to the best of the authors knowledge in the Central American and Caribbean region. Furthermore, this study was focused in rural and semi urban areas, where no studies was found in the literature.
In Costa Rica, 93% of the population received drinking water in 2019 (PEN and CONARE 2020). Moreover, in the same year, 19.4% of homes in rural and semi-urban areas were supplied with water by local Associations Administrators of Aqueduct and Sewerage Systems, (ASADAs in Spanish) (Sánchez-Hernández 2019). In addition, in 2016, 14.3% of the population was supplied by 24 municipalities and the rest by duly organized public companies (AyA 2016). The main water sources used are groundwater, springs, surface water and the mixture of the two latter ones, moreover, in all cases chlorine disinfection is the method used (Arellano-Hartig et al. 2020). In general, due to economic and analytical capacity limitations, monitoring of THMs is scarce, mainly at the ASADAs and municipal level. Thus, the objective of this study was to develop a series of prediction models of TTHMs in the distribution systems of rural and semi-urban areas supplied by springs, surface water and the mixture of both sources. This is the rst study of its kind carried out in the country and is expected to serve as a tool for decisionmaking in the aqueducts regarding their operation and parameters to be monitored. Furthermore, it may be useful for the Ministry of Health to consider a reform of existing legislation.

Materials And Methods
The study was performed in three different zones of the country (Fig. 1). The site is located in the Paci c slope presenting a dry season from December to March, a raining season from May to October and two months of transition, April and November (Manso et al. 2005). Nineteen small distribution systems of rural or semi urban area were selected. The population of most of the systems ranges from 328 to 8000 habitants. The length of the distributions networks ranges from 1.2 to 13 kilometers. The raw water sources of the systems were surface (6), springs (6) and a mixture of both (7). The treatment process of some of the surface and the mixture of water sources, included 2 conventional treatment systems, 6 slow sand ltration systems and 2 only screening or sedimentation. The water was chlorinated in 16 cases with solid Ca(ClO) 2 , in one case with liquid NaClO, and in two systems generated in situ by electrolysis was applied. In this study, mainly in spring water, chlorination was the only treatment, therefore, water subjected solely to chlorination was considered as treated water.

Water sampling and analytical procedures
Water samples from the 19 systems were collected from three different sampling campaigns, in the dry, transition and raining seasons, respectively. The study period was between April 2018 to September 2019. Each sampling day, four samples of the distribution network were taken. The latter were sampled in four different points as recommended by the local legislation (MINSA 2018): at the exit of chlorination storage tank (minimum estimated contact time design of 30 min.) and at the beginning, the middle and the end of the distribution network.
Total and dissolved organic carbon, TOC and DOC, respectively were determined using a Teledyne Tekmar TOC In the eld, pH was determined at all sampling points using Hanna HI 8-124 equipment and free chlorine using a colorimeter (Pocket Colorimeter II, Hach) following the DPD method (N, N-diethyl-p-phenylenediamine). Turbidity and apparent color were determined in the laboratory in less than 24 hours after sampling using a 2100Q and DR900 equipment (both Hach). In all cases, the methods of the Standard Methods (APHA et al. 2017) or those recommended by the equipment manufacturers were followed.

Mathematical model development
The models were developed using the water samples taken at the exit of the chlorinated water storage tank and in the distribution network of each system. The models were developed from four data sets accordingly to the source water of the systems: (1) all sources, (2) spring, (3) surface, and (4) mixture of surface and springs waters refers as mixed. Prior to the analysis, each database was randomly divided into two groups: calibration data (70% of the total) and validation data (30% of the total); a similar procedure was reported by Gol nopoulos and Arhonditsis (2002) for the development of multivariate regression models for the prediction of THMs in a water treatment plant in Greece.
Initially, the normality of TTHMs and variables like temperature, pH, turbidity, color, free residual chlorine, COT, COD, UV 254 reported by Sadiq et al. (2019) as potentially in uential in the formation of THMs were evaluated using the Anderson -Darling test (Ryan 2007). As it will be discussed later, the variables presented a non-normal distribution as shown in Table S1 (Online Resource 1), therefore, as recommended by Kargaki et al. (2020) for nonparametric data, the Spearman correlation test with a signi cance level (α) of 0.05 was used. Using this test, the Spearman correlation coe cient (r s ) and their respective p value were determined. Similar to Chowdhury et al. (2008) applied criteria for Pearson's correlation coe cient in THMs model development, in the present research an r s below 0.3 means weak correlation, between 0.3 to 0.7 moderate and greater than 0.7 strong correlation.
Furthermore, the correlation was considered statistically signi cant if the p value < 0.05 and vice versa.
Multiple regression analysis was performed in the Minitab 17 statistical software program for the development of linear and nonlinear models. TTHMs concentrations were considered as the dependent variables, while the others water quality parameters were considered as the independent variables. Once the potential variables to include in the models were identi ed, as recommended by Feungpean et al. (2015), the stepwise method was used to identify the signi cant variables in the explanation of variability provided by the model. In the stepwise method, each of the variables are included or excluded when evaluating the p value of the F test, against the alpha values to enter or leave the model considering a signi cance level of 0.05.
To nd the model that represents the best performance and goodness-of-t of the data, for each data set, linear and non-linear models were generated. Transformations were applied in the dependent and/or independent variables (e.g., square root, exponential, logarithmic) (Pardoe 2012). In all cases, data exclusion criteria were used, such as: studentized residual deleted greater than 3, high leverage points, Cook's distance and DFTIS (Acuña

Models' validation and applicability
The best model obtained for each data set was validated using the excluded data used to obtain the models (30% of the total data). For validation, predicted TTHMs and those measured were compared using the criteria: R 2 , SE, MSE (Shahi et al. 2020). In addition, like the study mentioned, a T test was performed to determine a signi cant difference between the mean of the TTHMs measured and the predicted by the models. A test of equal variances was performed to determine whether equal variance could be assumed in the T test. Next, the T test was performed by calculating the t value and its respective p value. The values were compared and if the p value > 0.05, the difference between the measured and predicted values was considered as non-signi cant and vice versa.

Results And Discussion
Water quality parameters Table 1 presents the main characteristics of the treated/chlorinated water of the 19 systems. In general, the water quality was maintained from the outlet of the chlorinated water storage tank to the end of the network. The temperature range is typical for tropical countries and the pH values were close to 7. The turbidity and color of all samples were relatively low indicating the e ciency of the treatments and/or that the water sources were good.
Similarly, in most cases TOC and DOC were quite low. Moreover, UV 254 indicates a low presence of humic substances, and SUVA, in most cases less than 2 L/mg·m, suggests non-humic NOM and low molecular weight aliphatic compounds (Edzwald and Tobiason 2011). The low values in the above parameters related to NOM and the low concentrations of residual free chlorine justify the low concentrations of TTHMs, where only two samples slightly exceeded the 80 μg/L regulated by the US EPA (US EPA 1998). As for the dominant species of THMs, chloroform occurred in a higher percentage on average (62% of the samples) and in low concentrations (10.60 ± 13.86 μg CHCl 3 / L). In addition, the species CHBrCl 2 , CHBr 2 Cl and CHBr 3 were frequently found, but at much lower concentrations (i.e., < 2 μg/L). Such speciation of THMs has been reported in other studies (Sérodes et al. 2003). In general, in all the parameters (except in pH and free residual chlorine), surface water values at least duplicate spring water ones, and the mixed and the whole date set values were in between. That is expected as surface water is highly in uenced by allochthonous and autochthonous production, and the effect is also observed in the whole and the mixed water data sets.
Furthermore, the higher concentration of precursor (e.g., TOC, UV 254 ) is re ected in higher THMs concentration.

Correlation of independent variables with THMs in treated water
The Anderson-Darling statistical test (Ryan 2007) showed that the dependent (TTHMs concentrations) and most of the independent variables presented a non-normal distribution across all data sets (p value < 0.05) (Table S1, Online Resource 1). This is expected because the data comes from systems with different operational characteristics. The data presented a positively skewed distribution, which is characterized by having a large amount of data in the low ranges of the parameter compared to the higher ranges. Therefore, to evaluate the correlation between the variables, Spearman's non-parametric test was used (Kurajica et al. 2020).
Temperature and pH showed non-signi cant and weak correlations (p value > 0.05, r s < 0.3) in all data sets (Table   2), expected as both parameters were relatively stable ( correlation for both parameters. Accordingly, an increase in temperature tends to increase the reaction rate between organic matter and chlorine, and the THMs concentration increase with pH because many hydrolysis reactions, which occur in basic medium, promote their formation. Turbidity presented a weak correlation in all data sets (r s < 0.3) and was signi cant (p value < 0.05) only in the whole data set and surface water data set (Table 2). Tsitsi i and Kanakoudis (2020) reported a greater correlation between turbidity and TTHMs (r = 0.553) for two treatment plants using surface sources. With regard to apparent color, a low and signi cant positive correlation in the surface water data set were observed, in the others, the correlation was not signi cant ( Table 2) Free residual chlorine showed a signi cant correlation in the whole data set and in the spring and mixed water data sets (Table 2). In addition, the correlation was moderate and positive in all data sets. Contrary, some authors reported negative correlations between this parameter and TTHMs (Feungpean et  . Finally, the SUVA only presented a signi cant, but low negative correlation in the mixed water data set (Table 2). Other studies have reported low and negative correlations for SUVA, but not signi cant (Babaei et al. 2015). Modeling THMs formation within distribution system As shown in Table 3, a linear, logarithmic, and exponential models were developed for each type of water. All models were signi cant (p value < 0.05 of F test) and in most cases the Durbin -Watson value was found between 1.5 to 2.5 as recommended in the literature to avoid autocorrelation problems (Tsitsi i and Kanakoudis 2020). The models presented a wide range of adjusted R 2 , from 0.132 to 0.687 indicating a varied performance and adjustment of the data.
The most appropriated models (in bold in Table 3) were selected not only because the values of the coe cient of determination, but also for statistical parameters related to the error (i.e., SE, MSE, MAE). For the whole data set, spring and mixed water data sets, the models 1, 4 and 10, respectively, presented the lowest values of SE, MSE and MAE and they were selected although they presented a slightly lower R 2 . However, in these models the R 2 of 0.448, 0.657 and 0.531, respectively (Table 3) (Table 3). Therefore, models 1, 4, 7 and 10, all linear, were selected as the ones with the best performance and goodness-of-t. Among those models, a greater goodness-of-t is observed in those of spring waters (of higher quality) followed by the model of the mixed water data set, then the model of the whole data set and a lower performance in the case of the surface water data set. In general, those models can be considered moderately robust and could be improved by including some parameters and operational variables that affect the formation of THMs in distribution networks (e.g., bromide ion, contact time, chlorine dose) (Nikolaou et al. 2004).
Through a more detailed analysis of each of the chosen models, it can be determined which are the most in uential variables in the formation of THMs by type of water source. Thus, the model 1, similar to models reported by Kumari and Gupta (2015), includes the variables pH, free residual chlorine, DOC and UV 254 . In the case of the spring water data set, model 4, free residual chlorine, DOC and turbidity were included, the latter variable has also been used in THMs prediction models (Al-Tmemy et al. 2018). Finally, in the surface and mixed water data sets, models 7 and 10, free residual chlorine and organic matter content such as DOC and TOC respectively, are observed as in uential. Nomenclature: TTHM: total trihalomethanes (µg/L); Cl: free residual chlorine (mg/L); UV 254 : ultraviolet absorption at 254 nm (cm −1 ); DOC: dissolved organic carbon (mg/L); AP: apparent color; T: turbidity (NTU).  Table 4). Furthermore, Fig. 2 shows that most of the data are within the prediction interval for all the models. In case of the whole data set and surface water ( Fig. 2a and 2c) the data tend to move away from the line of best t above 30 µg/L. In the case of the models for spring water and the mixed water ( Fig. 2b and 2d), with lower TTHMs concentrations, the data tend to distribute more evenly. Therefore, these models seem to perform better at TTHMs concentrations lower than 30 µg/L.

Conclusions
Several TTHMs models were developed for the tropical Costa Rican rural and semiurban chlorinated water. The TTHMs concentrations ranged between < 0.20 to 91.31 μg/L with CHCl 3 accounting on average for 62% of the total. Depending on the data set, several parameters, including turbidity, TOC, DOC, free residual chlorine, and UV 254 presented signi cant correlation (p value < 0.05). Four linear models presented the best goodness-of-t and were moderately robust. From the validation stage, it was found that according to the criteria of R 2 , SE, MSE and MAE, spring water and mixed spring/surface water models showed a satisfactory level of explanation of the variability of the data. Moreover, all the models seem to better predict TTHMs concentrations below 30 µg/L. Therefore, considering the speci c chlorinated water characteristics (low NOM and TTHMs produced) the models developed could be useful for decision-making in drinking water supply systems and be considered in possible modi cations in current legislation. ESM1.pdf