## 3.1 Problem description

This case was based on a real chemical leakage accident that occurred on January 18, 2017, at 9:23 a.m., when a truck carrying crude benzene overturned at Fen River Bridge and fell into the Fen River(marked with "source site" in Fig. 3), Xinjiang County, Yuncheng City, Shanxi Province, north China. The Fen River was contaminated with approximately 4,320 kg of crude benzene. For modeling purpose, this release was treated as a short-duration release. Monitoring points for the downstream crude benzene were set up at monitoring sites labeled in Fig. 3. It was assumed in this work that the pollutant was released instantly from a point source.

The samples of river water quality were taken at "Monitoring site B". Guo and Cheng (2019) provided the monitoring data used in this study, and the pollutant concentration at the monitoring location was simulated. Figure 3 depicted the contaminated river portion and the monitoring locations. The average cross-sectional area and the mean flow rate were 25 m2, 2880 m/h, respectively (Guo and Cheng 2019). Taking the location of contaminant source as the coordinate origin, the coordinates of Monitoring site A were referred as (*x*0, 0).

The release mass, location, and release time of the contamination source were assumed to be unknown, which could be defined by three parameters (*M, x*0, *t*0), where *M* was release mass, *x*0 was the source location and *t*0 was release time. We supposed that the values of the source parameters were evenly distributed within the prior range presented in Table 1.

Table 1

True values and prior distributions of contaminant source parameters.

Parameter | Prior distribution | True value |

*M* | *U* [2000, 10000] | 4320 kg |

\({x_0}\) | *U* [1000, 25000] | 16000m |

*t*0 | *U* [50, 720] | 120 min |

Time series of the observed pollutant concentration was presented in Fig. 4, providing critical observation data for CA model calibration. It was shown that the pollutant concentration at monitoring site B approximately follows the normal distribution, with the highest observed concentration occurring 11 hours after the observation.

## 3.2 Result analysis

**Concentration simulation results.** The CA model was applied to simulate the contaminant transport during the incident and was calibrated using time series data about observed concentration. All CA simulations in this section used a cell size of 5 m and a time step of 1 min. The model was calibrated by comparing the CA model’s predicted concentrations to the observed data at the monitoring sites (A and B). Available 1-hour concentrations were collected at Monitoring site A and Monitoring B. The model concentration was calibrated by the contaminant transfer coefficient (m) and correction coefficient (d). Based on calibration, *m* and *d* were set to 0.18 and 0.39, respectively.

Four statistical indicators were calculated to evaluate the CA transport modelling: the root square error (*RMSE*), the relative *RMSE* (*RRE*), the mean absolute deviation (*MAD*) and the mean deviation (*MD*). These indicators were used to assess agreement between CA based simulated results and observed values for pollutant concentrations at two observation locations in the river.

Root mean square error (*RMSE*) is used to assess the effectiveness of CA model. *RMSE* shows the average of the squared difference between model simulated and observed values:

$$RMSE=\sqrt {\frac{1}{N}\sum\nolimits_{{n=1}}^{N} {{{\left( {{X_{\bmod {\text{el}}}} - {X_{{\text{obs}}}}} \right)}^2}} }$$

10

where *N* represents the number of observed values throughout the simulation time; \({X_{\bmod {\text{el}}}}\)and \({X_{obs}}\)are model-simulated and observed values at each comparison point *i*. The lower the *RMSE*, and the better the model simulation performance. A *RMSE* of 0 indicates that the simulations exactly match the observations.

*RRE* is defined as the ratio of *RMSE* to the observed change (Ji 2017):

$$RRE=\frac{{RMSE}}{{{\text{Observed change}}}} \times 100{\text{=}}\frac{{\sqrt {\frac{1}{N}\sum\nolimits_{{n=1}}^{N} {{{\left( {{X_{\bmod {\text{el}}}} - {X_{{\text{obs}}}}} \right)}^2}} } }}{{{X_{\hbox{max} }} - {X_{\hbox{min} }}}} \times 100$$

11

where Xmax and Xmin represent the maximum and minimum values of the observations, separately.

*MAD* denotes the average distance between model-predicted and observed values, while *MD* denotes the difference between the average model-predicted and the observed values (Pontius et al.2008):

$$MAD=\frac{{\sum\nolimits_{{n=1}}^{N} {\left| {{X_{\bmod {\text{el}}}} - {X_{{\text{obs}}}}} \right|} }}{N}$$

12

$$MD=\frac{{\sum\nolimits_{{n=1}}^{N} {\left( {{X_{\bmod {\text{el}}}} - {X_{{\text{obs}}}}} \right)} }}{N}$$

13

Figure 5 compared CA-based model results and observed values at Monitoring Site A and B. A time series of predicted and observed concentrations at Monitoring sites (Fig. 5) revealed that the observed and predicted data agreed well.

Table 2 presented the statistical error analysis for concentrations. In Table 2, the absolute error varied from 0.011 mg/L at Site A to 0.103 mg/L at Site B; the *RMS* error varied from 0.068 mg/L at Site A to 0.082 mg/L at Site B; and the *RRE* values of pollution concentration ranged from 2.87–5.59%. According to the statistical results, CA model simulated the concentration trends well, which agreed with the graphical results. The ability of the model to simulate contaminant transport was proved by the general agreement between predicted and actual outcomes.

Table 2

Statistical analysis of observed and model concentrations at Monitoring sites.

Site | Number of data | Observed mean(mg/L) | Modeled mean (mg/L) | Mean absolute error (mg/L) | RMSE (mg/L) | RRE MAD MD (%) (mg/L) (mg/L) |

A | 9 | 1.025 | 1.014 | -0.011 | 0.068 | 5.59 0.100 -0.002 |

B | 9 | 1.661 | 1.558 | -0.103 | 0.082 | 2.87 0.289 -0.047 |

**MCMC convergence diagnostics.** Convergence diagnostics are used to determine whether samples produced by MCMC simulations are drawn from the appropriate posterior distribution. There are numerous diagnostic approaches for determining whether a Markov chain has reached its stationary distribution (Cowles and Carlin 1996), some of which compute sample statistics from the Markov chain. The trace plot is an informal but widely used diagnostic tool. Trace plots can exhibit parameter values over many iterations. For this plot, the y axis is the value of a parameter, while the x axis shows the iteration number. A chain is considered to have reached stationarity when the distribution of points is centred on a constant mean.

Using uniform priors on the source parameters *M*,\({x_0}\)and\({t_0}\):

$$P\left( M \right)=\left\{ \begin{gathered} 1.25 \times {10^{ - 4}},\quad \quad 2,000<M<10,000, \hfill \\ 0,\quad \quad \quad \quad \quad \quad else \hfill \\ \end{gathered} \right.$$

12

$$P\left( {{x_0}} \right)=\left\{ \begin{gathered} 4.167 \times {10^{ - 5}},\quad \quad 1,000<{x_0}<25,000 \hfill \\ 0,\quad \quad \quad \quad \quad \quad else \hfill \\ \end{gathered} \right.$$

13

$$P\left( {{t_0}} \right)=\left\{ \begin{gathered} 1.493 \times {10^{ - 3}},\quad \quad 50<{t_0}<720 \hfill \\ 0,\quad \quad \quad \quad \quad \quad else \hfill \\ \end{gathered} \right.$$

14

In this case study, only the observed error was taken into account for simplicity, and it was also set to follow the normal distribution N (0, 0.52) with mean 0 and standard deviation 0.5. In this situation, the number of observations N was equal to 9. The posterior probability density function (PDF) was calculated as follows:

$$P(M,{x_0},{t_0}\left| D \right.)=\frac{{7.77667 \times {{10}^{ - 12}}}}{{{{\left( {2\pi {\sigma ^2}} \right)}^{{9 \mathord{\left/ {\vphantom {9 2}} \right. \kern-0pt} 2}}}}}\exp \left\{ {\sum\limits_{{i=1}}^{9} {\left[ { - \frac{{{{\left( {{D_{\bmod {\text{el}},i}}\left( {M,{x_0},{t_0}} \right) - {D_{{\text{mea}},i}}} \right)}^2}}}{{2{\sigma ^2}}}} \right]} } \right\}$$

15

For simplicity, assume the proposal distribution for the MH algorithm was also a uniform proposal distribution\(q\left( {\bar {\theta }\left| {{\theta _i}} \right.} \right)=U\left( {{\theta _i} - \Delta \theta ,\;{\theta _i}+\Delta \theta } \right)\), and step size was 5% of a prior range, \(\Delta \theta \left( {M,{x_0},{t_0}} \right)=\left( {400,1200,33.5} \right)\). Each Markov chain was run for a total of 20,000 iterations. The acceptance rate of a candidate was approximately 0.42, indicating that the sampling was efficient. The plot of sample distribution of the three parameters was shown in Fig. 6. In order to facilitate visualization in Fig. 6, that sample was thinned to 2,000 values by taking every tenth value. These plots indicated that the three chains converged to the posterior after around 2000 iterations. The last 18,000 samples are used for inference about the posterior distribution of the contaminant source.

**Analysis of the identification results.** Converged MCMC chains can be used for the estimation of the posterior distributions of source parameters. For instance, the mean values of 3 unknown parameters could be estimated by calculating the mean of MCMC samples. Because the samples acquired by the first 2000 iterations are unstable, the last 18,000 were used in the computation of the posterior mean, the posterior standard deviations (the standard deviation of the sampled values), and the relative error of the parameters. Table 3 presented the posterior summary statistics of the parameter estimation given by Bayesian MCMC. The probability histograms of the last 18,000 values for the source parameters in Fig. 7. The sample skewness for *M, x*0 and *t*0 were 0.360, -0.449 and − 0.145, respectively. The sample kurtosis for *M, x*0 and *t*0 were equal to -0.320, -0.212 and 0.629, respectively.

Table 3

Summaries of the posterior distribution for three source parameters

Parameter | True value | Mean | Mean absolute error | Standard deviation | Relative error (%) |

*M*(kg) | 4320 | 4964 | 644 | 836.7 | 14.9 |

*x*0 (m) | 16000 | 13827 | -2173 | 1182.4 | -13.6 |

*t*0 (min) | 120 | 142.2 | 22.2 | 19.2 | 18.5 |

As illustrated in Figs. 6 and 7 and Table 3, several conclusions could be drawn: |

(1) The identification results tended to stay stable enough after about 2000 iterations. The location of contaminant source *x*0 was most likely at 12,500 − 16,500 m; the release mass *M* was most probably at 4,200–5,200 kg; the release time *t*0 was most probably at 120–150 min. These results were close to the true values and thus validating the efficiency of the MCMC method.

(2) According to Table 3, the standard deviations of *M, x*0 and *t*0 were 836.7, 1182.4 and 19.2, respectively. The relative errors for *M*, *x*0 and *t*0 were 14.9%, -13.6% and 18.5%, respectively. The result depends on the likelihood and prior distribution, which are supposed to be normal and uniform distributions, respectively.

(3) The estimated mean results using the Bayesian-MCMC method were compared to the actual values for *M*, *x*0, and *t*0, and the relative errors were all less than 20%.

(4) Fig. 7 depicted the histogram of the generated samples. The absolute values of both skewness and kurtosis were less than 1, so the posterior distribution for all three parameters approximately followed a normal distribution.

## 3.3 Discussion

**Transition rules of CA model.** The 2D CA model developed in this study was primarily for river pollution simulation. The performance of the CA model was compared with the hydrodynamic-water quality model Delft3D-Flow (Deltares 2013), in which the hydrodynamic-water quality model is the Navier Stokes equations and advection-diffusion equation. The model Delft3D-Flow is chosen as its effectiveness in modeling hydrodynamic and transport processes. Instead of using an advection diffusion equation, the CA model has been built to calculate the amount of pollutant mass transferred between neighbouring cells. A comparison of results from the Delft3D-Flow and CA models revealed no significant difference. The *MAD* and *MD* for Delft3D-Flow model were 0.255 mg/L and − 0.028 mg/L, respectively (Guo and Cheng 2019), while CA model had slightly larger *MAD* and *MD* values for Site B. These results indicated acceptable simulation accuracy of the proposed CA model in this case study.

Finding appropriate transition rules for CA model is a crucial task (Balzter et al. 1998). Transition rules govern how a cell is influenced by its neighborhoods. In most cases, the transition rules in the CA model are obtained based on the specific problem and the physical properties of the problem. In this study, two major advection and dispersion processes in rivers are simulated using a set of transition rules composed of explicit algebraic expressions. Contaminants are assumed to be conservative and do not consider the fate and decay process. Future studies should consider the fate and decay process (such as chemical and biological kinetics) to generate appropriate transition rules for CA models. CA model is based on proposed rules and its applicability and accuracy need to be confirmed (Li et al.2019). More recent work has shown that combining CA models with neural networks can help the CA model to mine the rules (Lauret et al. 2016; Xing et al.2020)

**Performance of the Bayesian-MCMC method.** This study proposes a Bayesian-MCMC method based on the MH algorithm for identifying single-point pollution sources in rivers. Providing a full posterior distribution of the parameters is one important advantage of the Bayesian-MCMC method over simulation-optimization methods, which can provide only a single optimal solution of the source parameters (Hutchinson and Oh, et al.2017). Identifying contaminant source from limited observed data is an ill-posed inverse problem due to irreversible contaminant fate and transport process (Tong and Deng 2001). Therefore, the uniqueness of the obtained solutions and whether the problem was ill-posed are vital to consider. The Bayesian-MCMC method is particularly well suited to this situation, because it can provide a probabilistic description of the possible contaminant sources in order to consider uncertainties in input data. The comprehensive example above confirmed the Bayesian-MCMC method's critical significance in contamination source detection.

This study assumed only a single-point contaminant source in river pollution accidents. Compared with single-point source identification problems, only few studies have been carried out on identifying multi-point sources (Wang et al. 2018; Zhu and Chen 2022). The method described in this study can be expanded to conclude multi-point sources, with the only difference being the number of contaminated sources that must be evaluated. It should be noted that more work is required to evaluate the likelihood function by including such parameters (Wang and Jin,2013), but in the current investigation, we assume that observation error follows a normal distribution. More research is required on this subject.

**Impact analysis of observation error.** The likelihood function is employed in Bayesian inference, which is closely related to the observation error distribution. So far, there has been little discussion about observation errors in Bayesian contaminant inference. Concentrations at a monitoring site are subject to observation error. In this case study, the observation error was based on a random sample from a normal distribution. The effect of observation error on the identification results was discussed in this section.

To illustrate the effect of observation errors on the identification results, assuming that the observation error followed the normal distribution\(N\left( {0,{\sigma ^2}} \right)\), we had intentionally specified three different standard deviations σ as 0.05, 0,25 and 0.5, respectively. Statistical results were compared based on the mean and standard deviation of the source parameters. Table 4–6 presented the comparison of statistical results for source parameters. In order to intuitively represent the impact, we showed the variation of the relative error and the sampling relative error with various standard deviations of observation error in Fig. 8. The standard deviation and relative error of the source parameters decreased as the oservation errors decreased. Therefore, the posterior results are influenced by measurement errors. The lower the measurement error variances, the lower the posterior variance and the more certain the identified results.

Table 4

Comparison of statistical results for release mass *M*

\(\sigma\) | True value | Mean | Standard deviation | Relative error (%) | Sampling relative error(%) |

0.50 | 4320 | 4964 | 836.75 | 14.9 | 0.14 |

0.25 | 4320 | 4925 | 253.35 | 14.0 | 0.04 |

0.05 | 4320 | 4680 | 84.26 | 8.3 | 0.01 |

Table 5

Comparison of statistical results for source location *x*0

\(\sigma\) | True value | Mean | Standard deviation | Relative error (%) | Sampling relative error(%) |

0.5 | 16000 | 13827 | 1182.4 | -13.6 | 0.05 |

0.25 | 16000 | 14035 | 366.7 | -12.3 | 0.02 |

0.05 | 16000 | 14425 | 205.4 | -9.8 | 0.01 |

Table 6

Comparison of statistical results for release time *t*0

\(\sigma\) | True value | Mean | Standard deviation | Relative error (%) | Sampling relative error(%) |

0.5 | 120 | 142.2 | 19.2 | 18.5 | 0.11 |

0.25 | 120 | 140.3 | 18.3 | 16.9 | 0.11 |

0.05 | 120 | 134.5 | 15.6 | 12.1 | 0.09 |