Groundwater contamination source identification based on Sobol sequences–based sparrow search algorithm with a BiLSTM surrogate model

In the traditional linked simulation–optimization method, solving the optimization model requires massive invoking of the groundwater numerical simulation model, which causes a huge computational load. In the present study, a surrogate model of the origin simulation model was developed using a bidirectional long and short-term memory neural network method (BiLSTM). Compared with the surrogate models built by shallow learning methods (BP neural network) and traditional LSTM methods, the surrogate model built by BiLSTM has higher accuracy and better generalization performance while reducing the computational load. The BiLSTM surrogate model had the highest R2 of the three with 0.9910 and the lowest RMSE with 3.7732 g/d. The BiLSTM surrogate model was linked to the optimization model and solved using the sparrow search algorithm based on Sobol sequences (SSAS). SSAS enhances the diversity of the initial population of sparrows by introducing Sobol sequences and introduces nonlinear inertia weights to control the search range and search efficiency. Compared with SSA, SSAS has stronger global search ability and faster search efficiency. And SSAS identifies the contamination source location and release intensity stably and reliably. The average relative error of SSAS for the identification of source location is 9.4%, and the average relative error for the identification of source intensity is 1.83%, which are both lower than that of SSA at 11.12% and 3.03%. This study also applied the Cholesky decomposition method to establish a Gaussian field for hydraulic conductivity to evaluate the feasibility of the simulation–optimization method.


Introduction
Groundwater contamination occurs below the surface; it is difficult to detect and often has a lag time when it is detected. And little information was known about the sources of groundwater contamination when it was first discovered (Datta et al. 2014;Snodgrass and Kitanidis 1997). This poses great difficulties for the rational design of groundwater contamination remediation programs and the accurate determination of contamination liability. Therefore, the study of groundwater contamination source identification (GCSI) is very important.
It should be noted that contamination migration in groundwater systems is an irreversible process; thus, GCSI is a nonlinear and unconventional inverse problem (Yeh 1986;Skaggs and Kabala 1994). Numerous solution methods have been proposed in the literature to solve the inverse problem of groundwater contamination source identification (Wang and Hu 2017). The methods for GCSI can be divided into the following categories: (i) analytical approach Skaggs et al. 1998); (ii) direct approach (Atmadja and Bagtzoglou 2001;Bagtzoglou 2003); (iii) simulation-optimization approach (Ayvaz 2010;Pan et al. 2021); (iv) probabilistic and geostatistical simulation approach (Sun 2007;Butera et al. 2013). The analytical approach means analyzing the contamination sources along the reverse direction of groundwater flow when the current groundwater contaminant distribution conditions are known. The direct approach means that the inverse problem is solved directly by giving the corresponding mathematical assumptions to the inverse problem through the regularization method, which transforms the non-adaptive problem into an adaptive one. Probabilistic and geostatistical simulation approaches include multiple nonlinear regression methods and the great likelihood method. The simulation-optimization approach is one of the most widely used among these approaches.
The current literature includes several simulation-optimization approaches for solving the problem of GCSI. It is important to note that coupling the mathematical and optimization models is required in the simulation-optimization approach to determine the aquifer's response to the generated variable values. So far there are three methods (Amirabdollahian and Datta 2013): the response matrix approach , the embedding approach (Gorelick 1983), and the linked simulation-optimization approach. However, the response matrix approach is only suitable for linear systems and the embedding approach has data processing problems in the face of highly non-uniform practical cases (Lu 1994;Ye 1993).
The groundwater transport solute simulation model is an equation constraint for the optimization model when using the linked simulation-optimization approach. The objective function is to minimize the deviation between the simulated and observed values. However, solving the optimization model requires multiple calls to the mathematical simulation model, which will cause a huge computational load. Therefore, a surrogate model strategy is employed to approximate the input-output mapping relationship of the solute transport model (Pan et al. 2022a). Surrogate model techniques fall into three main categories: data-driven, projection, and hierarchical-based approaches. Among them, the data-driven methods are widely used to construct surrogate models of solute transport models, due to their superior accuracy and rapidity (Asher et al. 2015).
Generally speaking, data-driven surrogate model construction methods can be broadly classified into two categories: shallow learning methods and deep learning methods. Yan and Minsker (2006) proposed an adaptive neural network to replace the groundwater simulation model and made good progress in a groundwater contamination remediation case. Ouyang et al. (2017) developed surrogate models for numerical groundwater simulation models using support vector machine methods, kriging methods, and other shallow learning methods. Hou et al. (2019) used the kernel limit learning machine method to develop a surrogate model and effectively solve the groundwater contamination source identification problem. However, the surrogate models using shallow learning methods have difficulty in the accuracy requirements when facing problems with high dimensionality and complex linearity. Therefore, it is essential to find methods with high and robust approximation accuracy when constructing surrogate models.
Long and short-term memory (LSTM) neural network as a deep learning method is more suitable for solving complex problems. LSTM is a special kind of recurrent neural network (RNN), which can solve the long-term dependence problem and the gradient disappearance or explosion problem that exist in general RNN (Hochreiter and Schmidhuber 1997). In recent years, LSTM has been widely used in the field of groundwater research. An et al. (2020) completed the prediction of karst spring flow using LSTM. Li et al. (2021) used LSTM to build a surrogate model for groundwater contamination source identification. Bidirectional LSTM (BiLSTM) consists of the forward LSTM and backward LSTM, which is an extension of the traditional LSTM and can greatly improve the performance of the model. Due to the superior performance of the BiLSTM, this study applies it to GCSI.
In addition, the search efficiency of the optimization model is also important in the linked simulation-optimization approach. So far, the most widely used optimization algorithms are gradient-based traditional optimization algorithms and heuristic optimization algorithms. Heuristic optimization algorithms, such as simulated annealing algorithms (Jha and Datta 2013), genetic algorithms (Xia et al. 2019), and particle swarm optimization algorithms (Guneshwor et al. 2018), are faster and more effective than traditional optimization algorithms. The sparrow search algorithm (SSA) was first proposed by Xue and Shen (2020), which is inspired by the foraging and anti-predatory behaviors of sparrows and has high convergence performance and local search capability. SSA has a simple structure, fewer parameters, and better local search capability than traditional optimization algorithms. However, SSA also suffers from the defects of easily falling into local optimum and low convergence accuracy. To solve these problems, this paper firstly introduces Sobol sequences to enhance the population diversity of initial sparrows, and finally introduces nonlinear inertia weights to improve the global search ability and search efficiency. In general, this paper uses a sparrow search algorithm based on the Sobol sequence (SSAS) to solve the optimization model, which is better than ordinary SSA, resulting in faster convergence and higher solution accuracy (Duan and Liu 2022).
As the most important hydrogeological parameter in groundwater modeling, hydraulic conductivity has uneven spatial variability. In previous studies, scholars often use the homogeneity assumption and parametric partitioning methods to portray the hydraulic conductivity field to fit the actual situation as much as possible (Pan et al. 2022b). The hydraulic conductivity of an aquifer can usually be considered a field that is spatially correlated and usually follows a lognormal distribution, so it can be treated as a Gaussian field after logarithmic treatment. This paper uses the Cholesky decomposition method to establish a Gaussian field of hydraulic conductivity, to evaluate the effectiveness and practicability of the link simulation-optimization approach.
To solve the GCSI problem, the present study uses the BiLSTM method to build a surrogate model for the numerical groundwater simulation model. And the accuracy of the surrogate models built by the BiLSTM method is compared with the accuracy of the surrogate models built by the LSTM method and the BP neural network method. Finally, the surrogate model with the highest accuracy from the three surrogate models is selected and linked to the optimization model, and the optimization model is solved using SSAS to identify the contamination source information. In addition, the Cholesky decomposition method is used to establish the Gaussian field of hydraulic conductivity when establishing the numerical simulation model of groundwater in this paper, to evaluate the feasibility and effectiveness of the simulation-optimization method proposed in this paper.

Numerical simulation model
A numerical groundwater simulation model containing groundwater flow and solute transport models is used to describe the solute transport law. The governing equation of groundwater flow in the model can be expressed as: where x and y are the gradient operator, K is the hydraulic conductivity, H is the water head, w is the source and sink term, μ is the specific yield, and Γ is the boundary of the spatial domain.
The governing equation of groundwater contaminant transport can be expressed as: where c is the contaminant concentration, D is the hydrodynamic dispersion tensor, u i denotes mean linear seepage velocity that satisfies Darcy's Law, and I is the source and sink term.

Bidirectional long and short-term memory (BiLSTM) neural network
Recurrent neural network (RNN) has the concept of temporal order, and its implicit layer input ( Fig. 1) includes not only the input at the current moment (x t ) but also the output of the implicit layer at the previous moment (s t−1 ). It can be mathematically expressed as (Graves et al. 2013): where s t is the Hidden Statu, W denotes the weight matrix of the influence of the last value of the implied layer on the current implied layer, U denotes the weight matrix from the input layer to the implied layer, x t is the input data at time step t, y t is the output of the neural network, f and g are activation functions, and V denotes the weight matrix from the implied layer to the output layer. LSTM is a special type of RNN that can be used to deal with long-term dependency problems due to a special structure called a memory unit in LSTM that replaces the hidden layer in RNN. There are three gate structures within an LSTM memory cell, which are forget gate, input gate, and output gate. The memory unit structure of LSTM is shown in Fig. 2. The basic formula of LSTM can be mathematically given as (Alom et al. 2019): where f t is the forget gate, i t is the input gate, o t is the output gate, h t−1 is the output of the cell at moment t − 1, h t is the output of the cell at moment t, x t denotes the input at moment t, C t−1 denotes the state of the cell at moment t − 1, C t denotes the cell state at moment t, C t indicates a candidate value, σ denotes the sigmoid function, W denotes the weighting factor, and b denotes the offset factor.
BiLSTM consists of two LSTMs superimposed on each other (Fig. 3), and the output is jointly determined by the states of the two LSTMs, which greatly improves the accuracy of the traditional LSTM. The Forward layer is calculated forward from moment 1 to moment t to get the output of the hidden layer at each moment and propagate backward; the Backward layer is propagated backward from moment t to moment 1 to get the output of the hidden layer at each moment and propagate forward. Finally, the final result is obtained by combining the corresponding outputs of the Forward and Backward layers at each moment through the activation function (Siami-Namini et al. 2019).

Sparrow search algorithm based on Sobol sequence (SSAS)
In SSA, individual sparrows are divided into three categories: discoverers, joiners, and vigilantes, and each individual position represents a solution. Discoverers are responsible for searching foraging areas with abundant food and providing foraging directions. The joiners follow the finders to forage. Vigilantes are responsible for  monitoring the foraging area. The identity of discoverers and joiners is dynamically updated throughout the population. During foraging, resources are obtained by constantly updating the positions of the three individuals.
A population of N sparrows can be expressed as: where d denotes the number of dimensions of the problem variable to be optimized, and n is the number of sparrows. The fitness value of sparrows is denoted as: where f denotes the fitness value. During each iteration, the position of the discoverer is updated as follows: where t represents the number of iterations, iter max is the maximum number of iterations, X i,j denotes the position information of the ith sparrow in the jth dimension, R 2 and ST denote the warning value and the safety value, Q is a random number obeying a normal distribution, and L denotes a 1 × d matrix.
The formula for updating the joiner's position is as follows: where X P denotes the current optimal position of the discoverer, X worst is the current global worst position, and A is a 1 × d matrix where each element is randomly assigned to 1 or − 1.
The formula for updating the location of the vigilantes is as follows: where X best is the current global optimal position, β is the step control parameter, f i is the current fitness value of the individual sparrow, f g and f w are the current global optimal and worst fitness values respectively, and ε is a constant, which serves to avoid the denominator from being zero. SSA uses random numbers in the search space to generate initialized populations. This method has low ergodicity and uneven distribution of individuals, which affects the performance of the algorithm. In contrast to the common pseudorandom numbers, the low discrepancy sequence (LDS) is more evenly distributed. Among them, the Sobol sequence has a shorter computational cycle and faster sampling speed (Joe and Kuo 2003). The positions of individuals in the initialized populations based on Sobol sequence mapping are as: where x max and x min are the maximum and minimum values of the individual positions respectively, and K n ∈ [0, 1] denotes the random numbers generated by the Sobol sequence.
Moreover, the traditional SSA lacks effective control over the step length, which is manifested by the fact that after the optimal solution is found, other individuals quickly converge to the optimal solution, thus easily falling into the local optimum. For this reason, the nonlinear inertia weight ω is introduced to control the search range and convergence speed. The inertia weight ω is calculated as follows: With the improvement, the discoverer's position is updated with the following formula:

Fig. 4 General situation of the study area
The Gaussian random field of hydraulic conductivity There are many methods to carve out the hydraulic conductivity field, and the Cholesky decomposition method is used in this study. The following function is used to describe the spatial correlation of the natural logarithm Y = lnK of the hydraulic conductivity (Zhang 2017): where 2 Y is the variance, and x and y are the correlation lengths in the x and y directions, respectively. Then, the correlation matrix C Y is decomposed into the lower triangular 2 y matrix (L) and upper triangular matrix (U) using Cholesky decomposition: Finally, generate the random data set of the Y-field: where is the mean value of the Y-field, and is a set of standard random Gaussians.

Site overview
This study is carried out for a hypothetical calculation case. The study area is 250 m long and 200 m wide, and the aquifer thickness is 40 m.It is a two-dimensional non-uniform medium with regular boundaries. As shown in Fig. 4, the western boundary Γ 1 and the eastern boundary Γ 2 are the specific head boundaries, and the northern boundary Γ 3 and the southern boundary Γ 4 are the no-flow boundaries. Set the head of A, B, C, and D as 30 m, 29 m, 30.5 m, and 29.5 m respectively. The groundwater flow is unsteady, and the direction is southwest to northeast. The total simulation period for contaminant transport is 10 years, divided into 10 simulation periods (one simulation period per year). The background pollutant concentration is set to 0 mg/L, and the pollutant will not undergo chemical changes and biodegradation. The 1 Grid spacing in y-direction, Δy (m) 1

Fig. 5
The Gaussian random field of hydraulic conductivity groundwater source at a site discharges pollutants to the aquifer in periods 3-9. Six monitoring wells (Observation wells 1, 2, 3, 4, 5, and 6) were set up in the study area to monitor the pollutant concentrations in groundwater, where wells 1 to 5 were used to participate in GCSI and well 6 was used to check the accuracy of inversion identification (Ge et al. 2022).

Numerical simulation model
Based on the specific conditions of the study area, a numerical simulation model of groundwater flow and contaminant transport was developed. Table 1 shows the basic parameters of the aquifer. When building the simulation model, the Gaussian random field of hydraulic conductivity was established using the Cholesky decomposition method. Firstly, the study area space is discretized into 50,000 (250 × 200) grids using GMS software. Then, set the x-direction correlation length x as 20, y-direction correlation length as 15, Y-field mean Y(X) as 3.912, and Y-field variance 2 Y as 1, using Eqs. (20)-(22) for calculation. Finally, the Gaussian random field of hydraulic conductivity was obtained as shown in Fig. 5: Figure 6 shows the initial flow head distribution in the study area. The actual GCS information is given in Table 2. However, Fig. 6 The initial flow field during GCSI, the position of the GCS, the initial release period, the period of stopping the release, and the release intensity of the GCS during each release period are unknown. The purpose of GCSI is to identify these unknown variables.
The actual GCS information of the study area is input into the numerical simulation model, and then the simulation model is run to obtain the contaminant concentrations in the observation wells for each stress period as the observation data (Fig. 7).

Surrogate models
The variables to be identified in this paper are the horizontal coordinate xof the source, the vertical coordinate y of the source, and the release intensity st1 to st10 of the source, which are 12 variables in total. The inputs of the surrogate model are the above 12 variables, and the outputs are the contaminant concentrations monitored by five observation wells at each time, totaling 50 variables. In the surrogate model, the training data is used to extract the inherent features and laws of the simulation model, and the validation data is used to evaluate the generalization ability of the surrogate model. To ensure that the data cover the sampling interval uniformly and increase the sample ergodicity, the Latin hypercube sampling (LHS) method is used to sample the above 12 input variables. A total of 1000 sets of samples were randomly combined and input into the simulation model, and the corresponding 1000 sets of pollutant concentration outputs were calculated. The input-output data sets were then randomly divided into training data set (70%) and validation data set (30%).
Surrogate models are built using BP, LSTM, and BiL-STM methods, and the performance of the surrogate models is evaluated and compared by root mean squared error (RMSE) and coefficient of determination (R 2 ). The equations of each evaluation index are as follows: Root mean squared error (RMSE): Coefficient of determination (R 2 ):  where m is the number of samples, y i is the output of the simulation model, ŷ i is the output of the surrogate model, and − y i is the average of the output of the simulation model.

Nonlinear optimization model
The accuracy of the two surrogate models was compared and analyzed. The surrogate model with the highest accuracy was selected and linked to the optimized model. The nonlinear optimization model includes three parts: the objective function, the decision variables, and the constraints. In this paper, the minimization of the error between the calculated values of the surrogate model and the actual observed values is used as the objective function. The x and y coordinates of the source location and the source release intensity st1 to st10 are used as decision variables. And the surrogate model is used as the equation constraint, together with other constraints, to construct the optimization model for the GCSI. The optimization model is constructed in the following formulation: where C obs i denotes the actual observed contamination concentration in the ith observation well, C cal i is the corresponding calculated value of the surrogate model, and Z is the objective function in the optimization model.
It should be noted that this paper carries out the identification of groundwater contamination source release intensity and contamination source location through hypothetical examples and does not carry out the identification of hydraulic conductivity. The Gaussian field of hydraulic conductivity established in this study is only to evaluate whether the GCSI's method based on BiLSTM and SSAS is feasible and highly accurate in the hypothetical case.

Accuracy comparison of surrogate models
The proposed BiLSTM is compared with BP and LSTM. From Fig. 8, it can be observed that the surrogate model based on BiLSTM converges faster than the other two models during training, completing convergence in 40, 100, and (24)  Table 3 compares the three surrogate models in terms of generalization performance and training time. As can be seen from the table, the training time of BiLSTM is slightly higher than the other two models, but BiLSTM achieves higher generalization accuracy compared to LSTM and BP. The training cost of the surrogate models is negligible compared to the time saved in solving the optimized model, so BiLSTM is chosen as the surrogate model for the next optimization step. Figure 9 shows the comparison of the predicted and actual observed values for the three surrogate models, illustrating that BiLSTM shows good performance not only in the validation dataset but also in a specific set of reference values.

Solution of the optimization model
The surrogate model based on BiLSTM is linked to the optimization model, and then the optimization model is solved using SSAS and SSA. When solving the optimization model, using surrogate models instead of simulation models for iterative calculations can drastically reduce the calculation load. The location, the initial release period, the ending release period, and the release intensity of the GCS in each release period were finally determined. Figure 10 shows the convergence curves of SSAS and SSA, and it can be seen that the fitness of SSAS is significantly lower than that of SSA, and the convergence speed of SSAS is faster than that of SSA. Compared with SSA, SSAS has a more uniform initial population distribution, stronger global search ability, and search efficiency. It can be seen from Table 4 that the identified values of st1, st2, and st10 are significantly lower than those of other periods, so it can be inferred that the initial release time is st3 and the termination time is st10. The average relative error of SSAS for the identification of source location is 9.4%, and the average relative error for the identification of source intensity is 1.83%, which are both lower than that of SSA at 11.12% and 3.03%.
To further compare the accuracy of the two search algorithms, the source locations and release intensities identified by SSAS and SSA were substituted into the simulation model for the forward calculation, and the simulated and observed values were compared through observation well No. 6 (Fig. 11).

Discussion
A hypothetical case was applied for this study for three reasons: (a) the real information about the pollution source in  the hypothetical case is known, which saves a lot of human and material resources and also verifies the effectiveness of the proposed method. (b) Since there are few actual cases available for the study, this study tries to fit the actual study system as much as possible. The Cholesky decomposition method was used to establish the Gaussian field of hydraulic conductivity in this study, and the value of K was fixed. Compared with the case designs of previous researchers (Pan et al. 2022b), the Gaussian field can be more closely fitted to the complex actual contaminated sites (but there is a gap in the complexity of the actual system). (c) Theoretical research ultimately serves the actual production life. The method proposed in this study is applied to a hypothetical case to achieve good results, which is a reference for the actual GCSI problem. From Ouyang et al. (2017) and Hou et al. (2019), it can be seen that the surrogate models built based on shallow learning methods have no obvious advantages or disadvantages in terms of accuracy. Li et al. (2021) compare shallow learning methods with LSTM deep learning methods and conclude that the LSTM surrogate model is superior. This study compares shallow learning methods, LSTM, and improved LSTM (BiLSTM), and finally concludes that the BiLSTM surrogate model has higher accuracy. However, the results of this paper only show that the BiLSTM method has some potential in building surrogate model and is not necessarily applicable to all cases.

Conclusions
In this study, a simulation-optimization method based on BiLSTM and SSAS is constructed based on a hypothetical case, and the spatial and temporal characteristics of point contamination sources are identified based on this framework. The following conclusions were drawn: (1) In this study, the BiLSTM method was applied to construct a surrogate model of the contaminant transport simulation model. Compared with BP and traditional LSTM, BiLSTM has better generalization performance for sequential data sets, and the surrogate model built by it has higher approximation accuracy for simulation models. The BiLSTM surrogate model has the R 2 value of 0.9910 and the RMSE of 3.7732. In addition, BiL-STM can reduce the large computational load generated by a large number of calls to simulation models in the simulation-optimization framework.
(2) In solving the optimization model, an improved SSA is proposed in this study. SSAS introduces Sobol sequences to enhance the population diversity of the initial sparrow and introduces nonlinear inertia weights to improve the global search ability and search efficiency. And it has high accuracy in the identification of both contaminant source location and release intensity (the relative errors are 9.4% and 1.83%).