Application of Machine Learning Approaches in Particle Tracking Model to Estimate Sediment Transport in Natural Streams

8 Several empirical equations and machine learning approaches have been developed to predict dispersion coefficients in open channels; however, the ability of some learning-based models to predict these coefficients has not yet been evaluated, and the direct application of machine learning-based dispersion coefficients to Lagrangian sediment transport models has not been studied. In this research, data from previous studies is used to evaluate the ability of ensemble machine learning models, i.e., random forest regression (RFR) and gradient boosting regression (GBR), to predict longitudinal and transverse dispersion in natural streams. The optimal principal parameters of ensemble models were adjusted using the grid-search cross-validation technique, and the machine learning-based dispersion models were integrated with a Lagrangian particle tracking model to simulate suspended sediment concentration in natural streams. The resulting suspended sediment concentration distribution was compared with the field data. The results showed that GBR model, with a coefficient of determination (R 2 ) of 0.95, performed better than the RFR model, with R 2 =0.9, in predicting the longitudinal dispersion coefficients in a natural stream in both training and testing stages. However, the RFR model with R 2 = 0.94 performed better than the GBR (R 2 = 0.91) in predicting the transverse dispersion in testing stage. Both models underestimated the dispersion coefficients in the training and testing stages. Comparison between the PTM with ensemble dispersion coefficients and empirical-based dispersion relationships revealed the better performance of the GBR model compared to the other two methods.

The eroded material from these activities adds to the sediment load, which can significantly impact water quality and threaten the lives and habitats of aquatic species (Ahmari et al., 2021;Sulaiman et al., 2021).Numerical models have been utilized to assess sediment transport in natural streams (e.g., Wu, 2004;Fang and Rodi, 2003) and estuaries (e.g., Tu et al., 2019;Ouda and Toorman, 2019); however, the mechanisms of sediment movements in a flow domain are not well understood (Shi and Yu, 2015).In recent decades, advancements in computer technology have significantly improved hydraulics and sediment transport numerical models that use computational techniques to solve mathematical equations governing flow and sediment movement in open channels.The choice of a model for each specific problem depends on the project's requirements, availability of data, and knowledge about the physical processes that determine the system's behavior.
There are two approaches to modeling sediment transport in streams: Eulerian-Eulerian and Eulerian-Lagrangian (Shi and Yu, 2015).Eulerian-Eulerian models simulate sediment movement as a continuous phase and consider the statistical properties of the sediment cloud, while Eulerian-Lagrangian models consider sediment particles as a dispersed phase and the movement of individual sediment particles is tracked in the flow domain.Several studies have shown the applicability of Eulerian-Eulerian (Oh, 2011;Shi and Yu, 2015) and Eulerian-Lagrangian models (Tsai et al., 2020;Park and Seo, 2018;Oh and Tsai, 2018;Fan et al., 2016;Tsai et al., 2014;Macdonald et al., 2006;Niño and García, 1998) in natural streams.
The majority of stochastic models are based on the Lagrangian approach that simulate natural processes such as sediment entrainment or deposition (Oh, 2011).In these models, a particle tracking model (PTM) tracks individual sediment particles in the flow domain, using discretized advection and dispersion terms.Therefore, any changes in advection and dispersion terms have the potential to directly affect the movement of sediment particles.The advection term shows the effect of flow velocity on the movement of particles, and the dispersion coefficient shows the diffusivity effect.The longitudinal and transverse dispersion coefficients address variations of suspended sediment concentration along and across a stream channel.Figure 1 shows the results of sediment transport modeling in a prismatic open channel and the temporal and spatial distribution of suspended sediment concentration in the longitudinal and transverse directions.Figure 1 Temporal and spatial evolution of suspended sediment concentration in the longitudinal and transverse directions of a prismatic rectangular open channel (Baharvand et al. 2023a) Despite studies on modeling sediment transport in waterbodies, the effects of several parameters of sediment transport (e.g., size, shape, density, and composition of sediment particles, flow velocity, shear stress, etc.) are not well understood (Lick, 2009).The dispersion coefficient is one of the essential parameters in modeling sediment transport in natural streams, and several empirical equations, shown in Table 1, have been developed to estimate longitudinal (Dx) and transverse (Dy) dispersion coefficients.In these equations, H (m) is water depth, u (m/s) is depthaverage velocity, u* (m/s) is shear velocity, W (m) is stream width, and Sn is stream sinuosity coefficient.In recent years, machine learning (ML) approaches have been used in several studies as an alternative to empirical equations to predict the dispersion coefficient in natural streams.The main advantage of using ML models (also known as soft computing techniques) is their independency from the physics of the problem (Baharvand et al., 2021;Salazar and Crookston 2019).Several studies have examined the ability of these approaches to predict longitudinal dispersion coefficients.For instance, Toprak and Savci (2007) compared the performance of a learning-based model, fuzzy logic, with different empirical equations (e.g., an equation proposed by Kashefipour and Falconer, 2002), and showed the superiority of fuzzy logic models.Toprak and Cigizoglu (2008) also compared the accuracy of several ML models (i.e., feed-forward back propagation, radial basis function-based neural networks, and generalized regression neural networks) with different widely used empirical longitudinal dispersion coefficient equations and concluded that ML-based models are more reliable and more accurate than empirical equations in predicting longitudinal dispersion coefficients in natural streams.Azamathulla and Ahmad (2012) showed the superiority of gene expression programming (GEP) to predict transverse dispersion in natural streams compared to Equation 5 (Fischer, 1967) and Equation 7 (Ahmad, 2007).Over the past two decades, several studies have shown the advantages of various soft computing models in estimating dispersion coefficients over empirical approaches (e.g., Azamathulla and Ghani, 2011;Piotrowski et al., 2012;Antonopoulos et al., 2015;Parsaie and Haghiabi, 2015;Zahiri and Nezaratian, 2020;Najafzadeh et al., 2021;Nezaratian et al., 2021).
Although the application and sensitivity analysis of machine learning approaches to predict the dispersion coefficient in natural streams have been studied by several researchers, these approaches have not been used in Lagrangian-based sediment transport models.In addition, the ability of some learning-based models, such as random forest and gradient boosting regression, to predict dispersion coefficients has not been evaluated.In this study, we used data from previous studies to examine the performance of ensemble machine learning models in predicting longitudinal and transverse dispersion in natural streams.The ML-based dispersion models were integrated with a Lagrangian particle-based sediment transport model, and the performance of the PTM-ML-based dispersion models was compared with the result of the PTM-empirical dispersion relationships.

Methodology
The two main steps of this study were the development and evaluation of dispersion models for natural streams using soft computing techniques, and combining learning-based dispersion models with a particle tracking model developed by Baharvand et al. (2023a).Field data from previous studies were used to develop machine learning models that could predict longitudinal and The model architecture was programmed in Python 3.9.0, a high-level, general-purpose programming language (Rossum, 1995), using several Python-based packages such as NumPy, SciPy, Pandas, Matplotlib, Arcpy, Seaborn (Waskom, 2021), and Scikit-learn.The following presents a summary of the PTM development process, case study area, field data collection, and development of the dispersion models.

Particle Tracking Model
The present study utilized a particle tracking model with a continuous sediment source to simulate the added sediment load in natural streams.The architecture and governing equations of the PTM are discussed by Baharvand et al. (2023a).The hydrodynamic parameters required by the PTM were exported from an HEC-RAS 2D model and calculated using a linear interpolation technique in the flow domain to estimate total displacement of the particles caused by the advection term.The details of the integration of the PTM with HEC-RAS 2D are presented in Baharvand (2022) and Baharvand et al. (2023b).The advection and dispersion terms were considered as displacement equations for individual particles in the flow domain in the PTM.
The PTM was developed based on two-dimensional advection coefficients in longitudinal and transverse directions to estimate sediment particles' advection displacement using streamwise (u) and transverse (v) velocity components.The dispersion coefficients were estimated in three dimensions to address the diffusivity effect on particles in x, y, and z directions.Different experimental and field studies have reported various values for three-dimensional dispersion coefficients (Dx, Dy, Dz).Some of the widely used equations to estimate longitudinal and transverse dispersion coefficients are listed in Table 1.In this study, discretized advection and dispersion coefficients were estimated using hydrodynamic parameters (flow depth, shear velocity, etc.) at the position of the sediment particles.
The dispersion coefficients in longitudinal and transverse directions were estimated using two methods: 1) empirical equations and 2) ML-based approaches.The performance of the longitudinal dispersion coefficient equation by Elder (1959) (Equation 1) and transverse dispersion coefficient by Gualtieri and Mucherino (2008) (Equation 8) in a PTM developed for open channel flows is assessed by Baharvand et al. (2023a).In the present study, these empirical equations were used in the PTM to simulate sediment transport in a natural stream, and the results were compared with the PTM coupled with ML-based dispersion coefficients.More information on the development of the PTM with empirical equations may be found in Baharvand (2023a).The following sections discuss the development of the ensemble ML-based dispersion prediction models, vertical displacement, stochastic term development, and discretized Lagrangian transport equations.

Ensemble Dispersion Prediction Models
The dispersion coefficients in longitudinal and transverse directions can be predicted using empirical and ML-based models, and several ML approaches have been used to estimate dispersion coefficients in natural streams.For instance, Azamathulla and Wu (2011)  After initial assessment of different machine learning models, the ensemble ML models in this study were used to predict longitudinal and transverse dispersion coefficients.The advantage of ensemble approaches is that they build a powerful model, using the collection of weak prediction models (Hastie et al., 2009;Shao and Deng, 2018).Several types of ensemble approaches (i.e., bagging, stacking, and boosting), RFR (classified as a bagging ensemble model), and gradient boosting regression (GBR) (classified as a boosting ensemble model) were selected as the ensemble tree-based models.The two main reasons for this selection were that 1) both models had demonstrated a high level of accuracy in generating longitudinal dispersion coefficients in previous studies with different input datasets, and 2) the literature lacks adequate information on the use of these two approaches for predicting transverse dispersion coefficients in natural streams.
The RFR and GBR ensemble models were developed using the scikit-learn package (Pedregosa et al., 2011) in Spyder IDE, a widely used open-source scientific environment written in Python.

Natural Stream Dispersion Dataset
The parameters used to estimate longitudinal and transverse dispersion coefficients in natural streams obtained from previous studies include flow velocity (u), flow depth (H), and shear velocity (u*).In some of the studies, stream width is also considered an effective parameter for predicting dispersion coefficients (Toprak and Savci, 2007;Zeng and Huai, 2014;Kargar et al., 2020;Nezaratian et al., 2021).Since the variation in the channel width in the present case study is insignificant (Section 2.2.1), it is not considered an effective parameter.Other parameters, such as channel longitudinal slope, Froude number, and Reynolds number, were also not included as effective parameters because of their lesser importance (Najafzadeh et al., 2021).
In present work, 309 dispersion coefficients data (77 longitudinal and 232 transverse dispersion data) from more than 41 natural streams in North America were collected from the literature and used to train machine learning models to predict longitudinal and transverse dispersion in natural streams.Effective parameters for estimating longitudinal dispersion coefficients were extracted from studies by Fischer (1967), Yotsukura et al. (1970), Nordin andSabol (1974), Rutherford (1994), Graf (1995), and Toprak and Cigizoglu (2008).The transverse dispersion coefficient dataset was obtained from Nezaratian et al. (2021), which contains data from various studies, including Fischer (1967), Yotsukura et al. (1970), Holley and Abraham (1973), Krishnappan and Lau (1977), Beltaos (1979), Rutherford (1994), Jeon et al. (2007), Baek and Seo (2008), Lee and Seo (2013).The effective parameters for estimating dispersion coefficients, hereinafter called feature dataset (H, u, u*), were plotted against the dispersion coefficients with the frequency of each feature dataset in Figures 3 for longitudinal and transverse directions.Table 2 shows descriptive statistics of the longitudinal and transverse dispersion database used to develop the dispersion coefficients ensemble prediction models.

Development of Ensemble Machine Learning Models
Each machine learning model has specific hyperparameters (parameters that control the learning process) whose value is highly correlated with the performance of the learning-based model.GridsearchCV is a method that tunes each hyperparameter in an ML model by performing an exhaustive search of optimal parameters in a grid-wise manner (Alwated and El-Amin, 2021).
The GridSearchCV technique can perform pairwise computations of the hyperparameters.The present study used the GridsearchCV technique's Scikit-learn package to assess the performance of the hyperparameters for the RFR and GBR ensemble models and relied on randomly selected 80% of the collected dataset for each dispersion coefficient.
The tree specific (max feature, min sample leaf, min sample split) and boosting (number of estimators) hyperparameters are considered in the GridsearchCV method to find the optimal combination for both ensemble models.In this technique, 5-fold cross-validation is employed, which splits the dataset into five random subsets (Figure 2).
In tree specific hyperparameters, the max feature hyperparameter represents the number of features that will be considered for the best randomly splitting scenario.The min sample leaf determines the minimum required data in a leaf that controls the model overfitting.Min sample split refers to the minimum number of observation data needed for splitting in the tree-based model.The number of estimators is a boosting hyperparameter that controls the number of trees in the modeling.Once the optimal hyperparameter combination is defined, the model's accuracy in both training (with randomly selected 80% of the dataset) and testing stages (remaining 20% of dataset) will be tested, using different statistical measures.After ensuring the performance of the models, the tuned ensemble models will be used as ML-based dispersion estimators in the PTM to calculate the longitudinal and transverse dispersion coefficients in natural stream.

Random Forest Regression
Random forests (RF) are bagging ensemble machine learning algorithms developed by Breiman (2001).This approach is a meta estimator that fits many prediction decision trees created by the bootstrapping technique on various sub-samples of the dataset (Pedregosa et al., 2011).The random forest classification and prediction problems are comparable to the learning-based approaches such as boosting methods (Breiman, 2001) and support vector machines (Zhu, 2007).
This method controls the overfitting of the model and increases the prediction performance by using an averaging method (Pedregosa et al., 2011).
The RF method uses the measure of importance to rank the input variables and increase the prediction's accuracy.The importance of the variables is estimated by comparing the prediction's error metrics of out-of-bag (OOB) samples, which provides an opportunity to estimate the unbiased prediction error in the training stage's random forest construction phase, with in-bagsamples (IBS).The RF model's flowchart is illustrated in Figure 4, and detailed information on random forest governing equations may be found in Breiman (2001).Figure 4 Random Forest regression model flowchart

Gradient Boosting Regression (GBR)
Some boosting ensemble models can predict target values using weak classifiers generated by feature data samples.Gradient boosting is a popular algorithm that, unlike the random forest algorithms that use independent regression trees, builds an ensemble of regression trees in sequence, with each tree using the previous tree to learn and improve the prediction accuracy.In essence, the boosting attacks the bias-variance tradeoff, using weak decision tree models, and boosts the model accuracy using sequential trees.New weak models concentrate on the training rows of the weighted datasets with less prediction accuracy in the previous step.Figure 5 shows the gradient boosting regression flowchart.The shrinkage factor is an essential variable in GBR that describes the portion of each weak model in the prediction method after multiplying the prediction residual of each tree in the ensemble model.The shrinkage factor (also known as learning rate) ranges from 0 to 1.After using different values for the shrinkage factor in the trial-and-error process in this study, it was set at 0.1 to develop the GBR model.

Performance Standards
Three statistical measures were utilized to assess the accuracy of predicted values: coefficient of determination (R 2 ), root mean square error (RMSE), and mean absolute error (MAE).The total discrepancy ratio (DRs) was also used to evaluate the performance of ensemble models adopted in the present study.The DRs is a statistical measure that was used in several studies to assess the accuracy of ML-based models in predicting the dispersion coefficient in natural streams (e.g., Kashefipour and Falconer, 2002;Zeng and Huai, 2014;Nezaratian et al., 2021).Equations 9 to 12 show R 2 , RMSE, MAE, and DRs relationships, respectively.
where x i is i th observed dispersion coefficient, x ̂i is i th predicted dispersion coefficient, x ̅ refers to the mean dispersion coefficient, and N is the number of data.

Particles Vertical Displacement
The PTM does not consider the effects of the vertical advection term, and vertical displacement of particles is calculated by considering the vertical dispersion coefficient and settling velocity that are due to gravitational forces acting on suspended particles.Therefore, the Van Rijn (1993) particle settling velocity equation was used to compute the temporal vertical displacement of each particle.Equation 13shows the settling velocity of sediment particles.
where ws (m/s) is particle settling velocity,  (m 2 /s) is the kinematic viscosity of water, S is ratio of particle density to fluid density, and d (m) is the particle diameter.
Unlike longitudinal and transverse dispersion coefficients, which are computed based on both empirical and ML-based approaches discussed in previous sections, the vertical dispersion coefficient is computed using Equation 14proposed by Van Rijn (1987).
where H represents flow depth, z is the vertical elevation of particles, and Dzm is the maximum ) in which u* is shear velocity, and κ refers to the von Kármán constant (κ = 0.4).

Stochastic Random Walk Method
Random behavior of sediment particles in natural streams can be modeled using different stochastic approaches, such as the stochastic jump diffusion model (Oh, 2011) and the random walk method.Due to the acceptable performance of the random walk method, various studies have used this method to calculate the random motion of sediment particles in turbulent flows (Lane and Prandle, 2006;Taghvayi, 2013;Shi and Yu, 2015).In the present study, the random walk method was used to generate the stochastic term, i.e., (, ) , using a normal distribution probability function (Equation 15) with a mean of  = 0, and a standard deviation of  = 1.

𝑓(𝑥
where ∆  , ∆  , and ∆  are the total particle displacement in streamwise, transverse, and vertical directions after one computational time step (∆).  and  p are the linearly interpolated velocity components in streamwise and transverse directions at the particle location, respectively.

Study Area
A section of Wilson Creek near Highway FM 2478 in McKinney, Texas, U.S. was selected as the study area (Figure 6), as road expansion and bridge replacement projects were expected to add sediment load to the creek.The bridge location and construction site footprints on the south and north sides of the Wilson Creek are shown in Figure 6.The sediment regime in the creek was monitored during the construction period in 2021.According to the historical flow data at the USGS 08059590 gauge station (11.2 km downstream of the bridge location), the mean daily flow varied between 0 and 37.4 m 3 /s, with an average of 1.98 m 3 /s (U.S. Geological Survey (USGS), 2016).Two storm events, on April 29 and November 3, 2022, with an average daily discharge of 37.4 m 3 /s and 7.4 m 3 /s, were considered, and it was estimated that during these events a total of 12.8 and 2.4 tonnes/day of sediment entered the creek due to overland erosion from the construction sites, respectively (Ahmari et al., 2022a).

Field Measurements
Flow and sediment characteristics were monitored at the Wilson Creek bridge site from December 2020 to December 2021.The monitoring site was extended from upstream of the bridge to downstream of the impacted area by construction activities and was visited after each storm event, with an average of 2 to 4 visits per month, depending upon the rainfall and streamflow conditions.The field program included monitoring total suspended solids (TSS), turbidity (Tu), bedload material, substrate type, and depositional areas.Two ISCO 6712 Teledyne water sampler units were installed upstream and downstream of the bridge location to collect samples during storm-based events for lab analysis of the concentration of the suspended sediment (Figure 6).The upstream unit collected water samples before the bridges, where the sediment load in the creek was not impacted by the local overland erosion in the construction areas.The samples collected by the downstream sampler represented the cumulative effect of the sediment load from upstream and the sediment load entering the creek from the bridge construction areas.The TSS data from the upstream and downstream units were compared to estimate the elevated suspended sediment in the creek due to the overland erosion.
Discrete TSS and turbidity (Tu) samples were collected at the location of the automated water samplers, as well as from the area between the bridge location and the downstream water sampler unit, where the sediment regime in the creek was impacted by the construction activities (section 2 in Figure 6).The turbidity of the samples was measured in the NTU unit, using a Hach 2100Q portable turbidimeter.The TSS and Tu discrete data were collected mainly during low flow conditions to complement the sediment data collected by the automated water samplers during storms.Water samples collected by the automated samplers and grab samples were sent to the lab for TSS analysis, using EPA method 160.2 (United States Environmental Protection Agency, 2017).More information on the field monitoring program can be found in Ahmari et al. (2022a).
Due to the lack of direct measurement of TSS on April 29, 2021, the suspended sediment concentration was estimated using the TSS-Tu relationships (Equation 19) that was developed based on the data collected by water samplers and discrete data collected upstream and downstream of the bridge.In these equations TSS is in mg/L and Tu in NTU.Grab samples were collected from the eroded materials at the construction site and depositional areas in the creek to determine the sediment gradations contributing to the total sediment load.A minimum of 500 grams of samples were collected from each area and sent to the lab for gradations tests.The content of gravel, sand, silt, and clay in the samples was measured by sieve analysis and hydrometry tests after debris and other objects were removed.Based on this analysis, the average fraction of the clay/silt, sand, and gravel was estimated as 10%, 75%, and 15%, respectively (Baharvand et al., 2023b).These values were used in the PTM as the gradation of the sediment load entering the creek from the north and south construction areas.

Wilson Creek Particle Tracking Model
A particle tracking model developed by Baharvand et al. (2023a) to simulate sediment transport in natural streams and estimate dispersion coefficients by employing empirical equations was adopted in this study.A calibrated HEC-RAS 2D model for the study area was coupled with the PTM to provide hydrodynamic parameters.Two storm events, on April 29 and November 3, 2022, with an average daily discharge of 37.4 m 3 /s and 7.4 m 3 /s, respectively, were considered.The PTM was used to predict the suspended sediment concentration variations in different sections of the creek downstream of the bridge location, and its performance was evaluated by using field data collected from the bridge construction site in Wilson Creek.The dispersion coefficients estimated using ML methods were used in the Wilson Creek PTM.The distributions of the suspended sediment in the creek were predicted by the PTM, using empirical and ML-based methods, and the results were compared with the field data.

Prediction of Longitudinal Dispersion Coefficient
This section presents the longitudinal dispersion coefficient that was estimated by using random forest (RFR) and gradient boosting regression (GBR) models.The accuracy of each model is discussed in the following section for several hyperparameter tuning scenarios, using GridsearchCV method.

RFR and GBR Ensemble Prediction Models
Comparisons of the observed and estimated longitudinal dispersion coefficients using the RFR and GBR models in the training and testing stages are shown in Figures 7 and 8.As can be seen, the RFR model predicted the longitudinal dispersion coefficient with R 2 = 0.93 in the training stage (Figure 7a); however, the GBR model showed more accuracy with R 2 = 0.95 (Figure 7b).RMSE and MAE of the GBR model were smaller than that of the RFR model for the training stage (Table 3), which means that the GBR model estimated Dx more accurately.The total discrepancy ratios of the RFR and GBR models for the training stage were estimated as DRs = -0.16 and -0.The RFR and GBR models' prediction of the longitudinal dispersion show that the GBR model has advantages over the RFR in both the training and testing stages.As discussed in Section 2.1.1,the hyperparameter values for both stages of both models were determined through the GridsearchCV approach.By utilizing the GridsearchCV technique, the optimal values for the RFR hyperparameters to predict longitudinal dispersion were estimated as 3, 3, 2, and 200, respectively (Table 3).The optimal max feature, min sample leaf, min sample split, and number of estimators hyperparameters of the GBR model were estimated as 3, 5, 4, and 250, respectively.Table 4 summarizes the statistical measures for the optimal hyperparameter interrogated values for the RFR and GBR models in the training and testing stages.The optimal values for the RFR hyperparameters, i.e., max feature, min sample leaf, min sample split, and number of estimators, for predicting the transverse dispersion coefficient are estimated as 3, 5, 2, and 250, respectively.

Prediction of Transverse Dispersion Coefficient
However, the optimal max feature, min sample leaf, min sample split, and number of estimators hyperparameters of the GBR model are estimated as 3, 10, 2, and 250, respectively.The GBR model, optimized with GridsearchCV, demonstrated better accuracy than the RFR model in predicting transverse dispersion coefficient during training, but both models underestimated Dy in training and testing.Both models showed acceptable accuracy in predicting longitudinal and transverse dispersion coefficients.

Suspended Sediment Concentration in Wilson Creek Predicted by PTM
The suspended sediment concentration predicted by the PTM with machine learning and PTM with empirical dispersion models are compared for two flow scenarios, i.e., Q = 37.4 m 3 /s on April 29, 2021 and Q = 7.4 m 3 /s on November 3, 2021.The simulation time (t = 2400 s) was determined based on each dispersion model to ensure that all particles injected at the sediment sources exited the downstream boundary of the model.

Suspended Sediment Concentration -April 29, 2021 Storm
The increase in suspended sediment concentration in Wilson Creek due to the overland erosion in the construction site was estimated using ensemble learning-based and empirical dispersion models and is presented in Figure 15.As mentioned in Section 2.2.1, the construction activities The maximum increase in suspended sediment concentrations (SSC) on the north side of the creek were estimated as 146, 105, and 153 mg/L by the PTM with the empirical dispersion equation, RFR, and GBR (Figures 15 a-c).The SSC values for the south side of the creek were 201, 216, and 287 mg/L, respectively.The PTM with the RFR dispersion model predicted the maximum SSC on the north and south sides 24% lower than the GBR and 7% higher than the estimated values by the empirical dispersion model.The maximum SSC predicted by the PTM with the GBR was 40% higher for the south side and 5% higher for the north side compared to the estimates by the PTM using the empirical dispersion equation.
The elevated suspended sediment concentration across CS-1 to CS-4 is depicted in Figure 16.
The maximum elevated SSC across CS-1 was estimated as 160, 198, and 229 mg/L by the PTM with the empirical equation, RFR, and GBR, respectively (Figure 16a).The suspended sediment concentration at cross section CS-1 (27 m downstream of the bridge site) was compared with the TSS values estimated from Equation 19b.The estimated TSS using the TSS-Tu relationship (Equation 19b) and the turbidity measurements in the construction zone (section 2 in Figure 6) for different days of field measurement showed the TSS values ranging from 120 to 226 mg/L (Baharvand et al., 2023b).As shown in Figure 16a, the PTM models with the empirical equation and the RFR model estimated the maximum suspended sediment concentrations in section CS-1 within an acceptable range, showing the applicability of these PTMs.The PTM with the empirical equation and the RFR model underestimated the SSC, compared to the maximum measured TSS during field measurement, by about 29% and 12%, respectively; however, the difference in the maximum dispersion estimated by the PTM with the GBR dispersion model and the TSS range predicted using Equation 19b was negligible.
The elevated suspended sediment concentrations across Wilson Creek at cross sections CS-2, CS-3, and CS-4, located 67, 163, and 335 m downstream of the bridge site are shown in Figures 16b-d.As the sediment plume traveled downstream, it became diluted, and the suspended sediment concentration decreased.Unlike the suspended sediment concentration distribution in CS-2 (Figure 16b), the maximum sediment concentration estimated by the PTM with empirical dispersion model was higher than the results from the PTM with ensemble dispersion models in CS-3 and CS-4 (Figures 16c and d).According to these figures, the PTM with the empirical dispersion model estimated greater sediment concentration compared to other two dispersion models in CS-3 and CS-4, especially from the center towards the south side of the creek.
The RFR and GBR models in CS-1 and CS-2 showed higher SSC on the south side of the creek (Figures 16a and b).The maximum elevated sediment concentrations across CS-3 were estimated as 28, 19, and 23 mg/L for the PTMs with the empirical equation, RFR, and GBR, respectively.In downstream sections (CS-3 and CS-4), the sediment concentration estimated by the PTM with the empirical model was higher than that of the ensemble models, especially from the center to the south side of the creek.These differences could be due to the underestimation of the empirical dispersion coefficients that results in less displacement of particles and an increase in sediment concentrations.
Three longitudinal profiles illustrate the variations of SSC estimated by different dispersion models along the creek.Figure 15a shows the longitudinal profile on the south side (L-1), center line (L-2), and north side (L-3) of the creek.The changes in elevated SSC along L-1 to L-3 are shown in Figure 17  The high concentration values predicted for the area between 10 m and 30 m downstream of the bridge location were expected because of the proximity of this area to the sediment sources on both sides of the creek.Downstream of this high concentration area, the sediment concentration decreased gradually for all three dispersion models to approximately 47 mg/L, 70 m downstream of the bridge on the south side.As shown in Figure 17a, the GBR estimated a higher concentration along L-1 up to 70 m downstream of the bridge than the PTM with the RFR and empirical dispersion model.The PTM with the RFR and the empirical-based dispersion models showed similar SSC values in the downstream areas (100 m and further from the bridge), where the sediment plume became diluted with the creek flow.
The distribution of the suspended sediment concentration along the center line of the creek (line L-2) was estimated as 5 to 53 mg/L by three dispersion models in an area between 5 m to 51 m downstream of the bridge location (Figure 17b).The PTM with the GBR dispersion model resulted in higher concentration than the RFR and empirical dispersion models in an area from 5 m to almost 51 m downstream of the bridge location (Figure 17b).For areas beyond 50 m downstream of the bridge location along the center line, the PTM with the GBR estimated that the SSC was lower than the RFR prediction.Overall, the PTM with the empirical based dispersion model estimated the SSC in a range between the RFR and GBR estimates along the center line of the creek for high flow conditions (Q = 37.4 m 3 /s on April 29, 2021), except for a distance 37 m to 87 m downstream of the bridge, where the flow experiences the first meander of the channel.
Along the north side of the creek (line L-3), the PTM-empirical dispersion model estimated that the sediment concentration was 40% higher than the PTM-RFR dispersion model and less than 6% than the PTM-GBR dispersion model for an area up to 16 m downstream of the bridge location (Figure 17c).Beyond 38 m downstream of the bridge location, the SSC from the PTM with the empirical dispersion model showed smaller values than those estimated by the PTM with the RFR and GBR dispersion models.The SSC from the PTM with the GBR was slightly higher than the result of the RFR dispersion models from 32 to 165 m downstream of the bridge.All three PTM models showed similar trends in the fully mixed area (from 190 m to 350 m downstream of bridge).
Figure 17 shows that the PTM with the GBR dispersion model predicted higher concentration values along the creek than the RFR and empirical model from the bridge location to approximately 50 m downstream, along the south, center, and north sides of the creek.In the downstream sections, the PTM with the empirical and RFR dispersion models showed similar concentration distributions.Unlike the south and center lines of the channel, the PTM with the empirical dispersion model estimated higher sediment concentrations in areas near the bridge location, along the creek's north bank, than the RFR dispersion model.However, the PTM with the empirical dispersion model showed smaller concentrations than the RFR and GBR dispersion models in downstream sections on the north side of the creek.The longitudinal profiles L-1 to L-3,shown in Figure 18a, were also used to investigate the distribution of the sediment concentration along the creek.The elevated suspended sediment concentration across CS-1 to CS-4 on November 3, 2021 is depicted in Figure 19.The maximum elevated SSC across CS-1 was estimated as 421, 514, and 506 mg/L by the PTM with empirical equation, RFR, and GBR dispersion models, respectively (Figure 19a).The TSS was collected by the water sampler units on November 3, 2021 at the center line of the creek upstream of the bridge and at section CS-2 (Figure 6).The average TSS upstream of the bridge and at section CS-2 was measured as 35 mg/L and 174 mg/L, respectively.This means that the average TSS was elevated due to the local overland erosion by 139 mg/L along the center line of the creek from upstream to downstream of the bridge at CS-2.The PTM estimated the elevated suspended sediment concentration at the creek's center line (point 3 in Figure 19b) as 156, 146, and 143 mg/L using the empirical equation, RFR, and GBR, respectively.Therefore, the PTM with the empirical equation, RFR, and GBR overestimated the elevated sediment concentration at the center line by 12, 5, and 2.8%.The average TSS of 174 mg/L at CS-2 is calculated based on the TSS of three water samples collected during November 3 storm, i.e., 31, 171, and 319 mg/L.Considering the variability of TSS with time during the storm and the fact that a steady state flow was simulated for this storm, this range of difference between the actual and estimated TSS values seems reasonable.
According to Figure 19c, the PTM with the empirical equation, RFR, and GBR estimated the average elevated SSC in section CS-3 as 127, 113, and 114 mg/L, respectively.These values show that the PTM with the RFR and GBR models predicted similar sediment concentrations downstream of the creek, where the sediment was almost fully mixed across the channel.All three models predicted very small sediment concentration values at section CS-4 (< 0.1 mg/L).These Comparing Figure 20 with Figure 17 shows that the sediment concentration values of all the PTMs were significantly higher than those estimated for the high flow scenario.It could be argued that the smaller flow area and depth during the low flow resulted in locally higher SSC in the creek, even though the sediment load entering the creek during high flow (12.8 tonnes/day) was much higher than the sediment load entering the during the low flow condition (2.4 tonnes/day).The average width of the creek 350 m downstream of the bridge was approximately 16 and 34 m during low and high flow scenarios, respectively.The average flow depth during the high flow scenario was approximately 2.5 times higher than during the low flow scenario.The flow depth at the center line of CS-1 and CS-4 for the low flow scenario was 0.9 m and 2.3 m, respectively, while the flow depth at similar locations during the high flow scenario was 2.9 m and 5.5 m, respectively.Most of the sediment particles entering the creek from the construction site were deposited before section CS-3 during the low flow scenario; fewer particles were transported to downstream sections during the high flow scenario (Baharvand et al., 2023b).Therefore, a higher concentration of suspended particles was expected between the bridge location and section CS-3 during the high flow scenario.
The PTM with the empirical dispersion model predicted smaller values of sediment concentration along the north, south, and center line profiles in an area that extended between the bridge and 40 m downstream, as shown in Figure 20; the PTM with the RFR and GBR predicted similar sediment concentrations.On the south side of the creek, 209 m downstream of the bridge, the suspended sediment concentration decreased from 452 mg/L to 63 mg/L for the PTM with RFR and GBR models (Figure 20a).The maximum concentration estimated for the south side, using the PTM with the empirical dispersion model, was 341 mg/L, 7 m downstream of the bridge and decreased to 36 mg/L, 209 m downstream of the bridge.A sharp increase of sediment concentration, more than 200 mg/L, was detected in the area from 244 m to 254 m downstream of the bridge location on the south side, which is detectable in Figures 18b and c.The sharp increase in the SSC could be due to the shallower flow depth at this location (0.98 m) compared to adjacent areas.From this location, the creek bed elevation decreases, and the flow depth increases to 2.3 m in a distance of 18 m.The applied shear stress at this location was less than the particles' critical shear stress, and no deposition occurred at this point (Baharvand et al., 2023b).Therefore, a large number of resuspended particles was expected at this location, which increases the sediment concentration significantly.
The maximum concentration estimated for the central part of the channel was 585 mg/L according to the PTM with RFR and GBR dispersion models; the PTM with an empirical dispersion model estimated a maximum concentration of 459 mg/L at this location (Figure 20b).
On the north side of the creek, the PTM with an empirical dispersion model predicted lower sediment concentration values in most areas along the longitudinal section than the PTM with the RFR and GBR dispersion models (Figure 20c).The SSC was negligible along the north, south, and central profiles 285 m downstream of the bridge due to a high rate of sediment deposition after this segment of the creek.Field observations and the results of numerical modeling of sediment deposition in Wilson Creek corroborate the low SSC rate in this area of the creek (Baharvand et al., 2023b).

Conclusion
Previous studies have used empirical equations and machine learning approaches to estimate dispersion coefficients in natural streams, but some learning-based approaches such as bagging and boosting ensemble models were not studied.Also, despite various studies on the sensitivity and application of machine learning, these methods have not been used in Lagrangian-based sediment transport models.The present study used the data derived from previous studies to investigate the performance of two ensemble machine learning models, random forest regression (RFR) and gradient boosting regression (GBR), in predicting longitudinal and transverse dispersion coefficients in natural streams.The resulting data-driven dispersion models were integrated with a Lagrangian particle tracking model (PTM) to simulate suspended sediment concentration in natural streams.
of the creek and were used to develop the suspended sediment concentration distributions across and along the channel.A comparison between the cross-sectional sediment concentrations and the field data showed acceptable accuracy of all three models in predicting suspended sediment concentration distributions in the creek.The average sediment concentrations from the PTM with the GBR model correlated better with the results of the field investigations for high and low flow scenarios, however, the sediment concentration heat maps showed that the mixing process of the GBR was faster than that of the PTM with RFR and empirical models.
The present model simulated steady-state flow and estimated vertical dispersion coefficient with empirical and machine learning-based models.Future research could improve the model's capabilities by adding unsteady flow and bedload transport and using machine learning for vertical dispersion coefficient.However, a limitation is that the dataset used may not be applicable to other streams, and future research should increase the number of dispersion coefficient datasets to enhance the model's reliability.

Figure 2
Figure 2 Flowchart of the in-stream sediment transport model with learning-based and empirical dispersion coefficients

Figure 5
Figure 5 Flow chart of gradient boosting regression modelAs shown in Figure5, the GBR is trained using N regression trees (weak models).In the first step, the weak model is trained using the feature matrix X (a matrix of H, u, u*); the target value is Y (dispersion terms Dx and Dy).The prediction residual error is estimated for the first step (R1) using the observed (y1) and predicted ( ̂1) target values.Once the residual of the first step has been determined, the second weak model is trained using the feature matrix X, and the residual errors that are calculated from step 1 (R1) are considered the target values for the second step.The step 2 prediction residual error ( −  ̂1) is used as the target value for the next step, and this sequential prediction process is repeated until the ensemble tree-based gradient boosting model is trained.

Figure 7 Figure 9
Figure 7 Comparison between observed and predicted longitudinal dispersion coefficients (Dx) by a) RFR, and b) GBR in the training stage Figure 8 compares the Dx that was observed and predicted by the RFR and GBR in the testing stage.As shown, the GBR model more accurately predicted the longitudinal dispersion coefficient in the testing stage (R 2 = 0.9) than the RFR model (R 2 = 0.86).The RMSE of the RFR model was determined as 79.1 m 2 /s in the testing stage, which is higher than the RMSE of the GBR model (66.1 m 2 /s).Figures 9 and 10 illustrate the overestimation and underestimation of the RFR and GBR for each observation in the dataset in the training and testing stages.

Figure 11 Figure 11
Figure 11 shows a comparison of the observed and predicted transverse dispersion coefficients (Dy) using the RFR and GBR in the training stage.The coefficient of determination of the RFR model for the training stage was calculated as R 2 = 0.92, and the RMSE was determined as 0.007 m 2 /s.The discrepancy ratio in the RFR in the training stage was DRs = -0.071showing that the

Figure 13
Observed and predicted transverse dispersion coefficients (Dy) for each data sample by a) RFR, and b) GBR in training stage (a) (b) Figure 14 Observed and predicted transverse dispersion coefficients (Dy) for each data sample by a) RFR, and b) GBR in testing stage

Figure 15
produced 12.8 tonnes/day of sediment yield on April 29, 2021, which elevated the sediment concentration downstream of the bridge location.The suspended sediment concentrations estimated by the PTM with the RFR and GBR models were compared at four cross sections downstream of the bridge location.Increase in suspended sediment concentration in Wilson Creek due to overland erosion corresponding to the April 29, 2021 storm estimated by the PTM with using different dispersion coefficients: a) Empirical equation, b) RFR model, and c) GBR model

Figure 16
for the PTM with the empirical equation, RFR, and GBR.According to Figure 17a, along the south longitudinal section (L-1), the maximum elevated SSC varied between 193 mg/L (PTM with empirical dispersion model) and 261 mg/L (PTM with the GBR dispersion modelElevated suspended sediment concentration across Wilson Creek due to overland erosion corresponding to April 29, 2021 storm: a) Cross section CS-1, b) Cross section CS-2, c) Cross section CS-3, and d) Cross section CS-4.Cross sections are shown in Figure 15

Figure 19 Figure 20
values were negligible compared to the sediment concentrations estimated at sections CS-1 and CS-2 and their variations were less significant.Elevated suspended sediment concentration across Wilson Creek due to overland erosion corresponding to November 3, 2021 storm: a) Cross section CS-1, b) Cross section CS-2, c) Cross section CS-3, and d) Cross section CS-4.Cross sections are shown in Figure 18 Three longitudinal profiles were used to extract the suspended sediment concentration distributions estimated by the PTM using three dispersion models.The extent of profiles L-1 to L-3 is depicted in Figure 18a.The increases in the SSC along L-1 to L-3 are shown in Figure 20 for different dispersion models.Variation of suspended sediment concentration along a) south side (L-1), b) centerline (L-2), and c) north side (L-3) of Wilson creek (November 3, 2021 storm)

Table 1
Commonly used longitudinal and transverse dispersion empirical equations

Table 2
Descriptive statistics of the longitudinal and transverse dispersion dataset * STD: Standard Deviation

Table 3
Hyperparameter tuning scenarios of longitudinal dispersion coefficient (Dx).The ideal hyperparameters are shown as bold numbers

Table 4
Hyperparameter tuning scenarios of longitudinal dispersion coefficient (Dy).The ideal hyperparameters are shown as bold numbers