Total phosphorus removal in multi-soil-layering nature-based technology: Evaluation of influencing factors and prediction using data-driven methods

doi:10.21203/rs.3.rs-1971008/v1

Download PDF

Research Article

Total phosphorus removal in multi-soil-layering nature-based technology: Evaluation of influencing factors and prediction using data-driven methods

https://doi.org/10.21203/rs.3.rs-1971008/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Excess phosphorus (P) in wastewater can produce eutrophication, posing a serious risk to the safety of water resources and ecosystems. Therefore, effective pollutant removal including P from wastewater is the key strategy to save the environment and public health. Multi-soil-layering (MSL) is a promising nature-based technology that mainly relies on a soil mixture containing iron to remove P-pollution from wastewater. In the MSL influent, fourteen water quality indicators were measured, including pH, dissolved oxygen, total suspended solids, electrical conductivity, organic matter, nutrients, and coliform bacteria, to determine which ones have the strongest relationship with total phosphorus (TP) removal. The influence of hydraulic loading rate (HLR) and climatic variables (air temperature, rainfall, and evaporation) on the removal of TP was investigated. Four data-driven methods including multiple linear regression (MLR), k-nearest neighbors (KNN), random forest (RF), and neural network (NN) were conducted to predict TP removal at the MSL system outlet. In contrast to climatic variables, the results reveal that the HLR has a significant impact (p < 0.05) on TP removal (47% − 90%) in the MSL system. Furthermore, using a feature selection technique, the HLR, pH, PO₄³⁻ and TP were suggested as the relevant input variables affecting TP removal in the MSL system, while an examination of accuracy shows that the RF model achieves good prediction accuracy (R² = 0.93) and can help to understand MSL behavior for pollutants.

Multi-soil-layering

Phosphorus removal

Hydraulic loading rate

Feature selection

Data-driven methods

Phosphorus (P) is a vital component for all forms of life and is widely abundant in nature [1, 2]. Where the high rates of water consumption, associated with population development, industrial progress, and improving the quality of life, have led to a significant increase in its quantity, especially in sewage effluents [3]. Furthermore, pesticides and fertilizers, as well as agricultural and animal farms, have all contributed to increased P contamination, resulting in ecosystem eutrophication issues [4], particularly in rural areas [5]. Consequently, protecting individuals' health and the environment from P-pollution in rural areas using adequate wastewater treatment plants (WWTP) is a serious concern nowadays.

With these facts, and to gain economic and environmental benefits and reduce P-pollution, we used a new environmentally friendly WWTP based on soil called the multi-soil-layering (MSL) system. The investment in the MSL is cost-effective and is characterized by high hydraulic capacity, simple maintenance, and effective life of over twenty years [6]. The soil mixture, which is the main component of the MSL system, represents the anaerobic zone and consists of local materials (soil, charcoal, sawdust, granular iron). Charcoal has a strong affinity with contaminants and can perform a function of adsorbent [7], while sawdust constitutes a source of carbon. Regarding iron, several studies have demonstrated the exceptional immobilization capacities of P in wastewater that iron possesses through adsorption and precipitation [3, 4]. In the same context, the permeable layer (e.g., gravel) is considered an aerobic zone and could facilitate the removal of pollutants including P, improves water dispersion and distribution as well as reduce clogging risk [8].

Regarding P removal mechanisms, Zhou et al. [9] have reported that micropores on the surface of soil mixture blocks contribute to the adsorption of P, while Fe³⁺ in these blocks can co-precipitate with Chen [10] concluded that P precipitation driven by liberated iron ions was the primary process rather than adsorption in the MSL system. In the same regard, it has been suggested that physic-chemical mechanisms, rather than microbiological activities, are key in the MSL system's P removal [9]. In addition, the optimization of MSL performance is related to several factors such as, pH, adequate aeration, low hydraulic loading rate (HLR), number of layers, and large surface area of soil mixture blocks [11–14].

Despite its relevance as a water quality indicator, P removal in the MSL system has not been sufficiently investigated, particularly at the modeling level. Furthermore, the number of studies that used computer based-simulation methods to predict P removal is still limited, with an accuracy of about (R² = 0.85) [15]. Therefore, this research aims to investigate P removal in the MSL system in a semi-arid environment and to accurately predict this removal. In the same context, the number of data-driven models employed to simulate MSL performance is limited to stepwise-cluster analysis (SCA) and neural network (NN) methods [13–17]. Thus, this study is the first application of random forest (RF) and k-nearest neighbor (KNN) to predict MSL removal efficiency, especially for P removal. The choice of these data-driven methods in this paper relied on the fact that they proved their effectiveness in predicting water quality characteristics in several investigations [18, 19]. Furthermore, they can deal with the nonlinear relationships between TP removal and candidate predictors. Thus, given the large number of potential predictors, we will try to select the relevant ones using a hybrid feature selection technique based on the Pearson correlation coefficient and Bayesian information criterion (BIC) [20]. This technique's relevance is represented in the fact that it will allow determining to which variables are capable of explaining P removal in the MSL system, as well as contribute to enhancing the performance of the data-driven methods as input variables.

On the other hand, because of the region's semi-arid climate in the study area, we will try in the first study of its kind to explore the effect of climatic variables including air temperature, evaporation, and rain fall on the MSL efficiency in removing total phosphorus (TP) pollution. This paper aims to (i) assess the impact of the HLR and climatic variables in reducing TP in the MSL system, (ii) use a feature selection technique to determine the relevant explanatory variables for TP removal, (iii) evaluate the potential of the data-driven methods (KNN, RF, and NN) to predict TP removal as compared to multiple linear regression (MLR) method, and (vi) conduct a feature importance analysis.

2.1. Study area

The study area is classified as a semi-arid zone and belongs to the Haouz region in Central Morocco. The experimental MSL systems are operated downstream of eight active households in the village of Talat Margen (Latitude: 31.3351; Longitude: -7.94636). January is the coldest month, with an average temperature of 12°C, while July is the hottest, with an average temperature of 29°C. In the previous decade, the temperature reached its highest extreme (48.6°C) in July and its lowest extreme (-7°C) in January. Regarding rainfall, the average annual value is varying from 160 mm to 258 mm [21]. Thus, the major part of the water resources is used for the irrigation of the agricultural lands and the provision of drinking water. In addition, the village is marked by the absence of a wastewater treatment plant and sewage network.

2.2. System design

The experiment included five MSL pilot plants, each measuring 65 cm in height, with a diameter of 40 cm and a total volume of 26 liters. A schematic configuration of the MSL systems is shown in Fig. 1. Each MSL was operated in the form of stratified layers, filled with gravel particles (approximately 3–5 mm) and alternated with soil mixture blocks (SMB) with an effective 28% pore space. The SMBs are arranged horizontally and surrounded by gravel particles with a thickness of 5 cm. Moreover, SMBs play an important role in purification processes within the MSL system and consist of soil (70%), associated with granular iron metal (10%), charcoal (10%) and sawdust (10%). In this study, the soil is composed of 88% sand, 5.6% silt, and 6.2% clay. The wastewater storage tank has a capacity of 1000 liters and is connected to all the MSL systems. Each one was fed continuously with different hydraulic loading rates (HLRs) to assess its influence on the MSL efficiency. Thus, the HLR for the five pilot plants (MSL1, MSL2, MSL3, MSL4, and MSL5) are respectively, 250, 500, 1000, 2000, and 4000 L /m²/day.

On the other hand, the systems were installed outdoors, and they were in direct contact with the atmosphere. As a consequence, they were exposed to natural precipitation as well as ambient air temperature. Thus, the effect of these climatic variables as well as evaporation on TP removal in the MSL system was investigated. Monthly climatic data were obtained by the Tensift River Basin Agency (ABHT) in Marrakesh, Morocco. Regarding evaporation, measurements were carried out using a circular class A evaporator pan with a diameter of 1.20 m and a depth of 0.25 mm.

2.3. Sample analysis and climatic variables

The experiment lasted a year, with samples taken once every two weeks. Laboratory analyses for each water quality indicator were repeated to ensure consistency of results and to increase the number of data sets. Thus, the total data points were 246 with an average of 49 samples from each MSL. The indicators that have been investigated are related to the physicochemical and bacterial properties of the sewage. General parameters such as pH and electrical conductivity (EC, µS cm^− 1) as well as dissolved oxygen (DO, mg L^− 1) were measured using WTW multi 340i/set (Germany). The remaining indicators were analyzed according to normalized methods (Fig. 2).

For the portion of fine particulate and organic matter; the filtration technique was employed to measure total suspended solids (TSS, mg L^− 1) content, while the dichromate open reflux technique was used to measure organic matter content (five-day biochemical oxygen demand (BOD₅, mg L^− 1) and chemical oxygen demand (COD, mg L^− 1). Nitrogen pollution such as ammonium (NH₄⁺, mg L^− 1) and nitrites (NO₂⁻, mg L^− 1) levels were measured respectively according to the indophenol and diazotization techniques. Nitrates (NO₃⁻, mg L^− 1) were analyzed as NO₂⁻ content after their reduction through a cadmium copper column. The Kjeldahl mineralization, distillation of NH₄⁺ and a final acidimetric titration were the three techniques used to determine total Kjeldahl nitrogen (TKN, mg L^− 1) content. However, total nitrogen (TN, mg L^− 1) was the summation of all forms of nitrogen pollution (NH₄⁺, NO₂⁻, NO₃⁻, and TKN). The technique of molybdate and ascorbic acid was employed to analyze orthophosphate (PO₄³⁻, mg L^− 1) level, while TP (mg L^− 1) was determined as PO₄³⁻ after potassium peroxodisulfate digestion.

Regarding fecal indicator bacteria, more specifically total coliforms (TC, log units) and fecal coliforms (FC, log units) were measured using a series of dilution methods according to [22]. In addition, Lactose 2,3,5 Triphenyl Tetrazolium Chloride with Tergitol agar media were used to compute these bacteria (more information is given in Sbahi et al. [14]). The log colony forming units (CFU) were used to express the level of coliform bacteria. All of the previously stated water quality indicators were monitored at the MSLs' inlet, whereas TP removal was evaluated at the outlet of each MSL system.

2.4. Statistical analysis and feature selection

ANOVA and Tukey tests were performed to determine if there was a significant difference in TP removal, with p < 0.05. Pearson's correlation coefficient (r) was used to measure the linear relationship between TP removal and climatic variables. In terms of feature selection, the key explanatory variables for TP removal will be determined using the r-Pearson followed by the BIC metric, which further penalizes the number of parameters present in the selection [20]. Thus, the BIC can be computed as follows:

$$BIC= - 2\log (\psi )+k\log (n)$$

where $\psi$ denotes maximum likelihood, k the number of parameters to estimate, and n the number of data points. The selection with the lowest BIC score is preferable. The R software (version 3.5.2) was used to conduct all of the statistical tests.

2.5. Data-driven methods and implementation

After the relevant input variables are identified, the experimental data will be randomly divided into training (80% of total data) and validation data sets (20%). The k-folds cross-validation (CV) combined with grid search were used to explore the optimal parameters of each method and to minimize overfitting [26]. The k-fold CV technique consists of randomly dividing the training data set into k folds of equal size (we adopted k = 10). The first model was trained using (9/10) folds as a training set and the held-out fold (1/10) was used to measure the prediction accuracy. Thus, the highest R-squared (R²) value indicates the optimal parameters to adopt for each model. All these processes have been implemented in R software (3.5.2) using the trainControl( ) and train( ) functions of caret-package [26].

2.5.1. Multiple linear regression (MLR)

In water resources research, the MLR method has been widely used as a prediction model [27, 28]. Each input variable is weighted in this model such that the value of the regression coefficients optimizes each variable's effect on the final equation. The following function (6) may be used to summarize the statistical model for MLR:

$${y_i}={b_0}+\sum\limits_{{j=1}}^{p} {{b_j}{x_j}} +{\varepsilon _{}}$$

Where y_i is the dependent variable to be explained, ${x_1}....{x_p}$ are explanatory variables, ɛ is the random error of the model, and ${b_0},{b_1}....,{b_p}$ are the parameters to be estimated.

2.5.2. k-nearest neighbors (KNN)

The KNN has proven to be a nonparametric method for classification and regression tasks [29]. The concept of this method is to select the k-closest neighbors to the studied point in order to predict its value. Garg et al. [30] have reported that the number of neighbors (kmax) has a significant influence on the prediction result. The neighborhood is calculated through a distance metric, such as the Euclidean distance. Thus, the distance between samples in this study was calculated using the Minkowski distance, which generalizes the Euclidean distance as follows:

$$D=\sqrt {\mathop \sum \nolimits_{{i=1}}^{K} |{x_i} - {y_i}{|^p}{)^{(1/p)}}} ~~$$

Where ${x}_{i}$ and ${y}_{i}$ are two individual samples, k is the number of candidate input variables, and p is a distance parameter (power) provided to the model for use in computing the Minkowski distance.

2.5.3. Random forest (RF)

In theory, RF is a data-driven method that is based on multiple decision trees consisting of a set of N trees {T₁ ($x$), T₂ ($x$),. . ., T_N ($x$)}. For a regression task, to predict a numerical variable, RF uses a random selection of the data and creates multiple regression trees [31]. Thus, from a given data set, the RF starts with several bootstrap samples which are drawn randomly with replacement. Then, the samples obtained were used to fit the corresponding decision trees. Each tree produces an output, and the final output is the average of all decision trees predictions [32]. Effectively, the RF model could process large data sets and produce important estimates of variables. The regression predictor of RF is described by the following equation:

$$f\left( x \right)=\frac{{\left[ {~\mathop \sum \nolimits_{{K=1}}^{N} {T_N}\left( x \right)} \right]}}{N}~~$$

Where $~x=\{ {x_1},~~{x_2},~ \ldots ,~{x_{\text{\varvec{\alpha}}}}\}$, is an $\alpha$-dimension input vector that builds a forest. The number of different predictors tried to split each node is called mtry. This parameter is essential for tuning the RF method, whereas the number of trees (ntree) may be specified by the user during the training process.

2.5.4. Neural network (NN)

Historically, NN are strong nonlinear methods that were inspired by notions of how the brain operates [33]. The NN model in this study was implemented using a single-layer feed-forward network. Thus, its architecture includes three layers: an input layer that contains the input variables, an output layer that provides the prediction result, and a hidden layer that conducts all of the calculations between the input and output layers [33]. Every unit (neurons) in the hidden layer is linked to all units in the input layer. Similarly, every unit in the output layer is linked to every unit in the hidden layer. The number of hidden neurons (size) and weight decay are two parameters to calibrate for the NN method. The mathematical expression of a unit can be written as follows:

$$y=f\left( {\mathop \sum \limits_{i} {w_i}*{x_i}+{\text{~}}b} \right)$$

where, $w$ is the weight between the input $x$ and the hidden units, $f$ is the activation function, and $b$is the bias associated with the output layer. The application of all data-driven methods was carried out using the R software (3.5.2).

2.5.5. Performance metrics

Models’ performances were evaluated by measuring the difference between the real and predicted values using the mean-absolute-error (MAE) and root-mean-square-error (RMSE). These metrics measure the average model prediction error and goodness of fit respectively. In addition, the R² that expresses the amount of variation described by the model was also used to evaluate the model accuracies [34].

where$\gamma$, $\widehat {\gamma }$ and $\overline {\gamma }$ are model output, real output, and the mean of the real output respectively; n denotes the total number of real data points.

3.1. Total phosphorus removal and hydraulic loading rate

Table 1 shows descriptive statistics for the fourteen water quality parameters monitored in the MSLs influent. For instance, the pH measured in the raw wastewater has minimum, maximum, and average values of 7.20, 8.90, and 8.17, respectively. In the same context, to examine the effect of the hydraulic shock load on MSL efficiency in removing TP, the five MSL units were subjected to an increasing HLR (between 250 and 4000 L/m²/day).

Table 1

Summary statistics of the HLR and the MSLs influent characteristics
Parameter	Unit	Range	Mean	Standard deviation
HLR	L/m²/day	250–4000	-	-
pH	-	7.2–8.9	8.17	0.35
DO	mg/L	0.36–1.3	0.78	0.31
EC	µS/cm	1722–2421	2015	198
TSS	mg/L	206.4–380.2	278.9	39.7
BOD₅	mg/L	244–396	314.4	46.5
COD	mg/L	430–619	504.2	47.1
NH₄⁺	mg/L	15.4–34.15	23.1	4.8
NO₂⁻	mg/L	0.11–0.93	0.43	0.24
NO₃⁻	mg/L	10.3–34	20.5	5.44
TKN	mg/L	22.3–49.3	33	6.3
TN	mg/L	57–109.3	77	12.42
PO₄³⁻	mg/L	3.1–7.2	4.9	1.3
TP	mg/L	5.2–9.9	7.4	1.14
TC	log units	5.9–7.4	6.83	0.47

Regarding the MSLs performance, we note that there are significant (p < 0.05) differences between the efficiency of the MSL systems subject to different HLRs and becomes insignificant (p > 0.05) when the HLR rises from 2000 to 4000 L/m²/day (Fig. 3a). On the other hand, obtained results show a decrease in the TP level from 7.40 mg L^− 1 to 0.73 mg L^− 1 for the HLR of 250 L/m²/day, and 3.96 mg L^− 1 for the HLR of 4000 L/m²/day (Fig. 3a). Thus, it can be noticed that increasing the HLR value negatively affects the level of TP in the MSL effluent. This finding appears to be confirmed in the scatterplot (Fig. 3b), where we can see a linearly decreasing trend of TP removal as a function of HLR. In addition, for a better understanding of the performance and stability of the MSL systems, Fig. 3c depicts the changed curve of TP content with time increased. As illustrated, the effluent concentrations of the MSL systems (MSL1, MSL2, and MSL3) were relatively stable over the experiment duration. In contrast, the level of this indicator was unstable in the effluents of the MSL systems (MSL4 and MSL5) exposed to a high HLR (2000 and 4000 L/m²/day), which indicates that by increasing the HLR, the performance of the MSL system decreases and becomes unstable. These findings are in line with what has already been published in the scientific literature [9, 35], which found that increased HLRs may reduce the MSL system's TP removal efficiency.

Overall, despite the decreased efficiencies demonstrated by the MSL4 and MSL5, all the MSL systems consistently maintained significant (p < 0.05) TP reduction, with removal rates ranging from 47 to 90%. In addition, the HLR is found to be inversely related to the hydraulic shock load applied to the MSL systems' surface area. As the HLR increases, the wastewater's residence time within the MSL decreases, reducing their interaction with the MSL layers and, therefore, decreasing the MSL's purification efficiency [36]. This finding was also supported by Toreu et al. [37], who found that P adsorption by soils rises as reaction time increases.

Regarding P removal mechanisms in the MSL system, many studies [38–40] have reported that its removal is done by adsorption on iron hydroxides and aluminum hydroxides present in soil mixture and soil colloids. An et al. [8] have reported that iron metal in the soil mixture immobilizes P by adsorption and precipitation during the percolation of wastewater. Indeed, soil mixture iron is transformed into ferrous iron (Fe²⁺) and then transported to the gravel layer (aerobic conditions) to be oxidized to ferric ion (Fe³⁺), allowing for P purification in the MSL system once again [41]. Therefore, increasing P contact time with adsorbents inside the MSL system is a key component in its removal. However, short-cut flow or preferential flow would lead to the reduction of contact time between wastewater and the MSL media, which will negatively influence the transformation of iron to ferric ions. Furthermore, Song et al. [15] reported that MSL parameters such as aeration, pH, and wastewater dispersion can also influence TP removal. Indeed, sufficient oxygen could promote the oxidation of iron to ferric ions in the aerobic zone (gravel layers) resulting in increased P adsorption by soil particles, while pH could influence the proportions of hydroxyl ions (OH⁻) [42]. Jianbo et al. [43] have previously verified this assertion, stating that the pH of a phosphorus solution has a significant impact on the whole adsorption process, notably on the adsorption capacity.

3.2. Effect of climatic variables

In this paper, the importance of rainfall, temperature, and evaporation on the performance of MSL is examined. Figure 4a shows that the rainiest months are from January to April, and the driest months are from June to August. Regarding precipitation, it is heavier in the spring and winter, whereas the summer is drier with rainfall from 3 to 13 mm. Regarding temperature, the coldest month was January (2.6°C), while the hottest months are June to August (45°C), which explains the significant increase in evaporation during this period (between 207 mm and 326 mm).

In the same context, Fig. 4b depicts the efficiency of TP removal in the MSL systems throughout the experiment period. This removal was stable in the MSL1 during the period (Jul.17 - Jun.18), with an average removal between 86% and 92%. In comparison to the MSL3, which varied somewhat between a decrease (Sep.17 - Oct.17) and an increase (Nov.17 -Jun.18), the MSL2 also showed good performance for TP removal during the experiment period (between 77% and 89%). However, the performance of the MSL systems (MSL4 and MSL5), fluctuated significantly between an increase and a drop, wherein the MSL5 recorded the lowest removal in October (16.7%). This result can be explained by the low concentration of TP in the MSL influent in October (as indicated in Fig. 3). Therefore, this might imply that when TP concentrations in the influent are low, the MSL system's performance subjected to high HLR is limited.

On the other hand, Pearson's coefficient was also used to assess the relationship between TP removal and climatic variables (Fig. 4c). The results reveal that the linear correlation was weak (-0.1 ≤ r ≤ 0.03), with no significant (p > 0.05) relationship between local climatic data and the MSL performance. This result was consistent with Masunaga et al. [39] findings, which indicated that seasonal variables did not appear to have a significant impact on TP level in the MSL system. Overall, the MSL system's performance has remained constant throughout the year, with no significant influence of climatic variables in the semi-arid study area, especially for longer HLRs. However, the increase in the HLR makes its performance unstable.

3.3. Feature selection

Regarding an output variable, when the number of candidate input variables becomes large, selecting those who have significant relationships with the output could be an effective strategy [44, 45]. In this study, the Pearson coefficient and BIC metrics were used for this purpose. The pairwise correlation matrix (Fig. 5a) indicates the input variables that change linearly with TP removal. Each square includes both the correlation coefficient (r-Pearson) and the significance asterisks. We can note a significant (p < 0.05, r = − 0.77) negative relationship between TP removal and the HLR (Fig. 5a). However, this relationship is significantly (p < 0.05) positive with the level of TSS, pH, NO₃⁻, NO₂⁻, TP, and TC in the MSL influent, where Pearson's coefficient value ranges between 0.15 and 0.26.

Regarding the BIC metric, ten subset selections were evaluated and compared based on these candidate significant variables. This task was processed using the lmSubsets R-package (version 3.5.2). Figure 5b shows that the first subset has the lowest BIC value, while Fig. 5c gives more details about the ten subset selections. The candidate variables are indicated on the x-axis, while the y-axis shows the selection number. The results of these selections were very similar, with only a few variables switching between being added and excluded. Therefore, the BIC-selection consisting of the features (HLR, pH, PO₄³⁻, and TP) is suggested to be significantly (p < 0.05) pertinent to explain TP removal in the MSL system.

In line with this result, Lamzouri et al. [17] also used pH, TP, and HLR as input features to predict the TP content at the outlet of the MSL system. Kotti et al. [46] have also used HLR as an important input variable to predict the removal of P in free-water surface CWs. Thus, the empirical equation developed by Akratos et al. [47] was based on HLR as an important factor for predicting the removal of P in horizontal CWs. Vidal et al. [48] reported that pH appeared to be a good estimator for the reduction of TP in a sand filter. Overall, the selected variables are also used in previous studies as informative predictors related to the removal of P in wastewater treatment systems.

3.4. Tuning parameters

In this study, the goal of tuning the data-driven models is to identify a set of parameters that would result in an optimal model with lower error values and higher prediction accuracy. Thus, to tune these solutions, the parameters of each model were assessed, and the optimal combination was selected based on the highest R² value. Figure 6 illustrates the tuning parameter profile for each method. In the case of the NN model (Fig. 6a), accuracy increases in parallel with the increase in the number of hidden neurons (size = 6) and reaches the highest R² value (R² = 0.90) for six neurons for weight decay of 0.03. Thus, the appropriate NN model architecture to predict TP removal in the MSL system is 4 units in the input layer, 6 units in the hidden layer, and one unit in the output layer (Fig. 6b). Regarding the RF tuning profile (Fig. 6c), as the number of mtry increases, the accuracy increases considerably (R² = 0.93), showing that four mtry appear to achieve better performance for the RF model. Furthermore, determining the number of trees that makes the error rate stable during the tuning process is an important step.

As can be seen in Fig. 6d, the model error decreases as the number of trees rises. Therefore, the optimal number of trees to use, according to RF's final model, is 500. Regarding the KNN model, as was already noted, the grid search approach was employed to determine the final values of the Minkowski distance and the maximum number of neighbors (kmax) while maintaining the kernel function as "optimal" for weighting. Thus, the accuracy increases in lockstep with distance (Fig. 6e). Therefore, the set of parameters (kmax = 3, distance = 3, kernel = optimal) provides the best performance for the KNN model.

In the same context, after training the models on 80% of the total data set, they were compared and evaluated using three performance metrics. A comparison of the estimated performance is shown in Fig. 7a. In the case of the MLR, the model shows modest performance, followed by the KNN model. In terms of the MAE, its mean value varies from 9.2% (MLR) to 4.64% (RF). Furthermore, the mean RMSE values are roughly consistent with the MAE values: 11.96% (MLR), 8.34% (KNN), 6.55% (RF), and 7.26% (NN).

Overall, the RF model was the most accurate data-driven model during the training process. These conclusions are supported by the boxplots in Fig. 7b, which show that the residual distributions for all the models are different. Furthermore, the plot (Fig. 7b) suggests that the residuals of the RF model are close to those of the NN model, but are typically less than those of the KNN and MLR models.

3.5. Data-driven model evaluation

After tuning the models and identifying which one performs well, the performance metrics were used to assess the data-driven model's prediction accuracy. Figure 8 illustrates a comparison between the predicted and the real values for the TP removal in the MSL system using the validation data set (20%).

Regarding the MLR model (Fig. 8a1), some real data points are overestimated, resulting in a significant difference between real and predicted values. This might be owing to the MLR model's failure to account for the non-linear relationship between the predictors and the output variable. In addition, the model’s ability to predict TP removal is unsatisfactory in that the RMSE, MAE and R² are respectively 12.12%, 9.22% and 0.64 (Fig. 8a2). However, although the KNN model presents an improvement over the linear model, some data points are underestimated (Fig. 8b1). This result can be confirmed by the fit between the real and predicted values in that the RMSE, MAE and R² were 7.18%, 4.61% and 0.87 respectively (Fig. 8b2).

In the case of the RF, it can be assumed that this model shows a nearly perfect match between the real and predicted values (Fig. 8c1). Furthermore, the predictive performance of the RF model (RMSE = 5.29%, MAE = 3.63% and R² = 0.93) in this article outperformed the previous models in terms of TP removal predictions (Fig. 8c2). Similarly, for the NN model, Fig. 8d1 visualizes a good convergence between the predicted and real values. This finding can be confirmed by the good fit between the real and model predicted values (R² = 0.90). In addition, the estimated values of RMSE (6.11%) and MAE (4.51%) demonstrate that the NN model was also suitable for predicting TP removal in the MSL system (Fig. 8d2).

In summary, the MLR and KNN models have exhibited modest to desirable prediction accuracy, while the RF and NN models showed good prediction accuracy. These findings support the predictive power of the NN in predicting MSL system performance. However, the RF model was shown to be more accurate in this investigation. Furthermore, as compared to previous studies [15, 17] whose accuracy in predicting TP removal in the MSL system was ranged between (R² = 0.60) and (R² = 0.85), the RF model developed in this study improved this prediction (R² = 0.93) based on relevant and informative input variables.

Finally, due to the number of parameters to tune (mtry) and the obtained accuracy, as well as simplicity in the implementation, it can be concluded that RF is a simpler and more powerful data-driven approach that accurately predicts TP removal in the MSL system. Furthermore, as compared to the SCA and NN methods, the balance between complexity and accuracy of the RF method makes it a decent choice for predicting MSL performance.

3.6. Feature importance

Following the prediction of TP removal, feature importance analysis was performed to explore the most important variable that influences this removal. Based on the accurate model, the importance of each predictor was calculated using variable_importance( ) function. Figure 9 indicates that all of the input variables selected using the feature selection technique are important and explain the output variable.

The results demonstrated the HLR's supremacy as the most important parameter controlling TP removal in the MSL system, as well as the significance of their nonlinear relationship. Furthermore, TP level in the MSL influent is ranked as the second most important variable, followed by PO₄³⁻ and pH, respectively. Thus, these findings are consistent with the literature [9, 46, 48], implying that the level of pollutants in raw wastewater, in combination with the hydraulic shock load, governs MSL removal efficiency.

The current study investigated TP removal in the MSL systems under various HLR and attempted to predict this output variable using four data-driven methods. Therefore, we have come to the following conclusions:

HLR had a significant effect (p < 0.05) on the total phosphorus removal by MSL;
MSL removes efficiently TP (> 90%) under low HLR may be due to the increase of the contact time of pollutant with MSL media;
Climatic variables do not appear to have a significant effect on TP removal;
Feature selection suggested that TP removal was significantly (p < 0.05) affected by the HLR, phosphorus influent concentration, and pH;
RF model has successfully predicted TP removal in the MSL (R² = 0.93);
TP removal in the MSL is strongly governed by nonlinear relationships as compared to linear relationships and RF has been useful to establish this nonlinearity.

Overall, MSL nature-based technology is a promising solution for wastewater treatment in rural areas. Moreover, investigating the relationship between influent characteristics, hydraulic conditions, and pollutant removal using a combined strategy based on feature selection and computer-based simulation methods may assist in better understanding MSL behavior.

Funding

The authors declare that no funds, grants, or other support were received during the preparation of this

manuscript.

Competing interests

The authors have no relevant financial or non-financial interests to disclose

Authors contributions

Conceptualization, Methodology, Validation, Formal analysis, writing original draft: Sofyan SBAHI; Investigation, Visualization: Abdessamed HEJJAJ and Abderrahman LAHROUNI; Writing - Review & Editing: Naaila OUAZZANI; Conceptualization, Validation, Resources, Supervision, and Writing - Review & Editing: Laila MANDI. All authors read and approved the final manuscript.

Ethics Approval: Not applicable

Consent to Participate: Not applicable

Consent to Publish: Not applicable

Availability of Data and Materials

The data used and analysed in the current study are available from the corresponding author on reasonable request.

Acknowledgments

The authors would like to thank the National Center for Studies and Research on Water and Energy (Cadi Ayyad University) for scientific and technical support of this research.

Santos, A. F., Almeida, P. V., Alvarenga, P., Gando-Ferreira, L. M., & Quina, M. J. (2021). From wastewater to fertilizer products: Alternative paths to mitigate phosphorus demand in European countries. Chemosphere, 284. https://doi.org/10.1016/j.chemosphere.2021.131258
Wang, L., Chen, L., Guo, B., Tsang, D. C. W., Huang, L., Ok, Y. S., & Mechtcherine, V. (2020). Red mud-enhanced magnesium phosphate cement for remediation of Pb and As contaminated soil. Journal of Hazardous Materials, 400, 123317. https://doi.org/10.1016/j.jhazmat.2020.123317
Wang, Q., Liao, Z., Yao, D., Yang, Z., Wu, Y., & Tang, C. (2021). Phosphorus immobilization in water and sediment using iron-based materials: A review. Science of the Total Environment, 767, 144246. https://doi.org/10.1016/j.scitotenv.2020.144246
Arias, M., Da Silva-Carballal, J., García-Río, L., Mejuto, J., & Núñez, A. (2006). Retention of phosphorus by iron and aluminum-oxides-coated quartz particles. Journal of Colloid and Interface Science, 295(1), 65–70. https://doi.org/10.1016/j.jcis.2005.08.001
Ma, X., Li, Y., Zhang, M., Zheng, F., & Du, S. (2011). Assessment and analysis of non-point source nitrogen and phosphorus loads in the Three Gorges Reservoir Area of Hubei Province, China. Science of the Total Environment, 412–413, 154–161. https://doi.org/10.1016/j.scitotenv.2011.09.034
Chen, X., Luo, A. C., Sato, K., Wakatsuki, T., & Masunaga, T. (2009). An introduction of a multi-soil-layering system: A novel green technology for wastewater treatment in rural areas. Water and Environment Journal, 23(4), 255–262. https://doi.org/10.1111/j.1747-6593.2008.00143.x
Boyer, A., Ning, P., Killey, D., Klukas, M., Rowan, D., Simpson, A. J., & Passeport, E. (2018). Strontium adsorption and desorption in wetlands: Role of organic matter functional groups and environmental implications. Water Research, 133, 27–36. https://doi.org/10.1016/j.watres.2018.01.026
An, C. J., McBean, E., Huang, G. H., Yao, Y., Zhang, P., Chen, X. J., & Li, Y. P. (2016). Multi-soil-layering systems for wastewater treatment in small and remote communities. Journal of Environmental Informatics, 27(2), 131–144. https://doi.org/10.3808/jei.201500328
Zhou, Q., Sun, H., Jia, L., Zhao, L., & Wu, W. (2021). Enhanced pollutant removal from rural non-point source wastewater using a two-stage multi-soil-layering system with blended carbon sources: Insights into functional genes, microbial community structure and metabolic function. Chemosphere, 275, 130007. https://doi.org/10.1016/j.chemosphere.2021.130007
Chen, Y.-C. (2021). Phosphorus and nitrogen removal from water using steel slag in soil-based low-impact development systems. Journal of Water Process Engineering, 44, 102385.
AFNOR. (1983). Recueil de normes françaises: eau, méthodes d’essai. French Standardization Association (AFNOR) Paris, France.
Luanmanee, S., Boonsook, P., Attanandana, T., Saitthiti, B., Panichajakul, C., & Wakatsuki, T. (2002). Effect of intermittent aeration regulation of a multi-soil-layering system on domestic wastewater treatment in Thailand. Ecological Engineering, 18(4), 415–428.
Sbahi, S., Ouazzani, N., Hejjaj, A., & Mandi, L. (2021). Nitrogen modeling and performance of Multi-Soil-Layering (MSL) bioreactor treating domestic wastewater in rural community. Journal of Water Process Engineering, 44, 102389. https://doi.org/10.1016/J.JWPE.2021.102389
Sbahi, S., Ouazzani, N., Latrach, L., Hejjaj, A., & Mandi, L. (2020). Predicting the concentration of total coliforms in treated rural domestic wastewater by multi-soil-layering (MSL) technology using artificial neural networks. Ecotoxicology and Environmental Safety, 204, 111118. https://doi.org/10.1016/j.ecoenv.2020.111118
Song, P., Huang, G., An, C., Shen, J., Zhang, P., Chen, X., … Sun, C. (2018). Treatment of rural domestic wastewater using multi-soil-layering systems: Performance evaluation, factorial analysis and numerical modeling. Science of The Total Environment, 644, 536–546. https://doi.org/10.1016/J.SCITOTENV.2018.06.331
Hong, Y., Huang, G., An, C., Song, P., Xin, X., Chen, X., … Zheng, R. (2019). Enhanced nitrogen removal in the treatment of rural domestic sewage using vertical-flow multi-soil-layering systems: Experimental and modeling insights. Journal of Environmental Management, 240, 273–284. https://doi.org/10.1016/j.jenvman.2019.03.097
Lamzouri, K., Mahi, M., Latrach, L., & Mandi, L. (2017). Quantitative evaluation of the effect of parameters affecting biological and physicochemical phosphate removal from wastewaters in a Multi-Soil-Layering system. Revue Marocaine des Sciences Agronomiques et Vétérinaires, 5(September), 313–318.
Guo, C., & Cui, Y. (2022). Machine learning exhibited excellent advantages in the performance simulation and prediction of free water surface constructed wetlands. Journal of Environmental Management, 309, 114694.
Kim, M., Kim, Y., Kim, H., Piao, W., & Kim, C. (2016). Evaluation of the k-nearest neighbor method for forecasting the influent characteristics of wastewater treatment plant. Frontiers of Environmental Science and Engineering, 10(2), 299–310. https://doi.org/10.1007/s11783-015-0825-7
Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: Understanding AIC and BIC in model selection. Sociological Methods and Research, 33(2), 261–304. https://doi.org/10.1177/0049124104268644
Aahd, A., Simonneaux, V., Sadik, E., Brahim, B., & Fathallah, S. (2009). Estimation des volumes d’eau pompés dans la nappe pour l’irrigation (plaine du Haouz, Marrakech, Maroc). Comparaison d’une méthode statistique et d’une méthode basée sur l’utilisation de la télédétection. Revue des sciences de l’eau, 22(1), 1–13. https://doi.org/10.7202/019820ar
Standards, M. (2006). Moroccan standard approved by order of the Minister of Industry, Trade and Economy Last Level. Moroccan Industrial Standardization Service.
AFNOR, N. F. T. (1997). T 90-105. Qualité de l’eau-dosage des matières en suspension–méthode par centrifugation, Association française de normalisation, Paris.
Apha, A. (1985). Standard methods for the examination of water and wastewater. Apha Washington.
Rodier, J. (1996). L’analyse de l’eau naturelle, eaux résiduaires, eau de mer. Denod, Paris, 1, 1383.
Kuhn, M., & Johnson, K. (2013). Applied predictive modeling. Applied Predictive Modeling (Vol. 26). Springer. https://doi.org/10.1007/978-1-4614-6849-3
Kadam, A. K., Wagh, V. M., Muley, A. A., Umrikar, B. N., & Sankhua, R. N. (2019). Prediction of water quality index using artificial neural network and multiple linear regression modelling approach in Shivganga River basin, India. Modeling Earth Systems and Environment, 5(3), 951–962. https://doi.org/10.1007/s40808-019-00581-3
Xiao, H., Bai, B., Li, X., Liu, J., Liu, Y., & Huang, D. (2019). Interval multiple-output soft sensors development with capacity control for wastewater treatment applications: A comparative study. Chemometrics and Intelligent Laboratory Systems, 184, 82–93. https://doi.org/10.1016/j.chemolab.2018.11.007
Mohurle, S., & Devare, M. (2020). A Study of KNN Classifier to Predict Water Pollution Index. Advances in Intelligent Systems and Computing, 1025, 457–466. https://doi.org/10.1007/978-981-32-9515-5_44
Garg, A., Huang, H., Kushvaha, V., Madhushri, P., Kamchoom, V., Wani, I., … Zhu, H. H. (2020). Mechanism of biochar soil pore–gas–water interaction: gas properties of biochar-amended sandy soil at different degrees of compaction using KNN modeling. Acta Geophysica, 68(1), 207–217. https://doi.org/10.1007/s11600-019-00387-y
Breiman, L. (2001). Random forests. Machine learning, 45(1), 5–32.
Tran, D. A., Tsujimura, M., Ha, N. T., Nguyen, V. T., Binh, D. Van, Dang, T. D., … Pham, T. D. (2021). Evaluating the predictive power of different machine learning algorithms for groundwater salinity prediction of multi-layer coastal aquifers in the Mekong Delta, Vietnam. Ecological Indicators, 127, 107790. https://doi.org/10.1016/j.ecolind.2021.107790
Chollet, F., & Allaire, J. J. (2018). Deep learning with R. Shelter Island. Manning Publications Co. Biometrics, 76, 361–362.
Yeom, C. U., & Kwak, K. C. (2017). Short-term electricity-load forecasting using a tsk-based extreme learning machine with knowledge representation. Energies, 10(10), 1613. https://doi.org/10.3390/en10101613
Wei, C., & Wu, W. (2018). Performance of single-pass and by-pass multi-step multi-soil-layering systems for low-(C/N)-ratio polluted river water treatment. Chemosphere, 206, 579–586.
Sato, K., Masunaga, T., & Wakatsuki, T. (2005). Water Movement Characteristics in a Multi-Soil-Layering System. Soil Science and Plant Nutrition, 51(1), 75–82. https://doi.org/10.1111/j.1747-0765.2005.tb00009.x
Toreu, B. N., Thomas, F. G., & Gillman, G. P. (1988). Phosphate-sorption characteristics of soils of the north queensland coastal region. Australian Journal of Soil Research, 26(3), 465. https://doi.org/10.1071/SR9880465
Pattnaik, R., Yost, R. S., Porter, G., Masunaga, T., & Attanandana, T. (2008). Improving multi-soil-layer (MSL) system remediation of dairy effluent. Ecological Engineering, 32(1), 1–10. https://doi.org/10.1016/j.ecoleng.2007.08.006
Masunaga, T., Sato, K., Mori, J., Shirahama, M., Kudo, H., & Wakatsuki, T. (2007). Characteristics of wastewater treatment using a multi-soil-layering system in relation to wastewater contamination levels and hydraulic loading rates: Original article. Soil Science and Plant Nutrition, 53(2), 215–223. https://doi.org/10.1111/j.1747-0765.2007.00128.x
Wakatsuki, T., Esumi, H., & Omura, S. (1993). High performance and N and P-removable on-site domestic waste water treatment system by multi-soil-layering method. Water Science and Technology, 27(1), 31–40. https://doi.org/10.2166/wst.1993.0010
Guo, J., Zhou, Y., Jiang, S., & Chen, C. (2019). Feasibility investigation of a multi soil layering bioreactor for domestic wastewater treatment. Environmental Technology (United Kingdom), 40(17), 2317–2324. https://doi.org/10.1080/09593330.2018.1441331
Kwesi Asomaning, S. (2020). Processes and Factors Affecting Phosphorus Sorption in Soils. Sorption in 2020s, 45, 1–16. https://doi.org/10.5772/intechopen.90719
Lü, J., Sun, L., Zhao, X., Lu, B., Li, Y., & Zhang, L. (2009). Removal of phosphate from aqueous solution using iron-oxide-coated sand filter media: Batch studies. In Proceedings - 2009 International Conference on Environmental Science and Information Application Technology, ESIAT 2009 (Vol. 1, pp. 639–644). IEEE. https://doi.org/10.1109/ESIAT.2009.104
Park, J. G., Jun, H. B., & Heo, T. Y. (2021). Retraining prior state performances of anaerobic digestion improves prediction accuracy of methane yield in various machine learning models. Applied Energy, 298, 117250. https://doi.org/10.1016/j.apenergy.2021.117250
Hvala, N., & Kocijan, J. (2021). Input variable selection using machine learning and global sensitivity methods for the control of sludge bulking in a wastewater treatment plant. Computers & Chemical Engineering, 154, 107493. https://doi.org/10.1016/j.compchemeng.2021.107493
Kotti, I. P., Sylaios, G. K., & Tsihrintzis, V. A. (2016). Fuzzy Modeling for Nitrogen and Phosphorus Removal Estimation in Free-Water Surface Constructed Wetlands. Environmental Processes, 3(1), 65–79. https://doi.org/10.1007/s40710-016-0177-8
Akratos, C. S., Papaspyros, J. N. E., & Tsihrintzis, V. A. (2009). Artificial neural network use in ortho-phosphate and total phosphorus removal prediction in horizontal subsurface flow constructed wetlands. Biosystems Engineering, 102(2), 190–201. https://doi.org/10.1016/j.biosystemseng.2008.10.010
Vidal, B., Hedström, A., & Herrmann, I. (2018). Phosphorus reduction in filters for on-site wastewater treatment. Journal of Water Process Engineering, 22, 210–217. https://doi.org/10.1016/j.jwpe.2018.02.005

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Total phosphorus removal in multi-soil-layering nature-based technology: Evaluation of influencing factors and prediction using data-driven methods

Status:

Version 1

Abstract

Figures

1. Introduction

2. Material And Methods

2.1. Study area

2.2. System design

2.3. Sample analysis and climatic variables

2.4. Statistical analysis and feature selection

2.5. Data-driven methods and implementation

2.5.1. Multiple linear regression (MLR)

2.5.2. k-nearest neighbors (KNN)

2.5.3. Random forest (RF)

2.5.4. Neural network (NN)

2.5.5. Performance metrics

3. Results And Discussion

3.1. Total phosphorus removal and hydraulic loading rate

3.2. Effect of climatic variables

3.3. Feature selection

3.4. Tuning parameters

3.5. Data-driven model evaluation

3.6. Feature importance

4. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1