The data in this article was obtained through the digital library of the theses and dissertations of the Federal University of Paraná - UFPR. Both studies referenced (HERNANDEZ, 2019) and (PAULA, 2019) focused on the same subject of study within the same Wastewater Treatment Plant (WWTP) located in the city of Curitiba, State of Paraná, and the research took place during the same period. Based on the available data, the model included the use of CNN-based algorithms with the ability to predict results regarding a biological system, using the following physical-chemical parameters as input variables: sewage flow rate (L/s), COD (Chemical Oxygen Demand, mg/L), TSS (Total Suspended Solids, mg/L), and VSS (Volatile Suspended Solids, mg/L), aiming to predict the biogas flow rate (Nm³/h), as controlling this output variable ensures good production.
(PAULA, 2019) conducted a Spearman and Kendall Tau correlation analysis to examine the relationship between the variables in the liquid phase of the influent and the gaseous phase (biogas production). The aim was to assess the relationship between the biogas produced and the physical-chemical input parameters of the system. It was found that among the solids, VSS showed a strong positive correlation, TSS exhibited a moderate correlation, while SSF had a weak or very weak correlation. It indicates that SSF is unsuitable for predicting biogas production, in addition to the fact that SSF represents the inert fraction of domestic sewage, which is irrelevant as a substrate during anaerobic digestion.
The researched Wastewater Treatment Plant (WWTP) features six modified Upflow Anaerobic Sludge Blanket (UASB) reactors, is considered medium-sized, with a design flow rate of 440 L/s, serving a population of approximately 252,764 inhabitants and its hydraulic retention time is approximately 8 hours (Ross, 2015). In the operational phases, there is a preliminary treatment involving two bar screens, 10 cm and 5 cm in size, a 6 mm mechanical bar screen, a grit chamber, and a Parshall flume.
The database was formatted and normalized to enable the programming language to capture the information so that the data adhered to a range adjusted for orders of magnitude, making all values fall between zero and one. This normalization process allowed for training of the neural network and then validating and testing the results obtained.
The obtained variables refer to the data from the influent and effluent in the liquid phase of the anaerobic treatment process and the gas phase, which corresponds to the final stage of the anaerobic treatment process. The data from the liquid phase of the influent and the gas phase were measured by electronic equipment, already properly calibrated and with their uncertainty analyses studied in the work of (HERNANDEZ, 2019) hourly, obtaining 24 samples per day for 3 consecutive days, for each of the 5 months analyzed. As for the effluent data, laboratory analyses were conducted three times a day: at 8 a.m., 12 p.m., and 4 p.m, for the same three consecutive days in the 5 months of the research.
The quantity of collected data does not represent a limitation for the application of computational models, as can be observed in successful cases (HAMED, KHALAFALLAH, and HASSANIEN, 2003), (CANETE, SAZ-OROZCO, et al., 2016), (CHOI and PARK, 2001), (MJALLI, AL-ASHEH, and E.ALFADALA, 2007), (SAKIEWICZ, PIOTROWSKI, et al., 2020), (ASADI, GUO, and MCPHEDRAN, 2020), (KUSIAK and WEI, 2012), (ALSULAILI and REFAIE, 2021), (NOURANI, ELKIRAN, and ABBA, 2018), and (HEJABI, SAGHEBIAN, et al., 2021). In many cases, the quantity of data is even substantial, but due to measurement problems, the data is incomplete or uncalibrated, leading to their exclusion as they do not represent the actual situation. The benefit of reducing data for modeling is achieving more efficient and enhanced modeling.
In all the studies mentioned in the previous paragraph, regardless of the quantity of available data, predictions were successfully made using AI tools, and the results were satisfactory. Alternative methods can be used to generate more input data for the models. In cases where various data is missing, appropriate statistical methods, such as data interpolation and extrapolation, can be used to fill in the missing data and then use the synthetic data in modeling. However, in this case, the synthetic data can introduce a higher error level, and using such data may not lead to appropriate results (NOURANI, ELKIRAN, and ABBA, 2018). This is because the period that was not collected during the day regarding the effluent consists of 21 missing samples per day. Data interpolation and extrapolation could solve this problem by generating data for that period. However, in the case of applying these techniques, generating data from only three daily samples would not provide the expected confidence in the model, which led to the exclusion of effluent information from the development of the model.
Among the different types of available networks, the feed-forward-backpropagation (FFBP) network is the favorite among authors because it is considered the most suitable for modeling anaerobic digestion (HOLUBAR, ZANI, et al., 2002). This is due to its ability to correct the weights applied to the layers.
In their research, HAMED, KHALAFALLAH, and HASSANIEN (2003) used BOD and the concentration of suspended solids as the basis for their ANN model. They found that these two parameters were sufficient for modeling, as they are considered good indicators of ETP (Effluent Treatment Plant) performance, according to Droste (2009).
From the data collected hourly, various parameters were obtained, including the collection time, sewage flow rate, COD (Chemical Oxygen Demand), organic load, TSS (Total Suspended Solids), SSF (Fixed Suspended Solids), VSS (Volatile Suspended Solids), sewage flow rate, CH4, CO2, O2, and H2S content in the biogas. Other data that were collected only three times a day were not used in the analysis. To mitigate the sensitivity of the ANN and improve its prediction capacity, the number of parameters applied to the network was reduced to four: sewage flow rate, COD, TSS, and VSS, as shown in Fig. 4. This optimization of the model helps avoid overfitting by eliminating unnecessary data that would introduce noise into the model.
According to (PAULA, 2019) and (HERNANDEZ, 2019), leaks in the biogas lines were identified during a certain period of data collection. Therefore, in this article, on readings after the correction of the identified leak in the WWTP was considered.
To evaluate the model's performance, the Pearson correlation, which is used to calculate the correlation between two data series, and the coefficient of determination (R2), a squared version of Pearson, commonly used to measure model efficiency, were used, as well as the Mean Squared Error (MSE). The R2 indicates the degree of correlation between experimental and predictive values, which can be represented by the equation present in Eq. 1.
Dessa forma obtendo um maior valor de R2 e um valor de MSE tendendo a zero, significa um melhor desempenho durante a predição.
Os resultados apresentados na próxima sessão referem-se a uma média calculada com o valor obtido de cada um dos 10 treinamentos em sequência realizados, para o índice de correlação R2 e valor do MSE.
In this way obtaining a more considerable value to R2 and a value to MSE tending to zero, it means a better performance during the prediction.
The results showed in the next session refer to a calculated media with the value obtained in each one of the ten trainings realized in sequence for the indices of correlation R2 and value of MSE.