Two-Stage Model for Machine Learning/Transient-based Leak Detection in Pressurized Pipelines


 This study introduces a two-stage approach for high-resolution leak localization in large-scale pipelines by coupling machine learning to transient hydraulics. The method includes two stages of leak zone identification and in-zone Leak detection. A transient simulation model using the Method of Characteristics (MOC) is developed to generate the learning data for the pipeline under consideration. Afterward, the problem search space is reduced, and the maximum leak detection error is restricted by determining the most likely leaky zone using Support Vector Regression (SVR). Then, the zone dataset is provided by introducing leak candidates to the identified zone. After that, an ensemble classifier consisting of a set of linear discriminant components is trained to reliably detect the exact location of the leak using the majority voting technique. The models are applied to a theoretical pipeline and an experimental Reservoir-Pipe-Valve (RPV) system. The performance of the applied machine learning algorithms is compared to well-known algorithms considering a variety of kernels and hyperparameters. The impacts of different levels of uncertainty in pipe roughness and initial flow on the models' accuracy are also investigated. The results manifest that the proposed model has high accuracy and is stable, and robust against the hydraulic simulation uncertainties.


30
At first glance, leaks in pipe systems cause a noticeable loss of clean water ranging from 5% 31 to 50% of the total water supplied (Plath, et al. 2014). Leaks in pipes occur due to various 32 reasons such as poor quality of pipe materials and finish, errors in operation and maintenance, 33 corrosion, internal and external high pressures. Leaks also create severe operational difficulties 34 and impose high costs on the operation management (Shamloo and Haghighi 2009). 35 Large leaks or bursts may be early visible and usually reported by people. However, tiny leaks 36 that are not visible on the surface can remain undiscovered for a long time while gradually 37 advancing and affecting network performance. The early leak detection would save water and 38 prevent small leaks from turning into bursts and is essential for water companies because of 39 economic, environmental, and reputation reasons (Sophocleous, et al. 2019) and public 40 satisfaction. Fixing leaks at the early stages would also prevent water health and security, 41 especially when inverse leakage is likely. Contamination from underground water, surface 42 flow, and nearby wastewater may be suctioned into the network during an inverse leakage. 43 Hence, the early leak detection facilities are an indispensable part of any pipe system to 44 decrease losses and threats of leaks. 45 During the last decades, various noninvasive methods have been introduced for leak detection 46 in pipe systems. A group of them are hardware-based methods like acoustic techniques based 47 on the measurement of the sound of water leakage in the pipeline Ozevin 48 and Harding 2012). Other hardware-based techniques are non-acoustic methods that utilize 49 various hardware-based techniques such as infrared thermography (Chunli, et al. 2005), 50 Ground Penetration Radar (Abouhamad, et al. 2016). Despite the advantages of hardware-51 based techniques, they are mostly expensive, labor-intensive, and their application is limited to 52 small-scale systems. 53 Another category of noninvasive methods is model-based techniques that take advantage of automatically through experience (Mitchell 1997). These algorithms build a model based on 90 sample data, known as training data, to make predictions or decisions without explicit 91 programming. The sample data can be obtained from real events from the field or generated by 92 a simulation model. Samples may be labeled or unlabeled. Labeled data are presented in the 93 form of input-output pairs and are suitable for supervised learning tasks like regression and 94 classification. Each input describes a system state, and output corresponds to its expected 95 outcome called a label. Unlabeled data contain pure information about the system and do not 96 include any prior knowledge or judgment from a label assigned by a supervisor. This type of 97 data is used in unsupervised learning tasks like clustering. ML-based leak detection methods 98 exploit ML algorithms to learn the patterns of different leak states from a sample dataset and 99 predict unseen leak events in reality. Compared to the model-based approaches, the ML-based 100 unique results. Also, SVM and ANN need both pressure and flow measurements for acceptable 125 leak detection accuracy.

126
The performance of ML algorithms can significantly be affected by the quality and quantity of 127 datasets. Due to the lack of proper data acquisition systems, expensive cost of flow 128 measurement, and low frequency of leak events, an insufficient amount of real data is often showed that in addition to high performance in leak detection, the model is robust against high 141 levels of uncertainty in pipes' friction factors and nodal demands. In that study, the network 142 junctions are considered as leak candidates, thus the applied ML algorithm was aimed to handle 143 a limited number of classes. In practice, the resolution of the leak detection depends on the 144 closeness of leak candidates. Higher resolutions need more candidates of closer distances.

145
Taking such an approach increases the complexity of the ML problem and consequently reduce 146 the leak detection accuracy. To the authors' best knowledge, these are the only studies in joint  In the following sections, first, the applied hydraulic simulation model and dataset generation 163 are explained. After that, the leak zone identification stage is clarified based on the concept of 164 SVR. Next, the application of the ensemble learning approach to in-zone leak detection is 165 explained. The model is applied to two case study pipelines. The first one is a numerical 166 pipeline to evaluate the method in large-scale problems with a large number of leak candidates.

167
The second is an experimental reservoir-pipe-valve (RPV) system to study the method in an 168 actual situation. Finally, the results are discussed, and the study is concluded. Figure ( 173 The performance of the data-driven models can be highly affected by the quality of the dataset.

174
In leak detection, the training set should contain a sufficient number of samples, including a 175 variety of leak scenarios, to comprehensively capture the behavior of pipelines in the face of 176 leaks (Zhang, et al. 2016). In the context of ML, the input vector of each sample data is denoted 177 as the feature vector. It should include the most effective elements that represent the problem's 178 behavior, thus leading to higher performance of the applied learning algorithm. In this study, a 179 vector containing the transient pressure heads of the pipe system is used as the input to describe 180 a certain leak state, and the location of the corresponding leak state is considered as the label.

181
The length of applied pressure trace is 3 / , where and are pipe length and wave speed, 182 respectively. The reason for this choice is that this part of the response signal contains 183 information on the complete pipeline without significant energy dissipation (Bohorquez,et al. 184 2020). Accordingly, to set up the required datasets, the following steps are taken: (1) where is the distance along the pipe, is time, is wave speed, is gravitational 191 acceleration, is cross-sectional pipe area, is pipe diameter, is instantaneous 192 discharge, is the instantaneous piezometric head, and is friction factor. The where is pipe roughness, is pipe diameter, = ⁄ is Reynold's number, is 205 pipe flow velocity, and is Kinematics viscosity of the fluid. In this study, the 206 unsteady friction factor is calculated by the following model initially introduced by 207 Brunone, et al. (1995) and revised by Vitkovsky (2000).
In which    approach, the leak's location is considered a discrete value, and some locations along the 241 pipeline are considered leak candidate locations. Each location is a label, and the classifier is 242 supposed to find the correct label for new inputs. In the regression approach, the leak location 243 is a continuous real number. Thus, the location of possible leaks is not limited to certain places.

244
The regressor would return a real number as the leak location. In long pipelines, the number of 245 leak candidate locations is increased to preserve the leak localization resolution. This will 246 increase the ML complexity by increasing the number of classes (leak candidate locations), 247 and consequently, standard ML algorithms depict poor performance in terms of accuracy and 248 reliability.

249
In this study, to address the issue mentioned above, a regression-classification scheme is

306
In this study, the K-folds Cross-validation method (Refaeilzadeh, et al. 2009) is applied to 307 tuning the model parameters and handling the issues related to overfitting and noise in samples.

308
In this technique, the training set is randomly split into subsets of the same size. Each of 309 these disjoint subsets is called a fold. Then for each fold, the classifier is trained using out-of-310 fold samples, and after that, the model performance is assessed using in-fold data as a validation

Leak zone identification 399
The first stage in the proposed method is to train the leak zone identification algorithm and 400 determine the extent of the leaky zone according to its MLDE. To do so, the SVR of different 401 kernels is trained using the training set and tested using the first test set. shortens the corresponding leaky zone by decreasing the MER. The right column of Table (   proportion is reduced to 40% for the test set.  (Table 5), and the second is to investigate the 507 impact of the number of components LDA classifiers on its performance (Table 6). According 508 to to enhance the robustness and reliability of the model in various zones along the pipeline, the 513 algorithm with 40 models and subspace of 50% is selected for in-zone leak detection.      The authors have no relevant financial or non-financial interests to disclose.