Application of Machine-Learning Algorithms in Outlier Detection in Well Logs

doi:10.21203/rs.3.rs-2250016/v1

Download PDF

Research Article

Application of Machine-Learning Algorithms in Outlier Detection in Well Logs

https://doi.org/10.21203/rs.3.rs-2250016/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

The digital transformation in the oil and gas sector began because of two economic downturns. Literature review suggests an increasing work on outlier identification in geophysical well logs with some Exploration and Production companies have started working internally on this aspect as it consumes a significant amount of time for a geoscientist.

While performing 1D geomechanical modelling, data preparation is the most time-consuming step, often demanding close to 4 to 5 hours. The data preparation steps include removing the outliers present in the well log data caused by the casing shoe, bad hole intervals, tool noise, and many other factors. Geoscientists invest hours to clean these outliers and handle the resulting missing values. Identification of outliers is mostly manual and missing data are replaced with simple methods based on linear, power, or exponential algorithms derived in a laboratory setup. In this work, seven different anomaly detection techniques were evaluated on well logs. The proposed workflow includes a combination of Z-score and Isolation Forest algorithms to clean and condition the data and handle missing values. The workflow then constructed the cleaned composite log data in quick time, which was then directly consumed by 1D geomechanical workflows. This workflow is a handy tool that adds value by saving time in data preparation. This study will redound to the benefit of geoscientists by saving significant amount of the time spent in the identification of outliers and hence reducing the overall turnaround time of modelling.

machine learning

outlier identification

isolation forest

DBSCAN

sub-surface

Machine learning (ML) applications are gaining more traction in the oil and gas sector in recent times. This journey is briefly discussed by [1], [2]. A great amount of focus today is to utilize the existing archived data and streamline the existing workflows by reducing the turnaround time of modelling.

Well logs are the basis for most of the reservoir characterization workflows in the oil and gas industry [3], [4]. In the field of 1D geomechanical modelling workflows, as explained in [5] the data preparation step is the most time-consuming step. It forms the first step in any of the modelling processes especially when the data are known to have been affected by different factors [6]. The geophysical well log data acquired in a borehole are usually affected by casing shoe, bad hole intervals, downhole noise, and many other factors. Geoscientists invest hours of their time to clean the bad data and handle the resulting missing values. Most of the methods used to clean these outliers’ values are manual. The methods used for predicting the resultant missing values are based on linear, power, or exponential algorithms derived in a laboratory setup. One of the most common approaches to construct a bulk density log is using the algorithm proposed by [7]. Other key well logs are compressional and shear slowness, which require cleaning as they are taken as direct input in the modelling. The work of [8] and many others have proposed algorithms to reconstruct the shear slowness in the missing intervals using established algorithms and neural networks.

The applications of ML approaches are gaining popularity in the oil and gas industry for data preparation stages of modelling. In recent times both industry and institutes have successfully applied ML algorithms for various applications, including the prediction of synthetic logs using available data. The study has demonstrated the use of Z-score, density-based spatial clustering of applications with noise (DBSCAN) [9], IF [10], K-means, and hierarchal algorithm to identify the outliers in the geophysical logs acquired in each section. The workflow will then construct the composite well log data by merging data from each section that can then be directly consumed by 1D geomechanical workflows.

Therefore, the purpose of this study was established to understand the feasibility of using ML techniques to identify outliers in well logs. The ML approaches proposed in this study would benefit geoscientists by saving 40 to 50% of their time invested in data preparation. The diagram in Fig. 1 outlines the challenges associated with geophysical logs in their direct usage. The scope of the study will first explore the possibility of automating the data preparation task by exploring the best ML techniques to identify outliers in the geophysical logs. The modelling will be attempted on the data downloaded from Equinor’s Volve Field [11]. This exercise will be performed on six wells having geophysical log data. The literature review suggests the presence of multiple outlier identification techniques used in different industries. This study will analyze the applicability of Z-score, K-means, hierarchical clustering, DBSCAN, IF, and support vector machine (SVM) methodologies to scan anomalies in density and sonic logs.

The data preparation stage is the most time-consuming and cumbersome step in the geology and geophysics workflows. It is the first step in any of the modelling processes, especially when the data are known to have outliers because of different factors. If the data are not cleaned for modelling, the modelling output will give noisy results and ultimately hamper decision making. The geoscientist applies some logical workflows or performs manual cleaning of geophysical log data before further use. The proposed workflow intends to use data-science based algorithms to automate this process of data cleaning.

There are different methods available to identify outliers in an independent variable. The univariate type of outlier is easy to identify and only needs one feature; however, multivariate types of outliers need investigation of distributions in n-dimensional space. Boxplot or histogram-based methods are used to remove the univariate outliers in a feature. If the feature follows a suitable distribution; for example, a normal distribution, a Z-score approach can be used as an outlier removal technique. The values outside the Z-score threshold of 3 or 3.5 are outliers. This approach is very similar to the use of a boxplot, which defines abnormal values in terms of interquartile range (IQR) defined by the difference of upper quartile (Q3) and lower quartile (Q1). As per the usual belief, the values above Q3 + 1.5 IQR or below Q1–1.5 IQR can be subjected to scrutiny. Figure 2(a) shows the boxplot marking the median, quartiles, IQR, and outliers. The application of Z-score as an outlier detection technique is common in different sectors; a few examples include the work of [12], [13]. These researchers have successfully applied the Z-score algorithm in high-dimensional and time-series data.

[4] has worked on the utility of ML algorithms to streamline and shorten the log cleaning and petrophysical workflows. They used the XGBoost algorithm [14] to predict the different petrophysical properties such as clay volume saturation, etc. In the training stage, logs were first manually cleaned and interpreted as a dependent variable, and compressional slowness, density, and other geophysical logs as independent variables. The well log predictions were satisfactory as per petrophysicists and were used further in the estimation of rock properties. Contrary to usual practice, they found that formation tops had a minimal impact on the accuracy of predictions. However, these insights require further analysis as formation tops were given 0 or 1 labels. Ideally, different formations showcase different behaviours and have different measurements. Hence, formation tops can be labelled differently. The suggested workflow didn’t use mud log data, which are available in every well location and could prove to be healthy input. In addition, the outlier removal process relies on the manually cleaned data that might not be readily available in multiple cases like exploratory wells. Another method useful for identifying anomalies in the data is the IF method, which isolates a feature randomly from the given features and splits the data for all the observation presents. The workflow of IF [10] and extended IF is presented by [15] on their work on different types of datasets excluding well logs.

Anomaly detection is a task that recently focused on the optimization of production. The work of [16] applied ML algorithms to identify possible issues in plunger-lift operations. This work was a proof of concept for the south Texas field that helped in anomaly detection and focused more on the possible solution. A series of heuristic and ML models were constructed to facilitate labelling and categorization tasks. The signatures of the different production-related features were then linked with potential causes. They tested the algorithm for 50 wells over 3 months, and the results suggest a reduction in time spent over manual identification of the problem and saved close to USD 220/well/month. Another example can be found in the work by [17], which again applied a few methods including IQR cutoffs from percentile() NumPy function. They demonstrate a use case of anomaly detection in real-time production data. Figure 2(b) shows an example of outliers present in the well logs identified using a crossplot.

Literature review suggests there are multiple methods to identify outliers in the datasets. A comparison of the different anomaly detection algorithms is performed by [18]. The author attempted distance-based, K-means, density-based, statistical, and SVM algorithms on stellar populations parameters belonging to the astronomical domain. As per this experiment, the kernel-based novelty technique gave the best results. The work of [19], [20] explores the Class-Based Machine Learning and cross-entropy clustering-Gaussian mixture model hidden Markov model workflows for well log data processing and interpretation. They demonstrate limited results for outlier identification and has focus on interpretation through classification of well log data. The proposed workflow uses different algorithms and will directly prepare the wells logs which can be consumed by geomechanical modelling. This research has explored the applicability of six different ML algorithms to achieve cleaned data.

The dataset used in this study belongs to the Volve Field located in the Norwegian North Sea and was operated by Equinor [11]. The field was finally decommissioned in 2016 after nearly 8.5 years of production. It is an offshore field with an average water depth of 80 m. The primary target formation is the Middle Jurassic Hugin Formation, which produces from a sandstone interval. The well drilled in the field varied between 2.7 to 3.2 km below the seabed. Figure 3 shows the location of the Volve Field. Equinor released a complete set of data to foster research and development under Equinor Open Data License.

The study plan required the use of well log data. These datasets were collected for six wells and included a total of 11 features, with nine measured features and the rest being derived features. All the features are continuous numerical variables except formation tops, and these features can be broadly classified into two categories. The first category consists of well logs that are acquired by a process called borehole logging. In this process, different tools are lowered into the borehole to measure the properties of the formations drilled. The last category of features includes information regarding the geological age and derived variables. The final dataset will have close to 93,900 samples of geophysical logs. A list of these features and their description is given in Table 1.

Table 1

List of available key features at individual well location. RHOZ and DTCO are the key logs focused for outlier identification
S. No	Feature	Description and Definition
1	Depth	Depth of measurements and acts as a reference log
2	RT	True resistivity of the formation
3	RHOZ/RHOB	Bulk Density: Defined as mass of the rock by volume
4	NPHI	Neutron Porosity: NPHI log indirectly measures the porosity of the rock. This property is estimated by capturing the returning neutrons after passing through the rock.
5	GR	Gamma Ray: This log measures the radioactivity of the rocks by measuring the gamma ray intensity. This is usually a good indicator of lithologies. Shales usually have higher GR in comparison to sandstones.
6	DTCO	Compressional Slowness: This property is an estimate of traveltime of P-wave through the rock and expressed in µs/ft.
7	DTSM	Shear Slowness: This property is an estimate of traveltime of S-wave through the rock and expressed in µs/ft.
8	BS	Bit Size: This log measures the size of bit used to drill that interval.
9	CALI	Caliper: This property measures the actual diameter of the borehole.
10	Enlargement	Difference between caliper and bit size.
11	Zones	Formation tops are labelled as 1 to 4 based on their depth.

The dataset collected from any field is usually available for individual sections acquired using different tools and technologies. The well logs have outliers in the form of tool noise, poor borehole measurements, etc. Geoscientists invest nearly 4 to 5 hours to clean and condition the well log data before the modelling stage. The proposed workflow has evaluated approaches such as Z-score, DBSCAN, K-means, hierarchical clustering, SVM, and IFs to identify anomalies and remove them from the main datasets. This workflow is summarized in Fig. 4. The individual anomaly detection methods are discussed in an upcoming section. The focus will be on the key features that are inputs to 1D geomechanical modelling, including compressional slowness, shear slowness, and the bulk density log. The missing values caused by removing outliers can be replaced with predictive ML or neural network-based approaches such as multilinear regression, artificial neural networks, etc.

4.1 Z-Score or Extreme Value Analysis (Parametric)

Z-score defines the proximity of a data point with respect to mean. It defines the location of data in terms of standard deviations away from the mean. Under this method data points residing far from the mean are usually treated as outliers. If the feature follows a normal distribution, Z-score is one of the most common methods that can be used as an outlier removal technique. Data points located beyond that Z-score threshold of 3 or 3.5 are usually considered outliers. This approach is very similar to the use of a boxplot that defines abnormal values in terms of IQR defined by the difference of upper quartile (Q3) and lower quartile (Q1). As per the usual belief, the values above Q3 + 1.5 IQR or below Q1–1.5 IQR can be subjected to scrutiny. This method has limitations when dealing with a multimodal distribution and might not remove all the outliers present in the dataset.

4.2 K-Means Clustering

K-means clustering is a type of unsupervised ML technique that uses Euclidean distance to classify data into given clusters. This technique essentially includes the following steps:

Initialization and assignment step: In the first step, the algorithm randomly first selects k-cluster centres and then assigns each of the data points to a cluster based on Euclidian distance.

Optimization step: Coordination of new centre points (centroids) are estimated based on each of the newly formed clusters.

Reassignment step: Again, the distance between newly formed centroids and all the other points is estimated. A data point is reassigned to a cluster for which its distance from its centroid is minimum.

Reoptimization step: Again, the centroids are updated based on newly formed clusters.

These steps continue until there is no change in distance between the data points and cluster centroid. Statistically, elbow curve and silhouette score are considered to identify an optimal number of possible clusters in a business case. This method depends on the choice of several clusters, and it is also affected by outliers. That is one reason it is also used for outlier identification, as they are usually located based on their distance from the correct data points. The elbow curve constructed in the study is shown in Fig. 5.

4.3 Hierarchical Clustering Algorithm

The hierarchical clustering (HC) algorithm groups data into multiple clusters that can be visualized in a tree-like diagram called a dendrogram. There are two types of hierarchical clustering– agglomerative (bottom-up) and decisive (top-down). In both types of methods, the distance between the clusters is estimated using a user-defined technique. In the bottom-up approach, the first step in each of the data points is assumed to be a cluster and gradually they are merged with neighbouring points using the distance measure. Similarly, in the top-down method, all the data points are first considered in one cluster and then sequentially broken down into individual data points based on this distance measure. So, in both the methods a proximity distance matrix is created. Different types of linkage types govern the approaches to measure the distance between the two subclusters of data points. It includes single, complete, average, and centroid linkages.

4.4 DBSCAN Algorithm

DBSCAN is a type of clustering and nonparametric algorithm proposed by Ester et al. (1996). It defines clusters based on the density of data in a user-defined sphere. One key advantage of using this algorithm is that users don’t need to specify the number of clusters and it can identify arbitrarily shaped clusters. However, this algorithm might not work when the differences in the data density are significant. This algorithm requires two hyperparameters. The first one is the threshold to consider two points in the same neighbourhood (eps) and the second one is the minimum number of neighbour data points (MinPts). The value of eps governs the size of the final cluster and if it is chosen smaller, then a larger part of the data will not be grouped. DBSCAN is a preferred algorithm for identifying structures in the data when they cannot be located manually. This method has been commonly used for establishing hidden trends in different sectors including medicine, marketing, management systems, etc. This algorithm, like others, is also available in the scikit learn package.

4.5 SVM Algorithm

Outlier detection is an example of a one-class classification problem and SVMs are often used for this task; in fact, it is also called the one-class SVM. In this algorithm, a nonlinear function defining a hyperplane is used to classify the data into different groups. It is widely used in novelty detection applications; for example, in time-series data shown by [22], SVM methods require multiple hyperparameters and key ones include kernel type and “nu” parameter. Nu parameter is defined as the upper bound on the fraction of training errors and a lower bound of the fraction of support vectors. SVM methods are reported to be effective in high-dimensional space and especially when there is a clear margin between the classes. However, this method is not recommended when the dataset is bulky. In addition, choosing the appropriate kernel function is not simple.

4.6 IF Algorithm

The IF algorithm was proposed by [23] using the anomalous properties of the outliers. IFs belong to the category of an unsupervised algorithm that uses decision trees in the background. It operates on a simple principle that “outliers are few and away from the rest of the data points.” The IF algorithm uses binary decision trees for splitting the dataset. Essentially it forms multiple decision trees with randomly sampled features present in the dataset. This algorithm has multiple hyperparameters and the key ones include the number of base estimations (n_estimators) and the contamination level in the data (contamination). In this algorithm, the training dataset is used to generate iTrees and then each instance of the test dataset is assigned an anomaly score using the IF algorithm. Thereafter, based on a user-defined threshold anomalies can be identified. The IF method has shown effective results on datasets belonging to different sectors such as transport, geographical, etc. Some of these examples and benefits are covered in the work of [24]. This method is appealing because it assigns an anomaly score where users can intuitively understand outliers. Multiple researchers [15], [23] have reported IF as one of the key algorithms for outlier identification in the data-mining process of almost every sector.

A combination of the above-discussed methods can be applied in sequence. The proposed workflow also attempted a combination of Z-score and IF algorithms to remove outliers. The Z-score method is expected to remove univariate outliers whereas IF algorithm is expected to remove multivariate outliers.

4.7 Model Evaluation

The capabilities and robustness of outlier detection techniques will be evaluated using relative accuracy metrics by performing visual quality checks by comparing the manually cleaned data with the algorithm cleaned data. To check the clustering tendency of the geophysical logs, Hopkins’s test is performed. The Hopkins test examines whether data points differ significantly from uniformly distributed data in the multidimensional space. The Hopkins statistic estimate of 0.75 suggests that the dataset has a high tendency to cluster. The model ranking is performed based on algorithms capability to filter outliers.

A brief description of chosen algorithms to achieve study objectives is already explained in section 4. As identified by the literature review, these workflows are unique and not published in the oil and gas industry. Hence, the chosen workflow is customized to achieve the objectives. Figure 6 shows the steps in the flow chart to identify outliers in the geophysical logs using six different ML algorithms. It is started by removing the extreme values present in the geophysical logs by applying logical cutoffs. After cleaning the dataset, the clustering tendency was checked using Hopkins’s statics followed by exploratory data analysis (EDA) to gain insights into the distribution and trends in the data. Thereafter, the dataset was checked for the optimum number of clusters using elbow curve, silhouette score, and dendrogram. Using the identified clusters, K-means and HC algorithms were applied. Other methods used were DBSCAN, IF, and one-class SVM. The tuning of hyperparameters in these algorithms was optimized so that they capture the maximum outliers in the geophysical logs. The results are compared with the manually cleaned data to understand the performance of an algorithm.

This section covers the description of the geophysical well logs in individual wells. The local corrections performed on the data lead to different numerical ranges of the same features and hence demand normalization. The results of EDA and performance of all six ML techniques are discussed in the following subsections.

6.1 Exploratory Data Analysis (Bivariate and Multivariate Analysis)

EDA is an approach to visually analyse the data and understand the correlations among different features or establish trends and patterns. Data scientists also use this stage to test different hypotheses and assumptions with the help of different statistical and graphical tools. An in-depth investigation helped to identify the types of missing values and outliers in the dataset. This section mainly performs bivariate and multivariate analysis to gather further insights into the dataset. Geoscientists invest a good amount of time in removing the outliers so that the final model is clean and devoid of any noise. Three very important geophysical properties useful in such workflows are RHOZ, DTCO, and DTSM and hence these features are of prime importance. These geophysical properties are correlatable in nature as shown in Fig. 7 and Fig. 8, which makes it feasible for different algorithms to identify outliers. Figure 7 is the crossplot between DTCO and RHOZ coloured by zones and enlargement. Theoretically, DTCO and RHOZ are proportional to each other and should follow a trend. Some of the outliers are manually highlighted by an orange polygon and can be related to hole enlargement. Figure 8 is a pair-plot showing the variation of different geophysical properties coloured by zones.

6.2 Performance of ML Methods in Outlier Identification

The geophysical logs collected or acquired in any field are usually available for individual sections acquired using different tools and technologies. The data points present in each feature include outliers coming from tool noise, poor borehole measurements, etc., and before proceeding to the modelling stages, the geoscientist invests nearly 4 to 5 hours to clean the data and prepare a composite data file devoid of these outliers. The ML-driven workflow evaluated different approaches such as Z-score, DBSCAN, IFs, K-means, and HC to identify anomalies and remove them from the main datasets. The focus was on the key features of bulk density and compressional slowness that are inputs to 1D geomechanical modelling. The Hopkin’s statistic that is used to understand the clustering tendency of the dataset was derived. The Hopkin’s statistic estimate on geophysical logs has a value of greater than 0.95, suggesting that the dataset has a high tendency to cluster. The sections below summarize the results obtained from different methods.

6.3 Performance of Z-Score

Z-score is a simple algorithm that is used to clip extreme values lying far away from the mean value. The cutoff point in this method is chosen as 3.5, which removes some of the outliers. The resulting clusters are shown in Fig. 9(b). The Z-score approach selects some of the correct outliers. Some of the outliers closer to the existing cluster were not spotted by the algorithm. Similar observations on the performance of the Z-score method were made on the datasets prepared for other wells.

6.4 Performance of K-Means

K-means is an example of an unsupervised algorithm that divides the data into user-defined k-clusters. It groups the data points based on its distance from centroids. It is expected that the outliers will be clubbed in different clusters and hence this algorithm is used. From the elbow curves, an optimum number of clusters were seen to be in the range of 5 to 6. Using a k value of 5, clusters were formed. The classified clusters are shown in Fig. 9(c), which has compressional slowness variation with bulk density. The snapshot on the right is showing the final clusters obtained from the K-means algorithm. Clearly, the K-means approach is unable to identify correct outliers in comparison to the other methods.

6.5 Performance of HC Algorithm

HC is also an unsupervised algorithm that uses the hierarchy of clusters in the form of a tree called a dendrogram. In this study, an agglomerative approach with a complete linkage method is used. The preparation of a dendrogram is time consuming, especially for larger datasets. The dendrogram helped to identify the optimal clusters that were found to be 5 to 8. The resulting 8 clusters for well-1 are displayed in Fig. 9(d), having compressional slowness and bulk density crossplot. The snapshot on the right is showing the final clusters obtained from the HC algorithm. In HC, the outliers are coloured by red (“3”) and grey (“7”) and some outliers are present in cluster “6,” which is pink. The HC approach is selecting most of the correct outliers with minor inclusion of correct data. This method is time consuming and might not be suitable for larger datasets. Similar observation on the performance of the HC method was made on the other well datasets.

6.6 Performance of DBSCAN

DBSCAN is an unsupervised clustering algorithm that is seen to work for nonconvex clusters and noise. The value of the “eps” that defines the distance between two points is observed to vary between 0.3 to 0.5. The other hyperparameter, i.e., the minimum number of neighbours, is 100. The resulting clusters are highlighted in Fig. 9(e). The shown crossplot has compressional slowness variation with bulk density. The snapshot on the left is manually cleaned by a geoscientist and highlighted values (in orange) are outliers. The snapshot on the right is showing the final clusters obtained from the DBSCAN algorithm. It is very clear that DBSCAN is unable to identify correct outliers. A similar observation was made on the other well datasets.

6.7 Performance of SVM

SVM belongs to the category of supervised algorithm that is used for both clustering and regression. Literature review suggests that this approach is more suitable for nonlinear separation problems. The “nu” value is observed to vary between 0.2 to 0.4 for optimum results. The other hyperparameters were kept to default. The resulting clusters are displayed in Fig. 9(f), showing a crossplot between compressional slowness and bulk density. The snapshot on the left is manually cleaned by a geoscientist and outliers are highlighted by orange polygons. The snapshot on the right is showing the final clusters obtained from the one-class SVM algorithm. It is very clear that the SVM approach is selecting a few outliers in the cluster “-1.” Most of the data points in this group are correct. A similar observation on the performance of the SVM method was made on the other well datasets.

6.8 Performance of IFs and Combined Method

IF belongs to the category of an unsupervised algorithm that uses decision trees in the background. The algorithm follows a basic principle that the outliers are few and different. The depth of the trees can be related to the presence of outliers in the features. Usually, trees that travel deeper are less likely to have outliers. The key hyperparameter is the contamination value that varies between 0.03 and 0.05 for optimum results. The resulting clusters are shown in Fig. 9(g). The shown crossplot has compressional slowness variation with bulk density. The snapshot on the left is manually cleaned by a geoscientist and highlighted values (in red) are outliers. The snapshot on the right is showing the final clusters obtained from the IF algorithm. In IF, the outliers are coloured by blue and shown by “-1.” The IF approach is selecting most of the correct outliers in comparison to the other methods used. A similar observation on the performance of the IF method was made on the other well datasets. The combined method results in removal of some extra anomalies located closer to the main dataset. The result of Z-score and IF method combination is shown in Fig. 9(h).

Different methods have been attempted to automatically recognize the outliers. Some of these methods can be quickly applied prior to their usage in the modelling. A summary of the performance of different methods and their ranking is shown in Table 2. Figure 10(a) shows the line plot showing the comparison of the original bulk density curve in red with the cleaned curve in blue. The plot compares the result of IF, Z-score and HC methods. It clearly shows that a combination of IF and Z-score is the best method to remove outliers in geophysical logs. This combination of methods can act as a first filter to clean the logs collected for modelling. An example workflow for effective utilization of anomaly detection algorithm is shown in Fig. 10(b).

Table 2

Performance summary of different algorithms applied for outlier identification in geophysical logs. The ranking is based on the performance in comparison to manually corrected
Rank	Method	Performance	Accuracy
1	Z-score + IF	Combined method works the best	85–90%
2	IF	Able to capture the outliers	85–88%
3	Z-score	Correctly remove extreme outliers	75–85%
4	SVM	Partly removes the outliers and correct data	75–80%
5	DBSCAN	Remove extra data points that are not outliers and misses some of the outliers	< 50%
6	HC	Able to capture the outliers but time consuming and removes actual datapoints	< 50%
7	K-means	Unable to cluster the outliers separately	< 50%

The dataset used in this research project belongs to the Volve Field located in the Norwegian North Sea and was released by Equinor to promote research and insights. The dataset included geophysical logs from six different wells. A detailed analysis of available features reveals the presence of outliers, noise, and some of the discrepancies caused during the data acquisition.

The objective of the study was to suggest a suitable ML-based outlier-handling technique in geophysical logs that can reduce the requirement of manual intervention in the data-preparation step of geomechanical modelling. The key question was to understand the performance and efficiency of ML approaches such as SVM, IF, etc. in the identification of outliers. To achieve this objective, six different methods were applied to automatically recognize the outliers. The results helped to conclude that a combination of IF and Z-score methods can be applied as first-pass filters to remove outliers in density and sonic logs critical for modelling workflows. However, it is important to highlight that not all the outliers were removed from the data and there is a further 10 to 15% scope for improvement. This can be achieved by further investigating different hyperparameters and modified methods. The literature review didn’t reveal any work in this task and hence this study sets a baseline for future works.

The results and conclusions drawn from this research demonstrate applications of supervised and unsupervised ML techniques in the oil and gas sector. The results establish that combined Z-score and IF algorithms are suitable for outlier identification and removal from bulk density and sonic logs. The proposed workflow significantly reduces the time in the data preparation stage of geomechanical modelling. The difference between the manual cleaning and automated cleaning is expected to reduce time by 30 to 40%.

Acknowledgments

The completion of the work could not have been possible without the insights shared by Dr Manoj Jayabalan and Dr Rupal Bhargava. I thank them for guiding us throughout the study on the best practices. A portion of gratitude is also owned by my Schlumberger advisors Somessh Bahuguna, Juliane Heiland, Surej Kumar Subbiah, Fabian Krzikalla, Ivan Diaz Granados, Adrian Rodriguez-Herrera and Assef Mohamad Hussein for brainstorming discussions and helping me to identify a suitable project.

Statements and Declarations

Authors disclose that there is no financial or non-financial interests are directly or indirectly related to the this work.

Y. Daneeva, A. Glebova, O. Daneev, and E. Zvonova, “Digital Transformation of Oil and Gas Companies: Energy Transition,” Atl. Press, pp. 199–205, Aug. 2020, doi: 10.2991/AEBMR.K.200730.037.
M. Mohammadpoor and F. Torabi, “Big Data analytics in oil and gas industry: An emerging trend,” Petroleum, vol. 6, no. 4, pp. 321–328, Dec. 2020, doi: 10.1016/J.PETLM.2018.11.001.
Universitas Gadjah Mada, “Well log analysis for reservoir characterization,” AAPG Wiki, 2019. https://wiki.aapg.org/Well_log_analysis_for_reservoir_characterization (accessed Oct. 18, 2021).
N. Brown, A. Roubíčková, I. Lampaki, L. MacGregor, M. Ellis, and P. V. de Newton, “Machine learning on Crays to optimize petrophysical workflows in oil and gas exploration,” Concurr. Comput. Pr. Exper 2020, vol. 32, no. 20, p. e5655, Oct. 2020, doi: 10.1002/CPE.5655.
R. Talreja, S. Bahuguna, R. Kumar, J. Zacharia, A. Kundan, and V. Kalpande, “Geomechanics insights for successful well delivery in complex kutch - Saurashtra Offshore Region,” in Proceedings of the IADC/SPE Asia Pacific Drilling Technology Conference, APDT, 2021, vol. 2021-June, doi: 10.2118/201039-MS.
P. Threadgold, “Some Problems And Uncertainties In Log Interpretation,” Log Anal., vol. 13, no. 2, 1972, Accessed: Oct. 18, 2021. [Online]. Available: https://onepetro.org/petrophysics/article-abstract/171541/Some-Problems-And-Uncertainties-In-Log?redirectedFrom=fulltext.
G. H. F. Gardner, L. W. Gardner, and A. R. Gregory, “Formation Velocity and Density - The Diagnostic Basics For Stratigraphic Traps.,” Geophysics, vol. 39, no. 6, pp. 770–780, 1974, doi: 10.1190/1.1440465.
S. C. Iwuoha, P. K. Pedersen, C. R. Clarkson, and I. D. Gates, “A working method for estimating dynamic shear velocity in the montney formation,” MethodsX, vol. 6, no. 1, pp. 1876–1893, Jan. 2019, doi: 10.1016/J.MEX.2019.08.013.
M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise,” in Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231, Accessed: Oct. 30, 2021. [Online]. Available: www.aaai.org.
F. T. Liu, K. M. Ting, and Z. H. Zhou, “Isolation forest,” Proc. - IEEE Int. Conf. Data Mining, ICDM, pp. 413–422, 2008, doi: 10.1109/ICDM.2008.17.
Equinor, “Volve field data set download,” Equinor ASA, 2020. https://www.equinor.com/en/what-we-do/digitalisation-in-our-dna/volve-field-data-village-download.html (accessed Oct. 17, 2021).
PeruriVenkataanusha, Ch. Anuradha, Patnala S.R., Murty Chandra, and Kiran Surya, “Detecting Outliers in High Dimensional Data Sets Using Z-Score Methodology,” Int. J. Innov. Technol. Explor. Eng., vol. 9, no. 1, pp. 48–53, Sep. 2019, doi: 10.35940/ijitee.A3910.119119.
N. B. Chikodili, M. D. Abdulmalik, O. A. Abisoye, and S. A. Bashir, “Outlier Detection in Multivariate Time Series Data Using a Fusion of K-Medoid, Standardized Euclidean Distance and Z-Score,” Commun. Comput. Inf. Sci., vol. 1350, pp. 259–271, 2021, doi: 10.1007/978-3-030-69143-1_21.
T. Chen and C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794, doi: 10.1145/2939672.
S. Hariri, M. C. Kind, and R. J. Brunner, “Extended Isolation Forest,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 4, pp. 1479–1489, Apr. 2021, doi: 10.1109/TKDE.2019.2947676.
Jon Snyder, Stuart Scott, and Rashid Kassim, “Self-Adjusting Anomaly Detection Model for Well Operation and Production in Real-Time,” Apr. 2019, doi: 10.2118/195234-MS.
I. I. M. Fuad, M. F. N. Demon, and H. Husni, “Automated Real Time Anomaly Detection Model for Operation and Production Data at Scale,” Nov. 2020, doi: 10.2118/203194-MS.
H. J. Escalante, “A comparison of outlier detection algorithms for machine learning,” in Proceedings of the International Conference on Communications in Computing, 2005, pp. 228–237, Accessed: Mar. 01, 2022. [Online]. Available: https://www.researchgate.net/profile/Hugo-Jair-Escalante/publication/228728521_A_comparison_of_outlier_detection_algorithms_for_machine_learning/links/0912f50b777c20ab5e000000/A-comparison-of-outlier-detection-algorithms-for-machine-learning.pdf.
P. Y. Wu, V. Jain, M. S. Kulkarni, and A. Abubakar, “Machine learning-based method for automated well log processing and interpretation,” SEG Tech. Progr. Expand. Abstr., pp. 2041–2045, Aug. 2018, doi: 10.1190/SEGAM2018-2996973.1.
V. Jain et al., “Class-Based Machine Learning for Next-Generation Wellbore Data Processing and Interpretation,” Jun. 2019, doi: 10.30632/T60ALS-2019_SS.
Google, “Google Maps,” 2022. https://www.google.com/maps (accessed Apr. 07, 2022).
S. Tewari, U. D. Dwivedi, and S. Biswas, “A Novel Application of Ensemble Methods with Data Resampling Techniques for Drill Bit Selection in the Oil and Gas Industry,” Energies, vol. 14, no. 2, p. 432, Jan. 2021, doi: 10.3390/EN14020432.
F. T. Liu, K. M. Ting, and Z. H. Zhou, “Isolation forest,” in Proceedings - IEEE International Conference on Data Mining, ICDM, 2008, pp. 413–422, doi: 10.1109/ICDM.2008.17.
P. Karczmarek, A. Kiersztyn, W. Pedrycz, and E. Al, “K-Means-based isolation forest,” Knowledge-Based Syst., vol. 195, p. 105659, May 2020, doi: 10.1016/J.KNOSYS.2020.105659.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

Application of Machine-Learning Algorithms in Outlier Detection in Well Logs

Status:

Version 1

Abstract

Figures

1. Background Of The Study

2. Outlier Identification Using Ml Methods

3. Data Selection

4. Ml Algorithms And Evaluation Criterion

4.1 Z-Score or Extreme Value Analysis (Parametric)

4.2 K-Means Clustering

4.3 Hierarchical Clustering Algorithm

4.4 DBSCAN Algorithm

4.5 SVM Algorithm

4.6 IF Algorithm

4.7 Model Evaluation

5. Flow-chart And Configuration Of Algorithms

6. Results And Discussions

6.1 Exploratory Data Analysis (Bivariate and Multivariate Analysis)

6.2 Performance of ML Methods in Outlier Identification

6.3 Performance of Z-Score

6.4 Performance of K-Means

6.5 Performance of HC Algorithm

6.6 Performance of DBSCAN

6.7 Performance of SVM

6.8 Performance of IFs and Combined Method

7. Discussion Over Outlier Identification And Its Application

8. Conclusion

Declarations

References

Additional Declarations

Status:

Version 1