The data preparation stage is the most time-consuming and cumbersome step in the geology and geophysics workflows. It is the first step in any of the modelling processes, especially when the data are known to have outliers because of different factors. If the data are not cleaned for modelling, the modelling output will give noisy results and ultimately hamper decision making. The geoscientist applies some logical workflows or performs manual cleaning of geophysical log data before further use. The proposed workflow intends to use data-science based algorithms to automate this process of data cleaning.

There are different methods available to identify outliers in an independent variable. The univariate type of outlier is easy to identify and only needs one feature; however, multivariate types of outliers need investigation of distributions in *n*-dimensional space. Boxplot or histogram-based methods are used to remove the univariate outliers in a feature. If the feature follows a suitable distribution; for example, a normal distribution, a Z-score approach can be used as an outlier removal technique. The values outside the Z-score threshold of 3 or 3.5 are outliers. This approach is very similar to the use of a boxplot, which defines abnormal values in terms of interquartile range (IQR) defined by the difference of upper quartile (Q3) and lower quartile (Q1). As per the usual belief, the values above Q3 + 1.5 IQR or below Q1–1.5 IQR can be subjected to scrutiny. Figure 2(a) shows the boxplot marking the median, quartiles, IQR, and outliers. The application of Z-score as an outlier detection technique is common in different sectors; a few examples include the work of [12], [13]. These researchers have successfully applied the Z-score algorithm in high-dimensional and time-series data.

[4] has worked on the utility of ML algorithms to streamline and shorten the log cleaning and petrophysical workflows. They used the XGBoost algorithm [14] to predict the different petrophysical properties such as clay volume saturation, etc. In the training stage, logs were first manually cleaned and interpreted as a dependent variable, and compressional slowness, density, and other geophysical logs as independent variables. The well log predictions were satisfactory as per petrophysicists and were used further in the estimation of rock properties. Contrary to usual practice, they found that formation tops had a minimal impact on the accuracy of predictions. However, these insights require further analysis as formation tops were given 0 or 1 labels. Ideally, different formations showcase different behaviours and have different measurements. Hence, formation tops can be labelled differently. The suggested workflow didn’t use mud log data, which are available in every well location and could prove to be healthy input. In addition, the outlier removal process relies on the manually cleaned data that might not be readily available in multiple cases like exploratory wells. Another method useful for identifying anomalies in the data is the IF method, which isolates a feature randomly from the given features and splits the data for all the observation presents. The workflow of IF [10] and extended IF is presented by [15] on their work on different types of datasets excluding well logs.

Anomaly detection is a task that recently focused on the optimization of production. The work of [16] applied ML algorithms to identify possible issues in plunger-lift operations. This work was a proof of concept for the south Texas field that helped in anomaly detection and focused more on the possible solution. A series of heuristic and ML models were constructed to facilitate labelling and categorization tasks. The signatures of the different production-related features were then linked with potential causes. They tested the algorithm for 50 wells over 3 months, and the results suggest a reduction in time spent over manual identification of the problem and saved close to USD 220/well/month. Another example can be found in the work by [17], which again applied a few methods including IQR cutoffs from percentile() NumPy function. They demonstrate a use case of anomaly detection in real-time production data. Figure 2(b) shows an example of outliers present in the well logs identified using a crossplot.

Literature review suggests there are multiple methods to identify outliers in the datasets. A comparison of the different anomaly detection algorithms is performed by [18]. The author attempted distance-based, K-means, density-based, statistical, and SVM algorithms on stellar populations parameters belonging to the astronomical domain. As per this experiment, the kernel-based novelty technique gave the best results. The work of [19], [20] explores the Class-Based Machine Learning and cross-entropy clustering-Gaussian mixture model hidden Markov model workflows for well log data processing and interpretation. They demonstrate limited results for outlier identification and has focus on interpretation through classification of well log data. The proposed workflow uses different algorithms and will directly prepare the wells logs which can be consumed by geomechanical modelling. This research has explored the applicability of six different ML algorithms to achieve cleaned data.