Lithology identification based on interpretability integration learning

An interpretability model for intelligent lithology identification is proposed, which utilizes Ensemble Learning Stacking, Permutation Importance (PI), and Local Interpretable Model-agnostic Explanations (LIME) techniques. The aim of this method is to provide more accurate geological information and scientific support for oil and gas resource exploration. Two logging datasets from the public domain were used as experiments, and support vector machine (SVM), random forest (RF), and naive Bayes (NB) were employed as base learners, while SVM was utilized as the meta learner for lithology classification via stacking algorithm. The accuracy of the model was verified using evaluation metrics such as Area Under Curve (AUC), precision, recall, and F1-score. The PI and LIME techniques were employed to explain the lithology identification model. The results indicate that the stacking algorithm produced the best indexes and highest prediction accuracy. With respect to overall interpretation, PHIND, GR, and RT were found to have the most significant influence on lithology identification in a natural gas protection area in the United States, while DEN, CAL, and PEF were observed to be the most influential variables for lithology identification in the Daqing Oilfield in China. From the perspective of a single sample, the LIME algorithm can provide a quantitative prediction probability and degree of influence of the characteristic variables.


Introduction
Oil and gas are non-renewable resources that are becoming increasingly exploited and consumed by humans. As a result, there is a growing demand for high-speed and accurate oil and gas resource exploration. Lithology identification is a critical step in accurately determining rock porosity and oil saturation, and is also the basis for studying geological reservoir characteristics, calculating reserves, and geological modeling in oil and gas exploration. Therefore, the development of rapid and accurate lithology identification methods using machine learning has become a significant research topic in the field.
Data-driven machine learning methods have proven to be effective in uncovering complex nonlinear relationships among high-dimensional features. In recent years, machine learning has rapidly developed and has been widely applied in various fields, including geosciences (Saporetti et al. 2018(Saporetti et al. , 2019Sun et al. 2019;Asante-Okyere et al. 2020). In lithologic identification, several related studies have also been conducted. Support Vector Machine (SVM) (Bressan et al. 2020) can handle non-linear geological data and achieve high classification accuracy, but it is sensitive to the choice of kernel function and parameter tuning. Artificial Neural Networks (ANN) (Asante-Okyere et al. 2020) can learn complex non-linear patterns in geological data, but they require large amounts of labeled data and are sensitive to the choice of architecture and hyperparameters. Adaboost  can handle complex geological data and improve classification accuracy, but it is sensitive to outliers and noise in the data. Random Forest (RF) (Ao et al. 2019) can handle high-dimensional geological data and achieve high accuracy, but it is prone to overfitting and the decision-making process can be difficult to interpret.
Stacking integration is an ensemble learning algorithm that employs a parallel learning approach and an untyped algorithm (known as the "primary learner") to generate initial prediction values, and then uses a meta-learner to 1 3 optimize these initial predictions to obtain the final prediction results. Stacking has been shown to improve the generalization ability (Liu et al. 2020;Jia et al. 2021) and robustness of the model, resulting in improved accuracy of rock type classification by combining the outputs of multiple base models. Additionally, stacking can help identify which base models perform well on specific subsets of geological data, which can be used to inform further model selection and feature engineering .
Currently, the focus of intelligent lithologic identification research is on improving the accuracy of the model, but the interpretation of the predicted results of these models is often insufficient. Machine learning algorithms that are more accurate tend to be less interpretable (Ibrahim et al. 2019), which hinders the progress of lithology identification models. To address this issue, some interpretive algorithms have emerged in recent years. For example, SHAP has been used for coronary heart disease mortality prediction , interpretable machine learning has been applied to estimate crop yields (Mateo-Sanchis et al. 2021), and LIME has been used for traffic safety interpretability studies (Das et al. 2021). These interpretive algorithms can help improve the interpretability of machine learning algorithms and increase the reliability of lithologic identification models.
The previous studies have shown that the random forest and stacking algorithm-based ensemble learning models are widely used in lithology identification. However, the generalization ability of these models in lithology recognition and prediction evaluation still needs further verification, and the interpretability of the model requires improvement. In this paper, we use two public logging datasets from the Council Grove gas reserve located in Kansas, USA (Dubois et al. 2003(Dubois et al. , 2007 and Daqing Oilfield in China to verify the performance of the stacking algorithm in lithology classification using SVM, RF, and NB as base learners and SVM as meta learner. The precision, recall, F1-score, and AUC are evaluated, and the interpretability of the identification model is studied using two explanatory algorithms: PI and LIME.

Data description and pre-processing
Two datasets from the public domain were used in the study. Datasets 1 (Data 1) is Facies logs from nine wells from the Council Grove gas reserve located in Kansas, USA, and datasets 2 (Data 2) has 12 wells comes from Daqing Oilfield, China, with same logs and lithologies (Cao 2018).
Data 1 has 9 lithologies include Nonmarine sandstone, Nonmarine coarse siltstone, Nonmarine fine siltstone, Marine siltstone and shale, Mudstone (limestone), Wackestone (limestone), Dolomite, Packstone-grainstone (limestone), Phylloid-algal and bafflestone (limestone) which are studied from core samples in every half foot and matched with logging data in well location. The dataset contains 3165 samples in total. Feature variables include five from wireline log measurements. See Table 1 and Fig. 1.
Data 2 is the actual logging data of a tight sandstone working area in Daqing Oilfield, which has been repeatedly tested by experts. This area is a key area for tight sandstone oil exploration mainly contains five different lithology categories, i.e., mudstone, siltstone, argillaceous siltstone, silt mudstone, and oil shale. The dataset contains 5978 samples in total. Feature variables include eight conventional logging curves common to each well. See Table 2 and Fig. 2.
To ensure the reliability of the data, data preprocessing is performed on the original logging data, including the following: remove outliers, filter, and the normalization method of the maximum and minimum values is used for the logging data. The calculation method is as follows: where X nor is the normalized logging data, and X max and X min are the maximum value and the minimum value of the original logging data of the single attribute, respectively.
To divide the dataset, we select samples from each category, with a ratio of 8:2 for training and testing.
(1) X nor = X − X min X max − X min

Correlation analysis
There could be intricate and obscure connections between feature variables and labels, as well as between pairs of feature variables, hence it is essential to identify and quantify the degree of correlation. This can aid in better preparing the data to meet the requirements of machine learning algorithms, such as linear regression, whose efficacy diminishes in the presence of such correlations. In this article, the Spearman coefficient is used to measure the correlation between selected feature variables and between variables and labels. The correlation coefficient of the characteristic variable should satisfy | R | ≤ 0.75 (Chen et al. 2022). The correlation between feature variables and labels will be discussed in Section "Interpretability of the model". Figure 3(a) shows that the five feature variables used by Data 1 all meet the requirement of being independent of each other. As can be seen from Fig. 3(b), among the feature variables used by Data 2, DEN and AC have high correlation, so they choose to remove AC, and other feature variables meet the requirements of mutual independence.

Support Vector Machine
Support Vector Machine (SVM) is a powerful tool for accurately constructing the widest classification boundary of data and solving non-linear classification problems (Wang et al. 2017). This algorithm calculates a hyperplane that maximizes the distance between different sample categories (Shankar et al. 2018(Shankar et al. , 2020, and can be applied to both linearly separable and non-separable data. SVM uses the hinge loss function to calculate empirical risk and includes a regularization term in the solution system to optimize structural risk. It is a sparse and robust classification method. The kernel method can be used by SVM to perform non-linear classification, which is one of the most common kernel learning methods (Hsieh 2009).

Random forest
The Bagging algorithm is a commonly used ensemble learning technique. In this algorithm, each learner's training set is obtained from the original training set by randomly selecting samples from it. The size of the new training set is the same as that of the original training set. The Random Forest algorithm is an ensemble learning method based on the Bagging algorithm. It has high classification accuracy and can handle outliers and noise well. Random Forest can also handle highdimensional data without requiring feature selection. (Breiman 2001;Genuer et al. 2017).

Naive Bayes
Naive Bayes (NB) is a commonly used classification method based on Bayes' theorem and the assumption of independent feature conditions. It is a stable and efficient model that requires minimal parameter estimation and is less sensitive to missing data. The NB algorithm assumes that the properties of the data set are independent of each other, resulting in a simple logic that is agnostic to different types of data sets. Theoretically, it has a lower error rate than other classification methods.

Stacking
Ensemble learning is a fundamental method in data science that relies on the outcomes of multiple models, where the results of several weak learners are combined to achieve better performance than a single strong model. Stacking is a popular method in ensemble learning. As shown in Fig. 4, the Stacking framework consists of two layers of prediction models: the primary learner in the first layer and the meta-learner in the second layer. The Stacking prediction approach first inputs the raw data into each primary learner to obtain the primary learners' prediction results. Then, the prediction results of the primary learners are used as input for the metalearner to produce the final prediction results. Stacking is a powerful technique that can significantly enhance the prediction accuracy of a model, especially for complex data sets.
The Stacking prediction method combines the advantages of different learners through the integration of multiple primary learners to make the prediction model with strong generalization ability; further, the meta-learner is used to optimize the output results of primary learners to improve the overall prediction accuracy (Xu et al. 2020).

Evaluation identification results
To evaluation identification results of the four models, precision, recall, F1-score and AUC were selected as indicators. All of the evaluation metrics are the results obtained on the validation set.
Precision refers to the probability of correctly predicting positive samples among all the predicted positive samples. Recall is the probability of correctly identifying positive samples in the actual positive sample set. F1-score is the harmonic mean of precision and recall.
Receiver Operating Characteristic curve (ROC) and AUC are measures used to evaluate the performance of classification models. The x-axis of the ROC curve is the false positive rate (the probability of identifying a positive case when it is actually negative), while the y-axis represents the true positive rate (the probability of correctly identifying a positive case). AUC is the area under the curve. When comparing different classification models, the ROC curve of each model can be drawn and the area under the curve can be used as an indicator of the model's performance. A higher AUC value indicates a more accurate classifier (Swets 1988).

Interpretive model for lithology identification
The unexplained lithology recognition model may fail to accurately analyze geological information, as machine learning models act like a black box, making it difficult to control the operations within the model. The only way to enhance the performance of the model is by experimenting with different parameters. To address this issue, interpretability algorithms like PI and LIME are needed. These algorithms can provide insights into the inner workings of the model, thereby increasing the interpretability of the results.

Permutation importance
PI is a useful technique for evaluating the importance of features that are unrelated to the model. It establishes the significance of a feature value by altering the size of the feature value. The greater the alteration, the more important the feature (Breiman 2001). The procedure is as follows: 1) select a feature following model training; 2) substitute the feature with random numbers and calculate new prediction results; 3) compare the old and new prediction results to determine the impact of the feature on the model prediction.

Local interpretable model-agnostic explanations
LIME is a model-independent local interpretability algorithm that was proposed by Marco (Ribeiro et al. 2016) in 2016. LIME can truly reflect the behavior of the classifier when predicting samples. It targets a single sample and assumes that the local model is a simple linear model to explain the local data points. LIME makes the input values with small perturbations around the local points, observes the prediction behavior of the model, and assigns weights according to the distance between the disturbance points and the original data, so as to obtain an interpretable model and prediction results. The specific formula is as follows: For the explanatory model g of example x, the approximation of model g and the original model f is compared by minimizing the loss function. Formula (6): Ω (g) represents the model complexity of the explanatory model g, G represents all possible explanatory models, π x defines the neighborhood of x, and makes the model interpretable by minimizing L.

Experimental results in Data 1
We utilized SVM, RF, and NB as base learners, with SVM as the meta-learner, to identify lithology in Dataset 1. The results of precision, recall, and F1-score indicators are presented in Table 3. The results demonstrate that the four models constructed in the training set exhibit good performance in lithology recognition, with the Stacking model achieving the best identification result. Stacking's three evaluation indicators have the highest mean. Precision of Stacking was 86.40%, recall was 92.23%, which is over 6% higher than the three primary models, and F1-score was 89.12%, demonstrating the effectiveness of ensemble learning. Furthermore, the standard deviation analysis suggests that Stacking has certain advantages in stability and can overcome the imbalance of category number to some extent.
In this study, the ROC curves were analyzed for each of the four models (Fig. 5). The micromean AUC of SVM was 0.82 and macro mean AUC value was 0.72, with Class5

Experimental results in Data 2
We employed SVM, RF, and NB as base learners, with SVM serving as the meta-learner. We utilized these models to identify lithology in Dataset 2 and evaluated their performance based on precision, recall, and F1-score indicators, as shown in Table 4. The results indicated poor lithology recognition using SVM, with precision at 50.36%, recall at 68.28%, and F1-score at 62.06%. Although the lithology identification results of RF and NB were not particularly outstanding, Stacking demonstrated the best performance. Precision was 84.48%, recall was 89.43%, and F1-score was 86.09% in Stacking, which was over 7% higher than the other three models, highlighting the effectiveness of ensemble learning in complex situations. Moreover, standard deviation analysis revealed that Stacking also exhibited certain advantages in terms of stability. In this study, ROC curves were analyzed for the four models (Fig. 6). The micromean AUC value of SVM was 0.77, and the best identified macro-average AUC value was 0.53. The AUC value of Class5 was 0.35, indicating the worst discrimination effect. RF and NB have the micromean AUC values of the 0.86 and the macro-mean AUC values of 0.70 and 0.82, respectively. In RF prediction results, Class2 and Class5 were less well identified, and the AUC values did not exceed 0.6. In the identification results of NB, the recognition effect of Class2 is relatively poor, and the AUC value does not reach 0.7, indicating that the overall identification is poor; the AUC of Stacking is 0.91 and the macro average is 0.86, and the overall performance is better than the other three models. The AUC value of Class2 was the lowest, 0.72, a large improvement over other models; Class5 had the highest AUC value of 0.96, the highest identification accuracy in all categories of all models.

The importance of factors in Data 1
Different evaluation factors affect the accuracy of lithology identification differently, so this study determines the influence size of the evaluation factors from a global perspective. That is, the importance of each evaluation factor is calculated, and the higher the value of importance, the greater the effect of this factor on the evaluation results. We employed the PermutationImportance class from the eli5. sklearn library in Python to calculate the feature importances of our classifier. The random number was set to 42, and we chose the Stacking model as our prediction model.
The importance order from large to small is Average of neutron and density log, Gammara ylog, Resistivity measurement, Porosity index and Photoelectric effect log (Fig. 7). Comparing this result with the correlation analysis between feature variables and labels in Section "Correlation analysis", it was found that PHIND with the highest correlation with labels has the highest impact weight, and PE with the lowest correlation has the smallest impact weight. However, for GR with relatively small correlation, the ranking of impact weight is higher than that of RT and POR with higher correlation, The PI method explores the potential relationship between GR and lithology categories that is not reflected in correlation.

Lithology recognition model local interpretation in Data 1
We utilized the ExplainPrediction class from the eli5. sklearn library in Python to interpret our model based on LIME (Local Interpretable Model Diagnostic Explanations). Specifically, we employed cosine similarity to compute the similarity between samples and utilized the seed perturbation method to generate perturbed samples by adding some random noise to the original sample, setting the random number to 42. The original model used a stacking approach, and we employed the least squares method to fit the local linear model.
The LIME algorithm selects samples for categories 5 and 3 (Figs. 8 and 9). The results show the contribution of each feature to the prediction result of this sample. The positive number in the figure represents that the feature has a positive effect on the result, and the negative number is the opposite. From Fig. 8, the model has an 82% probability for sample Data 1-27 being identified as a Class5. The reason for the model classifying this sample was based on such factors as PHIND, RE and PRO, with PHIND having the highest contribution of 0.494, consistent with the results explained by the global model. In addition, the contribution of RT and GR is negative, which becomes the interference that predicts the sample to be Class5.
The model with 66% probability for sample 43 being identified as Class3. The rationale for the model to classify this sample is based on POP, GR, PE, PHIND, and RT, indicating that all features are supporting the result of this prediction, but the effect size is different.
Comparison of LIME results with the correlation analysis in Section "Correlation analysis" reveals that the feature with the highest importance for sample 27 in Data 1 is PHIND at 0.494, and it has the highest correlation with the corresponding label. The importance of features PE, POR, RT, and GR is inconsistent with their correlation with labels, which is likely due to their inherently low importance, all below 0.16, and the presence of sample-specific characteristics. For sample 43, the ranking of feature importance for POR, GR, and PE is consistent with their correlation with labels, while the inconsistency of PHIND is due to its low importance of 0.018 and being classified as a minority label: Litho3, which accounts for 19.4% of the total samples. The above analysis illustrates the significance of LIME's interpretability results.

The importance of factors in Data 2
The importance order from large to small is Density log、Caliper log、Photoelectric effect log、Gamma ray log、Spontaneous potential log、Shallow resistivity log、Deep resistivity log (Fig. 10).
Comparing this result with the correlation analysis between feature variables and labels in Section "Correlation analysis", it was found that the DEN with the highest correlation with the label had the highest impact weight, and the LLD with the lowest correlation had the lowest impact weight. However, for SP with relatively low correlation, the ranking of impact weight was higher than that of LLS with higher correlation.

Lithology recognition model local interpretation in Data 2
LIME identifies categories 3 and 1 of category 1 (Figs. 11 and 12) and interpreted them. The results show that the model has a 64% probability for sample Data 2-111 being identified as Class3. The rationale for the model to classify this sample was based on such factors as DEN, CAL and GR, of which DEN had the highest contribution of 0.195. The results are consistent with the global interpretation. In addition, SP, LLD and LLS are the interference to predict Class3, but less interference. The model has an 89% probability for sample Data 2-243 to be identified as a Class1. The reason for the classification of this sample is based on such factors as PEF, CAL, DEN, SP, LLS, and LLD. In addition, GR becomes the interference that predicts the sample to be Class1.
In Data 2, the feature with the highest importance for sample 111 is DEN at 0.195, which is also highly correlated with the corresponding label. The importance of the CAL, GR, PEF, and SP features is consistent with the order of correlation with the corresponding label. For sample 243, the ordering of feature importance for PEF, CAL, DEN, SP, and others is similar to the order of correlation with the label.

Discussion
Previous machine learning-based lithology identification studies generally reflect the importance of features, but it is related to the machine learning algorithm itself, so the method is not universal. However, according to the different influence degree of different characteristics on the lithology identification results, this study adopts the PI algorithm unrelated to the model algorithm itself from a global perspective, which can more objectively explain the importance of identifying different input features of the model.
In addition, in view of the problem of previous lithology recognition models, this study introduced LIME algorithm to locally explain the model and reveal the specificity of different samples for different categories. The results show that different input characteristics will have different effects on the classification of each sample, including support and interference, and the size of the effect is also different.
Based on the results of the two interpretation algorithms, we know that in Data 1, PHIND, GR and RT are the most important feature variables of the model to identify lithology, and PHIND usually has a positive effect on the identification results. In Data 2, DEN is the most important variable of the model to identify lithology and has a high positive effect; SP, LLD and LLS contribute less, but usually have a negative impact on the identification results. The results of this interpretation have some significance in reality, indicating that the geological background of the two regions is obviously different, and there are obvious differences in the lithology of the rocks storing natural gas and oil, which are the different influencing factors leading to lithology identification, indicating that we can trust our prediction model to a certain extent.

Conclusion
Based on the analysis of lithology identification in a natural gas protection area in the United States and Daqing Oilfield, China, this study selected GR, R, PE, POR, GR, PHIND, CAL, SP, AC, DEN, LLD, LLS and PE as the characteristic variables for identification. The ensemble learning model combined with PI algorithm and LIME algorithm was used to identify lithology and obtain high identification accuracy. The main findings of this study are: (1) The ensemble learning model represented by Stacking algorithm is the most suitable for lithology identification of the two data sets, with Data 1 achieving an identification accuracy of 86.40%, recall of 92.23%, and F1 score of 89.12%; Data 2 achieving an identification accuracy of 84.48%, recall of 89.43%, and F1 score of 86.09%. These results are better than those of the other three machine learning models. Moreover, Stacking algorithm shows advantages in stability, and can effectively overcome the imbalance of category number.
(2) From a holistic perspective, the three characteristic variables PHIND, GR and RT have the greatest influence on the lithology identification of the natural gas protection area in the United States, while DEN, CAL and PEF have the greatest influence on the lithology identification of Daqing Oilfield in China. From the perspective of single sample interpretation, the LIME algorithm is able to provide a quantitative prediction probability and the degree of influence of characteristic variables.
These findings provide a significant contribution to the interpretation and understanding of the lithology identification process, and can be used as a reference for similar studies in the future. Author's contribution All authors have contributed to the conception and design of this study. Data collection was carried out by Xiaochun Lin. Xiaochun Lin and Shitao Yin constructed the experimental models and the development and testing of the proposed methods. The manuscript was written by Xiaochun Lin, and all authors have provided feedback and comments on the manuscript. All authors have reviewed and approved the final version of the manuscript.

Data availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Declarations
Competing interests The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.