Rock Thickness Identication by XGBoost with Logging Data

: Based on research on the response mechanism of rock formations and reservoirs to logging curves, 12 logging curves selected by combining the depth characteristics of formations are proposed to identify rock formations and reservoirs using four algorithms: logistic regression (LR), support vector machine (SVM), random forest (RF) and XGBoost. Out of 60 wells in the study block, 57 wells were selected for training and learning, and the remaining 3 wells were used as prediction samples for testing the algorithm. The recognition of rock formations and reservoirs is performed by each of these four machine learning algorithms, and predictive knowledge is obtained separately. It was found that the accuracy of the 4 algorithms for rock formation and reservoir layer identification reached over 90%, but the XGBoost algorithm was found to be the best in terms of the 4 scoring criteria of F1-score, precision, recall and accuracy. The accuracy of rock formation identification could reach over 95%, and the correlation analysis between the logging curve and rock formation could be performed on this basis. The results show that the RMN, RLLD and RLLS have the most obvious responses to the sandstone layer, off-surface reservoir and effective thickness layer, and the CAL has the least effect on the formation and reservoir identification, which can provide an effective reference for the selection and dimensionality reduction of the subsequent logging curves.


Introduction
At present, the methods of logging lithology identification are mainly conventional logging identification methods, conventional logging rendezvous map identification methods, imaging logging identification methods, principal component analysis methods, neural network methods, and transverse wave information rendezvous identification methods. In recent years, the use of machine learning algorithms to predict rock formation and reservoir layer methods has also gradually diversified, among which He Ping 1 proposed using the fuzzy ISODATA algorithm to optimize the prediction of rock formation. Song Jianguo et al. 2 used the nonlinear prediction feature of the random forest algorithm to predict reservoir layers. Wang Zhong et al. 3 used a support vector machine to eliminate multiple correlations between variables to study rock layer identification. Liu Ke et al. 4 used a mixed density network to establish a theoretical model to achieve effective prediction of sand thickness. Shan Jingfu 5 used a BP neural network to identify gas reservoir layers with complex rock layers. Lei Huang et al. 6 used a deep learning approach for rock formation identification and correlation analysis. In addition to the use of machine learning algorithms to establish a mathematical model, the reasonable choice of logging curves and features will also have a large impact on the accuracy of rock layer identification 7 . For example, Mou Dan et al. 8 used five logging curves of deep lateral, acoustic time difference, compensated neutron, density and natural gamma to identify basal rocks in the Liaohe Basin. Wang Zehua et al. 9 used five curves of resistivity, acoustic time difference, natural gamma, density and acoustic impedance to identify basal rocks in the Liaohe Basin. Logging curves were used to identify the 4 major rock phases in the Jungar Basin; Zhu Yixiang and Shi Guangren 10 used 15 logging curves, such as acoustic time difference, bulk density, photoelectric absorption cross-sectional index, natural gamma, compensated neutron, deep lateral, and shallow lateral, as multidimensional geological characterization parameters to identify the coal seams; Guo Xu 11 used five logging curves, such as lateral resistivity, density, natural gamma, sound speed, and natural potential, to identify the coal seams. The positioning was interpreted qualitatively; Yu et al. 12 used 4 logging curves of acoustic time difference, density, natural potential and natural gamma to select appropriate nuclear functions to predict the three major types of rock formations of siltstone, mudstone and conglomerate in an oil field. However, all of the above articles have some problems, such as the high model complexity cannot obtain the trained and mature model quickly, the formation identification results are easily affected by the noise of the raw data, and the clarity of the formation data obtained from the logging curve is not high enough to support a higher formation identification efficiency 13 . Based on the above problems, this paper classifies the formation into sandstone layer, off-surface reservoir and effective thickness layer from the perspective of reservoir evaluation, and tries to identify the rock layer in this block by XGBoost, Random Forest, Support Vector Machine and Logistic Regression. Through the comparison of 4 algorithms, the XGBoost algorithm can effectively solve the problem of accurate identification of rock layers and reservoirs, which proves the universality of the method.

Description of the study block
Based on the field logging data and actual conditions, the working wells in this study block are basically shallow wells with depths ranging from 1,260 to 1,314 m.
The logging data obtained from the field have well depths ranging from 1,035 to 1,300 m. The well section in this study block passes through the Sa Zero, Sa One, Sa Two and Sa Three formations, which contain three to four large interbedded distribution formations, all of which are at depths below 1200 meters.

Geological characteristics of the study block
According to the logging data of 60 wells obtained from an oilfield block in Daqing and related articles, the development area has poor oil content, physical properties and reservoir lithology, the stratum has a more severe sensitivity, and the mobilization of crude oil reserves is low. From the perspective of the geological structure, the Songliao Basin is divided into three developmental stages: early, middle and late stages. The study area experienced tectonic movement in the Songliao Basin in the early stage, which was mainly fault subsidence. The middle stage experienced subsidence of the land mass, and the late stage lifted the land mass again 14 . The average content of clay minerals is 27.3%; kaolinite, montmorillonite and illite constitute the majority of clay minerals, and the cementation type of rock is mainly pore cementation, with contact cementation accounting for a smaller proportion 15 . The reservoir physical characteristics of the study block can be obtained from an analysis of the field logging curves (Table 1). From the porosity distribution characteristics in Table 1, the block can be classified into medium-and low-porosity reservoirs. There is a certain relationship between the reservoir physical properties of the study block. For example, there is an obvious correlation between porosity and permeability ( Fig. 1). It can be seen from the figure that with increasing permeability, the porosity of a rock layer also increases. At the same time, it can be seen that the porosity of the vast majority of stratigraphy measured is greater than 18%, which verifies that the block belongs to medium-and low-porosity reservoirs. Performance is the reservoir with good sorting and coarse lithology has greater porosity and permeability and higher oil-bearing grade, which is expressed in the logging curve as a higher apparent resistivity value, smaller DEN, larger BHC and larger SP anomaly.
Based on the above analysis and combined with the petrophysical logging response mechanism 16 (Table 2).
To study the implied relationship between the 12 logging curves in Table 2, the relationship curves can be obtained by correlating the potential, gradient, and resistivity data ( Figure 2) to understand that there is a certain trend relationship between the logging curves in this test area. The implied relationship between the logging curve data can be accurately identified, and the corresponding algorithmic parameter equations can be established by machine learning.  function for solving the maximum feature space interval is as follows: where is the Lagrange multiplier vector; is the current data; is the linear function of the data; and is the adjustment function.
Random forest is an ensemble learning algorithm based on decision trees.
Random forest uses CART decision trees as a weak learner, using a method that randomly selects a small number of features, and the number of selected features defaults to the square root of the total number of features 18  By aggregating all the training data and observing the distribution of the training data, it can be seen that the sandstone layer, off-surface reservoir and effective thickness layers account for a tiny percentage of the total formation data; that is, there is a certain class of samples in the dataset that is much more or much less than the other classes. Therefore, the problem is transformed into an imbalance problem at this point ( Fig. 3). Under these conditions, according to the characteristics of each algorithm, traditional machine learning methods will fail to face the problem, leading to a reduction in the accuracy of rock formation identification. Therefore, the SMOTETomek algorithm is used for sampling to produce new data sufficient to ensure the rationality of the decision space in a few classes, balancing the data classes on the premise of ensuring the original distribution (Fig. 4). accuracy represents the proportion of the correct judgment of the whole sample by the classifier, and the formula is as follows: where TP is the correct predicted answer; FP is the incorrect prediction of other classes as this class; FN is the prediction of this class label as another class label; and TN is the incorrect predicted answer.
Four algorithms were used to perform the formation and reservoir identification process based on the study block logging data (Fig. 5). Valid data after data preprocessing were entered into the LR, SVM, RF, and XGBoost algorithms. Due to the unbalanced logging data in this classification, there are limited valid data for the rock and reservoir layers, so recall and accuracy should be considered in the results. for the off-surface reservoir layer, XGBoost has an 85% recall with 94% accuracy; for the effective thickness reservoir layer, XGBoost has an 81% recall with accuracy remaining above 95%, while the LR, SVM and RF algorithms also have an accuracy of over 90%. It can be seen in the overall F1-score that these 3 algorithms are not as effective as XGBoost for integrated rock layer identification; in the process of effective thickness layer identification, due to the distribution ratio of the three types of interlayers in the stratum and the characteristics of the algorithms of LR and SVM 22 , the 2 algorithms cannot converge, so they are not discussed here. From the time point of view, the training time of the XGBoost algorithm for logging curve data is also the least among the 4 algorithms, which takes only 3 seconds, indicating that the XGBoost algorithm is more popular in the field of rock formation and reservoir identification. (Table 3). On this basis (Fig. 6), correlation analysis of logging curve characteristics, the logging curve with the highest sensitivity to the effect of rock formation and reservoir identification was calculated by the analysis 23 . The results showed that the response of the RMN, RLLD and RLLS is the most obvious for the sandstone layer, off-surface reservoir layer and effective thickness layer, and the well diameter curve has the least effect on the rock and reservoir layer identification. These results can provide an effective reference for the selection and dimensionality reduction of subsequent logging curves.

Analysis of the applicability of the method
The sandstones in the study block of this paper mostly develop trough-like interlacing laminations but also block formations, deformation formations, drainage formations, oblique laminations and horizontal laminations 24 . The mudstone is a rock dominated by clay minerals. According to the silt content, there are two types of siltstone and mudstone. Siltstone is a type of mudstone with 10% to 50% silt content, and mudstone is a type of mudstone with <10% silt content. Therefore, the research method proposed in this paper can be well applied to strata with the above characteristics 25 . The intelligent method proposed in this paper is a pioneering experiment in this field because few researchers have studied the problem of rock stratification and others have done so on the basis of experiments.
The specific interpretation results for the three wells are shown in Figure 7. (2) The study selected 4 machine learning algorithms, LR, SVM, RF and XGBoost, and compared the F1-scores, precision, recall, accuracy and computation time of the 4 algorithms. Through a comprehensive comparison of the five discriminatory methods, the study shows that the XGBoost algorithm is the most effective for rock and reservoir layer identification, with an average accuracy of more than 95%.
(3) When faced with an uneven distribution of logging data, such as the three types of interlayer interference and nonreservoir interference, the SMOTETomek algorithm is selected to interpolate the interlayer and reservoir layers, which can effectively balance the data and improve the accuracy of formation and reservoir identification.