Semi-supervised interlayer intelligent recognition method

The development and distribution of interlayers in sandstones can lead to increased heterogeneity in the formation within the reservoir, which further affects the movement of fluids in the formation. Therefore, it is essential to accurately evaluate the interlayers in sandstones, as the precise evaluation of interlayers in sandstones is of great importance for identifying the distribution of underground fluid systems. The logging data of interlayers are inadequate for traditional Machine Learning training due to their low measurement proportion compared to the conventional layers. In logging data, the amount of data in interlayers is significantly smaller than that of conventional reservoirs. Traditional machine learning models are mostly based on samples with balanced distribution. By contrast, semi-supervised learning requires a small number of labeled samples for learning, and then combines a large number of unlabeled samples for modeling. To verify the feasibility of semi-supervised learning in the identification of interlayers, the Donghe sandstone section of the H oilfield was taken as an example. First, the core analysis results were used to label the logging data; then, to uncover more response information that can characterize the interlayers on the logging curve, multiple features were extracted to construct cross features. Finally, an improved model based on autoencoders—probabilistic autoencoder (PAE)—is proposed to address the issue of interlayer recognition for imbalanced samples. The PAE model can calculate the probability of belonging to a different class for unlabeled samples, and classify new samples according to the maximum probability. Experimental results show that, compared with traditional machine learning methods and ensemble learning methods, PAE achieves higher recognition accuracy and better generalization performance by updating the algorithm, and can be used as a simple and fast method for interlayer recognition. The results of the algorithm demonstrate that the semi-supervised method is of great significance for exploring and developing complex heterogeneous oil reservoirs. The research results are of great importance for exploring and developing complex heterogeneous oil reservoirs.


Introduction
At present, 70% of China's oil production still comes from mature oil fields, and its remaining recoverable reserves are considerable (Hu 2008;Liu et al. 2009). However, the presence of interlayers in reservoirs increases the heterogeneity of reservoirs and is an important factor affecting the distribution of remaining oil. Therefore, elucidating the spatial distribution of interlayers in reservoirs is the primary task of tapping the remaining oil potential (Li et al. 2021). Interlayers refer to the non-permeable layers that can impede or regulate fluid movement in the reservoir. It has a wide distribution area and can be compared between wells. Its thickness ranges from tens of centimeters to tens of meters, and it is usually situated between unit layers. The interlayer refers to the relatively low-permeable and non-permeable Communicated by H. Babaie layers distributed in the unit sandstone layer. The distribution is more irregular than that of the interlayer, the area is smaller, the extension is shorter, and most of them are within the sand layer. The thickness is usually tens of centimeters to several meters. It cannot effectively impede or regulate fluid movement but can affect the distribution of oil and water and the movement path of water flooding in local areas (Li et al. 2018;Zhang et al. 2014). However, no distinction is made in the study, and they are collectively referred to as interlayers. The interlayer is the division boundary of the reservoir and has a shielding effect on the reservoir. The interlayer is a non-permeable or low-permeability layer clamped inside the smallest unit that can perform regional comparison and has the effect of enhancing the heterogeneity in the layer (Li et al. 2018;Qu et al. 2019). As impermeable layers, interlayers in reservoirs are used to impede or regulate the flow of fluids. Accurate identification of sandstone interbeds is essential to identify the distribution of subsurface fluid systems (Liu et al. 2011;Han et al. 2014). Therefore, elucidating the development characteristics of interlayers, describing the internal structure of single sand bodies, and uncovering the heterogeneity in reservoir layers can provide a solid geological foundation for the later development of water-drive reservoirs (Fu et al. 2016;Wang et al. 2016).
Due to the different origins, lithology, and diagenesis of the interlayers in different regions, there are various methods for classifying the types of interlayers. At present, some scholars believe that interlayers can be divided into three categories: muddy interlayers, calcareous interlayers, and physical interlayers. For the identification of interlayers, there are currently two commonly used methods, one is qualitative identification methods, mainly geostatistics, and spider web model methods (Deng et al. 2016;Saritha et al. 2013). However, since the accuracy of logging data is usually affected by the vertical resolution and thin layer effects, and the underground geological conditions are complex and heterogeneous, it is difficult to accurately and effectively identify interlayers with qualitative identification charts (Lun et al. 2017;Guo et al. 2019). The second is the upper and lower limits of electrical or physical properties. The specific process is to make a cross plot to determine the corresponding logging values or upper and lower limits of physical properties of the interlayer, and then identify the interlayer according to the threshold value (Guo et al. 2020). This method has strong objectivity and fast processing time, but there are few interlayer samples for core calibration, and there is a large disparity in the number of interlayer samples between different interlayers, so the applicability of this method may be limited.
As a feature extraction technique, deep learning can automatically learn the relationship between features and tasks and extract complex features from simple features. Therefore, this paper attempts to use deep learning methods to accurately identify interlayers in well-logging data. First, we introduce the characteristics of well-log data. Then, we explain the proposed method-Probabilistic Autoencoder (PAE) and the differences from traditional autoencoder neural networks. Finally, we compare traditional machine learning classification methods and ensemble learning methods with PAE. Experimental results show that the PAE model proposed in this paper has a good recognition effect on log data.

Auto-encoder
The goal of the auto-encoder is to learn a compressed and distributed representation of the dataset (Lun et al. 2014;Sun et al. 2014;Zhang et al. 2014). It consists of two parts, an encoder and a decoder, both of which include a deep neural network. As a result, the auto-encoder is capable of processing nonlinear problems (Zhang et al. 2016;Yu et al. 2017).
As shown in Fig. 1, the encoder of auto-encoder is a qualitative mapping function that transform input vector which is a typical nonlinear affine mapping. The mapping relationship is shown in Eq. (1).
In Eq. (1), X is an input vector of d dimension, the parameter set = {W, b} , where W is a weight matrix of d×d � dimension, and b is an offset vector of d ′ dimension.
The decoder is a mapping function that inversely maps the result Z of the latent variable to the input space is also a typical affine mapping. The mapping relationship is shown in Eq. (2).
The purpose of the auto-encoder is to make the reconstruction vector Z reproduce the input vector X as much as possible. Therefore, we hope that the auto-encoder can learn the functional relationship of Eq. (3) through iterative training.
where f is a model function that maps X to X , and θ is the parameter set of the model.
In general, a trained auto-encoder can reconstruct a vector that is identical to the input data x through the latent variable. If the feature correlation and data distribution between (3) f (X) ≈X a sample and the training set are different, it will make a big difference between the reconstructed vector and the input vector (Suk et al. 2015;Zhou et al. 2017). The difference between the input data and the reconstructed data can be measured by the root mean square error (RMSE), which is defined by Eq. (4).
where n is the dimension of the input data and X is the original data. X is the reconstructed vector.

Improved auto-encoder based model
It's generally believed that the model is the cognitive pattern generated by the learning data, and the parameters of the model can be obtained through training or experience (Miyaji et al. 2008;Ng 2011). The parameters that PAE needs to learn are the parameters of the auto-encoding neural network and the parameter set of the four types of interlayers. The parameters of the PAE are = { , � } , where θ is the parameter of the encoder, ′ is the parameter of the decoder; the parameter set of the four types of interlayers is where S center is the scoring center point, Count is the number of samples involved in parameter calculation, and B min and B max are the upper and lower bounds of the scoring interval. Neural network parameters can be obtained from training samples of sandstone, and the parameter set ∅ of the four types of labels need to be calculated in the trained auto-encoder neural network through the four types of interlayer samples. As shown in Fig. 2, the entire algorithm architecture is mainly divided into two modules, the feature engineering module and the parameters learning module. The feature engineering module is mainly used for the preprocessing of the original data and analyzing the distribution n characteristics of the sample in the feature space. The main operations include feature cleaning, feature crossing, normalization, and so on; the parameters learning module is mainly to obtain two types of parameters of the model, the specific components include the layer of auto-encoder parameter training, the layer of interlayer parameter extraction, the layer of PAE model calculation, and the layer of classification parameter update. Now assuming that the category of a new sample is related to the distance and score interval calculated by the new sample and the center point of four types of datasets. Here we can have two understandings: ① After we check the Euclidean distance between the scores calculated of the new sample and the center points of the four types of datasets, we will find the class with the smallest value is most likely to be the category of the new sample; ② The wider the score interval of a certain type of sample, the greater the probability when the new sample falls into this sampling interval. Based on the above points, the score formula is designed by Eq. (5): In the above equation, S point is the reconstruction error calculated by the Eq. (4), S i center is the scoring center point of the i-th sample, S i range is the score interval of the i-th sample, d i score is Euclidean distance from the new sample to the i-th interlayer.
Based on the four types of distances for each new sample, Normalize the transformation with softmax to get the probability that the sample belongs to a certain category. The conversion formula is shown in Eq. (6). Fig.1 Schematic diagram of the auto-encoder By solving Eq. (6), we can get the probability that the new sample belongs to the four types of interlayers. The type of interlayer with the highest probability is the type of the new sample. Algorithm for updating model parameters To avoid the generalization of the model being reduced due to the unreasonable sampling of the interlayer samples and sand samples, this paper uses the improved auto-encoder method to intelligently identify the interlayers and refers to the self-training ideas in the semi-supervised method. The parameters are updated with new samples. As the number of samples gradually increases, the impact of the original training data on the model parameters will be reduced, making the model more general and achieving the model's self-adaptation and intelligence. The specific update algorithm is shown in Fig. 3.
As shown in Fig. 3, the updated parameter is ∅ = S center , Count, B min , B max . The update method of S center is to obtain the average value of the current total score and the number of samples that have been put into the calculation, and get a new S center . B max and B min are updated by the reconstruction error of new samples. It is assumed here that the probability threshold α is 0.85. When the probability that a new sample belongs to a certain type of interlayer is greater than 0.85, it is assumed that the sample falls within the confidence interval of that type of interlayer, and then the new sample is used to update the model parameters.
In the update algorithm shown in Fig. 3, p i is the probability that a new sample belongs to the i-th type of interlayer, is a hypothetical probability with a lower limit, S i center is the score center point of the i-th type of interlayer, Count i is the total number of samples for the i-th type of interlayer, B i min with B i max is the minimum and maximum boundary of the i-th type of interlayer on the score interval. S i center_new is the updated center point of the i-th type of interlayer.

Method for identifying new samples
As shown in Fig. 4, each log feature on the left represents the original input vector, where D means depth. Logging feature subscripts represent the i-th sample. These feature vectors convert the original 10-dimensional logging features into 66-dimensional features through steps such as feature cleaning and structural feature cross-in feature engineering. F represents a feature in the high-dimensional space after feature engineering, and its subscripts, such as logging features, are consistent, both represent the i-th sample, and the superscript represents the j-th feature of the i-th sample in the high-dimensional feature space. Then, the probability auto-encoder is passed to obtain the sample belonging probability p, whose subscript represents the probability set of the i-th sample, and its superscript represents the type of the interlayer: 1 is a muddy interlayer, 2 is a sandstone sample, 3 is a physical interlayer, and 4 is a calcareous interlayer. Finally, the probability set is converted into a sample category label by the Argmax function. The meaning of the function is to take the type of interlayer corresponding to the maximum probability in the set as the result.

Analysis of logging curve responses
Interlayers are easily identified from core, thin section, and scanning electron microscope images. However, in the process of petroleum exploration and development, core data are very limited, while well log data are common. Therefore, it is more feasible to identify interlayers based on well logging data. After the cores were positioned, the logging curves and core features of the three types of interlayers in the Donghe sandstone section of the H Oilfield were analyzed. The performance characteristics of muddy interlayers on conventional logging curves are: obvious increase in natural gamma ray value, low acoustic wave time difference, scattered density value, relatively low deep resistivity, and the curves are mostly finger-shaped with small amplitude clock. Calcareous interlayers are characterized by conventional logging curves, such as significantly lower natural gamma ray values, low acoustic time differences, high density values, and high deep resistivity. Saw-toothed and boxshaped, etc. Calcium-shale interlayers are characterized by conventional logging curves: the natural gamma ray value is between the natural gamma ray values of muddy interlayers and calcareous interlayers, and the acoustic time difference is low, the density value is high, the deep resistivity is relatively large, and the curve presents finger-like and jagged shapes (Fig. 5). Table 1 shows the types of interlayers and their corresponding natural gamma ray, P-wave transit time, density, and deep resistivity response characteristics.
According to the analysis of the cross plot and core characteristics of the muddy interlayer, physical interlayer, and calcareous interlayers, it can be seen that although a small number of calcareous and muddy interlayer samples can be identified by the cross plot, most of the parameters of the interlayers have overlapping regions, making it difficult to perform low-dimensional linear separation through the cross plot. Therefore, the deep learning method is used to nonlinearly transform the characteristics of the logging curve, and map the low-dimensional logging curve characteristics into high-dimensional space to find the decision boundary, thus finally realizing the fine identification of the interlayer. As the results in Fig. 5 show, regions of sandstone are more independent and well-differentiated than those of the interlayer sample. Due to the auto-encoder algorithm used in this paper needing to reflect the preference for normal samples, sandstone samples are used to train the auto-encoder model, so that the model can have better decision boundaries, thereby improving the model's performance.

Settings for model parameters
The self-encoder neural network forces the neural network to learn the low-dimensional features of the dataset by compressing the original expression and reconstructing it, thus the network structure of the auto-encoder is symmetrical. The neural network structure is shown in Fig. 1. To explore the influence of the number of neural network layers on the model, different layer models were constructed as comparison models. The choice of the number of hidden layers is mainly related to enabling Tensor Cores, which are used for efficient Tensor and matrix operations. A power of 2 is used as the number of hidden layer nodes to enable Tensor Cores to accelerate operations. The input feature size is 66, so the number of hidden layer nodes that can be selected are 64, 32, 16, and 8. When the number of hidden layer nodes is 4, it is difficult for the network to reconstruct the original features through four-dimensional data. Therefore, the largest layer number chosen is 9, and the minimum is 5, which is more appropriate. The specific network structure parameters are shown in Table 2.The activation function of the encoder and decoder intermediate layers uses the Sigmoid function to prevent the gradient explosion, and the activation function of the head and tail part uses the ReLU function to avoid over-fitting and to prevent the reconstruction error from being too small. The loss function is the root mean square error of the input vector and the reconstructed vector. In this paper, 5757 log data from 10 wells in a certain area are selected as the dataset, including 751 muddy interlayer samples, 3471 sandstone reservoir samples, 1006 calcareous interlayer samples, and 531 physical interlayer samples. The training model and parameters used 80% of the dataset, and the remaining 20% of the data was used as the test set for the model. Then, the sandstone reservoir samples processed by feature engineering are sent to the model.
The training of the model is to shuffle the training samples of the sandstone, and then send the samples to the model in batches continuously, calculate the RMSE between the reconstruction vector and the input vector, and calculate the parameter gradient of the last layer according to the error. Then, use the chain rule to calculate the parameter gradient of the previous layer. Update parameters based on the learning rate and parameter gradient.

Training of model parameters
In order to explore the influence of different probability thresholds on the effect of the model, the probability threshold parameters were set from 0 to 1 with an interval of 0.1. The accuracy rates of different models under different probability thresholds are shown in Fig. 6a. When the probability thresholds range from 0 to 0.8, the accuracy rates of the three models with different network layers gradually increased, indicating that when selecting a model with a probability greater than a certain threshold, adding new labeled samples to the model can improve its effect. However, when the probability threshold is greater than 0.8, the effect of the model gradually weakened, indicating that when the initial model cannot classify unlabeled samples well, the effect improvement of adding new samples to the model for training is limited.
The change of the loss function during training is shown in Fig. 6b. Three different types of network models converge around 1500 iterations. However, the model converged when PAE7 was iterated to 3500 times, and the value of the loss function was about 1 × 10 −4 . At this time, the training of the auto-encoder neural network model was completed, and the parameters of the self-encoder neural network were obtained. PAE5 reached convergence at 1800 iterations. Both the convergence speed and reconstruction error of the PAE7 model are in the middle of the other two network models. From the convergence speed of Fig. 6b, considering the actual computing power conditions, it is more appropriate to set the number of iterations of the model to 4,000 times, and the number of network layers of the model should not be too deep. The calculation is to multiply the error of each layer and pass it backward. Too many layers will cause less error information to be lost in the error transmission process.
After the auto-encoder neural network training is completed, a total of 4607 training set samples are sent to the auto-encoder neural network model to calculate the initial score center point and the score interval. For each new sample, the score center points and score intervals of the four types of samples are updated according to the model parameter update algorithm, so that the model not only uses the training data to train the model, but also uses the new sample to update the model, thus solving sampling problems caused by sample imbalance. The score center points and scoring distances of the training set are shown in Table 3.

Experimental results of the model
We use three comparison methods to verify the effect of the PAE model, which are the decision tree algorithm (Decision Tree, DT) based on approximating discrete function values in traditional machine learning, and the linear regression method (Linear Regression, LR) for solving conditional probability distributions. And the Support Vector Machine algorithm (Support Vector Machine, SVM) that seeks the largest interval hyperplane in the feature space. The parameters of the three machine learning algorithms mainly consider the optimization parameters, search range, and experimentally determined optimal parameter combinations. The specific settings are shown in Table 4. The PAE model uses the PAE7 algorithm in the 7-layer network structure discussed in the appeal. Table 5 shows the recognition effect of each model on different types of interlayers. Compared with the simple decision tree algorithm and linear regression algorithm, the support vector machine algorithm, which finds the largest interval hyperplane in the multidimensional feature space, is better. Unbalanced, the traditional machine learning algorithm showed an obvious imbalance in the recognition effect. There were more sandstone samples, so the learning effect was the best, but there were fewer samples of other interlayers, so the effect on the identification of interlayers was poor. The update algorithm, which was more in line with the sample characteristics of the data set, was superior to the traditional machine learning method, which showed that the semi-supervised idea had better advantages in dealing with uneven logging data. From the analysis of the model accuracy index, it can be concluded that physical interlayers are difficult to distinguish from calcareous interlayers, and the results of multiple models also verified this point of view. That is, physical interlayers and calcareous interlayers overlap in the feature space, making it difficult to effectively identify them. Compared with the other two types of interlayers, muddy interlayers were relatively independent in the feature space, so the recognition effect was better.
Although physical interlayers were difficult to distinguish from calcareous interlayers, the identification results of the PAE model were still better than those of traditional machine learning models. This is because the PAE model could better capture the characteristics of the dataset samples, and the update algorithm could better adapt to the characteristics of the dataset samples.
To verify the generalization ability of the model, we use the training set to train and update the model to identify the interlayer in a new well. The effect is shown in Fig. 7. On the left are the characteristics of the original log curve, mainly including GR, ACC, CNLC and DENC, and the right mainly includes conclusion of the optimal model PAE7. From the interpretation conclusions of the log on the right, the PAE model can effectively identify various types of interlayers. The recognition accuracy of a few of the calcareous and physical interlayers is slightly worse. From the above conclusions, it can be seen that the conclusion of PAE7 is more consistent with the type of interlayers calibrated by the core. Even the calcareous and physical interlayers that are difficult to identify in the dimensionality reduction analysis are more accurately identified by the PAE7 model, but there is a difficulty in identifying the log data in the transition between the sandstone and the interlayer. This shows that the semisupervised anomaly detection method PAE7 model can be practically used for the identification of interlayers, with high recognition accuracy and strong generalization, and can be used as a simple and fast method to identifying interlayers.

Disscution
In this paper, PAE was used for interlayer prediction of logging data. The trained autoencoder could have a certain preference for specific data, so that a higher abnormal score could be calculated for new samples that were different from the training samples during the prediction process. The size of the score was related to the degree of abnormality of the new sample. As for the imbalance between the interlayer sample and the sandstone sample, the semi-supervised idea was used to construct an update algorithm to update the model parameters to solve it. Self-encoding is a lightweight neural network, and the number of parameters is two orders of magnitude lower than that of large-scale neural networks CNN and RNN. The advantage brought by this is that the PAE model can be calculated in real time, that is, a welltrained PAE can be independently distributed in embedded devices to monitor the logging data in real time, reducing the amount of network transmission and directly transmitting the results to the visualization terminal, which will reduce data transmission and equipment laying costs.

Conclusion
For reservoirs developed by long-term water injection or gas injection, the distribution of interlayers has a great influence on tapping the remaining oil potential. The distribution of interlayers can be obtained from well logging data, seismic data and drilling coring. However, the precision of seismic data is limited, and the cost of drilling and coring is too high. Therefore, it is an effective and cost-effective method to use well logging data to analyze the distribution of underground interlayers. However, there are many factors that affect the accuracy of well logging data, so how to accurately identify Table 3 Initial score center point and score interval of four types of samples  interlayers based on well logging data has become an urgent problem to be solved. The research results show that the interlayer recognition based on the self-encoding neural network proposed in this paper can effectively solve this problem, and lay a good foundation for the next step to finely characterize the spatial distribution of the interlayer.  .7 The effect of the interlayer recognition of the PAE7 model