In this study, data is processed using normalization and principal component analysis. Data normalization is one of them. It is the process of mapping data with different scales to a unified standard range using methods such as minmax normalization, Zscore normalization, and fractional fixed scale normalization, which has the advantages of eliminating scale differences, improving model performance, and accelerating convergence, allowing data with different scales to be compared and analyzed under the same standard, thus improving data distribution, increasing model stability, and accelerating convergence. The benefits are that dimensional differences are eliminated, model performance is improved, and convergence is accelerated, allowing data at different scales to be compared and analyzed using the same criteria, improving data distribution, increasing model stability, reducing algorithm search space complexity, and assisting in the handling of features with different scales. In this study, the minimummaximum normalization is utilized, and the formula is provided in Eq. 1.
In Eq. (1), X represents the feature's original data. Xmin is the feature data's minimal value, Xmax is the feature data's maximum value, and Y is the normalized output value.
PCA may extract the most representative feature vectors from highdimensional data by mapping it to a new orthogonal feature space, reducing data dimensionality, eliminating redundant information, and retaining the data's structure and variability. Its benefits can be seen in several ways: first, by calculating the contribution of principal components or eigenvalues, PCA effectively reduces the dimensionality of data, saves storage space and computational costs, and avoids overfitting problems; second, by calculating the contribution of principal components or eigenvalues, PCA can discover correlations between variables and help select the most important features for further analysis and modeling; and finally, PCA can be used for data visualization to help us understand the data. It can aid in identifying and comprehending the key variables and driving mechanisms that contribute to gas emergence in investigations of gas emergence.
If there are n samples and m features in the dataset, it can be represented as a n m matrix, where each row corresponds to a sample and each column to a feature. Principal component analysis consists of the following three steps:
1 Normalizing the data By assigning a mean of 0 and variation of 1 to each feature, normalize the data. The effects of scale are removed with the help of this treatment.
2 Identify the covariance matrix: Determine the m mdimensional covariance matrix C for the data collection. The covariance matrix provides a description of the association between particular features in the data set.
3 Selecting the principal components and eigenvalue decomposition: To obtain the eigenvalues and corresponding eigenvectors, the covariance matrix C is subjected to the eigenvalue decomposition. According to the size of the eigenvalues, the top k eigenvectors with the biggest eigenvalues are selected as the main components. The number of original characteristics, m, is typically more than or equal to the number of primary components k.
The fundamentals of transfer learning
An essential area of machine learning is transfer learning. Transfer learning has been presented as a solution to the performance issues with classic machine learning algorithms caused by data shortages, labelling challenges, or significant domain variations. Transfer learning makes use of gained knowledge to move from one domain to another to improve the effectiveness of the target job to address these issues. The learning process on the target domain can be sped up, the amount of data needed can be decreased, and generalization can be improved by importing knowledge, features, models, or methods from the source domain. In order to solve the issues of insufficient data and domain disparities, increase the efficiency and usability of algorithms, and make machine learning more adaptable to complicated realworld scenarios, transfer learning has been developed. Figure 1 depicts typical transfer learning paradigms.
Transfer learning methods are typically classified as Finetuning and Feature Selection. Finetuning is a modelbased migrating learning strategy that adjusts to a new task by first training a good model on the source domain as the initial parameters and then training on the target domain. Certain layers or parameters of the model are frequently modified during finetuning to better meet the specific needs of the target task. This method is most typically used for deep learning models like pretrained convolutional neural networks (CNNs) or language models like BERT, GPT, and others. Using the generic features and knowledge learnt in the source domain, the model may swiftly adapt to the task on the target domain through finetuning. Feature Selection: Feature selection is a featurebased transfer learning strategy that uses features acquired in one domain to tasks in another. The feature representations taken from the source domain are regarded as features with some generality in this technique and can be used to solve the appropriate tasks. However, finetuning and feature selection are not mutually exclusive procedures; they can be employed in tandem. For example, during finetuning, a portion of the pretrained model's layers can be frozen and only a subset of them finetuned, while the feature selection technique can be selectively altered to better fit the target job. When it comes to transferring knowledge, the probability distribution of the underlying data plays a crucial role. The success of transfer learning relies on the similarity of the data probability distributions in the source and target domains. This means that the feature distributions and edge probability density functions of the data in the source and target domains are statistically indistinguishable. By migrating and adapting the model from the source domain to the target domain, we may speed up the learning process and boost the performance in the target domain. The general method of transfer learning is depicted in Fig. 2.
Data source
Tunnel depth m (H), groundwater level (h) m, rock quality design value % (RQD) and water production characteristics (W) m3/hm are the effective characteristics investigated in this study for the water flowing into the tunnel. The database of this study contains from more than 1080 datasets, which are the source of the datasets are the papers (Mahmoodzadeh et al 2017) and the paper (Mahmoodzadeh et al 2021) respectively.
(1)Tunnel depth (H)
As the tunnel deepens, the surrounding rocks will experience more pressure. It will reduce permeability by closing tunnel rock joints and fissures. This study measures tunnel depth from the center.
(2) Groundwater level(h)
The groundwater level rises, increasing water pressure on the tunnel's rocks and water flow into the tunnel. This study considers the tunnel center's effective groundwater level on water inflow.
(3) RQD.
Water input depends on the tunnel's rock joint state. RQD parameter scores depend on tunnel rock joints' number and spacing. Thus, RQD can affect tunnel water input.
(4)Water yield (W)
Rock mass seepage and water inrush determine aquifer water yield. Due to diverse circumstances, the rock's aquifer has varying water output and solubility. Thinner limestone and dolomite supply more water than mudstone and shale. Water inrush is substantially more likely in aquifers with high solubility. The previous authors suggested utilizing a limited but effective number of criteria to forecast tunnel water intake. They believed that a high number of characteristics would boost accuracy but decrease water inflow prediction dependability. Previous academics in related investigations selected these characteristics as effective factors that affect tunnel water input.
Figure 3 shows the statistical characteristics of the tunnel surge case data in response to the data and the correlation between the variables. The diagonal line shows the distribution characteristics of each variable parameter, from the statistical histogram of each parameter. It can be seen that each parameter has a wide distribution covering a wide range of values of these parameters, so the statistical database of this paper is in principle universal. From Fig. 3, it can be seen that the distributions of H, h, W, and WI show a clear left skew, while the distribution of RQD shows a bellshaped pattern. The fitted curves represent the level of correlation, and the higher the skew indicates the higher correlation. The linear relationship shows that there is some correlation between the variables. For the predictor target WI, the predictors H, W, and h have a strong positive correlation with WI, while WI has a certain negative correlation with RQD. A summary of the database used in this paper can be seen in Table 1
Table 1
A summary on the database of water inflow.

H

h

RQD

W

WI

Count

1080

1080

1080

1080

1080

Mean

87.45

49.62

48.06

5.71

82.92

Min

15

0

4

1

40.5

Max

350

312

99

33.6

274.6

Std

67.46

55.39

20.24

3.94

30.58

Forecasting models can be made more adaptable and useful by increasing the quantity of data and the parameter ranges. Assume, for instance, that we want to utilize the offered prediction models to forecast the water intake in a brandnew tunnel. In that situation, the input parameter ranges for the new tunnel must fall within the bounds of the training datasets for these models. Otherwise, high accuracy from the prediction models cannot be anticipated. Table 1 offers a summary of the database. In this study, more datasets are used compared to the literature. Additionally, the parameters' range is increased. Predictive models will be more accurate as a result of incorporating such variables into the database.
In this research, primary datasets were initially subjected to a descriptive statistical analysis. Boxplots for all of the parameters are shown in Fig. 4. If we pay attention to the middle line, except for the parameter W, it is almost in the middle of the box for other parameters, indicating that the datasets are approximately symmetrical. As an additional note, H and RQD are the only parameters having outliers (one and two, respectively), whereas two other parameters (h1 and W) have zero. When there are outliers in the initial dataset, it might be difficult to determine any patterns among the variables. Therefore, finding the outliers and clusters in the data makes it possible to create a more consistent collection of data from which to build predictive models
The distributions of data probability have a direct and significant impact on transfer learning. When the source and target domains have similar data probability distributions, transfer learning usually produces good results. This suggests that the data in the source and target domains have statistically similar feature distributions and edge probability density functions. In this situation, the model developed in the source domain can be deployed straight to the target domain with some migration and adaptation, speeding the learning process and enhancing performance in the target domain. Transfer learning gets more difficult when the data probability distributions in the source and target domains differ. The model learned in the source domain may not be readily adaptable to the data in the destination domain in this scenario. Because of the variation in data distribution, features and knowledge from the source domain are no longer valid or relevant in the target domain. Some measures are required in this scenario to handle the distribution discrepancies in order to accomplish successful transfer learning. Furthermore, the influence of data probability distributions on transfer learning is related to task similarity. If the task similarity between the source and target domains is great, transfer learning may still produce good results even if the data distribution differs. The principal component analysis of data1, data2, data3 makes the number of features the same as the number of features of the tunnel water inflow data. Figure 5 indicates the correlation between the results of the principal component analysis of data1, data2 and data3 and the probability distribution of the tunnel water inflow characteristics W. Figure 5 The first row represents data1, the second row represents data2, and the third row represents data3.
The earlier authors thought it was feasible to think about employing a few, but significant, parameters for the study to forecast the water intake in the tunnel. They claimed that include several characteristics will not only improve prediction accuracy but also raise the risk of decreasing the predictability of water inflow. These characteristics were chosen based on the findings of other researchers in related investigations, and the earlier work also recognised these four parameters as important elements that influence tunnel water input.
The data for gas emergence were acquired from the literature(Bi et al 2023), literature (Liu 2021) and literature (Ma 2021), with individual data specifics presented in Table 2.
Table 2
Description of gas emergence data
Authors

Influencing factors

Amount of data

Designation

Bi, S

coal seam depth X1/m
coal seam thickness X2/m
coal seam dip angle X3/°
gas content of mining seam X4/(m3·t− 1)
coal seam spacing X5/m
mining height X6/m
adjacent layer gas content X7/(m3·t− 1)
adjacent layer thickness X8/m
interlayer lithology X9,
working face length X10/m
advancing speed X11/(m·d− 1)
extraction rate X12/%,
daily production X13(t·d− 1)

30

Data1

Liu, Y

Depth of buried coal seam/m
Coal seam gas content(m3 / t)
Thickness of coal seam / m
Mining intensity / (t/d)
Workface extraction rate/m
Working face length/m
Coal seam inclination angle /°.
Gas content of adjacent seam /(m3 / t)
for the thickness of the adjacent layer /m
Distance between seams/m
Mining height/m
Pushing speed/(m3 / t)

30

Data2

Ma Z.H.

Daily advance distance/m
Daily output of working face/t
Elevation of bottom plate/m
Coal seam burial depth/m
Gas extraction/(m3 min− 1)
Thickness of coal seam/m
Thickness of adjacent layer/m
Coal seam spacing/m

30

Data3

The tunnel surge data, data 1, data 2, and data 3 were shown as a hot spot map to help visualize the relationship between the data. Red denotes a positive correlation, blue a negative correlation, and the darker the color, the higher the degree. Figure 6 depicts the diagram of the data hotspots. The tunnel water inflow data is shown in subplot a, data1 is shown in subplot b, data2 is shown in subplot c, and data3 is shown in subplots .
Model selection
Machine learning for inversion of geological parameters (Liu et al 2020; Feng et al 2021) machine learning for prediction of surface deformation (Cao et al 2021), machine learning for monitoring and prediction of tunnel hazards(Cha et al 2017;Ma et al 2021), machine learning for classification of tunnel envelopes(Zhao et al 2022), machine learning for prediction of tunnel convergence(TorabiKaveh et al 2020), and machine learning for deformation of tunnel envelopes (He et al 2020) are some examples of problems where the potential capabilities of ML methods have been shown today. However, there hasn't been much work done on MLbased tunneling for predicting water infiltration. There is no one ML model that, in the opinion of the majority, can successfully tackle all engineering problems in the best manner, notwithstanding the strengths of many ML techniques. As a result, academics have tried to gauge how effective different ML techniques are at handling different optimization issues. AAN (Artificial Neural Network), ET (Extra Trees), GB (Gradient boosting), KNN (Knearest Neighbors), MLP (MultiLayer Perception), SVM (Support Vector Machine), and XGBOOST (Extreme Gradient Boosting) are just a few of the six ML and TABNET models we utilize. However, the primary traits of the aforementioned models encourage us to use them. The following are their primary traits:
ANN is a computational model made up of several connected artificial neurons that simulates the nervous system of the human brain. By transmitting and processing information in a weighted fashion, they carry out computations. An ANN is typically trained using a back propagation algorithm, which compares the discrepancy between the network's output and the predicted output and optimizes the network by varying the weights of the connections in the network in response to this difference. The following are some of an ANN's benefits: An ANN's design enables concurrent computing and processing. This is known as distributed processing. Due to this, it has an advantage for handling massive amounts of data and numerous concurrent activities; ANN is more adaptable and powerful when handling complicated situations since it can learn and express nonlinear mapping relationships; adaptive learning: Through the training method, ANN may automatically alter the connection weights, enabling adaptive learning; fault tolerance: The distributed architecture of ANN and redundant connections make it resistant to partial failures. An ANN is faulttolerant to damage to some nodes or connections because of its dispersed topology and redundant connections. These characteristics give ANNs a wide range of applications and research benefits in the fields of artificial intelligence and machine learning.
The Extra Trees algorithm is an integrated learning technique that uses decision trees as its foundation and includes randomness into how each decision tree is built. The Extra Trees approach increases the diversity and resilience of the model by randomly choosing a selection of features and a threshold while creating each decision tree. When performing feature segmentation on decision tree nodes, the extra tree differs from conventional decision tree algorithms in that it chooses a feature from the feature subset and a random threshold rather than the best feature and threshold based on the best segmentation criterion (e.g., information gain or Gini coefficient). The additional tree approach is strengthened by this randomization, which can lower the possibility of overfitting and increase computational efficiency. In disciplines like data analysis and predictive modelling, the additional tree algorithm is extensively utilized as a quick, reliable, and highly parallelized integrated learning method. To boost model variety and increase prediction stability and accuracy, randomization is introduced. This method is appropriate for managing complicated problems and big data. Consequently, the additional tree algorithm is adopted.
Gradient Boosting, often known as the GB algorithm, is an optimization technique frequently used in artificial intelligence and machine learning. It trains a number of weak classifiers repeatedly to create a strong integrated model. The GB algorithm has the following benefits over other machine learning algorithms. The GB algorithm, to start, is a versatile and effective algorithm. It can be used to solve a variety of issues, such as classification and regression problems. The GB method can handle complex data relationships well and deliver excellent prediction results in a variety of scenarios by merging many weak classifiers. The GB algorithm is an adaptive algorithm, to start with. Based on the discrepancy between the previous round's true labels and prediction results, the GB algorithm modifies the weight of each weak classifier in the subsequent iteration round. As a result, the GB method can gradually enhance the model's performance and learn to adapt to the properties of the data. The GB algorithm also exhibits strong robustness and generalizability. By combining several weak classifiers and maximizing their strengths, it lowers the chance of overfitting. The GB technique can also deal with frequent practical issues including highdimensional sparse data and missing values, which strengthens the model in realworld applications. The GB method was selected as the research model for this work because it is a potent and versatile optimization algorithm with widespread applications and research value in the fields of machine learning and artificial intelligence.
The KNN (KNearest Neighbor) algorithm is an instancebased learning technique that identifies the K closest neighbors to the target sample in order to do classification or regression prediction. The KNN method makes assumptions and judgements based on the training samples already available rather than explicitly developing a model. First off, the KNN algorithm is incredibly straightforward to comprehend and does not necessitate an explicit training procedure, making it simple to use and comprehend. Second, the KNN technique may be used with many types of data and issues and does not make any assumptions about the distribution of the data. Because numerous neighbors are taken into account, the KNN method also has strong fault tolerance, meaning that its prediction results are generally steady in the presence of noisy or anomalous data. Generally speaking, the KNN algorithm is an easytouse machine learning technique for small data sets and straightforward classification or regression tasks. As a result, the study model is the KNN algorithm.
The MLP (Multilayer Perceptron) technique is a machine learning model based on an artificial neural network (ANN). It is made up of numerous layers of neurons, each of which is linked to the neurons in the previous and next levels. There are various advantages of using the MLP algorithm. For starters, it is capable of parallel distributed processing. MLP's structural architecture enables it to do computation and processing in parallel, giving it an edge when dealing with massive amounts of data and multiple concurrent jobs. MLP, on the other hand, can learn and represent nonlinear mapping relationships. The MLP can learn and express complicated nonlinear relationships by introducing activation functions that introduce nonlinear transformations in each neuron. This increases its flexibility and capability when dealing with situations with complicated aspects. As a result, the MLP algorithm
The SVM (Support Vector Machine) algorithm is a popular machine learning model in which the goal is to find the best hyperplane to divide the sample space into distinct classes. This hyperplane is chosen to maximize the interval between two different types of support vectors. There are various advantages to using the SVM method. For starters, it has a strong generalization capability. Because SVM seeks the most sparsely spaced hyperplane, it can effectively handle highdimensional data and difficult nonlinear problems while avoiding overfitting. Second, SVM is capable of handling tiny sample data sets. SVM is not sensitive to the size and dimensionality of the dataset because it is based on the notion of interval maximization, hence it can still deliver better classification results in the small sample instance. In summary, SVM is chosen as the research model.
XGBoost (Extreme Gradient Boosting) is a gradient boosting treebased machine learning algorithm with strong predictive modelling and feature engineering capabilities. Here are some of the benefits of the XGBoost algorithm: Through parallel processing and well optimized algorithms, XGBoost achieves exceptional computational performance. It employs multithreading techniques and an approximation methodology to efficiently leverage computer resources and accelerate training and inference. Scalability: XGBoost can handle highdimensional feature spaces and facilitates modelling on largescale datasets. Interpretability: XGBoost may output the relevance scores of individual features, making the model's conclusions more understandable. Finally, XGBoost is a powerful and efficient machine learning algorithm with exceptional performance and a broad range of application fields. As a result, this model is selected as the research model.
Tabnet is an artificial neural network (ANN)based method designed to analyze tabular data that was proposed in 2019 and has since established a solid reputation. Tabnet improves the modelling of tabular data by introducing a sparse attention mechanism and a decision tree structure between the input and output. In comparison to many standard ANNs, Tabnet can learn adaptively. For training, it employs a back propagation algorithm that compares the difference between the network's output and the predicted output and adjusts the weights of the network's connections depending on this difference. This enables Tabnet to adapt to new datasets and tasks in order to enhance forecast performance by learning and modifying model parameters. Furthermore, Tabnet is faulttolerant. It is faulttolerant to the failure of some nodes or connections due to its dispersed topology and redundant connections. This increases Tabnet's robustness and ability to sustain strong performance in the presence of noisy or missing data. As a result, Tabnet is selected as the study model to be compared to other models. The principle of Tabnet is shown in Fig. 7.
These aforementioned transfer learning models can be broadly classified between models based on decision trees and models based on artificial neural networks. As a result, the two categories of transfer learning methods are finetuning and feature selection. Figure 8 illustrates how the finetuning strategy is used in the artificial neural networkbased model to achieve transfer learning. As depicted in Fig. 9, the decision treebased model employs the feature selection method to implement transfer learning.