Long-term rainfall forecast model based on the TabNet and LightGbm algorithm

Due to the problems of unbalanced data sets and distribution differences in longterm rainfall prediction, the current rainfall prediction model had poor generalization performance and could not achieve good prediction results in real scenarios. This study uses multiple atmospheric parameters (such as temperature, humidity, atmospheric pressure, etc.) to establish a TabNet-LightGbm rainfall probability prediction model. This research uses feature engineering (such as generating descriptive statistical features, feature fusion) to improve model accuracy, Borderline Smote algorithm to improve data set imbalance, and confrontation verification to improve distribution differences. The experiment uses 5 years of precipitation data from 26 stations in the Beijing-Tianjin-Hebei region of China to verify the proposed rainfall prediction model. The test set is to predict the rainfall of each station in one month. The experimental results shows that the model has good performance with AUC larger than 92%. The method proposed in this study further improves the accuracy of rainfall prediction, and provides a reference for data mining tasks.


Introduction
Rainfall is an important parameter in weather forecasting and flood control. How to obtain precipitation information more quickly and accurately has attracted more and more attention from meteorological researchers 1,2 . Nowadays, meteorological disasters such as droughts and floods frequently occur and cause serious losses. This requires further improvement of the accuracy of weather forecasts 3 . Rainfall is affected by many key factors, such as hydrology, location, circulation, etc., and is a nonlinear system 4 .
Therefore, it is of great significance to establish an accurate and good generalized rainfall prediction model 5,6 .
At present, there are various methods for predicting the probability of rainfall.Suning Liu 7 developed a recursive approach to long-term prediction of monthly precipitation using genetic programming.Hongya Li 8  With decades of data accumulation, neural networks stand out among many methods by virtue of their excellent processing capabilities for massive data.Jinle Kang 11 deployed Long Short-Term Memory (LSTM) network models for predicting the precipitation based on meteorological data from 2008 to 2018 in Jingdezhen City.
Yongtao Wang 12 innovatively combines the artificial bee colony (ABC) algorithm and the backpropagation neural network into a precipitation prediction model. Yang Liu 13 used the BP-NN algorithm and added the Precipitable water vapor (PWV) feature to establish a high-accuracy short-term rainfall prediction model. The above research results show that the rainfall forecast model based on machine learning is practical and reliable.
Due to the problems of unbalanced data sets and distribution differences in longterm rainfall prediction, the current rainfall prediction model had poor generalization performance and could not achieve good prediction results in real scenarios. This paper has made improvements in the following aspects: (1) Using the method of model fusion to fuse the TabNet network and the LightGbm 27 has obtained better generalization ability (2) Using adversarial verification to improve the distribution difference (3) Generating descriptions Statistical features and the use of feature fusion to improve model accuracy (4) Use Borderline SMOTE algorithm to improve data imbalance.

Theory of Tabnet algorithm
Topological structure of the TabNet algorithm. At present, deep neural networks have achieved great success in images 23 , text 24 , and audio 25 . However, for tabular data sets, tree models are still mainly used. In many data mining competitions, xgboost and LightGbm rely on its ( 1) Fit the hyperplane boundary in tabular data well (2) Good interpretability (3) Fast training speed becomes the first choice among many algorithms.
For traditional DNN, blindly stacking network layers can easily lead to model overparameters, resulting in DNN performance on the tabular data set is not satisfactory. In August 2019, the Tabnet network proposed by SercanÖ. Arık 14 , on the basis of retaining the end-to-end and representation learning characteristics of DNN, it also has the advantages of tree model interpretability and sparse feature selection. Gradually become the first choice for tabular data tasks.  (1) Feature transformer layer: Feature calculation, split for the decision step output and information for the subsequent step 14 .The structure is shown in figure 2: It can be seen that the Feature transformer layer consists of two parts. The parameters of the first half of the layer are shared, which means that they are jointly trained on all steps; while the second half is not shared, and is trained separately on each step. For each step, the input is the same features, so we can use the same layer to do the common part of feature calculation, and then use different layers to do the feature part of each step.
GLU is a gated linear unit 25 , which is based on the original FC layer plus a gating.
The residual connection is used in the layer, and it is multiplied by √2 to ensure the stability of the network.
The Feature transformer layer realizes the calculation of the features selected in the current step. A decision tree constructs a combination of the size relationship of a single feature, and does not consider more complex situations. Therefore, TabNet uses a more complex Feature transformer layer to perform feature calculations. In some feature combinations, it does better than decision trees.  We use a learnable mask layer to select salient features, and through the selection of sparse features, the learning of the model is more effective in each step 14 . According to the structure of the Attentive transformer layer, its calculation formula can be written as: Sparseamx is the sparseness of Softmax，encourages sparsity by mapping the Euclidean projection onto the probabilistic simplex 15 , ℎ (. ) represents the FC+BN layer, where P[i−1] is divided by the Split layer in the previous step, and p[i-1] is the Prior scales item, which is used to indicate the application of a certain feature in the previous step degree.
If a feature has been used many times in the previous step, it should no longer be selected by the model. Therefore, the model uses this Prior scales item to reduce the weight ratio of this type of feature. It can be seen from the formula, If r=1, then each feature can only be used once.
The Attentive transformer layer can obtain the Mask matrix of the current step according to the results of the previous step, and try to make the Mask matrix sparse and non-repetitive. The Mask vector of different samples can be different, which means that TabNet can allow different samples to choose different features instance-wise, and this feature is not available in tree models. For additive models such as XGBoost, a step is a tree. , And the features used in this decision tree are selected on all samples (for example, by calculating information gain), it cannot be instance-wise 28 .
(3) Split layer: Cut the vector output from the Feature transformer layer into two parts.
The global importance of the normalized feature can be expressed as formula 5.It also proves the interpretability of TabNet In general, TabNet uses a sequential multi-step framework to construct a neural network similar to an additive model. The key points in the model are the Attentive transformer layer and the Feature transformer layer.
Self-supervised learning of the TabNet algorithm. TabNet applies a self-supervised learning method to obtain the representation of tabular data through the encoder-decoder framework, which is also helpful for classification and regression tasks.  The encoded representation is the sum vector of the encoder without the FC layer.
The encoded representation is used as the input of the decoder. The decoder uses the Feature transformer layer to reconstruct the representation vector into a feature. After the addition of several steps, we will output the reconstructed freature.
The matrix for masking the feature is S ∈ {0,1} * ,the feature data is f ,then the input of the encoder is (1 − S) * f, if the final decoder output is f, then self-supervised learning is to reduce the true value S * f and The differen{, #1}ce between the reconstruction value S * f considering that the magnitude of different features is not necessarily the same, so the regularized MSE is used as the loss.
In addition, in order for the model to learn the representation method of the entire feature data, in the training process of self-supervised learning, the matrix S will be resampled every round to ensure the overall representation ability of the encoder model. Descriptive Statistical Features. Figure 5 shows the correlation between each feature and the probability of rainfall As shown in Figure 3, the probability of rainfall is determined by many factors, but the probability of rainfall has a relatively high correlation with RHU_AVG(average relative humidity), GST_MIN(minimum surface temperature), TEM_MIN(minimum atmospheric temperature), and EVP_SMALL(small-scale evaporation). The characteristics are combined with the discrete characteristics to generate descriptive statistical characteristics, as shown in Figure 5, which generates the average and standard deviation of the humidity of each station.  Increase the data set dimension by 18 dimensions.

Feature fusion.
Precipitable water vapor (PWV) 16 is an important meteorological parameter. Abundant water vapor is the basic condition for the formation of rainfall and strong convective weather processes. Therefore, PWV is one of the important data needed for weather forecasting.
zenith total delay (ZTD) occurs as the The Global Navigation Satellite System (GNSS) signal is affected by the atmospheric refraction when it passes through the troposphere, ZTD includes zenith hydrostatic delay (ZHD) and Zenith Wet Delay ZWD 17 . ZHD accounts for approximately 90% of ZTD [18]. ZHD can be calculated by using the (7): where PW is the surface pressure of the station with a unit of °C, ϕ refers to the latitude of the station with a unit of radian, and H is the geodetic height of the station with a unit of km. Therefore, ZWD can be obtained by extracting ZHD from ZTD, and PWV can be calculated by using the (8): where ρW is the water vapor density, and Π represents the conversion factor: Among them, TP is a real example, FN is a false negative example, FP is a false positive example, and TN is a true negative example.
AUC is the sum of the area of each part under the ROC curve. This article uses AUC as the evaluation index of the model.

Unbalanced data set.
Unbalanced data is common in financial risk control, antifraud, advertising recommendations and medical diagnosis. Generally speaking, the proportions of positive and negative samples of unbalanced data are very different. For models, models built with unbalanced data are more willing to favor the labels of multicategory samples, which has low practical application value. Figure 5 shows the rainfall probability distribution.
Where B is a random function, generally a random decimal number from 0 to 1, and ′ is a sample randomly selected by the K-Nearest Neighbor.
However, the SMOTE algorithm has an important flaw. It treats minority samples equally and does not consider the category information of neighboring samples. Sample aliasing often occurs, resulting in poor classification performance. In order to improve this problem, this study uses Borderline SMOTE to improve Data imbalance.
Borderline SMOTE is an improved oversampling algorithm based on SMOTE, which uses only a few samples on the border to synthesize new samples.The Borderline SMOTE sampling process divides the minority samples into 3 categories, namely Safe, Danger and Noise, and only oversamples the minority samples of Danger.
(1) Safe: More than half of the samples are minority samples, point B in figure 8.
(2) Danger: More than half of the samples around are majority samples, which are regarded as samples on the boundary,point C in figure 8.
(3) Noise: The samples are surrounded by most types of samples, which are regarded as noise.,point A in figure 8. After using the Borderline SMOTE algorithm to process the training set, the ratio of the number of non-rainy days (0) to the number of rainy days (1) is shown in the figure 9, which shows that the problem of data imbalance has been improved.

Adversarial verification.
Long-term rainfall prediction will face the problem of the difference in the distribution of the data set, which leads to a large variance between the training set and the test set, the model is unstable, and has no practical application value. This study uses the method of confrontation verification to improve this problem.
We add labels to the training set and test set to distinguish the data set as a training set or a test set. For example, 'is_test' = 0 is the training set and 'is_test' = 1 is the test set, and then merge the two data sets , then train a model (Lightgbm in this study) to do classification prediction on 'is_test'. If AUC is larger than 0.7, it means that the distribution of the training set and the test set is quite different. If AUC=0.5, it means that there is no obvious distribution between the training set and the test set. The difference, the AUC of this study is 0.89, proves that the data set has obvious distribution differences. Figure 10. Feature importance distribution [Te fgure is generated by matplotlib] Figure 10 shows the WIN_MAX (maximum wind speed) contributes the most to the classification of the model, indicating that the WIN_MAX feature has the largest distribution difference between the training set and the test set, so we delete the WIN_MAX feature to reduce the distribution difference between the training set and the test set. Only through the above figure, we found that RHU_AVG (average relative humidity) and RHU_MIN (minimum relative humidity) also have a greater contribution to the model, but if you delete these two features, it will seriously reduce the model's fit in the training set, and the AUC of the test set will also be with this reduction, the confrontation verification is obviously not orthogonal.  Figure 11 is a flowchart of the entire experiment.   Table 2 shows that the LightGbm model and TabNet model used in this experiment obtained 91.0313% and 91.8567% AUC on the test set, respectively, which is about 5% higher than the BP-NN and LSTM models used in the comparative experiment.
Similarly, after the model fusion of TabNet and LightGbm, the AUC of the model has been slightly improved, especially in the test set, which reduces overfitting and proves the feasibility of model fusion. These results proves that using the fusion model of LightGbm and TabNet to predict the probability of rainfall can achieve good results.

Conclusion
Rainfall is affected by a variety of meteorological factors and is a complex nonlinear system. Therefore, this paper proposes a prediction model that uses a combination of TabNet neural network and LightGbm decision tree, and uses feature engineering to generate descriptive statistical features to improve the model's performance Accuracy, using feature fusion, mining the potential value of each feature to improve the upper limit of the model, using the Borderline SMOTE algorithm to improve the imbalance of the data set. In the training phase, adversarial verification is used to improve the distribution difference between the training set and the test set.
Finally, the prediction results of LightGbm and TabNet are averaged to reduce the impact of overfitting. The performance of the model is 0.9278 AUC. This result proves the reliability of using the hybrid model of TabNet and LightGbm to predict precipitation, and provides a new method for data mining tasks. In future research, more data, better parameters, and more reasonable feature engineering methods should be used to increase the generalization ability of the model.