Modeling tropospheric zenith wet delays in the Chinese mainland based on machine learning

In this study, the tropospheric zenith wet delay (ZWD) modeling was realized based on the regression analysis of 7 years (2013–2019) of radiosonde data at 182 sites in the Chinese mainland and the surroundings through two machine learning (ML) approaches including back-propagation neural network (BPNN) and random forest (RF). Furthermore, the forecasting performance of the ML-based models and other formulae for ZWD calculation was assessed by the sounding profiles at the discrete sites for the year 2020. Our results show that RF-based and BPNN-based blind ZWD models obtain an overall accuracy of 4.7 cm in the Chinese mainland, which is slightly superior to that of the empirical Global Pressure and Temperature (GPT3) model (RMS: 4.8 cm). On the other hand, when sites access meteorological data, the ML-based models with meteorological parameterization (CZWD-C and CZWD-F) can achieve overall accuracies of 3.5 cm and 3.7 cm in the Chinese mainland, respectively. Their accuracies improved by 19% and 16% compared to the Saastamoinen model and improved by 12% and 9% in contrast to the Askne and Nordius formula. Moreover, ZWD forecasting accuracy in the Chinese mainland is significantly improved by introducing surface meteorological parameters into the functional formulation, in particular with the surface water vapor pressure. Furthermore, compared with the GPT3, the ML-based models with meteorological parameterization can improve ZWD forecasting accuracy across mainland China, especially in regions with monsoon climate patterns.


Introduction
Microwave signals emitted from radio sources or satellites are refracted due to the composition and physical condition of the neutral atmosphere when they traverse through this region.The refraction effect caused by the neutral atmosphere is commonly referred to as tropospheric slant delay, and it depends on the refractivity along the signal propagation path.In general, the tropospheric slant delay is separated into two components: a hydrostatic part due to the induced dipole moment of the mixed gases and a nonhydrostatic (wet) part due to the permanent dipole moment of the water molecule (Davis et al. 1985).With the high variation of the humidity in the lower atmosphere both spatially and temporally, it is difficult to accurately estimate the tropospheric wet delay of the observations at low elevation angles, especially in specific weather and climate conditions.Therefore, the tropospheric wet delay modeling has been one of the main limiting factors in the analysis and applications of space geodetic observations.It has been proved that accurate a priori values for wet delays can effectively improve the performance of the precise kinematic positioning in terms of convergence time and the positioning precision after convergence is achieved (De Oliveira et al. 2017;Lu et al. 2017).
The common concept for modeling tropospheric slant wet delay is to determine the delay in the zenith direction and multiply it with an elevation-dependent mapping function based on the assumption of an azimuthally symmetric 171 Page 2 of 11 atmosphere (Davis et al. 1985;Böhm et al. 2006;Landskron and Böhm 2018).In this study, we set aside the modeling of the troposphere mapping function and focus our efforts on modeling zenith wet delay (ZWD).Currently, several formulae with meteorological parameterization have been developed to forecast ZWD, including the traditional Hopfield (1969) model, Saastamoinen (1972) model, and Askne and Nordius (1987) (A&N) formula.Among them, the Saastamoinen and Hopfield models were built based on the analyses and representation of the vertical distribution of temperature, pressure, and humidity in the lower atmosphere.When surface water vapor pressure and temperature are available, both models can be used to estimate ZWD values.However, ZWD forecasting accuracy for both models is still limited due to the poor capability of accounting for time-varying features of the atmospheric water vapor in the height domain.On the other hand, the A&N formula was developed by considering vertical humidity distribution in the troposphere.The A&N formula can forecast ZWD using the parameters of weighted mean temperature (T m ), water vapor decay factor, and surface water vapor pressure, which is the most accurate and widely used formula for ZWD determination.
Furthermore, in addition to the traditional ZWD formulae mentioned above, global empirical troposphere models can directly yield ZWD values by the input parameters of the day of the year (DOY) (or Modified Julia Day, MJD) and approximate station coordinate (Dousa and Elias 2014).These blind troposphere models, such as Global Pressure and Temperature 3 (GPT3) (Landskron and Böhm 2018) and the former version GPT2w (Böhm et al. 2015), adopted the functional formulations with the predefined periodic form to account for climatological mean and seasonal variations of meteorological parameters and their vertical distribution.They were realized mainly based on the monthly reanalysis data of the numerical weather model (NWM).Therefore, these blind models are unsuitable for accurately forecasting the nonstationary patterns or short-term fluctuations in the neutral atmosphere.Moreover, due to the limited spatial resolution of NWM data and the usage of predefined functional formulation, the global empirical troposphere models cannot well represent complex variation in the tropospheric wet delay for the regions with specific climate patterns.
In recent years, thanks to the powerful capacity to extract information from a large set of the historical dataset and model nonlinear features through regression analysis, the machine learning (ML) technique has attracted increasing attention, and it has been widely applied in geoscience studies, for example, ionospheric parameter modeling (Chen et al. 2022;Xiong et al. 2021;Zhukov et al. 2021), precipitation forecast (e.g., Min et al. 2018;Manandhar et al. 2019), and atmospheric forecast and climate analysis (Han et al. 2017;Arcomano et al. 2020).In the current study, considering the drawbacks of the traditional formulae and empirical troposphere models mentioned above, we developed comprehensive ZWD models for the Chinese mainland through two machine learning approaches, including back-propagation neural network (BPNN) and random forest (RF), which are called CZWD_BP and CZWD_RF, respectively.The developed ML-based models not only directly yield ZWD at the respective site using the DOY and approximate station coordinates but can also further improve forecasting performance with the support of surface meteorological data.Moreover, the developed ML-based ZWD models for mainland China were assessed by the comparisons with the numerical integration of temperature and humidity profiles at discrete sites for the year 2020.
The next section describes the dataset and methodology of the tropospheric ZWD determination.Then, we will introduce the fundamentals of random forest and backpropagation neural network approaches, along with the tropospheric ZWD modeling in mainland China through these two machine learning approaches.Furthermore, the ZWD forecasting accuracy is assessed by comparing it with the numerical integration of the atmospheric profiles at discrete radiosonde sites.The summary and conclusions are given in the last section.

Dataset and methodology of ZWD calculation
In the current study, radiosonde data were used to model the tropospheric ZWD in mainland China through two ML approaches and validate the modeling accuracy.The radiosonde measurements provide vertical profiles of meteorological variables covering the troposphere, which include air temperature, total pressure, relative (specific) humidity, geopotential height, wind speed, direction, etc., at a series of significant isobaric levels (Li et al. 2020).These atmospheric vertical profiles are usually obtained twice or four times per day at UTC 00:00, 06:00, 12:00, and 18:00 at near-globally distributed sites.Up to date, several institutions, such as the Integrated Global Radiosonde Archive (IGRA) and the University of Wyoming, provide public archives for freely accessing quality-assured radiosonde data at more than 1500 stations worldwide.
Based on the meteorological profiles obtained from radiosonde data, the tropospheric zenith wet delay ΔL z w can be determined by integrating non-hydrostatic refractivity in the zenith direction from the station altitude h s to the top of the lower atmosphere (Nilsson et al. 2013) h top , which can be expressed as follows: Subscript 'w' and superscript 'z' in the (1) represent wet delay and zenith direction, respectively; N nh denotes non- hydrostatic refractivity in the lower atmosphere, which is a function of air temperature and humidity variables.Hence, equation (1) can be rewritten as follows: where p w and T are partial pressure of water vapor in (hPa) and air temperature in (K) at each level, respectively; k i (i = 1, 2, 3) are atmospheric refractivity coefficients; R d and R w denote the gas constant for dry gases and water vapor, respectively; Z w is the compressibility factor for atmospheric water vapor that represents the deviation of the atmosphere from an ideal gas.It should be noted that a strict quality control strategy was performed for the preprocessing of radiosonde data.On the one hand, the sounding records with two adjacent pressure levels exceeding 200 hPa or those with a top pressure layer of more than 300 hPa are removed.On the other hand, only the radiosonde data containing records with mandatory levels and significant levels are considered.Finally, 7 years (2013-2019) of IGRA sounding profiles at 182 sites in mainland China and the surroundings were employed for the ZWD modeling through the machine learning (BPNN and RF) approach.In addition, the radiosonde data for the year 2020 were used for model validation.
Compared with the rigorous method, the traditional formulae such as the Saastamoinen model and A&N formula are usually used to estimate ZWD values when surface meteorological data at the site are available.The functional expression of the Saastamoinnen model is shown as follows (Davis et al. 1985): where p w,s and T s are the partial pressure of water vapor and surface temperature at the site.and h s represent the site's latitude and height, respectively.
Currently, the A&N formula is the most accurate way to derive tropospheric ZWD estimations by the input parameters of the surface water vapor pressure, weighted mean temperature T m , and water vapor decrease factor (Askne and Nordius 1987): where g m is the gravitational acceleration at the mass center of the vertical column of the atmosphere.

Machine learning approaches for ZWD modeling
The main objective of the current study is to construct the functional relationship between input variables and the tropospheric ZWD in the Chinese mainland through two machine learning (BPNN and RF) approaches.The input variables or features of the experimental dataset include the float DOY, station latitude, longitude, and ellipsoidal height, surface water vapor pressure, and surface temperature.The target variable is the time-varying ZWD values derived by the numerical integration of vertical meteorological profiles obtained from radiosonde measurements.
In the modeling, we devised three regression analysis schemes to study the effect of surface meteorological parameters on the ZWD modeling and to analyze the difference in ZWD forecasting between BPNN and RF approaches.For CZWD-A and CZWD-D, the timevarying ZWD series is modeled as nonlinear functions of the DOY and station coordinate.Both models are utilized for geodetic sites which cannot acquire surface meteorological data.Concerning CZWD-B, CZWD-C, CZWD-E, and CZWD-F, they are developed with meteorological parameterization; thus, they are suitable for geodetic sites with the availability of collocated synoptic data or NWM operational/forecast products.The nonlinear functional relationships between the tropospheric ZWD and these input variables for different modeling schemes are listed in Table 1. (4)

Random forest
Random forest (RF) is an ensemble learning approach proposed by Breiman (2001), which is known for its robust performance on classification and regression tasks.According to the principle of bootstrap aggregation, the RF forms the ensemble of multiple decision trees during training and outputs the mode of the classes for classification or obtains average predictions for regression (Izquierdo-Verdiguier and Zurita-Milla 2020).Thanks to the adaptive feature of the decision rule, the RF has the strong capacity to account for the nonlinear relationship between input features and the target variables in regression tasks.In the current study, we adopted the classification and regression tree (CART) (Berk 2008) as the decision tree algorithm to realize the tropospheric ZWD modeling in the Chinese mainland.The basic structure of the RF-based tropospheric ZWD modeling is presented in Fig. 1.
As illustrated in Fig. 1, the RF generates multiple homogeneous training sets from the original samples based on the bootstrap sampling principle and builds multivariable regression trees for each training set according to the decision tree algorithm.Concerning the growth of each regression tree, a bootstrap subset that includes about two-thirds of the original samples is selected.Then, the tree is grown from the training dataset by successively portioning input space with binary splits based on random input feature selection.For the regression task, the split node is determined based on the criterion of minimum regression error, which is represented as the weighted sum of the regression error as follows: Herein E(D s1 ) and E(D s2 ) are the regression error for subsets D s1 and D s2 , respectively.The respective sample number for (5) both subsets is N s1 and N s2 ; N denotes the total sample num- ber.According to the regression algorithm, the mean square error is calculated as the regression error: where y is the target variable and its mean value is repre- sented as y .Compared with an individual regression tree, the RF regression can effectively overcome the overfitting issue by integrating multiple decision trees produced by randomly selecting a subset of training samples and a subset of variables for the splits at each tree node (Maxwell et al. 2018).The final regression predictions of the RF system are determined by averaging the outputs of all decision trees as follows: where X and T k represent the input variables and the output of each decision tree, respectively.The final output of the RF is denoted as Y.

Back-propagation neural network
As a part of the artificial neural network, the back-propagation neural network (BPNN) is a multilayer feed-forward neural network realized based on the error back-propagation algorithm.The BPNN optimizes the differences between the network output and expected output through a gradient descent algorithm and is suitable to solve nonlinear optimization issues with multiple input parameters, which has been widely used in various regression tasks (Hu et al. 2017;Li et al. 2021).Generally, a BPNN system incorporates three layers: an input layer, an output layer, and one or multiple hidden layers.The number of neurons in the input and output layers represents the number of input and output variables, ( 6)  respectively.In this study, one hidden layer was applied in the BPNN-based ZWD modeling.The structure of the tropospheric ZWD modeling through the BPNN approach is illustrated in Fig. 2, and the respective functional formulation can be described as follows: Herein, g(⋅) indicates the activation function; W IH is the weight matrix for connecting neurons of the input layer and hidden layer, while W HO denotes the weight matrix used for connecting neurons of the hidden layer and output layer; B IH and B HO indicate the respective offset matrix corresponding to the W IH as well as W HO .For a BPNN system, each neuron in the current layer is directly connected to the neurons of the subsequent layer with an activation function.In the modeling, we adopted a hyperbolic tangent function as the activation function for connecting neurons between the input layer and hidden layer, which is expressed as follows: On the other hand, the linear activation function was applied to connect the neurons between the hidden layer and the output layer:

Hyperparameter determination
The number of neurons in the hidden layer for a BPNN system and the number of trees for an RF system is the hyperparameter that affects the generalization ability for different modeling schemes.To obtain a better generalization ability and study the sensitivity of modeling performance with respect to the number of hyperparameters, we devised 10 network/tree structures with the number of neurons/trees varying from 5 to 50.For each network/ tree structure, the ZWD modeling accuracy in terms of root-mean-squares (RMS) error is determined based on a tenfold cross-validation method, as shown in Fig. 3.We can find that the RMS error of ZWD estimations for both BPNN and RF approaches decreases as the increasing number of hyperparameters, which reveals that ensemble learning can help the BPNN-based and RF-based ZWD modeling achieve better generalization ability.Moreover, when the number of neurons in the hidden layer varies from 5 to 50, the ZWD modeling accuracy for the BPNN approach improved by 0.4-0.5 cm, while it only improved by approximately 0.1 cm for the RF approach when the number of trees varied from 5 to 50.It indicates that the BPNN-based ZWD modeling in the Chinese mainland is more sensitive to the varying hyperparameter than the RF approach.
Furthermore, we can see that the improvement in ZWD modeling accuracy for the RF approach is nearly negligible when the number of trees is larger than 20.A similar variation tendency can also be observed in the results for BPNN-based modeling if the number of neurons in the hidden layer is larger than 25.Therefore, we fixed the optimal hyperparameters to 25 and 20 for the realization of the BPNN-based and RF-based ZWD modeling, respectively.

Modeling validation and analysis
In this section, the developed ML-based models and other formulae for ZWD calculation were assessed using radiosonde data at 176 sites in mainland China and the surroundings for the year 2020.This study derived the ZWD references by numerically integrating temperature and humidity profiles from the site height to the top radiosonde level.Concerning the ZWD forecasting, the CZWD-A, CZWD-D, and GPT3 can directly forecast ZWD using the input parameters of the float DOY (or MJD) and approximate station coordinates, whereas the Saastamoinen model, A&N formula, and the ML-based ZWD models with meteorological parameterization require surface meteorological information at the site.Here, these surface meteorological data are obtained from the lowest data of the sounding profiles.

Modeling accuracy of ZWD predictions
Figure 4 displays the overall distribution of ZWD differences between models and radiosonde data at 176 sites in 2020.The statistical indicators in terms of mean difference (bias), standard deviation (STD), and RMS error are presented in Table 2 and Fig. 4. Table 2 and Fig. 4 show that ML-based ZWD models with meteorological parameterization perform best among these models and formulae for forecasting ZWD values in mainland China.For example, the CZWD-C/CZWD-F models can achieve mean biases of 0.4/0.4cm and overall RMS errors of 3.5/3.7 cm in the Chinese mainland, whose accuracies improved by 12% and 9% in comparison with the A&N formula and improved by 19% and 16% in contrast to the Saastamoinen model, respectively.In addition, we can find that ZWD accuracy for both BPNN and RF approaches improves significantly when surface meteorological parameters are taken into account in the modeling, as illustrated in the upper and middle panels.When considering surface water vapor pressure in the modeling, the BPNN-based and RF-based ZWD models can obtain overall accuracies of 3.8 cm (CZWD-B) and 3.7 cm (CZWD-E) in mainland China, respectively.When surface temperature is further taken into account in the functional formulation, the ZWD modeling accuracy for RF and BPNN approaches upgrades to 3.5 cm (CZWD-C) and 3.7 cm (CZWD-F), respectively,  which improved by 27% and 24% with respect to the blind GPT3 model.Thus, it is obvious that the surface water vapor pressure plays an important role in the ZWD modeling through BPNN and RF approaches in mainland China than the surface temperature.Furthermore, one can see that the blind CZWD-A and CZWD-D models yield ZWD values with a bias of − 0.7 cm and an overall RMS error of 4.7 cm in the Chinese mainland, whose accuracy is slightly better than that of the GPT3 model (bias: − 1.2 cm, RMS error: 4.8 cm).

Spatial distribution of model accuracy
In this subsection, the site-wise bias and RMS error for these models and formulae were calculated by ZWD residuals between model outputs and references derived by radiosonde data in 2020, which are utilized to study the spatial distribution of ZWD forecasting performance in the Chinese mainland.
Figure 5 shows the site-wise bias of ZWD estimations or these models and formulae in the Chinese mainland.It is obvious that empirical ZWD models including CZWD-A, CZWD-D, and GPT3 perform worse in forecasting ZWD in the regions with specific climate patterns.For example, the GPT3 model underestimates ZWD values by around 2 cm in the regions with East Asia monsoon climate type (Southeast China and its adjacent sea areas in the Pacific Ocean), while it exhibits negative biases exceeding 3 cm in the regions with Indian monsoon climate type (Southwest region outside the Chinese mainland).Meanwhile, the blind CZWD-A and CZWD-D models also display prominent negative anomalies in the regions with the Indian monsoon climate type.The Saastamoinen model shows positive biases at the island sites in the South China Sea, East China Sea, and the coast of Japan, whereas it exhibits negative biases with magnitudes of 2-6 cm in South China.Furthermore, we can see that BPNN-based and RF-based ZWD models with meteorological parameterization generally can forecast ZWD values with absolute biases within 1 cm for most sites.In comparison with the empirical models mentioned above, the RF-based and BPNN-based ZWD models with meteorological parameterization can improve forecasting ability and mitigate negative biases that occurred in the regions with East Asia monsoon climate patterns.
Figure 6 displays site-wise RMS errors of ZWD estimations in mainland China and the surroundings for these models and formulae.It is evident that the forecasting performance of these empirical ZWD models (CZWD-A, CZWD-D, and GPT3) is affected by the specific climate patterns in the Chinese mainland and the surroundings.As illustrated in the left panels, these empirical models perform relatively better in forecasting ZWD value in North China, in which their accuracy is superior to 3.5 cm.However, because the empirical models cannot reproduce complex variations of the atmospheric water vapor in the regions with East Asia monsoon and Indian monsoon climate types, the forecast accuracy of these empirical models degrades in the Compared with the empirical models, the A&N formula and ML-based models with meteorological parameterization have prominent improvements in forecasting accuracy across the Chinese mainland and the surroundings.We can see that these models, which are dependent on the surface meteorological parameters, can estimate ZWD values for North China with RMS errors within 3 cm.On the other hand, they significantly mitigate large RMS anomalies that occurred in Southeast China and regions with Indian monsoon climate type, in which they achieve accuracies of 4.5-6.0cm.Moreover, we can find that the Saastamoinen performs worst among these ZWD models that are dependent on surface meteorological parameters.Although surface water vapor pressure is considered in the functional formulation, the Saastamoinen still suffers from large RMS errors with magnitudes of 5.5-8.0 cm in Southeast China and the regions with the Indian monsoon climate type.

Temporal variation of model accuracy
To investigate the temporal variation of ZWD forecasting accuracy, we calculated the sub-daily ZWD (model-minusreference) residuals at each site in the Chinese mainland at each epoch in 2020.Based on the ZWD residuals at all tested sites, the bias and RMS error statistics at each observational epoch were obtained, as illustrated in Figs.7 and 8.
As can be seen from Figs. 7 and 8, the ZWD forecasting accuracy in mainland China for these models and formulae shows prominent seasonal variations with larger RMS errors in summer and smaller RMS errors in winter.The temporal characteristics of ZWD prediction accuracy can be attributed to the seasonal variation of the atmospheric water vapor in mainland China.There is a high content and dynamic of atmospheric water vapor in these regions during summer.Thus, it is difficult for these models to account for complex variations of atmospheric moisture, especially in regions with monsoon climate patterns.In addition, we can see that forecasting accuracy for the empirical models (CZWD-A, CZWD-D, and GPT3) in terms of RMS error varies from 2.5 to 7 cm throughout the year 2020, which is worse than that of the A&N formula and the ML-based models with meteorological parameterization (RMS errors: 2-5.5 cm).Moreover, we can see that the absolute biases of the ML-based model with meteorological parameterization are basically within 1.5 cm during the year 2020.These features indicate that the ZWD formulae and models that are dependent on surface meteorological parameters, to some extent, can well represent the variation of the atmospheric moistures and alleviate the large forecasting errors that occurred in summer for the empirical models.

Conclusions
In the current work, the tropospheric zenith wet delay modeling was realized based on 7 years (2013)(2014)(2015)(2016)(2017)(2018)(2019) of radiosonde data at 182 sites in mainland China and the surroundings through two machine learning (BPNN and RF) approaches.In the modeling, we applied three parameterization schemes to construct the nonlinear functional relationship between the tropospheric ZWD and the input variables in consideration of different application conditions.The empirical ML-based models are suitable for the sites without the surface meteorological data, while the ML-based models with meteorological parameterization are utilized to improve ZWD forecasting accuracy at sites accessible to surface meteorological data.
Furthermore, the developed ML-based models and other formulae for ZWD were assessed by the radiosonde data in mainland China and the surroundings in 2020.Our results show that RF-based and BPNN-based models with meteorological parameterization (CZWD-C and CZWD-F) can estimate ZWD values with overall RMS errors of 3.5 and 3.7 cm, respectively, whose accuracies are superior to that of the traditional Saastamoinen model and the A&N formula.On the other hand, the RF-based and BPNN-based blind models (CZWD-A and CZWD-D) generally obtain an overall accuracy of 4.7 cm in the Chinese mainland, which is slightly better than that of the GPT3 model.From the aspect of modeling accuracy, the RF regression approach has better applicability in the tropospheric ZWD modeling in the mainland of China than the BPNN method.Moreover, the ZWD forecasting accuracy of the ML-based models in the Chinese mainland has significantly improved by introducing surface water vapor pressure in the functional formulation.Compared with the empirical ZWD models, the RF-based and BPNN-based ZWD models with meteorological parameterization mitigate negative biases in the regions with East Asia monsoon climate type and improve forecasting accuracy across the mainland of China, especially the regions with monsoon climate patterns.

Fig. 1
Fig. 1 Structure of random forest for the tropospheric ZWD modeling

Fig. 2
Fig. 2 Structure of back-propagation neural network models

Fig. 3
Fig. 3 Generalization ability of BPNN-based and RF-based ZWD models in terms of RMS errors for different schemes (left panels represent RF-based modeling results and right panels denote BPNNbased modeling results)

Fig. 4
Fig. 4 Distribution of ZWD residuals (model-minusreference) for these models and formulae validated by the radiosonde data at 176 sites in 2020.PDF denotes the probability density function presented as the red curve lines in the figure

Fig. 5
Fig. 5 Mean differences of ZWD values between models and radiosonde data for 176 sites analyzed during 2020

Fig. 6
Fig. 6 RMS errors of ZWD values between models and radiosonde data for 176 sites analyzed during 2020

Fig. 7
Fig. 7 Time series of biases of ZWD estimations for these models in 2020

Table 1
Functional formulations for the tropospheric ZWD modeling schemes (Note: lat and lon denote station latitude and longitude in degree, and h_ell represents the height of the site in meters) C (DOY, lat, lon, h_ell, e s , T s ) CZWD-BP CZWD-D f D (DOY, lat, lon, h_ell) CZWD-E f E (DOY, lat, lon, h_ell, e s ) CZWD-F f F (DOY, lat, lon, h_ell, e s , T s ) 171 Page 4 of 11