Data. In this study, the following climate-related variables with yearly frequency from 1950 to 2021 were collected from multiple sources:
1. Global Mean Sea Level (GMSL) in mm was gathered from the mean of two data sources35,36. The data for the year 2021 were collected and calibrated from GSFC (2021)2.
2. Glaciers and Ice Sheets Mass Balance (Mass) in Gt (Gigaton) was estimated by two methods: (i) the product of global mean glacier mass in water equivalence from the world glacier monitoring service [1950–2021]37 and the total glaciers and ice sheets area38; (ii) Summation of 17 mountain glaciers mass balance39, Greenland and Antarctica ice sheet mass balance [2002–2016]40, with missing values imputed by Kalman filter41; overlapping period of estimation (i) and (ii) were calibrated by the average values.
3. Arctic Sea Ice August Extent (SeaIce) in Mkm2 was gathered from two sources: (i) observed August Arctic sea-ice extent [1870–2008]42, and (ii) observed north hemisphere sea ice extent in August [1979–2021]43. The least squares method was adopted to calibrate and combine these two data sources.
4. Global Surface Temperature (with sea ice area measured by the air above sea ice) (GMST) in °C was obtained from Rohde (2020)1.
5. Global Specific humidity (Humidity) in kg/kg (mass of water vapor per kilogram of moist air)44.
6. Greenhouses gases: CO245,46, CH447,48, and N2O47,49.. Global warming potential (GWP), which is measured by units of CO2 equivalents in the environmental impacts, is adopted to analyze the effect of greenhouse gases in this study. The formulation to calculate GWP is shown in Eq. (1), following Intergovernmental Panel on Climate Change (IPCC) reports50.
$$\begin{array}{r}GWP=C{O}_{2}+28C{H}_{4}+265{N}_{2}O \left(1\right)\end{array}$$
7. Sunspot Number (SSN)51.
8. Regional Mean Sea Level (RMSL) data were obtained from the Permanent Service for Mean Sea Level52,53: New York RMSL is obtained from the Battery station, and Osaka RMSL is obtained from the Osaka station.
9. New York Hourly Water Level data were collected from the University of Hawaii Sea Level Center36, and Osaka Hourly Water Level data were collected from Japan Oceanographic Data Center54.
Augmented Dickey-Fuller (ADF) test results confirm all the variables are integrated of order 1, i.e. I(1) processes, based on historical data analysis.
Path analysis model identification and estimation for global mean surface temperature (GMST) and global mean sea level (GMSL). Structural equation modeling (SEM) refers to a family of multivariate procedures designed to infer causal relationships among a set of variables55. It can be viewed as a system of multivariate regression equations with variables that can serve as both independent and dependent variables. Latent variables are also common although in this work we did not use any latent variables. Inference of SEM usually focuses on estimating the model implied population covariance matrix through the likelihood function:
$$\begin{array}{r}{F}_{ML}=\hspace{0.25em}log\left|\varSigma \left(\theta \right)\right|+tr\hspace{0.25em}\left(S{\varSigma \left(\theta \right)}^{-1}\right)-\text{l}\text{o}\text{g}\left|S\right|-q, \left(2\right)\end{array}$$
where \(\varSigma \left(\theta \right)\) is the model implied covariance matrix with \(\theta\) representing model parameters including path coefficients, \(S\) is the sample variance-covariance matrix, and \(q\) is the number of parameters to be estimated. The unified structural equation model (uSEM)4, also known as Vector Autoregression – Structural Equation Modeling (VAR-SEM) or the Dynamic SEM modeling can be used to analyze the proposed causal associations among a set of time-series variables by incorporating both the contemporaneous relations and the longitudinal relations, simultaneously. Contemporaneous relations reflect relationships between variables at the same time point, while longitudinal temporal relations are defined as relationships between variables at different time points.
Variable selection for the hypothesized full uSEM model in this work was done in two approaches:
1) A system-wise variable selection based on the entire set of uSEM equations via the regularized unified structural equation modeling (RuSEM) method22, where regularization is done through the adaptive Lasso56.
2) An equation-wise variable selection based on each individual uSEM equation, which can also be viewed as an autoregressive distributed lags (ARDL) model. The variable selection for each ARDL was done through the backward stepwise variable selection method.
The adaptive LASSO56 penalizes parameters after each parameter is scaled by the un-penalized maximum likelihood estimators (MLE) in the following equation:
$${F}_{\text{alasso }}={F}_{ML}+\lambda {∥{\theta }_{ML}^{-1}\text{*}{\theta }_{pen}∥}_{1}, \left(3\right)$$
where \(\lambda\) is the regularization parameter with a common initialization value of 0.1.
For equation-wise variable selection, each individual ARDL model has the following general form:
$$\begin{array}{r}{y}_{t}={\beta }_{0}+{\beta }_{1}{y}_{t-1}+\dots +{\beta }_{p}{y}_{t-p}+{\delta }_{0}{x}_{t}+{\delta }_{1}{x}_{t-1}+\dots +{\delta }_{q}{x}_{t-q}+{ϵ}_{t}, \left(4\right)\end{array}$$
where \({y}_{t}\) is the endogenous variable at time t, \({x}_{t}\) is the exogenous variable at time t, \({\beta }_{0}\) is the intercept, \({\beta }_{i},i\in \{1,\dots ,p\}\) and \({\delta }_{j},j\in \{0,\dots ,q\}\) are the coefficients. The Akaike information criterion (AIC) is used to de-select unnecessary paths in the backward stepwise variable selection.
For our work, the final path model selected based on the system-wise and the equation-wise variable selections methods happen to be identical – signifying the robustness of our climate pathway result. The final path model, namely, the final refitted uSEM using only the significant pathways identified, is represented in the matrix form as follows, which can also be viewed as a system of ARDL time series regression equations:
$$\begin{array}{rr}\left[\begin{array}{c}Humidit{y}_{t}\\ TEM{P}_{t}\\ Mas{s}_{t}\\ Sea{Ice}_{t}\\ GMS{L}_{t}\end{array}\right]=& \left[\begin{array}{ccccc}0& 0& 0& 0& 0\\ 0.443& 0& 0& 0& 0\\ 0& -0.022& 0& 0& 0\\ 0& -0.335& 0& 0& 0\\ 0& 0.089& 0& 0& 0\end{array}\right]\left[\begin{array}{c}{Humidity}_{t}\\ TEM{P}_{t}\\ Mas{s}_{t}\\ {SeaIce}_{t}\\ GMS{L}_{t}\end{array}\right]+\\ & \left[\begin{array}{ccccc}0.433& 0.374& 0& 0& 0\\ -0.329& 0.683& 0& 0& 0\\ 0& 0& 1.009& 0& 0\\ 0& 0& 0.320& 0.347& 0\\ 0& 0& -0.228& -0.097& 0.606\end{array}\right]\left[\begin{array}{c}{Humidity}_{t-1}\\ TEM{P}_{t-1}\\ Mas{s}_{t-1}\\ {SeaIce}_{t-1}\\ GMS{L}_{t-1}\end{array}\right]+\\ & \left[\begin{array}{c}0\\ 0.226\\ 0\\ 0\\ 0\end{array}\right]\left[GW{P}_{t-1}\right]+\left[\begin{array}{c}0.034\\ 0.029\\ -0.052\\ -0.026\\ 0.026\end{array}\right] \left(5\right)\end{array}$$
GWP forecasts under COP26. The 26th United Nations Climate Change Conference (COP26), held in Glasgow, UK, from 31 October to 13 November 2021, has reached two key resolutions to reduce the anthropogenic greenhouse gas emissions: (1) to reduce the anthropogenic CO2 (carbon dioxide) emission by 45% by 2030, compared to its 2010 levels, and (2) to reduce the anthropogenic CH4 (methane) emissions by 30% by 2030, compared with its 2020 levels.
It is assumed that without anthropogenic emissions, the greenhouse concentration should remain stable, as the same as last year. The global CO2 concentration level in 2010 and 2009 was measured at 389.89 parts per million (ppm) and 387.34 ppm respectively, featuring a yearly increase of 2.54 ppm. With the assumption, we conclude all 2.54 ppm increase is anthropogenic, that is, due to human activities. Per COP26, for 2030, the expected concentration increase caused by anthropogenic emission of CO2 will be \(\left(1-0.45\right)\text{*}2.54\hspace{0.25em}\approx 1.40\) ppm. To simplify the matter, we have adopted a linear decrease, which means every year we will see the amount of emission decrease stays constant (we can also use other models such as exponential, which will not significantly change our results). Therefore, if we can follow through the COP26 resolution, the annual increase of anthropogenetic CO2 concentration, from 2020 to 2030, in a linear decreasing pattern, will be (2.43, 2.31, 2.20, 2.09, 1.97, 1.86, 1.74, 1.63, 1.51, 1.40) ppm.
The global CH4 concentration levels in 2020 and 2019 were measured at 1879.2 parts per billion (ppb) and 1866.69 ppb respectively, featuring a yearly increase of 12.56 ppb. With the assumption above, we conclude all 12.56 ppb is anthropogenic, that is, due to human activities. Per COP26, for 2030, the expected concentration increase caused by anthropogenic emission of CH4 will be \(\left(1-0.3\right)\hspace{0.25em}\text{*}\hspace{0.25em}12.56\hspace{0.25em}\approx \hspace{0.25em}8.79\) ppb. To simplify the matter, we have adopted a linear decrease, which means every year we will see the amount of emission decrease stays constant (we can also use other models such as exponential, which will not significantly change our results). Therefore, if we can follow through the COP26 resolution, the annual increase of anthropogenetic CH4 concentration, from 2020 to 2030, in a linear decreasing pattern, will be (12.18, 11.80, 11.43, 11.05, 10.67, 10.30, 9.92, 9.54, 9.17, 8.79) ppb.
Since no constraints were proposed for N2O, it is assumed the N2O concentration will increase following its historical trend, which is well fitted by an ARIMA(1,1,1) model. The uncertainty of the forecasted GWP level associated with the COP26 scenario was calculated under the assumption that the coefficient of variation (\(CV=\sigma /\mu\), where\(\hspace{0.25em}\sigma\) is the standard deviation and \(\mu\) is the mean) remains constant. The standard deviation in the COP26 scenario is the corresponding standard deviation under the unrestricted scenario times the ratio of the mean of these two scenarios (\({\sigma }_{C}={\sigma }_{N}{\mu }_{C}/{\mu }_{N}\)). Furthermore, because the greenhouse gas concentration is unlikely to decrease, we use the expected mean to represent the lower bound and only the upper prediction interval bound was retained for the COP26 restriction scenario (Fig. 3b). The final identified uSEM, the time series forecasts of GWP and its uncertainties, are used to forecast the future trend and uncertainties of GMST and GMSL until 2100 under either the unrestricted scenario or the COP26 restriction scenario. Meanwhile, backtesting tests using historical data were conducted to validate the forecast models. The backtesting used a 9:1 ratio of the training set to the test set. For the test data, both one-step ahead (predictors updated with true observed values at each time step) and multi-step ahead forecasting (predictors updated with forecasted values at each time step) were performed. Both results show that all the true values fall within the 99% forecast intervals validating the robustness of our model. Details are available in the supplementary materials.
Regional mean sea level (RMSL) forecasts. To forecast the RMSL, we have adopted the ARDL model with the stepwise variable selection method. The ARDL variable selection results are provided in Eq. (6) below:
$$\begin{array}{rr}NYRMS{L}_{t}& =2.2970+0.3259NYRMS{L}_{t-1}+0.9442GMS{L}_{t}\\ OSARMS{L}_{t}& =-2.997+0.538OSARMS{L}_{t-1}+0.836GMS{L}_{t}, \left(6\right)\end{array}$$
where \(NYRMS{L}_{t}\) is the regional mean sea level in New York in year \(t\), \(OSARMS{L}_{t}\) is the regional mean sea level in Osaka in year \(t\), \(GMS{L}_{t}\) is the global mean sea level in year \(t\). Results for the other six coastal locations examined are available in the supplementary materials. To reduce the impact of missing values in regional sea level data, we used a longer time span (1950–2015) for the training data when backtesting regional models. Model validation is confirmed by the backtesting results, showing that all the true values fall within the predicted range for all regions.
Consistency with physics-based models and forecasts. Although the data driven uSEM pathway modeling does not rely on physical theories and models, it still comes as a relief when we found our results highly consistent with the recent forecasts based largely on scientific theories. The IPCC 2021 climate change report57 has shown projections for global mean surface temperature (GMST) and global mean sea level (GMSL) rise under different scenarios. As the IPCC reports use a different method to assess greenhouse gas emissions than our model, we use Meinshausen et al. (2011)58 to compare the scenarios of our model with those of the IPCC. According to Meinshausen et al. (2011), the GWP concentration in 2100 will be 568.96 ppm in the SSP 1-2.6 scenario and 749.91 ppm in the SSP 2-4.5 scenario, which is comparable to our COP26 scenario forecast of 591.72 ppm and the unrestricted scenario forecast of 753.13 ppm, respectively. According to IPCC, GMST under SSP2-4.5 scenario will be approximately 2.7 (2.1–3.5)°C by 2100, which is comparable to our projection of 2.79°C, and 1.8 (1.3–2.4)°C under SSP 1-2.6 scenario, which is close to our projection of 1.58 (1.58–2.15)°C under COP26 scenario. The GMSL is anticipated to rise by 463 mm to 783 mm under SSP2-4.5 scenario and 344 mm to 644 mm under SSP1-2.6 scenario in 2100, which agree well with our estimates of 686.69 (604.02-769.36) mm for the unrestricted scenario and 537.08 (537.08–610.60) mm for the COP26 scenario. All estimates are calibrated to our baselines as mentioned before.
Per physics driven forecasts conducted by our group10, the GMSL will be 2.3 feet (701.04 mm) by 2100, which is highly consistent with our data driven forecast (686.69 mm) for the unrestricted scenario. The 95% confidence interval of the New York regional sea level is estimated to range from 1.6 feet (487.68 mm) to 4.1 feet (1249.68 mm) above the baseline based on the physics driven models10, while the 95% confidence interval based on our data driven models is 929.60-1157.30 mm, residing completely inside the former confidence interval.
From a climate science point of view, the relationship between temperature and the specific humidity is not linear and outlined by the August-Roche-Magnus (ARM) formula59,60 as follows:
$$\begin{array}{rr}{e}_{s}& =6.1094\text{e}\text{x}\text{p}\left(\frac{17.625T}{T+243.04}\right)\\ q& =\beta *RH*{e}_{s}/\left(p-(1-\beta ){e}_{s}\right), \left(7\right)\end{array}$$
where \({e}_{s}\) is the saturation vapor pressure in hPa, \(T\) is the temperature in Celsius degree (\(T\) = GMST + 14.105°C), \(q\) is the specific humidity in kg/kg, RH is the relative humidity ranging from 0 to 1, \(p\) is the atmospheric pressure in hPa (\(\approx\)1013.25), and \(\beta\) is a constant coefficient (\(\approx\)0.622). Although the ARM formula indicates a nonlinear relationship between temperature (\(T\)) and humidity (\(q\)), we can prove that this nonlinear relation can be well approximated by a linear relationship between \(T\) and \(q\) via Taylor Theorem, thus validating the sufficiency of our linear model structure.
According to the Taylor expansion, a function \(q\left(T\right)\) that is infinitely differentiable at \(T={T}_{0}\) is equal to a power series of the form \(q\left(T\right)=\sum _{i=0}^{{\infty }}{a}_{i}{\left(T-{T}_{0}\right)}^{i}\), where \({a}_{i}=\frac{{q}^{\left(n\right)}\left({T}_{0}\right)}{i!}\). Therefore, we can write the ARM as:
$$\begin{array}{rr}q\left(T\right)& ={a}_{0}+{a}_{1}\left(T-{T}_{0}\right)+{R}_{2}\left(T\right). \left(8\right)\end{array}$$
According to the Taylor Theorem, the error term \({R}_{2}\left(T\right)\) satisfies \({R}_{2}\left(T\right)=\frac{{q}^{\left(2\right)} \left(T{\prime }\right)}{2!}{\left(T-{T}_{0}\right)}^{2}\) for some \(T{\prime }\) in the range of \(T\) and \({T}_{0}\). Choosing \({T}_{0}=14.62\) and using the first two terms as the linear representation \({q}_{lin}\left(T\right)\) of the ARM, we can show that the difference between the ARM formula (\({q}_{ARM}\left(T\right)\)) and the linear representation (\({q}_{lin}\left(T\right)\)) is:
$$\begin{array}{r}\left|{q}_{ARM}\left(T\right)-{q}_{lin}\left(T\right)\right|=\left|{R}_{2}\left(T\right)\right|<1.94\times {10}^{-4}RH<1.94\times {10}^{-4},\text{ for }T\in \left[\text{13.80,17.50}\right]. \left(9\right)\end{array}$$
The temperature range includes the observed data from 1950 and the 99% prediction intervals of our model. Therefore, a linear representation is sufficient in approximating the non-linear relationship between GMST and Humidity given by the ARM formula. A numeric comparison is provided in the supplementary materials.