Causal discovery of drivers of surface ozone variability in Antarctica using a deep learning algorithm

The discovery of causal structures behind a phenomenon under investigation has been at the 11 heart of scientific inquiry since the beginning. Randomized control trials, the gold standard for 12 causal analysis, may not always be feasible, such as in the domain of climate sciences. In the 13 absence of interventional data, we are forced to depend only on observational data. This study 14 demonstrates the application of one such causal discovery algorithm using a neural network for 15 identifying the drivers of surface ozone variability in Antarctica. The analyses reveal the 16 overarching influence of the stratosphere on the surface ozone variability in Antarctica, 17 buttressed by the southern annular mode and tropospheric wave forcing in mid-latitudes. We find 18 no significant and robust evidence for the influence of tropical teleconnection on the ground- 19 level ozone in Antarctica. As the field of atmospheric science is now replete with a massive 20 stock of observational data, both satellite and ground-based, this tool for automated causal 21 structure discovery might prove to be invaluable for scientific investigation and flawless decision 22 making.


24
Ubiquitous throughout the troposphere and stratosphere, ozone plays a significant role in 25 atmospheric radiative forcing, atmospheric chemistry, and air quality. Considered an 26 atmospheric cleanser, Ozone in the stratosphere (90% of total amount) saves life on Earth by 27 filtering harmful UV radiation. Stratospheric ozone throughout the globe has been on a 28 downward trend, as indicated by the analysis of both satellite and ground-based measurements of 29 total column ozone (TCO) due to a steady increase in anthropogenic emissions of the reactive 30 chlorofluorocarbons (CFCs). 1 However, the stratospheric ozone hole is on a recovery path in 31 response to the Montreal Protocol and its subsequent amendments. [2][3][4] Nonetheless, the precise 32 causes of the observed changes in stratospheric ozone are complicated to isolate. They remain 33 uncertain due to the inability of existing chemistry-climate models (CCMs) to reproduce the 34 observations. 35 A variety of methods has been applied to date for the analysis of ground-level ozone that ranges 48 from simple statistical models like multiple linear regression (MLR) to sophisticated chemistry-49 climate models (CCMs) such as GEOS-Chem. 6 However, these models face difficulty in dealing 50 with complicated cause-effect relationships among meteorology and air pollutants. Multiple 51 regression models are limited in their interpretability as these are based on cross-correlation, 52 which might be highly biased due to autocorrelation effects or spurious correlations arising from 53 an unaccounted third process or a common driver. Apart from these, they lack insights into the 54 directionality of relationships. Therefore, CCMs are used to investigate the impact of changes in 55 emissions and meteorology using controlled perturbation of the system, allowing interpretation 56 of simulation results as causal effects forced by the interventions. Nonetheless, the ability of 57 CCMs in resolving essential processes such as land-biosphere interactions, stratosphere-58 troposphere transport (STT), and detailed atmospheric dynamics remain questionable, restricting 59 their interpretability and conclusions. 7 60 Causality is a fundamental scientific notion and is indispensable for accurate forecasting, 61 flawless explanation, and decision making. Discovering causal relations from observational data 62 has drawn much attention recently as the traditional way of causal analysis using interventions or 63 randomized control trials might be impractical, infeasible, or outright unethical. For example, 64 causal discovery methods relying solely on observational data have been used recently to study 65 ocean-atmosphere interactions, 8 the Walker circulation, 9 and the mid-latitude winter circulation 66 in the northern hemisphere. 7 With methodologies based on conditional independence tests, 67 heuristic scoring, or deep learning, we can identify causal linkages in observational data using 68 the premise that causes temporally precede their effects in time series. In this paper, we use one 69 such causal model based on a deep neural network to discover the potential drivers of surface 70 ozone variability over Antarctica. This method overcomes the pitfalls of common statistical 71 approaches, i.e., spurious correlations arising from the presence of common drivers, 72 autocorrelation, or indirect effects using a carefully designed causal discovery algorithm. 73 Pole station recurring every year and have concentrations equivalent or higher than those during 82 the primary peak (JJA). In contrast, secondary peaks at all other stations are sporadic and rarely 83 exceed those during the primary peak. The occurrence of the secondary peak in Antarctica has 84 been attributed to enhancement episodes due to NOx emission from snowpack 13 and photolysis 85 of remote PAN formed above continental source regions upon descent within the Antarctic 86 region. 30 Notwithstanding, these peaks might also result from the transport of photochemically 87 produced ozone in the planetary boundary layer (PBL) over the Antarctic plateau to other parts 88 of Antarctica due to katabatic flow prevalent apart from the direct transport of airmass from 89 UTLS enriched in ozone. 90

Results
We identify the ozone enhancement events [OEE] at all stations included in this study using the 91 methodology adopted by Cristofanelli et al. 2018. 31 OEEs identified at all stations in Antarctica 92 are shown in Fig. 1 in magenta color. To identify the OEE, we first fit an annual sinusoidal curve 93 (green curve in the top panel of Fig. 1) to the daily surface ozone dataset, followed by estimation 94 of gaussian PDF of residual (grey curve in the top panel of Fig. 1 ) from sinusoidal fit. We fit 95 another Gaussian distribution to all points lying beyond one of the last PDF. The intersection 96 of these two PDFs (vertical dash line in the bottom panel of Fig. 1 propagating from the troposphere to the stratosphere control the variation in STE, which is in 108 turn modulated by the meridional circulation driven by momentum imparted by the waves. We now investigate the influence of descending air mass transport from the UTLS on OEE 120 variability in Antarctica. Fig. 2 (b and c) shows the interannual variation in the frequency of 121 OEEs during the spring and summer seasons (the seasons with the highest number of OEEs in a 122 year). The frequency of OEEs during these two seasons had an increasing trend during the 123 1990s. However, this increasing trend has been disrupted during recent years and has turned to 124 decreasing trend during summer. To confirm the discovery made by TCDF, we fit another MLR on surface ozone using the 200 discovered causal drivers only, and the MLR fit for the same is shown in Fig. S3. It clearly 201 shows that we can achieve the same fit (shown by 2 ) as done before without using other non-202 causal variables, suggesting the irrelevance of using non-causal variables in explaining ozone 203 variability. We tried another MLR analysis ( Figure S4 TCDF. In contrast, they are same across all stations as analysed using PCMCI. These caveats 266 might be handled better using a more advanced causal discovery algorithm that leverages shared 267 dynamics across different causal graphs and robustly deals with hidden confounders to discover 268 causal links from time-series data like Amortized Causal Discovery. 50 269 In summary, we perform the causal analysis of surface ozone variability in Antarctica using a 270 state-of-the-art causal discovery framework based on a deep temporal convolutional network. 271 This framework avoids the drawbacks of common multivariate regression methods and generates 272 a causal graph that is sparse and interpretable. The generated causal graphs were found to be 273 consistent with the existing knowledge. With exponential growth in the amount of observational 274 data from both satellite and ground-based measurements, causal discovery methods might 275 provide novel insights across various domains of atmospheric and climate sciences, which can 276 aid knowledge discovery and guide robust policymaking. 277

279
In this study, ground-based surface ozone measurements from 5 Antarctic stations, namely 280 Arrival Heights, Marambio, Neumayer, South Pole, and Syowa, are used (see Table S1 for 281 details Estimation of stratosphere-troposphere transport 292 We use a lagrangian transport model HySPLIT using meteorological data from National Center 293 for Environmental Prediction (NCEP) (2.5 ∘ latitude-longitude grids) and Global Data 294 Assimilation System (GDAS) (1 ∘ latitude-longitude grids) to generate 15 days backward 295 trajectories on a daily basis at 500m above the ground level [agl]. Both meteorological datasets 296 have been used widely in several studies concerned with the airmass transport in Antarctica. 297 They can capture the meteorological variability in the Antarctic region reasonably well. 10-13 298 Generated backward trajectories have been discretized to 1 ∘ x 1 ∘ latitude-longitude grids. After 299 that, we use NCEP tropopause data to identify the trajectories coming from the stratosphere. Any 300 trajectory with endpoints with corresponding pressure lower than the associated grid tropopause 301 pressure is marked as being influenced by stratospheric transport and is counted over each month 302 to estimate the monthly frequency of stratosphere-troposphere transport (STT). 303

304
The goal of causal discovery is to uncover causal relationships using observational data. Before 305 finding causal relationships between distinct combinations of drivers at different time delays, 306 causal discovery methods must overcome numerous hurdles provided by the causative process or 307 the sampling process generating observational data. Because the cause typically occurs before 308 the effect, utilising the concept of time aids in the determination of the directionality of a causal 309 link. Causal links are the relationships found to be significant even after accounting for the 310 influences of other drivers (observed or hidden) or auto-correlations. 311

Granger Causality 312
Testing time-lagged causal connections in the framework of Granger causality (GC) is a popular 313 method to causal discovery. A widely used approach to identify the drivers of atmospheric ozone 314 variability utilizing a linear GC framework is to use an autoregressive regression model. 14 TCDF has a few hyperparameters such as the number of epochs, number of hidden layers, kernel 382 size, dilation coefficient, loss function, significance level for intervention loss, and learning rate. 383 Here, we perform the causal discovery using TCDF due to its simplicity and the ability to deal 384 with the hidden confounders. Our causal discovery method utilizing the TCDF framework is 385 different from that of Nauta et al. 2019 27 in the sense that the algorithm is not constrained to look 386 for the cutoff attention score ( ) in the first half during the attention interpretation stage as it 387 would restrict the number of potential causes to just half of all time series included in the study. 388 Discovered causal graphs from TCDF are compared with the same derived using PCMCI to 389 ascertain its robustness. We have also performed the MLR and lasso regression to identify the 390 inadequacies of these traditional statistical techniques for causal discovery. 391 We include various proxies (Fi) representing exogenous processes that drive the changes in 392 surface ozone in Antarctica. As surface ozone has substantial seasonal variability, we include 393 regression parameters expressed by a cosine and sine harmonic expansion utilising four 394 harmonics, i.e., 12 months ( = 1), 6 months ( = 2), 4 months ( = 3), and 3 months ( = 4) 395 in eq. 1. All data are rescaled to [-0.5, 0.5] before use for modeling. Since we are interested in 396 determining the drivers of surface ozone variability, we also estimate the adjusted coefficients of 397 determination ( 2 ) for MLR as it gives a measure of improvement in model fit when a 398 parameter is added to the model. 17 399 We test for the stationarity of our datasets using Kwiatkowski-Phillips-Schmidt-Shin (KPSS) 400 test and Augmented Dickey-Fuller (ADF) unit root test before performing causal discovery using 401 PCMCI and stationarise the dataset by first order differencing as required. We take six months as 402 τmax to account for tropical teleconnections to the polar region, and α is taken as 0.05. 403 404 We estimate the average causal effect (ACE) following the framework of potential outcomes. 405

Estimation of causal effects
The causal effect is defined as the difference between two potential outcomes. Here, the first 406 potential outcome concerns the treatment group and the other with the intervention or control 407 group. 9 Analytically, ACE is defined as: 408 where ( = ) represents an intervention that sets X to x. 410 As both potential outcomes cannot be observed simultaneously, the strong ignorability 411 assumption is required to identify causal effects. Since the causal graph must be known 412 (existence and absence of links) for causal effect estimation, we use the causal graphs discovered 413 using TCDF in this study. Assuming the relationships to be linear with no interactions, the 414 dependence of Y on X and confounders C can be expressed mathematically as: 415 Here, we use a doubly robust nonparametric estimator based on the theory of influence functions 417 called generalized augmented inverse probability weighted (gAIPW) estimator. 28, 29 We present 418 ACE along with its 95% confidence interval estimated using 500 bootstrap samples.     Inter-annual variation in the frequency of OEEs and associated back-trajectories altitude. a) Timeseries for OEEs during Spring. b) OEEs occurring during Summer. Here, the fraction of trajectories coming from UTLS (crossing 500 hPa) simulated using c) GDAS and d) NCEP meteorological reanalyses corresponding to OEEs is also shown.   Causal graph for surface ozone at all four stations (Neumayer, Syowa, Arrival Heights and South Pole) considered in this study generated using PCMCI at 5% signi cance level. Here, the color of nodes shows the autocorrelation, whereas the color of the detected links represents the conditional cross-correlation between concerned nodes. The numbers in the middle of the detected links represent the detected lags between cause and effect.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.