The methods describe briefly in 4 subsections.
Data Gathering And Management
Iran has 31 provinces (Wikipedia.org) and their shape files are available from ArcGIS Online (www.arcgis.com). The Carbon Monoxide (CO), Water vapor (H2O), Nitrogen Dioxide (NO2), Ozone (O3), Sulfur Dioxide (SO2), Absorbing Aerosol Index (AER), and Atmospheric Formaldehyde (HCHO) from Sentinel-5P: Near Real-Time (NRTI) Satellite are air pollution indices and Pressure, Total precipitation rate, Air temperature and Wind speed from NASA Global Land Data Assimilation System Version 2 (GLDAS-2) are weather indices of this study. They are downloaded with Google Earth Engine Data Catalog (https://developers.google.com/earth-engine/datasets). (Table A1_1 in Appendix A)
The analysis time period is: 1) comparing the first wave (from 2020-03-14 to 2020-04-20) with the previous year (from 2019-03-14 to 2019-04-20) and 2) comparing the second wave (from 2021-04-13 to 2021-04-25) with the previous year (from 2020-04-13 to 2020-04-25) of lock-down days in Iran. (Wikipedia.org) (Table A1_2 in Appendix A)
Unadjusted Effect: T-test And Wilcoxon Rank-sum Test
The summary statistics and statistical tests are independent sample t-test in two conditions for equal/unequal variance, F test for variance comparisons, Shapiro-Wilk Normality Test and nonparametric Wilcoxon Rank Sum test with R version 4.1.3 (Team 2022). The hypothesis is:
$$\left\{\begin{array}{c}{H}_{0}: {\mu }_{COVID19}={ \mu }_{Previous Year}\\ {H}_{1}:{\mu }_{COVID19}\ne { \mu }_{Previous Year}\end{array}\right. \left(1\right)$$
for each air pollution indices of each province in first and second lock-down waves. They are unadjusted sample averages because they don’t adjust with any covariates.
Adjusted Effects: Generalized Additive Mixed Model
The weather conditions are not the same in the two time period of analysis, therefore a recent study adjust the air pollution indices with meteorological variables and simple linear regression (Venter, Aunan et al. 2020). The limitation of this adjustment is 1) considering only the linear effects and 2) country-specific estimates. The Generalized Additive Mixed models (GAMM) can capture the nonlinear effects for example with smoothing spline, \(f(.)\), and it estimates the overall random effects for provinces. In this regard, the confounding variables are pressure, total precipitation rate, Air temperature, and Wind speed of the time-period of analysis. The model is:
The COVID Status covariate have two values (0- Previous Year and 1- COVID-19) and the provinces, \({U}_{i}\), enter the model as random effect. The model comparisons are with adjusted \({R}^{2}\), Akaike information criterion (AIC) and Bayesian information criterion (BIC). All calculations are with gamm() and s() functions in mgcv package with R (Wood 2006, Wood and Wood 2015, Harezlak, Ruppert et al. 2018)
Clustering: Functional Data Analysis
The functional clustering (FC) methods from Functional Data Analysis (FDA) toolbox (Ramsay, Silverman et al. 2005) are new and promising methods for air quality and meteorological time series studies such FC based on the Partitioning Around Medoids (PAM) algorithm for air quality monitoring network of NO2, PM10 and O3 and functional hidden dynamic geostatistical models in Italy (Ignaccolo, Ghigo et al. 2008, Maranzano, Otto et al. 2022), for climate zone identifications (Di Giuseppe, Jona Lasinio et al. 2013) and Functional Zoning (Ignaccolo, Ghigo et al. 2013). The FC methods are also widely used in environmental statistics such as: Yield Pattern Analysis (Pascucci, Carfora et al. 2018), river network data (Haggarty, Miller et al. 2015, Jamaludin 2017) and water quality functional spatial clustering with calibrating Ocean Colour data from MERIS, SeaWiFS and MODIS satellite data (Maritorena, d'Andon et al. 2010, Carlo, Paolo et al. 2017). The curves and time series observation have dependency to each other, and FDA methodology consider this structure with advanced statistical methods.
The air pollution indices ,\({X}_{ijwC}\left(t\right)\), are functional data with \({\{X}_{ijwC}\left(t\right), i=1,\dots ,7 , j=1,\dots ,31, w=\text{1,2} , C=\text{1,2} \},\) and \(i\) is air pollution indices, \(j\) is the province name, \(w\) is the wave (first and second waves) and \(C\) is the COVID-19 status (Previous year and COVID-19) (Ramsay, Silverman et al. 2005). Air pollution indices have different units and in order to enter them in the model, the preprocessing steps for \({X}_{ijwC}\left(t\right)\)are scaling and centering. Then, the functional average of all air pollution indices, \({\stackrel{-}{X}}_{jwC}\left(t\right)\), for each province, waves and covid status are:
Therefore, each province has four new functional variables including \({\stackrel{-}{X}}_{j11}\left(t\right)\) for the first wave and previous year, \({\stackrel{-}{X}}_{j12}\) for the first wave and COVID-19 year, \({\stackrel{-}{X}}_{j21}\left(t\right)\) for the second wave and previous year and \({\stackrel{-}{X}}_{j22}\left(t\right)\) for the second wave and COVID-19 of each province \(j=1,\dots ,31\). If \(w=1\), then \(t \in \left[\text{1,37}\right]\) and if \(w=2\), then \(t\in \left[\text{1,12}\right]\).
In the next step, the functional principal component analysis (FPCA) for each of \({\{\stackrel{-}{X}}_{j11}\left(t\right),{\stackrel{-}{X}}_{j12},{\stackrel{-}{X}}_{j21}\left(t\right),{\stackrel{-}{X}}_{j22}\left(t\right), for j=1,\dots ,31\}\) among all provinces decompose them and reduce their dimensions (Yao, Müller et al. 2005). The functional variables have some missing values at some observation time points and they are spare, and the principal components analysis through conditional expectation (PACE) approach (Yao, Müller et al. 2005) is FDA methodology to deal with them (Ramsay, Silverman et al. 2005). The summary statistics of \({\stackrel{-}{X}}_{jwC}\left(t\right)\)are:
$${E\stackrel{-}{X}}_{wC}\left(t\right)={\mu }_{wC} t \in T \left(4\right)$$
$${COV(\stackrel{-}{X}}_{wC}\left(s\right),{\stackrel{-}{X}}_{wC}\left(t\right))={G}_{wC}\left(s,t\right) t,s \in T (5)$$
The decomposition of the covariance is:
$$\left\{\begin{array}{c}{G}_{wC}\left(s,t\right)=\sum _{k}{\lambda }_{kwC}{\varphi }_{kwC}\left(s\right){\varphi }_{kwC}\left(t\right) \\ {\varphi }_{kwC}\left(.\right):eigenfunctions \\ {\lambda }_{kwC} :eigenvalues \\ t,s \in T\end{array} \left(6\right)\right.$$
Now, the \(j\)th random curve is approximately equivalent to the \({\stackrel{-}{X}}_{jwC}\left(t\right)={\mu }_{wC}\left(t\right)+ \sum _{k}{\xi }_{jkwC}{\varphi }_{kwC}\left(t\right)\)where \({\xi }_{jkwC}\) are uncorrelated random variables(Yao, Müller et al. 2005). The \(k\) is estimated with the 95% fraction of variance explained (FVE). Finally, the clustering procedure is clustering the \({\widehat{\xi }}_{jkwC}\) with hierarchal clustering algorithm and EM Clustering with hclust() from stats package and FClust() from fdapace R package (Carroll, Gajardo et al. 2020). The result plot is \({\widehat{\xi }}_{jk11} vs {\widehat{\xi }}_{jk12}\) and \({\widehat{\xi }}_{jk21} vs {\widehat{\xi }}_{jk22}\).