## 2.2 Statistical Analysis

To analyze and compare observed PM2.5 data with the 24-hour average guidelines of WHO, the hourly PM2.5 concentrations were computed as 24-hour average using the moving average method. In the case of hourly time series data such as PM2.5 monitoring, the 24-hour average values computed by moving average could be stored at any hour within an input period of 24 hours. Generally, three places consisting of the first, middle and the last hour were used to store the 24-hour moving average value, e.g., the OpenAir package of R for air quality analysis has these three options to store the values. Thus, similar assessments of the 24-hour moving average PM2.5 time series were recorded at the first, middle and the last hour with hourly observed PM2.5 time series were examined. Hereafter, time series recorded at the first, middle and the last hour were entitled to be the leftmost, center and rightmost moving average PM2.5 time series, respectively. The CCF and Euclidean distance were used to analyze the similarity between the leftmost, center and rightmost moving average PM2.5 time series to hourly PM2.5 time series. Similarities in terms of fluctuation shape and distance between the two time series were analyzed using the CCF and Euclidean distance, respectively.

Research using the CCF analysis was conducted such as studying the association between confirmed cases of COVID-19 and meteorologic variation [26] and using it to examine the relationship between the El Nino-Southern Oscillation (ENSO) variability represented by the Southern Oscillation Index (SOI) and associated time series of the number of new fish [27]. The CCF has been used to investigate the lead-lag relationship between the two time series in different time points, and can be used to determine the optimal time shift between of the two time series [27–28]. The correlation coefficients of 1 and − 1 indicate perfect relationships in the same and opposite directions, respectively. The strength of a relationship can be roughly explained using a verbal description without positive/negative direction by considering the coefficient value as follows: almost negligible relationship (< 0.2), small relationship (0.2–0.4), substantial relationship (0.4–0.7), marked relationship (0.7–0.9) and very dependable relationship (0.9-1.0), respectively [29]. The CCF described by Shumway and Stoffer [27] is calculated using the equations as shown below.

$${\gamma }_{XY}\left(k\right)={C}_{XY}\left(k\right)/{S}_{X}{S}_{Y}$$

1

,

where \({C}_{XY}\) is cross-covariance function, \({S}_{X}\) is sample standard deviations of time series *X*, and \({S}_{Y}\) is sample standard deviations of time series *Y*. The equations used to determine cross-covariance function are

\({C}_{XY}=\sum _{i=1}^{N-k}\left({x}_{t}-\stackrel{-}{x}\right)\left({y}_{t+k}-\stackrel{-}{y}\right)/N\) , *N* = 0, 1, …, *N* – 1 (2)

and

\({C}_{XY}=\sum _{i=1-k}^{N}\left({x}_{t}-\stackrel{-}{x}\right)\left({y}_{t+k}-\stackrel{-}{y}\right)/N\) , *N* = − 1, − 2, …, –(*N* – 1). (3)

where lag time point is *k*, which is usually much less than the number of time points along sample time series (*N*).

When the two data sets have very positive dependable relationships, their temporal variation is quite similar to each other. We examined each relationship between the 24-hour moving average PM2.5 time series and its hourly time series to reveal the lead-lag correlations of 72 time points (hours). A time point position showing the highest positive correlation coefficient, means the best shape similarity of both time series occurring at this time point. A good representation of the 24-hour time series for the hourly time series would have a high correlation coefficient and short lead or lag time length of the time point. The highest correlation presenting at a time point zero means no lead or lag time. This is a similarity that is shape-preserving, but represents the difference in magnitude between two time series, but a difference in magnitude (vertical shift) exists between the two time series. Comparison between the two time series based on the concept of distance measures can be performed using time series similarity measures, e.g., Euclidean distance, dynamic time warping (DTW), and others [21, 23, 30–32]. Euclidean distance is based on point to point measurement concept whereas DTW is based on the concept of point to many measurements. Both concepts are visualized in graphic form in the studies of Serra` and Arcos [32] and Cassisi et al. [23]. This study used the point to point distance concept because we considered the coincident events between the 24-hour moving average PM2.5 time series and hourly observed PM2.5 time series. The calculation of similarity represented by the Euclidean distance [21, 33] can be determined using the equation

$${E}_{D}=\sqrt{\sum _{i=1}^{N}{\left({x}_{i}-{y}_{i}\right)}^{2}}$$

4

.

Less distance resulting in less vertical shift is more similar between both time series. Therefore, Euclidean distance and CCF analyses are performed to evaluate three types of the 24-hour moving average PM2.5 time series in representing hourly PM2.5 variation. Next, we analyzed the 24-hour moving average PM2.5 time series against the 24-hour PM2.5 average values suggested by WHO air quality guidelines. The 24-hour moving average PM2.5 data were binned into each hour, 0.00, 1.00, ..., 23.00. Frequencies of concentrations falling in AQG, interim target (IT) 1, 2, 3, 4, and above were calculated for each hour as equation as shown below

$${F}_{Th,Hr}=\frac{{n}_{Th,Hr}}{{N}_{Hr}}\times 100$$

5

.

where \({F}_{Th,Hr}\)is the frequency of concentrations falling in each threshold (*Th*) ranges (AQG ≤ 15 µg m-3, 15 µg m-3 <IT4 ≤ 25 µg m-3, 25 µg m-3 <IT3 ≤ 37.5 µg m-3, 37.5 µg m-3 <IT2 ≤ 50 µg m-3, 50 µg m-3 <IT1 ≤ 75 µg m-3, and > 75 µg m-3) of an hour (*Hr*), 0.00, 1.00, ...,23.00. *N* is the total number of concentration values in an hour (*Hr*), and *n* is the number of concentration values in the threshold (*Th*) in an hour (*Hr*). The summation of \({F}_{Th,Hr}\)on a particular hour equals 100. Visualization all of \({F}_{Th,Hr}\)reveals the diurnal variation of each contribution of AQG and interim targets. All analyzes mentioned above were used by R statistical software and related packages.