COVID data. We use a daily number of COVID confirmed cases for states (and union territories, UTs) of India from a crowdsourced platform (covid19india.org), which collates data from official state bulletins among other sources. The data on this platform is updated on a real-time basis with information from the official state press bulletins, official (Chief Minister, Health Minister) handles, Press Information Bureau, Press Trust of India, ANI reports. We use data on confirmed COVID cases from March 14 to June 9, 2020. The total number of confirmed COVID cases in India as of June 9th was 267,261 of which 28 major states (and UTs) had more than 100 cases and accounted for 99.9% of all cases (Fig. 1). We include data from Ladakh (a new UT) in the erstwhile state of Jammu and Kashmir in our analysis. We also used data on the number of tests conducted from the same source (i.e. covid19india.org).
Ambient PM2.5 exposure data. Inadequate in-situ data of ambient PM2.5 in India13 propels us to utilize satellite-based ambient PM2.5 exposure. We derive surface PM2.5 from satellite aerosol optical depth (AOD) product following our previous works19,20. For this study, we analyze AOD retrieved by Moderate Resolution Imaging Spectroradiometer (MODIS) using MAIAC (Multi-Angle Implementation of Atmospheric Correction) algorithm at 1 km 1 km resolution. We convert daily MAIAC AOD to PM2.5 using a dynamic scaling factor derived from MERRA–2 (Modern-Era Retrospective analysis for Research and Applications, version 2) reanalysis data and calibrate the satellite-PM2.5 using the coincident in-situ measurements of PM2.5 by a ground-based network that is maintained and quality-controlled by the Central Pollution Control Board (CPCB). We have used a percentile-based regression technique to develop region-specific correction factors and tuned the satellite-PM2.5 closer to the ground-based measurements20. Since the satellite-PM2.5 represents satellite overpass time (10 am to 2 pm), we have applied a diurnal scaling factor from MERRA–2 reanalysis data to estimate 24-hr PM2.5 from the retrieved PM2.5. Comparison of 24-hr satellite-derived calibrated PM2.5 with CPCB measurements (Fig. 2) reveals an excellent correlation (R = 0.98) and a reasonable root mean square error (24.2 g/m3)attesting to the quality of the satellite data product given the large spatial data gaps in ground-based observations.
Meteorological data. We analyze the temperature and specific humidity data from ERA5 reanalysis. ERA5 is an updated and improved version of the ERA-Interim datasets21. We use the hourly data at the lowest level averaged over the diurnal scale in our analysis. We note that the surface pressure can be lower than 1000 hPa in a high-altitude grid for which the ERA5 reanalysis extrapolates the values from the level above in the post-processing.
Modeling framework. The ambient PM2.5 exposure-COVID relationship has been examined by GAM, which uses daily data on the count of confirmed COVID cases, 24-hr average PM2.5 exposure and weather data on humidity and temperature for the 28 major states of India. We estimate daily state-level PM2.5 and meteorological parameters by averaging all the grids lying within the state boundary. For the grids that partially belong to a particular state, the area-weighted average is considered in Arc-GIS using the state-level shapefiles. The population-weighted daily exposure estimates allow us to minimize exposure misclassification as the urban areas (where the COVID cases are higher) have a larger population (and hence larger weight) than the rural areas in a state/union territory (UT).
The dependent variable—daily number of confirmed COVID cases is a count variable; hence, we use GAM with a quasi-Poisson family for over-dispersed data14,22,23. The median incubation period for COVID cases in China has been estimated to be 5.1 days, with 97.5% of cases developing symptoms within 11.5 days24,25. Thus, in this study, we model short-term exposure to ambient PM2.5 (lag 0–7, lag 0–14, lag 0–21) as a moving-average7,22. Our basic GAM is defined as follows:
In this model, gyit is a log link function of the expectation yit ≡EYit, where Yit is the series of the count of daily confirmed COVID cases in a state. t denotes the day of observation and α is the intercept. PMil represents the linear term for l+1 day moving average of PM2.5 in state i. Our main coefficient of interest is β which captures the association between short-term exposure to PM2.5 and COVID confirmed cases. The delayed effect of meteorological factors—temperature and humidity are also controlled for the same period.
To capture the possible non-linear relationship between the meteorological factors and COVID incidence, we use a natural cubic spline (ns ()) function with three degrees of freedom. The choice of degrees of freedom is based on previous studies3,7,26,27. yi,t–1 denotes previous day count of daily confirmed COVID cases in state i, to account for serial correlation in our data7,26,27. We also introduce state fixed effects, statei, to account for unobserved time-invariant characteristics of states like population density, availability of health infrastructure and total population. Lastly, time fixed effects, dayt, are added to the model to capture unobserved factors at the day level, they account for shocks like national lockdown, lifting of movement restrictions7. These fixed effects (state and time) help us in exploiting the temporal and spatial variation in our data.
Sensitivity analysis. We employ six additional sensitivity analysis to check for robustness of our results. First, we exclude the state of Maharashtra which accounts for the largest contribution (one-third) to the total cases in India (as of June 9, 2020) from our analysis to check whether the results hold. Second, we include total tests conducted in each state in our analysis to check whether our results are sensitive to this alternate specification. This variable (total tests conducted) captures how states are ramping up their efforts to detect COVID infections which in turn affects the total confirmed case count in each state. The data on total cases confirmed in each state is not available for every day and hence the inclusion of this variable in our model has reduced our analysis sample considerably. Third, we test for the robustness of our results by altering the choice of the analysis period. We do this by restricting our study only to the post lockdown period, that is from March 25 to June 9, 2020. Fourth, we test whether our findings are sensitive to an alternate non-linear specification for the exposure variable. We replace the linear PM2.5 variable with a quadratic B-spline23. The quadratic B-spline for the moving average PM2.5 is defined by 3 knots placed at 20 μg m–3, 60 μg m–3 and 100 μg m–3. All the spline basis variables are centered at 25 μg m–3 (which is the WHO air quality guideline for 24-hour PM2.5).
Our next sensitivity analysis is to test the impact of exposure misclassification (if any, due to the use of state-level population-weighted PM2.5 exposure) on our results. We could not carry out city-level analysis as the COVID cases are not reported systematically at city-level. Instead, we used the second-order geographical and administrative unit (i.e. district) as the unit of analysis. On April 30, 2020 Government of India through a circular (available at https://static.mygov.in/rest/s3fs-public/mygov_158831498053877021.pdf) categorized all districts of India (total number of districts in India is 733) into the red zone (hotspots), orange zone (incidence of cases, doubling rate is relatively lower than red zones) and green zone (no confirmed cases in the last 21 days). We note that even the district-level data on COVID confirmed cases is only available from April 21, 2020, and the data is not available for each day for every district (the reason for mainly relying on the state-level analysis). Out of 130 districts categorized as red zone districts, we conducted our analysis on 118 districts which have consistent data.
Our final robustness check comprises of conducting an alternate analysis which used meta-analysis following a two-stage process. In the first stage, state-specific PM2.5-COVIDrelationshipsare analyzed using a time series Poisson model which accounted for overdispersion in the data. The first stage model is similar to equation 1 described above, it used a linear control for exposure (moving average), a natural cubic spline with df = 2 for humidity, df = 3 for temperature, df = 4 for time and is also controlled for the previous day-case count to account for serial correlation. In the second stage, these individual state associations are combined using a random-effects meta-analysis to arrive at a national level estimate. The random-effects meta-analysis model was fitted using maximum likelihood estimation.
The analysis has been carried out in R (version 3.5.1) using glm command in the mgcv package. The meta-analysis was conducted using mvmeta command. The estimates of ambient PM2.5 exposure is expressed as relative risk (RR) with RR > 1 representing an increased risk due to unit increases in PM2.5 exposure by 10 μg m–3 and corresponding 95% confidence intervals have also been provided.