Estimating SARS-COV-2 Exposure Indoors in Delhi Given Outdoor Pollution Metrics Using Machine Learning

The Global Burden of Disease journal by the Lancet(Ritchie and Roser, 2013) and states that one million deaths have occurred from 1990 to 2017 due to air pollution. In 2018, the WHO estimated a death toll of 3.8 million due to indoor pollution(WHO,2018). In these times of the pandemic, it is quintessential for countries like India, with a huge population and high levels of pollution, to take severe measures for controlling pollution. The 2020 US Policy Report in the Lancet(2020) a�rmed that there is a positive correlation between the PM 2.5 or PM 10 particles concentration and COVID-19 infection as the virus uses the particulate matter as a piggyback. The case study here, is based on the Indian urban locality and aims to analyze and estimate the correlations between PM 2.5 particles, the AQI, weather conditions and COVID-19 particles using Machine Learning models. The optimum model is also to be used for predicting the outdoor AQI and Covid-19 infection rates in the suburban localities of northwestern Delhi and the data so obtained, would aid to calculating ,and extrapolating the mortality probability due to Covid-19 infection, indoors, in the metropolitan cities of India, like Delhi.


Introduction
Air Quality Index(AQI) of a place can be categorized into 5 different categories: Satisfactory, Moderate, Poor, Very Poor and Severe.Delhi has an AQI ranging between 400-600 which can be categorized as Very Poor to Severe due to high particulate matter concentration and other detrimental gases such as carbon dioxide, sulfur oxides and oxides of nitrogen.PM 2.5 are ne solid aerosols with a particle diameter of ≤ 2.5 µm and are found suspended in ambient air.PM 2.5 in indoor environments is primarily derived from common outdoor sources such as motor-vehicles, biomass burning (predominantly in rural areas), and industrial emissions (Nor et al. 2021, Su, W., Wu, X., Geng, X. et al 2019, Nadzir, M. S. M. et al 2020).This is because all forms of outdoor emissions have an impact on indoor environments given the continuous ow of air.Prolonged exposure to PM 2.5 can be detrimental to human health (Burnett et al. 2018) as this ne particulate matter can be easily inhaled and can penetrate deep into the lungs (Nor et al. 2021, Marcazzan 2001,Zhang 2015,HEI 2020).PM 2.5 has a signi cantly longer lifetime in the air where it can be suspended for an extended period compared to respiratory liquid droplets.This longer lifetime of particles may pose a signi cant viral exposure threat to a healthy person, especially in indoor environments (Marcazzan 2001).The ne particulate matter gets easily propagated by tiny turbulent eddies in the air that arise from activities such as human movements and walking (Xing 2016, Zwoździak 2015) .
In this paper, we use datasets obtained from Kaggle and Open Government Data Platform of India to rst visualise graphically the AQI trends of in the past 5-6 years and the various ranges of AQI obtained in different months of the year.For complexity reduction, we reduce the analysis to a small locality of Ghaziabad district of Delhi, the capital and one of the most polluted cities of India.The locality we take into consideration is Indirapuram-Vasundhara in Ghaziabad, located latitudinally and longitudinally within 28.64N,77.37Eand 28.66N,77.38E.The locality, according to the Indian newspapers and government data, had the highest rate of COVID-19 infections in the months of November and December of 2020.This information is further validated using machine learning ensemble models like Random Forest Regressor and Gradient Boosting Regressor, which predict the possible AQI and weather conditions of the north-western Delhi localities with an accuracy of 80%.The predictions are visualised and correlated using the Pearson correlation coe cient and based on the correlations, we calculate the change in mortality rate ratio of indoors to outdoors, given the change in particulate matter concentration indoors due to outdoor pollution.All these data obtained, help us gain further insights in the mortality probability due to Covid-19 infection, indoors, thus ful lling the cause of the case study.

2.COVID-19 in India:
The Kaggle dataset had state-wise and district-wise details of the total number of coronavirus cases, tests carried out, positivity rate based on current population and other metrics.We again reduce the data to the total number of cases reported in Delhi daily from June 2020 till June 2021.
3.Delhi Weather data: Obtained from Wunderground using their easy-to-use API, this dataset comprises temperature(average and min-max), humidity, precipitation, and other condition details of Delhi weather from 1990 till 2016.Further weather conditions and daily mean temperatures till 2020 have been obtained by scraping Accuweather forecasts for Delhi.

Implementing Machine Learning:
We rst use the AQI dataset to obtain a correlation between PM 2.5 particles and AQI which is found to be 0.8 on an average based on the data from 2015 to 2020, signifying a strong correlation between the two.We train Ensemble regressors like the Random Forest Regressor and the Gradient Boosting Regressor model on this data and obtain the AQI predictions for Indirapuram and Vasundhara locality based on their PM 2.5 outdoor air concentrations.Ensemble modeling is a process where multiple diverse base models are used to predict an outcome.The motivation for using ensemble models is to reduce the generalization error of the prediction.The approach seeks the wisdom of crowds in making a prediction.It acts and performs as a single model.Most of the practical data science applications utilize ensemble modeling techniques.In reference to Leo Breiman's work (Breiman 2001),Random forests are a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest.The generalization error for forests converges a.s to a limit as the number of trees in the forest becomes large.Thus, using Random forests for AQI prediction ensures that it is closest to the actual AQI conditions of Indirapuram and Vasundhara.
The Gradient Boosting Machine(GBM) algorithm is used for supervised machine learning, and it produces an ensemble of weak learners(Garcia de Oliveira,2019).The most used implementations of the GBM techniques are Light GBM by Ke et al,2017 and the XGBoost library by Chen and Guestrin,2016.However, despite being a collection of weak learners, it outperforms most ensemble models, with the help of hyperparameter tuning.Hence, out of the two chosen models for the prediction purposes, the Gradient Boosting Regressor, which is based on the GBM algorithm, performs better than the Random Forest Regressor.
The models predict the AQI values with accuracies(based on R 2 metric score) of 77.4% and 80% respectively.The AQI predictions were the highest (mean value of 375 with a standard deviation of 25) in months of November and December 2020, indicating high PM2.5 particle concentration in these localities during that time of the year.Further, according to the Hindustan Times newspaper of India, Indirapuram and Vasundhara had the highest number of COVID-19 caseloads during the months of November and December 2020.Although this doesn't indicate causality, correlation between particulate matter and Covid-19 infections is evident, keeping external validations in consideration.Further, to support our ndings, there has been a research carried out by Nor, N.S.M., Yip, C.W, Ibrahim, N. et al(2020), where it was proven that particulate matter of diameter less than or equal to 2.5 µm could be a potential SARS-COV-2 carrier.No correlation was found between the virus concentration and the diameter of particulate matters (Marcazzan 2001) .However, positive correlations between PM 2.5 and other respiratory viruses such as the in uenza virus have been reported previously, emphasizing the probability of particulate matter being a transport carrier for SARS-CoV-2(Xing 2016).

Table-1. Machine Learning Models Used
The AQI dataset is further used to nd a correlation between PM 2.5 particle concentration and temperature and weather conditions.The temperature and weather conditions i.e., humidity for Delhi is obtained from the weather dataset as speci ed earlier.We merge the AQI, and weather datasets based on the common dates and obtain the correlation accordingly.For humidity, the correlation coe cient with PM 2.5 turns out to be 0.076 and for temperature, it is -0.41.Thus, temperature and humidity in Delhi, have a signi cant negative and insigni cant positive correlation respectively, with PM factors affecting the concentration of indoor particulate matter and the in uence of indoor and outdoor temperature is greater for o ces and classrooms with the glass exterior wall, whereas the relative humidity is the main factor for the rest of the building with concrete wall structure.However, when analyzed in Indian setup, i.e., Delhi, weather conditions and temperature had contradicting impacts on the particulate matter concentration, thus, implying that the correlations differ from not only building to building, but also, background to background.
This indicates the presence of other external factors such as casual behaviour of citizens, low testing rate and slow vaccination drive, inadequate measures and lack of strict lockdown and restrictions.Hence, although the correlation is strong and positive between the COVID-19 infections and PM 2.5 particles or the AQI predicted, the Pearson correlation coe cient value is estimated to be 0.68, due to the presence of other cofactors.This is known as External validity in research.[31].
According to the Health Effects Institute's Report of 2019, particulate matter (PM) pollution was considered the third most important cause of death in 2017 with the rate being highest in India.Air pollution was considered to cause over 1.1 million premature deaths in 2017 in India (HEI 2019), of which 56% was due to exposure to outdoor PM 2.5 concentration and 44% was attributed to indoor air pollution.
As per WHO (2016), one death out of nine in 2012 was attributed to air pollution, of which around three million deaths were solely due to outdoor air pollution.According to an article(Emily Henderson 2020), 1.67 million deaths occurred in India due to air pollution in 2019.This means that the mortality rate of India associated with PM 2.5 particle exposure in 2019 was 12.846 deaths per 1000 people.Given the pandemic and the increasing pollution in India despite several efforts by the government, it is feasible to assume that there has been an increase in the mortality rate due to the PM 2.5 particles exposure in the last two years.
Beixi Jia et al (2021), found out that the estimated PM 2.5 -mortality in India has had an annual increasing rate of 2.7% during 1998-2015.Further, the article states that aggressive air pollution control strategies should be taken in North India due to their current health risks.Based on this assumption, we use the formula obtained in NCBI's Mortality due to Indoor PM 2.5 exposure Report (Ji W 2015), Where Δlog M all,j is the increase in mortality due to the jth outcome associated with total PM exposure for each 10 μg/m3 increase in PM 10 or PM 2.5 , outdoors.j represents three major health outcomes: all-cause, cardiovascular, and respiratory mortality.
ΔC out is the increase in outdoor PM 10 or PM 2.5 concentrations, which is set as 10 μg/m 3 .
ΔC out-in is the increase in outdoor-originated PM 10 or PM 2.5 concentrations found in the indoor environment.
t out is the duration of direct exposure to outdoor PM pollution.
t in is the duration of indoor exposure to PM of outdoor origin.
Δlog M in,j estimates the increase in mortality due to the j th outcome associated with indoor exposure to outdoor-origin PM for each 10 μg/m 3 increase in PM 10 or PM 2.5 .Here, we use the PM 2.5 concentration change explicitly.
Using this formula we obtain a ratio of 3:7 between the Δlog M in,j and Δlog M all,j which means that for an increase in the mortality by 7 units due to the j th outcome associated with the total PM exposure for each 10 μg/m 3 increase in PM 2.5 outdoors , there is an increase of 3 units in the value of mortality due to the j th outcome associated with indoor exposure to outdoor-origin PM for each 10 μg/m 3 increase in PM 10 or PM 2.5 .The calculations are carried out considering a time span of 24 hours and ΔC out-in of 7.5 because according to Leung Dennis Y.C (2015), approximately 75% variation in indoor air pollutant concentration is due to outdoor air pollutant concentration variation.Previously, Douglas W. Dockery et al, based on a survey model, had estimated that the mean in ltration rate of outdoor ne particulates was approximately 70% and the effect of full air conditioning of the building was to reduce in ltration of outdoor ne particulates by about one half, while preventing dilution and purging of internally generated pollutants.However, when analyzed for the Delhi suburban setup, we see that the in ltration rate, although within the 95-percentile spread of normal distribution of 70% mean, tends to be on the higher side due to the high rates of pollution in India.Further, according to Chun Chen et al.( 2011),the indoor/outdoor ratios vary considerably due to the difference in size-dependent indoor particle emission rates, the geometry of the cracks in building envelopes, and the air exchange rates.Thus, it is di cult to draw uniform conclusions.However, for our case study, we realize that the indoor environment is highly in uenced by the outdoor ambience and there is a 30% increase in mortality due to increase in the indoor PM 2.5 concentration if there is a 70% increase in the mortality due to outdoor PM 2.5 concentration and the outdoor PM 2.5 concentration in uences the indoor concentrations of the same by 75%.
Further, based on one of the research works in PNAS(Z.Bazant 2021), we can quantify the concentration of pathogen C(r,t) suspended in droplets of radius r at 25℃, exhaled by an infected person in a room and having another healthy person in the vicinity, is: Rate of change=Production rate from exhalation − L r -( Where L r is Loss rate of pathogens from ventilation, ltration, sedimentation, and deactivation. For SARS-CoV-2, Buonanno et al. (2020) estimated a C q range of 10.5 to 1,030 quanta/m 3 based on the estimated infectivity c i =0.01 to 0.1 of SARS-COV-2 and the reported viral loads in sputum although the precise value depends strongly on the infected person's respiratory activity.Here C q is the concentration of exhaled infection quanta by an infectious individual.Hence, it becomes very important for implementation of air puri cation and ventilation along with proper maintenance of the 6ft rule even in the households.When the PM 2.5 concentration increases indoors , the probability of getting infected by these pathogenic suspended droplets increases given the virus can use the particulate matter as a carrier and thus, this explains the increase in mortality probability indoors given there is an in particulate matter concentration outdoors.

Results And Discussion
The outcomes of this research are intuitive as well as mathematical.
Outcome 1 COVID-19 infections have a positive and signi cant correlation with the PM 2.5 particle concentration in the air and PM 2.5 particle concentration have a positive and insigni cant correlation with humidity and a negative and comparatively signi cant correlation with the temperature of that locality, keeping Indian urban northwestern Delhi background in mind.The correlation between PM 2.5 particle concentration and COVID-19 infections is lower than expected due to the non-blocking of external validity.AQI anomalously has a negative and negligible correlation with humidity.Weather conditions such as smoke, blowing sand, widespread dust and haze are associated with high PM 2.5 concentrations and hence, higher COVID-19 infection probability.
Outcome 2 Indoor air is affected by outdoor pollutant concentration and an increase of 70% in natural logarithmic value of mortality due to PM 2.5 particles exposure outdoors will cause an increase of 30% in the indoor natural logarithmic value of mortality due to outdoor PM 2.5 particles exposure.The concentration of exhaled infection quanta by an infectious individual (one already having COVID-19 ) is within a range of 10.5 to 1,030 quanta/m 3 and hence aggressive amount of air puri cation measures must be taken to reduce the concentration of exhaled infection quanta, suspended in the air, which might aggravate the infection rates otherwise.2014).Photocatalytic materials can be utilized for indoor air puri cation(Hoang Bui 2021) and several companies across the world are producing advanced puri ers to tackle the viral transmission and mortality due to air pollution in urban cities.This paper, so far, has given a thorough understanding and analysis of air pollution and its effects on the SARS-COV-2 viral transmission rates, both indoors and outdoors.Further scope lies in analyzing using survey methods and experimental methods the effect of indoor and outdoor pollution on human health and comparison between the same.More detailed quantization of the transmission rate and effect of being within 6ft distance of an infected person, indoors can also be carried out.
In conclusion, air pollution is one of the most prominent concerns in today's world and the government, as well the public should take precautions to protect themselves from its adverse effects.

Table -
( Xing 2016, Zwoździak 2015, Hänninen 2005s its adverse effects both indoors and outdoors( Xing 2016, Zwoździak 2015, Hänninen 2005) and if not contained, can result in not only increase in respiratory or thoracic diseases but also fatality in co-morbid cases.Proper air puri cation systems and su cient ventilations are a mandate in urban houses and in rural areas, biomass fuel consumption should be controlled to reduce particulate matter pollution indoors(Chakraborty