Identifying Hotspots in the Distribution of Human Infectious Diseases Using a Bayesian Framework: A Lead to Drivers, Prevention, and Surveillance of Disease Emergence

Background The ongoing COVID-19 pandemic underscores the need of surveillance system to detect threats and regions at high risk from emerging infectious diseases (EIDs). With the human-driven perturbations to the human-animal-pathogen interface at an ecological scale, the integration of these environmental drivers is essential. We propose robust mathematical models to map, detect, and identify signicant drivers of EID outbreaks for three viral EID groups: Filoviridae, Coronaviridae, and Henipaviruses. Methods We modeled the in explicit, and zero-ination (ZIB) logistic regression with and without autoregression (iCAR). The presence data were extracted from WHO and for the EID and we generated pseudoabsence within the spatial distribution of the mammalian reservoirs. Various environmental and demographical raster were used to explain the distribution of EIDs. True Skill Statistic and deviance parameters were used to compare the accuracy of the different models. Our results show that the residual or night-time temperature, known as the minimum temperature, has a direct inuence on the distribution of most EID outbreaks analyzed. The Increase in minimum temperature was found to be an important driver for loviruses and the other diseases analyzed in the study. Recent research has shown that the increased surface temperature and the unpredictable seasonal rainfall due to climate change, have an indirect effect on disease emergence through sudden changes to the reservoir habitats, loss of biodiversity and migration of small mammals(29, Minimum temperature is the limiting factor for parasite development and vector distribution in malaria transmission and other vector-borne disease outbreaks such as Zika Research outside vector-borne diseases and temperature is limited. This direct spatial dependence of disease emergence on minimum temperatures is worrying. With climate change, increasing night-time minimum temperatures lengthening the frost-free season in most mid- and high latitude regions potentially increasing the latitudinal extent of disease emergence. We found that low attitude and high rainfall have a signicant inuence on the distribution of Henipavirus outbreaks. Studies have hypothesized that the emergence of Nipah in the lower Gangetic low-lying to results the Rapid use


Introduction
The shift in the geographical footprint of pathogens and/or infected hosts due to ecosystem disruption leads to emerging infectious diseases (EIDs) (1), of which COVID-19 is a current example at the center of international attention. Infectious diseases of animal origin or zoonoses account for more than 70% of emerging infectious diseases in recent decades.(2, 3) (4) With the onset of SARS-CoV-2, anticipating the emergence of new pathogens has become the major public health challenge of our time. Disease emergence is closely linked to anthropogenic changes that disrupt the human-animal-environment interface(5-7). The increasing unpredictability of the global climate and the decreasing distance between local human-animal-ecosystem interactions play a major role in the emergence of infections in human populations. Hence the urgent need to integrate the One Health approach at the center of infectious disease epidemiology in surveillance programmes.
One such cutting-edge programmes, PREDICT-2, USAID's latest Emerging Pandemic Threats funding programme, ended just weeks before the COVID-19 outbreak (8). Over the years, these active surveillance programmes have become less attractive to stakeholders because their results are projected over a long period of time, leading to pressure for accountability, a lack of immediate action, and uncertainty about the overall impact. However, with pathogens emerging, evolving and re-emerging at alarming rates (4,9), prevention of outbreaks in the rst place can be achieved at lower economic and social costs through surveillance using robust mathematical models, whereas the production of vaccines or new drugs for individual emerging pathogen is not viable. Mathematical models used to predict disease risks are based on assumptions and are not error-free. Nevertheless, they provide the best available 'reasonable basis for action'(10) available for the prevention of EID. To identify hotspots of disease emergence, it is essential to assess the in uence of possible emergence factors and to have detailed spatial and quantitative information on these factors. Species Distribution Models (SDM) can not only predict disease occurrence, but also a quantify associations between a disease and its drivers, as well as predict future outbreaks (11). SDM is often used to understand the causal process behind spatial distributions, relying on regression to identify correlations with bioclimatic factors. Highly complex machine learning techniques, such as maximum entropy (MaxENT), can be adapted to SDMs (12). Much of the spatial research on complex SDMs such as MaxENT (13) and boosted regression trees (BRTs) is ideal for presence data only. Presence data are often subject to sampling and spatial biases due to heterogenous reporting rates (14). Spatial restrictions or clustering due to the impact of environmental effects should however not be neglected as they explain the causal process.
Recent work has shown that the use of a spatial Bayesian framework on SDMs produces more accurate results when dealing with limited and clumped data and takes random effects into account, thus providing better results on the factors in uencing risk (15,16). Hierarchical Bayesian SDMs allow observations to be interpreted as the result of several ecological processes such bioclimatic factors, spatial dependence, and anthropogenic disturbances. With the use of an adaptive Metropolis algorithm in a Gibbs sampler to compute the posterior distribution, these Bayesian SDMs reduce optimization e ciency and computation time. Here, we propose to build reasonably accurate and robust mathematical models to map and compare the predictive risk of three viral epidemics of infectious diseases transmissible to humans that are under scrutiny; Filoviridae including Ebola and Marburg viral disease (EVD & MVD), Coronaviridae such as SARS, MERS and COVID-19, and Henipaviruses (Nipah & Hendra diseases) of the Paramyoviridae family. We are also exploring the potential hotspots and quantifying the signi cance of environmental factors using SDMs in a Bayesian framework. The aim is to provide a biogeographic perspective, measure predictive risks, identify factors, and compare Bayesian SDMs in predicting the emergence of the viral diseases studied.

Methods
Using a Bayesian framework, we modeled the presence-absence data using a two stage, spatially explicit, hierarchical logistic regression (17). First, we model the potential presence of EID occurrence in each cell grid (local) bioclimatic variables and population density variables using disease-level coe cients, and a spatial random effect. Then, assuming that the frequency of presence/absence data in each site follows a binomial distribution, the sampling intensity of each site varies accordingly. Once the models were tted, we compared the different models based on parameter summaries and model deviance.

Presence-absence data
When considering presence-absence data for SDM, the most common biases arise from assumption of perfect detection and stationary hosts. Disease occurrence depends on the spatial distribution of the disease reservoir and intermediate hosts. We used zero-in ation binomial models (18) to recognize imperfect detection of occurrence. Autocorrelation and non-stationarity of the mammalian hosts were taken into account using intrinsic conditional autoregressive (iCAR) models to avoid overestimation of spatial inference and prediction in the models (16). We extracted distributional data on the global occurrence of Filoviridae, Coronaviridae and Henipavirus human disease outbreak over time from WHO archives, Promed mail and published studies (Supplementary Table.1). In cases where the origin of the outbreaks was unclear, we restricted down to the general region or district of origin. Laboratory outbreaks, outbreaks leading to asymptomatic disease (Reston Ebola disease in the Philippines) and domestic (Hendra outbreaks in horses) and wildlife (Ebola in gorilla populations) outbreaks were also excluded. We excluded the recent SARS-CoV2 outbreak as the origin of the infection remains controversial. The analysis with the coordinates of the assumed origin of SARS-CoV2 are included in the Supplementary material. We geo-referenced the sites of origin and constructed spatial buffers of 10km around the geographical coordinates. For each group of viruses, we generated 500 random spatial points in the spatial buffers to constitute the presence points. Pseudo-absences were randomly generated in the spatial extents of the reservoirs and intermediate hosts of each virus group in a 1:2 ratio, leading to a 1000 absence points.

Bioclimatic and population predictor datasets
We extracted climatic and elevation covariates such as monthly maximum and minimum temperatures, rainfall and elevation from global Bioclim data (19) at spatial resolution of 2.5 minutes or about 4.5km at the equator. We used Moderate Resolution Imaging Spectroradiometer (MODIS) Land Cover Type (MCD12Q1) data (20) and human-included land use change from human activities from the Global Human Modi cation of Terrestrial Systems data set (21). Finally, we used Gridded Population of the World, Version 4 (GPWv4) for the human population density raster(22). The raster layers were resampled to a xed resolution of 4.5km and stacked to a raster brick. We obtained the geographical distribution and spatial extent of the primary hosts and reservoir mammals from the IUCN red list(23) and the list of mammals is listed in supplementary table 2.

Model tting and Model prediction
The models were tted using the r package "hsdm" which uses a hierarchical Bayesian approach incorporating spatial dependency into the analysis taking by accounting for geographical clumping which can be explained by biological (reservoirs and hosts movement) or bioclimatic variables. In our study, we analyzed the data using binomial and ZIB hierarchal SDM models with and without spatial autoregression. The ZIB models combines a Binomial process for observability and a Bernoulli process for habitat suitability (24,25). To model spatial autocorrelation, we used SDM with Gaussian intrinsic conditional autoregressive (iCAR) (26). The model is tted using a Bayesian framework that allows the use of pre-validated predictors and the generation of parameter uncertainties. A mixture of topographical, climatic, landscape, and human dependent predictors was used. The effect of a predictor was considered signi cant if it fell within a 95% con dence interval of the posterior distribution parameter. We used noninformative priors with a large variance of 10e6 (mean =0), except for spatial random effects, for which a weak informative prior: Uniform (min = 0, max = 10) was used. Two parallel MCMCs for each parameter were run and the convergence of the chains was checked visually using traceplots (Supplementary information) and Gelman and Ruben's convergence diagnostic. The high-risk areas or hotspots for each viral EID group are predicted using a maximum sensitivity + speci city threshold selection and the accuracy of the model was determined by True Skill Statistic (TSS) (27).

Model comparison
We used hierarchical SDM binomial, ZIB, binomial iCAR and ZIB iCAR models to map the predictive risk of viral EIDs. To compare the models with respect to deviance, we constructed a null model in each case to compared and calculate the percentage of deviance explained in comparison to the null model. The spatial autoregression models performed better than their counterparts in the three groups (Table 1). In the Filoviridae prediction model, we nd that 69% of the null deviance could be explained by the bioclimatic and population predictors using the ZIB model, which does not allow for accurate identi cation of hotspots. In contrast, the inclusion of the random effects through iCAR allowed us to explain 100% of the null deviance resulting in a perfect or saturated model. Similarly, the ZIB models with imperfect detection performed slightly better with the Coronaviridae with 74% and 100% of the null deviance explained without and with iCAR respectively. However, with Henipavirus EID model, the binomial iCAR model was slightly better, but we chose to summarize the ZIB with iCAR as the possibility of imperfect detection of outbreak events remains a concern. In addition, the ZIB model was able to explain 71% of the null deviance using predictors for Henipavirus events, which is higher than the binomial model and facilitates model standardization and comparison. Prediction maps, mcmc traceplots and TSS evolution for each model are available in the supplementary information.

Detection of EID hotspots
The distribution of the pathogenic lovirus, Marburg and Ebola viral disease (MVD & EVD) was restricted to the African subcontinent ( Figure 1). The ZIB iCAR model had a high TSS of 0.99 due to the addition of spatial autoregulation. High-risk regions for EIDs caused by coronaviridae predominate across the Indian subcontinent (Figure 2), detected by a ZIB iCAR model with a high TTS of 1. The hotspots of henipavirus diseases were limited to the west coast of India, Southeast Asia, China and Southern Australia (Figure 3).

Discussion
Our research demonstrates that it is possible to map high-risk regions for future outbreaks of major EIDs including EVD, coronaviral and Nipah virus diseases, using robust hierarchical SDM in a Bayesian framework. Our models also predict the hotspots for potentially new EIDs or Disease X belonging to the three viral groups analyzed. The models identi ed important factors that in uence the spatial distribution of potential disease outbreaks. About more than 70% of the spatial distribution of ZIB models without spatial autoregression could be explained by environmental and human factors. We were able to obtain near "perfect" models with a maximizing TSS of over 0.99 and an overall accuracy of one when including random effects with the ZIB models with iCAR. Minimum temperature increase and human-driven land use were found to the common drivers for all three EID groups.
The hotspots for lovirus diseases, EVD and MVD, were observed in the forested regions of Uganda, southern Sudan and eastern parts of Democratic Republic of Congo, with smaller areas in West and central Africa, up to Angola (Figure 1.B). The ZIB model shows that 69% of detected areas (Table 1) were spatially dependent on climatic and human factors. The addition iCAR contributes to the effect of "unknown" variables (28). These variables could be bushmeat consumption, biodiversity loss or other bioclimatic or human behavior covariates not used in the study analyses. Similarly, coronavirus disease hotspots were also dependent on anthropogenic land use modi cations, and were distributed mainly in the Indian subcontinent, with some areas in China and South-East Asia (Figure 2.B). Given the population and connectivity in a country like India, the emergence of a coronavirus can lead to a pandemic like COVID-19. These maps highlight the need for active surveillance in high-risk regions to prevent future outbreaks and threats like COVID-19. In addition to factors related to temperature and land-use change, henipaviruses hotspots are also dependent on areas of low elevation and low rainfall and are scattered along the west coast of India, in Bangladesh, along the coast to Malaysia and in smaller areas of the Indonesian archipelago (Figure 3.B).
Our results show that the residual or night-time temperature, known as the minimum temperature, has a direct in uence on the distribution of most EID outbreaks analyzed. The Increase in minimum temperature was found to be an important driver for loviruses and the other diseases analyzed in the study. Recent research has shown that the increased surface temperature and the unpredictable seasonal rainfall due to climate change, have an indirect effect on disease emergence through sudden changes to the reservoir habitats, loss of biodiversity and migration of small mammals (29,30). Minimum temperature is the limiting factor for parasite development and vector distribution in malaria transmission (31) and other vector-borne disease outbreaks such as Crimean Congo Hemorrhagic Fever and Zika (5,20). Research outside vector-borne diseases and temperature is limited. This direct spatial dependence of disease emergence on minimum temperatures is worrying. With climate change, increasing night-time minimum temperatures lengthening the frost-free season in most mid-and high latitude regions (32) potentially increasing the latitudinal extent of disease emergence.
We found that low attitude and high rainfall have a signi cant in uence on the distribution of Henipavirus outbreaks. Studies have hypothesized that the emergence of Nipah in the lower Gangetic plains and low-lying marshes could be attributed to ooding, which results in the destruction of mammalian habitats (33). Rapid changes in habitats due to human land use change lead to starvation and migration of known reservoirs of Nipah virus, fruit bat species (Family Pteropodidae), with contamination of fruit trees near human habitations and increased exposure to the pathogen (13,33,34). Our results support this hypothesis of the increased risk of Nipah outbreaks associated with low-lying plains, ooding, and rapid human-induced habitat changes. EVD and coronaviral diseases have also been found to be associated with the human-induced land modi cations. EVD has long been linked to deforestation, mining, population growth and land fragmentation (35)(36)(37). Our results show that EVD outbreaks are not directly related to population density, contrary to a recent study (36), but rather to the effects of population increase on land use, such as urbanization, deforestation, mining and hunting. In contrast, population density was signi cantly related to coronavirus hotspots. Whether high population density leads to observer bias and thus increased reporting of outbreaks needs to be examined in detail. The report of a SARS-like pneumonia in 2012 in miners in the Tongguan, Mojiang(38) raises the issue of potentially unreported sporadic outbreaks in regions with limited populations. Studies show that the emergence of coronaviral disease such as SARS (39) and MERS (40) is directly related to exposure to body uids from mammal raised in con ned spaces for bushmeat and recreation activities, respectively. "Wild avour" bushmeat restaurants and markets are often located in densely populated cites where the demand is high for exotic proteins (41) and cases are therefore more likely to be reported in densely populated areas. In the case of MERS, there is an increase in reporting in large cities as camel owners seek treatment for respiratory distress in tertiary hospitals located in large cities and are therefore likely to report cases. The effect of population density is, however, crucial in the spread of the epidemic and therefore remains an important factor in the detection of hotspot and active surveillance.
The inclusion of spatial autoregression, which is not explicitly accounted for in MaxENT (15,42) and other commonly used SDMs, provided us with near-perfect models with high TSS and accuracy for detection in each case. This approach with iCAR can be easily extended to other outbreaks in disease epidemiology with limited, clumped data and for which signi cant explanatory variables are unknown (24). Furthermore, by using a UIB model we account for imperfect detection. The obvious limitation of our study is the observability bias in the detection of outbreaks. Not all spillover events and outbreaks are reported especially in sparsely populated regions. We tried to account for imperfect detection with ZIB models which combines a Binomial distribution for the suitability process and a Bernoulli distribution for the detectability process (24). Active surveillance is essential in high-risk regions to detect the underreported outbreak events in humans to mitigate this bias in future models.

Conclusion And Recommendations
Overall, the results of the study pave the way for identi cation of hotspots using robust mathematical models using standardized open-source satellite imagery. This approach using SDMs in a Bayesian hierarchical structure is e cient and accurate in detecting hotspots and signi cant drivers of outbreaks and can be replicated in other settings and diseases. It also provides a map for active surveillance, which essential in epidemic prevention. Here are our recommendations for getting a head start in disease prediction and prevention.
1. Assess areas at high risk of ooding and identify them as hotspots for disease emergence in the tropics. With the unpredictability rainfall and rising sea levels, these areas need active disease surveillance.
2. The direct relationship between disease emergence and rapid changes in surface temperatures poses a threat of latitudinal spread of the disease. There is an urgent need for global efforts to communicate the impact of climate change on the future emergence of a disease such as COVID-19 and thus to include EIDs in the assessment of the economic costs of climate change.
3. Alternatives to rapid land use changes such as deforestation, land fragmentation for agriculture and livestock, and bushmeat consumption to meet the growing demand for protein. 4. Outbreaks and pandemics such as COVID-19 can be prevented by using robust mathematical modelling techniques and freely available satellite imagery.

Declarations
Ethics approval and consent to participate.

Consent for publication
Not applicable Availability of data and materials All data and R code for the models used in this manuscript can be accessed at https://github.com/soushie13/ecohealth

Competing interests
The authors declare that they have no competing interests.

Funding
The manuscript was a part of the doctoral dissertation of SJ during her PhD funded by the University of Guyane.
Authors' contributions SJ extracted, analyzed, and modeled the data. SJ, MC and REG interpreted the results. MC, MN and REG were major contributors in writing the manuscript. All authors read and approved the nal manuscript. Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors. Square concerning the legal status of any country, territory, city or area or of its authorities, or concerning the delimitation of its frontiers or boundaries. This map has been provided by the authors.