A framework to predict zoonotic reservoirs under data uncertainty: a case study on betacoronaviruses

doi:10.21203/rs.3.rs-4304994/v1

Download PDF

Research Article

A framework to predict zoonotic reservoirs under data uncertainty: a case study on betacoronaviruses

https://doi.org/10.21203/rs.3.rs-4304994/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

1. Modelling approaches aimed at identifying currently unknown hosts of zoonotic diseases have the potential to make high-impact contributions to global strategies for zoonotic risk surveillance. However, geographical and taxonomic biases in host-pathogen associations might influence reliability of models and their predictions.

2. Here we propose a methodological framework to mitigate the effect of biases in host–pathogen data and account for uncertainty in models’ predictions. Our approach involves identifying “pseudo-negative” species and integrating sampling biases into the modelling pipeline. We present an application on the Betacoronavirus genus and provide estimates of mammal-borne betacoronavirus hazard at the global scale.

3. We show that the inclusion of pseudo-negatives in the analysis improves the overall performance of our model significantly (AUC = 0.82 and PR-AUC = 0.48, on average) compared to a model that does not use pseudo-negatives (AUC = 0.75 and PR-AUC = 0.39, on average), reducing the rate of false positives. Results of our application unveil currently unrecognised hotspots of betacoronavirus hazard in subequatorial Africa, and South America.

4. Our approach addresses crucial limitations in host–virus association modelling, with important downstream implications for zoonotic risk assessments. The proposed framework is adaptable to different multi-host disease systems and may be used to identify surveillance priorities as well as knowledge gaps in zoonotic pathogens’ host-range.

Ecological Modeling

Infectious Diseases

zoonosis

viruses

hosts

predictive modelling

More than half of the infectious diseases that emerged in humans over the last five decades have an animal origin, i.e., are zoonoses (Jones et al., 2008; Kock & Caceres-Escobar, 2022). Among these, diseases with a wildlife reservoir represent a major and rising public health concern (Kruse et al., 2004; Morse et al., 2012; Smith et al., 2014). While global change drivers affect the risk of zoonotic disease emergence by altering human exposure to pathogens of wildlife (Plowright et al., 2017), the diversity and distribution of pathogens in their wild reservoirs represent the hazard–or potential risk–of zoonotic outbreaks (Hosseini et al., 2017). Moreover, increased pathogen sharing in wildlife due to hosts’ range changes may increase pathogen circulation and opportunities for spillover (Carlson et al., 2022). Understanding spatial and temporal patterns of zoonotic hazard stands as one of the most crucial tasks in addressing the burden of emerging and re-emerging zoonotic diseases at a global scale. However, there is very limited knowledge of the wide range of pathogens harbored by wildlife, and the role of undetected reservoirs in the maintenance of zoonotic pathogens remains largely unexplored. Indeed, the complete spectrum of reservoir hosts of many pathogens of primary public health concern remains severely underestimated. For example, only a very small fraction of the viral diversity of mammals is currently known (Carlson et al., 2019; Carroll et al., 2018) despite the scientific interest in characterising these pathogens, especially those in groups which have crossed the species barrier on multiple occasions (Holmes, 2022; Jones et al., 2008). Wild disease reservoirs are generally difficult to identify, as quantitative data on patterns of incidence and prevalence required for reservoir characterisation are extremely scarce for most disease systems and host species (Viana et al., 2014). Such limitations make the assessment of zoonotic hazard particularly challenging, and may jeopardise the establishment of One Health strategies for prevention of zoonotic spillover from wildlife.

Given the crucial knowledge gaps on host-pathogen associations and drivers of spillover from wildlife, novel data-driven approaches and technologies are becoming increasingly popular as part of an “emerging disease toolkit” that aims to support surveillance of emerging zoonotic diseases alongside traditional methodologies (Carlson et al., 2021). Among these new tools, machine learning approaches have been used to identify potential, understudied reservoirs, and provide estimates of zoonotic hazard. These models rely upon the assumption that currently unknown host species have similar ecological, phylogenetic, and life-history characteristics to known hosts (Han et al., 2015; Olival et al., 2017; Robles-Fernández et al., 2022). By identifying trait profiles of known hosts, this approach makes it possible to identify candidate hosts among species that have no known associations with the target pathogen but similar characteristics to known hosts (Becker et al., 2022; Blagrove et al., 2022; Wardeh et al., 2021). Trait-based statistical models can prove useful to rapidly identify potential reservoir species and pinpoint priorities for zoonotic surveillance across host taxa and geographical locations without necessarily having to perform expensive and labour-intensive field investigations. Approaches that employ these methods have been used in a wide array of contexts and disease systems, such as the identification of targets for Nipah virus surveillance in India (Plowright et al., 2019), and the recognition of unknown hosts of zoonotic leishmaniasis (Glidden et al., 2023).

As trait-based models rely upon profiles of known host species, they are prone to output the same patterns learned from observed host–pathogen data (Becker et al., 2022). This is a major limitation, as host–pathogen data is extremely limited and inherently biased due to spatial, taxonomic, and temporal variation in sampling efforts (Becker et al., 2019). In fact, discovery efforts have been preferentially directed towards certain vertebrate and viral taxa (Mollentze & Streicker, 2020). For example, the observed network of mammal-virus associations is heavily biased due to unstable discovery rates across decades and taxa (Gibb et al., 2022). This problem is intensified by the fact that models for host prediction rely on presence-only data, and most studies assume that all the species currently not known to carry the target pathogen are either not susceptible or not exposed to it. This assumption implicates that if a species is associated with a pathogen, the association will certainly be detected. However, the previously discussed biases in host-pathogen data make this assumption problematic, especially when sampling patterns follow a non-random distribution across taxonomy, geography, and life-history characteristics of the host species. Most studies recognise this limitation, however, in absence of any alternative, the use of “negatives” has become standard in this type of modelling (Becker et al., 2022; Pandit et al., 2018). There is a concern that models may be reproducing the same biases that occur in input data, rather than predicting hosts that truly harbor the target pathogen (Wille et al., 2021). This may result in having the most studied species being artificially predicted as the most likely hosts of zoonotic pathogens.

Here, we propose a methodological framework for predicting wild reservoirs of target pathogens, presenting an application on betacoronaviruses in mammals. Betacoronaviruses are a primary public health concern due to their spillover and pandemic potential, as shown by the SARS and MERS epidemics and more recently by the COVID-19 pandemic. By explicitly acknowledging biases and uncertainties in host–virus association data, and calibrating our analysis accordingly, we seek to provide a standardised and robust approach to predict reservoir hosts that may be used to guide zoonotic disease surveillance.

We defined a framework to model the probability of mammalian species to be reservoirs of betacoronaviruses, based on the characteristics of currently known hosts (Fig. 1). First (Section 2.1) we identified known mammal hosts of Betacoronavirus species, including both reservoir hosts and susceptible hosts. We also identified “pseudo-negative” species, as those without known association with betacoronaviruses despite existing research efforts in their geographic location for taxonomically related species. In doing so, we also accounted for uncertainty around host status definition of susceptible hosts and species classified as pseudo-negatives (Section 2.2). Secondly, we trained an ensemble of predictive models to estimate the probability of species’ reservoir status based on life-history, ecological and phylogenetic features (Section 2.3–2.4). Following an indepdent validation protocol, we tested the performance of the models, and then predicted reservoir status of mammal species not known to be associated with betacoronaviruses (Section 2.5). Finally, we mapped the distribution of observed and predicted reservoirs to highlight global hotspots of betacoronavirus hazard.

Betacoronaviruses are positive-sense single-stranded RNA viruses with high zoonotic and pandemic potential. Within this genus, three viruses have notably crossed the animal-human barrier, leading to severe outbreaks and large-scale epidemics with human-to-human transmission: SARS-CoV (Severe Acute Respiratory Syndrome Coronavirus), MERS-CoV (Middle East Respiratory Syndrome Coronavirus), and SARS-CoV-2, responsible for the COVID-19 pandemic. All three diseases caused by betacoronaviruses have been recognised as priority diseases within the World Health Organization's blueprint, underscoring the importance of betacoronaviruses as a public health risk. These characteristics make betacoronaviruses an excellent case-study to test and validate our modelling pipeline for hazard prediction.

2.1 Definition of host status

Known hosts. We obtained data on association of mammalian species with the target viral genus from VIRION, the most complete database of the vertebrates’ virome to date (Carlson, Gibb, et al., 2022). The database contains host–virus associations obtained with different detection methods, namely (in descending order of strength of evidence): virus isolation, detection of viral nucleic acid, serology. In this way, we were able to identify 213 positive species, i.e., species where betacoronaviruses, or antibodies thereof, have been detected. To train our models with more-detailed information on host status, we further classified positive species in two categories: (1) susceptible hosts (potential reservoirs) and (2) confirmed reservoir hosts, based on a systematic screening of the scientific literature. Information on positive species’ category (reservoir host or susceptible host) were used to assign different relative importance to the species during model training, so that reservoir hosts gave a bigger contribution to model estimation compared to susceptible hosts (Section 2.2).

We defined reservoir hosts as species which met two criteria: individuals can acquire the virus in natural circumstances; the virus can persist in the host population for a prolonged period. T o identify reservoir hosts, we surveyed Web of Science by querying a string made of the scientific binomial of positive species (e.g., ‘Pteropus vampyrus’) AND the term ‘reservoir’ AND the name of the viral genus (i.e., ‘Betacoronavirus’). This step was performed in R using the packages ‘httr’ (Wickham & Wickham, 2022) and ‘jsonlite’ (Ooms, 2014). We only retained records from observational studies where samples (e.g., oral swabs, rectal swabs, and/or whole blood) were collected from wild animals, reflecting natural exposure to the virus. According to our working definition of reservoir hosts, we screened resulting articles searching for temporal and spatial evidence of ongoing infection (maintenance) of one or more virus species belonging to the target genus in the candidate reservoir host population. The 25 species that met the definition and were included as betacoronavirus reservoirs are listed in Table S1. We considered susceptible hosts all those species for which evidence of a betacoronavirus infection was found, but that failed to meet the ‘reservoir host’ conditions. This resulted in the identification of 177 susceptible hosts for which life-history traits, phylogeny, and geographical data were available.

Pseudo-negatives. Available data on host–virus associations do not tipically include negative associations (i.e., data on species that have been tested and found unsusceptible to a given virus), as negative results are less likely to be published and species-wide non-susceptibility to a virus is difficult to prove (Wardeh et al., 2021). Despite acknowledging this limitation, most studies that aim to identify likely reservoir hosts customarily include all the species currently not known to carry the target pathogen as “negatives” in their analysis (Becker et al., 2022; Pandit et al., 2018). This corresponds to using ‘lack of evidence’ as ‘evidence of absence’, albeit with a high risk of generating false negatives (i.e., understudied positive species mistakenly classified as negatives). We argue that better selection of negative species would mitigate such errors in input data, reducing noise in this type of analyses and yielding greater robustness of predictions.

In order to reduce the risk of including false negatives (i.e., unrecognised positive species) in our analysis, we designed a methodology to identify “pseudo-negatives” among species that are likely to have undergone virological sampling but have no documented associations with the target virus. Virological sampling strategies in wildlife generally adhere to taxonomic and geographical patterns (Gibb et al., 2022), so that sampling efforts focus on species that belong to the same taxa and geographical area of known hosts to maximise viral discovery (Becker et al., 2019; Carroll et al., 2018). We integrated these sampling patterns into the definition of pseudo-negatives. Specifically, we classified a species without any known association to the target virus as a pseudo-negative if at least 50% of its geographical range overlapped with the range of one or more positive species from the same taxonomic family. Positive species include both susceptible hosts and reservoir hosts associated with betacoronaviruses. Viral sharing probability between phylogenetically related mammals peaks at geographical overlap values of 50% (Albery et al., 2020), suggesting that closely related species with high spatial overlap are more likely to have been sampled for the same virus. Species’ geographical distributions were represented based on area of habitat maps (AOH) from Lumbierres et al. (2022). We used AOHs as they provide more accurate information on species distribution than species’ range polygons, as portions of the range that are not suitable for the species are excluded. We tested if our use of geographic overlap introduced a bias, by favouring either large-ranged or small-ranged species as pseudo-negatives. We ran a Wilcoxon test to assess whether our selection of pseudo-negatives was biased by species’ range size, and found that the median range size of pseudo-negatives was significantly larger than the median range size of a random sample of mammals (p < 0.001) but also significantly smaller than the median range size of positive species (p < 0.001). While this comparison highlighted that neither positive species nor pseudo-negative species are a representative sample of the average mammal in terms of range size (Figure S1), larger-than-random range size of pseudo-negatives is actually desirable as it may counterbalance the exceptionally large range size of positive species, which is probably due to a combination of sampling bias and exposure to pathogens (Choo et al., 2023). Our pseudo-negative protocol identified 1,117 pseudo-negative species that met the criteria. Assuming that species which live in sympatry with many known positive species are more likely to have been sampled than species sympatric with fewer positive species, we used the number of overlaps alongside other information to assign different “weights” to pseudo-negative species during model training (Section 2.2). To assess the effect of the inclusion of pseudo-negatives on model performance, we repeated the analysis including all species from positive families that are not known to be associated with betacoronaviruses (n = 3,618) and weighted them equally in the models following Eq. 1. We assessed any significant performance improvements given by the inclusion of pseudo-negatives through a Wilcoxon signed-ranks test, a robust non-parametric test for statistical comparisons of classifiers (Demšar, 2006).

2.2 Accounting for uncertainty and sampling bias

To address the inherent uncertainty associated with host status definition, we developed instance weights that enabled us to incorporate sampling effort and the varying degrees of host status evidence into model training. Such weights are species-specific and determine the relative influence of each individual data point (i.e., species) on the predictive models. By implementing instance weights, we tried to mitigate uncertainty and noise in input data, putting the model in condition to target those species that provide a greater amount of information such as reservoir hosts and well-studied pseudo-negatives. This approach enabled our model to prioritise pattern learning from data instances that have greater information, enhancing model ability to identify potential viral reservoirs when provided with species with unknown host status. Weights (\(w\)) within reservoir hosts, susceptible hosts and pseudo-negatives were divided by their sum to ensure that values were proportionally adjusted and consistent with Eq. 1,

\({\sum }_{r = 1}^{{n}_{r}}{w}_{reservoirs}+{\sum }_{pr = 1}^{{n}_{pr}}{w}_{susceptible hosts}={\sum }_{p =1}^{{n}_{p}}{w}_{pseudo-negatives}\)

Eq. 1

where \({n}_{r}\), \({n}_{s}\), \({n}_{p}\) are the number of reservoirs, susceptible species, and pseudoabsences in the dataset (respectively). As shown in Eq. 1, we imposed positive instances (reservoir and susceptible hosts) to have an equal total weight to pseudo-negative instances. This allowed us to balance relative class proportion and avoid overfitting to the overrepresented class (pseudo-negatives).

We quantified sampling effort in different ways for positive and pseudo-negative species. For susceptible hosts, sampling effort (\(Seff\)) was measured as the total number of associations of the species with the target viral genus according to VIRION. For pseudo-negative species, sampling effort (\({Seff}^{*}\)) was represented as the mean number of associations between the species’ family and the target viral genus as indexed in VIRION. We assigned the maximum weight to all reservoir hosts, hence there was no need to estimate sampling effort for them. Relationships between weights and sampling effort for reservoirs, susceptible species and pseudo-negatives are described in Equations 2–4 (see also Fig. 2).

Weights of reservoirs are constant, i.e., assuming certainty of the available information:

\({w}_{reservoi{r}_{i} }=\frac{1}{{n}_{r}}\)

Eq. 2

Weights of susceptible species are penalised based on sampling effort (\(Seff\)) of each species, i.e., assuming a reduced probability of being a true reservoir if reservoir status was not confirmed despite high sampling effort:

\({w}_{{susceptible}_{i}}= \frac{1}{Log\left({1+Seff}_{i}\right)}\)

Eq. 3

Weights of pseudo-negatives are proportional to sampling effort (\({Seff}^{*}\)) and adjusted for the number of overlaps (> 50% of the species’ range) of the pseudo-negative on the range of positive species (\({n}_{overlaps}\)). This assumes higher certainty of “true negative” status for species without viral information that are potentially subject to high sampling efforts:

\({w}_{{pseudo-negative}_{i}}=\left(1+ {{Seff}^{*}}_{i}\right)\times {n}_{overlaps}\)

Eq. 4

2.3 Predictors of host status

We obtained predictors of host status from different sources, focussing on biological traits which have been demonstrated to correlate with viral host status in mammals (Tonelli et al., 2023; Wardeh et al., 2021). Mammalian life-history traits were retrieved from the COMBINE dataset (Soria et al., 2021). We checked for collinearity among traits and selected a subset with low correlation (vif < 5) which included: body mass, gestation length, longevity, litter size, litters per year, and weaning age. We also included the mean of pairwise phylogenetic distances (million of years) of each species from other mammals to account for phylogenetic covariance of species traits, and the mean of pairwise phylogenetic distances from known hosts to account for species’ phylogenetic proximity to the known host range of betacoronaviruses. Phylogenetic distances were calculated using the phytools package in R (Revell, 2012) based on a phylogenetic tree of mammals obtained from PHYLACINE (Faurby et al., 2018). We chose to randomly pick one tree out of those provided by PHYLACINE, given that distances calculated over 100 alternative trees were highly collinear (Pearson’s r > 0.99). We selected temperature and precipitation variables available in WorldClim2 to represent bioclimatic conditions within each species’ range (Fick & Hijmans, 2017). Mean value and standard deviation of each bioclimatic variable was extracted from within species’ ranges, and range size of the species (Km²) was also included as a predictor in the analysis after logarithmic (log₁₀) transformation.

2.4 Modelling reservoir status

Model ensemble. We used a stacked ensemble to predict reservoir status (1–positive, 0–negative) as a function of species traits. Stacked ensembles combine multiple base models (in our case binary classifiers) and aggregate their outputs to obtain the final predictions (Polikar, 2012). By aggregating the outputs of multiple classifiers, any errors made by individual classifiers are likely to be balanced out by other classifiers, often resulting in better predictive performance compared to that of single classifiers (Sagi & Rokach, 2018). We opted for a set of classifiers that encompasses diverse algorithms. We trained four classifiers: random forest (RF), extreme gradient boosting (XGBoost), single layer neural network (NNET) and generalised additive model (GAM). All models were trained and tuned in R using the mlr3 package (Lang et al., 2019). Information on hyperparameters that have been tuned for the different classifiers is provided in Table S2.

Tuning and validation. Each classifier was tuned and validated with a repeated (n = 20) nested cross validation, a procedure that avoids over-fitting in model selection and provides robust and unbiased performance evaluation (Cawley & Talbot, 2010). A nested cross validation has an inner cross validation loop (used for hyperparameter tuning) nested within an outer cross validation (used for model validation). In each iteration, the dataset was split into five outer folds stratified by host status category and mammalian order so that each fold had the same proportion of pseudo-negatives, susceptible hosts, and reservoir hosts. Each one of the five outer folds contains a training set (80% of the data) and a testing set (20% of the data). The training set of each outer fold was further split into three inner folds, each containing a training set (67% of the training set of the outer fold) and an assessment set (33% of the training set of the outer fold). For each classifier we used bayesian optimisation to navigate within the hyperparameter space. The hyperparameter combination that yielded the best average performance on the different assessment sets, quantified as true skill statistics (TSS; Allouche et al., 2006), was then fit on the training set and used to predict the test set within the outer fold. We adopted a probability threshold of 0.5 to separate positive from negative species.

We assessed the performance of the model ensemble by aggregating test set predicted probabilities of the single classifiers via weighted average and comparing them to observed status. By using a weighted average, we used individual performances of the four classifiers in the ensemble to adjust their contributions to final predictions. In this way, predictions of high-performance classifiers received higher weight than those of weak classifiers, resulting in increased classification performance (Polikar, 2012). Averaged predicted probability \({{\mu }}_{\text{i}}\left(\text{x}\right)\) of reservoir status for a given species i was computed as follows (Eq. 5):

\({\mu }_{i}\left(x\right)= \frac{1}{T}\sum _{t=1}^{T}{w}_{t}{p}_{t,i }\left(x\right)\)

Eq. 5

where the performance weight \({\text{w}}_{\text{t}}\) of a given classifier t (out of T classifiers) is its TSS estimate on the training set, divided by the sum of TSS across all classifiers (so that weights sum to 1); \({p}_{t,i}\left(x\right)\) is the predicted probability of reservoir status for species i, outputted by classifier t. Once we obtained ensemble predictions, performance was assessed by computing nine metrics (Table S3). As the nested routine was repeated 20 times, we computed performance metrics for each of the 100 hold out test sets.

2.5 Host status prediction

We predicted likely reservoirs of betacoronaviruses among mammalian species that (1) are not known to host betacoronaviruses (unknown status), (2) are known to be susceptible to betacoronaviruses but are not confirmed reservoirs. We chose to only predict host status within mammalian families where positive species are already observed, as patterns that underly observed associations between betacoronaviruses and mammals might not be generalisable across widely different mammalian taxa; thus, we provide more conservative predictions of viral hazard. We predicted betacoronavirus host status of 3,893 mammalian species using the model ensemble. The prediction process was repeated 100 times to furtherly account for prediction uncertainty, each time applying the different model structures used during the nested cross-validation routine. We then mapped predicted hotspots of betacoronavirus hazard using species’ AOH maps and summing the number of predicted and observed reservoir hosts per 10km×10km grid cell.

3.1 Ensemble accuracy

Our model ensemble to predict betacoronavirus reservoirs outperformed individual classifiers in some of the assessed metrics (Figure S2) and achieved moderate performance during validation (we report mean and sd of performance metrics calculated over the 100 test sets): TSS (mean = 0.50, sd = 0.073), TNR (mean = 0.80, sd = 0.039), TPR (mean = 0.69, sd = 0.081). Furthermore, the model trained without pseudo-negatives generally performed worse than the model trained with pseudo-negatives: TSS (mean = 0.35, sd = 0.069), TNR (mean = 0.62, sd = 0.042), TPR (mean = 0.73, sd = 0.075). See Fig. 3 for a full comparison across all assessed metrics.

Classification performance of the model ensemble showed significant variation across mammalian orders (Figure S3). Due to worse-than-random accuracy in primates, probably due to a combination of phylogenetic signal and range size (two important predictors in our models, see Figure S4), we chose to treat them differently to avoid overestimating potential hazard from this taxon and selected a convenient probability threshold of 0.74 to maximise TSS on the validation set. While potentially causing overfitting on primates’ observed host status, this approach provided more realistic predictions of primate reservoirs (see Section 3.3).

3.2 Predicted betacoronavirus reservoirs

Our model ensemble predicted 848 likely reservoirs of betacoronaviruses when prediction was limited to mammal families with at least one known positive species (63.3% predicted reservoirs had unknown status, 18.4% had susceptible status, and 18.3% had pseudo-negative status). Most predicted reservoirs were rodents (34.8%) and bats (25.2%), followed by carnivores (11.0%), shrews (11.0%) even-toed ungulates (10.9%) (Fig. 4); other less represented groups were primates, pangolins, and odd-toed ungulates (7.1% together). The number of predicted reservoirs was 250 under a stricter scenario where prediction was only allowed within families with at least one reservoir species (56.8% predicted reservoirs had unknown status, 24.4% had susceptible status, and 18.8% had pseudo-negative status). In this case, most of the newly predicted reservoirs were bats (56.8%), followed by rodents (39.6%) and shrews (3.6%).

3.3 Observed and predicted hotspots of betacoronavirus hazard

We mapped the spatial distribution of observed and predicted reservoirs to identify hotspots of betacoronavirus hazard at the global scale (Fig. 5). This prediction was limited to species within families with at least one susceptible species. Acknowledging the weak model performance on primates, we also provide two additional maps showing Betacoronavirus hazard hotspots where primates predicted using the 0.50 probability threshold (n = 294) were either included or excluded (Figure S5). These maps show that the main hotspots of observed and predicted reservoirs remain unchanged regardless of the way predicted primates were treated. Acknowledging the fact that models might not be able to correctly extrapolate patterns beyond the currently-observed taxonomic boundaries of reservoirs, we also provide a more conservative scenario where prediction was limited to families with at least one reservoir species (Figure S6).

Additionally, we accounted for uncertainty of reservoir prediction across 100 model iterations and mapped the degree of uncertainty around the hotspots of reservoir richness (Fig. 6). We identify observed hotspots of betacoronavirus reservoirs in Southeast Asia and, to a lesser extent in Europe. Reservoirs predicted by our model are more widely distributed than just these two areas, with major hazard hotspots located in Southeast Asia, subequatorial Africa, and tropical South America. Under the more conservative scenario (Figure S6), hotspots of betacoronavirus reservoirs do not include South America, as predicted Molossidae and Phyllostomidae bat species that occur in the region were excluded.

4.1 Considerations on framework methodology and validation

We presented a modelling approach aimed at the identification of unknown reservoirs of zoonotic viruses, which included key methodological aspects aimed at addressing well-known shortcomings in host–virus association modelling (Becker et al., 2022). This includes mitigating biases in the observed host–pathogen data, obtaining reliable performance estimates, and accounting for uncertainty around model outputs. In our case-study, we integrated sampling effort and existing knowledge on mammal–betacoronavirus associations in the training phase of our analytical pipeline. This was achieved by assigning different weights to the species included in our dataset, enabling our model to prioritise the correct classification of species that contain a larger portion of information. For example, the intermediate horseshoe bat (Rhinolophus affinis) – a confirmed reservoir of Sarbecovirus in South China (Li et al., 2022) – contributed about four times as much to model estimates compared to the hog badger (Arctonyx collaris) – a candidate intermediate host for SARS-like coronaviruses (Bell et al., 2004; Zhao et al., 2020). By calibrating varying levels of priority per species, our models targeted predictive traits crucial for reservoir identification. Furthermore, with the use of ensemble modelling, different tuned models were able to target distinct facets of Betacoronavirus host profiles, leading to better ensemble classification once individual models’ predictions were put together, as indicated by the greater accuracy of the ensemble compared to the individual classifiers.

Our pseudo-negative protocol was designed to pick negatives among species that were more likely to have been actually tested for betacoronaviruses, based on geographical and taxonomic biases in virological surveys in wildlife. We show that including pseudo-negatives in the analysis enhanced the performance of our model ensemble across most evaluated metrics (Fig. 3), with statistically significant differences observed (Table S4). Particularly, use of pseudo-negatives lowered the number of false positives outputted by the model. This result suggests that providing higher-quality negative data (by reducing the erroneous inclusion of understudied positive species as negatives) might effectively reduce noise in input data and allow the classifiers to better separate hosts from non-hosts. We recognise shortcomings in our selection of pseudo-negatives, particularly when dealing with species characterised by extremely large range areas. Extremely widespread species will rarely be selected as pseudo-negatives, as it is unlikely for these species to have an overlap equal or greater than half of their range area with any positive species. This potential shortfall may lead to the exclusion of some true negative associations of widespread species that may have been frequently sampled and found negative. Providing modelling approaches with data on true negative associations (i.e., evidence that the virus cannot infect a given species) will help overcome this important limitation. Additionally, true negatives will allow models to identify the actual biological boundaries of pathogens’ host ranges with higher accuracy and help reducing overprediction.

We employed nested cross-validation to estimate generalization error of our model ensemble in a robust and unbiased way. As highlighted by a recent application to the prediction of Leishmania hosts (Glidden et al., 2023), nested-cross validation is an efficient resampling strategy to obtain performance estimates when using small datasets. It allows both model training and selection within the same resampling regime without causing overfitting, as opposed to non-nested cross-validation where fine-tuning could bias the model towards the dataset, potentially yielding overly optimistic performance estimates (Cawley & Talbot, 2010). Use of robust, unbiased validation protocols is crucial to provide realistic indications of the reliability of data-driven approaches and communicate the likelihood of inaccurate hazard predictions. Also, by explicitly representing uncertainty around predicted hazard hotspots, we highlight areas where the reliability of predictions is weaker and identify potential knowledge gaps that may guide targeted sampling efforts to reduce such uncertainty.

While the model achieved good overall performance in predicting the main groups of mammalian reservoirs of betacoronaviruses - such as bats, rodents, and insectivores - they showed weaker accuracy for species in other taxonomic orders. We therefore took a more cautious approach and limited our predictions to families in which susceptibility to betacoronaviruses has been already confirmed. Weaker model fit was only evident within less-represented taxonomic orders in our dataset, especially primates, where our model exhibited a tendency to overpredict host status, indicating a potential limitation in capturing characteristics that drive host status in this order. Nonetheless, the way we treated primates did not quantifiably affect the identification of hotpots of host richness, as major hotspots located in Southeast Asia, subequatorial Africa, and tropical South America remained the same. It is worth mentioning that there is molecular evidence that apes and monkeys of Africa and Asia could be susceptible to SARS-CoV-2 infection (Li et al., 2023), opening the possibility that some primate species could also play a role in betacoronavirus maintenance.

4.2 Highlighting surveillance priorities and knowledge gaps in betacoronaviruses’ host-range

The application of our framework revealed potential neglected hotspots of Betacoronavirus hazard. Previous studies highlighted geographical and taxonomic biases in coronavirus sampling and identified knowledge gaps from critical hotspots of mammalian diversity in Africa, Asia, and Latin America (Anthony et al., 2017; Ruiz-Aravena et al., 2022). Currently observed hotspots of betacoronavirus reservoirs predominantly clustered in Southeast Asia and southern Europe, in agreement with patterns of coronavirus surveillance in bats preceding the COVID-19 pandemic (Cohen et al., 2023). The currently observed distribution indicates that the seeming lack of betacoronaviruses within certain taxa and geographical regions is more likely due to sampling biases rather than a genuine absence of natural reservoirs of these viruses (Ruiz-Aravena et al., 2022). We predicted high richness of reservoir hosts in tropical areas of the world, including South and Central America, subequatorial Africa, and Southeast Asia, which is considered the main hotspot of Betacoronavirus-positive species (Muylaert et al., 2022; Sánchez et al., 2022).

Despite the neotropics being among the main hotspots of mammalian diversity – especially of bat species (López-Aguirre et al., 2018) – the diversity of coronaviruses in the region has been only partially described. Interestingly, novel lineages of the Betacoronavirus genus found in insectivorous bats sampled in Costa Rica were found to be loosely related to any known Old-World bat betacoronavirus (Corman et al., 2013). This is consistent with patterns of viral diversification in Central and South America, suggesting that circulation of evolutionary distinct lineages of betacoronaviruses in the New World is higher than currently documented (Cibulski et al., 2020). We identified Vespertilionidae and Molossidae as the families with most newly predicted reservoirs in Central and South America. The identification of a new betacoronavirus, exhibiting 96.5% amino acid identity with MERS-CoV, in Nyctinomops laticaudatus (family Molossidae) in Mexico emphasises the need for surveillance efforts within Molossidae and the closely related Vespertilionidae families (Anthony et al., 2013). This is particularly important in order to gain insights into the Betacoronavirus lineages harbored by bat species in the neotropics and assess risk of spillover events in the context of growing invasion and degradation of neotropical ecosystems (Hansen et al., 2013).

In African countries, coronavirus surveillance has been mostly opportunistic (Markotter et al., 2020). Existing studies report highly diverse and novel coronavirus species, some of which related to human betacoronaviruses. Among these, circulation of MERS-CoV received particular attention as camels sampled throughout West and North Africa exhibited either previous exposure to this virus or tested positive to viral RNA. These findings suggest ongoing circulation of MERS-CoV and potential presence of wild reservoirs – most likely bats (Reusken et al., 2013) – fostering the chain of transmission in the continent. Both MERS and SARS-related viruses have been detected in a wide diversity of African bats from the families Hipposideridae, Phyllostomidae, Pteropodidae, and Vespertilionidae (Annan et al., 2013; Pfefferle et al., 2009; Quan et al., 2010; Shehata et al., 2016) including the Cape serotine bat (Neoromicia capensis) in South Africa (Geldenhuys et al., 2018) and the African pipistrelle bat (Pipistrellus hesperidus) in Uganda (Anthony, Gilardi, et al., 2017). Our predictions of abundant reservoir presence in Africa, coupled with growing evidence of viral detection in African bats underscores the importance of field investigations in Sub-Saharan Africa to resolve uncertainty around the role played by different potential reservoirs in viral maintenance in the region.

Surveillance activities on betacoronaviruses have been largely focused on bats, as they comprise the main reservoirs of high consequence betacoronavirus diseases. Nevertheless, a major proportion of the known genetic diversity within the Betacoronavirus lineage A (subgenus Embecovirus) is associated to rodents, which probably constitute an important reservoir of this viral clade (Lau et al., 2015; Wang et al., 2015). According to our model, the diversity of potential reservoir species in the order Rodentia might be underestimated by 2 to 6-fold (under a strict and liberal scenario of prediction, respectively), with the highest number of predicted rodent reservoirs identified in the families Cricetidae, Sciuridae, and Muridae. SARS-related betacoronaviruses have been found in Muridae species in Yunnan Province of China (Ge et al., 2017) and mice have been speculated to have a role in Swine Acute Diarrhea Syndrome Coronavirus (SADS-CoV, an alphacoronavirus) circulation (Yang et al., 2020). This evidence reveals that diverse coronaviruses may circulate in rodents, suggesting that rodent species may be natural reservoirs of coronaviruses and a candidate source of Betacoronavirus emergence. Our result indicate that monitoring strategies should include the systematic screening of other mammals aside from bats, including rodents, ungulates, and shrews that have been proven to be susceptible to betacoronavirus infection (Corman et al., 2014; Leopardi et al., 2023). This would help retracing the evolution of the Betacoronavirus genus and characterising viral and host features associated with highest spillover risk.

Predictive modelling has the potential to guide decision-making in viral discovery efforts and surveillance strategies of zoonotic risk. Yet, current data limitations prevent these tools from being integrated in actionable and applicable surveillance strategies focussed on spillover risk anticipation. Our framework is a step towards addressing the uncertainties associated with hazard modelling and prediction, applicable across various contexts and diseases. By integrating information on taxonomic and geographical patterns of virological sampling in the definition of pseudo-negatives, we achieved statistically significant improvements in models’ performance and reduced overprediction. With the rapid development of artificial intelligence applications and aerial imagery, our approach could be coupled with other modelling pipelines to identify the local scale distribution of reservoir hosts and assess drivers of spillover risk at the human-wildlife interface at finer scales (Layman et al., 2023). Further developments in the fields of virology and immunology will bring crucial contributions to our framework, generating essential knowledge for model improvement and validation. As these data become more readily available, they will provide essential information on drivers of host-pathogen interaction at the molecular level, including receptor structure and molecular binding affinity, which will increase efficacy of modelling approaches (Sundaram et al., 2022). In conclusion, overcoming current data gaps and explicitly addressing the limitations of predictive approaches will allow modelling pipelines to be better integrated into strategies for zoonosis risk assessment and mitigation in the future.

Albery, G. F., Eskew, E. A., Ross, N., & Olival, K. J. (2020). Predicting the global mammalian viral sharing network using phylogeography. Nature Communications, 11(1). https://doi.org/10.1038/s41467-020-16153-4
Allouche, O., Tsoar, A., & Kadmon, R. (2006). Assessing the accuracy of species distribution models: Prevalence, kappa and the true skill statistic (TSS). Journal of Applied Ecology, 43(6). https://doi.org/10.1111/j.1365-2664.2006.01214.x
Annan, A., Baldwin, H. J., Corman, V. M., Klose, S. M., Owusu, M., Nkrumah, E. E., Badu, E. K., Anti, P., Agbenyega, O., Meyer, B., Oppong, S., Sarkodie, Y. A., Kalko, E. K. V., Lina, P. H. C., Godlevska, E. V., Reusken, C., Seebens, A., Gloza-Rausch, F., Vallo, P., … Drexler, J. F. (2013). Human betacoronavirus 2c EMC/2012-related viruses in bats, Ghana and Europe. Emerging Infectious Diseases, 19(3). https://doi.org/10.3201/eid1903.121503
Anthony, S. J., Gilardi, K., Menachery, V. D., Goldstein, T., Ssebide, B., Mbabazi, R., Navarrete-Macias, I., Liang, E., Wells, H., Hicks, A., Petrosov, A., Byarugaba, D. K., Debbink, K., Dinnon, K. H., Scobey, T., Randell, S. H., Yount, B. L., Cranfield, M., Johnson, C. K., … Mazet, J. A. K. (2017). Further evidence for bats as the evolutionary source of middle east respiratory syndrome coronavirus. MBio, 8(2). https://doi.org/10.1128/mBio.00373-17
Anthony, S. J., Johnson, C. K., Greig, D. J., Kramer, S., Che, X., Wells, H., Hicks, A. L., Joly, D. O., Wolfe, N. D., Daszak, P., Karesh, W., Lipkin, W. I., Morse, S. S., Mazet, J. A. K., & Goldstein, T. (2017). Global patterns in coronavirus diversity. Virus Evolution, 3(1). https://doi.org/10.1093/ve/vex012
Anthony, S. J., Ojeda-Flores, R., Rico-Chávez, O., Navarrete-Macias, I., Zambrana-Torrelio, C. M., Rostal, M. K., Epstein, J. H., Tipps, T., Liang, E., Sanchez-Leon, M., Sotomayor-Bonilla, J., Aguirre, A. A., Ávila-Flores, R. A., Medellín, R. A., Goldstein, T., Suzán, G., Daszak, P., & Lipkin, W. I. (2013). Coronaviruses in bats from Mexico. Journal of General Virology, 94(PART 5). https://doi.org/10.1099/vir.0.049759-0
Becker, D. J., Albery, G. F., Sjodin, A. R., Poisot, T., Bergner, L. M., Chen, B., Cohen, L. E., Dallas, T. A., Eskew, E. A., Fagre, A. C., Farrell, M. J., Guth, S., Han, B. A., Simmons, N. B., Stock, M., Teeling, E. C., & Carlson, C. J. (2022). Optimising predictive models to prioritise viral discovery in zoonotic reservoirs. The Lancet Microbe. https://doi.org/10.1016/s2666-5247(21)00245-7
Becker, D. J., Crowley, D. E., Washburne, A. D., & Plowright, R. K. (2019). Temporal and spatial limitations in global surveillance for bat filoviruses and henipaviruses. Biology Letters, 15(12). https://doi.org/10.1098/rsbl.2019.0423
Bell, D., Roberton, S., & Hunter, P. R. (2004). Animal origins of SARS coronavirus: Possible links with the international trade in small carnivores. Philosophical Transactions of the Royal Society B: Biological Sciences, 359(1447). https://doi.org/10.1098/rstb.2004.1492
Blagrove, M. S., Pilgrim, J., Kotsiri, A., Hui, M., Baylis, M., & Wardeh, M. (2022). Monkeypox virus shows potential to infect a diverse range of native animal species across Europe, indicating high risk of becoming endemic in the region. BioRxiv.
Carlson, C. J., Albery, G. F., Merow, C., Trisos, C. H., Zipfel, C. M., Eskew, E. A., Olival, K. J., Ross, N., & Bansal, S. (2022). Climate change increases cross-species viral transmission risk. Nature, 607(7919). https://doi.org/10.1038/s41586-022-04788-w
Carlson, C. J., Farrell, M. J., Grange, Z., Han, B. A., Mollentze, N., Phelan, A. L., Rasmussen, A. L., Albery, G. F., Bett, B., Brett-Major, D. M., Cohen, L. E., Dallas, T., Eskew, E. A., Fagre, A. C., Forbes, K. M., Gibb, R., Halabi, S., Hammer, C. C., Katz, R., … Webala, P. W. (2021). The future of zoonotic risk prediction. In Philosophical Transactions of the Royal Society B: Biological Sciences (Vol. 376, Issue 1837). https://doi.org/10.1098/rstb.2020.0358
Carlson, C. J., Gibb, R. J., Albery, G. F., Brierley, L., Connor, R. P., Dallas, T. A., Eskew, E. A., Fagre, A. C., Farrell, M. J., Frank, H. K., Muylaert, R. L., Poisot, T., Rasmussen, A. L., Ryan, S. J., & Seifert, S. N. (2022). The Global Virome in One Network (VIRION): an Atlas of Vertebrate-Virus Associations. MBio, 13(2). https://doi.org/10.1128/mbio.02985-21
Carlson, C. J., Zipfel, C. M., Garnier, R., & Bansal, S. (2019). Global estimates of mammalian viral diversity accounting for host sharing. Nature Ecology and Evolution, 3(7). https://doi.org/10.1038/s41559-019-0910-6
Carroll, D., Daszak, P., Wolfe, N. D., Gao, G. F., Morel, C. M., Morzaria, S., Pablos-Méndez, A., Tomori, O., & Mazet, J. A. K. (2018). The global virome project. Science, 359(6378), 872–874.
Cawley, G. C., & Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research, 11.
Choo, J., Nghiem, L. T. P., Benítez-López, A., & Carrasco, L. R. (2023). Range area and the fast–slow continuum of life history traits predict pathogen richness in wild mammals. Scientific Reports, 13(1), 20191.
Cibulski, S., de Lima, F. E. S., & Roehe, P. M. (2020). Coronaviruses in Brazilian bats: A matter of concern? In PLoS Neglected Tropical Diseases (Vol. 14, Issue 10). https://doi.org/10.1371/journal.pntd.0008820
Cohen, L. E., Fagre, A. C., Chen, B., Carlson, C. J., & Becker, D. J. (2023). Coronavirus sampling and surveillance in bats from 1996–2019: a systematic review and meta-analysis. Nature Microbiology, 8(6). https://doi.org/10.1038/s41564-023-01375-1
Corman, V. M., Kallies, R., Philipps, H., Göpner, G., Müller, M. A., Eckerle, I., Brünink, S., Drosten, C., & Drexler, J. F. (2014). Characterization of a Novel Betacoronavirus Related to Middle East Respiratory Syndrome Coronavirus in European Hedgehogs. Journal of Virology, 88(1). https://doi.org/10.1128/jvi.01600-13
Corman, V. M., Rasche, A., Diallo, T. D., Cottontail, V. M., Stöcker, A., Souza, B. F. de C. D., Corrêa, J. I., Carneiro, A. J. B., Franke, C. R., Nagy, M., Metz, M., Knörnschild, M., Kalko, E. K. V., Ghanem, S. J., Morales, K. D. S., Salsamendi, E., Spínola, M., Herrler, G., Voigt, C. C., … Drexler, J. F. (2013). Highly diversified coronaviruses in neotropical bats. Journal of General Virology, 94(PART9). https://doi.org/10.1099/vir.0.054841-0
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7.
Faurby, S., Davis, M., Pedersen, R., Schowanek, S. D., Antonelli, A., & Svenning, J. C. (2018). PHYLACINE 1.2: The Phylogenetic Atlas of Mammal Macroecology. In Ecology (Vol. 99, Issue 11). https://doi.org/10.1002/ecy.2443
Fick, S. E., & Hijmans, R. J. (2017). WorldClim 2: new 1-km spatial resolution climate surfaces for global land areas. International Journal of Climatology, 37(12). https://doi.org/10.1002/joc.5086
Ge, X. Y., Yang, W. H., Zhou, J. H., Li, B., Zhang, W., Shi, Z. L., & Zhang, Y. Z. (2017). Detection of alpha- and betacoronaviruses in rodents from Yunnan, China. Virology Journal, 14(1). https://doi.org/10.1186/s12985-017-0766-9
Gearty, W., & Jones, L. A. (2023). rphylopic: An R package for fetching, transforming, and visualising PhyloPic silhouettes. Methods in Ecology and Evolution, n/a(n/a). https://doi.org/https://doi.org/10.1111/2041-210X.14221
Geldenhuys, M., Mortlock, M., Weyer, J., Bezuidt, O., Seamark, E. C. J., Kearney, T., Gleasner, C., Erkkila, T. H., Cui, H., & Markotter, W. (2018). A metagenomic viral discovery approach identifies potential zoonotic and novel mammalian viruses in Neoromicia bats within South Africa. PLoS ONE, 13(3). https://doi.org/10.1371/journal.pone.0194527
Gibb, R., Albery, G. F., Mollentze, N., Eskew, E. A., Brierley, L., Ryan, S. J., Seifert, S. N., & Carlson, C. J. (2022). Mammal virus diversity estimates are unstable due to accelerating discovery effort. Biology Letters, 18(1). https://doi.org/10.1098/rsbl.2021.0427
Glidden, C. K., Murran, A. R., Silva, R. A., Castellanos, A. A., Han, B. A., & Mordecai, E. A. (2023). Phylogenetic and biogeographical traits predict unrecognized hosts of zoonotic leishmaniasis. PLOS Neglected Tropical Diseases, 17(5), e0010879.
Han, B. A., Schmidt, J. P., Bowden, S. E., & Drake, J. M. (2015). Rodent reservoirs of future zoonotic diseases. Proceedings of the National Academy of Sciences of the United States of America, 112(22). https://doi.org/10.1073/pnas.1501598112
Hansen, M. C., Potapov, P. V., Moore, R., Hancher, M., Turubanova, S. A., Tyukavina, A., Thau, D., Stehman, S. V., Goetz, S. J., Loveland, T. R., Kommareddy, A., Egorov, A., Chini, L., Justice, C. O., & Townshend, J. R. G. (2013). High-resolution global maps of 21st-century forest cover change. Science, 342(6160). https://doi.org/10.1126/science.1244693
Holmes, E. C. (2022). The Ecology of Viral Emergence. Annual Review of Virology, 9. https://doi.org/10.1146/annurev-virology-100120-015057
Hosseini, P. R., Mills, J. N., Prieur-Richard, A.-H., Ezenwa, V. O., Bailly, X., Rizzoli, A., Suzán, G., Vittecoq, M., Garc\’\ia-Peña, G. E., Daszak, P., & others. (2017). Does the impact of biodiversity differ between emerging and endemic pathogens? The need to separate the concepts of hazard and risk. Philosophical Transactions of the Royal Society B: Biological Sciences, 372(1722), 20160129.
Jones, K. E., Patel, N. G., Levy, M. A., Storeygard, A., Balk, D., Gittleman, J. L., & Daszak, P. (2008). Global trends in emerging infectious diseases. Nature, 451(7181), 990–993.
Kock, R., & Caceres-Escobar, H. (2022). Situation analysis on the roles and risks of wildlife in the emergence of human infectious diseases. In Situation analysis on the roles and risks of wildlife in the emergence of human infectious diseases. https://doi.org/10.2305/iucn.ch.2022.01.en
Kruse, H., Kirkemo, A. M., & Handeland, K. (2004). Wildlife as source of zoonotic infections. In Emerging Infectious Diseases (Vol. 10, Issue 12). https://doi.org/10.3201/eid1012.040707
Lang, M., Binder, M., Richter, J., Schratz, P., Pfisterer, F., Coors, S., Au, Q., Casalicchio, G., Kotthoff, L., & Bischl, B. (2019). mlr3: A modern object-oriented machine learning framework in R. Journal of Open Source Software, 4(44). https://doi.org/10.21105/joss.01903
Lau, S. K. P., Woo, P. C. Y., Li, K. S. M., Tsang, A. K. L., Fan, R. Y. Y., Luk, H. K. H., Cai, J.-P., Chan, K.-H., Zheng, B.-J., Wang, M., & Yuen, K.-Y. (2015). Discovery of a Novel Coronavirus, China Rattus Coronavirus HKU24, from Norway Rats Supports the Murine Origin of Betacoronavirus 1 and Has Implications for the Ancestor of Betacoronavirus Lineage A. Journal of Virology, 89(6). https://doi.org/10.1128/jvi.02420-14
Layman, N. C., Basinski, A. J., Zhang, B., Eskew, E. A., Bird, B. H., Ghersi, B. M., Bangura, J., Fichet-Calvet, E., Remien, C. H., Vandi, M., Bah, M., & Nuismer, S. L. (2023). Predicting the fine-scale spatial distribution of zoonotic reservoirs using computer vision. Ecology Letters, 26(11). https://doi.org/10.1111/ele.14307
Leopardi, S., Desiato, R., Mazzucato, M., Orusa, R., Obber, F., Averaimo, D., Berjaoui, S., Canziani, S., Capucchio, M. T., Conti, R., Di Bella, S., Festa, F., Garofalo, L., Lelli, D., Madrau, M. P., Mandola, M. L., Moreno Martin, A. M., Peletto, S., Pirani, S., … Terregino, C. (2023). One health surveillance strategy for coronaviruses in Italian wildlife. Epidemiology and Infection, 151. https://doi.org/10.1017/S095026882300081X
Li, L., Zhang, L., Zhou, J., He, X., Yu, Y., Liu, P., Huang, W., Xiang, Z., & Chen, J. (2022). Epidemiology and Genomic Characterization of Two Novel SARS-Related Coronaviruses in Horseshoe Bats from Guangdong, China. MBio, 13(3). https://doi.org/10.1128/mbio.00463-22
Li, M., Du, J., Liu, W., Li, Z., Lv, F., Hu, C., Dai, Y., Zhang, X., Zhang, Z., Liu, G., Pan, Q., Yu, Y., Wang, X., Zhu, P., Tan, X., Garber, P. A., & Zhou, X. (2023). Comparative susceptibility of SARS-CoV-2, SARS-CoV, and MERS-CoV across mammals. ISME Journal, 17(4). https://doi.org/10.1038/s41396-023-01368-2
López-Aguirre, C., Hand, S. J., Laffan, S. W., & Archer, M. (2018). Phylogenetic diversity, types of endemism and the evolutionary history of New World bats. Ecography, 41(12). https://doi.org/10.1111/ecog.03260
Lumbierres, M., Dahal, P. R., Soria, C. D., Di Marco, M., Butchart, S. H. M., Donald, P. F., & Rondinini, C. (2022). Area of Habitat maps for the world’s terrestrial birds and mammals. Scientific Data, 9(1). https://doi.org/10.1038/s41597-022-01838-w
Markotter, W., Coertse, J., De Vries, L., Geldenhuys, M., & Mortlock, M. (2020). Bat-borne viruses in Africa: a critical review. In Journal of Zoology (Vol. 311, Issue 2). https://doi.org/10.1111/jzo.12769
Mollentze, N., & Streicker, D. G. (2020). Viral zoonotic risk is homogenous among taxonomic orders of mammalian and avian reservoir hosts. Proceedings of the National Academy of Sciences of the United States of America, 117(17). https://doi.org/10.1073/pnas.1919176117
Morse, S. S., Mazet, J. A. K., Woolhouse, M., Parrish, C. R., Carroll, D., Karesh, W. B., Zambrana-Torrelio, C., Lipkin, W. I., & Daszak, P. (2012). Prediction and prevention of the next pandemic zoonosis. In The Lancet (Vol. 380, Issue 9857). https://doi.org/10.1016/S0140-6736(12)61684-5
Muylaert, R. L., Kingston, T., Luo, J., Vancine, M. H., Galli, N., Carlson, C. J., John, R. S., Rulli, M. C., & Hayman, D. T. S. (2022). Present and future distribution of bat hosts of sarbecoviruses: Implications for conservation and public health. Proceedings of the Royal Society B: Biological Sciences, 289(1975). https://doi.org/10.1098/rspb.2022.0397
Olival, K. J., Hosseini, P. R., Zambrana-Torrelio, C., Ross, N., Bogich, T. L., & Daszak, P. (2017). Host and viral traits predict zoonotic spillover from mammals. Nature, 546(7660). https://doi.org/10.1038/nature22975
Pandit, P. S., Doyle, M. M., Smart, K. M., Young, C. C. W., Drape, G. W., & Johnson, C. K. (2018). Predicting wildlife reservoirs and global vulnerability to zoonotic Flaviviruses. Nature Communications, 9(1). https://doi.org/10.1038/s41467-018-07896-2
Pfefferle, S., Oppong, S., Drexler, J. F., Gloza-Rausch, F., Ipsen, A., Seebens, A., Müller, M. A., Annan, A., Vallo, P., Adu-Sarkodie, Y., Kruppa, T. F., & Drosten, C. (2009). Distant relatives of severe acute respiratory syndrome coronavirus and close relatives of human coronavirus 229E in bats, Ghana. Emerging Infectious Diseases, 15(9). https://doi.org/10.3201/eid1509.090224
Plowright, R. K., Becker, D. J., Crowley, D. E., Washburne, A. D., Huang, T., Nameer, P. O., Gurley, E. S., & Han, B. A. (2019). Prioritizing surveillance of nipah virus in India. PLoS Neglected Tropical Diseases, 13(6). https://doi.org/10.1371/journal.pntd.0007393
Plowright, R. K., Parrish, C. R., McCallum, H., Hudson, P. J., Ko, A. I., Graham, A. L., & Lloyd-Smith, J. O. (2017). Pathways to zoonotic spillover. In Nature Reviews Microbiology (Vol. 15, Issue 8). https://doi.org/10.1038/nrmicro.2017.45
Polikar, R. (2012). Ensemble learning. In Ensemble Machine Learning: Methods and Applications. https://doi.org/10.1007/9781441993267_1
Quan, P. L., Firth, C., Street, C., Henriquez, J. A., Petrosov, A., Tashmukhamedova, A., Hutchison, S. K., Egholm, M., Osinubi, M. O. V., Niezgoda, M., Ogunkoya, A. B., Briese, T., Rupprecht, C. E., & Ian Lipkin, W. (2010). Identification of a severe acute respiratory syndrome coronavirus-like virus in a leaf-nosed bat in Nigeria. MBio, 1(4). https://doi.org/10.1128/mBio.00208-10
Reusken, C. B. E. M., Haagmans, B. L., Müller, M. A., Gutierrez, C., Godeke, G. J., Meyer, B., Muth, D., Raj, V. S., Vries, L. S. De, Corman, V. M., Drexler, J. F., Smits, S. L., El Tahir, Y. E., De Sousa, R., van Beek, J., Nowotny, N., van Maanen, K., Hidalgo-Hermoso, E., Bosch, B. J., … Koopmans, M. P. G. (2013). Middle East respiratory syndrome coronavirus neutralising serum antibodies in dromedary camels: A comparative serological study. The Lancet Infectious Diseases, 13(10). https://doi.org/10.1016/S1473-3099(13)70164-6
Revell, L. J. (2012). phytools: An R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution, 3(2). https://doi.org/10.1111/j.2041-210X.2011.00169.x
Robles-Fernández, Á. L., Santiago-Alarcon, D., & Lira-Noriega, A. (2022). Wildlife susceptibility to infectious diseases at global scales. Proceedings of the National Academy of Sciences of the United States of America, 119(35). https://doi.org/10.1073/pnas.2122851119
Ruiz-Aravena, M., McKee, C., Gamble, A., Lunn, T., Morris, A., Snedden, C. E., Yinda, C. K., Port, J. R., Buchholz, D. W., Yeo, Y. Y., Faust, C., Jax, E., Dee, L., Jones, D. N., Kessler, M. K., Falvo, C., Crowley, D., Bharti, N., Brook, C. E., … Plowright, R. K. (2022). Ecology, evolution and spillover of coronaviruses from bats. In Nature Reviews Microbiology (Vol. 20, Issue 5). https://doi.org/10.1038/s41579-021-00652-2
Sagi, O., & Rokach, L. (2018). Ensemble learning: A survey. In Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery (Vol. 8, Issue 4). https://doi.org/10.1002/widm.1249
Sánchez, C. A., Li, H., Phelps, K. L., Zambrana-Torrelio, C., Wang, L. F., Zhou, P., Shi, Z. L., Olival, K. J., & Daszak, P. (2022). A strategy to assess spillover risk of bat SARS-related coronaviruses in Southeast Asia. Nature Communications, 13(1). https://doi.org/10.1038/s41467-022-31860-w
Shehata, M. M., Chu, D. K. W., Gomaa, M. R., Abisaid, M., El Shesheny, R., Kandeil, A., Bagato, O., Chan, S. M. S., Barbour, E. K., Shaib, H. S., McKenzie, P. P., Webby, R. J., Ali, M. A., Peiris, M., & Kayali, G. (2016). Surveillance for coronaviruses in bats, Lebanon and Egypt, 2013–2015. In Emerging Infectious Diseases (Vol. 22, Issue 1). https://doi.org/10.3201/eid2201.151397
Smith, K. F., Goldberg, M., Rosenthal, S., Carlson, L., Chen, J., Chen, C., & Ramachandran, S. (2014). Global rise in human infectious disease outbreaks. Journal of the Royal Society Interface, 11(101). https://doi.org/10.1098/rsif.2014.0950
Soria, C. D., Pacifici, M., Di Marco, M., Stephen, S. M., & Rondinini, C. (2021). COMBINE: a coalesced mammal database of intrinsic and extrinsic traits. Ecology, 102(6). https://doi.org/10.1002/ecy.3344
Sundaram, M., Schmidt, J. P., Han, B. A., Drake, J. M., & Stephens, P. R. (2022). Traits, phylogeny and host cell receptors predict Ebolavirus host status among African mammals. PLoS Neglected Tropical Diseases, 16(12). https://doi.org/10.1371/journal.pntd.0010993
Tonelli, A., Caceres-Escobar, H., Blagrove, M., Wardeh, M., & Di Marco, M. (2023). Identifying life-history patterns along the fast-slow continuum of mammalian viral carriers.
Viana, M., Mancy, R., Biek, R., Cleaveland, S., Cross, P. C., Lloyd-Smith, J. O., & Haydon, D. T. (2014). Assembling evidence for identifying reservoirs of infection. In Trends in Ecology and Evolution (Vol. 29, Issue 5). https://doi.org/10.1016/j.tree.2014.03.002
Wang, W., Lin, X. D., Guo, W. P., Zhou, R. H., Wang, M. R., Wang, C. Q., Ge, S., Mei, S. H., Li, M. H., Shi, M., Holmes, E. C., & Zhang, Y. Z. (2015). Discovery, diversity and evolution of novel coronaviruses sampled from rodents in China. Virology, 474. https://doi.org/10.1016/j.virol.2014.10.017
Wardeh, M., Baylis, M., & Blagrove, M. S. C. (2021). Predicting mammalian hosts in which novel coronaviruses can be generated. Nature Communications, 12(1). https://doi.org/10.1038/s41467-021-21034-5
Wardeh, M., Blagrove, M. S. C., Sharkey, K. J., & Baylis, M. (2021). Divide-and-conquer: machine-learning integrates mammalian and viral traits with network features to predict virus-mammal associations. Nature Communications, 12(1). https://doi.org/10.1038/s41467-021-24085-w
Wille, M., Geoghegan, J. L., & Holmes, E. C. (2021). How accurately can we assess zoonotic risk? PLoS Biology, 19(4). https://doi.org/10.1371/journal.pbio.3001135
Yang, Y. Le, Yu, J. Q., & Huang, Y. W. (2020). Swine enteric alphacoronavirus (swine acute diarrhea syndrome coronavirus): An update three years after its discovery. In Virus Research (Vol. 285). https://doi.org/10.1016/j.virusres.2020.198024
Zhao, X., Chen, D., Szabla, R., Zheng, M., Li, G., Du, P., Zheng, S., Li, X., Song, C., Li, R., Guo, J.-T., Junop, M., Zeng, H., & Lin, H. (2020). Broad and Differential Animal Angiotensin-Converting Enzyme 2 Receptor Usage by SARS-CoV-2. Journal of Virology, 94(18). https://doi.org/10.1128/jvi.00940-20

The authors declare no competing interests.

SMAframeworktopredictzoonoticreservoirs2024v2.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

A framework to predict zoonotic reservoirs under data uncertainty: a case study on betacoronaviruses

Status:

Version 1

Abstract

Figures

1 Introduction

2 Methods

2.1 Definition of host status

2.2 Accounting for uncertainty and sampling bias

2.3 Predictors of host status

2.4 Modelling reservoir status

2.5 Host status prediction

3 Results

3.1 Ensemble accuracy

3.2 Predicted betacoronavirus reservoirs

3.3 Observed and predicted hotspots of betacoronavirus hazard

4 Discussion

4.1 Considerations on framework methodology and validation

4.2 Highlighting surveillance priorities and knowledge gaps in betacoronaviruses’ host-range

5 Conclusions

References

Additional Declarations

Supplementary Files

Status:

Version 1