We defined a framework to model the probability of mammalian species to be reservoirs of betacoronaviruses, based on the characteristics of currently known hosts (Fig. 1). First (Section 2.1) we identified known mammal hosts of Betacoronavirus species, including both reservoir hosts and susceptible hosts. We also identified “pseudo-negative” species, as those without known association with betacoronaviruses despite existing research efforts in their geographic location for taxonomically related species. In doing so, we also accounted for uncertainty around host status definition of susceptible hosts and species classified as pseudo-negatives (Section 2.2). Secondly, we trained an ensemble of predictive models to estimate the probability of species’ reservoir status based on life-history, ecological and phylogenetic features (Section 2.3–2.4). Following an indepdent validation protocol, we tested the performance of the models, and then predicted reservoir status of mammal species not known to be associated with betacoronaviruses (Section 2.5). Finally, we mapped the distribution of observed and predicted reservoirs to highlight global hotspots of betacoronavirus hazard.
Betacoronaviruses are positive-sense single-stranded RNA viruses with high zoonotic and pandemic potential. Within this genus, three viruses have notably crossed the animal-human barrier, leading to severe outbreaks and large-scale epidemics with human-to-human transmission: SARS-CoV (Severe Acute Respiratory Syndrome Coronavirus), MERS-CoV (Middle East Respiratory Syndrome Coronavirus), and SARS-CoV-2, responsible for the COVID-19 pandemic. All three diseases caused by betacoronaviruses have been recognised as priority diseases within the World Health Organization's blueprint, underscoring the importance of betacoronaviruses as a public health risk. These characteristics make betacoronaviruses an excellent case-study to test and validate our modelling pipeline for hazard prediction.
2.1 Definition of host status
Known hosts. We obtained data on association of mammalian species with the target viral genus from VIRION, the most complete database of the vertebrates’ virome to date (Carlson, Gibb, et al., 2022). The database contains host–virus associations obtained with different detection methods, namely (in descending order of strength of evidence): virus isolation, detection of viral nucleic acid, serology. In this way, we were able to identify 213 positive species, i.e., species where betacoronaviruses, or antibodies thereof, have been detected. To train our models with more-detailed information on host status, we further classified positive species in two categories: (1) susceptible hosts (potential reservoirs) and (2) confirmed reservoir hosts, based on a systematic screening of the scientific literature. Information on positive species’ category (reservoir host or susceptible host) were used to assign different relative importance to the species during model training, so that reservoir hosts gave a bigger contribution to model estimation compared to susceptible hosts (Section 2.2).
We defined reservoir hosts as species which met two criteria: individuals can acquire the virus in natural circumstances; the virus can persist in the host population for a prolonged period. T o identify reservoir hosts, we surveyed Web of Science by querying a string made of the scientific binomial of positive species (e.g., ‘Pteropus vampyrus’) AND the term ‘reservoir’ AND the name of the viral genus (i.e., ‘Betacoronavirus’). This step was performed in R using the packages ‘httr’ (Wickham & Wickham, 2022) and ‘jsonlite’ (Ooms, 2014). We only retained records from observational studies where samples (e.g., oral swabs, rectal swabs, and/or whole blood) were collected from wild animals, reflecting natural exposure to the virus. According to our working definition of reservoir hosts, we screened resulting articles searching for temporal and spatial evidence of ongoing infection (maintenance) of one or more virus species belonging to the target genus in the candidate reservoir host population. The 25 species that met the definition and were included as betacoronavirus reservoirs are listed in Table S1. We considered susceptible hosts all those species for which evidence of a betacoronavirus infection was found, but that failed to meet the ‘reservoir host’ conditions. This resulted in the identification of 177 susceptible hosts for which life-history traits, phylogeny, and geographical data were available.
Pseudo-negatives. Available data on host–virus associations do not tipically include negative associations (i.e., data on species that have been tested and found unsusceptible to a given virus), as negative results are less likely to be published and species-wide non-susceptibility to a virus is difficult to prove (Wardeh et al., 2021). Despite acknowledging this limitation, most studies that aim to identify likely reservoir hosts customarily include all the species currently not known to carry the target pathogen as “negatives” in their analysis (Becker et al., 2022; Pandit et al., 2018). This corresponds to using ‘lack of evidence’ as ‘evidence of absence’, albeit with a high risk of generating false negatives (i.e., understudied positive species mistakenly classified as negatives). We argue that better selection of negative species would mitigate such errors in input data, reducing noise in this type of analyses and yielding greater robustness of predictions.
In order to reduce the risk of including false negatives (i.e., unrecognised positive species) in our analysis, we designed a methodology to identify “pseudo-negatives” among species that are likely to have undergone virological sampling but have no documented associations with the target virus. Virological sampling strategies in wildlife generally adhere to taxonomic and geographical patterns (Gibb et al., 2022), so that sampling efforts focus on species that belong to the same taxa and geographical area of known hosts to maximise viral discovery (Becker et al., 2019; Carroll et al., 2018). We integrated these sampling patterns into the definition of pseudo-negatives. Specifically, we classified a species without any known association to the target virus as a pseudo-negative if at least 50% of its geographical range overlapped with the range of one or more positive species from the same taxonomic family. Positive species include both susceptible hosts and reservoir hosts associated with betacoronaviruses. Viral sharing probability between phylogenetically related mammals peaks at geographical overlap values of 50% (Albery et al., 2020), suggesting that closely related species with high spatial overlap are more likely to have been sampled for the same virus. Species’ geographical distributions were represented based on area of habitat maps (AOH) from Lumbierres et al. (2022). We used AOHs as they provide more accurate information on species distribution than species’ range polygons, as portions of the range that are not suitable for the species are excluded. We tested if our use of geographic overlap introduced a bias, by favouring either large-ranged or small-ranged species as pseudo-negatives. We ran a Wilcoxon test to assess whether our selection of pseudo-negatives was biased by species’ range size, and found that the median range size of pseudo-negatives was significantly larger than the median range size of a random sample of mammals (p < 0.001) but also significantly smaller than the median range size of positive species (p < 0.001). While this comparison highlighted that neither positive species nor pseudo-negative species are a representative sample of the average mammal in terms of range size (Figure S1), larger-than-random range size of pseudo-negatives is actually desirable as it may counterbalance the exceptionally large range size of positive species, which is probably due to a combination of sampling bias and exposure to pathogens (Choo et al., 2023). Our pseudo-negative protocol identified 1,117 pseudo-negative species that met the criteria. Assuming that species which live in sympatry with many known positive species are more likely to have been sampled than species sympatric with fewer positive species, we used the number of overlaps alongside other information to assign different “weights” to pseudo-negative species during model training (Section 2.2). To assess the effect of the inclusion of pseudo-negatives on model performance, we repeated the analysis including all species from positive families that are not known to be associated with betacoronaviruses (n = 3,618) and weighted them equally in the models following Eq. 1. We assessed any significant performance improvements given by the inclusion of pseudo-negatives through a Wilcoxon signed-ranks test, a robust non-parametric test for statistical comparisons of classifiers (Demšar, 2006).
2.2 Accounting for uncertainty and sampling bias
To address the inherent uncertainty associated with host status definition, we developed instance weights that enabled us to incorporate sampling effort and the varying degrees of host status evidence into model training. Such weights are species-specific and determine the relative influence of each individual data point (i.e., species) on the predictive models. By implementing instance weights, we tried to mitigate uncertainty and noise in input data, putting the model in condition to target those species that provide a greater amount of information such as reservoir hosts and well-studied pseudo-negatives. This approach enabled our model to prioritise pattern learning from data instances that have greater information, enhancing model ability to identify potential viral reservoirs when provided with species with unknown host status. Weights (\(w\)) within reservoir hosts, susceptible hosts and pseudo-negatives were divided by their sum to ensure that values were proportionally adjusted and consistent with Eq. 1,
\({\sum }_{r = 1}^{{n}_{r}}{w}_{reservoirs}+{\sum }_{pr = 1}^{{n}_{pr}}{w}_{susceptible hosts}={\sum }_{p =1}^{{n}_{p}}{w}_{pseudo-negatives}\) | Eq. 1 |
where \({n}_{r}\), \({n}_{s}\), \({n}_{p}\) are the number of reservoirs, susceptible species, and pseudoabsences in the dataset (respectively). As shown in Eq. 1, we imposed positive instances (reservoir and susceptible hosts) to have an equal total weight to pseudo-negative instances. This allowed us to balance relative class proportion and avoid overfitting to the overrepresented class (pseudo-negatives).
We quantified sampling effort in different ways for positive and pseudo-negative species. For susceptible hosts, sampling effort (\(Seff\)) was measured as the total number of associations of the species with the target viral genus according to VIRION. For pseudo-negative species, sampling effort (\({Seff}^{*}\)) was represented as the mean number of associations between the species’ family and the target viral genus as indexed in VIRION. We assigned the maximum weight to all reservoir hosts, hence there was no need to estimate sampling effort for them. Relationships between weights and sampling effort for reservoirs, susceptible species and pseudo-negatives are described in Equations 2–4 (see also Fig. 2).
Weights of reservoirs are constant, i.e., assuming certainty of the available information:
\({w}_{reservoi{r}_{i} }=\frac{1}{{n}_{r}}\) | Eq. 2 |
Weights of susceptible species are penalised based on sampling effort (\(Seff\)) of each species, i.e., assuming a reduced probability of being a true reservoir if reservoir status was not confirmed despite high sampling effort:
\({w}_{{susceptible}_{i}}= \frac{1}{Log\left({1+Seff}_{i}\right)}\) | Eq. 3 |
Weights of pseudo-negatives are proportional to sampling effort (\({Seff}^{*}\)) and adjusted for the number of overlaps (> 50% of the species’ range) of the pseudo-negative on the range of positive species (\({n}_{overlaps}\)). This assumes higher certainty of “true negative” status for species without viral information that are potentially subject to high sampling efforts:
\({w}_{{pseudo-negative}_{i}}=\left(1+ {{Seff}^{*}}_{i}\right)\times {n}_{overlaps}\) | Eq. 4 |
2.3 Predictors of host status
We obtained predictors of host status from different sources, focussing on biological traits which have been demonstrated to correlate with viral host status in mammals (Tonelli et al., 2023; Wardeh et al., 2021). Mammalian life-history traits were retrieved from the COMBINE dataset (Soria et al., 2021). We checked for collinearity among traits and selected a subset with low correlation (vif < 5) which included: body mass, gestation length, longevity, litter size, litters per year, and weaning age. We also included the mean of pairwise phylogenetic distances (million of years) of each species from other mammals to account for phylogenetic covariance of species traits, and the mean of pairwise phylogenetic distances from known hosts to account for species’ phylogenetic proximity to the known host range of betacoronaviruses. Phylogenetic distances were calculated using the phytools package in R (Revell, 2012) based on a phylogenetic tree of mammals obtained from PHYLACINE (Faurby et al., 2018). We chose to randomly pick one tree out of those provided by PHYLACINE, given that distances calculated over 100 alternative trees were highly collinear (Pearson’s r > 0.99). We selected temperature and precipitation variables available in WorldClim2 to represent bioclimatic conditions within each species’ range (Fick & Hijmans, 2017). Mean value and standard deviation of each bioclimatic variable was extracted from within species’ ranges, and range size of the species (Km2) was also included as a predictor in the analysis after logarithmic (log10) transformation.
2.4 Modelling reservoir status
Model ensemble. We used a stacked ensemble to predict reservoir status (1–positive, 0–negative) as a function of species traits. Stacked ensembles combine multiple base models (in our case binary classifiers) and aggregate their outputs to obtain the final predictions (Polikar, 2012). By aggregating the outputs of multiple classifiers, any errors made by individual classifiers are likely to be balanced out by other classifiers, often resulting in better predictive performance compared to that of single classifiers (Sagi & Rokach, 2018). We opted for a set of classifiers that encompasses diverse algorithms. We trained four classifiers: random forest (RF), extreme gradient boosting (XGBoost), single layer neural network (NNET) and generalised additive model (GAM). All models were trained and tuned in R using the mlr3 package (Lang et al., 2019). Information on hyperparameters that have been tuned for the different classifiers is provided in Table S2.
Tuning and validation. Each classifier was tuned and validated with a repeated (n = 20) nested cross validation, a procedure that avoids over-fitting in model selection and provides robust and unbiased performance evaluation (Cawley & Talbot, 2010). A nested cross validation has an inner cross validation loop (used for hyperparameter tuning) nested within an outer cross validation (used for model validation). In each iteration, the dataset was split into five outer folds stratified by host status category and mammalian order so that each fold had the same proportion of pseudo-negatives, susceptible hosts, and reservoir hosts. Each one of the five outer folds contains a training set (80% of the data) and a testing set (20% of the data). The training set of each outer fold was further split into three inner folds, each containing a training set (67% of the training set of the outer fold) and an assessment set (33% of the training set of the outer fold). For each classifier we used bayesian optimisation to navigate within the hyperparameter space. The hyperparameter combination that yielded the best average performance on the different assessment sets, quantified as true skill statistics (TSS; Allouche et al., 2006), was then fit on the training set and used to predict the test set within the outer fold. We adopted a probability threshold of 0.5 to separate positive from negative species.
We assessed the performance of the model ensemble by aggregating test set predicted probabilities of the single classifiers via weighted average and comparing them to observed status. By using a weighted average, we used individual performances of the four classifiers in the ensemble to adjust their contributions to final predictions. In this way, predictions of high-performance classifiers received higher weight than those of weak classifiers, resulting in increased classification performance (Polikar, 2012). Averaged predicted probability \({{\mu }}_{\text{i}}\left(\text{x}\right)\) of reservoir status for a given species i was computed as follows (Eq. 5):
\({\mu }_{i}\left(x\right)= \frac{1}{T}\sum _{t=1}^{T}{w}_{t}{p}_{t,i }\left(x\right)\) | Eq. 5 |
where the performance weight \({\text{w}}_{\text{t}}\) of a given classifier t (out of T classifiers) is its TSS estimate on the training set, divided by the sum of TSS across all classifiers (so that weights sum to 1); \({p}_{t,i}\left(x\right)\) is the predicted probability of reservoir status for species i, outputted by classifier t. Once we obtained ensemble predictions, performance was assessed by computing nine metrics (Table S3). As the nested routine was repeated 20 times, we computed performance metrics for each of the 100 hold out test sets.
2.5 Host status prediction
We predicted likely reservoirs of betacoronaviruses among mammalian species that (1) are not known to host betacoronaviruses (unknown status), (2) are known to be susceptible to betacoronaviruses but are not confirmed reservoirs. We chose to only predict host status within mammalian families where positive species are already observed, as patterns that underly observed associations between betacoronaviruses and mammals might not be generalisable across widely different mammalian taxa; thus, we provide more conservative predictions of viral hazard. We predicted betacoronavirus host status of 3,893 mammalian species using the model ensemble. The prediction process was repeated 100 times to furtherly account for prediction uncertainty, each time applying the different model structures used during the nested cross-validation routine. We then mapped predicted hotspots of betacoronavirus hazard using species’ AOH maps and summing the number of predicted and observed reservoir hosts per 10km×10km grid cell.