Bee occurrence records
We collected publicly available records of bees (Hymenoptera: Apoidea) in Canada and continental U.S.A. from the BeeBDC database 15, 2,664,435 observations from 3,361 species. We then excluded spurious observations, such as those with unflagged geospatial problems (e.g., incomplete or unlikely coordinates), as well as paleontological records and records lacking basis information (i.e., whether the observation was based on a living or preserved specimen). We then excluded observations with invalid or problematic basis, coordinates, coordinates reprojection, datum, date, URI reference, or type status, retaining 2,540,129 records from 3,353 species. We selected all unique records by taxon ID, full name, family, species, references, and coordinates, resulting in 964,887 unique records. We cleared all taxonomic issues, using the taxonomic backbone in Discover Life 25. We also excluded all bee records from Hawaii (i.e., 1,409 records of 44 species), as well as any observations of exotic species not registered in North America (12,432 records of 72 species). We excluded 274 bee records from 6 species whose coordinates had no associated urbanization or climate information (e.g., points over water). We selected species with at least 5 independent records (excluding 1,408 observations and 714 species), and grid cells that included at least 5 observations (excluding 83,109 observations). The final dataset included 866,247 unique records from 2,553 species. The data cleaning process was conducted in R v. 4.2.0 using packages ‘rgbif’ 26, ‘scrubr’ 27, and coordinateCleaner 28.
Taxonomic diversity
We divided the study area in 10 x 10 km2 grid cells, and assessed taxonomic diversity as the cell bee species richness. Species richness is just a simple count of present species that does not require abundance or sampling effort data, making it convenient for analyses using synthetic data such as BeeBDC, especially since it contains data from unstructured recreational observations without sampling effort data. We used species richness to estimate taxonomic diversity while controlling for sampling intensity (i.e., number of records) in our statistical models.
Phylogenetic diversity
To estimate phylogenetic diversity we used the maximum-likelihood tree of bees provided by Henríquez-Piskulich, Hugall, and Stuart-Fox (2023) 29, one of the most recent and complete global bee phylogenies. We added missing species as polytomies at the basal node of their congeners, since many bee species lack extensive genomic data 30. Additionally, we added four missing genera as polytomies of their sister genera: Melissoptila (tribe Eucerini 31), Microalictoides (related to Diadasia, Conanthalictus, Protoduforea, Sphecodosoma, and Xeralictus 32), Neopasites (paraphyletic in respect to Biastes 33) and Rhopalolemma (tribe Neolarrini 34). The grafting of missing species was conducted with R package ‘rtrees’ v.1.0.1 35. Phylogenies derived in this way are reliable for calculating community phylogenetic diversity 36.
We assessed three common metrics of phylogenetic diversity (PD), namely, mean pairwise distance (MPD 37), mean nearest taxon distance (MNTD 37), and Faith’s phylogenetic diversity (Faith’s PD 38). Mean phylogenetic distance (MPD) refers to the mean branch length separating each pair of species in the sample. Mean nearest taxon distance (MNTD) is the average distance between each species in the sample and its closest neighboring species (i.e., the closest species in the tree present in the sample). Faith’s phylogenetic diversity (Faith’s PD) is defined as the sum of the branch lengths connecting the sample species across the phylogenetic tree. We calculated all PD metrics with R v. 4.2.0 and packages ‘ape’ v.5.6.2 39 and ‘PhyloMeasures’ v.2.1 40.
Functional diversity
We evaluated five functional traits to estimate functional diversity, including body length (separately for females and males), trophic specialization, nesting location, sociality, and kleptoparasitism (Table S.7.1) (Table 1). These are important traits that determine several functional aspects, including bee floral visits, habitat use and availability, phenology, exposition to flooding, and risks of overheating, parasitism, and predation 8,14,19,41,42. These traits are also susceptible to urbanization, given the effects of urbanization on flower composition and distribution, nest availability, and prevalence of diseases and parasitism 8. We used Gower distances to compute the functional distance matrix since they can handle combinations of quantitative and qualitative traits, as well as missing data 43,44.
Table 1
Description of the selected functional traits. We obtained the measurements for each functional trait from Moreno-García, Nguyen, and Li (2024).
Functional traits | Measurement | Rationale |
Female length (mm) | Average or mid-point (for ranges) of female body length (head to last tergite) | Related to flower accessibility, resistance to overheating, predation 14. Bee size is altered by urbanization; however, the direction of this effect is unclear 8,14. |
Male length (mm) | Equivalent to female body length | We distinguished between male and female lengths due to bee sexual dimorphism 65. |
Trophic specialization | - Oligolectic/ Specialized - Polylectic/ Generalized | Describes bee diet breadth or lecty 42. Trophic specialization of non-kleptoparasitic species is based on pollen sources, that of kleptoparasites is based on nectar sources). Urbanization tends to benefit polylectic species 8. |
Nest location | - Below-ground - Above-ground - Both (hereafter ‘Opportunistic’) | Determines habitat availability as well as the species exposition to flooding, predators, and other risks 19. Urbanization tends to benefit above-ground nesting species 8. |
Sociality | - Eusocial - Solitary or primitively social (hereafter ‘Solitary’) | Influences resource gathering, distribution area, annual activity cycle, and predation risk 41. Urbanization tends to benefit social species 8. |
Kleptoparasitism | - Kleptoparasitic - Non-kleptoparasitic | Parasitic bees lay their eggs in the nests of other species to avoid parental care 19, having reduced nesting and resource gathering costs 66. Parasitic bees tend to have a narrow host selection which makes them less common in anthropogenic habitats 22. |
1Intertegular distance (ITD) is probably a better estimator of size since it is clearly defined and not highly dependent on the position of the pinned specimen. However, ITD information is especially scarce for species first described long ago and not included in very recent taxonomic keys. |
We used two metrics to quantify functional diversity (FD): mean pairwise distance (MPD 37), and mean nearest taxon distance (MNTD 37) based on functional traits. We calculated the functional MPD and MNTD using the functional distance matrix as well as a species presence matrix. Functional MPD and MNTD are equivalent to phylogenetic MPD and MNTD but rely on functional distances rather than branch lengths. We calculated all FD metrics with R v. 4.2.0 and packages ‘cluster’ v.2.1.3 45, and ‘picante’ v.1.8.2 46.
Independent variables
We studied the response of bee diversity to urbanization, which was measured using human population density and built surface 47. However, our models would be spurious if we ignored the effects of other important environmental drivers such as climate 18 and agriculture 21,48. Hence, we studied the effects of urbanization, climate, and land use on bee diversity (i.e., bee richness, phylogenetic, and functional diversity). We assessed the climatic conditions by quantifying the effects of temperature, precipitation, and solar radiation 18. To quantify land use conditions, we calculated cropland proportion as well as the diversity of land uses within each grid cell. Additionally, we controlled the effects of other confounding variables, including the number of records, and the spatial correlation, and ecoregion of the samples.
We measured urbanization as the product between population density and percent of built surface. We obtained human population density data from the Global Human Settlement Layer for 2020 at a resolution of 1 km (GHSL 49). We selected the data for 2020 since most of our records were recorded during that period. We re-projected the points at a resolution of 10 km to match the bee occurrence data and extracted the values of human population density for each grid cell. We obtained the built surface, cropland proportion, and land use diversity from Dynamic World 50 using the average map from 2020, which classified each grid cell into one of nine categorical land uses (i.e., water, trees, grass, flooded vegetation, crops, shrub and scrub, built, bare, and snow and ice 51). We used cropland proportion and land use heterogeneity since both agriculture and habitat heterogeneity have extensive effects on bee diversity 11,14,48. For the proportion of built area and cropland, we counted the 0.01 x 0.01 km2 (10 x 10 m2) Dynamic World grid cells contained within each 10 x 10 km2 grid cells that were classified as cropland or built surface using Google Earth Engine 51. For the diversity of land uses, we used the Shannon-Weaver index 52 to calculate the diversity of land uses among the 0.01 x 0.01 km2 grid cells within each of the 10 x 10 km2 grid cells using Google Earth Engine. We obtained the climate data from WorldClim v.2.1 53. We downloaded the climate data for 1970–2000 at a resolution of 2.5’ (approximately 5 km at the equator). While the temporal period covered by WorldClim is not ideal for the study, it contains geographically consistent measurements of average temperature, precipitation, and solar radiation. We reprojected the data at a resolution of 10 km and extracted the climate data at the grid cell center. The centers of some grid cells were located on water bodies, so we used the average over the coordinates of the bee occurrences within those grid cells instead. We obtained the map of ecoregions level II from the US EPA portal 54. We extracted the ecoregion value for the center of each 10 x 10 km2 grid cell. Some grid cells did not return any values (e.g., located within water bodies). Thus, we used buffers of increasing size to assign ecoregion values and visually resolved all the cases in which a grid cell was added to either multiple or no regions.
Statistical methods
To evaluate the effects of urbanization, climate, agriculture, and landscape heterogeneity on bee diversity, we used linear mixed models with the bee diversity metrics as response variables and land use, climate, and sampling intensity explanatory variables. To account for spatial autocorrelations among grid cells, we included Gaussian correlations among the coordinates of grid cell centers in the model and included a random effect of ecoregion at the grid center. We fitted six models, one for each of the bee diversity metrics (species richness, phylogenetic MPD, MNTD, and Faith’s PD, and functional MPD, and MNTD). Each model included seven explanatory variables: urbanization (product of population density and built surface proportion), temperature, precipitation, solar radiation, cropland proportion, Shannon-Weaver diversity of land uses, and number of bee occurrences (as an estimate of sampling intensity). We used variance inflation factors (VIF) to detect potential collinearity between covariates and found that all VIF were below 2 55 but those of solar radiation (3.5, correlation with temperature = 0.64, correlation with precipitation = -0.57) and temperature (2.7). We decided to maintain temperature and solar radiation in the model following a prior analysis of global bee diversity 18. We log-transformed urbanization, cropland proportion, and bee sampling intensity. We scaled and centered all explanatory variables (mean = 0, std. dev. = 1). We calculated the marginal and conditional R2 using Nakagawa and Schielzeth’s method 56,57. We repeated the analyses excluding exotic species to assess the robustness of our results to species introductions (introductions are more common in urban areas 3). We also repeated the models using standardized effect sizes (SES) of phylogenetic and functional diversity based on species pool within each ecoregion level II 54 to control for the regional bee richness. The results for the models excluding exotic species and SES metrics, as well as the raw models with outliers are located respectively, in Supplementary Materials Appendices S.3-S.5. The models for the SES for functional diversity were very consistent, whereas those for phylogenetic diversity revealed some interesting discrepancies. Any important differences between the main results and those of the additional models are discussed in the results section.
To test whether different bee families and species with different functional traits are associated with different environmental conditions, we used linear mixed models with the values of the environmental variables as response variables, the categorical bee families and functional traits as explanatory variables (bee family, nesting location, trophic specialization, sociality, and kleptoparasitism), and ecoregion level II as a random intercept. We also evaluated the effect of the environmental variables on bee length, by using female and male lengths as response variables, the environmental variables as explanatory variables, and ecoregion as a random intercept.
We assessed the phylogenetic conservatism of bee functional traits to examine possible correlations between bee phylogenetic and functional diversity. To study phylogenetic conservatism, we evaluated the correlation between the phylogenetic and functional distance matrices, and computed the phylogenetic signal of all studied functional traits but nesting location (since the estimation of phylogenetic signals for categorical variables with more than 2 levels is non-trivial 58). We evaluated the phylogenetic signal of female and male total length using Pagel’s λ, assuming Brownian motion 59, and using package ‘phyloint’ v.0.1 60. We computed the phylogenetic signal of trophic specialization, sociality, and kleptoparasitism using the D index, assuming Brownian motion 61, and using package ‘caper’ v.1.0.2 62.
For all models, we evaluated model assumptions, including normality of the residuals, homoscedasticity, and presence of outliers and influential observations. We performed all the analyses using R v.4.2.0 and packages ‘nlme’ v.3.1.157 63 and ‘MuMIn’ v.1.47.1 64.