In a region characterized by a long taxonomic and naturalistic tradition, the analyses provided clearly demonstrate a high heterogeneity in the spatial distribution of completeness percentages, as well as variability between taxa in the number of spatial units that can be considered well surveyed. These findings are not novel, as geographical and taxonomic shortcomings and biases in biodiversity databases have been acknowledged on numerous occasions and across various organisms and data sources (e.g. Dennis and Thomas, 2000; Meyer et al. 2016; Titley et al. 2017; Troudet et al. 2017; Hughes et al. 2021; García-Roselló et al. 2023). These challenges raise doubts about the reliability of these data in providing a comprehensive understanding of biodiversity distribution, which is crucial for establishing effective conservation strategies (Rocchini et al. 2023). Regarding geographical inequalities, our European data shows that there is a distinct latitudinal gradient observed in the occurrence of probable well-surveyed cells. This pattern runs in parallel with the southern increase in biological diversity (Myers et al. 2000) and the northern growth in taxonomic resources and task forces (dos Santos et al. 2020). However, a more pronounced and steep longitudinal gradient in cell completeness values is apparent, likely partially due to the limited focus of GBIF’s objectives in Eastern Europe (Gaiji et al. 2013).
When examining the disparities among different taxonomic groups, we observe that the number of 30 arcminute cells with completeness values equal to or higher than 90% ranges from 40% of total in Aves to 0.4% in Insecta (29.3% and 0.1% respectively for completeness values equal to or higher than 95%). In reality, the completeness patterns of the 20 studied classes can be divided in two groups clearly discriminated by the amount of information available about them. The data available for vertebrates and vascular plants is nearly fifty times larger, and the number of cells with completeness values of at least 90% is eight times greater than those available for invertebrates and mosses. While the information is taxonomically biased because, in general, a smaller part of the data corresponds to those more diversified groups such as invertebrates, this gradient in the amount of information is also manifested geographically. Thus, the worst-surveyed groups only exhibit high completeness values in some places in the north and central regions of Western Europe, while this pattern widens in the better-surveyed groups, tending to encompass the entire western part of the subcontinent. Nevertheless, the conclusion is that even a region with a prolonged taxonomic tradition shows a high heterogeneity in the taxonomic and geographic distribution of their completeness values. This allows us to examine how this inequality influences the capacity to obtain a reliable sample capable of representing the environmental variability of the territory.
The provided results shows that a little number of 30-arcminute cells may allow to cover an important range of the complete environmental variability of the territory as measured by %MESS variable. Thus selecting only 5% of the 30 arcminute European cells selected at random is sufficient to represent more approximately 92% of the total environmental variability. However, in cells with completeness values equal to or higher than 90% (around 10.8% of total cells, in average), the mean represented environmental variability barely reaches 54% (minimum = 17%, maximum = 85% depending of the taxonomic class). The mean %MESS value is even lower (34%) if the cells with completeness values equal to or higher 95% are selected (5.9% of total). This result demonstrates the uncoordinated and contingent character of the accumulation process of biodiversity information and the need of an extra effort that should be more intense in those taxa with a lower geographical coverage of their data.
We can consider the random selection of spatial units as a relatively efficient manner of obtaining the environmental representation of a territory. Thus, the difference in %MESS values when cell completeness are equal to or higher than 90% and a similar number of cells randomly selected across Europe could be considered a measure of how far the data of a taxonomic group would be from an adequate environmental coverage. In the case of Europe, this difference is negatively correlated with the number of European cells in which each class is present (r = -0.525; p < 0.02). Thus, the larger the spatial coverage of a group's data, the more efficient the environmental representativeness of its data. Non vascular plants and invertebrates, as they are under-surveyed, showed much smaller %MESS values (35.8% in average; maximum = 84.6%, minimum = 17.7%) than vertebrate and vascular plant classes (75.4% in average; maximum = 66.7%, minimum = 62.7%). However, due to the growth curve reflecting the increase of the environmental representativeness with the addition of spatial cells, these much less surveyed taxa showed a potential higher rate of increase in their environmental representativeness with the addition of new data. Although the differential amount of data available for each class would be the main factor explaining the degree of environmental representativeness of each organism data, our results suggest that some class specific attributes of the compiled information and/or of their distribution and environmental adaptations could also play a role.
Species Distribution Models based on correlations may be utilised to forecast the occurrence of a taxon in absence of exhaustive information (Guisan et al. 2017). Unfortunately, the lack of information in some localities and the low level of completeness in others propitiates the existence of an unknown number of false absences, which hinders model estimations (Lobo, 2016). Another requirement in these modelling exercises is that the response variable should be well distributed across the gradient of environmental conditions existing in the selected region. When this does not occur, model results will extrapolate beyond the observed range of environmental conditions used in the process of model building (Jiménez-Valverde et al. 2013; Yackulic et al. 2013). Our study indicates that the lack of completeness is widespread across many groups and regions in Europe. Evidently, there is much more information available than what is present in GBIF. However, most of this information is not freely accessible, remaining hidden (Hochkirch et al. 2021). Thus, the consequence is that the use of the available information on the identity and distribution of organisms in biodiversity assessments and conservation efforts still requires strategic sampling approaches and additional efforts to make the current hidden biodiversity data accessible to the public (Jetz et al. 2012). Furthermore, these extra compilation efforts should be mainly directed towards those spatial units capable of improving the current environmental representation of the spatial units considered well-surveyed. This is the only way to have a representative sample capable of producing effective interpolations and reliable predictions of species distributions. Considering the magnitude and speed of the biodiversity crisis (Glaubrecht, 2023), it seems reasonable that humankind should not wait for reliable data to plan and implement conservation measures but should instead facilitate the necessary actions to obtain the required data for those groups and regions capable of mitigating the existing biased picture of biodiversity information.