2.2 Sampling design
We used the IFN (“Inventário Florestal Nacional”) database provided by the Brazilian Forest Service (“Serviço Florestal Brasileiro”; SFB 2019), which comprises 376 grids of 20 km x 20 km (named conglomerate) distant at least 20 Km from each other (Fig. 1). We selected only 144 plots located in crystalline Caatinga (see Moro et al. 2016). Starting in October 2013 and finishing in August 2014, a conglomerate was installed with four sample subunits with 20 m x 50 m each, representing a total area of 4,000 m². In each subunit, all trees, cacti, and palms with diameter at breast height (DAP) greater than or equal to 10 cm were sampled. As most woody plants in Caatinga have DAP ≤ 10 cm (SFB 2019), a smaller plot (10 m x 10 m) was installed inside the subunit and shrubs with a DAP between 5 and 9.9 cm, herbs and lianas were sampled. Some species were recorded in more than one life-form. A total of 2,148 botanical samples of trees, shrubs, herbs, and lianas individuals were collected and sent to the “Herbário Prisco Bezerra” of the Federal University of Ceará (UFC) to identify the species with assistance from specialists following the Flora e Fungi do Brasil. Although the IFN data has exotic and native species, we considered only native species in the following analyses (Table S1). Results with native and exotic species together are in Table S2.
2.3 Predictor variables
Standardized climatic data - For all the sites we extracted the 19 bioclimatic variables from CHELSA (www.chelsa.org.br), a high resolution (30 arc seconds or ~ 1 km at equator) database for the 1979–2013 period (Karger et al. 2017). We retained the uncorrelated variables (r < 0.25 and p > 0.05 ): mean annual temperature (°C*10, MAT), mean annual precipitation (mm/year, MAP) and precipitation of driest quarter (mm/quarter), and the range of variables covered by the study sites was 23.7°C to 28.9°C, 506 mm/yr to 1,172 mm/yr, and 3 mm/yr to 30 mm/yr, respectively. Because the Caatinga is a semi-arid region, we considered climatic variables related to temperature and precipitation as potentially significant ecological drivers of species richness.
Soil Sampling - Soil data were collected by IFN in each plot within a 2 meters radius from the central point of each conglomerate, with samples at a depth of 0–20 cm, using Dutch auger or digger. The soil samples were sent to a specialized laboratory for analysis. From 15 chemical and physical variables, we considered only the uncorrelated: phosphorus (mg.kg− 1), base saturation index (%) and clay content (g.kg− 1). Higher values of these variables were positively related to soil fertility (Lu et al. 2002; Santos et al. 2012). For more details about soil sampling, see the National Forest Inventory Field Manual and forms, available at: SNIF - IFN - Dados abertos (www.florestal.gov.br).
Deforestation variables - The crystalline caatinga map (Moro et al. 2015) was used as a mask layer to cut the Mapbiomas land use and land cover data. Then, we used georeferenced digital files containing data on vegetation remnants, with the last updated map and accuracy = 81.8% (MapBiomas 2019) adjusting the date of the satellite image to the end of 2014, coinciding with the forest inventory in the state of Ceará (SFB 2019). The Mapbiomas dataset was reclassified to create habitat and non-habitat. The cell size of satellite images used to create forest cover maps was 30 m x 30 m (MapBiomas 2019). The habitat category included only Caatinga vegetation, whereas non-habitat category included several land use types, such as pasture (1,924,517.82 ha, 39.57%), agriculture (497,305.69 ha, 10.22%), agriculture and pasture mosaic (1,642,524.19, 33.77%), temporary crops (217,739.72 ha, 4.48%), urban infrastructure (81,540.90 ha, 1.68%), mining (268.80 ha, 0.005%), perennial crop (279,565.96 ha, 5.75%), soy beans crop (1.43 ha, 0.00003%), other temporary crops (217,738.30 ha, 4.48%) and other non-vegetated areas (2,853.32 ha, 0.06%) (MapBiomas 2019; Souza Jr et al. 2020).
Additionally, the shapefile composed of a layer paved (provided by DNIT – “Departamento Nacional de Infraestruturas de Transportes”; http://servicos.dnit.gov.br/vgeo) and unpaved (provided by OSM – OpenStreetMap; http://www.openstreetmap.org/) roads (IBGE 2010) were superimposed on the shapefile of the forest remnants to obtain a more accurate scenario of the habitat amount pattern of the Caatinga. An extension of 110 m and 60 m for each side of the paved and unpaved roads, respectively, was arbitrarily considered as a deforested area (i.e., non-habitat) and subtracted from the original forest area (based on Antongiovanni et al. 2018).
We considered the habitat amount, patch size (the patch in which a sample plot is located) (Watling et al. 2020) and isolation (based in distance metric) with distance-based on nearest-neighbour distance effects (Prugh 2009) based on Fahrig (2013, 2017) and calculated with the “class level metric” in landscape metrics package (Hesselbarth et al. 2019) in R (R Core Team 2019). We extracted the landscape parameters for buffers from 1km to 6km from the focal plot. For all groups (i.e., trees, shrubs, herbs, and lianas), 6km was retained (Jackson and Fahrig 2015). Due to its the greatest effect which produced the strongest species-landscape relationship. The patch size data was obtained by measuring the area of patch that contains the plot and isolation was calculated on the basis the nearest-neighbors mean distance of the focal patch to three other patches in the local landscape (Fahrig 2013; Watling et al. 2020).
Chronic disturbance index (CDI) - We used the CDI calculated by Antongiovanni et al. (2020) that was estimated using 14 variables, such as human population, infrastructure, grazing, logging, and fire. A correlation matrix among the 14 variables demonstrated that 96.8% of the 91 estimated correlations were lower than 0.4, supporting their relative independence. All data is available in the Dryad Digital Repository by Antongiovanni et al. (2020). For those 45 plots that had no data, we took the three nearest points/pixels to the CDI value for the Ceará state.
2.5 Data analysis
We tested for spatial correlation with semivariograms considering the statistical model with species richness and the spatial distribution of the plots. As the spatial correlation was low, we used general linear models (GLMs) with the response variables: plant richness, dark diversity, completeness and species pool. For predictor variables we use annual mean temperature, annual mean precipitation and precipitation of driest quarter for climate, base saturation index, assimilable phosphorus and clay content for soil and acute (habitat amount, patch size, patch isolation) and CDI for disturbances. The predictor variables were z-transformed to put the them on the same scale. For explanatory variables in a set of generalized linear models (GLMs) we use Poisson distribution corrected by over dispersed data for plant richness, dark diversity and species pool and, Gaussian distribution for completeness. All analyses were performed in R version 4.4.1 (R Core Team, 2019).