2.1. Study Area
The study area is a protected area in Ethiopia, designated as Babile Elephant Sanctuary (BES). BES is situated in Eastern Ethiopia between Oromia and Somali National Regional States. It is located between the latitudes of 08°02'29" N and 09°23'19" N and longitudes of 41°45'29" E and 43°22'21" E (Fig. 1). BES is the largest protected area in the country, covering an area of 8,388 km2 (EWCA & WSD, 2010). The study area includes a 20 km buffer zone to compare the extent of settlement distribution in and around BES.
The sanctuary is characterized by arid and semi-arid climatic conditions with a bimodal rainfall pattern. The main land cover of the area is natural vegetation including open shrubland, dense shrubland, and woodlands. The vegetation of BES can be categorized into three major types: riverine forest, Acacia-Commiphora Woodland in lowlands and mid-highlands, and Bushland/scrubland (Zelalem Wodu, 2007). Including the African Elephants and Black Lions, the sanctuary harbours 30 species of mammals belonging to seven orders and 15 families. The sanctuary is one of the areas of unquestionable national significance to be included in the Ethiopian Protected Area System (Vreugdenhil et al., 2012). It is in category II of the International Union for Conservation of Nature (IUCN) Management category and is administered by Ethiopia Wildlife Conservation Authority (EWCA) (Gebremeskel Gizaw, 2021). It is increasingly suffering from the settlement, intensive agriculture, and livestock grazing. The area of BES is shared by 15 woredas of two regions. The population density in these woredas ranges from 21 in Degahamido of Somali Region to 711 in Haromaya of Oromia Region based on population predictions for 2017(CSA, 2013).
The study area is selected because it is one of the PAs in the country seriously affected by population pressure and settlement. Neil and Greengrass (2021) showed how uncontrolled immigration and settlement threatened BES with a four-fold increase in the number of houses from 2006 to 2017. Increasing demand for natural resources in the area like fodder, fuel wood, timber, water, and land for cultivation are reasons for human encroachment into the reserve area (Aschalew 2020; Reddy and Workneh 2014; Sintayehu & Kassaw, 2019). The size and quality of the sanctuary for the conservation of biodiversity have been reduced due to the huge immigration of farmers and their livestock who are permanently settled in and around the sanctuary (EWCA & WSD, 2010). Browsing animals, especially camels, are in direct competition for resources with the Elephants. BES has received the least attention for conservation since its establishment although it has been designated as the largest sanctuary for Elephants in the country (Belayneh et al., 2013).
2.2. Data source
Sentinel-2 level 2A Image was the main source of data for the settlement mapping. It is freely available at https://scihub.copernicus.eu/dhus/#/home. The Level-2A processing includes a scene classification and an atmospheric correction applied to Top-Of-Atmosphere (TOA) resulting in orthoimage, Bottom-Of-Atmosphere (BOA) corrected reflectance product with 10m, 20m, and 60m resolution (Sentinel Hub, 2022). Using multi-temporal data increases the accuracy of mapping scattered settlements (Hoffman-Hall et al., 2019; Ji et al., 2020; Marconcini et al., 2020; Muro et al., 2020; Xu, 2021). For this study, images from the two main seasons were used, i.e., wet season (August/September 2021) and dry season (March 2022). The total number of scenes used for this study was 14 (Table 1). The study area is covered by five tiles. The dry season images were used directly. But for the wet season images, cloud pixels were masked out from the main image (Table 1, column 3) and replaced by pixels from additional wet season images (Table 1, column 4).
Table 1
Images used for the study
Tile identifier | Dry season image date | Green season main image date | Green season additional image date |
37PHK | 06 Mar, 2022 | 28 Aug, 2021 | 08 Aug, 2021 |
38PKQ | 06 Mar, 2022 | 28 Aug, 2021 | 12 Sep, 2021 |
37PHL | 11 Mar, 2022 | 18 Aug, 2021 | 28 Aug, 2021 |
38PKR | 08 Mar, 2022 | 28 Aug, 2021 | 08 Aug, 2021 |
38PLQ | 16 Mar, 2022 | 30 Aug, 2021 | - |
2.3. Image Preprocessing and interpretation
Remotely sensed raw data generally contain flaws and defects. Removal of flaws through some methods is termed pre-processing methods and mainly includes geometric correction, radiometric correction, and atmospheric correction methods (Reddy, 2008). These corrections were not needed for Sentinel-2 level 2 data. Besides, Sentinel-2 L2A data are Ortho-rectified and UTM geocoded. Since sentinel-2 imagery comes with different spatial resolutions, resampling is needed. So, all images were resampled to 10 meters pixel size.
2.3.1. Cloud masking
Cloud detection is a preliminary step before processing optical remote sensing data (Hagolle et al., 2010). Clouds significantly alter the spectral signatures obtained from satellite data, which often leads to the false identification of land cover change (Tarrio et al., 2020). But, as the thermal band is still a key data source for cloud detection, cloud screening in Sentinel-2 is challenging (Tarrio et al., 2020). High omission of clouds and shadows was reported (Claverie et al., 2018; Tarrio et al., 2020). The data quality report (Enache, 2022) shows the overall accuracy of L2 scene classification varies for cloud versus clear pixels between 57.3 to 96.8%. The main problems were the over-detection of clouds over bright targets and the under-detection of semi-transparent clouds or cloud edges (Enache, 2022). This study applied a new method of cloud masking based on simple multi-temporal thresholding that referenced cloud-free dry season images. The advantage of using this method was described in the supplementary data. The threshold of 0.045 was found fairly good and applied for cloud masking in this study. The following equation is applied.
B1w - B1d > t−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−(1)
Where B1w and B1d are reflectance values of the coastal aerosol band (B1) for the cloudy wet season image and the cloud-free dry season image respectively. t is a threshold that differentiates between cloudy and non-cloudy pixels.
2.3.2. Feature selection and training data preparation
For any modelling, the input data is the most important consideration. Remote sensing bands and their derivatives were used for feature classification. Textural indices derived from satellite imagery are useful to improve classification performance (Mellor et al., 2015). Co-occurrence measures were applied in ENVI software which was based on the work of Haralick, Dinstein, & Shanmugam (1973). All available texture measures were tried with the spectral indexes on a 3 by 3 window size. All features with promising results were selected. After recursive trials on the test area, 19 variables were selected to identify settlement structures (Table 2). These were based on 12 indexes, three texture metrics, and three-time variables. Feature importance was also analysed based on the random forest gain ratio. To prepare the training data, we used high-resolution remote sensing data to collect training samples. Training polygons were delineated for each class by overlaying Google earth image, true colour sentinel image and Normalized Difference Vegetation Index (NDVI) data. The samples were distributed all over the study area with more preference given to the overlapping area of tiles. This decreases workload by using the same training repeatedly. Areas with imagery dates before 2019 especially for settlement samples were avoided.
To avoid misallocation issues, samples were taken only from the middle of the class objects. However, this was not always possible as settlement areas were fragmented. The inclusion of the surrounding area outside the sanctuary was important because more compact settlements were found around this area. The training data quality was assessed by investigating the histogram (Lillesand et al., 2015).. When the histogram of a land cover class is uniform and unimodal for some feature, the training data can be considered of good quality. The training data collection produced 506 polygons containing 20,752 pixels for the training data, of which only 186 polygons or 1762 pixels (8.5%) were for the settlement class.
Table 2
Features used as inputs for classification
No | Index/variable | Equation | Reference | Seasonal variable | Texture metrics |
1 | Built-Up Areas Index (BAI) | (B2 – B8)/ (B2 + B8) | (Mhangara et al., 2011) | Dry and wet | - |
2 | Brightness Index- BI | [(B42 + B32)/2]0.5 | (Escadafal, 1989) | dry | - |
3 | Bare soil index (BSI) | [(B11 + B4)- (B8 + B2)]/ [(B11 + B4)+ (B8 + B2)] | (Chen et al., 2004) | wet | contrast |
4 | Band Ratio for Built-up Area (BRBA) | B4 / B11 | (Waqar et al., 2012) | dry | variance |
5 | Brightness index (BI3) | (B4 + B3 + B2)/3 | - | Seasonal difference | Mean of dry season index |
6 | Brightness index (BI4) | (B2 + B3 + B4 + B8)/4 | (Zhou et al., 2009) | dry | - |
7 | Modified Bare Soil Index (MBSI) | 2(B4-B3)/ (B4 + B3-2) | (Zhang et al., 2018) | wet | - |
8 | Normalized Difference Green Blue (NDGB) | (B3 – B2)/ (B3 + B2) | (Marconcini et al., 2020) | Dry and wet | - |
9 | Normalized Difference Red Blue (NDRB | (B4 – B2)/ (B4 + B2) | (Marconcini et al., 2020) | dry | - |
10 | B8-B1 index | (B8– B1)/ (B1 + B8) | - | dry | - |
11 | Tasseled Cap Wetness (TCW) | (B2∗0.2578)+(B3∗0.2305)+ (B4∗0.0883)+(B8∗0.1071)- (B11∗0.7611)-(B12∗ 0.5308) | (Shi & Xu, 2019) | Wet and seasonal difference | - |
12 | B11 | Raw sentinel SWIR band | | Dry season and seasonal difference | - |
2.3.3. Classification method and accuracy assessment
Although the target of this study was to identify settlement areas, multiple classes were used based on a pre-experiment test. More classes are necessary to improve classification accuracy by encompassing the spectral variability of a landscape (Adams & Gillespie, 2006). Six-class were identified (two classes of settlement and four classes of non-settlement). The settlement classes are bright roofs and dark roofs. Non-settlement classes comprise roads, bare land, seasonal vegetation and permanent vegetation. Because it is successful in several studies, the random forest algorithm (RF) was applied. Two parameters are important in the RF algorithm. The number of trees is usually set between 100 and 1000 (Rodriguez-galiano et al., 2012). Tree depth (the maximum number of splits in a tree) is usually set to the square root of the total number of input variables for classification (Belgiu & Drăgu, 2016). Hoffman-Hall et al.(2019) applied 500 trees and 9 splits in a similar study. In this study, the number of trees was set to 120, and the tree was based on a pre-test result. The default tree depth of 30 was maintained in ArcGIS. The classification was done for each tile separately mainly because mosaicking 5 tiles of sentinel data for 13 bands and 10m resolution is computationally intensive. After classification, the results of all tiles were merged and reclassified into binary classes.
Regarding the accuracy assessment, the result was compared with World Settlement Footprint (WSF) 2019, which is a 10m resolution global dataset (Marconcini et al., 2020) and available at https://download.geoservice.dlr.de/WSF2019/. WSF uses multi-temporal Sentinel-1 (S1) radar and Landsat-8 optical imagery as an input and Support Vector Machines (SVM) as a classifier. It also incorporates existing settlement datasets in the -post-classification steps. It was selected for its high accuracy. The agreement and disagreement areas were identified and compared.
Table 3
Number of samples for accuracy assessment
Density Class | Settlement density(pixels/km2) | Area in class- km2 (%) | Number of samples for accuracy |
1 | 0 | 131559 (69) | 114 |
2 | 1–50 | 3964 (21) | 182 |
3 | 51–500 | 1718 (9) | 164 |
4 | > 500 | 351(2) | 87 |
| Total | | 545 |
A formal accuracy assessment was also done. Sampling design, response design, and error analysis are important considerations in accuracy assessment (Congalton & Green, 2019; Jensen, 2015). The sampling design is stratified random sampling. Stratification was based on the settlement density of the resulting map. Settlement pixels were counted in 1km square grids and the study area was divided into 4 classes based on settlement density. This served as strata for sampling. The total number of samples was 545. The number of samples for each stratum was allocated as mentioned in Table 3. More samples were allocated to the middle classes (2nd and 3rd ) to emphasize the scattered settlements as it was the focus of this study. The sampling unit was 3 by 3 pixels of the resulting map to avoid the transfer of misallocation error. when there was any settlement structure in a 30 by 30m grid, it was considered as reference settlement data. Otherwise, it was recorded as non-settlement. Reference samples were collected from Google earth and using GPS. Overall accuracy, F-score, omission error, and commission error were used as accuracy measures. The F-score (Pérez-Hoyos; A. Udías & Rembold, 2020) is more meaningful for imbalanced classification. It is the harmonic mean of the user’s and producer’s accuracy.
F-score= (2*UA*PA)/ (UA + PA) --------------(2)
Where UA and PA are user’s accuracy and producer’s accuracy respectively. The F-score ranges between 0 and 1. The higher the value of the F-score the better the accuracy.
2.3.4. Settlement map integration method
The WSF map and the classified map of this study were integrated and used for subsequent analysis. First, the settlement pixels of WSF were masked. When the non-settlement area in WSF was mapped as a settlement in the study, the following threshold expressions were applied to remove the commission error from the classified map: Reflectance of B6 in the wet season > 0.22, BAI for wet season > -0.6 and NDGB < 0.9. These threshold values were joined by the Boolean ‘AND’ to remove less bright objects as well as contrasts of vegetation and bare land which were confused with the settlement.
2.4. Spatial analysis of settlement distribution
To see the variation in settlement density, cluster analysis was performed. First, the raster data was changed to point. Then the points were changed to clusters with the Ordering Points To Identify the Clustering Structure (OPTICS) algorithm using the Density-Based Spatial Clustering with the Application of Noise (DBSCAN) package (Hahsler et al., 2019). Density-based clustering works by grouping regions of high density and separating them from regions of low density. This helps identify coherent, homogeneous regions in the data set (Pattnaik, 2020). The minimum number of points (minPts) was set to 3 and the neighbourhood distance (eps_cl) to 500 meters. The polygon enclosing each cluster was drawn using the concave hull algorithm (Gombin et al., 2020) in R packages. The second iteration of DBSCAN was done on polygons with an area of less than 100 m2. Lastly, the concave hull algorithm was performed on all clusters and the geometry was cleaned by dissolving overlapping boundaries. Settlement density was calculated by dividing the number of settlement points contained in a polygon by its area.
Comparison of settlement by proximity was also conducted using buffer zones. This study includes an 11,579 km2 buffer area on the outer boundary of BES for comparison of settlement inside and outside the PA. The BES inside the area was 8388 km2. Concentric buffers were drawn at 10 km intervals inside and outside the BES for comparison. The number of settlement clusters and their density were compared in these proximity zones.
Furthermore, it will be helpful for future decision-making if the location of extended settlement-free areas in BES is mapped. To map non-settlement area, first, the settlement polygons inside BES were buffered by 1 km. Then, the non-settlement area was identified by calculating the symmetric difference from the BES boundary. The resulting polygons were ranked by their area to identify where extended settlement-free areas were located.