A novel band selection architecture to propose a built-up index for hyperspectral sensor PRISMA

Processing of hyperspectral remote sensing datasets poses challenges in terms of computational expense pertaining to data redundancy. As such, band selection becomes indispensable to address redundancy while preserving the optimal spectral information. This paper proposes a novel architecture using Genetic Algorithm (GA) optimizing technique with Random Forest (RF) classifier for efficient band selection with the Hyperspectral Precursor of the Application Mission (PRISMA) dataset. The optimal bands are BLUE (λ = 492.69 nm), NIR (λ = 959.52 nm), and SWIR 1 (λ = 1626.78 nm). This paper also involves an application of the selected bands to accurately identify and quantify built-up pixels by means of a new spectral index named Hyperspectral Imagery-based Built-up Index (HIBI). The proposed index was used to map built-up pixels in six cities around the world namely Jaipur, Varanasi, Delhi, Tokyo, Moscow and Jakarta to establish its robustness. This analysis shows that the proposed index has an accuracy of 94.02%, higher than all the other indices considered for this study. Moreover, the spectral separability analysis also establishes the efficiency of the proposed index to differentiate built-up pixels from spectrally similar land use or land cover classes.


Introduction
Land covers in urban areas tend to change more drastically over a short period than elsewhere because of incessant urbanization. During the time period from 1960-2018, the share of the global urban population increased from 33.61-55.27% (Zha et al. 2003;Mukherjee et al. 2020). India' increased from 17.97% to 1961 to 31.16% in 2011 and is expected to reach 40% by 2030 (Kaur and Luthra 2018). The studies related to urbanization of late have been analyzed with renewed enthusiasm by urban planners, economists, and researchers. Urban expansion has been quantified using economic, demographic, and geographical approaches. The quantification of the urban measurements with an economic and demographic perspective measures the change in the ratio of urban to the total population and the contribution Communicated by H. Babaie Rajarshi Bhattacharjee rajbhatt78645@gmail.com For the remote sensing-based indices calculation, the scientific community has primarily used the LANDSAT series satellite imageries due to their easy availability. The LAND-SAT program was activated in 1972, and since its inception, these satellites have acquired multiple multispectral images that calculate the Earth's reflected solar energy falling in the visible and non-visible ranges of the electromagnetic spectrum (Mukherjee et al. 2020).
The process of built-up spread calculation can be performed using several remotely sensed datasets or different satellite imageries and spectral values based on the category of land use (Xu, 2008). These calculations and estimations can be analyzed with the help of several classification algorithms (Poyil and Misra, 2015;Rawat and Kumar, 2015). Among the numerous classification techniques, the indexbased thresholding method has been frequently applied by researchers because of its computational efficiency and ease of implementation (Zha et al. 2003;He et al. 2010). The urban land-use class of the Pearl River Delta of China has been classified by Chen et al. (2006) using multiple remote sensing-based indices with high accuracy. For mapping the bare land and built-up in urban areas, several indices have been used in various studies, such as Urban Index(UI) (Kawamura et al. 1996), Normalised Difference Built-up Index (NDBI) (Zha et al. 2003), Normalised Difference Bareness Index (NDBaI) (Zhao and Chen, 2005), Indexbased Built-Up Index (IBI) (Xu, 2008), Enhanced Built-Up and Bareness Index (EBBI) (As-syakur et al. 2012), Dry built-up index (DBI) (Rasul et al. 2018), and powered B1 built-up index (PB1BI) (Mukherjee et al. 2020). Indices like UI, NDBI, and EBBI are based on the high-speed mapping of bare land or built-up areas. Nevertheless, these indices are incompetent enough to verify the proper distribution among the built-up and the bare land classes (He et al. 2010;Ukhnaa et al. 2019). Some researchers stated that this inability, because of the severe difficulty of the spectral response patterns to built-up areas, vegetation, and bare land, predominantly in terms of the pixel groupings in areas with heterogenic objects (He et al. 2010). All these builtup indices have been derived using LANDSAT imageries. The LANDSAT-series satellite(s) are multispectral sensors having only a few bands (Loveland and Irons, 2016). As compared to the multispectral data, the hyperspectral data provide more substantial information. The greater level of spectral details provides better prospects to analyze the Land use/ Land cover (LU/LC) pattern (Jarocińska et al. 2022). The band combination which has been best suited for the built-up delineation has been obtained using a genetic algorithm (GA) based optimization technique using random forest (RF) as a classifier (Nagasubramanian et al., 2018).
This study uses hyperspectral data to derive a new index for built-up area delineation. The PRISMA sensor dataset has been used in this analysis. This sensor was built by the Italian space agency and launched in 2019. This hyperspectral sensor consists of approximately 250 bands in a spectral range of 400-2500 nm. PRISMA images have a spatial resolution of 30 meters, similar to that of LANDSAT imageries (Vangi et al. 2021). The index has been named the 'Hyperspectral Imagery-based Built-up Index (HIBI),' which can properly delineate the built-up features and distinguish between built-up and non-built-up features. The novelty factor in this work is that for the first time, the GA-based RF classifier has been applied to the PRISMA datasets. The optimal bands have also been figured out which can be used by the research community for urban pattern identification in future.
This index has been tested by creating classified maps of six cities in the world. The HIBI mapping outcomes have been compared to the results of several other existing builtup indices. The proposed index has also been compared with a machine learning (ML) and deep learning (DL) classifier.

Study area description
For analysing the performance classification of the given index HIBI, three cities have been chosen from India, and three other cities have been chosen from outside India. The study region selected from India is Delhi, Jaipur, and Varanasi. These cities have a high population density. Delhi is the national capital of India. It is the second most populous city in India. The city of Jaipur is the state capital of Rajasthan and the most populous city of the state. It is also the tenth most populous city in India (CensusInfo India 2011). Varanasi is known as the spiritual capital of India and one of the most famous cities in the world (Garg et al. 2020). The cities that have been selected outside India are Tokyo (the capital city of Japan), Moscow (the capital city of Russia), and Jakarta (the capital city of Indonesia). The city of Tokyo is known as the most populous city in the world. Moscow is the second most populous city in Europe. Jakarta also features in the top hundred most populous cities of the world (UN, DESA, PD 2014; Hui et al. 2017). The location map of the study site has been drawn in Fig. 1.

Datasets used
The remote sensing datasets comprise PRISMA satellite images. All the datasets used in this analysis have been Level-2D products. These datasets have been atmospherically corrected and geocoded (Vangi et al. 2021). Except for the proposed index, all the other indices have been estimated from LANDSAT-8 imageries. The PRISMA and LANDSAT-8 images are from April 2021.

Methodology adopted for the study
The built-up indices UI, NDBI, IBI, EBBI, DBI, and PB1BI, have been computed for the preprocessed LANDSAT-8 scene. Another built-up index additionally has been computed using multispectral data (LANDSAT-8) having similar band placement to that of hyperspectral PRISMA data. This index has been termed in the manuscript as 'Multispectral Imagery-based Built-up Index (MIBI). The output images (represented in the form of maps) have been generated for each built-up index applying the optimum threshold. Then the proposed built-up index (HIBI) is estimated using the PRISMA dataset, and a comparison is drawn with all the other built-up indices. Then the comparison of the HIBI is made with a machine learning (ML) classifier named Support Vector Machine (SVM) and a deep learning (DL) classifier known as Convolutional Neural Network (CNN).

Development of built-up index HIBI and its comparison with ML and DL techniques
The spectral curve has been made for LU/LC features using the PRISMA dataset for the six selected cities. Seven PRISMA bands situated in the EM spectrum zone between 400 and 450 nm have not been considered for this analysis. These bands basically lie in the aerosol domain. The median reflectance value of each feature class has been shown in the spectral graph. The spectral pattern of PRISMA bands of the study regions is represented in Fig. 2. Another major challenge while working with the hyperspectral dataset has been the selection of the best waveband combination. In this scenario, the major task has been to select the best possible band combination for built-up area delineation. So to mitigate this challenge, a genetic algorithm (GA) based optimization technique using random forest (RF) as a classifier has been implemented. The number of decision trees used is 100, bootstrap has been set to true to increase the computational efficiency, and the random state has been set to 2 to avoid different results across different executions. GA has been known as a population-dependent stochastic search optimization method influenced by natural genetics and natural selection principles. Wavebands have been represented by long string bits and are known as chromosomes. A score has been assigned to each of these chromosomes using a fitness function (Goldberg 2001). In this case, the fitness function evaluates how well these chromosomes (combination of bands) perform to discriminate between built-up and non-built-up regions. These chromosomes have evolved in consecutive generations using mutation, selection, and crossover genetic operators to explore the solution space unless the best solution has been achieved or end criteria have been encountered. Chromosomes for reproduction can be selected in many ways. One of the ways is to pick chromosome pairs from the whole population, which gives better fitness scores to execute crossover. Genetic details of the two chromosomes can be randomly combined using the crossover operator. Some part of a chromosome gets modified by the mutation operator, and it averts GA from selecting local optimal solutions (Nagasubramanian et al. 2018). It has been significant to select an appropriate fitness function carefully. In this analysis, the F1 score of the first term in both the numerator and denominator of the proposed index. The urban pixels have a low reflectance value as compared to the barren land in the NIR band. It may be noted that the healthy vegetation reaches its reflectance crest in the NIR region, which makes it an automatic inclusion in the index to differentiate built-up and vegetation. However, the spectral response of dry vegetation and bare land peaks up in the SWIR1 band leading to a similar spectral footprint as the construction materials. Thus, these three bands have been used in the present study to considerably improve the extraction of built-up pixels by increasing the contrast between bare lands and built-ups. Even with the water bodies (inland as well as the sea) in these regions, the built-up class has shown the separation. In the GREEN, RED, and SWIR2 regions, the spectral curve of the built-up class gets overlapped at several places with some of the other classes. All the observations, as mentioned earlier, have been taken into account. A generic built-up delineation index has been proposed to recognize the built-up pixels on the basis of a suitable threshold. The formula of the built-up index consisting of BLUE, NIR, and SWIR1 band has been shown as: In this analysis, GA has given which specific bands in the given regions need to be used to get the best result. Since the PRISMA image is a hyperspectral dataset, there will be multiple bands in each region. For example, this dataset contains 10-12 bands in the BLUE region of the EM spectrum. So unless an optimization algorithm has not been implemented, it will be impossible to select the best band out of the given 10-12 bands. A similar scenario will also happen in other regions of the EM spectrum with the PRISMA data.
In the NIR region, this dataset contains around 60-65 bands.
Even the SWIR 1 and SWIR 2 regions also include a large built-up class has been chosen to assess the performance of the classifier. F1 has been defined as the harmonic mean of recall and precision values (Powers 2020). A good F1 score is also indicative of good classification performance. F1 score ranges from 0 to 1 (Fourure et al. 2021). F1 has been mathematically calculated using these formulas (Nagasubramanian et al . 2018): 10-fold cross-validation has been conducted to evaluate the fitness of the classifier. The GA gives the best possible band combination. In this case, it selected Band 13 (λ = 492.69 nm) from the BLUE region, Band 22 (λ = 562.73 nm) from the GREEN region, Band 34 (λ = 669.81 nm) from the RED region, Band 66 (λ = 959.52 nm) from the NIR region, Band 129 (λ = 1626.78 nm) from the SWIR1 region, and Band 197 (λ = 2229.75 nm) from the SWIR2 region. The flowchart for the GA-RF architecture for the best waveband combination selection is shown in Fig. 3. The GA architecture has given a combination of six bands from six different EM regions for built-up delineation. But still, all six bands can not be used for built-up estimation. The careful analysis of the LU/LC spectral curve shows that the BLUE region depicts a high reflectance value for the built-up class compared to other classes. On the basis of this observation, the BLUE band has been kept as the  (Mukherjee et al. 2020). The pixels inside the range [L, U] have been delineated as built-up pixels, and the other pixels in the image have been marked as non-built-up pixels. For accurate classification using any index, the proper U and L bounds estimation must be done using statistical techniques number of bands. The GA-RF architecture has been implemented by using Google colab software (a cloud-based free python platform).
The thresholding technique can be helpful for assigning an upper cutoff (U) and lower cutoff (L) for a single pixel (combining all the cities) and 385,235 non-built-up pixels (combining all the cities). A 7 × 7 window with 7 × 7 stride has been used to generate a training image chipset for the CNN model. The huge difference between training and testing pixels can be seen because training has been used only for SVM and CNN methods. But testing data has been associated with every index used in this study. The stratified random sampling technique has been used for creating training and testing data. The testing dataset gives a reference for comparison of the delineated built-up results generated from the indices as well as from SVM and CNN classifiers. In the SVM technique, the kernel function has been set to Radial Basis Function (rbf). Moreover, the decision shape function has been set to one-vs-one ('ovo'). The CNN classifier has been trained for 50 epochs with the initial learning rate being 0.001 and dropout value of 0.25. The activation function has been set to ReLU while the batch size being 128. The optimization function Adam has been used to tune the hyperparameters. The value of other additional hyperparameters Beta1, Beta 2, and epsilon have been 0.001,0.009, and 1e-08 respectively. The binary map of the built-up after LU/LC classification has been used for the generation of training and testing data. The datasets contain samples from both the built-up and non-built-up pixels. The same set of training pixels has been incorporated for threshold interval window generation for all the indices of built-up delineation and for the SVM as well as the CNN classifier training. Similarly, the same set of pixels can be utilized to evaluate and compare the performance of the classification between several techniques. Several parameters have been chosen for the accuracy measurements like Sensitivity, Specificity, Positive Prediction Value (PPV), Negative Prediction Value (NPV), Total accuracy, and Cohen's Kappa (κ) (Mukherjee et al. 2020). These parameters for accuracy measurements have been defined in Table 1.

Performance of the built-up index HIBI
This work has analysed the built-up and non-built-up regions mapping in the considered study regions using HIBI transformation. The multispectral bands of the LANDSAT-8 satellite image have been used for making all the other considered indices except HIBI. The HIBI index has been compared with the other indices (UI, NDBI, PB1BI, EBBI, DBI, MIBI, and IBI). The HIBI result has been compared with SVM and CNN classifiers to estimate the effectiveness and accuracy. Additionally, the built-up and non-built-up coverage area of the study region has been tabulated. from the sample training set. This technique must be nontrivial as the built-up sample pixels do not emulate any parametric variations like Gaussian distribution. The bootstrapping thresholding technique has been applied in this study to overcome the difficulty of the non-trivial method.
The classification performance of the proposed index has been compared with other supervised classification algorithms along with ML and DL classifiers.

Spectral separability measurement
One of the commonly used methods for spectral separability is known as Jeffries-Matusita(JM) distance, and in this study, the JM distance method has been used. This method has been very reliable for spectral separability because it behaves like the probability of correct classification (Padma and Sanjeevi, 2014). The probability density of the spectral vectors, S 1 and S 2 for the bands (l = 1, 2,. ., L) has been denoted as p l and q l, and the JM distance has been calculated (Ghiyamat et al. 2013) as: The JM distance ranges from 0 to 2, where 2 indicates the maximum separability ( Rao et al. 2014). The LANDSAT-8 image has been chosen as the base image, and on that image, the 50 pure pixels have been chosen for each of the builtup, cropland, vegetation, bare soil, sandbar, and waterbody class. Then, these pure pixels group has been placed on the corresponding classified image generated from each index. Since the geo-coded location of the points has been the same, the points lie at the same location for all the indexbased classified images. The spectral distance between built-up and all the other classes has been calculated for all index-based classified images. This procedure has been carried out in ERDAS Imagine software. The entire process has been reproduced for the HIBI index by taking the PRISMA datasets.

Creation of training and testing data
The training datasets have been implemented to generate the threshold interval window of the built-up-delineation indices and for training the Support Vector Machine (SVM) and CNN models. The training and testing datasets have been prepared from the LANDSAT-8 and PRISMA images of the study region(s). Both training and testing data have been divided into built-up and non-built-up pixels. The training set comprises 250 built-up pixels (combining all the cities) and 1000 non-built-up pixels (combining all the cities). The test set consists of a total number of 64,535 built-up pixels above-mentioned cities. In Table 4, the accuracy parameters have been computed for all the considered indices along with the accuracies of SVM and CNN classifiers. For all the accuracy parameters, the average value has been computed because all these parameters for each index, along with ML and DL algorithms, have been applied to the six considered cities. The built-up and non-built-up regions have been estimated by using HIBI for the six considered cities. The results have been tabled in Table 5. GA algorithm has been implemented in this analysis to decide which individual bands need to be selected from each of the respective EM regions. The F1 score value for this algorithm has been estimated as 0.92.

Qualitative assessment of the built-up indices
The visual comparison among all the indices along with SVM and CNN in the form of classified images for one of the cities (i.e., Jaipur) has been shown. Figure 5 depicts the classified maps. From the maps, it can be seen that all the indices have been overestimating the built-up regions except HIBI. From the spatial distribution pattern images, it can be reckoned that UI shows maximum overestimation, followed by PB1BI and EBBI. Figure 6 represents the Standard false

Result of the spectral separability
The spectra separation between built-up and other non-builtup classes has been tabulated in Table 2. The HIBI index has the best spectral separability values between built-up and all the other non-built-up classes.

Threshold window for built-up indices
The bootstrap thresholding has been applied, and the threshold window, along with the range diagrams, has been depicted in Fig. 4.
The HIBI index has the highest built-up range percentage among all the indices. This shows that the HIBI index has been the most robust and dynamic index of all. It can classify the built-up and non-built-up regions more effectively. The false positive (FB) value will be less for HIBI as compared to other considered built-up indices. The built-up threshold range percentage has been tabulated in Table 3. The percentage here has been considered as the average percentage by combining all the six considered cities.

Quantitative accuracy assessment of the built-up indices
For each index, the accuracy estimation has been performed by considering the testing pixels as a reference for the six The measure of how often the predicted pixel has been built up when actual testing pixel also has been built-up pixel

Specificity
The measure of how often the predicted pixel has been non-built-up when the reference testing pixel also has been non-built-up one

Discussion
The urban indices based on remote sensing technology have been generally used to differentiate between bare soil and built-up regions. These indices exhibit a low level colour composite (SFCC) and true colour composite (TCC) image of the study stretch prepared by PRISMA image. multispectral data, the associated accuracy of MIBI is less as compared to HIBI. The narrower bandwidth of hyperspectral data provides more accuracy compered to the wider bandwidth of multispectral datasets. So the built-up delineation using hyperspectral data is more precise. The accuracy parameters measurement has been tabulated for both the HIBI and MIBI in Table 4. This index can also successfully distinguish between built-up and sandbar. The sandbar gets mixed with urban pixels, and this can procure the wrong classification result. Sandbar formation has been a common phenomenon for a river like River Ganga (Jain and Singh 2020). This river has been the lifeline for Varanasi city, and one extent of this city has been situated near the river bank. So for the proper delineation of the city stretch, the sandbar needs to be categorized into non-built-up classes, and earlier indices have not been very capable of doing this. The HIBI index can be beneficial for delineating those cities which lie near the sandbar intruded river bank. The coastal cities having shorelines can also be appropriately demarcated by this index. In cities where the tree canopy covers the building, at those places, this pixel spectra-based index can produce an error-prone result because of the heterogeneous landscape (As-syakur et al. 2012). If the spatial resolution of the satellite imagery gets enhanced, it can retrieve better information in heterogeneous urban regions by capturing smallscale objects (Tran et al. 2011). The performance of HIBI has been has been reasonably accurate when compared with SVM and CNN classifiers. The SVM and especially CNN algorithm execution has been challenging enough as they require substantial knowledge and skillset in computer programming. Moreover, considerably high end systems are also required for the execution of such algorithms. Selection of perfectly homogenous training samples or datasets also play a key role in the accuracy of SVM and CNN classifiers. The execution of HIBI is computationally inexpensive and much easier as compared to ML and DL algorithms. This index can be executed easily in any open source GIS software like QGIS. However, the performance of HIBI of accuracy because these land-use categories possess a high degree of homogeneity. However, the application of HIBI has been found to be very effective in discriminating between bare soil and built-up areas, which has been a significant limitation of several pre-existing indices. The bootstrapping thresholding has been applied to determine the index range. The HIBI has been created using hyperspectral data, and other indices have been generated by using multispectral data. This analysis also shows that hyperspectral images are better for pixel-based classification in comparison to multispectral images. From the HIBI images of Jaipur city, it can be clearly seen that this index can properly delineate and discriminate between built-up and bare soil classes. The MIBI has also been estimated as having similar band placement to that of HIBI but since it has been using  three bands are the optimal bands and can be used by future researchers for built-up classification. This index provides better separability between the sandbar and built-up class than the other indices considered. The other indices except HIBI overestimate the built-up area for his region. Like all supervised classification methods, the performance of HIBI is subjected to the selection of training data, i.e., the training set needs to be selected cautiously to ensure optimal performance. depends highly on the appropriate selection of threshold like any other remote sensing based index.

Conclusion
A new pixel-spectra-based remote sensing index has been presented and analysed for the delineation of the nonbuilt and built-up regions of several cities (three in India and three outside India). The index has been calculated by using the PRISMA images. Three bands finally, namely, Blue (λ = 492.69 nm), NIR (λ = 959.52 nm), and SWIR1 (λ = 1626.78 nm), have been used for estimating the HIBI. The analysis indicates that the proposed index can be a more accurate alternative to map built-up pixels in comparison to the existing indices considered in this study. These

Conflict of interest
The authors report there are no competing interests to declare.