Land is an invaluable natural resource that serves as the foundation for human civilization and sustains various ecological systems. The proper utilization of land resources is crucial to mitigating the irreversible loss caused by unplanned use. In the context of rapid urbanization and its adverse impacts on society and the environment, understanding the dynamics of land use and land cover (LULC) is crucial in various applications such as resource management and sustainable development (Ferreira et al., 2019; Rawat et al., 2020; Boulila et al., 2021; Chen et al., 2021; Zhang et al., 2021). To comprehend the complex relationships between land and human activities, researchers and decision-makers employ LULC classification systems. These systems provide a systematic framework for classifying and mapping various types of man-made and natural features on the Earth’s surface within a specific time frame using statistical and scientific analysis methods (Alshari et al., 2021). The classification of LULC provides valuable insights into the spatial distribution and temporal changes of land resources at various scales. It assists in identifying areas of intensive human intervention, consequently for land management strategies, which indirectly aids in assessing the effects of land use changes on ecosystems and habitat fragmentation. Therefore, LULC classification generally facilitates decision-making processes related to effective monitoring and analysis of land resources (Cheng et al., 2017; Ferreira et al., 2021; Tesfay et al., 2022).
Remote sensing (RS), which entails gathering data about objects or phenomena from a distance, has been instrumental in obtaining valuable information for LULC classification. Over time, RS technology has continuously evolved, resulting in enhanced data quality and detail, primarily driven by high-resolution imagery. The advancements in sensor capabilities and image processing techniques, coupled with the development of robust classification methods, have created opportunities to obtain highly accurate LULC information efficiently. However, classifying remotely sensed images remains a complex task (Broni-Bediako et al., 2022). One significant challenge that has hindered progress in RS classification is the limited availability of reliable labelled ground truth datasets (Jozdani et al., 2022). Although there has been a significant increase in the accessibility of freely available satellite and aerial imagery, fully harnessing the potential of this data requires processing and transforming satellite images into structured semantic information. Consequently, there is a pressing need to create readily usable datasets that facilitate the efficient utilization of available imagery. These datasets should be carefully labelled and validated to ensure their reliability and accuracy. By providing researchers with such datasets, it becomes possible to develop and evaluate classification algorithms more effectively, enabling the extraction of meaningful information from RS data and the application of LULC information with high precision.
The introduction of the pioneer and popular remotely sensed image classification dataset known as the UC Merced land use dataset (Yang & Newsam, 2010) has served as a basis for extensive research on classification models including machine learning (ML) and deep learning (DL), particularly convolutional neural network (CNN) models. As a result, numerous similar datasets have been generated, as summarized in Table 1. The benefits of generating such benchmark datasets are multifold, such as the evaluation of newly developed CNN structures and experimental studies on the fusion of multiple CNN models, specialized feature extraction techniques, and more. The computational methods heavily rely on experimental data for development and testing (Adegun et al., 2023). Only by comparing against existing knowledge can the performance of these methods be assessed. Therefore, benchmark datasets with known and verified outcomes are essential, as emphasized by Sarkar et al., in 2020. These benchmark datasets provide researchers with standardized and diverse collections of remotely sensed images, facilitating the development of accurate and robust classification algorithms. Furthermore, the availability of such datasets supports researchers to develop and train DL models on large-scale and diverse data, that can produce reliable results leading to improved classification performance. Moreover, these benchmark datasets promote reproducibility and comparability across different studies. By utilizing the same dataset, researchers can assess and compare the performances of their models consistently. This ensures that results are reliable and comparable, fostering a more cohesive research community in the field of remotely sensed image classification.
However, the existing benchmark datasets for LULC classification lack representation of the Indian landscape, with only a single study focusing on tiled sentinel image patches for LULC classes in Bangalore, India (Pallavi et al., 2022). Furthermore, other available datasets often include non-land cover classes or exhibit excessive complexity that hampers effective classification. Additionally, these datasets are predominantly trained on CNN models, which are DL neural networks requiring large datasets for training, disregarding the foundational principle of starting with less complex models in data science and gradually increasing complexity, leading to the underutilization of ML models. While some of the datasets are quite large posing significant challenges for training ML models. To overcome these limitations, this study aims to generate a tailored, medium-sized benchmark dataset specifically for the Indian context. Medium-sized datasets strike a balance, providing sufficient data for effective training of DL models without risking overfitting, while also allowing traditional ML models to leverage their efficiency and simplicity for competitive performance. In contrast, large datasets can overwhelm ML models and incur additional training time and resources, while very small datasets may not offer enough data for DL models to generalize effectively. As a result, the study presents an opportunity to evaluate both CNN and ML models, recognizing the significance of incorporating traditional ML models alongside advanced CNN models. The study offers an overview of relevant research in LULC classification, underscoring the importance of generating synthetic datasets tailored to specific contexts. The methodology to generate the Indian LULC patch dataset is described, employing state-of-the-art CNN models and traditional ML models for classification. The study presents experimental results and performance evaluation, followed by a comprehensive discussion of the findings and recommendations for future research.
1.2 Related work
In this section, we review earlier studies on the generation and classification of LULC scenes or datasets. In this context, we present datasets of remotely sensed aerial and satellite imagery of LULC and similar scenes. Additionally, we review the state-of-the-art CNN image classification models for LULC classification.
The generation and availability of satellite imagery datasets, such as UC Merced and EuroSAT, have significantly contributed to LULC classification and the improvement and development of DL models. EuroSAT, a dataset consisting of high-resolution satellite images covering ten different land cover classes in Europe, has provided researchers with a valuable resource for training and evaluating LULC classification algorithms. Petrovska et al., (2020) employed a two-stream concatenation method, CNNs to extract the feature and SVM to classify them for the classification of RS image datasets - UC Merced and WHU-RS. Rajagopal et al., (2020) proposed a model that uses residual network-based feature extraction, which extracts features from the diverse convolution layers of a deep residual network. And the model has been tested using the UC Merced land use and WHU-RS datasets. Studies conducted on the well-known benchmark datasets have demonstrated that the RS scene classification method based on heterogeneous feature extraction and fusion of CNN models is superior to many state-of-the-art scene classification algorithms (Chaib et al., 2017; Iftenea et al., 2017; Muhammad et al., 2018; Wang et al., 2020). Laban et al., (2018), on the other hand, used the WHU-RS, UC Merced, and Brazillian coffee Scenes (BCS) datasets for remotely sensed image scale selection methods to be used in feeding CNN architectures.
Table 1
List of widely used existing RS and aerial LULC benchmark datasets
Dataset | Descriptions | Total image | Class | Size | Resolution (m) | Reference |
UC-Merced | The aerial ortho imageries were obtained from the United States Geological Survey National Map of specific regions within the U.S. The land-use images consist of red, green, and blue bands. However, classifying the dataset is challenging due to the presence of highly overlapped classes, e.g., dense residential, medium residential, and sparse residential classes, which primarily vary in the density of structures they contain. | 2100 | 21 | 256 × 256 | 0.3 | Yang and Newsam, 2010 |
WHU-RS | The aerial scenes are collected from Google Earth imagery. Later, Sheng et al., 2012 expanded the data set with 7 new classes. The datasets have a wide range of scale, orientation, illuminations, as well as spatial resolutions with a maximum of 0.5 m. | 950 | 12 | 600 × 600 | ≥ 0.5 | Xia et al., 2010 |
WHU-RS19 | 1005 | 19 | Sheng et al., 2012 |
BCS | The BCS dataset contains only two scene classes (coffee and noncoffee) acquired from SPOT satellite imageries across four counties in the Brazilian state of Minas Gerais. The green, red, and near-infrared bands were used in this dataset because they are the most suitable and demonstrative bands for differentiating vegetation areas. | 1,438 (coffee) & 36,577 (non-coffee) | 2 | 64 × 64 | | Penatti et al., 2015 |
SAT-6 | Images were extracted from the National Agriculture Imagery Program. The region and the uncompressed digital Ortho quarter quad tiles (DOQQs), which are GeoTIFF images, conform to the topographic quadrangles of the United States Geological Survey. | 4,05,000 | 6 | 28 × 28 | | Basu et al., 2015 |
SAT-4 | 5,00,000 | 4 |
RSSCN7 | The images in this dataset were gathered from Google Earth and were sampled at four different scales, with 100 images per scale. The primary difficulty of this dataset arises from the variations in scale among the images. Furthermore, the dataset poses a significant challenge due to the extensive diversity of images captured under various seasonal and weather conditions. | 2800 | 7 | 400 × 400 | - | Zou et al., 2015 |
RSC11 | The dataset was obtained from Google Earth and consists of high-resolution RS images depicting several U.S. cities. Within this dataset, certain scene classes exhibit visual similarities, thereby amplifying the challenge of accurately distinguishing between the scene images. | 1232 | 11 | 512 × 512 | 0.2 | Zhao et al., 2016 |
NWPU-RESISC45 | The dataset was developed by Northwestern Polytechnical University (NWPU) for Remote Sensing Image Scene Classification (RESISC). It exhibits a high degree of diversity within each class and similarity between different classes. | 31500 | 45 | 256 × 256 | 0.2–30 | Cheng et al., 2017 |
PatternNet | The image scene was collected from Google Earth imagery and, in some cases, through the Google Map API for selected cities in the United States. It was specifically gathered for RS image retrieval approaches. | 30,400 | 38 | 256 × 256 | 0.062–4.693 | Zhou et al., 2018 |
EuroSAT | A comprehensive dataset consisting of geo-referenced images captured by the Sentinel-2 satellite has been compiled, encompassing various European cities spread across more than 34 countries. This benchmark dataset comprises 13 spectral bands, providing a rich resource for analysis and research purposes. | 27,000 | 10 | 64 × 64 | 10 | Helber et al., 2019 |
AID | As Google Earth photos originate from various RS sensors, the images are multi-source. This poses greater difficulties than using photographs from a single source. | 10000 | 30 | 600 × 600 | 0.5-8 | Xia et al., 2017 |
The existing benchmark datasets have been extensively used for analyzing the performance of classification models with different feature extraction and classification methods. DL models such as GoogleNet, DenseNet, Visual Geometry Group 19 (VGG19), Residual Network 50 (ResNet50), and InceptionV3 on the EuroSAT dataset (Dewangkoro & Arymurthy, 2021; Helber et al., 2019). Carranza-García et al., (2019) used the CNN model for LULC classification over remotely sensed imagery and compared proposed DL architecture and other ML models such as SVM, RF, and KNN, and reported that DL is the fastest for both training and testing and concluded CNN as a very powerful technique for the problem of LULC classification. Basu et al., (2015) comparatively analyzed DL models including deep belief networks, CNN, and stacked denoising autoencoders using their own generated SAT-4 and SAT-6 datasets, and additionally developed a customized CNN architecture DeepSat, which was found to outperform the other models. Xia et al., (2017) classified their datasets-AID using GoogLeNet, VGG-VD-16, and CaffeNet and concluded that VGG-VD-16 performed the best with 89.64% accuracy.
In recent studies, Naushad et al., (2021) introduced a wide residual networks-based method that surpassed the performance of ResNet, achieving an accuracy of 99.17%. On the other hand, Temenos et al., (2023) evaluated various existing CNN models and their newly developed model. The existing models, such as shallow CNN, GoogleNet, DenseNet121, Inception V3, ResNet50, ResNet101, VGG16, and GeoSystemNet, outperformed the new model they proposed, Deep SHAP. Furthermore, the utilization of spectral indices in classification tasks was found to improve accuracy compared to using only RGB channels, as observed by Yaloveha et al., (2021), with the classification accuracy increasing from 64.72–84.19%. Broni-Bediako et al., (2022) also reported variations in their model's performance across different datasets, achieving accuracy rates of 96.56% and 96.10% on NWPU-RESISC45 single-label and AID single-label RGB aerial image datasets, respectively, as well as 99.76% and 93.89% on EuroSAT single-label and BigEarthNet multilabel multispectral satellite image datasets, respectively. Helber et al., (2019) applied ResNet-50 and GoogleNet on various datasets and found that GoogleNet performed best on UCM with an accuracy of 97.32%, while ResNet-50 excelled on SAT-6 with a 99.56% accuracy rate. Thiagarajan et al., (2021) achieved remarkable results using the HFEL–CCGSA method, reporting a classification accuracy of 99.99% for SAT-4 and SAT-6, surpassing AlexNet, LeNet-5, and ResNet. However, for the EuroSAT dataset, the accuracy was 99.49%, which was comparatively lower than the GeoSystemNet model. And Chen and Tsou (2021) proposed DRSNet, a novel deep CNN architecture specifically designed for small patch size Landsat 8 RS image recognition, and demonstrated impressive performance on EuroSAT, BCS, and UC-Merced datasets as well.