SDFC dataset: a large-scale benchmark dataset for hyperspectral image classification

Hyperspectral image (HSI) classification plays an important role in a wide range of remote sensing applications in military and civilian fields. During past decades, significant efforts have been made on developing datasets and introducing novel approaches to promote HSI classification, such that promising classification performance has been achieved. However, existing datasets generally pose following issues, including the limited categories and annotated samples, the lack of sample diversity, as well as the low spatial resolution. These limitations severely restrict the development and evaluation of data-driven models, especially deep neural network-based ones. In recent years, advances in imaging spectroscopy provide us the opportunity to obtain the hyperspectral image data with high spectral and spatial resolution, therefore, in this paper, we contribute a large-scale benchmark dataset for conducting hyperspectral image classification to address issues raised by existing datasets, noted as ShanDongFeiCheng (SDFC). The proposed SDFC is characterized by (1) The large-scale annotated samples with diverse categories; (2) The high spatial resolution; and (3) The high intra-class variance yet relatively low inter-class variance, making the HSI classification task much more challenging on it. We evaluated 10 classic traditional and deep neural network-based models on SDFC, of which the results can be regarded as useful baselines for further experiments. Moreover, given the state-of-the-art performance of SpectralNet, we selected it as the representation method, and evaluated it across datasets to analyze the difference effects on the classification model induced by different datasets. The comprehensive review and analysis of the representative classification models on both existing and proposed datasets demonstrate the advantages and challenges of our proposed dataset, and provide promising perspectives for future HSI classification studies.


Introduction
Nowadays, the drastically increased number of remote sensing images originated from different imaging instruments (e.g., hyperspectral and synthetic aperture radar, etc.) Paoletti et al. (2019); Hladik et al. (2013), provide the opportunity to measure the earth's surface more precisely, which poses new challenges for the intelligent earth observation. Hyperspectral imaging, as one of the most representative remote sensing techniques, can simultaneously obtain the spatial, spectral and radiation information of ground objects, which plays a crucial role in the material analysis and classification of Land Use and Land Cover (LULC) (Samat et al. 2016). Hyperspectral image (HSI) classification aims at assigning the semantic label to each pixel to facilitate further applications (He et al. 2018). It is gradually applied in diverse areas, such as Ecology, Geology, Geomorphology, Soil Science, and Atmospheric Science Wei and Zhou 2021).
During past decades, several HSI datasets and various methods, including traditional and deep neural network-based ones, are proposed to improve the accuracy of HSI classification (Qing et al. 2021). A large number of methods have achieved advanced classification performance on publicly available datasets, especially deep neural networkbased methods (Abdulsamad et al. 2021;Mou et al. 2018;Huang et al. 2021;Yan et al. 2020;. For instance, state-of-the-art models such as A2S2K-ResNet (Roy et al. 2021), hybridSN (Roy et al. 2020), and SpectralNet (Chakraborty et al. 2021) manage to obtain over 97% accuracy on the Pavia University dataset. However, existing datasets often suffer from limited categories and annotated samples, the lack of sample diversity, as well as the low spatial resolution, which restricts the development of HSI classification from following two aspects: (1) Simple test data cannot further evaluate the potential of classification methods when these methods achieve the high accuracy close to the 100%; (2) Limited training samples affects the generalization ability of data-driven methods, especially neural network-based ones.
With the development of imaging technique, hyperspectral images with higher spectral resolution and greater spatial resolution can be obtained, therefore, in this paper we propose a new large-scale benchmark dataset noted as ShanDongFeiCheng (SDFC) to address the above-mentioned issues. The proposed SDFC has 20 categories with 1201 × 601 pixels, of which 652743 pixels are manually labeled. The false color image is shown in Fig. 1. SDFC aims at providing the benchmark resource for the evaluation and further development of state-of-the-art classification models. Moreover, we also present a comprehensive review of up-to-date methods in the field. Experimental results on SDFC indicate that it is effective to assess the potential and shortcomings of existing methods on the proposed dataset. To sum up, the main contributions of this paper are threefold: (1) We contribute a large-scale benchmark dataset (SDFC) for conducting HSI classification in order to address the limitations of existing datasets. To the best of our knowledge, SDFC possesses the largest number of annotated samples and categories, as well as a high spatial resolution. The high intra-class variance yet relatively low inter-class variance in SDFC also poses new challenges to the classification task. These characteristics make the SDFC more suitable to evaluate and advance state-of-the-art methods for HSI analysis. (2) We provide a comprehensive review of the recent progress in the HSI classification field, including publicly available datasets and state-of-the-art methods.
(3) We investigate the classification performance of representative methods on both existing and proposed datasets, providing fair comparisons in diverse scenarios. These results can be referred as baselines to inspire promising perspectives for future studies.
The rest of the paper is organized as follows: in Sect. 2, we introduce the existing datasets for HSI classification, and the details of proposed SDFC dataset are presented in Sect. 3. The experimental evaluation of representative methods are described in Sect. 4. Finally, conclusions and discussions regarding potential research directions are summarized in Sect. 5.

Public datasets
During past decades, several public hyperspectral image datasets have been proposed and applied for the hyperspectral image classification task. In this section, We introduce these datasets noted in the open source community Papers with Code by the chronological order.

Indian Pines dataset
The Indian Pines dataset was captured by the Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in 1992 (Wang et al. 2018;Zhu et al. 2022), which covers the Purdue University Agronomy farm and its surrounding area in the northwest of West Lafayette. There are 145 × 145 pixels in this dataset and the wavelength range is from 400 to 2500 nm. The spatial resolution is 20 m and the spectral resolution is 10 nm. After the removal of water absorption bands, the remaining 200 bands are utilized for hyperspectral image classification. Figure 2 shows the false-color image of the Indian Pines and the corresponding ground-truth. This dataset possesses 10249 labeled samples, which consist of 16 categories, including Alfalfa,

Salinas Scene dataset
In 1998, this dataset was captured by AVIRIS sensor over the Salinas Valley area in California (Zheng et al. 2020;. It consists of 512 × 217 pixels and the wavelength range is from 400 to 2500 nm. The spatial and spectral resolutions are 3.7 m and 10 nm, respectively. The Salinas Scene dataset possesses 204 bands for hyperspectral image classification, which are not impacted by water absorption and low SNR.

Pavia University dataset
Pavia University dataset was introduced by the Telecommunications and Remote Sensing Laboratory of Pavia University in 2001 (Zhu et al. 2022). It was obtained by a Reflective Optics System Imaging Spectrometer (ROSIS) sensor during a flight campaign over Pavia, Northern Italy (Roy et al. 2021). The Pavia University consists of 610 × 610 pixels with wavelengths from 430-860 nm. It possesses a 1.3 m spatial resolution and a 4 nm spectral resolution. After removing the broken bands, the remaining 103 bands are treated as research subjects. The false-color image and the corresponding ground-truth are shown in Fig. 5. This dataset is composed of a total number of 42776 labeled samples and 9 classes. Different from above datasets whose samples are derived from rural scenes, the Pavia University dataset mainly originates from the urban scene, of which the categories include Asphalt, Meadows, Gravel, Trees, Painted Metal sheets, Bare Soil, Bitumen, Self-Blocking Bricks, and Shadows.

Pavia Center dataset
Similar to the Pavia University dataset, Pavia Center is another urban dataset provided by the Telecommunications and Remote Sensing Laboratory of Pavia University, also captured by ROSIS sensor in the Pavia area of Northern Italy . It consists of 1096 × 1096 pixels with the wavelength from 430-860 nm. It has the same spatial and spectral resolutions with the Pavia University dataset. After the removal of broken bands, the remaining 102 bands are utilized for HSI classification. The false-color image and the corresponding ground-truth are shown in Fig. 6, which contains 148152 samples labeled from 9 classes, i.e., Water, Tree, Meadow, Brick, Bare Soil, Asphalt, Bitumen, Tile, and Shadows.

Botswana dataset
Botswana dataset is captured by Hyperion sensor on NASA EO-1 satellite in the Okavango Delta area of Botswana in 2004 (Bandara et al. 2022;He et al. 2018). It possesses 145 bands and consists of 1476 × 256 pixels with the wavelength from 400-2500 nm, in which the spatial resolution is 30 m and spectral resolution is 10 nm. Figure 7 shows the falsecolor image and the corresponding ground-truth, respectively. This dataset contains 3248 labeled samples, which are grouped into 14 classes. These categories are Water, Hippo Grass, Floodplain Grasses1, Floodplain Grasses2, Reeds, Island Interior, Acacia Woodlands, Acacia Shrublands, Acacia Grasslands, Short Mopane, Mixed Mopane, Exposed Soils, Riparian, and Firescar.

Other datasets
In addition to datasets mentioned above, there remains several datasets for hyperspectral image classification, e.g., WashingtonDC dataset (Palash et al. 2021), Houston 2013 (Hang et al. 2021;, WHU-Hi dataset (Hu et al. 2020), Curprite dataset (Cen et al. 2020), Matiwan Village dataset (Cen et al. 2020) et. al.. These datasets have fewer categories, resulting in small intra-class diversity. For example, there are only five categories in the WashingtonDC dataset. Moreover, these categories generally belong to the similar semantic branch. Specifically, the main category objects of the WHU-Hi dataset and Matiwan Village dataset both belong to Plants. The application of these datasets is rather limited compared to seven datasets introduced earlier, therefore, we focus our investigations on those widely used datasets.

Brief summary
As we have summarized, tremendous efforts have been devoted to the dataset construction for HSI classification. However, most existing datasets have demonstrated following limitations when facing the rapidly development of classification models, especially data-driven ones: (1) The number of categories and labeled samples is rather small, which has severely restricted the evaluation of the effectiveness of data-driven methods.
(2) The spatial resolution is relatively low, which leads to the spectral mixing in pixels located at the boundary of different ground objects. This could lead to the inaccurate labelling of boundary pixels that may affect the classification performance to certain extent.
(3) The existing datasets possess the low intra-class similarity and inter-class diversity, resulting in that most modern classification models can generate features that dis-  (Zhong et al. 2018;Ding et al. 2021) criminate enough for achieving the high accuracy, which is not conducive to assess the potential of these models when dealing with more general and challenging cases. which include the small-scale annotated samples, fewer categories, the saturated classification accuracy for advanced methods and limited annotated areas. The limitations hinder the applications of deep learning methods in Hyperspectral Images. To address the overfitting issues raised by the limited training data, we propose a new large-scale benchmark dataset. It covers 652743 labeled pixels with 20 classes annotated for Hyperspectral Image Classification.

Image acquisition
SDFC dataset was captured by the High-spectral Aerial Hyperspectral Sensor(HAHS) over FeiCheng area (36 • 13'N,116 • 46'E) in 2018. Feicheng is located in Shandong Province, China, at the western side of Tarzan. The city covers the area of about 1,277 square kilometers with a length of 48 kms from north to south, and a width of 37.5 kms from east to west. The landscapes of the dataset are diverse with the altitude of 58-600 m. The entire region is located in the mid-latitude zone and has a warm continental monsoon climate. The dataset was captured at an altitude of 5 kms with a good visibility conditions. There are 1201 × 601 pixels in this dataset and the wavelength range is from 400 to 1000 nm With 63 bands. The spatial resolution is 50 cm and the spectral resolution is 10 nm, indicating a higher spatial resolution than other datasets. Figure 1 shows the false-color image of the SDFC dataset.

Annotations & statistics
The 652743 pixels of the SDFC dataset were annotated, including the boundary areas that were often neglected in the exisiting datasets. The proportion of the annotated pixels against entire image is about ninety percent, which is the largest among existing datasets. In order to address the relatively lower inter-class variance of the dataset, we made sure that the samples of each category were widely distributed in the image and annotated them in a fine-grain level. For example, the roof can be divided into five categories based on its material, such as Concrete Roof, Caigang Watt Roof, Asphalt Roof, Glazed Tile Roof, and Clay Tile Roof.
To ensure the reliability of the labelling process, we designed the following rules: First of all, we took advantage of the RGB image mentioned above and divided the image into multiple regions by the shape and color of land objects on the level of pixel. Then, we integrated and categorised these regions by their spectral curves with the field trip verification. Finally, we summarized 20 classes and defined corresponding colors for each class to obtain the pixel-level ground-truth. The land objects and corresponding colors are shown in Table 2 and Fig. 8. It is worth noting that three researchers were involved in the labeling process and the final result is determined by the voting method for the fairness. Moreover, by referring to the existing protocol, the unrecognisable regions were defined as background and were not considered for training data.

Spectral separability analysis
In order to evaluate the effectiveness of dataset features for classification, we need to set a quantitative standard. Therefore, we selected three indicators to compare SDFC dataset against existing datasets, namely intra-class average distance, inter-class average distance, and the similarity matrix. Finally, we applied the t-distributed Stochastic Neighbor Embedding(t-SNE) to analyse the distribution of the selected samples.
More specifically, an HSI can be denoted as X ∈ ℝ ×h×b , where and h are the spatial sizes of the dataset, b is the number of bands. We consider the set of the number of each class as N. N can be described as a 1-D matrix N = N 1 , N 2 , ⋯ , N C where C is   the size of class set. X i = X i1 , X i2 , ⋯ , X i N i is the sample size of the class i, where X ij = X ij1 , X ij2 , ⋯ , X ijB is the spectral vector of the j th pixel in class i. Generally, the intra-class average distance and inter-class average distance reflect the distribution of samples in one class and the distribution of the average spectral curves among different classes respectively. The smaller intra-class average distance and the larger inter-class average distance mean the dataset has a better divisibility. The intra-class average distance can be obtained as follows: where Intra_Class(i) means the intra-class average distance of the class i, Intra_Class mean is the average value of the intra-class average distance of all classes. The inter-class average distance can be obtained as follows: where X i mean indicates the average value of the class i. Moreover, we also applied the cosine of the angle between two spectral curves as the similarity measuration of different samples, and the formula for the cosine similarity is defined as follows: where similarity ij is the cosine similarity between the average value of the class i and the average value of the class j. The higher the value, the higher the similarity. The similarity matrices of the compared datasets are shown in Fig. 9, the value of each square grid represents the similarity between the class mean of given two classes. We can observe that the values of square grids in SDFC dataset are larger than others from an overall perspective. It indicates that the dataset is more challenging for performing the classification task.
To further observe the distribution of different categories of samples, we randomly selected ten samples per category and visualised them with T-SNE. From the Fig. 10 we can see that the samples in the same category has a higher dispersion and the positions of different categories are interleaved in the SDFC dataset. Thus, the distribution of pixels in SDFC dataset is more compact than others. (1) The similarity matrices of seven datasets SDFC dataset: a large-scale benchmark dataset for hyperspectral… 1 3 Page 13 of 28 173 Fig. 10 The T-SNE of seven datasets

Representative HSI classification methods
In order to verify the performance of the representative methods on SDFC dataset, we select ten methods.

Experimental setup
To validate the performance of compared methods on SDFC dataset, we lay down two sets of experiments. In the first set of experiments, ten methods were performed on SDFC dataset to leverage the capabilities of these models. In the second set of experiments, we chose the method with the best and most stable performance from the first set. And then, we applied the selected method across datasets to analyse the characteristics of each dataset. For a more comprehensive evaluation, we designed two training protocol: (1) Percentage: 1, 5, 10, 15, 20% of the dataset were randomly selected as the training data.
(2) Balanced: 50, 100, 150, 200 samples in each class were randomly selected to form the training set. For fair comparisons, the parameters are fine-tuned from open-sourced implementations. The training details are listed in Table 4. All models were trained with a 2GHZ CPU and two NVIDIA GTX2080Ti GPUs.

Experimental indicators
There are three standard evaluation metrics in HSI classification: overall accuracy(OA), average accuracy(AA) and kappa coefficient(kappa). The overall accuracy is the ratio of the number of correctly classified samples to the total number of samples. While it can not be a good characterisation of the classification of each category for a dataset with an extremely unbalanced number of category samples. Therefore, we also applied the average accuracy, which first calculates the ratio of the number of correctly predicted samples in each category against the total number of samples in each category, and then obtains the average value of the classification accuracy for all categories. The kappa coefficient is often used to determine whether different models are consistent in their predictiveness. The formula is as follows: where p 0 is the overall accuracy. a 1 , a 2 , ⋯ , a c are the number of true positive samples in each category respectively, and b 1 , b 2 , ⋯ , b c are the number of predicted samples in each category respectively.

Comparison between different methods on the SDFC dataset
The aim of the first set of experiments is to compare the performance of different methods on SDFC dataset. The Table 5 reports the performance of different methods under the setting of percentage. Table 6 records the performance of different methods under the balanced training samples. Figure 11 shows the trend of OA and Kappa under the percentage setting. It can be observed that most models perform better as the number of training samples increases. However, the upward trend gradually slows down. The SpectralNet performs well in this experiment set, especially when the training samples are sufficient. The result meets the expectation of the reported performance on the Papers with Code. Figure 12 shows the trend of OA and Kappa results in the performance of different methods under the training strategies of balanced training samples. The method SpectralNet also achieves the overall robust performance. However, in the case of insufficient training samples, the method fdssc method achieves the best performance. It reflects that SpectralNet is more sensitive to the number of training samples. Moreover, we find that most classification-based methods achieve the relatively better performance against segmentation-based ones, since classification-based methods pay more attention to the features of each pixel, while the segmentation-based methods more focus on the spatial context. From Fig. 13, it can be seen that the prediction maps of the classification-based method have more pretzel noise than those of the segmentation-based method. The segmentation-based method, on the other hand, shows more regular results with less edge noise.

Comparisons across datasets
The aim of the second set of experiments is to compare the performance of the same classification method on different datasets. Given SpectralNet achieves the overall best performance in the first set of experiments. It is selected as the base model for the second set.  We can see that SpectralNet tends to obtain the above 90% accuracy on most datasets when sufficient training samples avaliable. For example, The OA on the Pavia Center dataset with 5% of training samples are similar to those of the method with 20%, and value is 0.99. However, the classification performance on the SDFC dataset is the lowest in all datasets, which indicates that there is still room for further improvement of methods on SDFC. It also can be found that SpectralNet performs the worst on Botswana dataset with the limited training samples, due to the total number of labeled samples in this dataset is already small. As the number of training samples increasing, the classification results on the Botswana dataset achieves the subsequent improvement. Figure 15 shows the trend of OA and Kappa results in the performance of different datasets under the training strategies of balanced training samples. From the figures, it can be found that the SDFC dataset is much more difficult to classify than others. Table 9 records the accuracy of the prediction of each class on three datasets, Indian Pines, Salinas Scene, and SDFC. The accuracy of five categories in Indian Pines are below 70%, namely, Alfalfa, Corn, Grasspasture-mowed, Oats, and Stone-steel-towers. However, The number of categories with such low accuracy in SDFC are nine, i.e., Crops 2, Shadow, Concrete Roof, Cars, Grass, Clay Tile Roof, Special Material, White Tile Floor, and Black Tile Floor, indicating the challenges of the proposed dataset.

Conclusion
In this paper, we aim to provide a larger Hyperspectral Image dataset noted as SDFC dataset with a significant higher annotation ratio and more categories. The SDFC dataset has a high intra-class variance yet relatively low inter-class variance, which further contributes to the exploration of stronger hyperspectral image classification techniques. We compare the proposed dataset with multiple existing datasets. And the state-of-theart methods are utilized to demonstrate that although the advanced performance are obtained on public datasets, there is still great room for further improvements. In summary, we expect that the proposed SDFC dataset will facilitate the future progress of data-driven classification methods.