A novel strategy for classifying spectral-spatial shallow and deep hyperspectral image features using 1D-EWT and 3D-CNN

Hyperspectral images (HIs) are used in diverse disciplines, such as resource handling, land cover analysis, food science, anomaly detection, and precision agriculture. Researchers have been working on a number of visual processing and machine intelligence algorithms to handle this type of data as efficiently as feasible. Deep learning approaches have advanced significantly in the field of machine vision, which is also having a big impact on the analysis of hyperspectral data. To increase its discriminative potential for HI classification, this work suggests a powerful 3D-CNNs (Convolutional Neural Networks) architecture, where the shallow features extracted using 1D-EWT (Empirical Wavelet Transform) are served as input, and the ultimate output of the CNN are projected class-related outcomes. The framework is known as PEC, where P stands for PCA, E for 1D-EWT, and C for 3D-CNN model. Prior to features extraction, the HI undergoes spectral dimension reduction via Principal Component Anaalysis (PCA). To forecast segmentation for a volumetric area of a 3D HI sample, 3D CNNs use 3D convolutional kernels. The usage of more parameters by these CNNs means that the ability to employ interslice context that can boost performance. The CNN model’s parameters are optimised using a limited training set. The newly proposed PEC framework achieves a considerable overall accuracy of 99.58% and 99.94% (percentage increase of 0.08% and 0.54%) and Kappa score of 99.51% and 99.92 (percentage increase of 1.5% and 0.8%) compared to similar approaches for two real-world data sets IP and PU respectively with 30% training samples.


Introduction
In practice, hyperspectral imaging (HI) refers to the process of capturing and analyzing an image at sizable wavelengths, allowing the image to be split into hundreds of colors (limited to four colours for multispectral imaging such as red, green, blue and near infrared). The techniques of spectroscopy (to detect materials based on how light interacts with a subject), enable the HI to gather sufficient bands of data to accurately represent each pixel of an image.
A drone equipped with a hyperspectral camera may identify plant diseases, weeds, soil erosion issues, water control issues, and predict crop yields in agriculture applications. HI is a non-destructive, non-contact technique that can be utilised without harming the object being examined. While multispectral data sets typically consist of roughly 5-10 bands with very high bandwidths, data sets of hyperspectral origin typically contain over hundred spectral bands and relatively low (5-10 nm) bandwidths. Most frequently employed spectral ranges include visible and infrared(near and short wave).
Hyperspectral sensors are passive components that gather data in the form of a collection of images that correspond to various electromagnetic spectrum bands. A hyperspectral data cube is created by combining these images; this cube can then be processed and examined to analyze the spectral data for a variety of uses. Tan (2017) et al., discuss HI based advancements and identified HI tools for social, environmental and military applications for past three decades. The vast majority of the Earth can be imaged using modern sensor technology, which has exceptional spatial, spectral, and temporal resolution. These characteristics enable the effective application of hyperspectral imaging in a plethora of remote sensing applications that necessitates the determination of the physical characteristics of various complex surfaces as well as the identification of visually comparable materials with precise spectral signatures. Spectral resolution and spatial resolution are two characteristics of hyperspectral images (HIs).
HI has two spatial dimensions ( H x and H y ) measuring the geometric connection of the image pixels to one another and one spectral dimension ( S ), that reveals the changes in image pixel values with relation to wavelength (Khan et al. 2018).
One of the objectives of HI classification is to identify the class label for every pixel within the image and categorized as the key issues for the bulk of these applications. The fundamental issue in HI-based classification is coping with few training samples, high dimensional data, and expensive human labeling. (Paoletti et al. 2019).
To retrieve relevant spectral properties, band selection and dimension reduction are two most preferable approaches being suggested in Chang et al. (1999); Bai et al. (2015) and Harsanyi and Chang (1994); Tan et al. (2013) respectively. It is challenging to appropriately categorise the objects using only spectral data because of their different spatial distribution and spectral variation (Rajan et al. 2008). Spatial elements are becoming increasingly important in HI classification with the advancement of computer based intelligent image processing and vision (Fauvel et al. 2008). A plethora of spatial feature extraction strategies have been devised in recent years, including the wavelet transform (Zhu and Yang 1998), the gabor filter (Li and Du 2014) and the grey-level co-occurrence matrix (Pesaresi et al. 2008).
Applications of wavelet transform has been evolving since three decades and employed in many fields. Authors in Pittner and Kamarthi (1999); Mallat (1999) discuss such applications for analyzing signals and images, data compression, pattern recognition and human vision. For the purpose of HI classification, many other feature extraction methodologies have also been devised, all of which emphasize on the localization characteristics of the wavelet transform (Hsu and Ths 2003;Hsu et al. 2002). The regional spectral disparity of HI curve in diverse spectral channels at every wavelength band can be automatically evaluated and delivers instrumental details for HI classification in wavelet-based feature extraction techniques. Hsu and Ths (2003) provides additional information on wavelet-based feature extraction techniques. The aforementioned spatial feature extraction methods using the wavelet transform, Gabor filter and Grey-level co-occurence matrix are together referred as the "handcrafted" or "Shallow" features (Xu et al. 2018). These features are distinct from those that are obtained from various layers of a deep learning model. Due to the significant contributions and achievements of deep learning techniques in the realms of Visual Information processing, a practical alternative method to automatically extract valuable characteristics from HI has emerged (Krizhevsky et al. 2012;Ji et al. 2013;Sun et al. 2014;Sandeep Kumar et al. 2022).
Hyperspectral data exhibits the characteristics of nonstationary and non-linearity. As the Empirical Mode of Decomposition (EMD) suggested by Huang et al. in Huang et al. (1998), offers solutions for non-linear and non-stationary signals, the adaptive techniques of Empirical Wavelet Transforms (EWT) and EMD have immense potential for HI applications. EMDs also implemented for dimensionality reduction (Gormus et al. 2012), separation of harmonics from damaged bearings Immovilli et al. 2010), and basic levels of HI classifications (Demir and Erturk 2008;Demir and Ertürk 2010;Ertürk et al. 2012;Jonnadula et al. 2020).
Numerous studies also evaluate the Ensemble Empirical Mode Decomposition (EEMD), Empirical Mode Decomposition (EMD), and EWT from the application perspective of bearing defect diagnostics. Studies in Prabhakar and Geetha (2017) observe that EWT outperforms in comparison to EEMD and EMD, where the time complexity in the former model is significantly low.
In this study, we use EWT-based shallow feature extraction method to handle the frequency band features of hyperspectral images. Further, to extract and classify the deep features, EWT-based features are further fed into a deep learning architecture. This work improves the study of HIs in adapting (Jayapriya et al. 2020) which was meant for classification of EWT-based HIs using deep learning. The proposed framework of classification outperforms to six previous models of analysis.
The paper's remaining content is laid out as follows. Section 2 illustrates the suggested framework comprising of established methods, proposed mathematical models and details on the classification process modeled in the study of analysis. The HI dataset(s) used, experimental conditions, and the framework's findings are all detailed in Section 3. Experimental analysis of the framework as outlined in section-3 exhibits effectively even with the limited training samples and learns to acquire discriminative hierarchical spectral-spatial features. Section 3.4 illustrates the computational complexities of various algorithms used in the suggested framework and the previously implemented method (Jayapriya et al. 2020). The block diagram of the proposed PEC framework is shown in Fig. 4. Finally, Section 4 summarises the conclusions and discussions.

Proposed Framework
Due to the high dimensionality of the hyperspectral image data and the scarcity of training examples, HI applications struggle with class imbalance. To deal with such issues we propose a novel framework. The framework adapts three distinct techniques, as discussed above. In the first stage of the framework, the spectral dimension is lowered via principal component analysis (PCA). We elaborated the adapted dimensionality reduction process in Section 2.1. The data is then transformed using 1D-EWT to extract shallow features. Shallow feature extraction with EWT is the core methodology adapted in the framework and discussed in details in Section 2.2. Input frequencies suitable for deep hierarchical features extraction and classification, finally used in 3D-CNNs. 3D-CNN model-based deep feature extraction and classification methodologies are outlined in Section 2.3. The detailed description of the proposed framework is explained as below:

Dimensionality reduction
Let Img ∈ R M×N×d represent the spectral-spatial HI cube, where Img represents the initial input, M,N and d indicate the corresponding rows, columns and channels. Each HI pixel in the Img cube has d spectral values. The groundtruth for the Img takes one of the values from the set of C classes, Y = y 1 , y 2 , ..., y C . The HI data, Img is subjected to the conventional PCA along spectral bands as a means to reduce the spectral redundancy. The PCA retains the spatial dimension while decreasing the amount of spectral bands to m from d where (d ≫ m) (refer block diagram in Fig. 1). As the spatial content is crucial to identify any item and to preserve it, the spectral bands are just decreased. I p ∈ R M×N×m is the updated input after PCA, where M, N and m represent to width, height and quantity of spectral bands respectively.
These values indicate PCA reduced data cube. Spectral Dimension reduction improves classification performance while reducing overfitting and computational expenses. We use I p to represent this dimensionally shrunk image. Further details on PCA and the underlying mathematics can be referred in Jonathon (2014).

Feature extraction with 1D-EWT
In the signal processing domain, the Empirical Wavelet Transform (EWT) proposed by Gilles (2013) has been widely adapted for numerous applications. During 2014, Gilles et al. (2014). further expanded the model to get utilise the EWT model exclusively for visual processing applications. The seminal work presented by Prabhakar and Geetha (2017); Jayapriya et al. (2020) adapts the EWT for classification of Hyperspectral data.
Building a wavelet filter bank based on Fourier supports that are discovered in the processed signal spectrum is the fundamental concept of EWT. The empirical wavelets are dilated variations of a single parent wavelet in the temporal domain, just like the traditional wavelets. Empirical detection of the corresponding dilation components instead than adhering to a predetermined scheme is the novel aspect of the empirical wavelet.
The crux of EWT is adaptive Fourier spectrum segmentation of the original signal, followed by filtering. A set of wavelets is created using the 1D-EWT based on arbitrary divisions of the 1D Fourier domain. There are N sequential intervals (different size bandwidths) between 0 and , 0 = 0 and N = . The remaining (N-1) limits are found by looking for the local maximum spectral values. Set M as the overall number of maxima; if M is greater than or equal to N, the first N-1 maxima are kept; if M is less than N, all maxima are kept and N is revised. The intermediate frequency is therefore determined as n (n = 1,2,3,....,N-1) between two local maxima (Lopez-Gutierrez 2022). The detailed illustration of the transformation process can be found in Lopez-Gutierrez (2022); Mohapatra et al. (2022).
The aforementioned components (detail and approximation) is utilised to recover the original signal because they correspond to the various frequency components/modes of EWT. Each mode of the EWT is retrieved and used as a feature for additional investigations. The modes are arranged by frequency from low to high. In the proposed framework, 1-D EWT transform, with N e = 2 is applied on each band of I p giving rise to I p_ewt , where N e represents the number of modes. After being formed, modes are layered, therefore if N e is the number of modes, stacks containing modes,' (2 N e −1 ) combinations will exist. As per the observations by Prabhakar and Geetha (2017), these stacks find suitability to get used as features for HI classification. It is next to select the first K spectral bands of I p_ewt . In the experimental analysis we select and focus K as 30 for IP and 15 for rest datasets and as per the conclusions derived from Roy et al. (2020) for selecting number of Principal Components for Hyperspectral classification. As a means to use image classification methods, the selected bands of the transformed Image, I p_ewt is split into overlapping tiny 3D chunks. The label of centered pixel derives the truth labels of the image. Further, in using I p_ewt 3-D other adjacent patches P ∈ R (S×S×K) were derived. They encompass the S × S window or spatial extent as well as all K channels.

Deep feature extraction and classification with 3D-CNN
The use of convolutional neural networks (CNNs) has demonstrated promising results for a number of computer vision and image classification tasks. The CNN offers the advantage of being able to extract spatial features from the data with its kernel, which is not possible with other networks. With CNNs, edges, colour distributions, etc. can be detected in an image, making them particularly robust for recognizing and classifying images.
CNNs can be categorized based on the dimension of covolutional kernel that is utilized. In 1D-CNN, kernel slides along one dimension. To predict a single slice HI map, 2D-CNN use two dimension convolutional kernels. As a result of the fact that 2D convolutional kernels take one slice as input, they are inherently unable to leverage context from adjacent slices. Since they can leverage context across the height and width of a slice, 2D CNNs cannot leverage context from adjacent slices. A 3D-CNN addresses these issues by predicting segmentation for a volumetric patch in a scan using 3D convolutional kernels. Moreover, this is due to the volumetric nature of HI data and the presence of spectral dimension, 2D CNN is found to be restrictive to successfully extract key features from spectral channels. As per the studies of Ji et al. (2013);Fırat et al. (2022), 3D-CNN is capable enough to deal with both spectral and spatial characteristics as compared to 2D-CNN which simply takes spatial information only. Because these CNNs use a greater number of parameters, using interslice context can improve performance, but at the cost of computational complexity. With these features, it motivated to get use the 3D-CNN. At the same time the dimension reduction methodology adopted in our framework tackles to the challenges of computational complexity being inherited in 3D-CNN classification model, We refer the schematic representation of 3D-CNN model in Fig. 2 being adapted for human action recognition applications (Ji et al. 2013). The model uses a three dimensional filter which convolves with three dimensional volumetric data.
The schematic representation of 3D-CNN classification model adapted in our framework is displayed in Fig. 3. The suggested deep learning model of PEC framework has two 3D-CNNs, two dropout layers, a flatten layer and three dense layers. Here, the 3D convolution generates a feature map based on spatio-temporal information of the input volumetric sample. High-level features in the data are represented in the convolutional layer output. Fully-connected layer learns the non-linear combinations using these features. Dropouts are the regularization technique being used to prevent overfitting in this model. A specific proportion of the network's neurons are switched at random when dropouts are introduced. The neurons' input and output connections are also disabled. This improves the model's feature learning capability.

Proposed PEC Framework for HI Classification
The road map of the proposed PEC framework is shown in Fig. 4. The framework adapts three distinct techniques, as  Ji et al. (2013) discussed above. In the first stage of the framework, the spectral dimension is lowered via principal component analysis (PCA). We elaborated the adapted dimensionality reduction process in Section 2.1. The data is then transformed using 1D-EWT to extract shallow features. Shallow feature extraction with EWT is the core methodology adapted in the framework and discussed in detailed in Section 2.2. Input frequencies suitable for deep hierarchical features extraction and classification finally used in 3D-CNNs. 3D-CNN model-based deep feature extraction and classification methodologies are outlined in Section 2.3.

PCA Modeling
Let M, N, d be the rows, columns and channels or spectral values of image cube, R M×N×d .
Let A ∈ R d and LS N be all N-dimensional linear subspace. Thus, N th principal subspace ( ls N ) can be defined as:

1D-EWT Modeling
Let U s (n, t) be the detail coefficient where: Let U s (0, t) be the approximation coefficient where: where N e =2 on I p and | bands |= 2 N e −1

3D-CNN Modeling
The activation value a L i,j,k at output position (i,j,k) of the L th filter when convolving with the input HI, X is defined as: where RLA is the Rectified Linear Activation Function and Figure 4 is the empirical representation of our proposed framework.
Algorithm-1 represents the step-wise operational approach of the proposed PEC framework for hyperspectral image classification which in turn adapts above mathematical modeling of PCA, 1D-EWT and 3D-CNN.

Experimental Setup and Outcomes
To experiment the undertaken study and to validate the proposed PEC framework, an extensive testing is done using Google Colab Pro online platform with Keras Deep learning framework and allied Python libraries on three HI datasets, c,m,n,d X c,m,n,d  The performance analysis on the classification is evaluated with three accuracy metrics, overall accuracy (OA), average accuracy (AA), and kappa coefficient (k). We refer the evaluating parameters, as follows.
In other words, the average accuracy can be termed as the average of each accuracy per individual class.
where, i denotes the class number, N represents total classified values compared to truth values, m i,i are values correctly classified and located on diagonals of error matrix. C i , G i are the total number of predicted values and toal true values of the class i, respectively.

Dataset description
The first dataset in our experimental analysis i.e., Indian Pines (IP) dataset recorded in western portion of Indiana during 1995, between 0.4 and 2.5 m in wavelength. It has 200 spectral bands (excluding 20 water absorbing spectral bands) and a 145 × 145 spatial dimension. There are sixteen classes in total. The groundtruth of IP is shown in Fig. 5(a).
The second dataset of our experiment comprises with 610 × 340 pixel University of Pavia (PU). There are nine (9) AA = Sum of accuracy for each class predicted Number of class classes in total. The image contains 103 spectral bands (wave length range 430-860nm) and spatial resolution of 1.3m. The groundtruth of PU is shown in Fig. 5(b). The third dataset, The Salinas dataset (SA) represents to the Salinas Valley in California. The spatial size 512 × 217 contains 204 spectral bands. This is a high-resolution dataset with sixteen distinct classes and the capturing distance was around 3.7 metres. The groundtruth of SA is shown in Fig. 5(c).

Experimental settings
In the experimental settings, we used PCA algorithm of Scikit-learn library Pedregosa et al. (2011) to reserve first m PCs of input image Img, to produce the dimensionally reduced image, I p . For the above three datasets IP, PU and SA, the values of m are set to 30, 15, and 15 respectively in order to minimise computing costs. The whiten parameter of the PCA is set to true. Whiten parameter aids in raising forecasting accuracy. For feature extraction in 1D-EWT, the EWT1D() method of the Python package ewtpy (Carvalho et al. 2020) is applied to individual spectral bands of the I p .
Since a higher number decomposition level results to a high computations, the number of modes Ne is limited to 2. The parameters are set to 10 for lengthFilter and 5 for sigmaFilter. The (2 N e −1 ) coefficient feature matrices produced by the 1-D EWT procedure each have the same spatial dimensions as the I p . We choose the first K high frequency components of I p_ewt 's spectral bands. The selected bands of the transformed image, I p_ewt , are extracted into small, overlapping 3-D patches whose truth labels are determined by the label of the centre pixel in order to employ image classification techniques. Our 3-D neighbouring patches P ∈ R S×S×K from I p_ewt have been Algorithm 1 Proposed PEC framework for Hyperspectral Image classification. generated, that span over the whole of K spectral bands as well as the S × S spatial breadth.
As discussed, the proposed Deep learning model consists of two 3D-Convolutional layers. Four filters of dimension 3 × 3 × 7 are present in the first 3D convolution layer. Eight filters of dimension 3 × 3 × 3 are present in the second 3D convolution layer. Both the 3D convolution layers employ Relu activation. The first 3D convolution layer receives the input samples and extracts the features maps using different numbers of kernels. These features are fed to second 3D convolution layers. Feature maps from the previous layer are re-passed to a dropout layer at a 40% dropout rate and flattened to feed the first fully connected layer (Dense). Output features are once again fed into second fully connected layer (Dense) comprising 128 neurons with Relu activation followed by second dropout layer with a rate of 40%. Finally, the model obtains the expected output from the final fully connected layer using the softmax activation function. Random batches with a size of 1024 are employed for training. Adam with 0.01 learning rate and 1e-4 decay is used as a batch optimizer while the Categorical cross-entropy loss function is employed as the loss function. 100 epochs have been used for training.

Classification outcomes and analysis
In Table 1 we present the experimental outcomes of PEC framework with thirty percent and ten percent training samples in respect to the undertaken datasets (IP, PU and SA). Figure 6 portrays the PEC framework based predicted images for undertaken datasets with thirty and ten percent training along with their groundtruth(s).
In Table 2 we compare the findings of Jayapriya et al. (2020) with those of the proposed PEC framework.
The method suggested by Jayapriya et al. (2020) uses EWT for shallow feature extraction, 2D CNN for deep feature extraction, and associated three classification algorithms, Random forest, Multi-SVM, and ELM. In contrast, the proposed PEC framework uses 3D Convolutional layers along with 1D-EWT and PCA for the extraction and classification of deep spectral and spatial features. Therefore the PEC framework is not too expensive computationally. In Table 3, the classification effectiveness of the suggested approach is compared to that of several techniques, such as SSRN Zhong et al. (2018), 2D-CNN Makantasis et al. (2015), 3D-CNN Ben Hamida et al. (2018), and Multi-scale 3D- CNN He et al. (2017).
We cite some of the technical validity and development of the approach in reference to the developments of allied approaches. The values for the competing approaches in the  aforementioned tables are obtained in line with the data of Roy et al. (2020). It may be noted here that, the SSRN is a supervised deep learning framework (Zhong et al. 2018), that decreases the issue of diminishing accuracy shown in earlier deep learning model. Particularly, the back propagation of gradients is facilitated by the residual blocks, which link up every other 3-D convolutional layer via identity mapping. In order to extract more discriminative features, the SSRN treats spectral characteristics and spatial information in two independent blocks. To further regularise the learning process and enhance the classification accuracy of trained models, SSRN model mandates batch normalisation on each convolutional layer. Similarly, as per Makantasis et al. (2015), 2D-CNN based classification technique that auto-hierarchically produces high-level features from HI(s). This approach uses multi-layer perceptron to carry out the classification task and convolutional neural network to encode the spectral and spatial information of the pixels. In Ben Hamida et al. (2018), authors investigated how well the DL-architecture perform in classifying the RS hyperspectral dataset, and developed a novel 3D Deep Learning strategy which caters to process both spectral and spatial information simultaneously. Some of suggested models achieive better classification rate than state-of-the-art methods with reduced processing costs. A Multiscale 3D deep convolutional neural network (M3D-DCNN) suggested by He et al. (2017) not only improves the classification but also promises to attain large-scale datasets by jointly learning both 2D Multi-scale spatial feature and 1D spectral feature from HI data in an end-to-end strategy. The model produced cutting-edge results on the standard datasets despite not using any handcrafted features or pre-or post-processing like PCA, sparse coding, etc.
The influential factor in accepting 30% training set for the framework is analysed in respect to three undertaken datasets. For simplicity, the effect of number of frequency bands (K) on accuracy metrices for one dataset (say, IP) is shown in the Table 4. Figure 7(a), (b) and (c) illustrate how accuracy metrics vary with different training samples for various frequency bands K. Table 5 is the self-explanatory representation on computational complexities of the associated methods used by competing models and suggested framework. K s and C stand for kernel size and channels, respectively, in 3D CNN. N stands for the input channel, W for the image's width, and H for its height. K s , W and H are recognised as width, height and input image in the 2D-CNN. Number of training samples are represented as n in SVM and RF, whereas p represents to number of available features. n trees indicate the number of trees. By exploitation random hidden weights, the ELM has a time complexity of L and N, where L is the number of hidden layer neurons and N is the number of training data. In the next entry of Table 5, N represents signal samples. N mra denotes the multiresolution analysis (MRA) or signal decomposition. The last row of the table represents computational complexity of PCA algorithm applied on an input matrix with N × D dimension.

Complexity Analysis
The proposed PEC model using 1D-EWT for feature extraction has a computational complexity advantage of factor N over the 2D-EWT employed in Jayapriya et al. (2020). The classification approach used in Jayapriya et al. (2020) includes 2D CNNs and additional machine learning algorithms, including MSVMs, ELMs, and Random Forest, but the suggested PEC model uses only 3D CNNs for classification. As multiple classifiers have been suggested by Jayapriya et al. (2020) whereas the proposed PEC technique uses 3D CNNs, so exact comparison is not possible.

Conclusions and discussions
Experimental findings of the proposed PEC framework are contrasted with those of other cutting-edge work carried out by Jayapriya et al. (2020). The suggested PEC framework surpasses by producing better performance oriented classification accuracy like Overall Accuracy (OA) Alberg et al. (2004), Average Accuracy (AA) Fung and LeDrew (1988), and Kappa coefficient (k) Fung and LeDrew (1988). There is immense potential for enhancing classification efficiency by implementing the discussed approaches to enhance spatio-temporal feature learning,