FusAtNet: Dual Attention based SpectroSpatial Multimodal Fusion Network for Hyperspectral and LiDAR Classification

With recent advances in sensing, multimodal data is becoming easily available for various applications, especially in remote sensing (RS), where many data types like multispectral imagery (MSI), hyperspectral imagery (HSI), LiDAR etc. are available. Effective fusion of these multisource datasets is becoming important, for these multimodality features have been shown to generate highly accurate land-cover maps. However, fusion in the context of RS is non-trivial considering the redundancy involved in the data and the large domain differences among multiple modalities. In addition, the feature extraction modules for different modalities hardly interact among themselves, which further limits their semantic relatedness. As a remedy, we propose a feature fusion and extraction framework, namely FusAtNet, for collective land-cover classification of HSIs and LiDAR data in this paper. The proposed framework effectively utilizses HSI modality to generate an attention map using "self-attention" mechanism that highlights its own spectral features. Similarly, a "cross-attention" approach is simultaneously used to harness the LiDAR derived attention map that accentuates the spatial features of HSI. These attentive spectral and spatial representations are then explored further along with the original data to obtain modality-specific feature embeddings. The modality oriented joint spectro-spatial information thus obtained, is subsequently utilized to carry out the land-cover classification task. Experimental evaluations on three HSILiDAR datasets show that the proposed method achieves the state-of-the-art classification performance, including on the largest HSI-LiDAR dataset available, University of Houston (Data Fusion Contest - 2013), opening new avenues in multimodal feature fusion for classification.


Introduction
With the advent of advanced sensing technologies, simultaneous acquisition of multimodal data for the same underlying phenomenon is possible nowadays. This is especially important in remote sensing (RS), owing to presence of satellite image data from several sources like multispectral (MSI), hyperspectral (HSI), synthetic aperture radar (SAR), panchromatic (PCI) sensors etc. as well as light detection and ranging (LiDAR), to name a few. Each source provides different kind of information about the same geographical region, which can aid in tasks relating to total scene understanding. For example, the detailed spectral information from HSI is commonly used to discriminate various materials based on their reflectance values, finding applications in agricultural monitoring, environment-pollution monitoring, urban-growth analysis, land-use pattern [1,2]. Similarly, LiDAR data is used to obtain the elevation information, which is useful to distinguish objects within the same material [3]. Since the attributes of these modalities complement each other, they are extensively used in a cumulative fashion for multimodal learning in remote sensing domain [4,5].
In the recent past, many ad-hoc and conventional techniques have been introduced for the fusion of HSI and Li-DAR modalities due to their ability of digging the latent representations and features from the raw data [6,7]. Besides, the concerned fusion strategies have been applied for different application scenarios as visible in [8,9,10,11,12], where conventional methods such as support vector machines (SVM), random forests (RF), rotation forests (RoF) etc. have been actively used for classification. [13] proposes an in flight fusion of LiDAR and HSI data. The intensity of HSI data is corrected with the help of cross-calibrated return intensity information obtained from airborne laser scanner (ALS).
Similarly, in the present era, deep learning is being actively used in the domain of multimodal fusion [14,15]. The deep learning approach generally follows a multistream architecture where each stream corresponds to a single modality. These extracted features are then concatenated to be used as joint representation for further classification. Convolutional neural networks (CNNs) especially have been widely utilised as feature extractors in remote sensing community and shown to be more powerful than the conventional techniques for supervised inference tasks [16]. Although the multistream deep architectures have produced excellent performance measures, a key disadvantage to such an approach, however, is that the feature extraction of different modalities is carried out individually instead of utilising features from both modalities jointly. This causes some important shared high-level features from both the modalities to be missed out. Another key point is the fact that such a method may make different features significantly unbalanced, and the information may not be equally represented [17]. Given the multi-source feature embeddings, feature aggregation is an important stage. Simple concatenation or pooling of individual extracted features may have redundant information and thus the system might be prone to overfitting. Lastly, having large number of features by just concatenation may increase the dimensionality and due to lack of large labelled data, the model may suffer from curse of dimensionality [16]. Limited training samples or imbalanced data, along with the need of to avoid any human intervention in selecting features, have encouraged researchers to search for better joint feature learning methods.
Recently, the usage of attention learning mechanism has shown remarkable performance gain for different visual inference tasks [18,19,20]. Ideally, the attention modules highlight the prominent features while suppressing the irrelevant features through a self-supervised learning paradigm. However, in most of these research works, attention based learning is carried out only on a single modality and hence only similar kind of features are highlighted. Therefore, we are left with the task of designing such a network that takes the attention mask from one modality and use it to enhance the representations of other modality (Fig. 1). Based on this premise, the idea of multimodal attention is envisioned, where a complementing modality not only synergistically adds relevant information to the existing modality but also highlights such features that went "unnoticed" by the at- The self-attention module (left) works only on single modality where both the hidden representations as well as the attention mask are derived from the same modality (HSIs). On the other hand, in the cross-attention module (right), the attention mask is derived from a different modality (LiDAR) and is harnessed to enhance the latent features from the first modality. tention map derived from the existing modality. Inspired by these discussions, we propose FusAtNet, an attention based multimodal fusion network for land-cover classification given an HSI-LiDAR pair as input, as illustrated in Fig.  3. Our method involves extracting spectral features using "self-attention" in HSI and incorporate multimodal attention using the proposed "cross-attention" mechanism that uses LiDAR modality to derive an attention mask that highlights the spatial features of the HSI (Fig. 2). This interaction between spectral and spatial features leads to an intermediate representation which is further refined through self-attention based learning. This rich final representation is henceforth directed for classification. The key contributions are summarised as follows: • To the best of our knowledge, ours is one of the first approaches to introduce the notion of attention learning for HSI-LiDAR fusion in the context of land-cover classification.
• In this regard, we introduce the concept of "crossattention" based feature learning among the modalities, a novel and intuitive fusion method which utilises attention from one modality (here LiDAR) to highlight features in the other modality (HSI).
• We demonstrate state-of-the art classification performance on three benchmark HSI-LiDAR datasets outperforming all existing deep fusion strategies along with thorough robustness analysis.

Related work
By definition, the task of image fusion aims at synergistically combining images from different related modalities to generate a merged representation of the information present in the images, improving visual inference performance over the individual images. Growing interest from the multimedia community is reflected in various works like [21] where audio-visual crossmodal representation learning was proposed, in [22] where RGB-depth multimodal features were fused for scene classification and in shared cross modal image retrieval [23]. It is also an emerging topic in medical image classification. For example, whereas [24] fuses information from MRI/PET, [25] utilises 4 different modalities utilising CNN as feature extractors for image segmentation. Unsupervised methods like [26] generate joint latent representation from data of different modalities using deep belief networks.
Remote sensing has been utilising several classical multimodal fusion methods such as decision fusion [27], kernel based fusion [28], PCA [29], intensity-hue-saturation (IHS) [30], wavelet based fusion methods [31] etc. for various applications in order to improve the classification performance of even the most conventional models.
In addition to classical methods, deep learning is being actively used in remote sensing field both for feature attention and multimodal learning. In feature attention, [19] presents a novel spectral-attention framework to highlight the reflectance characteristics of the hyperspectral image for better classification performance. Similarly, [20] introduces a spectral-spatial attention network using residual learning (to tackle vanishing gradient [32]) and convolution-deconvolution framework (to extract distinct spatial features) which collectively assist in robust classification. From the perspective of multimodal fusion in remote sensing, deep learning normally involves concatenation of extracted features from unimodal networks, and then sending them for classification. The entire model is trained in an end to end fashion. For example, [33] concatenated the Kronecker product of LiDAR derived features with the spectral features obtained from the HSI and used them for classification using a CNN model. [34] and [35] use a two stream model of image fusion where in one stream, a 3D-CNN is used to extract the spectral-spatial features from the HSIs while a 2D-CNN is used to extract the depth features from LiDAR dataset. It is important to note that the LiDAR data has been rasterised in the image domain as a digital elevation model (DEM) and digital surface model (DSM). The features are concatenated and sent to a deep neural network for fusion and finally classification. [36] proposed an adaptive technique of HSI-LiDAR fusion. Initially, the LiDAR and HSI features are extracted using a two-stream CNN where each stream corresponds to each modality. The streams follow the similar architecture and contains cascaded residual blocks (inspired from Face Alignment Network (FAN) [37] and hourglass networks [38]) to keep both the original and extracted features from fusion. The extracted features are then combined with original features using an adaptive technique based on squeeze and excitation networks [39] where, instead of simply concatenating the features, each feature is assigned a specific weight. The weighted tensors are flattened and concatenated and sent to a fully connected layer for classification.
As already mentioned, the existing techniques for HSI-LiDAR fusion overlook the aspect of attention based feature learning. On the other hand, FusAtNet incorporated different attention learning modules within its framework for better cross-modal feature extraction. Additionally, we introduce the notion of cross-modal attention which is a novel paradigm in the realm of feature fusion.

Proposed method
The objective of this work is to perform pixel based classification by harnessing the spectral and spatial information constituted in HSIs and the depth and intensity information encoded in LiDAR.
To accomplish this task, we consider HSI and LiDAR where, B 1 and B 2 denote the number of channels in HSI and LiDAR modalities respectively, while n denote the number of available groundtruth samples. The groundtruth labels y n i ∈ {1, 2, ..., K}, where K represent the number of groundtruth classes. The patches are sent to the proposed FusAtNet model which get processed as they pass through various modules that are discussed ahead in section 3.2.

Model overview
The intent behind this research work is to synergistically explore the specral-spatial properties of HSI and spatial/elevation characteristics of LiDAR modality using the "cross-attention" framework. The work of the attention modules is to selectively highlight the hotspots in the extracted hyperspectral features in order to increase the interclass variance and thus improve the classification accuracy. This is achieved in two steps: firstly, the HSI features are passed through a feature extractor and spectral attention module, and their combination is used to emphasize the spectral information in the HSI features. Simultaneously, the LiDAR features are passed through a spatial attention framework and the resultant mask accentuates the spatial characteristics of HSI. Secondly, the highlighted features are reinforced with the original features and passed through modality extraction and modality attention modules, the outputs of which are combined to judiciously highlight the important sections of the two modalities. The resultant features are then sent to the classification module.

Network architecture
The comprehensive architecture of FusAtNet is displayed in Fig. 3 and all the experiments adhere to the same. FusAtNet essentially contains six modules that are used in three phases. In the first phase, hyperspectral feature extractor F HS , spectral attention module A S and spatial attention module A T are used to jointly extract and highlight the spatial-spectral features from the HSI. In the second phase, modality feature extractor F M and Figure 3. Schematic of FusAtNet (presented on Houston dataset). Initially, the hyperspectral training samples XH are sent to the feature extractor FHS to get latent representations and to spectral attention module AS to generate spectral attention mask. Simultaneously, the corresponding LiDAR training samples XL are sent to spatial attention module AT to get the spatial attention mask. The attention masks are individually multiplied to the latent HSI representations to get MS and MT . MS and MT are then concatenated with XH and XL and sent to modality feature extractor FM and modality attention module AM. The outputs from the two are then multiplied to get FSS, which is then sent to the classification module C for pixel classification. modality attention module A M are used to selectively highlight the modality specific features. In the third phase, the modality specific spectral-spatial features are sent to the classification module C. All the modules are inherently CNN modules where the size of all the kernels is fixed to 3×3 and non-linearity is fixed to ReLU. The modules are discussed as follows: Hyperspectral feature extractor F HS : F HS consists of a 6 layer CNN and is used to extract the spectral-spatial features from the HSIs. The first five layers contain 256 filters while the sixth layer has 1024 number of filters. All the convolution operations are applied with zero padding. Output of each convolution operation is operated on by batch normalisation. The module can be represented as where, θ F represent the weights of the module. The output of F HS is a patch of size 11×11×1024.
Spectral attention module A S : A S draws its attention mask from the HSI. The module is a CNN with 3 convolution blocks, with 2 convolution layers each. In addition, first and second convolution block are followed by a residual block each. There is a maxpooling layer after each residual block and the sixth convolution layer. The last layer of this module is a global average pooling (GAP) layer. Over-all, the architecture of this model is inspired from [19]. The number of kernels in first five convolution layers is 256 and that in the sixth one is 1024, all of which use zero padding. Each convolution operation is followed by a batch normalisation layer. The model is denoted as A S (θ AS , x i H ), where θ AS are the weights of this attention module. The output of this module is a vector of size 1×1024, which is multiplied with the output of F HS to get the highlighted spectral features as denoted in Eq. (1).
Here, M S denotes the extracted features highlighted with spectral attention mask and ⊗ represent the broadcasted element-wise matrix multiplication operation (such that the resultant product retains the size of the matrix with higher dimension).

Spatial attention module A T : Denoted by
, where θ AT denotes the weights, spatial attention module is a 6-layer CNN that generates attention mask from the Li-DAR modality. The first 3 layers consist of 128 filters each while each of the last three layers has 256 number of filters. There are two residual layers, each after second and fourth convolution layer. All the convolution layers are followed by a batch normalisation operation. The output from this module is a patch of size 11×11×1024 that is multiplied with the extracted features from F HS to get spatially highlighted features M T denoted in Eq. (2) as: Modality feature extractor F M : F M module follows the same structure as that of F HS and can be represented as where θ FM are the weight of the module. It is fed with the spectrally and spatially highlighted features M S and M T along with the original X , the output of which is a patch of size 11×11×1024 which can be represented as in Eq. (3).
where, ⊕ represents concatenation along channel axis.

Modality attention module A M : Architecture of A M is similar to that of A T and is denoted by
, θ AM being the weights. The work of this module is to create an attention mask that focuses on specific traits of each modality and therefore the input is kept the same as that of F M . This is represented in Eq. (4).
The output of the module is an 11×11×1024 patch that is multiplied with the output of F M , as shown in Eq. (5), and the result is sent to the classification module.
where, F SS are the final spectral-spatial features.
Classification module C: The input to C module are the final spectral-spatial features F SS (x i H , x i L ). The module is a 6-layer fully convolutional neural network where first four layers consist of 256 filters each while the fifth and sixth layers respectively contain 1024 and K filters, where K is the number of classes. The filter size for the last layer is set to 1×1 and no padding is used in any layer. All the layers except last one are operated on by ReLU activation function and batch normalisation while the last layer is the softmax layer. The module can be defined as where, θ C are classification weights. The output of C is a vector of size 1×K.

Training and inference
The output from C is subjected to a categorical crossentropy loss which is backpropagated to train the FusAtNet model in an end-to-end fashion (refer Eq. (6)).
where, L C is the classification loss. During the testing phase, the given test sample (x j H , x j L ) is passed through the fusion module and follows the same path as that of the training samples. The resultant output F SS (x j H , x j L ) is sent to the classification module C where it is assigned the predicted class label.

Experimental setup
This section discusses about the datasets used to validate FusAtNet and protocols followed while training the same.

Datasets
To evaluate the efficacy our method, three HSI-LiDAR datasets have been considered.
Houston dataset: This dataset consists of a hyperspectral imagery and a LiDAR depth raster and was introduced in GRSS Data Fusion Contest 2013. The dataset is acquired over the Houston university campus and surroundings by National Airborne Centre for laser mapping (NCALM). The HSI is composed of 144 hyperspectral bands with the wavelengths varying from 0.38 μm to 1.05 μm with each raster of size 349×1905 and spatial resolution 2.5 m. A total of 15029 groundtruth samples are available that are distributed over 15 classes and divided into training and testing sets containing 2832 and 12197 pixels respectively [40]. However, for our experiments, 12189 pixels are considered in the test set since a few of the pixels were interfering with the data preprocessing. The dataset can be visualised in Fig. 4.
Trento dataset: This dataset is collected using AISA eagle sensor over the rural regions in Trento, Italy. The HSI image is composed of 63 bands with their wavelengths in the range of 0.42 μm to 0.99 μm, while LiDAR consists of 2 rasters showing elevation data. The dimension of each band is 166 × 600 while the spatial and spectral resolutions are 9.2 nm and 1.0 m respectively. There are a total of 6 classes in the imagery, groudtruth of which are available for 30214 pixels that are divided in 819 training pixels and 29395 test pixels [40]. The dataset is displayed in Fig. 5.

MUUFL Gulfport dataset: This dataset is acquired over the campus of University of Southern Mississippi Gulf
Park, Long Beach Mississippi in November, 2010. The HSI imagery originally contained 72 bands. However, due to noise, initial and final four bands are omitted leading to a total of 64 bands. The LiDAR modality consists of two elevation rasters. All the bands and rasters are coregistered, acquiring the total size of 325×220. There are a total of 53687 groundtruth pixels encompassing 11 classes [41,42].
For training, 100 pixels per class are selected leaving the total of 52587 pixels for testing. The HSI and LiDAR imageries along with the groundtruth pixels can be viewed in Fig. 6.

Training protocols
Our method is compared against other conventional and state of the art multimodal learning methods from [40] to fuse HSI and LiDAR modalities, such as SVM [43], extreme learning machines [44], CNN-PPF [45] and twobranch CNN [40] with spectral and spatial feature extraction. The SVM (both hyperspectral and LiDAR) and ELM (only hyperspectral) models for Trento dataset have been retrained and re-evaluated since the values in [40] seemed incorrect. All the analyses have been carried out on both HSI only (represented as (H) in the results and classified maps) as well fused HSI and LiDAR (represented as (H+L)) data Table 1. Accuracy analysis on the Houston dataset (in %). 'H' represents only HSI while 'H+L' represents fused HSI and LiDAR.
Classes SVM (H) [43] SVM (H+L) [43] ELM (H) [44] ELM (H+L) [44] Two Branch CNN (H) [40] Two Branch CNN (H+L) [ to affirm the efficacy of multimodal learning over unimodal learning. To assess the performance of the methods, overall accuracy (OA), producer's accuracy (PA), average accuracy (AA) and Cohen's kappa (κ) have been used as evaluation metrics. Both HSI and LiDAR data is subjected to min-max normalisation to scale the modalities and speed up the convergence.
The network uses a fixed patch size of 11×11 for all the datasets. These patches are created around the pixel with known groundtruth label. In addition, in order to boost the performance of our model, we resort to data augmentation technique (used in [5]) by rotating the training patches by 90 • , 180 • and 270 • in clockwise direction. All the weight initialization are carried out using glorot initialization [47] while the training is performed for 1000 epochs. A small initial learning rate of 0.000005 is chosen because a higher learning rate leads to higher fluctuations when Adam optimizer is used with Nesterov momentum [48].

Results and discussion
Our proposed method is verified on Houston, Trento and MUUFL datasets in Tables 1, 2 and 3 respectively. It is clearly visible for all the cases that our method outperforms all the state of the art methods with a significant margin in all the avenues, be it OA (the respective accuracies of Houston, Trento and MUUFL datasets being 89.98%, 99.06% and 91.48%), AA (respective values being 94.65%, 98.50% and 78.58%) or κ. It is also easily observed that in case of classwise/producer's accuracy, the performance of our method is better than the other methods for most of the classes and only marginally exceeded by other methods for a few of them. For Houston dataset, it can be noted that the accuracy for 'commercial' class (92.12%) is significantly improved for our method in comparison to other methods. This can be attributed to the fact that commercial regions generally have a variable layout with frequent elevation changes that are effectively captured by LiDAR based attention maps. Similarly, in case of Trento dataset, the 'road' class shows a notable increase in accuracy (93.32%). This increment is also on account of variation in road profile with respect to its elevation. The classification maps for Houston, Trento and MUUFL datasets are presented in Fig.  4, 5 and 6 respectively. It can be visually verified that the classification maps obtained from FusAtNet tend to be less noisy and have smooth interclass transitions. It is also observed in Fig. 4 that methods such as SVM and two-branch CNN tend to classify the shadowy areas as water (in the right portion of the maps) because of their darker tone. Our approach largely mitigates this problem as well.

Ablation study
We further carried out different ablation studies to highlight the individual aspects of our model. In table 4, we evaluate our model's performance by iteratively removing each of the attention module. It is evidently visible that in absence of even anyone of the attention module, the model tends to underperform. In addition, the importance of spatial characteristics of LiDAR modality is also proven since the presence of only LiDAR based spatial attention module gives better accuracy than HSI based spectral attention module for all the three datasets.  Table 5 displays the performance of our method when trained without data augmentation. Since our model is quite deep, there is a decrease in the performance when no augmentation is applied on training samples. This magnitude of this decrease is maximum in case of Houston dataset (4.76%) since it has most number of features in comparison to other datasets. Hence, it requires comparatively more iterations to converge and give better accuracy.
Furthermore, an additional ablation study is carried out on all the datasets to check the effect of decreasing the training size and then evaluate the performance of our model as displayed in table 6. As expected, the accuracy progressively decreases as the number of training samples decrease, further reinforcing the high data requirement of the deep learning models.

Conclusions and future work
We introduce a novel fusion network for HSI and Li-DAR data for the purpose of producing improved landcover maps. Our network, called FusAtnet, judiciously utilizes different attention learning modules to learn joint feature representations given both the input modalities. To this end, we propose the notion of cross-attention where the feature learning stream for a given modality is influenced by the other modality. The results obtained for multiple datasets confirm the efficacy of the proposed fusion network. Due to the generic nature of FusAtNet, it can be extended to support a varied range of modalities with minimum overhead. In future, we plan to extend the network to support more than two modalities. Besides, we also plan to perform rigorous model engineering to limit the number of learnable parameters without compromising the performance, for example, using the notion of dilated convolution in the attention modules effectively.