SIS-CAM: An Enhanced Integrated Score-Weighted Method Combined with Gradient Optimization for Interpreting Convolutional Neural Networks

doi:10.21203/rs.3.rs-4174042/v1

Download PDF

Research Article

SIS-CAM: An Enhanced Integrated Score-Weighted Method Combined with Gradient Optimization for Interpreting Convolutional Neural Networks

https://doi.org/10.21203/rs.3.rs-4174042/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

The opacity of deep convolutional neural network(CNN) models has hindered their performance enhancement across various domains, posing challenges in understanding their internal mechanisms. To address this, computer vision has developed approaches to assess CNN interpretability via visualization. However, existing techniques often encounter noise during gradient calculation and may produce rough, blurry saliency maps, leading to the localization of meaningless information. This paper proposes SIS-CAM, optimizing gradients using squared values during backpropagation and integrating the initial saliency map with the input image via feature fusion. The image is iteratively integrated with a masked approach, averaged, and linearly combined with the initial saliency map. This approach refines gradients through squaring, enhancing visual features of neuron activation and improving the saliency map’s effectiveness in capturing information. The improved gradients are integrated with feature mappings to derive preliminary masks, which are merged with the input image to derive secondary masks for accurate delineation of boundary features. Integration operations on the secondary masks compute average scores of masked input images, which are then amalgamated with the initial saliency map to generate the final map. The proposed method undergoes qualitative and quantitative evaluation, including Deletion tests, Insertion tests, Average Drop, Average Insertion tests, Class Discriminative Visualization, and sanity checks on 2000 images from the ILSVRC2012val dataset. Experimental findings show that SIS-CAM effectively reduces noise in saliency maps, accurately captures target boundary characteristics, and exhibits superior visual performance compared to the baseline model.

Explainable AI

Convolutional neural networks

Class activation maps

Saliency map

Deep convolutional neural networks have significantly transformed computer vision by using their sophisticated layered structure, allowing machines to accurately perceive and understand the visual realm. This proficiency is essential across a broad spectrum of devices, encompassing tasks such as image classification [1, 2] and recognition of objects [3, 4]. However, a crucial concern emerges regarding the comprehensibility and clarity of deep neural networks (DNNs). These models are opaque and have trouble offering compelling explanations for their decision-making process. This limitation mitigates the deep networks’ endurance for vital implementations, thus even for crucial applications. Instance these networks engage in intricate activities, their decision-making processes become less transparent, which raises issues in domains that demand a high level of lucidity and dependability, such as healthcare diagnostics [5], facial recognition, and vehicular autonomy.

Various interpretability methods have been developed to showcase the underlying principles of a model and enable a comprehensive understanding of its inner workings. These methods visualize the saliency maps by annotating regions with stronger colors that are more important for the decision-making process, thereby demonstrating the model’s representations and decision-making process. This visualization approach is currently the mainstream method for visualizing saliency maps. Zhou et al. [6] employing class activation mapping (CAM), known as a revolutionary technique, different layers within a network structure can effectively function as unsupervised item detectors. Their methodology employed a global average pooling layer [7] to generate a visual depiction of the weighted amalgamation of feature maps at the critical penultimate layer. This methodology resulted in the development of perceptive heat maps, accurately delineating the specific regions of an input image that the CNN scrutinizes to determine its classification. However, a fascinating intricacy of this method was the necessity to train a linear classifier anew for each unique category.

Cam-based explanations [6, 8, 9] elegantly unravel the mystery of neural decisions for specific inputs. They achieve this by combining activation maps from convolutional layers in a linear and weighted manner, resulting in visual clarity. Innovative techniques such as Grad-CAM [9] employ a unique strategy by calculating gradients via the network’s target layer via back-propagation of the prediction score. These gradients operate as coefficients to combine the forward feature maps. Significantly, these techniques [10, 11], which are frequently faster, need only one or a predetermined number of network queries [12]. Zhang [13] developed a novel visual interpretability saliency model that has a problem of containing an excessive amount of irrelevant information. This is because the feature maps may not consistently match with the intended category. Certain approaches [14, 15] utilize high-resolution signals, such as saliency maps extracted from internal network layers e.g., feature maps and gradients, to enhance accuracy. These methods are employed to accomplish one of the two strategies is predominantly employed to achieve the objectives. A viable approach involves leveraging the outputs of the lower layers of convolutional neural networks, typically characterized by higher spatial resolutions [15, 16]. Nonetheless, the shallow layers of these models exhibit characteristics akin to edge detectors. Consequently, the saliency map generated from shallow layers possesses diminished capability in discriminating between different classes. The subsequent method enhances the resolution of the input image by increasing its scale, allowing for the extraction of higher-resolution signals from the deeper layers [14]. Simply averaging these signals, Grad-CAM [9] amalgamates gradients and feature maps from inputs of various scales by computing their mean. However, techniques [15, 16] fail to fully exploit the distinctiveness of signals produced across multiple scales. To circumvent issues stemming from gradients, approaches such as Score-CAM [17] and IS-CAM [18] have been devised, relying solely on forward propagation to derive the weight value of the active map. However, these methods also suffer from issues such as excessive noise, coarse localization, and occasionally highlighting irrelevant regions.

To enhance the capture of important semantic regions in saliency maps and improve their localization accuracy, a novel CAM method called SIS-CAM is proposed. This method draws inspiration from Grad-CAM [9] and performs a square operation on the gradients obtained from backpropagation, optimizing it to augment the neural network’s capacity for capturing intricate features. Additionally, inspired by IS-CAM and Abs-CAM [19], SIS-CAM generates both initial and final saliency maps. The initial saliency map enhances the neural network’s capture of detailed features and strengthens the information of important semantic regions, while the final saliency map improves the precise localization of these important regions. Now, let’s analyze the novel aspects of this approach:

Enhancing Gradient Magnitudes through Squaring: Traditional methodologies rely on gradients acquired through backpropagation to delineate the pertinent regions within the input image influencing the CNN’s predictions. SIS-CAM improves upon this procedure by squaring these gradients. Squaring enhances neuron activation properties, thereby aiding in the identification of pivotal features within the image for the CNN’s decision-making process. This yields a saliency map characterized by enhanced clarity and detailed information, elucidating the importance of each image component.
Integration of input image masks into the process: SIS-CAM goes beyond merely improving gradients. Additionally, it integrates input image masks into the procedure, so enhancing the analysis with an additional level of complexity. The process commences by generating an initial saliency map, also known as the primary mask, by the fusion of enhanced gradients and feature mappings. This map offers a comprehensive depiction of the noteworthy regions within the image.
Generation of Secondary Masks: The generation of secondary masks involves integrating the initial masks with the input image. The use of secondary masks is essential for accurately predicting image border properties, which are frequently significant in comprehending the context and intricacies of images.
Computation of Mean Scores for Secondary Masks: SIS-CAM then computes the average scores of the concealed input images. This phase enhances the comprehension of the most significant parts of the image by calculating the average scores across the secondary masks. This aids in mitigating the saliency of irrelevant areas, thereby enhancing precision.
Integration with the Initial Saliency Map for the Final Saliency Map: The final step involves merging the average scores derived from the secondary masks with the initial saliency map to generate the final saliency map. This amalgamation yields the final saliency map, providing a comprehensive and detailed representation of the most significant features within the image as detected by the CNN.

The pursuit of enhanced comprehensibility in CNNs continues to be a significant obstacle in the field of deep learning. The proposed SIS-CAM introduces innovation by improving the gradients utilized in conventional saliency maps and integrating a multi-layered mask technique. This leads to a more precise, elaborate, and comprehension of the prominent aspects in an image that have the biggest impact on CNN’s predictions, thus enhancing the interpretability of these intricate models.

This novel approach provides a more flexible and adaptive trajectory for deep learning visual interpretation. The introduction of SIS-CAM represents a significant advancement in the effort to improve the interpretability of Convolutional Neural Networks. This research establishes a strong basis for our work and represents a crucial milestone in the development of CNN analysis. It emphasizes the need for clear, precise, and detailed display of models. With SIS-CAM, our goal is to establish a higher standard in the area of CNN interpretability, making a significant contribution to the wider field of artificial intelligence and machine learning.

In the domain of interpretability analysis concerning CNNs, numerous Saliency methodologies exist. These approaches are broadly classified into Perturbation-based, Gradient-based, and Class Activation Mapping-based Saliency Methods. To provide a more direct visualization of the classification structure, we have constructed a comprehensive classification diagram, as depicted in Fig. 1.

2.1 Perturbation-based Saliency Method

Various techniques have recently been suggested to create many altered images by obscuring portions of the original image. The modified images are subsequently inputted into CNNs to investigate the impacts and repercussions of the obscured regions on the output results. The technique of employing altered images as input to examine the impact of various regions on the output is known as the perturbation-based saliency approach. The process entails iteratively applying masks to the photos to generate a substantial quantity of altered input images [20]. Later on, several image perturbation techniques have been suggested, including Fong model [21], who introduced the concept of a perturbation mask as a trainable process. This enables the perturbation mask to effectively cover significant regions of the input image. Petsiuk et al. [22] employed random sampling to create several normalized masks for the input images. Subsequently, these masks were integrated with the input images, and the weighted sum of the masks along with the projected scores was computed to generate the ultimate saliency map [23]. While the process of explaining may seem straightforward, it necessitates a substantial amount of processing. Subsequently, Agarwal et al. [22] utilized generative models for creating perturbation masks that are visually more realistic, aiming to enhance the quality of perturbed images. However, this perturbation method necessitates a substantial time investment to refine and iterate the process of mask production. Contrary to the approaches mentioned earlier, XRAI [11] partitions the input image into segments and subsequently assesses the importance of different sections recursively. It then aggregates smaller regions into larger blocks according to their significance scores. However, the requirement for a substantial number of inquiries gives rise to concerns over the significant computational expense and limited efficiency. The random input sampling for explanation RISE [24] method is utilized for elucidating black-box models. RISE generates numerous randomly normalized masks between 0 and 1. These masks are then multiplied with input images to obtain softmax outputs of specific classes, which are weighted to produce more detailed perturbation maps. However, its drawback lies in its computational intensity.

2.2 Gradient-based Saliency Method

The gradient-based saliency technique assesses the importance of specific locations in the input image by analyzing the gradient of the model’s output with respect to the input features. Initially proposed by Karen Simonyan et al. [25], this approach calculates the sensitivity gradients of different areas in the input image based on the model’s confidence scores [26, 27]. By modifying the backpropagation rule, it becomes feasible to assign distinct scores to the input features, thereby elucidating the significance of particular regions within the input features. Zeiler et al. [20] introduced a novel technique entitled deconvolution, which allows for the examination of the characteristics of an input image that are learned by various layers of the convolutional neural network. Still, the deconvolution approach yields a significant amount of extraneous noise. Other research efforts have presented a novel idea of integrated gradients [10, 11], which leverages integrated gradients to measure the weight of input feature regions and comply with its specified sensitivity and invariance. To mitigate the problem of noise resulting from gradients, SmoothGrad [28] uses the technique of introducing Gaussian noise to the input image to get rid of noise from the gradient map. However, this approach is hindered by the extensive number of iterations and the lengthy processing time it requires. The gradient-based saliency method offers exact localization of image features, instead, it lacks in effectively determining the relevancy and discriminative capabilities of various regions.

2.3 Class Activation Mapping-based Saliency Method

In the initial phase, the Class activation mapping-based saliency technique computes the weighted scores for each activation map generated by the final convolutional layer of the convolutional neural network through global average pooling. Subsequently, these activation maps are linearly amalgamated with the scores to produce the ultimate saliency map. However, a notable limitation of CAM [6] is the absence of pooling layers in numerous image classification models. Hence, to apply the CAM technique to classification models that lack pooling layers, it is necessary to make alterations to the model’s design and endure retraining. Furthermore, CAM is restricted to scrutinizing solely the final convolutional layer of the model and is incapable of evaluating any intermediate layers. Grad-CAM [9] introduced a versatile architecture applicable to a wider range of CNN models, achieving this without depending on a Global Average Pooling (GAP) layer and possessing the capability to assess any intermediate layer of the CNN model. Nevertheless, Grad-CAM does exhibit significant limitations. In the case of images featuring many items forming the same class, it is capable of generating only a single saliency map. Additionally, the final saliency map tends to exhibit a significant amount of noise and lacks accuracy in accurately delineating the regions of interest. To address these challenges, GradCAM++ [8] utilizes a weighted fusion of the positive derivative of the target class score concerning the feature map of the final convolutional layer. This methodology overcomes the constraint of producing only one saliency map for an image containing multiple objects of the same class, thereby notably improving image visualization. Both Grad-CAM and Grad-CAM + + employ gradients to calculate the weight scores of the activation maps generated from the convolutional layers. However, this approach gives rise to issues such as gradient saturation, gradient disappearance, and gradient noise. Additionally, it fails to account for the influence of forward propagation on predictions. To address this issue, Score-CAM [17] replaces the former approach of employing gradients as the weight of the activation map with forward propagation. This approach effectively resolves the overall issues caused by gradients and results in an improved saliency map. IS-CAM [18] integrates an integration operation into the Score-CAM channel. This procedure merges the input mask and computes the average scores obtained from the normalized mask. The resulting weight scores are then linearly combined with the activation map. LFI-CAM [29] integrates an attention branch network to generate an attention map during the forward propagation phase. This attention map is then utilized to enhance the accuracy of CAM using the attention mechanism. The Group-CAM [30] method employs an approach known as "split-transform-merge". This involves initially dividing the class activations into separate groups and then summing the activation maps within each group to create the initial mask. The initial mask is subsequently modified and transformed to produce the input image mask. The input mask is subsequently employed in the neural network to acquire scores, which are then weighted and combined with the original mask to produce the saliency map. This process significantly enhances the efficiency of saliency map production, while some residual noise persists. The Relevance-CAM [31] method employs hierarchical relevance propagation to acquire weighted components, demonstrating strong performance in shallow object localization. POLY-CAM [32] combines the detailed activation maps from shallow network features with the enlarged activation maps produced by the network’s final layer, resulting in enhanced precision for localizing picture features. FD-CAM [33] improves the acceptability and distinguishability of score-based weights by conducting group channel switch operations. It amalgamates the score-based weight scores with the gradient-based weight scores to augment the discriminability of the saliency map. Abs-CAM [19] boosts the gradient by computing its absolute value and linearly merging it with the feature maps generated by the convolutional layer. The resultant image is subsequently inputted into the model for forward propagation to compute the weight scores. These scores are then allocated to each initial saliency map to produce the final saliency map.

This section will provide an introduction to the SIS-CAM algorithm and elucidate its mechanism for making explanation to CNNs. The algorithm’s overall structure is depicted in Fig. 2. Initially, the gradients are squared element-wise to produce the optimal initial saliency map. Subsequently, the improved saliency map is utilized as a mask for element-wise multiplication with the original image, yielding a blurred image. Ultimately, the blurred image undergoes an integration process before being inputted into the model to generate the final saliency map. The detailed procedures are outlined in Algorithm 1.

3.1 Initial Mask Obtained by Gradient Optimization

${X}_{0}^{ }\in {\mathbb{R}}^{3\times H\times W}$ denote the input image, where $\text{f}\left(·\right)$ represents a deep convolutional neural network model used for classification prediction. The score for type c predicted by the mode for the input image ${X}_{0}^{ }$ is denoted as: ${f}^{c}\left({X}_{0}^{ }\right)$. Upon feeding the input image ${X}_{0}^{ }$ into the model $\text{f}\left(·\right)$, it traverses through the l-th target convolutional layer, yielding a collection of feature maps ${A}_{l}^{ }$, where the k-th feature map is denoted as ${A}_{l}^{k}$. Initially, the gradient of ${f}^{c}\left({X}_{0}^{ }\right)$ with respect to each pixel $\left(\text{i}, \text{j}\right)$ of the k-th feature map of the l-th target convolutional layer needs to be computed.$\begin{array}{c}{\text{g}}_{ij}^{kc}=\frac{\partial {f}_{ }^{\text{c}}\left({X}_{0}\right)}{\partial {A}_{l}^{k}},\#\left(1\right)\end{array}$

Afterwards, the obtained gradients are squared element-wise to get the squared optimized gradients. The optimized gradients are further subjected to global average pooling in both the height and width dimensions, leading to the attainment of equal weighting coefficients across different neurons.

$$\begin{array}{c}{w}_{k}^{c}=\frac{1}{Z}\sum _{i} \sum _{j} {{(g}_{ij}^{kc})}_{ }^{2},\#\left(2\right)\end{array}$$

$where,$ $Z$ denotes the total number of pixels in a feature map ${A}_{l}^{k}$.

The weight coefficients $w$ obtained are linearly combined with the corresponding channel’s feature map. The initial saliency map serves to mask the original image for perturbation operations. It is imperative to initially upsample the activation map corresponding to significant regions in the original input space, and subsequently employ the upsampled initial saliency map to perturb the input image, emphasizing spatial locations strongly linked to the activation map. The results are subsequently denoised and adjusted through the process of upsampling and normalization to obtain the initial saliency map ${H}_{l}^{k}$. To facilitate the production of smoother activation map masks, the original activation values within each activation map are normalized to fall within the range of [0,1], rather than assigning binary values to all elements.

$$\begin{array}{c}{H}_{l}^{k}=Norm\left(\text{U}\text{p}\left({RELU(w}_{k}^{c}{A}_{l}^{k})\right)\right),\#\left(3\right)\end{array}$$

$$where,$$

$$\begin{array}{c}Norm\left({A}_{l}^{k}\right)=\frac{{A}_{l}^{k}-min{A}_{l}^{k}}{max{A}_{l}^{k}-min{A}_{l}^{k}}.\#\left(4\right)\end{array}$$

where “Up” represents up-sampling operation and “Norm” signifies the normalization operation.

3.2 Final Saliency Map Obtained by Integral

In prior studies, the traditional approach of extracting significant regions in input images and negating irrelevant regions typically entailed doing pixel-wise multiplication of a mask with the original image. Still, when it comes to classification tasks, the differentiation between significant and insignificant regions in several images is frequently ambiguous, resulting in the neglect of subtle pixel boundaries in the traditional pixel-wise multiplication method. Consequently, a new way is suggested to generate a blurred image by applying a unique linear computation technique to the input image and mask. This method overcomes the constraints of the pixel-wise multiplication approach.

$$\begin{array}{c}{H}_{l}^{k}={X}_{0}^{ }*{H}_{l}^{k}* \frac{k}{C}+{H}_{l}^{k-1}.\#\left(5\right)\end{array}$$

Where $C$ denotes the overall quantity of feature maps generated by the target layer, and $k$ represents the current feature map.

Subsequent to obtaining the blurred image through linear calculation, integration operations are performed on these blurred images. Initially, the parameter $N$ is set as the number of intervals between [0,1]. Given the similarities between integral and mean operations, the mean operation is used to approximate the integral operation. The map of all intervals for a single blurred image is calculated as follows:

$$\begin{array}{c}{M}_{n}^{k}={(X}_{0}^{ }*{H}_{l}^{k})*coe.\#\left(6\right)\end{array}$$

Here the symbol $coe$ represents the integral coefficient for each interval. ${M}_{n}^{k}$ denotes the map of the $n$- th interval for the $k$- th blurred image.

$$\begin{array}{c}coe=\frac{n+1}{N}.\#\left(7\right)\end{array}$$

For each blurred image, $N$ maps are generated for each interval, where $N$ represents the number of intervals. These maps are input into the neural network model to obtain $N$ scores. A subsequent averaging operation yields the final score for the current blurred image. Applying the same process to all $k$ blurred images results in the final scores for $k$ blurred images.

$$\begin{array}{c}{S}_{k}^{c}=\frac{1}{N}\sum _{n=1}^{N} \left({f}^{c}\left({M}_{n}^{k}\right)\right).\#\left(8\right)\end{array}$$

After the averaging process, a softmax function is applied to determine the influence of each image region on the original image prediction classification.

$$\begin{array}{c}{\alpha }_{k}^{c}=\frac{\text{exp}\left({S}_{k}^{c}\right)}{\sum _{k} \text{exp}\left({S}_{k}^{c}\right)}.\#\left(9\right)\end{array}$$

Finally, k-th channel’s initial saliency map ${T}_{l}^{k}$ will yield a contribution score ${\alpha }_{k}^{c}$for class c. The final saliency map is derived from linear calculation and summation superposition of the initial saliency map and the contribution score. The final saliency map ${\mathcal{L}}_{SIS-CAM}^{c}$ is derived following the ReLU operation:

$$\begin{array}{c}{\mathcal{L}}_{SIS-CAM}^{c}=ReLU\left(\sum _{k} {\alpha }_{k}^{c}{T}_{l}^{k}\right).\#\left(10\right)\end{array}$$

Algorithm 1

outlines the specific details of the proposed SIS-CAM method.

Algorithm 1: The proposed SIS-CAM interpretable algorithm

Input: Image${X}_{0}^{ }$, Model $\text{f}(·)$,Class $c$, Target convolutional layer $\mathcal{l}$, Integral coefficient $coe$, Integral steps $N$

Output: $\text{S}\text{a}\text{l}\text{i}\text{e}\text{n}\text{c}\text{y} \text{m}\text{a}\text{p} {\mathcal{L}}_{AIS-CAM}^{c}$

1: Initialization:

2: ${\mathcal{L}}_{AIS-CAM}^{c}=0$, ${M}_{ }^{1},\dots {,M}_{ }^{k}\leftarrow \left[\right],{A}_{\mathcal{l}}^{k}\leftarrow {f}_{\mathcal{l}}\left({X}_{0}^{ }\right) ,$

3: $C\leftarrow$the number of channels in ${A}_{\mathcal{l}}$, $coe=0.0$;

4: ${f}_{ }^{c}\left({X}_{0}^{ }\right)\leftarrow$Score of target class c;

5: For $k \varvec{i}\varvec{n} [0, \dots ,C-1]$ do

6: Gradients ${G}_{k}^{c}\leftarrow Backpropagation$;

7: Gradients Squaring: ${{(g}_{ij}^{kc})}_{ }^{2}$;

8: Weight ${w}_{k}^{c}\leftarrow Gap\left({G}_{k}^{c}\right)$;

9: ${T}_{l}^{k}=ReLU\left(\right({w}_{k}^{c}{A}_{l}^{k}\left)\right)$;

10: ${H}_{l}^{k}=normalize\left(upsample\right({T}_{l}^{k}\left)\right)$;

11: If $k\ne 0:$

12: ${H}_{l}^{k}={X}_{0}^{ }*{H}_{l}^{k}* \frac{k}{C}+{H}_{l}^{k-1}$;

13: For $n \varvec{i}\varvec{n} [1, \dots ,N]$ do

14: $coe$ = $\frac{n+1}{N}$;

15: ${M}_{n}^{k}= {(X}_{0}^{ }*{H}_{l}^{k})*coe$;

16: ${M}_{ }^{k}.append\left({M}_{i+1}^{k}\right)$;

17: End

18: End

19: ${M}_{ }^{1},\dots ,{M}_{ }^{k}\leftarrow$Batchify(${M}_{ }^{1},\dots {,M}_{ }^{k}$);

20: ${S}_{k}^{c}=\frac{1}{N}\sum _{n=1}^{N} \left({f}^{c}\left({M}_{n}^{k}\right)\right)$;

21: ${\alpha }_{k}^{c}\leftarrow \frac{\text{e}\text{x}\text{p}\left({S}_{k}^{c}\right)}{\sum _{k} \text{e}\text{x}\text{p}\left({S}_{k}^{c}\right)}$;

22: Return:${\mathcal{L}}_{AIS-CAM}^{c}\leftarrow ReLU\left(\sum _{k} {\alpha }_{k}^{c}{T}_{l}^{k}\right)$;

In this section, we will perform multiple evaluations to assess the effectiveness of the proposed SIS-CAM explanation approach. In Section 4.1, we will provide an introduction to the datasets, normalization parameters, neural network models, and baseline approaches that were utilized in the studies. In Section 4.2, we will compare SIS-CAM with the baseline approaches and assess the SIS-CAM method through the visualization of saliency maps. Section 4.3 will introduce the Faithfulness evaluation, which is a quantitative analysis method for assessing the model. In the forthcoming Section 4.4, experiments will be conducted to investigate the capability of SIS-CAM to distinguish between various categories within the same input image. In Section 4.5, we will conduct a quantitative evaluation of the approach using the Sanity Check. In Section 4.6, an original SIS-CAM will be compared with a simplified version of SIS-CAM by removing the feature fusion module, with the aim of validating the importance of the feature fusion module. In Section 4.7, we will conduct experiments to investigate the influence of the integration step count N on the efficacy of SIS-CAM. In Section 4.8, we will use SIS-CAM to perform the same experiments on the VGG-16 model to validate the generalizability of SIS-CAM across different CNN models.

4.1 Experimental Setup and Baselines

In the upcoming evaluation experiments, the pre-trained VGG-19 network from the PyTorch model zoo is employed as the base model by default. The datasets utilized are sourced from ILSVRC2012_val, comprising 50,000 images. The initial step involves reshaping the input image to dimensions of (224 * 224 * 3) and normalizing its range to [0, 1]. Subsequently, normalization is performed using the mean vector [0.485, 0.456, 0.406] and standard deviation vector [0.229, 0.224, 0.225] without additional preprocessing. For equitable comparison, all saliency maps are upsampled to dimensions of 224 * 224 using bilinear interpolation.

In this analysis and evaluation, we will use Grad-CAM, Grad-CAM++, IS-CAM, Group-CAM, Score-CAM, and Abs-CAM as the reference baselines. For IS-CAM and Group-CAM, we utilized either the parameters recommended by the authors or default parameters. For the RISE method, we set the patch size to 64, stride to (8, 8), and used 8000 masks for the RISE method. For the proposed SIS-CAM method in this paper, we will conduct experiments using an integration step number N = 15.

4.2 Qualitative Evaluation via Visualization

The saliency maps produced by our SIS-CAM method were qualitatively compared with seven other explanation methods for the same input image. Results are depicted in Fig. 3. In comparison to gradient-based methods such as Grad-CAM and Grad-CAM++, SIS-CAM’s saliency maps display reduced random noise, greater concentration in the target area, and improved class discrimination capabilities. While Grad-CAM obtains weights by globally average pooling the gradients and Grad-CAM + + uses high-order derivatives of the gradients to obtain weights, SIS-CAM uses the square of the gradients, enforcing gradient enhancement, thus making it more effective in identifying the target area. Group-CAM obtains the initial saliency map as a mask similar to SIS-CAM but without gradient optimization. Score-CAM circumvents the random noise problem caused by gradients by obtaining the saliency map through forward propagation.

IS-CAM incorporates an integration operation into the channels of Score-CAM, integrating the input mask and averaging the scores derived from the normalized mask to obtain weighted scores, which are then linearly combined with the activation map. SIS-CAM integrates the integration operation of IS-CAM and conducts a one-step preprocessing of the input image using the optimized gradients and feature maps of the convolutional layers, effectively enhancing the boundaries between the target area and background noise. Abs-CAM is similar to SIS-CAM, but it introduces an integration operation and uses square optimization in the gradient optimization, further enhancing the visualization effect of the saliency map. SIS-CAM retains the advantages of the gradients and further optimizes the disadvantages of gradients, as well as inheriting the advantages of the integration operation. Therefore, SIS-CAM outperforms other methods in saliency map generation.

4.3 Faithfulness Evaluation

Consensus regarding evaluation metrics for interpretability analysis of saliency maps or neural network models remains elusive. This study introduces Deletion and Insertion metric tests to assess the ability of interpretability methods to extract important pixel regions from input images. Additionally, it introduces Average Drop and Average Increase as metrics to assess the objectivity and credibility of interpretability methods. These metrics are primarily computed through element-wise multiplication of input images with saliency maps to observe alterations in the score of the target class. This study will utilize the top 50% pixels of the attention map as a mask and conduct element-wise multiplication with the input image. These metrics have been extensively employed in recent research, and therefore, this study introduces them to assess the effectiveness of interpretability methods.

4.3.1 Metrics Evaluation

Insertion

The Insertion metric, as formalized by Petsiuk et al. [22], quantifies the degree to which the probability of the predicted class increases with the addition of more significant pixels. This metric involves substituting pixels in a fully blurred version of the image with clear pixels from the original image and computing the Area Under the Curve (AUC) of the softmax classification scores as a quantitative measure. The importance of each pixel is derived from the generated saliency map. A sharp increase in the probability curve (relative to the increased pixel proportion) indicates a more effective explanation.

Deletion

The Deletion metric, also formalized by Petsiuk et al. [22], is the inverse of the Insertion metric. In this approach, clear pixels from the original input image are progressively replaced with highly blurred pixels until all pixels are blurred. Subsequently, the AUC of the softmax classification scores serves as a quantitative measure. The Deletion metric gauges the extent to which the probability of the predicted class diminishes as significant pixels are removed. A sharp decline in the probability curve signifies better explanatory capability.

Over-all

The Over-all metric is obtained by subtracting the Deletion metric from the Insertion metric. A good interpretability method should not only perform well in either the Insertion or Deletion metrics, and Over-all provides a comprehensive evaluation of an interpretability method.

Average Increase

Average Increase is determined as the mean percentage reduction in model confidence for a particular class in the image when solely the explained image region (acquired through element-wise multiplication of the input image with the top 50% of the attention map pixels utilized as a mask) is employed as input.

$$\begin{array}{c}Average Increase= \begin{array}{c}\sum _{i=1}^{N} \frac{\text{max}\left(0,{Y}_{i}^{c}-{O}_{i}^{c}\right)}{{Y}_{i}^{c}}\times 100 \#\#\#\#\end{array},\#\left(11\right)\end{array}$$

${Y}_{i}^{c}$ : Representation of the model’s output score for class c on the i-th image.

${O}_{i}^{c}$ : Representation of the model’s output score for class c on the i-th image when only the explained image region is used as input.

Average Drop : Average Drop quantifies the frequency of the increase in model confidence when solely the explained image region is supplied as input across the entire dataset. Formally, the confidence increase rate metric is defined as:

$$\begin{array}{c}Average Drop= \left({\sum }_{i=1}^{N} \frac{{1}_{{Y}_{i}^{c}<{O}_{i}^{c}}^{ }}{N}\right)\times 100,\#\left(12\right)\end{array}$$

where 1 is an indicator function that yields 1 when the parameter is true. The definitions of the other symbols remain consistent with those specified in the preceding metric.

Both Average Increase and Average Drop metrics serve to assess the fidelity of interpretation maps produced by interpretation methods in target recognition tasks. We follow the measurement calculation formula in [22].

4.3.2 Deletion and Insertion Visual Results

In the preceding, we presented the principles of deletion and insertion metrics, which involve gradually decreasing the model’s prediction score by removing the pixel region most relevant to the image category [6, 22]. Conversely, the insertion metric involves gradually adding the original image’s pixels to a completely blurred image and offers the added benefit of attenuating the influence of adversarial attack instances [27]. Experimental outcomes illustrating the Deletion curve and Insertion curve are depicted in Fig. 4, demonstrating that SIS-CAM outperforms the other four interpretable methods in both the visualized saliency map effect and actual indicator values. Furthermore, Table.1 compares SIS-CAM with six other interpretability methods in Deletion, Insertion, and Overall metrics, all of which were tested using an average result obtained from 2000 randomly extracted images from the ILSVRC2012val dataset. The results indicate that SIS-CAM outperforms the other methods in these insertion and Over-all metrics. The favorable metric data reflects SIS-CAM’s ability to better identify the pixel region where the target category is located in the input image. The data shows that SIS-CAM has the highest value for Over-all metric and the highest value for the Insertion metric, indicating that SIS-CAM has more concentrated saliency map areas and less noise compared to the other interpretable methods.

Table 1

The average AUC scores of the Insertion curve (higher scores indicate better performance) and Deletion curve (lower scores indicate better performance) are computed across 2000 ILSVRC2012_val images. The best results are highlighted in **bold**.
Methods	Insertion	Deletion	Over-all
Grad-CAM Grad-CAM++ Score-CAM Group-CAM Abs-CAM IS-CAM SIS-CAM	0.52938 0.52065 0.54857 0.52076 0.54258 0.54727 0.56261	0.09128 0.09663 0.08774 0.10637 0.08869 0.08908 0.09174	0.43810 0.42402 0.46083 0.41439 0.45389 0.45819 0.47087

4.3.3 Average Drop and Average Increase

Here we will introduce the evaluation of the objective faithfulness of SIS-CAM compared to other mainstream interpretable methods using Average Drop and Average Increase. We conducted comparisons between SIS-CAM and five other interpretable methods, performing experiments on each method using 2000 randomly selected images from the ILSVRC2012_val dataset. As depicted in Table 2, SIS-CAM achieves an Average Drop of 0.24352 and an Average Increase value of 0.29215. SIS-CAM ranks first in the Average Drop metric and second in the Average Increase metric. The experimental results indicate that SIS-CAM has more precise saliency map areas and less noise compared to other models.

Table 2

Average Drop (lower values indicate better performance) and Average Increase in Confidence (higher values indicate better performance) were computed across 2000 ILSVRC2012_val images.
Methods	Average-Drop	Average-Increase
Grad-CAM Grad-CAM++ Score-CAM Group-CAM Abs-CAM IS-CAM SIS-CAM	0.29960 0.29930 0.25761 0.30293 0.26432 0.25785 0.24352	0.30000 0.25100 0.28350 0.26078 0.28500 0.27750 0.29215

4.4 Class Discriminative Visualization

To validate the discriminative capability of the proposed method across various target categories within the same image, we selected an input image with two categories, still from the ILSVRC2012val dataset, containing the animal categories "Tiger cat" and "Bull mastiff." As shown in Fig. 5, we classified the input image using VGG-19, setting the image label as "bull mastiff," with confidence scores of 0.32% and 46% for "bull mastiff" and "Tiger cat," respectively. The former’s classification score is significantly lower than the latter, and based on Fig. 5, it is evident that SIS-CAM can accurately identify the pixel areas where the target categories are located. Hence, it can be deduced that SIS-CAM effectively discriminates between different categories within the same image.

4.5 Sanity Check

According to [34], evaluating the saliency map results of interpretable methods solely through visual inspection may introduce certain biases. Therefore, in this section, we adopt a sanity check [34] to assess the saliency map results of SIS-CAM. Following the method proposed in [34], we perform cascading and independent randomization on the VGG-19 network to visualize the saliency maps of SIS-CAM at each model step. Through this sanity check approach, we can validate that SIS-CAM is capable of obtaining saliency maps for specified classes based on the actual model weights. As depicted in Fig. 6, the sanity check outcomes yielded by SIS-CAM at various convolutional layers exhibit variability, indicating SIS-CAM’s high sensitivity to model parameters and its ability to accurately reflect the level and quality of the model.

4.6 Visual Comparison of Saliency Map Ablation

To illustrate the efficacy of the saliency map generated with the inclusion of the feature fusion process relative to the saliency map generated without it, we conduct a visual comparison between the saliency maps produced by the feature fusion process and those generated without it. The visual outcomes are shown in Fig. 7, which reveals that adding the feature fusion process results in a more accurate determination of the target class’s location and precise visualization of the important feature areas of the target class.

4.7 Impact of Intervals Evaluation

In addition, to demonstrate that the hyperparameter Integral steps N has little impact on the interpretation of SIS-CAM for convolutional neural networks, we set N = 10, 15, 20, 25 and randomly selected 400 ILSVRC2012_val images for interpretability analysis of VGG-19 on these 400 images. We used the Deletion, Insertion, Avg-Increase, and Avg-Drop metrics for evaluation. As shown in Fig. 8, the maximum difference in the Deletion and Insertion metrics for the variation of the Integral steps N parameter in SIS-CAM is only 0.00433 and 0.00544, respectively. Similarly, the maximum difference for Avg-Increase and Avg-Drop is 0.02295 and 0.0175, indicating that the SIS-CAM method is minimally affected by the variation of the hyperparameter Integral steps N.

4.8 Recent State of Art Models

To showcase the versatility of the proposed approach across various CNN architectures, interpretability assessments were conducted on an alternate CNN model, VGG-16, alongside other explanation methods, as illustrated in Fig. 9. The findings suggest that the proposed SIS-CAM method can be used for interpretability analysis across different CNN models. Furthermore, it is evident that SIS-CAM possesses strong target class localization capabilities.

This study introduces a novel convolutional neural network interpretation method termed SIS-CAM. It combines the strengths of gradient-based and score-based interpretation techniques by optimizing the gradient squared and conducting integration operations on the input image’s mask. Experimental findings indicate that SIS-CAM yields interpretation maps with reduced noise, enhanced credibility, and effective discriminative capabilities. A large number of qualitative and quantitative experiments confirm the superiority of SIS-CAM. However, the integration process requires multiple inputs of masked images into the model to obtain average scores, which increases the time consumption. Additionally, squaring the gradient only partially eliminates noise from the gradient, rather than completely removing the noise issue present in the gradient. Therefore, in future work, further exploration of these two aspects is necessary to achieve an efficient and low-noise interpretable method.

Author Contribution

Yuquan Zhang: Contributed to this research by participating in manuscript writing, experimental dataset collection, paper illustrations, citation organization, experimental code development, experimental data collection, and paper formatting.Umer Sadiq Khan: Contributed to this research by participating in manuscript writing, citation organization, experimental data collection, and paper formatting.Fang Xu: Contributed to this research by participating in paper illustrations, citation organization, and experimental data collection.Yan Zhang: Contributed to this research by participating in paper illustrations, citation organization, and experimental data collection.Zhimin Li: Contributed to this research by participating in paper illustrations, citation organization, and experimental data collection.Yi Ma: Contributed to this research by participating in paper illustrations, citation organization, and experimental data collection.Zhen Liu: Contributed to this research by participating in paper illustrations, citation organization, and experimental data collection.Na Yang: Contributed to this research by participating in paper illustrations, citation organization, and experimental data collection.

Krizhevsky, A., I. Sutskever, and G.E. Hinton, ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017. 60(6): p. 84-90.
Simonyan, K. and A. Zisserman, Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
Zhang, P., et al., Feature aggregation with transformer for RGB-T salient object detection. Neurocomputing, 2023. 546: p. 126329.
Redmon, J., et al. You only look once: Unified, real-time object detection. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Saleem, H., A.R. Shahid, and B. Raza, Visual interpretability in 3D brain tumor segmentation network. Computers in Biology and Medicine, 2021. 133: p. 104410.
Zhou, B., et al. Learning deep features for discriminative localization. in Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
Lin, M., Q. Chen, and S. Yan, Network in network. arXiv preprint arXiv:1312.4400, 2013.
Chattopadhay, A., et al. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. in 2018 IEEE winter conference on applications of computer vision (WACV). 2018. IEEE.
Selvaraju, R.R., et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. in Proceedings of the IEEE international conference on computer vision. 2017.
Qi, Z., S. Khorram, and F. Li. Visualizing Deep Networks by Optimizing with Integrated Gradients. in CVPR Workshops. 2019.
Kapishnikov, A., et al. Xrai: Better attributions through regions. in Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
Li, Q., Understanding Saliency Prediction with Deep Convolutional Neural Networks and Psychophysical Models. arXiv preprint arXiv:2204.06071, 2022.
Zhang, Q., L. Rao, and Y. Yang. A novel visual interpretability for deep neural networks by optimizing activation maps with perturbation. in Proceedings of the AAAI Conference on Artificial Intelligence. 2021.
Wang, B., et al., Multi-scale low-discriminative feature reactivation for weakly supervised object localization. IEEE Transactions on Image Processing, 2021. 30: p. 6050-6065.
Jiang, P.-T., et al., Layercam: Exploring hierarchical class activation maps for localization. IEEE Transactions on Image Processing, 2021. 30: p. 5875-5888.
Rebuffi, S.-A., et al. There and back again: Revisiting backpropagation saliency methods. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
Wang, H., et al. Score-CAM: Score-weighted visual explanations for convolutional neural networks. in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 2020.
Naidu, R., et al., IS-CAM: Integrated Score-CAM for axiomatic-based explanations. arXiv preprint arXiv:2010.03023, 2020.
Zeng, C., et al., Abs-CAM: a gradient optimization interpretable approach for explanation of convolutional neural networks. Signal, Image and Video Processing, 2023. 17(4): p. 1069-1076.
Zeiler, M.D. and R. Fergus. Visualizing and understanding convolutional networks. in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13. 2014. Springer.
Fong, R.C. and A. Vedaldi. Interpretable explanations of black boxes by meaningful perturbation. in Proceedings of the IEEE international conference on computer vision. 2017.
Agarwal, C., D. Schonfeld, and A. Nguyen, Removing input features via a generative model to explain their attributions to classifier's decisions. 2019.
Sundararajan, M., A. Taly, and Q. Yan. Axiomatic attribution for deep networks. in International conference on machine learning. 2017. PMLR.
Petsiuk, V., A. Das, and K. Saenko, Rise: Randomized input sampling for explanation of black-box models. arXiv preprint arXiv:1806.07421, 2018.
Simonyan, K., A. Vedaldi, and A. Zisserman, Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.
Bach, S., et al., On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 2015. 10(7): p. e0130140.
Nam, W.-J., et al. Relative attributing propagation: Interpreting the comparative contributions of individual units in deep neural networks. in Proceedings of the AAAI conference on artificial intelligence. 2020.
Smilkov, D., et al., Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825, 2017.
Lee, K.H., et al. Lfi-cam: Learning feature importance for better visual explanation. in Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.
Zhang, Q., L. Rao, and Y. Yang, Group-cam: Group score-weighted visual explanations for deep convolutional networks. arXiv preprint arXiv:2103.13859, 2021.
Lee, J.R., et al. Relevance-cam: Your model already knows where to look. in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
Englebert, A., O. Cornu, and C. De Vleeschouwer, Poly-CAM: High resolution class activation map for convolutional neural networks. arXiv preprint arXiv:2204.13359, 2022.
Li, H., et al. FD-CAM: Improving Faithfulness and Discriminability of Visual Explanation for CNNs. in 2022 26th International Conference on Pattern Recognition (ICPR). 2022. IEEE.
Adebayo, J., et al., Sanity checks for saliency maps. Advances in neural information processing systems, 2018. 31.

No competing interests reported.

Download PDF

Reviewers invited by journal
30 Mar, 2024
Submission checks completed at journal
27 Mar, 2024
Editor assigned by journal
27 Mar, 2024
First submitted to journal
27 Mar, 2024

You are reading this latest preprint version

SIS-CAM: An Enhanced Integrated Score-Weighted Method Combined with Gradient Optimization for Interpreting Convolutional Neural Networks

Status:

Version 1

Abstract

Figures

1 Introduction

2 Related Work

2.1 Perturbation-based Saliency Method

2.2 Gradient-based Saliency Method

2.3 Class Activation Mapping-based Saliency Method

3 Methods

3.1 Initial Mask Obtained by Gradient Optimization

3.2 Final Saliency Map Obtained by Integral

4 Experiments and Evaluation

4.1 Experimental Setup and Baselines

4.2 Qualitative Evaluation via Visualization

4.3 Faithfulness Evaluation

4.3.1 Metrics Evaluation

4.3.2 Deletion and Insertion Visual Results

4.3.3 Average Drop and Average Increase

4.4 Class Discriminative Visualization

4.5 Sanity Check

4.6 Visual Comparison of Saliency Map Ablation

4.7 Impact of Intervals Evaluation

4.8 Recent State of Art Models

5 Conclusion

Declarations

Author Contribution

References

Additional Declarations

Status:

Version 1