Deep convolutional neural networks have significantly transformed computer vision by using their sophisticated layered structure, allowing machines to accurately perceive and understand the visual realm. This proficiency is essential across a broad spectrum of devices, encompassing tasks such as image classification [1, 2] and recognition of objects [3, 4]. However, a crucial concern emerges regarding the comprehensibility and clarity of deep neural networks (DNNs). These models are opaque and have trouble offering compelling explanations for their decision-making process. This limitation mitigates the deep networks’ endurance for vital implementations, thus even for crucial applications. Instance these networks engage in intricate activities, their decision-making processes become less transparent, which raises issues in domains that demand a high level of lucidity and dependability, such as healthcare diagnostics [5], facial recognition, and vehicular autonomy.
Various interpretability methods have been developed to showcase the underlying principles of a model and enable a comprehensive understanding of its inner workings. These methods visualize the saliency maps by annotating regions with stronger colors that are more important for the decision-making process, thereby demonstrating the model’s representations and decision-making process. This visualization approach is currently the mainstream method for visualizing saliency maps. Zhou et al. [6] employing class activation mapping (CAM), known as a revolutionary technique, different layers within a network structure can effectively function as unsupervised item detectors. Their methodology employed a global average pooling layer [7] to generate a visual depiction of the weighted amalgamation of feature maps at the critical penultimate layer. This methodology resulted in the development of perceptive heat maps, accurately delineating the specific regions of an input image that the CNN scrutinizes to determine its classification. However, a fascinating intricacy of this method was the necessity to train a linear classifier anew for each unique category.
Cam-based explanations [6, 8, 9] elegantly unravel the mystery of neural decisions for specific inputs. They achieve this by combining activation maps from convolutional layers in a linear and weighted manner, resulting in visual clarity. Innovative techniques such as Grad-CAM [9] employ a unique strategy by calculating gradients via the network’s target layer via back-propagation of the prediction score. These gradients operate as coefficients to combine the forward feature maps. Significantly, these techniques [10, 11], which are frequently faster, need only one or a predetermined number of network queries [12]. Zhang [13] developed a novel visual interpretability saliency model that has a problem of containing an excessive amount of irrelevant information. This is because the feature maps may not consistently match with the intended category. Certain approaches [14, 15] utilize high-resolution signals, such as saliency maps extracted from internal network layers e.g., feature maps and gradients, to enhance accuracy. These methods are employed to accomplish one of the two strategies is predominantly employed to achieve the objectives. A viable approach involves leveraging the outputs of the lower layers of convolutional neural networks, typically characterized by higher spatial resolutions [15, 16]. Nonetheless, the shallow layers of these models exhibit characteristics akin to edge detectors. Consequently, the saliency map generated from shallow layers possesses diminished capability in discriminating between different classes. The subsequent method enhances the resolution of the input image by increasing its scale, allowing for the extraction of higher-resolution signals from the deeper layers [14]. Simply averaging these signals, Grad-CAM [9] amalgamates gradients and feature maps from inputs of various scales by computing their mean. However, techniques [15, 16] fail to fully exploit the distinctiveness of signals produced across multiple scales. To circumvent issues stemming from gradients, approaches such as Score-CAM [17] and IS-CAM [18] have been devised, relying solely on forward propagation to derive the weight value of the active map. However, these methods also suffer from issues such as excessive noise, coarse localization, and occasionally highlighting irrelevant regions.
To enhance the capture of important semantic regions in saliency maps and improve their localization accuracy, a novel CAM method called SIS-CAM is proposed. This method draws inspiration from Grad-CAM [9] and performs a square operation on the gradients obtained from backpropagation, optimizing it to augment the neural network’s capacity for capturing intricate features. Additionally, inspired by IS-CAM and Abs-CAM [19], SIS-CAM generates both initial and final saliency maps. The initial saliency map enhances the neural network’s capture of detailed features and strengthens the information of important semantic regions, while the final saliency map improves the precise localization of these important regions. Now, let’s analyze the novel aspects of this approach:
-
Enhancing Gradient Magnitudes through Squaring: Traditional methodologies rely on gradients acquired through backpropagation to delineate the pertinent regions within the input image influencing the CNN’s predictions. SIS-CAM improves upon this procedure by squaring these gradients. Squaring enhances neuron activation properties, thereby aiding in the identification of pivotal features within the image for the CNN’s decision-making process. This yields a saliency map characterized by enhanced clarity and detailed information, elucidating the importance of each image component.
-
Integration of input image masks into the process: SIS-CAM goes beyond merely improving gradients. Additionally, it integrates input image masks into the procedure, so enhancing the analysis with an additional level of complexity. The process commences by generating an initial saliency map, also known as the primary mask, by the fusion of enhanced gradients and feature mappings. This map offers a comprehensive depiction of the noteworthy regions within the image.
-
Generation of Secondary Masks: The generation of secondary masks involves integrating the initial masks with the input image. The use of secondary masks is essential for accurately predicting image border properties, which are frequently significant in comprehending the context and intricacies of images.
-
Computation of Mean Scores for Secondary Masks: SIS-CAM then computes the average scores of the concealed input images. This phase enhances the comprehension of the most significant parts of the image by calculating the average scores across the secondary masks. This aids in mitigating the saliency of irrelevant areas, thereby enhancing precision.
-
Integration with the Initial Saliency Map for the Final Saliency Map: The final step involves merging the average scores derived from the secondary masks with the initial saliency map to generate the final saliency map. This amalgamation yields the final saliency map, providing a comprehensive and detailed representation of the most significant features within the image as detected by the CNN.
The pursuit of enhanced comprehensibility in CNNs continues to be a significant obstacle in the field of deep learning. The proposed SIS-CAM introduces innovation by improving the gradients utilized in conventional saliency maps and integrating a multi-layered mask technique. This leads to a more precise, elaborate, and comprehension of the prominent aspects in an image that have the biggest impact on CNN’s predictions, thus enhancing the interpretability of these intricate models.
This novel approach provides a more flexible and adaptive trajectory for deep learning visual interpretation. The introduction of SIS-CAM represents a significant advancement in the effort to improve the interpretability of Convolutional Neural Networks. This research establishes a strong basis for our work and represents a crucial milestone in the development of CNN analysis. It emphasizes the need for clear, precise, and detailed display of models. With SIS-CAM, our goal is to establish a higher standard in the area of CNN interpretability, making a significant contribution to the wider field of artificial intelligence and machine learning.