COVID-19/Pneumonia Classi�cation Based on Guided Attention

With the novel coronavirus 19 (COVID-19) continually having a devastating effect around the globe, many scientists and clinicians are actively seeking to develop new techniques to assist with the tackling of this disease. Modern machine learning methods have shown promise in their adoption to assist the health care industry through their data and analytics-driven decision making, inspiring researchers to develop new angles to �ght the virus. In this paper, we aim to develop a robust method for the detection of COVID-19 by utilizing patients' chest X-ray images. Despite recent progress, scarcity of data has thus far limited the development of a robust solution. We extend upon existing work by combing publicly available data across 5 different sources and carefully annotating the comprising images into three categories: normal, pneumonia, and COVID-19. To achieve a high classi�cation accuracy, we propose a training pipeline based on the directed guidance of traditional classi�cation networks, where the guidance is directed by an external segmentation network. Through this network, we observed that the widely used, standard networks can achieve an accuracy comparable to tailor-made models speci�cally for COVID-19, furthermore one network, VGG-16, outperformed the best of the tailor-made models.


Introduction
Since its introduction into the human population in late 2019, COVID-19 continues to have a devastating effect on the global populace with the number of infected individuals steadily rising 1 .With widely available treatments still outstanding and the continued strain placed on many healthcare systems across the world, e cient screening of suspected COVID-19 patients and their subsequent isolation is of paramount importance to mitigate the further spread of the virus.Presently, the accepted gold standard for patient screening is reverse transcriptase-polymerase chain reaction (RT-PCR) where the presence of COVID-19 is inferred from analysis of respiratory samples 2 .Despite its success, RT-PCR is a highly involved manual process with slow turnaround times, and with results becoming available up to several days after the test is performed.Furthermore, its variable sensitivity, lack of standardized reporting, and a widely ranging total positive rate [3][4][5] calls for alternative screening methods.
Chest radiography imaging (such as X-ray or computed tomography (CT) imaging) has gained traction as a powerful alternative, where the diagnosis is administered by expert radiologists who analyze the resulting images and infer the presence of COVID-19 through subtle visual cues [6][7][8][9][10] .Of the two imaging methods studied, X-ray imaging has distinct advantages with regards to accessibility, availability, and rate of testing 11 .Furthermore, the existence of portable X-ray imaging systems does not require patient transportation or physical contact between healthcare professionals and suspected infected individuals, thus allowing for e cient virus isolation and a safer testing methodology.Despite its obvious promise, the main challenge facing radiography examination is the scarcity of trained experts that could conduct the analysis at a time when the number of possible patients continues to rise.As such, a computer system that could accurately analyze and interpret chest X-ray images could signi cantly alleviate the burden placed on expert radiologists and further streamline patient care.Image identi cation techniques are readily adopted in Arti cial Intelligence (AI) and could prove to be a powerful solution to the problem at hand.Despite recent progress in the development of AI algorithms [12][13][14][15] , one of the fundamental issues facing the development of a robust solution is the scarcity of publicly available data.We extend upon existing works by combining various publicly available data sources and carefully annotating the images across three classes: normal, pneumonia, and COVID-19.The data is then divided into training, validation, and testing subsets with an 8:1:1 split respectively with a strict class balance maintained across all sets.
Deep learning models, such as convolutional neural networks (CNNs), have gained traction in the eld of medical imaging 16 and here we train 10 promising CNNs for the purpose of COVID-19 classi cation in chest X-ray images.To assist the models, we utilize a purpose-built extraction mask as part of a threestage procedure.The mask accurately extracts the lung areas from the CXRs, with the subsequent images fed into one of the CNNs.To better quantify the performance of our proposed framework we benchmark our results against recently developed COVID-Net models 12 .To ensure consistency we utilize our dataset to output predictions across an array of different COVID-Net models.
The structure of the rest of this paper is as follows.Section 2 summarizes the data collected based on 5 most relevant datasets.Section 3 describes a proposed three-stage work ow using a guided attention mechanism.Results obtained during the all stages, Further improvements of the proposed work ow, its advantages over other models and possible implementation are shown in Sect. 4. Section 5 represents a synthesis of key points of the developed model based on the guided attention mechanism.
Since the datasets include different labels for their ndings, we conducted the following mapping.We assigned viral and bacterial pneumonias to the "Pneumonia" label; SARS, MERS-CoV, COVID-19, and COVID-19 (ARDS) to the "COVID-19" label; "no ndings" and "normal" diagnosis to the "Normal" label.Table 1 summarizes statistical information of the study data set.During the second stage, we chose the 4 most accurate networks which were then ne-tuned.In the process of ne-tuning, both a feature extractor and a classi er were trained.In the third stage, the networks were trained with a guided attention mechanism.This mechanism is based on the usage of the U-net segmentation network, where the output is used to focus the classi er on the lung area of an image.Besides direct guidance by U-net, the network is additionally trained based on indirect supervision through the application of Grad-CAM.Indirect supervision is used in the training process since Grad-CAM's attention heatmaps re ect the areas of an input image supporting the network's prediction.In this regard, the prediction is based on the areas which we expect the network to focus on, while indirect supervision forces networks to focus on the desired object in the image rather than its other parts.The training work ow of the model is shown in Fig. 1 below.All three stages are described in Sect.3.1 and 3.2 in more detail.
It should be noted that different COVID-Net models 12 are also considered in this study.To date, COVID-Net models are state-of-the-art models used for distinguishing COVID-19 and pneumonia cases.All COVID-Net models are abbreviated CXR further in the paper.

Stage I and Stage II
As we mentioned above, we chose 10 deep learning networks in order to nd out which network architectures are most effective in recognizing COVID-19 and pneumonia.All the networks vary by the number of weights, architecture topology, the way of data processing, etc.Additionally, CXR models are used for comparison purposes.Table 2 summarizes information about the networks we used in the rst stage.It is important to note that for the rst stage that only the classi cation heads were trained with body weights frozen.According to the results of the hyperparameter tuning procedure, gradient descent optimizer SGD with a learning rate equal to 10 − 4 proved to be optimal.Having trained several state-of-theart networks, we found that most of them diverged.In this connection, L2-regularization with λ of 0.001 was applied to all training networks.All networks were trained with a batch size equal to 32.In order to avoid over tting during network training, we applied Early Stopping regularization monitoring validation loss with patience equal to 10 epochs.For training networks in on both rst and second stages we used the cross-entropy, calculated as follows: where C is the number of classes (3 in our study), p i is the predicted probability, y i is the ground-truth label (ternary indicator), ε is a small positive constant.
For the training and testing networks during the rst stage, the dataset was split in an 8:1:1 ratio i.e. the training subset includes 2122 images (80.7%), the validation subset -242 images (9.2%), and the testing subset -267 images (10.1%).The split of data within training, validations, and testing phases was performed according to the distribution shown in Table 3.

Stage III
Once the performance and accuracy metrics of all networks were estimated, 4 networks that showed the best results on the rst stage were chosen for ne-tuning.Besides training both bodies and heads of the networks, we introduced a guided attention mechanism for the considered networks.We were inspired by 34 , where the authors proposed a framework that provides guidance on the attention maps generated by a weakly supervised deep learning neural network.The attention block in our pipeline is based on the usage of U-net 35 .As shown in Fig. 1, the proposed algorithm applies segmentation masks to the features of the network body (feature extractor) using multiplication.Applying an attention block to the output feature vector of the network's backbone allows networks to put more weight on the features that will be more relevant in the distinction of the different classes.Additionally during this stage, we applied attention maps obtained with help of the Grad-CAM technique 36 .Furthermore, the loss differs from the one on Stage I and Stage II and it is calculated as follows: where L clas is the cross-entropy loss, L attn is the attention loss, α is the coe cient used to scale the total loss and the attention component.L attn is calculated according to Eq. ( 5) in 34 .
To correctly apply U-net in the guided attention mechanism, we trained this network on the lung segmentation task.The data used for the training of this network is taken from the V7 Labs repository 37 .
The segmentation dataset contains 6500 images of AP/PA chest X-ray images with pixel-level polygonal lung segmentations.Some examples of COVID-19 affected patients with segmented areas of lungs are shown in Fig. 2.

Visual model validation
While modern neural networks enable superior performance, their lack of decomposability into intuitive and understandable components makes them hard to interpret.In this regard, an achievement of the model transparency is useful to explain their predictions.Nowadays, one of the techniques used for model interpretation is known as Class Activation Map (CAM) 38 .Though CAM is a good technique to demystify the working of CNNs, it suffers from some limitations.One of the drawbacks of CAM is that it requires feature maps to directly precede the softmax layers, so it applies to a particular kind of network architecture that performs global average pooling over convolutional maps immediately before prediction.Such architectures may achieve inferior accuracies compared to general networks on some tasks or simply be inapplicable to new tasks.De facto deeper representations of a CNN capture the best high-level constructs.Furthermore, CNN's naturally retrain spatial information which is lost in fully connected layers, so we expect the last convolutional layer to have the best tradeoff between high-level semantics and detailed spatial information.In this connection, we decided to use another popular technique known as Grad-Cam.This model interpretation technique, published in 36 , aims to improve the shortcomings of CAM and claims to be compatible with any kind of architecture.The technique does not require any modi cations to the existing model architecture, and this allows its application to any CNN based architecture.Unlike CAM, Grad-Cam uses the gradient information owing into the last convolutional layer of a CNN to understand each neuron for a decision of interest.Grad-Cam improves on its predecessor, provides better localization and clear class discriminative saliency maps.

Stage I
Having trained 10 neural networks, we found that 2 tend to diverge more than others.This is likely connected with the normalization layers.Networks such as MobileNet V2 and VGG-16 do not have Batch/Instance/Layer/Group Normalization layers in their architecture.In this regard, these networks start diverging (MobileNet V2) or hit a validation loss/accuracy plateau (VGG-16) after approximately 100 epochs.Popular regularization techniques such as Lasso Regression (L1 Regularization), Ridge Regression (L2 regularization), ElasticNet (L1-L2 regularization), Dropout and Early Stopping may help to avoid this problem.In this regard we applied Ridge Regression, Dropout layers and Early Stopping in our training pipeline.As for the remaining networks, they did not suffer from over tting; however, they could not reach better validation loss/accuracy values.When a given model reached its best validation loss, we saved the associated model weights using saving callback.Figure 3 demonstrates how the networks were trained during the rst stage.Blue asterisks re ect the best value of the accuracy on the validation subsets.
Since the loss value is poorly interpreted, we compared commonly used network metrics such as accuracy and F1-score.Table 4 and Table 5 summarize these metrics estimated in the rst stage.As seen, MobileNet V2, E cientNet B1, E cientNet B3, and VGG-16 achieved better results than other networks.

Stage II
Based on the results of the rst stage, MobileNet V2, E cientNet B1, E cientNet B3, and VGG-16 demonstrated their ability to distinct COVID-19 and pneumonia on X-ray images much better than other networks.In this regard, these networks are chosen for ne-tuning.Additionally, we compared how netuned networks differ from the best networks of the rst stage.The results of the models' performance are shown in Fig. 4, where blue asterisks re ect the best value of the accuracy on the validation subsets.
Having compared accuracy and F1-score values on the rst (Table 4 and Table 5) and second stage (Table 6 and Table 7), we can state that MobileNet V2 and VGG-16 have a larger boost in accuracy than E cientNet models.Once the ne-tuning was performed, MobileNet V2 and VGG-16 got a + 6% and + 9% accuracy change on the validation subset and a + 1% and + 4% accuracy change on the testing subset.
On the other hand, E cientNet B1 and E cientNet B3 a + 2% and + 3% accuracy change on the validation subset and a -1% and + 6% accuracy change on the testing subset.It should also be noted, that the largest boost in classi cation of COVID-19 was achieved by VGG-16.This network had an + 11% boost, while MobileNet V2, E cientNet B1, and E cientNet B3 could reach the level of + 2%, 0%, and + 6%, respectively.

Model validation using Grad-CAM
As we mentioned in Sect.3.3, despite deep learning models having facilitated unprecedented accuracy in image classi cation, one of their biggest problems is model interpretability representing a core component in understanding and debugging of a model.We used the Grad-CAM technique to validate the models and their correct/incorrect ability for making predictions, and to verify which series of neurons activated in the forward-pass during the prediction.For the sake of visualization, we choose 3 patients with different ndings: normal, pneumonia, and COVID-19.Source images of these ndings with their ground truth (GT) heatmaps are shown in Fig. 7 and Fig. 8.
Using Grad-CAM, we validated where our 4 best networks (MobileNet V2, E cientNet B1, E cientNet B3, VGG-16) are looking, verifying that they are properly looking at the correct patterns in the image and activating around those patterns.The Grad-CAM technique uses the gradients, owing into the nal convolutional layer to produce a coarse localization heatmap highlighting the important regions in the image for predicting the target concept i.e.COVID-19 or pneumonia areas.However, the localization heatmaps may differ from the traditional localization techniques such as segmentation masks or bounding boxes.In this regard, these heatmaps are used for the sake of approximate localization.
In order to interpret the models, Fig. the obtained results, we may state that the training of the models using masks (Stage III) has a positive effect on the search for the correct patterns by the models.Networks such as MobileNet V2 (Fig. 7c and Fig. 8c) and VGG-16 (Fig. 7f and Fig. 8f) identify affected areas correctly, despite the inaccuracies in the location of the heatmaps.On the other hand, interpretation of the E cientNet networks showed that they are not activating around the proper patterns in the image.This allows us to assume that E cientNet B1 and E cientNet B3 have not properly learned the underlying patterns in our dataset and/or we may need to collect additional data.

Conclusion
In this study, we demonstrated a training pipeline based on directed guidance for neural networks.This guidance forces the neural networks to pay attention to the areas obtained by the external network.
Having trained a set of deep learning models, we found that the proposed pipeline allows for increased classi cation accuracy.This pipeline was used for the detection of COVID-19 and distinguishing its presence from that of pneumonia.Of the obtained results, MobileNet V2 performed comparably to the tailor-made CXR model CXR-4A, despite being 15 times less complex.According to the performed experiments, the networks trained based on the proposed pipeline perform comparably to practicing radiologists when it comes to the classi cation of multiple thoracic pathologies in chest X-ray radiographs.Our pipeline may have the potential to improve healthcare delivery and increase access to chest radiograph expertise for the detection of a variety of acute diseases.

Declarations Figures
Page 17/       Visualization of network heatmaps for a pneumonia nding 7 and Fig. 8 re ect the visualization of gradient class activation maps.Additional cases of the networks' heatmaps are shown in Appendix B and Appendix C. Based on

Figure 1 Proposed
Figure 1

Figure 4 Accuracy
Figure 4

Figure 5 Comparison
Figure 5

Table 1 -
Statistical information on the dataset used in the study

Table 2 -
Description of the models used during the rst stage 33ing Amazon SageMaker33, we tuned a given model and found its best version through a series of training jobs run on the collected dataset.Having performed hyperparameter tuning based on Bayesian optimization strategy, a set of hyperparameter values for the best performing model was found, as measured by a validation accuracy.The optimal architecture of the network head consists of the following layers:

Table 3 -
Description of the data distribution within training, validation, and testing subsets

Table 4 -
Performance metrics within different subsets obtained after the rst stage

Table 5 -
Performance metrics within different classes obtained after the rst stage

Table 6 -
Performance metrics within different subsets obtained after the second stage

Table 7 -
Performance metrics within different classes obtained after the second stage