The proposed workflow in this study is divided into three stages. During the first stage 10 industry-standard networks including MobileNet V2, DenseNet-121, EfficientNet B0, EfficientNet B1, EfficientNet B3, EfficientNet B5, VGG-16, ResNet-50 V2, Inception V3, and Inception ResNet V2 were trained on the prepared dataset. All those networks are the de-facto industry standard in the field of deep learning. During the second stage, we chose the 4 most accurate networks which were then fine-tuned. In the process of fine-tuning, both a feature extractor and a classifier were trained. In the third stage, the networks were trained with a guided attention mechanism. This mechanism is based on the usage of the U-net segmentation network, where the output is used to focus the classifier on the lung area of an image. Besides direct guidance by U-net, the network is additionally trained based on indirect supervision through the application of Grad-CAM. Indirect supervision is used in the training process since Grad-CAM’s attention heatmaps reflect the areas of an input image supporting the network’s prediction. In this regard, the prediction is based on the areas which we expect the network to focus on, while indirect supervision forces networks to focus on the desired object in the image rather than its other parts. The training workflow of the model is shown in Fig. 1 below. All three stages are described in Sect. 3.1 and 3.2 in more detail.
It should be noted that different COVID-Net models 12 are also considered in this study. To date, COVID-Net models are state-of-the-art models used for distinguishing COVID-19 and pneumonia cases. All COVID-Net models are abbreviated CXR further in the paper.
3.1. Stage I and Stage II
As we mentioned above, we chose 10 deep learning networks in order to find out which network architectures are most effective in recognizing COVID-19 and pneumonia. All the networks vary by the number of weights, architecture topology, the way of data processing, etc. Additionally, CXR models are used for comparison purposes. Table 2 summarizes information about the networks we used in the first stage.
Table 2
– Description of the models used during the first stage
Model
|
Size of an input image
|
Size of an output feature vector
|
Parameters, millions
|
Size,
Mb
|
Reference
|
MobileNet V2
|
224x224
|
7x7x1280
|
2.6
|
14
|
25
|
DenseNet-121
|
224x224
|
7x7x1024
|
7.2
|
33
|
26
|
EfficientNet B0
|
224x224
|
7x7x1280
|
4.2
|
29
|
27
|
EfficientNet B1
|
240x240
|
8x8x1280
|
6.7
|
31
|
27
|
EfficientNet B3
|
300x300
|
10x10x1536
|
11.0
|
48
|
27
|
EfficientNet B5
|
456x456
|
15x15x2048
|
28.8
|
75
|
27
|
VGG-16
|
224x224
|
7x7x512
|
14.9
|
528
|
28
|
ResNet-50 V2
|
224x224
|
7x7x2048
|
25.6
|
98
|
29
|
InceptionV3
|
299x299
|
8x8x2048
|
22.1
|
92
|
30
|
Inception ResNet V2
|
299x299
|
5x5x1536
|
54.5
|
215
|
31
|
CXR Small
|
224x224
|
7x7x2048
|
117.4
|
1448
|
32
|
CXR Large
|
224x224
|
7x7x2048
|
127.4
|
1486
|
32
|
CXR-3A
|
480x480
|
13x13x1536
|
40.2
|
617
|
32
|
CXR-3B
|
480x480
|
15x15x2048
|
11.7
|
293
|
32
|
CXR-3C
|
480x480
|
15x15x2048
|
9.2
|
210
|
32
|
CXR-4A
|
480x480
|
13x13x1536
|
40.2
|
617
|
32
|
CXR-4B
|
480x480
|
15x15x2048
|
11.7
|
293
|
32
|
CXR-4C
|
480x480
|
15x15x2048
|
9.2
|
210
|
32
|
To train the abovementioned networks, we used bodies of these networks with ImageNet weights frozen. Using Amazon SageMaker 33, we tuned a given model and found its best version through a series of training jobs run on the collected dataset. Having performed hyperparameter tuning based on Bayesian optimization strategy, a set of hyperparameter values for the best performing model was found, as measured by a validation accuracy. The optimal architecture of the network head consists of the following layers:
-
Global Average Pooling layer;
-
Densely-connected layer with 128 neurons and ELU activation;
-
Dropout layer with dropout rate equal to 0.10;
-
Densely-connected layer with 64 neurons and ELU activation;
-
Dropout layer with dropout rate equal to 0.05;
-
Densely-connected layer with 3 neurons;
-
Softmax activation layer.
It is important to note that for the first stage that only the classification heads were trained with body weights frozen. According to the results of the hyperparameter tuning procedure, gradient descent optimizer SGD with a learning rate equal to 10− 4 proved to be optimal. Having trained several state-of-the-art networks, we found that most of them diverged. In this connection, L2-regularization with λ of 0.001 was applied to all training networks. All networks were trained with a batch size equal to 32. In order to avoid overfitting during network training, we applied Early Stopping regularization monitoring validation loss with patience equal to 10 epochs. For training networks in on both first and second stages we used the cross-entropy, calculated as follows:
where C is the number of classes (3 in our study), pi is the predicted probability, yi is the ground-truth label (ternary indicator), ε is a small positive constant.
For the training and testing networks during the first stage, the dataset was split in an 8:1:1 ratio i.e. the training subset includes 2122 images (80.7%), the validation subset – 242 images (9.2%), and the testing subset – 267 images (10.1%). The split of data within training, validations, and testing phases was performed according to the distribution shown in Table 3.
Table 3
– Description of the data distribution within training, validation, and testing subsets
Dataset
|
Diagnosis
|
Training
|
Validation
|
Testing
|
CCXRD
|
Normal
|
14
|
2
|
2
|
Pneumonia
|
133
|
15
|
17
|
COVID-19
|
407
|
46
|
51
|
ACCXRD
|
Normal
|
102
|
12
|
13
|
Pneumonia
|
0
|
0
|
0
|
COVID-19
|
46
|
6
|
6
|
FCCXRD
|
Normal
|
1
|
1
|
1
|
Pneumonia
|
0
|
1
|
1
|
COVID-19
|
27
|
4
|
4
|
CRD
|
Normal
|
0
|
0
|
0
|
Pneumonia
|
0
|
0
|
0
|
COVID-19
|
177
|
20
|
22
|
RSNAPDD
|
Normal
|
567
|
63
|
70
|
Pneumonia
|
648
|
72
|
80
|
COVID-19
|
0
|
0
|
0
|
Total
|
|
2122 (80.7%)
|
242 (9.2%)
|
267 (10.1%)
|
3.2. Stage III
Once the performance and accuracy metrics of all networks were estimated, 4 networks that showed the best results on the first stage were chosen for fine-tuning. Besides training both bodies and heads of the networks, we introduced a guided attention mechanism for the considered networks. We were inspired by 34, where the authors proposed a framework that provides guidance on the attention maps generated by a weakly supervised deep learning neural network. The attention block in our pipeline is based on the usage of U-net 35. As shown in Fig. 1, the proposed algorithm applies segmentation masks to the features of the network body (feature extractor) using multiplication. Applying an attention block to the output feature vector of the network’s backbone allows networks to put more weight on the features that will be more relevant in the distinction of the different classes. Additionally during this stage, we applied attention maps obtained with help of the Grad-CAM technique 36. Furthermore, the loss differs from the one on Stage I and Stage II and it is calculated as follows:
where Lclas is the cross-entropy loss, Lattn is the attention loss, α is the coefficient used to scale the total loss and the attention component. Lattn is calculated according to Eq. (5) in 34.
To correctly apply U-net in the guided attention mechanism, we trained this network on the lung segmentation task. The data used for the training of this network is taken from the V7 Labs repository 37. The segmentation dataset contains 6500 images of AP/PA chest X-ray images with pixel-level polygonal lung segmentations. Some examples of COVID-19 affected patients with segmented areas of lungs are shown in Fig. 2.
3.3. Visual model validation
While modern neural networks enable superior performance, their lack of decomposability into intuitive and understandable components makes them hard to interpret. In this regard, an achievement of the model transparency is useful to explain their predictions. Nowadays, one of the techniques used for model interpretation is known as Class Activation Map (CAM) 38. Though CAM is a good technique to demystify the working of CNNs, it suffers from some limitations. One of the drawbacks of CAM is that it requires feature maps to directly precede the softmax layers, so it applies to a particular kind of network architecture that performs global average pooling over convolutional maps immediately before prediction. Such architectures may achieve inferior accuracies compared to general networks on some tasks or simply be inapplicable to new tasks. De facto deeper representations of a CNN capture the best high-level constructs. Furthermore, CNN’s naturally retrain spatial information which is lost in fully connected layers, so we expect the last convolutional layer to have the best tradeoff between high-level semantics and detailed spatial information. In this connection, we decided to use another popular technique known as Grad-Cam. This model interpretation technique, published in 36, aims to improve the shortcomings of CAM and claims to be compatible with any kind of architecture. The technique does not require any modifications to the existing model architecture, and this allows its application to any CNN based architecture. Unlike CAM, Grad-Cam uses the gradient information flowing into the last convolutional layer of a CNN to understand each neuron for a decision of interest. Grad-Cam improves on its predecessor, provides better localization and clear class discriminative saliency maps.