In this section, we present the dataset used in the research, and the most important experiments conducted using performance curves, which show the stability of the model's performance during training and validation.
4.1. Dataset
The dataset proposed by Kermany et al. [17] is used to evaluate the proposed approach. The dataset consists of 5856 CXR images of pneumonia two classes: infected images and not infected images. This dataset is for children within ages between 1 and 5 years and classified by specialists in the “women and children's medical center” at Guangzhou [42]. The dataset is divided into 79% for training, 19% for testing, and 2% for validation. But the dataset is re-split into 75% for training and 25% for validation, and the same number for prediction images. Table 3 shows the distribution of the original dataset in detail.
Table 3
The original distribution of the proposed dataset
|
Infected
|
Normal
|
Total
|
Training
|
3875
|
1341
|
5216
|
Testing
|
390
|
234
|
624
|
Validation
|
8
|
8
|
16
|
Total
|
4273
|
1583
|
5856
|
4.2. Evaluation Metrics
The performance of the proposed model is evaluated using five metrics: accuracy, precision, recall and AUC, and it is the rate between the TP and FP. In general, it is used in binary classification models, and it is used to evaluate models that take into account the same percentage of importance for the classification of both types of samples.
Binary cross entropy is also used as a loss function to measure the difference between the estimated probability and the actual probability [43]. These metrics are formulated using Formulae 5–8.
$$Accuracy =\frac{TP + TN }{TP + FN + FP + TN} \left(5\right)$$
$$Precision = \frac{TP}{TP + FP} \left(6\right)$$
$$Recall = \frac{TP}{TP + FN} \left(7\right)$$
$$AUC= \underset{0}{\overset{1}{\int }}\frac{TP}{P} d\frac{FP}{N}= \frac{1}{P N} \underset{0}{\overset{N}{\int }}TP d FP \left(8\right)$$
Where, TP, FN, FP, and TN refer to True Positive, False Negative, False Positive, and True Negative, respectively. TP represents the number of correct classifications of positive observations of phenomena infected images; FN represents the number of incorrect classifications of positive observations (normal images) that are incorrectly labeled as phenomena infected images. FP represents the number of incorrect classifications of negative observations, and TN represents the number of correct classifications of negative observations.
4.3. Experiments
Three python IDEs are used in this study: Jupyter notebook; used to redistribute the dataset, Kaggle; used for implementation, and Spyder used for re-sampling.
Figures 7 and 8 show the ratios of the two types of images during training and validation after redistribution, and we note that the phenomenon of unbalanced distribution of data remains.
As for the third main stage in this research, data augmentation is used to obtain balanced ratios of images from both during the training and validation stages, as we see in Figs. 9 and 10.
Different number of layers are used for the two models in the feature extraction stage depending on the used optimizer. For instance, 85 layers are used for pre-activation ResNet and DenseNet169 used with the Adam optimizer, while 37 layers are used with the SGD optimizer. For the classification step, an ANN is used with one hidden layer containing 450 neurons and a Relu activation function, and one output layer with one neuron and a sigmoid activation function. These classes are designed for each of the three main stages: dealing with the original data, class weight, and re-sampling.
Parameter setting is one of the most sensitive steps and affects the results of the model, it is used to reduce network error. The values of these parameters are chosen experimentally for all experiments as presented in Table 4. Parameters’ values are also related to the type of optimizer used, for instance, a learning rate of 0.0001 is used for the Adam optimizer for updating the network weights. 30 epochs are used which indicates the number of times the training dataset is divided according to the patch-size across the network, and accordingly the parameters in the network are updated. Default steps per epoch that are equal to several samples in the training dataset are divided by the batch-size, and default steps between validation epochs that are equal to several samples in testing dataset are divided by the batch-size for fitting the model with data step. L2 regularization is used with the SGD optimizer to overcome the overfitting problem, the learning rate is changed to 0.00001, and the number of epochs is also changed to 40; this value represents the number of forwards and backward stages to be used in the training and verification stages. Dropout regularization is used in the resampling stage instead of L2 regularization.
4.4. Experiment Results
In this section, we present the results and performance curves that express how the model performed during the training and validation phases. In the three main stages, original distribution stage, class weight, and re-sampling, the results of each stage are divided into sub-stages based on the type of optimizer used and other additional criteria. The metrics used for evaluation are precision, recall, accuracy, loss, and AUC. The model's performance in the training phase is represented in blue for all curves and the orange curve expresses the model's performance during the validation phase.
Table 4
Training Hyper-parameters
Parameter
|
Value
|
Batch size
|
24
|
Optimizer
|
Adam, SGD
|
Learning Rate
|
0.0001(Adam) and 0.00001(SGD)
|
Number of Epochs
|
30,40
|
Steps per epoch
|
Number of training samples / batch-size
|
Validation steps
|
Number of testing samples / batch-size
|
Regularization method
|
L2, Dropout
|
The performance of the model was very fluctuating and unstable, especially during the validation phase when using the Adam optimizer with the data in the original distribution, which is illustrated in Fig. 11.
When using the SGD optimizer and reducing the complexity of the model while using the data in its original distribution, the curve began to appear less kinked during the performance validation phase and is visible in Fig. 12.
The performance is still bad, even after conducting several experiments, the results have not been improved, and so we first had to increase the percentage of data during the validation stage to become 25% instead of 19%. Since the data is measured unbalanced, two methods are applied to overcome this problem, class weight and resampling.
When using the Adam optimizer with a class weight, the model’s performance curves in the validation stage express the inability to generalize all the validation data, which is shown by the highly volatile curve in the validation stage shown in orange color in Fig. 13. Despite standing at excellent values at the end of the trip for all measures (recall of 99%, 0.01 of loss rate, and 97% for the rest of the measures) in the validation phase, the performance curves indicate that the performance is not guaranteed. This prompted us to implement another type of optimizer, which is the SGD, and its performance curves are illustrated in Fig. 13.
After applying the SGD optimizer with the class weight technique, the curves became significantly more stable for all metrics (Fig. 14). It ended with 0.97 for accuracy and AUC, and 0.24 for rate of loss, while performance was modest on the accuracy scale of 0.89 and recall of 0.88.
Metrics values and numbers improved when changing the weight of the least available samples in the data, which is the normal class, by 1.5 compared to 1.0 for the highest available samples. Although the curves are less stable than in the previous stage, as shown in Fig. 15, there were slightly better results as numbers, where the precision ratio reached 98%, and the stability of performance relative to the AUC metric remained at 0.97, and the recall metric remained at 0.88, while the accuracy improved by 0.01 Its value becomes 90%, and the loss ratio becomes 0.23.
It found that the performance of the model after producing images with balanced ratios through data augmentation, especially when using the Adam optimizer with L2 regularization method was poor, as all performance curves in the validation stage were very jagged and unstable until the end (Fig. 16).
After using the SGD optimizer and reducing the number of model layers in addition to using a dropout of 0.8 which chosen experimentally on the balanced data. The shape of the curves became more stable and less fluctuating, but there was little difference between the training curve and the validation, which is illustrated in Fig. 17.
We reduced the difference between the training curve and the validation curve by increasing the dropout ratio to 0.9 and the improvement is evident in Fig. 18.
4.5. Comparison results from all experiments
The experiments in this study are divided into three main stages. The first stage is to train the model on the data with its original distribution, and the remaining two stages deal with the imbalanced data in weighting stage and the resampling stage using data augmentation. The goal in this study is to reach a good performance numerically as well as behaviorally through its behavior in the performance curve so that we reach a stable model as much as possible. The results are presented through the performance curves in the previous paragraph in detail and Table 5 summarizes the results for each experiment.
The model obtained a bad and unstable performance when trained on the data with its original distribution. Referring to Table 5, it shows bad numerical results, so there was a need to increase the percentage of data in the performance validation stage and using techniques to deal with the irregular distribution in performance. The goal of this study is to make a trade-off and to combine the stability of the performance curve and the numerical results to choose the best results. Despite the high results of the class weight stage experiment with Adam optimizer, its performance curves were zigzag, which means that the model is not guaranteed performance. Therefore, the results of the different class weight experiment are nominated stage with SGD optimizer to combine good performance and acceptable shape in performance curves. We used these results to compare with related work in the next section.
4.6. Comparison with other approaches
The results of the proposed approach are compared with the results of nine previous approaches using the same dataset proposed in [17]. These results are presented in Table 6, Fig. 19, and Fig. 20. The proposed approach achieved promising results compared to other approaches especially for the precision metric. Comparing the results in terms of accuracy shows better results are achieved compared to many previous approaches, including ResNet-34 [15], MobileNetV2, CNN and LSTM from [29], the pre-trained Xception model [20], ResNet-50 [24], Xception, VGG19 and CNN [37]. While the difference was clear between our results and other models in measuring AUC, other models include MobileNetV2, CNN, LSTM [29], the pre-trained Xception model [20], ResNet-50 [24], the results are also summarized in Fig. 23. The last column shows the difference between our results in the loss scale compared to ResNet-34 [15], [27] and its four algorithms [23], [37] (Figs. 21–23).
Table 5
The comparison results of our experiments
Stage
|
Result stage
|
precision
|
Recall
|
Accuracy
|
AUC
|
Loss
|
Original distribution stage with Adam optimizer
|
Training result
|
0.74
|
0.99
|
0.74
|
0.53
|
4.42
|
validation result
|
0.62
|
0.99
|
0.62
|
0.54
|
6.10
|
Original distribution stage with SGD optimizer
|
Training result
|
0.96
|
0.95
|
0.94
|
0.98
|
0.14
|
validation result
|
0.90
|
0.92
|
0.88
|
0.94
|
0.27
|
Class weight Stage with Adam optimizer
|
Training result
|
0.99
|
0.99
|
0.99
|
0.99
|
0.002
|
validation result
|
0.97
|
0.99
|
0.97
|
0.97
|
0.01
|
Equal class weight Stage with SGD optimizer
|
Training result
|
0.95
|
0.93
|
0.92
|
0.97
|
0.19
|
validation result
|
0.97
|
0.88
|
0.89
|
0.97
|
0.24
|
Different class weight Stage with SGD optimizer
|
Training result
|
0.96
|
0.93
|
0.92
|
0.97
|
0.18
|
validation result
|
0.98
|
0.88
|
0.90
|
0.97
|
0.23
|
Re-sampling Stage with SGD optimizer and dropout of 0.8
|
Training result
|
0.974
|
0.939
|
0.957
|
0.992
|
0.115
|
validation result
|
0.880
|
0.890
|
0.883
|
0.953
|
0.279
|
Re-sampling Stage with SGD optimizer and dropout of 0.9
|
Training result
|
0.965
|
0.927
|
0.946
|
0.987
|
0.142
|
validation result
|
0.876
|
0.907
|
0.888
|
0.948
|
0.292
|
Table 6
Comparison results with others works
Reference
|
Algorithm
|
Accuracy
|
Recall
|
Precision
|
AUC
|
Loss
|
[2]
|
CNN model
|
84
|
-
|
-
|
-
|
0.8
|
[3]
|
CNN
|
78
|
-
|
-
|
-
|
-
|
[15]
|
ResNet-34
|
92
|
99
|
90
|
-
|
1.6
|
[27]
|
VGG16
|
87
|
96
|
-
|
-
|
0.3
|
VGG19
|
88
|
95
|
-
|
-
|
0.3
|
Inception V3
|
70
|
84
|
-
|
-
|
0.9
|
ResNet50
|
77
|
97
|
-
|
-
|
0.6
|
[29]
|
ResNet152V2
|
99
|
99
|
99
|
99
|
-
|
MobileNetV2
|
96
|
99
|
95
|
97
|
-
|
CNN
|
92
|
92
|
95
|
96
|
-
|
LSTM
|
91
|
92
|
93
|
95
|
-
|
[20]
|
Xception pre-trained model
|
-
|
99
|
84
|
96
|
0.04
|
[24]
|
Pre-trained ResNet-50
|
90
|
93
|
93
|
89
|
0.03
|
[23]
|
VGG16
|
87
|
-
|
-
|
-
|
0.3
|
Xception
|
82
|
-
|
-
|
-
|
0.45
|
[37]
|
Xception
|
83
|
-
|
95
|
-
|
0.6
|
DenseNet201
|
93
|
-
|
99
|
-
|
1.9
|
MobileNet_v2
|
96
|
-
|
98
|
-
|
0.24
|
VGG19
|
85
|
-
|
80
|
-
|
1.3
|
CNN
|
84
|
-
|
94
|
-
|
0.4
|
VGG16
|
86
|
-
|
87
|
-
|
1
|
Inception_v3
|
94
|
-
|
93
|
-
|
1.76
|
ResNet 50
|
96
|
-
|
98
|
-
|
1.5
|
Inception_ResNet_V2
|
96
|
-
|
98
|
-
|
1.1
|
The proposed model (different class weight stage) with SGD)
|
Pre-activation ResNet with DenseNet169
|
90
|
88
|
98
|
97
|
0.23
|
4.7. Results Discussion
The proposed approach was able to reach a performance that combines good results numerically and behaviorally acceptable in the performance curve. In addition, we overcame many challenges that encountered during this study, like the small size of the test data, which appeared for the first time in the performance curve. To meet this challenge, we repartitioned the training and validation data from scratch. And then we used several techniques to deal with the imbalance of data distribution, which is the categorical weight and re-sampling with the increase of the data. Another challenge was the problem of overfitting, which we addressed by reducing the number of ResNet pre-activation layers and using an organization type suitable for the two techniques implemented in the previous challenge. Where L2 is used with the class weighing technique, and the dropout is best suited with the re-sampling technique. In addition to changing the optimizer from Adam to SGD, where the use of the SGD optimizer had a positive effect in terms of model performance stability compared to the results obtained when applying the Adam optimizer. The explanation for this phenomenon is that the SGD optimizer splits the batch number of features equal to the number of that features, making the performance movement more stable, specifically in the loss scale, and thus, the possibility of updating the weights for each training sample. The second conclusion is that the method for dealing with unbalanced data has a role in determining the type of regularization method that is best, where the L2 method was the best suited for the class weight technique, while the dropout method was the best suited for the re-sampling technique, which may be due to the volume of data used in each stage. The results were very good on the precision, AUC, and loss metrics. However, the results were modest in terms of performance on measures of accuracy and recall. The reason according to our analysis was the inequality of weight for each class, so it may cause a conflict in the classification of the two classes. When comparing our results with those of other models, we got good results on most metrics, except for recall (Figs. 19–23). Finally, we can also say that the proposed approach was able to outperform many previous studies in terms of reaching stable performance and among these studies that reached fluctuating performance regardless of the results as the values [22–24], [27], and [44]. In addition, the results have a small loss rate, and the difference was obvious compared to the two studies in [26] and [2], which had an error rate of more than 1.00.