A. Datasets
Collecting COVID-19 dataset is not trivial. We, however, collect a number of CXR benchmark collections (C1 to C3) from the literature (See Table I). They help to showcase/validate the usability and robustness of our model.
C1: COVID-19 collection [24] is an open-source collection that is made available and main- tained by Joseph Paul Cohen. As of now, it is composed of 73 COVID-19 positive CXRs, along with some other CXRs of diseases like MERS, SARS, and viral Pneumonia. For our purpose, only COVID-19 positive posteroanterior CXRs are considered.
C2: Pneumonia collection [25] (Kaggle CXR collection) is composed of 5863 CXRs. Out of this, 1583 CXRs are normal or healthy CXRs and the remaining 4280 CXRs show various manifestations of viral and bacterial Pneumonia.
C3: Two publicly available Tuberculosis (TB) collections [26] are considered: a) Shenzhen, China and b) Montgomery County, USA. These CXR benchmark collections were made available by the U.S. National Library of Medicine, National Institutes of Health (NIH). The Shenzhen, China collection is composed of 340 normal cases and 342 positive cases of TB. The Montgomery County, USA collection is composed of 80 normal CXRs and 58 TB positive CXRs.
Few samples from the aforementioned collections are visualized in Fig. 4. Using aforemen- tioned collections, we constructed six different combinations of data to train and validate our model. As provided in Table II, these six different combinations of datasets (D1 to D6) are enlisted below:
D1: In dataset D1, 73 COVID-19 positive CXRs and 340 healthy CXRs from the Shenzhen, China collections are considered.
D2: For this dataset D2, 73 COVID-19 positive CXRs and 80 healthy CXRs from the Mont- gomery County, USA are considered.
D3: D3 consists of 73 COVID-19 positive CXRs and 1583 healthy CXRs from the Pneumonia collections are considered.
TABLE I
DATA COLLECTION (PUBLICLY AVAILABLE).
Collection
|
# of positive
cases
|
# of negative
cases
|
C1: COVID-19
|
73
|
–
|
C2: Pneumonia
|
4280
|
1583
|
C3: TB (China)
|
342
|
340
|
TB (USA)
|
58
|
80
|
TABLE II
EXPERIMENTAL DATASETS USING TABLE I.
Dataset
|
COVID-19
+ve -ve
|
Pneumonia
+ve -ve
|
TB (China)
+ve -ve
|
TB (USA)
+ve -ve
|
D1
|
73
|
–
|
–
|
–
|
–
|
340
|
–
|
–
|
D2
|
73
|
–
|
–
|
–
|
–
|
–
|
–
|
80
|
D3
|
73
|
–
|
–
|
1583
|
–
|
–
|
–
|
–
|
D4
|
73
|
–
|
–
|
1583
|
–
|
340
|
–
|
80
|
D5
|
73
|
–
|
4280
|
1583
|
–
|
–
|
–
|
–
|
D6
|
73
|
–
|
4280
|
1583
|
342
|
340
|
58
|
80
|
Index: +ve = positive cases and -ve = negative/healthy cases.
D2: D4 contains 73 COVID-19 positive CXRs and 2003 healthy CXRs, combined from the Shenzhen, Montgomery and Pneumonia collections are considered.
D5: In dataset D5, 73 COVID-19 positive CXRs, 4280 Pneumonia positive CXRs and 1583 healthy CXRs from the Pneumonia collections are considered.
D6: In dataset D6, 73 COVID-19 positive CXRs and 6683 non-COVID CXRs (comprising of 4280 Pneumonia positive, 400 TB positive and 2003 healthy CXRs) are considered.
The primary motivation behind constructing the various data combinations (D1 to D6) is to show the robustness of the Truncated Inception Net to detect COVID-19 positive cases. Further, COVID-19 is believed to have a close relationship with traditional Pneumonia. Therefore, a separate dataset (D5) was constructed to show whether our proposed model is able to differentiate COVID-19 positive cases from those traditional Pneumonia positive cases. Besides, CXRs of Tuberculosis manifestation were also added in D6 to prove that our model is robust enough to identify COVID-19 from other diseases like TB, Pneumonia, and healthy CXRs. The robustness also lies in the way we collect data, where regional variation can be considered as a crucial element. In our datasets, the healthy CXRs in D1, D2, and D3 are collected from different regions of the world. Considering multiple combination of data from different places can help develop cross-population train/test models1.
1Even though, our tests proved that the proposed model can be considered as a cross-population train/test model, it is beyond the scope of the paper.
As an input to our model, CXR images were scaled down to the size of 224 224 3 to match the input dimensions of the Truncated Inception Net. Such a resizing can also reduce computational complexity. Further, the images were normalized using the min-max scaling scheme.
B. Validation protocol and evaluation metrics
To validate our proposed model, a 10-fold cross-validation scheme was opted for training and testing purposes on all six datasets: D1 – D6. A cross-validation scheme ensures that the model’s performance is not biased by the presence of outlier data samples in the training or testing datasets. For each of the 10-folds, six different evaluation metrics were employed: a) Accuracy (ACC); b) Area under the ROC curve (AUC); c) Sensitivity (SEN); d) Specificity (SPEC); e) Precision (PREC); and f) F1 score. These can be computed as follows:
ACC = (tp + tn)/(tp + tn + fp + fn), SEN = tp/(tp + fn),
SPEC = tn/(tn + fp), PREC = tp/(tp + fp), and F1 score = 2 ((PREC × SEN)/(PREC + SEN)) ,
where tp, fp, tn, and fn are the total number of true positives, false positives, true negatives, and false negatives. The mean scores from all 10 folds were taken for each of the above metrics, to get the final results on a particular dataset.
In traditional deep learning tasks, a primary metric like accuracy is sufficient to judge the performance of a deep learning model. On the contrary, such an assumption does not work well when considering imbalanced datasets. In such cases (more often in medical dataset), the positive class to be predicted often has much lower data samples than the negative class. Therefore, accuracy would demonstrate a fairly high value even if the model labels all the test data to be negative. Therefore, special attention is given to metrics like Sensitivity/Recall, Precision, and F1 score here.
In the context of COVID-19, the SEN metric plays a very crucial role when deploying a model for screening patients in the early stages of a pandemic. Sensitivity measures the likelihood that the model would not miss to classify COVID-19 positive samples/patients. This prevents the further spreading of the infection. Secondly, the precision measures the likelihood that a model would not make a mistake to classify normal patients as COVID-19 positive. This metric becomes very important in the later stages of a pandemic, when medical resources are limited, and they are available only to the patients that are in need. Besides, F1 score is used to extract the combined performance score of a model, which is basically the harmonic mean of the precision and sensitivity of a model.
C. Results and analysis
Before providing quantitative results, we first provide activation maps generated by our pro- posed model for a COVID-19 positive, Pneumonia positive, and TB positive CXR can be visualized in Fig. 5. Qualitatively speaking, such feature maps help understand how different their features could possibly be.
Following the validation protocol and evaluation metrics mentioned in the previous Section III-B, we present the mean scores that were achieved using 10-fold cross validation train-test scheme, on each of the six different datasets: D1 – D6. The experimental results are well documented in Table III. Also, standard deviation(σ) is reported all cases, whose very low value proves the statistical robustness of our model. Our proposed Truncated Inception Net model achieves a classification ACC, AUC, SEN, SPEC, PREC, and F1 score of 99.96%, 1.0, 0.98, 0.99, 0.98, and 0.98, respectively, on the dataset: D5 (COVID-19 positive case detection against Pneumonia and healthy cases) and that of 99.92%, 0.99, 0.93, 1.0, 1.0, and 0.96, respectively, on the D6 dataset (COVID-19 positive case detection against Pneumonia, TB, and healthy CXRs). Since the custom datasets being used were highly imbalanced in terms of class representation, sensitivity and precision are the most significant metrics in our case, as said in Section III-B. Consequently, the proposed model achieves high sensitivity and precision on these datasets. For a better understanding of the results, six different ROC curves are shown in Fig. 6; one for each dataset, starting from D1 to D6.
TABLE III
RESULTS: AVERAGE ACC IN %, AUC, SEN, SPEC, PREC, AND F1 SCORE USING 10 FOLD CROSS-VALIDATION WITH σ
STANDARD DEVIATION.
Dataset
|
ACC
|
AUC
|
SEN
|
SPEC
|
PREC
|
F1
|
D1
|
99.50 ± 0.325
94.04 ± 3.250
100 ± 0.0
99.85 ± 0.019
99.96 ± 0.002
99.92 ± 0.100
|
0.99 ± 0.023
1.0 ± 0.0
1.0 ± 0.0
0.99 ± 0.100
1.0 ± 0.0
0.99 ± 0.006
|
0.96 ± 0.015
0.88 ± 0.0924
1.0 ± 0.0
0.96 ± 0.020
0.98 ± 0.015
0.93 ± 0.096
|
1.0 ± 0.0
1.0 ± 0.0
1.0 ± 0.0
1.0 ± 0.0
0.99 ± 0.100
1.0 ± 0.0
|
1.0 ± 0.0
1.0 ± 0.0
1.0 ± 0.0
1.0 ± 0.0
0.98 ± 0.002
1.0 ± 0.0
|
0.97 ± 0.007
0.93 ± 0.045
1.0 ± 0.0
0.97 ± 0.015
0.98 ± 0.013
0.96 ± 0.055
|
D2
|
D3
|
D4
|
D5
|
D6
|
Additionally, since for every dataset we computed 10-fold cross validation, for better under- standing of how average scores and their standard deviation were computed, the results obtained from each fold on the dataset: D6 are provided in Table IV. Besides, the proposed Truncated Inception Net model performs 2.3 0.18 times on an average faster than Inception Net V3 model. In Table V, computational times (by taking 10 different CXR samples as input) are used to demonstrate the differences between them. The primary reason being the large number of parameters in the original Inception Net V3 model. Precisely, this model contains more than 21.7 million trainable parameters in contrast our model which contains only 2.1 million trainable parameters, making it a better choice for training on small datasets and also for active learning. Therefore, for mass screening in resource-constrained areas, employing a faster tool is the