In this section, the used dataset and the proposed methodology including three CNN architectures, three classifiers (SVM, RF, and KNN), in addition to automated extracted features will be discussed in detail.
In this research, several publicly available datasets of COVID-19 and Non-COVID-19 chest X-ray images have been used [40]. COVID-19 images have collected from different sources; such as Cohen dataset on GitHub (https://github.com/ieee8023/covid-chestxray-dataset), Italian Society of Medical and Interventional Radiology (SIRM) website, Radiopaedia and Radiological Society of North America (RSNA) and a totally of 310 COVID-19 chest X-ray images have been collected [38].
Then a collection of normal and pneumonia images are taken from Kaggle publicly available dataset (https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia) were 310 images for normal chest X-ray images and 310 images for pneumonia for both viral and bacterial have been taken [39, 40]. Also, all these images have been added to the dataset to be augmented during the training, the augmentation has been done to prevent CNN from overfitting and memorizing the exact details of the training images. The augmentation here includes flipping the images, rotating the image, translate the image, scale the image and make the total number of images increased which makes the dataset suitable for deep learning [40]. The augmented dataset is publicly available by Alqudah and Qazan using and published online at (https://data.mendeley.com/datasets/2fxz4px6d8). Fig.1 A, B, and C show an x-ray image for the Normal, COVID-19 and Pneumonia subjects respectively.
- Convolutional Neural Network (CNN)
In general, Convolution Neural Networks (CNNs) is one of the recent advancements of the traditional artificial neural network (ANN), that is composed of self-learning neurons with their weights and biases. For each pixel from an input image entered for every neuron, it will be followed by a dot product calculation in addition to elective non-linearity capture [41]. The input layer represents a layer that will enhance the image with enhancement such as mean subtraction and feature-scaling. While the feature that distinguished between CNN's and ANNs is a large number of hidden layers, so it is called deep learning. CNN’s are consisting of convolution layers that are convolved the image with learnable filters, and each one will produce a features map in the output [42].
Then the spatial coordinate of the generated feature map will be reduced using max-pooling layers. The process is performed by applying a window on the image with a predefined stride value for each step and finally voting the maximum value of the pixel and putting it in the new image. Meanwhile, another method can be applied in the pooling layer based on the average value, not the maximum [43]. ReLU layer which introduces the nonlinearity for the network its function is [10]. The last layer is a fully connected layer which represents the output layer that has the classification percentages which is serving the classification result as a single vector of probabilities these probabilities passed to the softmax classifier to select the class with the highest probability [44].
The AOCTNet [34] proposed by Alqudah and initially used to classify optical coherence tomography (OCT) images, and in this research paper to classify x-ray images of the chest. It starts with the first convolutional layer, which uses 32 filters with size 33 and one padding zeros, while the remaining convolutional layers devoted 16 filters of the same size, except for the third, which uses only 8 filters. The batch normalization layer between the convolution layer and the nonlinearity layer (RELU) is introduced for time-consuming, accelerating training stage purposes, and decreasing the sensitivity of network initialization. Max pooling layer is used with window size 22 in AOCTNet and also increases 2 pixels [35]. The output layer is a fully-connected layer with output size 2, softmax layer, and a layer of classification. Figure 2 illustrates the AOCTNet graphic representation.
The ShuffleNet [45] is a very computational and lightweight CNN architecture that is commonly used in mobile devices and embedded systems that use a small computing capacity. The ShuffleNet offers better performance on ImageNet classification tasks as compared with MobileNet. The architecture of ShuffleNet is composed essentially of stacking ShuffleNet units grouped into three levels. In each stage, the first building block is added with a step equal to 2. In each stage, the output channels are similar but multiplied for the next one. ShuffleNet architecture uses two different operations to reduce the cost of computation while preserving accuracy [45]. These operations are group convolution and shuffle of the channels pointwise. The channel shuffle operation allows the channels to be divided into several subgroups in each group, then feed each group to different subgroups in the next layer [45]. Figure 3 illustrates ShuffleNet units were introduced which are designed specifically for small networks.
Generally, MobileNet is the most commonly used and suitable CNN for embedded device applications such as smartphone and robot vision with a lightweight CNN architecture [46]. The MobileNet consists of 25 layered MobileNet architecture built for the current study that employs four layers of Conv2D, seven layers of batch normalization, seven layers of ReLU, three layers of Zero Padding 2D, and single layers of DepthwiseConv2D, Global Average Pooling, Dropout, and Dense [47]. Like other CNN's, MobileNet is pretrained on ImageNet [48]. Figure 4 Displays MobileNet architecture.
Due to the uncommon availability of a huge medical image dataset, training the entire convolutional neural network (CNN) by medical images to achieve a certain classification accuracy is unlikely often. Therefore, random initialization of weights is replaced using exist CNN that has been already trained on a large dataset usually ImageNet, and this process is called transfer learning. There are two ways to learn how to transfer; the first is to remove fixed features, whereby the data from the intermediate layer is used to train the entire classifier. The second is fine-tuning, which is focused on replacing the last fully connected layer with the new classes and retraining the entire CNN using new modification. This modification will update the weights in the deeper layers [23, 33].
In this paper, the AOCTNet, MobileNet and ShuffleNet pretrained CNNs are modified to be compatible with the goal of this research. For the AOCTNet the first layer image input layer (Image Input Layer) modified to be with the size of 128 128 3 and the fully connected (FC) layer (FC Layer) modified to 3 outputs. For the MobileNet the first layer image input layer (input_1) modified to be with the size of 128 128 3, the FC layer (Logits) modified to 3 outputs and the classification layer (ClassificationLayer_Logits ) modified to be suitable with the new FC layer. Finally, ShuffleNet the first layer image input layer(Input_gpu_0|data_0) modified to be with the size of 128 128 3, the FC layer (node_202) modified to 3 outputs and the classification layer (ClassificationLayer_node_203 ) modified to be suitable with the new FC layer.
- X-ray Images Preprocessing
Since this research focused on using lightweight CNNs and any reduction in the size of the input images will result in reducing the time required for both training and classification of new images. Based on that the input image layer for all architectures was modified to be the size of 128 128 3. While the size used for RGB images and the chest x-ray represent only 2D images, a 3D concatenation was done for the images to make it RGB.
Automatic feature extraction using CNN is usually extracted from the fully connected (FC) layer of any architecture. Fully connected layer extract only representative features that able to distinguish between the input class and in this research to distinguish Normal, COVID-19 and Pneumonia cases. In this paper, the features are extracted from the FC layer of three architectures (AOCT-Net, MobileNet, and ShuffleNet) which is a common method since the FC layer proceeds the Softmax classifier. Based on the selected layer only three features from each class will be extracted and these features will a fine selected and representative [49, 50]. The scatter distribution for the extracted features using three models is represented in Figure 5.
Furthermore, these features are used to build hybrid systems another classifier than Softmax classifier using Support Vector Machine (SVM) and Random Forest (RF) classifiers. All classifier's details are discussed in detail in the following sub-sections from 2.4 to 2.7.
CNN's default classifier is the Softmax classifier, and it is a very powerful and commonly used form of discriminant classifier. The softmax discriminant function (SDF) assigns a new input of the test sample to the output class, using the nonlinear transformation of the distance between the test sample and training samples. In this way, the learning rule in a Softmax classifier for the binary units is similar to the regular binary unit law. The only difference is that the Softmax function model is the generalization of the logistic sigmoid function, which can handle classification problems with more than two possible values [49, 50].
- Support Vector Machine (SVM) Classifier
Support Vector Machine (SVM) is one of the leading and commonly used supervised algorithms in machine learning used to classify all data into two main groups. The SVM uses training data to create a model that distinguishes the data entered and that can be used to predict the new data class. The main objective of the SVM is to find the best hyperplane separating the entire dataset and optimizing the distance between the nearest data point and the hyperplane separating [51, 52, 53]. In this research, we have been used radial basis function (RBF).
- K-Nearest Neighbor (KNN) Classifier
K-nearest neighbor (KNN) algorithm is a non-parametric machine learning method that is fast, lazy, widely used, and instantaneous. In general, KNN's input vector includes the feature space and the target that represents the class member that is graded based on the classes of its neighbor's majority voting technique. The majority vote is applied to the weights that indicate the distance between each characteristic point and the vector mass center [54, 55]. In this research, we have been used a hamming distance function with three neighbors using the exhaustive search method.
- Random Forest (RF) Classifier
In 2001 Breiman invented the Random Forest algorithm. It is composed of a large number of collective decision trees that function together. Each distinct tree in this form of classifier spits out a class prediction, and the class with the most choices to be the prediction of our model. Simplicity and powerfulness are critical impressions behind random wood. In data science-speaking, several fairly uncorrelated models are the explanation that works so well because of the random forest model [55, 56]. In this research, we have been used RF with 100 bags used for bootstrapping.
The confusion matrix is the most commonly used tool for evaluating the efficiency of the artificial algorithm used. It compares device output to reference data. The matrix of confusion shows the most common metrics, such as accuracy, specificity, sensitivity, and accuracy. To test each, the four statistical indices used were determined: true positive (TP), false positive (FP), false negative (FN) and true negative (TN) [57]. Accuracy, sensitivity, precision, and specificity were therefore determined as follows:
[Please see the supplementary files section to view the equations.] (1)
(2)
(3)
(4)
The accuracy is indicated about the classifier's ability to properly distinguish between classes, while sensitivity refers to his ability to correctly detect the true positive, specificity measures the actual negatives that the classifier correctly identifies, and precision indicates his ability to predict positive from how many of them are positive [57]. Also, the classifier's F1 Score which measures the accuracy of a test and the Matthews Correlation Coefficient (MCC) which represents the essence as a coefficient of correlation between the class observed and expected [58, 59]:
[Please see the supplementary files section to view the equations.] (5)
(6)