In the section below, we presented four important techniques in our research.
3.3 Feature Extraction using CNN
Feature extraction using Convolutional Neural Networks (CNNs) is a critical process in image analysis and computer vision tasks. It involves extracting relevant patterns or features from raw image data to represent them in a more abstract and meaningful way. To reduce dimensionality of the image data and capture essential information from images, such as edges, textures, shapes, and other discriminative patterns, feature extraction process is paramount.
CNNs consist of multiple convolutional layers. Each convolutional layer applies a set of learnable filters (kernels) to the input image, performing convolutions to produce feature maps. These feature maps represent the presence of specific patterns or features at various spatial locations within the input image. Mathematically, the operation of applying a filter \({W}_{i}\) to a portion of the input image X can be represented as a convolution operation followed by a bias term and an activation function as shown in Eq. 1.
$${Z}_{i} = f \left(\sum _{l=1}^{d}\sum _{m=1}^{d}\left({X}_{(l,m)}* {W}_{i(l,m)}\right)+ {b}_{i}\right)$$
1
where \({Z}_{i}\) is the output feature map corresponding to the i-th filter, \({X}_{(l,m)}\) is the input image patch centered around pixel (l,m), \({W}_{i(l,m)}\) is the corresponding filter weights, \({b}_{i}\) is the bias term for the i-th filter and f is the activation function such as ReLU.
For pooling layers after each convolutional layer, pooling layers (max pooling) was applied to reduce the spatial dimensions of the feature maps. Mathematically, the max pooling operation can be represented as \({Y}_{(i,j)}\) in Eq. 2.
$${Y}_{(i,j)} = {}_{\left(p,q\right) ϵ pooling region}{}^{max}\left({Z}_{(i+p,j+q)}\right)$$
2
where \({Y}_{(i,j)}\)is the output of the max pooling operation at position (i, j), \(\left({Z}_{(i+p,j+q)}\right)\) is the feature map value at position (i + p,j + q). The max pooling operation is applied over a predefined pooling region.
Non-linear activation functions ReLU (Rectified Linear Unit) was applied after each convolutional and pooling operation to introduce non-linearity and enable the network to learn complex representations. The Activation functions introduces non-linearity into the network which is represented in Eq. 3.
f(x) = max (0, x) (3)
Eventually, the feature maps are flattened into a vector representation, which serves as the input to fully connected layers (also known as dense layers) in the network. For the research work, the flatten operation result was 25088 which corresponds to the total number of neurons in the output feature maps of the preceding convolutional and pooling layers. The fully connected layers further process the extracted features to perform tasks like classification or regression as represented in Eq. 4.
A = f(W X + b) (4)
where A is the output vector, W is the weight matrix of the layer, X is the input vector, b is the bias vector and f is the activation function. These layers combine the learned features from previous layers and map them to the desired output classes. For the research, there are 64 individual neurons in the layer meaning there are 64 sets of weights and biases that will be learned during the training process. To curtail overfitting, a dropout of 0.25 was used during training. Figure 2 shows the diagrammatic representation of each process.
The CNN has 20 layers of various types including Conv2D, MaxPooling2D, Dropout and FCL. ReLu activation function was used for the inner layers and dropout threshold of 0.25. The CNN learns to extract meaningful features from a labeled training dataset consisting of images and corresponding ground-truth labels. The trained CNN is evaluated on a separate test dataset that it hasn't seen during training. This ensures unbiased assessment of the model's performance on unseen data. The feature extraction process remains the same during testing, but the extracted features are used for inference or prediction without further parameter updates.
In CNN, the number of epochs refers to the number of times the entire training dataset is passed forward and backward through the neural network. Each epoch consists of one forward pass (computing predictions and losses) and one backward pass (updating weights using backpropagation). For the purpose of this research, 50 epochs were used due to complexity of the model, size of the training dataset and learning rate which determines the size of the steps taken during gradient descent optimization. To measure the performance of the epoch function, Fig. 3 shows the accuracy function and Fig. 4 shows the loss function graph. It can be seen on the graph (Fig. 3) that both training accuracy and validation accuracy increase rapidly and approach 1.0 as the number of epochs increases. Also, both training loss and validation loss decrease rapidly and approach a minimum value as the number of epochs increases. The graph suggests that the CNN model is effectively learning from the training data and generalizing well to unseen data which is due to the early epoch and overfitting avoidance in the model.
Th next stage is to performance analysis of the CNN model. Table 1 shows the classification report for the binary classification between COVID-19 and Normal cases. Precision measures the proportion of true positive predictions (correctly identified COVID-19 cases) out of all positive predictions (all cases predicted as COVID-19). A precision of 0.9749 for COVID-19 and 0.9671 for Normal indicated that the model had a high percentage of correct positive predictions for both classes. Recall measures the proportion of true positive predictions (correctly identified COVID-19 cases) out of all actual positive cases (all COVID-19 cases in the dataset). A recall of 0.9669 for COVID-19 and 0.9751 for Normal indicated that the model captured a high percentage of actual positive cases for both classes. F1-score is the harmonic mean of precision and recall, providing a single metric that balances both precision and recall. A high F1-score (0.9709 for COVID-19 and 0.9711 for Normal) indicated a good balance between precision and recall for both classes. Support represents the number of samples in each class in the dataset. There are 362 samples for both COVID-19 and Normal classes. Accuracy measures the overall correctness of the model's predictions across all classes. An accuracy of 0.9710 indicated that the model correctly predicted the class for approximately 97.10% of the samples. The macro average calculates the average of precision, recall, and F1-score across all classes. In this case, the macro average for precision, recall, and F1-score is 0.9710, indicating balanced performance across classes. The weighted average calculates the average of precision, recall, and F1-score, weighted by the number of samples in each class. In this case, the weighted average for precision, recall, and F1-score was 0.9710, indicating balanced performance considering the class distribution.
Table 1
Classification report for CNN Model
| precision | recall | F1-score | Support |
Covid | 0.9749 | 0.9669 | 0.9709 | 362 |
Normal | 0.9671 | 0.9751 | 0.9711 | 362 |
Accuracy | | | 0.9710 | 724 |
Macro avg | 0.9710 | 0.9710 | 0.9710 | 724 |
Weighted avg | 0.9710 | 0.9710 | 0.9710 | 724 |
Overall, the classification report suggested that the model achieved high precision, recall, and F1-score for both COVID-19 and Normal classes, indicating strong performance in distinguishing between the two classes. The high accuracy further confirms the overall effectiveness of the model in classification. Figure 5 shows the confusion matrix for the CNN model. There are 350 instances (True Negative) correctly predicted as Covid when they are actually Covid, 12 instances (False Positive) incorrectly predicted as Normal when they are actually Covid, 9 instances (False Negative) incorrectly predicted as Covid when they are actually Normal and 353 instances (True Positive) correctly predicted as Normal when they are actually Normal.
3.4 Machine Learning Models
The objective of the research is to use machine learning models and combine all to form an Ensemble model to predict whether a patient as Covid-19 or not using a software. The models used for the research are decision tree, Random forest, Support Vector Machine and Ada Boost.
3.4.1 Decision Tree
A decision tree is a non-linear supervised learning algorithm used for classification and regression tasks. It partitions the feature space into regions and makes predictions based on the majority class within each region. We can denote a decision tree as T. The prediction of a decision tree for a sample x can be represented as:
ŷT = T(x) (5)
3.4.2 Random Forest
Random Forest aggregates predictions from multiple decision trees. Let \({T}_{\varvec{i}}\) represent the i-th decision tree in the Random Forest. The prediction of the Random Forest for a sample x can be represented as:
ŷRF = \(\frac{1}{N} \sum _{i=1}^{N}{T}_{i}\left(x\right)\) (6)
where N is the number of trees in the Random Forest.
3.4.3 Support Vector Machine (SVM)
Support Vector Machine finds the hyperplane that best separates the classes in the feature space. Let's denote an SVM model as SVM. The prediction of an SVM for a sample x can be represented as:
ŷSVM = SVM (x) (7)
3.4.4 AdaBoost
AdaBoost combines predictions from multiple weak learners. Let \({H}_{i}\) represent the i-th weak learner. The prediction of AdaBoost for a sample x can be represented as:
$${ŷ}_{AdaBoost} = \text{s}\text{i}\text{g}\text{n} \left(\sum _{i=1}^{N}{\alpha }_{i}{H}_{i} \left(x\right)\right)$$
8
3.4.5 Soft Voting Classifier (Ensemble Model)
In a Soft Voting Classifier, predictions from individual models are combined by averaging their class probabilities. Let's denote the Soft Voting Classifier as Voting. The prediction of the Soft Voting Classifier for a sample xx can be represented as:
ŷVoting = argmax \(\left(\frac{1}{M}\sum _{j=1}^{M}{p}_{j}\left(x\right)\right)\) (9)
where M is the number of individual models (in this case, 4) and \({p}_{j}\left(x\right)\) represents the probability estimates from the j-th model for each class.
To combine the predictions from the Decision Tree, Random Forest, SVM, and AdaBoost models into a Soft Voting Classifier, we computed the probability estimates \({p}_{j}\left(x\right)\) for each sample x using each model, and then average these probabilities across all models. We then chose the class with the highest average probability as the final prediction for the software.