The selection of CNN architecture is a very difficult part because most of the architectures are deployed in large scale applications which contains millions of parameters with thousands of classes and needs high computational power. In order to find an appropriate model, a CNN model is designed which work good in a small number of images with very low computational resources like CPU and GPU. The proposed model has 5 convolution layers, 3 fully connected layers, ReLU in the hidden layers is included as activation function to add nonlinearity during the training of the network, and dropout is included after the first two fully connected layers to prevent the problem of over-fitting. Due to the reduction of trained classes, hardware resources that have, and a number of images, scaled down the number of neurons, parameters, and filters of pre-trained CNN models.
By using input image size (W1), receptive field size (F), stride (S), and amount of zero paddings (P) that can compute the spatial size of the output volume in each layer [22]. The following equation gives the exact output volume size of all the layers in the proposed model.
𝑂𝑢𝑡𝑝𝑢𝑡 𝑠𝑖𝑧𝑒 (𝑊2) = \(\frac{W1-F+2P }{S}+1\) (1)
Where: W1 is the size of the input volume, F is filter size, P is the number of zero paddings, and S is the stride.
A. Input layer: the input layer of our CNN model accepts images of size 150x150x3 with two different classes (suggested COVID-19 and non-COVID-19). This layer only passes the input to the first convolution layer without any computation. Therefore, there are no learnable features and the number of parameters in this layer is 0.
B. Convolutional layer: in the proposed model there are five convolutional layers. The first convolutional layer of the model filters the 150x150x3 input image by using 32 kernels with a size of 5x5x3 with a stride of 2 pixels. Since (150 − 5)/2 + 1 = 73, and since this layer has a depth of K = 32, the output volume of this layer is 73x73x32. The product of the output volume gives a total number of neurons in the layer (first conv layer) which is 170,528. Each of 73*73*32 neurons in this volume is connected to a region of size 5x5x3 in the input volume. The second convolutional layer takes as input the output (pooled output) of the first convolutional layer and filters it by using 32 kernels of size 3x3x32. The third, fourth and fifth convolutional layers are connected to each other without intervening pooling layer. The third convolutional layer takes as an input the output of the second pooled convolutional layer and filters with 64 kernels of size 3x3x64. The fourth convolutional layer has 64 kernels of size 5x5x64 and the fifth convolutional layer also has 64 kernels of size 3x3x64. All the convolutional layers of the proposed model use ReLU nonlinearity as activation functions. ReLU is chosen because it is faster than other non-linearities such as tanh to train deep CNNs with gradient descent [23].
C. Pooling layer: There are three max-pooling layers after the first, second and fifth convolutional layers of the proposed model. The first max-pooling layer reduces the output of the first convolutional layer with a filter of size 3x3 and stride 1. The second max-pooling layer takes as an input the output of the second convolutional layer and pools by using 2x2 filters of stride 1. The third max-pooling layer has a filter of size 2x2 with stride 2. This layer has no learnable features and it only performs down sampling operation along the spatial dimension of the input volume, hence the number of parameters in these layers is 0.
D. Fully Connected (FC) layer: in the proposed model there are three fully connected layers including the output layer. The first two fully connected layers have 64 neurons each and the final layer which is the output layer of the model has only one neuron. The first FC layer accepts the output of the fifth conv layer after converting the 3D volume of data in to a vector value (Flattening). This layer computes the class score and the number of neurons in the layer predefined during the development of the model. It is the same as ordinary NN and as the name implies, each neuron in this layer is connected to all the numbers in the previous layer.
E. Output layer: the output layer is the last (the third FC layer) of the model and it has 1 neuron with a sigmoid activation function. Because the model is designed to classify 2 classes called binary classification.
F. Feature Extraction
Feature extraction is the main stage of most image classification problems because before the classification stage is started, important features that are used to classify the images are extracted by the CNN algorithm. When training CNN the network learns what type of features to extract from the input image. Features are extracted by convolution layers of CNN and feature extraction is the main purpose of this layer. These layers have a series of filters or learnable kernels (Fig. 2) which aims to extract local features from the input image.
The feature map (𝑀𝑖) is computed as:
𝑀𝑖=\(\sum _{k}wik*xk+bi\)(2)
Where: 𝑤𝑖𝑘 is a filter of the input, x𝑘 is the kth channel of the input image, and b is the bias term
The features in this case contain color (grayscale color which black and white) patterns of the given image. Then each value of the feature map is passed through activation functions to add nonlinearity in the network. After nonlinearity, the feature map again fed into the pooling layer to reduce the resolution of the feature map and computational complexity of the network. The process of extracting useful features in the input image consists of multiple similar steps by cascading convolution layer, adding nonlinearity, and pooling layers.
G. Classification of Proposed Model: In the proposed model classification is performed in fully connected layers. As shown in the model above (Fig. 3) it have a total of three fully connected layers including the output layer. The main function of these layers is to classify the input image based on the features extracted by the convolution layers. The first fully connected layer accepts the output of the convolution and pooling layers. But the outputs are joined together and flattened in to a single vector value before fed into the fully connected layer. Each value of the vector is representing a probability that a certain feature (grayscale color in the dataset) belongs to a class. The below Fig. 3 illustrates how the input values flow into the fully connected layer of the network.