Figure. 1 depicted architecture of the proposed model. It consists of two modules namely Pneumonia prediction and severity prediction module.
3.1 Pneumonia Prediction module
The X-ray images of lung are given as input to model. The images are pre-processed using pre-processing module. To remove the unnecessary details from chest X-rays, we perform the following over the images: resizing, shuffling, and normalization in-order to extract the needed information for further processing. Data augmentation like rotation, scaling was also carried out. The images were resized to 224 * 224 pixels. Then the pre-processed x-ray images were trained using the Recurrent Neural Networks (RNN).
In conventional Neural Network, the outputs and inputs were independent from each other. To solve this problem, we use RNN, in which the outputs from k-1 stride were fed as input to the kth stride. RNN was widely used because of its hidden layer, which plays a vital role in remembering the sequence of information. It remembers the sequence of calculated information using a “memory”. The same weight and bias were applied to all the hidden layers, this diminish the complication of rising parameters. Here x, h and y indicates input, activation function and output, where Whx, Whh and Wyh indicates weight applied to hidden layer, input layer and output layers. It’s shown in Figure.2. The obtained output from the last stride was compared with the actual output and the error was calculated, then error (gradient) was back propagated to network to adjust the weight accordingly to obtain expected result. RNN was not able to remember the long sequence of information. Thus the flow of gradient in RNN leads to two major problems: vanishing gradient and exploding gradient. Gradient was computed by recurrent multiplication of derivatives. So, if the generated gradient was too small it causes vanishing gradient and if it was greater than the threshold, it leads to exploding gradient.
To overcome the above problems, we were moving to LSTM and GRU
3.1.1 Long Short-Term Memory:
LSTM can remember long sequence of information with the help of gates. These gates control the flow of information within the net by eliminating unnecessary information from preceding steps and fed the needed information to the succeeding steps. Architecture of LSTM was shown in Figure.3
The LSTM consist of three gates: forget gate, input gate and output gate. These three gates determine the cell state (act as memory) which was the central part of LSTM. The information were added or removed from the cell based on the gates. xt denotes current input. Ct indicates the content of latest cell state and Ct−1 denotes the cell state of previous LSTM unit. ht denotes the current output and ht−1 represent previous LSTM unit’s output. Wf, Wi and Wo were the weight applied to forget gate, input gate and output gate and bi, bf and bo were the bias applied to forget gate, input gate and output gate. It includes two functions (1) Sigmoid function (σ), (2) Tanh Function: Sigmoid activation function was used in all three gates, which converts the output value to stay in the range of 0 to 1. Similarly Tanh activation function converts the output to fall in the range of -1 to 1 which was used in input and output gate. Using these functions the network can learn itself about the data which was important and which was not important. Therefore gates plays vital role that decides which information to be kept and which one to be discarded.
Forget gate:
The inputs to forget gate were ht−1 and xt. Depending on the value of inputs, the sigmoid function decides the value of Ct−1. If the output of σ was closer to 0, indicates that it was useless and hence it could be discarded from Ct−1 and if it was closer to 1, indicates that it was useful information, hence it is kept in Ct−1. [35]. Equation for forget gate was given in (1).
f𝑡 = (𝑊f ⋅ [ℎ𝑡−1, 𝑥𝑡] + 𝑏f) (1)
Input Gate:
This gate was used to update cell state. Inputs to the sigmoid function of input gate were ht-1 and xt. Similarly the same inputs were applied to Tanh function. The outputs obtained from both the functions were multiplied and the sigmoid function finalizes what should be kept from Tanh output. The equations of input gate were given in (2) and (3)
i𝑡 = (𝑊i ⋅ [ℎ𝑡−1,] + 𝑏i ) (2)
𝐶̃𝑡 = tanh (𝑊i ⋅ [ℎ𝑡−1, 𝑥𝑡] + 𝑏i) (3)
Output Gate:
The sigmoid function gets the two inputs current input (𝑥𝑡) and output of preceding hidden layer (ℎ𝑡−1). The updated cell state was given as input to Tanh function. The outputs from sigmoid and Tanh functions were multiplied to obtain the next hidden layer’s input. The equation is given below in (4) and (5)
𝑂t = (𝑊𝑂 ⋅ [ℎ𝑡−1, 𝑥𝑡] + 𝑏𝑂) (4)
ℎt = 𝑂𝑡 tanh (𝐶𝑡) (5)
Cell State:
The cell state ct was multiplied by the output obtained from forget gate. The output from forget decides whether to keep the cell state as it was earlier or to update it. Then the value of cell state ct was added with the output obtained from input gate in order to update the cell state to new value. The equation of cell state was stated in (6).
𝐶t = 𝑓𝑡 𝐶 𝑡−1 + 𝑖𝑡 𝐶̃𝑡 (6)
3.1.2 Gated Recurrent Unit (GRU):
GRU was similar to LSTM, it uses only two gates (1) Reset Gate (2) Update Gate, where as LSTM uses three gates. GRU use only fewer gates, hence it was bit faster than LSTM, but slight variation in performance. The update gate was similar to input and forget gate of LSTM. The reset gate decides at what extent the past information had to be discarded. Unlike LSTM, GRU does not use cell gate and output gate.
The equations of GRU cell were stated below from (9) to (12):
rt = sigm(Wxrxt +Whrht−1+br) (9)
zt = sigm(Wxzxt +Whzht−1+bz) (10)
h˜t = tanh(Wxhxt +Whh(rt ּ ht−1) + bh) (11)
ht = zt ּ ht−1 +(1 − zt)ּh˜t (12)
where rt, zt, xt, ht are the reset gate, update gate, input vector and output vector, respectively. Whereas, W and b indicates weight factor and bias. sigm and tanh denotes the sigmoid and tangent activation function.
We trained the model using two architectures of RNN such as LSTM and GRU. The features obtained from both architectures were independently used to predict Pneumonia. If the result of the prediction is positive then we detect the severity of the patient using their CT scan. Since CT scan provides more details when compared to X-Ray images. Also if the prediction of Pneumonia is negative, then there is no need of taking the CT scan. The severity of Pneumonia is identified using Convolution Neural Networks (CNN) with several architectures like LeNet-5, AlexNet, VGG-16, GoogLeNet and ResNeXt-50. The results obtained from individual architecture are tabulated and being compared to detect the severity of the disease.
3.2 Severity Prediction Module:
When Pneumonia was confirmed for the patient, we recommend the concern to take the CT scan and it was pre-processed. Then the processed image was analyzed using Convolution Neural Network (CNN). The features parameters of CNN models were determined without human intervention by Grid Search. It added advantage of using CNN models
CNN architecture was a brilliant architecture as it mimics the neural pattern of human brain. It consists of several layers. Each layer has set of neurons which analyze every portion of the image. CNN compares the image portion by portion. The portion it looks for was called filters or features. It extracts the image features and converts them to lesser dimension with no loss of image characteristics. To do so, it has following layers.
Input Layer
The CT scan image was given as input to the input layer. Before feeding it to input layer, convert the image to single dimension as column matrix. For example, if the image dimension was 32 * 32, we have to reshape it into a single column 1024 * 1. (i.e.) If the training sample was “n”, then input dimension would be (1024, n).
Convolution Layer
The image features were extracted using this layer. Hence it was also called as feature extractor layer. It was mainly used to extract important features from the input image. This layer contains numerous filters that do convolution operation shown in Figure.5. Performing dot product operation between the image pixel (portion of input image with same size as filter) and the filter was termed as convolution operation. The output of this operation was single integer of expected output dimensions. For example if the dimension of input image was 6*6 and the size of the filter was 3*3. Then the expected output dimension was 4*4. Then we glide the filter above the succeeding portion of the input image until we complete the entire image. Thus the output obtained from this layer becomes the input to succeeding layer. The dot product operation was shown below. This layer also includes ReLU (Rectified Linear Unit) activation function which converts the negative value to zero. It was applied to the output obtained from convolution operation.
3*1 + 0*0 + 1*-1 + 1*1 + 5*0 + 8*-1 + 2*1 + 7*0 + 2*-1 = (3 − 1 + 1–8 + 2–2) = -5
0*1 + 1*0 + 2*-1 + 5*1 + 8 *0 + 9 *-1 + 7*1 + 2*0 + 5*-1= (-2 + 5–9 + 7 − 5) = -4
Pooling Layer:
This layer reduces the dimension of image obtained from convolution layer. This layer was also called as down sampling. It was mainly used to reduce the computational overhead. Either we can perform average pooling (takes the average of all pixels) or max pooling (takes the maximum of all pixels). For example if the dimension of image to the input was 4*4, it converts them to 2*2 after maximum pooling. It is shown in Fig. 6.
Fully Connected Layer (FC)
FC has three layers FC input layer, FC layer and FC output layer (Softmax Layer).The input layer convert the output obtained from pooling layer to single vector. The FC layer applies weight and bias to the features generated from input layer. Softmax layer generates the probability which was used for multi-class prediction.
Architectures of CNN:
To predict the severity of Pnuemonia we have used most popular five architectures of CNN and they are as follows.
LeNet-5:
This was introduced in 1998. It has two convolution layers, sub sampling layers (pooling layer) and three FC layers. It contains nearly 60,000 parameters. The size of the convolution Filter was 5 x 5, and there were 6 * (5 * 5 + 1) = 156 Weights in total, where + 1 indicates the bias. So each pixel in Convolution Layer (C1) was connected to 5 * 5 pixels and 1 bias, resulting in total connection: 156 * 28 * 28 = 122304.
AlexNet:
It was introduced in 2012. AlexNet consist of five convolution layers, three fully-connected layers, with nearly 60 million parameters. Input size of image was 224 * 224*3.It was passed to first convolution layer with 11*11 filter size, 5*5 in layer two and 3*3 in layer three to five. The window size of max pooling layer was 3*3 was used with stride 2 which was followed by three fully connected layers. It contains ten times more convolution layer than LeNet. This resolved the Over fitting issue in dropout layers and also the size max pooling network get reduced.
VGG-16: (Visual Geometry Group)
This architecture was developed in 2014, first runner up of ILSVRC 2014. It has thirteen convolution layers and three FC layers. The parameter of VGG was about 138 million. VGGNet was used to get better training time by diminishing the number of parameters (training variables) in the convolution layers. There by the learning rate was faster than AlexNet. The filter size of convolution layer and max pool layers were 3*3 and 2*2 with two strides.
GoogLeNet (Inception):
It was the winner of ILSVRC 2014 and named after Prof.Yan LeCun’s LeNet. When compared to VGGNet it has minimum error rate [49–50]. It was released in many version such as V1 (2014), V2 & V3 (2015) and V4 (2016). V1: It consist of 1*1, 3*3 and 5*5 convolution layer along with max pooling layer. Totally it contains 22 layers with 5 million parameters with error rate of 6.67%. V2: To reduce the parameters in V1, batch normalization was done. In this version they converted 5*5 convolution layers to 3*3 convolution layers. It ultimately reduces cost by increasing accuracy with error rate of 4.8%. V3: It consists of 48 layers and the parameters increase to 24 million. It produces high accuracy there by reducing the error rate to 3.58%, which was half the error rate when compared to V1. V4: It was introduced in 2016 with the combination of Inception ResNet (Residual Network). Due to the residual association, it has high training speed. This stem module was upgraded and it consists of 43 million parameters.
ResNeXt-50: (Residual Network with 50 Layers)
It was introduced in 2017. When compared to other architecture it has less error rate of 3.03%. It has fifty layers with about 25.5 million parameters. This solves vanishing gradient problem when CNN move deeper and deeper.