Deep learning is a subset of machine learning in the field of artificial intelligence that mimics the neuron of the human brain or neural networks [27, 28]. It allows machines to solve complex problems in the field of pattern recognition and computer vision, even if the data is very diverse, unstructured, and interconnected. Artificial neural networks play a significant role in doing these. But regular artificial neural networks have not scalability for varied and large data sizes. To avoid such limitations, deep neural networks have the advantage of providing integrated feature extractors and classifiers [26].
In deep learning techniques, many algorithms exist. But in this study, we focused on convolutional neural networks since it accepts raw images as input and does not require separate feature extraction, except for size normalization [26].
Convolutional neural networks (CNNs or ConvNets) are a kind of neural network used for supervised learning [20] and one of the most promising deep learning techniques to deal with many pattern recognition tasks [29]. It is preferable for image or video data processing, especially for object detection, face recognition, pattern recognition, etc.
In regular artificial neural nets, the networks receive a single vector input and transform it through a series of hidden layers to reach the last fully-connected layer known the output layer. Each hidden layer receives the fully-connected neurons from the previous layer. For the classification applications, the output layer holds the class scores. Whereas the convolutional layers of CNNs have 3D arranged volumes of neurons. For example, if our input data is an image for the CNNs, the image should be arranged in 3D (width, height, depth) format. Here, width and height represent the dimensions of the input image, and the depth shows the channels (i.e., R, G, B) of the image. The following Fig. 3 and Fig. 4 clearly shows the diagrammatical arrangements of the neurons in regular networks and CNNs, respectively.
As the literature says in recent times, CNNs have been used for character and numeral recognition tasks extensively [4]. It has the advantage of sharing weight across CNN layers that reduces parameter numbers and improves performance [9].
3.2.1 Convolutional Neural Network Layers
Convolutional neural networks consist of multiple layers, as we set Fig. 4. The convolutional layers, pooling layers, and fully-connected layers are the main layers used to build CNNs architecture. In addition to containing layers, a CNNs architecture requires activation functions such as sigmoid, ReLU, tanh, Leaky ReLU, etc. to be functional. In general, it is possible to classify the CNNs in two parts i.e. feature extraction and classifications. On the feature extraction part, the convolution and pooling operations play a role. Whereas for the classification part, the fully-connected layers take part. The brief descriptions of CNNs layers are found in the following sub-section.
Features are the pixel information of an image. Features are important components of character recognition. In image recognition relevant features must be extracted for recognition. To extract the relevant features CNN consists of input, convolutional, activation, padding and pooling layers.
Input Layer: - In this specific layer CNN should contain image data. The image data may be color or grayscale three-dimensional matrix filled by pixel values. we need to reshape the pixel values into a single row. The pixel values are normalized by dividing each value by 255 to make it in the range of 0 to 1, which has the advantage of making the learning faster and getting a better model performance.
Convolution Layer: - Convolution is one of the core building blocks of the CNNs of the convolutional layers because features of the image are get extracted within this layer. The operation of the convolution is computed by sliding the vital component called kernels across the patches of the input image [30]. The kernel has a role in creating a feature map which holds the values computed from element-wise matrix multiplication and the sums of it. Later, this feature map is fed to other layers to learn several other features of the input image.
In the following formula, the feature map F is calculated from the input image I and kernel K. The indexes of the width and height of the result matrix are marked with a and b, respectively.
$$F\left[a,b\right]=\left(I\text{*}K\right)\left[a,b\right]=\sum _{i}\sum _{j}I\left[a-i,b-j\right]K[i,j]$$
1
Here Fig. 5 shows how a 3x3-pixel filter convolved over the 6x6-pixel input image and generates a 4x4 feature map. Formula 2 specifies the size of the feature map.
Where nnew represents feature map size, n represents the dimension of the input image, and f represents the filter or kernel dimension. In this study, we applied kernel size (3,3) at each convolutional layer.
Activation / Non-Linearity: - Any kind of neural network contains activation or non-linearity to be powerful. In hidden layers we may want to consider activation functions such as Rectified Linear Activation (ReLU), Logistic (Sigmoid), Hyperbolic Tangent (Tanh).
In CNN's, the ReLU is the most commonly used activation function to achieve its real potential on convolution results and its bias value. The ReLU activation helps to make all negative value to zero. The following Eq. 3 indicates the mathematical expression of the ReLU function.
$$y=max\left(0,\left(w*I\right)+b\right)$$
3
where w is weights and I is input value and b for bias.
Stride and Padding: - Stride specifies the number of moves of kernel at each time. By default, the stride size is usually one, which means the kernel slides pixel by pixel. It is also possible to set a bigger size if less overlapping required between the receptive fields. Using a bigger kernel size makes the resulting feature map smaller, and skipping will happen over potential locations. In our model, we directly applied the default stride value.
In ConvNet, the feature map size is always smaller than the input size because of the convolution applied. Then dimension mismatch will be happening with input size, and shrinking exist on a feature map. To preserve the feature maps shrinking using padding is desirable techniques when padding applied to our computation of features mathematically computed as follows.
Where nnew represents feature map size, n represents the dimension of an input image, f presents the filter or kernel dimension, and p is the paddings applied to the input image.
The general convolutional layer is expressed in the following equations.
$${C}_{p}^{l}=\sigma (I\text{*}{K}_{l,p}^{l}+{b}_{p}^{l})$$
5
Where \(\sigma\), is an activation function used, for example, if we are using a sigmoid activation function, the equation becomes the following.
$$\sigma \left(x\right)=\frac{1}{1+{e}^{-x}}$$
6
$${C}_{p}^{l}\left(a,b\right)=\sigma \left(\left(I\text{*}K\right)\left[a,b\right]+{b}_{p}^{l}\right)=\sigma \left(\sum _{i}\sum _{j}I\left[a-i,b-j\right]\text{*}K\left[i,j\right]+{b}_{p}^{l}\right)$$
7
Where C is a convolution layer, l indicates layer number, and i, j denotes the map indices of current and next layers, respectively. Indices a, b is a row and column size of the feature map, respectively. σ is the activation function, I is an image, K is a kernel, and b is the bias [31].
Pooling Layer: - The Pooling Layer usually serves as a bridge between the Convolutional Layer and the Fully Connected Layer. After convolution, activation, stride, and padding operations are done, performing pooling is the next step. Pooling is done at the pooling layer, which takes the output of the previous layer and down-samples by choosing the prominent features [30]. The Down-sampling technique reduces the dimensionality of each convolutional layer. So, a reduction of the dimensions makes computationally efficient. It may create an advantage for both minimizing the training time and overwhelmed overfitting.
Even many pooling techniques are found; max-pooling is the most common one to facilitate the pooling operation because it keeps the most important or most abundant features. It takes the maximum value by sliding "n x n" windows from the input. Figure 6. Max pooling: 2x2 windows from 4x4 input and 1 step stride
Figure 6 shows the computation process and its result of the max-pooling layer using a 2x2 window with stride 1.
In our study, also we applied the max pooling.
Fully-connected layer: - In convolutional neural networks, a fully connected layer is just like a hidden layer of a regular neural network that consists of a 1D dimensional vector value. Since the convolutional layer generates 3D data, it should be flattened to have 1D data. The flatten data passes to one or many fully connected layers to feed the SoftMax unit, which consists of output classes.
$$\widehat{o}={\sigma }(\text{w}\text{*}f+b)$$
8
Output layer: - The output layer unit assigns a probability value to all the possible classes [32]. The one which contains the largest amount must be taken as the predicted class label. One-hot matrix representation takes these probability values and keeps the largest probability value to be "1", and the remaining is set to "0". In ConvNet, the SoftMax unit is taken as an output layer, and it calculates a probability of a given image to be in a particular output class.
$$p(y=j|{\theta }^{\left(i\right)})=\frac{{e}^{{\theta }^{\left(i\right)}}}{\sum _{j=0}^{k}{e}^{{\theta }_{k}^{\left(i\right)}}}$$
9
Where, "\({\theta }^{\left(i\right)}\)" holds the value of the transpose of the weight's matrix \(W\) multiplied by the feature vector X. Given as follows:
$$\theta ={W}_{0}{X}_{0}+{W}_{1}{X}_{1}+\dots +{W}_{k}{X}_{k}=\sum _{i=0}^{k}{W}_{i}{X}_{i}={W}^{T}X$$
10
Where \(X\) and \(W\) is the network input the weight vector, respectively. The "\({W}_{0}{X}_{0}\)" is the bias.
Loss function: - Since this work is a multi-class classification problem, we applied the multi-class cross-entropy loss function that used to optimize the parameter values in the model.
Assuming that the actual output label is \(O\), the loss function is expressed by:
$$L = \frac{1}{2}\sum _{i=1}^{n}{(\widehat{O}\left(i\right)-O\left(i\right))}^{2}$$
11