A Convolutional Neural Network (ConvNet/CNN) is a Deep Learning method that can take an input image, assign relevance (learnable weights and biases) to various aspects/objects in the image, and distinguish between them. Convolution is a linear mathematical action between matrices. A convolutional layer, a non-linearity layer, a pooling layer, and a fully-connected layer are among the layers of CNN. In machine learning issues, CNN has performed satisfactorily. The dataset under study was used to train deep learning models (S. Fan et al., 2020), namely EfficientNet and InceptionNet (Patil, 2018).
EfficientNet is a convolutional neural network architecture and scaling approach that uses a compound coefficient to consistently scale all the depth, breadth, and resolution parameters. Unlike conventional practice that arbitrarily scales these factors, the EfficientNet scaling method uniformly scales network width, depth, and resolution with a set of fixed scaling coefficients (A. Kumar et al., 2021). The model was implemented using the transfer learning approach. A peak accuracy of 91% was achieved.
Inception v3 is an image recognition model that has been shown to attain greater than 78.1% accuracy on the ImageNet dataset. To achieve better accuracy, the InceptionNet model was implemented. The model is the culmination of many ideas developed by multiple researchers over the years. It is based on the original paper: "Rethinking the Inception Architecture for Computer Vision" (Szegedy et al., 2016). Convolutions, average pooling, max pooling, concatenations, dropouts, and fully linked layers are among the symmetric and asymmetric building components in the model. Batch normalization is done to activation inputs and is utilized extensively throughout the model. Loss is computed using Softmax. Due to its high accuracy on image classification problems, InceptionNet was utilized to develop a model. A peak accuracy of 94% was achieved. Though it exhibited better results on the test data compared to the EfficientNet model approach, the models took a long time to train, and its predictions on test data were not satisfactory enough.
The Vision Transformer (ViT) has emerged as a viable alternative to convolutional neural networks (CNNs), which are the current state-of-the-art in computer vision and are widely employed in image identification applications. In terms of computing efficiency and accuracy, ViT models exceed the present state-of-the-art (CNN) by almost a factor of four.
Transformers are already capable of paying attention to regions that are far apart right from the starting layers of the network which is a significant gain the transformers bring over CNNs which have a finite receptive field at the start. One other advantage of transformer models is that they are highly parallelizable.
The attention mechanism enhances the crucial parts of the input data and fades out the rest. Self-attention module replaces the convolutional layer so that now the model gets the ability to interact with pixels far away from its location. The self-attention mechanism is a type of attention mechanism which allows every element of a sequence to interact with others and find out to whom they should pay more attention. An attention mechanism like self-attention can effectively solve some of the limitations of the Convolutional Networks. This distinct behavior is due to the inclusion of some inductive biases in CNNs, which can be used by these networks to comprehend the particularities of the analyzed images more rapidly, even if they end up limiting them and making it more difficult to grasp global relations.
The Vision Transformers, on the other hand, are free of these biases, allowing them to capture a global and wider range of relationships at the cost of more time-consuming data training. Input visual distortions such as adversarial patches or permutations were also significantly more resistant to Vision Transformers (Park et al., 2022).
The ViT model is made up of many Transformer blocks that employ layers. As a self-attention method, the MultiHeadAttention layer is applied to the sequence of patches. The Transformer blocks generate a [batch size, num patches, projection dim] tensor, which is then processed by a SoftMax classifier head to generate the final class probabilities output.
The initial stage in the model is to split an input image into a series of image patches. By projecting a patch onto a vector of size projection_dim, the PatchEncoder layer will linearly transform it. The projected vector is also given a learnable position embedding. These image patches are then sent via a linear projection layer that may be trained. This layer serves as an embedding layer, producing fixed-size vectors. The sequence of image patches is then linearly added using position embeddings to ensure that the images preserve their positional information. It injects crucial information about the image patches' relative or absolute positions in the sequence.
The 0th class is a principal element to note in the position embedding module. BERT's class token inspired the concept of the 0th class. This class, like the others, is learnt, although it does not originate from its picture. Instead, the model design has it hardcoded. If the transformer is provided with the positioning data, it will not know what order the photos are in. The transformer encoder receives this sequence of vector pictures.
A Multi-Head Attention layer and a Multi-Layer Perceptron (MLP) layer make up the Transformer encoder module. The Multi-Head Attention layer divides inputs into several heads, allowing each head to develop varying levels of self-attention. All the heads' outputs are then combined and sent into the Multi-Layer Perceptron. Normalization layers (Layer Norm) are applied before each block using transformers, and residual blocks are applied afterward. Finally, the transformer encoder receives an additional learnable classification module (the MLP Head), which determines the network's output classes.
The proposed work fine-tunes the google/vit-base-patch16-224-in21k a Vision Transformer (ViT) pre-trained on ImageNet-21k (14 million pictures, 21,843 classes) at 224x224 resolution in this case. The model is provided with images in the form of a series of fixed-size patches (resolution 16x16) that are linearly embedded. In order to train the model, the images must be converted to pixel values. A transformer’s Feature Extractor accomplishes this by augmenting and converting the photos into a 3D Array that can be fed into the model (Dosovitskiyet al., 2020).
Data augmentation techniques (John B et al., 2002) were performed on images of the training set to improve the generalization ability of the model with the help of PyTorch’s transform class which provides common image transformations. PyTorch also provides functionalities to load and store the data samples with the corresponding labels. In order to create training and validation dataloaders, the in-built DataLoader class was utilized. This wraps an iterable around the dataset, enabling us to easily access and iterate over the data samples in our dataset. The model was configured with the following parameter values-
- Learning rate: 5e-4
- Loss function: CrossEntropyLoss
- Optimizer: Adam optimizer
- Image size: 32
- Patch size: 16 x 16
- Number of classes to classify: 12