Image Classi�cation using ImageNet Classi�ers in Environments with Limited Data

In this research, we compare and contrast various image classi�cation algorithms and how effective they are in speci�c problem sets where data might be scarce such as prediction of rare phenomena (for example, natural calamities), enterprise solutions etc. We have employed various state-of-the-art algorithms in this study credited to have been some of the best classi�ers at the time of their inception. These classi�ers have also been suspected to fall prey to over�tting on the datasets they were initially tested on viz. ImageNet and Common Objects in Context (COCO); we test to what extent these classi�ers tend to generalize to the new data provided by us in a transfer learning framework. We utilize transfer learning on the ImageNet classi�ers to adapt to our smaller dataset and examine various techniques such as data augmentation, batch normalization, dropout etc. to mitigate over�tting. All the classi�ers follow a standard fully connected architecture. The end result should provide the reader with an overall analysis of which algorithm or approach to use in conditions where data might be limited while also giving a brief overview of the progress of image classi�cation algorithms since their advent. We also provide an analysis on the effectiveness of data augmentation in limited datasets by providing results achieved with and without utilizing data augmentation. In our case, we found the MobileNet (with its lightweight nature contributing to low computational costs) and InceptionV3 (owing to its lower training time) to be the best performing classi�ers for applying transfer learning to limited datasets out of the classi�ers we have used for our study. This paper aims to establish preemptive standards that can be used to evaluate the models which can be used in object recognition, and image classi�cation for problems containing limited amounts of data.


Introduction
Image Classi cation has always been one of the most widely used application of Computer Vision. There is no "perfect" way to classify images, just better ones. With increasing research and implementation in this area, we perennially see new approaches evolving, taking inspiration out of older methods and transforming them into neoteric standards. However, most of these aforementioned studies focus on classi cation of humongous datasets while establishing themselves as some of the best classi ers.
We found that there is a severe lack of reference and studies when it comes to the performance of these classi ers when generalizing to a dataset which has limited data for learning (or adaptation, in the case of Transfer Learning).
The aim of this research is to answer the following questions at the onset of this study:-1. The ILSVRC (ImageNet Large Scale Visual Recognition Challenge) entries have been accused of falling prey to over tting on the dataset. A concept which seems far-fetched at the get go, considering the enormous size of the ImageNet dataset, but something that has made its way around the research circle causing speculation and giving rise to multiple studies and experiments including one that tested how well the ImageNet classi ers would generalize to a dataset created to closely mimic ImageNet [1] which proves that the accuracy results are not replicated in a dataset created in a similar fashion to ImageNet; However, as Recht et al. 2019 [1] observe and document, the models which performed better on ImageNet than their predecessors, performed even better on this similar dataset as compared to the same predecessors, hence, cementing the performance of these classi ers relative to each other. We aim to test this notion in our study on a low-scale dataset and set parameters by using these classi ers in our study to not only test the top-1 accuracy in absolute terms, but also, the relative terms.
2. The ImageNet is a humongous dataset with a whopping 14 million images encapsulated within itself [2]. The results of the ILSVRC are geared towards providing solutions that speci cally would only cater to industrial or academic problems in which access to large amounts of data is available and the only pertinent issue is utilizing all this data to form accurate predictions for classi cation and detection.
However, this doesn't account for situations where only a scarce amount of data might be available for you to work with; these situations can also frequently occur in the industry and might leave some of the most prepared organizations with their big data suites and weathered data scientists in a rut.
These situations can range from enterprise solutions to time series to aggregating modelling to prediction of rare phenomena.
Through our study, we propose to view how well these state-of-the-art classi ers would function exclusively in a custom environment for situations where data might be limited. The main thing to note would be whether the same algorithms would function with better accuracy and speed as compared to their conventional counterparts when considering a smaller dataset for different utility.
We have divided the paper in 5 segments. Section 2 covers the background of our research in the form of a literature survey.  [4] respectively. This led to a widespread resurgence of CNNs as more individuals realised how the power of these neural nets allowed the creators to clinch the highly-coveted top spot in the 2012 edition of the prestigious ILSVRC.
Only a couple of years later, the VGG16 and VGG19 classi ers (where the numeric stood for the number of layers each network possessed), named after the Visual Geometry Group of Oxford University where it was incepted, placed second in the classi cation track of the ILSVRC 2014 (achieving a top-5 test error of just 7.32% becoming one of the rst networks to go below the 10% mark) and rst in the localisation track. One of the major changes that Karen Simonyan & Andrew Zisserman implemented in their convolutional network was the inclusion of multiple smaller lters to cover the same amount of area as done by a large lter; this change helped not only introduce increased non-linearity but also helped decrease the number of parameters which, in turn, helped reduce over-tting and reduce the time it would take for the model to converge.
In the very same year, Szegedy et al. designed a convolutional neural network, named GoogLeNet, that deviated from the basic architecture that could be found in CNNs at that time to signi cantly decrease the number of parameters while vastly increasing the number of layers. The network, codenamed Inception (owing to the authors' cinephilic tendencies), incorporated 22 layers and closely beat VGG in the ImageNet Classi cation Challenge with a top-5 classi cation error of just 6.67%. Taking inspiration from the Network-in-Network approach [6], GoogLeNet incorporated a lot of 1x1 convolutions to overcome computational bottlenecks via dimensionality reduction which is similar to the approach adopted by the VGG networks as well. What truly set GoogLeNet apart and allowed them to place above VGG in the ImageNet challenge, though, was their implementation of convolutions, where the stacking of convolutions was done not only in a sequential manner but also on the same level. This allowed them to capture both global and local convolutions more effectively due to different sized lters being employed at the very same level. This idea of going deeper not only in laterally but also horizontally was inspired by the theoretical work met out by Arora et al. regarding sparse structure and how modules with high correlations should be grouped together to connect the previous layer to the next layer by forming lter banks [7]. Eventually, more versions of the Inception model were released which used, more frequently, the concept of factorised convolutions (as seen in the VGGnets), and enhanced it to break up square lters into pairs of one-dimensional lters (a lter of size NxN would be now factorised to produce two lters viz. Nx1 and 1xN The year right after the Inception and VGG networks got introduced, researchers engineered a neural network [8] that could incorporate an exponential amount of layers while continually increasing the performance of the model, something which had been elusive to the entire deep learning community prior to this (successful efforts were made to mitigate this, but none came close to the breakthroughs achieved by this network).
The two major issues that had served as the hurdles in the quest for going deeper in CNNs were-the vanishing/exploding gradient problem and the computational time (and power) to train these networks. The skip-connections are simply identity shortcut connections which stack identity mappings while skipping over one or more layers to ensure that the vanishing gradient problem doesn't occur. He et al.
theorised that the introduction of these residual blocks should allow these networks to achieve a training error lower than that of their shallow counterparts. Many other networks such as ResNeXt-a blend of ResNet and Inception which used a hybrid architecture based off both these networks which introduced a hyper-parameter called cardinality corresponding to the number of independent paths that could be taken while traversing the network [9], DenseNet-a network which uses skip-connections while exploiting feature reuse by aggregating the feature maps with depth concatenations [10], [11].
MobileNet is another recent model, which employs the use of factorised convolutions to drastically reduce training time and the model size. It utilizes the concept of depthwise separable convolutions [12], similar to the one applied in Inception models, to a larger extent to create an extremely light-weight deep neural architecture with highly optimised latency to account for a more accessible model with low computational requirements. MobileNet also features two novel hyper-parameters viz. the width multiplier-α, and the resolution multiplier-ρ, which allow it to transition into being even more light-weight) [13]. Introduced width and resolution multipliers as parameters to alter the weight of the model. Introduced depthwise separable convolutions to reduce the model size and complexity.
We now brie y discuss the related work in the eld of Image classi cation models and its evolution in recent years. Datasets are a vital part of any image recognition research. One might hope that a given classi er would perform well on a new data set assembled from the same source following the same protocols. Researcher showed that early datasets resulted in over tting leaving classi ers with minimal accuracy on other datasets [14]. Studies showed that deeper networks showed higher accuracy across various transfer tasks whereas wider networks showed lower accuracy [2], [14], [15].
Researchers have replicated a new dataset based on two prominent benchmark datasets-CIFAR10 & ImageNet and demonstrated that even with small variations, the current classi cation models generalize the new data with accuracy drops ranging from 3%-15% on CIFAR 10 and 11%-14% on ImageNet [13], [15], [16]. Their studies show that adaptivity is an unlikely explanation for the accuracy drops and that the differences in accuracy stem from a larger distribution gap between the datasets, while Adaptivity and Generalisation gaps are also listed as additional possible causes.
In a substantial research presents a comparative study of classi ers using popular datasets based on relative data bias and cross data generalisation showing how a given classi er trained on one dataset performs on a different dataset. The authors have tried to raise awareness about an important issue i.e. bias in datasets. The results of the in-depth study of cross data generalisation were rather disheartening, since almost all the datasets are assembled from one source-the Internet which raises too many red ags [4], [6], [17]. A simple explanation to this was bias in data datasets such as selection, label or even capture bias but most importantly, the negative set bias.
The major difference in our research is that we tried to study to what extent does these classi ers generalise on different datasets, thereby showing the extent of over tting and how effectively these classi ers generalise when used across various target sets with limited data which in turn provides a substantial base for using a speci c approach/algorithm under speci c conditions.

Learning with the Training Wheels On
Transfer Learning is an investigating technique utilized in Machine Learning to use data learnt in previous problems by applying it in related existing problems. This concept that vies for the cross-utilization of knowledge to tackle related, novel problems is inspired from the intrinsic ability of the human brain to transfer knowledge across different tasks [21]. Transfer Learning is touted to be the next big driver of the professional success of Machine Learning by many eminent individuals in the deep learning community [10], [11], [18]. There have been various comprehensive reviews of transfer learning that have incorporated recent advancements in the various types of transfer learning that exist [9][1], [19].
It can be interesting to see the situations in which Transfer Learning is adopted when we focus on the formal de nition used for it. We can look at the framework provided employing the usage of domains, tasks, and marginal probabilities [1], [19].
The framework, concisely put, consists of a domain, D, de ned as a two-element tuple encapsulating feature space, , and marginal probability, P(X), where X represents a sample data point (X={x i } , where i=1,...,n with n sample points) and a Task, T, de ned as a two-element tuple encapsulating the label space, γ, and objective (predictive) function, η, where η can also be represented by P(γ|X) employing conditional probability. η is learned from feature-label pairs-(x i , y i ), such that η(x i )=y i . iii. γ S ≠γ T -In this situation, the label space of the source domain and target domain are different.
iv. P(Y S |X S )≠P(Y T |X T ) -In this situation, the conditional probabilities of the source task and target task are different.
Even though almost every Transfer Learning implementation would generally employ all of the four situations partly, our work and experimentation primarily focuses on the third and fourth situation, wherein the size and manner of the label space and conditional probabilities of the initial and nal tasks differ a lot from the ImageNet winners' work.
Transfer Learning has wide reaching applications which pervade all elds of Arti cial Intelligence and Machine Learning but become extremely useful when we take into consideration our own problem-Image Classi cation. Since, Image Classi cation depends heavily on Feature Extraction and most of the layers while training contribute to the same concept, using a model already trained to extract features makes sense. But, transfer learning doesn't simply use a pre-trained model on a new dataset. In our case, it takes in the necessary layers required for feature extraction which are already trained and connects them with some Fully Connected (FC) layers with some dropout to make sure that the pre-trained model "learns" to use the pre-trained models' learned characteristics to learn the characteristics of the new problem set, hence, creating a new model catered to our problem.

Methodology
In our work, we use 8 different classi ers to derive the needed results. Five of the aforementioned classi ers are used in a transfer learning framework with the ImageNet trained models (without the dense layers) used as the base models, whereas, the rest of the models (viz. the DNN, CNN, and AlexNet) are trained and validated purely on our dataset. Even though the expectations for these (non-transfer learning) primitive networks were minimal, we included these in our study to illustrate the contrast in performance and delineate the relative superiority of the models employing transfer learning.
Our dataset consists of four basic classes, having only around 800 images. These classes are generalized supersets of the many speci c classes found in ImageNet and, thus, more than viable for Transfer Learning.
Our approach algorithmically, outlined brie y, is as follows: 9: Comparing all these results to obtain the best performing classi er.
To ensure a standard for comparison, we utilized the same amount of Fully Connected Dense Layers (hereon, referred to as FC) with the same amounts of nodes after the convolutional layers of the base model used (as seen in Table 2 (a) below). The exception to this was our Basic Deep Neural Network as we felt that to ensure a fair comparison and to make up for the absence of any feature extraction (due to the omission of convolution layers), the DNN needed extra layers (the composition can be seen in  Each layer in all our models are accompanied by ReLU non-linearity [3] barring the last Fully Connected layer which is followed by a Softmax layer to realize classi cation. Since parameter tuning is the main oil that makes the cogs turn in a neural network, it is desirable to have a lot of parameters available for tuning and tweaking. But, in order to have a lot of parameters, we need to have a lot of training examples. All our networks were very prone to over tting due to the limited amount of data that was available to train with. To reduce over tting, we've also employed Data Augmentation [20],[6], [21], [22] to arti cially increase the size of the database by using label-preserving image transformations [14]. Data Augmentation works on the principle of mathematical transformations such as scaling, rotation, translation, cropping, ipping the image on both the horizontal and the vertical axis, off-center randomized zoom. We use three prominent transformations, namely, horizontal ip, shear, and zoom. All of these are applied with randomness in each iteration of the epoch so that the "same" image is never visited twice by the classi er during training to increase the model's ability to generalize [6], [12], [17]. The value we used for the scope of randomness in the zoom transformations was set to 0.2 (this corresponded to a range of [0.8, 1.2]). We have given a representation of the impact created by data augmentation by providing results with and without data augmentation.
Dropout was also considered but, after considerable testing, proved to be super uous (quite ironically), potentially, because the training dataset, being small, didn't necessarily require random co-adaptation of neurons to avoid over tting [5], [15].

Results And Discussion:
The authors have completed a comprehensive survey of all the milestone image classi cation algorithms used and tested them in the aforementioned conditions. The major project includes the implementation and outlines the explicit and implicit comparison of the aforementioned algorithms. Various combinations of regularisation, hypertuning, activation functions were used and hence, the best outcomes (to the best of our humble abilities, that is) determined. The authors have also examined various techniques to reduce loss while comprehending neoteric discoveries in the sphere of Deep Learning. We hope that this comparison serves as a beginning point for others to choose and compare algorithms for their limited-data image classi cation nets. All hyper-parameters have been set after a lot of trial and error to provide a fair and standardised comparison. We notice many unintuitive results after conducting our work which were shocking when considered to their ImageNet performance. We have employed many measures to reduce over tting and loss and if we glance at the graphs for accuracy above (comparing the training and validation accuracies), we realise that even after these measures, there is still a lot of over tting taking place owing to the small amount of data available to us.
It is fair to assume that this is the result that would be replicated in any other such similar circumstance while using these algorithms with limited data. These results, thereof, are not and should not be considered endemic and speci c to our dataset. These results might be expected to be replicated as long as the classes to be classi ed are similar to the ones already present in the ImageNet database, barring which, only the results of DNN, CNN, and AlexNet would be of insight to the user. But, taking the more probable (and certainly more hopeful and convenient) scenario of the target classes (the ones you wish to classify) being similar to the source classes (the ones present in ImageNet), this conclusion should serve as a starting point for the readers to discern which approach is the best to utilise and, hence, spend resources like time for optimisation and computation on.
Without further ado, these are the results that we received with our implementation:- We also calculate the scores corresponding to our models based on their implementation calculated from individual confusion matrices as shown in Table 4 below. We, as mentioned earlier, see certain results which digress from the normal (and ImageNet results).
As expected, the DNN is the one achieving the lowest accuracy owing to its inability to e ciently extract features and, thereby, stunting it from achieving a high accuracy. It's accuracy of 63.75 is still impressive considering that the time taken for it to train and validate is just around 20 minutes. One thing to note here, though, is that training this DNN required very high computational requirements where it even made GoogleColab's Tensor Processing Unit crash while we implemented it owing to the high number of nodes it took to train the said network. Su ce to say, this is not a network that should be used for image classi cation of any type and it is probably wise to steer away from this model while trying to achieve valuable results.
Next on the list, shockingly enough, coming second to last if we order our implementation by the accuracy achieved is AlexNet (the CaffeNet version). This might come as a shock to some that the prodigal model which shook up the entire academia by winning the ILSVRC 2012 loses to a normally trained one-layer CNN (68.75% as compared to 76.88%). However, upon a closer look, there are three things to note that could possibly explain this anomaly. Firstly, this is a CaffeNet version of AlexNet. The difference, as explained in the literature survey of AlexNet, is that while the actual AlexNet is trained on two GPUs which signi cantly increases its accuracy, the CaffeNet model is just implemented on a singular GPU. Secondly, unlike all other algorithms (whose genesis came from the ILSVRC) we implemented in our work, our version of AlexNet (or rather, CaffeNet) is not pre-trained on ImageNet. That is to say that the weights assigned are purely trained from scratch and there is no transfer learning applied. Thirdly, and perhaps most importantly, the implementation of CaffeNet which we used was one in which we stuck to the archaic norm of having large lter sizes (as de ned in the AlexNet architecture) of the order 11x11, 7x7, and 5x5, as compared to the recent norm of using simple, small 3x3 lters with deeper CNNs; also, the lters used were also massive (starting from 96 lters in the rst convolutional layer and reaching up to a whopping 384 lters in the third and fourth convolutional layers) to account for the vast size of ImageNet as compared to the mere 32 lters used for CNN. These three points and the two differences (crystalised in the last point) make a huge difference and, hence, affect our accuracy to a large degree especially when the limited size of the dataset is taken into consideration. This to an extent, possibly, explains why our AlexNet with 5 Convolutional Layers was one-upped by our CNN with just 1 Convolutional Layer.
However, another thing to note is that while our CNN takes 32.9 seconds to train on one epoch, AlexNet does the same in just 17 seconds (which is almost half the time). Therefore despite the archaic and, now,   We also look at the effects of data augmentation and the impact it has on the accuracy and loss of our classi ers on the dataset in Table 5. As preempted, we can see a lot of over tting taking place as compared to the case where we used data augmentation. We can also observe a drop in accuracy for most of the networks barring VGG16, VGG19, MobileNet and InceptionV3 which perform similar to their original performance (with augmentation). We see an astounding drop in the accuracy of ResNet from 98.12 to 71.25, however other networks display increased robustness. DNN, CNN, and CaffeNet also see a drop in accuracies, albeit much less severe than that of ResNet.
Surely this analysis would then make the readers believe that Inception-3 is hands-down the best algorithm to use in terms of both, the time taken to train and the accuracy received. This analysis can only be deemed partially true, however, because there is still a facet that we need to analyse in order to do justice to this comparative analysis, that is, accounting for the standardization.
We have used 50 epochs and 4 dense layers with the exact same number of nodes (along with adams optimizer with default learning rate) to standardize our comparison and provide the readers with an e cient comparison and guide to utilise models in small datasets. However, one cannot help but wonder what if, with adequate tweaking, other algorithms might perform better than the current top-spot position holder-InceptionV3.
While there is no way to conclusively answer this question, there is a way to pointedly tackle one facet of it through our work. We can do so by looking at the highest validation accuracy that was achieved throughout the epochs and their corresponding training accuracies.
One of the run-of-the-mill ways to counter over-tting is "early stopping". However, it is imperative to realise whether the validation accuracies peaked due to natural circumstances of good training or whether it was just an anomaly (brought out by ukes) which later got recti ed as the epochs continued.
In order to verify this, it's very important to take note of the training accuracies in order to de nitively (to an extent) tell whether a model actually would perform better with early stopping. Now, let's take a look at  We see that almost all algorithms (apart from DNN because of its lower training accuracy) performed better before hitting the 50th epoch. The 5% accuracy difference between both the VGG models gets reduced to a measly 0.63% difference whereas ResNet50 and Inception-V3 still enjoy extremely high matching accuracies of 99.37%.
What is a bit out-of-the-blue, though, is the accuracy achieved by MobileNet. With a whopping 100% validation accuracy-a perfect score, if there ever were one-the MobileNet blows all of its competitors out of the water. To concretize that it was not just a uke or an anomaly, we see it repeating this glorious accolade an astounding 8 times (twice the amount of times it was done by our previous (now dethroned) top-spotters-ResNet50 and InceptionV3. To further establish this point, we see that the corresponding train accuracies are also in tandem with the results achieved in Validation Accuracies.
With its low computation time of just 17.4 seconds per epoch and the high validation accuracy that may be achieved through early stopping, the MobileNet is a neck to neck competitor to the Inception-V3 with its similar accuracies and train time.

Future Scope
The Future Scope and Implications are pretty clear in how there needs to be a rising focus on achieving higher state-of-the-art e ciencies in smaller datasets rather than just trying to build a model that predicts after being fed astronomical amounts of data. There have been many recent strides towards that direction through Tiny ImageNet, however, we feel there is still scope for more focus on this endeavor.
However, some improvements that readers might implement include delving into Generative Adversarial Neural (GAN) Networks (Dueling Neural Networks) and various other approaches and techniques that have emerged recently and require a high level of expertise to implement and possesses high computational requirements. GAN, speci cally, an innovative and disruptive technique (network) to expand and arti cially create a larger dataset with the limited data available which may be utilized by the readers for making their classi cation problem more approachable via networks which work better with more data available for training, viz., ResNet in contrast with MobileNet.

Conclusion
In conclusion, we feel that the two questions that we posed at the start of the introduction setting out the basis and need of this research work are su ciently tackled.
The rst point, which questioned the authenticity of the results of the ILSVRC (shrouded by allegations of over tting) stands answered when we look at how all the algorithms performed in the same order we expected them to (barring AlexNet, for which a detailed possible explanation is outlined above). We also tested out pre-trained models to see if they would work on a target dataset with similar expectations, and that too was ful lled and veri ed in the same breadth.
The second point which questioned the e cacy of these models and algorithms on a smaller dataset and asked for a comparative analysis has also been ful lled by authors as outlined above.
Lastly, to summarise, we'd like to suggest our readers to employ either Inception-V3 or MobileNet based on the computational apparatus available and the task to be completed when faced with a choice between these algorithms. The MobileNet while it's incredibly viable because of its light-weight nature, low computational requirements and high validation accuracy is then given competition and su ciently rivaled by Inception-V3 which has (albeit, higher computational requirements but also) lower train time, and equally high validation accuracies (if not higher).

Declarations
Con ict of Interest: authors declare no con ict of interest.

Figure 1
Representation of our methodology via a ow chart Graphical Representation of Training Accuracy achieved by classi ers Figure 4 Graphical Representation of Validation Accuracy achieved by classi ers