Enhance Image Classiﬁcation Performance Via Unsupervised Pre-trained Transformers Language Models

Image classiﬁcation and categorization are essential to the capability of telling the difference between images for a machine. As Bidirectional Encoder Representations from Transformers became popular in many tasks of natural language processing recent years, it is intuitive to use these pre-trained language models for enhancing the computer vision tasks, e.g . image classiﬁcation. In this paper, by encoding image pixels using pre-trained transformers, then connect to a fully connected layer, the classiﬁcation model outperforms the Wide ResNet model and the linear-probe iGPT-L


Background and Problem
Unsupervised pre-training is important in modern research of deep learning. Lee et al. [1] used pre-training approaches in Computer Vision tasks in 2009, and later from 2010 to 2016, Nair and Hinton [2] proved that the pretraining process is supplementary in the CV tasks, thus, can be omitted in some cases. However, it started to flourish the natural language processing domain since Mikolov et al. [3] had proposed Word2Vec. Not long before Devin et al. [4]'s BERT language model dominates in most frequently used tasks in natural language processing, which is close resemble of Vicent et al. [5]'s Denoising Autoencoder model, which was initially designed for images. The pre-training process becomes one of the most important procedures in deep learning, however, existing image classification methods don't have satisfactory accuracy.

Recent Work
Chen et al. [6] trained image representation by sequence Transformers and tested on CIFAR-10 to show it is outperforming to Wide-ResNet which was inspired by unsupervised natural language representation learning. Wang et al. [7] reviewed that convolutional neural networks had been proposed in the 1960s, and had its implementation in the 1980s, and until LeCun et al. [8]'s first experiment on handwritten digit recognition, CNN's great potential had been revealed. In the 2010s, Krizhevsky et al. [9] proposed the deep architecture, AlexNet, by concatenating multiple components of CNN layers. Several years later, a lot of variants of AlexNet had been proposed by researchers and the accuracy of ImageNet had been greatly improved, e.g. ZFNet [10], VGG [11], GoogLeNet [12], ResNet [13], ResNeXt [14], inception-ResNet-v2 [15], DenseNet [16]. Lu and Weng [17] concluded that for the multi-source image classification tasks, additional information such as signatures, texture, context, and ancillary data can be combined to achieve better performance. And it is difficult in handling the dichotomy between pixels and natural language texts in a single model. Cui et al. [18] proposed several whole-word-masking pre-trained Chinese language models, which are improved versions of BERT [4] pre-trained language models, namely RBT3, RBTL3, and RoBERTa-wwm-ext-large. These models achieved better performance in Chinese machine reading comprehension, Chinese document classification, and other downstream natural language tasks. He and Peng [19] combined the vision stream and language stream as two parallel channels for extracting multi-source information in the image classification task, and tested on the CUB-200-2011 image dataset and achieved 85.55% by combining GoogLeNet [12] and CNN-RNN [20], the result outperformed many competitors.

Results
Three of the most popular approaches for image classification tasks are per-pixel, subpixel, and heterogeneous. Lu and Weng found that, for per-pixel approach, nonparametric classifiers, e.g. neural networks, support vector machines, and decision trees, are the most well-known algorithms for their performance and generalization advantages in the late 1990s and 2000s. Fernández et al. [21] compared different classifiers in small datasets, and they found that the random forest algorithm ranks first among the 179 classifiers.

Approach
My approach consists of a pre-training state followed by a fine-tuning stage. In pre-training, I use BERT objectives and the sequence Transformer architecture to predict language tokens.
Given an unlabeled dataset X, the BERT objective samples a sub-sequence S ∈ {C}, C represents all possible tokens, and such that for each index i ∈ S, there is independent probability 15% of appearing in S, name the token M as the BERT mask. As equation (1), train the language model by minimizing the BERT objective of the "masked" elements x M conditioned on the "unmasked" ones x [1,n]\M .
The transformer decoder takes the image pixels and meta characters sequence x 1 , · · · , x n and produces a ddimensional embedding for each position. Then I use a fully connected layer as a non-linear function from embeddings to image class. The dropout layer and the Softmax layer are used for better transfer performance between the training and the test dataset.

Per-pixel Image Encoder
For the per-pixel image classification approach, for every RGB channel of pixels in an image, each pixel had its pixel-channel code, ranges from 0x00 to 0xff for different colors. Thus, taking these pixels in an image is identical to ASCII characters in a document. Generality speaking, the performance of the pre-trained language models achieved in the document classification tasks, can be transferred to image classification naturally.
Recall that Kim [22] had proved that unsupervised pretrained language model word2vec and CNN outperformed many other machine learning algorithms, e.g. support vector machines and conditional random field, in many datasets such as movie reviews, Stanford sentiment treebank, and TREC question. Cui  improved performances over many other machine learning algorithms. BERT and RoBERTa-wwm-ext-large models both achieved an f1-score of 97.8% in the THUCNews dataset, which contains 65 thousands of news in 10 domains. I used RBTL3 and RoBERTa-wwm-ext-large pre-trained language model in two Chinese Judicial datasets, which has 2 and 4 classes. The Case-2 dataset annotated civil and criminal cases, and the Case-4 dataset annotated by civil, criminal, intellectual property, and administrative cases. The Case-2 dataset has 19508 training documents and 2000 test documents, and the Case-4 dataset has 34789 training documents and 2013 test documents.
From table (1), by combining the pre-trained language model with fully connected layer as the document classification model, the test accuracy exceeds the other popular machine learning algorithms. Therefore, the pixel channels of an image can be properly represented by these pre-trained language models. The CIFAR-10 dataset contains 60000 color images with a resolution of 32x32 in 10 classes, and the CIFAR-100 dataset has 100 classes containing 600 images each. Encode the image by the sequence of RGB channel values, in the order of Red-channel, Green-channel, and Blue-channel, then encode other meta-data if provided, as a sequence of ASCII characters. In the Concatenation layer, the pixel-channel value and metadata are concatenated, put a special token of [CLS] at the start, and put a [SEP] token between channel values and metadata, put a special token of [SEP] at the end. In the Trim layer, due to the limit on the max sequence of the BERT language model, a sequence larger than 512 needs to be trimmed before sending it to the BERT model. Keep the first 256 characters and last 256 characters of the concatenated sequence, trimmed result contains the first 255 red-channel value, some blue-channel value, and all the meta value in common cases. In the Encoder layer and the Embedding layer, trimmed sequence of values are encoded by BERT-like models, and get the encoded representation of the token [CLS] as the images' language model embeddings. In the Feature-Extraction layer, a combination of one dropout layer, one fully connected layer, and one softmax layer, as equation (2), is used. In the Output layer, the classification label of the image is feed in the model.

Discussion
This paper proposed a novel idea by using pre-trained language models for image representation and take image classification as an example of its performance in Computer Vision tasks. The finding might benefit various subjects, namely Education, Medical, Chemistry, Physics, Cosmology, Geography, Climate and Materials Science for their specific research. Tests showed that the proposed model outperforms the iGPT-L model without augmentation on the image dataset, the model achieved accuracy of 99.60% ∼ 99.74% on the CIFAR-10 image set, and accuracy of 99.10% ∼ 99.76% on the CIFAR-100 image set.

Training
As Cui et al. [18] reported that the training of the The experiment was performed on a Google Cloud TPU v3, with 32GB of RAM, and 8 chips with 16GB of the high speed of memory each, which can provide 420 tera-flops of computation capability.

Results
Use the proposed model, and I tried different pre-trained language models to see the impact on classification accuracy. From the table-2, for the same size of a dataset with larger classes, it needs more epochs and training time for the classification model. For the same RoBERTa language model with different numbers of transformer layers, 24 layers of transformers had better accuracy than 3 layers, however, its training cost grows for the larger language model. For the same fine-tuned language model, classes number has some impact on the accuracy, the fewer classes the dataset has, the more accurate results the model can achieve.  Table 2. Comparision of accuracy of the pre-trained language models on the CIFAR-10 and CIFAR-100 datasets in the image classification. RoBERTa-large is short for RoBERTa-wwm-extlarge.

Comparison
Compare to iGPT-L's accuracy of 96.3% on the CIFAR-10 dataset without augmentation and 82.8% on the CIFAR-100 dataset, our models have preferable better results. The reason that the model's outstanding performance lies in the large pre-trained data for BERT, on top of that fine-tuned by RoBERTa, and use of extra language corpus of 5.4 billions of tokens of wiki data and other resources. The transformers in the pre-trained language models use multiple layers for representing images and may be used in other Computer Vision task, e.g. object detection, gesture prediction.

Acknowledgements
I would like to thank my tutor Professor Sun Lifeng in Tsinghua university for his guidance and discussion.

Author contributions
Shen Dezhou has made substantial contributions to the conception and design of the work, acquisition, analysis, and interpretation of data. Beside, also made the creation of code work used, drafted the work and substantively revised it. Shen Dezhou has approved the submitted version and agreed both to be personally accountable for the author's own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature.

Competing Interests statement
The authors declare no competing interests.   Figure (2) shows the different layers for the image classification model, e.g. concatenation, trim, encoder, representation, and feature extraction layer. The channels and texts are concatenated together then sent to trim layer, the trimmed input of the sequence are encoded then goes to the extraction layer. The dropout and softmax functions are used to get optimal result.
Tables Table (1) shows that the pre-trained language models have solved the document classification well.
Table (2) shows that the pre-trained language models converge on different image datasets and achieved satisfactory accuracy. Language models with distinct architectures need different epochs to converge on different image datasets.