2C-Net: integrate image compression and classification via deep neural network

Providing effective support for intelligent vision tasks without image reconstruction can save numerous computational costs in the era of big data. With the help of the Deep Neural Network (DNN), integrating image compression and intelligent vision tasks at a feature representation level becomes a new promising approach. But how to perform non-linear transformation for image compression and extract image patterns for intelligent vision tasks simultaneously within a shared DNN remains an open problem. In this paper, a versatile framework is studied to explore the common feature representations for both image compression and classification. A fully shared latent representation is extracted in a more compact way to support compression and classification task. The General Feature Extraction and Feature-Analytic Classifier are proposed to generate and utilize shared latent representation. Then, the whole framework is joint optimized by considering multiple factors (i.e., rate, quality, and accuracy). Extensive experiments are carried out to validate that the proposals can improve the performance of both learning-based image compression and classification. The results show that the proposed method outperforms the conventional codecs like BPG and JPEG2000 in compression efficiency, while achieving acceptable accuracy on different image classification datasets without image reconstruction.


Introduction
With the rapid development of communication and multimedia technologies , images (including videos) become the most dominant data shared through the Internet. This brings huge stress not only to image compression, but also to intelligent computer vision tasks. Generally, images are compressed immediately after capturing to reduce data size, which only takes human vision into consideration, and ignores the requirement of intelligent computer vision tasks. Accordingly, images should be decompressed before feature extraction for computer vision tasks (e.g., image classification) (see in Fig. 1a).
With the explosion of image data, intelligent content management becomes more important in many scenarios, such as automatic content approval on social media platform, immediately image seeking from thousands of photos in personal smartphone. To avoid a mass of redundant computation for decompression, compact descriptors are extracted and stored independently for directly executing intelligent tasks [1,2], as shown in Fig.1b. In this case, the extra overhead of each image should be always stored and transmitted even for occasional use. To address this, extracting features directly on compressed domain for intelligent tasks should be an ideal and effective way [3,4]. However, those attempts have not achieved expected performance, 1 3 because the compressed domains of conventional compression methods are not designed for understanding image contents.
Taking the advantages of Deep Neural Networks (DNN), the new generation of learning-based image compression algorithms achieved prominent coding performance and outperformed conventional compression standards by 10-20% [5][6][7][8][9][10][11][12][13][14][15][16][17]. More importantly, DNN-based structure for compression is much similar to that for classification. On the basis of these similarities, Torfason et al. [19] utilized the feature maps extracted by the learning-based encoder [8] for image understanding, and achieved acceptable performance on image classification and segmentation. Later that year, Shen et al. [18] verified the structure and feature similarities for learning-based image compression and classification, which indicated that realizing intelligent computer vision tasks with latent representations is feasible and even more efficient.
These methods started an interesting and meaningful direction of realizing image compression and computer vision tasks (e.g., image classification) in an integrated framework by sharing the latent representations learned by DNNs. However, in existing integrated methods with fully shared latent representations, either generalization or compactness is ignored. This paper focuses on finding a generalized and compact shared latent representation to integrate image compression and classification task at a feature representation level. The proposed collaborative compression and classification network is named as 2C-Net for simplicity. 2C-Net is composed by four primary parts, as shown in Fig. 2: General Feature Extractor (GF-Extr), Rate Reduction (R-Red), General Feature Application (GF-App), and Ratemulti-Distortion Optimization (RmD-Opt).
• GF-Extr is totally shared by compression and classification task, which is carefully designed to empower the shared latent representation with generalization, compactness, and completeness. • R-Red adopts hyperprior model for efficient entropy coding and reduce the total data size of the shared latent representation. • GF-App contains a symmetrical network to GF-Extr for image reconstruction, and a feature-analytic classifier network for image classification.  Fig. 2 The overall flowchart of 2C-Net framework. 2C-Net is composed of four main modules. The General Feature Extraction module is equipped with simplified network to extract compact and generalized shared latent representations. The Rate Reduction module applies hyperprior model to perform efficient entropy coding, so as to reduce total bit-rate. The General Feature Application module con-sists of two branches. The reconstruction network is the decoder for human vision, while the feature-analytic network is an image understanding branch for intelligent machine vision tasks. At last, the Ratemulti-Distortion Optimization module is designed to balance compression ratio, reconstruction quality, and classification accuracy • RmD-Opt considers the trade-off among compressed ratio, reconstruction quality, and classification accuracy through pretraining and joint fine-tuning.
Experimental results show that 2C-Net achieves convincing comprehensive performances, which outperform current collaborative image compression and classification methods [18,19]. It outperforms conventional image compression standards (i.e., JPEG) by saving about 38% of bit-rate with the same reconstruction quality. It also outperforms its baseline coding method and other classic learning-based compression methods, which is competitive in real applications. From the perspective of image classification, 2C-Net achieves 80.4% top-1 accuracy on Caltech101 dataset [20] and 75.1% mAP on Pascal VOC 2012 [21] dataset without fully decoding the bit stream, which is close to the pixel domain accuracy. It also achieves 62.8% top1 accuracy on the large-scale ImageNet ILSVRC 2012 dataset [22], which indicates the generalization and application prospects. Extensive ablation experiments proved that the proposed modifications have achieved good balance between effectiveness and efficiency, which provides good guidance for future work to further improve the performance or apply it to real scenario.
The key contributions of this paper can be drawn as follows: • We realize a collaborative image compression and classification framework through extracting compact and generalized shared latent representations. • An improved network scheme is proposed to specifically dealing with the compatibility between image compression and classification. • The proposed method achieves a competitive comprehensive performance on image compression and classification, which proves feasibility in collaboration between human vision and computer vision.

Related work
In this section, we first review the recent studies on image classification and compression, which are two of the basic components of 2C-Net. Then, recent progress in collaborative image compression and vision tasks is introduced.

Image classification
Image classification is one of the fundamental studies in Artificial Intelligence (AI). Since AlexNet [23] achieved superior performance in 2012, the Convolutional Neural Networks(CNNs) have been widely applied in image classification. The VGGNet [24] then extended neural networks to a deep stage and showed the importance of network depth in image classification. To reduce the number of parameters and extract better feature representations, the Inception-Net [25] inventively replaced large convolutional kernels with a set of small ones. Then, He [26] employed shortcut connection to build the residual block and proposed the famous ResNet to solve the gradient vanishing problem in training, which supported the network structure to go deeper and deeper. On the basis of these methods, more improvements were made in DenseNet [27] and NASNET [28] to make the algorithms outperform the Human Vision System (HVS) in image classification task. These methods built important theoretical and practical foundations for 2C-Net, but cannot be directly applied in the framework, as they all take original images as input without considering the effect of image compression.

Image compression
Image compression algorithms are designed to reduce the bit-rate, while keeping the visual quality of reconstructed images. Conventional codecs like JPEG [29], JPEG2000 [30], and BPG [31] are mainly consisted of transformation, quantization, and entropy coding. In recent years, DNN turns on the new generation of image compression and outperforms conventional algorithms [10,32,33]. Ballé [9] proved that the Rate-Distortion Optimization (RDO) in deep image compression is equivalent to the minimization of the Kullback-Leibler divergence over the data distribution and proposed a hyperprior model for entropy coding in DNNbased image compression. Recently, autoregressive context model [11,15,[33][34][35][36][37][38][39] and non-local attention model [12] were introduced to hyperprior model-based deep compression methods and achieved remarkable performance gain.
With the help of attention model, Li [14] and Cai [13]

Cooperation of image compression and intelligent tasks
Traditionally, image compression algorithms have two basic targets, that is compression ratio and reconstructed quality for human vision. With the popularity of AI, images also become an important input of computer vision system, 1 3 which brings a new task for image compression, that is cooperating with intelligent tasks. Relevant studies can be summarized from three different aspects:

Task aware image compression
Commonly, images are compressed for transmission, and should be reconstructed before computer vision tasks, as shown in Fig. 1a. Existing research [40] has shown that the distortion from image compression may cause great accuracy loss in DNN-based computer vision algorithms. Thus, many researches have made contributions to adjust image compression methods to different vision tasks, such as introducing quantization network to realize adjustable quantization [41], or utilizing attention map to emphasize the oreground object [42]. Le [43] concatenated vision tasks direct after the codec and joint-optimized the codec and vision task together. Chamain [44] made a further exploration to discuss the effectiveness of each part when joint-optimizing codec with vision tasks. These methods improved the classification performance on reconstructed images, but image decompression is still needed before intelligent task.

Individual feature compression
Another similar way is seeking compact descriptors or features for vision tasks individually [1,2,45], as shown in Fig. 1b. Moving Picture Experts Group (MPEG) announced standards named Compact Descriptor for Visual Search (CDVS) [1] and Compact Descriptor for Visual Analysis (CDVA) [2] to generate standard descriptors for quick image and video retrieval, respectively. Recently, the MPEG established a new standard group called 'Video Coding for Machine' to explore the next generation codec, which is friendly to intelligent vision tasks [45][46][47][48][49]. Tseng [50] also proposed a linear and reversible transform to provide compact features for image classification. These compact features effectively reduce the storage and computation demand for vision tasks, but they are not complete enough to reconstruct image with high quality.

Tasks cooperating through shared features
Seeking effective representations in a shared feature space is another promising way to avoid image reconstruction for intelligent vision tasks [50][51][52][53]. Many early studies attempted to execute color histograms [3] and texture features [4] from Discrete Cosine Transform (DCT) domain or wavelet domain for fast image retrieval. However, these statistical features are insufficient to understand image content, because both DCT and wavelet are signal-level transforms.
On the basis of the architecture similarity shared by DNN-based image compression and vision tasks, studies on generating a partial-shared bit stream for both image compression and intelligent tasks have attracted many attentions. Zhang [54] trained compression and feature extraction network separately and then combined them together for image retrieval through fine-tuning. Liu [55] considered image classification features as a subset of the essential information required by image reconstruction and proposed a salable compression scheme for human and vision tasks. These methods can generate a partial-shared bit stream for compression and vision task, but also bring extra computational costs during encoding.
Another way is to generate a fully shared feature representation for image compression and vision tasks. Among these methods, Shen [18] and Liu [56] tried to reuse compact features extracted by a DNN-based codec for image compression and retrieval tasks, but suffered performance loss due to the weak generalization capability of compact features. Torfason [19] employed features from very early layers in DNN to ensure the generalization, but also introduced a huge amount of spatial redundancy, which harms the compression efficiency.
Our 2C-Net focuses on the fully shared strategy, which takes both compactness and generalization into consideration at the very beginning and tries to keep generalization capability within the shared latent representation through network modification and Rate-Multi-Distortion optimization.

2C-Net : integrate image classification and compression
In this section, we propose an improved 2C-Net framework to realize collaborative compression and classification task by sharing a set of general and compact latent representations (as shown in Fig. 1c). Specifically, 2C-Net framework (see in Fig. 2) is composed of four main modules, i.e., GF-Extr, R-Red, GF-App, and RmD-Opt. GF-Extr shoulders a critical mission of generating a shared latent representation that can benefit both image compression and classification synchronously. R-Red module is used to reduce the bit-rate of the output latent representation from GF-Extr. In GF-App, a symmetrical network to GF-Extr is adopted to reconstruct images from the shared latent representation. At the same time, a featureanalytic classifier is proposed to reorganize the latent representation to fit the requirements of image classification. Finally, the overall optimization of 2C-Net can be described as a RmD-Opt process. Then, the most crucial problems in this collaborative task become: • How to generalize the shared latent representation and then adaptively apply it in image reconstruction and classification. • How to reduce the bit-rate of shared latent representation with acceptable distortion in both reconstruction quality and classification accuracy.
To solve these problems, the details of each module are provided as follows.

GF-Extr : general feature extraction
In this paper, the variational image compression named Ballé18 [9] is adopted as the baseline method, because its basic framework is widely used in nowadays DNN-based compression methods. Moreover, the encoder of Ballé18 is relatively simple, which is consistent to our demand in feature generalization. On the basis of this, further modifications are made for better generalization, compactness, and completeness in our proposed GF-Extr.

Modification for generalization
The original feature extractor in Ballé18 [9] is specifically designed for image compression, without taking image classification into consideration. To maintain the generalization of latent representation and facilitate feature extraction in the deep layers of 2C-Net, we use a lightweight residual unit as the basic unit of GF-Extr. As illustrated in Fig. 3, the lightweight residual unit is composed of only one convolution layer to extract primary and common features, while the shortcut is used to prevent gradient vanishing during the training for classification task. Then, the whole general feature extraction progress contains only six convolutional layers, in which four of them lie in residual blocks and two of them sit in the beginning and the end of the network for adjusting feature channels.

Modification for compactness
The compactness of extracted features is another important concern in efficient compression of the shared latent representation. Thus, the number of down-sample layers should be adjusted according to the resolution of input image. For images with 256 × 256 or higher resolution, we choose to employ four down-sample layers to balance the detail and the compactness of shared latent representation. Besides, we also follow the modification in Ballé18 [9], which uses the Generalized Divisive Normalization (GDN) [57] to better Gaussianlize the shared latent representation and benefit to rate reduction.

Modification for completeness
Concerning the high-quality image reconstruction, we also take feature completeness into account. Some normal operations used in primary feature extraction for classification (e.g., max-pooling, average-pooling) are not included in GF-Extr. Moreover, convolutional layers with stride are utilized to conduct down-sample to ensure the reversibility in general feature extraction.

R-Red : rate reduction
After extracting compact latent representations, entropy coding can help further reduce total bit-rate. Different from former attempts of combining image compression and intelligent tasks, we use hyperprior model [9] to better fit the characteristic of shared latent representation. Basically, the Rate-Distortion-Optimization problem can be formulated as where R stands for rate, D is the reconstruction distortion, x is the input image, and p x is its natural distribution. y = f (x) is the encoder, while ŷ stands for the quantized latent representation. g(⋅) is the decoder and d(⋅) is the measurement for reconstruction distortion.
As shown in Fig. 4, for hyperprior-based methods, a set of compact representation z is further extracted to approximate the distribution of quantized latent representation ŷ . Then, the hyperdecoder takes the quantized compact representation ẑ as input and reconstructs the full statistical characteristics of ŷ to guide its arithmetic coding and decoding. Therefore, the probability distribution of ŷ is In practical, we use Gaussian distribution as prior to model the distribution of the quantized shared latent representation ŷ and use the fully factorized model to describe the distribution of compact representation ẑ . Then, according to Ballé18 [9], their distributions can be expressed as where N represents for Gaussian distribution, and y and y are its mean value and variance. The u − 1 2 , 1 2 represents for the uniform distribution ranging from − 1 2 , 1 2 ,which is used to approximate none-gradient quantization operation during training. With this assumption of hyperprior model, the task of R-Red module is to force the numerical distribution of shared latent representation ŷ to fit the Gaussian distribution in Eq. 4. When the actual numerical distribution of ŷ is closer to the prior distribution, the mutual information between them becomes smaller, so that the total bit-rate can be reduced.

GF-App : general feature application
Accessed by GF-Extr and R-Red module, the shared latent representation contains sufficient but highly compressed information for both image compression and classification. Then, a specific GF-App module is designed to decouple the shared representation and to apply it to image reconstruction and classification, respectively.

Image reconstruction
The network for image reconstruction follows the basic idea of auto-encoder, which uses a nearly symmetrical structure to the simplified network in GF-Extr (see in Fig. 5). Concretely, the transposed convolution layer is used for upsampling, whereas residual units and inverse GDN (IGDN) layers keep the same settings as GF-Extr. With this kind of modification, the reconstruction network can learn to perform the inverse transformation of GF-Extr, and accurately reconstruct images from shared latent representation.

Channel Extension
Depth-wise Separable convolution Fig. 6 The detailed structure of the feature-analytic classifier. The depth-wise separable convolution is used to reorganize the shared latent representation for classification task. The shared latent representations first processed by H individual 3 × 3 convolution kernels. Then, the 1 × 1 convolution kernel is used to learn the weight of each channel for feature fusion along channels. After the depth-wise separable convolution, the resblocks utilized the similar structure to those in ResNet-18 [26], but the network is extended to better extract useful patterns across channels

Feature-analytic classifier
Different from conventional classifiers, the classification branch in GF-App module takes latent representation as input, which has a smaller spatial size but more feature channels compared with original images. Thus, a specific feature-analytic classifier (shown in Fig. 6) is proposed to accommodate these inputs. First, the depth-wise separable convolution is used to decouple and reorganize the shared latent representation. Then, feature channel extension is applied to pay more emphasis on the channel dimension during further feature extraction. Feature reorganization: The GDN layer in GF-Extr and the hyperprior model used in R-Red module both restrict the numerical distribution of the shared latent representation ŷ to be a channel-wise Gaussian distribution. As a result, shared latent representation from different feature channels has different statistical characteristics. Thus, standard convolution operation may brutally mix up these different Gaussianlized features along the channels, and eventually decrease classification accuracy in 2C-Net. To deal with this problem, a depth-wise separable convolution (shown in bottom part of Fig. 6) is used at the beginning of featureanalytic classifier to reorganize the received latent representation ŷ . Every feature channel in ŷ is processed by an independent convolutional kernel to ease the affect caused by Gaussianization. Then, the transformed feature channels are fused according to the weights learned by 1 × 1 convolution. Besides the effectiveness in adaptive feature reorganization, the depth-wise separable convolution can also save computational resources by replacing 3 × 3 × H in × H out normal convolution with 3 × 3 × 1 × H in and 1 × 1 × H out convolutions, where H in and H out represent for the input and output channels in depth-wise separable convolution.
Feature channel extension: After the depth-wise separable convolution, the residual network, typically ResNet-18 [26], is used as the backbone classifier. Considering that the input latent representation has a small spatial size but a large number of feature channels, we propose to emphasize the further feature extraction on channel dimension. Thus, the feature channels of all residual block are extended, whereas operations on spatial dimension like large-kernel convolution layer and max-pooling layer are removed. Moreover, we propose to extend feature channels gradually and avoid any "bottleneck" of channel number during network propagation (see in Fig. 6). Because the input latent representation has very low data redundancy, any information loss caused by channel reduction is irreversible and may lead to considerable accuracy decrease.

RmD-Opt : rate-multi-distortion optimization
Different from single image compression or classification tasks, the optimization of 2C-Net should be considered from three aspects: total bit-rate, image reconstruction quality, and classification accuracy. Treating both reconstruction and classification loss as distortion terms in the training of 2C-Net, the optimization of 2C-Net can be described as Ratemulti-Distortion Optimization (RmD-Opt) where D rec and D cls stand for reconstruction loss and classification loss, respectively, and 1 and 2 are balance coefficients. R represents for the bit-rate of compressed shared latent representation.

Measurements in optimization
Since HVS and DNN-based feature extraction are both sensitive to the structure information of image contents [58,59], we select the MultiScale Structural SIMilarity index (MS-SSIM) as the distortion measurement for image reconstruction, which is a perception loss widely used in image quality assessment [60] and image compression [9]. Then, the reconstruction distortion term in Eq. 5 is modeled as where x represents for the image reconstructed from latent representation ŷ , and x is the original image. With the guidance of MS-SSIM, the shared latent representation is supposed to be better generalized to classification task with fewer structural losses in image reconstruction.
For classification task, cross-entropy is utilized for measuring classification accuracy where N represents for the batch size of training, y i is training label, and p i is the output of the softmax layer.

Quality-accuracy consistency
The joint optimization of 2C-Net contains three aspects, named total bit-rate R, reconstruction quality Q, and classification accuracy A. In our 2C-Net framework, GF-Extr is specifically designed to satisfy both image compression and classification as described in Sect. 3.1. On the basis of this structure, the relation between Q and A is explored to support the RmD-Opt in 2C-Net.
For this purpose, 2C-Net is only trained to optimize Q and A, without taking R into consideration to verify whether there are potential conflicts between the two tasks. The performance curve of Q (in red curve) and A (in blue curve) is reported in Fig. 7. In addition, a classifier with the same network structure in GF-Extr and classifier branch (removing reconstruction branch) is trained specifically for classification to work as the control group (in blue dot curve). Along with the training process, Q and A almost increase synchronously, especially in the early stage of training. Only in the later stage of training, reconstruction quality and classification accuracy appear conflict attribute, which needs to be further reconciled.

Optimization strategy
Because the optimization for 2C-Net contains multi-distortions, which have great difference in the order of magnitude, it is hard to find the best balance through direct joint training from scratch. According to the exploration of Q-A relation, a practical optimization strategy is proposed for the RmD-Opt problem.
Accordingly, the GF-Extr module is first optimized for compression task, until the training procedure is close to converge. Then, the latent representation extracted by GF-Extr is fed into feature-analytic classifier to train for classification task. When both of the distortion terms in Eq. 5 are close to converge, different sets 1 and 2 are selected to balance the comprehensive performance of 2C-Net and approximate their joint optimal.

Experiment
In this section, a comprehensive performance of 2C-Net is compared with other methods of similar framework (i.e., Shen18 [18], Torfason18 [19]) and a baseline method which directly using the output of Ballé18 [9] for image classification. Considering that 2C-Net is designed for executing intelligent computer vision tasks on compressed domain, it should take retaining compression performance as a premise. The comparison experiments with classical and stateof-the-art image compression algorithms are executed, to see whether the modifications to the backbone hurt the performance of image compression. In additional, ablation studies are executed to present the effectiveness of each modification.

Datasets
Several typical datasets not only for image compression but also for image classification are used in the experiments. They are ImageNet ILSVRC 2012 dataset [22], Caltech101 dataset [20] , Pascal dataset [21], and Kodak dataset [61].
ImageNet ILSVRC 2012 dataset contains 1M natural images with different resolution and various image contents, which are distributed across 1000 diverse classes. Caltech101 dataset provides a single-label classification task with low-resolution images (typically 200 × 300 ) over 101 categories, and each class contains roughly 40 to 800 images, totaling around 9000 images. Pascal VOC 2012 dataset is designed for a multi-label classification task, which may contain one or more objects in a single image. It provides 17,125 high-resolution images covering 20 classes of objects. Kodak dataset contains 24 high-resolution uncompressed images, commonly used to evaluate image compression methods.

Preprocessing
All the images used in training and testing stages for both compression and classification tasks are simply normalized to [0,1]. During the training and testing for compression task, the width and height of each input image are cropped to the nearest integer multiple of 64 to ensure that the input image and reconstructed image can have the same spatial size.
Reconstruction Quality (MS-SSIM) epochs Fig. 7 The relation between reconstruction quality and classification accuracy among training process. The blue line is the trend of classification accuracy during training process, while the brown line indicates the reconstruction quality at the corresponding training epoch. As a comparison, the blue dash line shows the accuracy curve when training the same network only for classification task.

Training scheme
According to the training strategy in Sect. 3.4, we first train the GF-Extr, R-Red, and the reconstruction branch in GF-App module for image compression with Eq. 1, where is set to 10, 50, 100, 200 for different bit-rates. We choose Adam as our optimizer, and the initial learning rate is set to 10 −4 . Then, we degrade the learning rate by 10 times after every 200,000 steps. The whole training procedure lasts for three epochs over the ImageNet ILSVRC 2012 dataset. When the above modules are close to converge, we fix their parameters, and only train the feature-analytic classifier branch in GF-App module. In this stage, the batch size is set to 8, and the initial learning rate is 10 −4 . For Caltech101 dataset, we additionally use the knowledge distillation method at the last softmax layer to avoid over-fitting on this small-scale dataset. Finally, the whole 2C-Net framework is trained jointly according to Eq. 5 with different set of weight coefficients 1 and 2 to further balance the performance of compression ratio, reconstruction quality, and classification accuracy.

Comparison setups
To verify the comprehensive performance, we compare the proposed 2C-Net with similar fully shared latent representation-based methods "Shen18" and "Torfason18". In addition, the baseline version of 2C-Net (denoted by "Ballé18_ ex"), which directly concatenating the same feature-analytic classifier after the Ballé18 encoder, is also compared.
Moreover, we also compare our method with reconstructing-then-classifying methods based on JPEG, BPG, and the codec of 2C-Net. For these methods, images are compressed by the encoders and reconstructed before feed into the classifier. We utilize the backbone classifier of 2C-Net (i.e., ResNet-18) as their classifiers. Each ResNet-18 classifier is trained and tested over the corresponding reconstructed images.
During the comparison, the compression efficiency is measured by compression ratio (Bits Per Pixel, BPP) and reconstruction quality (MS-SSIM), while the classification performance is measured by Top-1 accuracy and mAP.

Evaluation and analysis
The comprehensive performances are evaluated on three widely used image classification datasets (i.e., Caltech101, Pascal and Imagenet). Results are demonstrated in Table 1 and Fig. 8. Results of Shen18 and Torfason18 are directly adopted from their papers, which are tested only on part of the datasets.
Compared to other fully shared latent representationbased methods, the proposed 2C-Net achieves significant performance gain methods in both rate-distortion curve (Fig. 8 left) and rate-accuracy curve (Fig. 8 right). It indicates that the shared latent representation extracted by proposed 2C-Net achieves a good balance between compactness and generalization. Compared with the baseline method Ballé18-ex , our 2C-Net achieves higher compression efficiency and classification accuracy with similar compression ratio. As mentioned in Sect. 4.2.1, the proposed 2C-Net differs from baseline Ballé-ex only in GF-Extr module. This result well verifies the effectiveness of proposed GF-Extr module.
Since the proposed 2C-Net directly use the shared latent representation, it can save about 7-73 ms reconstructing time per image while achieving better compression efficiency compared to reconstructing-then-classifying methods. Moreover, for Caltech101 dataset, the proposed 2C-Net achieves similar and even better classification accuracy compared to those reconstructing-then-classifying methods. Our classification accuracy on Pascal dataset is also close to reconstructing-then-classifying methods.
For the comprehensive performance on the Imagenet dataset, the proposed 2C-Net outperforms Torfason18 by 4-5 dB in MS-SSIM with similar compression ratio, which is a huge improvement in compression efficiency. In terms of classification accuracy, the 2C-Net falls behind Torfason18 and BPG_rec, because the latent representation of Torfa-son18 is extracted from very shallow layers, which contains more details to support large-scale classification. However, this kind of latent representation also contains huge spatial redundancy. To make a clearer comparison, Fig. 9 gives a full view of the comprehensive performance of 2C-Net and Torfason18 in the three-dimensional space consisted by BPP, MS-SSIM, and accuracy. To better demonstrate the reconstruction quality, the MS-SSIM is shown in decibels, which can be calculated as The results in Fig. 9 indicate that the modifications of 2C-Net have made effective efforts in extracting compact and generalized latent representations and sought a better solution plane in the performance space expanded by BPP, MS-SSIM, and accuracy.

Extended study
To evaluate and understand the effectiveness of each improvement proposed in 2C-Net, a series of extended experiments are conducted with the same data pre-processing and training scheme on the Kodak dataset and Pascal VOC 2012 dataset.

Compression performance
As a collaborative image compression and classification method, the proposed 2C-Net is also competent for conventional image compression task. Thus, we further compare our 2C-Net with other compression methods on the widely used Kodak dataset to give a general view on its compression efficiency.
Considering the trade-off between compression efficiency and computational complexity, 2C-Net takes the network of Ballé18 [9] as baseline, which is the foundation of many emerging methods of DNN-based image compression. In this section, early DNN-based methods like Rippel17 [62] and compression method equipped with auto-aggressive context model and non-local attention (e.g., non-local19 [12]) are involved in the comparison. Furthermore, the conventional The average compression performance on Kodak dataset is shown in Fig. 10. It is obvious that emerging methods based on DNN can achieve more than 1dB performance gain compared with conventional methods. With the modification of GF-Extr and the Gaussian hyperprior model used in R-Red , 2C-Net further improves the performance of compression on the basis of Ballé18. This proves that GF-Extr can extract general features compatible with  . 9 The detailed performance comparison on Imagenet dataset. Comprehensively, the proposed 2C-Net achieves a better solution plane over the 3D performance space expended by rate, distortion, and accuracy Fig. 10 The RD curves averaged on Kodak dataset. Ballé18 is the baseline codec of 2C-Net, while the Non-local19 is the recent compression method on the basis of hyperprior model but additionally adopting non-local attention and context model. The conventional compression methods are drawn in dash line classification, while keeping the competitive performance on image compression. As an advanced method on the basis of hyperprior model, Non-local19 further improved the coding performance by adopting non-local attention and auto-regressive context model, which confirms the extensibility of the baseline method. However, it needs more complicated computation, which is inappropriate to be baseline for this work. Figure 11 gives several examples of reconstructed images of different kinds, compressed with different bit-rate and different codecs. It can be observed that 2C-Net can provide much better subjective quality than JPEG and JPEG2000 at very low bit-rate (i.e., < 0.2 BPP). Although increasing bit-rate may reduce the gap, 2C-Net still present clearer details.

Latent representation: generalization or compactness
Usually, features from shallow layers are more generalized but contains huge spatial redundancy, while deep features are more compact but specific to a certain task. Thus, the balance between generalization and compactness is a crucial issue for GF-Extr. In this experiment, different combinations of pooling layer and conv-layer are tested to uncover their effect on the generalization and compactness of latent representation. We trained three types of 2C-Net: with 2 pooling layers and 4 conv-layers (P2C4), 4 pooling layers and 4 conv-layers (P4C4), and 4 pooling layers and 8 conv-layers (P4C8), respectively, in GF-Extr module. All of them share the same 2C-Net structure except for the above-mentioned modifications and symmetrical changes in reconstruction networks. All three models are trained to work at a similar compression ratio for a fair comparison. The compression efficiency and classification accuracy of different configurations for 2C-Net are shown in Table 2. P2C4 achieves the best classification accuracy, whereas its reconstruction quality degrades greatly. When increasing the pooling layer to 4, reconstruction quality improved significantly, whereas classification accuracy degrade. When further increasing the conv-layer to 8, reconstruction quality remains unchanged, but classification accuracy still degraded. These results indicate that number of pooling layers plays a more important role in balancing the generalization and compactness of shared latent representation, while more conv-layers would hurt the classification performance. With the similar compression ratio, the GF-Extr module with four pooling layers and four convolution layers can achieve a better balance between reconstruction accuracy and classification accuracy.

Effectiveness of feature reorganization
The R-Red module restricts the numerical distribution of shared latent representation, which may have impact on normal feature expression. To reduce this kind of impact, we use a depth-wise separable convolution to reorganize the received latent representation before feeding into the feature-analytic classifier. To evaluate its effectiveness, a 2C-Net without depth-wise separable convolution is trained with the same training scheme. Moreover, because the modification is made in the feature-analytic classifier, all the other modules are reused and kept fixed to ensure that both models share the same latent representation and the same compression efficiency. The experimental results are shown in Table 3. These results clearly show that classifier with depth-wise separable convolution achieves better classification accuracy at all bit-rates, indicating the necessity and effectiveness of feature reorganization.

Feature channels: wide or wider
In feature-analytic classifier, wider residual blocks are used to extract features across channels and avoid information loss during network propagation. In the proposed 2C-Net, the initial number of channels in feature-analytic classifier is set to 128, which is equal to the channel of shared latent representation. To evaluate the impact caused by feature channels extension, we add a convolution layer after DS-conv to adjust feature channels to 64 and 192, whereas following residual blocks expand their channels by two times on the basis of the former one. Taking 2C-Net at low bit-rate as an example, all the models are trained to 0.17 BPP. The classification accuracy and number of parameters are reported in Table 4. This result indicates that the wider initial feature channels can benefit to classification accuracy with the cost of computational complexity. The proposed feature-analytic classifier applies 128 feature channels to make the 2C-Net more cost-effective.

Discussion
With the verification provided by above-mentioned ablation experiments, the network modifications made in our proposed 2C-Net framework are proved to be a good trade-off between image compression and classification at the feature representation level. Among these modifications, the feature reorganization conducted by depth-wise convolution can improve classification accuracy at all bit-rates and reduce computational cost at the same time. On the other hand, the implementation of GF-Extr and the extension of feature channels are more flexible terms. According to practical demands, different implementations can be applied to make the 2C-Net framework adaptive to other situations.

Conclusion
In this paper, we propose a 2C-Net framework to integrate image compression and classification through sharing latent representation. By inheriting the network from learningbased image compression, this work achieved competitive compression performance. Based on that, exquisite modifications are proposed to empower the network with general feature extraction for classification. Experimental results proved that the learning-based image compression has a great potential to be compatible with the classification, and its ability can be generalized to different datasets. In any case, the newly born framework is a courageous and radical attempt, and there still left many challenging problems to solve. We will further explore the new possibilities from the following aspects: first, further improvement of the performance on classification using advanced network and training skills, and verifying the efficiency on more difficult classification datasets; second, extend the applications to other intelligent tasks, such as retrieval and segmentation; Fig. 11 The reconstruction quality of 2C-Net, JPEG, and JPEG2000 at low, middle, and high bit-rates third, use the output of hyperencoder to further improve the performance; and finally, explore incremental bit stream which can support intelligent tasks with a subset, and reconstruct images with full stream.