Balancing the Encoder and Decoder Complexity in Image Compression for Classification

This paper presents a study on the computational complexity of coding for machines, with a focus on image coding for classification. We first conduct a comprehensive set of experiments to analyze the size of the encoder (which encodes images to bitstreams), the size of the decoder (which decodes bitstreams and predicts class labels), and their impact on the rate-accuracy trade-off in compression for classification. Through empirical investigation, we demonstrate a complementary relationship between the encoder size and the decoder size, i.e., it is better to employ a large encoder with a small decoder and vice versa. Motivated by this relationship, we introduce a feature compression-based method for efficient image compression for classification. By compressing features at various layers of a neural network-based image classification model, our method achieves adjustable rate, accuracy, and encoder (or decoder) size using a single model. Experimental results on ImageNet classification show that our method achieves competitive results with existing methods while being much more flexible. The code will be made publicly available.


Introduction
Data compression is a fundamental problem in information theory as well as many realworld applications.Recently, coding for machines emerges as a promising scheme in data compression, in which case one aims to represent data using as few bits as possible while retaining high prediction accuracy for downstream vision tasks.Coding for machine approaches have already many potential applications such as edge-cloud computing [1][2][3], privacy-preserving communication [4,5], and the Internet of Things [6,7].
Most existing research on coding for machines focuses on the rate-accuracy trade-off, where rate measures the average number of bits per sample produced by the encoder, and accuracy measures the downstream task (e.g., ImageNet classification [8]) prediction accuracy on the decoder side.This is a natural extension of the classical rate-distortion trade-off in lossy compression, and various prior works have studied coding for machine problems based on rate-distortion theory [9][10][11].In particular, Dubois et al. [9] have theoretically proved that, for certain tasks, it is possible to achieve a high prediction accuracy with an extremely low rate compared to reconstruction-oriented compression.
Yet what is feasible in theory may not always be achievable in practice.Rate-distortion theory, which describes the best possible rate-distortion (or rate-accuracy in our context) Fig. 1: Difference between coding for human (or compression for reconstruction, Fig. 1a) and coding for machines (or compression for prediction, Fig. 1b).This paper focuses on coding for machines, where one wants to encode data X to a compressed representation Z such that predicting a target Y from Z is as good as predicting from X.
trade-off, assumes an unbounded encoder and decoder family.However, many practical applications involve constraints on computational resources (e.g., for mobile and wearable devices), which limit the choice of encoder and decoder architectures.Such constraints render the theoretical rate-accuracy bounds inapplicable, calling for a more practical analysis of coding for machine performance under computational constraints.
Motivated by this, we first study the rate-accuray trade-off in coding for machines under computational constraints on the encoder and decoder.Our analysis reveals that such constraints would greatly impair the rate-accuracy trade-off, leading to a three-way trade-off between rate, accuracy, and computational complexity.Targeting this trade-off, we propose an end-to-end learning framework for adjusting these three quantities using a single neural network model, which has not been achieved by existing methods.
To summarize, we make the following contributions to the study of coding for machines: (Sec.3) We empirically investigate the impact of encoder/decoder size on coding for machine performance.Our analysis reveals a complementary relationship between the encoder and decoder sizes, as well as a three-way trade-off between rate, distortion, and encoding complexity; (Sec.4 and 5) We propose a novel method for adjusting this three-way trade-off that requires only a single neural network model.Experiments show that our method achieves comparable rate-distortion-complexity performance with existing methods while being much more flexible for deployment.

Preliminaries
This section briefly reviews the theoretical foundations of lossy compression.Readers familiar with rate-distortion theory and coding for machines may skip this section.
Rate-distortion theory.Let X ∼ p X denote the source data variable.In lossy compression for reconstruction (Fig. 1a), also referred to as coding for human vision, the goal is to represent X using as few bits as possible, from which one can obtain a good reconstruction X.The distortion is quantified by a function d(•) measuring the difference between X and X.Given a distortion threshold D ∈ R, the minimum rate (i.e., average number of bits per sample when compressing an i.i.d.sequence of X) required to achieve E[d(X, X)] ≤ D is given by the information rate-distortion function [12]: where I(•) is mutual information.Note that the reconstruction X need not be in the same space as the source X.In coding for machines, for example, the reconstruction is often a prediction target instead of the original data.Rate-distortion in coding for machines.Let p X,Y be a joint distribution, where X denotes data, and Y is the prediction target.In compression for prediction (Fig. 1b), we again want to represent X using a low-rate representation, but on the decoding side, the objective now is to infer Y instead of reconstructing X. Prior research has applied the rate-distortion theory to coding for machines in various ways [9][10][11].Among them, a representative approach [9] is to regard the compressed representation Z as the "reconstruction" and adopt the following distortion function: where D KL is the KL divergence.Intuitively, Eq. ( 2) measures how well a representation z can be used to predict Y , compared to predicting Y using x.This distortion equals the best-case classification log loss and simulates the prediction error in many downstream tasks.With this distortion, the information rate-distortion function becomes: which follows from Theorem 2 of [9].Intuitively, the distortion threshold D determines the maximum information loss regarding Y allowed during encoding.When D = 0, no information loss is allowed, and the encoder must retain all task-related information I(X; Y ) in the representation Z.In this case, the minimum achievable rate equals I(X; Y ), and the compression is lossless in the sense that predicting from Z is as good as predicting from X.When D > 0, the encoder is allowed to discard a subset of I(X; Y ), and the minimum achievable rate decreases accordingly.Assumptions.We consider neural network-based (as opposed to hand-crafted) encoders and decoders, the complexity of which can be controlled by tuning the number of network layers and the number of dimensions per layer.We apply elementwise uniform quantization and use factorized entropy models [13] for Z.We consider image classification as the downstream prediction task.More general settings (e.g., vector quantization and other downstream tasks) are left to future work.

Rate-Distortion under Computational Constraints
We perform two experiments (Sec.3.1 and Sec.3.2) to study the impact of encoder and decoder size on the R-D trade-off in coding for machines.In both experiments, we assume that the encoder and decoder are neural networks accompanied by uniform scalar quantization, following the non-linear transform coding framework [14].Computational constraints are thus imposed by varying the neural network depth (number of layers) and width (number of dimensions per layer).Sec.3.3 discusses the observations and motivates our proposed method for image coding for machines (Sec.4).

Experiment: 2-D datapoint classification
We considers a toy problem of 2-D datapoint compression for classification, shown in Fig. 2a.The data X is 2-D, the label Y is binary with equal probability, and the difference between the two classes is an angular shift in polar coordinates.In this experiment, the rate is measured by bits per datapoint, and the distortion is measured by classification log loss (also known as the cross-entropy loss).
According to Eq. (3), we know that one could achieve no distortion with R(0) = I(X; Y ) = H(Y ) = 1 bit per datapoint (when compressing long sequences).To verify this, we train a neural compressor [14] with a powerful encoder and decoder, both of which are MLPs with 3 hidden layers.Its R-D performance is shown as the red triangle marker in Fig. 2e.We observe that this particular compressor successfully achieves a performance close to R(0), i.e., zero distortion with a rate close to 1 bit.
Then, we reduce the encoder and decoder sizes by decreasing the number of hidden layers and dimensions.Let L e denote the number of hidden layers in the encoder, and L d the number of hidden layers in the decoder.We try with various combinations of L e ∈ [0, 3], L d ∈ [0, 3] and show the results in Fig. 2e.We make the following observations based on the results: • Given a powerful encoder, restricting the decoder does not hurt R-D: we keep L e = 3 and reduce L d to 0 (i.e., a linear decoder).This configuration is shown as the blue triangle mark in Fig. 2e.In this case, restricting the decoder size does  not affect the R-D performance, indicating that given a powerful encoder, a simple decoder suffices to make accurate predictions.• Given a powerful decoder, restricting the encoder increases rate: we gradually reduce L e from 3 to 0 while keeping L d = 3.Note that L e = 0 refers to a simple elementwise uniform quantizer.Results are shown as the yellow and red curves in Fig. 2e.We see that to achieve the same distortion, the rate increases as the encoder size decreases.Note that even weak encoders are able to achieve near-zero distortion, as long as given a sufficiently high rate (e.g., when L e = 0, L d = 3).• For a weak encoder, restricting the decoder increases distortion: we keep L e = 0 and decrease L d from 2 to 0, shown as the blue curves in Fig. 2e.This increases the distortion at all rates, indicating that a powerful decoder is necessary to make accurate predictions when using weak encoders.
To better understand the behavior of the encoders with different number of layers, we visualize their quantization regions in Fig. 2b for L e = 3, and Fig. 2c and 2d for L e = 0. We see that the large encoder (Fig. 2b) encodes only task-related information, which is the polar angular shift in this toy problem, and data points with the same label are quantized to the same code.In contrast, the small encoder (Fig. 2c and 2d) is not able to extract the task information due to its restricted capacity, and extra bits are used to code task-irrelevant information, i.e., datapoint positions in Cartesian coordinates.This toy problem of 2-D classification gives intuitions about the functionality of the encoder and decoder with different sizes (and thus different expressive power).Next, we conduct experiments for natural images to verify our observations.This section considers a more realistic scenario where we use vision transformer-based [15] encoders and decoders to compress the CIFAR-10 dataset for classification.CIFAR-10 [16] is a natural image classification dataset that is widely used in machine learning research.It contains 60,000 RGB images (50,000 for training, 10,000 for testing) of ten different object categories, and each image has 32 × 32 pixels.We use a   vision transformer-based model (shown in Fig. 3) to compress the images and make predictions from the compressed representation.The encoder and decoder sizes are controlled by their number of ViT blocks, denoted by L e and L d , respectively.Rate is measured by bits per pixel (bpp), distortion is measured by the cross-entropy loss, and the Lagrange multiplier used is λ = 16.0.We train the model for 100k iterations with batch size 256, learning rate 0.001, Adam optimizer [17], and cosine learning rate schedule.We apply various combinations of L e and L d , and the experimental results are presented in Fig. 4.

Experiment: CIFAR-10 classification
The main observations are summarized as follows.
Increasing the encoder size significantly reduces bpp, but increasing the decoder size does not.This can be observed in Fig. 4a.For a fixed encoder size L e , increasing the decoder size L d leads to an unchanged (for L e = 16) or worse bpp (e.g., for L e = 1).However, for a fixed L d , increasing L e from 1 to 16 reduces bpp from around 1.0 to around 0.07.In other words, the rate is highly (negatively) correlated with L e , but it is approximately independent of L d .This is largely expected, as the rate is independent of the decoder when given the data source and the encoder 1 .
Increasing either the encoder size or the decoder size improves the classification accuracy.This can be seen from Fig. 4b.When fixing L e = 1 (first row in the figure), increasing L d from 1 to 16 improves the accuracy from 77.0% to 99.9%.When fixing L d = 1 (first column in the figure), a similar trend can be observed.This suggests that a powerful end-to-end model (i.e., the encoder concatenated with the decoder) is necessary to make accurate predictions.Note that increasing the model size does not always improve classification accuracy.For example, when L e = 16, increasing the decoder size L d from 4 to 16 decreases the accuracy from 100.0% to 98.6%.This is presumably because training larger models typically requires more training iterations and data [18], which is not the case in our simple setting (we use the same training recipe for all models).
A complementary relationship exists between the encoder and decoder sizes.Looking at Fig. 4c, we see that the best choice of L d for L e = 1 is L d = 16, indicating that a powerful decoder is necessary to achieve good R-D performance when the encoder is weak.Contrarily, L d = 1 is one of the best choices when L e = 16, indicating that a simple decoder is sufficient when the encoder is powerful.If we list the best (L e , L d ) pairs for each L e (i.e., we compute argmin for the rows in Fig. 4c

Desired properties of a practical method
Our observations in previous sections suggest several insights for designing a flexible and efficient method for compression for prediction.First, since there is a multi-way tradeoff among rate, distortion, and encoder capacity, a flexible method should be capable of operating at various combinations of these factors to accommodate different application scenarios.Second, the method may take advantage of the complementary relationship between the encoder and decoder.For example, when the encoder is powerful enough, one needs only a simple decoder to produce an accurate prediction.Otherwise, a complex and powerful decoder is needed.In the following section, we propose a method that satisfies these properties.
4 Adjusting the Rate, Prediction Accuracy, and Encoder-Decoder Complexity Using a Single Model We present an end-to-end framework for image compression for prediction.Our method, FICoP (Flexible Image Compression for Prediction), uses only a single model to achieve adjustable rate, distortion, and encoding/decoding complexity, making it a flexible method for practical applications.

Overview
Fig. 5 overviews our approach.Suppose an existing neural network model takes data X as input, produces a (first-order) Markov chain of features, and predicts the conditional label distribution q Y |X (we use q to distinguish it from the true data distribution p X,Y = p X • p Y |X ).Our method is a plug-and-play extension that takes the base model and inserts entropy bottlenecks (EBs) in between its layers (Fig. 5a).Each EB can be viewed as a splitting point that divides the model into an encoder q Z|X and a decoder q Y |Z , and at test time (Fig. 5d), one can freely choose which EB to activate.By splitting the model in such a way, we can control the encoding complexity and, at the same time, take care of the complementary relationship between the encoder and decoder.For example, activating an EB at the early layer results in a small encoder and a large decoder, and vice versa.
For each EB, the training objective is to minimize the rate-distortion Lagrangian: where H(Z) denotes the entropy of the discrete latent representation Z estimated by a neural entropy model [13], the KL term is the distortion, and λ is a Lagrange multiplier that trades off rate and distortion.A key contribution in our method is to optimize Eq. 4 for multiple EBs at a range of λ using a single end-to-end training process (Fig. 5c).Note that we use λ in the loss function as well as pass it to the model as an input.

The entropy bottleneck (EB) module
Fig. 5b shows the entropy bottleneck module.It follows the Hyperprior structure [13], a two-layer VAE architecture commonly used for image compression.The compressed representation Z contains two components Z 1 and Z 2 with an auto-regressive prior p Z = p Z2|Z1 • p Z1 .We describe the details and highlight the differences from the original design as follows.
A lightweight bottleneck architecture.Unlike in Ballé et al. [13] where a stack of CNN layers is used, we use single convolutional layers for all transformations.Our objective is to keep the small size of each EB so that we can insert multiple EBs into any existing model without significantly increasing the overall model size.Non-linearity is achieved by layer normalization [19] operations, which has been shown effective in image compression [20][21][22] Quantization with straight-through gradient estimator.As the standard practice, we quantize Z 1 and Z 2 using elementwise uniform quantization before invoking the entropy coding algorithm.However, as opposed to the additive uniform noise in [13], we apply hard quantization during training as well, and the gradients are approximated using the straight-through estimator (STE) [23,24].Although STE is shown sub-optimal in compression for reconstruction [25,26], we find that STE is better than additive uniform noise in our setting (Appendix B.2).
Probablistic models and variable-rate compression.We model the prior for Z 1 and Z 2 using the discretized Gaussian distribution [27]: where σ 1 is a function of λ (through an embedding layer), and {µ 2 , σ 2 } are functions of Z 1 (through a convolutional layer), as shown in Fig. 5b.The embedding layer consists of a sinusoidal positional encoding [28] followed by an MLP, following [29].We achieve rate-adaptive quantization [30,31] by applying an affine transform and its inverse to Z 1 before and after nearest integer rounding, respectively: where Z ′ 1 is the variable before quantization, and the affine parameters are produced by the λ embedding layer.This adaptive quantization is also applied for Z 2 , which we omit in Fig. 5b for simplicity.Equation ( 5) effectively conditions the prior p Z = p Z2|Z1 • p Z1 on λ, and Eq. ( 6) conditions the encoder q Z|X on λ, allowing us to control the rate-distortion by varying the λ input to the model.

Training objective
Given a base model employed with multiple EBs, we train them jointly in a single training process (Fig. 5c).Specifically, let k denote the index for the EBs, and p K be a pre-defined distribution over the indices, which we choose to be uniform in our experiments.At each training iteration, we activate the k-th EB and deactivate the others for a k sampled from p K .To achieve variable-rate training, we also randomly sample λ from a distribution p Λ throughout training.This λ is then used in the loss function as well as to condition the EBs.In our experiments, we choose p Λ to be a log-uniform distribution with a range of [0.1, 16], and we discuss these choices of hyperparameters in Appendix B.1.
Formally, the training objective is to minimize the following loss w.r.t.all model parameters (including the base model and all EBs): where the first term corresponds to rate, and the second corresponds to distortion.

Experiments
Without otherwise specified, we use the ResNet-50 [32] as the base model in this section.We show that our method generalizes well to other model architectures in Appendix B. Dataset and metrics.We use the 1,000-class ImageNet dataset [8] for training (train split) and evalutation (val split).In evaluation, all images are resized to 224 × 224 pixels, and the rate is measured in terms of bits per pixel (bpp) after resizing.The distortion is estimated by the log loss (also known as the cross-entropy loss) of the model prediction w.r.t. the ground truth label.We also report the top-1 classification accuracy, which is more interpretable.All metrics are computed for each image in the val set and then averaged across the entire val set.
Augment ResNet-50 with FICoP.ResNet-50 contains five stages (excluding the first convolutional layer), each of which contains multiple blocks.We take ResNet-50 with pre-trained weights as the base model and insert EBs to the splitting points shown in Fig. 6a.The splitting points are referred to as T i.j , where i is the stage index, and j is the block index in the stage (the indices start from 1).The resulting model is referred to as ResNet-50 + FICoP and is trained on ImageNet train split for 160k iterations with a batch size of 256.For T 1.1 to T 4.1 , we train the EBs with λ ∈ [0.1, 16].For T 5.1 , we found that the model is not sensitive to λ and the rate is always close to zero, so we only train it at λ = 0.1.The full details of training hyperparameters are given in Appendix A.

Validating the rate-distortion-complexity trade-off
Fig. 6b shows the rate-distortion (R-D) results of ResNet-50 + FICoP operating at each splitting point, and each of them produces a separate R-D curve.The encoding complexity for the case of all splitting points is reported in Table 1.We also show results for the original ResNet-50 as a reference.
Comparing the R-D curves in Fig. 6b, we observe that a deeper splitting point (which consumes higher encoding complexity) achieves better R-D performance.This is expected by our analysis, as a more powerful encoder is able to compress away more task-irrelevant information, thus reducing the rate for the same distortion.Note that when the encoding complexity approaches that of the entire ResNet-50, i.e., at T 5.1 , the log loss converges to the one of the original ResNet-50 with a near-zero rate, which can be viewed as approaching the information R(D) function.

Comparing ResNet-50 + FICoP with existing methods
Existing methods.We consider several related works as baselines.To our best knowledge, no existing method is able to achieve a rate-distortion-complexity trade-off using a single model as in ours, so a strictly fair comparison is not possible.We briefly describe these approaches as follows: • Dubois et al. [9] uses CLIP [33] together with an entropy bottleneck as the encoder and an MLP as the decoder.We refer to this method as CLIP + EB.Its setting differs from ours in that (a) CLIP is trained on image-text pairs instead of ImageNet, and (b) the method operates at only high encoding complexity.• Matsubara et al. [3] uses a lightweight CNN encoder and a trunacted ResNet-50 decoder with knowledge distillation techniques applied during training.The method is referred to as Entropic Student.In their setting, one needs to train multiple models to operate at different rates, and only low encoding complexity is supported.Fig. 6c shows the accuracy-rate results of our method compared to the baseline methods, and Table 1 shows the corresponding encoding complexities.The Entropic Student method employs a lightweight encoder (around 0.47 GFLOPs) and thus achieves a much lower accuracy-rate curve than CLIP + EB (around 4.4 GFLOPs).Our method, however, is able to adjust the encoding complexity to control the accuracy-rate tradeoff.In the low complexity regime, ResNet-50 + FICoP at T 1.1 and T 2.1 achieves similar accuracy-rate curves as Entropic Student, and in the high complexity regime, ResNet-50 + FICoP at T 5.1 achieves a comparable accuracy-rate curve as CLIP + EB.This demonstrates the flexibility and effectiveness of our approach in controlling the rate-distortion-complexity trade-off.Note that the accuracy-rate curves are not always monotonic because the method is trained to optimize the log loss, which does not always translate into classification accuracy.

FICoP with various base model architectures
FICoP with ConvNeXt [34].To verify that our approach generalizes well to various model architectures, we choose a modern convolutional neural network architecture, ConvNeXt, as the base model.We apply FICoP to ConvNeXt-tiny, a lightweight version ConvNeXt that achieves 82.5% top-1 accuracy on ImageNet, and show the results in Fig. 7.In the figure, Fig. 7a shows the model architecture and the layers in which we insert EBs, Fig. 7b show the distortion-rate results, and Fig. 7c show the accuracy-rate results.We observe a rate-distortion-complexity trade-off, which is consistent with our previous analysis.Furthermore, since ConvNeXt is a more powerful base model than ResNet-50, it leads to a significant performance boost when compared to the baseline methods.
FICoP with Swin-Transformer [35].We also apply our approach to a Transformer architecture, Swin-Transformer, and show the results in Fig. 8.The observations are consistent with previous experiments, showing that our approach works well with Swin-Transformer and outperforms the baseline methods.We thus conclude that FICoP is a general approach that can be applied to various image classification model architectures.

Experimental analysis
We investigate our method by ablating its components.All other settings are the same as in the previous section.

Impact of joint training multiple entropy bottlenecks (EBs).
As our approach trains one shared base model jointly with multiple EBs, a natural question is how its performance differs from the one of training a separate base model for each EB.We thus train a separate ResNet-50 model for each EB at {T i.1 , i = 1, 2, ..., 5}, and we refer to it as "separate base models" as shown in Fig. 9a.We observe that using a shared base model achieves comparable performance (better for low encoding complexity but worse for high latter does not require entropy coding.Thus, our study on the encoder/decoder complexity also applies to the information bottleneck methods.Several existing works also investigate the impact of computational constraints on information theory.Xu et al. [54] and Kleinman et al. [55] consider the amount of usable information between variables under constraints.A more related work is Harell et al. [10], where the authors prove that post-training compression of a deeper feature in a fixed model is better in terms of rate-distortion.Our paper explores the case where the base model is trainable, which is complementary to Harell et al. [10].

Conclusion
We have extended the rate-distortion trade-off in coding for machines to incorporate computational constraints.In particular, we show that a more powerful encoder leads to better rate-accuracy performance, and we reveal a complementary relationship between the required encoder size and the decoder size to achieve good performance.Experimental results have confirmed the existence of a three-way trade-off between rate, distortion, and model complexity, as well as showing the proposed method is advantageous over earlier methods in practical situations.
Limitations and future work.An important assumption in our method is that in the base model, the model's input-feature-output forms a Markov chain, which is not true for models with hierarchical features and skip connections [56,57].Also, we only consider image classification as the downstream task in this work.Future work could investigate more general cases, e.g., machine vision tasks such as image segmentation, human vision tasks such as image denoising and inpainting, and a combination of both [58].

Declarations
Availability of Data and Material.The research presented in this manuscript relies on the ImageNet dataset, a well-known and publicly accessible database widely used in the field of computer vision.We will ensure that materials related to our research, including source code and trained models necessary for replicating the results reported, will be made publicly available upon the acceptance of the paper.
Competing interests.We hereby declare that there are no known financial or nonfinancial competing interests that influence the work reported in this manuscript.
Funding.Research reported in this publication was supported by the National Institutes of Health (National Cancer Institute R01CA277839).The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Authors' contributions.This work is a collaborative effort that benefits from the contributions of each author.Z.D. was primarily responsible for conducting the majority of the experimental work and drafting and revising the manuscript.M.A.F.H. provided the initial prototype and foundation upon which the proposed method was further developed and refined.J.H. contributed to the conceptualization of the study through participation in brainstorming sessions and offering ideas.F.Z. is the project leader and oversees all aspects of the project's development and execution.

Appendix A Training Hyperparameters
We show the ImageNet training hyperparameters in Table 1, which include three models: ResNet-50 [32], ConvNeXt [34], and Swin-Transformer [35].For ResNet-50, we use the tv resnet50 model from the timm library2 , which is pre-trained on ImageNet and has a top-1 accuracy of 76.1%.For ConvNeXt and Swin-Transformer, we use the models and training script provided by the torchvision library3 .
The training devices and the training time for each experiment are also reported in Table 1.We estimate the total computation for all experiments is around 800 GPU hours (more than 20 training runs for ResNet-50, each one costing around 36 GPU hours).

Fig. 2 :
Fig. 2: We train neural compressors with various encoder/decoder sizes to compress the data points in (a) for classification.The encoder and the decoder are MLPs with L e and L d hidden layers, respectively.L e = 0 refers to a elementwise uniform quantizer.Figures (b), (c), and (d) show the quantization boundaries of the encoders, where the background colors indicate the predicted label.Figure (e) shows the rate-distortion results for various encoder-decoder pairs.

Fig. 4 :
Fig.4: CIFAR-10 compression for classification results.We train the model (Fig.3) with various combinations of encoder layers L e and decoder layers L d , and we report the bits per pixel (bpp), classification accuracy, and R-D loss for each of them.Lower bpp and R-D loss are better, and higher classification accuracy is better.

Fig. 5 :
Fig. 5: Overview of our method.In figure (b), E denotes entropy coding (with quantization), D denotes decoding, and LN denotes Layer Normalization.

Fig. 6 :
Fig.6: ImageNet results using ResNet-50 as a base model in terms of (b) distortion-rate and (c) accuracy-rate performance.Note that our approach (ResNet-50 + FICoP) is able to control the rate-distortion-complexity at test time using a single trained model (by activating different EBs and adjusting λ), while the baseline methods train a separate model for each R-D point.

Table 1 :
Encoding complexity of our method compared to previous ones.FLOPs are estimated for a 224 × 224 input image.