Optimal JPEG Quantization Table Fusion by Optimizing Texture Mosaic Images and Predicting Textures

The JPEG standard allows the use of a customized quantization table; however, it is still challenging to find an optimal quantization table timely. This work aims to solve the dilemma of balancing computational cost and image-specific optimality by introducing a new concept of texture mosaic images. Instead of optimizing a single image or a collection of representative images, the conventional JPEG optimization techniques can be applied to the texture mosaic image to obtain an optimal quantization table for each texture category. We use the simulated annealing technique as an example to validate our framework. To effectively learn the visual features of textures, we use the ImageNet pre-trained MobileNetV2 model to train and predict the new image’s texture distribution, then fuse optimal texture tables to come out with an image-specific optimal quantization table. Our experiment demonstrates around 30% size reduction with a slight decrease of FSIM quality but visually indistinguishable on the evaluation datasets. Moreover, our rate-distortion curve shows superior and competitive performance against other prior works under a high-quality setting. The proposed method, denoted as JQF, achieves per image optimality for JPEG encoding with less than one second additional timing cost.


Introduction
As of today, JPEG [3] remains the most used lossy image compression standard for digital images in content sharing and various image capture devices [1] [2]. Since JPEG was developed in 1992, new image coding standards such as JPEG 2000 [4], WebP [5], and HEIF [6] have been proposed to respond to the continual expansion of multimedia applications. The JPEG 2000 adopts the Discrete Wavelet Transform (DWT) and arithmetic coding with a layered file format that offers flexibility. Compared to JPEG's fixed-size transform, the DWT is in principle open-ended for image size and compresses better than JPEG. The WebP and HEIF leverage the intra-frame coding technologies from video codecs. The WebP is based on the VP8 coding standard, uses intra-prediction, in-loop filtering, and block adaptative quantization technologies to win over JPEG. The HEIF uses the still image coding of the state-of-the-art High Efficiency Video Coding (HEVC) standard [7], which is roughly double the video compression efficiency compared to the Advanced Video Coding (AVC) standard [8].
Although the new formats are proven to be superior to JPEG subjectively and objectively [9] [10], JPEG remains the most used image format in the past three decades. The JPEG 2000 is broadly used in medical images [11] and remote sensing images in Geographic Information Systems (GIS) [12]. The WebP image format is advocated by Google and is well supported in popular web browsers. The WebP has become the second popular Internet image format next to the JPEG [1], but merely a 1.2% market share. Since 2017, Apple switches to use HEIF as the default photo capturing format in iPhone [13]. We expect more mobile phone manufacturers to follow, and the digital imaging world is one step closer to replace JPEG with the state-of-the-art HEVC [6] coded image format. However, we expect JPEG to stay for at least a few more decades [2] because JPEG still satisfies average user demands. As JPEG takes the top-1 Internet image traffic of 72.9%, many kinds of research on how to reduce the JPEG bandwidth have been proposed by tech giants like Google [14] [15] [16] and Facebook [17].
The success of JPEG comes from its low computational complexity and coding effectiveness considering the human visual system (HVS). JPEG typically achieves 10:1 compression gain without noticeable quality loss and is very cheap to implement. JPEG divides the image into 8 × 8 blocks, using Discrete Cosine Transform (DCT) to shift the pixels from the spatial domain to the frequency domain. The transformed DCT coefficients are rearranged in zigzag order. A quantization table is used to quantize DCT coefficients, resulting in reduced coefficients and sparse DCT block for run-length encoding (RLE). Figure 1 (a) shows the default luminance quantization table provided in the JPEG standard.  Since JPEG is lossy, the compression rate can be adjusted, allowing a selectable tradeoff between storage size and image quality. Although a smaller quality metric Q achieves better image size reduction, it may introduce visual artifacts such as blocking and ringing effects. Therefore, imaging applications usually set the default JPEG quality at 75 or above. The JPEG allows the use of customized tables, but how to optimize the quantization table remains challenging due to the numerous solution space and lack of reliable HVS quality measurements. As a result, modern image applications and digital camera processors tend to compress JPEG images with minimal quantization values to preserve quality.

Rate-distortion Optimization
The JPEG standard table is obtained from a series of psychovisual experiments, which determines the visibility thresholds for the DCT basis functions at pre-defined viewing distance [2]. Since early 90's, there are many previous works to explore the potential of designing better quantization tables [19] [29]. Some of the works directly focus on improve rate-distortion performance and differ mainly in the optimization strategy, such as coordinate-descent enumeration [21] [26] [30], dynamic programming [22], and genetic algorithms [23].
Early works such as Ratnakar and Livny [22] uses Mean Squared Error (MSE) as the distortion measurement during optimization, which has been proven poorly correlated to HVS. Ernawan and Nugraini [25] proposed a psychovisual threshold through quantitative experiments based on the contribution of DCT coefficients on each frequency order to the reconstruction error. Their psychovisual threshold assigns an optimal balance between quality and compression ratio and derives a new quantization table using 40 real images. Significant performance improvement is reported, but it seems not generally applicable to all images.
Hu et al. [27] indicate that human perceived distortion is discretely characterized by some jumps and could be modeled with a stair quality function (SQF). The SQF utilizes the notion of just noticeable difference (JND), which refers to the visibility threshold below any change that can not be detected by the HVS [31]. With the JND concept, Zhang et al. [28] redefine the distortion metric by soft-thresholding MSE values according to the JND as a threshold for each coefficient and propose a JND-based perceptual optimization method. Hu's method operates as image-specific optimization on the testing image with an exhaustive search algorithm. Their work delivers significant bitrate saving compared to JPEG standard and shows superior size reduction vs. quality tradeoff than Ernawan's PSY table. However, Hu et al. do not report the time cost for the image-specific optimization.
In recent years, many researchers proposed deep learning-based methods to endto-end optimize the JPEG quantization table for image compression [32], artifacts correction [17], and specific computer vision tasks [33] [16]. Two essential issues, non-differentiable JPEG quantization and entropy coding estimation, have been appropriately handled. The end-to-end learning for the quantization table with neural network backpropagation is achieved.

Simulated Annealing
The stochastic optimization process known as simulated annealing [34] has successfully been applied to find vector quantization parameters. Monro and Sherlock [19] [20] attempted to use simulated annealing to determine quantization tables for DCT coding. Monro et al. anneal all the 64 quantization values with the cost function composed of RMSE error and a selected target compression ratio. Their optimization process searches optimal tables on selected images with minimal RMSE error while keeping the chosen target's compression ratio. Around a single-digit percentage improvement of the RMSE error is reported comparing to the standard JPEG table.
Until the early 2000s, new objective FR-IQA methods like SSIM [35] and FSIM [36] were proposed and shown to be statistically closer to HVS. Jiang et al. [24] utilize SSIM as the quality metric to evaluate image distortion during the simulated annealing. In their work, a multi-objective optimization equation is proposed to minimize bitrate while maximizing SSIM. To solve the equation, they estimate the Pareto optimal point for finding an optimal quantization table so that no other feasible points have a lower bitrate and higher SSIM index. However, since Pareto optimal point differs in every image, the multi-objective optimization framework only proves helpful on a per-image basis. Hopkins et al. [29] adopt FSIM as the quality metric and revise the annealing process to focus on compression maximization with a temperature function that rewards lower error. They use the standard JPEG table as the initial table and randomly perturb ten quantization table values at each step. A set of 4,000 images was selected from the RAISE [37] dataset as a training set to run four groups of annealing processes at quality metrics 35, 50, 75, and 95. Four global optimized quantization tables are reported to reduce the compressed size by around 20% over the JPEG standard table and claimed to improve FSIM error by 10%. However, the corpus of 4,000 training images is merely a proxy to the universal pictures but not custom-tailored per image.

Image-Specific Optimization
Another work from Google's JPEG encoder Guetzli, [14] aims to produce visually indistinguishable images at a reduced bitstream size. Using a close-loop optimizer, Guetzli optimizes global quantization tables and selectively zeros out specific DCT coefficients in each block. However, the most size reduction of Guetzli comes from identifying DCT coefficients to zero out without significantly decrease the quality score. Guetzli's approach is too aggressive to zero out DCT coefficients and causes noticeable artifacts at a low bitrate setting. Compared to Hopkins' global optimized quantization table, Guetzli's per image optimization strategy achieves a 29-45% data size reduction at the cost of computational complexity and memory consumption (up to 30 minutes on a high-resolution image), which is considered impractical in real-world applications.
The SJPEG encoder [39] uses a simple adaptive quantization method by collecting and analyzing the DCT coefficients' histogram in each sub-band. Starts with the scaled standard table, the SJPEG uses a least-square fitted slope to decide an overall λ value for the rate-distortion cost function. With the obtained λ, SJPEG searches for the best quantizer with the minimal cost within each sub-band's delta range. The rate and distortion are approximated with the logarithm of quantized coefficient and square of quantization error. The adaptive quantization method is fast because it simplifies the optimization problem and uses quantization error as the quality measurement. Figure 2 shows the block diagram of the proposed JPEG Quantization Table Fusion (JQF) framework. The training workflow is a series of offline procedures to cluster texture patches and optimize texture quantization tables. The prediction workflow splits the input image into patches of the same size as training, predicted by the deep CNN model to obtain a texture distribution, which structurally describes what kinds of texture categories compose the input image. Then we can further aggregate those optimal tables to form a custom-tailored quantization

Texture Patches Clustering
Since the JPEG uses 8 × 8 image blocks for DCT transform, the texture patch size B must be a multiplier of 8. We choose block size B = 64 to crop the training images into 64 × 64 texture patches in a non-overlapping manner. Because we do not know how many types of textures in our RAISE training set, the unsupervised K-Means clustering algorithm is used to cluster textures into K = 100 categories.
Typically, the image clustering problem can be considered two separated steps: 1) design the appropriate visual features to extract and 2) train a logistic regression classifier that minimizes the class assignments [40]. In recent years, the success of deep Convolutional Neural Networks (CNNs) in various computer vision tasks [41] [42] makes CNN a preferred choice for feature extraction. The ImageNet [41] pre-trained CNN models such as VGG-16 [42] and MobileNetV2 [43] have become the building blocks in many computer vision applications. In this context, we use the last activation maps after layers of ConvNets, called bottleneck features, to extract texture features. In our experiences, both the pre-trained VGG-16 and the MobileNetV2 model could extract representative visual features for clustering. We choose the MobileNetV2 network for its compact model size and prediction efficiency on mobile devices. To speed up the K-Means clustering algorithm, we apply the principal component analysis (PCA) on the extracted 1280-dimensional feature vectors to further reduce the dimension to 500, covering 82% of the variance. Annealing on Texture Mosaic Images The quantization table is an 8×8 matrix of 8-bit unsigned values. It is unrealistic to enumerate the whole space for an optimal solution. Therefore, we anneal the JPEG's default luminance quantization table in Figure 1 ( where I r is the raw image, I c and I s are the JPEG images compressed with the candidate table and standard table. The physical meaning of equation (1) is that we will only accept a solution with FSIM degradation of less than 0.5%. To not be trapped in a local minimum, there is a probability P (i) to accept a worse solution, affected by the temperature function T (i) and the energy delta ∆E from the current solution score improvement ratio. The temperature function used to affect the probability is given by, where i is the iteration index, M is the maximum iterations to anneal, ρ is a constant to control the temperature function so that the probability approaches 1 ρ+1 at the end of the annealing process. The probability P (i) to receive an answer is calculated by, where S i is the current solution's score, which is composed of C i as the current compressed JPEG file size, and D i = FSIM(I r , I c ) as the quality distortion.
With the design of probability P (i), we have a higher possibility to accept a worse solution at the early stage, which prevents us from being trapped in a local minimum. Furthermore, the probability decreases gradually with the number of iterations we run; then, the annealing becomes a hill-climbing process. The current iteration could be finished by accepting a worse solution or randomly updating again to get the next candidate. In this work, we choose M = 3000 and ρ = 10 to anneal each texture mosaic image t and obtain corresponding optimized table T t .
We show the final annealed quantization tables T 1 and T 19 of categories id 1 and 19 in Figure 3. Table T 1 and T 19 obtain a size reduction of 37.9% and 25.42% with the FSIM quality degradation of 0.48% and 0.4%, but they are visually indistinguishable. We conclude two findings by observing the annealed results: 1) a smoothed texture like category 1 could achieve more bitrate saving than a complex texture with high-frequency signals under the given quality distortion tolerance. 2) the annealing process tends to raise the quantizer magnitude in low-frequency bands and reduce quantizers in high-frequency bands.

Texture Training and Prediction
We use the pre-trained MobileNetV2 network to extract texture features for clustering; we fine-tune the ConvNets parameters and train the fully connected layers for K texture categories. The texture patches from the RAISE dataset are split into 80%-20% for training and testing. We employ the Adam optimizer with default settings in PyTorch with a learning rate of 0.0001 and batch size 2048. We train the texture CNN model for 30 epochs, resulting in top-3 testing accuracy reaches 95.08%. We crop the input image into same-sized patches for prediction and then execute the forward propagation pass to obtain each patch's texture category. Figure 4 shows the texture distribution of the lighthouse (kodim19) image from the Kodak dataset; textures are displayed using their texture mosaic images.

Quantization Table Fusion
We consider two strategies to aggregate texture optimized quantization tables, voting by the majority and weighted average. The voting strategy will pick the texture category's quantization table with the most counts in the histogram as the final quantization table. The weighted average strategy weights the annealed quantizers by each texture category's percentage. We select the weighted average policy for its overall better performance. The fused optimal quantization table T O is calculated by, where T t denotes the optimal table of the texture t, and W t is the weights of the corresponding textures. We provide the final fused quantization table at Q = 95 of the lighthouse image in Figure 1 (b). The fused table has the same tendency as we observed when annealing texture tables, i.e., larger quantizers in low-frequency bands and smaller quantizers in high-frequency. Our observation echoes the conclusions obtained from Monro and Sherlock [19,20] that the JPEG standard table improperly over-estimates the low-frequency parts and under-estimates the high frequencies.

Dataset and Implementation Details
We use the RAISE dataset [37] as our training database and cross-validate it with the Kodak dataset [44]. The RAISE dataset is a real-world image dataset primarily designed for the evaluation of digital forgery detection algorithms. It consists of 8,156 high-resolution raw images in resolution around 4288 × 2848, uncompressed, and guaranteed to be camera-native. The Kodak dataset [44] has 24 lossless images, commonly used for evaluating image compression. Each image is about 768 × 512 in resolution.
We randomly select 50 images from the RAISE-1k subset as the testing set. The rest of the images are used as the training set. We crop the 1,487 training images into B × B patches with stride 256, generating a total of 261,712 texture patches. Then we cluster patches to K textures and perform the simulated annealing process on the stitched mosaic images. At most 225 patches are randomly selected from each texture category to limit the required annealing time. We compare our JQF method's performance with the JPEG standard table (denotes as STD), Hopkins' proposed global table [29] (denotes as SAT), and the SJPEG [39]. Since both JQF and SAT optimize the luminance table while SJPEG's optimization includes the chrominance table, we encode all the raw images in grayscale JPEG as a fair comparison. The compression is done with the command line program cjpeg from libjpeg [18] and the sjpeg command. Table 1 shows the baseline performance of block size B = 64, K = 100 texture categories, M = 3000 annealing iterations, and Q = 95 targeting JPEG quality. The proposed annealing and fusion framework for the two benchmark datasets delivers fairly close FSIM quality to the SAT, decreasing by 0.07% and 0.03%, but it further reduces the JPEG compressed file size by 15.58% and 13.48%. In other words, compared with the JPEG standard, our method achieves a significant size reduction (near 30%) with a slight (0.4%) FSIM quality decrease. The FSIM quality degradation falls within our annealing criterion γ = 0.5% and is visually indistinguishable, as shown in Figure 5. Interestingly, our method improves the PSNR by 0.64% and 1.11% against SAT. Although it is well known that PSNR does not align with the HVS, it reflects the signal fidelity. We do not mind having a higher PSNR if we can reduce the compressed size. Compared to SJPEG, our proposed method is less aggressive for the same testing datasets with 2.22% and 1.06% size increases but with overall higher qualities. Both SJPEG and JQF have improved PSNR against SAT on the Kodak dataset; we think it indicates that the image-specific optimization better adapts to the individual image than a global one.

Annealing Iterations
The simulated annealing algorithm terminates with an optimal global solution with probability one as long as the iteration number is large [45]. However, finding suitable iterations to ensure a significant probability of success is time-consuming. Instead, we choose to run annealing iterations that yield similar FSIM to that of Hopkins' work as a fair control parameter. The Table 2 shows the result. More iterations to anneal could reduce the compressed file size while keeping the average FSIM quality degradation within the γ = 0.5% tolerance. Since the -0.4% FSIM reduction is close to our preset tolerance, we choose M =3000 as our default annealing steps.

Quantization Table Fusion
We compare the optimization performance of the K=100 texture mosaic images with the 50 RAISE testing images in Table 3. The 3000-iteration annealing process Figure 5 Quality comparison of selected Kodak images compressed with JPEG quality Q=95 using JPEG standard table, SAT, and our JQF fused table. We encode images in color format YUV420 with JPEG standard chrominance table for better display quality in the paper. Our JQF baseline performance has significant size reduction but visually indistinguishable quality results. delivers a 26.26% size reduction compared to the standard table on those 100 texture images. However, the weighted average of those optimal texture tables could further boost the size reduction to 29.97% on real-world testing images. The result shows that the fused quantization table does better adapt to the image content. Table 4 shows the size reduction comparisons on block sizes 32, 48, 64, and 96, respectively, using 100 texture clusters. The outcome aligns with our expectation that smaller-sized image blocks generally adapt to image content better, so the overall size reduction tendency is apparent. However, the improvement is not significant, except for the more considerable performance drop with block size 96. It is intuitive because a large image block could cover specific objects and cannot represent certain textures, or the texture prediction accuracy drops.

Number of Texture Categories
The optimization performances of category numbers K=25, 50, 100, and 150 are shown in Table 5. Not to our surprise that the more texture categories we split, the better resize reduction we can achieve. Still, the improvement tendency is noticeable but not significant when we split into more categories. Theoretically, we can cluster image patches into a massive set of textures to describe the image content. Nevertheless, since our annealing process runs on those texture mosaic images, annealing on an enormous collection of images may become impractical.

Rate-Distortion Curve Comparison
We compare our JQF method with JPEG standard (STD), Hopkins' simulated annealing table [45] (SAT), SJPEG [39], and Ernawan's psychovisual table [25] (PSY), and Luo's deep learning derived tables [16] (DLT). We use SJPEG to encode images from the testing set with the quality setting Q=35, 75, 50, and 95, then obtain the corresponding averaging bitrate to generate four rate-distortion data points as the benchmark. We adjust the JPEG quality target Q for each competing method to generate the closest bitrate and record its quality metrics. The ratedistortion curve comparison of dataset Kodak and RAISE is shown in Figure 6. Firstly, we observed the PSY table has its SSIM and FSIM performance very close to the JPEG standard but slightly higher PSNR. That makes sense because PSY derived the new quantization table according to a primitive threshold from psychovisual experiments, which helps maintain the HVS quality metrics.
Secondly, the SAT table does not come out right size vs. quality tradeoff, even compared to the JPEG standard. Both SAT and our method run simulated annealing techniques with FSIM error and compressed size as the acceptance criteria, so the two approaches will generate inferior FSIM rate-distortion performance. However, the SAT table seems unable to gain more signal fidelity in PSNR, especially in the RAISE dataset.
The third point, our annealing process uses the FSIM as an error budget to exchange for better size reduction. The stochastic process explores the solution space to search for a more optimized table. Furthermore, the texture-based quantization fusion approach further reduced the distortion and enhanced the PSNR RD-curve performance. Our method and SJPEG deliver significant PSNR gain at JPEG's high-quality settings as an image-specific optimization approach. The proposed JQF method performs less optimally than JPEG at low-quality settings because we use the same annealing iterations M =3000 for all targeting quality. If we anneal fewer iterations or relax the tolerance γ in equation (1), we will obtain optimal texture tables with better quality but worse compression ratio. At last, the JQF, SJPEG, and DLT are in the winning group and performed relatively close to each other. The quality of our proposed JQF is slightly ahead of that of the SJPEG. Luo et al. demonstrate the advantage of end-to-end learning to optimize a global optimized table that beats image-specific optimization approaches like JQF and SJPEG. The lead is not significant, but it is surprising. Their deep learning method can obtain another 0.6dB PSNR gain if applied to specific images, but the computational analysis is not included in their paper [16]. On the SSIM and FSIM HVS-oriented metrics, JQF and SAT provide lower performance curves because simulated annealing methods trade the quality loss for the bitrate reduction. SJPEG approximates its visual quality using DCT coefficient rounding error to trade the perceptual quality for speed; therefore, it also provides lower SSIM and FSIM curves. The situation is the same in DLT, which only formulates pixel reconstruction error in the cost function, producing the worst HVS metric in quality.

Prediction Computational Cost
We compare the time cost of the proposed JQF method and SJPEG adaptive quantization in Table 6. The MobileNetV2-based prediction model is light-weight in size (only 10.5MB) and efficient to compute. Furthermore, for high-resolution images, we can predict the texture distribution on a down-sampled version. We used a workstation with Intel Core i7-9700K CPU and Nvidia GeForce RTX 2080 Ti GPU for the benchmark, taking about 0.11 seconds with GPU and 1.55 seconds on pure CPU computation per RAISE testing image. The computing cost of JQF is competitive to SJPEG with GPU and acceptable with CPU. Given that the deep learning inference chips may become more generally available in the future, predicting texture distribution for JPEG optimization remains feasible. Figure 7 demonstrates the visual artifacts introduced by all competing methods. The artifacts are positively correlated with the bitrate savings at low bitrate (Q =35). As suggested in [29] [14], we think the JPEG optimization problem finds its value at a high quality setting, where there is more room for optimization to trade human unperceivable quality for bitrate saving. As a 30-years-old codec, JPEG merely applies predictive coding on DC coefficients after DCT transform, and the coding gain mainly comes from consecutive quantized zero coefficients for RLE. There is no intra-prediction coding tool like HEVC [6] in JPEG, so the optimization at low bitrate becomes a zero-sum game of size or quality. For low bitrate JPEG optimization, there are alternative research works being proposed to either preediting [15] or post-processing with artifacts correction [17] to maintain good visual quality at reduced bitrate.

Conclusion
We propose a novel JPEG quantization table fusion (JQF) framework to solve the dilemma of balancing between computational cost and image-specific optimality by introducing a new concept of texture mosaic images. We use the simulated annealing technique as a proof-of-concept to validate our method. We use the K-Means clustering algorithm to cluster texture categories and the MobileNetV2 pretrained CNN model to learn and predict textures. Then we fuse an optimal texture table from texture-optimized tables to come out with an image-specific optimal quantization table. Our experiments demonstrate a significant size reduction but visually indistinguishable in perceptual quality compared with that of the JPEG standard table on the Kodak dataset and a real-world consumer photo dataset. Our method's rate-distortion curve shows superior and competitive performance against other prior works under JPEG high bitrate setting. With the proposed JQF framework, the per-image optimality for JPEG encoding is achieved with affordable additional computational cost.