Bagged Tree And ResNet Based Joint End-to-End Fast CTU Partition Decision Algorithm For Video Intra Coding

of the coding for the a part of encoding complexity. the complexity of coding block partition for many video coding standards, proposes an end-to-end fast algorithm for partition structure decision of coding tree unit (CTU) in intra coding. It can be extended to various coding standards with ﬁne tuning, and is applied to the intra coding of HEVC reference software HM16.7 as an example. In the proposed method, the splitting decision of a CTU is made by a well designed bagged tree model Then, the partition problem of a 32 × 32 sized CU is modeled as a 17-output classiﬁcation task and solved by a well trained residual network (ResNet). Jointly using bagged tree and ResNet, the proposed fast CTU partition algorithm is able to generate the partition quad-tree structure of a CTU through an end-to-end prediction process, instead of multiple decision making procedures at depth level. Besides, several e ﬀ ective and representative datasets are also conducted in this to lay the foundation of high prediction accuracy. Compared with the original HM16.7 encoder, experimental results show that the proposed algorithm can reduce the encoding time by 59.79% on average, while the BD-rate loss is as less as 2.02%, which outperforms the results of most of state-of-the-art approaches in the fast intra CU partition area.


Introduction
Video coding standards have been continuously developed and updated for decades to meet the increasing need of video market for videos with higher definition. In the past many years, various coding standards have been invented and replaced, such as advanced video coding (AVC), high efficiency video coding (HEVC), versatile   5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 video coding (VVC), audio video coding standard (AVS) and AOMedia video 1 (AV1). They all brought considerable coding benefits in their time, for example, developed by the Joint Collaborative Team on Video Coding (JCT-VC), VVC and HEVC both are able to achieve about 50% less bit rate than their predecessors while maintaining the same video quality [1]. However, with the transition of video compression rate again and again, the complexity of various video coding standards has increased dramatically due to the introduction of many efficient but complicated coding tools.
Block based coding is one of the most important coding techniques and is commonly used in many popular video coding standards. Taking HEVC and VVC for example, HEVC uses quad-tree structure to partition a coding unit (CU), and VVC uses quad-tree plus binary tree (QTBT) structure to finish block partitioning. In block based video coding standards, each CU can be iteratively further split into sub-CUs according to a specific partition structure. A flexible CU partition rule brings various size combinations for a CU. With many CU sizes and modes to be selected, rate-distortion optimization (RDO) is adopted to select the optimal CU partition structure for a CTU along with prediction modes. During the encoding process, many video coding standards traverse all the possible CU sizes and prediction modes, then RDO is used to select the best combination with the minimum cost. As a result, RDO brings the most encoding complexity burden by exhaustive calculation on the CU size decision process, and it restricts the application of video coding standards in many real-time scenes. Thus, it is very necessary and meaningful to design a fast CU partition algorithm for existing video coding standards.
In the past several years, many approaches have been proposed to reduce the massive encoding complexity in various standards, such as HEVC and VVC. Heuristic based CU depth decision approaches were first proposed and widely studied [2,3,4,5,6]. For example, Liu et al. [3] proposed a fast CU depth decision algorithm based on statistical analysis. They used a three-stage method to make the splitting result decision according to prior information. Shen et al. [7] proposed an early determination and a bypass strategy for CU size decision by using the texture property of current CU and coding information from neighboring CUs. In some approaches, CU depth range was shorted, and some CU depth levels were skipped according to statistical information [8,9].
Although the heuristic based fast methods have achieved many acceptable results, they can not properly consider the partition property of various video sequences. In other words, there are too many factors to influence the partition result. And these factors may change with different sequences so that people usually do not know that which combination of these factors has the best performance upon implementation. Generally, we only consider several key factors which have close correlation with CU partition, and small number of the considered factors may lead to a poor result.
The technique of classical machine learning algorithms is introduced in order to overcome the drawback of statistical information based methods. Support vector machine (SVM) models with three outputs are used to get a trade-off between bit distortion and encoding complexity [10,11]. Besides, to further improve the prediction accuracy, two or more SVM models are employed in each CU depth. Zhang et al. [12] employed two SVMs at each depth to perform the decisions of early CU split   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 and early CU termination. Zhu et al. [13] used the cascaded SVM and defined a misclassification cost and a risk area to jointly make a CU partition decision. Meanwhile, Grellert et al. [14] and Zhang et al. [15] also proposed SVM based approaches which focused on features analysis. Besides, decision tree or data mining methods were also used to reduce the encoding complexity [16,17,18,19,20]. Also, Fisher's linear discriminant analysis and the k-nearest neighbors classifier were employed in order to fastly decide a CU partition [21], and Kim et al. [22] proposed a joint online and offline Bayesian decision rule based fast CU partition algorithm. Moreover, Yang et al. [23] proposed an efficient low complexity intra coding algorithm for versatile video coding (VVC) by using learning based classifiers.
Although the classical machine model based methods outperform the heuristic based ones due to their advantages on dealing high dimensional problems, their inputs (i.e. features) play a very important role on the encoding results. Besides, the features designing is all manual and it requires much experience. Additionally, there are many key features which are commonly used in existing works. Hence, in order to significantly improve the encoding performance, researchers should find more efficient features, which is usually quite difficult. Furthermore, generally one or more classifiers are needed to be designed for each depth, which requires much work and training time. The algorithms using convolution neural network (CNN) were proposed to address these problems [24,25,26]. Due to the property of CNN on automatic region feature extraction, these algorithms have enormous advantages in image processing. CNN based algorithms also have achieved many good encoding results.
Liu et al. [27] devised the convolution neural network (CNN) based fast algorithm to prune no less than two partition modes for the RDO processing on each CTU. However, their CNN structure is too shallow to fully learn the relation between the image data and the partition structure. Besides, considering all available partition modes for a CU, only no less than two CU partition modes are pruned, which is not enough.
Besides, the algorithm proposed by Kim et al. [28] used image data and encoding information based vector data to train a CNN for the prediction of the CTU depth. However in their algorithm, not only image data but also vector data should be collected before the prediction phase, which requires more pre-encoding time. Besides, three kinds of CNN structures should be constructed, and each is designed for a CU in a certain depth. Therefore, at least three CNNs are needed for one video sequence.
Furthermore, Xu et al. [29] proposed an approach by using CNN and long shortterm memory (LSTM) network. Specifically, the CNN was used to predict CU partition of intra coding, and the LSTM was used to predict the CU partition for inter coding. In their algorithm, the CNN especially the LSTM part is very complex and needs much time to get trained and refined.
As we can see, all these CNN based methods still complete the CU partition prediction at the depth level, which needs one or more CNN models for each CU depth. It means that at least three deep learning models are needed for a single video sequence. In other words, existing works still model the CU partition as a binary classification problem, and it is a great waste of the CNN ability. In 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 this paper, we construct a deep structure to explore a sufficient learning capacity of ResNet. Meanwhile, we propose an end-to-end solution for the CTU partition. What's important is that the proposed algorithm is a general solution to the partitioning problem in many block based coding standards. It models the CU partition as a multi-classification problem unprecedentedly. To verify the effectiveness of our proposed method, we implement it into HEVC intra coding as an example. First, bagged tree models are employed to classify a CTU to sharply shorten the classification categories. Then an end-to-end ResNet is trained to predict the final partition structure of a 32×32 CU instead of predicting the splitting decision of a CU in a certain depth. The main contributions of this paper are presented as follows: 1) Using the great superiority of ResNet, this paper proposes a deep neural network structure to predict the partition structure of a CU. It explores more sufficient relations between the image data and the partition. 2) Adopting a two-stage prediction strategy with a combination of bagged tree and ResNet, the proposed algorithm can achieve more accurate prediction results. And it reduces CU partition categories sharply and effectively. 3) For the first time, the CU partition task is modeled as a 17-class problem.
An end-to-end CNN prediction model is designed. As a result, only one CNN model is needed for a single video sequence instead of more than three CNN models. 4) A general solution to CU partitioning in video intra coding is proposed, and is verified on HEVC for an example. In a similar way, it can be implemented on other video coding standards and achieve gains in coding complexity reduction. This paper is organized as follows. Section 2 introduces the quad-tree based partition structure and gives a brief review of bagged tree and ResNet. Section 3 describes the proposed bagged tree and ResNet based end-to-end joint fast CTU partition algorithm. Section 4 reports the experimental results. The conclusions are summarized in Section 5.

Background
In this section, firstly we describe the CU partition technique in HEVC. Then, we give a brief review about bagged tree method, followed by a short introduction of ResNet model.

CU Partition of HEVC
To encode a video sequence, each frame among a sequence is divided into multiple non-overlapping squares, whose size is 64×64. As the largest coding unit (CU), a CU of size 64×64 is also called a coding tree unit (CTU). As is known to all, larger CU can save more semantic information while smaller CU can achieve more precise pixel prediction values. Thus, HEVC employs a recursive splitting process to traverse all the possible partition results of a CU. Fig. 1 shows the recursive splitting process of a CTU, and this process terminates when it reaches the smallest CU, whose size is usually configured before encoding and default to be 8×8. Therefore, there are totally 4 kinds of CUs, which have different sizes, i.e. 64×64, 32×32, 16×16 and 8×8, shown as squares of different colors in Fig. 1 .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 According to a quad-tree structure, HEVC employs RDO to make the optimal partition decision for a CTU. The left part of Fig. 2 shows a partition example of a CTU, and the right part of Fig. 2 is the corresponding quad-tree structure. As we can calculate, there are two options for a CU of size 16×16 (i.e. nonsplit or split to four sub-CUs of size 8×8). However, the options for a CU of size 32×32 increase to 17 (1+2 4 ), and for a CTU the options go up to 83522 (1+17 4 ). Performing RDO on CTUs extremely increases the encoding complexity which is too large to be solved by multi-classification techniques.

Bagged Tree
Traditional machine learning methods have shown great advantages in classification tasks. As one of the most famous classical machine learning models, the decision tree is widely used in various application scenarios, such as image recognition and data mining. Just like its name, the decision tree employs a tree structure to make a classification decision. For a classification task, each leaf node in a decision tree model represents a target class, and each father node contains an attribute along with a threshold. In this way, the decision making process goes deeper, and the predicted labels are generated.
Among traditional machine learning methods, prediction efficiency, time complexity and implementation difficulty vary a lot. However, compared with other models, the tree model has its unique advantages. First, it accords with people's common sense among some classification problems. Thus it can achieve a quite high accuracy among some particular problems through proper training. Second, it is much easier to get trained and takes less time to finish prediction, which brings negligible time overhead. Third, it is also easier to be implemented into existing works due to the simple "if. . . else. . . " structure based prediction process.
However, the decision tree is much likely to get overfitting on training set. The bagging technique is introduced to address this problem. As a result, the bagged tree model is generated, and it is an ensemble of many decision tree classifiers, each of which is trained with a random subset of the training dataset. After training, the final prediction of the a bagged tree model for an instance is made by taking the majority vote of the predictions from all the individual decision trees on the input sample. Fig. 3 shows the structure of bagged tree model consisting of n decision trees, and Fig. 3 also illustrates how the bagged tree generates the final prediction for an instance. Since the bagging technique decreases the variance of the model without increasing the bias, the performance of bagged tree model is better than that of a single decision tree model. Therefore, in the proposed joint fast CU partition algorithm, the bagged tree is used in the first phase to predict the splitting result of a CTU.
One of the most important features of bagged tree is that the importance of features can be evaluated during the training process without having to repeatedly train models with different feature combinations like most feature selection techniques require. Caruana et al. [30] proposed a feature importance measurement technique referred as multiple counting for bagged tree. In multiple counting, the importance value for every feature of a single decision tree model is calculated according to the number of data samples decided by this feature in all decision tree  [30].

Residual Network
In recent years, CNN is one of the hottest topics due to its excellent performance on image problems solving, such as image recognition and object detection. Compared with traditional machine learning methods, CNN extracts effective features by using the filters of various sizes instead of hand-crafted ones. Using the convolution, CNN also reduces massive parameters required in a neural network. However, when deeper networks start converging, the researchers found that with the increment of network depth, the training error gets higher, and the accuracy degrades rapidly [31].
To address this problem, He et al. [31] proposed ResNet by using a deep residual learning framework. Fig. 4(a) shows the structure of a basic building block in ResNet. With the introduction of a shortcut in Fig. 4(a), a deep network, constructed by many building blocks, can converge easily, and the degradation problem is solved. By a stack of numbers of building blocks like Fig. 4(a), a ResNet is constructed, and He et al. [31] provided two kinds of build blocks ( Fig. 4(b-c)), by which a well performed ResNet of expected depth can be generated. The building block in Fig. 4(b) is used to build a relative deep network (ResNet-34). The building block in Fig. 4(c), called "bottleneck", is used to stack ResNet-50/101/152, which represents ResNets containing 50, 101 and 152 layers, respectively.
ResNet can be as deep as having 152 layers or contain even more layers while maintaining a good convergence performance. As a result, it is more suitable for ResNet than other CNNs to solve the end-to-end CU partition problem. Therefore, the CU partition problem is modeled as a multi-classification problem by the proposed algorithm.

The Proposed Bagged Tree and ResNet Based Joint CTU Partition Method
In this section, we describe the proposed fast CTU partition algorithm for HEVC intra coding. First, we illustrate the overall process, which explains how the bagged tree model and the ResNet model jointly work. Then, effective features are designed to train bagged tree models. Also, the architecture of the ResNet is designed to achieve a good prediction performance. Finally, the databases are constructed to train and validate our bagged tree and ResNet models. Using these databases, we train 20 bagged tree models (each is used for sequences under the same resolution and QP value). Besides, we also train four ResNet models with the same architecture, and each model is generated when a certain quantization parameter (QP) is set as different values, i.e. 22, 27, 32 and 37.

Flowchart of Our Fast Splitting Method
The flowchart of the proposed two-stage CTU partition algorithm is shown in   3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 In the first stage, a bagged tree based classifier is used to predict the splitting status of a CTU, and the output is either split or nonsplit. Specifically, if a CTU is predicted to be a nonsplit class, it goes to the CTU partition structure determination process directly shown in Fig. 5, and the final partition is a whole 64×64 CTU without any splitting. On the other hand, if a CTU is predicted to be split by the bagged tree classifier. It will be split into four sub-CUs with size 32×32 immediately. Sequentially, each of these four CUs will be passed to the second stage for further processing.
In the second stage, a 32×32 sized CU is fed to the fine trained ResNet classifier. As a result, one of 17 labels, each of which represents a partition structure for CU with size 32×32, is the output by ResNet. Similarly, all these four sub-CUs of a CTU predicted to be split in the first stage are classified, and four corresponding predicted labels are generated. Then these four labels as well as the parent CTUs are processed further.
A CTU partition structure determination process is adopted by following the above-mentioned stages 1-2. According to those four labels generated by ResNet in stage 2, the partition structure of the corresponding parent CTU is decided finally. Fig. 6 shows the flowchart of the proposed algorithm. PL, whose value is from 1 to 17, is the predicted label output by ResNet for a CU of size 32×32. The corresponding partition quad-tree structure is illustrated in Fig. 6, where the PL values of four sub-CUs (numbered 1, 2, 3 and 4) are assumed to be 1, 5, 17 and 2. Once a partition of a CTU is generated, an optimal rate-distortion cost can be calculated directly without lots of comparisons. Then an encoding process is executed.
Comparing with traditional recursive RDO, the proposed algorithm effectively skips unnecessary search for different combinations of CU sizes. Besides, compared with other existing fast partition methods implemented on depth level, this paper proposes a novel end-to-end two-stage CTU partition algorithm by using the bagged tree and ResNet techniques. In this way, the CTU partition result can be calculated through a well-trained bagged tree model and a fine-tuned ResNet model. As a result, we not only significantly reduce the time spent on RDO, but also spend less time on training required learning models compared with existing works, which take a strategy of training as many as three or more models on the depth level.

Features Design for Bagged Tree Model
The traditional machine learning methods rely heavily on handcrafted features. Many existing works have developed numbers of useful attributes for CU splitting label prediction. Based on these existing features, in our previous work [32] we also designed several novel features proven to be effective. As a result, based on our particular task of this paper we select totally 29 feature candidates used in [32], and they are listed in Table I along with corresponding meanings. These feature candidates are from four fields, which are information from neighboring CTUs, side information during pre-encoding process, statistical information of CTU pixels and information from pixel filtering results. We will give a brief introduction of these feature candidates in the following paragraphs, and more details are available in our previous work [32] .   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 Results of [33] suggested that information from neighboring CTUs is useful for the partition decision making of current CU. Thus, the features m nbCtuAboRd, m nbCtuLef Rd, m nbCtuAblRd and m nbCtuAbrRd are extracted, and they are the RD cost of the neighboring CTUs locating at the the above, left, aboveleft and above-right of current CTU, respectively. Besides, for a target CTU, m nbCtuAboDepth, m nbCtuLef Depth, m nbCtuAblDepth and m nbCtuAbrDepth denote the average depth values of its above, left, above-left and above-right neighbored CTU, respectively.
Besides, some bypass results during the encoding process are also important and widely used [34]. Hence, we pre-encode the current CTU with PLANAR mode and use several useful encoding results as key features. As a result, the features totalCost, totalDistortion and totalBins represent the total cost, the total distortion and the number of bits under PLANAR mode, respectively. And m aveCBF is the coded block flag (CBF) of current CTU encoded with PLANAR mode. Moreover, HEVC uses Hadamard transform to fast estimate the encoding performance, so the encoding cost, the distortion and the number of bits generated under Hadamard transform condition are extracted as features, which are denoted as m costHadamard, m sadHadamard and m bitsHadamard,r e s p e c t i v e l y .
Obviously, some classic statistical data, such as mean, variance and gradient, can reflect the content complexity of a CU and the difference among four sub-CUs comprising one CU. Thus for a target CTU, we calculate the mean and the variance of all pixel values as m meanM ain and m varMain,r e s p e c t i v e l y .T h e y 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 are calculated according to the following equations where p i is the luminance value of the ith pixel, n is the number of pixels in current CU. Futhermore, the variance of four sub-CUs means is calculated and denoted as m varMeanSub, and m varV arSub is the variance of four sub-CUs variances. They are calculated through the following equation where varSub i represents the variance of luma pixels within the ith CU, and meanV arSub is the mean of all the four sub-CUs variances (i.e. varSub 1 , varSub 2 , varSub 3 and varSub 4 ).
In addition, we also reuse the features m nmse, which was designed and proved to be efficient in [11]. m nmse is the mean of squared errors between each pixel and the mean of its eight neighboring pixels of a CTU. It is calculated as follows: where p i,j is the luma value of the pixel locating at (i, j) in current CU. N is the side length of a CU, and it is 64 in the case of a CTU. Besides, we also apply several kinds of filters on the pixels of a CTU and process these filter results to obtain useful features, which can reflect the pixel changing, the content complexity and even the edge information of a target CTU. As a result, m edgeSobel and m dcom are features calculated from responses of sobel filters widely used in edge detection. Fig. 7 shows four Sobel filters for detecting edges in different directions. Each filter is applied with overlap on every 3×3 square pixel block within current CU. Fig. 7(a) is an example of the 3×3 square pixel block. m edgeSobel and m dcom are calculated through the following equations 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 where N is the pixel number of a CU along a side, and it is 64 in the case of a CTU. k denotes the kth 3×3 square pixel block within current CU.
As we can see, m edgeSobel can reflect information about horizontal and vertical edges, and m dcom represents the comprehensive gradient information of current CU from four directions. Moreover, Bay et al. [35] systematically analyzed the significant advantage of Haar wavelet on edges changing and scene shading of a picture. So we perform three Haar filters of different directions on the pixels in a target CTU. Fig. 8 (b-d) are the Haar filters of horizontal, vertical and diagonal directions, respectively. Fig. 8 (a) is an example of a 2 × 2 square pixel block on which the Haar filters are performed. Specifically, we firstly split a CTU into 2 × 2 non-overlapped sub-squares. Then, each sub-square is filtered by a same Haar filter to obtain the corresponding filter responses of a certain direction. The responses of a certain filter on a sub-square are calculated according to the following equations where d x , d y and d xy are the filter responses of horizontal, vertical and diagonal directions, respectively. Based on these responses, we sum them to obtain the features m haarSumx, m haarSumy, m haarSumxy, m haarSumAbsx, m haarSumAbsy, and m haarSumAbsxy according to the following equations 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 where N is the number of pixels of the current CU along one side, and it is 64 in a case of CTU. Furthermore, the number of the interest related points within a picture can reflect people's visual perception, while no one notices its influence on the CU splitting. So we extract the interest points related feature m numAveInterestP oint for a target CTU by using a particular filter on a CTU. The feature m numAveInterestP oint represents the average number of interest points for each pixel of a CTU. It reflects how much attention people will pay to a CTU and how many details a CTU contains. Three filters as shown in Fig. 9 are used on each of the current CTU, and three corresponding results D xx , D yy , D xy are obtained as the filter responds in the horizontal, vertical, and diagonal directions, respectively. We obtain the final value of the feature m numAveInterestP oint for current CTU by using the following equations where P (i, j) is the interest value of the pixel located at (i, j), and B(i, j)i st h e Boolean value of being decided to be an interest point for pixel (i, j). t is the threshold to judge the interest point. N , whose value is 64, represents the size of current CTU. Because the original interest points detection method uses more complicated filters to obtain P (i, j), which is much time consuming, the relative weight 0.9 is used to minimize the error between them.
Features Selection for Bagged Tree Model These 29 feature candidates are too many to be implemented. Hence, these candidates are ranked by the feature importance measurement technique introduced in Section II-B during the bagged tree training phase.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 According to the ranking results, we find that the importance values are divided between the 10th and the 11th most important features. Besides, the importance values of the top 10 most important features stay in a relatively high level, while the importance values for the rest of feature candidates are generally very low. Thus, we pick the top 10 most important features as our final attributes. Hence, each bagged tree model has its own unique feature set consists of the top 10 most important features. Table II lists the top 10 features of the classifiers for the five resolution classes (i.e. A, B, C, D and E) and different QP values. The importance values of corresponding features are also shown in Tabel II. As can be seen from Table II, these top 10 features of classifiers for different QPs and resolution classes are almost the same while their ranking changes.

ResNet Model Designing and Training
Four ResNets with the same architecture are used, and they dedicate video sequences under different QP values, i.e. 22, 27, 32 and 37, respectively. The ResNets used in this work are based on the 34-layer ResNet in [31]. Specific architecture of the ResNet is shown in Fig. 10. Each convolution block in Fig. 10 represents a convolution process followed by a normalization process and a ReLU calculation. The ResNet used in this work consists of 34 layers, and it takes a luma CU of size 32×32 as the input. With a fully connected layer at the very end, the ResNet outputs a vector to describe the probability of a CU belonging to each class. Using this probability vector and the ground truth vector of the target CU, the cross entropy loss is calculated according to the following equation where V 1 is the probability vector output by ResNet, and it contains the probabilities of a CU belonging to every class, V 2 is the class label ground truth of a CU, and it is in the form of one hot encoding, and L is the cross entropy loss between V 1 and V 2 . The cross entropy loss is used to measure the difference between the predicted label and the ground truth of the 32×32 sized CU. Since it is a convex function, the global minimum can be easily calculated by taking a partial derivative. Generally, training the ResNet model is finding suitable parameters to minimize the cross entropy loss of all the training samples.
We use the deep learning framework PyTorch to train the ResNets in this paper. Adam optimizer is used with learning rate 0.01. Specifically, the models are trained on 40 epoches with batch size 32.
As for the databases used to train bagged tree models, we extract 10 key features of each CTU among a training sequence under a particular QP and resolution class. Besides, the splitting label of a CTU is also extracted to form the training database. So there are 20 databases in total, each of which is constructed for a QP value and a resolution class as shown in Table III. DB xy represents the database used to train the resolution class x and QP y bagged tree model. For example, DB A22 is the database constructed to train the bagged tree model which is used to make the splitting decision of the CTUs in a sequence under resolution A and QP 22.
Specifically, to form the database used to train the bagged tree models, 36,000 CTU samples are randomly picked from the training sequences encoded by HM16.7. To balance the database, in each database the ratio between the samples of label split and nonsplit is 1. When there are not enough samples in the whole encoded sequence, all the nonsplit samples are picked, and the same number of split samples is randomly picked.
As for the databases constructed to train the ResNet, due to the end-to-end prediction structure, each partition structure of a 32×32 sized CU corresponds to a class label. There are 17 classes in total, and these classes together with the corresponding partition structure are shown in Fig. 11. As we can see from Fig. 11 that these classes cover all the possible partition results for a CU of size 32×32. In this way, a partition structure can be predicted by ResNet through a single prediction.
In the databases for ResNet, each sample includes luma values of a 32×32 sized CU and its ground truth label according to the encoding results. To form the database, all 32×32 sized CUs in a training sequence are collected. Different with 20 bagged tree models, each ResNet model is designed for the sequences encoded under a certain QP, so there are 4 ResNet models needed to be generated totally. Hence, 4 databases are generated (denoted as DB-22, DB-27, DB-32 and DB-37), and they are constructed for the sequences encoded with QP 22, 27, 32 and 37, respectively. Taking the generation of DB-22 for example, we encode five training sequences with 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  QP 22 firstly. Then, luma values and class labels of 32×32 sized CUs from these five sequences are all collected to make up DB-22. Similarly, DB-27, DB-32 and DB-37 are generated. It is worth noting that, the difference among these four databases is only the labels. The luma pixel values between different databases are the same, because they are all from the same training sequences. In this paper, chroma CUs split just same as luma CUs. Hence, the models and datasets for chroma are not considered.

Experimental results and discussion
In order to verify the performance of the proposed bagged tree and ResNet joint fast algorithm (BTRNFA), we implemented this algorithm in the HEVC reference encoder HM16.7. The original software HM16.7 is available online (https://hevc.hhi.fraunhofer.de/svn/svn HEVCSoftware/tags/HM-16.7/). Besides, the deep learning framework PyTorch is used to perform the training and the prediction of ResNet models. Test sequences were selected from the HEVC common test conditions (CTC) [36]. To analyze the fitting performance and the generalization of the models, BTRNFA is also carried out on five sequences used in the models training.
Experiments were conducted on an Aliyun host which has a windows 64-bit operation system. The host has 4 kernels of Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz 3.10 GHz with a 16 GB memory. Coding parameters were set as default, and all results were generated under all-intra main configuration. The number of the frames to be encoded was set as the maximal value for each sequence according to the CTC.
There are three versions of the proposed BTRNFA. The first version is denoted as BTRNFA DT , in which only the bagged tree model is active. In this situation, the CUs of depth 0 are partitioned according to the predicted labels output by a bagged tree model, and CUs of depth 1 and 2 are processed by RDO. The second version is BTRNFA ResNet , in which only the ResNet model for the CUs of depths 1 and 2 is active. RDO is performed on the CUs of depth 0. The third version is BTRNFA joint , which is also the most aggressive version. In the BTRNFA joint , both the bagged tree model and the ResNet model are activated. As a result, BTRNFA joint can achieve the most reduction of the encoding time.
First, BTRNFA DT , BTRNFA ResNet and BTRNFA joint were carried out for performance comparison. We analyzed the contributions of the bagged tree model and the ResNet model on the proposed joint algorithm. In addition, we also compared the BTRNFA joint algorithm with the CNN based state-of-the-art CU depth decision algorithms, i.e. the CU partition mode decision algorithm (PMDA) in [27], the complexity reduction method (CRM) in [29] and the fast CTU partition approach (FPA) in [26]. To measure rate-distortion performance, the BD-rate was calculated. Complexity reduction was measured by encoding time saving (denoted as TS), which was calculated according to the following equation where time pro is the sequences encoding time of the proposed algorithm, and time ref is the encoding time spent by the reference encoder.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65

Performance of Three Versions of The Proposed Algorithm
In this section, we discuss BD-rate and time saving performance of three versions of the proposed BTRNFA (i.e. BTRNFA DT , BTRNFA ResNet and BTRNFA joint ). Experimental results of each version are shown in Table IV. We can see from Table IV that BTRNFA DT achieves 0.25% BD-rate loss with complexity reduction 22.37% on average. Its BD-rate loss is the least one among these three versions of the proposed algorithm, and its encoding time saving is also the least. The averege 0.25% BD-rate loss shows our bagged tree model is trained well and has a high prediction accuracy on test sequences. However, the time saving results are not the highest, because only the bagged tree model is activated in this version of the proposed algorithm, and the encoding complexity among CU partitioning of depths 1 and 2 is not considered to be reduced.
On the contrary, the bagged tree model is disabled in BTRNFA ResNet , and only the ResNet model is active. In this case, RDO is performed on each CTU while the partition structure of CUs with depth 1 is generated by the end-to-end prediction. This version of the proposed algorithm achieves 1.81% BD-rate loss with a time saving of 49.83% on average compared with the original algorithm in HM16.7. We can see from Table IV that, compared with BTRNFA DT , the time saving of BTRNFA ResNet increases about 27.46% with a 1.56% sacrifice in terms of BD-rate. The results of Table IV show that the prediction accuracy of ResNet is not quite high so that more BD-rate loss is caused. On the other hand, we can also see that more complexity can be saved with CU depths 1 and 2 than that in CU with depth 0.
BTRNFA joint is the most aggressive version, in which both the bagged tree model for CTU and the ResNet model for CU in depth 1 are activated. Its results are shown in the last two columns of Table IV. BTRNFA joint achieves as much as 59.79% complexity reduction with only 2.02% BD-rate loss. Since it has the most time saving among these three algorithm versions, its BD-rate loss performance is acceptable.  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 As we can see, the performance of the proposed BTRNFA joint varies with sequences listed in Table IV. To figure out what kinds of videos are suitable for BTRNFA joint such that the good results can be obtained, we use the ratio of time saving versus BD-rate loss to conduct further analysis. Considering this in Table IV, BTRNFA joint performs better in videos of resolution class C and D than those of A, B and E. It means our method has better performance in videos of low resolution, which maybe due to a high prediction accuracy on CTUs in low resolution videos. Besides, observing performance differences on sequences of the same resolution class, we can see BQTerrace shows better results than BasketballDrive. Considering BDrate and time saving jointly, PartyScene and BlowingBubbles outperform BQMalll and BasketballPass, respectively. It is easy to find the sequences BasketballDrive, BQMalll and BasketballPass all contain multiple objects moving fast in scene. Sequences which contain many still or slow-moving contents make our method easy to achieve good results. Thus, we can conclude that videos of low resolution and less moving contents are much suitable for the proposed method.
As for the performance of our method on other QP settings, on the one hand we can use the current models to directly encode videos under other QPs. This does not affect the RD performance, but only affects the benefit of time savings. On the other hand, we can train one model for a particular QP. This will obviously result in better RD performance and more time savings as well, since it is targeted to the QP. In short, experiments on existing QPs have verified the effectiveness of our algorithm. For other QPs, our method can achieve a better trade-off between RD performance and time saving by training the corresponding models.

Prediction Accuracy and Inferring Overhead of the proposed method
To analyze the performance of the proposed BTRNFA joint on each stage, we present the prediction accuracy of bagged tree models and ResNet models as well as the GPU inferring time overhead.
Firstly, Table V shows the prediction accuracy of different bagged tree models for CUs in depth 0. The average accuracy of these 20 bagged tree models is 89.86% on the splitting prediction for CUs in depth 0. Prediction accuracy is quite stable on around 90% for four QP values, which can be observed from the last row of Table  V. Besides, we find these four bagged tree models trained for resolution class A all perform below average, perhaps because the video sequence used for training is not representative.
Besides, from different levels, Table VI presents the prediction accuracy of four ResNets. On the overall prediction level, four ResNets show quite stable accuracy performances which distribute between 56% and 58%. Although the overall accuracy is low, their accuracy on different depth levels is quite good. In Table VI, the prediction level depth 1 means the the split flag prediction (i.e. split or non-split) of CUs in depth 1. And the prediction level depth 2 means the splitting decision prediction for CUs in depth 2 which is also a binary decision. As we can see from the third row of  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 the overall prediction accuracy of four ResNet models is not high, the prediction accuracy of splitting decision at each depth level is quite high. This is because the difference between classes is not obvious in the multi-classification task for CU partitioning, and slight prediction error will not lead to the splitting decision error at each depth level. The high prediction accuracy of ResNets at each depth level also verifies the correctness of our algorithm. This paper aims at reducing the encoding complexity by deciding partitioning structure of a CTU early, so the complexity introduced by the ResNet model must be considered. As a result, an appropriate network structure shown in Fig. 10 is designed, and the encoding time of the proposed method has included the corresponding inference time which is shown in Table VII. It can be observed from Table  VII that the inferring overhead of four ResNets is about 1% of the original HM encoder. It is worth noting that the inference process is carried out on GPU, so the consumption is very low.

Performance Comparison with CNN Based State-Of-The-Art Algorithms.
This section discusses the comparisons of the encoding performance between the proposed BTRNFA joint and other two CNN based intra mode fast decision algorithms. These two algorithms that BTRNFA joint is going to compare with are implemented on reference software of different version. In order to make a fair comparison, we apply the proposed BTRNFA joint to these reference software versions of the two comparison objects respectively. The corresponding results are presented in the following paragraphs.
The PMDA in [27] reduces the hardware complexity of the encoder by decreasing more than two CTU partition modes for RDO. It achieves 60.86% complexity reduction on average, and the BD-rate loss increases by as much as 2.74% on HM12.0. Compared with their results, the proposed BTRNFA joint saves about 0.73% BDrate loss with a negligible encoding time increment of 0.54% on average. Detailed  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 results shown in Table VIII indicate that BTRNFA joint exceeds PMDA on BD-rate with a quite close time saving. As for the encoding complexity reducing approach CRM in [29], an early terminated hierarchical CNN is proposed to decide the CU partitioning. The encoding complexity on HEVC intra mode decision is dramatically reduced by replacing the RDO with the predicted partitioning result. On average, 61.91% of encoding complexity is saved with 2.24% of BD-rate loss increment compared with the original HM16.5. Comparison results are shown in Table IX. Compared with CRM, the proposed BTRNFA joint achieves 0.22% less BD-rate loss to 2.02% with only 2.60% more encoding time. Besides, only one ResNet model is required in the BTRNFA joint , while the number of CNN models used in CRM is as large as 3. It means that the proposed algorithm has advantages no matter on models training or encoder implementation.
Generally speaking, the proposed BTRNFA joint outperforms two CNN based state-of-the-art algorithms. On average, additionally 0.73% and 0.22% BD-rates are saved, compared with PMDA and CRM respectively. Though the time saving of BTRNFA joint is very close to theirs, our approach shows great advance in terms of the model number required in encoding by introducing an end-to-end partitioning solution. Thus, the complexity of model training and implementation is quite low and competitive for the proposed BTRNFA joint .

Comparisons of CU Partition Results
To compare the differences of the CU partition results between the original algorithm in HM16.7 and the proposed BTRNFA joint algorithm, we encode the 180th frame of the sequence RaceHorses (416×240) with QP 22 by using these two algorithms. Fig. 12 presents the CU partition results of the original HM16.7. In Fig.  12, black line represents the CU boundaries. Fig. 13 shows the CU partition results of the proposed BTRNFA joint . In Fig. 13, we use the gold and red lines to rep -1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63 64 65 It can be observed from Fig. 13 that the same CU partition results take up a quite large portion. Setting the results in Fig. 12 as the baseline, we observe that almost all the partition results upon background and plain areas stay the same as those in Fig. 13. This observation indicates that our bagged tree model and ResNet model achieve a high accuracy on CUs with flat contents. Besides, as we can see, the differences between Fig. 12 and Fig. 13 mainly exist among the edge areas. It means that our algorithm needs should be improved on the prediction upon some particular CUs which cross the boundaries of objects.
In general, compared with the optimal CU partition results generated by the original HM16.7 as shown in Fig. 12, there are not much differences existing in the CU partition results generated by the proposed BTRNFA joint . Our BTRNFA joint algorithm performs well, and its CU partition results are satisfactory.

Figure 2
A partition example of a CTU and the corresponding quad-tree splitting structure in the HEVC intra coding. The structure of bagged tree. An illustration of how the nal prediction for a sample input is generated by a bagged tree model, which is composed of multiple decision trees, each trained on a random subset of data.   An example of CTU partition structure determination process with the corresponding partition quad-tree. P L is the predicted label for a 3232 sized CU output by ResNet. P L is set as an integer from 1 to 17.    Architecture of the ResNet used in this paper. The dotted lines represent the shortcuts which increase the dimensions.

Figure 11
Class labels and corresponding partition structures of a 3232 sized CU for ResNet multi-classi cation.
Variate label is the class index of dierent partitions for a CU of size 3232.

Figure 12
Partition results of the 180th frame in the sequence RaceHorses (416240), which is encoded by the original HM 16.7 with QP 22. Black line represents the splitting boundaries between CUs.

Figure 13
Partition results of the 180th frame in the sequence RaceHorses (416240), which is encoded by the third version BTRNFAjoint of the proposed algorithm with QP 22. Black line represents the same partition results as those of the original HM16.7. Gold line represents the boundaries of CUs, which are split by the original HM16.7 but are not split by the BTRNFAjoint. The boundaries of CUs, which are decided non-split by the original HM16.7 but are split by the BTRNFAjoint, are shown with red line