Rethinking the Value of Local Feature Fusion in Convolutional Neural Networks

Traditional CNN head for classification tasks typically consists of a global average pooling layer before the last fully-connected classifier. However, such a simple and light-weighted head lacks the ability of feature fusion, and can’t give full play to the strong feature extraction ability of the network body. In the present work, we analyze the Basic Block and Bottleneck structure in ResNet in-depth and reveal the importance of performing feature fusion inside local patches via 1×1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1\times 1$$\end{document} convolution. We propose a new head structure consisting of three stages with a series of 1×1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$1\times 1$$\end{document} convolution to replace global average pooling. With little additional FLOPs and inference speed drop, our new head improves the accuracy for ResNet18 by 3.6% on ImageNet, and for ResNet56 by 5.0% on CIFAR-100.


Introduction
Convolutional neural networks (CNN) are widely used in computer vision tasks including scene recognition and image classification.The overall structure of CNN has three parts: the stem, the body, and the head [5].The stem part receives image input, extracts basic feature and down-samples.The body part accounting for the vast majority of computing cost, and composed of many stages (usually 4-stages in ImageNet [6] and 3-stages in CIFAR [7]), each corresponds to network layers or blocks with same output resolution.The head part receives the output feature map from body and encode it into a 1-D vector for final inference.Figure 1shows the three parts of different network architectures in detail.The head provides a summary of features and has a great impact on the overall performance of the network, which is easily despised and rarely studied.
Early architectures [1,8] flatten the feature map to get input of fully-connected layer as in Fig. 1a.NiN [2] proposes global average pooling to avoid overfitting and reduce computation cost, see Fig. 1b.Many recent architectures [3,[9][10][11] introduce a fully-connected classifier after the global average pooling as in Fig. 1c.And some architectures [4,12,13] add an additional wide 1 × 1 convolution before global average pooling as the structure shown in Fig. 1d.The head structures in Fig. 1c-d are commonly adopted, they down-sample the spatial resolution of body output (where feature maps usually have a resolution of 7 × 7) to 1 × 1 directly via global average pooling.Such head structures lose too much spatial information and do not have enough feature fusion ability.
1 × 1 convolution has the ability of fusing existing features at the same spatial location, which we call local feature fusion.Lin et al. [2] has already revealed the powerful local feature fusion ability of 1 × 1 and the importance of local feature fusion.
In this paper, we propose a CNN head to replace the traditional head based on global average pooling, see Fig. 1e.Our head structure uses more 1 × 1 convolution layers for better feature fusion, and smoothly reduces the spatial resolution of body output to 1 × 1 to make full use of spatial information.
The experimental result shows that our head greatly enhances the deep network feature.There are also other ways to enhance the discrimination power of deep feature, including designing new loss function and normalizing the feature vector [14][15][16][17][18][19][20].These optimization methods are orthogonal to our new head structure and can be used in combination.
We summarize our main contributions as follows:  Fig. 1 The stem, body, and head part of different network architectures.a-d Architectures of AlexNet [1], NiN [2], Basic Block-based ResNet [3] and MobileNet-v2 [4].e Our proposed head, which consists of one single 1 × 1 convolution layer (followed by batch normalization and ReLU activation) and an average pooling layer for spatial down-sampling in each stage (1) We reveal the importance of local feature fusion through

Related Work
Deep convolutional neural networks have achieved great success in computer vision since the introduction of AlexNet [1].Various advanced architectures like VGG [8], ResNet [3], DenseNet [11], ResNeXt [10] are proposed ever since.Neural Architecture Search (NAS) technology [21] have also spawned architectures obtained by automatic search including NASNet [22], MnasNet [23], etc. MobileNet family [4,24,25] is proposed for lightweight computation which replace the expansive 3 × 3 convolution with depth-wise separable convolution, ShuffleNet [13,26] even proposes channel shuffle and group convolution to reduce the cost of 1 × 1 convolution in MobileNet.Lin et al. [2] propose MLPConv for feature extraction of the local patches before combining them into higher level concepts in NiN, which is the first to use 1 × 1 convolution for local feature fusion.Then, GoogleNet [9] uses 1 × 1 convolution to reduce dimension in the multibranch Inception module. 1 × 1 convolution layers in ResNet Bottleneck [3] are responsible for reducing and then increasing dimensions, leaving the 3 × 3 layer a bottleneck with smaller dimensions.The well known MobileNet family [4,24,25] is built primarily from depth-wise separable convolution [27], which is a form of factorized convolutions which factorize a standard convolution into a depth-wise convolution and a 1 × 1 convolution to fuse information among channels.
The Bottleneck structure adopted in ResNet50/101/152 motivates very deep ResNet models to get even better performance.He et al. [3] interprets the use of Bottleneck as a trade-off between building a deeper network and saving computing cost.We find that the 1 × 1 convolution in Bottleneck has a strong ability of local feature fusion, which benefits the network performance a lot, especially for deep models.
GENet [28] explaines the adaptability of Basic Block, Bottleneck [3] and Inverted Bottleneck [4] in different network stages from the perspective of intrinsic rank, and regards the latter two as the low-rank approximation of Basic Block.GENet is obtained by neuralarchitecture-search, it uses Basic Block in stage-1/2, and Bottleneck and Inverted Bottleneck in stage-3/4.We found that BL-ResNet is slower than BB-ResNet even under the same number of weighted layers, in which case BL-ResNet has fewer blocks.We replace all the blocks in ResNet18 with Bottleneck to produce BL-ResNet26.As Table 1 shows, the 8-block BL-ResNet26 is even slower than the 16-block ResNet34.We attribute such results to the larger memory access cost of Bottleneck, as Bottleneck introduces feature maps with four times the number of channels.
Though slower, BL-ResNet outperforms BB-ResNet in accuracy with the same number of blocks stacked, as in Table 1.For example, BL-ResNet26 achieves similar accuracy as ResNet34, though it contains eight less weighted layers and is more affected by the limitation of the receptive field.With the same number of stacked blocks ((2,2,2,2) in the four stages), BL-ResNet26 outperforms ResNet18 by 3.07% in accuracy.
To exam which block replacement leads to accuracy gain and inference speed drop, we replace the blocks in BB-ResNet with Bottleneck stage-by-stage.The result in Table 2shows that the Bottleneck is more appropriate in deeper stages.Deep stages could benefit more from Bottleneck, and Bottleneck in deep stages might introduce less memory access cost (feature maps in deep stages are smaller in size). 1oth ResNet34 and ResNet50 are composed of (3,4,6,3) residual blocks in four stages.The only difference between the two is that ResNet34 is based on Basic Block and ResNet50 is based on Bottleneck.ResNet50 is 2.18% more accurate than ResNet34, while ResNet34 is 42% faster.We gradually replace the blocks in ResNet34 with Bottleneck stage by stage from stage-4 to stage-1 and show the results in Table 3. Replacing only the three blocks in stage-4 with Bottleneck improves accuracy by 1.36% with only a 5% inference speed drop.Such accuracy gain accounts for the main improvement for ResNet50 over ResNet34.Restoring the stage-1 blocks in ResNet50 to Basic Block has little impact on the accuracy, but improves inference speed by 22%.We hold the view that shallow layers concentrate more on basic texture extraction, while deep layers pay more attention to feature fusion as they contain high-level semantic features.The secret why Bottleneck fits deep stages better may lie in its 1 × 1 convolution, which has the ability of local feature fusion.

What does 1 × 1 Convolution Do
Block structures like Bottleneck [3] and Inverted Bottleneck [4] contains 1 × 1 convolution, they have the ability to better fuse local features.Lin et al. [28] regards Bottleneck and Inverted Bottleneck as the low-rank approximation of Basic Block in GENet.However, pure Basic Block-based models are not competitive in performance, especially for very deep networks.We argue it is the local feature fusion ability from 1 × 1 convolution layers that makes Bottleneck and Inverted Bottleneck fit deeper network stages better.Some network architectures [4,12,13,25,28] insert a wide 1 × 1 convolution layer just before global average pooling to better fuse local features and increase the dimension of classifier input (see Fig. 1d for detail).In MobileNet-v2 [4], the width multiplier is applied to network stem and body but not to head if it is lower than 1.0.Sandler et al. [4] interprets that the wide 1 × 1 convolution in head benefits the compact models a lot.
The head part of the network contains the deepest stages in the network including the classifier, and local feature fusion is particularly important in head.The last fully-connected layer classifier in head is a simple linear function, and it does not have enough feature fusion ability.Thus, the dimension and the quality of the input of the classifier will restrict the performance of the classifier.
Depth-wise separable convolution-based networks [4,24,25] place most of the computational overhead on 1 × 1 convolution which performs local feature fusion.Although the feature extraction ability of network body will somehow be limited by depth-wise separable convolution, the performance of these networks is still very good.

Structure of the New Head
Most of the CNN head performs global average pooling after stage-4, leading to spatial information lose.To solve this problem, we propose a new head structure as Fig. 1e shows.Our new head contains three stages.The first stage (stage-4) is obtained directly from receiving the output of the network body, and we append two additional stages in the head with a spatial resolution of 4 × 4 (stage-5) and 2 × 2 (stage-6).Each of the three stages consists of one 1 × 1 convolution layer (followed by batch normalization [29] and ReLU [30]).The input of stage-5 is obtained via an adaptive average pooling with an output size of 4 × 4 to adapt to input images of different resolutions.
Large convolution kernels like 3 × 3, 5 × 5 or even 7 × 7, have strong feature extraction ability but are not good enough for local feature fusion.They would be inevitably affected by the neighboring pixels in the feature maps.Take ResNet18 for example, Table 8shows that 3 × 3 based head is 0.48% worse than 1 × 1 based head even when the number of channels is fixed.This result shows the degradation problem of large kernels in the link of local feature fusion.We shall discuss the results of using various units in the head in detail later.
We record the widths of three convolution layers in the head (i.e.(512,1024,512)) as the width of the head.We set the width ratio of three 1 × 1 convolution layers to 1:2:1, since halving the width of the first stage-4 layer saves FLOPs, and halving the width of the last stage-6 layer improves parameter efficiency.
Our head can smoothly down-sample the resolution of body output to 1 × 1.Such head structure makes full use of the spatial information and performs better feature fusion, thus improves the overall performance of the networks.

ImageNet Classification
We evaluate our method on the ILSVRC2012 dataset [6] that consists of 1000 classes.The models are trained on the 1.28 million training images, and evaluated on the 50k validation images.The heads of baseline models are reproduced following their original structure, and we use the pair of head width to represent our proposed head.Results with '*' correspond 300-epochs training schedule with strong tricks.It is worth noting that the head in original MobileNet-v2 [4] and GENet [28] already contains a 1 × 1 convolution at stage-4.GPU inference speed is obtained via single RTX2080ti GPU with Pytorch-AMP FP16 acceleration and a batch-size of 64 Bold values represent the relatively good results proposed by us

Training Setup
We train our models using Pytorch1.9[31] and Pytorch Automatic-Mixed-Precision (AMP) acceleration.We use the standard SGD optimizer with momentum set to 0.9, standard weight decay to 1 × 10 −4 , and the initial learning rate to 0.2.Models are trained for 90 epochs using cosine decay schedule [32] and 2 epochs linear warm-up [33] on 4 RTX3080 GPUs, with a total batch-size of 512.For MobileNet-v2, we adjust the weight decay to 4e-5, initial learning rate to 0.1, and extend training epochs to 150 as in [4].For GENet, we adjust the training epochs to 120.As we are focusing on revealing the importance of local feature fusion and the effectiveness of our new head, we only use standard data augmentation of random-resized-crop and randomhorizontal-flipping in most of our experiments.We didn't use any other tricks like smoothlabel [33], mix-up [34], cut-mix [35], auto-augmentation [36], etc, unless otherwise specified.The experimental result in Table 4"*" proves that our head is also effective on stronger training schedule.
Although replacing residual blocks in stage-4 by Bottleneck for ResNet18 and ResNet34 could improve the accuracy by 1% with only 6% GPU inference speed drop (as we discuss in Sec.3.1), we follow the original block configuration to minimize structure modification and highlight the effect of our new head.

Main Results
We test our head among various network structures, such as traditional convolution-based ResNet [3], depth-wise separable convolution-based MobileNet-v2 [4], group convolutionbased RegNet [5] and GENet [28] which combines different blocks.Table 4 shows that our proposed head improves many network architectures.Increasing head width can improve model accuracy with little additional FLOPs and inference speed drop, although it introduces more parameters.
As Table 4 shows, our head improves the top-1 accuracy of ResNet18 by 3.61% from 71.15% to 74.76% with only 12% additional FLOPs.This accuracy approaches the performance achieved by the much bigger ResNet34 (74.41%).Similarly, ResNet50 with our head achieves a top-1 accuracy of 78.23%, exceeding baseline by 1.64% with only 9% additional FLOPs.This performance is even better then ResNet101 (78.03%).Large model like ResNet152 also achieves an accuracy gain of 0.62%.The most surprising result is that ResNet18 with a width multiplier of 0.5 obtains more than 8% accuracy gain, and reached a top-1 accuracy of 71.26%, which is even higher than the ResNet18 baseline.The proposed head is also effective for RegNet models which are based on group convolution.Depth-wise separable convolution-based networks like MobileNet-v2 place most of the computational overhead on 1×1 convolution, they do not lack local feature fusion like ResNet, and therefore benefit less from our head.Our head improves the accuracy of MobileNet-v2 by around 1%.We also discover that the stage-4 1 × 1 convolution in the head has very limited improvement on MobileNet-v2, which is quite different from other networks.
Compared with MobileNet-v2, our head achieves obviously more improvement on ResNet models.Network bodies based on depth-wise separable convolution are very light-weighted, but their ability for feature extraction is relatively weak.Traditional 3 × 3 convolution-based network body has great potential in feature extraction, however, it is often limited by the lack of local feature ability of the head.Our head greatly improves the local feature fusion ability of the network, and pushes these computational efficient bodies further.This leads us to rethink the CNN architecture design.

Dropout in Head
The weighted layers inside the head have more channels and smaller spatial resolution.They contain more parameters, with a relatively low parameter reuse rate, which may lead to overfitting.We add dropout to the 1 × 1 convolution layers except the first one in the head.The drop probability is set between 0 and 0.2 depending on the network body and the width of the head.

Fewer Stage Networks
Radosavovic et al. [5] found that the top RegNet models at high FLOPs contain very few blocks (one or two) in stage-4, but the 3-stage models perform much worse than the 4stage models.With the introduction of our head, we evaluate 3-stage networks and even 2-stage network again.We reconstructe the stage-3 for 2-stage network through single 1 × 1 convolution layer (followed by batch normalization and ReLU).
Table 5 shows our results on fewer stage networks.We found that fewer stage models perform obviously better than their baseline models with the introduction of our head, and are not too much behind the 4-stage models anymore.

CIFAR-100 Experiments
We conduct more experiments on the CIFAR-100 dataset [7], which consists of 50k training images and 10k testing images in 100 classes.The models are trained on the training set and evaluated on the test set.

Training Setup
We use the standard SGD optimizer with momentum set to 0.9, standard weight decay to 5 × 10 −4 , and the initial learning rate to 0.1.Models are trained for 200 epochs with cosine decay schedule and a batch-size of 128 without any other training tricks.We use standard data augmentation of the random crop with 4 pixels padding and random horizontal flipping.We built the head by adding three 1 × 1 convolution on stage-3 (8 × 8), stage-4 (4 × 4) and stage-5 (2 × 2).We do not insert dropout into the head of CIFAR models, as it leads to a validation accuracy drop in CIFAR-100.

Main Results
Table 6 and Fig. 2 show our main results on CIFAR-100.Our proposed head improves the accuracy of various ResNet models by a large margin with around 10% additional FLOPs and inference speed drop.ResNet20 with our head outperforms ResNet20 baseline by 5.07%, approaching ResNet110 baseline in accuracy with only 18% FLOPs and 33% parameters.For extremely compact models, ResNet20 with a width multiplier of 0.5 even approaches the original ResNet20 baseline in accuracy with the introduction of our head.This pattern is repeated at greater depth, where our head further improves ResNet110 by 4.74% and BL-ResNet164 by 2.89% in accuracy.

BB-ResNet vs. BL-ResNet
We obtain ResNet models of different sizes via depth scaling following [3], and Fig. 2a shows the performance of BB-ResNet and BL-ResNet with a different number of blocks stacked.BL-ResNet always outperforms BB-ResNet under the same FLOPs constraint except for the extreme case of the smallest Bottleneck-based BL-ResNet11.However, when we turn to the GPU inference speed rather than FLOPs, things have changed.The Bottleneck is slower than Basic Block due to large memory access cost, therefore BL-ResNet models do not necessarily perform better under the same inference speed constraint.Figure 2b shows that BB-ResNet outperforms BL-ResNet under the inference latency of fewer than 8 milliseconds per batch.Until the latency reaches 8.5 milliseconds per batch, ResNet74 and BL-ResNet47 converge to the same accuracy of 74.3%.It reveals that relatively shallow BB-ResNet models can better balance accuracy and running speed, though they are not as accurate as BL-ResNet with the same FLOPs.

Head Effect
Our proposed head can improve both BB-ResNet and BL-ResNet while it has a greater effect on BB-ResNet, for BL-ResNet has already performed some of the local feature fusion in the network body, and BB-ResNet has no 1 × 1 convolution for local feature fusion along with the whole network.Furthermore, the input of the BB-ResNet classifier has too few channels, thus the dimension increment in our head can also help, as we have discuss in Sec.3.2.
As shown in Table 6, ResNet164 is 2.48% less accurate than BL-ResNet164.With the introduction of our head, the gap drops to 1.15%.BB-ResNet and BL-ResNet reach the same accuracy at the latency of 8.5 milliseconds per batch.With the introduction of our head, BB-ResNet is consistently more efficient than BL-ResNet until the latency reaches 12.2 milliseconds per batch, where ResNet110 and BL-ResNet65 converge to the same accuracy of 79.7%.After that, BL-ResNet takes over.
In conclusion, our head nicely solves the problem of BB-ResNet's lack of local feature fusion.BB-ResNet with our head is also very competitive in accuracy compared to BL-ResNet, while it is significantly faster in speed.

Head Width
Fig. 2shows the accuracy of ResNet20, ResNet56 and ResNet110 with head width varying from (128,256,128) to (1536,3072,1536) comparing to the BB-ResNet baseline.When the head width is less than (512,1024,512), the increase of head width significantly improves model accuracy with negligible additional FLOPs.Until the head width reaches (768,1536,768), extending the width of the head achieves diminishing return.
ResNet20 with the largest head width of (1536,3072,1536) even reaches an accuracy of 78.01% with a total of only nine Basic Blocks stacked.However, a wider network head has one drawback of introducing a large quantity of parameters, so we further limited the head width of the models in our main results to better trade-off model accuracy with parameter efficiency.

Stages of Head
Applying 1 × 1 layers in the head does help improve performance.However, are all the layers among stage 4 to 6 necessary and effective?We study the ImageNet results of performing local feature fusion via 1 × 1 convolution in different stages in our proposed head based on ResNet18.We set the width of all the 1×1 convolution layers in head to 2048 for convenience, and properly insert dropout layers for each configuration.As Table 7 shows, performing feature fusion on smaller feature maps works better in a single-layer head, which contradicts many network architectures including MobileNet-v2, which adds a wide 1×1 convolution layer on 7×7 feature maps at the end of stage-4.However, as we add more feature fusion layers in the head, spatial resolution becomes more and more important.Moving our proposed head back for one stage introduces a 1% drop in accuracy (see the eighth and ninth rows in Table 7 for detail).Stacking three fully-connected layers does achieve a good result with minimal additional FLOPs.However, it is least parameter efficient.Stacking three 1 × 1 convolution layers on the largest 7×7 feature maps fails to get a better result than our proposed head under same parameter quantity and obviously larger FLOPs.
We also investigate the effect of performing local feature fusion in the network stem, add one 1 × 1 layer for feature fusion after each convolution in the stem based on ResNet18.Since there's only one 7×7 convolution in the stem, we introduce another "split-7x7" variant [32] for better comparison.There are three 1 × 1 layers for "split-7x7" style ResNet and one 1 × 1 layer for original ResNet in total.The full local-feature-fusion stem which composed of only 1 × 1 convolution and pooling is either tested.As Table 8 shows, performing local feature fusion in stem does improve performance for a little, however, it introduces much more memory-access cost for generating extra large feature maps, and is inefficient in speed.
Our proposed head on stage-4 (7×7), stage-5 (4×4) and stage-6 (2×2) can smoothly downsample the spatial resolution from body output and makes full use of spatial information.It achieves a better balance among model accuracy, parameter efficiency, and computational overhead.

Different Units in Head
Our proposed head follows the "feature extraction unit → down-sampling pooling unit" formula, and we apply 1 × 1 convolution for feature extraction unit and average-pooling for down-sampling unit in our head.We study the ImageNet results of using different units in the head, based on ResNet18.The well-known SSD architecture [37] for object detection uses a combination of 3 × 3 and 1 × 1 convolution after the backbone network to obtain smaller feature maps to detect larger objects.So we consider the combination of 3 × 3 and 1 × 1 convolution as well.The feature extraction unit on stage-4 is fixed as 1 × 1 convolution to adjust the number of channels.
Table 9shows the results of adopting different units in the head.We found that average-

Fig. 2
Fig. 2 CIFAR-100 accuracy of ResNet with or without our head.BB-ResNet represents the Basic Block-based ResNet, and BL-ResNet represents the Bottleneck-based ResNet.Models of different size are obtained via depth scaling following [3].Since Bottleneck contains more weighted layers and is slower in speed, we set the upper bound of the number of stacked blocks to 54 for BL-ResNet, and 81 for BB-ResNet each.a Relationship between model accuracy and FLOPs.b Relationship between model accuracy and GPU latency 1 × 1 convolution, and analyze the properties of ResNet blocks from the perspectives of local feature fusion ability and their efficiency.(2) We propose a new CNN head, which can better fuse local features and make full use of spatial information.(3) Extension experiments show that the proposed head achieves great accuracy gain on CIFAR-100 and ImageNet classification.The accuracy improvement is 3.7% for ResNet18 on ImageNet, and 5.0% for ResNet56 on CIFAR-100, respectively.

Table 1
ImageNet accuracy and inference speed of ResNet based on different residual blocks

Table 2
Changes in accuracy and inference speed of replacing Basic Block in each stage in ResNet with Bottleneck "stageX-BL" represents the models that replace all the blocks in stage-X from Basic Block to Bottleneck, and "BL" represents the models that replace all the blocks in network with Bottleneck.GPU inference speed is obtained via single RTX2080ti GPU.We used Pytorch-AMP FP16 acceleration and a batch-size of 64 when testing ImageNet models, while a batch-size of 128 without AMP when testing CIFAR models Bold values represent the relatively good results proposed by us

Table 3
(3,4,6,3)esents the Basic Block, and "BL" represents the Bottleneck.The number of residual blocks in each ResNet model in the four stages is fixed to(3,4,6,3).The first row represents the case of orignal ResNet34, and the last row represents the case of original ResNet50.GPU inference speed is obtained via single RTX2080ti GPU with Pytorch-AMP FP16 acceleration and a batch-size of 64 Bold values represent the relatively good results proposed by us

Table 4
Main results on ImageNet classification

Table 5
Performance of fewer stage networks with or without our head

Table 6
Main results on CIFAR-100 classification ResNet represents the Bottleneck-based models.We use the pair of head width to represent our head.GPU inference speed is obtained via single RTX2080ti GPU with a batch-size of 128 Bold values represent the relatively good results proposed by us

Table 7
Performance of ResNet18 models with head spanning different stages

Table 8
Performance of ResNet18 models with and without local feature fusion in stem.