Efficient Re-parameterization Operations Search for Easy-to-Deploy Network Based on Directional Evolutionary Strategy

Traditional NAS methods improve performance by sacrificing the landing ability of the architecture and the re-parameterization technology is expected to solve this problem. However, most current Rep methods rely on prior knowledge to select the re-parameterization operations, which limits the architecture performance to the type of operations and prior knowledge. At same time, some re-parameterization operations hinder the optimization of the network. To break these restrictions, in this work, an improved re-parameterization search space is designed, including more type of re-parameterization operations. Concretely, the performance of convolutional networks can be further enhanced by the search space. An automatic re-parameterization enhancement strategy is designed to effectively explore this search space based on neural architecture search (NAS), which can search an excellent re-parameterization architecture. Then, we solved the optimization problem caused by using some re-parameterization operations to enhance ResNet-style network. Besides, we visualize the output features of the architecture to analyze the reasons for the formation of the re-parameterization architecture. On public datasets, we achieve better results. Under the same training conditions as ResNet, we improve the accuracy of ResNet-50 by 1.82% on ImageNet-1k.

NAS tends to retain multi-path structures and more "shortcut" operations without artificial restrictions.This structure is not friendly to most terminal devices.Meanwhile, the architecture searched based on the DARTS [1] search space achieves better performance on the benchmark dataset, but some convolution operations (such as 7×7 and 5×5 separable convolutions, 7x1-1x7 sequential convolution) in the search space are not well optimized on the terminal devices, which makes them [2-4, 6, 23-25] not well deployed on edge devices.In Fig. 1, we tested the accuracy and inference time of networks on the NVIDIA embedded board.The networks searched by traditional search space do not perform well when deployed to edge devices.Especially, the accuracy and inference speed are inferior to the manually designed networks on ImageNet-1K.Therefore, it is a challenging task to improve the performance of the architecture without sacrificing the ability of deployment to terminal devices.
The structural re-parameterization technology provides us with a new idea.Rep methods effectively improve the performance of the tranditional networks, but the specific operations need to be selected based on prior knowledge [26,27], such as depth, the type of convolution operations and the number of convolution operations.This approach makes the re-parameterization network suboptimal.To further improve the performance of the model and find Fig. 1 The inference speed and accuracy of networks are tested on NVIDIA AgX Xavier.On ImageNet-1K, our networks achieved better accuracy and inference speed on the general dataset the globally optimal networks, we use NAS method to search a set of reparameterization operations, which can be fused into the traditional networks.In this way, we can get great performance and landing friendly models.In addition, the performance of re-parameterization models is closely related to the type of re-parameterization operations.Some previous work [28,29] search the best combination of re-parameterization operations automatically, but the type of re-parameterization operations limits the upper limit of network performance.Therefore, the performance of model can be further improved by exploring more re-parameterization operations.
In addition to the re-parameterization operations involved in the above work, there are other convolution operations that can be fused into VGGstyle networks after appropriate transformation.Thus, an improved reparameterization search space (IREPS) was designed to make it include more re-parameterization operations, and a better set of re-parameterization operations can be searched from the larger search space, which can be fused into easily deployable networks without accuracy degradation.
For the larger re-parameterization search space, a directional evolutionary strategy (DES) was designed to explore a optimal architecture population from it.When training the SuperNet parameters, it learns the importance of blocks and candidate operations, and uses them as an indicator to generate different offspring architectures.Thus, the search strategy can directly ignore bad architectures, which makes the algorithm converge rapidly.To explain the reasons for the formation of the architecture and the improvement of performance, finally, we visualized the architecture and the output feature.In summary, the contributions can be summarized as follows: 1.An improved re-parameterization search space was designed, which contains more re-parameterization operations.Compared with other Rep search space, it can further improve the performance of traditional convolutional network.2. To explore the larger search space, DES is proposed to search a set of reparameterization operations.This search strategy makes a trade-off between the diversity of architectures and the efficiency of search.3. Extensive experiments on image classification and its downstream tasks demonstrate our architecture achieve better results compared to other related work.
2 Related Work

Network Architecture Search
Neural architecture search (NAS) is a widely-used technique, which aims to search feature extraction networks that match given tasks.Evolutionary algorithm-based NAS [5,7,8,30,31] uses the principle of "survival of the fittest" to select architectures and rely on genetics, mutation, crossover and random generation to obtain new offsprings.Evolutionary algorithm is well-established global optimization method with high robustness and wide applicability.However, evolutionary algorithm-based NAS converges slowly due to the random generation of offspring architectures.Gradient-based NAS [1,2,4,6,[32][33][34][35][36] benefits from the introduction of differentiable function, which transforms the discrete search space into continuous, so that it can be optimized by gradient optimization algorithm.From the perspective of parameter optimization, it can be divided into two categories.One is bilevel optimization [1,6,33,35,36], which optimizes the architecture parameters under the weight parameters are optimal.It can be described as: where α denotes architecture, w α denotes the network weight bound with the architecture α, L train and L val denote optimization loss on training and validation dataset.It first optimizes the network weights w, then finds α that minimizes the validation loss L val .The other is single-level optimization [2,34,37], which regards the optimization of w and α as independent processes.It can be described as: Eq. 2 indicates that both w and α are optimized in an optimization process.
Although gradient-based NAS can converge quickly, the existence of Matthew's effect makes architecture lack of diversity, which leads to the architecture is non-globally optimal.In this work, the single-level optimization approach is used to optimize the weight of SuperNet and learn the importance of operations in the SuperNet.Instead of sampling from SuperNet randomly, DES assumes that the optimal re-parameterization operation combination vary at different training phases (epochs) and aims to generate different architecture based on current optimal re-parameterization operations.Thus, DES can speed up the convergence of search strategy and explore globally optimal architecture.

Structural Re-parameterization
The structural re-parameterization technology is an equivalent parameter conversion technology.In our work, the structural re-parameterization technique refers to equivalently converting a multi-branch architecture into a singlebranch architecture.ACNet [38] proposes to fuse 1D asymmetric convolution into square convolution to enhance the feature representation capability of square convolution.DDB [26] aims to enhance the representation of a single convolution by combining diverse branches and give methods for fusing multiple convolution operations in various combinatorial forms.RepVGG [27] constructs a residual structure-like branch based on the VGG network and fuses the trained residual-like structure into a 3×3 convolution by structural re-parameterization technique.Based on the re-parameterization technique, RepNAS [28] designed a re-parameterization search space, in which all multibranch structures can be transformed into single-branch structures.There are several re-parameterization techniques, which can be described as:

Proposed strategy and search space
Here, the improved re-parameterized search space is designed first.Secondly, the parameters of the SuperNet are optimized by batch optimization method.Afterward we introduce how to generate the offspring architectures.Finally, a re-parameterization verification method is implemented to speed up the verification process.

Improved re-parameterization search space
The convolution operations in traditional convolution networks are called fixed operation.In this work, fixed operation represents 3 × 3 convolution operation.
In the re-parameterization search space, all candidate operations can be fused into fixed operation.Therefore, when the traditional network (ResNet,VGG etc.) that needs to be re-parameterized is determined, the parameters of the architecture is also determined.In addition, when two different operations are re-parameterized, the center weight of the operations need to be aligned , and then fuse the weight parameters.Hence, the re-parameterization search space has the following characteristics: 1.In structural re-parameterization search space, the parameter number of the network is only related to the number of channels.Therefore, changing the number of operations in the block only affects the resource consumption of training, not the amount of parameters and the inference speed when the network is deployed.2. In the reparameterization search space, the convolution operations with the same groups and channels but different kernel sizes can be fused into each other, if the centers of the convolutions can be exactly overlapped.
In AcNet [38], taking 3×3 convolution as an example, the cruciform weight of the convolution center position has the most important feature information.Thus, the better feature extraction ability can be achieved by enhancing the cruciform feature at the center position of convolution.In this work, more operations are expected to be included in the re-parameterization search space, which can further enhanced the cruciform weight at the center position of the fixed operation.

Batch optimization of SupNet parameters
In the search process, we use binary encoding (i.e., 0, 1) to cover the SuperNet to obtain different subnets.1-element indicates participation in forwarding propagation and 0-element indicates non-participation.To trade off efficiency and accuracy, the method of batch optimization of parameters is introduced, which can be expressed as: where P and B represent the number of populations and subnets.M denotes the subnet sampled from the SuperNet S. L Mj is the loss value of the subnet on the training dataset.Eq. 3 shows that the weight parameters of the SuperNet can be optimized by updating the gradients of subnets in batch.Further, it can be approximated as the average gradients of a part of the individuals in the population.
In search process, the branches number of sub-architectures is limited to C and all sub-architectures share the weight of SuperNet.The single-level method is used to optimize parameters on the training dataset D train .With the formulation used before Eqs. 2 and 3, the search process can be given as: where θ and ω are the architecture parameters and the weight parameters of the network, respectively.We generate different sub-architectures to form a population under resource constraints, and optimize the weight and architecture parameters of the Supernet by sampling the architectures from the population.

Generation of the architecture
The convergence speed of Evolutionary algorithm-based NAS is slow, which is caused by generating offspring architectures randomly.Therefore, we introduce the differentiable method to learn the importance of blocks and re-parameterization operations, and then guide the generation of the subarchitectures.We use the Sigmoid function to quantify the importance of re-parameterization blocks β and candidate operations α.As shown in Fig. 2, each layer of SuperNet is composed of re-parameterization blocks and fixed operations, the block is composed of multiple candidate operations O p (•).Therefore, the output of the i th layer Bi (x) can be expressed as: where f o (x) and F (x) are the output feature of the candidate operations and fixed operation respectively.From Eqs. 6 and 7, we can conclude the following easily: 1) From a local perspective, the more significant the enhancement of the fixed operation by the re-parameterization operation in a block, the larger the value of α, 2) The feature values of re-parameterization operations are multiplied by the weight β of the current block eventually.From a global perspective, the more important the output feature of the block are for the fixed operation, the greater the value of β.Therefore, when generating offspring architectures, both the global and local characteristics of the architecture are considered.
The architectures in population can be divided into three parts: 1) the architectures retained from the previous population (parent architectures), 2) the offspring architectures generated by the crossover and mutation, 3) the new offspring architectures sampled from the SuperNet.The architectures that are generated according to importance is used to form the third part of the population.Specifically, the random distribution noise σ α and σ β are added to the architecture parameters α and β to ensure the diversity of the architecture.We define α 0 = sigmoid (α 0 ), α = sigmoid(α), β 0 = sigmoid (β 0 ), β = sigmoid(β), where α 0 and β 0 are the initialization weights of α and β.The range of the perturbation can be defined as: where σ α and σ β belong to random distribution.We take the deviation of the maximum and minimum weight values from the baseline as the range of perturbation.When sampling offspring architectures from the SuperNet, the edges with higher weights are retained by global sorting (β i + σ β ) • (α i + σ α ), (•) denotes the multiplication of two matrices.This process can be simply described as: where 1-element means the network uses this connection.rank (•) denotes the global ranking.As shown in Fig. 2, the red line indicates the branch, which selected, and the black line indicates not selected.Due to the addition of appropriate perturbations, the candidate operations with high weight are retained, and do not completely ignore the operations with low weight in the current stage.Warm up SuperNet S 3: end while 4: while j < E evo do 5: for Mini-batch data X, target Y in Dataset do

9:
Calculate loss and compute the gradients according to Eq. 3.

10:
Update network parameters ω and architecture parameters α, β.Re-parameterize architecture and obtain the performance of the architecture.

14:
Select the architecture according to performance and perform mutation crossover.Meanwhile, sample the subarchitectures from the SperNet according to Eq. 8-10.15: end while 16: Output: Architectures P = P 1 , P 2 • • • P k .

Performance estimation of population
In evolutionary algorithm-based NAS, evaluating the architectures takes a lot of time.In addition, there is still tiny deviation in the performance of the architecture before and after reparameterization.Although this deviation can be ignored in practical application, we need to search for the best performance architecture after re-parameterization, it has a certain impact on the choice of the architecture.Thus, in this work, the re-parameterization operations are fused into the fixed operation before verifying architecture performance, as shown in Fig. 3.It is worth noting that the BN layer is also fused into the fixed convolution.The multi-branch Conv-BN layer becomes Conv layer.Hence, using re-parameterized architectures for performance evaluation can speed up the evaluation process and eliminate the deviation.
We use α and β to indicate the importance of blocks and candidate operations.Therefore, the weights and biases of the candidate operations need to be scaled α • β times before the architecture is re-parameterized.Table 1 shows the time consumption of the verification process on CIFAR-10.Concretely, the number of populations is set to [64,128,256].Our approach increases the speed of the architecture evaluation by around 60% compared to the naive evaluation method.In this experiment, considering factors such as architecture diversity search time and computational resources, the population size is set to 128.The overall training procedure is summarized in Algorithm 1.

Experiments
To verify the improved re-parameterization search space is effective for different datasets, we search for a set of reparameterization operations for ResNet on CIFAR-10 and ImageNet-1K.Due to ResNet has residual structure, it is natural to weed out residual connection in the search space.Our experiment is divided into two stages: the search stage and the retraining stage.

Search Architectures on CIFAR-10
We add a re-parameterization block similar to the residual structure to the 3×3 convolution in VGG-16 and ResNet-18.For a fair comparison, the structure of the network and data augmentation techniques are followed by ACNet [38] and ResNet [18].We use the SGD optimizer with a learning rate of 0.1 to optimize the parameters of the network.To optimize the architecture parameters α and β, we use Adam optimizer with learning rate of 0.0001 and (0.5, 0.999) betas.We limit the number of branches to 2  3 times of the total branch number.In the training process, we sampled 5 architectures and used them to update SuperNet.The probability of both mutation and crossover for the architecture is 0.5.
We search for 500 epochs and retrain architectures on CIFAR-10 dataset.Except for the learning rate and the probability of the drop-path, the retraining process are the same as DARTS [1].Respectively, the learning rate and the probability of drop-path is set to 0.05 and 0.08.
Table 2 shows our results.We achieved 1.02% better accuracy than RepVGG [27] and 0.21% than RepNAS [28].Our architecture has a great advantage in the inference process.Since the re-parameterized architecture retains only convolution and non-linear operations, the inference speed of IrepResNet-18 and IrepVGG-16 reaches 3.93ms and 1.76ms per image, which is faster than the architectures such as DARTS.
Table 1 We obtain the performance of architecture on Nvidia A100 GPU.The time to evaluate the performance of the architecture by the re-parameterization techniques consists of two parts, i.e. the time consumed by the architecture re-parameterization and the architecture forward inference.The results are average of verifying 10 populations and the batch size is 512, full precision(fp32).

Experience on ImageNet-1K
To reveal the generalization ability, we evaluate on the ImageNet-1K, which contains 1.3M images for training and 50K for validation from 1000 classes.To save computational resource and speed up the search, based on the conclusion of ACNet [38], we remove the 2 × 2 dilated convolution from the search space.
We set B = 1, E warm = 5 and batch size is 256.We use Adam optimizer with 0.0001 learning rate and (0.5, 0.999) betas to optimize α, β.We limit the number of branches to 1 2 times of the total branch number.We search for 100 epochs and then fix the structure of the SuperNet to retrain architectures for 120 epochs.To be fair, we use the same data augmentation techniques as ResNet [18].
We plot the structure of IrepResNet in Appendix A. The structure of IrepResNet-50 is truncated in the middle, i.e., the first eight layers retain all re-parameterization operations, and the last eight layers exclude all enhancement operations.To better explain this phenomenon, we visualized the output feature values of ResNet-50 and IrepResNet-50.As shown in Fig. 4, it can be easily concluded that IrepResNet-50 has stronger discrimination ability for targets compared to ResNet-50.For ResNet-50, the role of the first eight layers task is mainly to achieve the separation of foreground and background in the image.While the last eight layers task is mainly to further distinguish the subtle differences between foreground and background, so that the target in the image can be focused accurately.This division of task is significant for the formation of the IrepResNet-50 structure.The first eight layers obtain less feature Table 2 Comparison with state-of-the-art image classifiers on CIFAR-10 dataset.We calculated the parameters of the model and tested the model of the inference time on Nvidia A100 GPU with a batch size of 1, full precision(fp32).

Model
Top information due to the few number of channels.Thus, the re-parameterization operations are important to improve the current feature information, which makes the them has higher weight.The last eight layers can acquire more feature information because of the more channels, which allows it to take on the task of focusing on the target and refining the foreground and background in the picture.Compared with the first eight re-parameterization blocks, the feature information in the last eight re-parameterization blocks is not important for the last eight fixed operations, which makes the re-parameterization operations has smaller weight value.Therefore, searching the architecture under resource constraints makes the algorithm prefer to retain the reparameterization operations that in the shallow layers.We also visualized the output features and architecture of IrepResNet-34 and IrepResNet-18, as shown in Appendix A. The re-parameterization operations that are preserved in the IrepResNet-34 and IrepResNet-18 also emerge the same trend.

Generalization performance on Downstream Task
We transfer our ImageNet-pretrained IrepResNet-50 and IrepResNet-18 model to downstream tasks object detection to validate generalization of the model.Specifically, the pre-trained model is used as the backbone for the downstream algorithms FPN [39] and CenterNet [40] algorithms on the COCO dataset.For the optimization of the target detection model, we refer to the optimization   approach and hyperparameter settings of MMDetection [41].FPN and Center-Net are fine-tuned on a single NVIDIA A100 GPU with batch sizes 16 and 64, respectively.In addition, the fine-tuned model can re-parameterize the backbone to achieve faster forward inference.The results in Table 4 show that IrepResNet can achieve better performance compared to FPN, CenterNet, and DyRep.

Search under different resource constraints
To explore the impact on the performance of the architecture under different resource constraints, we searched IrepVGG-16 and IrepResNet-18 on the CIFAR-10 dataset.Specifically, the number of branches is set to , 1 times of the total branch number.As shown in Table 5, when the resource constraint reaches 2/3, the architecture achieves better performance.We found that the performance of the architecture is weaker than RepNAS [28] when retaining the same number of branches as RepNAS in the improved reparameterization search space (retain four branches for each block).Based on AcNet [38], the main reason is that the enhancement effect of the dilated convolutions on the 3 × 3 convolution is weaker than the 1 × 3 and 3 × 1 convolutions.As shown in Fig. 5, the architecture weights of the 1 × 3 convolution-1 × 2 dilated convolution and 3 × 1 convolution-2 × 1 dilated convolution are subtracted and transformed equivalently and we plotted them as heatmap.The larger the difference values, the more important the asymmetric convolution (3×1 and 1×3 convolutions) in the same layer.This indicates that the feature enhancement effect of asymmetric convolutions (1 × 3 and 3 × 1 convolution) are stronger than dilated convolutions in this experiment.Therefore, when the same number of branches are retained as RepNAS [28], some of the 1 × 3 and 3 × 1 convolutions may be replaced by dilated convolution due to the Matthew effect of the gradient-based learning method, which leads to the architectures with potentially weaker performance than RepNAS [28].

Conclusion
In order to further improve tranditional convolutional networks, we designed a more comprehensive re-parameterization search space and searched it by directional evolutionary strategy to further improve the performance of ResNet.A re-parameterization block similar to the residual connection is added to each 3 × 3 convolution in ResNet model.Then finding an optimal architecture population by exploring the derivative architectures of the optimal re-parameterization architecture at the current stage.Extensive experiments demonstrate that the proposed the improved re-parameterization search space can further improve the performance of models and perform well in downstream tasks.Moreover, we explain the reasons for the formation of the architecture and analyze the enhancement effect between re-parameterization operations.It is worth mentioning that the improved re-parameterization search space proposed in our paper can be further used as a bridge between coarse-grained search and fine-grained search.It means that the re-parameterization model after coarse-grained search (architecture operation) can be divided into 1 × 1 convolution and 2 × 2, 2 × 1, 1 × 2 dilated convolution, and then the model can be transformed into a channel pruning friendly network, which can actually further reduce the FLOPs and inference time of the model.

Fig. 2
Fig. 2 (a): This is the SuperNet.After training it, each branch and block of the Supernet is given different weights.(b): The offspring architectures are generated from the SuperNet according to Eqs. 8-10.(c): We use binary code to represent the offspring architectures, and the binary code (1) represents the architecture in (b).

Fig. 3
Fig.3Using the re-parameterized structure to obtain the accuracy.Each sub-architecture in the population can be re-parameterized into the rightmost structure.

Fig. 4
Fig. 4 We visualized the output feature values of the convolution in the ResNet-50 and IrepResNet-50 to better interpret the structure of our network.We consider the structure of Conv1×1-Conv3×3-Conv1×1 as a layer.The first and second columns are the original image and the heatmap of the last layer output.The other columns are the heatmap of the feature outputs convolved in the 4th layer to the 11th layer.IRepResNet-50 has significantly better feature focus than ResNet-50.

Fig. A2
Fig. A2 List of architectures of IrepResnet-50, IrepResnet-18, IrepResnet-34.row represents the enhancement of a 3×3 convolution.From left to right, it represents the reparameterization structure from the first layer to the last layer.

Fig. A3
Fig. A3 We visualized the output feature values of the convolution in the IrepResNet-18 and IrepResNet-34 to better interpret the structure of our architecture.We consider the structure of Conv3 × 3-Conv3 × 3 as one layer.
1) Conv-BN to Conv, 2) a Conv for branch addition, 3) Sequence Conv structure to Conv, 4) a Conv for depth concatenation to a Conv, 5) K×K average pooling to K×K Conv, 6) a Conv for multi-scale Convs.The above work enhances the feature extraction ability of convolution by re-parameterization technology, but their type of re-parameterization operation is deficient.In this work, we aim to build more different types of reparameterization operations.
Algorithm 1 Directional evolution strategy for neural architecture search Require: SuperNet S, Population P = {P 1 , • • • , P k }, evolution number E evo , Warm up number E warm , parameter optimization epochs E p , arch-parameters α, β. 1: while i < E warm do: do

Table 3
Results of our models on ImageNet-1K dataset compared to other models.The performance on ImageNet-1K with comparison to other NAS methods and models.All experiments on the ImageNet-1K were performed based on Nvidia A100 GPU.We calculated the parameters of the model and tested the model of the inference time with a batch size of 1, full precision(fp32).

Table 4
Results on object detection.

Table 5
Performance of the architecture on the CIFAR-10 dataset under different resource constraints.