Layered feature representation for differentiable architecture search

Differentiable architecture search (DARTS) approach has made great progress in reducing the computational costs of designing automatically neural architectures. DARTS tries to discover an optimal architecture module, called as the cell, from a predefined super network containing all possible network architectures. Then a target network is constructed by repeatedly stacking this cell multiple times and connecting each one end to end. However, the repeated design pattern in depth-wise of networks fails to sufficiently extract layered features distributed in images or other media data, leading to poor network performance and generality. To address this problem, we propose an effective approach called Layered Feature Representation for Differentiable Architecture Search (LFR-DARTS). Specifically, we iteratively search for multiple cell architectures from shallow to deep layers of the super network. For each iteration, we optimize the architecture of a cell by gradient descent and prune out weak connections from this cell. Meanwhile, the super network is deepen by increasing the number of this cell to create an adaptive network context to search for a depth-adaptive cell in the next iteration. Thus, our LFR-DARTS can obtain the cell architecture at a specific network depth, which embeds the ability of layered feature representations into each cell to sufficiently extract layered features of data. Extensive experiments show that our algorithm solves the existing problem and achieves a more competitive performance on the datasets of CIFAR10 (2.45% error rate) , fashionMNIST (3.70%) and ImageNet (25.5%) while at low search costs.


Introduction
Over the last few years, deep neural networks (DNN) have demonstrated powerful capabilities of feature extraction (Ravi and Zimmermann 2001;Guyon et al. 2008;Cai and Zhu 2018;Nixon and Aguado 2019) and data mining (Bramer 2007;Zhu 2009;Cai et al. 2020;Verma and Singh 2021.) Thus, DNN is applied to a large variety of challenging tasks, such as image recognition (Hu et al. 2018;Kaiming et al. 2016), speech recognition (Geoffrey et al. 2012;Alex et al. 2013), machine translation (Sutskever et al. 2014;Wu et al. 2016), and other complex tasks (Sunil et al. 2019;Lotfollahi et al. 2020;Heydarpour et al. 2016;Bobadilla et al. 2021). But designing an advanced neural network typically requires substantial efforts of human experts. To eliminate such a handcraft process, neural architecture search (NAS) (Zoph and Le 2016;Zoph et al. 2018;Real et al. 2019) has been proposed to automatically search for a suitable neural network from a predefined search space. Its excellent performance has increasingly attracted researchers' attention.
Most NAS approaches apply reinforcement learning (RL) (Zoph and Le 2016;Zoph et al. 2018;Irwan et al. 2017) or evolutionary algorithms (EA) (Real et al. 2017;Liu et al. 2017;Real et al. 2019) to perform architecture search. Both of their searching procedure require sampling and evaluating numerous architectures from a discrete search space to obtain the optimal one. The searching procedure is prohibitively computational overhead. For example, NASNet )(RL-based) trains and evaluates more than 20,000 neural networks across 500 GPUs over 4 days. AmoebaNet (Real et al. 2019) (EA-based) even takes 3150 GPU-days to discover an optimal neural architecture.
To eliminate this high computational overhead, (Liu et al. 2018b) recently proposed a differentiable architecture search (aka DARTS), which relaxes the discrete search space to be continuous and optimizes a common cell architecture in a super network (also called search network in the following) by gradient descent. Then, the identical cells are repeatedly stacked multiple times and connected end to end to construct a target network for a specific task. This kind of NAS approach indeed reduces the computational costs by the differentiable search strategy. However, this target network shows poor performance on testing datasets, especially when transferred to a large-scale dataset since this repeated and simple network structure in depth-wise is hard to sufficiently extract the layered features distributed in media data. In term of image data, the layered features express semantic information of different granularities. In general, the semantic information need to be handled by convolutional kernels with different configurations. But obviously the simple neural architecture from DARTS cannot fully extract and utilize these useful features. Therefore, how to search for cell architectures with layered feature representation for a target network becomes our research question.
To address the above problem, we propose an effective approach called Layered Feature Representation for Differentiable Architecture Search (LFR-DARTS). Specifically, we initialize a search network constituted by multiple cells with all candidate operations and then iteratively search for the architecture of each cell from shallow to deep layers of the search network. For each iteration, we first optimize the architecture of a specified cell by gradient descent and gradually prune out weak connections from this cell. To effectively learn the importance of candidate operations and highlight the optimal ones during this process, we design a new functional network layer called Normalization-Affine and introduce an entropy constraint for the operations being optimized. When obtaining the optimal architecture of a cell, we deepen the search network by increasing the number of this cell to N (a configurable hyperparameter) copies in the original location of network while keeping other cells unchanged, so as to create an adaptive network context to search for a depth-adaptive cell in the next iteration. Therefore, our LFR-DARTS makes each cell to be searched at a specific and adaptive network depth, which is conducive to embedding the ability of layered feature representations into each cell to sufficiently extract data features (Fig. 1).
In terms of search efficiency, our approach takes shorter search time than DARTS since we constantly prune out weak operations from the search network to progressively accelerate the forward and backward propagation of the network. Moreover, the optimization for cell architecture is simpler yet more efficient compared to DARTS, which is demonstrated by the diagnostic experiments in Sect. 4.3. We validate our LFR-DARTS on the image classification tasks of CIFAR10, fashionMNIST and ImageNet. We take only 0.45 GPU days (NVIDIA GTX1080Ti) to obtain an optimal neural architecture on the training dataset of CIFAR10. Our neural network achieves the state-of-the-art performance on validation dataset of CIFAR10 (i.e., 2.65% test error rate with 2.7M parameters and 2.45% test error rate with 4.4M parameters). Then we transfer the neural architecture to other datasets of fashionMNIST and ImageNet. Under the same circumstances, our network achieves 3.70% test error rate on fashionMNIST (with 2.5M parameters) and 74.5% top1 accuracy on ImageNet (with only 4.9M parameters).
In summary, we make the following contributions in this work: 1. We propose a layered feature representation approach for differentiable architecture search to solve the problem of insufficient layered feature extraction in DARTS. Firstly, we design a hierarchical search scheme that is to search a depth-adaptive cell architecture in each search iteration. At the end of each iteration, we dynamically increase the number of the currently obtained cell to N copies in the original depth location so as to deepen the search network. Compared with other differentiable search approaches, our hierarchical and dynamic search scheme allows the discovered network to sufficiently extract feature information of different granularities and levels and integrate it to make decisions. 2. A new functional network layer (called as Normalization-Affine) and the entropy constraint are developed to highlight important operations among candidates, while suppressing other weak operations. That provides higher reliability for optimal architecture selection. 3. Extensive experiments show the advantages of our method in neural architecture search. Compared to other DARTS approaches, our discovered cells are able to represent different levels of feature information hidden in data. Therefore, our algorithm achieves competitive even better network performance and generalization on several datasets.

Related work
In recent years, NAS is becoming a research hotspot in artificial intelligence. Many search algorithms have been proposed to explore neural networks. According to the strategies to explore the search space, the existing NAS approaches can be roughly divided into three categories (Thomas et al. 2018), i.e., reinforcement learning (RL)-based approaches, evolu-tion algorithm (EA)-based approaches, and gradient-based approaches.
The early approaches (Zoph and Le 2016;Zoph et al. 2018;Bowen et al. 2016;Cai et al. 2018;Liu et al. 2018a) use RL to optimize the search policy for discovering optimal architectures. NASNet ) trains a recurrent neural network as a controller to decide the types and parameters of neural networks sequentially. ENAS (Hieu et al. 2018) reduces the computational burden of NAS-Net by sharing the weights of common operations among child networks. The EA-based methods apply evolutionary algorithms to evolve and optimize a population of network structures (Real et al. 2017(Real et al. , 2019Wang et al. 2020;Ma et al. 2020). AmoebaNet (Real et al. 2019) encodes each neural architecture as a variable-length string. The string mutates and recombines to produce new population of networks. The high-performance networks will be remained and they generates the next promising generation.
But both RL-based and EA-based approaches require excessive computational overhead though achieving an advanced performance. To address this issue, the gradientbased approaches (Liu et al. 2018b, a, c;Xuanyi and Yang 2019) are proposed to accelerate the architecture search. Typically, DARTS relaxes the discrete search space to be continuous and utilizes gradient descent to jointly optimize neural architecture and network weights. SNAS (Liu et al. 2018c) proposes to constrain the architecture parameters to be one-hot to tackle the inconsistency in optimizing objectives between search and evaluation scenarios. GDAS (Xuanyi and Yang 2019) develops a differentiable sampler over the search space to avoid simultaneously training all the neural architectures in the space. DARTS+ (Liang et al. 2019) RobustDARTS (Arber et al. 2020) and PDARTS (Liu et al. 2018c) employ early stopping to restrict the excessive number of "skip" operations. FairDARTS (Chu et al. 2020) proposes the collaborative competition strategy to address the unfair advantage in exclusive competition. NASSA (Hao and Zhu 2021) designs a new importance metric of candidate operations for more reliable architecture selection. Although the gradient-based approaches show high search efficiency, their network structures lack the ability of layered feature representations.

Method
In this section, we present our proposed algorithm Layered Feature Representation for Differentiable Architecture Search (LFR-DARTS) in detail. We first introduce a classical differentiable NAS algorithm DARTS in Sect. 3.1, which is a basis of our LFR-DARTS. Then, we describe the concrete search procedure of our algorithm in Sect. 3.2. Finally, in Sect. 3.3, we introduce a minimum entropy constraint and formulate the gradient optimization for the search network.

Preliminary: DARTS
In DARTS, the goal of architecture search is to discover an optimal cell with the most important operations from a search network. The search network consists of L identical cells with the given candidate operations. These cells connect with each other in order, and each cell is considered as a directed acyclic are two input nodes of this cell, x B−1 is the output node, and the others are intermediate nodes. The nodes are connected to predecessors by multiple kinds of operations (e.g., convolution, pooling). These operations share an operation space O Table 1, in which each operation is represented as o(.). The feature transformation f (.) from node i to the subsequent node j could be represented by the weighted sum of these operations: where x i is the feature maps of node i, and α is the architecture parameter, which is used to weight its corresponding operation.
is represented by all of its predecessors: The output x B−1 of one cell is calculated by the concatenation of the intermediate nodes in the channel dimension: The output of this cell will be input of the next cell. The cell is a special information processing or feature extraction block. Thus, the internal architecture (including operation types and connection between nodes) of the cell is critical to the performance of a neural network.

The procedure of layered architecture search
A convolutional neural network (CNN) has a hierarchical structure so as to extract the layered visual features of images. As Simonyan et al. (2013); Zeiler and Fergus (2014) describes, the discriminative information is hidden in feature maps of different layers, each layer has the characteristic of representing specific features. Many excellent network structures (Szegedy et al. 2015;Kaiming et al. 2016;Hu et al. 2018) obey this rule consistently. But differentiable NAS algorithms Liu et al. 2018b) just search single cell architecture (a normal cell and a reduction cell) in pre-defined search space, and then construct a target neural network by the repetitive cells. It contradicts the common sense and cannot be guaranteed that the neural network with repetitive and oversimplified structure is capable of sufficiently extracting layered features. It causes poor performance, especially when transferring the cell architecture to a large-scale dataset.
Following the characteristics of neural networks, we propose a new differentiable NAS algorithm called Layered Feature Representation for Differentiable Architecture Search (LFR-DARTS). Firstly, we specify the number of target cells to be searched and initialize a search network by a few identical cells that contain the same structure and candidate operations inside. These cells are connected in order, which makes each cell naturally placed in the different depths of the search network. Then, we iteratively search for multiple cells with different architectures from shallow to deep layers. For each iteration, we first optimize the architecture of a cell by gradient descent and gradually prune out weak connections from this cell. Once a cell discovers its optimal architecture, we will fix its architecture in the search network and then perform the search for a deeper-adaptive cell in the next iteration.
In order to embed the capability of layered feature representation into the cells, we dynamically increase the depth of the search network during the search process, rather than keeping the static state as in DARTS. Concretely, when we discover the optimal architecture of a cell (if it is a normal cell), we will increase its number to N copies in the original depth of the search network while simultaneously keeping other cells unchanged. In this way, our gradually growing search network creates an adaptive network context for searching optimal cells adaptive to different network depths.
But we find that there exist some problems when applying the architecture optimization strategy of DARTS to our search process. First, this optimization strategy in DARTS is just applied to searching a single cell, not multiple cells with different hierarchical features. Since the parameters α for architecture optimization in DARTS are shared between cells, leading to optimizing and producing only a common cell. Second, the search procedure is complicated as DARTS needs to alternatively optimize the architecture parameters α and network weights ω by gradient descent. α is trained on the validation dataset and ω is trained on the training dataset respectively, which greatly consumes search time. To solve the problems of architecture optimization, we design a new functional layer called Normalization-Affine (NA), which follows intermediately after each candidate operation and provide us a selection indicator of optimal operations.
For any candidate operation, our NA functional layer first normalizes the output of this operation and then reweights the normalized result by a trainable parameter to learn its importance. We formulate the NA layer for any k-th operation in a set of candidate operations: where the trainable weight parameter ϕ k is referred to as an affine parameter which is used to weight each operation. is a very small value close to zero.
.., x m in } is the input tensor of the NA layer and the output tensor of the k-th operation, and it contains m feature maps. μ = {μ 1 , μ 2 , ..., μ m } , σ = {σ 1 , σ 2 , ..., σ m } are mean vector and standard deviation vector of the mini-batch x in . μ and σ also contain m elements, and each element is corresponding to a feature map of x in . The normalized function norm(.) partially comes from Batch Normalization (Ioffe and Szegedy 2015), which is one of the most common and useful normalization approaches in CNN models.
We combine Eqs. 1, 4 and 5, get information conversion from node j to i, shown as Eq. 6: where o k (.) is the k-th operation in a set of candidate operations and x j is the input of the operation. μ k , σ k denote the mini-batch mean and mini-batch standard deviation vector of the output of k-th operation. Each NA layer, corresponding to any operation, contains a learnable affine parameter ϕ, which is trained and updated together with weight parameters ω by gradient descent. Since different cell is located in different depth of network, the affine parameters of the cell will be trained to learn the layered neural architectures. In addition, we optimize affine parameters and weight parameters in the same gradient descent step rather than alternate optimization, which saves half of the search time compared to DARTS.
Our dynamic search approach gradually prune out the weak operations from search network based on the affine parameters. The importance score S of an operation between any pair of nodes is defined as follows: where ϕ k denotes the affine parameter corresponding to the k-th operation in the operation space. The larger S k is, the more likely the corresponding candidate operation is to be retained during the search process. We might doubt whether the normalization is really necessary in the NA layer. We have found through experiments that directly using the affine parameter to weight an operation without beforehand normalization cannot achieve an ideal result. The reason is that the distribution of the outputs from different operations probably varies widely, which makes it pretty hard to identify importance of operations by the affine parameters ϕ. For any operation, its weight parameters ω will be optimized and updated simultaneously with the corresponding ϕ. Optimizing the two kinds of parameters together could make them vary synchronously, resulting in same result by increasing one and deceasing another. Therefore, normalization before reweighting is quite necessary since it makes the results from different operations uniformed so that the affine parameters can genuinely represent the importance of operations.

Network optimization with entropy constrain
During the search process, we optimize the affine parameters ϕ together with weights ω by the gradient descent. Then we try to pick out the operation with the highest importance score from the candidates. But the importance scores of operations between a pair of nodes could be very close to each other, which makes it challenging to select an optimal operation among them. Thus we consider adding an entropy constraint over these candidate operations to concentrate the high scores on one or few operations, then the operations with high scores can be identified and selected more easily. To this end, we redesign the loss function as follows: where L CE is a general cross entropy loss function and B p=1 H p denotes the summation of entropies w.r.t. all candidate operations in the cell currently being searched, H p is the entropy of the p-th set of candidate operations. B is the number of nodes in a cell and |O| is the size of the operation space. λ is a scaling factor that controls the rate of convergence.
We try to minimize the loss function Eq. 9, and this optimization procedure forces the entropy to decrease so that the importance score distribution S tends to a single peak gradually. When the scaling factor λ is larger, the constraint is stronger, thus the difference of importance scores can get obvious through the training with the fixed number of steps. We verify the entropy constraint effectiveness, and details are presented in Sect. 4.3.2.
With the minimum entropy constrain, we formulate the optimization and gradient computing about affine parameters. According to Eqs. 6 and 9, the gradient w.r.t. an affine parameter ϕ k can be computed as below: where ∂ S j ∂ϕ k depends on the positive and negative of ϕ k , the result as follows: where δ jk = 1 if j = k or else δ jk = 0. From Eqs. 11 and 12, we can observe that the entropy constraint also delivers interactive information between different affine parameters, which pushes the competition of various operations. Moreover, there is no extra computational burden for training our search network just like a common convolutional neural network. The pseudocode of our proposed algorithm LFR-DARTS is presented in Algorithm 1.

Experiments
In this section, we compare the performance of our algorithm LFR-DARTS with other NAS approaches and humandesigned networks on the several popular image classification datasets, including CIFAR10, fashionMNIST and ImageNet (Jia et al. 2009). Following DARTS (Liu et al. 2018b), We conduct our experiment in two steps: (1) Architecture search: searching the optimal cell on the training dataset of CIFAR10; (2) Architecture evaluate: construct an evaluation network by the obtained cells and test its performance on the testing datasets of CIFAR10, fashionMNIST and ImageNet.

Architecture search and result
The initial search network G consists of χ = 5 cells where two reduction cells (with stride = 2) are inserted between three normal cells (with stride = 1). The number of nodes B = 7 in a cell. The initial operation space O is same as Zoph et al. (2018); Liu et al. (2018b) and the size of space |O| = 8 at the beginning of each iteration. The search network in the each iteration will be performed the search training of T epochs before obtaining the final architecture of the corresponding cell. The process of search training assures the performance stability of the search network after each network prune. In our experiment, the value of T is set to 60 because we find through experiments that the accuracy of the search network keeps relatively steady after about 60 epoch training. Fewer training epochs usually lead to performance collapse as network parameters have not converged. More training epochs contribute little to the result. All of our experiments are run on one device with a CPU of Intel core i7-8700K and a GPU of NVIDIA GTX1080Ti.
The architecture search is implemented on the deep learning framework PyTorch (Paszke et al. 2017) with initial channels of 16 and a batch size of 96. The initial learning rate is 0.025 and then annealed down to zero following a cosine schedule. A standard SGD optimizer with momentum of 0.9 and weight decay of 3 × 10 −4 is adopted. The hyperparameter λ is fixed at 5 × 10 −3 . Other experiment set- We implement the experiment five times with different random seeds and pick out the best cells based on the validation performance. The cells discovered by LFR-DARTS algorithm are presented in Fig. 2.

Evaluation on CIFAR10
An evaluation network consisting of 16 cells discovered in Sect. 4.1 is trained from scratch for 600 epochs on the CIFAR10 with mini-batch=50, initial learning_rate=0.025, init_channels=36, drop_path_prob=0.15, momentum=0.9, weight_decay=3 × 10 −4 and auxiliary towers of weight=0.4. We use standard image preprocessing and data augmentations, i.e., randomly cropping, horizontally flipping and batch normalization. Other settings remain the same as Hieu et al. (2018); Liu et al. (2018b). Two reduction cells as Fig. 2(d) and Fig. 2(e) are located at 1/3 and 2/3 of the total depth of the evaluation network, respectively. The other positions of the network are filled with three other kinds of the normal cells, i.e., Fig. 2(a), (b) and (c).
To explore the performance limitation of our discovered neural architecture, we further increase the initial channels to 50 for the evaluation network which contains 15 cells and more parameters (denoted as large settings in Table  2). We compare our neural network with other networks designed by experts and other NAS methods under fair conditions where the parameters are less than 5M for all the NAS networks. Every evaluation network is trained 5 times using different random seeds and the results prove that our discovered network has excellent performance and strong stability.
The test results and the comparison with other approaches are summarized in Table 2. As shown in Table 2, our LFR-DARTS achieves a test error rate of 2.65% with only 2.7M parameters on the validation dataset of CIFAR10. With more parameters (4.4M), LFR-DARTS further reduces the error rate to 2.45%, which almost outperforms the existing state-of-the-art works with less computational cost than DARTS.

Evaluation on fashionMNIST
The discovered cell architectures are first transferred to another dataset called fashionMNIST, consisting of a training set of 60,000 images and a testing set of 10,000 images. Each image is a 28x28 grayscale image associated with a label from 10 classes. The evaluation network is constructed by 15 cells, 36 initial channels. We training this evaluation network by  Table  3.

Evaluation on ImageNet
We transfer our architecture discovered on CIFAR10 to a large-scale dataset named ImageNet, and the result also demonstrates excellent generalization performance. Following the mobile setting in Zoph et al. (2018); Hieu et al. (2018), we construct our evaluation network by 15 cells and train it for 250 epochs with batch size 160, initial channels 48, weight decay 3 × 10 −5 , monmentum=0.9 and initial SGD learning_rate=0.1 (decayed linearly to 0.0). We keep other hyper-parameters and settings as that on CIFAR10. We compare our algorithm with other approaches and the results are presented in Table 4. We compare the complete architecture search and evaluation process of ours, DARTS and random search on three datasets, respectively. Please refer to Figs. 3 and 4. Fig. 3 illustrates the loss learning curves on CIFAR10 (a), fashionMNIST (b) and ImageNet (c). Fig. 4 shows the training/testing process at the evaluation stage on the same three datasets. The results show that our method achieves better performance and generalization than the baseline methods.

Efficiency of architecture search
Our LFR-DARTS algorithm has shown high search efficiency and space utilization. In the experiment, the architecture search contains χ iterations. In each iteration, we make statistics on the time and space cost of one epoch training, and the results are presented in Table 5. For each iteration, we gradually remove the weak operations from the search network at 3 steps, respectively. At the beginning, it spends 196 seconds in one epoch training with 10094M GPU memory usage and reduces to 87 seconds with 8676M GPU memory usage at the last. It clearly shows that the training time and space costs constantly get decreasing, which proves that our method speeds up the search process.
To further show the difference in the search efficiency between our algorithm and DARTS, we investigate the forward-propagation and backward-propagation time of our method and DARTS during the search process. For the search network of ours and DARTS, we set the same batch size=32, training epochs=300, and then we monitor the time changes of once propagation. The results are displayed in Fig. 5. The propagation time in our method descends step by step in Fig.  5(a) as the search network constantly drops weak operations. But DARTS needs to conduct a bilevel optimization of architecture parameters and network weights simultaneously. We measure the propagation time and illustrate it in Fig. 5(b). As can be seen, our method still keeps a faster search process than DARTS even in the initial phase. The search process   (a) (b) Fig. 5 The efficiency comparison of once forward and backward propagation during the search process gets accelerated gradually in the later phases, which is also suitable for searching deeper networks.

Effectiveness of entropy constraint
In this section, we experimentally verify the effectiveness of the entropy constraint mentioned in Sect. 3.3. The entropy constraint is a part of the loss function Eq. 9. The minimum entropy over candidate operations makes the distribution of operations' importance score tend to a single or few peaks. This distribution makes it easier to filter out optimal operations since the constraint highlights the most important operation. In fact, the parameter λ is closely related to the distribution of the importance score, so an appropriate scaling factor is a key. We conduct experiments to compare the effects of different values of λ on the results. During a search stage, we randomly chose one set of operations (containing 8 operations) in a searching cell and observe its differences of importance scores under different λ settings. The result is displayed as Fig. 6, where λ = 0.0 means no entropy constraint in the experiment. The four sub-figures show the distributes of importance score at four different training epochs. The distributes vary as the training, and various λ has different impacts on the results. In the initial phase of the training 6(a), the importance is a random distribution. After a period of training 6(b), the high scores are concentrated on few operations. From the extensive experiments, we find that the importance score converges better under the condition of λ = 0.005 than other settings. Too large or small values of λ could lead to unsatisfactory convergence results. As we can see, it gets worse while λ ≥ 0.05. So an appropriate entropy constraint w.r.t. candidate operations can make a positive contribution to the process of architecture search.

Discussion
From the visualized results of the discovered cells in Fig 2, we get some interesting observations consistent with common sense are that shallow network layers prefer to select small separable convolutional kernel (as Fig. 2(a)) and deeper layers prefer large dilated separable convolutional kernel (as Fig. 2(c)). Therefore, these discovered cells show striking depth-adaptive characteristic. That is small-size convolutional kernels in the shallow layers do well in extracting the fine-grained feature information of data. The large-size convolutional kernels in the deeper layers are conducive to processing the fused features. Sufficient layered information can provide more reliable basis for making decisions. However, the network architectures stacked repeatedly in DARTS cannot meet the requirement of feature extraction. LFR-DARTS takes this problem into consideration, thus improves the performance of differentiable approaches. Our method also provides a valuable reference for developing more elaborate and useful architecture cells.
In addition, our search process is divided into multiple stages and performed iteratively. Each cell architecture is searched based on the obtained cells and the current net-work depth. Although, these cells are not the global optima, their combination provides an approximately optimal solution for the architecture search. The greedy search scheme is currently adopted by most differentiable search methods. Thus, there are lots of promising improvements in this greedy search scheme.
Our work has shown many advantages in designing network architectures for image tasks. It is also worth applying our method to other fields, such as object detection, natural language processing etc. We will further explore the differentiable approaches of architecture search to solve the problems of model automated design in other fields.

Conclusion
In this paper, we propose a novel differentiable NAS algorithm called Layered Feature Representation for Differentiable Architecture Search (LFR-DARTS) to solve the existing problem of insufficient layered feature representation. In this way, LFR-DARTS improves the performance and generalization of discovered network architectures compared to other differentiable NAS algorithms. Specifically, we develop a layered and dynamic architecture search scheme to discovered multiple optimal cells from shallow to deep layer and gradually prunes out the weak operations from the search network. Besides, to effectively learn the importance of candidate operations and highlight the optimal ones during search process, we design a new functional layer Normalization-Affine and introduce an entropy constraint for the operations. The extensive experiments on the image classification tasks demonstrate our algorithm can achieve better performance while requiring low computational costs.