Convolution neural networks can be built up to increase accuracy by adding more layers, and their resource costs are fixed. The standard approaches to model scaling, however, are inconsistent. Some models scale in depth, while others scale in width. Some models merely consume higher-resolution images to obtain better results. When models are scaled arbitrarily, it often results in little or no performance improvement and requires extensive human tweaking. EfficientNet uses a technique known as compound coefficient to quickly and simply scale up models. Instead of arbitrarily growing width, depth, or resolution, compound scaling consistently scales each dimension with a preset fixed set of scaling factors. By combining scaling with AutoML, the developers of EfficientNet created seven models in various dimensions that outperformed state-of-the-art convolution neural networks in terms of accuracy and efficiency.
Model Scaling
According to the logic, scaling all three dimensions—width, depth, and picture resolution—while taking into account the various resources available, can best increase the model's performance overall. Scaling one dimension can help improve model performance. The compound scaling method is shown in figure.
-
Scaling Convnet- It can be described as modifying the network's dimensions to improve performance based on the most popular definitions. Depth, width, and resolution make up the dimensions.
-
Compound scaling- The authors of EfficientNet suggest starting with a baseline network (N) and concentrating on expanding its length (L), width (C), and resolution (W, H) while maintaining the baseline design. This differs from the typical method of looking for the ideal layer architecture. Thus, choosing the ideal width (w), depth (d), and resolution (r) coefficients within the constraints of the resources available to maximize the accuracy of the network (memory and number of feasible operations (FLOPS)) is the definition of the optimization issue.
In order to further reduce the search space < L,C,W,H>, the authors also suggested to restrict that all layers must be scaled uniformly using a constant ratio. Thus, the dimensions of the network are defined as:
The compound coefficient Φ, controlled by the user, determines the number of available resources. α, β, and γ are constants found through grid search, which allocate these resources to the network's depth, width, and resolution respectively.
It is also important to mention that the authors noticed that the FLOPS of a regular convolution operation are proportional to d, w², r². Since convolution operations dominate the computation cost in ConvNets, using compound scaling on a Convnet increases the number of FLOPS by (α.β².γ²)Φ, thus the constraint α.β².γ²≈2, to increase the total FLOPS by 2Φ.
3. EfficientNet architecture
Compound scaling, as previously said, enhances the network's width, depth, and resolution rather than altering the operations carried out within a layer of the network. Following is the architecture of the model-
MBConv
Skip connections are used by residual blocks to link a convolutional block's start and finish. The channels are wide at the start of the convolutional block, get smaller as the block depth rises, and then get wider again at the end due to the additional information. Wide->narrow->wide is the pattern for a typical residual block in terms of the number of channels.[18]
The pattern of an inverted residual block, however, is the opposite of that of a regular residual block; it means narrow->wide->narrow. MBConv enhances efficiency and adaptability of CNNs for mobile platforms using Depth-wise Separable Convolution. The remaining channels are compressed at the beginning and end of the block using a 1x1 convolution, followed by a 3x3 depth-wise convolution to restrict the parameters.
Squeeze and Excitation (SE) Block
SE is a CNN component that improves interdependencies between channels by dynamic feature channel-wise recalibration, giving relevant channels more weight than unimportant ones. View the illustration below.
The following structure is the result of EfficientNet applying the SE block along with the MBConv block. The initial component of each network is its stem, after which all architecture experimentation, which is common to all eight models and the top layers, starts.
Following that, each of them has seven blocks. As we progress from EfficientNetB0 to EfficientNetB7, the number of these blocks' sub-blocks increases, with a different amount being present in each block. The architecture will be built using 5 modules. These modules are then joined to create sub-blocks, which will be utilized in the blocks in a particular manner.[11, 21]