We investigate the impact of zero initialization on ANNs by setting initial weight values to zero and comparing this approach with the conventional method of random weight initialization. Eq. (1) shows the backpropagation formula.
$$\:\frac{\partial\:Loss}{\partial\:Weight}=\left(An\:error\:from\:upper\:layers\right)\times\:\left(A\:result\:from\:lower\:layers\right)$$
1
The gradient of the weight with respect to the loss is calculated as the product of an error generated from the upper layers and an output value from the lower layers. Naturally, the output value from the lower layers is entirely independent of the weight values. For instance, whether the weight value is 0.03, − 0.1, 0, or an unimaginably large value, the output value remains the same. However, indirectly, the weight value affects the results of the error received from above, thereby participating in the weight gradient computation. Other reports claim that initializing weights to zero leads to poor learning, despite the zero value not seeming particularly problematic for learning based on the two parts (i.e., independent and indirect parts). A detailed investigation is needed on this matter.
Analysis of Configurations
When a layer is entirely initialized to zero, it directly affects both the layers above and below it. The left of Fig. 1(a) illustrates the relationship between a zero-initialized layer and the layer below it. For simplicity, the activation function is omitted. If the input values to this zero-initialized layer are diverse, the weights will inevitably differ after backpropagation1, despite initially starting at zero. However, error information no longer propagates backward. During the next backpropagation, the weights of the upper layer already have values, allowing the backpropagation information to be passed backward.
Conversely, the right of Fig. 1(a) shows the relationship between a zero-initialized layer and the layer above it. Since all weights are zero, the input values to the upper layer are also zero, resulting in zero gradients for the weights of that layer during backpropagation, which halts learning in the upper layer. Nevertheless, backpropagation information is still transmitted backward. If the input values in the bottom remain diverse, the weights in the lower layer will get varied gradients, and be learned. In the next round of forward propagation, the input values for the upper layer will get diverse values, resulting in learning the weights of the upper layer.
Figure 1(a) supposes that there are randomly initialized layers above and below the zero-initialized layer. Under these conditions, learning proceeds without problems. However, if two or more layers initialized to zero exist in succession, there is a problem. In the left of Fig. 1(b), all gradients for the weights in the upper layer and the nodes acting as inputs to the upper layer become zero following backpropagation, inhibiting any transmission of backpropagation information and learning weights. Consequently, learning does not occur regardless of the number of iterations, except for the bias in the upper layer.
To address this issue, one effective strategy is to randomly initialize bias values, when weights are zero, as shown in the right of Fig. 1(b). This setup allows the upper layer to receive diverse input values through the biases of the lower layer, generating varied gradients during backpropagation. However, because there is no transmission of backpropagation information, the learnable parameters in the lower layers fail to learn. Because, in subsequent backward propagations, the weights in the upper layer have non-zero values, the gradients of the upper layer can be transmitted to the lower layer, eventually making non-zero values in the lower layer weights. This dynamic is further exemplified in Fig. 1(c) and 1(d), where layers initialized to zero develop non-zero weight values after several cycles of forward and backpropagation. We use a similar setting to the right of Fig. 1(b): Weights are initialized to zero and biases are randomly initialized for experiments.
In the condition of Fig. 1(a), and (c), learning may be thwarted by activation functions used. For example, many currently used activation functions, such as ReLU, output a gradient value of zero when they receive an input of zero. This results in no gradient being transmitted to the lower layer, thereby hindering proper learning progression.
Most modern learning algorithms, such as Adam31, do not use gradient values for weight updates; instead, Adam utilizes values that incorporate previous gradient information. Adam uses the three gradient values during initial backpropagation: 0, 1, and − 1. If the gradient calculated is zero, it remains zero; if not, it is set to 1 or − 1 based on its direction, regardless of its original magnitude. These gradients are then scaled by the learning rate for weight updates. Although the weights may initially appear uniform after this first update, the original gradient values are retained and used in subsequent updates, ensuring diversity in weight values over time.
In the case of residual connection25, the calculations are generally similar to those depicted the ones in Fig. 1, with some notable differences. For example, if the residual connection is connected to the middle nodes in the right of Fig. 1(a), the gradient values of the weights in the upper layer will be diverse, allowing for immediate learning of these weights, unlike the scenario in the right of Fig. 1(a).
However, zero initialization has a problem in ResNets25, where the batch normalization layer follows each layer operation. If the layer is initialized to zero, all input values to the batch normalization layer will be either zero or match the bias values. In the case where inputs are zero, the data distribution will have a mean and variance of zero. In the backpropagation of the batch normalization layer, the variance is used in the denominator, and a very small real number is inserted to prevent the denominator from becoming zero during learning. Here, the gradient values have very large values because the denominator is too small. Moreover, in forward propagation, people intend that the outputs of the batch normalization layer have a data distribution with a mean of zero and a variance of one. However, the variance of input values is zero, and it cannot be one. Therefore, while learning is technically possible, the layer operates in a way that is completely different from the intention, making this method not recommended.
In the case where the input values are the same as bias values, the output values from the batch normalization layer will still exhibit a mean and variance of zero. This occurs because batch normalization is performed over the channel, and when all values in a channel are identical, they produce a mean and variance of zero. According to an original paper32, this effect arises from the property of deleting the bias effect. Therefore, the setting such as the right of Fig. 1(b) proves ineffective. In the case of ResNets, where the batch normalization layer follows each layer operation immediately, zero initialization is applicable only for the fully connected layers used at the end of the network.
On the other hand, in layer normalization33, biases affect the outputs of the normalization layer. If weights and biases are initialized to zero, a similar situation to the batch normalization occurs. However, if only weights are initialized to zero and biases are not, the input values have a data distribution with non-zero mean and variance. The outputs of the layer can have a data distribution with a mean of zero and a variance of one because the layer normalization operates differently, not normalizing over the channel. Since layer normalization is used in multilayer perceptron mixers (MLP-Mixers)4 and vision transformers (ViTs)34, it is possible to initialize internal layers to zero and the last layer in these models.
Performance on Plain Neural Networks
Table 1 presents the results of multilayer perceptrons (MLPs)1 and convolutional neural networks (CNNs)1. In Table 1, “Weight” refers to the method used for weight initialization, and “Layer” specifies the layers where the indicated weight initialization method is applied. The term “All” in “Layer” indicates that all layers are initialized as described under “Weight.” “Zero all” means that both weights and biases across all layers are set to zero. “1” refers to the first layer, closest to the input layer. Unless specified as “All” or “Zero all,” layers not explicitly mentioned use weights initialized from a uniform distribution, which is PyTorch’s standard setting. For instance, under “Zero” in “Weight” and “1” in “Layer,” all layers except the first are initialized with values drawn from a uniform distribution. “Default” in “Weight” refers to PyTorch’s default initialization settings. “Xavier” in “Weight” denotes initialization using the Xavier uniform distribution. It should be noted that when weights are initialized to zero, biases are always used, except under the “Zero all” condition. The values following “Constant” in Table 1 represent initialization with that constant. The bold text in Table 1 indicates the highest value observed in each model, and all values are rounded for clarity.
Table 1
| MNIST | CIFAR-10 | CIFAR-100 |
Model | Weight | Layer | Mean | Weight | Layer | Mean | Weight | Layer | Mean |
MLPs | Default | All | 98.66 | Default | All | 53.49 | Default | All | 24.57 |
Normal | All | 95.28 | Normal | All | 40.67 | Normal | All | 12.10 |
Xavier | All | 98.43 | Xavier | All | 51.83 | Xavier | All | 25.32 |
Constant 1 | All | 97.02 | Constant 1 | All | 52.70 | Constant 1 | All | 24.85 |
Constant 2 | All | 96.49 | Constant 0.1 | All | 53.71 | Constant 0.1 | All | 25.34 |
Constant 3 | All | 95.69 | Constant − 0.1 | All | 53.43 | Constant − 0.1 | All | 25.27 |
Constant 5 | All | 94.87 | Zero all | All | 10.00 | Zero all | All | 1.00 |
Constant 10 | All | 92.12 | Zero | All | 53.53 | Zero | All | 25.54 |
Constant 0.5 | All | 97.50 | Zero | 1 | 53.21 | Zero | 1 | 24.65 |
Constant 0.1 | All | 98.13 | Zero | 2 | 53.63 | Zero | 2 | 25.58 |
Constant − 0.1 | All | 97.88 | Zero | 3 | 54.10 | Zero | 3 | 25.26 |
Zero all | All | 11.35 | - | - |
Zero | All | 85.12 |
Zero | 1 | 98.67 |
Zero | 2 | 98.60 |
Zero | 3 | 98.61 |
CNNs | Default | All | 99.19 | Default | All | 71.61 | Default | All | 39.07 |
Normal | All | 97.18 | Normal | All | 48.03 | Normal | All | 8.40 |
Xavier | All | 99.14 | Xavier | All | 70.67 | Xavier | All | 36.88 |
Constant 1 | All | 99.03 | Constant 1 | All | 68.03 | Constant 1 | All | 35.42 |
Constant 0.1 | All | 99.13 | Constant 0.1 | All | 67.64 | Constant 0.1 | All | 35.30 |
Constant − 0.1 | All | 98.75 | Constant − 0.1 | All | 64.23 | Constant − 0.1 | All | 25.34 |
Zero all | All | 11.35 | Zero all | All | 10.00 | Zero all | All | 1.00 |
Zero | All | 99.12 | Zero | All | 68.03 | Zero | All | 36.69 |
Zero | 1 | 99.25 | Zero | 1 | 70.90 | Zero | 1 | 39.38 |
Zero | 2 | 99.20 | Zero | 2 | 72.62 | Zero | 2 | 40.99 |
Zero | 3 | 99.25 | Zero | 3 | 72.96 | Zero | 3 | 42.01 |
Zero | 1,2 | 99.16 | Zero | 4 | 73.31 | Zero | 4 | 41.12 |
Zero | 2,3 | 99.22 | Zero | 1,2 | 66.74 | Zero | 1,2 | 36.36 |
Zero | 1,3 | 99.21 | Zero | 1,3 | 70.94 | Zero | 1,3 | 39.22 |
- | Zero | 1,4 | 71.24 | Zero | 1,4 | 37.45 |
Zero | 2,3 | 71.84 | Zero | 2,3 | 40.01 |
Zero | 2,4 | 73.57 | Zero | 2,4 | 40.68 |
Zero | 3,4 | 73.98 | Zero | 3,4 | 41.68 |
Zero | 1,2,3 | 68.76 | Zero | 1,2,3 | 36.58 |
Zero | 1,2,4 | 67.05 | Zero | 1,2,4 | 34.02 |
Zero | 1,3,4 | 73.24 | Zero | 1,3,4 | 39.44 |
Zero | 2,3,4 | 73.91 | Zero | 2,3,4 | 40.82 |
*Default: PyTorch's default initialization settings; Xavier: Xavier uniform distribution for initialization; Zero all: Both weights and biases across all layers are set to zero; Constant X: Initialization with a specific constant value, X. |
**Bold: Highest accuracy achieved for each model configuration. |
Results for MNIST in Table 1 show that uniform initialization typically outperforms other methods in MLPs and CNNs, with the exception of zero initialization. As expected, training under the “Zero all” condition does not occur. Performance deteriorates as constants increase. Interestingly, performance metrics are not symmetrical between “Constant 0.1” and “Constant − 0.1.” When all layers are initialized to zero, there is a notable decrease in performance: average accuracy drops to 85.12% for MLPs and 99.12% for CNNs. Performance appears to improve only when at least one layer undergoes random initialization.
For CIFAR-10, as shown in Table 1, the default setting in PyTorch yields the best performance for MLPs and CNNs, with the exception of zero initialization. Similar to earlier results, performance declines when all layers are initialized to zero. Additionally, there is variability and asymmetry in performance when initialized with “Constant 0.1” and “Constant − 0.1.”
This trend is further investigated with the CIFAR-100 dataset. The performance differences between “Constant − 0.1” and “Constant 0.1” continue across datasets, with the best outcomes observed in models incorporating zero-initialized layers.
First, Fig. 2 illustrates the variation of CNNs in weights of the last layer over time when it is initialized to zero, showing a progression from uniformity to diversity in weight values as epochs increase. One convolutional neural network is used to create each of the figures. Second, the accuracy trends of CNNs initialized with zero were investigated across epochs. Initializing the last layer to zero consistently outperformed uniform initialization across all epochs. This suggests that zero initialization in the final layers can have a significant positive impact on the learning process. Third, except for the MLP results on MNIST, zero initialization showed a statistically significant performance improvement. Finally, Fig. 2 illustrates a comparison among models with the last layer of CNNs initialized to constant, zero, and default settings. The bars represent the averages and the standard deviations. As shown in Fig. 2, models initialized with a constant value consistently outperform those with random initialization across all datasets, suggesting that constant values can enhance accuracy. Note that initializing to zero is considered a specific case of constant initialization.
Performance on Modern Architectures
Results for ResNets, ViTs, and MLP-Mixers on CIFAR-10 and CIFAR-100 datasets are shown in Fig. 3. In Fig. 3, “Below” refers to the closest linear layer to the input of the MLP module, while “Above” refers to the farthest linear layer from the input. For example, ViTs are set under the “Last, Above” condition in “Layer,” with seven zero-initialized layers. Specifically, the layers in all MLP modules and the last layer for classification are initialized to zero. Note that even if a layer is initialized to zero, biases are randomly initialized.
All data in Fig. 3 pass the normality test, allowing T-tests to be conducted with the options: unpaired, assuming Gaussian distribution, not assuming equal standard deviations, and two-tailed. Performance decreases or remains unchanged except in the case of MLP-Mixers under the “Zero, Last” condition in CIFAR-10 and ViTs in the “Zero, Last” condition in CIFAR-100. Consistently, using the “Below” condition tends to result in performance decreases. However, it is important to note three aspects: First, although there is a statistically significant decrease in performance, it is not drastic—for instance, models that typically achieve an average of 92.8% do not drop below 92.2% due to zero initialization (Fig. 3(a)). Secondly, many performance results are comparable to those achieved by models with random initialization. Thirdly, some results rather surpass those of randomly initialized models, suggesting that zero initialization could enhance performance in modern deep learning algorithms.
Table 2 details the ratio of randomly initialized parameters to total parameters in the models that achieved the highest average accuracy for each dataset. Notably, in the MNIST dataset, CNNs achieve the highest accuracy when the zero-initialized layer is either the first or third layer. So, Table 2 shows results when the third layer is zero-initialized. Remarkably, CNNs with approximately 20% of parameters randomly initialized tend to outperform others in the CIFAR-10 dataset. Additionally, it is noteworthy that even a small fraction of zero values, amounting to just 0.1% of total parameters, can significantly contribute to performance improvement. We particularly examine ResNets with the last layer initialized to zero and ViTs and MLP-Mixers under the “Last, Above” condition. It is noteworthy that the performances of MLP-Mixers with half of the parameters initialized to zero are maintained.
Table 2
The ratio of randomly initialized parameters to total parameters
Dataset | Model | Ratio | The total number of parameters | Model | Ratio | The total number of parameters |
MNIST | MLPs | 0.504 | 1,238,730 | - |
CNNs | 0.936 | 79,626 |
CIFAR-10 | MLPs | 0.999 | 8,157,010 | ResNets | 0.999 | 25,819,466 |
CNNs | 0.199 | 381,066 | ViTs | 0.696 | 19,774,606 |
- | MLP-Mixers | 0.502 | 17,110,026 |
CIFAR-100 | MLPs | 0.757 | 8,247,100 | ResNets | 0.992 | 26,003,876 |
CNNs | 0.377 | 473,316 | ViTs | 0.695 | 19,819,696 |
- | MLP-Mixers | 0.500 | 17,156,196 |
In summary, our comprehensive investigation across various datasets and models reveals that while zero initialization can occasionally lead to decreased performance, it enhances outcomes under certain conditions, such as when applied to the last layer in MLP-Mixers and ViTs. These findings challenge conventional beliefs about weight initialization in neural networks, indicating that zero initialization combined with other initialization methods can effectively improve the performance of modern deep learning architectures. It also challenges the belief that there should always be as many randomly initialized parameters as possible.