Complex calculus and mathematical equations aside, to put it very simply, a backpropagation algorithm works as presented in Figure 2. First, the model takes a forward pass. After that, the gradient of the loss function with respect to each weight by the chain rule will be calculated. Using the calculated gradients, weights are then updated by a gradient method. This process iterates over and over again until the updated weights result in a model outcome with a reasonably low loss.
As mentioned before, this process can not be vectorized, which means each iteration must take place one at a time in the CPU. In my proposed method, first, some alterations are made to the backpropagation algorithm.
For initiation of training an NN, the first step is to initialize some random weights. As these weights are random, they may be or may not be close to the optimal weights one desires. The closer the weights, the fewer iterations are needed to train the model to obtain optimal weights. Considering that in a standard backpropagation, it is needed to compute the gradients all the way back to the gradient of the first weights of the model in order to update the parameters, if the random initial weights are far from the optimal weights desired, this could result in a time-consuming procedure that updates the weights to their optimal value very slowly for the first layers. It is because due to the chain rule, updating the first weights requires the updating of the weights after them and so on, and thus, calculating all the weights to update the current weights and then forward passing the model to obtain some new loss to calculate new gradients, can be very time-consuming. In my proposed method, weights are updated from the last layer up to the first layer one by one. In other words, after implementing a forward pass, gradients for the last layer are calculated and its weights are updated. Then, another forward pass only using the last layer will be performed to obtain a new loss. Using the new loss, this time, we implement a backpropagation step to compute the gradients of the last layer and the layer before it. Using the new calculated gradients, the weights of these two layers are updated, and then another forward pass using only these two layers will be implemented. The same process repeats, each time by including another layer backward until we reach the first (input) layer and update all the weights once. Then another forward pass, this time using the whole model is implemented. Using the new loss calculated after this pass, we will repeat the same process as above to update all parameters only by one again. A simple illustration of this technique is presented in Figure 3.
This process has two advantages: first, we update parameters one by one which reduces the number of iterations needed to obtain the optimal weights, which in turn results in less time consumption. Second, each “vector” given to the CPU includes more than one iteration of the model, which also reduces the time needed to obtain our desired optimal weights. Although this process is not equal to vectorizing the code at all, to some significant extent, it speeds up the training process. In the following section, I implement this technique in a four layers NN built for the cat classification task and will report the results for both standard backpropagation and this new propagation technique, so the reader can compare the results and gain a better intuition of the proposed method.