Cyclical Learning Rates (CLR’s) for Improving Training Accuracies and Lowering Computational Cost

Prediction of different lung pathologies using chest X-ray images is a challenging task requiring robust training and testing accuracies. In this article, one-class classier (OCC) and binary classication algorithms have been tested to classify 14 different diseases (atelectasis, cardiomegaly, consolidation, effusion, edema, emphysema, brosis, hernia, inltration, mass, nodule, pneumonia, pneumothorax and pleural-thickening). We have utilized 3 different neural network architectures (MobileNetV1, Alexnet, and DenseNet-121) with four different optimizers (SGD, Adam, and RMSProp) for comparing best possible accuracies. Cyclical learning rate (CLR), a tuning hyperparameters technique was found to have a faster convergence of the cost towards the minima of cost function. Here, we present a unique approach of utilizing previously trained binary classication models with a learning rate decay technique for re-training models using CLR’s. Doing so, we found signicant improvement in training accuracies for each of the selected conditions. Thus, utilizing CLR’s in callback functions seems a promising strategy for image classication problems.

'Pneumothorax', 'Pleural Thickening' and 'Hernia'. A metadata associated with the image dataset consists of patient's age, gender, unique patient id, and the view position (anterior-posterior and posterior-anterior) of the X-ray image. All methods were performed in accordance with the relevant guidelines and regulations with human participants.

A2. Exploratory data analysis:
From the total set, 60,361 images have the label 'No Finding' (healthy), while others have multiple labels with combinations of 14 classes. Overall, the unique constitutes to around 836 labels. Unique can be any of the 14 primary classes ('No Finding' label excluded) or any combination of these 14 primary classes. Figure 1 depicts the distribution of these 15 unique labels.
A one-hot encoding was applied to convert 836 unique labels to 15 primary class labels [9]. Comparison of the number of images in 15 primary classes before and after performing one-hot encoding is shown in Table 1. A plot for the number of images after performing one-hot encoding is shown in gure 2.  A dynamic batch training was utilized to decrease computational time and memory. Based on optimal performance, an iterative loop of 32 images/batch was used for training till all the images in batch were exhausted (Table 3). Apart from utilizing less memory, this method helps to save fewer errors in the memory for updating hyperparameters through backpropagation which increases the training speed drastically. The high-resolution X-ray images for training have higher fractional improvements in area under curve (AUC) [11], and also can help localize a disease pattern. With the idea of choosing a balanced data, the dataset for one-class classi er contains 2,800 images of "No Finding" class and 200 images from each disease class. We again choose 1:4-fold split of test to training set to be consistent with binary classi ers. Further, the preprocessing through ImageDataGenerator class with same parameters as binary classi ers was performed for this split. Dynamic training with an optimal batch of 16 images/batch was performed.

A4. Model architectures for binary & one-class classi ers:
A4.1. Binary classi er: A 2D convolutional neural network is applied using an MobileNetV1 network architecture [12]. The model parameters of MobileNet previously trained on ImageNet have been utilized using transfer learning ( Figure 3).
For MobileNetV1 previously trained ImageNet weights are passed through a global average pooling layer considering averages of each feature map instead of adding fully connected layers. This technique helps to easily interpret feature maps as categories con dence maps, to reduce over tting, and is more robust to spatial translations of the input as it sums out the spatial information [13]. To further reduce over tting, a dropout regularization layer to drop ~50% of the input units for variance reduction has been applied after the global average pooling layer. The model is then passed through 4 dense layers of output nodes 250, 50, 10, and 2 with linear activation functions in them. In each dense layer, L1 and/or L2 regularization is applied to the layer's kernel, bias, and activity. Kernel regularizer with both L1 and L2 penalties of 0.001 and 0.01 respectively are applied on the kernel's layer. A bias regularizer with an L2 penalty of 0.01 is applied on the layer's bias. Activity regularizer with an L2 penalty of 0.001 is applied on the layer's output. After each dense layer, batch normalization is used to stabilize the learning process and dramatically reduce the number of training epochs required to train a deep neural network. Finally, the model architecture is complete with application of a dense layer comprising of sigmoid activation function and 1 output node. The stochastic gradient descent (SGD) optimizer with learning rate decay has been used to train the model as it gave a superior performance compared to RMSProp and adam optimizer for all the classi ers except "Hernia". Adam optimizer with a learning rate of 0.01 has been found to perform better in case of "Hernia". A momentum parameter has been used to help accelerate gradient vectors in right directions (Table 4). A false-positive predictions arise when the algorithm is unable to identify the "No Finding" class, a problem falling under the category of "Anomaly Detection". One-class classi er is an unsupervised learning algorithm focusing on the problem of anomaly detection [14]. The model contains a negative class (inlier or normal class) and a positive class (outlier or anomaly class). In our case, the normal class or inlier class is the "No Finding" class. The anomaly class is formed by combining 200 images of each disease class. The bene t of this approach is that if the prediction/test image fed to the algorithm is not from any of the 14 disease classes, it will still categorize it as an "Anomaly" simply because the algorithm could not classify it as an image with "No Finding" class. If the algorithm classi es the image with a disease other than these 14 diseases as a "No Finding" class, it will give rise to a problem of false negative prediction. One-class classi er serves the purpose of solving the problem of both false positives and false negative predictions. The model architecture for one-class classi er is same as the binary classi er.
A5. Cyclical learning rates: The rst step in applying CLR's is to de ne a maximum learning rate and a base learning rate [4]. The learning rate can then be allowed to vary between maximum learning rate and base learning rate. We have utilized learning rate nder technique (described in section A6) to decide maximum learning and the base learning rates. For one condition, "Pneumothorax" binary classi er, maximum and the base learning rates of 0.03 and 0.0075 respectively were obtained using learning rate nder. A step size is an important parameter which simply is the number of batches in which the learning rate will become equal to the maximum learning rate starting from the base learning rate or vice-versa. It is the number of training batches to reach half cycle. Typically, the step size of 2-8 times the number of training batches in 1 epoch is ideal [4]. For "Pneumothorax", the total number of training batches in 1 epoch is equal to 541. Therefore, a step size of 1082 was used for learning rate nder. Finally, a mode policy needs to be de ned for calculating learning rates. Mode is the pattern in which the learning rate will vary within the bounds of maximum and minimum learning rates. The "triangular" policy for "Pneumothorax" binary classi er is shown in gure 4. The learning rate monotonically increases to maximum learning rate from base learning rate in two epochs and decreases back to base learning rate in the next two epochs. Since the "Pneumothorax" model with CLR technique and "triangular" policy is trained for 36 epochs, a total of 9 full cycles can be observed in gure 4.
We also have parallelly utilized a more complex policy called a "modi ed triangular2" policy. In this policy, the maximum learning rate is not taken to be the average of previous maximum learning rate unlike "triangular2" policy. After 3 complete cycles of the "triangular2" policy, the training is continued with "triangular2 policy" with original maximum learning rate obtained from the learning rate nder technique. This process is carried out until whole training is exhausted. In the "Pneumothorax" binary classi er, the maximum learning rate in the rst cycle is 0.03 from the rst learning rate nder cycle, followed by second cycle with maximum learning rate of 0.01875, followed by third cycle with maximum learning rate of 0.013125 ( gure 5), etc.
A6. Learning rate nder: The upper and lower bounds of the CLR have been determined by learning rate nder technique where the cost function is minimum. Training the model with a learning rate nder as a callback for 1-5 epochs was enough to get the learning rate with minimum cost function. In case of the "Pneumonia" binary classi er, the minimum and maximum values for the learning rates were 1e-7 as minimum and 1 as maximum ( gure 6). The training increases exponentially after each batch on minimum learning rate. The "Pneumothorax" model loss vs. learning rate curve trained for 10 epochs is found to have a learning rate of 3e-2 with minimum loss ( gure 6). This loss increased as the learning rate approached to 1. The base learning rate for CLR can be accounted to one-fourth of the maximum learning rate [4].
A7. With binary classi ers CLR's out-perform normal training with a learning rate decay policy: We have run 3 model architectures (MobileNetV1, AlexNet, and DenseNet121) for comparing the performance (computational cost & accuracy) of classi ers [15,16]. MobileNetV1 with an SGD optimizer was found to be most e cient, while DenseNet121 had good accuracy but signi cantly more computational cost, AlexNet had signi cantly lower accuracies when trained for the same number of epochs (Table 5). The problem of false-positive predictions was addressed using one-class classi ers. For the models of "In ltration", "Atelectasis", "Fibrosis" & "Pneumothorax" the accuracies have been consistently low after training for the selected number of epochs. So, we chose these conditions to test CLR's on (Table 6). The problem of false-positive and false-negative predictions was resolved with one class classi ers. After which, a selected model trained for 32 epochs using CLR's with a maximum learning rate of 0.1, a base learning rate of 0.025, a step size of 2, and with a "triangular" policy provided a nal training accuracy of 83.01%. CLR's showed improved accuracy and a lower computational cost compared to training a network with constant learning rates (   The "Pneumothorax" model is found to perform best when the CLR's is used with a "triangular" policy. As shown in gure 7, it took 47 epochs for the model with a constant learning rate to reach an accuracy of 79.26%. With CLR using "modi ed triangular2" policy crossed the accuracy level of 79.26% at 38 epoch and reached the accuracy of 80.92% in 41 epochs. While, the "Pneumothorax" model with CLR using a "triangular" policy crossed the accuracy level of 79.26% in just 36 epochs to achieve nal accuracy of 79.83%. The loss compared to the number of epochs was seen to be decreased with CLR's in both "triangular" and "modi ed triangular2" policies ( Figure 8). The loss of the "Pneumothorax" model with CLR reduced quicker than the "Pneumothorax" model with a constant learning rate.
The "Fibrosis" model was found to give better results in the case of the CLR technique with a "modi ed triangular2" policy. A comparison of " brosis" model trained for 32 epochs is shown in gure 9. The model reached an accuracy of 85.04% in 32 epochs when trained with a constant learning rate policy.
The model reached an accuracy of 86.96% in 32 epochs when trained with CLR using a "triangular" policy. It crossed the 85% accuracy level in 30 epochs. The model reached an accuracy of 88.15% in 32 epochs when trained with CLR using a "modi ed triangular2" policy. It crossed the accuracy level of 85% in just 25 epochs. The loss was observed to be always less in CLR's with a "modi ed triangular2" ( gure 10).

Discussion & Future Scope:
Depthwise separable convolutions like MobileNets have been gradually pruned for improving the speed of dense network [17]. MobileNetV1 Imagenet weights with SGD optimizer is found to outperform other optimizers and architectures in terms of training time taken and accuracy attained. Achieving a high test accuracy is directly depended on learning rate hyper-parameter for training neural networks [18,19,20,21]. Three forms of triangle CLR's have been stated to accelerate neural network training [18,19]. Further, tuning the batch size hyper-parameter for adjusting learning rates have also been shown to improve learning accuracy [22]. Some hyperparameter tools like Hyperopt, SMAC, and Optuna, using grid search, random search and bayesian optimization have been seen e cient in tuning batch sizes [23,24]. To the best of our knowledge, our work is the rst to present a comprehensive characterization of CLR function on training and testing accuracy of dense network models. In general, training any model with a CLR technique is found to perform better than training with a constant learning rate. For the "Pneumothorax" binary classi er, the CLR technique with the "triangular" policy is found to outperform both CLR with the "modi ed triangular2" policy and constant learning rate training. For the "Fibrosis" binary classi er, the CLR with the "modi ed triangular2" policy was found to give better results than the rest two policies.
Primarily, we found that there are two main advantages of training with CLR's over constant learning rates, with decay learning rates the model can get stuck into the saddle points or local minima due to low learning rates, and secondly CLR's reduces the effort of choosing an optimal learning rate by hit and trial method. Poor choice of initial learning rate can make the model circle in nitely. In setting a learning rate, there is a trade-off between the rate of convergence and overshooting, a high learning rate will make the learning jump over minima but a too low learning rate will either take too long to converge or get stuck in an undesirable local minimum [25]. The CLR's cyclically provided higher learning rates too, which helped the model to jump out of the local minima of the cost function. With these ndings, implementing CLR's for improving prediction accuracies seems a promising strategy for object detection and machine translation.

Declarations
Supplementary data: None