Reliable adaptive distributed hyperparameter optimization (RadHPO) for deep learning training and uncertainty estimation

Training and validation of Neural Networks (NN) are very computationally intensive. In this paper, we propose a distributed system based NN infrastructure that achieves two goals: to accelerate model training, specifically for hyperparameter optimization, and to re-use some of these intermediate models to evaluate the uncertainty of the model. By accelerating model training, we can obtain a large set of potential models and compare them in a shorter amount of time. Automating this process reduces development time and provides an easy way to compare different models. Our application runs different models on distinct servers with a single training data set, each with tweaked hyperparameters. By adding uncertainty to our results, our framework provides not just a single prediction but a distribution over predictions. Adding uncertainty is essential to some NN applications since most models assume that the input data distributions are identical between test and validation. However, in reality, they are producing some catastrophic mistakes. Since our solution is a distributed system, we make our implementation robust to common distributed system failures (servers going down, loss of communication among some nodes, and others). Furthermore, we use a gossip-style heartbeat protocol for failure detection and recovery. Finally, some preliminary results using a black-box approach to generate the training models show that our infrastructure scales well in different hardware platforms.


Introduction
Neural Networks (NN) have been hugely successful in many different tasks, from winning the game of Go against professional players to early skin cancer detection. Traditionally, researchers design and train NN to produce an estimate by setting different values for tunable model parameters (hyperparameters) and then using gradient descent to optimize the iterative algorithm. Unfortunately, when configuring a NN model for a machine learning applications, there is often no clear-cut optimal way to select default values for hyperparameters, including the structure (number of layers, number of neurons per layer, type of the layer, and others) of the model. Different base model structures can train at different rates and classify data with differing accuracy and precision. Therefore, it is often beneficial for scientists training a machine learning application to compare their model performance when using different parameters and model configurations. For example, when training a sequential model, programmers can choose to compare the performance of standard Recurrent Neural Networks (RNN) layers versus Long Short Term Memory (LSTM) or Gated Recurrent Units (GRU) layers. They can also test the performance of any of these with 1-D convolutional layers or more complex architectures such as transformers. Even within each layer, further optimizations can be made by tuning different hyperparameters, such as changing the learning rate, error function, batch size, or the number of epochs. Unfortunately, there is currently no best way to choose these, so hyperparameter tuning often consists of guessing values and checking if changing a hyperparameter affects the model's performance. Every machine learning developer group has its initial favorites to try. The problem is that since training a model can be very time-consuming, making one small change and waiting to see how the change affected the performance is undesirable for someone looking to tune their model efficiently.
Hyper-parameter optimization (HPO) in NN has been studied before. There are currently four leading solutions random search, multi-fidelity stochastic, multifidelity non-stochastic, and population-based training (PBT). The work in [1] summarizes these main research tracks. Random search is the easiest to understand and implement since it requires selecting the different hyper-parameters randomly. In [2] grid search (exhaustive search), it is compared to randomly chosen trials and concludes that random search is more efficient for hyperparameter optimization. Grid and random search assume that trials are independent, so these solutions are easy to parallelize. Multi-fidelity stochastic methods are mainly based on Bayesian optimization (BO), a powerful solution for varied design problems, including HPO. The paper from [3] presents a summary of how BO can be used for different tasks with specific examples of Automatic Machine Learning and Hyper-parameter Tuning. BO consists of an iterative algorithm with two key ingredients, a probabilistic surrogate model and an acquisition function to decide which point to evaluate next. In each iteration, the surrogate model is fitted to all observations of the target function. Then the acquisition function, which uses the predictive distribution of the probabilistic model, determines the utility of different candidate points, trading off exploration and exploitation. Compared to evaluating the expensive BlackBox function, the acquisition function is cheap to compute and can be thoroughly optimized. Bayesian optimization is sample efficient, but the problem is that it is sequential, although there are some distribution attempts [4,5]. Non-stochastic multi-fidelity solutions, like [6] hyperband, focus on speeding up random search through adaptive resource allocation and early stopping. In [6], researchers present the Asynchronous Successive Halving Algorithm (ASHA), a practical and straightforward hyper-parameter optimization method suitable for massive parallelism that exploits aggressive early stopping. The algorithm is inspired by the Successive Halving algorithm (SHA) described in [7]. Some proposed solutions mix Bayesian Optimization and halving algorithms. In [8], one of these solutions is presented, and results are thoroughly evaluated on a diverse set of benchmarks to demonstrate its improved performance compared to a wide range of other state-of-the-art approaches. Population-Based Training (PBT), as explained in [9], exploits partial training to increase the fitness of a population of models iteratively. PBT starts by training many NN models in parallel with random hyperparameters. Instead of training them independently, it uses information from the rest of the population to refine the hyper-parameters and directs resources to the most promising models. A worker node can copy model parameters from a better-performing worker and explore new parameters by randomly changing the values.
Some other approaches use machine learning algorithms for HPO. For example, in [10] a recurrent network (RNN) is used to generate model descriptions of neural networks and train this RNN with reinforcement learning to maximize the expected accuracy of the generated architectures on a validation set. On the CIFAR-10 dataset, starting from scratch, these methods can design a novel network architecture that rivals the best human-invented architecture in test set accuracy. The problem with this approach is how to select the hyper-parameters for this NN.
The main contribution of this paper is to provide an easy-to-use distributed implementation of Hyper Parameter Optimization (HPO) for Machine Learning (ML). The proposed solution is robust (resilient to errors in the hardware or the network), easy to parallelize, and scales to the system resources available to the scientist, which can be a laptop with a multi-core CPU, a graphics accelerator (GPUs), or multiple computers as in a distributed system cluster. Furthermore, we use the HPO proposed model to output not only the best model and accuracy but also the uncertainty of the predicted output. Finally, we provide the code for the scientist to replicate our results or expand to different HPO algorithms of their interest Appendix 5.5.
In Fig. 1, the framework for our HPO optimization can be seen. We will explain each of the leading implementation details in Sect. 2. Specifically the distributed system features based on gossip protocols are described in 2.3 while the HPO features are explained in Sect. 2.4. The rest of the paper is organized as follows: Sect. 2 describes the implementation, results are presented in Sect. 3, and in 4 we present our conclusions and future work.

System design
In this section, we described our proposed implementation; we decided our HPO algorithm would have distributed systems features from the start. Most of the applications published concentrate on explaining their HPO algorithm. They then mentioned how easy (or not) it would be to distribute it, providing a few actual numbers about scalability or how to use it in a distributed manner. To avoid this afterthought on distributed features, we added them from the start, always keeping in mind the availability to scale the HPO implementation to the amount of hardware (CPU/GPU computation) the user may have. To do this, we chose a Programming Language to implement the HMO solution that allows us to express distributed and parallel paradigms and run Machine Learning algorithms. We chose GoLang because it has native concurrency support and uses channels we will leverage for all communication. There is also support in GoLang for NN, although not as well tested as some other machine learning libraries written in python. Python performs worse than Golang for most applications as it is an interpreted language. While dependent on implementation, this remains true for machine learning libraries. The primary benefit of using python for some parts of this project is the more extensive user base and its public libraries, which are widely used in machine learning. However, due to python's poor concurrency support, it is inefficient to use python to implement the distributed systems capabilities.

Fig. 1 Architectural framework
Reliable adaptive distributed hyperparameter optimization… We decided to implement two different versions of our HPO framework, one that only uses GoLang (distributed and ML components) and another that uses GoLang for the distributed components and python for machine learning. The expectation is that an all-GoLang implementation will be fast. However, a GoLang-Python hybrid will be more flexible for the machine learning scientist since it only needs to provide the python script as an input without changing any previously written code. Therefore, the project implementation is divided into four main parts, user interface Sect. 2.2, the distributed system features explained in Sect. 2.3, and the machine learning and HMO features explained in Sect. 2.4. Finally, the uncertainty estimation features are described in Sect. 2.5.

Graphical user interface (GUI)
One of the most requested features for HPO implementations by ML scientists is to make it user-friendly. To address that problem, we provide a GUI (Fig. 2). Our solution allows for the customization of a configuration file to specify. Including the train/test datasets to use, the number of trials (distinct models to try per process), the optimization algorithm (grid search, random search, and bayesian optimization), different hyperparameters (learning rate, epochs, and others), and the python script the user would like to tune. The user can either create a new JSON configuration file using the client program or provide a configuration file to be run using the interface. The application outputs a list of the trials and each trial's respective accuracy, and finally, the most optimal configuration.

Distributed system features
We build a system that is scalable, easy to distribute machine learning training jobs, and fault-tolerant based on some of our previous work [11]. The system includes: • Membership Protocol. We use Paxos [12] with slight variations for our implementation. • Gossip-based Failure Detection heartbeat Protocol [13,14].
The back end gets a user's configuration and runs several neural networks with the same inputs but different hyperparameters to help compare the outcomes. Finally, we leverage the concurrency in GoLang [15] by generating a system of channels to send and receive messages between servers within the environment. To support failure detection, every worker process that the master starts to complete the training job is started with its heartbeat table. We use a gossip [13] approach to implement failure detection; every worker node updates its heartbeat and monitors two neighbor nodes' heartbeat tables. If a node is identified as failed, it is labeled as "died," and its job will be sent to another worker process. For consensus, our implementation uses a Paxos-style election algorithm [12], employing one master proposer process, two shadow masters processes, and multiple worker acceptor processes to execute functions in parallel while providing fault tolerance. The master process distributes the training jobs to workers, monitors work completion and handles failed workers. The master process records its progress to an internal log saved to stable storage, ultimately providing a persistent state for recovery. This internal log and the values committed to the commit log are backed up on a file in the directory. When a proposer failure is detected, the two shadow masters are running processes that remain synchronized and ready to take over as the primary master upon failure. We select specific nodes (at random) identified as proposers or acceptors to achieve consensus using the previously mentioned Paxos-style algorithm. Acceptors will vote in the consensus algorithm. Once consensus has been reached, the proposal is committed to stable storage in a CommitLog, which is output to a file in the current directory. The file consists of lines of commits, formatted (id command message), where id is the committed proposal, and the rest corresponds to functions and arguments proposed to be performed but not adopted. Two Shadowmaster servers read from a CommitLog as the persistent state after a failure/restart, ensuring it resumes service exactly where it left off. If the master is not heard within 2 s, a shadow master will begin the process to continue the next task in sequence from the commit log -becoming the master. This ensures a consistent state through failure and provides recovery from a complete system shutdown. Figure 3 shows a summary diagram of the Distributed system features of the project.

Hyper parameter optimization features
We added three different algorithms for HPO explained in the next subsections.

Grid
In grid search, we try every combination of the set of hyper-parameters selected by the user. Our implementation of grid search selects hyperparameters differently for numeric (i.e., learning rate) or categorical (i.e., activation function) hyperparameters. For numeric parameters, we run several trials for each hyperparameter. For each trial, we test a unique value where the values are all equally distributed across the hyperparameter range. For instance, let some numeric hyperparameter and the number of trials-per-parameter be t p . The set of hyperparameter values to be used is then: where | h v | = t p . The total number of trials is given by the number of hyperparameters and the number of trials per hyperparameter. Let n be the number of hyperparameters, and t p be the number of trials per parameter. We then have There are cases where the usage of hyperparameters is necessarily dependent on the values of other hyperparameters. For instance, in the case of a "solvers" categoricalhyperparameter h s = {"Momentum", "Adam", "RMSProp", "AdaGrad"} , the learning-rate hyperparameter h lr is not used when h s = "Adam" . To solve this we adjust only the trials-per-parameter.

Random
In a randomized search, we will not test all the possible combinations as in a grid search. Instead, we try random combinations among the values specified by the user, initializing the number of random configurations we want to test in the parameter space. This has been shown to have good efficiency, even in higher-dimensional optimization problems [2].

Bayesian optimization
Bayesian optimization (BO) [16] follows Bayes' theorem, that is for some events A, B this can be rewritten to represent hyperparameter optimization as follows: for some objective function f o , score s, and set of hyperparameters H; [1]. For the implementation of BO we use the GoLang library [17] since we need to create the samples and assign them to the different workers, and it is all the better if this is done in GoLang also.

Machine learning features
To test the HPO solutions described above our implementation needs to be able to run and execute NN models. To do so users have two options: • Option One: to provide the script they desire to tune in GoLang (using library GoNet or Gorgonia machine learning libraries). Reliable adaptive distributed hyperparameter optimization… • Option Two: to provide scripts in Python (using any machine learning library, including TensorFlow, PyTorch and others).

Adding uncertainty to the output
Uncertainty in ML is an important subject. Most algorithms assume that the input data distribution is identical between test and validation, but in reality, they are not. For example, if we train a traffic sign classifier, the model can output high confidence for the location of a speed limit signal when in fact the signal is a stop sign that has been graffitied. Often ML provides high confidence output for out-of-distribution input that should have been classified as "I do not know." The objective of Uncertainty for ML is to provide not just a single prediction but a distribution over the prediction. For example, the output of a classifier will be a label and confidence that will allow the user to determine if the model can be "trusted" or needs to defer to a human expert. Some of the most successful solutions to quantify uncertainty are based on ensembles [18], where an ensemble of models is trained with the same data and the results aggregated. Ensembles perform model combinations; i.e., they combine weak models to obtain a more powerful one. Since our framework generates all these intermediate models using Bayesian Optimization and other methods to perform HPO, we use the M best of these generated models to return what are the best parameters but also can return the estimated uncertainty. For classification problems, the input x, and the label y the NN models the probabilistic predictive distribution of p {y | x} , where is the NN parameters of the NN. In our project, we use M be the number of NN we create in the ensemble and { m } M m=1 as the hyperparameters. We use the entire training set to train each NN in parallel and combine the prediction (described in [19]): p(y | x) = M −1 ∑ M m=1 p m (y | x, m ) and for classification, this corresponds to averaging the predicted probabilities. The user can decide the parameter M. An example of the output of an image with high uncertainty can be seen in Fig. 4 a. The handwritten digit image is a bit confusing even for humans. We will most probably guess a four, but with some doubt. Our model outputs the same; as can be seen, the label is four, but it also provides an uncertainty of 0.4. By contrast, an image with low uncertainty produces very low uncertainty, as seen in Fig. 4 b. Depending on the application, the user can decide what to do with outputs of high uncertainty, should they be ignored or defer to an "expert"-the car driver-to evaluate the image and decide to stop or go further.

Results
To test the implementation's validity, we use a standard dataset for the automatic classification of handwritten characters [20]. The MNIST database of handwritten digits has a training set of 60,000 examples and a test set of 10,000 examples. The digits (size 28 × 28) have been size-normalized and centered in a fixed-size image.
Some researchers have achieved "near-human performance" on the MNIST database using convolutional neural networks, which is kind of the "hello, world!" of machine learning. We decided on the following adjustable hyper-parameters and their ranges (this can be changed at any time by the user):

Execution time
We used three different computers to produce the execution times. The first is a laptop with a configuration similar to what many users own, the second is a scientific desktop computer with 2 GPUs, and the third is a supercomputing cluster (Argonne National Lab). These three different configurations prove that our implementation can work in different environments commonly owned by users and that we can scale to all available cores and nodes. To take advantage of the full multicore utilization, we use GoRoutines; a GoRoutine is a lightweight thread managed by the Go runtime compiler; they run in the same address space, so access to shared memory must be synchronized. Goroutines are incredibly cheap when compared to traditional ones with very low overhead. The HPO algorithm 2.4 uses all available cores by using this GoRoutines; it does not use GPUs since it did not produce better execution times. Since the GPU first needs to find the values for the hyperparameters and then submit the different values to different GPUs. We found that our problem is better distributed by using all available cores to calculate the hyperparameters and the GPUs to do the NN training. This solution also has the advantage of not changing the NN script provided by the machine learning scientist; since the library, they use to create their model (Tensorflow, Pytorch, and others) are already distributed to the GPU and do not need to be changed. To scale to the different nodes in a cluster, we use our library [21] that assigns GoRoutines to different nodes on the system.

Laptop computer
AMD Ryzen 7 processor with 8 cpu cores, 32 MB L3 Cache Memory, and Clock Speed 3.6 GHz. Running on Windows 10 Operating System. In Table 1, we can see the execution times (time it took to get the optimal configuration) and the accuracy obtained for that optimal configuration for different both Random and Bayesian optimization.
Due to the limited computing power of a personal computer, the number of trials and concurrent processes were limited to 12, 2 per worker. It can also be seen that Bayesian executing time is slightly worse than random since it is sequential, efficient at sampling, but inefficient to parallelize.

Desktop computer with GPU
AMD Threadripper 3970X, 32 Cores, 3.70 GHz, 128 MB L3 cache. Running on Ubuntu 18.04 Operating System. Also contains 2 Nvidia GeForce RTX 2080 Ti GPUs with NVLink. In Table 2, we present the execution times and the accuracy obtained for that optimal configuration for different both Random and Bayesian optimizations parameters on the AMD Threadripper Desktop. Note that there are slight variations in the accuracy between Tables 1 and 2, but from the ML point of view, these variations are basically negligible.

Cluster
For our test, we used Argonne National Lab Cooley cluster (six racks). Cooley architecture is based in Intel Haswell, one NVIDIA Tesla K80 w/dual GPUs per node, 126 nodes, 1512 cores total, 47 TB Memory, and 4 TB GPU Memory.
Since once the different Hyperparameters to test are selected, the different models are entirely independent, the algorithm scales well to other cores/nodes. We tested the different architectures with the same HPO and optional parameters as described in Sect. 3. The user selects how many hyperparameters sets to evaluate to find the optimal. For our test, we did fix the number of sets for Sects. 3.1.1 and 3.1.2 to 12 but made it a multiple of the nodes for Sect. 3.1.3 to show scalability. Tables 1 and 2 show that our code uses all the cores (the times for only one trial are almost 12 times less than the HPO ones) available in the CPU. Furthermore, each node Accuracy (value of the optimal) 0.9921 ± 0.0007 0.9920 ± 0.0008 425.20 ± 5.00s 650.12 ± 5.00s Accuracy (value of the optimal) 0.9900 ± 0.0010 0.9900 ± 0.0010 can independently run the same model (same inputs but different hyperparameters). So we expect to get perfect scalability. The only reason we see a small overhead in Fig. 5 is that the cluster we are using is MPI aware, not GoLang Aware, so we need to create in GoRoutines that get assigned to different nodes IP addresses which shows a slight slowdown [21].

Resiliency
To test the recovery and replication of nodes, we use a timer to kill nodes in our main thread manually and then start the node back up to observe that the system still works and the node has rejoined the cluster. All tests were finished, and the time to recover from workers and master failure is negligible compared to the time taken for the NN classifier.

Uncertainty
As explained in Sect. 2.5, we use a relatively simple idea not only to return the model best-set hyperparameters and its associated accuracy, we also to estimate the uncertainty, as can be seen in Fig. 4. In all experiments we did run, the extra time taken to evaluate the uncertainty of the ensemble is O(x), where x is the number of models used in the ensemble. For x = 5 − 10 , the uncertainty is evaluated in less than two seconds on all tests explained in Sect. 3.1, which can be considered negligible compared with the average time it takes to train the model. The code for this data can be found in Funding Appendix.

Conclusion
In this paper, we developed a functional software tool for tuning hyper-parameters, including a graphical user interface that allows fast and robust testing and evaluation of ML algorithms; we also provide uncertainty estimations with very little overhead. Furthermore, we integrated the training of a neural network with three different HPO algorithms (grid, random, and BO) and a Paxos consensus algorithm to accelerate training and provide failure recovery and coordination to the system. We also improve the model's predictive speed by distributing the task among many nodes and or go routines. Finally, we allow the ML scientist to scale the performance to its available computational resource. We intend to implement distributed BO and ASHA as part of the HPO algorithm and improve the usability of our proposed solution to a cluster. We also need to provide a better way of distributing the GoRoutines and use GoLang RPC library. Right now, we need to provide the different nodes in the cluster IP addresses in a file.
Author contributions John wrote the HPO algorithms and tests; Maria wrote the Distributed algorithms and tests, and Gerardo did tests and reviews.
Funding John Li was supported by the XSEDE EMPOWER program under National Science Foundation Grant number ACI − 1548562 . Dr. Maria Pantoja was supported by Argonne National Lab Visitor Scholar Programs. Dr. Gerardo Fernández-Escribano was supported by Junta de Comunidades de Castilla-La Mancha under the project with reference SBPLY/21/180501/000195, and the Ministerio de Ciencia, InnovaciÓn y Universidades del Gobierno de España under the project with reference PID2021-123627OB-C52, projects which were funded by the "Fondo Europeo de Desarrollo Regional" (FEDER).

Declarations
Ethical approval This research does not include human and/ or animal studies.
Conflict of interest I declare that the authors have no competing interests as defined by Springer, or other interests that might be perceived to influence the results and/or discussion reported in this paper.