Improved training of deep convolutional networks via minimum-variance regularized adaptive sampling

Fostered by technological and theoretical developments, deep neural networks (DNNs) have achieved great success in many applications, but their training via mini-batch stochastic gradient descent (SGD) can be very costly due to the possibly tens of millions of parameters to be optimized and the large amounts of training examples that must be processed. The computational cost is exacerbated by the inefficiency of the uniform sampling typically used by SGD to form the training mini-batches: since not all training examples are equally relevant for training, sampling these under a uniform distribution is far from optimal, making the case for the study of improved methods to train DNNs. A better strategy is to sample the training instances under a distribution where the probability of being selected is proportional to the relevance of each individual instance; one way to achieve this is through importance sampling (IS), which minimizes the gradients’ variance w.r.t. the network parameters, consequently improving convergence. In this paper, an IS-based adaptive sampling method to improve the training of DNNs is introduced. This method exploits side information to construct the optimal sampling distribution and is dubbed regularized adaptive sampling (RAS). Experimental comparison using deep convolutional networks for classification of the MNIST and CIFAR-10 datasets shows that when compared against SGD and against another sampling method in the state of the art, RAS produces improvements in the speed and variance of the training process without incurring significant overhead or affecting the classification.


Introduction
During the past decade, the increase in available data and computational power, as well as theoretical developments, propelled the dominance of Deep Neural Networks (DNNs) Communicated  CONACYT-Centro de Investigación en Ciencias de Información Geoespacial, CENTROGEO, A.C., C.P., 76703 Querétaro, Mexico in applications such as computer vision, pattern classification, natural language processing, forecasting and so on (Alom et al. 2018). Convolutional neural networks (CNNs, the topic of this paper) have found a niche of application in medicine (Al-Waisy et al. 2020;Alkadi et al. 2019;Hesamian et al. 2019;Varela-Santos and Melin 2021;Yao et al. 2020), security (Liu and Lang 2019), activity recognition (Thakur and Biswas 2020), biometrics (Choudhary et al. 2020), among other areas.
However, due to the millions of trainable parameters that DNNs contain and the large amounts of data to be processed, notwithstanding the capabilities of modern computers, training of DNNs can be very costly (Wang et al. 2018), and can also be highly sensitive to the values of the hyperparameters involved. In other words, once the efficacy of DNNs has been amply demonstrated, the ability to produce these models with efficiency becomes increasingly relevant, motivating the study of novel methods for improved training. Such is the topic addressed herein.
Because of the non-convexity of the problem, the standard method for training many types of DNNs is mini-batch gradient descent, often referred to simply as Stochastic Gradient Descent (SGD). The conventional way of forming the data mini-batches for SGD is through uniform sampling of the training data-instances, but since data instances are not equally useful, this sampling strategy is not very efficient. In recent years, methods to improve the training of DNNs via more efficient sampling schemes have been proposed around the idea of prioritizing the selection of instances that are most important for learning. Notice that an instance's importance constantly changes during training because the model parameters (w.r.t. which the gradients for SGD are computed) are constantly modified. Thus, the importance distribution is not static, but dynamic; it cannot be computed once/off-line and then be used throughout the training, but it must be updated on-line along with the parameters of the model. For training of DNNs this poses a computational challenge.
As technology moves forward, quantum computing (the other topic in this special issue) has been gradually emerging as an option to tackle the computational complexity of otherwise intractable problems and to reduce the cost of learning models (Lamata 2020;Saggio et al. 2021;Zahorodko et al. 2021). Examples include quantum Boltzmann machines (Ajagekar and You 2020), machine learning (Abohashima et al. 2020;Havlíček et al. 2019;Schuld and Petruccione 2021) and deep learning (Schuld 2021;Wiebe et al. 2015). Even though quantum computing promises the benefit of significantly more efficient computation, the search for novel optimization methods for improved training remains very relevant because current insights indicate that, like DNNs, practical quantum and hybrid models could be trained via SGD (Schuld et al. 2019;Sweke et al. 2020). Furthermore, it has been observed that quantum neural network training landscapes very often (for quantum circuits of more than a few qubits) exhibit plateaus that (like in DNNs) lead to the vanishing gradient problem ). This makes the need for improved training techniques for classical, hybrid and (eventually) quantum models, increasingly crucial in this area.
The importance distribution for SGD can be defined by satisfying a strongly principled optimization criterion via a more general strategy known as Importance Sampling (IS). In statistics, IS is a technique for estimating properties of a target distribution based on samples from a surrogate distribution (Tokdar and Kass 2010). IS is also an important variance reduction technique (Owen and Zhou 2000), which makes it relevant for training via SGD because minimization of the variance of stochastic gradients within the training minibatches leads to smoother and faster reduction of the loss function, effectively improving training (Zhao and Zhang 2014). Thus, the optimal distribution for reduction of the gradient variance is proportional to the per-instance gradient norm (Zhao and Zhang 2015), but computing such a distribution for DNNs is unfeasible in practice due to the computational cost involved. In Sect. 2 we will review several strategies that have been developed to overcome this obstacle.
This work contains three important contributions: (1) We examine the Adaptive Sampling (AS) method (Gopal 2016) which includes interesting and comprehensible notions that have been shown to offer improvements over regular SGD but that also, to the best of our knowledge, have not been tested for the training of DNNs due to scalability limitations; (2) We identify a main drawback of AS and propose modifications intended to counteract it, thus producing the Regularized Adaptive Sampling (RAS) method. This proposal is of relevance to the scientific community because it constitutes a novel and rigorously formulated alternative for training of DNNs; (3) Our RAS proposal is experimentally compared against the standard sampling scheme for SGD as well as against the state of the art method known as Online Batch Selection (OBS) (Loshchilov and Hutter 2015) in the context of image classification, using the MNIST and the CIFAR10 datasets. Our results show that RAS outperforms OBS in general and SGD in some occasions while being equivalent in the rest, all of this with little computational overhead.
The rest of this paper is divided into six sections. Section 2 provides an overview of related work in the literature; Sect. 3 introduces the mathematical formulation of the selected sampling approaches; Sect. 4 contains the description of our proposal; Sect. 5 reports the results from our experimental comparison; Sect. 6 offers a discussion around our understanding of noteworthy results; Sect. 7 closes the paper.

Related work
A diversity of strategies to improve the convergence of SGD has been described in the literature. Herein we are mainly interested in the methods for improving the way in which data instances are selected to form training mini-batches by replacing the uniform distribution typically employed to sample training instances.
Regarding strategies based on IS: Zhao and Zhang (2015) show that the optimal importance distribution is proportional to the per-instance gradient-norm and establish a clear connection with the variance of the gradient estimates in SGD. Gopal (2016) employs side information (e.g. the classes in a classification problem) to define a distribution that is directly proportional to the norm of the gradients. Note that these two works are not directed at the training of full-scale DNNs, 1 whose size hinders the direct application of IS based on the per-sample gradient-norm. For full-scale DNNs, Katharopoulos and Fleuret (2017) generated the importance distribution using the loss value as an alternative importance metric that nevertheless succeeds at reducing the variance of the gradients. Later on, Katharopoulos and Fleuret (2018) derived an upper bound (16) to use in-lieu of the actual gradient-norm; this is an important contribution that we incorporate into our method. Johnson and Guestrin (2018) proposed a robust approximate version of IS (RAIS) that optimizes the sampling distributions p i at time t (obviated for compactness) w.r.t. an uncertainty set parameterized by vectors c and d, and trained adaptively over n examples, leading to p i ∼ c, s i + d, s i n/ n j=1 d, s j , where s i are features predictive of the true gradient norms, such as gradient norms from previous iterations. Although RAIS avoids a smoothing parameter used by algorithms that approximate the gradient norms via point estimates, we have found that said parameter can be tuned easily. Furthermore, RAIS introduces its own parameters and its gains depend on a learning rate schedule.
A different, empirical approach to define importance distributions employs the loss value of individual training instances. The loss is a measure of how well a model can solve particular instances; if the loss of an instance is high then it should be sampled more frequently so that the model has more opportunities to learn to solve it. This should make the learning process more efficient. Following this idea, Loshchilov and Hutter (2015) ranked instances w.r.t. their latest known loss value and built a sampling distribution that decays exponentially as a function of that ranking; this is called OBS.
Similarly, Jiang et al. (2019) proposed Selective-Backprop (SB) designed to accelerate (in real time) the training of deep models by reducing the number of backward passes over the training data. This is achieved by selecting each example with a probability that is a monotonically increasing function of the CDF of losses L i across the last R examples: p i = CDF R (L i ) ν . This heuristic employs a parameter ν > 0 to control the selectivity. The authors compared SB against the method of Katharopoulos and Fleuret (2018) and observed significant speedups, albeit by sacrificing a small fraction of accuracy on complex datasets.
In yet another approach inspired by how humans and animals learn, Bengio et al. (2009) presented training instances to learning models in gradually increasing amounts and degree of complexity, achieving faster convergence and better generalization in a variety of toy problems, recognition of shapes, and language models; this method is called Curriculum Learning (CL) , a precursor to current sample weighting strategies. A disadvantage of CL is that it employs a predefined learning schedule based on the judged difficulty of each training example, which strongly depends on human expertise.
Recently, Santiago et al. (2021) described a sample weighting strategy that automatically learns the optimal weights of training instances, overcoming the disadvantage of CL. Learning of weights is achieved by solving, at each SGD step: arg max w w T ∇ θ − λ w − 1 , s.t. w j ≥ 0, and w j = M, for model parameters θ , weights w j , j = 1, . . . , M and batch size M. In this problem, λ > 0 is a parameter that controls the regularization term; a smaller (larger) λ makes the weights less (more) uniform. The authors approximated the gradient norms ∇ θ as described in Katharopoulos and Fleuret (2018) and compared their weighting strategy against the IS-based method presented in that same work and against other approaches, achieving very competitive results on CIFAR10/100, including smaller classification errors, on average, than standard SGD.
Other proposals to optimize the training of DNNs can also be found in specific contexts. Alain et al. (2015) explored the application of IS to Asynchronous SGD, where the required gradients are computed in parallel, and this task is distributed over a set of machines. Faghri et al. (2020) also discuss dataparallel SGD, with emphasis on the quantization schemes used to communicate the gradients between processors. Wu et al. (2017) designed a distribution that maximizes the diversity of the losses in a training batch and applied this idea in deep embedding learning, which is a problem of clustering instances in the so-called embedding space induced by the learning model. Fan et al. (2017) used reinforcement learning to train a neural network that selects instances in order to optimize the convergence of a second neural network. Joseph et al. (2019) designed a mini-batch selection strategy based on the maximization of a submodular function that captures relevant information from the data, used to speed up the training. Finally, a review on optimization approximation and other techniques to enable large-scale machine learning can be found in the work of Wang et al. (2022).
Our RAS method presented in Sect. 4 is an adaptation of the AS algorithm of Gopal (2016) for DNNs, with the added improvement of a redefined probability distribution that combines a gradient-norm estimate and the loss of individual instances to form the mini-batches. This proposal is compared against uniform sampling and against OBS. Details of these methods are provided in the following section.

Selective sampling approaches
Consider the problem of optimizing the parameters of a deep learning model y i = ϕ(x i ; θ) for classification, where θ represents the model parameters, x i ∈ R n represents an input data instance and y i is its corresponding class label. Assuming a non-negative loss function L(ϕ(x i ; θ), y i ), the standard solution θ * is 2 : where U represents the uniform distribution and, given a (1) can be iteratively estimated via the SGD algorithm.
As discussed in previous sections, employing a nonuniform distribution for the selection of the training instances can make the process of solving (1) more efficient through two main approaches: constructing the sampling distribution with base on the values of L(ϕ, y), or with base on the gradients ∇ θ L(ϕ, y). This should be integrated along the standard training procedure that presents the training data to the learning model in mini-batches (simply, batches) of convenient size. Methods that implement these approaches are reviewed below. Online Batch Selection: Following the first approach, Loshchilov and Hutter (2015) and assign to the j-th ranked data instance a probability p j defined as: where exp(log(s t )/N ) is the probability drop between consecutive instances and s t = p 1 / p N at iteration t acts as a selection-pressure parameter that can be kept constant during training or may be decreased over time. Using (2), Loshchilov and Hutter designed a lossdependent instance-selection method called OBS, and showed that it can accelerate the training of a CNN. OBS is an appealing proposal because it can be implemented very easily over a learning model as a wrapper function with little overhead. In this work, RAS is compared against OBS. Now we review a method that uses the second approach to select training instances. Adaptive Sampling: Following the second approach, Gopal (2016) developed an instance-selection method called AS wich integrates Importance Sampling for minimization of the gradient variance with respect to a non-uniform probability distribution defined over a partition C of the dataset D, generated with base on side information, such as the class labels, y.
Briefly explained, C = {c 1 , c 2 , . . . , c k } and each bin . . , k. In our notation a superindex is used to indicate the j-th bin and a subindex indicates the i-th element of the corresponding j-th bin. That is, the N instances are split into k mutually exclusive not necessarily balanced bins. The probability of selecting the i-th instance is: where p j = P(C = c j ) is the probability of selecting the j-th bin, whose selection should reduce the variance V of the descent direction, posing the problem: where d t is the descent direction at iteration t: The solution to (4) due to Gopal (2016) gives the optimal distribution P at the t-th iteration: In the work of Gopal (2016) AS was tested on three classification datasets using binary logistic regressors, and it was shown to effectively reduce the variance of the objective function employed. However, AS was not tested on the training of deep networks. Prior to the development of this work, we tested AS on the training of a CNN for classification of the MNIST dataset without obtaining satisfactory results (Fig. 1). This motivated further study that in turn led us to propose modifications to improve AS, which are described below. Left column: AS with Adadelta; right column: AS with Adam the cause was intrinsic to the AS algorithm, rather than the choice of hyper-parameters or optimizers employed.
Careful examination of the instances selected by AS shows that, as training proceeds, the probability of selecting instances from only one (or a very few) of the bins (i.e. classes) in the dataset increases dramatically (see top two rows in Fig. 2). As the network is shown examples from only one/a few of the problem classes, training fails and the generalization collapses; this is particularly notorious when small batches are employed, as the plots for batch-sizes 64 and 128 show. Although increasing the batch-size does help to maintain a more balanced selection over the bins, using large batch-sizes in general hinders the training process.
Derived from this observation, our insight is that the selection probability of any single bin in the partition of the dataset should not increase in excess relatively to the probabilities of the other bins, since this eventually causes the distribution to collapse onto a single bin. In other words, excessive growth of single bin's probabilities must be impeded with the aim of maintaining a balanced distribution of the sampled instances. We interpret this as a form of regularization over the bin probabilities and identify the modified AS method by the name of Regularized Adaptive Sampling. The regularization effect of the modifications introduced in RAS can be seen in the two bottom rows in Fig. 2, where, with the exception of the case for batch-size 64 with Adam optimizer, the distribution over the classes appear perceptibly more balanced.
Examination of the instance-selection probability in AS shows that the second factor in (3), which represents the probability of the i-th instance to be selected within the j-th bin, is not connected to the loss value (or gradient of the loss w.r.t. the model's parameters) corresponding to that instance. In other words, the relative relevance of individual instances is lost in the computation of P(x i ). To correct this, let us start by revisiting the motivation for IS.
Indeed, IS can be seen as a method to estimate (1), surrogating a uniform distribution with a non-uniform one. Thus we can pose a Monte Carlo (MC) estimator: where ζ( is called the importance weight and V the importance distribution. Bouchard et al. (2015) showed that SGD is a powerful tool to optimize the sampling distribution of MC estimators, further motivating the idea of Assuming that P and Q j are independent, the new probability of picking the i-th instance can be defined as V(x i ) ≡ P(X = x i ) = p j q j i so that (7) becomes: Recalling that the main goal of using non-uniform sampling of the instances is to reduce the variance of the descent direction, the introduction of Q j generates a second problem similar to (4) but w.r.t. Q j : Formally, the original gradient variance minimization problem is decomposed into two minimization problems, (4) and (9). Problem (4) is the minimization of the gradient variance w.r.t P assuming a fixed Q j and (9) is the complementary minimization of the gradient variance w.r.t Q j assuming a fixed P.
The solution to (4) has already been given in (6) and can be simply re-written to accommodate Q j as: For (9), use 1/N = (|c j |/N )(1/|c j |) in (8) so that: and notice that the last summation can be separated to pose the following optimization problem exclusively over the instances of the j-th bin ∀ j = 1, . . . , k: so that the gradient d t j to be used in (9) is: It can be shown (cf. Avalos-López et al. (2021)) that E[d t j ] is independent of q j i and can be ignored in the solution of (9). We are left with E d t j T d t j and proceed to minimize the variance following (Gopal 2016). Finally, the optimal Q j at the t-th iteration is given by: Algorithm 1: Regularized Adaptive Sampling.

19
Update θ t using d b // backward pass.  The selective sampling implemented by RAS can be understood as two consecutive selection procedures: first a bin is selected according to (10) and then training instances within that bin are selected according to (14). It is worth remarking that in both cases the derived distributions are directly proportional to the norm of ∇ θ L since this optimizes the criterion of the minimum variance of the descent direction, which is known to improve training convergence. The pseudocode of RAS is given in Algorithm 1, and its flowchart is in Fig. 3.

Computational considerations
In the implementation of the RAS method there are a few computational drawbacks that must be addressed: 1. It is impractical to reset the sampling distribution (10) after each iteration. Instead, p j can be updated every β iterations (see line 11 in Algorithm 1).

Computing the gradient norm for an individual instance
requires computing a square root. However, Katharopoulos and Fleuret (2017) show that sampling using the loss (which is readily available during training) exhibits similar variance reducing properties to sampling according to the gradient norm. Thus, (14) can be replaced with: which means that the selection of training instances is set proportional to the loss function evaluated on the output of the model for the corresponding data and the corresponding target: L(ϕ(x j i ; θ), y j i ). 3. During training, P and Q j may degenerate to zero for some bins or some instances, respectively, causing those bins/instances to never be selected again. To avoid this, P and Q j are smoothed to mitigate the appearance of extreme values. This requires two more hyper-parameters, δ p , δ q ∈ (0, 1), to modulate the smoothing of P and Q j , respectively. 4. Computing the full gradient becomes prohibitive for deep learning models with millions of parameters. Katharopoulos and Fleuret (2018) derived an upper-bound of the gradient norm: that can be employed in lieu of the full gradient. In  The results of our experimental evaluation of ARS are presented in the following section.

Experimental setup and results
The proposed RAS technique is compared against uniform sampling (i.e. typical SGD) and against a loss-rank-based selection strategy, OBS (Loshchilov and Hutter 2015). This  (Krizhevsky et al. 2009). The standard MNIST dataset is a popular benchmark for evaluation of classification models because it can be easily solved by small modern networks. For this dataset a CNN with two convolutional layers (each Batch Normalized and followed by a Max-pooling layer) and two fully connected (FC) layers was implemented (see Table 1). The ReLu activation function was used for all the layers except for the last one, which employs the Softmax cross-entropy loss as is commonly done for classification problems. Dropout (with a rate of 0.5) and data normalization was used on the first FC layer.
For the CIFAR-10 dataset, which is a more difficult problem, a standard VGG-16 CNN (Simonyan and Zisserman 2015) was implemented. Batch normalization (for each convolutional layer) and Dropout (for the FC layers) were employed. All the layers included the ReLu activation function, except the output layer, wich used the Softmax cross-entropy loss. The architecture of this CNN is provided in Table 2.
Typical base-line classification performance on the MNIST dataset is above 99%, while a good base-line performance of VGG-16 on the CIFAR-10 dataset could be set around 85% correct classification (Kawaguchi et al. 2020;. 3 Notice that in our comparison we are not interested in obtaining top-level performance of the networks, but rather, in the characterization of the compared sampling methods under a range of experimental conditions such as different problems, optimizers, and a range of batch sizes. All of the compared methods were implemented using the library PyTorch 4 for Python 3.6 and the experiments were executed on a Linux box (with Ubuntu Bionic Beaver OS) equipped with a GeForce GTX 1070 GPU. The experiments consisted in training each of the CNNs for a fixed number of epochs and recording the training-loss and test-accuracy per epoch. The number of epochs was chosen considering computational cost and the need to observe training convergence of the corresponding networks. Under these criteria, the networks for the MNIST problem were trained during 50 epochs each, and the networks for CIFAR-10 were trained during 40 epochs each (because the VGG-16 networks are much more expensive to run). For all the experiments, the side-information in RAS was set to be the class-labels, and the hyper-parameters were set as follows: β = 3, δ p = 0.9 and δ q = 0.1; these δ values were selected by exhaustive search of the range (0, 1) in steps of 0.1 for both hyperparameters. Results of this search are illustrated in Fig. 4, which shows the training-loss surfaces obtained for the CNN in Table 1; as can be seen, the values selected correspond to the optimal region on these surfaces. Finally, OBS requires two extra hyper-parameters whose tried-out values can be found in Loshchilov and Hutter (2015) and based on those we set: r = 0.5 and s = 10 8 .  Table 1, over parameters δ p and δ q Both of the test datasets correspond to classification problems with ten classes and present a similar structure and size. The MNIST dataset contains 60,000 training instances and 10,000 test instances, all of which are gray-scale images of 28 × 28 pixels representing handwritten digits. The CIFAR-10 dataset comprises 50,000 training instances and 10,000 test instances, as color images of 32 × 32 pixels that represent different classes of vehicles and animals. The original instances were used without data-augmentation and without validation subsets (i.e. the datasets are initially split into training and test subsets without overlap and this split is maintained throughout the experiments).
In general, training of a DNN through pure SGD can be inefficient; because of this, training of networks for practical use is usually carried out with the help of optimizers. In this work the compared instance-selection methods are combined with two of the most popular optimizers . Also, the use of mini-batches in training of DNNs is practically universal, although regardless of its significant effect the batch size is a free parameter that needs to be empirically set. When selecting a batch-size there is a tradeoff between speed and performance; smaller batch-sizes imply more training time required but can lead to better performance, while larger batch-sizes require shorter training time but the obtained networks may not perform as well. With the objective of obtaining a characterization as complete as possible, all of the experiments consider batch-sizes from 2 6 = 64 to 2 10 = 1024 instances.
The first set of results, for MNIST with Adadelta, is shown in Fig. 5. Notice that although the training loss with RAS does not converge for batch-sizes of 64 and 128 instances, in the other three cases the training with RAS is faster (in terms of epochs, elapsed times are discussed below) than with SGD and OBS. The second set of results, for MNIST with Adam optimizer, is shown in Fig. 6. In this case it can be observed that convergence is very challenging for SGD, while OBS 5 https://pytorch.org/docs/stable/optim.html#algorithms. and RAS have some difficulties with batch-sizes 64 and 128 but in general show better convergence than SGD.
The third set of results, for CIFAR-10 with Adadelta is presented in Fig. 7. In this figure it can be seen that, with the exception of batch-size 1024 and the last 10 epochs of batch-size 64, the training-loss and the test-accuracy plots of RAS are in general more stable (i.e. show smaller dispersion and fluctuation) than those of OBS and SGD. In the Fourth set of results, for CIFAR-10 with Adam, shown in Fig. 8, the training-loss and test-accuracy plots of SGD and RAS are relatively smooth and very similar between methods, while the plots of OBS show that this method fails to converge for batch-sizes of 256, 512 and 1024 instances.
Finally, the average training times per epoch required by each combination of a sampling method and an optimizer are shown in Fig. 9 as plots for different batch-sizes of the networks trained for MNIST and CIFAR-10. Note that the times required for training on CIFAR-10 are about an order of magnitude larger than those required for training on MNIST. Also, that as the batch-size increases, the training time decreases [in accordance with the literature (Smith et al. 2017)], and that the difference in required time between methods is quite significant for the simple problem of MNIST but not so for the more complex problem of CIFAR-10.

Discussion
Considering our set of results as a whole, it can be appreciated that there is a wide variety of individual behaviors of the compared methods that can be analyzed in terms of the convergence of the loss function. These can be broadly separated into two groups: (1) Experiments where the methods behave as expected (success) and (2) Experiments where the methods misbehave 6 (failure). For instance, on the MNIST dataset with Adadelta optimizer (Fig. 5), RAS failed in the experiments for batch-size 64 and 128, but succeeded in the rest of the experiments. For the same dataset and optimizer, SGD and OBS always behaved as expected. Table 3 provides a visual summary of the qualitative outcome of the compared methods over all the experiments, where a success is represented by a circle ( ) and a fail-ure by a square ( ). The bottom row shows the success rate per column and the rightmost column the success rate per experimental setup, which are described as the combination of a dataset and an optimizer: (1)-MNIST w/Adadelta; (2)-MNIST w/Adam; (3)-CIFAR-10 w/Adadelta; (4)-CIFAR-10 w/Adam.
The average success rate per method is SGD: 0.65, RAS: 0.65 and OBS: 0.55. As recorded in Table 3, employing extreme batch sizes results in lower success rates overall. Focusing on the central values of the batch size (256 and 512) it can be seen that only the RAS method is 100% successful along the four experimental setups. The success of SGD is 62.5% and OBS is successful in only 50% of the experiments with these batch sizes. Why is it that the methods misbehave if one chooses the smallest or largest batch-sizes? For excessively small batches, a possible explanation in the case of the selective sampling methods is that small amounts of instances do not allow the computation of reliable estimates of the required probabilities; however, loss plots that increase over time instead of decreasing (see RAS in Fig. 5 and Fig. 6), or that show abrupt and significant jumps (see OBS in Fig. 6) can also be indicative of excessively large learning rates. On the other hand, the effect of very large batch sizes is more easily understood and expected. An excessive Based on the results described above, it is considered that the smaller batch-sizes (64 and 128) sometimes generate quite significant difficulties to some methods (and may make them more sensitive to the hyper-parameters), while the larger batch sizes (512 and particularly 1024) almost always generate inferior results for all methods. Because of this, and to simplify further analysis, a batch size of 256 (located in the middle of the range of batch sizes considered) is regarded as a convenient compromise for all three methods and the corresponding results will be explored in detail.
The results for batch size: 256 are shown in Fig. 10 (with Adadelta) and Fig. 11 (with Adam); these are limited to the CIFAR-10 dataset because it is a problem of more complexity than MNIST. The plots show the median and range over 15 trials. Regarding Fig. 10 it can be observed that the training loss curve obtained by RAS is smoother and decreases faster than the curves of SGD and OBS. The Loss of OBS intersects that of RAS at epoch 27 and the one of SGD does so until epoch 31. This is mirrored by the Test Accuracy, for which OBS equals RAS at epoch 33 and SGD does so at epoch 31. On average, there is an advantage of 30.5 epochs (out of the 40 epochs budget) in favor of RAS. Further, the dispersion of the data (illustrated by the shaded region around each curve) is much larger for SGD and OBS than for RAS (this applies to both the training loss and the test accuracy). It can be concluded that in this test it is preferable to choose RAS.
The results in Fig. 11 show a different scenario to that described above. According to these, the loss and accuracy obtained by RAS are essentially equivalent to those of SGD (in median value as well as dispersion) throughout the whole training process. In contrast, the loss curve of OBS describes quite a different path and only intersects that of RAS near the end of the training, at epoch 36 out of 40. This is reflected in the test accuracy curves, for which the curve of OBS never reaches the accuracy level obtained by RAS and SGD.
To identify statistically significant differences between the discussed results, a two-sided Wilcoxon rank sum test (Gibbons and Chakraborti 2020;Hollander et al. 2013) is performed for each pair of methods using as input data: a) the integral of the Training Loss and b) the integral of the Test Accuracy (including 15 experimental trials of 40 epochs for each method). These data are visualized through probability density estimates (Hill 1985) in Figs.12 and 13, for the loss and the accuracy, respectively. The horizontal line within each density plot represents the median of the corresponding data.  The objective of the Wilcoxon test is to compare the null hypothesis that data from the different methods are samples from continuous distributions with equal medians, against the alternative hypothesis that they are not. Results ( p-values)  Tables 4 and 5, for the Training Loss and the Test Accuracy, respectively. Methods combined with the Adadelta optimizer and with the Adam optimizer are tested separately. The results show that, with 5% significance level, there is not enough evidence to reject the null hypothesis in the tests comparing the loss of RAS vs. SGD with Adam, and the loss of SGD vs. OBS with Adadelta (to signify this, the data densities for these two cases are connected   Fig. 12); in every other case the statistical tests found sufficient evidence to reject the null hypothesis, allowing us to conclude that the discussed differences between methods are statistically significant. In addition, the plots in Figs. 12 and 13 show that RAS is clearly superior to OBS in every case, that it is superior to SGD with Adadelta, and that it matches the performance of SGD with Adam in terms of training loss. Finally, although the purpose of this work is not to improve the classification accuracy, given that classification is the end result of the networks, it is relevant to discuss the outcomes produced by the training methods. To this end, the differences between the median (over 15 trials) of the highest test accuracy obtained by RAS and the other methods are, using Adadelta: RAS-OBS = +0.09%; RAS-SGD = −0.63%, and using Adam: RAS-OBS = +2.84%; RAS-SGD= −0.84%. As can be seen, there are negligible differences (smaller than 1%) between the test accuracy that RAS can obtain and that of the other methods. The only exception is the difference between RAS and OBS using Adam, which is of 2.84% in favor of RAS. From this, it can be said that while RAS offers the benefit of a more stable and faster convergence it does so without negatively affecting the performance of the trained networks.

Conclusion
A novel method to sample data in order to form mini batches for the training of DNNs has been described. This method is an improved IS-based adaptive sampling strategy designed to accelerate the training process, which works by defining optimal distributions over data bins (i.e. problem classes) and instances that reduce the variance of the gradient-norms, and lead to faster and smoother convergence of a network's parameters. Based on a experimental comparison including two classification problems of different complexity, two popular optimizers and a range of batch-sizes, it was shown that the proposed RAS method either matches or outperforms standard SGD and always surpasses the performance of another selective sampling method (OBS).
It was also shown that while different optimizers cause SGD and OBS to behave quite differently, our proposed method is more robust to this variable. However, as the other methods compared, RAS is sensitive to the choice of the batch-size employed. All of the remaining hyperparameters involved were kept fixed to simplify the analysis. Although the RAS method implies a computational overhead that results in longer elapsed real times per epoch, this is only significant in relation to the very short times required for the training of the simple MNIST problem. Meanwhile, in terms of training epochs, an average advantage of 30.5 epochs (out of the 40 epochs budget) was achieved by RAS on the CIFAR-10 dataset when combined with Adadelta, and no difference was observed against SGD when combined with Adam (while OBS was substantially impaired in this case).
Variations of the proposed sampling strategy can be easily implemented, such as switching on/off the selective sampling at times when a speed-up is guaranteed, an idea presented by Katharopoulos and Fleuret (2018). Also, there is a variety of choices for the side information employed to define the importance of the bins and instances. For instance, Joseph et al. (2019) have explored the use of the entropy of the model as a measure of how informative are the instances in a training iteration. These ideas are being considered for future work.
Other additional possible avenues of research include the application of our method to other types of networks, including generative models, recurrent networks, models trained through transfer learning (Mari et al. 2020), and the most recent developments like hybrid quantum-classical models (Liu et al. 2021). The potential application of our method to these models and more is feasible as long as the models are trained in a supervised fashion through some form of SGD. Availability of data and material Data are available from the authors upon reasonable request.

Conflict of interest
The authors declare that they have no competing interests.

Ethics approval and consent to participate Not Applicable.
Consent for publication Not Applicable.