Truly Sparse Neural Networks at Scale

Recently, sparse training methods have started to be established as a de facto approach for training and inference efficiency in artificial neural networks. Yet, this efficiency is just in theory. In practice, everyone uses a binary mask to simulate sparsity since the typical deep learning software and hardware are optimized for dense matrix operations. In this paper, we take an orthogonal approach, and we show that we can train truly sparse neural networks to harvest their full potential. To achieve this goal, we introduce three novel contributions, specially designed for sparse neural networks: (1) a parallel training algorithm and its corresponding sparse implementation from scratch, (2) an activation function with non-trainable parameters to favour the gradient flow, and (3) a hidden neurons importance metric to eliminate redundancies. All in one, we are able to break the record and to train the largest neural network ever trained in terms of representational power -- reaching the bat brain size. The results show that our approach has state-of-the-art performance while opening the path for an environmentally friendly artificial intelligence era.


Introduction
Artificial Neural Networks (ANNs) succeeded in a broad range of application domains (LeCun et al. (2015)) due to their ability to learn complex transformations from data while achieving superior generalisation performance. However, current state-of-the-art networks are typically highly overparameterised (e.g. Zhang et al. (2017)) and demand extensive computational resources to be trained, which become a bottleneck where such resources are limited ). Reducing the memory footprint and training time of ANNs are active areas of research, crucial to handle the rapid expansion of machine learning which has resulted in enormous datasets, with millions to billions of examples and features, but also to decrease the high environmental impact of the energy-hungry deep learning algorithms. Taking inspiration from nature, a solution to improve neural network scaling is to use sparse connectivity. The traditional dense-to-sparse training paradigm (known mainly as network pruning) (Mozer & Smolensky, 1989;LeCun et al., 1989;Han et al., 2015;Frankle & Carbin, 2019) o er computational benefits just in the inference phase as first, it trains a dense network in order to prune unimportant connections and to obtain a sparsely connected neural network. Therefore, to obtain scalable and e icient ANNs, contrary to general practice, artificial neural networks, like biological neural networks, should not have fully connected layers also in the training phase. Recently, a new sparse-to-sparse training paradigm (or simply, sparse training) began to establish inside the research community, with several studies focus on developing memory and computational e iciency from the start by training directly sparse neural networks from scratch. The first attempt (Mocanu et al., 2016) has used just static sparsity limiting the capacity of the model sparse connectivity graph to fit the data distribution. This concept has been revised and drastically improved by introducing the Sparse Evolutionary Training (SET) algorithm with dynamic (or adaptive) sparsity in (Mocanu et al. (2018)). Currently, the sparse training concept has started to be a de facto approach for e icient training of ANNs, as demonstrated in (Bellec et al., 2018;Dettmers & Zettlemoyer, 2019;Mostafa & Wang, 2019b;Evci et al., 2019Evci et al., , 2020Jayakumar et al., 2020;Raihan & Aamodt, 2020). These algorithms search for an optimal sparse topology according to some salience criteria, while simultaneously optimising the model weights. Here, the topology of a neural network refers to the way the neurons are connected, and it is a crucial factor in network functioning, and learning (Miikkulainen (2010)). The resulting networks have a significantly lower number of parameters by design, and they have empirically shown to outstretch higher generalisation power than their dense counterparts in a number of cases, especially in the case of multilayer perceptrons and recurrent neural networks (Evci et al. (2020); Bourgin et al. (2019); Liu et al. (2020cLiu et al. ( , 2021). Besides this, intrinsically sparse models allow, in theory, real scalable deep learning solutions in low-resource devices, standard computers, and in the cloud.
The main limitation to achieve this theoretical scalability level is given by the fact that all state-of-the-art deep learning frameworks are based on very well-optimised dense matrix multiplications on Graphics Processing Units (GPUs), while sparse matrix operations are practically ignored. The only notable exception is given by the NVIDIA A100 GPU which was released in 2020 (Je Pool (2020)) and supports a hardware fixed 2:4 sparsity level (i.e. 50% sparsity level). Within these frameworks, one can only simulate the sparsity by using a binary mask over the connections; therefore, the model will carry on training dense matrices. As follows, until optimised hardware for sparse operations appears, one would have to focus on optimising the algorithms.
In (Liu et al. (2020b)), the authors developed an e icient implementation of sparse multilayer perceptrons (MLPs) trained with SET. For the first time, they built sparse MLP models with over one million artificial neurons on commodity hardware, only utilising one CPU core. Still their sparse framework is completely sequential and cannot yet compete against advanced professional frameworks designed to accelerate the learning of dense neural networks. Additionally, there is the need for revisiting various aspects (e.g. optimizers and activation functions) of sparse neural networks training since most of the literature has mainly focused on dense models. The choice of the activation function for deep neural networks has a critical impact on the performance of the training procedure. An inappropriate selection can lead to the loss of information of the input during forward propagation and the exponential vanishing/exploding of gradients during backpropagation (Hayou et al. (2019)). It is crucial to question weather the activation functions currently used for densely connected networks still behave reliably in the sparse context. SReLU is a relatively little-known activation function suggested in (Jin et al. (2016)) and it has proven to outperform ReLU (Agarap (2018)) when training sparse networks over di erent datasets (Mocanu et al. (2018); Dubowski (2020)) as it improves the networks gradient flow (ab Tessera et al. (2021)). However, this activation function requires to learn four additional parameters per neuron, which becomes a non-negligible number and introduces a serious computational overhead if we want to train models with millions or billions of hidden units.
To alleviate the aforementioned limitations, this paper proposes four new contributions which advance the scalability of neural networks by exploiting sparsity: • We introduce Weight Averaging Sparse Asynchronous Parallel SGD (WASAP-SGD), a parallel algorithm to train truly sparse neural networks and expand their feasible size on commodity hardware, without any GPU support. • We propose a variant of ReLU, called ALternated Le ReLU (All-ReLU), to achieve performance comparable to SReLU for SET without the additional overhead for training its associated parameters. • We introduce the concept of neuron importance and a method (Importance Pruning) to blend it into the sparse training procedure, which allows us to shrink even more the number of weights and to accelerate sparse training considerably. • We developed a customised and modularized so ware framework for sparse neural networks to test our theoretical contributions. It allows us to break the record and to train as a proof-of-concept the largest neural network model ever trained, i.e. 50 million neurons. The three approaches (WASAP-SGD, All-ReLU and Importance Pruning) represent independent contributions to sparse neural network literature, but also they can be used together as complementary methods to improve further the performance of sparse models, as we illustrate in our experimental results.

Results
Our work is focused on the most straightforward type of neural networks, MLPs, as they count for 61% of a typical Google TPU (Tensor Processing Unit) workload for production neural networks applications, while convolutional neural networks represent merely 5% (Jouppi et al. (2017)). Despite the numerous algorithms available to train sparse neural networks from scratch, we decided to base our evaluation on the SET algorithm, given its simplicity and good performance on a broad range of domains. Unlike the other sparse training techniques mentioned in section 4 that calculate and store information for all the parameters, including the non-existing ones, SET is memory-e icient because it uses information just from the existing parameters, and it does not require high computational complexity. These are all favourable advantages to our goal of developing large scale sparse neural networks. We evaluate and discuss the performance of our proposed methods on sparse MLP models by considering five publicly available datasets listed in Table 1.

Problem formulation
and a network f (x ; θ) with L layers parameterized by θ (weights w and biases b). We train the network to minimize the loss function L (f (x ; θ), y ). The motivation of sparse neural networks is to use a fraction of parameters to reparameterize the whole network, while preserving the performance as much as possible. Hence, a sparse neural network can be denoted as f s (x ; θ s ) with a given sparsity level. Initially, the network is uniformly initialized with a sparse distribution in which the sparsity level S l of each layer l is controlled by a parameter (see Mocanu et al. (2018) for details) and stays constant during the training. More precisely, for each layer l the connections are defined in a sparse adjacency weights matrix W (l ) = [[w 11 , w 12 , . . . , w 1n l ], . . . , [w n l −1 1 , w n l −1 2 , . . . , w n l −1 n l ]] in which the elements are either null (w i j = 0) when there is no connection between neuron i and neuron j or have a connection weight (w i j 0) when the connection between i and j exists. Initially each W (l ) is a Erdős-Rényi random graph (Erdös & Rényi (1959)).

WASAP-SGD method
We propose a novel parallel training method with two phases based on a data parallelism strategy (where the learning phase of a model is partitioned by input samples) to improve the scalability of sparse neural networks. The algorithm, called WASAP-SGD, is based on asynchronous SGD (Dean et al. (2012)) training for the first phase and Stochastic Weight Averaging (Izmailov et al. (2018)) for the second phase. This two-phase method helps in filling the gap between sparse and dense neural networks' performances (accuracy, running time and generalization). We consider a system with K workers, which repeatedly compute gradient contributions based on independently drawn data mini-batches from the given dataset D. We also consider a shared parameter server, which communicates with each of the workers independently, to give state information and get updates that it applies according to the algorithm it follows. The master and each worker have a replica of the sparse model to be trained. Moreover, each worker has access to a subset of the training data, as depicted in Figure 2. In Algorithm 1 we show the pseudocode for WASAP-SGD, describing how standard asynchronous SGD using a parameter server is extended with a local training phase followed by a sparse model averaging step to improve its generalization performance. Moreover, it is designed to include the topology adaptation step of sparse networks. The training is carried out asynchronously by all workers. We adopt a simple SGD update rule with momentum, which has shown to be e ective for training intrinsically sparse models: The master must periodically pause the asynchronous update to carry on the weight evolution algorithm on the sparse model to generate the new topology. Each update must include a minor modification, since individual weights may be outdated due to the topology evolution (as illustrated in Figure 3). Then, to improve the model generalization performance, each worker locally updates its sparse replica for the next phase (phase two). Once phase two is concluded, the K di erent models are averaged: The averaging step does not preserve the sparsity level S , since each worker has updated its topology independently from each other. Hence, the final model θ f s will have a di erent sparsity level S (l ) , for each layer l , where S (l ) ≥ S, l = 1, . . . , L . Thus, unimportant connections, accounting for a certain fraction S (l ) − S, will be pruned in each layer. More precisely, the unimportant connections are pruned based on their magnitude, Figure 3. Worker k' fetches parameters w t +1 and push gradients ∆w t +1 to the parameter server. These gradients may contain non-valid updates, since in that time frame the global model may have performed the topology evolution, hence they need to be corrected.
corresponding to the largest negative weights and the smallest positive weights in W (l ) .

All-ReLU
The new proposed activation function, All-ReLU (Alternated Le ReLU), is designed for training sparse MLPs and is able to accelerate training, without adding any additional computational complexity. All-ReLU is inspired by the S-shaped rectified linear activation unit (SReLU) presented in Jin et al. (2016). The intuition behind it came from analysing the SReLU parameters as well as the input distribution of the learned topology. Like SReLU, this function can improve the networks gradient flow, and consequently achieve better accuracy. However, since All-ReLU does not require to train additional parameters, it can be considered as simple and fast as ReLU to use. Given an ANN with L layers, our proposed Alternate Le ReLU (All-ReLU) is defined as follows for each layer l : where x i is the input value, α is the slope for the negative side of the input and % represent the modulo operation. The input layer (l = 1) and the output layer (l = L) are excluded. We believe that the proposed activation function can accelerate convergence by breaking symmetry during training and preserving the gradient flow through the network, hence leading to better performance for sparse models.

Importance Pruning
To substantially reduce the size of neural networks, we propose a novel method, for selecting the most important neurons, based on their strength (importance). In graph theory, the node strength is the sum of weights of links connected to the node (Barrat et al. (2004)). Taking inspiration from this graph measure, we determine the importance of each neuron based on the summation of absolute weights of its incoming connection. For each neuron j in layer l we define its importance as follows: where Γ (l −1) j is the set of all neurons connected to neuron j , i.e. Γ (l −1) given that n (l −1) is the number of hidden neurons from the previous layer l − 1 and w (l ) i j denotes the weight of connection linking neuron i to j in two consecutive layers as defined in W (l ) . This neuron importance metric can rapidly identify the main hubs of the sparse network, i.e. nodes that are positioned to make strong contributions to global network performance.
This metric can be easily integrated during training and we named this procedure Importance Pruning, once the topology is stable, to reduce overfitting with almost no loss in accuracy and substantially lessen the number of parameters up to 80% with respect to the initial sparse network and, as a consequence, decrease the overall training running time. In Algorithm 2, we provide an example of how Importance Pruning can be integrated with dynamic sparse training. The pseudocode refers to the SET algorithm; however, it could be easily replaced by any other sparse-to-sparse training technique.

Large scale Sparse Neural Network framework
We extended the sparse framework presented in Liu et al. (2020b) by implementing the theoretical contributions presented in this paper. The initial implementation was sequential and it was not able to obtain the same accuracy as Keras on some datasets such as CIFAR10. Our focus was on MLP as we did not have the human resources to develop RNNs and CNNs from scratch and we let this as future work. With respect to the speed of our approaches, it is worth to highlight that first we substantially improved the training time of a truly sparse MLP from its previous implementation in Liu et al. (2020b) with no parallelisation by replacing Cython with Numba (Lam et al. (2015)) and adopting 32-bit float precision instead of 64-bit. These minor changes ensure minimum resource requirements. WASAP-SGD is designed using Python 3.7, one of the most popular Machine Learning languages, combined with Message Passing Interface (MPI).
With our framework, we were able to build the largest Sparse MLP model, in terms of the number of neurons, ever trained on a single machine on the cloud (with no GPU). Note that the enormous models mentioned in literature have been usually trained in a distributed fashion or on several GPUs. Since a high-dimensional dataset for this task is hard to find, we use an artificial one to train a model with 50 million neurons on a machine with 96 virtual cores and 768 GB of RAM. This experiment demonstrates how the Sparse MLP can achieve what its dense counterpart cannot, due to memory error.  (Baldi et al., 2014) Physics particles 28 105000 12 50000 6 2 Madelon (Guyon et al., 2005) Artificial data 500 2000 600 2 FashionMNIST (Xiao et al., 2017)

Performance on Sequential Trained Sparse MLPs
This section summarises the performance for the comparison between All-ReLU and ReLU on sparse MLP models, and their integration with Importance Pruning to speed up the training. All SET-MLP variants have been run using our own truly sparse implementation framework and just one CPU core. Table 2 lists the maximum accuracy for each method/dataset combination, along with the total training time (expressed in minutes), the number of parameters at the beginning (st ar t _n W ) and at the end of the training (end _n W ). This is particularly interesting to report when Importance Pruning is applied to the sparse models for understanding the benefits in terms of memory footprint. The resulting learning curves and errors of our experiments are shown in Figure 6 and Figure 7, for both testing and training sets. For all figures, we obtain the mean accuracy by averaging the best test accuracy from 5 trials over 500 epochs. Moreover, for each model, we display the resulting number of parameters for the dense MLP version, the basic SET-MLP and SET-MLP where the neuron importance metric is adopted for taking out unimportant hidden units. Since a key result is the relative space improvement from the Importance Pruning, in Figure 4, we plot the relative model size (expressed in number of parameters) against the relative error to highlight the di erences with and without Importance Pruning (where Figure 4a summarizes the final training results and Figure 4b for the final testing results).  Table 2. On each dataset, we report the best classification accuracy and error obtained by each model on the test data over five di erent runs for 500 epochs. st ar t _n W represents the number of weights in the model at the beginning of the training, while end _n W represents the number of parameters in the final model. Importance Pruning (y/n) indicates if the proposed pruning strategy based on our neuron importance metric is activated. Training reports the overall running time needed for training the models. Furthermore, the table reports the performance of fully connected MLPs (Dense MLPs) with both activation functions. The networks have been trained using momentum SGD in its standard sequential version. It is worth mentioning that the Dense MLP has been run using Keras using all CPU cores and SET-MLP using our own implementation and just one CPU core.
We can observe that All-ReLU consistently outperforms ReLU on all datasets, indicating that the novel activation function associated with a sparse-to-sparse training algorithm helps to model better the data distribution. Moreover, when Importance Pruning is activated, we can notice a significant reduction in the number of parameters, which leads to a remarkable speedup in running time, with almost no loss in terms of accuracy. Looking at the CIFAR10 dataset (Figure 6e), we can see that the new activation function is capable of boosting the accuracy on test data by more than 2%. SET-MLP on CIFAR10, a er 500 epochs when using SReLU reaches about 70.30 % accuracy. This result suggests that All-ReLU indeed fills the performance gap with SReLU successfully. The model version with Importance Pruning can achieve comparable performances while training roughly 40% fewer parameters (where the reduced number of parameters refers to the final model and it is obtained by gradually reducing the connections during training) and gaining a speedup of 60 minutes. In this case, the importance metric seems to be more stable when adopting All-ReLU, resulting in a minor loss in performance. A similar outstanding result is obtained with Madelon (Figure 6c), where All-ReLU obtains 3% increase in accuracy with no Importance Pruning and about 2% with Importance Pruning (where the model uses 80% fewer parameters). Here, the Importance Pruning method has improved the performance significantly.  For FashionMNIST, All-ReLU can surpass both ReLU ( Figure 6d) and SReLU. The latter achieves around 90.10 % accuracy when trained for 500 epochs with the same settings, while All-ReLU achieves 91.38 %. The smaller version obtained via Importance Pruning ends with 20% of the original parameters, gaining a 41 minutes speedup. We attain similar performance for Leukemia and Higgs, although the increase in accuracy is more predominant for image data and Madelon. Lastly, in Figure 5, we show the gradient flow for the sparse models trained with All-ReLU and ReLU on CIFAR10 (Figure 5a), FashionMNIST ( Figure 5b) and Madelon ( Figure 5c). We recall that gradient flow is the first-order approximation of the decrease in the loss expected a er a gradient step, hence the higher, the better. All-ReLU visibly improves this metric, which is associated with e icient training of sparse neural networks (ab Tessera et al. (2021)

Performance on Parallel Trained Sparse MLPs
The available algorithms for parallelisation of dense neural networks are not suitable for sparse neural networks training. Hence they are not considered in our experimental evaluation of the proposed method. We did not test WASAP-SGD with Madelon and Leukemia because there is no added value in parallelising the training under the conditions where the data and/or the model size are minimal. In Table 3 we summarise the performance of the proposed parallel algorithm on three di erent datasets where All-ReLU is adopted, with and without Importance Pruning. For completeness, we report the results for a variant where phase one is synchronous, called WASSP-SGD (Weight Averaging Sparse Synchronous Parallel SGD). In this way, we want to empirically demonstrate that our asynchronous version is more suitable to train sparse models. Furthermore, for easy comparison, we again report the accuracy and training time for the sequential version (baseline). Since we run the experiments on a machine with six physical cores, we employed five workers and one master (parameter server). The time reduction does not improve significantly as the number of workers surpasses the number of physical processors. The synchronous variant WASSP-SGD is implemented by following the suggestions from (Goyal et al. (2017)), such as their gradual warmup and linear scaling rule for the learning rate. Conversely, for our WASAP-SGD, where the first phase is asynchronous, we observed it benefits from larger learning rates for the first few epochs, followed by fixed learning rates.
In Table 3, we also show the average running time using the Keras implementation of SET-MLP with a mask over the parameters. The statistics are provided for three di erent configurations of the hardware: training on one CPU core only, training with no constrains (all 12 CPUs cores) and GPU training. These numbers allow us to make a comparison between our proposed sparse framework and a popular deep learning library like Keras. By looking at the results in Table 3, we can observe that the proposed parallel algorithm exhibits persistently better convergence when the first phase is carried out asynchronously (WASAP-SGD). The same outcome holds in terms of training time. If we compare the running times of WASAP-SGD against the one from the sequential version, we gain an improvement of about half among all datasets, without introducing a notable increase in memory footprint. Our parallel method is able to outperform Keras CPU-based wall-clock running time for training of sparse MLPs significantly. We would like to emphasise that our sparse implementation together with parallelisation and Importance Pruning gets quite close to GPU training time (see Table 3). Plus, this is the kind of comparison that people typically do not make, because (for dense networks) does not make sense to compare CPU with GPU. Moreover, as we could notice from the results in subsection 2.2 for the Leukemia dataset, this computational advantage of GPUs manifests its limitation when it comes to training huge models. In this case, Keras is not able to allocate the dense tensors, resulting in a memory error. Additionally, it is worth mentioning that GPU training utilises more resource than CPU training and classic GPU training, as it also uses the CPU besides GPU. Hence, we believe that our approach has more potential to tackle large scale deep learning models.

Extreme large sparse MLPs
To the best of our knowledge, the largest public (claimed) dense neural network has 160 billion (B) parameters, where a parameter roughly corresponds to a synapse in the human brain. Given the human brain is estimated to have about 100 trillion synapses, that neural network could be said to be about 0.16% of the human brain. State-of-the-art networks in a generic sense contain around 16M neurons. For comparison, the 16 million (M) neurons number when compared with the 100B neurons in a human brain-only represents 0.016% of human brain size. Now, it is worth noting that size is not the only thing that matters. Most advanced models today perform worse when their network size is increased blindly. While it is true that DNN capabilities increase with their network size, there is also a fair amount of engineering work that goes into making larger networks accurate. Hopefully, sparsity will help in overcoming some challenges when training such large models.
In this scenario, we tried to push the limit of ANNs on a virtual machine with 96 cores and 768GB of RAM to train extreme large sparse MLP models. We are attempting to enter a region where neural networks have never been explored. Hence, we believe that any small finding is important. Because of limited time and resources,  Table 3. The table reports the accuracy, average running time and average resource utilisation over five di erent runs for 500 epochs when using parallel training with WASAP-SGD and our proposed sparse implementation framework. For completeness, we report the performance for the synchronous (phase 1) version of the algorithm as well. Here, this synchronous version is called WASSP-SGD (Weight Averaging Sparse Synchronous Parallel SGD). Moreover, we include the performance of sequential training for facilitating the comparison. Importance Pruning (y/n) indicates if the proposed pruning strategy based on our neuron importance metric is activated. Note that in this setting, the memory usage will decrease during training. To have an idea on how these performances can be compared against a state-of-the-art framework, we report the running time and memory statistics when using the Keras implementation of SET-MLP with a mask over the parameters. These statistics are provided for three di erent configurations of the hardware.
we did not repeat experiments to get statistical confidence. We just run them once for few epochs, on di erent regimes, to collect statistics such as matrix initialisation time, training time per epoch, inference time and topology evolution time.
Our main goal was to train a sparse model with more than 125 million neurons because we believe that this is the latest state-of-the-art (just for inference) size, according to (Mohammad Hasanzadeh Mofrad & Hammoud (2021)). For the sake of clarity, we mention that these results ignore entirely the training focusing just on inference. We want to go one step further and train such models. Herein, we have discovered that we are very close of reaching the limits of our proposed implementation framework of training extremely large sparse neural networks, but it is good to know the limits in order to know how to proceed further.
There are not many available datasets with a high number of features and a reasonable amount of data points to exploit data parallelism. For this reason, we created an artificial dataset by adopting the function make_classification 3 from Scikit-learn, a free so ware machine learning library for the Python programming language. The algorithm is adapted from (Guyon (2003)) and was designed to generate the "Madelon" dataset. We generated a binary classification task with 10000 samples, where each sample has 65536 features. We used 30% of the data as test data and 70% as train data. The models are trained using our parallel algorithm WASAP-SGD with momentum (set to 0.9), weight decay and dropout (set to 0.4). Moreover, the batch size is set to 128, and the learning rate is 0.01. The statistics for di erent architectures are presented in table Table 4. Moreover, we report the number of workers, the number of parameters and the value for , which controls the sparsity level. We stress that the training is performed on a single machine with no GPU, while popular state-of-the-art models are usually trained with distributed algorithms on multi GPUs.

Architecture Epsilon Neurons [#] Parameters [#] Workers [#] Weight initialization [min]
Training [min]  While trying to build these large sparse models, multiple challenges and technical di iculties have emerged: Inference bottleneck: Up to this moment, our work was focused on the training phase of neural networks since inference did not play an important role for common sized networks. However, once we started building extreme large models, we immediately noticed how important it is to optimise this phase as well. We tried to overcome the bottleneck by parallelising the inference in batches with python mutiprocessing. We are aware that there exist more sophisticated approaches, but this simple solution could already decrease the running time considerably. MPI overflow: Parallelization in Python integrates Message Passing Interface via mpi4py module. When we first selected this framework, we did not consider that mpi4py does not support the parallelisation of objects greater than 2 31 bytes, limiting the size of sparse matrices to be created. This limitation is probably becoming more noticeable nowadays, given the importance of Big Data analysis, and in Ascension & Araúzo-Bravo (2020), the authors developed BigMPI4py, a Python module that wraps mpi4py, supporting object sizes beyond this boundary. Matrix initialisation time: When building such larges models, weights initialisation starts to play a significant role. If the sparse matrix initialisation is not implemented e iciently, this may cause a bottleneck. In this regard, we had to vectorise this step in order to reduce the weight initialization running time.
Memory allocation issues: Sparsity allows for creating a bigger model (in terms of the number of hidden units) than when one adopts fully connected layers. However, this advantage has its limitations as well. When adopting a data parallelism strategy, the sparse model is replicated among all workers to accelerate the training procedure. When the model becomes extremely large, the number of workers which can fit in memory decreases (as can be noticed in Table 4). At this point, the training should become distributed by having each workers running on a di erent node in a network. Alternatively, other strategies may be explored to overcome the memory issues. Our implementation adopts MPI standard; hence it could be automatically run in a distributed fashion. Nevertheless, the feasible size of sparse models is much larger than their dense counterparts. In this regard, the largest dense network we can build on the virtual machine has around 600000 neurons.
We note that the weight evolution step is able to scale successfully without adding too much overhead. Moreover, we want to briefly outline that we managed to train a model with 10 million neurons ( = 1) via sequential training on Leukemia. Each training epoch takes around 33 minutes, and 18 seconds, inference takes 12 minutes and 9 seconds, and evolution time 30 seconds. Similarly, we could train a model with 50 million neurons where one epoch takes about 1 hour and 15 minutes. In the end, we were not able to train sparse models with more than 50 million neurons due to the challenges listed above. A er a point, to grow the number of neurons, we had to increase the sparsity as well as decrease the number of workers. Nonetheless, from these extreme experiments, we could learn some of the limitations of our approach and the chosen technologies. Lastly, it is important to highlight that these limitations come from the implementation itself and they are not limitations of the three theoretical contributions.

Discussion
To improve the scalability of ANNs by exploiting sparsity, three main contributions have been introduced in this work. Each of them has been implemented in a truly sparse manner with sparse matrices and operations. First, we introduced WASAP-SGD, a new parallel algorithm to e iciently train sparse neural networks asynchronously. This type of communication protocol has proven to be more e ective than synchronous training for sparse models, both from a running time and convergence point of view. Furthermore, the last training phase ensures higher generalization performance. For example, for CIFAR10, we could bring down the sequential training time of 590 minutes to 282 without Importance Pruning and to 246 with Importance Pruning while the GPU training time is around 200. It should be pointed out that all communications are intrinsically sparse, reducing the overhead significantly. From our experimental evaluation, we argue that sparsity could more easily overcome the typical negative traits of asynchrony. At the same time, the communication overhead is mitigated since the processes in the system exchange sparse updates. Lastly, the concept of staleness-adaptive AsyncPSGD and delayed compensation strategies have been explored, but they did not improve statistical e iciency.
Secondly, we proposed a new activation function called All-ReLU to boost sparse MLPs performances. All-ReLU has shown promising results, outperforming ReLU in all five datasets, across various domains, without adding any extra computational complexity to the training procedure. To present some outstanding results, it has shown to significantly increase accuracy on test data for CIFAR10 and Madalon by more than 2% when compared to ReLU. The benefit on CIFAR10 becomes even more visible if we train the network longer. In this case, All-ReLU increases the accuracy by 3.4 % when compared to the classic ReLU. Moreover, if the model is trained longer, All-ReLU gets on par results (72-73%) with SReLU, achieving an accuracy close to the 73-74% reported in original SET (Mocanu et al., 2018). For image datasets, we hypothesized that the higher performance of our activation function might be caused by its ability to capture the feature shi that is common in this data. Our results are in line with independent parallel literature on sparse neural networks, as we demonstrate that All-ReLU achieves better performance by enriching the gradient flow during the training process. These findings make stronger the contribution of both works and the generality of the concept.
Lastly, we proposed a metric to define neuron importance which can be employed to remarkably shrink the number of parameters via Importance Pruning, an active pruning strategy. The sparsity level of the network can be increased without performance loss using our proposed method, which reduces computation time and memory requirements. The combination of the new activation and Importance Pruning has been tested across all datasets, resulting in better or comparable outcomes when pruning is included. The most interesting case has been revealed for Madelon data, where the combination of the two methodologies has significantly improved performance up to 77%. This happens because the artificial dataset contains many redundant features which are eliminated, and it might imply that the neuron importance metric is useful for implicit feature selection.
The three methods are complementary and can be combined to obtain large scale sparse MLPs. In subsection 2.4 we performed some experiments to train extremely large networks (in terms of neurons), and we managed to train a sparse MLP with 50M neurons on a single machine. Thanks to this investigation, we could identify many important limitations in order to open several new research directions. Our contributions allow advancing the state-of-the-art in representational power (i.e. number of neurons) of artificial neural networks. Currently, up to our knowledge, the largest ANNs, built on supercomputers, accommodate the size of a frog's brain (about 16 million neurons) (Goodfellow et al. (2016)). A er some technical challenges are overcome, with sparse neural networks, we may create on the same supercomputers ANNs close to the human brain size (about 80 billion neurons).
There are several directions for future work. The concept of staleness-adaptive AsyncPSGD for the first training phase has been under-explored for a high number of workers. Although these adaptations do not seem to help in our experiments, continuing to investigate asynchrony-aware SGD is of interest for very sparse large models. Future research directions also include investigating the nature of sparse training with more extensive experiments on various model architecture, as CNNs and Transformers. The latter would likely benefit the most since they use large dense layers. Additionally, it would be intriguing to consider a decentralized architecture with no parameter server involved. From an implementation point of view, it would be great to develop the parallelization in C++, in order to achieve better performance and overcome some sloppy characteristics of Python, or to incorporate customized sparse matrix operations developed for GPUs (Gale et al., 2020). Lastly, we need to adopt distributed settings if we want to outstretch the size of ANNs and approach the human brain's size. A clear limitation for All-ReLU consists in choosing the slope α. Although we provide some practical advice (see subsection 5.2), it would be interesting to find a way to tune this parameter before training automatically.
We believe that our research opens the path for obtaining better performance for current state-of-the-art sparse training research in terms of accuracy, computational requirements, and energy costs. With regard to the latter, people use artificial intelligence for climate change, but they do not improve deep learning to save energy. With our work, we hope to raise more awareness concerning this problem and show that it is possible to pursue sustainable supercomputing. Finally, it can pave the way to develop much larger neural networks with billion of neurons which can help us to tackle challenging problems in complex domains such as health care.

Sparse Neural Networks Training
Fully connected neural networks have been shown to have a substantial number of redundant parameters, and, in some cases, more than 95% of the parameters can be predicted from the remaining ones without accuracy loss (Denil et al. (2013)). In early work on sparsification, Optimal Brain Damage LeCun et al. (1989) and Optimal Brain Surgeon (Hassibi et al. (1993)) use gradient methods to sparsify networks during training. They observed that a sparse network demonstrated several advantages over its dense counterparts, such as better generalisation, reduced memory footprint and faster inference time. Additionally, the methodology of sparsification includes Hebbian pruning (Srivastava et al. (2014)), matrix decomposition ), and graph techniques (Kepner & Gilbert (2011);Kepner et al. (2017; Kumar et al. (2018); Prabhu et al. (2018)).

Dense-to-Sparse Training
Recently, more and more studies attempted to obtain memory and computational e iciency methods for the inference phase of deep neural networks. Numerous post-pruning techniques (dense-to-sparse training) have been proposed to reduce the number of parameters and speed up the inference phase across a broad range of neural network architectures (Han et al., 2015;Srinivas & Babu, 2015;Han et al., 2016;Narang et al., 2017;Zhu & Gupta, 2017;Zhou et al., 2019); yet, these approaches require to fully train the dense model first. Several methods strove to learn the sparse networks during training (Louizos et al., 2018;Wen et al., 2018;Liu et al., 2020a;Kusupati et al., 2020;Yuan et al., 2021). However, these techniques begin with (or, are using at some moment during training) a fully-connected model, and as a consequence, they are not memory e icient. Another viable way is one-shot pruning, which aims to find sparse neural networks by pruning once before the main training phase (Lee et al., 2019(Lee et al., , 2020. In this setting, at least one iteration of the dense model requires to be trained to identify the sparse sub-networks, and therefore the pruning process is unfeasible for memory-limited scenarios. Additionally, this method cannot meet the performance of dynamic sparse training, especially at extreme sparsity levels ).

Sparse-to-Sparse Training
The aforementioned issues can be naturally overcome by training intrinsically sparse neural networks from scratch to obtain the e iciency for both the training and inference phases. With respect to Static Sparse Training, Mocanu et al. (2016) introduced intrinsically sparse networks by exploring the scale-free and small-world topological properties in Restricted Boltzmann Machines. Later, some works focus on designing sparse CNNs based on Expander graphs and show comparable performance (Prabhu et al. (2018); Kepner & Robinett (2019)).
In ) they tests pruning-based topologies, as well as RadiX-Nets from (Kepner & Robinett (2019)). Once more, results show that these static sparse networks obtain accuracies comparable to dense networks, but extreme levels of sparsity cause instability in training. A training technique that allows for sparsity throughout the entire training process in a dynamic fashion (sparse-to-sparse training) was first introduced in Mocanu (2017) (2020)) was introduced as a novel method for training sparse models without the need of a "lucky initialisation"; it can match and sometimes exceed the performance of pruning based approaches. Lastly, in (Liu et al. (2021)), the authors propose an approach to successfully train sparse Recurrent Neural Networks (RNNs) with a stable number of floating-point operations (FLOPs) and a fixed parameters count.

Parallel Training of Deep Neural Networks
Accelerating training for Deep Neural Networks (DNNs) is a daunting challenge and techniques range from distributed algorithms to hardware optimisations. Stochastic Gradient Descent (SGD) (Robbins & Monro (1951)) together with backpropagation, and in particular, its mini-batch variants (Bottou (2010)) are the de-facto methods to train DNNs. SGD, however, is inherently sequential with a dependency across iterations and, this dependency limits parallelism. In (Ben-Nun & Hoefler (2018)), the authors provide an extensive survey about the vast catalogue of parallelisation approaches in deep learning. There are three prominent strategies to partition the learning phase of a model: partitioning by input samples (data parallelism), by network structure (model parallelism), and by layer (pipelining). Data parallelism can be easily implemented, and it is, therefore, the most widely used implementation strategy on multi-GPUs (Li et al. (2016)). We have explored this option focusing on CPUs only, where each core utilises the same sparse model to train on di erent data subsets. The replicas communicate updates through a centralised parameter server (shared memory) which maintains the current state of all parameters for the sparse model. In this architecture, there is no synchronisation between CPU cores during the forward pass, but the gradients must be synchronised for the weights update. As for the parallelisation of SGD algorithms, one can choose to do it in either a synchronous or asynchronous way (Figure 8).

Synchronous Parallel SGD
In a shared parameter server system, the local workers can compute the gradients over their mini-batch of data, and then add the gradients to the global model; this approach is commonly referred to as Synchronous Parallel SGD (SyncPSGD) due to its barrier-based nature. In its standard form, SyncPSGD has scalability issues due to the waiting time that is inherent in the aggregation when independent workers compute with di erent speed (Bäckström et al. (2019)). the authors proposed Post-Local SGD, where the classic large mini-batch SGD is followed by a local SGD phase which allows workers to independently update their models for a few steps before synchronising. Moreover, (Gupta et al. (2020)) introduces a variant where they adapt Stochastic Weight Averaging Dean et al. (2012) to accelerate DNNs training. In the second phase, each worker refines its network separately, and at the end, they average the weights of the resulting models to produce the final result.

Asynchronous Parallel SGD
An alternative type of parallelisation is Asynchronous Parallel SGD (AsyncPSGD) Dean et al. (2012), in which workers fetch and update the shared model independently of each other. Hence, the training procedure has no barrier imposed. Notably, for sparse problems, Hogwild method (Recht et al. (2011)) shows that updating only the relevant parameters without any synchronisation could guarantee a nearly linear speedup with the number of processors (Recht et al. (2011)). Although AsyncPSGD can achieve faster speed due to the absence of waiting overhead, the lack of coordination implies that gradients may be computed on stale (old) version of the weights, which leads to statistical ine iciency. This problem is well-known, and some researchers have analysed its negative e ect on the convergence speed Avron et al. (2014); Lian et al. (2015). Another critical factor to monitor when considering di erent scales of asynchrony is that it introduces momentum to the SGD update, called the implicit momentum ).
Recent studies in Lan & Zhou (2018); Bäckström et al. (2019); Zheng et al. (2020) proposes di erent stalenessadaptive SGD algorithms to reduce the negative impact of asynchrony and approach the performance of sequential SGD. Moreover, they allow for fine-tuning the implicit momentum and increase the number of workers while maintaining statistical e iciency.

Parallel Training of Sparse Networks
To reduce the communication overhead in parallel distributed DNN training, various quantisation techniques and sparse gradient updates have been developed Wangni et al. (2018); Alistarh et al. (2017);Stich et al. (2018). In this regard, one significant advantage of sparse models is that the sparse gradient communication is automatically at hand. Related work on parallelisation for sparse DNN is presented in (Sattar & Anfuzzaman (2020)) as a solution to the Sparse DNN Challenge posed by MIT/IEEE/Amazon ). However, their work is focused on sparse neural networks created using RadiX-Net ) and trained on GPU which do not evolve the topology over time. Moreover, their solution is implemented in Tensorflow, where sparse layers are represented by dense matrices with a mask over weights. Similarly, in (Mohammad Hasanzadeh Mofrad & Hammoud (2021)), the authors devised a parallel strategy for large sparse neural networks (up to 125 millions of neurons), but even if they employ truly sparse matrices their approach is focused on the inference phase only. Further related work for sparse neural network inference is presented in (Hidayetoğlu et al. (2020);Pawłowski et al. (2020);Hasanzadeh-Mofrad et al. (2020)). In short, none of the available parallel training algorithms is designed for training truly sparse dynamic neural networks.

Activation Functions
Activation functions determine the output of a deep learning model, its accuracy, and also the computational e iciency, which can make or break a large scale neural network. They are essential for an artificial neural network to help the model learn complex non-linear patterns in the data.

ReLU
Rectified Linear Unit (ReLU) is one of the most widely used activation functions in neural networks (Glorot et al. (2011);Nair & Hinton (2010)), which can e ectively solve the problem of vanishing gradient (small gradient prevents the weight from altering its value) and slow training time of saturated activation function. Its mathematical expression is: and its derivatives can be easily calculated with a meagre computational cost; this is a desirable advantage for choosing ReLU to speed up the training. When the input value x i is lower than zero, the resulting derivative will also be zero, leading to a disconnection of the neuron (zero-sparsity). Disconnecting some neurons may reduce overfitting; however, this will hinder the neural network from learning in some cases due to the neurons death problem. The ReLU function also keeps the mean activation value to be greater than zero, which makes it di icult for the network to determine the direction with the fastest gradient drop in the backpropagation process, thus a ecting the network convergence. This "free" sparsity (in terms of neuron activations) obtained by adopting ReLU may represent an advantage for training fully-connected layers. However, this might be untrue for sparsely connected neurons since, in these settings, the fact they are not capable of capturing some significant aspect of the data on the negative side and the dead neurons problem could lead to a higher impact on performance overall.

Leaky ReLU
Leaky ReLU (LReLU) is a modification of ReLU which replaces the zero part of the domain in [− inf, 0] by a low slope α. The reason for using LReLU instead of ReLU is that constant zero gradients can also result in slow learning, as when a saturated neuron uses a sigmoid activation function. Additionally, others do not even activate. According to the authors in (Maas (2013)), this sacrifice of the zero-sparsity might bring worse results than when the neurons are entirely deactivated, suggesting the leaky rectifiers' non-zero gradient does not substantially impact training optimisation.

PReLU
Parametric Rectified Linear Unit (PReLU) is proposed by (He et al. (2015)) and generalizes the traditional rectified unit. The authors reported that its performance was considerably better than ReLU in large-scale image classification tasks. It is the same as Leaky ReLU with the exception that the slope parameter is learned during training via backpropagation.

Activation Functions in Sparse Networks
It is crucial to question weather the activation functions currently used for densely connected networks still behave reliably in the sparse context. S-shaped rectified linear activation unit (SReLU) is a relatively little-known activation function suggested in (Jin et al. (2016)) and used by Mocanu et al. (Mocanu et al. (2018)) in their implementation of the SET algorithm. When compared with the most common activation functions, SReLU in SET models has shown to perform best at all sparsity levels (Dubowski (2020)) in various domain.
Very recently, in (ab Tessera et al. (2021)) the authors have shown that SReLU and PReLU are more suitable activation function for sparse networks as they improve the networks gradient flow and achieve better accuracy when compared to the other activations. In this regard, they proposed a normalised measure of gradient flow called E ective Gradient Flow (EGF), which is better suited to examining the training dynamics of sparse networks. Their results related to activation functions are in line with our findings. Previous papers were already suggesting that low gradient flow is an exacerbated issue in sparse networks ; Evci et al. (2020), but they did not investigate its relation with activation functions.

Neuron Importance
The importance of hidden units in neural networks is still an open problem, crucial to understand neural networks' behaviour and to enhance the explainability of these black-box models. Many papers on dense deep networks speculate about the significance of a neuron towards a prediction. They tend to use the activation value of the hidden unit or its product with the gradient as a proxy for feature importance (e.g. Zagoruyko & Komodakis (2017)); however, both metrics can have undesirable outcomes. To overcome these problems, in (Dhamdhere et al. (2019)), the authors proposed the notion of conductance to gain a better understanding of neuron relevance through extensive ablation studies.
We argue that a neuron importance metric can be straightforwardly identified in sparse neural networks trained with a sparse training algorithm where the topology evolves overtime to find the best weight configuration. The proposed metric is shown to be valuable since it can remove a big chunk of unimportant units and related connections with almost no loss in accuracy. More importantly, this metric can be simultaneously derived for all neurons without requiring expensive computations. To define the importance of a neuron, we borrow some terminology and ideas from graph theory and hence, we introduce them here. In network science, a hub is a high-degree node that occupies a central role in the overall organisation of a network. Hubs have a significantly larger number of links in comparison with other nodes in the network (Barabási & Pósfai (2016)). They can be found in many real networks, such as the brain (van den Heuvel & Sporns (2013)) or the Internet. The loss of such well-connected hubs can be extremely devastating to network function. Given the role of hubs and their significance to networks, their locations and functions in the brain are of clear interest to neuroscientists. Accordingly, we would expect to find a similar biological structure in sparse ANNs as well.

Data availability
The data used in this paper are public datasets, freely available online, as reflected by their corresponding citations from Table 1.

Code availability
Prototype so ware implementations of the models used in this study are freely available online.

Algorithm 1: WASAP-SGD
Result: Trained sparse model θ f s Input: Number of workers K , t = 0, t = 0, phase = 1, epoch = 0, sparsity level S Step size η and momentum µ Training dataset D wit labels I ; Mini-batch size B SGDSparseUpdate(·), a function that updates the weights using momentum SGD TopologyEvolutionStep(·), a function that updates the sparse topology RetainValidUpdates(·), a function that retain only gradients applicable to the topology defined by θ t s Epoch τ 1 and τ 2 , at which to exit phase one and phase two respectively  36 We get K di erent models at the end of phase 2 37 Produce averaged model θ f s and select a fraction S of weights with bigger magnitude for each layer 5 Supplementary Information 5.1 From SReLU to All-ReLU          In this appendix, we illustrate the tuning of the hyperparameter α required by the proposed All-ReLU function when considering the FashionMNIST dataset. In Figure 19, we report the learning curve for di erent values of the slope, averaged over five runs, while in Table 5 are shown the best results in term of accuracy. The best value of α for training a SET model on FashionMNIST is 0.6. To avoid the expensive grid search, we want to propose a practical method to narrow down the choices of this parameter. Based on our intuition, we propose to pick the slope α based on the observed skewness of the input data distribution. When the input is skewed on the le side, the chosen slope should reflect the level of distortion or asymmetry in the data distribution.   It is essential to notice that any alpha level greater than 0.05 leads to better results than ReLU. Hence, even with suboptimal values is possible to attain satisfactory results.

Importance Pruning Post-Training
The proposed method Importance Pruning based on our neuron importance metric can be easily applied one time only, once the sparse training procedure is concluded. With the experiments reported in Table 6, we empirically demonstrate that the Importance Pruning technique should be integrated during training to gain the best results in terms of the memory footprint, running time and accuracy.

Datasets
We evaluate and discuss the performance of our proposed methods on sparse MLP models by considering five publicly available datasets listed in Table 1. We report both the number of data points used as training data and as testing data. The Leukemia dataset is obtained from the NCBI GEO repository (Edgar & Lash (2002)) with the accession number GSE13159. In (Liu et al. (2020b)), the authors provide a table with class labels (unbalanced) and their corresponding number of test samples. HIGGS (Baldi et al. (2014)) dataset is a classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not. The data has been produced using Monte Carlo simulations. Due to the limited resources available, we merely considered a fraction of the full dataset. Madelon (Guyon et al. (2005)) is an artificial dataset that has five informative features and 15 linear combinations of those features. The other 480 features are probes that provide no information about the class label. Lastly, FashionMNIST (Xiao et al. (2017)) and CIFAR10 (Krizhevsky (2009)) are both image datasets.

Implementation details
The following environment has been selected to implement the parallel algorithm WASAP-SGD, based on the implementation provided by Anderson et al. (2017) and Liu et al. (2020b): Language: Pure Python 3.7 for quick prototyping, where SciPy and Numpy are employed for sparse matrix operations while Numba accelerates some critical part of the code such as backpropagation. Framework: Message Passing Interface (MPI) standard. Library: mpi4py, the library is constructed on top of the MPI-1/2 specifications and provides an object-oriented interface which directly follows MPI-2 C++ bindings.
All the experiments performed in this chapter are executed on a typical laptop with the following configuration:

Experiments hyperparameters
We primarily used the same configuration of parameters as in the original SET paper (Mocanu et al. (2018)). Like SET, our methods were also tested on a multilayer perceptron, in which the fully connected layers have been replaced with sparse layers. The MLP models in subsection 2.2 are trained with sequential momentum SGD (momentum is set to 0.9), weight decay, a dropout rate of 0.3 and fixed learning schedule. In Table 7 we provide an overview of the primary hyperparameters. The slope α for All-ReLU has been mostly identified via grid search. Similarly, the epoch τ, that determines the starting point of Importance Pruning is determined based on a local search within a limited set of value, and it is set to 200 for all models. The fraction ζ of the smallest positive weights and the largest negative weights to be removed is always set to 0.3. Here, controls the sparsity level as discussed in Mocanu et al. (2018), η is the learning rate and B the batch size. Furthermore, to improve the learning process of our networks, we standardise the features of our datasets such that each attribute has zero mean and unit variance. The models in subsection 2.3 employ the same hyperparameters configurations, but the training procedure is parallelised with WASAP-SGD.