A. Federated Averaging (FedAvg)
Federated averaging (FedAvg) is a communication-efficient algorithm for distributed training with a large number of clients [17]. In FL, clients keep their data locally for privacy protection, and a central parameter server is used to communicate between clients. The central server distributes the parameters to each client and collects the updated parameters from clients. The central server then aggregates these updated parameters to obtain the updated global model [17, 19] as depicted in (Fig. 1). The FedAvg algorithm consists of alternating between a few local stochastic gradient updates at client nodes, followed by a model averaging update at the server [5].
The FedAvg algorithm is known for its simplicity and efficiency in distributed training [5]. It is equivalent to the local SGD if the clients are homogeneous [19]. Several empirical studies have shown that the model output by FedAvg leads to a model that generalizes well to new unseen tasks after a few fine-tuning steps [5].
To establish the iteration complexity required by the clients for proving such a result, a recent study has formally established that FedAvg learns an expressive representation in any setting where the underlying shared representation is a linear map [5]. The FedAvg algorithm is computationally efficient but requires very large numbers of rounds of training to produce good models. Major speedups are obtained when computation on each client is improved, once a minimum level of parallelism over clients is achieved. The amount of computation is controlled by three parameters: C - Fraction of clients participating in that round, E - number of training passes each client makes over its local dataset each round, and B - Local minibatch size used for client updates.
FedAvg is a communication-efficient algorithm for distributed training with a large number of clients [17]. In FedAvg, clients keep their data locally for privacy protection, and a central parameter server is used to communicate between clients. This central server distributes the parameters to each client and collects the updated parameters from clients.
The FedAvg algorithm can be described as follows:
• The central server initializes the model parameters.
• In each round, a random subset of clients is selected to participate in the training process.
• The selected clients perform multiple local training steps on their local data using the current model parameters.
• The clients send their updated model parameters to the central server.
• The central server aggregates the model parameters received from the clients by computing their weighted average.
• The central server sends the updated model parameters to all clients.
The pseudocode for the FedAvg algorithm is shown below:
Algorithm 1: FederatedAveraging. The K clients are indexed by k; B is the local minibatch size, E is the number of local epochs, and η is the learning rate |
Server executes: Initialize\({w}_{0}\) for each round t = 1, 2, … do \(m \leftarrow \text{m}\text{a}\text{x}(C . K, 1)\) \({S}_{t}\leftarrow\)(random set of m clients) for each client \(k \in {S}_{t}\) in parallel do \({w}_{t+1}^{k}\leftarrow \text{C}\text{l}\text{i}\text{e}\text{n}\text{t}\text{U}\text{p}\text{d}\text{a}\text{t}\text{e}(k, {w}_{t})\) \({m}_{t}\leftarrow \sum _{k \in { S}_{t}}{n}_{k}\) \({w}_{t+1}\leftarrow \sum _{k \in { S}_{t}}\frac{{n}_{k}}{{m}_{t}}{w}_{t+1}^{k}\) ClientUpdate(\(k, w\)): // run on client\(k\) \(\beta \leftarrow \left(split {\rho }_{k} into batches of size B\right)\) for each local epoch \(i\) from 1 to \(E\) do for batch \(b \in \beta\) do \(w \leftarrow w- \eta \nabla \mathcal{l}(w;b)\) return \(w\) to server |
The FedAvg algorithm has been shown to achieve good performance in various settings [12]. It has also been extended and modified in various ways to improve its performance and efficiency [11][17][21].
To evaluate the effectiveness of Federated Learning with Differential Privacy, we implemented the algorithm in Python 3.9. Our implementation served as the basis for experiments investigating the algorithm's performance with differential privacy on the client's side across various privacy settings and participant numbers.
B. Differential Privacy
Differential privacy (DP) is a mathematical framework for ensuring the privacy of individuals when their data is collected and used by a machine learning algorithm. It was first introduced by [6]. Mathematically, it can be defined as follows:
$$Pr\left[M\right(D) \in S] \le {e}^{\epsilon } Pr\left[M\left({D}^{{\prime }}\right)\in S\right]+ {\delta }$$
1
A mechanism M satisfies (\(ϵ\), δ)-differential privacy if, for all possible data sets D and D' that differ by a single individual's data point, and for all possible subsets S of the output space [6].
Where \(ϵ\) controls the privacy guarantee. Smaller \(ϵ\) values provide stronger privacy guarantees. When \(ϵ\) is 0, it means perfect privacy, but it also means less utility. And δ introduces a small probability of failure to satisfy \(ϵ\)-differential privacy.
The key idea behind DP is that an algorithm is differentially private if, for any two datasets that differ in only one individual's data, the output of the algorithm is indistinguishable with high probability. This means that it is not possible to learn anything about any individual's data by observing the output of the algorithm. There are a number of different ways to add noise to data to achieve DP. Gaussian noise is the most common method for adding noise to data. It works by adding a random number from a Gaussian distribution to each data point. And Laplace noise, this method works by adding a random number from a Laplace distribution to each data point.
The amount of noise that is added to the data is determined by the privacy budget. The privacy budget is a measure of how much privacy is desired. A larger privacy budget means that less noise is added to the data, which results in a more accurate model. However, a larger privacy budget also makes it easier for an attacker to learn about an individual's data.
In [1], the authors show that it is possible to train deep learning models with DP. They do this by using a technique called differentially private stochastic gradient descent (DP-SGD). DP-SGD is a variant of stochastic gradient descent that adds noise to the gradients during training. This helps to ensure that the model does not learn too much about any individual's data. They also discuss the challenges of applying DP to deep learning. One challenge is that DP can make the training process more difficult and less efficient. Another challenge is that it can be difficult to choose the right privacy budget. Despite these challenges, DP is a promising approach for protecting the privacy of individuals when their data is used to train deep learning models.
In this study, we employed the Differentially-Private Stochastic Gradient Descent (DP-SGD) algorithm. The DP-SGD algorithm adds random noise to the gradients computed by SGD in order to provide differential privacy guarantees. The amount of noise added is controlled by two parameters: the noise multiplier and the privacy budget. The noise multiplier determines the strength of the added noise, while the privacy budget determines the amount of privacy protection provided by the algorithm.
The most two important techniques used in the DP-SGD algorithm are gradient clipping and momentum. Gradient clipping limits the magnitude of the gradients to prevent large updates that could compromise privacy. Momentum, on the other hand, helps the algorithm converge more quickly by adding a fraction of the previous gradient to the current gradient.
To reduce the risk of data leakage during data transmission between hospitals and the global model, we applied the DP-SGD algorithm on the client's side (as depicted in (Fig. 3)) using the Opacus framework [22].
Opacus, a Python library, emerges as a pivotal instrument in this domain, facilitating privacy-preserving training of deep learning models by injecting noise into gradient computations. At the core of Opacus lies the delicate balance between adding noise to gradients. Excessive noise can obscure crucial information, compromising model performance, while inadequate noise fails to provide robust privacy guarantees (notice the noised models in (Fig. 3) before sending updates to the global model). Opacus navigates this balance by examining the norms of gradients, ensuring an optimal trade-off. Opacus acknowledges that within data, outliers are prone to divulge sensitive information due to their disproportionately larger gradients. To mitigate this risk, Opacus adopts an individual sample approach. It computes gradients for each sample within a minibatch, subjecting them to individual gradient clipping. This strategy ensures that no single sample unduly influences the model's learning process, enhancing privacy protection. See (Fig. 2) for an illustration [22]. The library provides a set of utilities for adding differential privacy to PyTorch models, including the make_private_with_epsilon function which automatically sets the noise multiplier based on the desired privacy budget (\(ϵ\)).
In our experimental setup, we used the make_private_with_epsilon function to add differential privacy to the client-side training process with a desired privacy budget of 3. The function automatically calculated the corresponding noise multiplier based on the specified epsilon value, batch size, and other parameters.
Overall, our experimental setup with differential privacy enabled us to evaluate the effectiveness of the FedAvg algorithm in a decentralized learning task while preserving the privacy of the individual clients' data. By combining FedAvg with differential privacy, we believe that our results can help pave the way for practical applications of federated learning in sensitive domains, such as healthcare or finance, where data privacy is a critical concern.