Migrating federated learning to centralized learning with the leverage of unlabeled data

Federated learning carries out cooperative training without local data sharing; the obtained global model performs generally better than independent local models. Benefiting from the free data sharing, federated learning preserves the privacy of local users. However, the performance of the global model might be degraded if diverse clients hold non-IID training data. This is because the different distributions of local data lead to weight divergence of local models. In this paper, we introduce a novel teacher–student framework to alleviate the negative impact of non-IID data. On the one hand, we maintain the advantage of the federated learning on the privacy-preserving, and on the other hand, we take the advantage of the centralized learning on the accuracy. We use unlabeled data and global models as teachers to generate a pseudo-labeled dataset, which can significantly improve the performance of the global model. At the same time, the global model as a teacher provides more accurate pseudo-labels. In addition, we perform a model rollback to mitigate the impact of latent noise labels and data imbalance in the pseudo-labeled dataset. Extensive experiments have verified that our teacher ensemble performs a more robust training. The empirical study verifies that the reliance on the centralized pseudo-labeled data enables the global model almost immune to non-IID data.


Introduction
Federated learning is a promising decentralized private learning, in which all local users cooperatively and decentrally train a global model without exposing local private data [1]. Each client trains the local model and sends parameters or gradients to the server for a global aggregation. The server returns a global model to clients for more training iterations. Compared with the centralized learning, federated learning can preserve clients' privacy with taking the cost of a modest accuracy loss [2]. However, data from clients may not always have been independent in practice. The non-IID (non-independent and identical distributed) data cause clients to get local models with weight divergence and deteriorate the performance of the corresponding aggregated global model with a further accuracy loss on the learning model [3,4]. The non-IID data issue is also called the problem of statistical heterogeneity.
The leverage of unlabeled data is one of the approaches to alleviate the adverse effects of non-IID data [5][6][7][8][9][10][11][12][13]. It has some advantages: The unlabeled data are easier to be collected than the labeled data; fewer privacy concerns might be raised on unlabeled data. For example, when a small financial agency wants to investigate users' salaries, it can get more data if it forgoes direct questions about the salary and asks marginal questions instead. This is because people with very high or low salaries or in specific industries are more sensitive to individual salaries. Although anonymization can protect user privacy, it is not absolutely safe when faced with anti-anonymization attacks. The marginal information without a direct target is exactly the unlabeled data we want to utilize. As users' privacy concerns decline when nonsensitive data is collected, the agency, which we will consider as a client, is able to access more information. When the client can't make good use of this data without direct target information, especially when the data distribution is biased, a larger agency can act as a server to collect this non-sensitive marginal data and use it to make predictions.
Different studies have different assumptions about the presence of the unlabeled data and different ways of utilizing them. Some literature [5][6][7][8] leverages unlabeled data on clients with local semi-supervised learning methods to improve the performance of federated learning. Local semi-supervised learning relies on clients to make use of unlabeled data, which puts more storage and computation pressure on local clients. This might be unacceptable for resource-limited devices. Others [9][10][11][12][13] have an opposite hypothesis that the unlabeled data are on the server. The data are used for auxiliary centralized learning and alleviating the weight divergence of local models. However, most of them have limited leverage over the data when encountering non-IID data. Federated distillation [9][10][11] communicates logits between server and clients and aggregates the logits to generate pseudo-labels for unlabeled data. The aggregation accuracy of the logits relies on the size of the unlabeled dataset and has a significant decline when non-IID data are encountered, resulting in a decrease in model performance. Ensemble learning on unlabeled data with the local models and the global model as the base learners [12,13] has an unstable pseudo-labeling accuracy due to the existence of biased local models.
In this work, we introduce a novel teacher-student framework to alleviate the negative impact of non-IID data. On the one hand, we maintain the advantage of the federated learning on the privacy-preserving, and on the other hand, we take the advantage of the centralized learning on the accuracy. We focus on the utilization of unlabeled data on the server side to improve the performance of federated learning. We collect the aggregated global models  1 We collect global models in different training rounds to form an ensemble with T teachers. At the same time, the teacher ensemble promotes the next round of global model training by time series as teachers (see Fig. 1 ) and give pseudo-labels to the unlabeled data in an ensemble way.
Different from the previous work, we change the generation of the global model. As the result of training with the pseudo-labeled data, we obtain a better global model and a better teacher model to update a base learner in the ensemble. The pseudo-labeling of the ensemble migrates the knowledge contained in the distributed labeled data to the centrally collected unlabeled data, which has a similar feature space to local data, a more uniform distribution, and more accurate pseudo-labels. We make the distributed federated learning has a closer performance to centralized learning.
Compared with pseudo-labeling by a single model, teacher ensemble can provide more accurate pseudo-labels. The key to ensemble learning is the base learners with good performance and diversity. The diversity is the differences between the learners especially the differences in output [14]. As multiple models join in the teacher ensemble, combining their different outputs through a specific voting method produces more accurate output. Due to the randomness of the model training process, at least before the global model converges, the collection of global models by time series still maintains diversity and gives play to the advantages of ensemble learning. In addition, the collection method ensures the consistency and stability of the global model. We show that the ensemble of aggregated global models can make the pseudo-labeling maintain high accuracy and confidence, even if on the non-IID data.
We evaluated our method under different local distributions on CIFAR-10/100 with a CNN and a ResNet-8. Experiments show that our method improved the utilization of unlabeled data from the perspective of the quantity of valid data, label accuracy, and distribution, and achieved higher test accuracy. With the knowledge learned from local labeled data transferring to the unlabeled data, we achieved comparable performance with centralized supervised learning with the same data size as the unlabeled data.

Preliminary
Federated learning is a distributed machine learning framework. It is proposed to use the computation capability of edge devices to collaboratively train a global model on a server, which performs better than any independent local model generally. At the same time, each local device does not send personal data to the server and keeps it locally to protect data privacy. The collaboration between the various local devices is embodied in the aggregation in federated learning. Federated averaging (FedAvg) [1] is one of the most commonly used aggregation methods; it is proposed to use a weighted average of all local models' parameters as the aggregated global model. The weight is proportional to the amount of data on the local devices.
We consider there are N local clients and M of which participate in each training round. Each model is parameterized by w; a labeled dataset is denoted as . C}, C is the number of categories for classification. Given a predictor f : x → y and a loss function l : y × y → R, the risk of a model parameterized by w on a classification task on D is defined as L(D; Federated learning There are two components in federated learning: multiple clients with local models and a server with a global model. The server provides an initial model to the clients, while the clients apply their private data to update the model and submit the model parameters to the server. As the aggregated global model is used as a new initial model, federated learning starts a new round of training. This procedure will iterate for R rounds.
FedAvg [1] is a standard aggregation method in federated learning. With model parameters exchange, clients collaboratively train a global model without exposing their data. By averaging the parameter values of multiple models, FedAvg aggregates the multiple models into a single model to complete the aggregation process in federated learning.
In local training, each client i performs supervised learning with its labeled private data D i after being initialized with a global model: where w i is the parameter of the local model on client i, η is the learning rate, cross-entropy loss is used as the loss function in L(·) in this work. In FedAvg, the server averages the local models to obtain an updated global model: where S is the collection of clients who participate in the training and |S| = M, w g is the averaged global model. The proportion of the local data to the total training data is used as the weight of the local model parameters. In the standard process of federated learning, the global model w g will then be sent back to clients for the next round of training, and so on for finite rounds to complete the whole learning process as shown in Fig. 2a.
Non-IID data In contrary to ideal identically and independently distributed data, non-IID data is a real and natural existence. Data generated by different users, from different geographic locations and in different time windows, lead to the non-identicalness of data distributions [15]. With P i and P j denoting the data distributions of any two clients i and j, x is a sample with a label y, the non-IID data in federated learning typically refers to the differences between P i and P j . From the perspective of conditional distribution, i.e., Fig. 2 Overview of D2C-FL. a shows the framework of federated learning, we use classical FedAvg as the aggregation method. b shows the way we generate the global model P(x, y) = P(y|x) · P(x) = P(x|y) · P(y), there are four main ways to show the differences: a. different P(y|x) with same P(x); b. different P(x) with same P(y|x); c. different P(x|y) with same P(y); d. different P(y) with same P(x|y). In addition, the unbalancedness in different clients' data also is a kind of non-IID setting. In fact, data distributions between clients may contain a more complex mixture of these effects.
Most of the existing work on simulating non-IID data focuses on making a different P(y). Two methods are mainly used: i. distribute data directly in proportion to a given percentage according to established data preferences of different clients; ii. follow Dirichlet distribution randomly distribute data to clients. The former is much easier to operate.
When encountered with non-IID data, the distribution differences between local data will lead to weight divergence of local models, and the model obtained by FedAvg will deviate from the ideal model. Introducing additional data to training is a regular approach to alleviate the adverse effects of non-IID data. Especially, unlabeled data are much easier to be collected than labeled data, especially when the unlabeled data contain less privacy and no relevance to a certain user. Since both of them are produced by users, the distribution of the collected unlabeled data is somewhat consistent with the joint distribution of all local data.

Overview of the teacher-student framework
In this section, we propose a novel teacher-student framework to migrate decentralized learning to centralized learning in federated learning (D2C-FL) and alleviate the impact of non-IID data. With a teacher ensemble, we migrate the knowledge of feature-to-label mapping from distributed labeled data to centralized unlabeled data. With auxiliary training, we aim to make the decentralized federated learning have a closer performance to centralized learning.
In federated learning, the statistical heterogeneity of local data caused the weight divergence of local models, thus deteriorating the performance of the global model. Rather than sending the aggregated model back directly, we change the way to obtain the global model. Figure 2 shows the overview of our method. We take the averaged model as a temporary global model and the student model learned from the pseudo-labeled data as an auxiliary to generate a teacher, while as a new global model. The teacher ensemble consists of T adjacent global models as shown in Fig. 1. The pseudo-labeling of the ensemble generates the pseudo-labels, which are used as supervision to learn a student model. Generally, the prediction of multiple teacher models in an ensemble can achieve better accuracy than that of a single model. The generated teacher instead of the flimsy averaged model will be sent back to local clients for their next round of training. To mitigate the impact of possible noise and imbalance of the pseudo-labeled data, we reset the global model to varying degrees before the next global update.
We assume that the server can meet additional storage and computation requirements. The above process would be repeated finite times until it converged to an ideal student model. We summarize the whole training process as Algorithm 1.

Algorithm 1 Illustration of D2C-FL.
We collect global models in federated learning to form a teacher ensemble. With pseudo-labeling an unlabeled dataset with the ensemble, we perform further training on the centralized pseudo-labeled data and re-update the global model.

Require:
The initial model weights w init ; unlabeled data D u ; size of teacher ensemble T ; number of participants in each round M; local labeled data D i ; learning rate η. Ensure: global model weights w g . 1: Initialization: 2: w g ← w init ; 3: Get Teacher Ensemble ready with T * w init . 4: 5: procedure Server 6: for r ← 1 to R do 7: Sample M clients as S to participate in training. 8: for each client c i ∈ S do 9: w i ← w g ; 10: Send w i to the server. 12: end for 13: w g ← Aggregate the local models; Equation (2) 14: Get pseudo-labeled dataset D pseudo ; Equation (3)-(9) 15: w stu ← U pdate(w stu , D pseudo , η); Equation (10) 16: Get a new teacher w teacher and rollback student model w stu ; Equation (11) and Equation (12) 17: Replace the most corrupt model in the Teacher Ensemble with w teacher ; 18: w g ← w teacher . 19: end for 20: end procedure

Pseudo-labeling the unlabeled data
We collect unlabeled data and obtain the pseudo-labels by the pseudo-labeling of the teacher ensemble. For the sake of simplicity, we do not use the logits output of the ensemble as soft labels to learn a student model as what would be done in distillation. We directly use the assigned class value by logits as the hard labels; then, the dataset composed of pseudo-labels and unlabeled data has the same form as the dataset used in general supervised learning.
To get a single-value label for a sample, we assign the class value with the highest probability among all the teachers as the pseudo-label. The decision relying on the highest probability indicates that the most confident teacher in the ensemble determines the label of the sample. It gives full play to the high-quality teacher in the ensemble. The pseudo-labeling process can be expressed as: y pseudo = y max_ pseudos arg max{y max_ pr obs } max{y max_ pr obs } ≥ τ null max{y max_ pr obs } < τ (5) where x u is a sample of unlabeled data, f i (·) is the output of teacher i in teacher ensemble, i.e., the logits, w i is the parameters of the teacher model. y max_ pseudos , y max_ pr obs , y pseudo are the labels given by T teachers, the max prediction probabilities of the T teachers and the final given pseudo-label for sample x u , respectively. The threshold τ is used to determine whether the data qualify for subsequent training. If the highest probability reaches the threshold τ , the corresponding label will be assigned to the sample as its pseudo-label and further learned by a student model. On the contrary, if the highest probability is lower than the threshold τ , the sample will be discarded from the subsequently generated dataset. Likewise, those samples with extremely high prediction probabilities, like greater than 0.9999, are most likely the product of overconfidence in the bias model and will also be discarded. The discard of the data with low and extreme high prediction probability will mitigate the over-fitting of the student model to the noise labels. A moderate but sufficiently efficient τ is essential. Given that the threshold τ is used to filter the labels according to the probabilities, which is the normalized logits by the softmax function, the choice of τ will have absolutely an impact on the labeling accuracy and quantity of the pseudo-labeled data. A small τ may produce many hard negative samples in our pseudolabeled data and decrease the labeling accuracy, while a large τ may lose many easy positive samples and reduce the number of pseudo-labeled data. Some pre-experiments tell us that setting it to a fixed value is the easiest but effective way.
Averaging the predictions of teachers as the output of ensemble is another way to get pseudo-labels, and it is a commonly used method in ensemble learning. The corresponding pseudo-labeling process can be expressed as: y pseudo = y avg_ pseudo y avg_ pr ob ≥ τ null y avg_ pr ob < τ (8) where y avg_ pseudo , y avg_ pr ob are the labels and corresponding probabilities given by the average output of T teachers. After all the pseudo-labels are checked by the threshold τ and assigned to the unlabeled data, we get a new labeled dataset finally: To distinguish it from the local data D i , we use D pseudo to denote it. Then, the student model will be updated through where w stu is the parameters of student model. The student model will serve as an auxiliary to generate a new global&teacher model and update the teacher ensemble.

Update the teacher ensemble
We update the teacher ensemble as shown in Fig. 1. In the first T rounds, the student model remains the same as the averaged global model. Therefore, the first T collected teachers are exactly the first T averaged global models. As we do not want the randomly initialized teachers to give bad predictions on the unlabeled data before the target number of available models gathering in the ensemble, the collection of teachers requires T rounds of preparation.
Once the target number of teachers is reached, we are able to perform pseudo-labeling to get a new dataset. Therefore, we can train a new student model, which is different from the averaged model. A new teacher will be generated by: where w teacher exactly is the parameter of the generated teacher, w g means the teacher will be sent back to clients as a new global model for a new training round. The α is the weight of the averaged global model. When α = 0, the student model will become the new teacher and be sent to clients directly which is consistent with [12]; when α = 1, the averaged global model will be the new teacher, the unlabeled data and the student model lose their auxiliary values. Neither the student nor the averaged global model alone is the best choice for the teacher.
The most corrupt teacher in the ensemble will be replaced by the new teacher as the update of the ensemble. As long as the averaged model or the student model does not converge, our teacher ensemble can maintain its diversity, which was important for ensemble learning.

Model rollback
To alleviate the confirmation bias [16] of the global model on the latent noise labels and data imbalance in the pseudo-labeled data, we add randomness to each round of training with the model rollback.
Rollback of student If the teachers in the ensemble have a worse alignment between the prediction probabilities and test accuracy, the threshold τ cannot filter out the noisy data in pseudo-labeling. To this end, we perform a rollback on the student model for its iterative updating. Specifically, we take a weighted average of the student model and a new randomly where w init is the new randomly initialized model. The β is the weight of the student model in the re-initialization. β = 0 means resetting student model to a completely random model at the beginning of each training round; β = 1 means withdrawing the rollback of student.
When learning with pseudo-labeled data, resetting the model in each training round can prevent the model from over-fitting the wrong labels and get a more stable convergence.

Rollback of teachers
As detailed in Sect. 3.3, each updated student model must have a combination with the averaged temporary global model to generate a new teacher. We regard the combination as a rollback of the teacher. Compared with the student as a teacher directly, we regress the teacher to an intermediate value between the student model and the temporary global model. Even though the student model draws a much richer and even dataset, it is not appropriate to make it alone becoming a teacher directly. As using student model to re-pseudo-label the dataset it draws on, the wrong predictions will be learned and reinforced again, thus deteriorating the performance of the global model [17].
Analysis We validated our method with different α and β on CIFAR-10 with a simple CNN (2 convolutional layers). We considered 100 clients, 10 of them participated in training with 20000 local labeled data and 10000 unlabeled data. Each client performed supervised learning with the same amount of 600 samples. We simulated two different non-IID settings; the data distributions of 20 randomly sampled clients can be seen in Fig. 3. The test accuracy with different α and β can be seen in Fig. 4 . It is shown that neither training based on a randomly initialized model (i.e. β = 1) nor the student in the last round (i.e., β = 0) alone can achieve the best test accuracy. Also, making the student directly as a teacher (i.e., α = 0) is not necessarily the optimal solution, which is also possible to have α be an intermediate value. But taking the model obtained by average aggregation as a teacher (i.e., α = 1) is definitely not the optimal solution. Since when α = 1, the global model has nothing to do with the student model. Furthermore, the global model will not be affected by the β at all as shown at the bottom of Fig. 4. α = 1 means the student model becomes a redundant model on the server side and local clients learned nothing from the unlabeled data. The rollback of the student and teachers realizes a better performance and of which on student model has more effect on the global model obviously.

Experimental setup
Datasets and models In the experiment, we evaluate the classification on CIFAR-10/100 [18]. CIFAR-10/100 contains 50000 training images and 10000 test images with 10/100 classes. To highlight the guidance of unlabeled data to the whole training performance, we distribute more data to the server as unlabeled. We assign 20000 of training data to the clients, and the rest of the training data is assigned to the server along with the whole test set. 80% of the data on the server side is used as unlabeled data to train, and 20% of the data is used to test the global model. We use a CNN Net with two convolutional and three fully connected layers, which has the same structure as LeNet-5 [19]. We also compare the performances of ResNet-8 following [12] for a more complex model. Federated learning settings Referring to the setting in [20], we consider that there are 100 clients locally and 10 clients are selected in each round to participate in the training. Each client keeps 600/100 pieces of data which is taken from the 20000 data pool according to the client's special data distribution preference for CIFAR-10/100. When the data in the pool are deficient, it will be supplemented with the taken data. As a result, a certain amount of data duplication between clients.
Heterogeneity settings In addition to the IID setting, we also consider two classical non-IID settings following [21]: α-bias, 2-class. IID: The data on each client is distributed evenly, with all the clients fetching the same amount of data from each class in the local data pool. α-bias: Each client has a data preference for a certain class to the degree of α. This means there will be an alpha ratio of data belonging to the preferred class, and the rest of the data belongs to the other classes uniformly. In this experiment, we fixed α at 0.8 for corresponding training and evaluations. 2-class: The data on each client evenly belong to two classes.
Training details We set both local epochs and global epochs to 5, the batch size is 128, the initial learning rate of local training is 0.001 and decays it by 0.95 in each epoch. Referring to the suggestion of [13], we directly set the size T of the teacher ensemble to 10 for experiments. The threshold τ for filtering unlabeled data is set to 0.75. The optimal (α, β) in model rollback varies according to different models and different datasets. Adam is used as the optimizer, which will be reset in each communication round to get a cyclical learning rate.
Baselines FedAvg [1], as a typical aggregation method, is used as one of the baselines. FedProx [22] is considered as an effective and efficient way for non-IID FL; we also conducted a comparison experiment with it. We also compared our method with FedDF [12], FedBE [13]. The difference between them is the composition of the teacher ensemble. FedDF takes the local model as the teacher ensemble directly. FedBE implements the collection of teachers by using multiple linear combinations of local models to simulate sampling from local models distribution. The coefficients of linear combinations follow a Dirichlet distribution, and the simulation achieves a comparable test accuracy with sampling models from Gaussian. The distillations with soft labels are eliminated to simplify the experiment; hard labels are used as a substitute to retrain a global student model. The One-Shot federated learning method [20] is not considered since it is similar to the one-round-version of FedDF, except it performs more local epochs training in each round. As the effect of our method largely depends on the leverage of centralized unlabeled data, we also compared our method with centralized learning on the same amount of labeled data.

Performance under different data settings
In this section, we compare the performance under different data settings. We evaluated the convergence of the global model from the aspect of test accuracy on CIFAR-10 and CIFAR-100, respectively.
Performance under the IID data setting Figure 5 shows the test accuracies of all the baselines with CNN and ResNet-8 under the IID data setting. All methods with teacher ensemble begin pseudo-labeling at round 10, before which the accuracies are consistent with FedAvg. Benefits from the leverage of unlabeled data, our method achieves a significant improvement compared to FedAvg on both CIFAR-10 and CIFAR-100. Since the pseudolabeling on the unlabeled data, our model gets a new richer dataset to learn and achieves a much better generalization effect. Due to less data on clients and the inherent complexity of the classification on CIFAR-100, FedAvg has a poor performance and the improvement of our method is more obvious.
Like FedAvg, FedProx did not converge completely in our limited communication rounds. We extended its training rounds, and its test accuracy eventually converged to {0.619, 0.577} Since FedProx lacks auxiliary data as a supplement, it has difficulty achieving higher test accuracy. In addition, the convergence rate of FedProx is far below ours.
By comparing with FedDF and FedBE, our method gets a significant improvement in the test accuracy with both CNN and ResNet-8. In addition to acquiring knowledge from the pseudo-labeled data which is consistent with FedDF and FedBE, we perform a rollback on teachers and the student to deal with the possible noise data and imbalance in the pseudolabeled data. As a result, our model fits data features better and obtains more accurate mapping between the features and labels. Figure 5 shows that ResNet-8 performs generally better than CNN, with the exception of FedProx. The generation and extensive use of pseudo-labeled data have resulted in a significant increase in test accuracy. Due to the poor leverage of unlabeled data (see Sect. 4.3), FedDF and FedBE with CNN suffer from a loss of test accuracy before the model converges. This also accords with the observation in [3], which showed that a pre-trained model would not learn from the FedAvg training on non-IID data and even had an accuracy drop on CIFAR-10. It is exactly the poor leverage of unlabeled data that makes it possible to generate another non-IID dataset and decrease the test accuracy. Those with ResNet-8 (except FedDF on CIFAR-10) may even keep declining and cannot reach a convergence within 100 training rounds. A possible explanation for the result may be the stronger confirmation bias of ResNet, which has been studied in [16].
In the case of training on fully labeled data, the residual block in ResNet solves the degradation problem of a deep neural network. When it comes to the training on pseudolabeled data, the residual block would also be stubborn on the possible noise labels and cause the degradation again. Once confirming the knowledge in noise data, with the block of the threshold in filtering, it is hard for the poor ensemble to give pseudo-labels. When the pseudolabeled data are no longer generated, the global model gradually degenerates to FedAvg as local models fit the local labeled data, resulting in a decrease in test accuracy. Our method alleviates the problem by a different formation of teacher ensemble and a rollback on the teachers and student, making the model accuracy increase like monotonically on CIFAR-10. Due to the complexity of CIFAR-100, the test accuracy on CIFAR-100 has more obvious fluctuations, but it does not prevent our method to achieve a higher test accuracy than others.
Performance under the 0.8-bias data setting The test accuracies with different models under the 0.8-bias data setting are compared in Fig. 6 . On the whole, our method still ResNet8 in FedProx performs even worse than CNN due to the deepening heterogeneity of the local data and the resulting overfitting on the preferred labels.
Performance under the 2-class data setting Figure 7 shows the comparison between the baselines with different models under the 2-class data setting. With the greater degree of non-IID, it is even hard for FedAvg and FedProx to have an increase of the test accuracy in 100 training rounds. The performance of FedBE with ResNet-8 on both CIFAR-10 and CIFAR-100 degrades to FedAvg like what is shown in the 0.8-bias data setting, but with a steeper degradation. While FedDF and FedBE with CNN and FedDF with ResNet-8 have a consistent performance with that in the IID data setting, we still perform better.
Summarizing the performance under different data settings, our method consistently performs better than FedAvg and achieves generally higher test accuracy than FedProx, FedDF and FedBE. When encountering non-IID data, FedAvg's performance degrades significantly, which is consistent with the observations in [3]. Although FedProx is somewhat effective with heterogeneous data, it is limited by the amount of local data available and brings little improvement in performance, both in terms of test accuracy and convergence rate. The methods including ours leveraging centrally unlabeled data do not show much fluctuation in test accuracy. When the performance of our method with ResNet-8 on CIFAR-10 has a limited 0.5% improvement under the 2-class data setting, that under other data settings has a maximum 3% improvement on CIFAR-10 and 4% improvement on CIFAR-100, our method with CNN on CIFAR-10 and CIFAR-100 brings maximum 4% and 8% increases, respectively.
Comparison with centralized learning While our method performs model training on 20000 local labeled data and 40000 unlabeled data at the same time in a semi-supervised learning way, we compare it with supervised learning on 40000 and 60000 centralized labeled data, respectively. Figure 8 shows the results. When we are almost immune to non-IID data, we achieve test accuracy comparable to supervised centralized learning. Benefits to the learning from labeled data on clients, we take full advantage of 40000 unlabeled data and achieve better performance than using only 40000 labeled data. Thus, we migrate the decentralized federated learning to centralized learning successfully.
Performance of the student model The student model is trained only with the pseudolabeled data on the server side. We compare the test accuracy of the global model with that of the student model. For a fair comparison, we consider the student model in multiple settings. Student 1: a normal student model with rollback; student 2: a student model without rollback, i.e. β = 1.0; student 3: the student model with rollback but before rollback. Figure 9 shows the results. The test accuracy of student 1 is lower than others because student 1 has a rollback with β times update step after each round of training and the replacing random parameters drags down the test accuracy. Student 2 trains without rollback and student 3 is tested before rollback, resulting in similar test accuracies to the teacher. But when less heterogeneous data is encountered, the global model, which is the aggregation of the student model and the averaged temporary global model, has a performance boost in test accuracy. The aggregation appears to be a second federated learning between the student and local clients, and it improves both the student and local models.

Utilization of unlabeled data
Utilization of unlabeled data on CIFAR-10 with CNN With 40000 × 80% unlabeled data on the server for global training, Fig. 10 shows the evolution of pseudo-labeled data in the training process from the perspective of accuracy and quantity, as well as the distribution in the last training round. The preparation of teacher ensemble in our method results in the empty of the pseudo-labeled dataset in the first T = 10 rounds, and the model accuracy is consistent with FedAvg. To make the comparison more visualized, we suspend the pseudo-labeling in FedDF and FedBE at the same time, which also provides them a better initialization of teacher ensembles.
As Fig. 10 shows, our method always has the highest pseudo-labeling accuracy in all the data settings. As we expected, it has almost become impervious to non-IID data, while other methods have a decline in data accuracy with the deepening of non-IID. The pseudo-labeled data were selected by a prediction probability threshold τ , which indirectly reflects that our prediction results have much better alignment between confidence and accuracy (i.e., higher confidences, higher accuracy) in the confidence interval of [τ , 1.0]. While other methods are less accurate than ours, the size of the dataset generated by them is also decreasing, much steeper in the case of non-IID. In our method, the accuracy and size are little affected by the non-IID data and rise steadily to better convergence. The bottom line of Fig. 10 shows the data distribution in the last communication round. We can see that the generated data by our method have a more stable and even distribution, and it is an approximately IID dataset.
The high quality of the pseudo-labeled dataset generated by our method leads to incremental model accuracy. The evolution of the quality of the pseudo-labeled dataset is consistent with that of model accuracy shown in Sect. 4.2. As the generation of pseudo-labeled data leads to a sharp increase in model accuracy and its almost immunity to non-IID data, we successfully generate a centralized pseudo-labeled dataset for the global model to learn from.
Utilization of unlabeled data in other cases Figure 11 shows the evolution of pseudolabeled data on CIFAR-100 with CNN. It is apparent that our method still gains an advantage in data accuracy, especially for non-IID data. Although our method does not have an advantage in terms of data quantity, benefits from high data accuracy and model rollback, our method can still achieve the best test accuracy. As the data size is not the only factor that determines model performance, the labeling accuracy and the prevention of over-fitting are both important. Figures 12 and 13show the same comparisons on CIFAR-10 and CIFAR-100 with ResNet-8. We gain advantages in both data accuracy and quantity, along with the more stable and even distribution; we still get a pseudo-labeled dataset with high quality. Different from the performance with CNN, FedBE with ResNet-8 begins to degenerate to vanilla FedAvg after only a few rounds of training on the pseudo-labeled data. The sharp degradation of the pseudo-labeling causes the empty of the pseudo-labeled dataset, and its model accuracy continues to decline, which is shown in Sect. 4.2. We suspect that the aggregation of the local models and the residual block in ResNet-8 enhance the teacher's knowledge of noise data. A single local model as a teacher in FedDF would not have such a strong confirmation bias, the teachers in our method rely more on student model and β of which rollback to a randomly initialized model in each communication round, both of them keep the generation of pseudo-labeled data for global training.
With different sizes of unlabeled dataset We investigate different sizes of unlabeled data on CIFAR-10 with CNN in Fig. 14. {10000, 20000, 30000, 40000} were used as the variable values in the experiment. The result shows that the model accuracy increases with the size of the unlabeled dataset and tends to increase. Although we cannot pseudo-labeling and utilize all the unlabeled data due to the inevitable model error and the threshold τ , a larger unlabeled dataset always leads to a larger pseudo-labeled dataset and our method gains more than others regardless of the data settings. As long as the pseudo-labeled dataset is not empty, the better utilization of the unlabeled data with high labeling accuracy and uniformly distributed labels ensures that D2C-FL is less likely to get worse than FedAvg.
With different τ for the threshold We evaluate the effect of τ on the utilization of unlabeled data on CIFAR-10 with CNN. Figure 15 shows the results. When the τ is set to a small value, the pseudo-labeled dataset gets a high labeling accuracy and a small quantity of samples; when the τ encountered with a big value, the pseudo-labeled dataset gets a low labeling accuracy and a big quantity of samples. When τ = 0.75, the pseudo-labeled dataset maintains a high labeling accuracy while containing a pretty large number of samples, and the test accuracy of the global model climbs to a higher level when the pseudo-labeled is encountered.

Case studies
Different ensemble sizes Keep 40000 data unlabeled on the server side, Fig. 16 shows the utilization of the data and corresponding test accuracy with different teacher ensemble sizes. Due to the preparation of the teachers, the pseudo-labeling processes of different ensemble sizes begin in different communication rounds. We can see that a single teacher generates a pseudo-labeled dataset with the highest labeling accuracy but the smallest data size. With the increase of ensemble size, the labeling accuracy has a slight decrease, while the data size has a noticeable increase. Since with the diversity of teachers, multiple teachers give samples multiple opportunities to pass the threshold τ and join in the pseudo-labeled dataset. Whenever a poor teacher generates a wrong label with high confidence, the sample will become noise data and be learned by the student model. As a result, the test accuracy has an irregular fluctuation in the 0-3% range instead of a significant improvement. Nevertheless, multiple teachers are more likely to produce the best performance. Different optimizers We evaluate our method with SGD as the optimizer. The (α, β) is set to (0.2, 0.8). As another learning rate strategy, we fix the learning rate at 0.001 rather than a cyclical learning rate. The comparison shows in Fig. 17. We can see that the learning with Adam has a faster convergence but an extremely slight over-fitting. Under the IID and the 0.8-bias data settings, SGD and Adam achieve comparable test accuracy. In the case of the 2-class data setting, with a deeper non-IID degree, SGD with a constant learning rate performs better. It indicates that keeping a constant rather than cyclical learning rate is better when encountering non-IID data. The advantage still maintained is that the test accuracy is little affected by the non-IID data.

Federated learning
Although federated learning has made good progress in both privacy protection and decentralized learning, there are some issues specific to federated learning that have been extensively  studied. Like statistical heterogeneity due to different users characteristics, system heterogeneity as a result of different computing and storage capabilities of edge devices, communication cost in poor communication conditions, and security aiming at malicious clients or a malicious server [2,[23][24][25][26]. All of these would lead to poor convergence of the aggregation model. In this work, we mainly focus on the problem of statistical heterogeneity, related work of which will be discussed in the next subsection.

Challenges on non-IID data in federated learning
In conventional distributed learning, the model owner has rich data and distributes it to other devices to train a model collaboratively. Different from it, the server in federated learning does not have available data to learn; it uses not only the computing capability of local devices but also the data on them. As a result, the distributions of the data on different devices vary according to the users' personalities, become independent due to the possible connections between users. This non-IID data in the context of federated learning is also called statistical heterogeneity. Zhao et al. [3] had verified that the test accuracy of the global model trained by FedAvg decreased significantly under non-IID settings, the convergence rate also slowed down. How to eliminate the effect of non-IID data on the performance of the global model has been extensively studied.
Some existing researches deal with non-IID data by sharing part of the labeled data. Zhao et al. [3] shared 5% of labeled data to the server to improve the performance of the global model. Yoshida et al. [27] proposed to use 1% shared IID data to train a model with the same role as local models to participate in aggregation, so as to alleviate the impact of non-IID data. In this case, sharing labeled data has a great risk when the key information about data privacy lies exactly in the labels. Active selection of participants is a good way to solve the problem of statistical heterogeneity. Lu et al. [28] proposed to select participants by clustering according to data distribution, to make local data used in training approximate to the global distribution. Taïk et al. [29] and Li et al. [30] proposed different selection metrics to make participant selections according to local data information. Li also considered the resource constraints of the clients, Taïk used encryption algorithms to obtain the information about local data and performed a re-selection of the samples. However, they are limited to the locally available data and bring limited improvements to the model in federated learning. Compared with them, I prefer the method proposed by Wang et al. [21], who used reinforcement learning to address the problem of participant selection, but it is a pity that their method was not compared in experiments.
The improvement brought by these methods is limited, and the model performance is still restricted by the non-IID data.
Sattler et al. [31] and Briggs et al. [32] did not think that one model can fit the distribution of all clients' data, they proposed to cluster the local clients and aggregate different global models for different types of data distributions. When there are no explicit clusters of clients, Jamali-Rad et al. [33] proposed to learn the task correlation between clients with a contractive encoding of local data to perform more efficient federated aggregation of heterogeneous data.
Hu et al. [34] proposed to use GAN to train a feature extractor for local data to enhance the correlation between clients. The weight of each client who participates in aggregation is determined by the feature quality. Jeong et al. [10] proposed to use federated augmentation (FAug) to enhance local data, so as to make non-IID data between local clients become IID data. But clients executing FAug incurred additional computing and storage costs, making the conditions for participation more stringent. Li et al. [35] accelerated convergence by adding differences between local and global models as a regularization term, which was different from our purpose.

Data disputes in federated semi-supervised learning
Federated semi-supervised learning focuses on the problem of labels deficiency in federated learning [5]. It also refers to the leverage of additional unlabeled data to improve model performance. There are different scenarios based on different locations of labeled and unlabeled data. Part of the existing work studies that the server has labeled data, and the client has only unlabeled data because of the cost of labeling. Zhang et al. [6] used the consistent regularization loss [36], which was widely used in semi-supervised learning, and adopted the group-based model average method. Liu et al. [7] employed a minimax optimization-based client selection strategy to select the clients who hold high-quality models and used geometric median aggregation to robustly aggregate model updates.
At the same time, another part of the work believes that it is reasonable only for the client to have both labeled and unlabeled data. Directly, Albaseer et al. [37] performed the vanilla semi-supervised learning locally. Jeong et al. [5] proposed to decompose the parameters and perform disjoint learning on labeled data and unlabeled data, respectively. Long et al. [38] sent different parts of model parameters to the server through a teacher-student framework, and the communication cost decreases with the convergence of the model.
There is also a scenario that the unlabeled data are only collected on the server side, while the client owns the labeled data. Jeong et al. [10] and Sattler et al. [9] aggregated the output of the supervised local model (i.e., logits) for each class to perform distillation. At the same time, Itahara et al. [11] shared all the labeled data and generated logits for each data for distillation. Although using logits could decrease the communication cost, the model performance was often poor. Lin et al. [12] still communicate model parameters between the clients and server. The local supervised models are used as an ensemble to give unlabeled data logits predictions. Chen et al. [13] fit the distribution of local models, sampling from the distribution to obtain an ensemble with higher quality. The unlabeled data were used to retrain the global model after being labeled with the logits.
In this work, we only study the last case. Following [12,13], we use an ensemble to pseudo-labeling the unlabeled data.

Pseudo-labeling and knowledge transference
Generally, consistency regularization is a commonly used method in semi-supervised learning. It encourages the model to make the same prediction on the original sample and its perturbed samples. Differently, we use model predictions to generate pseudo-labels to learn the knowledge in unlabeled data, which has been studied in [16]. In the context of federated learning, some existing researchers have used ensemble learning to make predictions. Ensemble learning constructs and integrates multiple base learners to make predictions, which could achieve better performances than predictions obtained from any base learner alone [39]. Guha et al. [20] and Lin et al. [12] drew on the idea of bagging in ensemble learning; they treated local clients as naturally formed bags to form the ensemble. The former carried out only once ensemble learning process, while the latter iterated it finite times for incremental performance improvements. The basic models in bagging can be parallel trained; just as the training process on local clients can also be performed in a parallel manner. Mao et al. [14] and Chen et al. [13] both proposed to have a linear transformation on the base learners. Mao et al. aimed for pursuing the optimal projective direction of the linear transformation to have a better performance of the ensemble. Chen et al. simulated the sampling from the possible distribution of base learners through multiple linear transformations to catch better base learners. The sampled base learners consisted of a new ensemble and maintain its diversity.
In contrast to the parallel formation of the ensemble, we collect the base learners of the ensemble in tandem. The global model obtained in each round of federated learning will be collected as a base learner of our ensemble. Generally, the classification results obtained by multiple base learners, i.e., the ensemble, are better than those of a single classifier. The more accurate the pseudo-labeling technique, the less likely the target model is to over-fit the noise data. Since the strong temporal correlation of our base learners, the collection method is similar to boosting in ensemble learning.

Discussion
Cost analysis We exchange model parameters between server and clients in each communication round, which is consistent with general methods regardless of communication cost. As the fact that the leverage of unlabeled data makes a drastic increase of model's test accuracy and faster convergence than FedAvg, it reduces required communication rounds and indirectly saves communication cost. Additional computing and storage costs resulting from additional training of the student model are borne by the server and do not burden local clients.
Unlabeled data In some fields, such as the service industry or application, the service provider may collaborate with users to enhance the user experience by accessing their data. With the release of the personal information protection law, the data security law, and the user agreement, which spells out how data will flow and be used, user data will be more safe and private.
Robustness to attacks Since we have migrated the focus of model learning from decentralized labeled data to centralized unlabeled data, it is intuitive that our method will be more robust against malicious clients, it needs to be verified. Limitations The assumption of large amounts of unlabeled data that do not involve privacy may hinder our method applying in a wider range of scenarios, as the disputes in Sect. 5.3. When limited unlabeled data is encountered, the improvement we can bring to federated learning is also limited since the unlabeled data and the accurate pseudo-labels are the basis of the performance improvement D2C-FL can bring to federated learning. Prospectively, some generative methods have been used in federated learning to give synthetic data and assist model learning, like in [40]. Ideally, the generative methods should only be used to assist global learning and add no additional computing and storage costs to clients as we did, which is our next step.

Conclusion
In this work, we leveraged a large amount of unlabeled data to improve the performance of federated learning, especially on non-IID data. We assume the unlabeled data is from users but detached from the users and does not pose a privacy threat. We collected the global models obtained from each training round as teachers to make predictions on the unlabeled data. By using the predictions on the data as the pseudo-labels, we migrated the focus of federated learning on decentralized labeled data to centralized pseudo-labeled data successfully. In addition, we used a model rollback to alleviate the impact of possible data imbalance and noise data in the pseudo-labeled data. Simulation shows that our method has a great improvement to federated learning, achieving similar and even higher accuracy compared to others. In addition, we achieved comparable performance with centralized supervised learning with the same data size as the unlabeled data. To explain the performance of the proposed method, we analyzed the utilization of pseudo-labeled data from the perspective of accuracy, quantity and distribution. The result is that our method can achieve greater utility of the unlabeled data and almost be immune to non-IID data.
Ping Xiong received his BEng degree in mechanical engineering from LanZhou Jiao-tong University, Lanzhou, China, in 1997, and the MEng and PhD degrees from Wuhan University, Wuhan, China, in 2002 and 2005, respectively, both in automation major. He is currently a Professor with the School of Information and Security Engineering, Zhongnan University of Economics and Law, Wuhan. His research interests include network security, data mining, and privacy preservation.