Robust self-supervised learning for source-free domain adaptation

Source-free domain adaptation (SFDA) is from unsupervised domain adaptation (UDA) and do apply to the special situation in reality that the source domain data is not accessible. In this subject, self-supervised learning is widely used in previous works. However, inaccurate pseudo-labels are hardly avoidable and that degenerates the adapted target model. In this work, we propose an effective method, named RS2L (robust self-supervised learning), to reduce the negative impact due to inaccurate pseudo-labels. Two strategies are adapted. The first is called structure-preserved pseudo-labeling strategy which generates much better pseudo-labels by stored predictions of k-closest neighbors. Another is self-supervised learning with mask. We use threshold masks to select samples for different operations, i.e., self-supervised learning and structure-preserved learning. For different masks, the threshold values are different. So it is not excluded that some samples participate in both two operations. Experiments on three benchmark datasets show that our method achieves the state-of-the-art results.


Introduction
Though achieving great success, unsupervised domain adaptation (UDA) [1,2] algorithm is to no avail when the source domain data is unaccessible. However, due to privacy, distributed computing or limited computational resources, source domain data is often unaccessible. To this end, source data-free domain adaptation (SFDA) [3] which assumes only the pre-trained model on the source domain is available and source data itself is unavailable, was brought forward recently.
The methods for SFDA can be roughly divided into two categories. The first category generates source domain like images [4,5]. With the help of image generation technologies, labeled artificial images substitute for unseen source data. Then, the generator and the target model can collaborate with each other without source data. Nevertheless, one challenge is whether the generated images accurately reflect the distribution of the source domain. Another category is based on the idea of setting pseudo-labels. Using predictions of target samples as input of clustering algorithm is a com-B Lihua Zhou lihua.zhou@std.uestc.edu.cn 1 School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, People's Republic of China mon technique for pseudo-labels [3,6,7]. To further exploit structures of target data, many works try to mitigate the negative impacts of noisy pseudo-labels [8][9][10][11][12][13]. However, these works exploit structures as an aid to the pseudo-labels. By comparison, we use neighborhood structure itself to generate a novel kind of pseudo-label.
It is obvious that noisy pseudo-labels degenerate the adapted target model significantly. As shown in Fig. 1a, the unseen source sample features are well clustered while the target sample features scatter over a wide area before model adaptation; from Fig. 1b, it can be observed when the traditional pseudo-labels are applied to adapt the source model, some extracted target sample features approach wrong clusters if the labels are incorrect. The ideal situation is shown in Fig. 1c which illustrates that the target sample features close to the centers of source sample feature clusters are tied up around the centers. Meanwhile, the target sample features near class boundary gradually approach to their neighbors. Unless most neighbors of a target sample feature belong to other classes, the feature will approach the right class center at last.
On the basis of the above analysis, we propose a new SFDA method, dubbed as robust self-supervised learning (RS2L). Specifically, our method consists of structurepreserved pseudo-labeling supervision strategy and selfsupervised learning with mask strategy. Instead of using clustering center as pseudo-label, we force the prediction of a target sample approximate to the mean value of its neighbors' predictions. The samples at class boundary are considered as low confidence samples and used for structurepreserved pseudo-labeling supervision. In the meantime, we select high-confidence samples, which near any class center, through a mask and carry out self-supervised learning process on them. In this way, the samples could keep local structure and relieve the impact of noisy labels. We make sure each sample be selected one and only once at each epoch to maintain a balance between the high-confidence samples and other samples.
Our contributions can be summarized as follows. (1) We propose RS2L, a new SFDA method, which uses structurepreserved pseudo-label for supervised learning. For the target samples at class boundary, their labels are determined by their neighbors and are much robust. (2) A new self-supervised learning strategy with a threshold mask is proposed such that the sampled samples can better reflect the label distribution of target samples. (3) Experiments on several benchmarks demonstrate our methods yield results comparable to or outperforming the state-of-the-arts SFDA methods.

Unsupervised domain adaptation
Different from domain adaptation algorithm using labels [14,15], recent advances promoted the development of unsupervised domain adaptation (UDA) which can access source data (the difference from SFDA). Methods are roughly classified into four strategies. The first one is based on statistic moment matching, such as CMD [16] and CAN [17], etc. It mitigates the gap between two domains by minimizing some defined statistical discrepancy metrics. The second one uses adversarial learning framework, such as DANN [18] and ADDA [19]. They introduce a domain discriminator for domain classification, and then force feature extractor confusing domain discriminator to learn domain invariant features. The third one is based on adversarial generation framework, such as CoGAN [20] and CycleGAN [21], which combines the domain discriminator with a generator, and generates fake data and aligns the distribution between the two domains at pixel-level. The final one is using self-training strategy, such as MTAE [22] and ssUDA [23]. It implicitly minimizes the discrepancy between two domains by incorporating auxiliary self-supervised learning tasks into the original task. Despite the great performance achieved, all of these methods need to access source domain data.

Source-free domain adaptation
Despite the lack of source data, SFDA is also trying to dig the correlation between domains. There are two main routes to address this problem. One line is concerned with faking a source domain. For example, 3C-GAN [4] generate labeled target-style data by class conditional generative adversarial net; CPGA [5] first trains a source prototype generator and then aligns each pseudo-labeled target data to the corresponding source prototypes; VDM-DA [24] generates source domain style features and then aligns the generated features with target features. Another line uses pseudo-label strategy. SHOT [3] gets pseudo-labels by building cluster center similar to the weighted k-means clustering, and information maximization is used to ensure the balance between classes; BAIT [6] performs a bi-alignment using two classifiers with one classifier head fixed; SCLM [9] and ProxyMix [10] exploits geometric information to mitigate the negative impacts of noisy labels.

Problem statement
in the target domain where n is the cardinality. For the source model, the feature extractor is g : X → R d and the classifier is f : R d → R C where d is feature dimension and C is the number of classes. The label prediction of the learned target model is denoted as p(x) = f (g(x)) and the prediction of the original source mode is denoted as p s (x). The goal of SFDA is to predict the labels {y i } n i=1 corresponding to the target dataset D without accessing the source domain data.
Overview As shown in Fig. 2, our model is based on SHOT-IM [3], but there are no fixed parameters for model adaptation. Two loss functions are added to SHOT-IM. One loss function achieves structure-preserved pseudo-labeling supervision strategy, and another is about self-supervised learning with mask strategy. By considering the structurepreserved pseudo-labeling, the pseudo-labels are from neighborhood scores stored in a score bank. Scores stored in score bank are used to calculate self-entropy for measuring the confidence of the corresponding samples. High-confidence samples are used for self-supervised learning while lowconfidence samples are for structure-preserved supervision.

Structure-preserved pseudo-labeling supervision
In the adaptation progress, we hope to keep local structure in the target samples, i.e., the samples with the same class label are still in a cluster in the feature space. The target model is a sample is allocated to the wrong class, the error will be gradually strengthened; c Distribution of the learned features after our strategies. Features near class boundary gradually approach to their neighbors rather than a class center Fig. 2 The pipeline of our RS2L framework. Two supervision strategies are adopted. The pseudo-labels of self-supervision are from themselves, while pseudo-labels of structure-preserved pseudo-labeling supervision are from their neighbors. Masks are used to select samples for the corresponding strategy first initialed by the trained source model with the prediction p(x) = p s (x). Since the predictions on target samples are noised due to domain shift, it may not be a promising avenue to use them as one-hot labels to adapt the source model. Based on a basic fact that similar features should produce similar predictions, it is reasonable to impel the prediction of a sample closer to that of its neighbors. So next we will use this analysis to eliminate the label noise on low-confidence sample because high-confidence sample already has more reliable pseudo-label. First, to compute the similarity between two samples quickly, similar as [25], we build a memory bank to store each sample feature. The feature bank F = {(g(x i ))} x i ∈D stores l 2 -normalized features for convenient calculation. Similarly, we set score bank as S = {( f (g(x i ))} x i ∈D storing the corresponding softmax prediction scores. In every iteration, the banks F and S are updated with the corresponding sample features and scores in a batch. To further alleviate the noise in nearby samples, we follow the method in [26] and adopt an exponential moving average (EMA) prediction for S by where γ is a momentum hyper-parameter, and s(x) denotes the prediction score with respect to x stored in S. Next, we will discuss how to generate structure-preserved pseudo label of a sample with its k neighbors. We find the knearest neighbors in the feature bank for each current target sample feature based on cosine similarity. The pseudo-label p i of sample x i is represented as follows, where N denotes the index set of neighbors in the memory bank for the sample x i and there are k elements in each N , and d(i, j) is the cosine distance between x i and x j . To select low-confidence samples for structure-preserved pseudo-labeling supervision, we build a binary mask after softmax predictions, called low-confidence mask. First, we calculate self-entropy of each sample. The approach [8] has proved that the prediction with the smaller self-entropy has less noise which means it is more reliable. The self-entropy of a sample x i is calculated as follows, , we temporarily divide the samples into C class-wise groups according to the maximal element of prediction. At last, for the sample x i , suppose it belongs to the m-th group, we define its mask as follows, where T m denotes the set where the self-entropy value of sample is within the top μ% in the m-th group. μ ∈ [0, 100] is a parameter to be decided experimentally.
Since the labels of our structure-preserved pseudo-label ing (SPPL) supervision are not one-hot form, our purpose is to reduce the distribution difference between labels and the output. The loss function of SPPL supervision for each sample x i ∈ D is defined as follows: where KL denotes the Kullback-Leibler (KL) divergence [27] loss. Then, the structure-preserved pseudo-labeling supervision loss function for each batch is as follows: where b is the batch size. For the samples filtered out by mask M n (M n (x i ) = 0), next section we will discuss how to use them.

Remark
We use mask to select low-confidence samples in the L sp2l function instead of building two new datasets, i.e., one for high-confidence samples and another for low confidence samples as previous approaches [11,17,28]. These methods randomly select one batch data from these two datasets alternately. However, the defect of this kind of sampling is that the sampled high-confidence samples and low-confidence samples cannot be guaranteed to be consistent. To keep data balance, high confidence samples are sampled as low-confidence samples. In fact, the distribution of the sampling data has changed. We address this issue by simply adding a mask before backpropagation. In other words, we ignore the distinction between high-confidence samples and low-confidence samples, only treat them differently in the backward process.

Self-supervised learning with mask
In this section, we only select high confident samples which have low self-entropy values to implement self-supervised training. Since high-confidence samples are closer to the class centers, and the target model is initialed with source model, the information of source domain data could be retained in this way. So we need not fix the parameters of classifier like other approaches [3]. There is also a binary mask, called high-confidence sample, used to select samples for self-supervised learning. We set mask threshold in an adaptive manner [7]. First we get the mean self-entropy value for all target samples, Next, we count the number of samples whose self-entropy value is less thanH , where I is an indicator. Then, we get the adaptive threshold ν = 100 * n n . In the end, for the sample x i , suppose it belongs to the m-th group divided in the previous section, we define its self-supervision mask as follows, where E m denotes the set where the self-entropy value of sample is within the smallest ν% in the m-th group. If the sample x i neither belongs to T m nor E m , we set '1' to the corresponding low-confidence mask M n under this circumstance. In this way, all target samples can take part in the model adaptation.
The loss function of self-supervision for sample x i ∈ D is set as the following mutated cross-entropy loss, where δ c (a) = exp(a c ) i exp(a i ) denotes the c-th element in the softmax output of a C-dimensional vector a, q(x i ) c = (1 − α)p c + α/C is the smoothed label and α is the the smoothing parameter which is empirically set to 0.1,p denotes a onehot vector,p c is '1' if c = argmax i ( p i ) and '0' for the rest. Then, the self-supervision loss function for a batch is defined as follows, Remark We use the score bank S to calculate self-entropy values of target samples. The scores are more stable than the predictions in an iteration since they have accumulated historical information. Comparative experiments confirm that this strategy can achieve better results.

Overall loss
At last, we combine above mentioned loss functions with the information maximization loss L im in [3] to form the final objective function. For simplicity, we ignore combination coefficients in the final function, and fix all of them to '1.' The overall objective is formulated as:

Datasets
Office-31 [29] is a standard domain adaptation dataset consisting of three distinct domains with 31 classes. Office-Home [30] contains 4 domains (Real, Clipart, Art, Product) with 65 classes and a total of 15,500 images. VisDA [31] is a more challenging large-scale dataset with 12 classes. Its source domain contains 152k synthetic images while the target domain has 55k real object images. These three datasets are the most widely used benchmarks in the DA field. And the sizes of these datasets are small, medium and large, respectively.

Implementation details
Network details We adopt the backbone of ResNet-50 for Office-31 and Office-Home, ResNet-101 for VisDA along with fully connected (FC) layer as feature extractor, and a FC layer as classifier head. We adopt SGD with momentum 0.9, weight decay 1e −3 and learning rate η 0 = 1e −2 for the new layers and those layers learned from scratch for all experiments except η 0 = 1e −3 for VisDA-C. We further adopt the same learning rate scheduler η = η 0 · (1 + 10 · t) −0.75 as SHOT [3], where t is the training progress changing from 0 to 1. Besides, we set the batch size to 64 for all tasks. Training details We randomly specify a 0.9/0.1 split in the source dataset and generate the best source hypothesis based on the validation split. The maximum number of epochs for Office-31, Office-Home and VisDA-C is empirically set as 100, 50 and 10, respectively. For learning in the target domain, the maximum number of epochs is empirically set as 30. Besides, we randomly run our methods for three times with different random seeds 2020; 2021; 2022 via PyTorch, and report the mean accuracy. We set k = 5, set γ 0.2, 0.6, 0.0 and set μ 90, 10, 40 for Office-31, Office-Home, VisDA-C, respectively. One advantage of using mask is ensure each sample, high-confidence sample or low confidence sample, be selected once at each epoch. So the first item and second item in Eq. (12) are balanced, and we set β 1 = 1 for all datasets. We set the equal weight to the third item in Eq. (12) for simplicity, also it is based on works of predecessors [3].

Baseline methods
We compare our RS2L with two categories of domain adaptation algorithms: vanilla UDA and SFDA. We select 8 state-of-the-art SFDA methods are selected: they are SHOT [3], SFDA [8], BAIT [6], MA [4], CPGA [5], NRC [32], SCLM [9] and ProxyMix [10]. We use bold to highlight the best results and underline to highlight the second-best results among source-free methods.

Qualitative results
We first show the results of cross-domain object recognition on Office-31 in Table 1. RS2L achieves second-best results on average, and it is slightly less than the best SFDA method ProxyMix by 0.2%. Compared to other SFDA methods, RS2L performs competitive on tasks except A ⇒ W. Through analysis, we believe that the number of samples of some classes in the dataset W is very small. And it is not sufficient to depict the data structure in the target domain. Since we do not need any additional clustering method to generate pseudo-labels, our algorithm is quite simple.
Then, we take a look at the results on Office-Home in Table 2. Among all methods, RS2L obtains the best results on 17 tasks and improves 0.6% in average accuracy compared with the previous best method SCLM. The results are explainable that the increase in data size is helpful for the performance improvement.
As shown in Table 3, strong results are also achieved on VisDA-C. Specially, our method reaches the best accuracy of 86.9% on average and obtains the best results on 5 out of 12 tasks. Compared to the second-best method ProxyMix, RS2L increases by 1.2% on average.

Further experiments
In this section, on Office-Home dataset, we further analyze our RS2L method from the following three aspects.
Feature visualization According to the 65-way classification results of the task A ⇒ C, we show the t-SNE [34] visualization of target features on task A ⇒ C in Fig. 3. Red color points denote unseen source data. Their features are more compacted before adaptation. After adaptation, compared to the results before adaptation, the distribution of target sample features (yellow points) is closer than that before adaptation as we expected.
Training stability We show the accuracy curve of task A ⇒ C on Office-home in Fig. 4, the accuracy during training grows up quickly and then converges as we expected. Therefore, the training procedure of RS2L is stable and reliable.
Parameters sensitivity We analyze three hyper-parameters k, γ, μ in this section. First, we set equal weight for Eq. (12)   The accuracy curve of the task A ⇒ C on Office-home for simplify. Then, we take a close look at k. Figure 5a shows when we fix γ = 0.6 and μ = 10, the accuracies change slightly as k from 2 to 10. Intuitively, k will be influenced by size of dataset. If the number of the samples of a class is too small, there must be samples that do not belong to the same category among k neighbors of a sample. It can be seen from Fig. 5b, c that the parameters γ and μ also slightly affect the model when we fix k = 5. It should be noted that the parameter μ is used to decide the threshold of mask M n .

Ablation study
In this subsection, we discuss the effectiveness of key components in RS2L. RS2L model is made up of three independent loss functions. To further analyze the function of each part of RS2L, we retest the model with different combinations of L sp2l , L self and L im . The result is shown in Table 4. We have mentioned in Sect. 3.3, if sample x i belongs neither to high confidence samples for L self strategy nor to low-confidence samples for L sp2l strategy, then we will set M n (x i ) to '1' instead of '0.' However, if the loss function L self is removed from the loss combination experiment, then the mask M h is also removed. In this situation, we set the elements of M n only based on the criterion in Sect. 3.2.
We also evaluate our method with and without masks. L * sp2l denotes structure-preserved pseudo-labeling supervision learning using all data. L * self is self-supervised learning with all data. The results show that the masks have a positive impact on the model training.
After setting down parametes k, γ, μ, we search over a range of values to identify (empirically) optimal values for the loss scaling factors, β 1 and β 2 in Fig. 6. Since we have given Ablation study on Table 4, after the supplementary experiment on dataset Office-Home, we only give the exper-   imental results when the parameters are nonzero. Based on the experiments, β 1 and β 2 are set to 0.5, 0.8, 1, 1.2 and 1.5, respectively. According to the experimental results, the maximum value is concentrated around β 1 = 1 and β 2 = 1.

Conclusion
We proposed a new unsupervised domain adaptation method called robust self-supervised learning (RS2L). Two strategies are employed. Structure-preserved pseudo-labeling strategy generates local structure-preserved pseudo-label from neighbors. Low-confidence mask is used to filtered out the high-confidence samples. Our new pseudo-labeling strategy is applied to these samples. In this way, we could reduce the negative impact by inaccurate pseudo-labels at the class boundary. Self-supervised learning with mask strategy is employed for high-confidence samples. Since the highconfidence mask is computed by accumulating historical information, self-supervised learning with mask can obtain better performance. Experiments on three popular benchmarks prove the effectiveness of RS2L without access to source data.