Pseudo-trilateral adversarial training for domain adaptive traversability prediction

Traversability prediction is a fundamental perception capability for autonomous navigation. Deep neural networks (DNNs) have been widely used to predict traversability during the last decade. The performance of DNNs is significantly boosted by exploiting a large amount of data. However, the diversity of data in different domains imposes significant gaps in the prediction performance. In this work, we make efforts to reduce the gaps by proposing a novel pseudo-trilateral adversarial model that adopts a coarse-to-fine alignment (CALI) to perform unsupervised domain adaptation (UDA). Our aim is to transfer the perception model with high data efficiency, eliminate the prohibitively expensive data labeling, and improve the generalization capability during the adaptation from easy-to-access source domains to various challenging target domains. Existing UDA methods usually adopt a bilateral zero-sum game structure. We prove that our CALI model—a pseudo-trilateral game structure is advantageous over existing bilateral game structures. This proposed work bridges theoretical analyses and algorithm designs, leading to an efficient UDA model with easy and stable training. We further develop a variant of CALI—Informed CALI, which is inspired by the recent success of mixup data augmentation techniques and mixes informative regions based on the results of CALI. This mixture step provides an explicit bridging between the two domains and exposes under-performing classes more during training. We show the superiorities of our proposed models over multiple baselines in several challenging domain adaptation setups. To further validate the effectiveness of our proposed models, we then combine our perception model with a visual planner to build a navigation system and show the high reliability of our model in complex natural environments.

Fig. 1: Transferring models from the available domain to the target domain.The existing available data might be from either a simulator or collecting data in certain environments, at a certain time, and with certain sensors.In contrast, the target deployment might have significantly varying environments, time, and sensors.
mixture step provides an explicit bridging between the two domains and exposes under-performing classes more during training.We show the superiorities of our proposed models over multiple baselines in several challenging domain adaptation setups.To further validate the effectiveness of our proposed models, we then combine our perception model with a visual planner to build a navigation system and show the high reliability of our model in complex natural environments.

Introduction
We consider the deployment of autonomous robots in real-world unstructured field environments, where the environments can be extremely complex involving random obstacles (e.g., big rocks, tree stumps, man-made objects), cross-domain terrains (e.g., combinations of gravel, sand, wet, uneven surfaces), as well as dense vegetation (tall and low grasses, shrubs, trees).Whenever a robot is deployed in such an environment, it needs to understand which area of the captured scene is navigable.A typical solution to this problem is the visual traversability prediction that can be achieved by learning the scene semantic segmentation [Yang et al., 2023, Jin et al., 2021].
Visual traversability prediction has been tackled using deep neural networks where the models are typically trained offline with well-labeled datasets.However, there might exist a gap between the data used to train the model and the data when testing.There are several existing datasets for semantic segmentation, e.g., GTA5 [Richter et al., 2016], SYNTHIA [Ros et al., 2016], Cityscapes [Cordts et al., 2016], ACDC [Sakaridis et al., 2021], Dark Zurich [Sakaridis et al., 2019], RUGD [Wigness et al., 2019], RELLIS [Jiang et al., 2020], and ORFD [Min et al., 2022].Nevertheless, it is usually challenging for existing datasets to well approximate the true distributions of unseen target environments where the robot is deployed.Even the gradual collection and addition of new training data on an ongoing basis cannot ensure a comprehensive representation of target environments within the distribution.In addition, manually annotating labels for dense predictions, e.g., semantic segmentation, is prohibitively expensive due to the large volumes of data.Therefore, developing a generalization-aware deep model is crucial for the robustness, trustworthiness, and safety of robotic systems considering the demands of the practical deployment of deep perception models and the costs/limits of collecting new data in many robotic applications, e.g., autonomous driving, search and rescue, and environmental monitoring.
To tackle this challenge, a broadly studied framework is transfer learning [Pan and Yang, 2009] which aims to transfer models between two domains -source domain and target domain -that have related but different data distributions.The prediction on target domain can be considered as a strong generalization since testing data (in target domain) might fall out of the independently and identically distributed (i.i.d.) assumption and follow a very different distribution than the training data (in source domain).The "transfer" process has significant meaning to our model development since we can view the available public datasets [Richter et al., 2016, Cordts et al., 2016, Wigness et al., 2019, Jiang et al., 2020] as the source domain and treat the data in the to-be-deployed environments as the target domain.In this case, we have access to images and corresponding labels in source domain and images in target domain, but no access to labels in target domain.Transferring models, in this set-up, is called Unsuper-vised Domain Adaptation (UDA) [Wilson andCook, 2020, Zhang, 2021].
Domain Alignment (DA) [Ganin et al., 2016, Hoffman et al., 2016, 2018, Tsai et al., 2018, Vu et al., 2019] and Class Alignment (CA) [Saito et al., 2018] are two conventional ways to tackle the UDA problem.DA treats the deep features as a whole.It works well for image-level tasks such as image classification, but has issues with pixel-level tasks such as semantic segmentation [Saito et al., 2018], as the alignment of whole distributions ignores the class features and might misalign class distributions, even the whole features from the source domain and target domain are already wellaligned.CA is proposed to solve this issue for dense predictions with multiple classes.
It is natural and necessary to use CA to tackle the UDA of semantic segmentation as we need to consider aligning class features.However, CA can be problematic and might fail to outperform the DA for segmentation, and in a worse case, might have unacceptable negative transfer, which means the performance with adaptation is even more degraded than that without adaptation.We empirically found that the training of CA can be unstable and easy to diverge, leading to low performance or even failure of training.Typically, CA adopts a network consisting of a feature extractor and two classification heads.The net is first trained on the source domain such that the decision boundaries of the two heads are both able to well classify source features of different classes.Then the net is trained on the target domain in an adversarial manner, where the discrepancy of the two heads is used as the adversarial objective.During adversarial training, the goal of the feature extractor is to generate features such that the discrepancy between the two heads is minimized; while the goal of the classification heads is to adjust the decision boundaries such that the discrepancy is maximized.By doing so the target features are enforced to be aligned with the trained source features.However, adversarial training can be highly unstable especially when the adversarial objective is not properly valued.We conjecture the reason for the problem of CA is the value of the adversarial objective -the discrepancy between the two heads is too large due to the domain shift, making the two classification heads in a near-optimal state at the beginning of training.In this case, the training of the two classification heads can quickly converge, breaking the equilibrium of the zero-sum game, and leading to a failure of the whole training.
To address the potential shortcomings of CA, our intuition is to apply a DA as a constraint to reduce the general distance between two domains before applying CA.This will also bring a reduction of the discrepancy value between the two classification heads, thus the adversarial training over the discrepancy can be stabilized.An example to explain our intuition is shown in Fig. 2. To achieve this, we investigate the relationship of the upper bounds of the prediction error on the target domain between DA and CA and provide a theoretical analysis of the upper bounds of target prediction error for the two alignments in the UDA setup.Our theoretical analysis justifies the use of DA as a constraint of CA.
This paper presents an extended and revised version of our recent work CALI [Chen et al., 2022] In summary, our contributions include We combine the proposed segmentation model and a visual planner to build a visual navigation system.

Related Work
Semantic Segmentation: Semantic segmentation aims to predict a unique human-defined semantic class for each pixel in the given images.With the prosperity of deep neural networks, the performance of semantic segmentation has been boosted significantly, especially by the advent of FCN [Long et al., 2015] that first proposes to use deep convolutional neural nets to predict segmentation.The following works try to improve the FCN performance by multiple proposals, e.g., using different sizes of kernels or dilation rates to aggregate multi-scale features [Chen et al., 2017a,b, Yu andKoltun, 2015]; building image pyramids to create multi-resolution inputs [Zhao et al., 2017]; applying probabilistic graph to smooth the prediction [Liu et al., 2017]; and compensating features in deeper level by an encoder-decoder structure [Ronneberger et al., 2015].
Recently, Transformers [Vaswani et al., 2017, Dosovitskiy et al., 2020] have gained huge popularity for various vision tasks including semantic segmentation.Transformer-based models employ an attention mechanism to capture the long-range dependencies among pixels.Different Transformer-based segmentation models have been developed, e.g., finer-grained and more globally coherent predictions are achieved for dense predictions by assembling tokens from various stages of the vision transformer into image-like representations at various resolutions [Ranftl et al., 2021].Semantic segmentation is treated as a sequence-to-sequence prediction task and a pure transformer without convolution and resolution reduction are used to encode an image as a sequence of patches [Zheng et al., 2021].A novel hierarchically structured Transformer encoder is combined with a lightweight MLP decoder to build a simple, efficient yet powerful semantic segmentation framework [Xie et al., 2021].Recent state-of-the-art works [Ranftl et al., 2021, Zheng et al., 2021, Xie et al., 2021] for semantic segmentation heavily rely on Transformer structure.However, all of those methods belong to fullysupervised learning and the performance might be degraded catastrophically when a domain shift exists between the training data and the data when deploying.

Unsupervised Domain Adaptation:
The main approaches to tackle UDA include adversarial training (a.k.a., distribution alignment) [Ganin et al., 2016, Hoffman et al., 2016, 2018, Tsai et al., 2018, Saito et al., 2018, Vu et al., 2019, Luo et al., 2019, Wang et al., 2020] and self-training [Zou et al., 2018, Zhang et al., 2017, Mei et al., 2020, Hoyer et al., 2021].Self-training maintains a teacher-student framework to conduct the knowledge transfer.The teacher model is trained on the source domain and used to predict segmentation for target images.The predictions from the teacher model are then used as pseudo labels to train the student model.During the training of the student model, existing methods use different ways to identify the er-roneous regions, including confidence score [Zou et al., 2019[Zou et al., , 2018] ] and entropy [Xie et al., 2022, Chen et al., 2019, Pan et al., 2020].Although self-training is becoming a popular method for segmentation UDA in terms of empirical results, it still lacks a sound theoretical foundation.In this paper, we only focus on the alignmentbased methods that not only keep close to the UDA state-of-the-art performance but are also well supported by sound theoretical analyses [Ben-David et al., 2007, Blitzer et al., 2008, Ben-David et al., 2010].
The alignment-based methods adapt models via aligning the distributions from the source domain and target domain in an adversarial training process, i.e., making the deep features of source images and target images indistinguishable to a discriminator net.Typical alignment-based approaches to UDA include Domain Alignment [Ganin et al., 2016, Hoffman et al., 2016, 2018, Tsai et al., 2018, Vu et al., 2019], which aligns the two domains using global features (aligning the feature tensor from source or target as a whole) and Class Alignment [Saito et al., 2018, Luo et al., 2019, Wang et al., 2020], which only considers aligning features of each class from source and target, no matter whether the domain distributions are aligned or not.In Saito et al. [2018], the authors are inspired by the theoretical analysis of Ben-David et al. [2010] and propose a discrepancy-based model for aligning class features.There is a clear relation between the theory guidance [Ben-David et al., 2010] and the design of network, loss, and training methods.There are some recent works [Luo et al., 2019, Wang et al., 2020] similar to our proposed work in spirit and show improved results compared to Saito et al. [2018], but it is still unclear to relate the proposed algorithms with theory and to understand why the structure/loss/training is designed as the presented way.
Visual Navigation: To achieve visual navigation autonomously, learning-based methods have been widely studied recently [Shen et al., 2019, Bansal et al., 2020].For example, imitation learning based approaches have been largely explored to train a navigation policy that enables a robot to mimic human behaviors or navigate close to certain waypoints without a prior map [Manderson et al., 2020, Hirose et al., 2020].To fully utilize the known dynamics model of the robot, a semilearning-based scheme is also proposed [Bansal et al., 2020] to combine optimal control and deep neural network to navigate through unknown environments.A large amount of work on visual navigation can also be found in the computer vision community, such as Shen et al. [2019], Gupta et al. [2017a], Chaplot et al. [2020], Gupta et al. [2017b], Wu et al. [2019], all of which use full-learning-based methods to train navigation policies, which work remarkably well when training data is sufficient but can fail frequently if no or very limited data is available.
3 Background and Preliminary Materials

Expected Errors
We consider segmentation tasks where the input space is X ⊂ R H×W ×3 , representing the input RGB images, and the label space is Y ⊂ {0, 1} H×W ×K , representing the ground-truth K-class segmentation images.The label for a single pixel at (h, w) is denoted by a one-hot vector y (h,w) ∈ R K whose elements are by-default 0valued except the i th element is labeled as 1 if the i th class is specified.Domain adaptation has two domain distributions over X × Y, named source domain D s and target domain D t .In the setting of UDA for segmentation, we have access to m s i.i.d.samples with labels In the UDA problem, we need to reduce the prediction error on the target domain.With a slight abuse of notation, we also use h to denote a hypothesis, which is a function: X → Y.We denote the space of h as H.With the loss function l(•, •), the expected error of h on D s is defined as s (h) := E (x,y)∼Ds l(h(x), y). (1) Similarly, we can define the expected error of h on D t as t (h) := E (x,y)∼Dt l(h(x), y). (2)

Upper Bounds for Expected Errors
Two important upper bounds related to the source and target error are given in Ben-David et al. [2010].
1 The first upper bound is given in the following theorem.
Theorem 1 For a hypothesis h, where d 1 (•, •) is the L 1 divergence for two distributions, and the constant term λ does not depend on any h.However, it is claimed in Ben-David et al. [2010] that the bound with L 1 divergence cannot be accurately estimated from finite samples, and using L 1 divergence can unnecessarily inflate the bound.Another divergence measure is thus introduced to replace the L 1 divergence with a new bound derived.The new measure is defined as follows, Definition 1 Given two domain distributions D s and D t over X , and a hypothesis space H that has finite VC dimension, the H-divergence between D s and D t is defined as The H-divergence resolves the issues in the L 1 divergence.If we replace d 1 (D s , D t ) in Eq. ( 3) with d H (D s , D t ), then a new upper bound for t (h), named as UB 1 , can be written as (5) An approach to compute the empirical H-divergence is also proposed in Ben-David et al. [2010], see the below Lemma 1.
Lemma 1 For a symmetric hypothesis class H (one where for every h ∈ H, the inverse hypothesis 1 − h is also in H) and two sample sets the approximated empirical H-divergence is computed as: where I[a] is an indicator function which is 1 if a is true, and 0 otherwise.2 The second upper bound is based on a new hypothesis called the symmetric difference hypothesis, see the following definition.
Definition 2 For a hypothesis space H, the symmetric difference hypothesis space H∆H is the set of hypotheses where ⊕ denotes an XOR operation.Then we can define the H∆H-distance as Similar to Eq. ( 5), if we replace d 1 (D s , D t ) with the H∆H-distance d H∆H (D s , D t ), the second upper bound for t (h), named as UB 2 , can be expressed as where λ is the same term as in Eq. ( 3).
The two bounds (Eq.( 5) and Eq. ( 10)) for the target domain error are separately given in Ben-David et al. [2010].It has been independently demonstrated that DA corresponds to optimizing over UB 1 [Ganin et al., 2016], where optimization over the upper bound UB 1 (Eq.( 5) with the divergence Eq. ( 7)) is proved as equivalent to an adversarial learning with Eq. ( 11) and a supervised learning with the source data, and that CA corresponds to optimizing over UB 2 [Saito et al., 2018], where the d H∆H is approximated by the discrepancy between two different classifiers.
Training DA is straightforward since we can easily define binary labels for each domain, e.g., we can use 1 as the source domain label and 0 as the target domain label.Adversarial training over the domain labels can achieve domain alignment.For CA, however, it is difficult to implement as we do not have target labels, hence the target class features are completely unknown to us, thus leading naively using adversarial training over each class impossible.The existing way well supported by theory to perform CA [Saito et al., 2018] is to indirectly align class features by devising two different classifier hypotheses.The two classifiers have to be well trained on the source domain and are able to classify different classes in the source domain with different decision boundaries.Then considering the shift between source and target domain, the trained two classifiers might have disagreements on target domain classes.Note since the two classifiers are already well trained on the source domain, the agreements of the two classifiers represent those features in the target domain that are close to the source domain, while in contrast, the features where disagreements happen indicate that there is a large shift between the source and the target.We use the disagreements to approximate the distance between the source and the target.If we are able to minimize the disagreements of the two classifiers, then features of each class between source and target will be enforced to be well aligned.

Adversarial Training
A standard way to achieve the alignment for deep models is to use the adversarial training method, which is also used in Generative Adversarial Networks (GANs) [Goodfellow et al., 2014].Therefore we explain the key concepts of adversarial training using the example of GANs.
GAN is proposed to learn the distribution p r of a set of given data {x} in an adversarial manner.The architecture consists of two networks -a generator G, and a discriminator D. The G is responsible for generating fake data (with distribution p g ) from random noises z ∼ p z to fool the discriminator D that is instead to accurately distinguish between the fake data and the given data.Optimization of a GAN involves a mini-maximization over a joint loss for G and D.
where we use 1 as the real label and 0 as the fake label.Training with Eq. ( 11) is a bilateral game where the distribution p g is aligned with the distribution p r .

Methodology
In this work we investigate the relationship between the UB 1 (Eq.( 5)) and UB 2 (Eq.( 10)) and prove that UB 1 turns out to be an upper bound of UB 2 , meaning DA can be a necessary constraint to CA.This is also consistent with our intuition: DA aligns features globally in a coarse way while CA aligns features locally in a finer way.Constraining CA with DA is actually a coarse-to-fine process.We use the coarse alignment to reduce the general domain distance and regularize the adversarial objective for the fine alignment into a proper level.DA is shown to be a zero-sum game between a feature extractor and a domain discriminator [Ganin et al., 2016], while CA is proved to be a zerosum game between a feature extractor and two classifiers [Saito et al., 2018].Both DA and CA are bilateral games, see (a) and (b) in Fig. 3.In this work, we propose a novel concept, pseudo-trilateral game structure (PTGS), for efficiently integrating game structures of DA and CA, see (c) in Fig. 3. Three players are involved in the proposed PTGS, a feature extractor G, a domain discriminator D, and a family of classifiers Cs.The game between G and Cs is the CA while the game between G and D is the DA.According to the identified relation in Eq. ( 12), the two upper bounds ÛB 1 and ÛB 2 need to use the same feature, hence we connect the domain alignment and class alignment using a shared feature extractor.Both D and Cs are trying to adjust the G such that the features between source and target generated from G could be well aligned globally and locally.There is no game between the Cs and the D. The DA and CA in the PTGS are performed in an alternative way during training.
Notations used in this paper are explained as follows.We denote the segmentation model h as h θ,φ (x) = C θ (G φ (x)) which consists of a feature extractor G φ parameterized by φ and a classifier C θ parameterized by θ, and x is a sample from U s or U t .If multiple classifiers are used, we denote the j th classifier as C j .We denote the discriminator as D ψ parameterized by ψ.

Bounds Relation
We start by examining the relationship between the DA and the CA from the perspective of target error bound.We propose to use this relation to improve the segmentation performance of class alignment, which is desired for dense prediction tasks.We provide the following theorem: Theorem 2 If we assume there is a hypothesis space H for segmentation model h θ,φ and a hypothesis space H D for domain classifiers D ψ , and H∆H ⊂ H D , then we have The proof of this theorem is provided in Section.7.1.Essentially, we limit the hypothesis space H and H D in Eq. ( 12) into the space of deep neural networks.Directly optimizing over Û B 2 might be hard to converge since Û B 2 is a tighter upper bound for the prediction error on the target domain.The bounds relation in Eq. ( 12) shows that the ÛB 1 is an upper bound of ÛB 2 .This provides us a clue to improve the training process of class alignment, i.e., the domain alignment can be a global constraint and narrow down the searching space for the class alignment.This also implies that integrating the domain alignment and class alignment might boost the training efficiency as well as the prediction performance of UDA.This inspires us to design a new model, which we describe in the subsequent sections.

Model Structure
Following our proposed PTGS and the identified relation in Eq. ( 12), we design the structure of our CALI model as shown in stage I of Fig. 4. Four networks are involved, a shared feature extractor G φ , a domain discriminator D ψ and two classifiers C θ 1 and C θ 2 .Furthermore, as defined in Eq. ( 9), h and h are two different hypotheses, thus we have to ensure the classifiers C θ 1 and C θ 2 are different.Note that the supervision signal from the source domain (y s in Fig. 4) is used to train C θ 1 and C θ 2 because both classifiers are expected to generate correct decision boundaries on the source domain.
No label from the target domain is used during training.
We also show steps involved in our ICALI model in Fig. 4-stages II & III.Stage II shows how a pair of a mixed image and the corresponding mixed label is generated.First, the prediction of the source image o 1s is used to compute the performance (indicated by the mean Intersection over Union (mIoU)) for all classes.Note that o 1s is the prediction of C θ 1 in stage I.The classes are then divided into a group of well-performing classes and a group of under-performing classes.The ratio for the division is a hyperparameter.Then a selection mask for source data M s is generated by extracting the regions of under-performing classes in the source ground-truth label.We can easily obtain the mask for target data: M t = 1 − M s .Next, the mixed data can be generated by: where o 1t is the prediction of C θ 1 in stage I.In stage III, the newly generated data {x m , y m } are used to train the model G ψ and C θ 1 .Note that the model C θ 2 is not included in the training of stage III such that the difference between the two models can be well maintained.

Losses
We denote raw images from the source or target domain as x, and the label from the source domain as y.We use semantic labels in the source domain to train all of the nets, but the domain discriminator, in a supervised way (see the solid red one-way arrow in Fig. 4).We need to minimize the supervised segmentation loss since Eq. ( 12) and other related Eqs suggest that the source prediction error is also part of the upper bound of the target error.In this section, we omit superscripts for all model notations.The supervised segmentation loss for training CALI is defined as where represents the element-wise multiplication between two tensors.
To perform domain alignment, we need to define the joint loss function for G and D where no segmentation labels but domain labels are used, and we use the standard cross-entropy to compute the domain classification loss for both source (CE s (x)) and target data (CE t (x)).We have and (17) Note we include G in Eq. ( 16) since both the source data and target data are passed through the feature extractor.This is different than standard GAN, where the real data is directly fed to D, without passing through the generator.
To perform class alignment, we need to define the joint loss function for G, C 1 , and where d(•, •) is the distance measure between two distributions from the two classifiers.In this paper, we use the same L 1 distance in Saito et al. [2018] as the measure, thus d(p, q) = 1 K |p − q| 1 , where p and q are two distributions and K is the number of label classes.
To prevent C 1 and C 2 from converging to the same network throughout the training, we use the cosine similarity as a weight regularization to maximize the difference of the weights from C 1 and C 2 , i.e., where w 1 and w 2 are the weight vectors of C 1 and C 2 , respectively.
The extra supervised training in ICALI only involves a simple loss: Fig. 5: Images from the camera on a ground robot in a real indoor environment.Left column: RGB images; Middle column: Surface normal images; Right column: Predicted segmentation images (yellow: navigable; green: non-navigable).Blue dots are sampled points from the predicted boundary.The orange dash line represents an approximated convex function shape while the red dash line is an approximated concave function shape.

Training Algorithm
We integrate the training processes of domain alignment and class alignment to systematically train our CALI model.To be consistent with Eq. ( 12), we adopt an iterative mechanism that alternates between domain alignment and class alignment.We present the pseudocode for the training process of CALI and ICALI in Algorithm 4.1.Note the adversarial training order of V 1 in Algorithm 4.1 is max ψ D min φ G , instead of the min φ G max ψ D , meaning in each training iteration we first train the feature extractor and then the discriminator.The reason for this order is because we empirically find that the feature from G is relatively easy for D to discriminate, hence if we train D first, then the D might become an accurate discriminator in the early stage of training and there will be no adversarial signals for training G, thus making the whole training fail.The same order applies to training of the pair of G and Cs with V 2 .

Visual Planner
We design a visual receding horizon planner to achieve feasible visual navigation by combining the learned image segmentation.Specifically, first we compute a library of motion primitives [Howard andKelly, 2007, Howard et al., 2008] T to denote a robot pose.Then we project the motion primitives to the image plane and compute the navigation cost function for each primitive based on the evaluation of collision risk in image space and target progress.Finally, we select the primitive with minimal cost to execute.The trajectory selection problem can be defined as: where C c (p) = m j c j c and C t (p) = m j c j t are the collision cost and target cost of one primitive p, and w 1 , w 2 are corresponding weights, respectively.

Collision Avoidance
In this work, we propose a Scaled Euclidean Distance Field (SEDF) for obstacle avoidance.Conventional collision avoidance is usually conducted in the map space [Gao et al., 2017, Han et al., 2019], where an occupancy map and the corresponding Euclidean Signed Distance Field (ESDF) have to be provided in advance or constructed incrementally in real time.In this work instead we eliminate this expensive map construction process and evaluate the collision risk directly in the image space.Specifically, we first compute a SEDF image E (S) based on an edge map Edge(S) detected in the learned binary segmentation S(I), where I is the input image.We then project the motion primitives from the map space to the image space and evaluate all primitives' projections in E (S).
To perform obstacle avoidance in image space, we have to detect the obstacle boundary in Edge(S).To achieve this, we propose to categorize the edges Edge(S) into two classes, Strong Obstacle Boundaries (SOBs) and Weak Obstacle Boundaries (WOBs).We treat the boundary from the binary segmentation as a function of a single variable in image space, and we use the twin notions of convexity and concavity of functions to define the SOBs and WOBs, respectively.SOBs mean obstacles are near to the robot (e.g., the random furniture closely surrounding the robot) and they cause the boundaries to exhibit an approximated ConVex Function Shape (CVFS) (see Fig. 5(c)).WOBs indicate obstacles are far from the robot (e.g., the wall boundaries that the robot makes large clearance from) and they typically make the boundaries reveal an approximated ConCave Function Shape (CCFS) (see Fig. 5(f)).In this work, we only consider the obstacles with SOBs and adopt a straightforward way (as in Eq. ( 22)) to detect boundary segments in Edge(S) with CVFS.We use a points set Ω to represent the boundary segments: where (u, v) are the coordinates in the image frame, as shown in Fig. 6(a), and v thres is a pre-defined value for evaluating the boundary convexity.If we use ∂Ω to denote the boundary of a set, then in our case we have ∂Ω = Ω.Then the definition of an EDF is: where d(x, y) = x − y is the Euclidean distance between vectors x and y.However, directly computing an EDF using Eq. ( 23) in the image space will propagate the obstacles' gradients to the whole image space, which might cause the planning evaluation space to be too limited.To address this, we introduce a scale factor α to compute a corrected version of EDF: where α ∈ [0, 1], and where U and V are the rows and columns index sets, respectively.Some examples of E with different α values can be seen in Fig. 6.Assuming x j is the j th pose in one primitive and its image coordinates are u j , v j , then the collision risk for x j is

Target Progress
To evaluate target progress during the navigation progress, we propose to use the distance on SE(3) as the metric.We define three types of frames: world frame F w , primitive pose frame F pj , and goal frame F g .The transformation of F pj in F w is denoted as T wpj while that of F g in F w is T wg .A typical approach to represent the distance is to split a pose into a position and an orientation and define two distances on R 3 and SO(3).
Then the two distances can be fused in a weighted manner with two strictly positive scaling factors a and b and with an exponent parameter et al., 2018]: We use the Euclidean distance as d trans (t wpj , t wg ), the Riemannian distance over SO(3) as d rot (R wpj , R wg ) and set p as 2. Then the distance (target cost) between two transformation matrices can be defined [Park, 1995] as: 5 Experiments

Datasets
We evaluate CALI together with several baseline methods on a few challenging domain adaptation scenarios, where several public datasets, e.g., GTA5 [Richter et al., 2016], Cityscapes [Cordts et al., 2016], RUGD [Wigness et al., 2019], RELLIS [Jiang et al., 2020], as well as a small self-collected dataset, named MESH (see the first column of Fig. 9), are investigated.The GTA5 dataset contains 24966 synthesized high-resolution images in the urban environments from a video game and pixelwise semantic annotations of 33 classes.The Cityscapes dataset consists of 5000 finely annotated images whose label is given for 19 commonly seen categories in urban environments, e.g., road, sidewalk, tree, person, car, etc.The RUGD and RELLIS are two datasets that aim to evaluate segmentation performance in offroad environments.The RUGD and the RELLIS contain 24 and 20 classes with 8000 and 6000 images, respectively.RUGD and RELLIS cover various scenes like trails, creeks, parks, villages, and puddle terrains.Our dataset, MESH, includes features like grass, trees (particularly challenging in winter due to foliage loss and monochromatic colors), mulch, etc.It helps us to further validate the performance of our proposed model for traversability prediction in challenging scenes, particularly the off-road environments.

Implementation Details
To be consistent with our theoretical analysis, the implementation of CALI only adopts the necessary indi-cations by Eq. ( 12).First, Eq. ( 12) requires that the input of the two upper bounds (one for DA and the other one for CA) should be the same.Second, nothing else but only domain classification and hypotheses discrepancy are involved in Eq. ( 12) and other related analyses (Eq.(3) -Eq.( 10)).Accordingly, we strictly follow the guidance of our theoretical analyses.First, CALI performs DA in the intermediate-feature level (f in Fig. 4), instead of the output-feature level used in Vu et al. [2019].Second, we exclude the multiple additional tricks, e.g., entropy-based and multi-level features based alignment, and class-ratio priors in Vu et al.
[2019] and multi-steps training for feature extractor in Saito et al. [2018].We also implement baseline methods without those techniques for a fair comparison.To avoid possible degraded performance brought by a class imbalance in the used datasets, we regroup those rare classes into classes with a higher pixel ratio.For example, we treat the building, wall, and fence as the same class; the person and rider as the same class in the adaptation of GTA5→Cityscapes.In the adaptation of RUGD→RELLIS, we treat the tree, bush, and log as the same class, and the rock and rockbed as the same class.Details about remapping can be seen in Fig. 15 and Fig. 16 in Section.7.2.
We use the PyTorch [Paszke et al., 2019] framework for implementation.Training images from source and target domains are cropped to be half of their original image dimensions.The batch size is set to 1 and the weights of all batch normalization layers are fixed.We use the ResNet-101 [He et al., 2016] pretrained on ImageNet [Deng et al., 2009] as the model G for extracting features.We use the ASPP module in DeepLab-V2 [Chen et al., 2017a] as the structure for C 1 and C 2 .We use the similar structure in Radford et al. [2015] as the discriminator D, which consists of 5 convolution layers with kernel 4 × 4 and with channel size {64, 128, 256, 512, 1} and stride of 2. Each convolution layer is followed by a Leaky-ReLU [Maas et al., 2013] parameterized by 0.2, but only the last convolution layer is follwed by a Sigmoid function.During the training, we use SGD [Bottou, 2010] as the optimizer for G, C 1 and C 2 with a momentum of 0.9, and use Adam [Kingma and Ba, 2014] to optimize D with β 1 = 0.9, β 2 = 0.99.We set all SGD optimizers a weight decay of 5e-4.The initial learning rates of all SGDs for performing domain alignment are set to 2.5e-4 and the one of Adam is set as 1e-4.For class alignment, the initial learning rate of SGDs is set to 1e-3.All of the learning rates are decayed by a poly learning rate policy, where the initial learning rate is multiplied by (1 − iter max iters ) power with power = 0.9.All experiments are conducted on a single Nvidia Geforce RTX 2080 Super GPU.

Comparative Studies
We present comparative experimental results of our proposed model, CALI, compared to different baseline methods -Source-Only (SO) method, Domain-Alignment (DA) [Vu et al., 2019] method, and Class-Alignment [Saito et al., 2018] method.Specifically, we first perform evaluations on a sim2real UDA in city-like environments, where the source domain is represented by GTA5 while the target domain is the Cityscapes.Then we consider a transfer of real2real in forest environments, where the source domain and target domain are set as RUGD and RELLIS, respectively.All models are trained with full access to the images and labels in the source domain and with only access to the images in the target domain.The labels in target datasets are only used for evaluation purposes.Finally, we further validate our model performance for adapting from RUGD to our self-collected dataset MESH.
To ensure a fair comparison, all the methods use the same feature extractor G; both DA and CALI have the same domain discriminator D; both CA and CALI have the same two classifiers C 1 and C 2 .We also use the same optimizers and optimization-related hyperparameters if any is used for models under comparison.
We use the mIoU as the metric to evaluate each class and overall segmentation performance on testing images.IoU is computed as ntp ntp+n f p +n f n , where n tp , n tn , n f p and n f n are true positive, true negative, false positive and false negative, respectively.

GTA5→Cityscapes
Quantitative comparison results of GTA5→Cityscapes are shown in Table 1, where segmentations are evaluated on 9 classes (as regrouped in Fig. 15).Our proposed methods (CALI & ICALI) have significant advantages over multiple baseline methods, and ICALI achieves the best performance for all categories and overall performance (mIoU*).
In our testing case, SO achieves the highest score for the class person even without any domain adaptation.One possible reason for this is the deep features of the source person and the target person from the model solely trained on source domain, are already wellaligned.If we try to interfere this well-aligned relation using unnecessary additional efforts, the target prediction error might be increased (see the mIoU values of the person from the other three methods).We call this phenomenon as negative transfer, which also happens to other classes if we compare SO and DA/CA, e.g., sidewalk, building, sky, vegetation, and so on.In contrast, CALI maintains an improved performance compared to either SO or DA/CA.We validate our analytical method for DA and CA (Section 4.1) by a comparison between CALI and baselines.This indicates either single DA or CA is problematic for semantic segmentation, particularly when we strictly follow what the theory supports and do not include any other training tricks (that might increase the training complexity and make the training unstable).This implies that integration of DA and CA is beneficial to each other with significant improvements, and more importantly, CALI is well theoretically supported, and the training process is easy and stable.Based on CALI, ICALI further improves performance and achieves the best results for all classes, validating the effectiveness of introducing the extra training of mixed data into CALI.Fig. 7 shows the examples of qualitative comparison for UDA of GTA5→Cityscapes.We find that CALI prediction is less noisy compared to the baseline methods as shown in the second and third columns (sidewalk or car on-road), and shows better completeness (part of the car is missing, see the fourth column).ICALI fur-

RUGD→RELLIS
We show quantitative results of RUGD→RELLIS in Table 2, where only 5 classes ‡ are evaluated.We observe similar trends as that in Table 1.More specifically, both tables show that CA has the negative transfer (compared with SO) issue for either sim2real or real2real UDA.However, if we constrain the training of CA with DA, as in our proposed model CALI, the performance will be remarkably improved.Some qualitative results are shown in Fig. 8.However, if we compare CALI and ICALI, the gain for the setting of RUGD→RELLIS is much less significant than the one in Table 1.If we look at qualitative results in Fig. 8, some predictions of ICALI look even worse than CALI, e.g., the last two rows.This is because the mixture step in ICALI highlights under-performing classes only in the source domain (as we can only have reliable identification of well/under-performing classes using provided labels in the source domain).This works well when the label shift between the source domain and the target domain is mild, e.g., the adaptation of GTA5→Cityscapes.However, in the adaptation of RUGD→RELLIS, the label shift is remarkable -the proportion of classes has significantly changed and only a few of the classes in the source domain appear in the target domain.In this case, highlighting classes that rarely appear in the target domain might cause misunderstanding and demolish the performance.

RUGD→MESH
Our MESH dataset contains only unlabeled images that restrict us to show only a qualitative comparison for the UDA of RUGD→MESH, as shown in Fig. 9.We have collected data in winter forest environments, which are significantly different than the images in the source domain (RUGD) -collected in a different season, e.g., summer or spring.These cross-season scenarios make the prediction more challenging.However, it is more practical to evaluate the UDA performance of crossseason scenarios, as we might have to deploy our robot at any time, even with extreme weather conditions, but our available datasets might be far from covering every season and every weather condition.From Fig. 9, we can still see the obvious advantages of our proposed CALI model over other baselines.Since the label shift between RUGD and MESH is still large, the advantage of ICALI over CALI is still not remarkable.

Discussions
In this section, we aim to discuss our model (CALI) behaviors in more details.Specifically, first we will explain the advantages of CALI over CA from the perspective of training process.Second, we will show the vital influence of mistakenly using the wrong order of adversarial training.
The most important part in CA is the discrepancy between the two classifiers, which is the only training force for the functionality of CA.It has been empirically studied in Saito et al. [2018] that the target prediction accuracy will increase as the target discrepancy is decreasing, hence the discrepancy is also an indicator showing if the training is on the right track.We compare the target discrepancy changes of CALI and our baseline CA in Fig. 10, where the curves for the three UDA scenarios are presented from (a) to (c) and we only show the data before iteration 30k.It can be seen that before around iteration 2k, the target discrepancy of both CALI and CA are drastically decreasing, but thereafter, the discrepancy of CA starts to increase.On the other hand, if we impose a DA constraint over the same CA (iteratively), leading to our proposed CALI, then the target discrepancy will be decreasing as expected.This validates that integrating DA and CA will make the training process of CA more stable, thus improving the target prediction accuracy.As mentioned in Algorithm 1, we have to use adversarial training order of max ψ D min φ G , instead of using min φ G max ψ D .The reason for this is related to our designed net structure.Following the guidance of Eq. ( 12), we use the same input to the two classifiers and the domain discriminator, hence the discriminator has to receive the intermediate-level feature as the input.If we use the order of min φ G max ψ D in CALI, then the outputs of the discriminator will be like Fig. 11(a), where the domain discriminator of CALI will quickly converge to the optimal state and it can accurately discriminate if the feature is from source or target domain.In this case, the adversarial loss for updating the feature extractor will be near 0, hence the whole training fails, which is validated by changes of the target discrepancy curve, as shown in Fig. 11(b), where the discrepancy value is decreasing in a small amount in the first few iterations and then quickly increase to a high level that shows the training is divergent and the model is collapsed.This is also verified by the prediction results at (and after) around iteration 1k, as shown in Fig. 12, where the first row is the source images while the second row is the target images.

Navigation Missions
To further show the effectiveness of our proposed CALI model for real deployments, we build a navigation system by combining the proposed CALI (trained with RUGD → MESH set-up) segmentation model with our visual planner.We test behaviors of our navigation system in two different forest environments (MESH#1 in Fig. 13 and MESH#2 in Fig. 14), where our navigation system shows high reliability.In navigation tasks, the image resolution is [400, 300], and the inference time for pure segmentation inference is around 33 frame per second (FPS).However, since a complete perception system requires several post-processing steps, such as navigability definition, noise filtering, Scaled Euclidean Distance Field computation, motion primitive evaluation and so on, the response time for the whole perception pipeline (in python) is around 8 FPS without any engineering optimization.The inference of segmentation for navigation is performed on an Nvidia Tesla T4 GPU.We set the linear velocity as 0.3m/s and control the angular velocity to track the selected motion primitive.The path length is 32.26m in Fig. 13 and 28.63m in Fig. 14.Although the motion speed is slow in navigation tasks, as a proof of concept and with a very basic motion planner, the system behavior is as expected, and we have validated that the proposed CALI model is able to accomplish the navigation tasks in unstructured environments.

Conclusion and Future Work
We present CALI, a novel unsupervised domain adaptation model specifically designed for semantic segmentation, which requires fine-grained alignments in the level of class features.We carefully investigate the relationship between a coarse alignment and a fine alignment in theory.The theoretical analysis guides the design of the model structure, losses, and training process.We have validated that the coarse alignment can serve as a constraint to the fine alignment and integrating the two alignments can boost the UDA performance for segmentation.The resultant model shows significant advantages over baselines in various challenging UDA scenarios, e.g., sim2real and real2real.
To further improve this framework in the future, we observe one problem with our model during deployment -segmentation boundaries can jump (vary fast) when the model makes highly-frequent predictions for images from the video captured by the onboard camera.Our model predicts segmentation for each frame independently and ignores the inter-frame coherence.This sometimes leads to a non-negligible disturbance to smooth navigation, as our visual planner highly relies on the segmentation boundaries to generate the SEDF for obstacle avoidance.To further increase the practicality of our model in real deployments, incorporating temporal correlation during training or inference is necessary to help maintain stable segmentation boundaries.In addition, we observe that the proposed ICALI model only has significant performance gain when the label shift between the source domain and the target domain is small.However, we usually have a large label shift between the source and the target in real deployments, e.g., RUGD→RELLIS or RUGD→MESH.Addressing this issue and generalizing ICALI to cases where large label shifts exist is worth more investigation in the future.

Remapping of Label Space
We regroup the original label classes according to the semantic similarities among classes.In GTA5 and City-  scapes, we cluster the building, wall and fence as the same category; traffic light, traffic sign and pole as the same group; car, train.bicycle, motorcycle, bus and truck as the same class; and treat the person and rider as the same one.See Fig. 15.Similarly, we also have regroupings for classes in RUGD and RELLIS, as can be seen in Fig. 16.

Fig. 2 :
Fig. 2: An example to explain the intuition of our CALI model.S 1 /T 1 : Features of class #1 from source/target data; S 2 /T 2 : Features of class #2 from source/target data.(a) We first have two well-trained classifiers F 1 and F 2 on the source domain; (b) The discrepancy can be large for the two trained classifiers on the target domain due to the large domain shift; (c) Our CALI model aims to apply domain alignment to reduce the general domain distance between the source domain and the target domain before performing the class alignment.In this case, the adversarial objective for class alignment -the discrepancy between the two classifiers, can be reduced to a proper level (see the shadow area change from (b) to (c)).With a well-regularized adversarial objective, the training of CA can be stabilized and the performance can be improved.
. To further improve CALI, we develop a mixture-based supervision augmented version of CALI -Informed CALI (ICALI), which introduces an extra supervised learning process following the pseudo-trilateral training of CALI.The data for the extra learning is built by mixing informative regions from the source domain and target domain.The informative regions are defined as image areas occupied by under-performing classes during training.The advantages of ICALI are twofold: (a) Mixing the data from different regions leads to new data augmentations and further boosts the representation learning; (b) By mixing data informatively, the underperforming classes are exposed to the model more frequently such that the performance of those classes are improved.

Fig. 3 :
Fig. 3: Different game structures.(a) Bilateral game for DA, similar to GANs [Goodfellow et al., 2014]; (b) Bilateral game for CA; (c) Our proposed pseudo-trilateral game structure (PTGS).G is the feature extractor; D is the domain discriminator; Cs are a family of classifiers.

Fig. 4 :
Fig. 4: Framework of our proposed model.Stage I: the CALI model.f represents the shared feature map; p 1 /o 1 and p 2 /o 2 are the categorical probability/class label predictions; S/T represent domain labels: 1 (source) and 0 (target); L 1 represents the L 1 distance measure between two vectors.Stage II: Mixing source data and target data.o 1s and o 1t are label predictions from C θ 1 for the source image and target image, respectively.Stage III: Extra supervised training using the new mixed data.pm/om are the categorical probability/class label predictions for the mixed data.Stage II & III make the whole framework as ICALI.See Section 4.2 for more details.

Fig. 7 :
Fig. 7: Qualitative results on adaptation GTA5→Cityscapes.Results of our proposed model are listed in the last second (ICALI) and third (CALI) columns.GT represents the ground-truth labels.

Fig. 8 :
Fig. 8: Qualitative results on adaptation RUGD→RELLIS.Results of our proposed model are listed in the last second (ICALI) and third (CALI) columns.GT represents the ground-truth labels.

Fig. 9 :
Fig. 9: Qualitative results on adaptation RUGD→MESH.Results of our proposed model are listed in the last (ICALI) and last second (CALI) columns.

Fig. 13 :
Fig. 13: Navigation behaviors in MESH#1 environment.The left-most column: top-down view of the environment; Purple triangle: the starting point; Blue star: the target point; We also show the segmentation (top row) and planning results (bottom row) at four different moments during the navigation, as shown from the second column to the last one.

Fig. 15 :
Fig. 15: Lable remapping for GTA5→Cityscapes.Name of each new group is marked as bold.

Table 1 :
Quantitative comparison of different methods in UDA of GTA5→Cityscapes.mIoU* represents the average mIoU over all of classes.

Table 2 :
Quantitative comparison of different methods in UDA of RUGD→RELLIS.mIoU* is the average mIoU over all of classes.