Tripartite real-time semantic segmentation network with scene commonality

Abstract. The two-branch real-time semantic segmentation network can quickly acquire low-level details and high-level semantics. However, the large contextual gap between them results in adverse impact on their fusion, and limits the further improvement of real-time segmentation accuracy. This paper proposes a tripartite real-time semantic segmentation network with scene commonality (TriSCNet) to address this problem. First, we add a parallel scene commonality branch based on the current two-branch architecture to learn intrinsic common features in similar street scene images, such as the spatial location distribution of various objects and the internal connections between them at the semantic level. Further, with the guidance of commonality, we propose an external branch attention module to enrich and enhance the feature information of traditional two branches. Finally, we utilize an alignment and selective fusion module to correct the misaligned context in the semantic branch and highlight the essential spatial information in the detailed branch. Our proposed TriSCNet achieves an excellent trade-off between accuracy and speed, yielding 77.9% mIOU at 67.2 FPS on Cityscapes test set and 75.8% mIOU at 127.4 FPS on CamVid test set, respectively.


Introduction
Semantic segmentation is a basic task in computer vision, which gives semantic classification labels to each pixel according to the features contained in the image itself, and finally presents several different semantic regions in the image.As an important milestone in deep learning semantic segmentation, fully convolutional networks (FCNs) 1 use a fully convolutional layer to replace a fully connected layer, allowing the network to accept inputs of any resolution size and output segmentation results of the same size.In recent years, with the maturity of deep learning technology, high-precision segmentation models based on FCN have developed rapidly, but for applications requiring real-time prediction, such as automatic driving, 2 video surveillance, and driving assistance, most models cannot be put into practical applications due to the huge time complexity.
To meet the requirement of real-time performance, many low-latency and effective realtime segmentation networks have emerged in recent years, among which the two-branch network performs best, and they mainly use the following two branch approaches (1) Single-scale input.BiSeNet 3 added a spatial path to the input image to specifically obtain spatial detail information, which was fused with the semantic information acquired by the context path to improve the performance of the model.Fast-SCNN 4 reduced the amount of convolutional computation of detailed branches by sharing shallow network features, which greatly promoted real-time performance.(2) Multi-scale input.ContextNet 5 encoded two scale inputs to different depths to obtain global and local contexts, respectively.However, the two branch features obtained through these methods have the following problems: first, the spatial information obtained from the detailed branch lacks robust boundary features and is easily weakened by strong semantics; second, the contextual information obtained from the semantic branch generally loses the features of small objects, which will affect the final pixel category prediction.Therefore, we hope to efficiently construct universal relationships between global and local in the scene to guide the learning process of two branches in the network.
Recent research has shown that attention modules can enhance the representation ability of features by constructing channel and spatial correlations between them.For example, SANet 6 selectively weighted different channels using its own global information and incorporates global information to enhance the ability of pixel grouping.CBAM 7 utilized channel attention module and spatial attention module for adaptive feature optimization.In these modules, the mapping weights learned based on the global and local context of the input feature maps themselves play a crucial role in localizing important information.However, low-level features are short of semantic information, and high-level features often lack spatial details.Under the influence of such deficiencies, the learnable mapping information used to enhance features is often limited to the internal advantage information of the input feature maps, which cannot effectively help the current input construct its missing but crucial features and affects the learning in the subsequent layers.
Multi-scale feature fusion can embed spatial detail information into the semantic features and has a significant effect on the improvement of segmentation accuracy.SFNet 8 proposed the concept of semantic flow and designed a flow alignment module (FAM) to rapidly align the contextual regions between high-level features and low-level features, which promoted the optimization of fusion between feature maps.However, in FAM, when obtaining and using flow information, the resolution size of the high-level feature maps involved before and after is not consistent, which will decrease the effectiveness of feature alignment.In addition, for fast prediction, current networks usually adopt simple element-wise summation 9 or concatenation 3 to fuse different feature maps, but neither of them reasonably emphasize crucial spatial information, which degrades the accuracy of object boundary localization during resolution recovery.
To solve the aforementioned problems and enhance the learning effectiveness of the twobranch architecture on street scenes, we propose a tripartite real-time semantic segmentation network with scene commonality (TriSCNet).We construct an additional branch and propose an attention module, which work together to facilitate the original two branches in learning and enriching feature information by establishing connections between details and semantics.Subsequently, we utilize a feature fusion module to separately process semantic and detailed features and fuse them, thereby promoting the balance between accuracy and real-time performance in the model.We compare the performance of our model with other state-of-the-art models on the Cityscapes 10 and CamVid 11 benchmarks, respectively, to demonstrate its superiority.In addition, ablation studies and feature visualizations are provided to better illustrate the detailed function of each method.
The main contributions of this article are as follows: • We design a scene commonality branch (SCB), which can supplement information to the original branch features, aiming to help the network in better understanding the panoramic layout of objects in complex street scenes.• An external branch attention module (EBAM) with two inputs is proposed to reduce the semantic gaps between the two branch features, which uses two global contextual mappings obtained respectively from branch features and common features to guide the original branch feature maps to fully mine their own channel information and supplement the information lacking in them.• We design the alignment and selective fusion module (ASFM) to fuse the enhanced two branch features using both operations of feature alignment and boundary refinement, and promotes our TriSCNet to achieve a better trade-off between accuracy and speed.
2 Related Work

High-precision Segmentation Model
Early segmentation methods 12,13 utilized an encoder-decoder architecture to obtain and parse convolutional features of different scales, which influenced many subsequent networks.To reduce the loss of detailed information caused by downsampling, DeeplabV3+ 14 and PsPNet 15 kept the resolution of feature maps unchanged and gradually expanded the receptive field to gather contextual information by pyramid pooling operations.ReFineNet 16 and ExFuseNet 17 fused and supplemented information at different scales through multi-path methods that repeatedly exploited high-level and low-level features.Gated-scnn 18 supervised and obtained fine boundary results by designing boundary loss function.Considering that convolutional neural network focused more on local information exchange, based on the development of non-local 19 and transformer, 20 networks such as Swin Transformer 21 and SETR 22 have established interaction responses between local positions and the global, greatly improving segmentation accuracy and opening a new path for the development of semantic segmentation.High-precision networks effectively improve their own accuracy through various functional feature modules but slow down the speed of prediction.Instead, PP-LiteSeg 23 designed three lightweight modules to achieve a superior trade-off between accuracy and speed.Therefore, designing simple and effective feature processing methods will greatly help improve the accuracy of real-time segmentation.

Two-branch Real-time Segmentation Model
The two-branch architecture demonstrates its own effectiveness in real-time segmentation networks and promotes the rapid development of this field.BiSeNetv2 24 adopted a guided aggregation layer to enhance and fuse features from two branches.CABiNet 25 leveraged improved global and local attention to capture long-distance and local contextual dependencies.FBSNet 26 employed a symmetrical encoder-decoder structure with two branches to extract deep semantic information and preserved shallow boundary details.DCNet 27 contains two independent sub-networks, which respectively obtain sufficient receptive field and capture the location dependencies of each pixel.These networks promote the development of two-branch architecture and greatly boost the accuracy, but they give more considerations to how to make the two branches better converge on their respective tasks and fuse features, but seldom formulate methods to decrease the semantic gap between the two branches on this basis.In this regard, we add a third branch to assist the traditional two branches and the whole network to more effectively understand and learn both spatial and semantic information within streetscapes.

Attention Mechanism
In vision tasks, the attention mechanism effectively constructs the correlation between feature information and enhances the representation ability of features.SENet 28 selectively measured correlations across channels in the feature maps using the squeeze and excitation (SE) module.DANet 29 utilized a self-attention mechanism to capture contextual dependencies from both channel and spatial directions.SANet 6 designed the squeeze and attention (SA) module to solve the problem of pixel grouping and better guide the pixel prediction.However, most of the current attention operations focus on a single input itself.In the feature extraction process, low-level features can capture local information but fail to understand the overall objects, and conversely high-level features can perform global modeling but easily ignore local objects.These inherent feature defects will keep passing along the network and weaken the ability of the attention mechanism to enhance the features.In this regard, we add the common features with the highestorder global scale to the attention mechanism to guide feature enhancement, aiming to obtain more abundant information.

Multi-scale Feature Fusion
Multi-scale context information is of great significance in improving segmentation accuracy.LSPANet 30 utilized a dual-branch decoding fusion to fuse information extracted by the encoder at different stages.FPANet 31 designed a lightweight feature pyramid fusion module (FPFM) to fuse two different levels of features.DDRNet 32 leveraged the deep aggregation pyramid pooling module (DAPPM) to fully mine and mix multi-scale contextual information.They promote the segmentation accuracy by integrating multi-scale features, but ignore the positional deviation between feature context and local information.FaPN 33 utilized deformable convolution to align features at different scales and refine the boundary effect of segmentation.SFNet 8 designed a FAM, which promoted feature fusion optimization with slight overhead.In our proposed fusion module, we improve FAM and utilize fast pixel-level attention to selectively maintain reliable spatial detail information to help refine the semantic boundaries of objects.

Overall Network Architecture
Semantic segmentation converts RGB images X H×W×3 0 into score maps Y H×W×C containing C semantic categories through mapping.Figure 1 shows the detailed network architecture of TriSCNet.
Due to the outstanding performance of STDC2 34 in real-time semantic segmentation tasks, we utilize it as the encoder to construct both detailed branch and semantic branch for feature extraction.In the process of encoder downsampling, we share the feature maps  32 , respectively.Based on the two original branches, we use the SCB as the parallel third branch to extract common features g, which participate as an input of the designed EBAM to assist other branches to enhance features.Furthermore, we also add them into semantic branch to enrich the context by providing multi-scale global receptive fields.In the end, we utilize the ASFM to fuse two different feature maps obtained from the detailed branch and the semantic branch, and predict the final segmentation results after recovering the original resolution by upsampling.
During the training process of the network, we add additional loss functions to the detailed branch and semantic branch, respectively, to strengthen their effectiveness in feature enhancement.We use the detailed auxiliary loss 34 as the De-Head to supervise the detailed branch to learn the boundary information efficiently, and utilize the cross-entropy loss segmentation head as the Se-Head to improve the segmentation performance of semantic branch.In the implementation, each input feature will be upsampled to the same resolution size as the original input image, i.e., H × W, and C predicted categories will be output, which will be used to compute the difference between the predicted maps and the ground truth to optimize the training and improve segmentation performance.Because they are only used in the training stage, there is no additional time cost in testing.

Scene Commonality Branch
Due to the significant gap in contextual information between low-level details and high-level semantics in traditional two-branch architecture, we design a parallel SCB to learn inherent Fig. 1 Overall framework.Our network consists of three branches: Detailed branch (the blue line), semantic branch (the orange line), and SCB (the green line).EBAM is the proposed attention module, ASFM is the proposed feature fusion module.De-Head and Se-Head are the additional supervised losses about details and semantics, respectively.common features in similar street scene images, aiming to assist the original two-branch network to fully understand the scene structure.
In the street scene dataset, the spatial locations of the categories in the scene have certain regularities and connections, and this inherent semantic correlation can effectively help the network analyze the information of the streetscapes.As shown in Fig. 1, we use SCB to obtain the semantic correlation, i.e., common features g.Specifically, this branch first extracts global information at three different scales by pooling operation, including 1/8, 1/16, and 1/32 image resolutions, and then concatenates them and utilizes a 1 × 1 convolution to aggregate and balance the crucial features in detailed and semantic information.The final output g constructs the connections of objects of different positions and sizes in streetscapes and can guide the two original branches to narrow the differences between branch features.The calculation process of SCB can be expressed as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 1 ; 1 1 7 ; 5 9 2 where GAPð•Þ is the global average pooling, catð•Þ is the concatenation, and Convð•Þ is the convolution.The added third branch combines shallow and deep features to learn scene-specific features.We utilize this commonality as highest-order auxiliary information to design the modules in Sec.3.3, and integrate it with semantic branch to enhance global understanding and analysis of images, helping the network obtain more meaningful features.

External Branch Attention Module
Influenced by the unbalanced semantics and details within the feature information, the attention weights learned based only on the input features themselves will exacerbate this imbalance in feature reconstruction, thereby increasing the differences between branch features.
To this end, we design the EBAM shown in Fig. 2 to solve the deficit by adding additional guidance information.In EBAM, there are two inputs, where X i are the branch feature maps from one of detailed branch or semantic branch, and g are the common features obtained through SCB.Specifically, we input X i and g into this module, and based on their respective global information, derive weight coefficients ω 1 and ω 2 respectively using the Sigmoid function.Subsequently, we utilize ω 1 and ω 2 to respectively reweight the channels of branch feature X i , and aggregate the two attention results from different levels by the element summation to get the optimized features X out with more abundant information.The aforementioned steps can be written as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 2 ; 1 1 7 ; 3 5 3 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 3 ; 1 1 7 ; 3 1 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 4 ; 1 1 7 ; 3 0 0 where σð•Þ represents the Sigmoid function.
In short, by aggregating the connections among different categories from common features g, our EBAM can help the detailed branch learn more spatial boundary features; meanwhile, the semantic branch can gradually learn the semantic information of small objects.We will demonstrate its effectiveness of this module through ablation experiments in Sec.4.3.1 and the visualized results in Fig. 4.

Alignment and Selective Fusion Module
There are two problems when fusing high-level and low-level feature maps.One is that multiple downsampling and upsampling will result in a certain degree of semantic misalignment between high-level and low-level features at corresponding feature positions, as shown with the black dashed line in Fig. 3.The second is that the contours of the semantic features tend to be coarse, and require refinement using boundary information from low-level features.To address the above two problems, we propose the ASFM shown in Fig. 3. First, we employ and modify a dynamic and learnable interpolation method from FAM 8 to align context in high-level feature maps, i.e., the Align operation in ASFM.Specifically, we combine the low-level feature maps F l with the high-level feature maps F h , which have the same resolution and channel after upsampling, to calculate the learnable offset Δ h through the convolution and add the Δ h into default sampling locations P h to obtain the new sampling point _ P h .We resampled F h and maintain its resolution unchanged before and after the operation to obtain the aligned semantic feature maps Fh at the same resolution.The steps above can be written as follows: E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 5 ; 1 1 4 ; 5 1 7 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 6 ; 1 1 4 ; 4 8 2 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 7 ; 1 1 4 ; 4 6 3 where Sampleð•Þ is the interpolation function.Second, we use two obtained weights to selectively introduce reliable detailed features into the semantic features to refine the segmentation boundaries.Specifically, we construct the semantic correlations between F l and Fh at the corresponding pixels by pixel-level attention composed of dot product operations and two activation functions.We then utilize the obtained two correlation weights S and R to select the desired spatial information.Since the weight S is biased toward semantic consistency and cannot effectively capture rich spatial information from detailed features, we use ð1 − SÞ that is more relevant to the details to achieve this task.Moreover, some of the spatial information in the detailed features is accurate, but it is relatively fragile compared to the high-level features.Therefore, we multiply R with high-level features to preserve more robust boundary information.Finally, we combine these two selected details with the semantic features, and obtain the fused feature maps F out , which refines and strengthens the reconstruction of boundary.The whole process can be calculated as E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 9 ; 1 1 7 ; 4 3 8 E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 1 0 ; 1 1 7 ; 4 1 9 where φð•Þ represents the ReLU function, and • is the dot product.In summary, our ASFM uses the alignment method to solve semantic misalignment between different features and uses element-wise multiplication and two nonlinear mapping functions to emphasize the necessary spatial information in the massive detailed features.We will prove its effectiveness through ablation experiments in Sec 4.3.2 and the visualized results in Fig. 4.

Experiments
In this section, we first introduce the datasets about Cityscapes and CamVid, and the details of training and inference.Then we investigate the effect of each model in ablation tests.Finally, we compare our performance with other state-of-the-art methods and demonstrate the effectiveness of our model on benchmark datasets.

Datasets
Cityscapes 10 is a semantic understanding dataset of urban street scenes taken while driving, including 5000 finely annotated images and 20,000 coarsely annotated images.In our experiments, we only use finely annotated images to verify the effectiveness of our proposed method.It includes three parts of training set, validation set and test set, each part has 2975, 500, and 1525 images, respectively.Annotations include 30 categories, 19 of which are used for semantic segmentation tasks.The image size is high resolution of 2048 × 1024, which is challenging for real-time semantic segmentation.
CamVid 11 contains 701 street scene images with a resolution size of 960 × 720.These images are extracted from the captured road scene videos and are divided into 367 for training, 101 for validation, and 233 for testing.We conduct segmentation experiments using 11 commonly used categories out of the 32 provided semantic categories.

Training
Our experiments are performed using a platform with PyTorch 1.8 on a RTX 3090.We adopt a training strategy similar to other advanced real-time segmentation networks 3,24,34 to more fairly compare performance.In terms of details, we use mini-batch stochastic gradient descent (SGD) with the momentum of 0.9 and the weight decay of 5 × 10 −4 .We adopt the poly learning rate strategy and calculate the update learning rate based on lr_start Ã ð1 − iter max iter Þ power with the initial learning rate of 1 × 10 −2 and the power of 0.9.We set the resolution size of cropping to 1024 × 512 and 960 × 720, batch size to 32 and 24, and iteration times to 50K and 20K, which we adopt a warmup strategy in the first 1000 and 300 iterations to help the model gradually stabilize on the Cityscapes and CamVid datasets respectively.In addition, we employ data augmentation on images, including random cropping, random horizontal flipping, and random scaling in the range of [0. 25, 2].With this configuration, the training time of our model on the two datasets Cityscapes and CamVid is 13.8 and 5.5 h, respectively, which is basically equal to the training time of the model in STDC-Seg 34 that also uses STDC2 as the backbone.

Inference
To intuitively compare with current advanced models, we set the batch size of the test data to be 1.We use the test set to evaluate the effectiveness of the trained model, and calculate the inference speed with TensorRT 8.2.0.6 on a single Nvidia GTX 1080Ti.The input image size is set to be 1024 × 512 and 1536 × 768 for Cityscapes, and 960 × 720 for CamVid.

EBAM for two branches
We perform ablation experiments on Cityscapes to reflect the guidance effect of common features on relatively low-level features and demonstrate the effectiveness of EBAM.By using two different inputs of EBAM, we compare the effect of different parts of the module on improving network performance.As can be seen in Table 1, compared to EBAM2, the EBAM3 uses added common features to obtain extra attentional mapping information, which guides the input feature maps to capture more abundant information, and improves the accuracy of the model by 1.1% mIoU.Moreover, the mIoU of EBAM3 is boosted by 2.3% over EBAM1 and achieves 74.2%, which proves the effectiveness of our designed EBAM.

ASFM
With the addition of EBAM, we further explore the effectiveness of ASFM in fusing two branch feature maps.In Table 2 (a), we obtain 75.1% mIoU by using original FAM, and use it as a baseline for the ablation experiments in this section.We research the contributions of various parts of our ASFM to the feature fusion.As shown in Table 2 (b), we perform experiments by using our modified alignment operation based on FAM, which boosts 0.2% mIoU than the baseline and demonstrate its effectiveness.In (c) and (d), the model continues to increase 0.3% and X i and g denote the two inputs of the module, which are the original branch features and the common features, respectively.The EBAM1 indicates that EBAM is not being used.EBAM2 indicates that only the input X i is being used for EBAM.The EBAM3 is the complete EBAM.
0.4% mIoU, respectively, by adding Detail S and Detail R. In (e), it can be seen that the accuracy has been further improved when using the complete ASFM, with an improvement of 0.8% mIoU compared to (a).These two detailed operations effectively embed the refined boundary information into strong semantic features.We increase the appropriate amount of computation for feature fusion, and ultimately obtain 75.9% mIoU with 174.6 FPS.Obviously, our fusion module further improves the model performance while ensuring fast prediction.

Visualization of our ideas
In Fig. 4, we demonstrate the effectiveness of our ideas by visualizing feature maps.First, we show the roles of the common features g in EBAM for detailed and semantic branches, respectively, and mark representative regions with boxes to provide intuitive comparisons.It can be seen that in the last two images of the first row, compared to the EBAM using only X i as input, the boundaries (black curves) of the objects in the blue boxes become clearer and more robust when the input g is added.In the first two images of the second row, the semantic representations of the small objects and the distinction between the regions of each category are further enhanced in the red boxes.Both results indicate that the commonality obtained from SCB has a strong guiding role in enriching the two branch features, further demonstrating the effectiveness of EBAM.Second, the last two images in the second row show the effect of feature fusion before and after using ASFM.It can be seen that our fusion module helps the target boundaries and regions in the scene become clearer and smoother, and the individual semantic regions become easier to distinguish.

Comparison with Existing Models
In recent years, real-time segmentation networks have received more and more attention, and continuously achieved an excellent trade-off between speed and accuracy.In this subsection, we use GTX 1080Ti to test the trained model on Cityscapes and CamVid, and compare our method with other state-of-the-art methods to show the experimental results.

Cityscapes
We train and evaluate the segmentation accuracy and inference speed (TensorRT was applied to help accelerate inference) of our proposed model on Cityscapes, and compare it with other models, where the bold performance results reflect the comparative advantages of our model.As shown in Table 3, we can see that our TriSCNet has excellent speed and high accuracy, achieving 75.5% mIoU and 174.6 FPS at 1024 × 512 resolution, which has faster prediction speed than previous models at the same resolution.When the predicted resolution size is 1536 × 768, TriSCNet achieves a performance of 77.9% mIoU and 67.2 FPS, which provides a higher accuracy than the other models.At this point, compared to STDC2-Seg75 and PP-LiteSeg-B2 using the same backbone network, the accuracy has increased by 1.1% and 0.4%, respectively, while only increasing a small number of parameters.In Fig. 5, we visualize the final prediction results of TriSCNet on the Cityscapes dataset and overlaid them on the original images.It can be seen that we can accurately and finely segment the predefined semantic objects in the image.

CamVid
To further illustrate the effectiveness of our designed model, we evaluate its performance on CamVid under the condition of maintaining the same image inference resolution size (i.e., 960 × 720) as other models.As shown in Table 4, the model finally achieves superior performance with  75.8% mIoU and 127.4 FPS.We intuitively present the advantage of the model in prediction precision with bold, which surpasses the accuracy of STDC2-Seg75 and PP-LiteSeg-B by 1.9% and 0.8% mIoU respectively.The visualized segmentation results on this dataset are shown in Fig. 6.It can be seen that our model shows accurate segmentation results on most objects, but is less effective in segmenting some narrow and small objects.

Conclusion
We construct a three-branch network, TriSCNet, to address the problem of significant contextual differences between branches in the traditional two-branch architecture.We use the designed SCB to learn the spatial and semantic connections between different categories in panoramic information, strengthening the ability of the network to understand complex scenes.The proposed EBAM based on the additional branch effectively establishes the correlation and interaction between different levels of information, and reduces the semantic differences between branches.ASFM promotes the alignment between two branch features and effectively preserves crucial details from massive spatial information, which leads to further performance improvement.Ablation experiments and visualizations on Cityscapes and CamVid datasets demonstrate that our proposed three-branch network effectively achieves an excellent balance of speed and accuracy.
Our proposed model also has certain limitations.First, affected by factors such as shooting distance and light intensity, our model produces some erroneous segmentation results in the test set.Second, our model is not optimal in terms of memory footprint compared to some state-ofthe-art models, which may affect the application of the model in extremely demanding environments.In the future, we will further explore lighter and more effective backbone and establish long-distance interactions between multiple categories to achieve a better balance between speed and accuracy.

8
of the intermediate stages to avoid excessive computation overhead caused by detailed branch, and use the two branches to obtain detailed feature maps F d ∈ R H 8 × W 8 and semantic feature maps

Fig. 3
Fig.3Alignment and selective fusion module.Δ h is flow field in FAM.Align is an alignment operation and outputs the aligned high-level features.R and ð1 − SÞ are the correlation between pixels at corresponding positions in the two branch features obtained by different mapping functions.
E Q -T A R G E T ; t e m p : i n t r a l i n k -; e 0 0 8 ; 1 1 7 ; 4 6 9

Fig. 4
Fig.4Feature visualizations for the EBAM and the ASFM.In the first two rows, the first row from left to right refers to: the input image, the ground truth, two results on the detailed branch (without/ with g in EBAM); the second row from left to right are: two results on the semantic branch (without/ with g in EBAM), two results after feature fusion (without/with ASFM).The last two rows are the same as the first two rows.

Fig. 5
Fig. 5 The visualized segmentation results of TriSCNet on Cityscapes.(a)-(d) The input images, the ground truth, predictions, the composite images of input images, and predictions.

Fig. 6
Fig. 6 The visualized segmentation results of TriSCNet on Camvid.The black areas in the figure are ignored categories during training.(a)-(c) Refer to the input images, the ground truth, and predictions.

Table 1
The ablation experiment of each input in the EBAM.

Table 2
The effects of each part in ASFM and comparison between ASFM and FAM.Align is the alignment operation, Detail S is the operation of preserving the spatial information of detailed features by the Sigmoid function, and Detail R is the operation of enhancing the boundary semantics of detailed features by the ReLU function.

Table 3
Performance comparisons on Cityscapes between TriSCNet (ours) and other state-ofthe-art models.

Table 4
Performance comparisons on CamVid between TriSCNet (ours) and other state-of-theart models.