ContourNet: Research on Contour Based Nighttime Semantic Segmentation

Due to the scarcity of nighttime semantic segmentation datasets and the high demand for network models, the development of semantic segmentation of nighttime scenes is still very slow. This paper proposes a new network model, ContourNet, which can model multi-level features. In addition, a separate contour network module is designed to accurately predict object contours, improving performance for objects far away, small, or with high contour continuity. A large number of experiments demonstrate that the ContourNet proposed in this paper can significantly improve the semantic segmentation ability of existing models for nighttime images, and can also improve the semantic segmentation accuracy of daytime images to a certain extent, with good generalization abilities. Specifically, after adding the contour module in this article, MIoU has increased by 5.1% on the night dataset Rebecca; MIoU has increased by 2.5% on the daytime dataset CamVid.


Introduction
Like human eyes, computer vision is one of the bridges between computers and the outside world.Among them, semantic segmentation is an important task in computer vision.FCN(Fully Convolutional Networks) [1] is the pioneering work of semantic segmentation.Since FCN, semantic segmentation based on deep learning has emerged in an endless stream.However, due to the scarcity of night data sets and the high requirements of network models, the development of semantic segmentation based on night scenes is always very slow.Nevertheless, half the day is a night of uneven light.Moreover, the success of night scene segmentation is crucial for some influential applications, such as autonomous driving [2,3], robot vision [4], etc.
Many network models perform well on daytime datasets but struggle with nighttime datasets, highlighting the need for dedicated nighttime semantic segmentation.By comparing daytime and nighttime images, it becomes apparent that color contrast is less distinct at night, and surface textures are less clear.Objects themselves are often obscured by blackness and dim yellow lights, leaving only contour information relatively intact.Therefore, extracting contour information and other information in the same network may not be ideal.Based on this insight, we have designed a contour network parallel to the Semantic network, which focuses the network's attention on contours to improve segmentation of continuous objects and small objects.To fully integrate the contextual information, we employ multiple decoders to output shallow features, intermediate features, and deep features during the decoding phase, allowing us to balance global and local information.Additionally, we employ a multi-task learning strategy, adding contour tags to supervise the contour network alongside the semantic tag supervision of the Semantic network.Finally, to obtain clearer local information, we use the Sobel operator to extract the original input image, concatenating the channel dimension with the output of the contour network.
The main contributions of this paper are as following: 1. To balance local and global information, we proposed a novel semantic network with multiple codec structures that can extract features at various levels simultaneously.Furthermore, we employ Sobel edge detection on the initial input image to extract clearer regional information, resulting in more precise semantic segmentation.2. To address the issue of partial information loss caused by uneven illumination distribution, we focus on contour information that typically retains more information and designed a specialized network dedicated to contour feature extraction.

Semantic Segmentation
The development of artificial intelligence has led to the progress of computer vision.Among them, semantic segmentation is the key to computer vision.More and more application scenarios require to extract relevant semantic categories and knowledge from images while establishing contact with the outside world.Therefore, this extraction process is the core of computer vision.FCN [1] is a pioneering work in the field of semantic segmentation.FCN is able to transform the input image of any size into the semantic interpretation, after effective reasoning and learning, with the same size.Its training method is end-to-end pixel-to-pixel.SegNet [5] adopts a symmetric encoder-decoder network, and keeps the polling index in the process of encoding to ensure that the resolution of the image can be better restored in the decoding process.Peng, D.,et al. [6] proposed the most advanced plug-and-play modules of encoders and decoders, namely Sematonic-Aware Normalization (SAN) and Sematonic-Aware whitening (SAW), which greatly improve the generalization ability of the model.Facing variant test data inconsistent with the distribution of training data, SAN and SAW are still able to help the model maintain the performance as much as possible.Unlike enhancing feature representation by integrating adaptive aggregation descriptor [7] and deep click [8] features to change the internal structure of the network, we use multiple outputs to enhance the fusion of different information At present, the mainstream of segmentation network is the structure of multiple encoder inputs and single decoder outputs.On the contrary, we adopt a single encoder as input and multiple decoders as output.This structure can better extract features of different depths to better combine context information.

Fully Supervised Edge Detection
The existing common edge detection methods can be divided into three categories: traditional edge detection algorithms, learning based methods and deep learning based ones.
The traditional edge detection algorithms first extract the gradient of the image, and then generate edges by threshold segmentation [9].Due to their simple and small calculation amount, they are widely used in image painting and other similar software.Learning based methods can perform shallow feature extraction on images and train corresponding detectors to generate object-level contours, such as gradient descent [10].They have better performance than traditional edge detection, but they do not have generalization ability.
In recent years, many semisupervised [11] and fully supervised deep learning methods have been proposed.Contour detection is originally a two-classification problem, but DeepContour [12] transforms contour detection into a multi-classification problem, and uses different parameters for different categories.HED(Holistically-Nested Network) [13] is the first time to introduce end-to-end edge detection.This model adopts the full convolution neural network and the deep supervision network.Inspired by Xeption, DexiNed [14] conceived a network model which can be used for edge detection without training and fine-tuning.

Multi-task Learning
Multi-task learning is the joint training of multiple related tasks, aiming to improve the generalization and robustness of each task through information transmission and integration among multiple tasks.Naturally multi-task learning is more in line with human cognitive learning mechanism.Teichmann, M., et al. [15] proposed a joint classification, detection and semantic segmentation approach, using a unified architecture in which the encoder shares information between the three tasks.Cheng, D., et al. [16] designed a multi-task model by simultaneously training segmentation network and edge detection network to detect hierarchical semantic features.Instead of multiple loss-sharing schemas to represent unity, our goal is not training a multitasking network, but to adopt boundary predictive assisted semantic segmentation.Unlike these works above, our semantic information and boundary information are fused in the end, and we use the higher level information contained in the semantic information to depress the lower level boundary information.

Vision Tasks in The Dark
Image analysis in the dark encounters the problem of overexposure and insufficient illumination.For low-light images, LIME [17] selected the maximum value of each pixel channel to estimate the image, then added structure priors to refine the image, and finally synthesizes the enhanced image according to Retinex theory.URetinex-Net [18] incorporates the image enhancement problem into a learnable network, and decomposes the low-light image into reflection layer and illumination layer.The network model is composed of three learning-based modules, which are responsible for data initialization, efficient optimization deployment, and specified illumination enhancement respectively.SCI(Self-Calibrated Illumination) [19] built a weighted sharing illumination learning process introducing selfcalibration modules using simple operations for enhancement, abandoned the complex design process of network structure.But the main idea of these networks is not to composite images, but rather to enhance them as seen during the day.Unlike changing an image itself to enhance it, we focus on the more obvious contour in the dark, and establish a connection between contour and semantics segmentation, so as to improve segmentation effect.

The Main Architecture of ContourNet
In this section, we first provide an overview of ContourNet, including backbone networks, semantic networks, contour networks, and Sobel edge operator modules.Then, the internal structure and collaboration of each network module are introduced in detail.
As a whole, ContourNet adopts a multi-head codec structure.First, after the input image is processed by the encoder, four shallow feature maps, f 0 , f 1 , f 2 , f 3 , with rich local information, and deep feature images F 2 , F 3 , F 4 are obtained.Next, the four shallow feature images above enter the contour network to extract the contours of the feature images; The previous three deeper feature maps enter the semantic network to extract semantic information.Finally, the contour information and semantic information are input into the contour module for information fusion.The output semantic information has higher level features and has a denoising effect on the boundary information.The overall structure of the ContourNet network is shown in the Fig. 1.

Semantic Network
Our network ContourNet uses ResNet [20] as the backbone to encode and process the input image to obtain rich semantic information.Enlighten by ResNet, we added two additional outputs to obtain the shallow feature F 2 with rich local information and the middle feature F 3 with rich semantic information.Denote the input image by I ∈ R 3×H ×W , after ResNet network structure, get the As a rule of thumb, F 2 is a shallow feature, F 4 is a deep feature, and F 3 is an intermediate feature between the above two.F 2 , F 3 and F 4 are the feature maps of different depths.After output by the encoder, F 4 is upsampled to obtain a feature map of the same size as F 3 , then F 4 and F 3 are fused together.In the same way, the feature map including shallow, deep and intermediate information is finally obtained, which contains richer contextual information and high-level semantic information, as the output of the whole semantic network.Semantic networks are enclosed by dashed boxes in pink and blue, see Fig. 1.  [20], and L i is the Layer i of ResNet.Among them, the semantic information is extracted by the semantic network in the blue dotted line, and the contour information is extracted by the contour network in the yellow dotted line

Contour Network
The contour network consists of three contour extraction modules and one Sobel operator module.The output of the image gradient ∇ I from the first layer and the second convolution layer of the semantic network are used as inputs to the first contour module, thereby outputting boundary information.Compared with contour information, the output semantic information contains more advanced features after further feature extraction, which can play a guiding role in contour extraction.For the contour label, we can obtain the binary edge of the contour by traditional image processing on the semantic label.In the training process, the cross entropy loss is used to monitor the contour network to ensure that the network model extracts more abundant contour information.Contour networks are enclosed by yellow dashed boxes, see Fig. 1.

Contour Module
At present, the more popular semantic segmentation methods generally put the color, shape and texture information in a depth CNN.However, the processing may not be ideal because it extracts some unimportant information and increases the redundancy of the information.In addition, compared with the clear vision in the daytime, the color and texture information of the night image is significantly reduced, which seems to be hidden in the dark, only the external outline is exposed.Therefore, we designed a contour network in parallel with the semantic network, processing only the information related to the contour.In addition, the contour network does not directly integrate with the output of the semantic network.On the contrary, the deeper semantic information helps the contour module to focus only on the required part.This can make the contour network have rich context information and high resolution at the same time, so as to ensure that the contour network adopts an effective shallow architecture.The detailed of contour module is shown in Fig. 2.
In ContourNet, the Contour module is used multiple times.It is assumed that s and e represent the intermediate layer outputs via a semantic network and a contour network, respectively.Please see the Eq. 1.
First, we perform further feature extraction on the input semantic information s, obtain higher level information, and concatenate it with the contour information: In Eq. ( 1), C R 1×1 denotes the use of a 1 × 1 convolution with ReLU activation function, while the symbol represents the concatenation of the resulting output with the intermediate layer outputs e and s obtained through the contour and semantic networks, respectively.
The advanced semantic information is able to guide the extracted contour module and reduce latent noises in contour extraction.
Then, feature extraction is performed on the contour information, where the convolution operation can obtain more information, and sigmoid can accelerate the network convergence during the training process: In Eq. ( 2), C S 1×1 denotes the use of a 1 × 1 convolution with Sigmoid activation function, which aims to prevent the flow of contour information to semantic information.
Then, the semantic information s and contour information e are multiplied by element to obtain the filtered image: To extract the necessary information, we perform an element-wise addition of the initial semantic information, intermediate layer semantic information, and final contour information.We then apply kernel-based channel-wise weighting to obtain more refined information: In this context, the symbol ⊕ denotes the element-wise addition operation, which allows for the fusion of information from multiple sources without increasing the number of chan-Fig.2 The details of Contour module.The two gray dashed boxes represent two convolution operations that use different activation functions nels.This operation has a lower computational complexity, making it an efficient method for integrating information from different parts.
After these processing, F is transferred to the next contour module for further processing.It is worth noting that the formula in the transmission process can be differential, so end-to-end back-propagation can be carried out.And the final output feature map is superimposed with multi-layer weights, focusing on boundary information, which is equivalent to an attention module of boundary information.In our experiment, a total of four contour modules were used, which were placed on the first, second, third, and final fusion modules of the semantic network.If the size of the feature map does not match during concatenate and the work cannot be carried out normally, we will sample the feature map by bilinear interpolation until the size of the feature map is consistent, and then concatenate.

Joint Multitasking Learning
We take an end-to-end approach to training and learning contour networks and semantic networks, supervising semantic segmentation and contour graph prediction.In particular, a contour map can be viewed as a semantic map with only two categories.Therefore, standard cross entropy loss functions are used for contour and semantic prediction.In this paper, the loss of the contour part is expressed as L CC E , while the loss of the semantic part is expressed as L SC E , and the loss of the entire network is: The loss function used in Eq. ( 5) involves two types of Cross Entropy Loss functions: L CC E for binary classification and L SC E for multi-classification.The separate Cross Entropy Loss function can be expressed as H ( p, q) = n i=1 p (x i ) log 1 q(x i ) , where p represents the label and q represents the prediction.In this function, a smaller value is obtained when there is a higher similarity between the label and prediction values.
where λ 1 and λ 2 are two hyperparameters that control the weight of contour flow loss and semantic flow loss in the total loss function, and ê ∈ R H ×W , ŝ ∈ R H ×W represents boundary labels and semantic labels, respectively.Before and after the fusion of contour map and semantic map, after the Sobel operator module, a standard cross entropy loss function is used for supervised learning.

Sobel Edge Detector
Sobel operator is a discrete difference operator for edge detection based on the first-order image gradient.Compared with canny operator, Sobel operator requires less computation.In the Sobel operator, the weight is inversely proportional to the distance from the center of the convolution kernel.Therefore, the difference in the x and y directions of the image A can be approximately calculated as: where "*" is a convolution operation.
Next, the horizontal and vertical gray values of each point in the image are combined: Finally, use the formula to calculate the gradient: After using Sobel to extract edges from the initial input image, we can obtain more specific shallow contour information, and then concatenate the shallow contour information with the deep contour information through the contour module.The final contour information has richer contextual information while greatly increasing the continuity of the contour.

Experimental Results
In this section, we will evaluate the generalization ability of the network model ContourNet.We conducted experiments on the daytime data set Camvid and the nighttime data sets Nightcity and Rebecca (This is a semantically segmented dataset of a small nighttime scene recently marked in our laboratory).
Evaluation index: We use MIoU as our evaluation index, and the ratio of intersection and union between the predicted results and the real values of each kind of model is summed and then averaged.Take dichotomous MIoU as an example: Thereof, IoU TP = T P/(T P + F P + F N), IoU FP = T N/(T N + F N + F P), T P represents True Positive, F P represents False Positive, F N represents False Negative, T N represents True Negative.
Experimental detail: In order to accelerate model convergence, the pre-trained weights of ImageNet are used to initialize the network parameters in this paper.In the training phase, the parameters of the Adam optimizer are β 1 = 0.5 and β 2 = 0.999.In addition, cross entropy is used as the loss function.The attenuation of the learning rate adopts the polynomial principle.Assuming the initial learning rate l 0 , the learning rate for the ith iteration is modified to: where p is a constant and T is the total number of iterations.Hyperparameter values on different datasets are listed in Table 1.
In all experiments in this paper, the same image enhancement method is used, that is, horizontally flipping and randomly cropping the images to make the processed images have the same size.For datasets NightCity [21], CamVid [22], and Rebecca, the size after clipping is 1024 × 512, 768 × 576, and 1280 × 720 respectively.

NightCity Dataset
NightCity [21] is a large nighttime road scene dataset, with images captured by a car's tachograph, including different road scenes from multiple cities.In this dataset, images are divided into 19 categories of interest at the pixel level.Of the 4297 night images, 2998 were used as training sets and 1299 were used as test sets.
On the NightCity dataset, in addition to the ContourNet proposed in this paper, comparative experiments have also been conducted on other models, such as FCN-8 S [1], BiSeNet [23], EGNet [24], and HyperSeg-s [25].Moreover, we not only provide the mIOUs of each method in the verification set, but also the IoUs of each category.The quantitative analysis shown in the Table 2 demonstrate that, without using the extended dataset Cityscapes [26], the ContourNet performs well on NightCity.It is worth noted that for the IoU index, the segmentation accuracy of this method is superior to other methods in 15 of the 19 labeled subcategories.Only 4 subcategories have a slightly lower segmentation accuracy than HyperSeg-s, but in terms of the MIoU, the method in this paper is about 2 points higher than HyperSeg-s.
From the segmentation results in Fig. 3, it is not difficult to find that the method in this paper is more accurate for small objects, such as traffic sign, traffic light, and pole; Secondly, for objects with continuous contours, such as building, sidewalk, void, etc., the proposed algorithm has obvious advantages as well, see the area enclosed by the red rectangular box.Fig. 3 Our prediction graph is compared with the prediction graphs of FCN-8 s [1], BiSeNet [23], and HyperSeg-s [25] on the set of NightCity [21] val.As can be seen, our ContourNet has better segmentation capabilities for distant objects and objects with strong continuity

CamVid Dataset
CamVid [22] is a dataset released by Cambridge University that focuses on urban road scenarios.The dataset is comprised of 701 fully labeled images with a resolution of 960×720, including 367 training images, 101 validation images, and 233 test images.All pixels of the picture are classified into 12 categories, of which 11 are marked interest classes and other unmarked classes.During the training process, in order to ensure that the number of images in the training set reaches more than 70% of the total set, both the training set and the verification set are used for training, the effectiveness of the algorithm is verified on the test set, and the MIoU is calculated.
On the CamVid dataset, select ResNet50 as the backbone network and crop the input image size to 768 × 576, the ContourNet achieved an accuracy of 73.1% on MIoU.Quantitative comparisons shown in Table 3, the performance of proposed method is slightly lower than HyperSeg-s, but for daytime datasets, the results are sound, fully demonstrating the generalization ability of the ContourNet.Similar to the nighttime NightCity dataset, this model has better segmentation effects for continuous contours and small objects, such as electric poles and traffic signs, see the area enclosed by the red rectangle box in Fig. 4.

Rebecca Dataset
Rebecca [30] is a dataset recently disclosed by our laboratory for nighttime semantic segmentation.This dataset mainly focuses on nighttime road scenes, and contains a total of 600 images, including 422 for training and 172 for verification.Rebecca's classification is the same as CamVid's, with pixels divided into 11 classes of interest and 1 unmarked class.The biggest difference is that Rebecca owns a high resolution of 1920 × 1080.Fig. 4 Our prediction graph is compared with the prediction graphs of FCN-8 s [1], BiSeNet [23], and HyperSeg-s [25] on the set of CamVid [22] test set.As can be seen, our network has better segmentation effects for traffic signs, signal lights, and clearly defined pole shaped objects Table 4 details the performance of ContourNet and other networks on various subcategories of IoUs on Rebecca and the overall mIoU.As can be seen from the table, ContourNet has similar advantages in subclasses to HyperSeg-s.However, this method achieves 63.8% of mIoU, slightly better than HyperSeg-s.Fig. 5 Our prediction graph is compared with the prediction graphs of FCN-8 s [1], BiSeNet [23], and HyperSeg-s [25] on the set of Rebecca [30] test set.As can be seen, rods with high continuity, distant objects, and plant backgrounds all have better segmentation effects The semantic segmentation results of ContourNet and other networks are shown in Fig. 5. Observing the area surrounded by the red border, it can be found that ContourNet has significantly better segmentation effects for distant and small objects than other networks, and also has merits for buildings, street lights, and dark skies with obvious outlines.

Ablation Experiments
Ablation experiments were conducted on the daytime data set CamVid and the nighttime data set Rebecca.The quantitative comparisons of ablation experiments on Rebecca val dataset were listed in Table 5.The qualitative analysis of individual module on Rebecca dataset in Fig. 6.The quantitative comparisons of ablation experiments on Camvid test dataset were listed in Table 6.The qualitative analysis of individual module on Camvid dataset in Fig. 7.
In Tables 5 and 6, Contour-no represents that we only use semantic network for training, nosobel represents that we only use semantic network and contour network to training while ContourNet represents the result of using semantic network, contour network and sobel module simultaneously.From Tables 5 and 6, it can be observed, compared to Contour-no (conventional semantic network) on the Rebecca and Camvid datasets, that the accuracy of Contour-nosobel (with the Contour module but without the Sobel module) is improved by 4.3% and 1.8%, respectively.While compared to Contour-nosobel on the Rebecca and Camvid datasets, the accuracy of ContourNet (with both the Contour and Sobel modules) is improved by 0.8% and 0.7%, respectively.Overall, the proposed algorithm improves per-  6 Results of Ablation Experiments on the Rebecca Val Set.We can see that our network has better segmentation ability for distant objects and rod shaped objects formance by 2.5% on the daytime dataset (Camvid) and 5.1% on the nighttime dataset (Rebecca).
Comparing different categories of the same dataset, on the nighttime dataset Rebecca, adding contour networks significantly improved the segmentation accuracy of several categories, such as sky, electric poles, bicycles, and other slightly contoured objects or small objects that are difficult to recognize.On the daytime dataset camvid, the situation is similar.In the results of ablation experiments on Rebecca on Fig. 6, it can be found that after adding a contour network, the segmented poles are continuous and complete, as are objects farther from the camera, as shown in the area surrounded by a red rectangle.It has been noticed that although the addition of Sobel module did not result in a significant improvement in MIOU, it did lead to better processing of the continuity of objects and some details in the predicted image.

Conclusion
In this paper, we propose a semantic segmentation network called ContourNet, which is a multi-tasking parallel network with both contour and semantic networks.Based on the analysis of the night data set, we tilt our attention towards the object contour and use a separate cross entropy loss function to ensure that the information extracted by the contour network is contour-only dependent.A large number of experiments show that our network model has more obvious improvements on the night data set.This is a very efficient architecture that could produce clearer predictions for more continuous object contours and significantly improve the prediction performance for small and distant objects.Before and after the addition of contour network, the segmentation effect is improved in both night data set and day data set, which fully demonstrates that ContourNet has good generalization abilities.

Fig. 1
Fig. 1 Method overview.ContourNet can be divided into three parts: Semantic, backone, Contour.The backbone is based on ResNet[20], and L i is the Layer i of ResNet.Among them, the semantic information is extracted by the semantic network in the blue dotted line, and the contour information is extracted by the contour network in the yellow dotted line

Table 6 Fig. 7
Fig. 7 Results of Ablation Experiments on the CamVid Test Set.After adding ContourNet, the ability to segment traffic signs and distant objects is enhanced

Table 2
[21]are the results of different methods on the NightCity[21]val set with only changing the network model

Table 3
[22]are the results of different methods on the CamVid[22]test set with only changing the network model

Table 4
[30]are the results of different methods on the Rebecca[30]val set with only changing the network model

Table 5
Experimental results of ablation of ContourNet on Rebecca.We can see that our network model has improved almost every category