A Novel Edge-Inspired Depth Quality Evaluation Network for RGB-D Salient Object Detection

Recently, the pair of RGB images and depth images, which is denoted as RGB-D images, are introduced to improve the performances of salient object detection, because one of the pairs may stand out at least in one modality. However, most existing methods still suffer from dilemma. Firstly, the edges of predicted salient object are blurry. Secondly, how to integrate RGB images and depth images effectively still needs be explored. Thirdly, the quality of depth images have a strong impact on the performance of salient object detection so that the selection of depth images is worthy of exploring. To address above problems, we propose an edge-inspired depth quality evaluation network, which evaluates the quality of the depth images based on the edge information. More specifically, the depth quality evaluation module includes two parts: the depth decider and the depth aggregator. The former judges the quality of the depth images while the latter produces the weighted depth features. Then, the edge detection module is proposed to predict edges of salient object and produce edge features. In addition, features from VGG backbone, edge features and depth features are integrated by multi-modality feature fusion module, which is composed of a series of hybrid dilated convolutions. Moreover, the integrated features are fused by three-feature interactive module and double-feature interactive module to predict the final salient map. Our experiments on four RGB-D datasets demonstrate that our proposed method outperforms previous high-performance RGB-D salient object detection.

However, the SOD still suffers from three dilemmas. First and foremost, the edge of the predicted salient object is lack of clarity and shapness. Furthermore, the integration between the RGB images and depth images is inefficiency. Besides, there are low-quality depth images which may decline the performance of SOD. As shown in Fig. 1, the edge of the image is one of the most effective feature for SOD, which can help to locate the salient object and sharpen the outline. When Ground Truth of edge is used as input in an experiment, the performance of SOD improves dramatically. Therefore, scholars pay much attention to predict a clear edge of the salient object, for instance [34][35][36]. However, the predicted edge of recent algorithms is usually blurry and unsatisfactory in the real application. Because there are a lot of crisp noisy. The noise may disturb some details, which may decline the performance. Secondly, the strategy of integrating RGB-D images is still worthy of exploring. Most learning-based models fuse RGB-D images by using early-fusion [37], late-fusion [38] or middle-fusion [39]. Although all strategies have achieved encouraging progress, they still face challenges in both extracting representative features and aggregate them for high performance. There are still room for improvement in sophisticated architectures. Last but not the least, the quality of depth images fluctuates in both training datasets and test datasets, which may lead to sub-optimal solutions. Specifically, the high-quality depth image can point out the salient object with a high probability. However, the low-quality depth images may introduce a lot of interference information. Therefore, the high-quality depth image should be employed completely while the low-quality image should be excluded directly.
To mitigate the above issues by using one endto-end deep learning algorithm, we present a Edge-Inspired Depth Quality Evaluation Network (EDQNet) for RGB-D SOD, predicting the salient object with sharp edges, fusing the RGB images with depth images with high-efficiency and excluding the low-quality depth images during the training and inferring process, simultaneously. At the beginning, a Depth Qualify Evaluated Module(DQEM) is proposed to evaluate the qualify of depth images Note that our proposed method can predict highquality salient map as well as high-quality salient edge. The salient edge is almost the couture of the salient object, which is clear and sharp with less noise based on the coincidences of edges, which includes two parts: the quality decider and the depth aggregator. The former one judges the quality of depth images based on the value of soft dice loss between the edge of depth images and the criterion, which is the average value between the edge of Ground Truth and the predicted edge in the training session. The latter one extracts and integrates the depth features, making decisions based on the output value of depth aggregator. The high-level depth images are fully used by the depth aggregator and vice versa. In addition, the Multi-Modality Feature Fusion Module (MMF) is proposed to integrate multi-modality features, which first extracts the features from VGG by using a series of hybrid dilation convolutions (HDC) [40]. Then, they are concatenated with depth features and edge features, following the convolutions with batch normalization and relu operation. In the next step, the multi-modality features from MMF are extracted and integrated by Three-feature Interactive Module(TIM) and Double-feature Interactive Module(DIM), progressively. Furthermore, a novel Edge Detection Module(EDM) is introduced to predict clear and sharp edges of salient object. The features from VGG are processed by Global Context Module(GCM), which are further upsampled and generate four edge maps by 1 × 1 convolutions. The salient edge is predicted by the product of four edge maps, activated by the sigmoid function. We dropout the first layer of VGG because it contains too much details which may disturb the SOD. Besides, the features are extracted by using VGG backbone [41].
Our contributions are summarized as follows: (1) First and foremost, we design an novel end-toend EDQNet, taking three perspectives for solving RGB-D SOD simultaneously, such as predict the edge of salient object, integrating the RGB images and depth images by attention mechnism, judging and excluding the low-quality depth images. (2) Secondly, the DQEM is introduced to evaluate the quality of depth images based on the edge information. To the best of our knowledge, this is the first attempt to estimate the quality by using the edge information. (3) Thirdly, we propose a novel EDM to predict the salient edge and the edge information in an endto-end deep learning algorithm, which is capable of weakening the noise in the edge map, effectively. The Ground Truth of the edge is produced by canny operator [42]. (4) Finally, we compare our approach with 15 SOTA RGB-D SOD, which demonstrates the superiority For better understanding, Table 1, named Common Used Symbols, is introduced to illustrate the full name and corresponding abbreviation of the concepts, networks, modules and etc.

Related Works
The utilization of RGB images for SOD based on CNNs has been extensively explored for years. Based on the goal of this paper, in this section, we review RGB-D SOD, edge detection method in SOD, and the quality evaluation of depth images.

Salient Object Detection
In this paper, we pay much attention to RGB-D SOD. Qu et al. [27] first introduced CNNs to infer salient object from RGB-D images. Subsequently, Zhu and his colleague [31] designed a main network for RGB images, with a sub-network for depth images, and then incorporated depth features into the main network. Fu et al. [43] utilized a Siamese network for simultaneous RGB and depth feature extraction, discovering the commonality between them. Li et al. [44] designed a cross-modal weighting network for RGB-D SOD to [45] adopted a bifurcated backbone to split multi-level features into student features and teacher features, suppressing distraction in the low-level feature. Zhao et al. [46] designed a single stream network to directly take a depth map as the fourth channel of an RGB image, proposing the depth-enhanced dual attention module. All above methods explore how to extract and integrate multi-level RGB-D features. In this paper, on the one hand, a novel and effective multi-modality features fusion module is proposed to integrate the RGB features and the depth features. On the other hand, the depth images are used to predict the salient object as an independent branch. Furthermore, the DQEM is proposed to evaluate the quality of the depth images, in order to eliminate negative effects from low-quality depth images.

Edge Detection Module
Nowadays, scholars pay much attention to improve performances of SOD by using the contour and edge information. The edge feature is one of the most effective feature, because a clear and sharp edge is helpful for locating the salient object and providing clear details. Guan and his colleague [47] train an edge detection stream based on the holistically-nested edge detection(HED) [48], extracting hierarchical edge features. Then, the edge contours are integrated with the salient detection stream, serving as the complementary information. Zhuge [49] proposes a novel FCN integrates multi-level features with the guidance of edge features based on an edge extraction branch. Compared with both methods, the major differences of our EDQNet are that we use an independent edge detection module to predict the edge result as well as extract edge features. Nonetheless, the above algorithms uses the edge detection as a complement. Furthermore, the NLDF [50] implemented a loss function to penalize errors on the edges. Since the salient edges are derived from salient objects through a fixed sober operator, it cannot be used in an end-to-end deep learning algorithm. Compared with our EDQNet, there is no edge detection module based on FCN in the NLDF. Zhao [36] designs two modules to extract both local features and edge features independently, and fuse them by a one-to-one guidance module. Nonetheless, the structure of edge detection edge is so simple that cannot extract features sufficiently. Our proposed EDQNet designs an independent module to predict the edge map and extract edge features. We implemented both binary cross entropy and soft dice loss to train the EDM in the training session. When it comes to the edge detection of RGB-D images, Zhang and his colleague [51] proposed a complementary interaction fusion framework to locate salient objects with fine edge details. It integrates features of RGB images and features of depth images together to generate a salient edge. However, our proposed EDQNet predict the salient edge only by using the RGB images because the quality of the depth images varies, which may result in noises.

Depth Quality Evaluation in RGB-D
The quality of depth images have a great influence on the performance of RGB-D SOD, so scholars pay much attention to alleviate the impact of low-quality depth images. Some example of the high-quality images and low-quality images are showed in Fig. 2.
A high-quality depth image is very helpful for SOD. Nonetheless, a low-quality depth image may be complete noise. At present, judging the quality of depth images in the end-to-end structure is still a challenge, whose structure is capable of discarding the low-quality depth images, automatically. As early attempts, Cong et al. [52] first proposed a no-reference depth quality metric [53] to alleviate the contamination of low-quality depth. Fan [37] evaluated the depth quality by comparing the result of two same networks with RGB images as input and depth images as input, individually. Wang et al. [54] designed three hand-crafted features to excavate depth images following multiscale methodology. Chen et al. [55] proposed to locate the most valuable regions of depth images by comparing two results, including a predicted salient map from RGB-D images as input and a salient map predicted from two sub-networks with RGB image and depth images as input, respectively. Zhang [56] proposes a novel depth quality-inspired feature manipulation process, which uses the global average pooling and IoU loss to evaluate the quality of the depth images. In addition, a holistic attention, integrating the depth images with the low-level features, is introduced to enhance cross-modal fusion. Different from the existing methods that focus on the depth images themselves, our depth quality evaluation method turn to be a novel perspective. As is showed in Fig. 2, we use the edge of depth image as the indicator to evaluate the quality. The high-quality depth images usually have a clear and highly-coincident edge with the edge of Ground Truth. Nonetheless, there is few overlay between the edge of the Ground Truth and the edge of the low-quality depth images.

Methodology
This paper begins by demonstrating an overall architecture of our EDQNet. It goes on to introduce the principles and detail information of our proposed module. Furthermore, the factors of the loss function are investigated. Figure 3 shows the overall architecture of EDQNet. First and foremost, the VGG backbone in the bottom half of the Fig. 3 are used to extract multi-level features. Then, EDM uses multi-level features, from the second layer to fifth layer of VGG, to predict salient edge and multi-level edge features. More significantly, the DQEM includes two parts, the depth aggregator and the quality decider, which are in the top half of Fig. 3. Firstly, the edge of Ground Truth, the edge of depth images and the predicted salient edge are all put into the quality decider to evaluate the quality of the depth images. Subsequently, the depth aggregator extracts features of depth images by using another VGG backbone and multiply them with the output value of the quality decider. Besides, features from depth images are integrated by two depth feature modules to predict a salient map. In the next step, four MMFs are used to integrate multimodality features, including salient features, edge features and weighted depth features. The outputs of four MMFs are processed by TIMs and DIMs

RGB Ground Truth Depth Images Bound of GT Bound of Depth
to integrate the feature, whose inputs are three adjacent integrated features and two adjacent integrated features, respectively. Finally, the output of DIM is upsampled by two deconvolutions to predict the salient map. Thanks to the above modules, the EDQNet predicts two salient maps and a salient edge, simultaneously.

Edge Detection Module
The EDM is designed for predicting the edge map of salient object and edge features, which is displayed in Fig. 4. Four input features (f 1 , f 2 , f 3 , f 4 ) firstly undergoes the global context module [57] to extract multilevel features. Then, they are upsampled directly to the same size of RGB image by using deconvolution and are processed by 1 × 1 convolutions. Furthermore, all of them are processed by the Sigmoid activation function to convert the value range from 0 to 1. The final edge map is the product of four results, following sigmoid operation. Since the product is one of the effective way to reduce the noise. In conclusion, there are five outputs of EDM, including a salient edge map and four edge features. The salient edge map and edge features can be formulated as follows: where GCM means the global context module whose output channel is a quarter of input channel. The conv 1x1 refers to convolution with 1 kernel size are used by the EDM and MMF. In addition, the EDM predicts the edge result, denoted as b_m, and edge features, labeled as (b 1 , b 2 , b 3 , b 4 ) . The output of the quality decider is named as the scores. Then, the depth aggregator extracts features of depth images, denoted as (df 1 , df 2 , df 3 , df 4 ) , by using another VGG backbone, weighting them by scores. What's more, depth features, labeled as (df 0 , df 1 , df 2 , df 3 , df 4 ) are integrated by two Depth Feature Module(DFM) progressively and predict a salient map only based on the depth image. Furthermore, the MMFs integrates the multi-modality features whose outputs are named as (I 1 , I 2 , I 3 , I 4 ) . They are processed by TIMs, whose outputs are labeled as (fusion 2 , fusion 3 ) . Then, the outputs of TIM are further input into the DIM to produce the integrated features, denoted as med_ fusion 2 without batch normalization and activated function. The up i means the upsample operation with corresponding multiples. The f i indicates the edge features from the second layer to the fifth layer in the VGG backbone.
As discussed in the related works, one major problem of EDM is that the predicted salient edge exists the noises. In this paper, we use the combination of soft dice loss and the weighted binary cross entropy loss as the loss function of EDM, which is showed from Eq. 3 to Eq. 5.
Where Loss b refers to the total loss of EDM. The BCE b and soft_diss bf indicates the binary cross entropy and the soft diss loss. The parameter and are set to 1 and 0.5, respectively. In Eq. (4), the y, y + , y − indicates the total pixels of edge map, the sum of edge pixels and the sum of non-edge pixels. What's more, y i , y ∼ i means the Ground Truth and the predicted value of the pixel i, respectively. N and i in Eq. (4) refers to the total number and the index of pixels when the image is expanded in one-dimensional. The essence of the soft dice loss [58] is computing the overlap proportion between the Ground Truth and the predicted map. The binary cross entropy regards every pixel as an independent sample while the soft dice loss regard the salient edge as a whole. As a result, the combination of binary cross entropy and soft dice loss is useful to suppress the noisy in the predicted edge. For Eq. (5), the i, j, X, Y refers to the coordinate of width, the coordinate of height, the total number of width and the total number of height, respectively. The y ∼ i,j and y i,j indicates the predicted pixel and the ground truth pixel in the (i, j) location of the image.

Multi-Modality Feature Fusion
Multi-Modality Feature Fusion(MMF) plays an indispensable role in this paper since it integrates salient features from VGG, edge features from EDM and depth features from depth aggregator, simultaneously. The structure of MMF is showed in Fig. 5. In the first step, the MMF extracts and integrates feature from VGG by using a series of HDCs [40]. The composition of HDCs is capable of capturing multi-level features with larger receptive fields, without sacrificing image resolution. The HDC is composed of dilated convolutions with different dilation rate, for instance (1,2,3), (1,2,5) and (3,4,5). The dilation rate in one group cannot have a common divisor which is greater than 1, because it can avoid griding effect. The first step of the MMF can be described as follows: Fig. 4 The structure of EDM Where conv k=1 refers to traditional convolution with kernel size 1 and conv k=3,d=1 means the dilated convolution with kernal size 3 and dilated rate 1. Both conv includes the batch normalization and leaky relu operation.
Furthermore, it is illustrated in Fig. 5 that four parallel branches f i , the edge feature and the depth feature are concatenated and processed by two convolutions. It is formulated as follows: Where concat means the concatenation and conv k=3 refers to the traditional convolution with kernel size 3. The c i , hdc i1 , hdc i2 and hdc i3 refers to the output of convolution and three outputs of the HDC. The b i and df i indicates the edge feature and the depth feature, respectively.

Depth Quality Evaluated Module
DQEM judges the quality of depth images by calculating the overlap between the edge of the depth image (6) and the edge of the Ground Truth and reweight the depth features based on the value. It consists of two parts, namely the quality decider and the depth aggregator. The quality decider has three inputs in the training session: the edge of ground truth, the edge of the depth image and the predicted edge. The average between the edge of ground truth and the predicted edge is used as the criterion. Both the edge of ground truth and the edge of the depth image are put into the data batch, together with the RGB images, depth images and ground truth. The output of the depth decider is the value of soft dice loss between the edge of depth images and the criterion. Then, the depth aggregator makes decisions and reweight the features from VGG backbone based on the value.

Quality Decider
For the quality decider, as discussed above, there are three inputs in the training session. The edge of ground truth and the edge of the depth images are generated by the canny operation while the predicted edge is the output of EDM. We use the average value between the predicted edge and the edge of Ground Truth as the criterion. Since the edge of Ground Truth represents the invariance of the criterion while the predicted edge is on behalf of the variability of the criterion. Thus, the combination can improve the performance of SOD effectively. However, in the inferring session, we only use the predicted salient edge as the input. Because the edge of The f i , b i , df i means the ith feature from the VGG, the ith edge feature from EDM and the ith depth features from depth aggregator, respectively. The ith output of MMF is denoted as I i , whose channel number is the quarter of input channel f i . The conv with purple color indicates the traditional convolution for feature extraction. The dconv with number as suffix refers to the dilation convolution, whose number is dilation rate GT, in the inferring session, is unknown, which cannot be used as inputs. Furthermore, the predicted salient edge is clearly and accurate enough to be used as a criterion. Then, the soft dice loss are calculated between the criterion and the edge of the depth image. The process of the quality decider in the training session and in the inference session are showed in Fig. 6 and are described in Eq. 10 and in Eq. 11, respectively.
Where the soft_dice_loss and the MEAN refers to calculated function, respectively. The bgt, b_m and d_b indicates the ground truth of the edge, the predicted edge and the edge of the depth images, respectively. The score tr and score in is the value of the quality decider in the training session and in the inferring session, respectively.

Depth Aggregator
The depth aggregator first extracts the depth features by using the VGG backbone and makes decision based on the value of quality decider. As is demonstrated in Fig. 3, a 1 × 1 convolution is first used to process the depth image, by adjusting the number of channel from 1 to 3. Then, an independent VGG backbone is introduced to extract the features from the depth images, which is further put into the 3 × 3 convolution with batch normalization and relu operation. Furthermore, the depth aggregator makes (10) score tr = soft_dice_loss (MEAN(bgt, b_m), d_b) (11) score in = soft_dice_loss(b_m, d_b) decisions based on scores. Specifically, there are a upper threshold and a low threshold. If the value were greater than the upper threshold, the depth features would be classified as the high-quality and is used completely. Similarly, if the value were less than the low threshold, the depth features would be classified as the low-quality and is used with a tiny percentage. When the value is between the upper threshold and the low threshold, the depth feature would be weighted slightly. The reweighted depth features are denoted as rdf 1 , rdf 2 , rdf 3 , rdf 4 . The strategy is illustrated as follow: Where the up, low and score represents the upper threshold, the lower threshold and the value of the depth decider, respectively. The df i and rdf i means the ith depth features and reweighted depth features, respectively.

Depth Features Fusion Modules
Two Depth Features Fusion Modules(DFM) are introduced to integrate features df 0 , df 1 , df 2 , df 3 , df 4 progressively, predicting a salient map based on depth images. The structure of DFM is showed in Fig. 7. Three inputs are processed by a convolution to extract features and adjust the number of channel. Then, the high-level features are upsampled 2 times and 4 times, respectively. Finally, they are concatenated and (12)   Where the conv, up and concat represent the convolution following batch normalization and relu operation, the upsample operation and the concatenation operation, respectively. The df i , df i-1 , df i-2 refers to the adjacent three features.

Three-Feature Interactive Module and Double-Feature Interactive Module
The adjacent three features from MMF are put into TIM, whose structure is depicted in Fig. 8(a). First of all, three features are processed by 3 × 3 convolution to extract features and adjust the number of channel. After that, the relative high-level feature is upsampled by the bilinear interpolation while the relative low-level feature is downsampled by the maxpooling operation. Finally, three features are added in the pixel level. Supposing I 1 , I 2 , I 3 , I 4 are output of (13) out = conv(concat(up 4 conv df i , MMFs and f usion 2 , fusion 3 are outputs of TIM, the correlation between them is formulated as: Where conv, pool, up indicates the 3 × 3 convolution with batch normalization and relu operation, the 2-times max-pooling and 2-times upsample. The I i and I k refers to the the relatively low-level features and relatively high-level features, compared with I j . The Double-Feature Interactive Module(DIM) is introduced to integrate outputs from TIM. The basic structure of DIM is the same with the TIM, which is showed in Fig. 6(b). The main difference is that the DIM executes the relevant operations twice. Suppose fusion 2 , fusion 3 are the inputs of DIM and med_fusion 2 refers to the output of DIM, the structure of DIM is formulated as follows: Where conv, pool, up indicates the same operation of TIM. The fusion 3 and fusion 2 refers to the relatively high-level feature and relatively low-level feature from DIM, respectively. The f_step1 high and (14) fusion n = conv(pool conv I i + conv I j + up conv I k ) n ∈ (1, 2) ;i ∈ (1, 2);j ∈ (2, 3);k ∈ (3, 4)   Fig. 8(a), the I i-1 , I i and I i+1 refers to the features from MMF while the fusion j means the output of TIM. In Fig. 8(b), the fusion 2 and fusion 3 indicates the outputs of TIM while the m_fusion i means the output of DIM f_step1 low refers to the relatively high-level feature and low-level feature in the process, respectively.

The Loss Function
The parameters of our proposed EDQNet are supervised by the feedback from two kinds of results, including salient maps and edge maps. The loss function of salient maps are binary cross-entropy loss: Where G sal refers to the Ground Truth. The S sal , S deep indicates the predicted salient map and the predicted salient map based on the depth images The BCE refers to the binary cross entropy. Furthermore, the loss function of salient edge map have been discussed from Eq. (3) to (5).
The overall learning objective can be formulated as follows.
Where , are weighted factors. In this paper, , are set to 0.5 and 1, respectively.

Experiment and Analyze
In this section, numerous experiments are conducted to verify the effectiveness and superiority of our proposed EDQNet and modules on four benchmark datasets, evaluating by four evaluation metrics.

Evaluation Metrics
Four widely-used metrics are used to evaluate the performance, including Mean Absolute Error (MAE), F-measure ( F β−max ) [18], S-measure(S α ) [62], E-measure ( E ξ ) [63]. MAE reflects the average absolute error between the predicted salient map and ground truth in the pixel level, which is denoted as: Where S and G represents the predicted salient map and the binary map of ground truth, respectively. The H, W, i, j means the sum of height, the sum of width, the index of height and the index of width, respectively.
F-Measure is an overall performance indicator based on the region similarity. We use maximum F-measure which is defined as: where P i and R i are the corresponding precision value and recall value for the threshold i ( i ∈ {1, 2, ..., 255} ).
2 is set to 0.3. We calculate the maximal F values from the PR curve, denoted as F max .
S-Measure is a structure measure, combining the region-aware structural similarity(S r ) and the object-aware structural similarity ( S o ).
where α ∈ [0, 1] is a hyper-parameter to balance S r and S o , which is set to 0.5. In Eq. (23), the ssim refers to structural similarity. The K, k and W k means the total number of patch, the current patch and the weight of every patch, respectively. When it comes to Eq. (24), the S and G refers to the predicted salient map and Ground Truth. Furthermore, the S and G is the mean value. The S , G and SG indicates the standard deviation and the covariance, respectively.
The object-aware structural similarity is described from Eq. (26) to Eq. (28): Fig. 10 Visual examples of our models with and without DQEM. The Images, Depth Images and the Ground Truth in the first three columns come from the NJUD and NLPR in the inference session. The fourth and fifth columns in Fig. 9 demonstrate the salient map predicted by EDQNet with and without DQEM, respectively. Furthermore, last two columns refer to the predicted edge with and without DQEM, respectively

RGB Depth GT EDQNet No DQEM EDQNet No DQEM
where u ∈ [0, 1] is a hyper-parameter to balance O FG and O BG , which refers to foreground similarity and background similarity, respectively. S BG , S FG , SBG and SFG indicates the mean value of the predicted background, the mean value of the predicted foreground, the standard deviation of the predicted background and the standard deviation of predicted foreground, respectively. E-Measure utilizes both image-level and local pixel-level statistics for evaluating the salient map, which are showed from Eq. (29) to Eq. (32) where W, H, i, j are the sum of width, the sum of height, the index of width and the index of height, respectively. FM is the enhanced alignment matrix, f is a simple convex function and FM is symmetry matrix. The symbol • refers to the Hadamard Product, in which GT and FM indicate the deviation matrix of Ground Truth and predict salient object, respectively.

Implementation Details
Our proposed EDQNet is implemented by Pytorch, which is trained for 300 epochs on a single Nvidia Tesla T4 GPU. The Adam optimizer is used with default values. The initial learning rate is set as 1e-4 for Adam optimizer and the batch size is 10. The poly learning rate policy is used, where the power is set to 0.9. When it comes to the data augment, every input data batch in the training session are resized to 256 × 256 with random flipping, rotation, color enhance and random pepper. In the training session, the input data are combined together so that the salient object detection, edge detection and quality evaluation of the depth images can be trained in the end-to-end procedure. During the inference session, the test data batch are fed into EDQNet to predict two The best score are showed in bold red. For MAE, the lower the value is, the better the performance is. On the contrary, forF −max , F −avg ,S , E , the higher the value is, the better the performance is.

Quantitative Comparisons
We compare the performance of our EDQNet method with 15 models, which is showed in Table 2. From Table 2, our EDQNet outperforms all models on NJUD and STERE in four evaluation metrics (i.e., MAE,F β−max ,S α ,E ξ ). When it comes to SIP, the F β−max of our model outperforms all other algorithms, however, the MAE, S α and E ξ rank the fourth, third and sixth, respectively. Furthermore, the F β−max and S α rank the first in NLPR. However, the MAE and E ξ in EDQNet rank the second and seventh, with 9% decline compared with the best models. The experimental result in Table 1 demonstrates that our proposed EDQNet can boost the performance of RGB-D SOD effectively. In terms of computational complexity, the FLOPs and Parameters of our proposed method is 44.27 GB and 44.9 M, respectively. Figure 9 shows visual comparisons of different methods. These examples reflect various scenarios, including small objects (1st and 2nd rows), low contrast between salient object and background (3rd and 4th rows), multi-objective salient object (5th and 6th rows) and complex scenes (7th and 8th rows). Our proposed method(the 4th column) are compared with six models (from the 5th column to the 10th column). All images of visual comparison are predicted by training the source codes from the github or are download from the github directly. For the scenario of small object, the compared methods cannot predict a complete salient object, which introduce some noises around the salient object. For the scenario of low contrast, the existing salient detectors mostly get poor object smoothness and recognize some unsalient parts as salient object. When it comes to the multi-objective detection, several methods miss some salient objects. For the last scenario of complex scenes, the compared approaches mostly predict a blurry salient object and miss some detailed information. To sum up, compared with six state-of-the-arts algorithms, our predicted results can consistently produce salient maps which are closer to the Ground Truth under various cases.

Ablation Study
In this section, we validate the effectiveness of different modules. We first verify the effectiveness of Table 4 Ablation Experiments of DQEM The best score are showed in bold red. For MAE, the lower the value is, the better the performance is. On the contrary, forF −max , F −avg ,S , E , the higher the value is, the better the performance is Table 5 Ablation Experiments of EDM The best score are showed in bold red. For MAE, the lower the value is, the better the performance is. On the contrary, forF −max , F −avg ,S , E , the higher the value is, the better the performance is DQEM by changing the criterion of depth evaluation. Next, we evaluate the EDM by using different composition of edge information. Furthermore, the MMF is evaluated by comparing different structures of feature fusion. All ablation studies are testified in four benchmark datasets.

Effectiveness of Depth Quality Evaluation Module
The main variable of DQEM is the strategy of evaluating the quality of the depth images. First of foremost, the EDQNet without the DQEM is introduced to explain the superiority of our proposed DQEM. We show visual comparisons, with and without the DQEM, in Fig. 10. The depth images are divided into two classes, the high-quality depth image in the first three row and the low-quality depth image in the last three row. For the first and the second row, there are fewer blemishes in the salient map predicted by our EDQNet. For the third row, we can see that the salient object predicted by our porposed EDQNet contains much detail information. Furthermore, the predicted edges of high-quality depth images predicted by our EDQNet is much clearer. For the low-quality depth images in the fourth row, the salient map predicted by the EDQNet without DQEM gets poor object smoothness and there are significant noises in the edge. For the fifth row, the model without DQEM misses some salient objects and the contour of the salient object is not sharp. Last but not the least, the depth image results in an obvious interference since the pillar and the round mark has the same depth information. However, only the round mark is regarded as the salient object. Obviously, the EDQNet without the DQEM regard both the round mark and the pillar as the salient object. Fortunately, our model with DQEM is capable of suppressing interference from the depth image, only regarding the round mark as salient region. Furthermore, we evaluate our proposed criterion with four variations, which are showed in Table 3. More (a) Integrating Both the Edge and Edge Features (b) Our Proposed EDQNet Fig. 11 The Evolution of EDM. In Fig. 11 (a), the predicted edge is downsampled and integrates with outputs of TIM, which is marked by red box. By this way, the model use both the edge features as well as the edge. In Fig. 11 (b), integrating the edge with the outputs of TIM is removed  Table 4. It is obviously that our proposed EDQNet under current strategy is the most effective. From Table 4, the edge of Ground Truth has the worst performance for the reason that it is completely stationary, which is obstructed for searching for the global optimization. On the contrary, the predicted edge improves the performance a little, since it introduces the variance of the criterion. Furthermore, we average the edge of ground truth and the predicted edge and choose a one-size-fitsall approach, using only one threshold. When the scores is upper than the threshold, it is regarded as the highquality depth images and is used completely. And vice versa. The performance of the one-size-fits-all approach is better than criterion ①②, because it take the variability and the stability into consideration, simultaneously. In addition, our proposed method sets up a transition zone to use the depth images in proportion, which improves the performance of our proposed further.

Effectiveness of Edge Detection Module
We show the performance contributed by different compositions of edge information, including edge features and the edge map. Firstly, the baseline is the EDQNet without the edge features and the edge map. The performance of the base model has suppressed the majority RGB-D SOD. However, it is worthy noting that in Table 5 the performance of the model only with the edge is the worst and the performance of the model with both edge features and salient edge have similar performance with the model of baseline. Since the edge may introduce some noises. Therefore, it has a negative effect on the performance. As a result, it is illustrated in Fig. 11 that we remove the part of integrating the salient edge with the multi-level features.

Effectiveness of Multi-Modality Feature Fusion
We compare the performance of MMF with another structure and choose the best one. As showed in Fig. 12, the main difference between them is the sequence of concatenating and extracting features. The EDQNet extracts features from VGG by using the composition of HDC. In the next step, they are integrated with the depth features and edge features. On the contrary, the compared method concatenates features from VGG, the depth features and edge features together. Then, the integrated features are extracted by using the composition of HDC. The experimental results in Table 6 show that our proposed MMF perform better.

Conclusion
In this paper, we sum up three dilemmas of RGB-D SOD, including the blurry edge, the multi-modalities feature fusion and the quality evaluation of the depth images. To address them together, the EDQNet is proposed to predict the salient object and its edge, evaluating the quality of the depth images, simultaneously. It includes three main components: EDM, DQEM and MMF. The EDM predicts the edge independently and provides edge features for the MMF, which is used to integrate multi-modality features by using the composition of HDC. Furthermore, the DQEM evaluate the quality of depth images by using the quality decider and reweight the depth features from the depth images by using the depth aggregator. Extensive experiments on four benchmark datasets demonstrate that our proposed method is superior to 15 methods. More importantly, both EDM and MMF can be widely used in SOD based on RGB images or RGB-D images, which can boost the performance effectively. For any task of multi-level multi-source feature fusion, the MMF is capable of extracting multi-level features and integrate multi-modality features. When it comes to the EDM, it is a complete independent module, so it can be plugged into any FCN-based architecture to predict the edge and edge features.
In the future, we would pay much attention to improving DQEM. In this paper, we use the edge information as the criterion for the depth images. Table 6 Ablation Experiments of MMF The best score are showed in bold red. For MAE, the lower the value is, the better the performance is. On the contrary, forF −max , F −avg ,S , E , the higher the value is, the better the performance is Nonetheless, the criterion just refers to the quality of a whole depth image. The quality of a specific area of the depth images is not evaluated. In the next step, we try to separate depth images into several patches. Then, we identify the coincidence pixels and the redundant pixels in one area to evaluate the quality of the patch. The high-quality patch is integrated with the features while the low-quality one is excluded directly. Furthermore, the piece-wise function in the depth aggregator can be improved by heuristic optimization [71]. In addition, when it comes to the loss function, the generalized uncorrelated constraint and the mixed regularization [72] can be introduced to optimize the loss function.