A Mechanical Parts Image Segmentation Method Against Illumination for Industry

Most of the current image edge detection methods rely on manually features to extract the edge, there are often false and missed detections when the image has adverse interference. The surface of mechanical parts is smooth, when taking photos in the industrial ﬁeld, it is easy to have specular reﬂection and shadow at the same time, which will aﬀect the edge detection results. In order to achieve excellent edge detection performance, we propose a semantic segmentation model based on encoder-decoder structure. It adopts joint learning strategy, using two decoders to process image decomposition task and segmentation task respectively, and sharing their parameters to eliminate the inﬂuence of illumination, so as to improve segmentation performance. In the training phase, the asymmetric convolution and BN fusion are combined to improve the detection eﬃciency. In addition, we built a gear part dataset for experimentation. The result shows that in the task of edge detection of mechanical parts aﬀected by illumination, our method has better performance than classical method.


Introduction
Common mechanical parts such as gears and slender shafts are widely used in military industry, aerospace, automobile industry and manufacturing industry [1]. In actual production, the machining precision of mechanical parts is affected by the precision of machine tools, thermal deformation, datum end face positioning, cutting vibration and other factors [2], and the precision of parts directly affects the working performance and service life of the machine and equipment [3]. Therefore, the factory inspection of parts dimensional precision is an extremely important link in industrial production.
With the development of computer vision and nondestructive testing and other related technologies, there are many efficient non-contact measurement methods for mechanical parts, in which the segmentation algorithm of part image is crucial. Xie X et al. [4] use the improved Roberts operator to extract the contour of the target, and then Zernike moments are used for sub-pixel positioning. At the same time using the Otsu method to automatically select the segmentation threshold, which achieves good detection efficiency and detection accuracy. Zhu G et al. [5] use Canny operator and bilinear interpolation to extract sub-pixel edge, and the least square method is used to fit the circle contour, and the measurement of the concentricity of the precision parts has obtained good results. Ofir N et al. [6] regards edge detection as a group of discrete curves to search for faint edges with noise interference, and effectively detects these faint edges. These traditional methods mentioned above mostly extract the edge by analyzing the shape, texture, color and other features of the target image [7], then calculate the size of the parts. Generally, to build a model for specific applications, we not only need be familiar with the process of edge detection, but also rely on the manually designed extractor for feature extraction. In addition, we also need to have professional knowledge and parameter adjustment process, which can't be widely applied.
In recent years, the development of deep learning theory and practice provides a good reference for image edge detection and segmentation. Among them, semantic segmentation is a challenging problem. Changes in imaging conditions may have a negative impact on the segmentation process. These changes include shadows, reflections, light source colors and intensity. Since image segmentation is a process of grouping pixels based on visual or semantic characteristics [8], sharp changes in pixel values may lead to inaccurate segmentation. The online detection of the mechanical parts precision in industrial scenes, due to the movement of the parts and the specular reflection of the metal surface, the image quality is poor. The existing models are not ideal for image processing with large differences in light and dark, which greatly affects the segmentation precision and accuracy of the parts image.
Intrinsic image decomposition is the process of decomposing an image into reflectance components and illumination components. The reflectivity component is an inherent property of the object and will not change with light factors. Illumination components change continuously with light source factors, including specular reflection and shadow that affect image semantic segmentation. Therefore, it is more effective to use the reflectance image for semantic segmentation because it does not contain the negative effects of illumination. On the contrary, the category information of semantic segmentation contains prior knowledge of object reflectivity, which can guide the intrinsic image decomposition.
We designed a convolutional neural network with an encoder-decoder structure. After extracting image features through one encoder, two decoders are used to process image segmentation and image decomposition tasks respectively. Regarding segmentation and decomposition as a mutually-promoting combined process, the model is trained with a joint learning strategy to eliminate the influence of light to improve the accuracy of mechanical parts segmentation.
The contributions of this paper are as follows: a) Design a semantic segmentation model for parts image, use the joint learning method to improve the semantic segmentation performance; b) Create a dataset of mechanical parts image with labels; c) Use asymmetric convolution and BN fusion to optimize the model performance.

Related Works
Multi-Task Learning Multi-Task learning is a method that combines multiple tasks and learn simultaneously to enhance the ability of model representation and generalization. Joint learning can be realized through the neural network approach. Currently, the main work can be divided into two categories: Parameter Sharing [9] and Tagging strategy [10].
There are three existing parameter sharing schemes, as shown in Figure 1: Hard sharing [11], Soft sharing [12] and Hierarchical sharing [13]. Hard Sharing is the most widely used sharing mechanism today, embedding data representations of multiple tasks into the same semantic space, and then extracting feature representations for each task using a specific layer. Hard sharing is easy to implement and is good for tasks with strong correlation, but it often performs poorly when encountering weakly related tasks. For the soft sharing, each task uses a single network for learning, and the network of each task can access information in the network corresponding to other tasks, such as eigenvalues, gradients, and so on. Although soft sharing mechanism is very flexible and doesn't need any assumptions about task dependencies, additional parameters are required for assigning each task a network. Hierarchical sharing is to do the simple tasks in shallow layers of the network and complex tasks in deep layers. Hierarchical sharing is more flexible than hard sharing, and requires fewer parameters than soft sharing. However, designing an efficient hierarchical structure for multiple tasks completely relies on expert experience. Image segmentation is based on the category information of the object, and the category information contains the prior knowledge of the object's reflectivity attributes, so segmentation and decomposition can be used as a strong correlation task. In this paper, hard sharing mechanism is used to extract features with one encoder, and two decoders handle the corresponding tasks separately. Furthermore, the soft sharing is also combined between decoders to improve the performance of each task by sharing parameters.

Image Semantic Segmentation
Image segmentation is a process of assigning a label to each pixel in the image such that pixels with the same label are connected with respect to some visual or semantic property [8], then divide to different region. Semantic segmentation technology based on deep learning has greatly improved the performance of image edge detection. Long J et al. [14] propose a full convolutional neural network (FCN), which replaces the full connection layer with the convolution layer in CNN, retains the spatial information of the target in the output, and realizes semantic segmentation by pixel level classification. Benefit from rich spatial information and lagre perception domain [15], many classic models such as U-Net [16], Mask R-CNN [17], Deeplab [18], etc. are proposed based on FCN. Most of these models use public datasets such as ImageNet [19], to show the performance of model. According to the proposed model, researchers use some improvements and optimizations to satisfy the specific scenarios. In order to improve the segmentation performance, Stan T et al. [20] samples a large number of small images in a fixed number of X-ray dataset to train the neural network. Smith A et al. [21] design a U-Net based convolutional neural network and constructed an annotated chicory dataset, which successfully complete the root system segmentation task of plants. It also demonstrates the feasibility of using deep learning to create the own dataset. Vuola A et al. [22] compare the advantages and disadvantages between U-Net and Mask R-CNN, and develope an integrated model, which achieved better results in nuclear segmentation task.
In the task of mechanical parts image segmentation in industrial scene, the change of lighting conditions will cause the change of object appearance when the image is collected in the field, which has a negative impact on the semantic segmentation task. In this paper, we joint learning segmentation and decomposition tasks to reduce the impact of illumination during segmentation.

Intrinsic Image Decomposition
An image can be decomposed to generate countless combinations of reflectance and illumination, so image decomposition is a long-standing ill-posed problem [23]. Li et al. [24] add nonlocal texture constraints to traditional techniques to optimize intrinsic image decomposition, which is a significant improvement over previous algorithms. With the development of technology, the latest research on intrinsic image decomposition has turned to deep learning technology. Shi et al. [25] use neural network decoder to jointly optimize each component by learning the correlation between intrinsic attributes, and achieved robust and real decomposition result. Based on this study, we use image segmentation attributes as an assistant to improve the performance of other tasks in joint learning.

Method
In order to reduce the influence of illumination factors on parts image segmentation, we use joint learning method in the encoder-decoder model, a shared encoder is used to extract features, and two decoders are used to learn image segmentation and image decomposition respectively. Through intrinsic image decomposition, the reflectance image without illumination component is used to guide the semantic segmentation task; at the same time, the class attribute provided by semantic segmentation contains the prior knowledge of reflectance of the target object, which in turn guides the image decomposition task. In addition, asymmetric convolution and BN fusion are used to enhance the feature learning ability and accelerate the operation speed respectively.

Joint Learning Method
Image intrinsic decomposition is based on Retinex theory [26], which decomposes an image into the product of a reflectance image and illumination image. The results obtained by image decomposition are not unique, and most of the current work is devoted to solving the ill-posed problem of intrinsic image decomposition. Supposing I is the original image, R is the reflectance image, S is the illumination image, and (x, y) are the pixel coordinates in the image, then the classical intrinsic image decomposition can be formulated as: Since the reflectance image is an object's own property, it is not affected by any light. According to the ShapeNet [25] model, given an image I, the process of obtaining reflectance component R and illumination component S by image intrinsic decomposition can be understood as follows: θ contains all the parameters learned by the image intrinsic decomposition decoder. MSE is used to optimize each component of θ. Let R * to be the ground truth parameter in the dataset, and R is the parameter learned by decoder, and r = R * −R is the learning difference. In order to obtain the most realistic reflectance image, minimize the following formula: where i,j are pixel coordinates; n is the total number of pixels; C is the RGB channel index of color image.
Image semantic segmentation task assigns a label based on what the pixel represents. Through a series of convolution and pooling operations, we can obtain a low-resolution multichannel feature map, which contains the characteristics of the relationship between the object and its environment. The feature map can provide the contextual semantic information of the segmented target in the entire image.
For a dataset with n classes and labels, a prediction is made of the probability that the pixels belong to each class, and the sum of these probabilities is 1: P represents the probability that a pixel with coordinates (x, y) belongs to class i. After sampling from a deep feature map, P can be predicted, and the biggest P can be selected as the label of the pixel. Gather the pixels with same label to generate a Mask, which divides the same class of region.
For our image semantic segmentation task, to reduce the calculation, the dataset has only two classes: conveyor background and part foreground. We use loss function L 2 to calculate the number of wrong pixels in foreground and background prediction. The smaller of L 2 the better outputs, so minimize the following formula: where y i is the class of predicted pixels (0 as background, 1 as foreground);ŷ i is ground truth; n is the total number of pixels in the image.
For the whole model, in order to achieve joint learning, we combine the loss function of segmentation decoder and decomposition decoder to jointly train the parameters, as shown below: γ is a manually set coefficient, which represents the weight of the corresponding loss function. By optimizing L to get the best output of image segmentation, and impact of γ will be shown in the evaluation part.

Asymmetric Convolution
To achieve better segmentation performance, most ideas for model improvement are mainly focused on: 1) how to connect the layers [27]; 2) combining different layers to improve the learning quality [28]. The asymmetric convolution [29] use an improved scheme independent of the network structure. It doesn't increase computation and can fulfill the real-time requirements of image detection in industrial field. The structure of asymmetric convolution is shown in Figure 2.
If several 2D kernels with compatible sizes operate at the same stride on the same input to produce the output with the same resolution and their outputs can be summed, these kernels can be added at the corresponding positions to obtain an equivalent kernel producing the same output: Let the input I, the convolution kernel K, and ⊕ is the addition of kernel parameters at corresponding positions. Taking the convolution kernel of 3 * 1 as an example, there are M convolution kernels, and the output of the j convolution kernel K j is at J th channel, the value of a point P in the output can be expressed as follows: Where W is the corresponding sliding window. If the points P output by the three convolution kernels correspond to the same sliding window, then the additivity of formula 7 holds (dark color in Fig. 2).

BN Fusion
Batch normalization [30] can accelerate the convergence speed of model training, make the model training process more stable, and avoid gradient explosion or gradient disappearance. Usually, neural networks are batch normalized after convolution, which requires two calculations. BN fusion combines these two steps into one.
For a convolution layer, the output is determined by the weights ω and the bias b: X l = ω * X l−1 + b. The batch normalization is shown in formula 9: γ and β are trainable hyperparameters, which are iterated by back propagation. As a restore parameter, it retains the distribution of the original data to a certain extent. µ is the mean value of input X, σ 2 is the variance, and ε is a very small constant to avoid errors of dividing by zero.
The homogeneity of convolution allows subsequent BN operations and linear scaling to be integrated into the convolution layer with additional bias. Expand X in formula 8 and take the following deformation: Construct a new convolution kernel that new weightsω = γ * ω √ σ 2 +ε , new biasb = β + γ * b−µ √ σ 2 +ε .Then output of the new kernel is same as the result of original Conv+BN,but only one calculation. After the model is trained, using BN fusion to speed the inference time.

Model Structure
The model used in this paper, both encoder and decoder are 5 layers. The encoder uses a convolutional kernel of 3 * 3 and stride of 2 to extract the features of each layer, and then uses the batch normalization to reduce the correlation between the layers. After BN operation, the rectified linear unit (ReLU) [31] is used as the activation function. The decoder uses the feature size symmetrical to the encoder for up sampling. The model is shown in Figure 3. There is a mirror link between the encoder and the decoder. We use the copy and crop method in U-Net [16] to make upsampling process more accurate. There is parameter sharing between the two decoders, the feature values after ReLU activation are shared with each other, there are two reasons for this: (1)The reflectance image obtained by the intrinsic decomposition doesn't contain illumination components, so as to guide the segmentation process to reduce the incorrect segmentation caused by the difference of highlight and shadow. (2)The class information obtained by semantic segmentation contains prior knowledge about the reflectance of an object, which can guide the generation of more accurate reflectance images.
In the training phase, we use three convolution kernels of 3 * 1, 1 * 3 and 3 * 3, all of which use a 3 * 3 sliding window to match the existing square convolution kernel. The branches of the three convolution kernels are all Conv + BN operation. After the training is completed, they are fused into a standard square convolution kernel of size 3 * 3. This process doesn't require any additional hyperparameters. The output composition is as follows: I is the input image, and the output O is the sum of three branches after convolution and batch normalization. µ, γ, σ and β are the parameters in BN operation.
The enhanced square convolution kernel contains BN fusion: γ,σ andβ are the parameters after BN fusion. Branches can be converted into standard convolution kernel by adding kernel parameters at the corresponding position.

Results and discussion
The experiment configuration is as follows: CPU is AMD R5 2600, GPU is NVIDIA GTX 1660Ti, RAM is 16GB. Experiments are programmed in the TensorFlow framework [32], the code running environment is Python3.8, and the deep learning environment is CUDA10.1 and cudnn7.6. We use RMSProp [33] optimizer to train the model, the learning rate is 0.1, the batch size is 2, and the model converges after training for 100 epoches.

Dataset
In this paper, we need to make effective semantic segmentation for the images collected on the conveyor belt in industrial production, so we construct a new dataset containing lots of metal parts images. Considering the different illumination in different areas of the conveyor belt, different sizes of gears occupy different pixels. A total of 600 original images are collected by manual setting in the area without specular reflection. After collecting the original image, add 5% Gaussian noise through Photoshop to obtain noise images. There are 1200 images in the dataset, and we extract the target area and crop the image to 320 * 320 resolution. We randomly selects 1000 images as the training set, and the remaining 200 images are used as the test set.
Intrinsic image decomposition training requires labels that can't be labeled by hand. Therefore, we use the part image with ideal shooting results as the benchmark, and rendering these images with specified intensity of highlight. The original lowlight images are taken as the ground truth reflectance attributes. As for semantic segmentation label, we manually define the black area as the foreground and the white area as the background. Example of the dataset is shown in Figure 4.

Evaluation Criteria
The proposed model performs two tasks simultaneously, but the main target is image segmentation. Therefore, for intrinsic image decomposition decoder, we quote parameters from the existing model ShapeNet [27], and then fine-tune with the industrial parts dataset, and use MSE to measure the decomposition effect.
For the image semantic segmentation task, we use the pixel accuracy to evaluate the prediction effect: TP (True Positive) is the number of pixels correctly predicted as the target, TN (True Negative) is the number of pixels correctly predicted as the background. On Figure 4 Example of dataset. The first image is taken in the ideal environment, which achieves the best imaging result; second image is the foreground and background manually marked, which can be used as the label of image semantic segmentation; last is the image with interference factors collected in the simulated industrial environment as the input of our model. the contrary, FP (False Positive) and FN (False Negative) are the pixels with wrong prediction.
In order to comprehensively analyze the segmentation performance and compare with the mainstream segmentation models, we also use IoU to evaluate the segmentation effect: Experimental Results and Analysis Experiment 1: effectiveness of parameter sharing in joint learning This experiment mainly verifies the effectiveness of parameter sharing. We use the control variable method to test the following cases. Case 1: without joint learning, only the semantic segmentation decoder works. Case 2: the intrinsic image decomposition decoder shares one-way parameters to the semantic segmentation decoder, as an aid to the segmentation task. Case 3: two decoders pass parameters to each other to achieve joint learning. The experimental results are shown in Table 1. The data shows that only a single decomposition task has a general result, and oneway assistance can improve the accuracy of semantic segmentation. In case3 with joint learning, not only the segmentation result is better, but also the decomposition is improved. This is because the two tasks promote each other and improve the performance.
Experiment 2: weights of loss function This experiment verifies the influence of loss function weights. Let the sum of γ 1 and γ 2 is 1, we analyze the weights of two tasks on different proportions, and how it affect the segmentation performance of the model. The experimental results are shown in Table 2. It can be seen from the table, the segmentation performance is improved when decomposition loss function γ 1 increased. However, with the weight γ 1 become bigger and bigger, the evaluation index of segmentation decays rapidly. Therefore, according to the experimental results, γ 1 = 0.3 and γ 2 = 0.7 are selected as relatively optimal combinations. Experiment 3: influence of asymmetric convolution This experiment compares the effects of normal convolution kernel and asymmetric convolution blocks on the model. This experiment was carried out on the parameter configuration of first two experiments, and three groups of test results are selected, as shown in Table 3. It can be seen in the data comparisons, conventional kernel enhanced by the asymmetric convolution blocks can improve the performance of segmentation, but only a limited increase in the pixel accuracy of about 0.5%. Experiment 4: influence of BN fusion On the basis of previous experiment, the convolution layer and BN layer are fused in this experiment. In order to verify the ability of BN fusion to accelerate the processing, we divide the test set images into 10 batches, and every batch has 100 images. We calculate the time required to process each batch under different conditions, as shown in Figure 5. According to the figure, each batch of images enhanced by BN fusion is generally faster than baseline, and the average time is increased by about 4.5%. Experiment 5: comparison with classical semantic segmentation model We verify the effectiveness of our model in the segmentation of industrial part dataset. We chose two existing classical models: FCN [7] and U-Net [9]. The experimental results are shown in Table 4. The conclusion in Table 4 shows that compared with the traditional single task segmentation model, joint learning can significantly improve the segmentation performance of the part dataset after eliminating adverse interference. The comparison results are shown in Figure 6. Image of first group has simple object, and the segmentation results of our method and comparative method are both excellent. But for second group, the metal part image has specular reflection and shadow at the same time, our method can eliminate these two disadvantageous interferences. In the contrast experiment, SegNet eliminates shadow interference, but the external contour has many gaps; U-Net has continuous contour, but the shadow is also judged as target. In order to demonstrate the ability to resist the highlight influence, the third group of images have obvious light details on the upper right of the target, our method correctly identified it as the target, but the two comparison algorithms have some deficiencies in some extent. The last group of images is a complex part, in the group, although our method is obviously better than comparison experiment, there are still many details to be optimized.

Conclusion
Aiming at machine parts images that contain the influence of illumination, this paper studies the use of deep learning-based semantic segmentation methods for edge detection of mechanical parts images. Although classic CNN methods can learn some effects of illumination, it can't completely eliminate these disadvantages. Therefore, this paper proposes a joint learning method to guide semantic segmentation through the reflectance feature map without the illumination influence to improve its prediction performance.
The method proposed in this paper is aimed at the image of mechanical parts, but each image in the dataset has only one target part. In fact, in the industrial field, the collected image usually contains multiple targets, and even some parts overlap together. In the future, the follow-up work will be carried out in multiobject instance segmentation to make the algorithm more suitable for the images collected on the industrial scene.