Surface defects inspection of cylindrical metal workpieces based on weakly supervised learning

Weakly supervised learning applies image tag labels to train convolutional neural networks to locate defect. In industrial vision system, metal surface is anisotropic under light in all directions and it is inevitable to cause local overexposure due to the natural reflection of active strong light, especially on the cylindrical metal surface. In this paper, injector valve is taken as the representative of cylindrical metal workpieces. Since the variety and complexity of cylindrical metal workpiece defects which cause pixel-level annotation require expensive manual work. This problem hinders the application of convolutional neural network in industries. In order to solve these above challenges, this paper proposed an end-to-end weakly supervised learning framework named Integrated Residual Attention Convolutional Neural Network (IRA-CNN). IRA-CNN only uses image tag annotation for training and performs defect classification and defect segmentation simultaneously. Weakly supervised learning is achieved by extracting category-related spatial features from defect classification scores. IRA-CNN is composed of multiple Integrated Residual Attention Block (IRA-Block) as the feature extractor which improves the accuracy and achieves real-time performance. IRA-Block adds Integrated Attention Module (IAM) which includes channel attention submodule and spatial attention submodule. The channel attention submodule adaptively extracts the channel attention feature map to improve its bilateral nonlinearity and the robustness. IAM can be well integrated into the IRA-CNN makes the neural network suppress the interference of useless background area and highlight the defect area. Satisfied performance is achieved by the proposed method in our own defect dataset which could meet the requirements in the industrial process. Experimental results show that the method has good generalization ability. The accuracy of defect classification reaches 97.84% and the segmentation accuracy is significantly improved compared with the benchmark method.


Introduction
Cylindrical metal workpiece needs to be matched with other parts in kinematic pairs. Its surface adhesion directly affects the performance of workpiece and even mechanical system. The injector valve, as the representative of cylindrical metal workpieces, has multiple surface quality hazards in its total Weiwei Zhang zwwsues@163.com Mu Ye yemu1138178251@163.com 1 Fig. 1 Three kinds of defects in injector valve: Non-Defect: 1, 2. Dirt: 3, 4. EC: 5, 6. Scratch: 7, 8, 9. It can be seen EC covers the whole valve area, so it does not need to be segmented EC, and segment dirt area and scratch area. Compared with the normal injector valve, the dirt area of the valve has smaller gray value and irregular shape. The main characteristics of the scratch defects are that the gray value is similar to the normal area, the contrast ratio is low, and the defects are distributed in strip. The defect of EC is characterized by small pits in large area of the injector valve, and according to the experimental observation, EC will cause pits in all areas of the whole injector valve, segmentation is meaningless for the defect of EC. Therefore, EC is not necessary to be processed by image segmentation. These defects have a negative impact on the quality of the injector. The above defects are difficult to be found by manual observation, and some surface defects can only be observed by high resolution camera. Therefore, it is necessary to detect the surface defects of injector valve for product quality control.
The existing surface defect inspection methods based on machine vision mainly targeted at four types of surface: (1) non-textured surface; (2) repeated pattern surface; (3) uniform textured surface; (4) non-uniform textured surface. Cylindrical metal workpieces can be classified as non-textured surface. For detecting this kind of surface defects, the traditional machine vision methods generally include two stages of feature extraction and pattern classification. The features that can be used for pattern classification include statistical measures such as first-order moments and second-order moments, as well as handcrafted features such as Local Binary Patterns (LBP) [4], Graylevel Co-occurrence Matrix (GLCM) [5]. Feature classifier includes generating model such as Bayesian classifier and discriminant model such as K-Nearest Neighbors (KNN), Support Vector Machine (SVM) [6], Random Forest [7]. The performance of defect inspection relies to some extent on how well the features are designed. Feature extraction in traditional methods mainly relies on manually designed descriptors, which requires professional knowledge and a complicated parameter adjustment process. Moreover, each method target a specific industrial scenario and has poor robustness and generalization ability.
In recent years, deep learning method has shown its advantages in defect segmentation [8,9], defect detection and classification [10,11,[13][14][15][16]18]. Deep learning has the characteristics of high precision and wide application by automatically extracting image features. In the field of defect inspection, the common surface defect inspection tasks include fabric defect detection, steel defect detection, solar panel defect detection, and LED chips defect detection. In this field, deep learning can be mainly applied to defect classification and detection and defect segmentation. Aiming at the problem of defect classification and detection, Xu [11] proposed roller bearing defect classification based on SDD-CNN. This method uses SSAD method for data expansion and uses InceptionV3 [12] as the classification network, and its classification accuracy reaches 99.56%. Chen et al. [13] proposed a multispectral CNN structure for surface defect classification of solar cells. Three convolutional neural networks are established for three spectral channels for classification, the classification accuracy is 88.24%. Cheon et al. [14] proposed a CNN structure for wafer surface defect classification, which can detect unknown defects by clustering the eigenvectors of the same defect types. He et al. [15] proposed a semisupervised defect classification method to make up for the shortage of samples in supervised training. This method uses GAN to generate a large number of unlabeled data. However, the above methods only classify defects and do not get the location information of defects and the feature information of defects (such as size, distribution), so some researchers add object detection and image segmentation methods to defect detection. Li et al. [16] used the improved YOLO [17] network to detect six kinds of steel surface defects, and achieved 97.55% mAP and 95.86% recall. Su et al. [18] designed a novel bidirectional attention feature pyramid network structure and embedded it into the regional proposal network to improve the efficiency of Faster-RCNN [19] in detecting surface defects of solar cell. The target detection method needs to use the label with bounding box for training.
More accurate defect boundary can be obtained by using image segmentation. Tabernik et al. [20] proposed a segmentation framework for crack segmentation, which first segments the defect and then classifies defects combined with the segmentation feature map. The experimental results show that the average accuracy of this method is at least 1.9% higher than that of deeplabv3 + [21] and u-net [22]. Tao et al. [23] used an encoder-decoder structure to segment defects. After that, the segmented region is cropped out for classification. Wang M and Cheng J C P. [24] transformed the inference of CRF into CNN operation, which fully combined CRF with deep convolutional neural network and significantly improved the defect segmentation effect. However, the variety and complexity of cylindrical metal workpieces defects and image contrast fluctuation, making samples annotating time-consuming and be of high cost which caused to hard to use supervised learning. Based on the above shortcomings, weakly supervised learning can effectively solve the leakage problem of labeled defect samples and realize the detection or segmentation of defect image only by using image-level annotation. Among them, CAM [25] is an important method to realize weakly supervised learning, and CAM can generate a saliency map to show the probability that each pixel belongs to a certain defect category. Lin et al. [26] used CAM method to visually predict the depth CNN and thus locate the LED defect location. Chen et al. [27] proposed a robust weakly supervised learning method for surface defect detection, which uses transfer learning to get CAM. Xu et al. [28] proposed a weakly supervised detection framework, in which CNN model was trained to identify surface cracks in motor commutators. This method achieved 99.5% recognition accuracy in Kolektor SSD dataset. The above researches all use CAM method, but CAM needs to replace the fully connected layer with global averagepooling layer and then retraining, which leads to the high training cost of the model and limits the use scenarios of the model. Grad-CAM [29] and Grad-CAM++ [30] can solve the above problems. The above two methods use the gradient information of the last layer of CNN feature map to assign values to each neuron. The difference lies in that Grad-CAM++ uses a weighted combination of the positive partial derivatives of the last convolutional layer feature maps with respect to a specific class score as weights to generate a visual explanation for the corresponding class label. For the case of multiple defects in one image, Grad-CAM++ can produce more accurate saliency maps. Therefore, Grad-CAM++ has better robustness and interpretability.
The saliency map generated by Grad-CAM++ still has a large noise effect on defect segmentation. Adding attention module in the backbone can effectively alleviate the problem. Attention module can emphasize the weak features of the defect, and suppress the noise in the image. Due to the complex surface and uneven light distribution of the cylindrical metal workpieces, adding attention module is suitable for the surface defect inspection of the injector valve. Attention module includes channel attention module and spatial attention module. These two methods realize "what to look" and "where to look" by calculating the weight of channel direction and spatial position to the feature map. Channel attention module can make CNN network select the feature map related to the defect, so as to suppress useless background information. Spatial attention module can make CNN network pay attention to the defect location information in the feature map. At present, attention mechanism has been widely used in various tasks [31][32][33]. Hu et al. [31] used global mean aggregation characteristics to calculate channel direction attention in their squeeze and excitation modules. Ji et al. [32] used attention module to align the context information between the feature maps at different scales and the final prediction of the saliency map. Woo et al. [33] proposed Convolutional Block Attention Module (CBAM), which infers the attention map in turn along two independent dimensions (channel and space), and then multiplies the attention map by the input feature graph to refine the adaptive feature. Inspired by the structure of CBAM, this paper proposes the integrated attention module (IAM). IAM enables classifiers to focus more on defect-related feature maps and regions.
This paper presents a framework for detecting the defects on the outer Cylindrical Mental surface of fuel injector valve, the contributions of this paper are as follows.
1. Aiming at the defect inspection of cylindrical metal workpieces, an Integrated residual attention convolutional neural network (IRA-CNN) was proposed. IRA-CNN is composed of multiple Integrated Residual Attention Block (IRA-Block). Two residual maps are included in IRA-Block to improve the robustness. Every IRA-Block combines a distinctive attention module to improve the interpretability of network. Therefore, IRA-CNN has a good classification effect for the defects of cylindrical metal workpiece. 2. We proposed IAM which is composed of a channel attention submodule and a spatial attention submodule.
In the channel attention submodule, the input proportion of Global Average Pooling (GAP) layer and Global Max-Pooling (GMP) layer was adaptively selected by IAM to adjust the output of different IRA-Block. IAM makes full use of the comprehensive characteristics of GAP layer and GMP layer to improve bilateral nonlinearity and robustness of IRA-CNN. Therefore, the background of the image and highlight the defect area of the cylindrical metal workpiece can be suppressed by IAM. 3. In this paper, a weakly supervised learning strategy was proposed to segment the defects of injector valve. This strategy can segment the defects only by using image tag annotation. After classification by IRA-CNN, the saliency map is generated by Grad-CAM++. The pixel-level segmentation of defects is based on the saliency map, which greatly simplifies the task of image segmentation. The time of pixel-level labeling is saved, and the segmentation accuracy is better for the defects of cylindrical metal workpieces.
The rest of this study is as follows. Section 2 introduces the methodology in detail. Section 3 introduces the experiments and related performance evaluation. Finally, the fourth part is the conclusion and discussion.

Methodology
In this section, the research of cylindrical metal workpiece defects inspection is mainly based on the injector valve. surface defects inspection of cylindrical metal workpiece based on weakly supervised learning structure is shown in Fig. 2. We mainly introduce the architectures of the classification module, Integrated attention module, and segmentation module.

Classification module
Convolutional neural network extracts local feature information from image by convolutional layers and compresses feature information by pooling layers. When detecting the surface defects of injector valve, the sliding window is used to process the original image and take the image divided into eight pieces, which can detect the defects more accurately without reducing the resolution of the original image.
In this paper, IRA-CNN was proposed to extract the features of the outer surface of the injector valve. IRA-CNN is composed of multiple IRA-Blocks in series, in which IAM is integrated. In Section 2.2, the structure of IAM will be explained in detail. EfficientNet [34] and MobileNet [35] also have similar structure, that is, the attention module is integrated into the convolution extraction block. In the IRA-Block, we used 3 × 3 convolution kernel to extract features, and stack multiple 3 × 3 convolution kernels to expand Receptive field. IRA-Block is shown in Fig. 3.
The function of convolutional layer is to extract the feature of a local region. Different convolution kernels are equivalent to different feature extractors. The defect features of different positions are extracted by sliding the convolutional block in the image. The convolution function is as follows where ⊗ denotes element-wise multiplication. X is the input an image. W represents the weight of the convolution kernel. b is the bias of the convolution kernel. After convolution operation, the output Y is obtained by using nonlinear activation function Φ(·).
In order to obtain a larger receptive field and reduce parameter, four 3 × 3 convolutional blocks were stacked to extract features. The number of convolution kernel channels is doubled compared with the previous IRA-Block. After feature extraction, IAM is used to suppress the background noise and highlight the defect area. Finally, Max-pooling IRA-CNN uses four IRA-Blocks to extract the features continuously, the last feature map is sent to the fully connected layer for feature aggregation. The last layer of the fully connected layer is the output of the whole network. The output has the same number of neurons as the labels and can be classified by Softmax classifier. Softmax classifier formula is as follows Y i represents the input of the Softmax classifier while P(Y i ) represents the output probability. K represents the total number of defect categories The above is the basic structure of IRA-CNN. IRA-CNN can effectively extract the defect features of injector valve, IRA-CNN network structure is shown in Table 1.

Integrated attention module
Because of the randomness of the surface defect location of the injector valve, IAM is proposed and integrated into the IRA-Block. The proposed IAM cascades the channel attention submodule with spatial attention submodule. The  input and channel attention map contain a residual connection. Relatively, channel attention map and spatial attention map also contain a residual connection. IRA-CNN can suppress the unnecessary background area and highlight the spatial location of the surface defects of the injector valve. IAM is inspired by CBAM. CBAM uses channel attention modules to cascade spatial attention modules. In channel attention modules, CBAM compresses the spatial information of feature maps through GMP layer and GAP layer, and then sends GAP and GMP into a shared multi-layer perceptron (MLP) with a hidden layer respectively to generate corresponding feature vectors, and then the output feature vectors are merged by using element-wise summation. CBAM compulsorily add the GAP layer and GMP layer after MLP feature extraction cannot well combine the information of GAP layer and GMP layer. The reasons are as follows: CBAM is biased to GMP. Shared layer was used by GAP and GMP to extract features in CBAM, the inference part can be regarded as GAP and GMP added directly and sent to shared layer. Generally speaking, the value of GMP is larger than GAP. The reason is that the ReLU activation function will filter out a large number of negative values, resulting in many values of 0 in the feature map, which leads to the decline of GAP. Therefore, it is biased for CBAM to directly send GAP and GMP into MLP without feature processing. The proposed IAM multiplies GAP layer and GMP layer by trainable weights and then sends them to MLP layer. IAM can make channel attention network adaptively select the proportion of GAP or GMP input to MLP. This channel attention module can get channel weights after only one inference. Moreover, residual connection is added between input and channel attention map, and between channel attention map and spatial attention map, which makes network training easier to converge and improves the bilateral nonlinearity and the robustness of networks. These are the reasons why IAM can achieve better performance than CBAM.
As shown in Fig. 4, given an intermediate feature map M input ∈ R C×H×W as input and M IAM is the final output attention map. C represents the number of channels in the current feature map. H and W represent the height and width of the current feature map respectively. All operations above can be expressed by the following formula where ⊗ denotes element-wise multiplication. F CW ∈ R C×1×1 and F SW ∈ R 1×H×W represent channel attention weight and spatial attention weight respectively. M ICAM is channel attention map. IAM makes IRA-CNN have better classification performance and model interpretability, and improves the accuracy to locate defects.

Integrated channel attention submodule
In this paper, an integrated channel attention submodule was proposed to obtain the channel attention feature map. This submodule extracts information from squeezing each channel feature map, so that the feature layer with stronger semantic information has higher weight. The ways of squeeze include GAP and GMP, which are proved in [25] and [36] to improve the representation ability of the network. Aiming at the cylindrical metal workpiece inspection, GAP and GMP have different representation ability. GMP focuses on the most significant region in the image to compensate the global region which GAP focus on. Therefore, this paper thinks that GAP and GMP should be jointly input into a network. We multiply GAP and GMP by a trainable weight in the input stage and then add them, and then send them to a network to extract information. Finally, a residual connection is added between the input and output, which is helpful to the convergence of the network. The specific implementation is as follows, as shown in Fig. 4. Firstly, the input feature map is squeezed into GAP layer F gap ∈ R C×1×1 and GMP layer F gmp ∈ R C×1×1 . Set a parameter α and set the weight of GMP to e α e α +1 . e α e α +1 is greater than 0 and no more than 1. Relatively, 1 e α +1 is the weight of F gap , e is the natural constant. So we can ensure that the weight of training is positive. Then, e α e α +1 F gmp and 1 e α +1 F gap are added and sent to MLP to extract information. The hidden layer of MLP contains C/k neurons, C is the number of channels after merging the GAP layer and GMP layer and k is the reduction rate. Finally, sigmoid function is used to activate the last layer to get the channel weight F CW . The integrated channel attention map M ICAM is obtained by where MLP represents that send a set of feature vector into a shared MLP respectively to generate corresponding feature vector.

Spatial attention submodule
Spatial attention submodule can make the network focus on "where" the defect is, which is a supplement to channel attention submodule. The input of the spatial attention submodule is M ICAM . Firstly, the average-pooling and maxpooling along the channel axis are applied to calculate two feature map F c avg and F c max ; then, the two feature maps are connected and input into a convolution layer to extract features and activate them with activation function. The convolution kernel size is 7 × 7, and the output of the convolution layer is spatial weights F SW . F SW and M ICAM are multiplied by the corresponding elements. Before getting the final output M IAM , there is a residual connection between M IAM and M ICAM . The above operation can be simplified as the following formula where Avgpool and Maxpool are average-pooling operation and max-pooling operation along the channel axis respectively.f 7×7 (·) represents the convolution of the feature map using 7 × 7 convolution kernel.

Saliency map by grad-CAM++
IAM has made classification network learn "where to look" and "what to look". Therefore, saliency map is generated by Grad-CAM++ based on the current defect classification score. The probability of whether a pixel value belongs to a defect can be ensured in saliency map. Therefore, saliency map is not only the basis to judge whether the network really recognizes the defect feature, but also an important basis for segmentation. Grad-CAM++ uses the gradient information flowing into the last convolutional layer of the CNN to assign importance values to each neuron for a particular decision of interest. The contribution of each pixel value in the feature map to the classification score is obtained by gradient conduction. The method of obtaining gradient weight is different from that of Grad-CAM. Detailed derivation of gradient weight in [29]. Grad-CAM fails to properly localize defects in an image if the image contains multiple occurrences of the same class. Grad-CAM++ solves this problem. Review the calculation method of Grad-CAM++. Suppose that the classification score of a defect classification is S c (before Softmax operation), (i, j ) and (a, b) are iterators over the same activation map A k are used to avoid confusion. The gradient weight w c k of the feature map is calculated as follows Finally, the gradient weight of the feature map is multiplied by the feature map to obtain the saliency map. The formula is as follow (14) where L c ij represents saliency map, which can visualize the importance of each pixel value to the current classification score. Therefore, saliency map can be used for defect segmentation.

Segmentation framework
The saliency map obtained by Grad-CAM++ represents the probability of whether each pixel is a defect, so it is necessary to binarize the saliency map to obtain the contour of the defect area. Binarization methods include fixed threshold method, maximum entropy threshold segmentation method, and OTSU segmentation [37] method. In these methods, the fixed threshold method needs to manually select the threshold. The maximum entropy threshold segmentation will retain many pixels with low defect probability, which has poor robustness. This paper chooses OTSU method to segment saliency image. OTSU method is based on the idea of clustering. Firstly, the image is divided into panorama and background based on the gray value. Since variance is a measure of the uniformity of gray distribution, the greater the variance between the background and the foreground, the greater the difference between the two parts of the image. When part of the foreground is wrongly divided into background or part of the background is wrongly divided into foreground, the difference between the two parts will become smaller. Therefore, the segmentation that maximizes the variance between classes means the minimum probability of misclassification. OTSU method can highlight the regions with high defect probability in saliency map and suppress the regions with low defect probability to form binary image. The image after using OTSU method is defect segmentation image. Saliency map without defect image will be forced to set to 0, and the gray value of defect segmentation image will also be set to 0.

Image acquisition system and dataset construction
In this study, the following injector valve machine vision system is used to collect images and detect defects. This system includes Image Acquisition System, Image Analysis System, Mechanical and Electrical Control System. The Mechanical and Electrical Control System uses a Siemens S7-1500 PLC and two S7-200 PLC, as well as a number of motor control injector valves feeding and sorting. The Image Analysis System uses industrial computer with GPU for defect detection, and uses TCP/IP communication protocol to interact with PLC. This system can not only collect and analyze the surface defect image, but also detect whether the outside diameter reach standard, but these are not the focus of this paper. The image acquisition arrangement scheme is shown on the right side of Fig. 5, and the camera, lens, parallel light source and spherical light source are respectively from left to right. The parallel light source irradiates the middle part of the cylindrical valve and the spherical light source irradiates both sides of the cylindrical valve. In order to make it easy for readers to understand this system, the injector valve machine vision system is shown in Fig. 5.
There are few datasets of cylindrical metal workpiece captured by industrial cameras. This paper built a dataset of surface defects of injector valve, which is used to train and evaluate the proposed model. The dataset was collected by Basler-Aca2440-20gm camera, and the camera exposure rate was set to 2300. The size of the original image collected by the camera is 2448 × 2048. Since the injector valve image only accounts for a part of the image, we segmented the original image to get 768 × 384 valve regions, the segmentation method was obtained by using fixed threshold binarization method. Each valve region was cut into 8 images to get 192 × 192 valve slices as the image in the dataset. Defects often account for only a small part of the image, so the image segmentation into 8 regions is conducive to classification and generating accurate saliency map. The operation of acquiring the dataset is shown in Fig. 6. There are 6747 images in the dataset, The training dataset and testing dataset were allocated according to the 5-fold cross-validation method. Among them, pixel-level labels are included in the testing data of dirt and scratch to evaluate the segmentation accuracy. Table 2 shows the dataset distribution of training data and testing data.

Evaluation metrics
We use Precision (P), Recall (R) and F-measure to evaluate classification performance. Moreover, pixel accuracy (PA) and Mean Intersection over Union (MIoU) are used to evaluate the performance of segmentation. The formula of the above evaluation indicators is as follows  (19) where TP is the number of defect images which are predicted to be correct by the model; in contrast, FP is the number defect images which are predicted to be false. FN is the number of the actual non-defect which is mistakenly classified as defect. k i=0 P ii represents the number of pixels belonging to class and predicted to be class i as well. k i=0 P ij represents the number of pixels that belong to class but are predicted to be class j .
Since the lack of statistical variability on the training and testing dataset. To increase the credibility of the accuracy  of classification and segmentation, this paper used K-fold cross-validation to traverse all the samples to verify that the defect detection result of IRA-CNN has higher reliability. This paper used 5-fold cross-validation. The experimental steps of 5-fold cross-validation are as follows: Divide the dataset into 5 sub-samples and keep a single sub-sample as the testing dataset as well as the other 4 samples as the training dataset. The experiment was conducted for 5 times, and the average value was taken as the final test result.

Implementation details
The experimental platform is an industrial computer equipped with a NVIDIA GeForce RTX2080 (8G) graphics board. The CPU of the industrial computer is Intel i7-9800x. The training and inference are completed on the industrial computer. The model framework is based on the Pytorch. During the experiment, we used cross entropy loss function. We use Adam optimizer to optimize the learning rate. According to [40], the initial learning rate (λ) is 0.001, the first estimated exponential decay rate (β 1 ) is 0.9, and the second estimated exponential decay rate (β 2 ) is 0.999 and the epoch of training is 100. Batch size is set to 32. In the initial state, this paper assumes that GAP and GMP have the same impact on the classification effect of the IAM. Therefore, a is initially set to 0, so the weight of F gap and F gmp are initialized to 0.5. In Section 3.4.5, we compare different reduction rate. According to the experimental results, the k of IAM uniformly set to 8. In the spatial attention submodule, according to [33], the kernel size of convolution layer is 7×7. In the following ablation experiments, it is shown that increase IAM can significantly improve the classification accuracy.

Evaluation
This subsection shows experimental results. The evaluation consists of four parts. The first part evaluates classification of injector valve defects by IRA-CNN. The second part is to evaluate the proposed segmentation framework. The third part is overall evaluation. The last part is time-efficiency evaluation.

Classification evaluation
For evaluating classification results, since there is no article on the surface defect detection of injector valve before and the injector valve is made of stainless steel. We compare our method with metal and steel defect classification method. The artificial features of traditional metal defect classification are as follows: (1) GLCM: this method is a classical method to extract texture features. According to [41], the angle parameter is set to 0, 45, 90, and 135 and the distance parameter is set to 1, 2, 4, 8, and 16. (2) MLBP: MLBP have the advantages of rotation invariance and gray invariance. Statistical histogram of LBP feature spectrum is used as feature vector for classification. (3) HOG: the feature is formed by calculating and counting the histogram of the gradient direction of the local region of the image. HOG parameters are obtained by experimental tuning. Among them, block is set to 8 × 8, cell is set to 4 × 4, and stride is set to 4. SVM and MLR are selected in classifier, among which SVM selects linear kernel function. In addition, we compared the deep learning classification methods, such as VGG16 [38] and ResNet50.
We also compared with other weakly supervised defect detection frameworks such as Decaf [39] and RWSLDC [27], in which RWSLDC also has attention module. We also changed IAM to CBAM for training, and added it to the result comparison. As shown in Fig. 7, the classification performance of IRA-CNN is higher than all other models in the testing dataset, reaching 97.8%. GLCM is often used to extract texture feature information; thus, it is not effective in the area where the texture feature is not obvious. the performance of classification using GLCM is the weakest. SVM classifier has good classification effect in training dataset, but its generalization ability is poor. Compared with the feature extraction method based on manual design, the deep learning method has higher classification accuracy. Compared with RWSLDC which also has attention module, RWSLDC only uses spatial attention module in the last feature layer, and its network depth is shallow, so its classification performance is not as good as IRA-CNN. IRA-CNN has the best classification performance for EC defects, followed by dirt and scratch, because dirt and scratch account for a small area of the image and have irregular shape. The specific data are shown in Table 3. Table 3 shows that in the 5-fold cross-validation method, the accuracy rate of each round of IRA-CNN is more than 97%, and the precision and recall of EC are greater than other defects. Precision-Recall curve of each class is drawn in Fig. 8. Figure 8 includes the PR curves of each round The results showed that the AP of each class were close to 1.00. The results show that IRA-CNN has impressive classification performance. In addition, Fig. 9 shows the confusion matrix. Similarly, the confusion matrix also contains the results of each round of 5-fold cross-validation. The confusion matrix can be used to get the TP, FP and TN of each kind of defects, so as to calculate the precision and recall. It can be observed from Fig. 10 that some Dirt defects are easily confused with scratches. These pictures are labeled as 1, 2, 3 and 4. In Fig. 10, it can be observed that the pictures labeled as 1 and 2 are actually dirt type, but they are similar to scratch type in shape. The pictures labeled 3 and 4 are of scratch type but similar in shape to the dirty type. In contrast, EC has obvious features. EC is hard to confuse with other defect types. Therefore, we conclude that labeled sample images with distinct features are crucial to achieving high classification accuracy.

Segmentation framework evaluation
According to a large number of experimental observations, EC will cover the whole injector valve, so it does not need to be segmented. Therefore, only dirt and scratch defects are selected for segmentation in this experiment. The pixel-level annotation of dirt and scratch was labeled with LabelMe. We compared our segmentation module with Ren's and Chen's. Chen's SA-CAM is also a weakly supervised segmentation method, Ren's method is similar to Chen 's. so we choose  these two methods as benchmark methods and compare it with CAM and Grad-CAM in the ablation experiment. As shown in Fig. 11 and Table 4, the area segmented by Decaf + MLR usually has a high recall rate due to its large coverage area, but the IOU value is low. Compared with SA-CAM, the precision of our segmentation framework is improved about 6.3%. This is because the segmentation method based on CAM is only sensitive to the large defect area, while our proposed framework focuses on all defect areas in the image, so it covers smaller area. Compared with the two benchmark methods, the PA and IOU of the dirt defect increased by more than 6%. The PA and IOU of scratch defect increased by more than 1.6%.

Overall evaluation
From above classification and segmentation evaluation of our proposed method. We can conclude that IRA-CNN is superior to other classification methods in injector valve classification, and performs well in weakly supervised defect segmentation. The reasons are as follows, IAM is integrated in every layer of IRA-CNN. This module makes Fig. 11 Comparison of the proposed method with Decaf + MLR and SA-CAM, the last row shows pixel-level annotation mask IRA-CNN pay more attention to the defect area, so it improves the classification performance. IAM helps Grad-CAM++ to produce more accurate saliency map, and Grad-CAM++ makes all defect areas more prominent, which is conducive to improving the segmentation accuracy.

Time-efficiency evaluation
The results of FPS are calculated by testing dataset.
Since the parameter of the IRA-CNN is more than Chen's RWSLDC, the inference time is relatively slow, but compared with SVM and other machine learning classifiers, the inference speed of this method is faster. The time for each method of training and testing is given in Table 6.

Reduction rate of IAM
The reduction ratio k introduced in Section 2.2.1 is a hyper-parameter which allows us to vary the capacity and computational cost of the IAM in the network. We set Reduction rate K in IAM with different values, and the results are shown in Table 5. The results show that the  parameter decreases with the increase of K. The accuracy is the highest when k = 8. Therefore, we set K to 8

Effect of IAM
To prove the effectiveness of IAM module, the attention module is added or replaced in the ablation studies. As shown in Table 7, The effectiveness and superiority of IAM can be proved by comparing CBAM with reducing IAM. Figure 12 shows the saliency map generated by Grad-CAM++. The effect of saliency map with attention module is better than that without attention module and the saliency map generated by IAM is more accurate, the defect area is more prominent and the background area is suppressed more obviously. Table 4 shows the advantages of Grad-CAM++ over Grad-CAM and CAM. In the training of CAM, we add a GAP layer to the last feature layer, and all the previous layers are frozen for retraining. The effect of CAM is the worst, because adding a GAP layer to the last layer may lose some semantic information and the classification accuracy. This can be shown in Table 4, and the Precision of CAM is greatly reduced. The effect of Grad-CAM is better than CAM, but as shown in Fig. 12, there are many dirt defects in the picture, and Grad-CAM does not cover all of them. Therefore, Grad-CAM++ has better robustness and adaptability in the task of segmentation of injector valve, especially in the task of dirty defect segmentation.  . 12 The above row is the ablation experiment of attention module, and the next row is the comparison between Grad-CAM and Grad-CAM++

Conclusion
In this paper, injector valve was taken as the representative of cylindrical metal workpieces. Since the variety and complexity of cylindrical metal workpieces defects which cause pixel-level annotation requires expensive manual work. Therefore, an end-to-end weakly supervised learning framework was proposed to classify and segment surface defects. This framework only uses image tag annotation for training and performs defect classification and defect segmentation simultaneously. IAM was proposed to suppress the interference of useless background area and highlight the defect area. In addition, IRA-CNN was designed and integrated with IAM to classify the defects. The accuracy of IRA-CNN is more than 1.9% compared with other methods. The experiment is based on the 5-fold cross-validation method to verify that the accuracy can reach 97.84%. Weakly supervised learning is achieved by Grad-CAM++. In the task of segmentation. Using Grad-CAM++ to generate saliency map. The defect pixel-level segmentation based on saliency map greatly simplifies the pixel-level segmentation task of image segmentation. The segmentation precision is at least 6.3% higher than other models. the segmentation precision is improved and the labeling time is saved. This weakly supervised learning framework only needs imagelevel annotation. The cost of manual annotation is reduced.
There are some limits in our framework. For example, this framework cannot classify and segment unknown defects. This is a common problem in industrial environment. In the future, we will study how to integrate unclassified defects into classification networks and realize segmentation and continue to optimize our IRA-CNN, and try to use different convolutional kernel forms to improve classification accuracy.