Light-Trans YOLO: A Lightweight Network Based on Transformer for Defect Detection of Industrial Lace Surface

Lace surface Defect detection has always been a crucial step in the industrial production of lace products. However, due to the complex texture and deformability of lace, as well as the difficulty of distinguishing minor defects from normal images. Therefore, the detection of defects on lace surfaces is a challenging but rarely studied task. In this paper, we propose a new lightweight detection framework, Light-Trans YOLO, to detect lace surface defects. First, our backbone network uses the lightweight network C3 GhostNet. In addition, to obtain more complete global information, we add the lightweight Mobile Transformer Block (MTB) to the backbone network. Then we use the proposed standard deep-wise separable convolution (SDSConv) and SDSBottleneck to design a new neck and add Coordinate Attention (CA) at the end, which overcomes the problem of information loss of deep separable convolution and extracts more effective information. We conduct experiments on the industrial lace surface defect dataset collected in lace production sites, and the experiments prove that the mAP of our model is 96.6%, which is 7.7% higher than YOLOV5s, and the FPS and F1-score of the model reaches 50.3 and 0.93, which indicates that our model has a great trade-off between detection accuracy and speed.


INTRODUCTION
Lace surface defect detection is a very important step in the control of product quality in the actual production of the fabric industry.However, because of the complex texture of lace and its susceptibility to deformation, and the difficulty of distinguishing small defects from normal images, the detection of surface defects in lace has been little studied and is therefore a very meaningful task.Traditional inspection methods are mainly manual, but due to the characteristics of lace itself and the subjective factors of human beings, it often leads to false detection and missing detection.Therefore, an efficient and low-cost inspection method to replace manual inspection is necessary considering the accuracy of inspection and production costs.With the development of artificial intelligence and deep learning, deep learning-based detection methods have gradually become the mainstream of object detection.Since AlexNet [1] proposed convolutional neural network (CNN) in 2012, it had become the primary model for computer vision tasks.Object detection based on deep learning can be divided into two-stage detector and single-stage detector.Typical two-stage detectors have R-CNN [2], Fast R-CNN [3] and Faster R-CNN [4].The most common single-stage detector is YOLO Series.However it is a drawback that conventional CNN networks such as YOLO cannot build connection between the feature extraction network and the feature information of the image well.With the increasing integration of natural language processing and computer vision, Transformer [5] is gradually achieving beneficial results in object detection and image classification.Due to the deficiency of convolution operation, it leads to the loss of many valuable feature information and also ignores the association between global and local features, while Vision Transformer has stronger robustness to interference factors and region size variation in images, which can compensate shortcoming of convolution.Therefore, in order to reduce the parameters of the model, we add the lightweight module Ghost bottleneck to the original backbone network CSPNet to obtain a new feature extraction network C3 GhostNet, which has less number of parameters and computation while ensuring detection accuracy.Besides, we add the Mobile Transformer module to the last layer of the backbone network to obtain more contextual information and different levels of characteristics in feature extraction which can further improve the performance of the detector and further.
Based on the above analysis, inspired by these articles,such as MobileNet [6], ShuffleNet [7], EfficientNet [8] our paper proposes a detector Light-Trans YOLO for lace defect detection based on YOLOV5.The model integrates C3 GhostNet and Transformer module to take into account both the whole and regional information of the image.In the neck network, we use SDSConv to process the extracted feature information and combine it with SDSBottleneck to design a lightweight neck.In addition, we collect and lable an industrial lace surface defect dataset (LSDD) at real industrial scenario.Experiments show that the proposed Light-Trans YOLO can achieve mAP of 96.6% and FPS up to 50.3 on the LSDD, which can improve the detection accuracy while ensuring the detection speed.Our main contributions can be summarized as follows: 1.A Transformer-based lightweight network for lace surface defect detection is proposed.A mobile Transformer module is added to the last layer of the backbone network to strengthen the association of global feature information with the network.Ghost bottleneck is used in CSPNet to improve the quality of extracted features and greatly reduce the number of parameters and computational effort.
2. A new SDSConv is proposed, which greatly alleviates the deficiency of information loss of DSC.A new structure SDSNeck is proposed to be used in the neck part of the network, which achieves the overall model light-weighting.
3. Experiments are conducted on our proposed lace surface defect dataset (LSDD) and the results show that the proposed model can achieve 96.6% mAP and 50.3 FPS.
The rest of the article is structured as follows: Section 2 focuses on the knowledge and work related to the lace defect detection and its application.Then in section 3 we describe specifically the steps of the whole model and the structure of each part.In order to test the validity of the proposed model and the necessity of each module, we have done a lot of experiments and the results are presented in Section 4. Finally, in Section 5, we conclude the whole study and give an outlook for the future.

RELATED WORK 2.1 Existing defect detection methods
The existing object detection strategies are mainly divided into traditional methods and deep learning-based methods.Traditional defect detection can be segmented into statistical method, spectral method and model-based method.For example, Tsai et al. [9] established a weighted variance matrix within the pixel neighborhood and identified defects based on the differences in eigenvalues between the variance matrices.Wen and Xia [10] extracted edge features in images to detect leather surface defects.Given the drawbacks of traditional methods, more and more tasks for defect detection are using deep learning-based methods.Shao et al.In [11], a large lace dataset was proposed for the first time, which focused on the differences between the high-level semantic features extracted by the neural network, and a de-deformation defect detection network was proposed for lace surface defect detection.Xu et al. [12] proposed a method for detecting and locating periodic lace surface defects, which only required defect-free image samples, reconstructed contrast tweens to increase the morphological similarity of input image pairs, and then detected defects on the residual plot between the output and the image to be measured.Ma et al. [13] proposed a lightweight aluminum strip defect detector based on YOLOV4, which could effectively simplify the parameters of the model and achieve accurate detection of aluminum strip defects.Besides, there are some lightweight model mentioned above which are promising for a wide range of applications in mobile devices and embedded systems.MobileNet reduced the amount of computation and the number of parameters by using depth-wise separable convolutions to improve the efficiency and accuracy of the model.It was optimized for mobile devices and can be used for tasks such as image classification, object detection, and semantic segmentation.ShuffleNet was based on grouped convolutions designed for efficient deep learning.It achieved control over computation by dynamically adjusting the number of channels of grouped convolution.EfficientNet was designed to improve the performance of computer vision tasks accroding to employ a modular design that achieved a balance between model size and accuracy by adjusting the network width, depth, and resolution.It provided high accuracy in a variety of computer vision tasks, along with faster inference and lower memory requirements.

Combination of CNN and Transformer
Since the CNN operator has the issue of limited local field of perception, it cannot obtain the complete global feature information, while the self-attentive mechanism of Transformer can compensate the shortcoming of CNN.Wang et al. [14] combined transformer with CNN to investigate for the first time the use of transformers in accurate pixel-level surface defect detection.Wang et al. [15] proposed a variant resonant Transformer and designed a new window transfer scheme to further enhance the feature transfer between windows and improve the detection effect.Zhang et al. [16] applied a Transformer-based module to the backbone and detection head, which realized the combination of local features and global information and enhanced the dynamic adjustment of the detector to different scales of steel defects.

Overall framework
As can be seen in Fig. 1, in the backbone of the model, we use the GhostConv and GhostBottleneck to form the new feature extraction network C3 GhostNet, such a change can make up for the defects of convolution neural network with multiple parameters and deep layers.Then we add MTB module in the last layer, which can take into account both the global and regional feature information.Considering the detect speed of industrial field production, we further optimize the network structure in the neck using the proposed SDSNeck to reduce the model parameters meanwhile ensuring the information integrity.Finally, we add the CA [17] at the end of SDSNeck to improve the effectiveness of the fused feature information.

Mobile Transformer Block
Due to the characteristics of CNNs structure, only basic visual information can be extracted from the shallow network.Currently, Transformer appears to gradually overcome this problem, and it can effectively enhance the connection between shallow feature information and semantic features [18].In the lace defect detection task, combining CNN and Transformer in the feature extraction stage can extract feature-related information in a larger neighborhood, thus improving the detection accuracy.
As shown in Fig. 2, MTB [19] is mainly composed of Mobile Selfattention (MOSA) and Mobile Feed Forward Network (MOFFN).MOSA proposes branch sharing mechanism, which can be written as Equation 1.By comparing with Q and K, V needs to retain more semantic information and has a stronger correlation to self-attention.Hence, simplifying the computation of Q and K can achieve a balance between performance and loss.The MOFFN branch replaces the FC layer in the normal MLP with the Ghost module, and applies the SE module after the Ghost module.Compared with the traditional Transformer, the MOFFN module can effectively extract the information of spatial dimension and improve the computational efficiency of the model.In general, adding the MTB module at the end of the backbone explores the potential of feature representation to extract contextual information more efficiently without increase of parameters.where Q, K, V represent the linear transformation matrix of the input embedding.

SDSNeck
In order to realize the lightweight of the network and reduce the parameters of the model, the existing networks are mainly implemented by deep-wise separable convolution depicted in Fig. 3(b).Heavy use of deep-wise separable convolutions can reduce the parameters, yet as the feature maps flows, many important feature information will be lost, resulting poor detection effect.As shown in Fig. 3(c), we input the images to the standard convolution and deep-wise separable convolution, respectively, and then adopt the shuffle operation to uniformly blend the concated output features of both, which named SDSConv can reduce the parameters of the model and overcome the problem of information loss.Based on SDSConv, Fig. 4 shows the structure of the SDSBottleneck we design.Then, to improve the reuse rate of the features, we propose the cross stage partial network(SDSCSPC) based on SDS-Bottleneck in Fig. 4 using the one-at-a-time aggregation method.Besides, we make use of the CA at the end of the neck to enhance the effectiveness of feature extraction and improve the quality of feature information.

EXPERIMENTS 4.1 Dataset
We collected and labeled a large number of lace surface defect images in real industrial production scenarios.The dataset(LSDD) contains four types of lace surface defects: hole, broken yarn, jacquard hole and edge.To ensure a balance between samples, each of the four defects is represented in the dataset by an equal proportion of 800 images each, where there are 2560 images in the training set and 640 images in the validation set.The resolution of the images is set to 512x512, and the ratio of the training set to the validation set is 8:2, some examples of defects are shown in Fig. 5. Before the experiment, we perform pre-processing operations on the lace images, including random cropping and random rotation and Mosica operations.

Evaluate metrics
To validate the detection performance of our method, mean average precision (mAP), frame per second (FPS), F1-score are used as evaluation metrics for model performance.F1-score, mAP and FPS are defined in Equation.( 2)-( 4), respectively.
where Presicion, Recall represent the accuracy and recall of model detection, respectively, C represents the number of defective species.    ,   are the total number of frames detected and the total time taken for detection, respectively.

Implementation details
In order to train our detection model, all experiments are set up in windows 10 operating system, NVIDIA GeForce RTX2080 Ti GPU, CUDA11.1,Python 3.8.13,Pytorch 1.7.1.In the training process, to overcome the problem of gradient explosion or gradient disappearance that may be caused by uninitialized models, we initialize the parameters of the model using the pre-trained model obtained by training the YOLOV5 model on the PASCAL VOC dataset.Then we use SGD optimizer with a weight decay of 0.0005 and momentumof 0.937 as default, and the learning rate is set from 0.01 to 0.0001.To strength the stability of the training process, we adopt warm-up strategy, which is used in the first 3 epochs of training.The batch size of the model is set to 8, and all models and experiments are trained for 500 epochs based on the above settings.
Quantitative analysis: We quantitatively analyze the detection performance of the different models using mAP and FPS, respectively.The detection accuracy of all models is shown in the Table 1.By comparison, our model achieve the highest detection accuracy in all defects.For hole and edge, which are relatively fixed in position and shape, our model achieves 96.5% and 99.5% detection accuracy, respectively, which is better than other models.98.1% detection accuracy is achieved for jacquard hole, especially for broken yarn with large variations in defect shape, the performance of our model is also greatly improved compared to other models.
We test the detection speed of the above models on the same hardware platform, and the results are shown in Table 1, we can see that our model has a detection speed of 50.3 FPS, which is next to YOLOV4-tiny.The detection speed of YOLOV5s, SSD, CenterNet, YOLOX and DETR are all lower than our model.Faster RCNN has the slowest detection speed, just 14.6 FPS.Other than that, We calculate the model complexity and parameters for each of the seven models, and it can be seen from Table 2 that our model has the lowest number of parameters and complexity, 5.2M and 8.8 GFlops respectively, which means that our network can better achieve light-weighting.
Qualitative analysis: The detection effects and the different detection results of all the models for the four defects are in Fig. 6.As can be seen in row b of Fig. 6, there are some defects missed detected in the detection results of all models, except for our model.In the third row, for broken yarn with irregular shape and orientation, YOLOV5s, SSD, CenterNet, YOLOV4-tiny, YOLOX only locate part of the defect area, and the detection results of Faster RCNN and DETR show more non-target regions leading to over-detection, while our model can locate and identify the defective regions well.From the overall visual comparison, our model can better detect and locate the four defects of lace, and it is relatively more accurate and more practical compared to others.

Ablation experiment
We perform ablation experiments on the LSDD to verify the validity of each module of our model.The ablation results are compared in Table 3.Based on the YOLOV5s, the MTB module is used in the end of the backbone network, the detection accuracy of the model is significantly improved.Through the analysis, MTB can capture the global contextual information of the image using the self-attentive   approach due to the advantage of its own structure, thus extracting more complete feature information and improving the the detection performance of the model by 7.4%.In addition, the employment of C3 GhostNet greatly reduces the number of parameters of the model and maintains the detection accuracy.The SDSNeck further reduces the parameters and complexity of the model, but the detection accuracy of the model decreases as the number of parameters decreases.We therefore try to take advantage of the attention mechanism to the end of the neck, which improves the detection performance of the model with only a small increase in parameter and complexity.

CONCLUSION
In this paper, we proposed a lightweight network Light-Trans YOLO for lace surface defect detection.Light-Trans YOLO combines CNN and Transformer module MTB in the backbone to improve the feature extraction capability of the network, overcome the problem of missed detection of small size defects.In order to reduce the parameters and complexity of the model while maintaining the detection accuracy, a novel lightweight neck network is designed.By testing on dataset (LSDD), Light-Trans YOLO achieves an mAP of 96.6%, an improvement of 7.7% over the baseline, and FPS and F1-scores of 50.3 and 0.93, respectively, indicating that our method has outstanding accuracy and has great potential for application in real-time detection.In future work, we will continue to extend our dataset to better detect a variety of defects in real industrial production, and further immprove the performance of Light-Trans YOLO by figuring out the cause of failure cases and implement it for deployment on mobile devices to improve the utility of the algorithm.

Figure 1 :Figure 2 :
Figure 1: The overall architecture of Light-Trans YOLO.The lace image is first fed into backbone network consisting of the C3 GhostNet and Transformer module(MTB) to effectively extract the global and local feature information of the image: M2-M4, then the feature map is input into the proposed lightweight neck network to better fuse the feature information, which is composed of the proposed SDSConv and SDSCSPC.The fused image features are then processed by CA attention mechanism to output more effective information for detection head.(×3, ×6, ×9 represent the number of modules).

Figure 3 :Figure 4 :
Figure 3: (a) represents the computational process of standard convolution, (b) is the process of Deep-wise separable convolution, (c) describes our proposed SDSConv, k represents the size of the convolution kernel.

Figure 6 :
Figure6: The detection results of four defects of lace by different models, it can be seen that the accuracy of our method is better than other models in defect location and identification.Table2: Complexity and number of parameters for different models

Table 1 :
The detection accuracy of different models for various defects on LSDD

Table 2 :
Complexity and number of parameters for different models

Table 3 :
Effectiveness of different module designs for lace defect detection