Vision-based melt pool monitoring for wire-arc additive manufacturing using deep learning method

Wire-arc additive manufacturing (WAAM) technology has been widely recognized as a promising alternative for fabricating large-scale components, due to its advantages of high deposition rate and high material utilization rate. However, some anomalies may occur during the deposition process, such as humping, spattering, robot suspend, pores, cracking and so on. This study proposed to apply deep learning in the visual monitoring to diagnose different anomalies during WAAM process. The melt pool images of different anomalies were collected for training and validation by a visual monitoring system. The classification performance of several representative CNN (convolutional neural network) architectures, including ResNet, EfficientNet, VGG-16 and GoogLeNet, were investigated and compared. The classification accuracy of 97.62%, 97.45%, 97.15% and 97.25% was achieved by each model. The results proved that the CNN models are effective in classifying different types of melt pool images of WAAM. Our study is applicable beyond WAAM and should benefit other additive manufacturing or arc welding techniques.


Introduction
As a direct energy deposition (DED) process, WAAM has emerged as a suitable alternative for fabricating mediumto-large size metal components [1][2][3]. WAAM employs an electrical arc as the heat source to fuse welding wire and deposits layer-by-layer to form a three-dimensional object [4]. Normally, the types of electrical arc adopted in WAAM mainly include gas metal arc (GMA), gas tungsten arc (GTA) and plasma arc (PA). Compared to other forms of additive manufacturing, the major advantages of WAAM are its high deposition rate and high material utilization [5]. Normally, the deposition rate for laser and electron beam additive manufacturing is about 2-10 g/min, while the deposition rate of WAAM could reach above 160 g/min [6]. On the other hand, WAAM possesses the advantages of lower capitalized cost, as the equipment in the WAAM system is easily available from an array of suppliers in the mature welding industry. Compared to electron beam-based additive manufacturing, vacuum environments are not required for the WAAM process. In comparison to laser-based methods, the WAAM method offers a higher-efficiency heat source, especially for those reflective metal alloys, such as aluminium, copper, and magnesium [7]. Due to these advantages, WAAM has extensive application prospects in maritime [8], aerospace [9], and automotive industries [10].
To ensure the manufacturing quality, improve the automation level and meet the industrial requirement for WAAM, it is crucial to develop reliable monitoring systems, which are capable of identifying abnormal and defects during the WAAM process. In recent years, some profound studies on monitoring and control for the WAAM process have been established. For example, Xia et al. [11] developed a visual sensing system to monitor the geometry of melt pool during the WAAM process and implemented real-time feedback control. Chabot et al. [12] utilized a thermal camera to monitor the thermal distribution and history during the WAAM process. Zhan et al. [13] applied a welding camera to monitor wire deflection during the WAAM process. Zhao et al. 1 3 [14] collect the spectrum signal and welding pool image to monitor the anomaly during WAAM. However, the studies on monitoring the anomalies in the melt pool of WAAM process are still insufficient. During WAAM process, some anomalies can be generated, such as hump and spatter, due to the poor process parameters or equipment setting. When those anomalies are generated, the melt pool will exhibit different morphology. Through classifying the melt pool images, it is able to detect the anomalies generated during WAAM process in real time.
During recent years, artificial intelligence (AI) technology has been developed remarkably and has been applied in various fields. In the laser additive manufacturing field, there have been several attempts of applying AI algorithm in process monitoring and optimization. Aminzadeh and Kurfess [15] utilized a Bayesian classifier to classify the layer surface quality during powder-bed additive manufacturing. Defective and unacceptable building regions can be detected. Scime and Beuth [16] applied a CNN to achieve autonomous powder bed anomaly detection and classification. Caggiano et al. [17] collected the real-time images during the selective laser melting (SLM) process, and the detection of the defective condition-related pattern can be implemented through automated image feature learning and feature fusion using CNN models. Kwon et al. [18] developed a deep neural network to classify the melt pool images with respect to 6 laser power labels in SLM. This work was expected to detect the abnormal and separate the defective parts non-destructively.
From the technological perspective, WAAM has a similar process with conventional arc welding. In the welding field, the application of machine learning has been tried. Liu et al. [19] utilized a CNN-LSTM model to classify the images during CO2 welding. The welding images were categorized into welding through, welding deviation and normal welding. This work was expected to achieve online defect recognition for CO2 welding. Xia et al. [20] developed a vision-based monitoring system for the keyhole GTAW process, and a CNN model (ResNet) was utilized to recognize different welding states. To remove the noise in the weld pool images, Feng et al. [21] employed a generative adversarial network (GAN) model to generate de-noised images. Furthermore, a voting-based ensemble model combining multiple CNNs was proposed to classify images from multisource sensing methods, including active vision, passive vision and reverse electrode images (REIs). Based on this model, different welding penetration states can be identified. In research by Ren et al. (2020), the welding sound data was transformed into logarithmic time-frequency spectrograms and used as the input of the CNN model for welding penetration states classification.
To our best knowledge, few studies were conducted to expand the application of deep learning to WAAM fields.
Compared to traditional classification algorithm, deep learning algorithm is able to learn high-level features from data in an incremental manner, and higher accuracy can be obtained. Therefore, this study proposed to employ deep learning in the process monitoring of WAAM. The deposition process of WAAM is a time-consuming process, which may cost a lot of manpower. If a monitoring system can be developed to detect anomalies automatically, it will help improve the production quality and save labour cost. Our study is expected to diagnose different process abnormal during the WAAM process, including humping, spattering and robot suspend. In this study, the suitability of several state-of-the-art CNN architectures will be assessed.
Our paper makes the following contributions: a visual sensing system for melt pool of WAAM was developed. To achieve anomaly detection, this study proposed to apply CNN in classifying different melt pool state. To our best knowledge, it is the first try to apply CNN in the process monitoring of WAAM. The state-of-art CNN architectures was investigated and compared. This proposed framework could be applied in the automation system for WAAM, which will promote the industrialization for WAAM. It is applicable beyond WAAM and would benefit other additive manufacturing and arc welding techniques.

Methodology
The abnormal states investigated in this study were categorized into four types: humping defects, robot suspend, spatter and normal state. The humping phenomenon is a common defect during the welding process, which is produced by the combined action of surface tension force and fluid flow patterns [22]. The hump can be produced when improper welding process parameters are used, such as heat input, welding speed, the flux of shielding gas and chemical composition of the base metal. The humping defect compromises the mechanical property and geometrical integrity of the deposited component. Therefore, it is necessary to implement real-time monitoring for humping defects. During the WAAM process, due to the programming bugs, displacement restrictor of robot, improper operation, and other equipment abnormities, the robot may be suspended during the depositing process. At the same time, the welding machine is still working. If this abnormal situation cannot be diagnosed timely, the deposited part can be damaged, and even the safety problem can be caused. Spatter can be generated due to the unstable welding process, lack of shielding gas and contamination in the material. It can be also viewed as a defect during the WAAM process. When these anomalies were produced, the melt pool will exhibit different characteristics. Through learning and classifying different melt pool images, it is able to achieve the detection of process anomalies during WAAM.
The framework of this study is as follows: firstly, the dataset of different welding states was collected by experiments and labelled. And then the dataset was split into a training dataset and testing dataset. The training dataset was used for model training, and the testing dataset was used for model assessment. To increase the diversity of the training dataset, data augmentation [23] was performed, including reflection, scaling, rotation and translation. Four novel CNN architectures that differ in structure and depth were investigated, including GoogLeNet, VGG16, ResNet and EfficientNet. To promote the training process, the models adopted pre-trained weights that learned from other datasets, such as ImageNet [24]. The block diagram in Fig. 1 provides an overview of the proposed framework in the training phase.

CNN
Nowadays, deep learning has become a popular method in machine vision and pattern recognition due to its superiority in feature learning and image classification. In the fields of deep learning, CNN is one of the most popular algorithms, which has been successfully applied in object detection, action recognition and image classification [25,26]. A classical CNN architecture consists of a series of successive layers, such as convolutional layers, pooling layers, dropout and fully connective layers. The convolutional layer plays a most basic role, which extracts features from the image and connects nodes to a small subset of spatially connected neurons in the input image channels. The main tasks of the pooling layer are to perform downsampling and restore vital information in the image. It could simplify the spatial dimensions of feature maps, reduce the number of parameters and prevent overfitting. The last layer on CNN is a fully connected layer. The fully connected layer is responsible for extracting high-level features from input. In this layer, each neuron is fully connected with the previous layer, and all the inputs were combined to generate categories as output. Through combining those layers in different strategies, a series of novel architectures has been established.

GoogLeNet
GoogLeNet is introduced by Szegedy et al. [27], which utilizes a CNN architecture with 22 layers. In GoogLeNet, a new conception called "inception module" is introduced (as shown in Fig. 2), which is composed of a shortcut branch and some deeper branches, and there are nine such modules in GoogLeNet. In inception module, a 1 × 1 convolution layer is applied to reduce the dimension followed by expanded 3 × 3 and 5 × 5 convolution layers. Through

VGG
VGG networks are one of the widely used CNN models with deep architectures. The depth of this model is increased to 16 (VGG-16) and 19 (VGG-19) layers, and the number of parameters is reduced by using very small (3 × 3) convolution filters. VGG architecture has secured first and second places on the image classification and localization tasks in the ImageNet ILSVRC2014 challenge. The basic feature of this model is use of several successive convolutional layers. VGG-16 model consists of a series of successive 3 × 3 convolutions, 2 × 2 max pooling layer, and three fully connected layers with the final layer as the Softmax output. The convolutional layers are put into five groups, and adjacent groups are linked by a max pooling layer. The number of convolution filters keeps constant with one group and doubles after each max-pooling layer. There are 12 convolutional layers and 3 fully connect layers in VGG-16. In this work, a VGG-16 was employed to perform classification for welding images.

ResNet
The ResNet model was proposed by Liu et al. [28], which got a first in the ILSVRC 2015. When the architectures of CNN become deeper, the rapid degradation problem may be raised. To solve this problem, a novel residual block was proposed out. As shown in Fig. 3, the shortcut connections were introduced in the residual block, which skips multiple network layers. When the input is x, the learned feature is written as H(x). The aim of residual learning is to get the residual F(x) ( F(x) = H(x)− x) between the learned features and the input. ResNet solves the problem of vanishing and exploding gradients, which exist in traditional deep neural networks. In this study, ResNet-50 was selected, which has 50 convolutional layers.

EfficientNet
EfficientNet was proposed to promote the performance of CNNs by scaling in three dimensions, including width, depth and resolution, which utilizes a set of fixed scaling coefficients to meet some specific constraints. The basic block in EfficientNet is named MBConv, which is an inverted bottleneck Conv. Shortcuts are applied between bottlenecks by connecting a smaller number of channels (compared to expansion layers). It was integrated with an in-depth separable convolution, which reduced the computation by almost k2 compared to traditional layers. Where k presents the kernel size, it specifies the height and width of the twodimensional convolution window. When pretrained on Ima-geNet and fine-tuned on Food-101, EfficientNet-B7 yields state-of-the-art top-1 accuracy of 93%, which is equal to the performance of Google's Gpipe model, with 8.7 times fewer model parameters. This is an important development-as the memory requirements for model inference become higher as model parameters increase, which means that better performance can be achieved with more modest hardware.
EfficientNet is particularly useful for using deep learning on the edge, as it reduces compute cost, battery usage, and also training and inference speeds. This kind of model efficiency ultimately enables the use of deep learning on mobile and other edge devices.

Transfer learning
There are mainly two methods for training CNN models: (1) training from scratch and (2) training using transfer learning. For CNN training, a large-scale training dataset could contribute to a generalizable result. However, in practical application, it may be not easy to obtain large-scale labelled datasets. Also, the training time for training from scratch may be too long. Additionally, overfitting is another potential risk. To promote these issues, transfer learning was proposed to retrain CNN models. In the transfer learning strategy, the weights of convolutional layers are firstly transferred from a pre-trained model to a new network that needs to be trained. In the new network, the weights of the convolutional layer are fixed, while the fully connected layers are retrained for the new task. Transfer learning is a convenient and effective method to acquire knowledge from related fields, which eliminates the need to restart the learning of the entire knowledge system.

Experimental system
In this study, the dataset was collected by experiments. As shown in Fig. 4 The chemical composition of the used wire and substrate is presented in Table 1. A Xiris XCV-1000e welding camera was utilized to capture the melt pool image during the experiment process. To restrain the strong arc light, a filter with 650-nm central value was combined with the camera. Table 2 presents the used parameters for the welding camera.

Dataset
As presented in Table 3, different strategies were applied to produce the defective samples for the dataset. For example, to collect humping samples, the welding position was set to be at horizontal position [29]. When making the spattering samples, a lower shielding gas flux was used for welding (5-10 L/min). As shown in Fig. 5, four categories of melt pool images were collected: humping, spattering, robot suspend and normal. It can be seen that the melt pool exhibits different morphology under different welding states. In humping state, the edge of the melt pool presented large fluctuation, while the normal melt pool has a smoother edge. The spattering phenomenon can be clearly observed in the images of spattering sample. When robot suspend happened, the melt pool exhibited an ellipse shape. The macro morphology of the welding sample is presented in Fig. 6.
The dataset was divided into two parts randomly: the training set and the testing sets. The proportion of training and testing sets was 70% and 30%, respectively. The amount

Implementation
The CNN models were established by programming in Python. The machine learning library of Pytorch was utilized, and the training process was implemented on Google Colab. Google Colab is a free cloud service and supports free GPU [30]. In this study, the Numpy, Pandas and Mat-Plotlib libraries were also applied.

Results and discussion
The learning rate is a vital parameter in the learning algorithms. There has not been a specific formula to calculate the learning rate. It is usually obtained by the trials and errors method. If a poor value is selected, the loss function has a chance to fall into local optimum. As a result, the network's performance will decline. In this study, different common values of learning rate were tried (as shown in Table 5). It can be seen that the best performance can be obtained when learning rate is selected as 0.005. Stochastic gradient descent (SGD) optimizer with momentum was utilized to minimize the loss function, since plain SGD can make erratic updates on non-smooth functions. SGD with momentum updates the weights with the moving average of the changes in individual weights for a single training sample.

(c) Robot Suspend (d) Normal
A learning curve can visualize the incremental evaluation of a classifier's learning performance over time. The accuracy and loss over epochs for individual networks evaluated by the training and validation sets are illustrated in Fig. 7. An epoch means one cycle of the weights updated by the full training dataset. The values of loss donate the sum of errors made by each image during the training process. A small value for loss indicates better results. It means that the classifier learned the feature of training and validation dataset with fewer errors. And a high value of accuracy indicates a better learning performance of the classifier. The training convergence of these networks was achieved after 20 epochs, which is logical for the CNN models since they have been pre-trained by ImageNet. The fluctuations on the graph are indicative of the cost function falling into some local minima. As inferred from Fig. 7, all CNN models achieved high accuracy in the training phase. The loss value of training and validation for various classifiers ensures the learning and generalization ability of these classifiers. It can be seen that the loss values for training process range from 0 to 0.6 and decreased rapidly with the training process. During the training phase, it can be seen that the initial accuracy for EfficientNet is relatively low. However, it has a strong learning ability, and its accuracy increased rapidly with the training process. ResNet has a higher initial training accuracy and faster convergence rate. In the testing phase, it can be seen that the loss value and accuracy have obvious fluctuation for VGG-16 and EfficientNet, which means the training falls into local optimum, while ResNet has a smooth curve. This kind of dual learning curve helps in evaluating and selecting the suitable classifier model with an optimized loss and maximum classification accuracy.
To further investigate the classification performance of diffident models, the confusion matrixes are presented in Fig. 8. In the confusion matrices, the row represents the actual class, and the column represents the predicted class. It can be seen that ResNet could obtain the highest overall accuracy for the melt pool image classification with a value of 97.62%. For all models in this study, the highest classification accuracy was obtained for the first class: robot suspend, in which all the samples were categorized correctly. This is because the features of the first class are far apart from other classes. Therefore, it is able to separate them in the feature space by linear SoftMax   function. The least accuracy was found for the class of spattering. This result can be explained by the relative closeness of the spattering, humping and normal class in some cases. The better performance of ResNet compared to other modes may be due to its deeper architecture. ResNet-50 model has the largest deep learning structure among these CNN architectures (EfficientNet, VGG-16, and GoogLeNet). However, it should be noticed that the increase in the number of layers does not always ensure better performance, and overfitting may be induced. Effi-cientNet could achieve a balance between model accuracy and training time. Its accuracy is just a little lower than ResNet, while its training time is the two-thirds of ResNet. The lowest overall classification accuracy was achieved by VGG-16. The main reason for the low accuracy of VGG-16 compared to the other models is probably the relatively shallow architecture. Additionally, the performance metric is derived from the confusion matrices, which could help us to better judge the performance of the classifiers. A performance metric for GoogLeNet, VGG-16, ResNet and EfficientNet is presented in Table 7. The performance metrics consist of precision, recall and F1-score. Precision reflects how well the reference images of ground truth are classified. Precision is formed as the following equation: where TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively. Recall is the ratio of a correctly predicted class instance (1) Precision= TP TP+FP   Table 7, it can be seen that all CNN models could distinguish different welding images with precision over 93%. Particularly, the ResNet model ensured the best performance for each class, which obtain the highest value in precision, recall and F1-score. It can be also seen that VGG-16 obtained the least outcome in terms of pre, recall and F1-score. EfficientNet could obtain a slightly lower precision than Resnet. As a general tendency, it can be observed that the performance increases with the overall complexity of the models.
As mentioned above, the deep architecture of CNN may lead to a serious complexity and computational cost. The CNN models used in this study are summarized with their properties in Table 8. As inferred from Table 8, EfficientNet-b0 has the least parameters while VGG-16 has the most parameters.   [32] 138 224 × 224 Resnet 50 [33] 25.6 224 × 224 EfficientNet-b0 [34] 5.3 224 × 224 The large amount of parameters for VGG-16 is owing to its deeper network structure. The number of parameters is decided by the model's space complexity. Space complexity combines the total of weight parameters and the characteristic map in each layer. Due to the limitation of dimension curse, the more parameters of the model, the more training data is required. The time complexity determines the training and predicting time for the model. If the complexity is too high, it will require a lot of time for model training and predicting, which cannot quickly verify the idea and improve the model, nor can it achieve rapid prediction. Figure 9  To intuitively investigate the effectiveness of CNN in classify welding images, T-SNE (T-distributed Stochastic Neighbour Embedding) algorithm was employed to visualize the intrinsic feature learned by CNN models. The T-SNE algorithm has been widely recognized as an effective method to evaluate different types of features, which integrates the function of dimension reduction and visualization. Through located on a two-dimensional (2D) map, high-dimensional features and original samples can be intuitively visualized. Figure 10 presents the T-SNE visualization results for each CNN models in this study. Each coloured point represents a sample in the validation dataset projected from the multidimensional output of the CNN models. It can be seen that for each model, samples from the same class gather together, and each class is separable. This means the classes are also separable in high-dimension feature space. There is no obvious deference between the dispersity of each class in these models, which is in accord with the close accuracy of each CNN model. In each T-SNE map, the same classes may distribute in different locations. This is because T-SNE transforms the samples into a different space that preserves distances between them, but it does not guarantee to preserve the value of the data sample. It treats each of the samples as a different point and tries to map the distances from that point to each other sample into another space. This does not take into account the value of a sample, just its relative distance to every other point. The separability in the T-SNE algorithm investigates that all the CNN architectures used in this study are capable to classify the melt pool images of WAAM efficiently.

Conclusion
To diagnose the anomalies generated during WAAM process, a visual monitoring system for the melt pool was developed. In this study, the anomalies were categorized into humping, spattering, robot suspend and normal. To recognize these anomalies automatically, CNNs were employed. The datasets for training and validation were collected by experiments. Different novel CNN architectures were applied and compared, including GoogLeNet, VGG-16, ResNet and EfficientNet. During the training phase, transfer learning strategy was applied. The results demonstrated that CNN could obtain high accuracy in classify melt pool images. The classification accuracy of 97.25%, 97.15%, 97.62% and 97.45% was obtained by GoogLeNet, VGG-16, ResNet and EfficientNet, respectively. ResNet has obtained the best performance in classifying melt pool images.
This proposed work will be helpful to improve the automation level and production quality for WAAM. In the future study, the proposed framework will be generalized by expending the image classes and dataset. At the same time, the CNN models will be optimized furtherly.