Tool wear prediction based on multidomain feature fusion by attention-based depth-wise separable convolutional neural network in manufacturing

Computer numerical control (CNC) machine tool is the foundation of the equipment manufacturing industry, and its technical level is an important indicator to measure the development level of a country’s equipment manufacturing industry. Tool wear during machining has a great impact on the important performance indicators of CNC machine tools, such as machining accuracy, machining efficiency and reliability. Tool wear monitoring is of great significance to improve the machining efficiency, machining accuracy and reliability of CNC machine tools. Multidomain features (time domain, frequency domain and time–frequency domain) can accurately characterise the degree of tool wear. However, manual feature fusion is time consuming and prevents the improvement of monitoring accuracy. A new tool wear prediction method based on multidomain feature fusion by attention-based depth-wise separable convolutional neural network is proposed to solve these problems. In this method, multidomain features of cutting force and vibration signals are extracted and recombined into feature tensors. The proposed hypercomplex position encoding and high-dimensional self-attention mechanism are used to calculate the new representation of input feature tensor, which emphasizes the tool wear sensitive information and suppresses large area background noise. The designed depth-wise separable convolutional neural network is used to adaptively extract high-level features that can characterise tool wear from the new representation, and the tool wear is predicted automatically. The proposed method is verified on three sets of tool run-to-failure data sets of three-flute ball nose cemented carbide tool in machining centre. Experimental results show that the prediction accuracy of the proposed method is remarkably higher than other state-of-art methods. Therefore, the proposed tool wear prediction method is beneficial to improve the prediction accuracy and provide effective guidance for decision making in processing.


Introduction
Computer numerical control (CNC) machine tool is the foundation of equipment manufacturing industry, and its technical level is an important indicator to measure the development level of a country's equipment manufacturing industry. Machining accuracy, machining efficiency and reliability level are important performance indicators of CNC machine tools [1]. In the cutting process, the slight wear of cutting tool will reduce the machining accuracy of the machine tool; the severe wear of cutting tool will cause the parts to be scrapped, interruption of cutting process and damage of machine tools, thereby reducing the machining efficiency and reliability of the CNC machine tools [2]. In the actual machining process, approximately 20% of downtime is caused by tool wear [3][4][5][6]. Therefore, real-time online monitoring of tool wear is helpful to improve the surface quality of products and the machining efficiency and reliability of CNC machine tools [7]. To avoid the interruption of machining process, researchers usually collect vibration, cutting force and acoustic emission (AE) signals for tool wear monitoring.
Existing research shows that time domain, frequency domain and time-frequency domain features can well characterise the degree of tool wear. Morgan et al. [8] extracted time domain features (e.g. mean and maximum) and frequency domain features (e.g. wavelet entropy) from cutting force and vibration signals to characterise tool wear. Zhang et al. [9] extracted four energy features on the basis of empirical mode decomposition and eight energy features on the basis of time-frequency spectrum from AE signals for tool wear monitoring. Wang et al. [10] combined the frequency domain, time domain and time-frequency domain features of force and vibration signals (e.g. variance, maximum wavelet energy) for tool wear monitoring. Zhou et al. [11] selected holder exponents as an index that could evaluate the singularity of vibration signals for real-time tool wear monitoring under different cutting conditions. Goyal et al. [12] summarised and analysed the existing time domain features (e.g. root mean square and skewness), frequency domain features (e.g. skewness and kurtosis of the band power spectrum) and time-frequency domain features (e.g. wavelet entropy) that could characterise tool wear. However, these extracted features only aim at specific signals or domains and cannot realise a universal characterisation of tool wear.
As the mainstream method in data-driven tool wear monitoring, machine learning has been widely used. Pandiyan et al. [13] combined genetic algorithm and support vector machine (SVM) for belt wear monitoring in grinding process. Genetic algorithm was used to extract a small number of tool wear sensitive features from the set of time domain and frequency domain features. Kong et al. [14] used an integrated radial basis function-based kernel principal component analysis (KPCA) to fuse the multidomain features extracted from the original signals. Gaussian regression was applied to predict the tool wear value and the corresponding confidence interval in real time. Wang et al. [10] used KPCA to fuse frequency domain, statistical domain and time-frequency domain features and utilised support vector regression (SVR) to predict tool wear. Li et al. [15] used gradient lifting decision tree to select the optimal feature subset from the set of multidomain features. They applied a hybrid classification restricted Boltzmann machine to identify the tool wear state. Wu et al. [16] extracted multidomain features from vibration and cutting force signals and selected the optimal features in terms of Pearson correlation coefficient, monotonicity and autocorrelation. An adaptive network fuzzy inference system was used to fuse the selected features. It can be seen that researchers usually use some feature dimension reduction or feature selection methods to reduce the number of multidomain features. This condition is due to the weak ability of the above shallow machine learning methods to learn high-dimensional data and complex nonlinear relationships. The use of feature selection or feature dimension reduction methods takes time and effort and loses some tool wear sensitive information, thereby limiting the accuracy of tool wear prediction.
In recent years, deep learning has been widely used in tool wear monitoring due to its powerful data mining ability and the ability to overcome the above limitations. Fu et al. [17] designed a deep convolutional neural network (CNN) for tool wear monitoring in drilling process. The performance of the model was better than radial basis function SVM. Aghazadeh et al. [18] applied wavelet time-frequency transform to extract wavelet energy features and utilised CNN to predict tool wear in the milling process. The prediction accuracy was higher than that of Bayesian ridge regression and SVR. Zhao et al. [19] used CNN to extract local features from the input signals. They utilised bidirectional long-term and short-term memory (LSTM) network to encode the dependence between features and applied multilayer perceptron (MLP) to predict tool wear. The proposed model was superior to linear regression, MLP and SVR. Cheng et al. [20] proposed a deep CNN model based on sound signal to recognise abrasive belt wear state. Chen et al. [21] combined a CNN and a bidirectional LSTM network with an attention mechanism for tool wear prediction in the milling process. Huang et al. [22] proposed a new tool wear predicting method based on reshaped time series CNN. Its performance was better than SVR and SVR + KPCA. Martinez et al. [23] used Gramian angular difference field to visualise the vibration and force signal, and applied a CNN with Cifar-10 architecture to identify the tool wear state. Cao et al. [24] proposed a new method of tool wear state recognition based on spindle vibration signals. In this method, translation-invariant wavelet frames were used to obtain the time-frequency spectra of the signals. CNN was used to construct the relationship between the spectrum and the tool wear state. Song et al. [25] proposed a deep CNN with LeNet architecture based on spindle current signals for tool wear state recognition in the milling process. Sun et al. [26] reconstructed the vibration, force and AE signals into images and sent them to the designed residual CNN for tool wear prediction. Li et al. [21] proposed a new residual dense neural network based on vibration signals for online tool wear monitoring. Although the performance of these deep learning methods is better than that of traditional methods, they still can be improved. The extracted features of these methods often come from specific signals or domains and cannot form a generic solution for any signal or domain. In addition, the feature sets used in these methods have low dimension and small capacity, hindering the further improvement of prediction accuracy.
As a variant of standard convolution, depth-wise separable convolution has achieved remarkable success in image processing, natural language processing, machine fault diagnosis and other fields [27,28]. Shang et al. [27] proposed a densely connected and depth-wise separable CNN to classify polarimetric synthetic aperture radar images. Compared with the standard CNNs, the classification accuracy was improved by 10.2%. Huang et al. [29] proposed a method of rolling bearing remaining useful life prediction based on transfer depth-wise separable CNN. Experimental result showed that the proposed method can improve the prediction accuracy and robustness. Chollet et al. [30] proposed a new depth-wise separable CNN model. The performance of the proposed model on ImageNet data sets was better than that of Inception V3. Xin et al. [31] proposed a depth-wise separable CNN for fault diagnosis of the attachment of marine current turbine. Compared with the standard CNN, the depth-wise separable CNN has the advantages of less trainable parameters and low computational complexity, which helps to further increase the number of convolution layers and further improve the performance of the model [32]. Although depth-wise separable convolution has achieved remarkable success in image classification and other fields, it has not been applied in the field of tool wear prediction.
Convolutional and pooling operation of depth-wise separable CNN are building build blocks that deal with one local neighbourhood at a time, which can suppress local background edge and impulse noise, but it is difficult to suppress large area of slowly changing background noise [33][34][35]. Since attention mechanism can emphasize useful information and suppress background noise from the perspective of the whole input tensor, existing researches often embed attention mechanism into CNN. Zeng et al. [36] proposed a convolutional neural network with selfattention mechanism for tool wear monitoring. Wang et al. [37] proposed a new multiscale CNN with attention mechanism to predict the remaining useful life of machinery. Huang et al. [38] proposed a self-attention-based CNN for document classification, which models documents from two levels: words and sentences and sentences and documents. Compared with the existing methods, the proposed method had the highest accuracy. The self-attention mechanism cannot capture the position information between different elements in the input tensor, so it is often combined with the position encoding algorithm to enhance its performance. Vaswani et al. [39] combined sinusoidal position encoding and self-attention mechanism to solve machine translation task and achieved higher single-model bilingual evaluation understudy score. Liu et al. [40] proposed a neural network model for tool wear prediction based on sinusoidal position encoding and self-attention mechanism. Shaw et al. [41] combined relative position encoding and self-attention mechanism for machine translation task. Considering that the transformer models were easily limited by fixed length context, Dai et al. [42] proposed a new transformer-XL model for natural language modelling task, which combined segmented relative position encoding and self-attention mechanism. Although these neural network models with attention mechanism have achieved good performance, they still can be improved. On the one hand, the existing attention mechanisms use sinusoidal position encoding and other position encoding algorithms to encode the position information. These position encoding algorithms cannot model the order relationship (i.e. adjacency and precedence) between different elements in the input tensor. On the other hand, the existing attention mechanisms can only deal with one-dimensional input but cannot deal with high-dimensional input, which hinders the further improvement of its ability to suppress large area of slowly changing background noise.
A new tool wear prediction method based on multidomain feature fusion by attention-based depth-wise separable convolutional neural network (ADSCNN) is proposed to improve the prediction efficiency and accuracy. Firstly, multidomain features are extracted from multisensory signals. Then, the proposed hypercomplex position encoding and high-dimensional self-attention mechanism are used to calculate the new representation of input feature tensor, which emphasizes the tool wear sensitive information and suppresses large area background noise. Finally, the proposed depth-wise separable CNN is used to model the nonlinear relationship between tool wear and the new representation. The rest of this paper is arranged as follows: Sect. 2 briefly introduces the related theories and elaborates the proposed theoretical methods. Section 3 introduces the framework and architecture of the proposed tool wear prediction method. Section 4 presents the experimental data sets. Section 5 analyses, compares and discusses the experimental results. Section 6 summarises the content of this paper.

Hypercomplex position encoding
Since CNNs are easily disturbed by noise in input tensor, it is necessary to use the proposed high-dimensional selfattention mechanism to enhance useful information in input tensor and suppress useless information. The proposed high-dimensional self-attention mechanism relies on proposed hypercomplex position encoding to encode the global absolute position information of each feature vector and their inner sequential and adjacent relationships.
Suppose that a feature matrix FM contains n feature vector of D dimension, which are expressed as fm 1 ,… ,fm n respectively. The hypercomplex position encoding is expressed as Eq. 1. In the training process of the proposed ADSCNN model, the parameters (wi d, θi d, wj d, θj d, wk d and θk d) are updated by using an error backpropagation algorithm.

High-dimensional self-attention mechanism
After hypercomplex position encoding, high-dimensional self-attention mechanism is used to obtain a new representation of the input feature tensor. The high-dimensional selfattention mechanism is expressed as Eq. 6.
(1) f fm pos , pos = fm pos ⊙ g(pos), g(pos) ∈ ℝ D where v i W Q , v j W K are matrices. e ij represents the similarity between matrix v i W Q and v j W K . W Q , W K , W V and W c are learnable matrices. The function tr(•) represents the sum of the main diagonal elements of a matrix. ||•|| F represents the Frobenius norm of a matrix. The value range of e ij is [0,1]. The closer e ij is to 1, the higher the similarity between matrix v i W Q and v j W K . The output Attention is a feature tensor with the same shape as input.
In the training process of the proposed ADSCNN model, the parameters ( W Q , W K , W V and W c ) are updated by using an error backpropagation algorithm.

Depth-wise separable CNN
Depth-wise separable CNNs have a strong ability of adaptive data mining and feature extraction and are widely used in fault diagnosis and prediction. A series of depth-wise separable convolution, standard convolution layers and pooling layers constitute the basic structure of depth-wise separable CNN. The depth-wise separable convolution is composed of depth-wise convolution layer and pointwise convolution layer.
The convolution layers can be regarded as the process of extracting local robust features from input feature map. The depth-wise convolution layer allocates a convolution kernel to each channel of the input feature map X l−1 , convolutes each channel with the corresponding convolution kernel and generates the output feature map X l through the set activation function. The depth-wise convolution layer is expressed as Eq. 10.
where K represents the convolution kernel containing M feature maps. Xl m represents the m-th feature map of the l layer. f(•) represents the activation function. M m represents a set of inputs for calculating the m-th output. bl m represents the m-th offset of the l-th layer. The input feature map X l−1 contains M channels. X l represents the output of depth-wise convolution layer. The m-th feature map of convolution kernel K and the m-th channel of input feature map X l−1 are convoluted.
In the training process of the proposed ADSCNN model, the parameters (K and b) are updated by using the error backpropagation algorithm.
where (Pl − 1 i) p,q is the patch in Xl − 1 i that is multiplied elementwise by convolution kernel K m in the convolution process. Loss is the loss function. δl m is the m-th element of the sensitivities in the l-th layer.
Pointwise convolution layer is a special case of the standard convolution layer. The pointwise convolution layer is mainly used to linearly combine different channels of the output feature map of the depth-wise convolution layer, and the size of its convolution kernel is 1 × 1 × M. M is the number of channels in the output feature map of the depth-wise convolution layer. The mathematical formula of standard convolution layer is suitable for pointwise convolution layer. The standard convolution layer is expressed as Eq. 12.
where K represents the convolution kernel. f(•) represents the activation function. X l represents the output of the standard convolution layer. Xl m represents the m-th feature map of the l layer. M m represents a set of inputs for calculating the m-th output. bl m represents the m-th offset of the l-th layer.
Similarly, the parameters of the standard convolution layer are updated by using the error backpropagation algorithm. The calculation formula for parameter update is expressed as Eq. 13.
where (Pl − 1 i) p,q is the patch in Xl − 1 i that is multiplied elementwise by convolution kernel K m in the convolution process.
The pooling layer is mainly used to significantly reduce the spatial size of the input feature map and the number of parameters. The pool layer removes the redundant details, retains the information closely related to the task and reduces the time and space complexities of the whole model. Commonly used pooling operations include average pooling, random value pooling and maximum pooling. The average pooling layer is expressed as Eq. 14.
where down(•) represents the down sampling function. βl m is the multiplication bias, and bl m is the addition bias. f(•) denotes the activation function.
Similarly, the parameters of pooling layer are updated by using the error backpropagation algorithm. The calculation formula for parameter update is expressed as Eq. 15.

Proposed methodology
The framework of the proposed tool wear prediction method is shown in Fig. 1. In the offline modelling stage, multiple sensors are used to measure the cutting force and vibration signals in the milling process, and an optical microscope is used to measure the tool flank wear regularly. The multidomain features of sensor signals are extracted and normalised, and these features are recombined into multidomain feature tensors. The multidomain feature tensors and the corresponding tool wear value are taken as the sample data sets. The sample data sets are then divided into training and validation data sets. The training data sets are used to train the designed ADSCNN model, and the super parameters of ADSCNN model are adjusted on the basis of the validation data sets. After the training process, a usable ADSCNN model is obtained.
In the online prediction phase, multiple sensors are used to collect the cutting force and vibration signals in real time. Multidomain features are extracted and recombined into feature tensors as test data sets. The test data sets are input into the trained ADSCNN model for real-time tool wear prediction. The output value of the trained ADSCNN model needs inverse normalisation to get the actual tool wear prediction value. Therefore, through offline modelling and online prediction, the proposed method realizes adaptive fusion of multidomain features and tool wear prediction.

Multisensory data acquisition
In the milling process, the vibration signal is extremely sensitive to tool wear and contains abundant tool wear information. However, the vibration signal is vulnerable to the interference of sensor installation position and machining environment. Therefore, a vibration sensor is often used together with other sensors for tool wear monitoring. As an important tool wear monitoring signal, the cutting force gradually increases with the aggravation of tool wear. For vibration and cutting force signals, the sensitivity of different direction signals to tool wear is different. Therefore, 3D cutting force and 3D vibration signals are collected by using vibration and cutting force sensors, and the tool flank wear is measured regularly by using an optical microscope.

Multidomain feature extraction
Many useless noises are found in the collected multisensory sensor signals. Thus, multidomain feature extraction is necessary. With the aggravation of tool wear, the time domain features of the signal inevitably change. The frequency spectrum and time-frequency spectrum of the signal also change. Specifically, the frequency domain and time-frequency domain features of the signal change. As shown in Table 1, 14 time domain features, 2 frequency domain features and 8 wavelet packet energy features are extracted from the vibration and cutting force signals in x, y and z directions respectively. The original signal is divided into multiple signal segments to increase the sample size and the generalisation ability of the ADSCNN model. Each signal segment contains 1024 points. As shown in Fig. 2, each column of the x-direction multidomain feature matrix contains 48 features. The 48 features include 24 multidomain features extracted from x-direction vibration signal samples and 24 multidomain features extracted from x-direction cutting force signal samples. The features in the y-direction and z-direction multidomain feature matrices come from the vibration and cutting force signal samples in the y direction and z direction, respectively. Depth-wise separable CNN has achieved great achievements in image recognition, image segmentation and other fields. Its input is usually three-order square pixel tensor. Therefore, the time axis and feature axis of the multidomain feature matrices are set to the same dimension. For vibration and cutting force signals, the signals in three directions have certain correlation. For each sample, the multidomain feature matrices in x, y and z directions are spliced into a three-order multidomain feature tensor, which is helpful for depth-wise separable CNN to model the correlation between different directions. On the basis of the collected signals, multiple three-order multidomain feature tensors are extracted as the input of subsequent ADSCNN model.

Design and training of ADSCNN model
As the key of the proposed tool wear prediction method, ADSCNN model is mainly used to model the relationship between multidomain features and tool wear. The framework of the proposed ADSCNN model is shown in Fig. 3. Firstly, considering that the feature vectors at different times in each input matrix are interdependent, a new representation of each input feature matrix is calculated by using proposed hypercomplex position encoding and highdimensional self-attention mechanism, which emphasizes the tool wear sensitive information and suppresses large area background noise. Specifically, the input three-order feature tensor consists of three feature matrices.  Fig. 1 The overall framework of the proposed tool wear prediction method three-order feature tensors is 48 × 48 × 4. The concatenate layer is used to splice the three three-order feature tensors into a larger three-order feature tensor with a size of 96 × 96 × 3. Secondly, a standard convolution layer, six depth-wise separable convolutions and a pooling layer are used to extract high-level representation from the input feature tensor. Each depth-wise separable convolution consists of a depth-wise convolution layer and a pointwise convolution layer. Compared with standard convolution, depth-wise separable convolution significantly reduces the number of parameters and the complexity of computation, thereby helping to further increase the number of convolution layers. Superimposing multiple depth-wise separable convolutions is mainly used to extract high-level local Mean value  Fig. 3 The framework of proposed ADSCNN model robust features from the input feature tensor. The pooling layer is mainly used to compress features, reduce the size of the feature mapping plane and the amount of calculation and memory consumption. No gradient vanishing problem is found in the rectified linear unit function, and its convergence speed is fast in the training process. Thus, it is selected as the activation function of the convolution layer and pooling layer. Finally, a dense layer and an output layer are used to establish the nonlinear relationship between the input high-level representation and the normalised tool wear values. Table 2 shows the detailed parameters of each layer of the proposed ADSCNN model. The ADSCNN model is trained by using the Frobenius norm of the difference between the predicted tool wear value y p and the actual tool wear value y. The loss function is expressed as Eq. 16.
where N is the number of samples in the training data sets.
As a typical model training tool, K-fold cross validation is used to train the ADSCNN model for reducing the risk of overfitting and improving its generalisation ability. During the training process, the error backpropagation algorithm is used to update the super parameters for minimising the loss function.

Performance appraisal of ADSCNN model
To quantitatively and comprehensively evaluate the performance of the proposed ADSCNN model, this paper selects two classic evaluation indicators, namely, root mean squared error (RMSE) and mean absolute error (MAE). The smaller the values of MAE and RMSE, the closer the predicted value of flank wear width is to the actual value. The smaller the prediction error of the model, the better the model performance. MAE and RMSE are expressed as Eqs. 17 and 18, respectively. where N represents the number of samples in the test data sets, y i represents the i-th true tool wear value, and yp i represents the i-th predicted tool wear value.
where {x i , i = 1,…,N} denotes a signal sample. The time domain signal (i.e. the original signal) is transformed into power spectrum by fast Fourier transform and is expressed as f i . P(f i ) represents the power spectrum density, and N represents the power spectrum length. For r-level wavelet packet decomposition, M r,j represents the ratio of the energy of the j-th sub band to the total energy of all sub bands.

Experimental setup
A set of experimental data measured from three flute ball nose milling tool of computer numerical control machine in dry milling is used to verify the effectiveness of the proposed tool wear prediction method [43]. As shown in Fig. 4, the CNC machine tool used is a high-speed CNC machine tool, the tool used is a ball nose milling cutter with three flutes and the workpiece is an Inconel 718 alloy workpiece. Table 3 shows the process parameters of the cutting process. In the milling process, a Kistler quartz 3-component dynamometer is mounted between the workpiece and the machining table to measure the cutting force, and three Kistler piezo accelerometers are mounted on the workpiece to measure the vibration signals in x, y, z directions respectively. In addition, A Kistler acoustic emission sensor is mounted on the workpiece to monitor the high-frequency stress wave generated by the milling process. Meanwhile, a DAQ NI PCI1200 is used to collect the signals of multiple sensors, and the continuous sampling frequency is 50 kHz.
The wear width of the flank of each cutting edge is measured with a LEICA MZ12 optical microscope after each milling.

Data preparation
The tool wear data set consists of three tools (C1, C4 and C6) with complete tool wear data. Each tool contains 315 signal data files and corresponding tool wear data. Each signal data files corresponds to a cutting process. Here, the tool wear data set C1 is taken as an example for data analysis.
As shown in Fig. 5, the tool wears gradually with the increase in the number of cuts. For each tool, the change law of the wear curve of the three flutes is the same. Tool wear is divided into three stages: initial wear, normal wear and severe wear. In the initial wear stage, the slope of the curve is larger, and the tool wears faster. In the normal wear stage, the slope of the curve is small, and the tool wears slowly. In the severe wear stage, the slope of the curve is large, and the tool wears violently.
The marginal spectra of the milling force and vibration signals of C1 under different tool wear status are shown in Fig. 6. The energy of the marginal spectrum of the first three milling force signals is mainly concentrated in the low frequency part (0-1500 Hz). The marginal spectra of the first three milling force signals are approximately unimodal. For the first three subgraphs, the amplitude of frequency distribution curve increases with the increase in tool wear. The amplitude of the frequency distribution curve of milling force in x direction and z direction is larger than that in   For the last three subgraphs, the amplitude of frequency distribution curve increases with the increase in tool wear. The amplitude of the frequency distribution curve of z-direction vibration is greater than that of y-direction and x-direction vibrations. Therefore, the proposed tool wear monitoring strategy is effective in predicting tool wear in terms of cutting force and vibration signals in milling process.
Considering that the original milling force and vibration signals contain considerable noise, 24 multidomain features are extracted from each milling force and vibration signals. Three normalised extracted features of cutting force and vibration signals of C1 are shown in Fig. 7. As shown in Fig. 7, some of the three features show a trend of decreasing and others increasing with the increase in cutting times. Therefore, predicting tool wear by using multidomain features is feasible. However, the relationship between tool wear and multidomain features is complex and nonlinear. Using the proposed ADSCNN model is necessary to model the nonlinear relationship between tool wear and multidomain features.
In accordance with the multidomain feature extraction method described in the previous section, the multidomain features of multiple sensor signals are extracted and reconstructed into multidomain feature tensors. The sample data sets composed of these multidomain feature tensors are divided into training data sets, validation data sets and test data sets. The detailed description of the experimental data sets is shown in Table 4. The detailed configuration of the computer used to train the ADSCNN model is as follows: Microsoft Windows 10 system, Intel (R) Core (TM) i5-3230 M processor, 4 GB random access memory and 2 Hz CPU frequency. The training time of one epoch is approximately 2.14 s, and the testing time of each sample is only 0.0015 s. Therefore, the proposed ADSCNN model is an efficient tool wear prediction method.

Results and discussion
The performance of the proposed tool wear prediction method based on ADSCNN model depends on the selection of super parameter. The optimal super parameters of the ADSCNN model change with the experimental data sets. Therefore, in the application of the proposed method, optimising the super parameters of the ADSCNN model is necessary. Specifically, the parameters of the proposed ADSCNN model mainly include position encoding method, activation function of dense and output layer, gradient descent algorithm, learning rate and number of epochs. The early stop strategy avoids the selection of the number of epochs and can effectively avoid the occurrence of overfitting and improve the generalisation performance of the network. Therefore, the four parameters to be optimised include position encoding method, gradient descent algorithm, learning rate and activation function of dense and output layer. Considering the similarity of the wear process of three ball-end milling cutters, C1 is used as a case of parameter selection.

Position encoding algorithm
The self-attention mechanism cannot capture the position information between different elements in the input tensor, so it is often combined with the position encoding algorithm to enhance its performance. Therefore, the proposed hypercomplex position encoding algorithm is compared with several common position encoding algorithms (without position encoding, sinusoidal position encoding, relative position encoding, segmented relative position encoding and hypercomplex position encoding) to demonstrate its superiority. Hypercomplex position encoding algorithm is combined with high-dimensional self-attention mechanism, and the other position encoding algorithms for comparison are combined with self-attention mechanism. Figure 8 shows the validation accuracy under different position encoding algorithms. Figure 9 compares the performance of different flutes of C1 under different position encoding algorithms. As shown in Fig. 8, the ADSCNN model using position encoding algorithms reduces the MAE and RMSE by approximately 50% compared with the ADSCNN model without using position encoding technology. Sinusoidal position encoding is a static and no learnable encoding algorithm that encodes the position information of each feature vector but cannot represent the relative distance between feature vectors. Relative position encoding, segmented position

Activation function
The activation function directly affects the capacity and parameter optimisation of neural network model. At present, a variety of activation functions have been widely used. However, no complete theoretical guidance is reported on how to select the appropriate activation function for specific applications. Therefore, common activation functions (including elu, selu, sigmoid and linear) are compared to choose a suitable activation function. Figure 10 shows the change trend of validation set accuracy under different activation functions. Figure 11 shows the performance of different flutes of C1 under different activation functions. As shown in Fig. 10, the ADSCNN model using sigmoid activation function converges faster and can converge to smaller MAE and RMSE compared with three linear activation functions (including linear, elu, and selu). The three linear activation functions are extremely fragile in the training process, easily leading to neuronal necrosis. The three activation functions do not compress the output amplitude, leading to the continuous expansion of output amplitude. The input of the proposed ADSCNN model is normalised multidomain features, and its expected output should be in the interval [0,1]. Fortunately, the sigmoid activation function can control the output in the interval [0,1] without causing gradient explosion. As shown in Fig. 11, the ADSCNN model using sigmoid activation function has the lowest RMSE and MAE for different flutes of C1. Therefore, the sigmoid activation function is used as the activation function of the dense and output layer.

Gradient descent algorithm
As a popular optimisation algorithm, gradient descent algorithm is often used to optimise neural network models. Different optimisation algorithms are often regarded as black box optimisers due to the lack of theoretical explanation for their advantages and disadvantages. Therefore, common optimisation algorithms (including stochastic gradient descent [SGD], adaptive moment estimation [Adam], Adadelata, Adagrad, Adamax and Nadam) are compared to choose a suitable optimisation algorithm. The change trend of validation set accuracy of C1 under different gradient descent algorithms is shown in Fig. 12. As shown in Fig. 13, the performance of different flutes of C1 under different gradient descent algorithms is evaluated. As shown in Fig. 12, SGD, Adadelata and Adagrad have a slow convergence rate and a large validation error. Adam, Adamax and Nadam all have fast convergence speeds. As shown in Fig. 13, Adam achieves the smallest MAE and RMSE compared with Adamax and Nadam. Therefore, Adam is chosen to train the proposed ADSCNN model.

Learning rate
In the model training, gradient descent algorithm is often used to optimise the super parameters of the model. For the gradient descent algorithm, the learning rate is an important parameter that directly affects the convergence speed of the model and its convergence error. Choosing the appropriate learning rate is necessary to improve the efficiency and accuracy of the proposed ADSCNN model. Therefore, different learning rates are used to train the proposed ADSCNN model. The results of performance metrics (including MAE and RMSE) under different learning rates are shown in Fig. 14. With the increase in learning rate, RMSE and MAE decrease first and then increase. Specifically, RMSE and MAE are the smallest when the learning rate is approximately 0.008. The smaller the RMSE and MAE, the closer the predicted tool wear value is to the actual flank wear value. The appropriate learning rate can reduce the prediction error and accelerate the model convergence. Therefore, considering the prediction accuracy and convergence speed, the appropriate learning rate is set to 0.008.

Comparison of different tool wear prediction methods
In accordance with a series of parameter optimisation experiments, the optimal parameters of the proposed ADSCNN model are shown in Table 5. The tool wear prediction values and prediction errors for data sets C1, C4 and C6 are shown in Fig. 15. The tool wear prediction error for each cutting number is less than 5 μm. The proposed tool wear prediction method can accurately predict tool flank wear. The performance of the proposed method is compared with other advanced methods by using original published data to demonstrate its effectiveness and advancement. Specifically, traditional machine learning methods, including SVR + KPCA and SVR, and advanced deep learning methods, including recurrent neural network (RNN), LSTM, CNN, CBLSTM, DCNN + AE and RTSCNN, are used for performance comparison. The comparison results under different evaluation criteria are shown in Table 6. SVR is unsuitable for large-scale data sets, difficult to select the kernel function and has high prediction error. KPCA improves the prediction accuracy of SVR by fusing multidomain features, but the prediction accuracy is still lower than the proposed tool wear prediction method. RNN, LSTM, CBLSTM, CNN and DCNN + AE directly model the relationship between multidomain features and tool flank wear, but their prediction error is higher than that of the proposed tool wear prediction method. Compared with other comparative methods, the MAE and RMSE of the proposed tool wear prediction method are the smallest, which are 0.47 and 0.69 μm, respectively. The variances of MAE and RMSE are the smallest, which are 0.18 and 0.38 μm, respectively. The proposed tool wear prediction method has the lowest variance and deviation and has the best prediction performance. In conclusion, the proposed tool wear prediction method can accurately and effectively model the nonlinear relationship between tool wear and multidomain features and achieve high prediction accuracy.

Conclusion
In this paper, a new tool wear prediction method based on multidomain feature fusion by ADSCNN for the milling process is proposed. This method improves the accuracy of tool wear prediction. The proposed method can achieve good prediction accuracy mainly due to the following technologies. First, the multisensory signal contains considerable tool wear information. Multidomain feature extraction provides comprehensive and rich features to characterise tool wear. Second, proposed hypercomplex position encoding and high-dimensional self-attention mechanism are used to calculate the new representation of input feature tensor, Fig. 13 Performance comparison of different flutes of C1 under different gradient descent algorithm which emphasizes the tool wear sensitive information and suppresses large area background noise. Third, the proposed depth-wise separable CNN is used to model the nonlinear relationship between tool wear and the new representation. Fourth, the proposed ADSCNN model adopts a dense layer and an output layer to automatically predict tool wear. Finally, on the basis of a variety of metrics, the parameters of the proposed ADSCNN model are optimised, thereby helping to further improve its prediction accuracy.
The main contribution of this paper is to realise the automatic prediction of tool wear in milling process by using the ADSCNN model to adaptively fuse multidomain features and omit the steps of feature selection and feature dimension reduction on the basis of feature engineering technology. In addition, the innovation is that the proposed hypercomplex position encoding and high-dimensional self-attention mechanism enhance the tool wear sensitive information contained in the input feature tensor and suppress the noise, and the proposed depth-wise separable CNN can well model the nonlinear relationship between tool wear and the new representation. The experimental results of three milling tool run-to-failure data sets show that the proposed tool wear prediction method based on ADSCNN model is better than other state-of-art methods in terms of overall prediction performance. The proposed tool wear prediction method based on ADSCNN model shows good prediction performance in the milling process. The findings serve as reference for its application in machining.   15 The prediction results of the proposed tool wear prediction method SVR [10] 11.9681 ± 3.3337 9.3770 ± 2.0422 SVR + KPCA [10] 5.4428 ± 1.5894 3.9583 ± 0.9371 CNN [44] 14.0428 ± 5.5588 11.0000 ± 1.3000 DCNN + AE [45] 2.2385 ± 0.7105 1.5708 ± 0.5485 CBLSTM [19] 9.2333 ± 1.9140 7.2333 ± 1.0263 LSTM [19] 13.7333 ± 6.2164 12.1667 ± 6.2292 RNN [19] 15.7333 ± 6.2164 12.1667 ± 6.2292 RSTCNN [22] 2.2009 ± 0.9012 1.5658 ± 0.6037 ADSCNN 0.6878 ± 0.3831 0.4738 ± 0.1822