3.1 Dynamic Convolution
In recent years, methods involving dynamic convolutions, such as CondConv [30], DynamicConv [31], and DyNet [32], have garnered significant attention from researchers. These methods render convolutional designs more flexible and adaptive without increasing network depth or width. The fundamental concept behind dynamic convolutions is to adjust convolutional parameters adaptively based on input variations.
The dynamic convolution method proposed by DyNet is adopted in this paper, as depicted in Fig. 2. DyNet enables the model to dynamically learn the importance between different channels and weight features from various channels, thereby extracting more informative features. This approach effectively reduces the model's parameter count. Compared to traditional fixed convolution kernels, dynamic convolution kernels have fewer parameters, conserving memory and computational resources. Additionally, this approach enhances the model's utilization of diverse channel features, improving its ability to model complex data.
Dynamic convolution kernels are generated based on input data and predicted coefficients. As a result, the model can adaptively adjust convolution kernels for different input samples. This further reinforces the model's generalization capability and adaptability, enabling it to accommodate new tasks better when handling different data types. The formula for employing dynamic convolutions is as follows:
$${\tilde {O}_t}={\tilde {w}_t} \otimes x=\sum\limits_{{i=1}}^{{{g_t}}} {(\eta _{t}^{i}\cdot w_{t}^{i})} \otimes x=\sum\limits_{{i=1}}^{{{g_t}}} {(\eta _{t}^{i}\cdot w_{t}^{i} \otimes x)=} \sum\limits_{{i=1}}^{{{g_t}}} {(\eta _{t}^{i}\cdot (w_{t}^{i} \otimes x))} =\sum\limits_{{i=1}}^{{{g_t}}} {(\eta _{t}^{i}\cdot O_{t}^{i})}$$
1
Where \({\tilde {O}_t}\)represents the output of the dynamic convolution kernel,\({\tilde {w}_t}\) represents the dynamic convolution kernel, x stands for input, gt represents the number of groups, \(\eta _{t}^{i}\) represents coefficients, \(w_{t}^{i}\) represents the fixed convolution kernel, and \(O_{t}^{i}\) represents the output of the fixed convolution kernel.
3.2 Cross − Mix Module
In ResNet [33], the primary idea is to add the output of one layer to the input of that layer in the channel dimension by using residual blocks, a process referred to as "skip connection" or "residual connection." This skip connection design enables the network to learn residual functions, thus helping alleviate the gradient vanishing problem and facilitating the training of intense networks. On the other hand, ShuffleNet [34] is another network architecture that achieves feature communication and fusion by grouping and shuffling channels. ShuffleNet can extract richer and more diverse feature information, further enhancing the model's performance and generalization ability.
This paper proposes the Cross − Mix module based on the concepts of the models above, primarily composed of three parts: Split, Cross, and Mix, as illustrated in Fig. 3.
The process can be understood as follows: Initially, the input is split into two parts equally along the length dimension (Split1 and Split2). Assuming the input is \(x=[{x_1},{x_2}, \cdots ,{x_n}]\), the formula for the Split part is as follows:
$${\text{Split 1}}=[{x_1},{x_2}, \cdots ,{x_{n/2}}]$$
2
$${\text{Split 2}}=[{x_{n/2+1}},{x_{n/2+2}}, \cdots ,{x_n}]$$
3
In the Cross part, dynamic convolution and batch normalization are applied to the split parts, resulting in two new feature maps (DyConv1 and DyConv2). Suppose the channel dimensions of the original split parts (Split1 and Split2) and the DyConv feature maps differ. In that case, 1×1 convolutions are employed to adjust the channel dimensions of the split parts, aligning them with the corresponding DyConv feature map dimensions. Assuming the number of channels for Split1 is C1 and DyConv1 is C2, the 1×1 convolution is adjusted for Split1 so that C1 = C2.
Assuming the input is \(x=[{x_1},{x_2}, \cdots ,{x_n}]\), the batch normalization operation is used on it, the formula for the BN layer is as follows:
$$\mu =\frac{1}{n}\mathop \sum \nolimits_{{i=1}}^{n} {x_i}$$
4
$${\sigma ^2}=\frac{1}{n}\mathop \sum \nolimits_{{i=1}}^{n} {({x_i} - \mu )^2}$$
5
$${\hat {x}_i}=\frac{{{x_i} - \mu }}{{\sqrt {{\sigma ^2}+ε } }}$$
6
$${y_i}=\gamma {\hat {x}_i}+\beta$$
7
Where µ represents the mean of the input, \({\sigma ^2}\)represents the variance of the input, \({\hat {x}_i}\)represents the normalized output, and ε is a constant to avoid the denominator being zero. yi represents the final output, which restores the range and distribution of the data using the normalized output multiplied by the scaling parameter γ and adding the offset parameter β.
Suppose the input tensor is X and the dimensions are (H, W, Cin), where H and W are the spatial dimensions, respectively, and Cin is the number of input channels. Assuming that the output tensor is Y and the dimension is (H, W, Cout), where Cout is the number of output channels, the formula for using a 1 × 1 convolution process is as follows:
$${Y_{ijc}}=\sum\limits_{{k=1}}^{{{C_{in}}}} {{X_{ijk}}\cdot } {W_{kc}}+{b_c}$$
8
Where Yijc represents the output tensor at (i, j, c), Xijk represents the input tensor at (i, j, k), Wkc represents the weight of the convolution kernel, and bc is the bias.
Two approaches, A and B, are incorporated in the Mix part, which theoretically yields identical outcomes. In Approach A, DyConv1 is added to Split2, yielding the mixed feature map Mix1. Split1 is added to DyConv2, resulting in the mixed feature map Mix2. Finally, Mix1 and Mix2 are concatenated along the length dimension to form the ultimate output feature map. The formula for approach A is as follows:
$${\text{Mix1 = }}Add{\text{(DyConv1, Split2) = DyConv1 + Split2}}$$
9
$${\text{Mix2 = }}Add{\text{(Split1, DyConv2) = Split1 + DyConv2}}$$
10
$${\text{Output = }}Concat{\text{(Mix1, Mix2)}}$$
11
In Approach B, DyConv1 is initially concatenated with Split1, producing the mixed feature map Mix1. Split2 is concatenated with DyConv2, generating the mixed feature map Mix2. Ultimately, add Mix1 and Mix2 to form the output feature map. The formula for approach B is as follows:
$${\text{Mix1 = }}Concat{\text{(DyConv1, Split1)}}$$
12
$${\text{Mix2 = }}Concat{\text{(Split2, DyConv2)}}$$
13
$${\text{Output = }}Add{\text{(Mix1, Mix2) = Mix1 + Mix2}}$$
14
Through the design of the Cross − Mix module, the network not only focuses on information fusion along the channel dimension but also divides and merges information along the length dimension. It helps in learning more meaningful and diverse representations of vibration signals. Cross − Mix combines features from one input part with another, facilitating information exchange and potential feature enhancement. This enhances the model's feature representation capacity and improves model performance to a certain extent. By employing Cross − Mix, the model effectively increases the number of feature maps, enabling it better to capture complex patterns and relationships within the data, thereby enhancing the model's performance and generalization ability.
3.3 DCCMN Model
Based on the proposed Cross − Mix module in the previous section, the Dynamic Convolution Cross − Mix Network (DCCMN) model is introduced in this paper, as depicted in Fig. 4. Since the Cross − Mix module involves operations such as split, dynamic convolutions, and mix, these processes introduce many parameters and computational complexity. Additionally, directly using the Cross − Mix module in the early layers of the model could lead to information loss.
To address these potential issues and balance model complexity and performance, the DCCMN model employs large convolution kernels and simple operations in its initial stages. For instance, the first convolutional layer has a kernel size of 16 with a stride of 4, and the second convolutional layer has a kernel size of 8 with a stride of 2. Batch Normalization (BN) and max − pooling layers are added after each convolutional layer. This approach aims to facilitate the rapid learning of low − level features from input data, reduce excessive parameters and computation, aid in model convergence, and mitigate the risk of overfitting. Subsequently, the Cross − Mix modules are gradually introduced for feature fusion. The model incorporates four Cross − Mix modules, enhancing performance and training efficiency to capture higher − level feature representations. A global average pooling layer is added after the last module to reduce feature dimensionality, and the model's final output is obtained using the softmax activation function.
The global average pooling layer is calculated as follows:
$${y_c}=\frac{1}{{H \cdot W}}\mathop \sum \nolimits_{{i=1}}^{H} \mathop \sum \nolimits_{{j=1}}^{W} {x_{ijc}}$$
15
Where yc is the output of the c − th channel, xijc is the input feature map at (i, j, c), and H and W are the height and width of the feature map respectively.
The calculation of softmax is as follows:
$$p(i)=\frac{{{e^{{\delta _i}}}}}{{\mathop \sum \nolimits_{{k=1}}^{K} {e^{{\delta _k}}}}}$$
16
Where p(i) represents the probability of each output, the sum of all p(i) is 1, and K is the number of classes of the multi − classification problem.