We used the AE scalograms dataset to compare the performance of the existing state-of-the-art deep learning algorithms. Classification is used to highlight the results.
Resizing is the process of changing the dimensions of an image to the network. Before training the network, the input image size diversified, we resize these images according to the input dimensions required by the network. To make a comprehensive evaluation, 80–20% training–testing ratios are considered: the data set was randomly divided into 80% for training and 20% for testing (26,343 training samples and 6,586 testing samples per class). Out of training data 20% was reserved for validation set. The experiments with ten deep-learning classification architectures including VGG16, VGG19, DarkNet19, DarkNet DarkNet53, ShuffleNet, GoogleNet, NASNet-mobile, SqueezeNet, InceptionV3, MobileNet-v2, Densenet201, Resnet18, Resnet50, Resnet101—were carried out. All the models were trained and tested on a laptop having the following configuration: Intel Core i7 2.80 GHz processor with NVIDIA GeForce GTX 1070 GPUs and 16 GB RAM. MATLAB version R2020b was used in this experiment.
Thereafter, the evaluation is carried out on the validation set to minimize overfitting. The unknown testing set is used to evaluate the performance once the training procedure and parameter selection have been completed. Instead of randomly selecting photos from the classes, uniform sampling was used to support training batches of 128, 64, and 16. With a learning rate of 0.0001, decaying exponentially by 0.1% every 50 iterations, and Stochastic Gradient Descent optimization with a validation patience 5.
5.1 Performance Comparison
Image classification provides three well accepted standard evaluation metrics: overall accuracy, average accuracy, and a confusion matrix. The performance evaluation metrices derived from the confusion metrics are given by Equation 4 to 6. The sensitivity of a classification model can be defined as the percentage of true cases (TP), which are correctly classified cases as defined by Equation 4 The classification model's specificity relates to the percentage of true non-cases, or AE signals from corrosion sources that have been classified as non-corrosion sources. The specificity can be expressed by Equation 4. F1 score as represented by Equation 6, was used for the comparison of two classification models. A weighted harmonic average of precision and recall values could be used to calculate F1 score [18]. The precision was defined as the likelihood that a randomly chosen AE signal processed as a corrosion case was the corrosion case. The recall, on the other hand, was defined as the chance that a randomly picked corrosion instance was accurately identified as such. Because there was no historical data on precision or recall, the beta value was set to 1, and the F-score was termed the F1-score
Table 2. Accuracy measures of the deep learning classification models
Model
|
Precision
|
Recall
|
F1-score
|
Accuracy (%)
|
NASNet-Mobile
|
0.83
|
0.94
|
0.88
|
87.2
|
ResNet-50
|
0.84
|
0.90
|
0.87
|
86.4
|
Inception-V3
|
0.89
|
0.90
|
0.89
|
89.2
|
ResNet-18
|
0.82
|
0.94
|
0.88
|
86.6
|
MobileNet-v2
|
0.89
|
0.92
|
0.91
|
90.3
|
ShuffleNet
|
0.87
|
0.91
|
0.89
|
88.8
|
Xception
|
0.85
|
0.91
|
0.88
|
87.6
|
EfficientNet-b0
|
0.88
|
0.95
|
0.91
|
91.0
|
\(Precision = \frac{TP}{TP+FP}\)
|
(4)
|
\(Recall = \frac{TN}{FN+TN}\)
|
(5)
|
\(F1-score= \frac{2\left(Precision\text{*}Recall\right)}{(Precision+Recall)}\)
|
(6)
(6)
|
\(Accuracy = \frac{TP+TN}{TP+TN+FN+FP}\)
|
(7)
|
Table 3. Complexity Comparison of deep-learning detection networks
Model
|
No. of Layers
|
No. of Parameters (million)
|
Processing Time (Seconds)
|
NASNet-Mobile
|
913
|
4.2
|
7.70
|
ResNet-50
|
177
|
23.5
|
4.90
|
Inception-V3
|
315
|
21.8
|
8.77
|
ResNet-18
|
71
|
11.1
|
2.40
|
MobileNet-v2
|
154
|
2.2
|
5.50
|
ShuffleNet
|
172
|
.862
|
2.30
|
Xception
|
170
|
20.8
|
14.32
|
EfficientNet-b0
|
290
|
4
|
7.709
|
Table 2. summarizes the accuracy performance of the models. ‘Average accuracy’ indicates the average number of training images per class for each class. We observe a large difference in performance across the different models. Additionally, precision, recall and F1-score of all the deep learning models have been presented in Table 2. Figure 4. Displays a line graph which shows that EfficientNet-b0 secured the highest accuracy.
Besides the pretrained CNNs listed in Table 4, VGG16, VGG19, AlexNet, and GoogleNet were being used. Although their training accuracy was around 50% but the training loss was unknown. NASNet-Large is a costly network in terms of computational resources and training time due to its large number of layers. Even with a minimum batch size of 4, the system used in this investigation did not support it. EfficientNet-b0 achieved the highest accuracy of 91.0%, followed by MobileNet-V2 yielded 90.3% accuracy. Furthermore ShuffleNet, and inception-V3 also got a reasonable accuracy with 88.8% and 89.2% respectively. The highest recall and F1 score have been achieved by Efficient-b0, but the best precision value has been achieved by A good balance between precision and recall value is found in the case of inception-V3.
5.2 Computational Time Evaluation
Our next step was to investigate how fast a given dataset affects detection performance. Because the initial goal of dataset is to build an intelligent platform that can be used by a variety of autonomous robots and equipment to improve the timeliness and efficiency of precision structural health management. This platform must be run in real time and with high accuracy. Since each detection network was tested on a single machine, the evaluation metric is the processing time (PT) required to process 1000 images. Figure 4 shows the results of the detection time assessment.
The results in Table 3. demonstrate that Xception takes the most time to compute with a CT of 14.32 seconds followed by Inception-V3, EfficientNet-b0 with CT value 8.77, and 7.7 seconds respectively. ResNet18 and ShuffleNet are the fastest classification models with computation speed 2.4 and 2.3 seconds respectively. However, EfficientNet-b0 with the best performance of classification accuracy with a comparatively low computational time of 7.7 seconds. The number of learnable parameters. Furthermore, the processing time has been displayed diagrammatically in Fig. 4.
5.3 Effect of learning rate on EfficientNet-b0
When training deep neural networks, the initial learning rate plays a key role in determining the step size at which the optimizer modifies the model's weights. For the model to function at its best and achieve effective convergence, choosing the right initial learning rate is crucial.
The findings in Fig. 6 show the necessity of balancing convergence speed and stability during training by showing that a learning rate of 0.01 leads in the maximum testing accuracy. A greater testing accuracy of 90.510% is obtained with a learning rate of 0.00001, however this number is still below that attained with a learning rate of 0.01. This can be as a result of the fact that a learning rate that is too low can result in slower convergence, which might cause the model to underfit the data. It is important to keep in mind, nevertheless, that selecting the ideal learning rate involves making a choice between preventing underfitting and overfitting. The testing accuracy significantly increases to 89.200% when the learning rate is set too low, such at 0.001. This might be as a result of a slower convergence rate, which enables the model to sensitively adjust its weights. On the other hand, a learning rate of 0.0001 results in testing accuracy that is little lower, 88.760%, suggesting that the model may converge.
The model's performance could be further improved by adjusting the learning rate hyperparameter using strategies like learning rate schedules or adaptive learning rate optimizers.
5.4 Effect of L2 Regularization on EfficientNet-b0
We examined the outcomes of evaluating the EfficientNet-b0 model's accuracy using various L2 regularization hyperparameter settings. By inserting a penalty term to the loss function based on the magnitudes of the model's weights, L2 regularization, also known as weight decay, is a popular method for preventing overfitting in deep learning models is displayed in Fig. 6.
The testing accuracy results for EfficientNet-b0 for various L2 regularization values, spanning from 0.00001 to 0.05, are shown in Fig. 6. Let's explore the ramifications of these findings and discover how L2 regularization affects the model's functionality.
The L2 regularization hyperparameter has a changing trend as does the testing accuracy. From a high of 91.0% at L2 = 0.0001 to a low of 80.8% at L2 = 0.05, the accuracy varies. This shows that although some levels of L2 regularization aid in accuracy improvement, excessive regularization can result in considerable accuracy losses. With L2 = 0.0001, the model obtains an accuracy of 91.0%, which is the highest testing accuracy. This shows that a moderate dose of L2 regularization is advantageous for avoiding overfitting and enhancing the model's generalization to new data. In comparison to bigger values (e.g., 0.01, 0.05), accuracy is improved with small values (e.g., 0.00001) and moderate values (e.g., 0.0001). The significance of hyperparameter tuning to determine the appropriate regularization intensity for a certain dataset and model architecture is highlighted by this sensitivity. It is interesting to note that accuracy decreases when L2 is set either too high (e.g., 0.05) or too low (e.g., 0.005, 0.01). This exemplifies the trade-off between under-regularization, which results in overfitting and poor generalization, and over-regularization, which inhibits the model's capacity to learn complicated patterns.