The DL in this study was established under Anaconda Spyder Python 3.6. Experiments were run on a computer with an Intel Core i7-9700 processor using 16 GB of RAM, and a NAVIDIA GeForce RTX 2060 graphics card in Microsoft Windows 10.
3.1 Deep-learning training results and model evaluation
We trained four CNN model architectures and compared the results of training with data in 128 × 128 pixel format and in 224 × 224 pixel format, as well as the results of training with batch size = 4, batch size = 8, batch size = 16, and batch size = 32. ResNet-34 performed poorly in the prediction of sandblasting defects. To identify the cause of this, we performed a comparison using GoogLeNet v4, which also employs residual learning. We additionally added a comparison using the default size of input data in GoogLeNet v4, which is 299 × 299 pixels.
For ease of description, we use [128, 4] to denote input data in 128 × 128 pixel format with batch size = 4 and [128, 8] to denote input data in 128 × 128 pixel format with batch size = 8. Similarly, there are [128, 16], [128, 32], [224, 4] [224, 8], [224, 16], [224, 32], [299, 4], [299, 8], [299, 16], and [299, 32].
Figure 5 presents the historical accuracy rates of the AlexNet model. The differences among the results from the 12 sets of training data were small, and the accuracy rates all converged to 1.
Figure 6 presents the historical accuracy rates of the VGG-16 model. Although the accuracy rates involving [128, 4], [128, 8], and [128, 16] fluctuated somewhat, they still converged to 1. The VGG-16 model only presented poor performance when trained using [224, 4], and its accuracy rates could not effectively converge to 1 late in the training period, remaining around 0.65.
As VGG-16 is deeper and has more parameters, the RAM of our computer was unable to process the input data in 224 × 224 pixel format with batch size = 32 or the input data in 299 × 299 pixel format. Training with these datasets could not be completed, and consequently, there were no results for these datasets.
Figure 7 presents the historical accuracy rates of the GoogLeNet v1 model. The differences among the results from the 12 sets of training data were small, and the accuracy rates all converged to 1.
Figure 8 displays the historical accuracy rates of the training results of the GoogLeNet v4 model. The training accuracy rates resulting from the 12 sets of training data all converged to 1; however, the validation accuracy rates fluctuated and could not effectively converge to 1.
Figure 9 shows the historical accuracy rates of the training results of the ResNet-34 model. The training accuracy rates resulting from the 12 sets of training data all converged to 1; however, the validation accuracy rates fluctuated sharply and could not converge to 1.
The objective of this study was to detect sandblasting defects in investment castings, so we attached more importance to the “unqualified” prediction results. An “unqualified” casting predicted as “unqualified” represents a true positive (TP), whereas an “unqualified” casting predicted as “qualified” represents a false negative (FN). Similarly, a “qualified” casting predicted as “qualified” represents a true negative (TN), whereas a “qualified” casting predicted as “unqualified” represents a false positive (FP).

We used recall, precision, and the F1 score to evaluate model training results. Their formulas are as shown in Eqs. (1). Through (3). Table 1 displays the training time, average time per image, and the other evaluation indices resulting from applying 54 weights to the test datasets.
In terms of overall training time, AlexNet took the least time among the five models. GoogLeNet v1 was close to AlexNet in training time, followed by ResNet-34. On the whole, GoogLeNet v4 took the longest time. In terms of average time spent on the test dataset, AlexNet was the fastest among the four models, which took less than 3 milliseconds per image, followed by GoogLeNet v1, which took 1.9 milliseconds to 3.2 milliseconds. GoogLeNet v4 was the slowest, taking 9.6 milliseconds to 12.4 milliseconds.
Table 1. Evaluation index for each training model
Compared parameters
|
Training time (min)
|
Average prediction time (ms)
|
Recall
|
Precision
|
F1 score
|
Compared parameters
|
Training time (min)
|
Average prediction time (ms)
|
Recall
|
Precision
|
F1 score
|
AlexNet
|
ResNet-34
|
[128, 4]
|
39
|
2.9126
|
96.79%
|
99.53%
|
98.14%
|
[128, 4]
|
63
|
3.4315
|
-
|
0.00%
|
-
|
[128, 8]
|
20
|
0.9566
|
100.00%
|
99.06%
|
99.53%
|
[128, 8]
|
32
|
3.1633
|
-
|
0.00%
|
-
|
[128, 16]
|
11
|
0.9554
|
99.53%
|
99.53%
|
99.53%
|
[128, 16]
|
16
|
3.1761
|
100.00%
|
0.94%
|
1.87%
|
[128, 32]
|
10
|
1.0026
|
99.53%
|
100.00%
|
99.76%
|
[128, 32]
|
10
|
3.1393
|
-
|
0.00%
|
-
|
[224, 4]
|
92
|
1.1961
|
99.53%
|
100.00%
|
99.76%
|
[224, 4]
|
113
|
3.6855
|
-
|
0.00%
|
-
|
[224, 8]
|
51
|
1.1221
|
98.60%
|
100.00%
|
99.30%
|
[224, 8]
|
60
|
3.6504
|
-
|
0.00%
|
-
|
[224, 16]
|
30
|
1.1928
|
99.53%
|
100.00%
|
99.76%
|
[224, 16]
|
32
|
3.6421
|
-
|
0.00%
|
-
|
[224, 32]
|
27
|
1.1171
|
99.52%
|
98.11%
|
98.81%
|
[224, 32]
|
28
|
3.6302
|
-
|
0.00%
|
-
|
[299, 4]
|
167
|
1.6361
|
97.22%
|
99.06%
|
98.13%
|
[299, 4]
|
181
|
4.2100
|
96.79%
|
99.53%
|
98.14%
|
[299, 8]
|
94
|
1.5217
|
90.21%
|
100.00%
|
94.85%
|
[299, 8]
|
96
|
4.1471
|
100.00%
|
99.06%
|
99.53%
|
[299, 16]
|
56
|
1.5170
|
99.07%
|
100.00%
|
99.53%
|
[299, 16]
|
58
|
4.1420
|
99.53%
|
99.53%
|
99.53%
|
[299, 32]
|
49
|
1.4792
|
97.22%
|
99.06%
|
98.13%
|
[299, 32]
|
50
|
4.1393
|
99.53%
|
100.00%
|
99.76%
|
GoogLeNet v1
|
GoogLeNet v4
|
[128, 4]
|
43
|
2.2590
|
83.00%
|
96.70%
|
89.32%
|
[128, 4]
|
43
|
2.2590
|
83.00%
|
96.70%
|
89.32%
|
[128, 8]
|
20
|
2.0037
|
99.51%
|
96.70%
|
98.09%
|
[128, 8]
|
20
|
2.0037
|
99.51%
|
96.70%
|
98.09%
|
[128, 16]
|
12
|
1.9985
|
99.43%
|
81.60%
|
89.64%
|
[128, 16]
|
12
|
1.9985
|
99.43%
|
81.60%
|
89.64%
|
[128, 32]
|
10
|
1.9783
|
98.87%
|
82.55%
|
89.97%
|
[128, 32]
|
10
|
1.9783
|
98.87%
|
82.55%
|
89.97%
|
[224, 4]
|
96
|
2.5093
|
99.48%
|
89.62%
|
94.29%
|
[224, 4]
|
96
|
2.5093
|
99.48%
|
89.62%
|
94.29%
|
[224, 8]
|
52
|
2.2668
|
97.12%
|
95.28%
|
96.19%
|
[224, 8]
|
52
|
2.2668
|
97.12%
|
95.28%
|
96.19%
|
[224, 16]
|
31
|
2.2733
|
98.97%
|
90.57%
|
94.58%
|
[224, 16]
|
31
|
2.2733
|
98.97%
|
90.57%
|
94.58%
|
[224, 32]
|
27
|
2.2327
|
98.26%
|
53.30%
|
69.11%
|
[224, 32]
|
27
|
2.2327
|
98.26%
|
53.30%
|
69.11%
|
[299, 4]
|
171
|
3.2084
|
96.80%
|
100.00%
|
98.38%
|
[299, 4]
|
171
|
3.2084
|
96.80%
|
100.00%
|
98.38%
|
[299, 8]
|
96
|
2.7287
|
98.97%
|
90.57%
|
94.58%
|
[299, 8]
|
96
|
2.7287
|
98.97%
|
90.57%
|
94.58%
|
[299, 16]
|
57
|
2.7042
|
-
|
0.00%
|
-
|
[299, 16]
|
57
|
2.7042
|
-
|
0.00%
|
-
|
[299, 32]
|
50
|
2.6987
|
-
|
0.00%
|
-
|
[299, 32]
|
50
|
2.6987
|
-
|
0.00%
|
-
|
VGG-16
|
Note: “-” indicates that a denominator equaled 0 during the calculation process, so the value could not be calculated.
|
[128, 4]
|
84
|
2.9479
|
98.56%
|
96.70%
|
97.62%
|
[128, 8]
|
54
|
2.2591
|
24.24%
|
7.55%
|
11.51%
|
[128, 16]
|
25
|
2.5013
|
92.17%
|
100.00%
|
95.93%
|
[128, 32]
|
17
|
2.4478
|
66.46%
|
100.00%
|
79.85%
|
[224, 4]
|
194
|
4.6935
|
99.07%
|
100.00%
|
99.53%
|
[224, 8]
|
109
|
4.8705
|
98.95%
|
88.68%
|
93.53%
|
Recall measures the proportion of correctly-predicted positives in the positive class; its equation is as shown in Eq. (1). AlexNet and GoogLeNet v1 predicted high proportions of the “unqualified” samples as “unqualified”, their recall values both exceeding 90.00%. The recall of AlexNet with [128, 8] even reached 100.00%. In contrast, VGG-16 presented low recall values when batch size = 4 and batch size = 16. We therefore speculate that batch size may affect the model training results. Precision measures the proportion of true positives among all predicted positives; its equation is as shown in Eq. (2). AlexNet displayed exceptional performance in this respect, achieving 100% with six of the datasets. Furthermore, the precision of VGG-16 with [128, 32], [224, 4], and [224, 8] reached 100.00%. Compared to the two models above, GoogLeNet v1 showed slightly poorer performance, only achieving 100.00% precision in one dataset. Although GoogLeNet v4 achieved 100.00% precision in three datasets, the quality of this model cannot be determined based on a single index; other indices such as the F1 score must also be included for comprehensive assessment. The F1 score is a balanced measure of recall and precision; its equation is as shown in Eq. (3). On the whole, AlexNet exceeded 98.00% with most of the datasets, even reaching 99.76% with [128, 32], [224, 4], and [224, 16 ], followed by 99.53% with [128, 8], [128, 16], and [299, 16]. Although the F1 score of VGG-16 with [224, 8] was 99.53%, it was still slightly inferior to AlexNet. GoogLeNet v1 only achieved F1 scores of 98.09% and 98.38% with [128, 32] and [299, 16]; its ten other F1 scores were only around 80.00%. A comprehensive review of Table 1 reveals that GoogLeNet v4 and ResNet-34 did not perform as expected in defect prediction. We therefore speculate that the investment castings used in this study were not suitable for training models with residual learning architectures in defect detection.
We employed YOLO v3 to detect whether investment castings appeared in the machine vision images, so we only defined one YOLO v3 category: “sample”. Thus, when training YOLO v3, we did not investigate the impact of data size or batch size. Models with better object tracking performance generally display the following: “samples” take up a greater proportion of the recognized images, and when other categories are identified, they identify “samples” as much as possible. In other words, while increasing the recall, a certain level of precision must also be maintained. In addition, the area under the precision-recall curve (i.e., average precision (AP)) is generally used for evaluation. Calculations revealed that the AP of the “samples” in this study was 99.83%, indicating that the YOLO v3 presented excellent performance.
3.2 Design and practical application of graphical user interface
To make it convenient for users to perform sandblasting-defect detection, we employed the PyQt5 library to design a graphical user interface (GUI). After images are captured under sufficient lighting, YOLO v3 analyzes them to detect investment castings, and then the images are sent to the CNN for defect detection. Upon starting, the application automatically loads the trained YOLO v3 weights and CNN weights. Before predicting the type of investment casting sandblasting, a certain range of the camera is first extracted (the dimensions of this range depend on the image size designated during CNN training). To prevent capture failure, we added a detection boundary, as shown in the off-white frames in the images on the left of Fig. (10). When YOLO v3 detects an investment casting, the type of investment casting sandblasting is not predicted unless the center point of the investment casting frame falls within the detection boundary. At the same time, the frame will display warning text. The results are as shown with the orange frame in the right image of Fig. (a). In contrast, if YOLO v3 detects an investment casting and the center of its frame falls within the detection boundary, then the type of investment casting sandblasting is predicted using CNN. The results are as shown in Fig. (b): the green frame in the right image indicates a “qualified” casting and the red frame indicates an “unqualified” casting.