To increase computational speed under reasonable computing resources, the entire learning data were dividing into manageable batches known as minibatch. In this study, the minibatch was set to the maximum possible size (batch_size = 16) that could be handled by available computing resources. Minibatch in general improves model performance. However, the relationship between batch size and performance has not been understood well, some studies have shown that the model trained on small batch sizes showed better performance [19].
The image data-processing, CNN model architecture, training, validation, and testing codes were written in Python using TensorFlow’s implementation of the Keras high-level API. The complete code including data creation, data pre-processing, model design, and validation codes are uploaded to the Github repository and can be found at the following link https://github.com/Smart-Structure-Design-Laboratory/ENDMILL-CNN.
6.1 Train, Validation, and test data
Training and validation data were used from the original Taguchi orthogonal array experiments. The data includes the raw pixel intensities from images and their associated class labels. The first and last 10 images captured from each experiment were discarded because of unsatisfactory machining surface and image quality. The default format of the images taken by the microscope camera was the PNG format in RGB color space. The RGB images were pre-processed and converted to grayscale images, cropped to 240´240 and then saved as NumPy array for model input.
As listed in table 9, the number of images in each class is unbalanced which could cause biased results. For balancing the number of images in each class, and lowering the need for high computing resources and time, a subset of approximately 500 images in each class was randomly sampled.
The validation dataset was also created by randomly selecting images ensuring no duplicate images exist in the validation dataset and training dataset. The validation dataset had approximately 200 images for each class.
For the inference, thirteen additional experiments were performed with arbitrary machining parameters, (different from Taguchi experiments) as listed in table 11. The experimental conditions and machining surface image data acquisition methods of this additional experiment are the same as described in Section 3, however, instead of using brand new endmills, near new endmills (used for 1 to 5 minutes) were used in all the 13 additional experiments. The number of images obtained from these experiments was 7,347. The images obtained from this experiment were categorized into four classes each having 200 images.
Table 11: Additional experiment for test data images
Machining Parameters
|
Class
|
Spindle Speed
|
Feed Rate
|
Depth of Cut
|
Diameter
|
24000
|
5
|
0.3
|
3
|
Fine
|
18000
|
10
|
0.7
|
4
|
Fine
|
24000
|
40
|
0.3
|
5
|
Fine
|
21000
|
20
|
0.5
|
3
|
Smooth
|
21000
|
35
|
0.8
|
4
|
Smooth
|
24000
|
10
|
0.2
|
5
|
Smooth
|
12000
|
20
|
0.3
|
3
|
Rough
|
3000
|
15
|
0.2
|
4
|
Rough
|
24000
|
15
|
0.9
|
5
|
Rough
|
24000
|
9
|
0.5
|
1
|
Coarse
|
21000
|
11
|
0.8
|
2
|
Coarse
|
15000
|
10
|
0.9
|
3
|
Coarse
|
12000
|
15
|
0.6
|
4
|
Coarse
|
6.2 CNN Models performance comparison
CNN is memory and computation-intensive, the number of filters on each convolution layer was limited to 64 because of workstation hardware limitation. Four hyperparameters were tuned (number of Conv2D filters, Dense layer units, learning_rate, and weight_decay) on a particular CNN model which took approximately 4 days on the workstation. Hyperparameter tuning is a technique for randomly selecting hyperparameters according to certain rules and ranges, and exploring optimal hyperparameters by training and evaluating models. It was found that the hyperparameter tuned and untuned models have slight performance differences. So, to save computation time, all the reported models in this study are hyperparameter untuned models.
Five different optimizers (SGD, RAdam, AdamW, Adamax, and rmsProp) were tested for the CNN architecture as explained above. However, because of the poor performance of rmsProp, the results of only four optimizers are reported here. The hyperparameters used for each optimizer are listed in table 12. Each model was run three times with 10, 20, and 30 epochs. In all the cases, the training and validation errors saturated in a relatively smaller number of epochs because of the large training data and a small number of classes.
Table 12: Various values of hyperparameters for optimizers
Model
|
Optimizers
|
Learning rate
|
Weight Decay
|
Momentum
|
Rho,
|
Beta,
|
Beta,
|
Epsilon, ε
|
CNN
|
Adamax
|
1e-3
|
-
|
-
|
-
|
0.9
|
0.999
|
1e-07
|
AdamW
|
1e-4
|
1e-6
|
-
|
-
|
0.9
|
0.999
|
1e-07
|
RectifiedAdam
|
1e-4
|
0.0
|
-
|
-
|
0.9
|
0.999
|
1e-07
|
SGD
|
1e-2
|
-
|
0.0
|
-
|
-
|
-
|
-
|
6.3 Train, validation and test accuracies
Figure 7 shows the radar plots of CNN models with four optimizers in terms of epochs, learning_rate, train accuracy, validation accuracy, and test accuracy. Note that the epochs were normalized (0-1) for clarity. All four of the presented optimizers showed outstanding accuracy of over 90% regardless of the epoch. It was seen that models trained for 10 epochs resulted in better performance, while models trained over 30 epochs experienced degraded performance due to overfitting.
Accuracy of the model varied with the selection of optimizers, superior results were received using RAdam followed by AdamW, Adamax, and SGD respectively. Here, SGD seems to have a convergence problem due to its characteristic ‘fixed’ learning rate. On the other hand, RAdam achieved more stable initial training compared to other optimizers because of the adaptive learning rate and succeeded in finding the best optimum values. AdamW showed a performance improvement with increasing epoch, although the difference was not significant, which is thought to be due to the weight decay as training progressed.
The results with the RectifiedAdam (or Radam) optimizer for 10 epochs showed the best performance with a training accuracy of 96.30% and validation accuracy of 93.31% as shown in figure 8. It can be seen that the model achieves a good fit, with train and test learning curves conversing and no sign of overfitting or underfitting. For testing the generalization of the CNN models, i.e., to check how accurately the model can predict for the new data unrelated to training data, it was evaluated by model.predict() function with the test dataset. The test accuracy was found to be 92.91%, which closely matches the validation accuracy.
Evaluating a classifier is often significantly trickier than evaluating a regressor [20]. Several model performance evaluations were performed with the best-performing CNN model. These include precision and recall curve as shown in figure 9 (a), receiver operating characteristic (ROC) curve, and confusion matrix. The precision (also called positive predictive value) and recall (also known as sensitivity) curve summarizes the trade-off between the true positive rate and the positive predictive value. The receiver operating characteristic (ROC) curve is a widely used as model’s performance evaluation metric in supervised classification problems for summarizing the trade-ff between the true positive rate and false positive rate. Figure 9 (b) shows the ROC curve, the dotted line in the figure represents the ROC curve of a purely random classifier. A well-performing classifier stays far away from the central dotted line towards the top-left corner.
In the Precision and Recall curve, the Precision rate begins to drop from the point beyond the Recall rate of 0.82. In particular, the poor performance of the rough class is noticeable, apparently due to the lack of enough training and evaluation data in the class compared to other classes. In addition, failure to extract distinct features due to the inconspicuous ridge-valley patterns between rough and smooth class might be another possible reason. Fine, smooth, and coarse class show excellent results with a precision rate of higher than 0.8 even when the recall rate approaches 1, indicating that the model is satisfactorily trained.
The overall trend of the ROC curve matches well with the precision and recall curve. The area under the curve (AUC) is a measure of the performance of the classifier. A perfect classifier will have AUC of 1. Here, two of the classes have an AUC of 1 while the other two classes have a very high AUC value. This validates the model is generalized well. Visual observation shows that the rough class has the least AUC because of the possible causes previously explained above.
Cross-validation is often used for evaluating the model but it is not generally preferred for classification problems especially dealing with skewed datasets. For the classification task, confusion-matrix is generally considered superior to cross-validation. Figure 10 shows the confusion matrix for the model with RAdam optimizer. The most noticeable value in the confusion matrix is the predicted label for the rough class. A significant number of the coarse and smooth classes were mispredicted as rough class. This validates the Precision and Recall curves and ROC curves and reconfirms that the machined surface of the rough class is ambiguous to distinguish it from that of the smooth and coarse classes. Similarly, a small fraction of smooth classes were incorrectly predicted as fine classes, this might be because of the presence of training data with the surface roughness close to 1mm. Since 1 mm was taken as a boundary distinguishing fine class with smooth class. The classifier might have been confused with the average surface roughness value of the fine class is 0.7382mm. Other predicted classes matched well with the actual class.
From all of the model performance indicators above, it can be concluded that DL models implemented with proper architecture, and trained on relatively small surface roughness image data could have significant accuracy.