Deep ensemble transfer learning-based approach for classifying hot-rolled steel strips surface defects

Over the last few years, advanced deep learning-based computer vision algorithms are revolutionizing the manufacturing field. Thus, several industry-related hard problems can be solved by training these algorithms, including flaw detection in various materials. Therefore, identifying steel surface defects is considered one of the most important tasks in the steel industry. In this paper, we propose a deep learning-based model to classify six of the most common steel strip surface defects using the NEU-CLS dataset. We investigate the effectiveness of two state-of-the-art CNN architectures (MobileNet-V2 and Xception) combined with the transfer learning approach. The proposed approach uses an ensemble of two pre-trained state-of-the-art Convolutional Neural Networks, which are MobileNet-V2 and Xception. To perform a comparative analysis of the proposed architectures, several evaluation metrics are adopted, including loss, accuracy, precision, recall, F1-score, and execution time. The experimental results show that the proposed deep ensemble learning approach provides higher performance achieving an accuracy of 99.72% compared to MobileNet-V2 (98.61%) and Xception (99.17%) while preserving fast execution time and small models’ size.


Introduction
Nowadays, hot-rolled steel strips are considered among the most imperative metal products that could be used in a wide range of industrial applications, including automotive, aeronautics, energy, military, and others. However, during the production process, several factors could affect the steel strips surface quality causing different defects such as Abdelmalek  scratch, crazing, and inclusion. As a result, these surface defects could affect the steel product quality resulting in significant economic losses and considerable danger to worker safety due to the reduced material strength and lifetime. Therefore, to increase steel strips productivity and minimize economical losses, the identification of common hot-rolled steel strip defects is a very important task.
Traditional methods used to identify hot-rolled steel strips defects are mainly based on visual observation by experienced inspectors. Due to many factors, manual steel surface defect identification is a hard, inefficient, and time-consuming task, especially in the case of online defect classification in the high-speed production line [1]. According to [2], relying on such a method, the inspectors can achieve a recognition rate of only 80% while inspecting around 0.05% of the total steel production. Thus, to overcome the aforementioned issues, supervisors need an automatic defect identification system to identify the different types of steel surface defects.
With the recent advancement in artificial intelligence and computer vision techniques and technologies, several approaches were proposed to achieve automatic steel strip surface defects classification, including statistical, filter-based, and classical machine learning approaches.
Traditional machine learning-based methods for surface defect recognition consists of two main stages, which are feature extraction and classification. To extract visual features, several methods were adopted over the years, including Scale-Invariant Feature Transform (SIFT), Local Binary Pattern (LBP), Gray Level Co-occurrence Matrix (GLCM), and Histogram of Oriented Gradients (HOG), among others. Then, the extracted features are fed into a classifier providing the defects' classes, such as support vector machine (SVM), Random Forest (RF), and K-Nearest Neighbor (KNN). The authors in [3] developed a method for hot-rolled steel surface defect recognition based on Adjacent Evaluation Completed Local Binary Patterns (AECLBPs) and SVM classifier achieving an accuracy of 98.93%. Mentouri et al. [4] combined Binarized Statistical Image Feature (BSIF) extractor with KNN classifier achieving an accuracy of 99.18%. Ashour et al. [1] developed a steel surface defect classification method combining the Gray Level Co-occurrence Matrix (GLCM) and the Discrete Shear Transform (DST) techniques. The proposed DST-GLCM approach was able to achieve a classification rate of 96%. Zaghdoudi et al. [5] proposed an approach based on the Binary Gabor Pattern (BGP) to extract significant features of hot-rolled steel strips' surface defects. Then, after applying the Principal Component Analysis (PCA) technique, the reduced features were fed to an SVM classifier. Applying such a technique, they were able to achieve an accuracy of 99.33%. However, these methods are facing several limitations in the case of complex scenes because they rely on manual feature extractions that depend on domain experts, which could be a very hard task in several applications. The traditional machine learning-based methods reach their limits in terms of accuracy with large and complex data. Moreover, the migration of such algorithms to solve similar problems is very difficult, where we need to develop new algorithms.
Recently, deep learning algorithms have emerged as a new paradigm outperforming the classical existing methods to classify different hot-rolled surface defects. The success of Convolutional Neural Network (CNN) models in computer vision-related tasks has motivated researchers to adopt them to solve several problems, including image classification [6], object detection [7], and image segmentation [8]. Due to their powerful feature extraction capabilities, CNNs have become one of the most effective deep learning solutions for automatic steel surface defect recognition. The continuous evolution of computational hardware (GPUs), software (Tensorflow, Pytorch), and data availability allow researchers to implement more advanced algorithms achieving impressive results. The authors in [9] proposed an improved version of the VGG-19 model that consists of only 18 layers, where the first 15 layers are frozen (not trainable). Also, they introduced a maximum and average feature extraction module for further improvements providing an accuracy of 97.62%. Jain et al. [10] proposed a GAN-based approach to generate new synthetic data to fine-tune a pre-trained CNN architecture on the NEU-CLS dataset [3] achieving an accuracy of 99.11%. Wang et al. [11] proposed an improved VGGbased model with a reduced number of parameters called VGG-ADB providing an accuracy of 99.63%. Li et al. [12] compared several CNN architectures for steel surface defect classification, including GoogLeNet, ResNet-18, MobileNet-V2, Vision Transformer (ViT), and CNN-T. The proposed CNN-T provided the best results over the other tested models achieving an accuracy of 99.17% compared to 96.94%, 98.06%, 98.33, and 98.89 achieved by ViT, GoogLeNet, ResNet-18, and MobileNet-V2, respectively. However, all the aforementioned CNN-based architectures rely on a one branch CNN model. The achieved results can be improved using ensemble architecture. Therefore, in this paper, we developed an efficient deep learning model based on transfer learning and an ensemble of deep CNN architectures techniques to perform hot-rolled steel strips surface defect classification from images. The main contributions of the current paper are as follows.
• A deep learning-based model is proposed to classify hot-rolled steel strips' surface defects with high performance. • A feature fusion method based on an ensemble of pretrained CNN models is used to improve the overall defect classification accuracy. • A comparative analysis of the proposed architecture with other studies is performed to show the effectiveness of the proposed deep ensemble transfer learning approach over other available approaches. • Investigate the impact of combining two state-of-theart CNN models on the classification of hot-rolled steel strip surface defects over single branch CNN architectures.
The remainder of the paper is organized as follows. In Section 2, we briefly introduce the concept and theoretical basis of CNN architecture and transfer learning technique. Section 3 details the proposed CNN model architectures. In Section 4, we describe the used dataset, implementation details, and evaluation of the proposed model by comparing it with state-of-the-art methods. Finally, we conclude the current paper while providing some future research directions in Section 5.

Theoretical basis
In the last few years, deep learning algorithms were widely adopted to solve several computer vision-related tasks, including steel surface defect classification. Most of the recent advanced image classification techniques are based on a particular deep neural network called Convolutional Neural Networks (CNNs) that provide state-of-the-art results in image classification, object detection, and image segmentation problems. Therefore, in this section, we aim to provide an overall overview of the CNN concept and the adopted architectures in our study.

CNN overview
The CNNs are special kinds of deep learning algorithms that are mostly used to solve computer vision-related tasks. Modern CNN architectures originated from the famous LeNet architecture [13]. Therefore, as shown in Fig. 1, a standard CNN architecture consists of two main parts, which are feature extraction and classifier. The feature extraction part (or base model) consists of a series of convolutional and pooling layers stacked on top of each other to extract relevant features from the input image. Whereas, the classifier part (or top model) consists of fully connected (dense) layers that utilize the highlighted features, obtained from the feature extraction part, to classify the content of the input image according to the list of classes, which is steel surface defects in our study.

Feature extraction part (base model)
The feature extraction part consists of two main building blocks, which are convolutional and pooling layers. The convolutional layer is the main building block in CNN architectures that consists of convolutional filters, where their main role is to detect features within the input image, including edges, lines, and shapes, among other visual features. To introduce nonlinearity to the convolutional layer outputs, each of them is followed by a non-linear activation function, such as Sigmoid, Rectified Linear Unit (ReLU), and Leaky ReLU. Based on the dimensions of the applied filters, the outputs of the convolutional layers are feature maps with new dimensions. Then, the generated feature maps are fed into a pooling layer, which downsamples the generated feature maps into smaller ones with lower dimensions. The size of the pooling layers' output depends on the hyperparameter selection. Then, the new feature map is fed to the next convolutional layer.
Classifier part (top model) The output of the last feature extraction layer passes through a flattening layer, or in some cases through a Global Average Pooling layer, to obtain a 1-dimensional array. Then, the latter array is fed to an ordinary fully connected network called dense layers, where all the network nodes are connected to each other. The final layer performs a high-level classification using a softmax function to predict the probabilities of each class.
Several CNN architectures were introduced since the advance of LeNet architecture, including AlexNet [14], VGG [15], GoogLeNet [16], and ResNet [17], to name a few. More recently, advanced CNN architectures have been introduced, including lightweight and hybrid models. Lightweight models, such as MobileNet [18][19][20] and SqueezeNet [21], were mainly proposed for embedded systems that have low computational power. In general, these architectures provide higher processing speed at the cost of lower accuracies. Also, more complex and even more efficient CNN architectures that combine traditional CNN concepts were proposed, including Inception-ResNet [22] and Xception [23] networks that combine Inception and ResNet modules. However, Xception architecture replaces the Inception modules with depthwise separable convolution modules. In this study, we combined the concept of transfer learning with MobileNet-V2 and Xception networks to achieve steel surface defect classification. The MobileNet-V2 and Xception models' selection over other competing architectures was due to several criteria, including the availability of their pre-trained version on the Keras module and both have the same size of the last feature maps sizes allowing us to not use additional layers to achieve this result. Moreover, MobileNet-V2 is known as a lightweight model that can be executed on devices with low Also, there is some evidence showing that the Xception model provides better generalization ability, when compared to other available models, outside the ImageNet dataset on which it was originally trained.

Transfer learning
To train deep learning models from scratch, we require high computational resources which forces us to use strategies that reduce the training time and processing power to accomplish the desired task. Transfer learning is one of the most effective solutions to overcome such issues, while providing improvements in terms of accuracy in the case of small datasets. Therefore, in our research, we adopted the transfer learning technique to overcome some of the deep learning model issues, including lack of data and long training time by initializing the models' parameters. The transfer learning paradigm aims to transfer knowledge of a pre-trained model to solve similar problems in different applications, which is in our case an image classification task. Therefore, MobileNet-V2 and Xception CNNs pretrained on the ImageNet dataset, which are provided as a part of the Keras library, were adopted as the main CNN architectures in our proposed model to extract steel surface defect features. However, these pre-trained models are trained on the ImageNet dataset, which is different from the targeted dataset in this study, where these models do not work well in the case of steel surface defect classification. To this end, we proposed a custom classifier on top of the original feature extraction parts of the MobileNet-V2 and Xception models and retrain only the classifier part.

Proposed approaches
In this section, we provide detailed information about the adopted approaches, including MobileNet-V2, Xception, and Ensemble Learning.

Transfer learning using MobileNet-V2 and Xception networks
MobileNet-V2 [18] is an improved architecture of its predecessor MobileNet-V1 [20]. It is a lightweight and efficient deep CNN architecture released by Google in 2018, which is mainly designed for mobile devices and systems that have a low processing power. Similar to MobileNet-V1, it is based on the concept of depthwise separable convolution to reduce the number of parameters resulting in reducing the computational cost. Most deep CNN architectures possess a huge number of parameters and require heavy computational loads. Compared to other state-of-the-art CNN architectures, such as VGG-16 and ResNet-50 which have 138.5 and 25.6 million parameters respectively, MobileNet-V2 provides a lower number of parameters of only 3.5 million parameters. The reduced computational power is one of the main reasons for selecting MobileNet-V2 in the current paper. It is adopted in the current study to improve the identification speed, which could help to develop a Hot-Rolled Steel Strips Surface Defects identification system that works in realtime. Moreover, MobileNet-V2 introduces the use of two main blocks for further performance improvement both in terms of accuracy and speed, which are linear bottlenecks and inverted residuals.
Xception which stands for Extreme Inception is a deep CNN architecture that consists of 71 layers. It is an extended version of the Inception network that was developed by François Chollet [24] in 2017. In the original paper, the author showed that the Xception model outperformed all of Inception-V3, VGG-16, and ResNet-152 achieving a top-5 accuracy of 94.5% on the ImageNet dataset, while reducing the number of parameters. The Xception model combines the Inception and ResNet ideas, while replacing Inception modules with depthwise separable convolutions to reduce the number of parameters.

Transfer learning using the proposed ensemble learning model
Several studies targeted the hot-rolled steel surface defect classification using different CNN architectures, including AlexNet, VGGNet, and ResNet, to name a few. However, a single CNN model may not provide the desirable results on a given dataset. Therefore, to improve the classification performance, we developed an ensemble of deep learning models that consists of the feature extraction part of two state-of-the-art CNN architectures, which are MobileNet-V2 and Xception models. The proposed model aims to combine the performances of the two fine-tuned models to get a more accurate and reliable model.
Ensemble learning is an advanced technique that aims to combine multiple deep learning algorithms to improve the performance of the existing ones resulting in a more reliable model. As shown in Fig. 2  The ReLU activation function is applied for each of the dense layers except the final layer. In addition, a dropout with a probability of 20% and batch normalization layers are applied at the first two dense layers to overcome the over-fitting problem. The final layer is a softmax function to give a probability of the predicted classes, including crazing, scratches, inclusion, patches, pitted surface, and rolled-in scale.
To make the comparison between the adopted approaches more realistic, we applied the same classifier part for the three models, which are MobileNet-V2, Xception, and the proposed Ensemble Learning model.

Experimental results
In this section, we present the proposed approaches for classifying hot-rolled steel strip surface defects using the NEU-CLS dataset. First, we describe the used dataset and provide the data preparation process, including data augmentation and data splitting. Then, the implementation details and experimental results are provided.

Dataset
The cornerstone for trusted deep learning-based models is data collection and preprocessing, where this stage requires a significant amount of effort. To evaluate the performance of the proposed model for steel surface defects classification, the Northeastern University (NEU-CLS) database [3] was used in the experiment. The NEU-CLS dataset is a publicly available dataset for hot-rolled Fig. 3 The NEU-CLS defect types steel strip surface defect classification, which was highly adopted in several scientific research papers published in reputed journals, including [10] and [11]. This dataset includes six of the most frequent surface defects in hotrolled steel strips, which are Crazing (Cr), Inclusion (In), Rolled-in Scale (RS), Patches (Pa), Scratches (Sc), and Pitted Surface (PS), as shown in Fig. 3. It provides a total of 1800 grayscale images, with a resolution of (200 × 200) pixels. The 1800 images were split equally into the aforementioned six different categories, where each category containing 300 samples.

Implementation details
In this experiment, we divided the dataset into training and testing sets at a ratio of 80% and 20%, respectively.
The training set consists of a total of 1440 images with 240 samples for each defect type, while the testing set contains 360 images with 60 samples for each class. Deep learning models require a huge number of data to improve their performances. Therefore, to increase the original dataset size for the training process, data augmentation techniques, including rotation and shifting, are applied using the Tensorflows' ImageDataGenerator to generate additional samples. We created an HDF5 dataset file to speed up the data loading on Google Colab and fix the dataset partitioning making the comparison between the models more reliable.
As shown in Table 1, the width and height of the input image were both set to 200 pixels. The developed model was trained for 100 epochs with a batch size of 64 images for training. We used the Adam algorithm for model

Results and discussions
To evaluate the performance of the proposed models to recognize hot-tolled steel strip surface defects and to make the experimental results more convincing, four main evaluation metrics were adopted, including accuracy, recall, precision, and f1-score. These metrics can be measured according to Eqs. 1-4.
where, TP, TN, FP, and FN represent True Positive, True Negative, False Positive, and False Negative, respectively. Table 2 summarizes the classification results achieved using the developed deep learning models on the test set in terms of loss, accuracy, precision, recall, f1-score, and inference time. The obtained results are based on the adopted hardware specifications as described in Section 4.2, where the results may differ depending on the used machine configuration. The proposed ensemble learningbased model provides higher accuracy of 99.72% and a lower loss rate of 0.052%, demonstrating its effectiveness over the models based on MobileNet-V2 and Xception that achieves only 98.61% and 99.17%, respectively. Also, the ensemble learning model training is performed in around 29.51 min, followed by MobileNet-V2 (29.54 min), and Xception (36.58 min). Taking into account that, in the training process, we only trained the classifier part for all three models using the transfer learning technique, where we set the feature extraction layers in the models as not trainable. However, this improvement comes at the cost of a slightly higher inference time.
For a better understanding of how the classifiers perform, in addition to the aforementioned evaluation metrics, we   provide other evaluation metrics such as loss, inference time, and confusion matrices of the proposed networks. The confusion matrix aims to provide the recognition rate of the deep learning model by displaying the classification accuracy of each defect category. Figure 4 shows the confusion matrix of the adopted deep learning models.
The proposed ensemble learning model outperformed MobileNet-V2 and Xception models, where it can identify five defect types out of six on the test dataset with an accuracy of 100% Fig. 4(c). Whereas, it misclassified only one image out of 360 on the test dataset. It classified the Inclusion defect as a Pitted Surface defect, which could be due to the greater similarity of the visual patterns existing among these classes. Figures 5, 6, and 7 show graphic representations of the training and validation accuracies and losses among the models under evaluation, including MobileNet-V2, Xception, and Ensemble Learning-based models. They demonstrate that these models are free of under-fitting and over-fitting problems. Also, it can be seen that the ensemble learning-based model has a higher convergence speed than the two other models, where it exceeded the 90% accuracy in only 5 epochs. According to Fig. 7, the proposed model provides accuracies of 99.93% and 99.72% on the training and test datasets, respectively. Thus, our model can classify the defects in new data with an accuracy of 99.72%, which indicates that the generalizability of the proposed Ensemble Learning model is excellent. Also, the proposed model achieved training and validation losses of 0.033% and 0.052%, respectively.
To further illustrate the effectiveness of the proposed model, the obtained accuracy is compared with various state-of-the-art results achieved in recent studies published in high-ranked journals using the same NEU-CLS dataset. To this end, we selected two studies based on classical machine learning models and six deep learning-based models. According to the results presented in Table 3, our model achieves higher accuracy than AECLBP [3], DST-GLCM [1], Custom CNN architecture proposed in [25], VGG16-ADB [11], Improved VGG-19 [9], Finetuned CNN proposed in [10], CNN-T [12], and Multi-SE-ResNet34 with an attention mechanism [26] by 0.79%, 3.72%, 0.67%, 0.09%, 2.1%, 0.61%, 0.55%, and 0.52% respectively.

Conclusions and future directions
In the current paper, we investigated the effectiveness of combining the feature extraction part of two state-of-theart CNN models to classify six different hot-rolled steel strip surface defects from the NEU dataset. We selected MobileNet-V2 and Xception as the main CNN architectures due to their high performances in the computer vision field in terms of accuracy and processing speed. Compared to the two other adopted models, the experimental results showed that the proposed ensemble learning model provides a higher classification rate of 99.72% and lower training time, while preserving a competitive processing speed of 90 ms per image of a resolution (200 × 200) pixels. Moreover, according to Table 3, the proposed approach achieved the highest accuracy among other state-of-the-art deep learning models developed in recent studies published in high-ranked journals. As part of the future work, we intend to validate and evaluate the proposed approach on other defects and more challenging datasets with different image resolutions and different conditions like various lighting and noise levels.