This section will explain the datasets and evaluation metrics. Based on it we can show the implementation results and make some comparisons with other works. Quantitative assessment is also evaluated in detail.
A. Datasets and Evaluation Metrics
This paper uses the WM-811K dataset and 21-Defect dataset to evaluate the effectiveness of the proposed architecture. The 21-Defect dataset, as shown in Fig. 5, is extracted from real wafer maps in the industry to provide more classes by [15]. It includes a total of 16388 wafer maps in 21 categories. Since the number of categories is more than that of WM-811K, and the probability of occurrence of some categories is low, the problem of data imbalance is more serious than that of WM-811K. We expect the proposed method applies to more severe tasks as well. For both datasets, we first use 70% of the dataset as a training set and 30% as a test set. We provide the data from the training set to the GAN for training and merge the generated data with the training set, and then provide the merged data to the classification network for training. The test set is independent and not trained by the GAN or added to the data generated by the GAN to ensure the fairness of the test results. Since WM-811K is collected from different lots, the wafer map dimension will be different according to the different die sizes of different lots. So we resize all data to 64\(\times\)64 to preserve as much as possible the feature of each size.
Most of the existing GAN methods use human evaluation. However, human evaluation is often biased towards the quality of the generated samples and ignores the diversity of the samples. We use the method proposed by [22] for G2LGAN evaluation. It includes Inception Score (IS), Mode score (MS), Fréchet inception distance (FID), Kernel maximum mean discrepancy (Kernel MMD), Wasserstein distance (WD), 1- nearest neighbors algorithm. If the model is good, the IS and MS should be as high as possible and the FID, Kernel MMD, and WD should be as low as possible. And 1-nearest neighbor accuracy is close to 0.5, and the better the result.
On the classification network, most of the existing methods use accuracy to judge model performance. When the data set is not balanced, the accuracy of the model is overestimated and fairness is lost. Therefore, we use both precision, recall, and F1-score as the evaluation metrics of our model. Because we consider each category of metrics to be equally important, we use Macro-average rather than Weighted-average when balancing metrics multi-class.
Implementation Results
We evaluate the effectiveness of the data augmentation network and classification network separately. In the data augmentation network evaluation, the proposed method is compared with those of CGAN, ACGAN, and BAGAN. Since the official source code is not available, we replicated them and achieved similar results. In the wafer classification network evaluation, we train the model using augmented data and compare the classification network to state-of-the-art works.
In the training setup, our method is implemented in Tensorflow. All the models are trained by adaptive moment estimation (Adam). The optimizer of G2LGAN have β1 = 0 and β2 = 0.9. The learning rate for the discriminator is 0.0004 and the learning rate for the generator is 0.0001. G2LGAN first trains 3 epochs with all the data and then trains 10 epochs with each class of data. Each epoch contains 10000 iterations, and each batch size is set to 64. The optimizer of classification network with β1 = 0.9, β2 = 0.999 and the initial learning rate is 0.1. We use WM-811K and virtual data generated by G2LGAN and run for 1000 epochs using a step decay of learning rate at the factor of 10 at epochs 200, 500, and 800.
C. Quantitative Assessment of G2LGAN
We show the comparison with the conditional image GAN that is currently able to generate the specified classes, and the generated image results are shown in Fig. 4. Let's take Donut as an example. Donut only accounts for 2.17% (555 images) of the classes with patterns, which is one of the minority classes. From Fig. 6, we can see that G2LGAN generates better results than other methods. It not only preserves the global features of the wafer map but also generates the class features of Donut.
However, evaluating the model only by generating images tends to focus on the generation effect and ignore the importance of diversity. So we conduct a quantitative analysis of each method as the results are shown in Table II. When trained on WM-811k at 64×64 resolution, G2LGAN achieves an IS of 8.763, FID of 15.241, and 1-NN accuracy of 0.531. It shows that all the methods have low scores in the IS and MS projects. We classify the WM-811K dataset using InceptionNet pre-trained on ImageNet as shown in Table III. More than half of the data in WM-811K were classified into Petri dishes, and more than 90% of the data were classified into 4 of the 1000 classes in ImageNet. This is because IS and MS are calculated using the Inception network pre-trained on ImageNet. Since the image of the wafer map is very different from the spatial distribution of the ImageNet data, no matter how well it is generated, the WM-811K data will not have enough diversity in ImageNet and thus the IS and MS scores are low.
The proposed method scores 15.241, 0.478, and 7.402 in the comparison of FID, MMD, and WD. These three distance-related metrics are extracted by using InceptionNet pre-trained on ImageNet to extract features and then compare the distance between real data and generated data by different methods. Since the feature maps are compared directly instead of the classification results, the impact of the dataset on the metrics is smaller. The accuracy of G2LGAN in 1-NN is 0.531. The ideal value of GAN's 1-NN accuracy is 0.5, which means that the G2LGAN generation results are very realistic so that the 1-NN classifier cannot classify between the real and the generated data. From the experimental results, the proposed method is the best in all the metrics, especially in the 1-NN accuracy we are very close to the ideal state. In the next subsection, we will combine the generated data with the training data to further verify whether our G2LGAN can work well in real applications.
D. Quantitative Assessment of Wafer Map Classification
We combine the images generated by G2LGAN with the training set and undersampling to balance the dataset. Our undersampling is different from the traditional random undersampling, which randomly deletes excess data. The advantage of this approach is that no valid data is deleted and most of the untrained data may be trained in the next epoch. We generate the data to 5000 if the number of original data is less than 5000. The balanced data distribution is shown in Table IV.
Table V shows the results of the classification network by different works. In this table, the rank of number one is marked in red color. In [10], VGG16 is used for the classification network, and therefore the maximum number of parameters is used. When the training data is insufficient, using a large network structure will lead to overfitting. [15] and [16] used the standard CNN component classification networks. The standard CNN uses a larger number of parameters compared to the depthwise separable convolution. [17] only used three layers of convolutional layers and one layer of fully connected layers to achieve low parameter values. However, too few parameters make the classification model unable to effectively infer the correct defect category. In [18], the depth-separable convolution is also used as the backbone of the classification network.
[14] and [19] are the newest works. In [14], a multigranularity GAN was used to generate synthetic wafer maps for WMDR which is similar to our method. However, our G2LGAN achieves better classification accuracy since it extracts global features in the first stage, and then fine-tunes the model by each class in the second stage. In [19], an additional 2DPCA framework is used here to extract more features from wafer maps. However, it also increases the overall complexity of the classification model. In contrast, we focus more on extracting low-dimensional features and reducing the number of convolutions for high-dimensional features to reduce the number of parameters and maintain high accuracy.
Our proposed method is the best in almost all the metrics compared with the current methods. Since most of the methods do not use undersampling to suppress the amount of data for most classes, it makes accuracy overestimated. Since our classification network uses depthwise separable convolution and focuses on low-dimensional features, we can maintain accuracy with a low number of parameters.
Commonly, a network architecture with a lower number of parameters often requires sacrificing model accuracy. Thus we emphasize the data augmentation to balance the data set and improve model performance. In Table V, we also show our method without G2LGAN as a reference result. The results show that the dataset enhanced with G2LGAN increases accuracy by 9.16%, F1-Score by 8.76%, precision by 6.31%, and recall by 11.39%. This shows that the data generated by G2LGAN can be effectively applied with obvious improvement.
Table VI shows the confusion matrix. It demonstrates the prediction accuracy of the proposed method for each class on the dataset. The numbers in the matrix indicate the distribution of the predicted labels and the actual labels and the subscripts of the diagonal data indicate the recall rate of that class. It can be easily seen that None-class is more misclassified than other classes. Since we want to keep the fairness of the testing set, we do not balance the testing set, which further explains the reason that the proposed method has only 90.9% precision. The performance of each class is shown in Table VII. Note that Donut has the worst performance in the WM-811 category because it has fewer test data and it is easier to pull down the score due to a small amount of misclassification.
We also conducted experiments on 21-Defect. Since the data set is too small, and even some categories have only single-digit samples, we first use rotation and flipping for preliminary data enhancement, and then use G2LGAN to generate data for each category. Since 21-Defect is a non-public dataset, hereby we firstly show the difference between the works with G2LGAN and without G2LGAN respectively. As shown in Table VIII, the scores of the data after using G2LGAN can be improved by 15% in each indicator. Compared to our previous work [18], although our F1-Score is only 0.1% higher, our model is 70% smaller than [18].
Since WM-811K and 21-Defect have overlapping categories, including Center, Mount, Edge-Arc (Edge-Loc), Edge-Ring, Random, Scratch-Acr&Scratch-Line, Near-full. We feed the data of overlapping categories in 21-Defect into the classification model trained on WM-811K for classification, to test whether the data in the non-training data set can be correctly classified by the classification network. The confusion matrix of the classification results is shown in Table IX, the horizontal axis is the data label of 21-Defect, and the vertical axis is the label predicted by the Our + model trained with WM-811K. The experimental results show that most of the overlapping categories can be accurately classified, representing no over-fitting and good robustness of our model.