Defect inspection in semiconductor images using FAST-MCD method and neural network

Most defect inspection methods used in semiconductor manufacturing require design layout or golden die images. Unlike methods that require such additional information, this paper presents a method for automatic inspection of defects in semiconductor images with a single image. First, we devise a method to classify images into four types: flat, linear, patterned, and complex using a cosine similarity. For linear and patterned images, we obtain defect-free images that retain the structure. A flat image is then obtained by subtracting the defect-free image from the input image. The FAST-MCD method then estimates the parameters of the inlier distribution of the flat image and uses them to detect defects. A segmentation neural network is used to detect defects in complex images. Unlike conventional methods that only work on a specific structure, our method classifies structures and finds defects in each structure. We use 16 defective images in our experiments, where our method detects all 16 defective images, while the conventional methods detect fewer defective images.


Introduction
Speed, accuracy, and repeatability are required for defect inspection in semiconductor manufacturing.These requirements are becoming more stringent as the fabrication process has become more sophisticated in recent years.Defects in semiconductors affect the appearance, functionality, efficiency, and stability of devices.Manual inspection is subjective, and its precision depends on the inspector's condition, such as eye fatigue.Therefore, automatic optical inspection continues to improve to detect defects and increase yield in semiconductor manufacturing [12,20].Non-destructive visual inspection is critical in the industry to assist or replace subjective and repetitive manual inspection processes.
B Chang-Ock Lee colee@kaist.eduJinkyu Yu hortensia@kaist.ac.krSonghee Han shee33.han@samsung.com 1 Department of Mathematical Sciences, KAIST, Daejeon 34141, Korea 2 Samsung Electronics, Yongin, Gyeonggi-do 17113, Korea Defect inspection methods in semiconductor images can be classified into four types: model-based algorithm, neural network, Die-to-Database (D2DB) method, and Dieto-Die (D2D) method.Many algorithms have been developed to find anomalies in various images, e.g., phase only transform [1], principal component analysis [6], selfsimilarity [9], discrete cosine transform [33], independent component analysis [34], and A-contrario detection [17].Most of these methods assumed a specific structure, such as flat or patterned, and were proposed to fit the structure.Recently, as neural networks have shown good performances in imaging problems, many methods using neural networks have been proposed [10,55,57,60].However, unlike many neural network problems, semiconductor images do not have benchmark data.Neural networks also have the disadvantage in that they are ambiguous to interpret results.Most methods for finding defects, especially in semiconductor manufacturing, are D2DB methods or D2D methods.Traditional D2DB methods [31,36,48] require preprocessing to align the database and an image.Inspection is then performed using the aligned database.There have been attempts to apply neural network [30,39,42] to the D2DB method, but an alignment step is still needed.Traditional D2D methods [21,50,64] use golden die images to make a difference image to find defects.Like the D2DB methods, neural network [3] is applied to the 123 / Published online: 3 October 2023 The International Journal of Advanced Manufacturing Technology (2023) 129:1547-1565 D2D method, but golden die images are still needed to train the network.In addition, the D2DB and D2D methods are very sensitive to the alignment process.Multiple scanning image method [37] is possible if other sensor images are available.
In this paper, we present a method for inspecting defects that removes the ambiguity of neural networks as much as possible with one image without additional information.First, we present a method for classifying images into four types: flat, linear, patterned, and complex using a cosine similarity.A flat image is an image in which the background excluding defects is almost constant with Gaussian noise.A linear image is an image that is shift invariant in a certain direction.If a particular shape appears repeatedly with a certain period, it is a patterned image.A complex image is the one in which all three of the above characteristics are absent.For linear images and patterned images, we reconstruct defect-free images.Then, a flat image is created by subtracting defect-free image from the input image.Under the assumption of Gaussian noise, a histogram of a flat image follows a normal distribution.Defects are considered as outlier in the distribution and could affect the parameters, so we need to minimize the influence of defects by estimating the inlier distribution.This distribution can be estimated using the minimum covariance determinant (MCD) method [46].The MCD method is a highly robust estimator for multivariate location and scatter.The MCD method finds the part of the data with the minimum covariance determinant consisting only of inlier data.Then, defects are found by threshold from the inlier distribution.We use a segmentation neural network for complex images.Figure 1 shows the typical four images and their defect regions.
The rest of this paper is organized as follows: In Sect.2, we briefly review the literature of model-based algorithm for the single image inspection and explain basic tools.Section 3 describes the classifier that classifies images into four types and two ways to remove the structure for linear and patterned images.A segmentation neural network is also described in Sect.3.There are experimental results for several data sets in Sect. 4. We conclude this paper with remarks in Sect. 5.

Previous works
This section briefly introduces conventional methods for finding anomalies in a single image u ∈ R h×w .As mentioned in Sect. 1, most of these methods work on specific structures such as flat or patterned.

Works for flat images
For flat images, there are several ways to find defects.The simplest method [6] uses the mean and standard deviation of the image.Let μ u and σ u be the mean and standard deviation of the image u, respectively.Then, the binary image ỹ representing defects is obtained using the threshold as follows: The constant c is usually assigned a value between 3 and 5.Because the mean and standard deviation of the entire image are used, the results will vary if the image has a large defect.Therefore, a method for estimating the inlier distribution without being affected by defects is needed, which is presented in Sect.2.2.2.
Another method is to divide the image into two regions using linear discriminant analysis (LDA) [50].This method finds the optimal threshold t * .Let C 0 (t) = {u i j | u i j < t} and C 1 (t) = {u i j | u i j ≥ t} where u i j is the value of u at the pixel (i, j).Let μ i (t) and σ 2 i (t) be the mean and variance for the set C i (t) for i = 0, 1.The farther apart the means of the two sets, C 0 (t) and C 1 (t), the smaller the variance of the sets and the better the division.That is, the object function J (t) can be written as .
Then, we find the value t * which maximizes the object function J (t).Since this method always divides the image into two sets, it is not suitable for applying to defect-free images.

Works for linear images
There is a method to find defects in directional textured images [6].This method uses the principal component analysis (PCA) to separate defects and background structures.
To apply PCA, first the average of the column vectors of the image matrix is set to zero.The normalized eigenvalues are then used to find the directional textured background.If the normalized eigenvalue is greater than 1, the principal component represents defect-free background.Otherwise, the principal component represents defects.This method is invariant to horizontal or vertical shifting, rotation, and illumination changes of the directional texture.However, in the case of an image with a vertical linear structure as shown in Fig. 1b, the linear structure is removed in the process of setting the average of the column vectors to zero, so the defect is judged to be the main structure.Therefore, this PCA-based method is not suitable for images with vertical linear structures.

Works for patterned images
There are several methods to find anomalies in patterned images.The simplest method [9] is to first choose an appropriate patch size for each image.Then, it checks how often the patch centered on each pixel appears in the image.For each patch q, we find the k most similar patches q i for i = 1, . . ., k.Then, the reconstructed patch q is obtained by averaging {q i } as where a is a constant.This method is highly sensitive to the patch size.
Another method finds the lattice vectors which generate the pattern.If the lattice vectors generating the pattern are known, it is easy to remove the pattern.Traditional methods for detecting pattern repetition are to use autocorrelation [32] or fast Fourier transform (FFT) [53].The autocorrelation of an image u, denoted by ac ∈ R h×w , is defined as This autocorrelation ac has the largest value at the origin.Therefore, global thresholding is not suitable for finding peak points.There is a method [32] to find peak points, which uses local maxima of smoothed ac as the peak points.The basic idea of the method using the FFT is that the frequency with the maximum value of FFT is related to the number of repetitions.When the number of repetitions of the pattern is large enough, for example, if (x * , y * ) is the index at which the FFT has the maximum value, the periods in the x and y directions can be approximated by h/x * and w/y * , respectively, so that (h/x * , 0) and (0, w/y * ) can be used as lattice vectors.However, if the image has fewer repetitions of the pattern, (i.e., small x * , y * ), then h/x * and w/y * cannot be said to be approximations of the periods.Therefore, this FFTbased method is not suitable for images with few repeated patterns.

Cosine similarity
In this section, we briefly review the cosine similarity which is widely used in image problems such as face verification and clustering [24,40,58,62].In this paper, it will be used to classify images and find periods in the case of patterned images.
Let u ∈ R h×w be an image, K ∈ R k×l be a kernel, and 1 k×l be a matrix of size k ×l with all entries equal to 1.Then, the cosine similarity C S ∈ R h×w is calculated as where * is the convolution and • is the Frobenius norm.When computing the convolution, we use reflection padding on u to get a cosine similarity C S of the same size as the image u.Note that the square, square root, and division operations are calculated on entry-by-entry.A large entry in C S means that the image u has a kernel-like structure near the same indices.

Minimum covariance determinant (MCD) method
This section describes a statistical technique for estimating inlier distribution.Since the average and covariance matrix are extremely sensitive to outliers, a robust estimator is essential and MCD method [46] is one of the most widely used estimators [22,23].In this paper, for defective images, the MCD method will be used to estimate the inlier distribution without being affected by defects.For the sake of completeness, we introduce the MCD method.Let {x i } n i=1 be a finite sample of data in R d with a distribution F, where d is the number of random variables.The MCD is determined by choosing a subset S = {x i j } s j=1 of size n/2 ≤ s ≤ n, which minimizes the determinant of covariance matrix computed from the subset S.Then, α = 1 − s/n is a portion of samples that is not contained in the subset S. Assume that the distribution F has a density of the form where g : R + → R + is a non-increasing function.Then, F is an elliptically symmetric, unimodal distribution.From the average μ S and covariance matrix S of the MCD-solution S, the average μ and covariance matrix of inlier distribution can be obtained by with the consistency factor c α as where q α > 0 satisfies Here, denotes the gamma function (see [5] for more details).However, calculating all covariance determinants of n s subsets is too difficult.

FAST-MCD method
In this section, we introduce the FAST-MCD method [47] to quickly find the MCD-solution S. First, we consider the Mahalanobis distance which measures how much each sample point x i deviates.For a given average vector μ and covariance matrix , the Mahalanobis distance of a point x is defined as The main part of the FAST-MCD method is called the concentration step (C-step), which is described in Algorithm 1. Through the C-step, it holds that det( k ) ≥ det( k+1 ).Since the sequence {det( k )} is monotone and bounded below, it converges.However, there is no guarantee that det( k ) converges to det( S ) for the MCD-solution S. Therefore, the FAST-MCD method has different limits for different choices of S 1 (see [47] for more details).Despite the lack of theory for the convergence to det( S ), the FAST-MCD method has been applied to various fields [2] and empirically proven to produce good results [61].In Table 1, we show by example that the FAST-MCD method estimates the inlier distribution well.

Algorithm 1 C-step in the FAST-MCD method.
Let S 1 be an initial subset of size s.Compute the mean vector μ 1 and covariance matrix 1 for S 1 .
S k+1 : the subset of s vectors selected in order of smallest Mahalanobis distance.
Compute the mean vector μ k+1 and covariance matrix k+1 for S k+1 .end while

Methods
In this section, we introduce a method of inspecting defects in a single image.First, we propose a method using the cosine similarity to classify images into four types: flat, linear, patterned, and complex.For linear and patterned images, we present how to reconstruct defect-free image.A flat image with the structure removed is obtained by subtracting defectfree image from input image.Then, we use the FAST-MCD method to detect defects in flat images.Finally, a segmentation neural network detects defects in complex images.Figure 2 shows the flow chart of our whole algorithm.

Image classification
For convenience, we assume a gray scale image.First, we divide an image u ∈ R h×w into M × N subimages.Then, we calculate the cosine similarity C S i with the kernel K i = i th subimage for i = 1, . . ., M N .A large entry in C S i means that the image u has a K i -like structure near the same indices.We find the region P i = {(x, y) | C S i (x, y) > t i for x = 1, . . ., h and y = 1, . . ., w} for a threshold t i and call it the repeated region.The C S i 's of the flat image and the pattern image are different.Since C S i depends on the structure of the image, the threshold t i to obtain P i must be set adaptively for the image.Therefore, t i is selected a value between the maximum value 1 and the minimum value of C S i , and the results for various ratios between the maximum and minimum values are in Appendix A.1.Here, we use t i = 0.85 + 0.15 min C S i .For each P i , we consider the centroid of K i .Then, we overlap the repeated regions {P i } based on the centroid of each K i and call it the overall repeated region If M or N is so large that the kernel K i becomes smaller than the repeating pattern, the cosine similarity C S i cannot find the pattern.In Appendix A.2, there is a one dimensional example to show that a kernel with a small size cannot find the pattern.
In Appendix A.1, we compute the moment tensor I of the connected region R containing the center of the domain [1, h] × [1, w].If an image has a linear structure, P has a long connected region R, with the large axis ratio defined as the ratio of large and small eigenvalues of I , in the dominant direction defined as the direction of the major eigenvector.If the axis ratio is greater than 25, we determine that the image is linear as discussed in Appendix A.1.
For an image to have a pattern, it must be repeated at least three times.If the pattern is repeated three times in one direction, then P has five high value regions in a straight line.If the pattern is repeated three times in a triangular shape, then P has seven high value regions and can form three straight lines, each containing three high value regions.For each high value region, we can extract a peak point as the centroid of the region.That is, P of a patterned image has at least five peak points on one or two straight lines.
For a Gaussian noised flat image, the histogram of d 2

M
for the MCD-solution S follows the chi-squared distribution.Jensen-Shannon divergence (JSD) is commonly used to measure the distance between two distributions x and y [14]: Note that this JSD(x, y) is bounded by log 2. If the JSD between the histogram of d 2 M for the MCD-solution S and the chi-squared distribution is less than 5 log 2/100, we determine that the image is flat (see Appendix A.3 for more details).Now, images can be classified by using the information of the repeated region as follows: 1. Linear image: Axis ratio of the connected component R ≥ 25. 2. Patterned image: At least 5 peak points in P forming one or two straight lines.3. Flat image: JSD between the histogram of d 2 M for the MCD-solution S and the chi-squared distribution ≤ 5 log 2/100.Figure 3 shows the cosine similarities and overall repeated regions for four types of images.The axis ratio of R is displayed at the top of P. Note that the second row, which has a linear structure, shows a higher axis ratio than others.For the third row with a patterned structure, the yellow line in the last column represents the line passing through five or more peak points including the center of P. Figure 4 shows the results of our image classification method.

Defect inspection by image type
As mentioned in Sect. 1, for linear and patterned images, we present two methods to reconstruct defect-free images.The difference between the defect-free image and the input image becomes a flat image containing defects.For flat images, we use the FAST-MCD method to estimate the inlier distribution and find the defects.The segmentation neural network is applied to inspect complex images.

Removal of structure in linear images
An image in which the axis ratio of the connected component R is greater than 25 is judged to have a linear structure.For linear images, we compute the direction of the major   4 Examples of images for four types: flat [37], linear [11,37,52], patterned [41,63,67], and complex [18,50,51].(All flat images and third linear image have permission from IOP Science, and first and second complex images have permission from Elsevier and Springer, respectively) eigenvector of the moment tensor I to get the dominant direction.The defect-free image can be obtained by taking the median of average intensity of the dominant line and the intensities at both ends along the dominant line.If the line has no defects, the average value is chosen as the median.Otherwise, one of the two end values is chosen as the median.Then, we can obtain a flat image by subtracting the defectfree image from the input image.Finding defects in the flat image can be done as in Sect.3.2.3.Figure 5b shows a long connected region R colored by green.It has an axis ratio 77.039 > 25 and a vertical major eigenvector.Hence, it is classified as a linear image.Figure 5d shows the defectfree image obtained using the dominant lines with the same direction as the major eigenvector.In Fig. 5e, the defects are prominent in the flattened image where the linear structure is removed.

Removal of structure in patterned images
If an image is not linear, we consider a straight line passing through the center of overall repeated region P.As mentioned in Sect.3.1, if there exist one or two straight lines passing through at least five peak points, the image is judged to have a patterned structure.For a patterned image, we extract two lattice vectors {w 1 , w 2 } (see Appendix A.4 for details on how to extract the lattice vectors).Depending on the pattern of the image, one lattice vector can be the zero vector.Using the two lattice vectors {w 1 , w 2 }, we create lattice points (see Algorithm 2).We overlap the image u ∈ R h×w so that the top left of the image is located at each lattice point.After taking the average values of the overlapped images, a defect-free image can be obtained by cropping the averaged image of size h × w in the middle.Since the defects do not appear repeatedly, the average image gives a defect-free Algorithm 2 Lattice points generation. Assume , and , with the notation x for the smallest integer greater than x.else[  image.Then, a flat image can be obtained by subtracting the defect-free image from the input image.Finding defects in the flat image can be done as in Sect.3.2.3.Figure 6 shows a graphical description of lattice point generation.We overlap the image u ∈ R h×w so that the top left of the image is located at each lattice point.The figure shows when the top left of the input image is placed on the orange dot.After the overlapping process, the averaged image is obtained.Then, we can obtain a defect-free image by cropping the green box. Figure 7c shows the lattice points generated from Algorithm 1.The lattice points appear regularly in the upper right corner of each patterned circle.In Fig. 7e, the defects are prominent in the flattened image where the patterned structure is removed.

Detecting defects in flat images
For the image which is not linear or patterned, we check whether the image is flat.To do this, we compute the JSD between the histogram h S (x) of d 2 M for the MCD-solution S and the probability density function of chi-squared distribution.If the JSD is less than 5 log 2/100, then the image is judged to have a flat background.This section describes how to find defects in flat images using the FAST-MCD method in Sect.2.2.3.
A histogram of a flat image with Gaussian noise follows a normal distribution N (μ, ) with g(r 2 ) = e −0.5r 2 (2π) d/2 having a negative derivative.Therefore, the consistency factor c α in (2) can be used to estimate the inlier distribution.Table 1 shows the estimate results of the gray scale flat image in Fig. 1a with  for estimating the inlier distribution.Since we assume the Gaussian noise, the square of the Mahalanobis distance, d 2 M (x i , μ, ), of inlier part follows a chi-squared distribution.We find defects with a threshold d 2 M (x i , μ, ) > χ 2 1, p .Here, p can be adjusted according to the level of defect detection.For example, p = 0.99 means that approximately 1% of the area is detected in the defect-free flat image.For our purpose, a defect-free image should be judged to be defectfree.Therefore, we use the threshold for detecting defects as p = 1 − 1/(4hw) for an h × w image.Here, the use of p = 1 − 1/(4hw) means that about 0.25 pixel is detected in a defect-free flat image, regardless of the size of the image.From now on, we will use α = 0.25 and p = 1−1/(4hw) for gray scale images.Figure 8b shows that the MCD-solution S contains no defects.Figure 8d shows that the histogram of d 2 M for S is similar to the chi-squared distribution.(i.e., MCD-solution S follows a Gaussian distribution).

Detecting defects in complex images
An image that is not judged flat, linear, and patterned is called complex and inspected for defects through a segmentation neural network.Let f θ : R h×w → [0, 1] h×w be a segmentation network with parameters θ, which gives a probability output.Let y = f θ (x) be the probability output of an input image x.Let ŷ be a ground-truth (label) segmentation region of the input x: ŷi j = 1 if (i, j) belongs to the target region and 0 otherwise.
The dice score (DS) is used to measure the performance of segmentation problems, which is defined by .

It takes maximum value 1 when A = B. Similarly, DS(1 − A, 1 − B) gives the performance of background
is used to evaluate multiple class segmentation [8], and it can be used to measure the tiny segmentation.It reduces the well-known correlation between the dice overlap and region size.From this GDS, the generalized dice loss is widely used in small segmentation problems in the form [54]: Originally, the weights w D and w B are determined by the ratio of the total training dataset, like weighted cross entropy loss, thus the same weights w D and w B are used for all images.But there is a difference between cross entropy loss and dice loss.Since the cross entropy loss is calculated for each pixel, weights can be given using the number (area) of pixels of each class in the total training dataset.On the other hand, as the dice loss is calculated for each image, it is not appropriate to use fixed weights for dataset with various sizes of class areas.Our training dataset contains images with various defect sizes, such as 0.0002 ≤ 1 hw i j ŷi j ≤ 0.3889.So, instead of using fixed weights, we use adaptive weights for each image We also use the boundary loss [29] L B (y, ŷ) = i j φ( ŷ) i j (y i j − ŷi j ) , where φ( ŷ) is the signed distance function as Here, ∂ ŷ is the boundary of ŷ.Then, our loss function is a weighted sum of these two losses [29]: During training, the weight λ was initially set to 1 and decreased gradually to 0.5 at the end of training.The optimizer Adam is used to minimize the loss function with parameters β 1 = 0.9, β 2 = 0.999, = 10 −8 , and learning rate 0.001.The network architecture based on U-Net [45] with ResNet [19] is shown in Fig. 9.
Figure 10d shows that the histogram of d 2 M for S and the chi-squared distribution are different.This means that some structure is present in the input image.Remark 1 We might consider applying the segmentation network to all images.We observed that in most images, the neural network does not give as accurate results as our proposed mixed method.Also, we do not know what the neural network will do for new, untrained structural images.

Pre-processing and post-processing
Before applying the proposed method, we take a denoising step.In denoising methods based on isotropic diffusion, diffusion at the edge can smear the edge and remove the texture of the object.However, denoising methods based on anisotropic diffusion consider both spatial distance and intensity difference, thus preserving edges while reducing noise in non-edge regions.We use Perona-Malik anisotropic diffusion [44], the most popular model, to denoise the image: where κ is a constant.We use κ = 0.14 and the time increment= 0.1 with five iterations.
If a defect appears on the boundary of the image, it is not known whether it is an actual defect or a part of the structure.Therefore, if a defect appears on the boundary of the image, it is excluded.For the remaining defects, we perform the morphological opening and closing to remove the dot defects (noise) and to connect nearby defects, respectively.We use structuring elements with a radius of 1 pixel for opening and a radius of 5 pixels for closing for 256 × 256 images.If there is a hole inside the defect, we fill it in during the post-process.

Experimental results
The proposed method was implemented to evaluate the performance of defect inspection for images with various structures.Since there is no testing database for semiconductor defects, we used 171 images in the literature [4, 11, 13, 15, 16, 18, 21, 25-28, 35, 37, 38, 41, 49-52, 56, 59, 63, 65, 67]; see [66] for specific image information.The size of images under inspection is 256 × 256.To train the network, we use the following data augmentation strategy: • Normalization and the use of complement, • Eight types of rotation and flipping.
For a 256 × 256 gray scale image u, we use the normalized image ū = (u − μ)/(10σ ) + 0.5 and the complementary image ūc = 1 − ū, where μ and σ 2 are the average and the variance of u, respectively.Then, we rotate the image by π/2, π and 3π/2 radians and flip these vertically.The network is trained with 155 × 2 × 8 = 2, 480 defective images, and the validation dataset has 16 defective images and corresponding defect-free images.We created the defect-free images using the exemplar-based inpainting method [7] and manual processes with some graphical tools.Figure 11 shows the examples of defect-free images generated with graphical tools.We chose the parameters of the neural network with the highest G DS( ỹ, ŷ) (3) for the 16 defective images in the validation dataset, where ỹ is a threshold result for y ≥ 0.5.The weights are the same as in (4).The network is trained on 200 epochs with a batch size of 64.
Figure 12 shows the defect inspection results for defective images.We show the input images, ground truths, results of our method, neural network-1 [55], neural network-2 [57], and self-similarity method [9] in order from left to right column.We generated the ground truth with threshold and some manual process.Since we do not have a design layout, we cannot implement the D2DB inspection methods.The neural network-1 (NN-1) extracts features using 2D convolution with kernel size of 5×5, ReLU activation function, and 2×2 max pooling.The NN-1 method gives 32 × 32 outputs.To find the region in 256 × 256 input images, we use the bicubic interpolation.The second neural network-2 (NN-2) method which takes 512 × 512 images as inputs has a W-shape cascaded autoencoder architecture.Each U-shaped autoencoder uses a dilated 2D convolutions and ReLU activation function to extract features and has skip connections.To use this network architecture, we obtained 512 × 512 images using the bicubic interpolation again.We trained the parameters in the networks with our dataset.The number of parameters in the network is 15.4M, 61.9M, and 11.7M in order of NN-1, NN-2, and our method.NN-2, a full machine learning method, is inferior to our method despite using 5 times more parameters.For the self-similarity method, NFA= 10 −10 was used.The self-similarity method is known to work well for images with repetitive structures.However, if a specific structure  does not appear repeatedly even though it is not a defect, the self-similarity method judges it as a defect.Table 2 shows the mean of GDS, standard deviation (std) of GDS, mean of IoU, and std of IoU for 16 defective validation images.
Here, IoU score is another metric mainly used in segmentation problems with the formula: where |•| denotes the number of pixels.We also provide the number of true positives for 16 defective validation images and the number of false positives for 16 defect-free validation images.
All model-based algorithms were implemented in MAT-LAB.All neural networks were implemented in Python with PyTorch [43], and all computations were performed on a cluster equipped with Intel Xeon Gold 6148 (2.4GHz, 40C) CPUs, NVIDIA RTX 3090 GPUs, and the operating system Ubuntu 18.04 64 bits.

Conclusion
Unlike the defect inspection methods mainly used in semiconductor manufacturing, we proposed a method of inspecting defects in a single image.Our method consists of classifying images and detecting defects according to each type.The cosine similarity, the moment tensor, and JSD were used to classify the image types.We proposed two methods for removing structures: one for a linear structure and the other for a repeated pattern.For the linear structured image, we found the dominant angle and removed the linear structure by subtracting the median of the average intensity and those at both ends on the dominant line.For the repeated patterned image, we selected two lattice vectors and made the lattice points.By overlapping the input image at each lattice points and averaging them, we obtained the defect-free reference image.From the difference image between the input image and the reference image, we found a flat image.The FAST-MCD method is used to detect defects in flat images.For an image with complex structure, we found the defects using a segmentation network.
Among the existing methods, most model-based inspection methods for a single image assume a special image structure (e.g., flat or patterned).Our method has the advantage of being more general in that it classifies the types of images and finds defects according to the types.Depending on the type of image with defect, it will be possible to know in which process the defect occurred.Machine learning methods have a disadvantage that it is difficult to explain the reason for the result.However, our method reduced the ambiguity of the results by classifying images into four types and then detecting defects in flat, linear, and patterned images by statistical methods and applying machine learning only to complex images.

A.1 Selection of the parameter t i
Here, we experimentally show the role of the parameter t i used in the image classification in Sect.3.1.We obtain the repeated region P i = {(x, y) | C S i (x, y) > t i for x = 1, . . ., h and y = 1, . . ., w} using the threshold t i .We overlap P i based on the centroid of each kernel and call it the overall repeated region P ⊂ [1, h] × [1, w].Since the value of C S i at the centroid of K i is always 1, P i contains the centroid of K i .It means that P contains the center of domain [1, h] × [1, w].For R, the connected region containing the center of domain, we compute the moment tensor I as where the components are defined as For the moment tensor I , we compute eigenvalues and corresponding eigenvectors.If an image has a linear structure, P has a long connected region R, with the large axis ratio defined as the ratio of large and small eigenvalues, in the Figure 13 shows the overall repeated region P with various t i = (1−r )+r min C S i for r = 0.1, 0.15, 0.2, 0.25, 0.3.The axis ratio of R is displayed at the top of each P. The maximum axis ratio of nonlinear images is 7.53, and the minimum one of linear images is 41.83.Therefore, we take the value 25 as the threshold for determination of linear images.Linear images always have an axis ratio greater than 25, regardless of t i values.The peak points of the patterned images are aligned on specific lines.

A.2 Kernel size for cosine similarity
Here, we give a one-dimensional example of the cosine similarity for different kernel sizes.Consider a long vector in which (10110) is repeated.Then, the cosine similarities for the three types of one-dimensional kernels are as follows: Algorithm 3 GHT method of extracting the lattice vectors [32].
Let {v i } n i=1 be the peak points with vector representation in an ascending order of lengths, where v 1 = (0, 0).Set score matrix , with the notation [a] for the rounded integer of a.
. end for end if end for end for Let the entry with the highest value in L be î, ĵ .Select the pair of vectors with the smallest length in {v If the kernel does not include a pattern as in the first case, the repeated region of C S cannot find the pattern.On the other hand, if the kernel represents a pattern as in the second and third cases, the repeated region of C S finds the pattern.Therefore, we suggest small M and N values like 2, 3, and 4 so that the kernel includes the pattern.

A.3 JSD between histogram and chi-squared distribution
We explain the details of the JSD between the histogram of Let {l i } be a set of straight lines passing through the center of P, in descending order of the number of peak points in l i .
For the same number of peak points, the line with the smaller distance between the points comes first.for i = 1 : 2 do Let {v j | j = 1, . . ., n i } be the vectors in the line l i .Find the minimum length vector vi in l i .for j = 1 : n i do .
end for Let ŵi = E 1 [a j ] v j .end for Select the pair of vectors with the smallest length in { ŵ1 , ŵ2 , ŵ1 + ŵ2 , ŵ1 − ŵ2 } as lattice vectors {w 1 , w 2 }. χ 2  1, p is defined to satisfy P(X > χ 2 1, p ) = p for X ∼ χ 2 1 .However, for defective images, the MCD-solution S appears outside the defects, and the maximum value of d 2 M in the MCD-solution S increases.The increment depends on the size of the defects.Therefore, we use a slightly modified chisquared distribution as follows.Let f 1 (x) be the probability density function for χ 2  1 distribution and l be the maximum value of d 2 M in the MCD-solution S.Then, the cropped probability density function f 1 (x) is In Sect.3.1, we measure the JSD between h S (x) and f 1 (x) to check whether h S (x) is close to f 1 (x) or not. Figure 14 shows the JSDs for flat and complex images.For flat images, JSD is below 0.01 log 2. For complex images, JSD is greater than 0.1 log 2. Therefore, we take the value 0.05 log 2 as the threshold for determination of flat images.

A.4 Lattice vector extraction
In this appendix, we briefly describe how to extract the lattice vectors from the overall repeated region P in Sect.3.2.2.Before starting, the locations of peak points are regarded as vectors originating from the center of domain [1, h]× [1, w].The main idea of the generalized Hough transform (GHT) method in [32] of extracting lattice vectors is to build a parallelogram grid with each pair of linearly independent vectors and score how close the peak points are to the grid (see Algorithm 3).The usage of 1/ v i leads to a high score in L for v i with small length.However, this method only uses the vectors v î and v ĵ to determine the lattice vectors and does not use other vectors that are constant multiples of them.
We modified the method slightly to use more linearly dependent vectors to obtain the lattice vectors.We consider a straight line passing through the center of domain.As mentioned in Sect.3.1, a straight line passing through three peak points is judged to have pattern information.If the line passes through more peak points, the pattern is better represented.To find the lattice vectors, we take two straight lines containing the largest number of peak points.If the straight lines have the same number of peak points, we take the line with the smaller distance between the points.The proposed lattice vector extraction algorithm is described in Algorithm 4. The lattice points generated from the lattice vectors extracted by our proposed algorithm are more accurate because the error is reduced by using more linearly dependent vectors.The amount of computation of our proposed extraction algorithm is also less than that of the GHT method.
Figure 15a shows the lattice points for two methods in input image.Blue dots and red dots represent the lattice points generated by the GHT method and our method, respectively.A flattened image obtained with the blue lattice points is shown in Fig. 15b.Near the boundary of the image, traces of structures remain.Figure 15c shows the flattened image obtained by our method.Compared to the GHT method, our method produces more accurate lattice points.

Fig. 1
Fig. 1 Typical four types of semiconductor images and defect regions.(First and third images have permission from IOP Science and IEEE, respectively)

Fig. 2
Fig.2The flow chart of our algorithm

Fig. 3
Fig. 3 Cosine similarities and repeated regions for four types of images.First column shows the input images with 2×2 partitions.Columns 2-5 show each cosine similarity C S i , and blue contours show the repeated region P i .The centroids of the subimages used as kernels are indicated

Fig. 5
Fig. 5 Inspection process of a linear image.a Input image, b overall repeated region P.The red dot indicates the center of P, and the green region shows the connected region R containing the center of the domain.The red lines in c show the dominant lines.The defect-free image is shown in d. e shows the flattened image with linear structure removed.The defects detected by the FAST-MCD method are shown in f

Fig. 6 Fig. 7
Fig. 6 Graphical description of lattice point generation using two lattice vectors w 1 and w 2 which are denoted as red and blue arrows, respectively.The white dots are the lattice points in [1, 2h]×[1, 2w] generated from these two lattice vectors

Fig. 8 Fig. 9
Fig. 8 Inspection process of a flat image.a Input image.b MCDsolution S. The defects detected by the FAST-MCD method are shown in c.The histogram of d 2 M for the MCD-solution S and the probability density function of chi-squared distribution are shown in d

Fig. 10
Fig. 10 Inspection process of a complex image.a Input image.b MCDsolution S. The defects detected by the segmentation network are shown in c.The histogram of d 2 M for the MCD-solution S and the probability density function of chi-squared distribution are shown in d

Fig. 11
Fig. 11 Examples of defect-free images generated by graphical tools.a and c are defective images, and b and d are defect-free images

Fig. 12
Fig.12 Comparison of several methods for defective images: Our proposed method classifies images into four types: flat, linear, patterned, and complex.The red and blue contours show the ground truth and segmentation region, respectively

Fig. 13 Fig. 14
Fig. 13 Experiment results for various t i .The first column shows the input images.Columns 2 through 6 show P for r = 0.1 to r = 0.3.The red dot indicates the center of domain, and green region shows the connected region R containing the center of domain.The blue dots indicate the peak points

d 2 M 2 MAlgorithm 4
for the MCD-solution S and the chi-squared distribution.Let h S (x) be the histogram of d 2 M (x i , μ, ) for the MCDsolution S. For a defect-free image, the maximum value of d in the MCD-solution S is close to χ 2 1,1−α where the quantity Our method of extracting the lattice vectors.

Fig. 15
Fig.15 Lattice points and results of GHT method and our method