Rapid and Flexible Semantic Segmentation of Electron Microscopy Data Using Few-Shot Machine Learning

Sarah Akers Paci c Northwest National Laboratory Elizabeth Kautz Paci c Northwest National Laboratory Andrea Trevino-Gavito Paci c Northwest National Laboratory Matthew Olszta Paci c Northwest National Laboratory Bethany Matthews Paci c Northwest National Laboratory Le Wang Paci c Northwest National Laboratory Yingge Du Paci c Northwest National Laboratory Steven Spurgeon (  steven.spurgeon@pnnl.gov ) Paci c Northwest National Laboratory https://orcid.org/0000-0003-1218-839X


I. INTRODUCTION
Material microstructures govern the functionality of many important technologies, including catalysts, energy storage devices, and emerging quantum computing architectures.
Scanning transmission electron microscopy (STEM) has long served as a foundational tool to study microstructures because of its ability to simultaneously resolve structure, chemistry, and defects with atomic-scale resolution for a range of materials classes. 1-3 STEM has helped elucidate the nature of microstructural features ranging from complex dislocation networks to secondary phases and point defects, leading to refined structure-property models. 2,4,5 Traditionally, STEM images have been analyzed by a domain expert manually or semiautomatically, utilizing a priori knowledge of the system to identify known and unknown features. While this approach is suitable for measuring a limited number of features for small data volumes, it is impractical for samples possessing high density, rare, or noisy features. 6,7 Moreover, manual approaches are difficult to scale to include multiple data modalities and cannot be performed at high speed, hindering our ability to perform in situ and complementary or correlative studies harnessing the full potential of modern instruments. 8 At a more fundamental level, variability in how such measurements are conducted and a lack of standardized approaches contributes to the broader issue of reproducibility in experimentation. 9 Though these limitations apply to all materials classes, they are particularly pronounced for complex oxides, whose properties are heavily influenced by even trace amounts of unwanted defects. [10][11][12] Hence, there is an urgent need to develop new approaches to characterize microstructural features with greater accuracy, speed, and statistical rigor than is possible with existing methodologies.
A central challenge in quantitatively describing microscopy image data (i.e. micrographs) is the wide variety of possible microstructural features and data modalities. The same instrument that is used to examine interfaces at atomic-resolution one day may be used to examine particle morphology or grain boundary distributions at lower magnification the next. In every study, the goal is to extract quantitative and semantically-meaningful microstructural descriptors to link measurements to underlying physical models. 13,14 For example, estimating the area fraction of a specific phase or abundance of a feature through image segmentation is an important part of understanding synthesis products and phase transformation kinetics. [15][16][17][18][19] Although several image segmentation methods exist (e.g. Otsu, 20 the water-shed algorithm, 21 k-means clustering 22 ), these are often not easily generalizable to different material systems, image types, and may require significant tailored image preprocessing.
Machine learning (ML) methods, specifically convolutional neural networks (CNNs), have recently been adopted for the recognition and characterization of microstructural data across length scales. [23][24][25][26] Classification tasks have been performed to either assign a label to an entire image that represents a material or microstructure class (e.g. "dendritic," "equiaxed," etc.), [26][27][28][29] or to assign a label to each pixel in the image so that they are classified into discrete categories. 25,[30][31][32] The latter classification type is segmentation of an image to identify local features (e.g. line defects, phases, crystal structures), referred to as semantic segmentation. However, many challenges remain in the practical application of semantic segmentation methods, such as the large data set size required for training and the difficulty of developing methods that are generalizable to a wide variety of data. Typically, data analysis via deep learning methods requires large amounts of labeled training data (such as the large image data set available through the ImageNet database). 33,34 The ability to analyze data sets on the basis of limited training data, as often encountered in microscopy, 35,36 is an important frontier in materials and data science. Recent advances have led to developments that allow human-level performance in one-shot, or few-shot learning problems, 37,38 but there are limited studies on such methods in the materials science domain. While many characterization tools may provide just a few data points, a single electron micrograph (and potentially additional imaging / spectral channels) may encompass many microstructural features of interest. The one-shot or few-shot learning concept also has significant implications for the study of transient or unstable materials, as well as those where limited samples are available for analysis due to long lead-time experimentation (such as corrosion or neutron irradiation studies). In other cases, there exists data from previous studies that may be very limited or poorly understood, for which advanced data analysis methods could be applied. 39 In this work, we present a rapid and flexible approach to recognition and segmentation of STEM images using few-shot machine learning. Three oxide materials systems were selected for model development (epitaxial heterostructures of SrTiO 3 (STO) / Ge, La 0.8 Sr 0.2 FeO 3 (LSFO) thin films, and MoO 3 nanoparticles) due to the range of microstructural features they possess, and their importance in semiconductor, spintronic, and catalysis applications. 40,41 We demonstrate that with only 5-8 sub-images (termed chips) that represent examples of a specific microstructural feature (e.g. a crystal motif or particular particle morphology), our model yields segmentation results comparable to those produced by a domain expert for all oxide systems studied here. The successful image mapping can be attributed to the low noise sensitivity and high learning capability of few-shot machine learning in comparison to other segmentation methods (e.g. Otsu thresholding, watershed, k-means clustering, etc.). The few-shot approach rapidly identifies varying microstrutural features across STEM data streams, which can inform real-time image data collection and analysis. More broadly, our findings underscore the power of image-driven machine learning to enable improved microstructural characterization for materials discovery and design.

II. RESULTS AND DISCUSSION
A deep learning approach known as few-shot learning was developed for semantic segmentation of STEM images. The premise of this few-shot learning model is to use very few labeled examples (< 10) per class for the model to identify regions of an image that correspond to each class. The general approach to image segmentation using few-shot learning is schematically described in Figure 1 and involves breaking an input image into a grid of subimages (referred herein as chips), model initialization, inference, and output of a segmented micrograph. The process of chipping relies on domain-specific knowledge of the materials microstructure, as indicated in the annotations in Figure 1.A.

A. Preprocessing
To separate and measure distinct phases which have varying contrast in the STEM images, preprocessing of original image data was required. A histogram equalization (HE) technique designed to enhance local image qualities without introducing global artifacts termed contrast limited adaptive HE (CLAHE) 42,43 was selected for use in this work. The details of the CLAHE implementation are described in Table I. CLAHE was first performed on original images and then the processed image was sectioned into a set of smaller subimages, as shown in Figure 1.B. The chip size varied between 95 × 95 pixels and 32 × 32 pixels, however all chips are resized to 256 × 256 in the ResNet101 embedding module. The variable size allowed for each chip to be large enough to capture a microstructural motif and small enough to provide granularity between adjoining spatial regions, as shown in Figure 1. The final preprocessing step is an enhancement technique 44 that marks the position and size of atomic columns using a Laplacian of Gaussians (LoG) blob detection routine. 45 This step was used on the LSFO system to enhance the extremely subtle differences between classes.

B. Model Architecture
The few-shot model inputs the preprocessed STEM image, typically with high resolution on the order of 3000 × 3000 pixels, that has been broken down into a series of smaller chips, x ik , typically not larger than 100 by 100 pixels. A handful of these chips are used as examples, or a support set, to define each of one or several classes. While most image applications for few-shot learning use disparate x ik to define a support set for each class (S k ), 46-50 here S k was created by breaking the original image into a grid of smaller subimages. A subset of chips were labeled for each class. The set of N labeled examples for k = 1, .., K classes makes up the support set defined by: x i represents an image i and y i is the corresponding true class label.
A Prototypical Network 51 was selected in this work, given its lightweight design and simplicity. This model is based on the premise that each S k may be represented by a single prototype, c k . To compute c k , each x ik first goes through an embedding function f φ which maps a D-dimensional image into an M -dimensional representation through learnable parameters φ. The transformed chips, or f φ(x ik ) = z ik , then creates the prototype for class k as the mean vector of the embedded support points c k , as follows: After class prototypes are created, an untrained Prototypical Network classifies a new data point, or query q i , by first transforming the query through the embedding function and then calculating a distance, e.g. Euclidean distance, between the embedded query vector and each of the class prototype vectors. After the distances are computed, a softmax normalizes the distance into class probabilities, where the class with the highest probability becomes the label for the query. 51 The final output of the model, for each q i , is the respective class label.

C. Model Inference
In order to quantify phase fractions in a STEM image (which can range from nm to µm in spatial dimension) each chip is used as a query point, q i , so that the entire set of query points, Q, makes up the full image. The size of Q is directly proportional to the size of each chip and the size of the full image, as shown in Table II. All q i first go through the embedding function and distances to each prototype are computed using the selected distance function. The network then produces a distribution over each of the K classes by computing a softmax over the distances and assigns a class label according to the highest normalized value. 51 The model-specific implementation and parameters are given in Table II. While the selection of model parameters is often tedious, specific model parameters in the few-shot context are generally straightforward, since we often leverage pretrained models for the embedding architecture. Here, a residual network with 101 layers, ResNet101, 52 was used as the embedding architecture. ResNet was specifically selected owing to its success in several related image recognition tasks. 52 Model weights for ResNet101 are available from PyTorch 53 pytorch/vision v0.6.0, as trained on the image database ImageNet. 54 Additionally, the Euclidean distance metric was used, since this metric generally performs well across a wide variety of benchmark datasets and classification tasks. 51 These pretrained models come with specified parameters and trained model weights. However, any embedding architecture may be used, especially those well-suited for segmentation tasks. 55 The similarity module can be any few-shot or meta-learning architecture as well; however, here, at least one GPU is necessary and may take several days to reach convergence given a sizeable database, e.g. a typical image database like ImageNet 54 contains 14 million images.
The scope of this manuscript will only discuss the former, using an untrained few-shot model and pure inference to make judgments about an image.

D. Classification
The segmentation output of few-shot classification using the Prototypical architecture for three oxide systems is shown in Figure 2. The model output is a superpixel classification, i.e. every pixel that belongs to a chip receives the same label and corresponding color, where it is necessary to collect image montages to survey the wide variety of possible particle morphologies. Here again the few-shot method successfully distinguishes several nanoparticle orientations from the carbon support background, with minimal instances of inaccurate labeling. Note the ability of the few-shot approach to accommodate the visual complexity of S 1 seen in 3 (row C middle), with a range of shapes, contrast, and sizes defining this 'flat' category. While S 1 here is defined with several more chips than the others, the model is able to reasonably perform a segmentation task impossible to contrast based methods alone.
The ability of this model to generalize well to different material systems is demonstrated in Figure 2, which shows that illustrate varying microstructural features were successfully mapped for STO, LSFO, and MoO 3 .

E. Comparison to Other Methods
Initially, several image analysis techniques were explored in an effort to quantify microstructral features of interest in specific micrographs, i.e. segmentation. It was immediately obvious that no single segmentation method would perform well in the absence of preprocessing steps, such as contrast adjustments, smoothing, and sharpening. Ideally, the aim of preprocessing in these analyses is to globally minimize artificial contrast textures and locally emphasize object edges, a critical noise reduction step for most segmentation routines. 57 Given that preprocessing and segmentation are often inseparable, 58 we examine comparable segmentation methods in the context of both segmentation and preprocessing together. In an effort to compare the few-shot approach with more widely-used segmentation methods, an example image from the STO / Ge system was analyzed using techniques with varying noise sensitivity and segmentation capabilities, with results presented in Figure 3.
The simplest approach to segmentation falls under a family of thresholding techniques shown in the first row of Figure 3. The three methods shown in the top row are designed to separate pixels in an image into two or more classes, based on an intensity threshold. The threshold in these methods is determined using information about the distribution of pixel intensities either globally (top row left) or locally using a neighborhood of pixels (top row center and right). The neighborhood methods are commonly more sensitive to noise, while Otsu's more global technique appears to separate foreground pixels (light) from background (dark) relatively well.
Moving beyond simple thresholding, we begin to look towards separating pixels into classes other than background and foreground. The segmentation methods shown in Figure 3 typically have the ability to separate intensities into multiple classes again defined by the distribution of pixel intensities in the image. Two classes are specified for these routines in order to demonstrate the premise that, ideally, the image could be segmented according to the two distinct micrographs. These approaches also typically involve blurring filters and/or morphological operations 59 in order to remove pixels that are not a part of a larger group or shape. While shape edges are more defined in the middle row of Figure 3 than in the top row, we note that the resulting segmentation still appears to be background/foreground and misses the distinction between micrograph structures. One obvious limitation of a direct implementation of these methods is that the resulting classes will always be based on intensity and not on the size or shape of the underlying micrographs. It may be possible to layer these methods with a shape detection routine where shapes of approximately the same size may be clustered into the same class. However, we found that clustering shapes post foreground/background segmentation was not able to distinctly separate microstructural features in an unsupervised manner, i.e. without tedious and manual intervention.
Rather than adding a shape clustering routine to an already segmented image, we imple-

III. CONCLUSION AND FUTURE WORK
Here we developed a flexible few-shot learning approach to STEM image segmentation that can significantly accelerate mapping and identification of phases, defects, and other microstructural features of interest in comparison to more traditional image processing methods. We studied three different materials systems (STO / Ge, LSFO, and MoO 3 ), with varying atomic-scale features and hence diversity in image data for model development.
Segmented images using the few-shot learning approach show good qualitative agreement with original micrographs.
When compared to other techniques, we find that noise sensitivity and/or labeling capability remain challenges for adaptive segmentation and clustering algorithms. The few-shot techniques explored in this manuscript provide powerful resources to combat these issues and remain flexible enough to accommodate a suite of materials. While few-shot machine learning has been increasingly successful in rapidly generalizing to new classification tasks containing only a few samples with supervised information, it is a known problem that the empirical risk minimizer can be slightly unreliable, 60

A. Experimental Materials and Methods
The three experimental systems were prepared as follows. SrTiO 3 films were deposited onto Ge substrates using molecular beam epitaxy (MBE), as described elsewhere. 40 for MoO 3 . Because of its beam sensitivity, the STO / Ge images shown were collected using a frame-averaging approach; a series of 10 frames were acquired with a 1024 × 1024 px sampling and 2 µs px −1 , then non-rigid aligned and upsampled 2× using the SmartAlign plugin. 62 Tens of images were collected from each material system and a range of selected defect features were used in this study.

B. Computational Methods
The specific implementation of the preprocessing techniques and parameters for the fewshot model are described in Tables II and II,

VI. COMPETING INTERESTS STATEMENT
The authors declare no competing interests. Figure 1 A schematic of the few-shot approach to segmentation. The raw STO / Ge image (A) is broken into several smaller chips (B) and a few user de ned chips are used to represent desired segmentation classes in the support set (C). Each chip then acts as a query and is compared against a prototype (D), de ned by the support set, and categorized according to the minimum Euclidean distance between the query and each prototype, yielding the segmented image (E).  An image from the STO / Ge system (top) is analyzed with a suite of image processing techniques with varying noise sensitivity and labelling capabilities. The thresholding techniques (top) typically separate background from foreground in an image. The strict segmentation techniques (middle) have the ability to separate the image further into multiple classes, though the classes are de ned solely on pixel intensity.