Similar classes latent distribution modelling-based oversampling method for imbalanced image classification

Learning an unbiased classifier from imbalanced image datasets is challenging since the classifier may be strongly biased toward the majority class. To address this issue, some generative model-based oversampling methods have been proposed. However, most of these methods pay little attention to boundary samples, which may contribute tiny to learning an unbiased classifier. In this paper, we focus on boundary samples and propose a similar classes latent distribution modelling-based oversampling method. Specifically, first, we model each class as different von Mises–Fisher distributions, thereby aligning feature learning with the class distributions. Furthermore, we develop a distance minimization loss function, which makes latent representations from similar classes close to each other. In this way, the generator can capture more shared features during training. In addition, we propose a boundary sampling strategy, which uses latent variables near the decision boundary to generate boundary samples. These samples expand the minority decision region and reshape the decision boundary. Experiments on four imbalanced image datasets show that the proposed method achieves promising performance in terms of Recall, Precision, F1-score, and G-mean.


Introduction
In recent years, image classification has been an attractive field of research in computer vision [1]. The progress of image classification depends on high-quality, large-scale datasets, such as ILSVRC 2012 [2] and MS COCO [3]. In the real world, however, image datasets often have imbalanced distributions [4], meaning that one or more classes (majority classes) have much more samples than the others (minority classes) [5]. When using imbalanced image datasets to train classifiers, they may be skewed toward learning the features of the majority class [6,7]. Such a biased learning process may lead to poor classification performance for the minority class. Since data distribution imbalance often occurs in many practical classification tasks, such as anomaly detection [8,9], medical image classification [10,11], and object detection [12], it is a significant challenge both industry and academia must face [7].
The goal of solving imbalanced image classification is to train an unbiased classifier that can correctly predict the class label of the data [13]. Among the existing methods, the Synthetic Minority Over-sampling Technique (SMOTE) is one of the most representative works in data-level methods [14,15]. The basic idea of SMOTE is to rebalance the dataset by synthesizing new minority class samples [16,17]. This technique has had several extensions [18][19][20][21], which were designed to identify the boundary between the majority class and the minority class. However, traditional oversampling methods use Euclidean distance as a similarity metric. As a result, they are unsuitable for handling high-dimensional imbalanced datasets like images and audio [17].
Recently, deep generative models, such as variational autoencoders (VAEs) [22] and generative adversarial networks (GANs) [23], have been used to generate high-dimensional synthetic samples [13,[24][25][26][27][28][29]. However, most existing methods generate high-fidelity samples, but those samples may be far from the decision boundary, which contributes little to imbalanced image classification [30]. Recent studies have shown that samples near the decision boundary (called boundary samples in this paper) are more critical for training unbiased classifiers than samples far from the boundary [18,31]. To handle this issue, Mullick et al. [30] proposed Generative Adversarial Minority Oversampling (GAMO), a method that generates boundary samples. However, GAMO may suffer from mode collapse, resulting in a lack of diversity in the generated samples, which may not improve the classifier performance significantly [32]. Guo et al. [33] proposed a Discriminative Variational Autoencoding Adversarial Network (DVAAN). The effectiveness of DVAAN may depend on accurately selecting two similar classes. Besides, DVAAN is a binary classification framework, which is inconvenient to deal with multi-class imbalanced data directly.
Hence, in this paper, similar classes latent distribution modelling-based oversampling method is proposed, which generates boundary samples in the minority class to help train an unbiased classifier. Specifically, we design a similar classes modelling network (SCN), an improved network of VAE-GAN [34]. The proposed oversampling method consists of two steps. In the first step, the von 1 3

Similar classes latent distribution modelling-based…
Mises-Fisher (vMF) distribution is introduced as the prior and posterior distribution of SCN, which can prevent the KL divergence from forcing all latent variables to concentrate on one point. By doing this, the encoder can model each class as different vMF distributions, which aligns the feature learning with the class distributions. This effectively improves the model's learning of latent features from imbalanced data [33,35]. In the second step, a Distance Minimization loss function (DM loss) is proposed to reduce the inter-class distance of similar classes, making their latent distributions closer. By clearly modelling class boundary, the generator can learn common features of similar classes from latent variables around the decision boundary. Then fuse these features into generated samples through adversarial training. For oversampling, we propose a boundary sampling strategy, which forms new sampling regions in the middle of two similar classes. After training convergence, the generator generates boundary samples by using the latent variables in this region.
Our main contributions can be summarized as follows: 1. We propose a similar class latent distribution modeling network (SCN), which models each class as different latent distributions to clearly distinguish the minority class from the majority class. 2. We design a Distance Minimization loss function (DM loss) for the encoder, which makes similar classes closer in the latent space. Consequently, the generator can learn shared latent features from their decision region. 3. A boundary sampling strategy is proposed to generate boundary samples by using the latent variables in the middle of two similar classes. 4. Extensive experiments on four imbalanced image datasets demonstrate the superiority of the proposed method.
The rest of this paper is structured as follows: Sect. 2 briefly discuss related work. Section 3 describes the proposed method in detail. In Sect. 4, the proposed method is evaluated by comparison and ablation experiments. Section 5 concludes this article.

Related work
Over the last few decades, experts and scholars have proposed various solutions to the imbalanced learning problem. These approaches are divided into two categories: data-level and algorithm-level. First, some algorithm-level and dataset-level methods are introduced, and then recent research on boundary samples-based oversampling methods is presented. Moreover, some concepts used in this paper are also introduced.

Algorithm-level methods
The way traditional classification algorithms treat each class is fair. But in class imbalance, not all classes play an equal role in how the algorithm learns. Most

3
of the time, the learning process is dominated by the majority class, causing low classification accuracy for the minority class. Algorithm-level methods aim to modify existing classification algorithms to reduce their bias toward the majority class. Cost-sensitive learning is one of the most effective algorithm-level methods [6]. It adjusts the training process of the algorithm by giving additional cost to misclassifying samples [7]. For misclassified minority samples, cost-sensitive learning will assign higher costs through a cost matrix. Some representative works are presented below. Khan et al. [5] jointly optimized the misclassification cost and network weight parameters. The focal loss proposed by Lin et al. [12] makes the network pay more attention to misclassified samples by weighing the classification loss of different classes. Reference [36] proposed using a class rectification loss in combination with hard sample mining. The goal was to find the boundary of the minority class, which would make the majority class less dominant. Although cost-sensitive learning is easy to combine with deep learning, setting the value of the cost matrix can be difficult [7].

Data-level methods
Data-level methods rebalance skewed class distributions by applying re-sampling strategies. The form of re-sampling includes oversampling and undersampling. The former creates new minority class samples, and the latter drops the intrinsic samples in the majority class. These techniques balance the number of samples in the majority and minority classes.
Recently, many scholars have combined Generative Adversarial Networks (GANs) with re-sampling to handle class imbalance [37]. Douzas et al. [25] introduced the Conditional GAN (CGAN), a method that generates class-specific samples to augment the minority class. Odena et al. [38] proposed Auxiliary Classifier GAN (ACGAN) to improve the quality of generated samples. However, these models have performed poorly on extreme class-imbalanced datasets by failing to generate the required minority samples. A more advanced method to overcome this issue is Multiple Fake Class GAN (MFC-GAN) [28]. It generates plausible samples by incorporating more fake classes. In the above studies, the instability of the training procedure for GAN remains a challenge in practice. To handle this issue, Suh et al. [17] proposed Classification Enhancement GAN (CEGAN), where they constructed the classifier as an independent network and optimized GAN with Wasserstein distance [39]. Other methods combine GAN with an additional generative model, such as a VAE. Marian et al. [40] proposed the concept of Balance GAN (BAGAN). This technique uses an autoencoder to initialize GAN training, which makes GAN start the adversarial training from a more stable point and helps improve the quality of the generated samples. BAGAN-GP [41] is an improved version of BAGAN. Furthermore, Data Augmentation GAN (DA-GAN) proposed by Antoniou et al. [26] used a similar initialization strategy to generate high-fidelity samples.

3
Similar classes latent distribution modelling-based…

Boundary samples-based oversampling methods
Research shows that samples near the decision boundary are more likely to be misclassified, so learning a robust classification model requires boundary samples. [31] For this reason, some methods based on decision boundary have previously been proposed in the literature. These methods include Borderline-smote [18], Safe-level-smote [19], and ADASYN [21].
In deep learning, a recently proposed method called Generative Adversarial Minority Oversampling (GAMO) [30] for expanding the boundary of the minority class by playing an adversarial game between a generator and a classifier to generate boundary samples. Although this model performs well on imbalanced image datasets, studies suggest that GAMO may suffer from mode collapse [32]. Guo et al. [33] proposed a network that combines Discriminative VAE and GAN (DVAAN). Specifically, DVAAN models two similar classes as a latent twocomponent mixture distribution and generates boundary samples for imbalanced learning. But this method relies heavily on human intuition to choose similar classes. In addition, DVAAN is a binary classification framework, which makes it hard to deal with multi-class imbalanced data directly.

VAE-GAN
Our proposed network is an improved VAE-GAN [34], it is necessary to introduce related concepts before delving into the proposed method.
VAE-GAN combines variational autoencoders and generative adversarial networks. It uses GAN to generate high-quality samples, and VAE models the data into a latent space. Therefore, the objective function of VAE-GAN consists of the adversarial loss of GAN and the negative evidence lower bound (ELBO) of VAE.
The first term of Eq. (1) is the feature-wise reconstruction loss, which reconstructs z into data samples by using the features extracted from the th layer of discriminator. F D (⋅) denotes the features extracted from the th layer of discriminator. The second term is the KL divergence, which forces the approximate posterior distribution q (z | x) to match p(z) , p(z) is usually assumed to be N(0, I) . Furthermore, as shown in Eq. (2), VAE-GAN needs to sample from the data distribution p r (x) , the posterior distribution q (z | x) , and the prior distribution p(z) , respectively, during adversarial training. (1)

Proposed method
This section proposes a similar classes latent distribution modeling-based oversampling method. The innovation of our method is to generate boundary samples using latent variables near the decision boundary. To this end, Sect. 3.1 introduces how SCN models the latent space of similar classes. Section 3.2 explains how to generate boundary samples. Section 3.3 gives an example to illustrate the proposed method better.

Overall network
As shown in Fig. 1, SCN contains of five parts. We briefly introduce the functions of each part below.
• The similar classes selector C sim : It is a classification model trained with imbalanced data. C sim outputs the similar class label of the current sample according to the maximum posterior misclassification probabilities. By doing this, SCN can be more targeted to reshape the skewed decision boundary. • The encoder network E: We feed a batch of image samples, class labels, and similar class labels into the encoder E. Based on the class labels, the encoder E maps the image samples to different latent distributions. Furthermore, during training, the encoder E gradually guides the latent distributions of similar classes close to each other. • The generator network G: The function of generator G is the same as in GAN.
The generator G tries to generate class-specific samples from different latent distributions. • The discriminator network D: Discriminator network D learns to distinguish between "real" and "fake" samples. • The classifier network C: Classifier C is an independent network used to improve the quality of generated samples. The overall flow of the proposed method is shown in Fig. 2 The first step models each class as a different latent distribution (details in Sect. 3.1.1). The second step guides the latent distributions of similar classes closer to each other (details in Sect. 3.1.2). The training details of SCN are introduced in Sect. 3.1.3.

vMF distribution-based latent variable modelling
Our goal is to model each class as a latent distribution with clear boundary, thereby aligning feature learning with the class distribution. Specifically, consider a dataset (X, Y) = x i , y i , i = 1, 2, … , n , where x i represent a training sample and y i represent its label value. For any x i , y i , we force, where Enc(⋅) and Dec(⋅) represent the encoder and decoder(also the generator of GAN), respectively. q (⋅) and p (⋅) are parameterized by the encoder and decoder, respectively z y i i is the latent variable of class y i , and x i represents the samples generated using this latent variable.
However, the original VAE-GAN does not meet the above requirements. Equation. (4) describes the value range of each part of ELBO, where the reconstruction loss and KL divergence are both nonnegative.
During VAE-GAN training, the KL divergence is continuously forces q (z | x) to match p(z) . The prior distribution usually uses N(0, I) , which causes the encoder to concentrate the latent distribution of all classes on one point, forming a cluster that cannot distinguish the class boundary (see Fig. 2a). To address this issue, it is necessary to introduce a prior distribution that should not affect the mean of the posterior when optimizing the KL divergence. The von Mises-Fisher (vMF) distribution is also known as the standard Gaussian distribution defined on a d − 1 dimensional hypersphere with a sample space The probability density is defined as follows: here denotes the mean direction and k ∈ ℝ ≥0 denotes the concentration of the latent variable near . I d is a modified Bessel function of the first kind of order d, and C d (k) is a normalizing constant. When k = 0 , it represents the uniform distribution on the hypersphere. SCN introduces vMF(⋅, k = 0) as prior distribution, and vMF( , k) as posterior distribution to depict the latent representation. By using the derivation method of [42] to obtain the following KL divergence.
Crucially, the KL divergence term in Eq. (6) only depends on k. is only optimized in reconstruction loss. The encoder can effectively learn to model different latent distributions for each class based on class information without being affected by KL divergence (see Fig. 2b).

The distance minimization loss function
If the class distributions are far apart, it is difficult for generator to learn the shared latent features of similar classes, making it difficult to generate boundary samples. In this section, a Distance Minimization loss function (DM loss) is proposed to guide the latent distribution of similar classes closer.
Specifically, first, SCN uses the similar classes selector C sim to obtain the similar classes labels of the training samples. C sim is a classification model pretrained with imbalanced dataset. When a sample (x, y) inputs to C sim , it outputs the similar classes label y sim of x based on the highest misclassification probability. This method of selecting similar classes can reflect that the classifier has extracted some shared features and the skewed direction of the decision boundary. Furthermore, according to the conclusions provided by [22,31] on the relationship between latent variables and generated samples, it is easy to infer the following corollaries:

Corollary 1 The distance between two latent variables is proportional to the distance between the corresponding samples
This proportional relationship is described in Fig. 3.
Similar classes latent distribution modelling-based… Corollary 2 When the generated sample x y is distributed near its similar class in the feature space, the classification probability of x y satisfies: Where w y and w sim represent the classification weight vector of the current class y and its similar class y sim respectively. According to [43], the weight vector w is normalized to 1, and the posterior classification probability becomes: Where y represents the angle between x y and w y , sim represents the angle between x y and w sim . The classification result only depends on y and sim . If y ≈ sim , it means that the generated sample x y is close to the decision boundary. In addition, the angle y,sim between w y and w sim can be used to approximate the distance between class y and class y sim in the feature space (see Fig. 3). Therefore, if the angle y,sim is smaller, the similar classes are closer in the feature space.
Based on the above two corollaries, the encoder is trained to minimize y,sim . The encoder makes the latent distributions of similar classes closer, allowing the generator to learn more similar feature from latent variables in the decision region, This means that the generator can map the latent boundary variables to boundary samples by When the generated sample and its similar class are far from the decision boundary, their latent distributions are also far from it. Conversely, when both the generated sample and its similar class are concentrated near the decision boundary, so are their latent distributions adversarial training. Furthermore, If the angle y,sim is too small, the latent distributions of similar classes may overlap. In this case, it is difficult for the discriminator to separate the different class features from the overlapping parts, which causes the generator to generate some samples of the wrong class. Therefore, we prevent the latent distributions from being too close by restricting y ≤ sim , thus making sure that the generator generates samples of the correct class. The DM loss can be described as follows: The margin parameter is used to ensure that the generated samples x y can be correctly classified by the classifier. In experiments, is set to 0.2 can effectively ensure that the class boundary is distinguishable. Equation. (10) can be converted into the following form: The DM loss is added to the objective function of the encoder, when the latent distributions overlap, making the generator generate samples of the wrong class, the second term of L DM acts like a penalty term. It forces the encoder to pay more attention to the class boundary and then gradually separates the overlapping parts of the latent distributions. The role of DM loss is shown in Fig. 2c.

Training of SCN
According to the conclusion of Suh et al. [17], the discriminator and the classifier are designed as two independent networks to stabilize the network training. In addition, a features-wise reconstruction loss is used, which enables the reconstructed samples have more details. The objective function of SCN is defined as follows: In Eqs. (12) and (15) x is the training data sampled from the data distribution p r (x) , z is the latent vector sampled from a vMF distribution q (z | x, y) , and z is the noise vector sampled from a uniform distribution p̃z(z) . y represents the class label of sample x . denotes the mean direction vector of vMF distribution, and k denotes the Similar classes latent distribution modelling-based… concentration around . Note that L D , L G , L E , L C are the loss functions of discriminator, generator, encoder, and classifier, respectively. Since there is a lack of minority classes in the real samples, only training the classifier with these samples will easily overfit the majority class. Therefore, our classifier is trained with real samples and generated samples. The training details of SCN are summarized in Algorithm 1. To stably model different distributions for each class on the hypersphere, DM loss should be added to the encoder every 5 epochs to gradually make similar classes closer.

Boundary sampling strategy
When the proposed network is trained to converge, the maximum likelihood estimation was applied to obtain the distribution center for each class. The maximum likelihood optimization problem can be written as follow: Since the latent distribution is modeled on the unit hypersphere, the cosine distance can be used as a distance measure. Each time chooses a minority class center near as the "anchor" to find near , which is the center with the shortest cosine distance from near , then the new sampling center is new = min − near 2 . The sampling distribution formed by new is located in the decision region of two similar classes, so the generator can generate boundary minority class samples by using the latent variables from this region. These samples are closer to decision boundary in feature space. The detailed process of boundary sampling strategy is summarized in Algorithm 2.
In addition, the parameter ∈ (0, 1] can be used to adjust the position of the sampling center, thereby controlling the style of the generated image. As shown in Fig. 4, the digital "9" and the digital "4" are two similar classes. When the SCN generates digital "9", can control how similar the digital "9" is to the digital "4". Incorporating these boundary samples with similar features into the Fig. 4 An example of the MNIST dataset: digital "9" and digital "4" are similar classes, in which the digital "9" is the generated sample. When the value of is greater, the sampling region is closer to digital "4", so the generated sample is more like the digital "4" training set can help the classifier learn a more robust decision boundary. In experiments, the value of is to 0.6 can achieve better results.

An example of the proposed method
This section assumes a simple example to illustrate the proposed method better. As shown in Fig. 5, given an imbalanced image dataset, where the digital "6" and "7" are the majority class, the digital "5" is the minority class. The digital "5" and "6" are similar classes. The circles represent latent variables, and the blue dotted line represents the decision boundary of imbalanced data.
In the early stages of training, each class forms a different vMF distribution (as shown in Fig. 5a) by using the method in Sect. 3.1.1. Then, under the guidance of DM loss in Sect. 3.1.2, the similar classes gradually get closer (as shown in Fig. 5b, the yellow circle gradually moves to the blue circle). The generator can learn shared latent features from latent variables between similar classes. After the training converges, the sampling method in Sect. 3.2 is adopted, which forms a sampling region between the red and the blue circle (as shown in Fig. 5c, the orange circle). The parameter can be used to adjust the position of the sampling region. Finally, latent variables are sampled from this region to generate boundary samples, thereby repairing the skewed decision boundary (as shown in Fig. 5d, the blue dotted line is moved to the red dotted line).
MNIST is a dataset used for digital image classification, consisting of grayscale handwritten digital images of 28x28 pixel size. MNIST has ten categories corresponding to the numbers 0-9, including 50,000 training and 10,000 test data.
Fashion-MNIST contains clothing article images. The dataset consists of 60,000 training samples and 10,000 samples for testing. Each sample is a 28x28 grayscale image associated with a label from ten classes. These labels include T-shirts, pants, jumpers, dresses, coats, sandals, shirts, sneakers, bags, and ankle boots.
CIFAR-10 is made up of 60,000 32 × 32 pixel color images. This dataset has 50,000 training samples and 10,000 testing samples, divided into ten classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, boats, and trucks.
CINIC-10 is constructed from CIFAR-10 and ImageNet. This dataset also consists of 32 × 32 color images in ten classes. Both training samples and test samples are 90,000.
Fashion-MNIST samples are more complex than MNIST. For MNIST, it is easy to observe which classes are similar, such as digit "5" and digit "6". Different comparison methods may also have the same conclusion. Different comparison methods for fashion-MNIST may identify different similar classes. Therefore, we use MNIST to evaluate the impact of the imbalance rate on different comparison methods and Fashion-MNIST to evaluate the effectiveness of generating boundary samples.
CINIC-10 contains 6000 images from CIFAR-10 and 210,000 images from Ima-geNet, respectively, and there is a lot of noise. For example, "cows" and "goats" appear in "deer" class [47]. Therefore, CIFAR-10 is used to evaluate the feature extraction ability of comparison methods under different imbalance ratios. CINIC-10 is used to test the performance of the comparison methods in the presence of noise.
To generate an imbalance for evaluation, we created step-imbalance and long-tailed datasets based on experiments of Buda et al. [15] and Mullick et al. [30]. An example of step imbalance and long-tailed is shown Fig. 6.
There are two influencing factors to consider before creating an imbalanced dataset. The first factor is the Imbalance Ratio (IR) between the majority and the minority classes, which determines the learning difficulty of imbalanced problems. Therefore, it is necessary to set several sets of IR for each dataset to comprehensively verify our method's performance. The expression of IR is shown in Eq. (17). The second factor is that the classifier can easily distinguish between majority and minority classes if they have clear class boundary. As a result, different classes should be chosen as the majority class in the experiment. Based on the first factor, IR is set to 10, 30, 50 and 100. Based on the second factor, we chose different classes as the majority class. All experiments are evaluated on the original test set, and report the average performance. The details of the imbalanced datasets are listed in Tables 1 and 2.

Comparison methods
The performance of the proposed method is evaluated by comparing it to the six methods listed below.

Evaluation metrics
The most commonly used metric for evaluating classifier performance is overall accuracy. However, if the data are imbalanced, some representative classes will lead to a highly misleading assessment. Therefore, to evaluate the classification performance more accurately, we use Precision, Recall, G-mean, and F1-score as evaluation metrics. They are listed in Table 3, where TP stands for true positive, TN stands for true negative, FP stands for false positive, and FN stands for false negative, respectively. Precision represents how many predicted true positive samples are positive samples. Recall reflects the proportion of samples predicted to be true positives and actual positive samples. F1-score is the weighted average of Precision and Recall. G-mean (geometric mean score) tries to maximize each class's accuracy while keeping these accuracies balanced.

Experimental setup
All models use the same network architecture on the same dataset. "Appendix 1" provides the detailed network architectures. B-SMOTE and ADASYN are implemented using the imbalanced learning library [48]. LeNet-5 [44] is used as the classification model on all datasets. The optimizer uses the stochastic gradient descent (SGD) with a momentum value of = 0.9 , and the learning rate is 0.01. For MNIST and FASHION-MNIST, all comparison methods and SCN are trained for 100 epochs, and each batch size is 128. The optimizer uses Adam with the parameters 1 = 0 , 1 = 0.999 , and the learning rate is 0.0001. For CIFAR-10 and CINIC-10, the epoch and the batch are increased to 300 and 256, respectively. The learning rate is 0.0002.

Comparative experiment
First, train a classifier on the original imbalanced dataset, forming a Baseline for comparing model performance. Then using the above method to augment the Similar classes latent distribution modelling-based… imbalanced dataset, training the classifier again with the augmented dataset. Lastly, comparing the classifier performance on the four datasets.

Results on step imbalance datasets
The comparison results of step imbalance MNIST are shown in Table 4. Our method outperformed the comparison method on these datasets. Since the features of the samples in the MNIST dataset are not complex, all comparison methods can extract discriminative features from different classes. Therefore, these methods perform better than the baseline at low imbalance ratios. However, at high imbalance ratios, the performance of all methods slips below baseline except for SCN and GAMO. This is because the number of minority class samples decreases significantly, and these methods prefer to learn the features of the majority class while ignoring the minority class. This makes them generate low-quality samples in the minority class. Our method uses DM loss to impose a penalty on the overlapping distributions, ensuring that there are clear boundaries between the different class distributions. Thus, at high imbalance rates, SCN can generate plausible minority class samples. Table 5 reports the comparison results on the step imbalance Fashion-MNIST. It was observed from the results that all methods were significantly better than the Baseline. As the imbalance ratio increases, the performance of B-SMOTE and ADASYN drops sharply because the samples they synthesize lack diversity, and the classifier easily overfit them. Interestingly, when IR=30, the performance of DVAAN is close to that of GAMO in terms of Precision and Recall metrics, but when IR=50, the performance of DVAAN is much lower than that of SCN and GAMO. This is because the samples in the Fashion-MNIST dataset have more complex features than those in MNIST, and different methods seem to extract different features. As shown in Fig. 11, SCN generates samples that mix "Coat" and "bag" features, while GAMO generates samples that mix "Coat" and "Pullover" features.
To further verify the performance of SCN, we investigated the trend of performance change with IR variation for the comparison methods. Note that in different IR, DVAAN chooses similar classes based on intuition. As shown in Fig. 7, when IR=20, the performance of GAMO, ADASYN, B-SMOTE, and SCN is relatively close. When IR = 60, the performance degradation of SCN starts to slow down, which indicates that SCN can handle highly imbalanced datasets. It is worth noting that the performance of DVAAN is very volatile because the decision boundary of Fashion-MNIST is often changeable under different IR. This means that the human eye is not reliable in selecting similar classes. Compared with other methods for generating boundary samples, SCN identifies similar classes in a more intelligent way. We trained a similar class selector using the original imbalance data. This selector can reflect which decision boundaries are biased toward which classes so that SCN can generate targeted samples to reshape the class boundary.
CIFAR-10 is a 3-channel color image dataset with rich foreground and background layers, thus increasing the difficulty of feature learning. The results in Table 6 show that when IR=10, the performance of ACGAN is below the Baseline, which means that the wrong features are extracted. Simultaneously, the performance of B-SMOTE and ADASYN is lower than the Baseline because they use the Table 4 Results of average classification performance on the step imbalance MNIST The bold font indicates the best performance Methods Similar classes latent distribution modelling-based… Table 5 Results of average classification performance on the step imbalance Fashion-MNIST The bold font indicates the best performance Methods Euclidean distance to calculate the similarity between samples, which is prone to finding the wrong neighbor samples. In addition, it can be observed from Fig. 12 that the samples they generate have many artifacts. When IR=30, as shown in Fig. 12, BAGAN-GP suffers from mode collapse, so its performance is also lower than the Baseline. When IR=50, the difference between methods becomes more pronounced. Although the performance of both GAMO and SCN is significantly higher than the Baseline, the classifier of GAMO is adversarial to the generator, which limits the feature learning ability of the generator. Our method models the different classes as latent distributions, such that the feature of learning is aligned with the distribution. This makes it easy to learn features for complex images. Furthermore, we generate samples in the decision region of the minority class, which further improves the performance of SCN at high imbalance ratios.
The performance comparison results on CINIC-10 are listed in Table 7. Since the CINIC-10 dataset contains a large number of samples from the ImageNet [2] dataset, there are many noisy samples [47]. In addition, CINICI-10 has 9000 samples per class, 1.8 times that of CIFAR-10. The minority classes sampled from CINIC-10 have a more dispersed class distribution, so it is more difficult for GAN-based methods to fit the distribution of these classes. The advantage of SCN is that we first model the scattered data as latent variables of a specific distribution (vMF distribution). Then the classifier gradually concentrates these scattered latent variables around their respective class distributions during the training process, which makes it easier for SCN to fit the distribution of the minority class. In addition, due to the constraints of DM loss, even if the data are scattered, there will be no overlapping of distributions, which can be observed from the modelling results of SCN (Fig. 8d). Therefore, SCN performs better than GAN-based methods, especially under highly Similar classes latent distribution modelling-based… Table 6 Results of average classification performance on the step imbalance CIFAR-10 The bold font indicates the best performance Methods  Table 7 Results of average classification performance on the step imbalance CINIC-10 The bold font indicates the best performance Methods imbalanced situations. While compared with DVAAN, which also models the latent distribution, SCN is superior to it under different IR, because the similar classes selector of SCN can obtain relatively accurate boundary information.

Results on long-tailed datasets
To further verify the superiority of our method, we conduct comparative experiments on long-tailed datasets. Different from the step imbalance, each class has a different number of samples in the long-tail dataset, so the decision boundary may be biased from the head class to the tail class or from the middle class to the tail class. In other words, the long-tail datasets have a complex decision boundary. Table 8 gives the experimental results of different methods. On the MNIST and Fashion-MNIST datasets, traditional oversampling methods can also alleviate longtail imbalanced learning to a certain extent, which may be related to their generation of more critical boundary samples for classification. Meanwhile, since MNIST and Fashion-MNIST are grayscale images, Euclidean distance can be used to measure the similarity of samples. However, in CIFAR-10 and CINIC-10, traditional oversampling methods are below the Baseline. Compared with BAGAN-GP, other methods focusing on boundary samples perform better, which shows that boundary samples are more helpful in reshaping the class boundary. The reason is that SCN is more suitable for dealing with the complex class boundary because its similar classes selector dynamically selects different similar classes according to the actual situation, thereby generating more critical boundary samples for classification Furthermore, Fig. 8 shows the modelling results of SCN on long-tailed datasets. In the latent space, each class has a clear class boundary and the intra-class distribution is tight. Even in the minority classes of CINIC-10, SCN can form obvious clusters. In addition, we can observe some outliers in the modelling results, which may be sample noise, but SCN is able to avoid them from overlapping with other distributions.

Ablation study
This section conducts ablation experiments on DM loss and boundary sampling strategy.

Ablation study for DM loss
In order to verify the role of DM loss, we created three ablation entities.  Table 9. SCN outperforms all other ablation variants.
The modelling results are visualized to illustrate the role of DM loss better. As shown in Fig. 9, SCN-A only models the latent distribution in terms of reconstruction loss, thus forming many overlapping clusters under the influence Table 8 Results of average classification performance on long-tailed imbalanced datasets The bold font indicates the best performance Methods Similar classes latent distribution modelling-based… of imbalanced data. Since the lack of DM loss in SCN-B, it can only model each class as a different latent distribution. Compared with SCN-B, SCN-C is the complete opposite, it models similar classes closer, but it causes the overlap between class distributions. In the modelling results of SCN, similar classes are closer to each other, and the class boundary is distinguishable. Therefore, classification loss and DM loss are complementary relationships.

Ablation study for sampling strategy
Many existing deep generative model-based methods generate samples by sampling distribution centers. As a result, to demonstrate that our sampling strategy contributes more to training a classifier, an experiment was designed to compare different sampling centers. We create an ablation variant (SCN-Center) that takes the distribution center of each class as the sampling center. The experimental results are listed in Table 10. SCN-Center outperforms Baseline in terms of generating samples of the correct class to augment the imbalanced dataset. However, SCN-Center generates samples around the center of the class distribution. Since these samples are far from the class boundary, they contribute limitedly to training the classifier. This is why the performance of SCN-Center lags behind that of SCN.  Fig. 9 Modelling results of three variants and SCN Table 9 Results of Ablation study for DM loss on MNIST dataset The bold font indicates the best performance Methods
Rec.  Table 10 Results of Ablation study for sampling strategy on MNIST dataset The bold font indicates the best performance Method Long-tailed Step imbalance (IR=30) Step imbalance (IR=50) Prec. Rec.

Statistical analysis
To further verify the effectiveness of our method, it is necessary to conduct statistical analysis on the performance of SCN and other comparison methods. Based on the recommendations in [49], we apply the Friedman test [50] to statistically rank SCN and all comparison methods on the imbalanced data mentioned in Sect. 4.3, then use the Holm test [51] to verify whether the performance ranking between SCN and other methods is statistically significant. All statistical tests are conducted on KEEL [52]. Table 11 presents the ranking results of the Friedman test. It is observed that SCN outperforms other methods on all evaluation metrics. The adjusted p values obtained by the Holm test are listed in Table 12, where at the 0.05 significance level, there is a significant difference between SCN and other comparison methods except GAMO in Precision, Recall, F1-score, which means that SCN is statistically better than these methods. In addition, the classification performance of SCN is better than that of GAMO on the G-mean Overall, nonparametric tests demonstrate that our SCN achieves satisfactory performance.

Conclusion and future work
This paper proposes a similar classes latent distribution modelling-based oversampling method to alleviate the difficulty of learning from imbalanced image datasets. First, we model each class as a different vMF distribution to reduce the difficulty of learning from a unimodal distribution. Second, a distance minimization loss function is introduced in the encoder, which makes similar classes closer, so that the generator can learn shared latent features in the decision region of similar classes. In addition, similar classes are selected using a pretrained classifier, which can effectively avoid the bias caused by human selection. The classification model is trained with imbalanced datasets, which can further reflect the skewed direction of the decision boundary. Therefore, our method can fix the skewed class boundary more targeted. Finally, we design a boundary sampling strategy, which can sample latent variables in the decision region to generate boundary samples. By adding these samples to the training set, the classifier can learn more difficult-to-classify (similar) features and further improve the robustness of the classifier to imbalanced image datasets. In extensive experiments, it has been demonstrated that the proposed method can effectively handle the imbalance problem. On CIFAR-10 and CINIC-10 with step imbalance (IR=50), G-mean improves by about 5% and 6% , respectively. Furthermore, G-mean is enhanced by about 3% and 5% on CIFAR-10 and CINIC-10 with long-tailed, respectively. In future work, we consider combining some manifold learning methods to obtain the manifold structure of the training data, thereby selecting similar classes more accurately. Besides, we plan to apply our method to real-world scenarios instead of benchmark datasets.