Class-specific extreme learning machine based on overall distribution for addressing binary imbalance problem

Class imbalance problem occurs when the training dataset contains significantly fewer samples of one class in contrast to another class. Conventional extreme learning machine (ELM) gives the same importance to all the samples leading to the results, which favor the majority class. To solve this intrinsic deficiency, modifications of ELM have been developed such as weighted ELM (WELM) and WELM based on the overall distribution (ODW-ELM). Recently class-specific ELM (CS-ELM) was designed for class imbalance learning. It has been shown in this work that the derivation of the output weights, β\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varvec{\beta }$$\end{document}, is more efficient compared to class-specific cost regulation ELM (CCRELM) for handling the class imbalance problem. Motivated by CCRELM, X. Luo et al. have proposed the classifier ODW-ELM, which is also not efficient for imbalance learning. In this work, a novel class-specific ELM based on overall distribution (OD-CSELM) and the kernelized version of OD-CSELM (OD-CSKELM) is proposed to address the binary class imbalance problem more effectively. OD-CSELM and OD-CSKELM are motivated by CS-ELM. In addition, the computational complexity of OD-CSELM and OD-CSKELM is significantly lower than WELM and kernelized WELM, respectively. The proposed work is evaluated by using the benchmark real-world imbalanced datasets. The experimental results demonstrate that the proposed work gives good generalization performance in contrast to the rest of the classifiers for class imbalance learning.


Introduction
Learning from imbalanced class distributions has drawn increasing consideration from the data mining community, and many classification algorithms have been presented (He and Garcia 2009;Sarmanova and Albayrak 2013;Haixiang et al. 2017). Almost all the classification problems in the real world do not have uniform class distribution. The classes whose number of samples is below the average number of samples per class are termed the minority classes. The classes whose number of samples is above the average number of samples per class are termed the majority classes. Some examples of the imbalanced learning are fraud detection (Wei et al. 2013;Zakaryazad and Duman 2016), software defect prediction (Wang and Yao 2013) (Krawczyk et al. 2016), imbalanced data for wireless sensor networks (Patel et al. 2020), classification of imbalanced multi-modal stroke dataset (Bhattacharya et al. 2020), etc. The problem associated with class imbalance learning is that conventional classifiers usually misclassify most of the minority class samples as the majority class samples. For problems such as fraud detection, minority class detection has greater importance in contrast to the majority class. Therefore, a higher positive recall value for the minority class is appropriate for any imbalanced classification problem.
During the last decades, various methods have been proposed to handle class imbalanced problems (He and Garcia 2009;Galar et al. 2012). The methods available for the imbalanced classification problems (Galar et al. 2012) can be broadly categorized as data level methods, algorithmic level methods, and cost-sensitive methods. Data level methods such as oversampling and undersampling (He and Garcia 2009;Liu et al. 2009) alter the data space to reduce the impact of the class imbalance. The undersampling method randomly selects a fraction of data from the majority class samples and balances the data distribution at the cost of information loss.
For example, EasyEnsemble and BalanceCascade (Liu et al. 2009) algorithms employ undersampling for balancing the dataset. Oversampling method randomly duplicates the samples in the minority classes to enhance the cardinality of the minority class. This may lead to over-fitting. The informed oversampling approach, on the other hand, generates synthetic minority class samples to balance the class distribution like synthetic minority over-sampling techniques (SMOTE) (Chawla et al. 2002).
Algorithmic-level methods directly modify the classifier design to address the imbalanced learning. The cost-sensitive methods assign more penalties for misclassifying the minority class samples with respect to the majority class samples, that is, misclassification of minority class samples is much more costly. For example, WELM (Zong et al. 2013) and weighted support vector machine (WSVM) (Yang et al. 2007). To find the optimal solution, they minimize the weighted cumulative error with respect to each sample.
Extreme learning machine (Huang et al. 2006) has drawn increasing attention among researchers around the world. It is a single-hidden layer feed-forward neural networks (SLFNs) with random weights between the input and the hidden layer. It computes the weights between the hidden layer and the output layer analytically utilizing the Moore-Penrose (MP) pseudo-inverse. This makes ELM much faster compared to the standard neural networks, which require great effort in hyperparameter tuning. Standard ELM (Janakiraman et al. 2015(Janakiraman et al. , 2016 does not take into consideration the class imbalance problem viably. Several extensions of ELM such as WELM (Zong et al. 2013), Boosting WELM (Li et al. 2014), CCRELM (Xiao et al. 2017), CS-ELM (Raghuwanshi and Shukla 2018a), class-specific kernelized ELM (CSKELM) (Raghuwanshi and Shukla 2018b), SMOTE-based classspecific kernelized ELM (Raghuwanshi and Shukla 2021), and UnderBagging reduced kernelized WELM (Raghuwanshi and Shukla 2018) have been designed to address the imbalanced learning effectively.
As mentioned above, random weights are employed to map the input data to the feature space. These weights remain unaltered amid the training phase. Due to this, some of the samples usually at the decision boundary are misclassified in certain realizations. It has been stated in Iosifidis and Gabbouj (2015);  that ELM with the Gaussian kernel is also referred to as kernelized ELM (KELM) outperforming ELM with the sigmoid node for the small-and medium-range datasets. ELM with the Gaussian kernel is relatively sluggish on the larger datasets in contrast to the ELM with the sigmoid node. It is also stated in Iosifidis and Gabbouj (2015);  that for most of the datasets (small and medium-range datasets) ELM with Gaussian kernel runs faster compared to ELM with sigmoid node. WELM Zong et al. (2013) minimizes the weighted cumulative error with respect to each sample. WELM uses two weighting schemes to assign class-wise weights to the samples. These weighting schemes assign more weight to increase the impact of the minority class while diminishing the relative impact of the majority class. It has been stated in He and Ma (2013) that the Lagrangian multiplier α relating to each minority class sample must be larger in weight in contrast to the α of the majority class samples. WELM employs two weighting schemes to assign a class-wise weight to the samples. The proposed algorithm employs class-specific regularization parameters instead of assigning weights to the training samples.
As mentioned above, α corresponding to the minority class samples should be greater than that of majority class samples. So, this work strengthens the regularization parameter for the minority class sample in contrast to the majority class samples to give more importance to the minority class samples.
The proposed algorithm assigns weights to the class instead of assigning weights to the training samples. As mentioned above, α i corresponding to the minority class should be greater than that of the majority class. So, this work strengthens the weight of the minority class compared to the majority class to give more importance to the minority class. The work in ODW- ELM Luo et al. (2018) is an extension of CCRELM and WELM. It has been shown in Raghuwanshi and Shukla (2018a) that the derivation of the output weights β is more efficient compared to CCRELM (Xiao et al. 2017) for imbalanced learning.
The solution β of CCRELM Xiao et al. (2017), if λ + and λ − are not in the same order of magnitude, for example, λ + is equal to 2 20 and λ − is equal to 2 2 , then I λ + + I λ − ≈ I λ . As a result, the solution formula of β of CCRELM is almost the same as that of ELM. Motivated by CCRELM, X. Luo et al. have proposed the method ODW-ELM, which is also not efficient for imbalance learning. In this work, a novel class-specific ELM based on the overall distribution and the kernelized version of OD-CSELM is proposed to address the binary class imbalance problem more effectively.
The main contributions of this paper are highlighted below.
1. OD-CSELM is an extension of CSELM and the costsensitive method to handle imbalanced problems. It has been stated in Raghuwanshi and Shukla (2018a) that CSELM performs better than WELM and also has a lower computational cost. 2. The proposed OD-CSKELM is the extension of CSKELM and the cost-sensitive method to handle imbalanced problems. It has been stated in Raghuwanshi and Shukla (2018b) that CSKELM performs better than KWELM and also has a lower computational cost. 3. The proposed work also has lower computational cost in contrast to WELM and KWELM for imbalanced learning.
4. The proposed OD-CSELM and OD-CSKELM use the advantage of the overall distribution framework. 5. The proposed OD-CSELM and OD-CSKELM use the advantages of both information theory to compute weight and assign a weight for each class instead of every instance to handle the imbalanced classification problems. 6. This work shows comprehensive experiments to compare the OD-CSELM and OD-CSKELM methods with the other tested methods. 7. A statistical significance test is also conducted to check whether the new classifiers significantly outperform the other classifiers in terms of G-mean and AUC.
The rest of the paper is organized as follows. Section 2 elaborates related work in detail. Section 3 explains the proposed work. Section 4 presents the experimental setup and the result analysis. The last section concludes the paper along with future work.

Extreme learning machine
ELM was first proposed by Huang et al. (2006Huang et al. ( , 2012 for generalized single-hidden layer feed-forward neural networks. ELM is much faster compared to standard neural networks, which require iterative hyperparameter tuning. ELM randomly allocates the input weights and the hidden layer biases. The output weights of ELM are determined analytically utilizing the Moore-Penrose pseudo-inverse Here, input vector x i = [x i1 , x i2 , ..., x in ] T ∈ R n and its desired output  u j2 , ..u jn ] are the weights connecting the jth hidden neuron with the input neurons b j is the bias of the jth hidden neuron. These weights remain unaltered over the training phase. The output of the hidden layer for ith sample is given as follows: Here, G(.) represents the hidden layer activation function. The output of the hidden layer for all the training samples is represented by Φ as shown in Eq: In the original ELM, the following optimization problem is formulated.
Here, λ is a regularization parameter, ξ i is the tolerable error of the ith training sample, ||β|| 2 is the parameter of the separating hyperplane and also known as the structural risk, and ||ξ || 2 is the sum error square, also known as the empirical risk. The structural risk maximizes the margin for separating classes (Deng et al. 2009). Here, β is the output weight vector between the hidden and the output layer. The mathematical model described in (3) is obtained by Huang et al. (2012), which can be defined as Here, I represents the identity matrix of appropriate dimension.
For the small training samples, the solution is determined in Huang et al. (2012), which is given below.
Here, I represents the identity matrix of the, N × N dimension. Applying Mercer's condition to define the kernel matrix of KELM as illustrated in Eq.
The output can be determined by the following expression Here, is the output vector. The predicted label of x is defined as Zong et al. (2013) was proposed for handling the imbalanced classification problems effectively. In WELM (Zong et al. 2013), each training sample is allocated a weight, and the associated weight for the minority class samples is relatively larger compared to the majority class samples. Thus, the impact of the minority class is strengthened, whereas the relative impact of the majority class is diminished. The two weighting schemes empirically used to compute the weight matrix, W, utilizing the class distribution are given below. Weighting scheme W 1:

Weighted extreme learning machine
Here, q k is the total number of samples belonging to the kth class.
Weighting scheme W 2: Here, q avg represents the average number of samples per class. Weight, W ii , is assigned to the ith samples. The mathematic model of WELM (Zong et al. 2013) can be described as Minimize: Here, a diagonal matrix W = diag (W ii ) is associated to allocate weight to each training sample x i . Based on the Karush-Kuhn-Tucker (KKT) theorem, the solution of (11) is delineated below.
The solution for kernelized WELM (KWELM) (11) described in Zong et al. (2013) can be rewritten as follows. Luo et al. (2018) minimizes the weighted least squares error to handle the class imbalance problem. ODW-ELM uses the information theory-based weighting schemes, which are more effective in contrast with the weighting schemes of WELM, which is shown in Luo et al. (2018). ODW-ELM assigns a weight to each class instead of assigning weights to all the training samples to handle the class imbalance problem. ODW-ELM does not need a weight matrix, which makes it computationally efficient compared to WELM. The weighting scheme of ODW-ELM is delineated below.

WELM based on overall distribution (ODW-ELM)
Here, m represents the number of classes and r k is a weight corresponding to the kth class. Here, p k is the probability corresponding to the kth class.
The problem formulation of ODW-ELM is reproduced below: Here, ξ k represents the error vector of the kth class and The solution of the above problem is determined by Luo et al. (2018), which is reproduced below: Using Eqs. (6) and (16), the kernelized ODW-ELM (ODW-KELM) can be rewritten as follows: 3 Proposed work

Class-specific ELM based on overall distribution (OD-CSELM)
The size of the weight matrix of WELM has a significant effect on the computational complexity of WELM. The proposed algorithm assigns weights to the class instead of assigning weights to the training samples. The proposed OD-CSELM is significantly faster compared to WELM to handle the larger class imbalance problem more effectively. The mathematical formulation of OD-CSELM is given below. Minimize: The Lagrangian function of (18) is given below.
Here, L D ELM represents dual optimization problem. N + represents the number of samples belonging to the minority class, and N − represents the number of samples belonging to the majority class. The proposed work assigns weights to the training samples as per the weighting approach of ODW-ELM mentioned in Eq. (14). This work uses class-specific weights r + and r − , which can be computed as follows: Here, p + is the probability of the majority class, and p − is the probability of the minority class, respectively. And α i is the Lagrangian coefficient for the equality constraint corresponding to the sample x i . Here, Here, Φ + and Φ − are the output of the hidden layer corresponding to the minority class and the majority class training samples, respectively. The vectors α + and α − are the Lagrangian coefficient for the equality constraints corresponding to the minority class and the majority class training samples, respectively. The vectors T + and T − are the actual output of the minority class and the majority class training samples, respectively.
The solution of the above dual optimization problem can be obtained by taking partial derivatives of L D ELM with respect to the variables (β, ξ + , ξ − , α + , α − ) as shown in Eqs.
The abovementioned Eqs. (31), (32) and (33) are expressed as Using (37) and (38), eq. (36) is expressed as Aforementioned, Eqs. determine the output weight β for the case N > L. In both cases, the output weight of the OD-CSELM method is The output function of OD-CSELM can be determined as follows: For the small training samples, Eq. (43) can be rewritten as follows: Output: OD-CSELM model for classification. 1: Initialize the class-specific weight, r + and r − , using (22) 2: Randomly select input weights, U and the hidden biases, b 3: Compute the hidden layer output Φ + and Φ − 4: Determine the output weight, β, by employing (43) 5: Return: β

Class-specific kernelized ELM based on overall distribution (OD-CSKELM)
The kernelized version of OD-CSKELM can be computed as follows: The output of OD-CSKELM can be computed using Eq.
The OD-CSKELM with Gaussian kernel function maps the data from input space to feature space by using Eq.
Here, x N + and x N − represent the training samples belonging to the positive and the negative class, respectively.

Computational cost analysis
The following subsection analyzes the computational complexities of ELM, KELM, WELM, KWELM and the proposed work.

Computational cost of ELM
The computational cost of determining the output weight matrix, β, is determined in Raghuwanshi and Shukla (2018a), which is reproduced below.

Computational cost of KELM
Kernel matrix, , is determined, which has computational complexity of O(n N 2 ). The computational complexity of computing inverse of kernel matrix of size N × N is equal to O(N 3 ). The computational complexity of the output weight matrix, β, given in (5) can be determined as follows:

Computational cost of WELM
The computational cost of determining the output weight matrix, β, is determined in Raghuwanshi and Shukla (2018a), which is reproduced below.

Computational cost of KWELM
Kernel matrix, Ω, is determined, which has computational complexity of O(n N 2 ). The matrix, W Ω is determined, which has computational complexity of O(N 3 + n N 2 ). The computational complexity of computing the inverse of kernel The computational complexity of the output weight matrix, β, given in (17) can be determined as follows.
For the large training samples, the time taken for developing the WELM classifier significantly increases due to the multiplication of the weight matrix, W N×N .

Computational cost of OD-CSELM
The computational complexities of the matrix multiplication, and O(m L N + ), respectively. Computational cost of the output weight matrix, β derived in (43) can be determined as follows:

Computational cost of OD-CSKELM
The complexities for evaluating the kernel terms Computational cost of computing the inverse of kernel matrix of size N × N is equal to O(N 3 ). The computational cost of the output weight matrix, β, given in (46) can be determined as follows: 4 Experimental setting and result evaluation

Dataset description
The experiments were performed to evaluate the proposed classifier by using 31 datasets obtained from online repositories, including UCI machine learning repository (Lichman 2013) and the KEEL data repository (Alcalá et al. 2010), which are available in the fivefold cross-validation format. The imbalance ratio (IR) of the datasets varies over the datasets. IR is calculated as Here, # stands for "number of". A brief description of the datasets used is listed in Table 1. Normalization is done to map the numbers within a range 1 and -1, which can be expressed as follows: Here, x is the original attribute value and x is the normalized attribute value, max n is the maximum value of feature n and min n is the minimum value of feature n.

Parameter settings
The proposed OD-CSELM employs hidden neurons with the sigmoid node and OD-CSKELM maps the input data to the kernel space by using the Gaussian kernel function. This work reveals the optimal results obtained for OD-CSKELM by the performing a grid search for the regularization parameter, λ, on 2 −18 , 2 −16 , ...2 48 , 2 50 and the kernel parameter, σ on 2 −18 , 2 −16 , ...2 18 , 2 20 . The proposed OD-CSELM with the sigmoid node uses random weights between the input and the hidden layer so, its performance fluctuates. Therefore, this work reports the average performance of this classifier for 10 independent trials. This work demonstrates the optimal results obtained for OD-CSELM with the sigmoid node by performing grid search for the regularization parameter, λ, on 2 −18 , 2 −16 , ...2 48 , 2 50 and the number of the hidden neurons L on {10, 20, ...990, 1000}.

Evaluation of imbalanced datasets
Following the assessment are metrics traditionally employed to compute the classifier performance, not all are suitable when the dataset is imbalanced.
Overall prediction accuracy = Here, T P is the number of correctly classified minority samples, T N is the number of correctly classified majority samples, F P is the number of falsely classified majority samples, and F N is the number of falsely classified minority samples.
Here, R TP is the true-positive rate and R FP is the falsepositive rate.
Overall prediction accuracy is not an appropriate metric for imbalanced learning. For example, consider a dataset that has 92 samples belonging to the majority class and 08 samples belonging to the minority class. A classifier that predicts all the samples as the majority class samples will have 92% prediction accuracy. A G-mean metric is better than the overall prediction accuracy metric when the class distribution is not  Bold values indicate the best result of each dataset uniform. For the aforementioned example, G-mean will be equal to zero. The receiver operating characteristic (ROC) graph (Bradley 1997) computes the algorithm performance by changing the threshold value for the algorithm score to get distinct values of R TP and R FP . In the ROC graph, the X-axis is the R FP and Y-axis is the R TP . The area under the curve (AUC) (Fawcett 2003;Huang and Ling 2005) can be employed to compute the performance of the method. The ideal point on the ROC graph would be (0, 1), for which all the minority samples are classified correctly and no major-ity samples are misclassified as the minority. The bigger the AUC the better the generalization performance of the method. If the method is of hard type, i.e., its predicted outcome is discrete class labels, then AUC can be defined as

Experiment results
The effectiveness of the proposed methods was investigated in terms of G-mean and AUC, results are listed in Tables  2 and 3. The experimental results for ELM and WELM are taken from Zong et al. (2013). The experimental results of weighted kernel-based SMOTE (WKSMOTE) (Mathew et al. 2018) in terms of G-mean for some of the datasets are taken from Mathew et al. (2018), and results for the remaining are obtained by experimentation. The experi-mental results of instance weighted SMOTE (IW-SMOTE) (Zhang et al. 2022) in terms of G-mean and AUC for some of the datasets are taken from (Zhang et al. 2022) and results for the remaining are obtained by experimentation. This work uses the code available online in Zhang et al. (2022) for evaluating the results of IW-SMOTE. It can be observed from Tables 2 and 3 that OD-CSKELM outperforms the rest of the methods. The order of effectiveness of compared methods based on G-mean can be observed from Table 2 as: ELM < KELM < ODW-KELM < ODW-ELM < The order of effectiveness of methods based on AUC can be observed from Table 3 as: ELM < KELM < ODW-KELM < ODW-ELM < IW-SMOTE < WKSMOTE < WELM < OD-CSELM < OD-CSKELM . According to these observations, the performance of our method is higher than that of the other state-of-the-art methods. Therefore, we conclude that the proposed method is efficient for imbalanced classification problems. Regarding model implementation, our proposal was designed in Matlab, and the codes are available at https://drive.google.com/drive/ folders/1cUaK54ijImrCkHQcvOohzaMOVK5d-3Im?usp= sharing.

The statistical test analysis
For further comparison of their performance, in this paper, we utilize the Wilcoxon signed-rank test. The Wilcoxon signed-ranks test was recommended by Demšar (2006). The Wilcoxon signed-rank test is a nonparametric statistical test, which is utilized for comparing the two samples that are paired, or related (Demšar 2006). This paper uses the nonparametric Wilcoxon signed-rank test for comparing the accuracy of the algorithms in consideration. The Wilcoxon signed-rank test calculates d i , which represents the difference between the performance scores of the two classifiers to be compared for ith dataset out of N datasets. The next step is to rank the d i in ascending order of the absolute value, mean ranks are allocated in case of ties. All positive and negative d i were summed separately and reported as R + and R − , respectively. Here, R + is the sum of ranks for the datasets in which the first method better than the second, and R − is the sum of ranks for the opposite. If the difference between R + and R − is sufficiently large, the null hypothesis assumes that no differences existing between the two classifiers are rejected. The Wilcoxon signed-rank test can also compare the obtained p-value with a given confidence level α to calculate whether the null hypothesis should be rejected or not. The results of the test are shown in Table 4 and Table 5. The significance label for this test is fixed to 0.05. If the p-value is lower than 0.05, then there is a significant difference between the two algorithms. The smaller the p-value, the difference is more statistically significant. The mean ranking of each method is additionally, computed by utilizing the Friedman alignedranks test (Demšar 2006) for the G-mean and AUC scores of the methods in consideration. The proposed OD-CSKELM is elected as the control method, as it attains the highest meanranks in the Friedman aligned-ranks test, which are shown in Fig. 1a and b. It can be observed in Fig. 1 that OD-CSKELM outperforms compared to the other state-of-the-art methods. The Friedman test (Demšar 2006) determines the rank of each algorithm on the N considered problems. Let the mean rank of the jth methods among a set of p methods is S j , then the null hypothesis that all p methods execute equally well should be rejected. The Friedman statistic is calculated as follows: Here, S j is the mean rank of each of the p algorithms for N number of problems. If the null hypothesis is rejected, the post hoc Bonferroni-Dunn test (Demšar 2006) is carried out to verify the pair-wise differences between the compared methods. The G-mean (for the 10 algorithms) of 31 binary class problems is given in The F-distribution has (9, 9 × 30) = (9, 270) degrees of freedom. The critical difference (CD) value of F(9, 270) is 1.88. For the level of significance value, α = 0.05. Because, the value of F F = 30.09 > 1.88, so, we reject the null hypothesis.
The difference between the ranks of ELM, WELM, ODW-ELM, OD-CSELM, KELM, KWELM, ODW-KELM, WKSMOTE, and IW-SMOTE with respect to OD-CSKELM are ( The differences corresponding to the ELM, WELM, ODW-ELM, KELM, KWELM, ODW-KELM, WKSMOTE, and IW-SMOTE is more than 0.8274. So, it can be concluded based on the statistical test that OD-CSKELM outperforms ELM, WELM, ODW-ELM, KELM, KWELM, ODW-KELM, WKSMOTE, and IW-SMOTE. Also, the difference in mean rank of OD-CSELM is lower than 0.8274, which shows that they have no significant differences corresponding to OD-CSKELM. The CD diagrams along with the meanrank of these algorithms in terms of G-mean and AUC are illustrated in Fig. 3, respectively. Figure 3 is the Bonferroni-Dunn test with α = 0.05. On the horizontal axis, the mean ranks of the compared methods are shown in increasing order. Groups of algorithms that are not significantly different at the α = 0.05 are connected. Moreover, the proposed method OD-CSKELM could achieve the best rank and significantly outperform other tested algorithms in terms of G-mean and AUC. Moreover, the box plots generated by the proposed method show that OD-CSKELM has a more compact box than the other tested methods, as depicted in Fig. 2a and b. Here, OD-CSKELM was considered the best among all compared algorithms since it has the highest median for all performance indexes. The mean training time of the ELM, WELM, ODW-ELM, OD-CSELM, KELM, KWELM, ODW-KELM, WKSMOTE and OD-CSKELM is reported in Table 6. Imbalanced learning proposed OD-CSELM and OD-CSKELM takes less training time compared to WELM and KWELM, which can be observed from Table 6 and can also be observed from Fig. 4.

Conclusion
The conclusion of the proposed work is given below.
1. The imbalanced classification problems are widely present in the real-world applications. However, most of the traditional classification algorithms are inclined toward the majority class as they are originally designed to address the balanced classification problems. This work proposes and evaluates OD-CSELM and kernelized version of OD-CSELM to address the class imbalance problem more effectively. 2. OD-CSELM and OD-CSKELM assign a weight to each class instead of assigning weights to all the training samples to handle the class imbalance problem. 3. The proposed OD-CSELM and OD-CSKELM have considerably reduced the computational cost in contrast to WELM and KWELM, respectively. 4. The proposed algorithms are assessed by using the benchmark real-world imbalanced datasets downloaded from the KEEL dataset repository. The experimental results indicate the superiority of the proposed work in contrast to the rest of classifiers for most of the assessed datasets. 5. The advantage of OD-CSELM and OD-CSKELM is also demonstrated by the Wilcoxon signed-rank test and the Friedman with post hoc Bonferroni-Dunn test.
Funding The authors have not disclosed any funding.
Data Availability Enquiries about data availability should be directed to the authors.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Informed Consent Informed consent was obtained from all individual participants included in the study.

Human and Animal Rights
This article does not contain any studies with human participants or animals performed by any of the authors.