Development and evaluation of M + 1-way classification mechanism realized through identifying foreign patterns

In this study, we are concerned with a new design methodology of M + 1-way classification mechanism. The intent is to reduce the cost of error prediction caused by insufficient evidence. The study is motivated by the notion of three-way decisions, which has been successfully used in various application areas to build human-centric systems. In contrast to traditional multiple classifications, one additional class is added into the proposed architecture to represent the reject decision made on foreign patterns, which exhibit significant differences compared to the patterns used for constructing the classification models. A collection of information granules is constructed on the basis of available experimental evidence to form a compact and interpretable representation of the feature space occupied by the native patterns. The patterns located outside the regions occupied by these information granules are identified and filtered out prior to classification while only the native patterns are subject to classification. The proposed methodology leads to a human-centric and human-interactive construct in which the rejected patterns need further processing. Different distance functions are utilized in the construction of information granules. The performance of the proposed architecture is evaluated involving one synthetic dataset and a collection of publicly available datasets.


Introduction
With the continued rapid growth of data generated in various fields by the information technology, classification has become a key step in data analysis and data mining tasks. It is no surprise that the field of pattern recognition is thriving and has become an integral part of data science. There has been a broad spectrum of methodologies, algorithms, and applications reported in this field in the literature (Shahmoradi and Bagheri Shouraki 2018;Zhu et al. 2017a;Al Sayaydeh et al. 2019;Wang et al. 2020).
In traditional classification applications, once classifiers are built through knowledge accumulation or statistical information extraction, an appropriate class (category) is assigned to a previously unseen instances with unknown labels. One specific label is predicted by the classification algorithm for each input instance. However, in some circumstance, this strategy could be problematic since the decision is made without adequate information or the decision is associated with low confidence level. Suppose that an English-based speech recognition model has been trained to perform speech signal classification with the aim of transforming the human speech into a written format. It is apparent that wrong decisions would be made when Arabic speech signals are input into the speech model. Instead of trying to predict the actual classes of the speech signals, a more natural alternative is to reject the non-English speech signals. The issue with the classification with rejection is that we have only partial knowledge about & Xiubin Zhu xbzhu@mail.xidian.edu.cn Huimin Zhang cchuiminzhang@outlook.com 1 the instances used for building the classification model, not knowing the characteristics of other instances. To distinguish these two kinds of instances existing in the overall feature space, we refer to the instances whose labels are known before the classifiers are constructed as native patterns (Homend et al. 2017;Homend and Pedrycz 2018) while all the other patterns that are located outside the feature space occupied by the native patterns are called foreign patterns. Classifiers are usually built on the basis of subsets of native patterns. As the human-centricity has become an essential characteristic of various models, a good classifier should be able to accurately classify native patterns and at the same time effectively reject foreign patterns. Human-centricity models are of great relevance in the realization of interpretable artificial intelligence (Hudec et al. 2021;Holzinger et al. 2022). The extension of traditional binary classification mechanism through classifying the instances into yes, maybe and no classes contributes to the enhancement of explainability of machine learning models (Hudec et al. 2021). The responsibility of determining the classes of the foreign patterns is assigned to the users. Such pursuits could address the limitations of traditional classifiers and build an algorithmic environment supporting human-centric processing. Foreign patterns could be distinguished and rejected prior to or after classifying native patterns. Classification with rejection is performed with the objective of maximizing the classification accuracy of nonrejected instances while minimizing the percentage of rejected instances. The framework of classification with rejection has been used in a wide range of fields and disciplines, such as software defect prediction (Mesquita et al. 2019), text categorization (Fumera et al. 2003), image processing (Giacinto et al. 2000;Lin et al. 2018), and rule selections (Capitaine and Frelicot 2012). The effectiveness of three visualization spaces, namely the ROC space, the error-rejection space, and the cost-reject space, on promoting the classification with rejection option is evaluated and discussed in Hanczar (2019). To quantify the performance of the rejection framework in classification, a set of performance indices is proposed in Condessa et al. (2017) to evaluate the quality of classified and rejected instances and determine a trade-off between rejection and misclassification.
The idea of classification with rejection is also closely related to the notion of three-way decisions proposed by Yao (2010) and Fujita et al. (2016) since the process of classifications is almost identical to decision making and classification could also be viewed as a special kind of decision problems. In order to solve the dilemma encountered in traditional two-way decisions that one definite decision should be made for each case even in conditions of uncertainty, the three-way decision strategy is proposed through adding one additional option, namely non-commitment, which neither accept nor rejects one decision. In this way, the determination of an immediate decision is deferred until sufficient evidence has been accumulated or further investigation has been made. As the upper and lower bounds of rough sets divide the universe into three pairwise disjoint regions (positive, boundary and negative regions), rough sets are used as a suitable vehicle supporting three-way decisions (Liang et al. 2015(Liang et al. , 2017. The utilization of three-way decisions in classification tasks has been widely studied in the literature. An active learning mechanism is employed to realize a tri-partition of the feature space in Min et al. (2019) such that different strategies are applied to instances coming from different regions. The evaluation of the effectiveness of the three-way strategy in classification problems through using the classification quality and decision cost measure is discussed in Liu (2021). The combination of the principle of justifiable granularity with the theory of rough sets for solving classification problem has been studied in Ju et al. (2019). Several other studies of three-way classification models realized through the use of specific objective function, confusion matrix, etc., also offer new insight into classification problems (Zhang and Yao 2017;Xu et al. 2020). Shadowed sets, which are isomorphic with a threeway logic, provide another interesting mechanism for realizing three-way decision making (Pedrycz 1998;Zhao and Yao 2019;Yue et al. 2020).
The three-way classification methods have become a popular technique for building efficient classification vehicles and constructing human-centric computing environment. Most existing methods make the decision for accepting or rejecting an instance that belongs to a certain class after the classification process. However, the research on the identification of foreign and native patterns prior to the classification is still lacking in the literature, and there has been relatively little research devoted to the characterization of geometrical spaces for native and foreign patterns (Homend et al. 2017). The Fuzzy C-Means (FCM) clustering method is used as a vehicle to reveal the structure in the data in Homend et al. (2017).
The intent of this study is to establish an algorithmic environment that supports the characterization of feature spaces occupied by native patterns in the framework of Granular Computing (GrC) (Yao et al. 2013;Zhu et al. 2020). GrC has become a human-centric methodological and developmental environment and has gained much attention (Zhu et al. 2021(Zhu et al. , 2018. Information granules, which are formed on the basis of available experimental evidence, could be expressed in terms of diversified formalisms (such as fuzzy sets, rough sets, and shadowed sets). These formed granular descriptors provide a comprehensive description of data and could be used as a vehicle to complete classification, prediction, and regression tasks (Zhu et al. 2017b). As information granules are formed in the feature space of numeric data and characterized by the prototypes, size, and some other auxiliary parameters, we revisit the fundamental steps of constructing information granules, and make necessary modifications and improvements such that these granular descriptors could form a concise and compact characterization of the target feature space. This study attempts to investigate how information granules could characterize the regions occupied by native patterns such that foreign patterns could be accurately distinguished.
In this study, we elaborate on the construction of information granules through using fuzzy clustering with the aim of forming an explicit description of the complex geometry of the regions of the native patterns. The novelties of this study are mainly reflected in the following aspects: First, as the characterization of a certain region of the feature space through a collection of information granules has not been studied, it is worth investigating how well the regions of native and foreign patterns are covered by the formed information granules. Second, information granules are constructed by invoking the principle of justifiable granularity, which takes into account both the coverage and specificity measure to quantify the representation of the granular descriptors. Third, core information granules and residual information granules are constructed in a cooperative manner to precisely characterize the spaces occupied by native patterns.
The structure of this study is organized as follows: The problem of classification with reject option is formulated in Sect. 2. Section 3 briefly reviews the Fuzzy C-Means algorithm while the construction of information granules through adhering to the principle of justifiable granularity is presented in Sect. 4. A series of experiments is performed in Sect. 5 and Conclusions are offered in Sect. 6.

Problem formulation
Suppose that we are given the following dataset D com- represents the attributes of the i-th, i = 1, 2, …, N, instance and y i 2{L 1 , L 2 , …, L M } denotes the associated class label of the i-th instance. The values of the attributes could be numeric or categorical. Since the labels of these instances are known, these instances available in the dataset D are referred to native patterns. A classification model T is constructed on the basis of available native patterns and used to predict the target label of an n-dimensional unseen instance. The output of the classification model could be one of the M labels in {L 1 , L 2 , …, L M }. The foreign patterns are identified before invoking the standard classification model T and an additional label L M?1 representing the non-commitment decision is assigned to the potential foreign patterns. For instances which are not labeled as foreign patterns, they are processed and classified by the classification model T in the same way as in traditional machine learning practices. The overall process is illustrated in Fig. 1.
The characterization of the feature space occupied by native patterns is realized through constructing a collection of information granules on the basis of the available native patterns. The aim is to construct a collection of compact information granules to cover the available native patterns as much as possible while maintaining the homogeneity of these granular constructs. The construction of information granules is completed through the guidance of certain performance indices. The union of volumes occupied by these granular descriptors could be regarded as a concise description of the space of native patterns. Instances that fall inside the boundaries of these information granules are treated as native ones and are subject to classification while instances that locate outside the union space are rejected and marked as L M?1 .
Keeping this objective in mind, we propose a two-phase construction of information granules to form a concise description of the native feature space in this study. The visualization of the two-stage development process is shown in Fig. 2. The native patterns are granulated through running fuzzy clustering algorithm. Subsequently, a collection of clusters that are characterized by a partition matrix and their corresponding prototypes is produced. Next, information granules centered around these prototypes are built to characterize the data structure and the geometry of the native feature space. The number of clusters specified when running the clustering algorithms and the size of each information granule are subject to optimization such that the spaces occupied by the formed information granules fit perfectly the regions of native patterns. The optimization process is guided by two essential criteria, one of which concerns the coverage of 3 Clustering algorithm-fuzzy C-means Clustering algorithms play a significant role in discovering the intrinsic structure of available experimental evidence. The results returned by the clustering algorithm are information granules that are described by their prototypes. We briefly review the employed fuzzy clustering algorithms, namely Fuzzy C-Means (FCM), in this section. Due to the Euclidean distance used in the clustering algorithm, spherical information granules are constructed. The construction of information granules is conducted in an unsupervised mode. All the native patterns are treated equally no matter which class they belong to. The objective of this study is to produce a compact description of the feature space occupied by native patterns using well-constructed information granules. To accomplish this goal, clustering algorithms are employed to reveal the underlying structure of patterns and provide the foundation on which more descriptive information granules could be formed.
By forming a collection of clusters, a concise representation of the space occupied by patterns coming from each class could be produced. As one of the most commonly used mechanisms for realizing fuzzy clustering, the FCM algorithm has been effectively used in system modeling and knowledge representation (Zhu et al. 2017a).
The clustering process is performed in an iterative manner guided by the following performance index: where c denotes the number of clusters specified in advance, v 1 , v 2 , …, v c represents the prototypes of the clusters, u ik stands for the membership degree of the k-th data coming from the dataset D to the local prototype v i . The fuzzification coefficient m (m [ 1.0) implies a certain shape of the membership functions for each cluster. The typical value of m is set to 2.0, which is most frequently used in the literature. In FCM, the distance d(v i , x k ) between the prototype v i and instance x k is assumed to be the Euclidean one, which is governed by the following expression: where v ip , x kp denote the p-th dimension of the corresponding prototype and instance. The FCM algorithm starts with randomly initialized values of the membership degrees u ik , i = 1, 2, …, c, k = 1, 2, …, N, that are subject to the constraint P c i¼1 u ik ¼ 1. The prototypes are determined in the following way: Next, the membership degree u ik of the instance x k in this class to the prototype v i is updated as follows: Subsequently, the steps for calculating the prototypes and the matrix of membership degrees are performed in an iteration fashion until the predefined number of iterations has been reached or the improvement in the performance index J is less than a preset minimum threshold.

Construction of information granules
Once the clustering algorithm has completed, we obtain a collection of prototypes, around which a family of information granules could be built to form a concrete representation of the feature space occupied by these native patterns. We expect the formed information granule could cover as many patterns belonging to the current class as possible and at the same time keep high specificity, which means that the information granule is justified by sufficient experimental evidence and comes with sound semantics. The coverage measure articulates how well the formed information granule represents the native feature space. The specificity limits the unrestrained increase of the volume of the information granule since too large an information granule would occupy space not belonging to native patterns and hinder the interpretations of the granular descriptors (Pedrycz and Homenda 2013).

Construction of core information granules
Suppose that the clustering algorithm is performed on the dataset D and a family of prototypes v 1 , v 2 , …, v c1 is obtained upon the completion of the clustering algorithm. Information granules X 1 , X 2 , …, X c1 are formed around these prototypes. The optimization of the size of the information granule is guided by the principle of justifiable Fig. 2 A two-phase construction of information granules granularity, which concerns both the coverage and specificity of the formed information granule (Pedrycz and Homenda 2013).
The coverage criterion is calculated as the number of instances belonging to the current class that are covered by the formed information granule versus the total number of instances in the current class. For information granule X i whose size (radius) is equal to q i , the coverage cov(X i ) is expressed in the following manner: As the size of the information granule X i increases, more and more instances coming from the native dataset fall within the boundaries of the current information granule. However, if the size of the information granule is too large, it may cover too much space not occupied by the native patterns. This would result in false acceptance of the foreign patterns in the following classification process. Through constructing more specific information granules, well-defined semantics could be associated with the granular data and the characterization of the structure of the native patterns could become more concise and accurate. The specificity of the information granule is measured in terms of its size in the following manner: Both the coverage and the specificity of the information granule could be changed by varying the size of the information granule. The larger the information granule is, the higher its coverage becomes. Lower values of the specificity are associated with more compact information granules. As the size of the information granule increases, more native instances are included and the specificity value also becomes low. Since we expect the formed information granules could cover as many instances as possible and at the same time be more specific with respect to their contents (indicated by high specificity values), these two criteria are in conflict. To accommodate these two conflicting demands, the coverage and specificity criteria are combined in the multiplication format as follows: The weight factor (a [ 0) controls the degree of the influence of the specificity on the produced information granule. The significance of the specificity is discounted when a \ 1, which means less specific information granules are formed. When a is set to zero, the influence of the specificity criterion is completely eliminated. The higher values of a stress the importance of the specificity criterion in constructing information granules and produce more specific granular data.
The performance index Q(X i ) is optimized with respect to the size of the information granule and the optimal size q i_opt is determined as the one that maximizes the performance index: The number of clusters (c 1 ) specified when running the clustering algorithm and the value of weight factor (a) are subject to optimization. The increase in the number of clusters leads to the decrease of the performance index J. The optimal number of clusters is determined as the one beyond which the improvement in the performance index J becomes less visible. Once the number of information granules has been determined, the coverage of the overall instances R 1 of the current group of information granules serves as a good indicator for determining the weight factor.
One could start with a lower value of the weight factor a and monitor the behavior of the performance index R 1 . With the increase in the value of the weigh factor, more specific information granules are formed and the overall coverage decreases. The optimal value of the weight factor is determined as the one that results in a significant decrease in the coverage criterion.
In this way, a family of information granules is constructed on the basis of native patterns to form a dense and specific representation of the core spaces occupied by the native patterns. We refer to these information granules as core information granules as they concern the primary structure of the native patterns.

Formation of residual information granules
While the core information granules provide a compact and concise description of the structure of the native dataset, residual information granules are constructed on the basis of the native patterns that are not covered by the core information granules, such that a complete portrait of the native feature space is formed.
Denote the native patterns falling outside the boundaries of the core information granules as D', D 0 ¼ fðx 0 k ; y 0 k Þjx 0 k 2 D and x 0 k 6 2 X 1 [ X 2 [ ::: [ X c1 g. The construction of residual information granules follows the same general scheme as proposed in the previous section. The difference is that the formation of residual information granules is only related to the uncovered native patterns. The number of residual information granules c 2 is also chosen by observing its influence on the improvement of the performance index J. The weight factor b specified when applying the principle of justifiable is optimized such that the residual feature space is covered by the compact residual information granules.
When one wishes to predict the label of one previously unseen instance, it is necessary to check if the instance falls within the regions represented by the information granules (both the core information granules and residual information granules) formed on the basis of native patterns. If the instance is recognized as a foreign pattern, then it is rejected immediately, and its further processing is left to the user. Instances that are not covered by any of the formed information granules can be classified in a standard way.
It is worth noting that no classification decision is made with respect to foreign patterns since their characteristics are unknown when constructing classifiers. Another problem is that foreign patterns which exhibit similar characteristics with native patterns are likely fail to be rejected. When there exists highly overlapping area between native and foreign patterns in the feature space, the performance of the rejection mechanism could be deteriorated.

Experimental studies
In this section, we experiment with a series of publicly available datasets coming from the UCI (https://archive.ics. uci.edu/ml/index.php) and KEEL Machine Learning repositories (https://sci2s.ugr.es/keel/datasets.php) (Alcalá-Fdez et al. 2011;Dua et al. 2019). We assume that the values of each feature have been normalized to the unit range [0, 1] using the Min-Max normalization method. Since the purpose of the experiments is to present the idea of using information granules to characterize the feature space occupied by native patterns and get insights into the performance of the proposed method to identify foreign patterns, each dataset is split into two pairwise disjoint subsets, namely native and foreign subsets, according to class labels. The patterns contained in the native subset are treated as native patterns, based on which information granules are formed. The native subset is partitioned into training and testing sets in the proportion of 70-30%.
When fuzzy clustering algorithms are employed to construct information granules, the maximal number of iterations is set to 100 (which is sufficient to guarantee the convergence of the algorithms), and the value of fuzzification coefficient m is assumed to be equal to 2.0. The clustering algorithms terminate once the maximum number of iterations is reached or the distance between the partition matrices in two consecutive iterations is less than a predefined threshold e = 10 -3 . The optimal number of information granules in each stage is determined experimentally. We sweep through a range of values c 1 and c 2 ranging from 2 to 15 and record the optimized values of the performance index J. There is a decreasing tendency when J is plotted as a function of the number of clusters. The optimal number of information granules is determined as the point beyond which the improvement in the performance index J becomes less visible.
The classification accuracy of native patterns is not the focus of this study. The performance of the proposed classification method with rejection option is evaluated in terms of the performance indices pertinent to effectiveness of the rejection mechanism. We use the notations ACC training , ACC testing to denote the ratio of training/testing native patterns that are correctly identified as native versus the cardinality of the training/testing dataset, respectively. The percentage of the correctly rejected foreign patterns is denoted as REJ foreign .
The results obtained by the rejection mechanism are compared with those achieved by the unsupervised approaches based on FCM membership degrees proposed in Homend et al. (2017) and the one-class Support Vector Machines (SVM). The optimal values of parameters (e, d) are determined experimentally through sweeping through the combinations of possible values by applying the algorithm proposed in Homend et al. (2017). The Radial Basis Function kernel is used when running one-class SVM and the outlier fraction r is chosen with the aim of achieving a tradeoff between the acceptance rate of the native patterns and rejection rate of the foreign patterns.

Banana dataset
The banana dataset coming from the KEEL Machine Learning Repository is a synthetic dataset composed of patterns that belong to several clusters with a banana shape. The patterns belonging to class -1 with attribute value At1 less than -0.6 are treated as foreign patterns while the others are considered as native patterns, as shown in Fig. 3. The native dataset consists of 4200 patterns while the foreign dataset is composed of 1100 patterns. It is visible that there is a nonlinear boundary between the space occupied by the native and foreign patterns (Fig. 4).
When the FCM algorithm is employed to identify the underlying structure of the data, the plots of the optimal values of the performance index J versus different number of clusters are shown in Fig. 3. The number of information granules to be constructed is determined as seven, as the improvement in performance index J becomes less obvious when we intent to partition the native patterns into more clusters. The weight factor in this case is determined as a = 3.5 in an analytical manner through observing the change of overall coverage of the formed core information granules versus different values of the weight factor. These core information granules form a concise and compact granular representation of the available native patterns. Subsequently, when the residual information granules are constructed on the basis of uncovered native patterns, the number of information granules is chosen as c 2 = 6 and the weight factor b is determined as 4.5. The plots of the core and residual information granules are shown in Fig. 5.
The values of the performance indices quantifying the rejection quality of the proposed mechanism, the number of information granules specified for the core part and residual part and the corresponding weight factors are also reported in Table 1. For comparative reasons, the rejection quality quantified in terms of the prediction accuracy obtained using FCM membership degrees and one-class SVM along with the corresponding parameters are also reported. The parameter column in the table contains the values of (c 1, a, c 2, b) for the proposed method, values of (c, e, d) for the method proposed in Homend et al. (2017) and the parameter r for one-class SVM. The increase in the accuracy of rejection of foreign patterns results in the lower acceptance of native patterns. Therefore, the key to the identification of foreign patterns lies in a concise characterization of the feature spaces occupied by native patterns. The proposed algorithm exhibits robust and much better performance in characterizing the shape of the feature space occupied by the native patterns and leads to the highest rejection quality on foreign patterns in comparison with existing methods.

Other real-world datasets
In the following experiments, we follow the same general scheme as discussed in the previous experiments. For each dataset, some patterns belonging to specific classes are chosen as foreign patterns, and are treated as previously unseen data points whose feature space is different from the available experiment evidence. The characteristics of the dataset, including the number of patterns, the number of features, classes treated as native patterns and foreign patterns, along with the number of native patterns and foreign patterns, are reported in Table 2. The experiments are performed in the same manner as in the previous section. We report the values of performance indices obtained  for the native training patterns, native testing patterns and foreign patterns in Table 3. It is evident that the geometry occupied by native patterns could be effectively characterized by the formed core information granules and residual information granules. The core information granules define the regions occupied by the native patterns with higher density while the residual information granules identify the space in which the native patterns distribute with a relatively low density. These two types of information granules form a precise characterization of the feature space corresponding to the native patterns. The same tendency is observed as in the previous experiments: the increase in the rejection rate of foreign patterns is associated with a high rejection rate of native patterns, especially when there is an overlap between native and foreign patterns. Thus, a sound promise between the rejection rate of foreign patterns and acceptance rate of native patterns should be strived for. Through forming a collection of information granules adhering to the principle of justifiable granularity, the boundaries  between the native and foreign patterns could be captured, which is helpful to cope with the identification of feature spaces with complex boundaries. The proposed information granule-based feature space modeling methodology could effectively identify the feature spaces occupied by different classes of patterns and improve the rejection quality of foreign patterns.

Conclusions
The distinguish of native and foreign patterns is of great significance in practical classification applications. Since the classifiers are usually built on the basis of native patterns, and no knowledge about foreign patterns is available at classifier construction stage, the classification of foreign patterns using these classifiers is prone to errors and could lead to omission of significant discoveries. The classification with rejection option could alleviate this problem through rejecting the patterns that are identified as foreign patterns. The human-centric nature is tressed in the rejection option as further analysis of the rejected patterns is left to the users. An information granule-based method is proposed to characterize the feature space occupied by native patterns, information granules are constructed in a cooperative manner by running clustering algorithms and applying the principle of justifiable granularity. Two types of information granules, namely, core information granules and residual information granules are formed on the basis of native patterns that are available at the initial stage. The patterns falling outside the area covered by the information granules are identified as foreign patterns and are rejected while the remaining patterns are classified by the classifiers in a traditional way.
Experimental studies demonstrate the performance of the proposed rejection mechanism. Further studies could focus on investigating other mechanisms of characterizing the shapes of the native feature space and the realization of corresponding rejection mechanism.