Covering rough set-based incremental feature selection for mixed decision system

Covering rough sets conceptualize different types of features with their respective generated coverings. By integrating these coverings into a single covering, covering rough set-based feature selection finds valuable features from a mixed decision system with symbolic, real-valued, missing-valued and set-valued features. Existing approaches to covering rough set-based feature selection, however, are intractable to handle large mixed data. Therefore, an efficient strategy of incremental feature selection is proposed by presenting a mixed data set in sample subsets one after another. Once a new sample subset comes in, the relative discernible relation of each feature is updated to disclose incremental feature selection scheme that decides the strategies of increasing informative features and removing redundant features. The incremental scheme is applied to establish two incremental feature selection algorithms from large or dynamic mixed data sets. The first algorithm updates the feature subset upon the sequent arrival of sample subsets and returns the reduct when no further sample subsets are obtained. The second one merely updates the relative discernible relations and finds the reduct when no subsets are obtained. Extensive experiments demonstrate that the two proposed incremental algorithms, especially the second one speeds up covering rough set-based feature selection without sacrificing too much classification performance.


Introduction
Rough set theory (Pawlak 1982) is a frequently used mathematical solution for conceptualizing uncertain, insufficient knowledge hidden in data. Its theoretical foundation is the equivalence relation where samples sharing identical feature descriptions are said to be equivalent or indiscernible. As a result, rough set theory is limited to only process data sets with symbolic features rather than other types of features, like real-valued, missing-valued and set-valued ones. It is not uncommon that an actual data set (called mixed decision system or mixed data in this paper) usually includes different types of features. Typically, some of them are redundant or irrelevant to the classifier methods and may worsen the classification performance (Chen and Yang 2014;Zhang et al. 2016). Hence, it is desirable to choose informative features from mixed data to improve the performance of the learning algorithms.
Currently, much effort has been made to extend rough sets to process other types of features. For example, similarity relation (Stefanowski and Tsoukias 1999) was employed to handle missing-valued data sets. To process set-valued data sets, Guan and Wang (2006) and Qian et al. (2009) presented the tolerance relation and the binary dominance relation, respectively. Neighborhood relation was defined in Yao (1998Yao ( , 2006;  to handle real-valued data sets. Based on the above binary relations, various types of rough set models, such as neighborhood rough set model, were developed in Slowinski and Vanderpooten (2000), Qian et al. (2012), Hu et al. (2008), Wang et al. (2018). By using the fuzzy relation, additionally, fuzzy rough sets were proposed to work on real-valued data sets (Dubois and Prade 1990;Shen 2004, 2007;Tsang et al. 2008;). Actually, a covering of a universe is composed of the family of subsets of the universe generated by every aforesaid relation, where α−cut sets are used to induce a covering of the universe based on the fuzzy relation (Chen et al. 2007;Du et al. 2011). Thus, these models just fall into the category of covering rough sets introduced by Zakowski (1983). Aggregating the coverings induced by various types of features into a single covering, covering rough set can be viewed as a method of handling mixed data sets.
Feature selection, as one important application of rough sets, has found its way to the domains of data mining and uncertainty reasoning. Compared with the existing feature selection methods, rough set-based feature selection does not require any prior knowledge, like the number of selected features and the probability distribution. Some feature selection algorithms in the field of rough sets were presented to tackle various kinds of data sets. These rough set methods are roughly split into the discernibility-based ones and the uncertainty measure-based ones. The typical representative of the discernibility-based algorithm is the discernibility matrix-based method. For symbolic data sets, Skowron and Rauszer (1992) calculated a reduct from a symbolic data set by introducing the discernibility matrix to construct the discernibility function. To improve this discernibility matrixbased approach, sample pair selection in Chen et al. (2012) was presented to obtain all minimal elements with no need for finding the discernibility matrix such that the searching range and the runtime are narrowed. For real-valued data sets, the discernibility matrix-based method is extended to fuzzy rough sets in Tsang et al. (2008). For mixed data sets, Chen et al. (2007) generalized the discernibility matrix of the classical case to that of covering rough set, where the discernibility function was formed to compute all reducts. Wang et al. (2014) simplified the complex discernibility matrices to more simpler ones and designed a discernibility matrix-based heuristic approach to reducts. A simpler discernibility matrix of covering decision system which essentially has the same form with the one of Wang et al. (2014) was also constructed in Wang et al. (2015).
The uncertainty measure-based algorithm can obtain a proper reduct (Chen et al. 2012). However, it has the complex structures and is not suitable for large data sets. To improve the computation of a reduct, the uncertainty measure-based algorithm uses different uncertainty measures to find a reduct (Slezak 2002;Wang et al. 2002;Liang and Xu 2002). For symbolic data sets, Qian et al. (2010) developed the positive approximation method to accelerate the heuristic process of finding reducts by using three representative information entropy measures. For real-valued data sets, a feature selection method based on fuzzy rough set was presented in Jensen and Shen (2004) to preserve the dependence function. Hu et al. (2006) presented a feature selection method that applies the information entropy to evaluate the significance of a feature subset. For heterogeneous data sets with symbolic and real-valued features, neighborhood rough set model (Hu et al. 2008) was presented to tackle heterogeneous data sets by assigning different thresholds to distinct types of features. Chen and Yang (2014) proposed a heuristic feature selection method by the discernibility relations of heterogeneous features. For mixed data sets with different types of features, Dong et al. (2016) employed the relative discernible relation to calculate minimal elements and developed algorithms for reducts which were experimentally demonstrated to be effective.
The above feature selection algorithms have been extensively studied from the theoretical perspective. However, they have the intensive computation cost when handling large mixed data. An intuitive solution to tackle mixed data is to split it into a family of sample subsets which arrive successively, and then apply the incremental technique to handle sample subsets one by one. Such mixed data sets are sometimes referred to as dynamic mixed data in this paper. As a newly acquired sample subset comes in, the above-mentioned non-incremental feature selection methods require to re-select a valuable feature subset with all historical data samples plus the newly arrived samples Li et al. 2007;. As a consequence, these non-incremental methods are uneconomical to deal with dynamic mixed data as older data sets are repeatedly processed. Conversely, incremental feature selection methods update a feature subset by making the best of the previously results learned from the historical data samples, with no need for re-computing the feature subset using the combined data samples of the historical samples and the newly acquired samples (Chen et al. 2015. Rough set-based incremental feature selection (Sang et al. 2021(Sang et al. , 2020(Sang et al. , 2021 has been investigated in a natural scenario where a single sample or a sample group varies. For a single sample arriving, the discernibility matrix was updated to incrementally select a feature subset in Orlowska and Orlowski (1992). However, Hu et al. (2005) experimentally demonstrated the high computational complexity of the method in Orlowska and Orlowski (1992). Thus, Hu et al. (2007) investigated the incremental mechanisms for the positive region to develop an incremental feature selection algorithm. The modified discernibility matrix was incrementally updated in Ming (2007) to compute all reducts. The core was incrementally computed in Feng and Zhang (2012) to expedite the acquisition of a reduct. The minimal elements in variable precision rough sets were updated in Chen et al. (2016) to study the incremental feature selection process. When a sample group dynamically varies, some incremental algorithms were studied based on rough sets. For example, Shu and Shen (2013) incrementally calculated the positive region to update attribute reduction from dynamic incomplete decision systems. Liang et al. (2014) proposed a group incremental feature selection algorithm by incrementally finding three representative information entropy. Yang et al. ( , 2018 investigated incremental feature selection algorithms based on fuzzy rough sets by utilizing the strategies of increasing and removing features. Considering the utility of incoming samples, a novel incremental feature selection method was presented in  by combining the active sample selection process with the incremental feature selection scheme. When adding and deleting samples, the covering element reduction was performed in Lang et al. (2015) by incrementally finding lower and upper approximations of sets. When varying samples, Jing et al. presented incremental feature selection algorithms by knowledge granularity (Jing et al. 2016) and multi-granulation view (Jing et al. 2017). An active incremental feature selection algorithm was presented in Zhang et al. (2020) using fuzzy rough set-based information entropy with incoming representative instances.
The above-mentioned incremental feature selection methods are effective to select a feature subset based on different rough set models. However, they are confined to homogeneous data sets. When processing mixed data sets, the above incremental methods have to preprocess a mixed data set into a homogeneous one. This easily brings about the information loss. Therefore, they cannot effectively meet the challenge that mixed data samples are continuously added in. Inspired by these observations, we investigate incremental feature selection algorithms based on covering rough sets by simulating a mixed data set with a sequence of sample subsets. The major contributions of this paper include as follows.
(i) The relative discernible relation of each feature is calculated to characterize the core and feature selection in covering rough sets. (ii) Upon a sample subset arriving, we apply the incremental mode to renew the discernible relation of every feature, which is valid for all sequent mixed data samples. (iii) Using the renewed discernible relations, we investigate incremental feature selection scheme that reveals the strategies of increasing informative features and removing redundant features. (iv) The incremental scheme is applied to present two incremental algorithms of feature selection: updating the optimal feature subset as sample subsets arrive and finding the optimal feature subset when no sample subsets are obtained. (v) The comparison results show that the two proposed incremental algorithms, especially the second one expedites the computation of the optimal feature subset from mixed data without sacrificing too much classification performance.
The outline of the paper is given below. We introduce the preliminaries related to our paper in Sect. 2. Section 3 employs the relative discernible relation to present a feature selection method in covering rough sets. Two incremental feature selection algorithms are developed in Sect. 4. Experimental comparisons are conducted in Sect. 5. This paper is summarized in Sect. 6.

Preliminaries
This section lists some basic concepts used in the paper.

Covering rough sets
Definition 1 (Bonikowski et al. 1998) Let U be a universe, A covering of the universe is apparently its partition. As noted in Chen et al. (2007), Couso andDubois (2011), Chen et al. (2014), a covering can be generated by various kinds of features, like symbolic features, real-valued features, setvalued features, missing-valued features and so on. In detail, this paper employs the equivalence relation (Pawlak 1982) to generate the covering of a symbolic feature a, and its covering The neighborhood relation (Yao 1998) is employed to generate the covering of a real-valued feature a, and its covering is Here, δ is the neighborhood radius and a (x, y) is the distance between x and y with respect to a. Tolerance relation (Guan and Wang 2006) is used to generate the covering of a set-valued feature a, and its covering is computed as C a = {T a (x) : x ∈ U } where T a (x) = {y ∈ U : a(x) ∩ a(y) = ∅}. We utilize the characteristic relation (Grzyma la-Busse 2004) to induce the covering of a missingvalued feature where ? and * represent the lost value and 'do not care.' So, the covering of a missing-valued feature a is calculated as For this reason, the definition of covering information system is introduced.
Definition 4 (Chen et al. 2007) Suppose (U , A) is a covering information system and B = {c i 1 , . . . , c i k } is a subset of A.
According to Definition 4, coverings generated by different types of features can be aggregated into a reasonable covering. Specifically, we assume B = {c i 1 , · · · , c i k } includes symbolic, real-valued, set-valued and missing-valued features. The covering C i j of the feature c i j can be generated according to the aforementioned way of coverings induced. Applying Definition 4, the induced covering The induced covering of the feature set can be used to construct lower and upper approximations of a sample set in covering rough sets. Therefore, it is possible to employ covering rough sets to handle mixed data with different types of features. In this paper, a mixed data set is denoted as a covering decision system defined as follows.

Covering rough set-based feature selection
Given (U , A, D) and B ⊆ A, the positive region of D relative to B is defined as pos B (D) = ∪ X ∈U /D B(X ). By substituting coverings for features, covering rough set is able to be regarded as a mathematical tool to reduce superfluous features. Here, feature selection in covering rough set model is described below.
Definition 7 (Chen et al. 2007) Given a covering decision system (U , A, D), P ⊆ A is said to be a reduct (also called an optimal feature subset) of A, if the following conditions hold: (1) pos A (D) = pos P (D); (2) ∀c ∈ P, pos A (D) = pos P−{c} (D). By Definition 7, the reduct of (U , A, D) is a minimal feature subset preserving the positive region. The discernibility matrix-based method, as a frequently used approach to compute reducts, is given below.
Definition 8 ∈ (x i ) c k } is the discernibility feature set of x i and x j that meet one of the two requirements: (1) x i ∈ pos A (D), x j / ∈ pos A (D) and (2) The discernibility function of (U , A, D) is denoted by f (U , A, D) = ∧{∨c i j : c i j = ∅}. By distribution and absorption laws, the minimal form of is said to be the core of (U , A, D). In reality, one reduct is sufficient for many practical applications. To calculate a reduct, we give the next theorems.
Theorem 1 Theorem 2 ) P ⊆ A is a reduct of (U , A, D) iff it satisfies the next requirements: (1) for ∀c i j = ∅, P ∩ c i j = ∅; (2) for ∀c k ∈ P, there exists c i j = ∅ that satisfies (P − {c k }) ∩ c i j = ∅.
Theorem 1 states that the core is made up of the singletons of a discernibility matrix, while Theorem 2 develops an easy method of judging whether a subset of features is an optimal one of the covering decision system. In Theorem 2, the condition (1) shows that a reduct is enough to together discern sample pairs whose discernibility feature sets are non-empty; the condition (2) indicates that every feature in a reduct P is separately essential. Hence, a reduct is just a minimum subset distinguishing sample pairs whose discernibility feature sets are non-empty.
Definition 9 ) Suppose U = {x 1 , · · · , x n }. The relative discernible relation of c ∈ A is defined as The relative discernible relation of c ∈ A is in fact the collection of sample pairs distinguished by c. That is, Based on the relative discernible relation, the next two theorems are developed to describe the core and feature selection of a mixed data set, respectively.
According to Theorem 3, if a sample pair can merely be discerned by c k and cannot be discerned by arbitrary feature of A − {c k }, c k just belongs to the core.
In light of the two aspects, this theorem is proved according to Theorem 2.
By Theorem 4, a reduct of a mixed data set is a minimum feature subset keeping the relative discernible relation of the whole set of features. Based on Theorems 3 and 4, a feature selection algorithm is designed to calculate a reduct from a mixed data set.
Algorithm 1 employs the relative discernible relation to find a reduct from a mixed data set.
Step 1 calculates the induced covering generated by each feature and the induced covering generated by the feature set. Its computational complexity is O(|U | 2 |A|). In Step 2, we calculate the positive region, which has the computational complexity of 9: end while 10: return red.
O(|U | |A|). In Step 3, we calculate the relative discernible relations of every feature and A, which has the computational complexity of O(|U | 2 |A|). In Step 4, we calculate the core of a mixed data set, which has computational complexity of O(|U | 2 ). In Steps 6∼9, we choose a feature which discerns the most sample pairs at every iteration, which has a computational complexity of O(|U | 2 |A|). In a word, the computational complexity of Algorithm 1 is O(|U | 2 |A|).
As a matter of fact, Algorithm 1 is a non-incremental method of feature selection in covering rough sets, as it fails to effectively apply the previously results learned from the historical data samples as new data samples arrive. When this happens, we need to put historical (old) data set and new data set together and then run a non-incremental algorithm like our proposed Algorithm 1 on the combined data set to obtain a new reduct. This is certainly wasteful in the computation time as the results of the historical data set are effectively thrown away. In Sect. 4, we will concentrate on the incremental scheme of feature selection in covering rough sets while handling dynamic mixed data, as well as large mixed data.

Incremental scheme of feature selection in covering rough sets
The section presents the incremental scheme of feature selection in covering rough sets, under the assumption that a mixed data set can be segmented into a collection of sample subsets. Sample subsets successively arrive, such that feature selection can be incrementally performed in covering rough sets. The relative discernible relation of every feature is first updated upon the sequential arrival of sample subsets. The incremental feature selection scheme is further revealed to decide the strategies of increasing informative features and removing redundant features. Based on the incremental scheme, two incremental algorithms are presented to begin by the empty set to calculate an optimal feature subset of a mixed data set.

Incremental setup and symbols
In the subsection, we give the descriptions of the incremental setup and the used notations.
To use the incremental technique to obtain an optimal feature subset of (U , A, D), U is segmented into a sample subset sequence {U k } m k=1 , where every subset is said to be an incoming sample subset. The following requirements obviously hold: Under the incremental setting, we add sample subsets one at a time. The previously obtained historical samples are stored in a temporary set T which has the initialization of the empty set. In detail, with the arrival of U 1 , the temporary set T is changed into U 1 ; upon the arrival of U 2 , T is updated as U 1 ∪U 2 ; · · · ; as U m arrives, T is updated as the set of all samples U . In general, T is updated with the sequential arrival of the first, · · · , mth sample subsets.
In (T

Updating the relative discernible relation
According to Sect. 2, it is necessary to calculate the relative discernible relation of every feature, when selecting a feature subset from a mixed data set. When a sample subset arrives, the incremental mode is thus used to incrementally calculate the relative discernible relation of every feature in this subsection.
As a decision sub-system (U k , A, D) is put into (T , A, D), some consistent samples in T and U k may be inconsistent in the new data set (T ∪ U k , A, D), which results in the following two facts. One is that some sample pairs of Dis T D ({c}) and . The other is sample pairs which are not discerned in (T , A, D) but can be discerned in (T ∪ U k , A, D). By the above two facts, the relative discernible relation of every feature is incrementally calculated.
The following theorem lists sample pairs which cannot be distinguished in (T ∪ U k , A, D).
Theorem 5 In (T ∪ U k , A, D), the following conclusions hold: By Theorem 5, we can delete the sample pairs of is the set of samples that are inconsistent in U k but consistent in T ∪ U k . Hence, we have the following theorem.
Theorem 6 In (T ∪ U k , A, D), the following conclusions hold: ) is the set of samples that are inconsistent in U k but consistent in T ∪ U k . Based on the discussions, the next theorem is given.
, the following conclusions hold: By Theorems 6 and 7, we can incrementally find the sample pairs that cannot be discerned in (T , A, D) and (U k , A, D) but can be discerned in (T ∪ U k , A, D). Furthermore, it is necessary to judge whether each sample in T and each sample in U k are discernible in (T ∪ U k , A, D), which is a key step of incrementally calculating the discernible relations of ∀c ∈ A and A.
According to Theorem 8, we can obtain sample pairs between T and U k that can be discerned in (T ∪ U k , A, D). By Theorems 5∼8, the incremental algorithm is designed to update the relative discernible relation of every feature.
Algorithm 2 Updating the relative discernible relations 12:

Incremental feature selection scheme in covering rough sets
With an incoming sample subset arriving, the following two cases will occur: (1) Dis D (A) = Dis D (red T ) and (2) Dis D (A) = Dis D (red T ). In terms of the two cases, incremental feature selection scheme is investigated to disclose the strategies of increasing informative features and removing redundant features. Theorem 9 indicates that red T properly includes an optimal feature subset of (T ∪ U k , A, D), or is just an optimal feature subset of (T ∪ U k , A, D). The strategy of removing redundant features is presented to incrementally calculate a reduct of (T ∪ U k , A, D).
First feature removal strategy The feature c can be removed from red T The strategy shows that in the new mixed data, red T − {c} can distinguish sample pairs which are discerned by A. By Theorem 4, the feature c can be removed from red T . If the strategy is unavailable for ∀c ∈ red T , red T is just a reduct of (T ∪ U k , A, D) by Theorem 4. In detail, we employ the forward greedy search to delete redundant features from red T . At each loop, we delete the feature c with Dis D (red T − {c}) = Dis D (A) from red T , and repeat this process until the removal of any remaining feature does not satisfy the First feature removal strategy. Accordingly, an optimal feature subset of the updated temporary data set is calculated incrementally.
Since Case 2 does not meet the condition (2) of Theorem 4, red T is not an optimal feature subset of the updated mixed data set. So, some informative features B ⊆ A−red T should be added into red T until Dis D (red T ∪ B) = Dis D (A) holds. The following strategy of adding features is proposed by Theorem 4.
Feature addition strategy If B ⊆ A−red T is a minimum addition subset that satisfies Dis D (red T ∪ B) = Dis D (A), B can be put into red T .
By the feature addition strategy, red T ∪ B can distinguish sample pairs which are distinguished by A. The fact indicates that a reduct of (T ∪U k , A, D) is contained in red T ∪B. Some features in red T ∪B, however, are redundant. Moreover, since B is a minimum addition subset that satisfies Dis D (red T ∪ B) = Dis D (A), Dis D (red T ∪ (B − {a})) = Dis D (A) holds for ∀a ∈ B. Therefore, we only need to delete redundant features from red T . Thus, the following strategy of removing redundant features is presented to calculate an optimal feature subset of (T ∪ U k , A, D).
Second feature removal strategy The feature c can be removed from red T By Theorem 4, when the removal strategy is not satisfied for ∀a ∈ red T , red T ∪ B is just a reduct of (T ∪ U k , A, D). When it is satisfied for ∃c ∈ red T , c can be removed from red T . Similar to the First feature removal strategy, we use the forward greedy search to delete redundant features from red T . At each loop, a feature c with Dis D ((red T − {c}) ∪ B) = Dis D (A) can be deleted from red T , and this process is repeated until the Second feature removal strategy is not satisfied. Thus, a reduct of (T ∪ U k , A, D) is obtained by continuously using the second feature removal strategy.
To sum up, the technique details are given below. If a newly acquired sample subset satisfies Case 1, we employ the first feature removal strategy to renew an optimal feature subset of the updated mixed data set. If it satisfies Case 2, we utilize the feature addition strategy to put a feature subset B ⊆ A − red T into the current reduct red T , and apply the second feature removal strategy to remove redundant features from red T .

Incremental algorithms for feature selection from mixed data
Two incremental feature selection algorithms are presented to incrementally calculate an optimal feature subset from a mixed data set based on the above strategies of increasing and removing features. Incremental-1 Update an optimal feature subset as sample subsets arrive successively.
With the sequential arrival of sample subsets, the incremental process updates a reduct by the updated relative discernible relation of every feature. More specially, if a newly acquired sample subset is in Case 1, the First feature removal strategy is performed; if it is in Case 2, the feature addition strategy is first performed and the second feature removal strategy is then used. A reduct of the whole mixed data set is acquired by repeatedly using the incremental process until no further sample subset is left. By the incremental process, the following incremental algorithm is presented to incrementally obtain an optimal feature subset in covering rough sets. Figure 1 clearly depicts the working principle of Incremental-1.
Algorithm 3 incrementally calculates an optimal feature subset beginning by an empty set. In Step 1, T and the results of (T , A, D) are initialized with empty sets. A new sample subset is obtained in Step 3.
Step 4 which has the com-

|Dis T D ({c})|)|A|)), incrementally calculates the relative discernible relations of ∀c ∈ A and A.
Step 6 which has the computational complexity of O(|T ∪ U k | 2 |A|), judges if the newly acquired sample subset meets Case 1 or Case 2. Steps 6∼10 perform the strategy of increasing features, which has the computational complexity of O(|A|). In Steps 11∼14, the removing feature strategy is performed which has the computational complexity of O(|A|). Hence, the whole computational complexity of Algorithm 3 is

{c})|)|A|)).
However, Incremental-1 needs to update the optimal feature subset at each iteration, which is uneconomical for large mixed data sets. In order to expedite Incremental-1, we do not incrementally calculate the selected feature subset with sample subsets arriving successively, which results in the next incremental process.
Incremental-2 Compute an optimal feature subset when there is no sample subset left.

Algorithm 4 Second incremental feature selection algorithm in covering rough sets (denoted by Incremental-2)
Input: 1) The sequence of sample subsets U = ∪ m k=1 U k ; 2) the set of features A; 3) the decision feature D. Output: An optimal feature subset of (U , A, D): red. The second incremental process starts from an empty set to calculate an optimal feature subset from the whole mixed data set where no sample subsets are obtained. With sample subsets arriving continuously, the relative discernible relations of every feature and the feature set are merely updated without incrementally finding the optimal feature subset. As no further sample subset is added, the final discernible relations of every feature and the feature set are obtained on the whole mixed data set. With the obtained relations, the feature addition strategy and the second feature removal strategy are utilized to calculate an optimal feature subset from the whole mixed data set. To clearly depict the working principle of Incremental-2, we show Fig. 2.
By the second incremental process, we present Algorithm 4 to calculate a reduct in covering rough sets. In Step 1 of Algorithm 4, T and the results of (T , A, D) are initialized as empty sets. The relative discernible relations of ∀c ∈ A and A are updated in Steps 2∼4 which has the com- . The feature addition strategy is performed in Steps 7∼11, which has the computational complexity of O(|A|). The Second feature removal strategy is performed in Steps 12∼16, which has the computational complexity of O(|A|). Therefore, the whole computational complexity of Algorithm 4 is max . Remark If new sample subsets with various feature types arrive in a random sequence, these sample subsets will be put into our proposed incremental models one at a time. Specifically, as a new sample subset is randomly added into the incremental model Incremental-1, the optimal feature subset is incrementally updated based on the updated relative discernible relations of each feature and the whole feature set. When the next sample subset arrive randomly, the same way as above is performed using Incremental-1. Furthermore, when a new sample subset is randomly put into Incremental-2, the relative discernible relations of each feature and the whole feature set are updated. After these new sample subsets have been put into Incremental-2, the optimal feature subset is calculated based on the updated relative discernible relations. Thus, the optimal feature subsets obtained by Incremental-1 and Incremental-2 can be used in the classification task.

Experimental results
This section compares Incremental-2 with Incremental-1, Non-Incremental (Algorithm 1) and consistency-based feature selection method (Dash and Liu 2003) (denoted by Consistency). The main concern is about the comparison of the time efficiency, i.e., the runtime of obtaining an optimal

Experimental setup
The hardware environment: The experiments of Sects. 5.2∼5.3 are performed with Windows 7PC and Intel (R) Xeon (R) CPU E5-2620 0 @ 2.00 GHz 2.00 GHz and 80 GB memory. The experiments of Sect. 5.4 is performed with Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz 3.60 GHz and 32.0 GB memory in Windows 10 system.
The software environment: MATLAB 2013b. Data set: All fourteen data sets used in our experiments are from UCI Machine Learning Repository, where 'Annealing' in Table 1 includes 798 training samples and 100 test samples. It is suggested that 'Australian,' 'Japan,' 'Mammographic mass' and 'Nursery' are used to further confirm the time efficiency of our proposed incremental algorithms. The detailed information is summarized in Table 1 where 'Missing features' reports the size of missing-valued features. These data sets contain various even-distributed numbers of samples, features and classes, and include symbolic, real-valued and missing-valued features which are denoted by 's,' 'r' and 'm,' respectively.
Data partitioning: Before performing feature selection, every data in Table 1 is at random split into 10 equal-sized sample subsets. The 10 sample subsets constitute a sample subset sequence.
Covering induced by a feature: The equivalence relation is used to generate the covering of a symbolic feature; the neighborhood relation is used to generate the covering of a Classification accuracy: The classifiers used in the experiments are the k-nearest neighbor function 'fitcknn' and the random forest function 'TreeBagger' in MATLAB Toolbox, where the number of nearest neighbors is set as 3 and the number of trees is chosen as 500. The tenfold cross-validation is used to compute the classification accuracy. Specifically, the data set before or after feature selection is at random split into 10 parts of equal size, where nine of them is used as the training set and the rest one is the test set. At each round, we use 'fitcknn' or 'TreeBagger' to train a model on the training set, and compute the classification accuracy of the test set that is the ratio of the test samples classified correctly and all samples in the test set. The average value of classification accuracies in 10 rounds is just reported as the final classification accuracy.

Comparison and analysis of Incremental-2 and covering rough set-based methods
This section compares Incremental-2 with Non-Incremental and Incremental-1 on the fourteen selected data sets. The experimental results are shown in Fig. 3, Tables 2 and 3. Figure 3 displays that the runtime of Non-Incremental, Incremental-1 and Incremental-2 varies with the successive arrival of sample subsets. In each sub-figure of Fig. 3, the horizontal axis is the index of every incoming sample subset, whereas the vertical axis is the running time of every comparison method. From Fig. 3, we can observe that both Incremental-2 and Incremental-1 are consistently faster than Non-Incremental as each sample subset continuously arrives. The main reason is that Non-Incremental lacks the effective mechanism of dealing with dynamic data sets. With the arrival of a new sample subset, Non-Incremental has to be rerun to calculate an optimal feature subset from the currently arrived sample subsets including the historical samples and the new added-in samples. In result, Non-Incremental costs a large amount of the computation time. Furthermore, we can see that Incremental-2 is more efficient than Incremental-1. This is because with the first nine subsets arriving, Incremental-1 needs to update the optimal feature subset by incrementally updating the discernible relation of each feature, while Incremental-2 merely incrementally calculates the relative discernible relations with no need for updating the optimal feature subset. Therefore, Incremental-2 can save the running time as sample subsets continuously arrive. Table 2 shows the runtime of the three comparison algorithms, where 'Total Time' represents the sum of the runtime of a method on the fourteen data sets. We can observe from Table 2 that Total Time of Incremental-2 (299070.42 seconds) is much less than that of Non-Incremental (427047.18 seconds) and Incremental-1 (337612.82 seconds). Moreover, Incremental-2 is more efficient than Non-Incremental and Incremental-1 on each selected data set including the four suggested data sets. For instance, on 'Annealing,' the runtime of Incremental-2 is 115.26 seconds, which is much less than that of Non-Incremental (620.34 seconds) and Incremental-1 (135.19 seconds); on 'Crowdsourced,' the runtime of Incremental-2 is 125051.27 seconds, which is much lower than that of Non-Incremental (194859.87 seconds) and Incremental-1 (144789.44 seconds). On the suggested data set 'Australian,' the runtime of Incremental-2 is 15.24 seconds, which is less than Non-Incremental (121.91 sec- Non-incremental Incremental-1 Incremental-2 Fig. 3 Runtime of Non-incremental, Incremental-1 and Incremental-2 with the successive arrival of sample subsets onds) and Incremental-1 (15.61 seconds); on the suggested data set 'Nursery,' the runtime of Incremental-2 is 32263.92 seconds, which amounts for 80.8% and 92.8% of that of Non-Incremental and Incremental-1. By these above facts, the time efficiency of Incremental-2 has been well proven. Table 3 summarizes the size and the classification accuracy of features selected by Non-Incremental, Incremental-1 and Incremental-2, where 'Raw' is the result of the original data set. In Table 3, 3-NN and RF represent the classification accuracies of selected features with the k-nearest neighbor classifier and the random forest classifier, respectively. From Table 3, we can see that all three comparison algorithms can remove redundant features from each selected data set. There are different features chosen by the three comparison algorithms, since Non-Incremental and Incremental-1 need to compute the optimal feature subset with the sequential arrival of sample subsets, whereas Incremental-2 finds a reduct when no sample subsets are obtained. Although the number of features selected by Incremental-2 (9) is bigger than that of Incremental-1 (4.57), the size of Incremental-2 (9) is smaller than that of Non-Incremental (12.14). The facts imply that in comparison with Non-Incremental, Incremental-1 and Incremental-2 can delete more redundant features.
In Table 3, the average 3-NN classification accuracy of Incremental-2 (0.79) is higher than that of Non-Incremental (0.78) and Incremental-1 (0.74), and the average RF classification accuracy of Incremental-2 (0.89) is higher than that of Non-Incremental (0.88) and Incremental-1 (0.79). This fact indicates that in contrast with Non-Incremental and Incremental-1, the proposed Incremental-2 is feasible to select an optimal feature subset and can obtain a good generalization. We can see that the 3-NN classification accuracy (0.79) of Incremental-2 is lower than that of 'Raw' (0.81) and the RF classification accuracy (0.89) of Incremental-2 is lower than that of 'Raw' (0.90). This may be because some features removed by Incremental-2 are informative to the 3-NN classifier and the random forest classifier. Furthermore, we find that the difference between 3-NN accuracies of Incremental-2 and 'Raw' is just 0.02 and the difference between RF accuracies of Incremental-2 and 'Raw' is only 0.01. This fact implies that Incremental-2 is feasible with ignoring the small difference of the classification accuracy.
To sum up, in contrast with Incremental-1 and Non-Incremental, our proposed Incremental-2 shows the time efficiency of selecting an optimal feature subset, while sharing the feasibility of obtaining an optimal feature subset with neglecting a small difference of the classification accuracy.

Comparison of incremental-2 and consistency
In this subsection, we compare Incremental-2 with Consistency (Dash and Liu 2003) on the fourteen selected data sets.  Before performing this experiment, we first clarify the executing process of Consistency, by which we can obtain the results of consistency-based feature selection algorithm. As well known, a feature selection method consists of the subset generation, subset evaluation and stopping criterion. In our experiment, we start from the empty set of features and put one feature that makes the increment of consistency maximal into the selected feature subset at every loop. This is the step of subset generation. The subset evaluation is embedded into the step by maximizing the increment of consistency. The consistency-based feature selection algorithm does not stop until the consistency of the original feature set is kept.
Consistency is a consistency-based feature selection approach of seeking for the minimum subset which classifies samples as consistently as the full set under the best first search strategy. It is parallel to rough set-based feature selection methods. So, it is indispensable to compare it with Incremental-2. However, Consistency can only handle data sets with symbolic features. When facing with realvalued data sets, it requires to discretize real-valued features; when dealing with missing-valued data sets, it has to fill up these missing feature values. Such preprocessing methods can easily lead to the information loss, which may worsen the classification accuracy of the learning algorithms. Fur- thermore, due to different mechanisms, we are only devoted to the comparisons of the size and the classification accuracy of selected features rather than the comparison of the runtime. The experimental results are summarized in Table 4. Table 4 reports the number of features chosen by Incremental-2 and Consistency. From Table 4, we can easily see that the average size of features chosen by Incremental-2 (9) is smaller than that of Consistency (11.29). In detail, on 'Steel,' the size of features selected by Incremental-2 is only 7, while the size of features selected by Consistency is 20. On 'Statlog,' the number of features chosen by Incremental-2 is just 11, whereas the size of features selected by Consistency is 31. These facts imply that our proposed method Incremental-2 can delete more redundant features than Consistency on most of selected data sets.
Moreover, we can see from Table 4 that 3-NN accuracy of Incremental-2 (0.79) is higher than that of Consistency (0.78), and the RF accuracy of Incremental-2 (0.89) is better than that of Consistency (0.87). For example, under the condition of 6 selected features on 'Dermatology,' the 3-NN accuracy of Incremental-2 is 0.61, while the 3-NN accuracy of Consistency is just 0.48; the RF accuracy of Incremental-2 is 0.69, whereas the RF accuracy of Consistency is 0.65. Furthermore, despite of less features selected by Incremental-2 on some selected data sets, the accuracy of Incremental-2 is better than that of Consistency. For instance, on 'Steel,' the 3-NN and RF accuracies of Incremental-2 are 0.59 and 0.75, while the 3-NN and RF accuracies of Consistency are 0.49 and 0.72. These above facts indicate that Incremental-2 has a higher classification accuracy in comparison with Consistency.

The effect of the neighborhood radius on
Incremental-1 and Incremental-2 In this section, we show the effect of changing the neighborhood radius on Incremental-1 and Incremental-2 for the first nine data sets of Table 1. Here, the neighborhood radius is adjusted to vary from zero to one with a step of 0.05. The experimental results are shown in Figs. 4, 5 and 6. Figure 4 depicts the time changes with varying the neighborhood radius. From Fig. 4, we can see that the proposed Incremental-2 is faster than Incremental-1 nearly at each neighborhood radius. This demonstrates the time efficiency of Incremental-2 that has been concluded in Sect. 5.2. The main reason is that at the arrival of each sample subset, Incremental-2 updates the relative discernible relations without updating the optimal feature subset. Thus, Incremental-2 can save the runtime of searching for an optimal feature subset each time a sample subset arrives. On the data sets 'Dermatology' and 'Mammographic mass,' since they have no real-valued features, the neighborhood relations are not generated. So, the runtime of Incremental-1 and Incremental-2 fluctuates a little with changing the neighborhood radius. Furthermore, on the rest seven data sets, the runtime of both Incremental-1 and Incremental-2 changes with varying the neighborhood radius. Specifically, on 'Credit,' 'Australian,' 'Japan,' 'German' and 'Steel,' the overall trend is that the runtime of Incremental-1 and Incremental-2 gets slow as the neighborhood radius increases. On 'Annealing' and 'Sick,' the overall trend is that the runtime of Incremental-1 and Incremental-2 gets fast with the increasing of the neighborhood radius. Therefore, for data sets with real-valued features, the changing of the neighborhood radius has an  influence on the runtime of Incremental-1 and Incremental-2. Meanwhile, Incremental-2 is faster than Incremental-1 nearly at each neighborhood radius. Figure 5 show the changing of the number of features selected by Incremental-1 and Incremental-2 when varying the neighborhood radius. We can see from Fig. 5 that the size of features selected by Incremental-2 is more than that of features selected by Incremental-1 at each neighborhood radius. This is because in contrast with Incremental-2, Incremental-1 can remove more redundant features. Moreover, on two symbolic data sets 'Dermatology' and 'Mammographic mass,' the size of features selected by Incremental-1 and Incremental-2 basically keeps invariant at each neighborhood radius. The reason is that the two symbolic data sets do not require to generate the neighborhood relations of real-valued features. Figure 6 shows the variations of the 3-NN classification accuracies with changing the neighborhood radius. From   Fig. 6, we can observe that the 3-NN classification accuracy of Incremental-2 is higher than that of Incremental-1 nearly at each neighborhood radius. This may be because Incremental-1 may remove some redundant features that are informative to the 3-NN classifier. Furthermore, on the two symbolic data sets 'Dermatology' and 'Mammographic mass,' the 3-NN classification accuracies of Incremental-1 and Incremental-2 fluctuates a little. This is because the two symbolic data sets do not need to generate the neighborhood relations of real-valued features.
In a word, the neighborhood radius has an effect on the performance of Incremental-1 and Incremental-2 for data sets with real-valued features. In contrast with Incremental-1, Incremental-2 can obtain a satisfactory feature subset in a shorter time nearly at each neighborhood radius. It is difficult to determine a proper neighborhood radius. Com- pared with Incremental-1, Non-Incremental and Consistency, Incremental-2 can efficiently obtain an optimal feature subset from mixed data sets without sacrificing too much classification accuracy.

Conclusion
In covering rough set model, incremental feature selection scheme is investigated to accelerate the computation of a reduct from large mixed data with the segmentation of a sample subset sequence. The incremental scheme is determined to reveal the strategies of increasing informative features and removing redundant features by incrementally updating the relative discernible relation of every feature. Two incremental feature selection approaches are presented by the strategies of increasing and removing features. The first one incrementally calculates the relative discernible relation and the optimal feature subset upon sample subset arriving continuously and returns the final reduct when all subsets have been added. The second one merely renews the relative discernible relations upon sample subsets arriving successively, and then calculates the reduct when all subsets have arrived. Our results show the two facts: (1) the presented incremental methods can accelerate the acquirement of the optimal feature subset in covering rough sets without sacrificing too much classification accuracy; (2) our proposed incremental-2 is more efficient than Incremental-1. On the basis of the above theoretical and experimental results, further studies are given below: (1) We will apply the presented incremental methods to some practical applications; (2) We will investigate the effect of the number of sample subsets on the runtime and the classification accuracy of the proposed incremental methods; (3) To improve the classification performance of the learning algorithms, we will investigate how to establish novel incremental feature selection method in covering rough sets.