Pearson’s Redundancy Multi-Filtering with BAT Algorithm for Selecting High Dimensional Imbalanced Features


 Feature selection plays a vital role for every data analysis application. Feature selection aims to choose prominent set of features after removing redundant and irrelevant features from original set of features. High Dimensional dataset poses a challenging task for Machine Learning algorithms. Many state-of-art solutions were developed to handle this issue. High dimensionality in addition to imbalance ratio in the dataset becomes a tedious task. To overcome the issue, this paper introduces a novel method namely Pearson’s Redundancy Based Multi Filter algorithm with improved BAT algorithm (PRBMF-iBAT) to obtain multiple feature subsets. PRBMF is implemented using multiple filters to obtain highly relevant features. iBAT algorithm uses these features to find best subset of features for classification. The results prove that PRBMF-iBAT perform better for the classifier in terms of Accuracy, Precision, Recall and F- Measure for three micro array datasets with SVM classifier. The proposed system achieves 97.99% of accuracy as highest compared to the existing rCBR-BGOA algorithm.


Introduction
The advancement in technology have paved way for the enormous growth of data in various sectors like banking, healthcare, communications, media, education, transportation, consumer trade and sports etc., This in turn has accelerated many researchers towards developing classifier model. Even though many models have been created and tested, classifying high dimensional imbalanced data still remains a leading issue in the research community. Micro array data are featured with high dimensionality, a smaller number of samples and imbalanced class distribution. Data is said to be high dimensional when it has a greater number of attributes which may be irrelevant or redundant. Many issues like high computational cost, high memory usage, difficulty in interpreting the model and decline in performance accuracy.
High dimensional dataset reduces the performance of the classifier model. To overcome this issue feature selection technique is essential. Feature selection is the most significant technique to select relevant features for developing a model that improves prediction performance thereby reducing the computational cost [1]. Feature selection reduces the number of features by selecting appropriate features for developing a model. Appropriate In addition to high dimensionality, Machine learning and data mining community leaned their interest towards imbalanced class distribution. Most of the datasets are usually imbalanced (unequal ratio of positive and negative samples) and high dimensional (more attributes in the dataset). Classes with a greater number of samples are termed as majority class and the classes with a smaller number of samples are called as minority class. The conventional classification algorithms tend favour the majority class when the classes have unequal distribution is imbalanced and degrades the performance of classifier with high dimensionality. Various solutions were provided to handle this imbalance issue including sampling techniques [2,3], cost sensitive learning methods [4,5] and ensemble techniques [6].Resampling technique includes over sampling and under sampling. The former technique increases the samples in the minority class whereas the latter technique reduces the number of samples in the majority class. The most commonly used over sampling technique is Synthetic Minority Over Sampling Technique (SMOTE) proposed by Chawla et al. [7].
In many situations exhaustive searching is not feasible so global search methods like evolutionary algorithms are used. Many researchers have hybridised other approaches to select relevant features. To handle both high dimensionality and class imbalance, main contribution of the research as follows.
1. A data pre-processing technique is proposed in which feature selection is combined with data sampling.
2. In addition, a meta heuristic algorithm is improvised to choose optimal subset of features.
The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 provides pedestals of the proposed method followed by proposed methodology in Section 4. Section 5 elaborates the experimental results and finally, the conclusion and future work are summarized in Section 6.

Related Work
Mohan Allam et al., [8] proposed Feature Selection -Iterative Teaching Learning Parkinson's Disease datasets.The proposed algorithm proved better in terms of training time and error rate by selecting best optimal features.Major problem is, it does not process irrelevant high dimensional attributes.
Poolsawad et. al, [9] investigates resampling techniques to evaluate the classification algorithm's performance. The author states that resampling is one of the easiest methods to balance the minority and majority class proportions. The work proves that both of the methods has impact on sensitivity and reduces the prediction error rate for the minority class.The authors concludes that reducing the majority class samples were the suitable resampling method for clinical datasets since error rates of minority class are reduced.
The major disadvantage of the proposed method is that the oversampling might duplicate the minority samples and under sampling will reduce the number of instances.It is also very important that the number of instances in each class should be significant for the classifier model to be fit.
Sotiris.K et. al., [10] states that imbalanced dataset is one which has less number of instances in one class compared to the other class. Learning classifiers from imbalanced dataset creates problems in domains such as feature selection, information retrieval, medical diagnosis and text classification. The primary level of solution given to this problem is data level and the secondary state of solution is algorithmic level. Data level solutions are categorized into under sampling and over sampling approaches. The algorithmic solutions are based on learning techniques. This method seems to be less effective for larger datasets.
MINDEX_IB is a partition-based feature selection algorithm was proposed by Hemlata Pant,Dr. Reena Srivastava [11] for handling imbalanced datasets. This method uses micro clustering for partitioning the attribute domain and finds the relevance of attribute from the statistical measures of the microcluster.
Haoyue Liu et. al., [12] proposed weighted Gini index (FI-FSw) to deal with imbalanced classification problems. Authors revised the original Gini index using imbalanced ratio dependent weight. The main limitation in this work is the difficulty to select the first feature when multiple featuresachieve the same frequency.
Kuan-Ching et. al., [13] introduced a feature selection algorithm called Cost-sensitive Considering the issuethat classifiers favour the majority class for imbalanced dataset, Zhang et al. [15] used F-measure as a performance metric for feature selection method.
They introduced Support vector machine (SSVM) algorithm which selects relevant features by making use of weighted vector along with symmetric uncertainty integrated with harmony search algorithm that chose optimal feature subsets.
Zhen et al. [16] proposed WELM, to solve multi-class imbalance problems. Zhen used data level handling, algorithmic level handling and ensemble technique to upgrade the performance of the classifier. Class oriented feature selection method is used at data level. Algorithmic level usesmodified extreme learning machine (ELM) to improve the input nodes with high discrimination power.Further the author uses ensemble technique to train the performance of the model. WELM proved to be effective and outperformed other existing methods when applied on eight genetic datasets.
Multiclass imbalanced class distribution issue is solved by Du et. al., [17]. The author used improved genetic algorithm. EG-mean is used as a fitness function that favours the minority class. Du proved that his method improves the precision rate of the minority class in multi-class imbalanced datasets in the size of feature subsets.Rung-Ching Chen et. al., [18] in their work clearly showed Random Forest is the best classifier by comparing RF with different classifiers. Authors combined Random Forest (RF), Support Vector Machine(SVM), K-Nearest Neighbor(KNN) and Linear Discriminant Analysis (LDA) with different features selection method RF, RFE, and Boruta to select the best classifiers method based on the accuracy of each classifier.

Pearson's Redundancy Based Filter (PRBF)
Pearson'sχ 2test is used to measure the difference between Probability distribution of two variables. Let us take n observations which are independentof two random variables X, X' in the training data. Pearson χ 2test will be valid only if number of observations are greater than 100.

Fisher Score Filter Approach
Fisher score proposed by Gu et al.,. [21]is an heuristic approach for calculatingfeaturs's score using Fisher ratio. Let µf i be the mean and be the standard deviation of the k-th class and i-th feature. The Fisher score for the feature i can be calculated using the where n k is the number of samplesin class k. Features with highest fisher score values will be selected. These selected features are suboptimal since the scores are evaluated individually. The main disadvantage of Fisher score is it fails to select the features with high aggregated discriminative power and redundant ones.

BAT Algorithm
BAT algorithm is based on echo location behaviour of micro bats. Thisalgorithm mimics the behaviour of bats while catching prey. BAT algorithm was proposed by Yang in [22]. Bats use echolocation to find its prey. When the bats are closer toits prey its pulse rate and frequency increase thereby decreases the loudness. A set of interactive parameters like position, velocity, pulse rate, loudness and frequency is assigned to each bat that affects the quality of solution and time to obtain the solution which makes the algorithm complicated compared to other metaheuristic algorithms [23].  (5) x* is the global best solution which is found after comparing all the solutions among all n bats. = −1 + ..………… (6) After obtaining the global best solution, a new solution for each bat is generated using a random walk with the following equation.
x new = x old + ЄA t .….………. (7) Where Є ϵ [-1,1] is a random number and A t is the average loudness of all the bats at time t.
As the iterations proceed, loudness and rate of pulse emission have to be updated.

Proposed PRBMF-iBATMethodology
The  Step 1: Apply SMOTE Algorithm to the high dimensional imbalanced dataset to obtain balanced class distribution Step 2: for each feature f i in the training data and class C, Initialize S= ∅ Step 3: for i= 1 to m ( m is the set of features in the dataset) Step 4:Calculate the fisher score value of all features in the training data.
Step 5: End for Step 6: Sort the features with the fisher score values.
Step 7: Choose top ranked feature set S.
Step 8: Calculate feature-feature SU and feature-class SU for each feature f i in S.
Step 9: for i=i to n Step 10: for j=i+1 Step 11: Calculate SU i,c , SU j,c , SU i,j Step 12: If SU i,c >=SU j,c and SU i,j>= SU j,c Step 13: remove f j else add f j to the selected set of features FS. Step

Proposed Improved BAT Algorithm
An improved BAT algorithm is applied to the output formed by the phase I. All the features from phase 1 are given as input to iBAT algorithm which results in optimal subset of features. IniBAT algorithm, a new bat position is obtained by calculating the mean between the previous position and the current best solution x*.

Experimental Results and Discussions
The effectiveness of the proposed PRBMF-iBAT with regard to high dimensional imbalanced dataset is discussed in this section. Three microarray datasets are used to evaluate the proposed method.Implementation was carried out on an Intel Core i5 @ 2.42 GHz CPU and 16 GB RAM in Microsoft Windows 10 platform. Implementation was performed in MATLAB R 2015. The datasetsused are Lung Cancer [14], Prostrate Tumor from repository and SRBCT [14]. The details of the dataset are listed in Classifiers tend to bias towards the majority class resulting high accuracy but the minority class consideration is low so accuracy alone cannot be an appropriate measure for imbalanced datasets. F-measure could be an appropriate measure to evaluate the efficiency of proposed method. Proposed PRBMF-iBAT is compared with Robust Correlation Based Redundancy and Binary Grasshopper Optimization Algorithm (rCBR-BGOA) [14 ]for Support Vector Machine. Fivefold cross validation is performed to compare the results.  Figure 3 shows the pictorial representation of the comparative analysis of the proposed and existing approach. The proposed method and the existing methods are also compared against the computational time. The figure 4 clearly pictures that the proposed method classifies the data in less time compared to the existing method for all three datasets.

Conclusion and Future Enhancements
The objective of the proposed method is to select optimal set of features in high dimensional imbalanced datasets. A novel approach PRBMF-iBAT is proposed to overcome the issues in selecting features in high dimensional imbalanced dataset.PRBMF-iBAT involves three phases, first phase includes the implementation of SMOTE algorithm to handle imbalanced issues followed by PRBMF ensembled with improved BAT algorithm to choose optimal set of features.The results obtained in this paper proves that the proposed PRBMF-

Conflict of Interest
This paper has not communicated anywhere till this moment, now only it is communicated to your esteemed journal for the publication with the knowledge of all co-authors.

Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.

Funding
There is no funding.