Automated Feature Selection and Classi�cation for High-Dimensional Biomedical Data

Background Automated machine learning aims to automate the building of accurate predictive models, including the creation of complex data preprocessing pipelines. Although successful in many ﬁelds, they struggle to produce good results on biomedical datasets, especially given the high dimensionality of the data. Result In this paper, we explore the automation of feature selection in these scenarios. We analyze which feature selection techniques are ideally included in an automated system determine how to eﬃciently ﬁnd the ones that best ﬁt a given dataset, integrate this into an existing AutoML tool (TPOT), and evaluate it on four very diﬀerent yet representative types of biomedical data: microarray, mass spectrometry, clinical and survey datasets. We focus on feature selection rather than latent feature generation since we often want to explain the model predictions in terms of the intrinsic features of the data. Conclusion Our experiments show that for none of these datasets we need more than 200 features to accurately explain the output. Additional features did not increase the quality signiﬁcantly. We also ﬁnd that the automated machine learning results are signiﬁcantly improved after adding additional feature selection methods and prior knowledge on how to select and tune them.


Introduction
Although biomedical datasets often contain a large number of features, only a subset of these features are usually relevant for the modeling and analysis in a particular application. Models that explain the output based on only the most relevant features are desirable because of computational efficiency as well as the model interpretability. Therefore, there lies a clear benefit in reducing the number of features by removing the irrelevant ones. There exists extensive work in feature selection [1,2] which has resulted in multiple selection methods [3,4,5,6] and evaluations of their performance [7,8].
In this work, we aim to automate the task of feature selection, especially for highdimensional datasets, which are prevalent in biomedical studies but less well covered by currently used automated machine learning (AutoML) methods.
We evaluate a wide range of feature selection methods to make an informed decision on which methods should be incorporated in automated machine learning frameworks. In particular, we consider three classes of methods: filter methods, wrapper methods, and embedded methods, and integrate them as additions to TPOT, an automated machine learning tool, to explore the benefits of these new components. As part of this integration, we introduce a quality metric which, apart from the accuracy of the resulting models, also takes into account the number of preserved features to balance accuracy vs interpretability of the resulting models. Different combinations of feature selection methods and meta-parameters are tested according to this metric. To evaluate the effectiveness of the resulting framework four datasets were used.
In the remainder of this paper, we first cover related work and the contributions of this paper in Section 2. Section 3 provides the necessary background. Section 4 describes our method for adapting TPOT to work better on high-dimensional data. Section 5 covers our experimental setup, including details of the datasets under study. Section 6 interprets the results of our experiments and Section 7 concludes the work with a discussion.

Related work
A good overview of basic feature selection techniques can be found in [9]. Numerous algorithms have been proposed for feature selection [3,4,5,6,10,11] and various performance evaluation tests have been developed [1,7,8,12,13]. Most of these focus on datasets with much smaller numbers of features (no more than 1,000) than is commonly encountered in biomedical datasets. Other feature selection research focuses on specific types of feature selection, e.g. wrapper methods [14,15,16] or embedded methods [17]. A select number of articles focuses on bigger feature set sizes. Chen et al. [18] considered larger general datasets and used support vector machines for feature selection. Xing et al. [19] focused on feature selection for a microarray dataset of 7000 instances and showed that feature selection is beneficial. Other work looked at 35 datasets with a range of features between 37 and 50,000, trying to identify the best performing feature selection types. [20]. They used five advanced, less generally applicable algorithms, whereas the ones applied in our work are more general and intuitive. Some other examples of automated machine learning approaches that search for pipelines and perform dimensionality reduction are [21,22,23].
Georges et al. [24] compared feature selection techniques with regard to the similarity in feature subset outcome. They also considered feature selection based on dataset reproducibility. The tested techniques are rather specific and as such less applicable for an extension for an automated machine learning tool.
Pes et al. [25] focused on feature selection for imbalanced classes. They conclude that feature selection contributed quite significantly to the efficiency of the algorithms, but no algorithm clearly outperformed the other ones. This work also highlights the need for feature selection algorithms in automated machine learning, specifically for imbalanced datasets.
Panicker et al. [26] investigated the quality of feature selection algorithms and their influence on the classification quality. The outcome was that feature selection is beneficial for classification, however there is no clear best feature selection algorithm.
The main characteristics of our work on feature selection for biomedical data are the following: • To balance predictive accuracy and model interpretability, we aim to find feature subsets that are as small as possible while keeping the performance as high as possible. This was seldomly a goal in previous works [27]. To this end, we developed a new metric, FS score, that incorporates both the feature subset size and predictive performance. The utility of this score is evaluated on different case studies.
• We focus on high-dimensional biomedical datasets with more than 1000 features, where feature selection has an especially large impact. This is not often done, whereas these types of datasets are becoming increasingly prevalent.
• We leverage very general feature selection algorithms, and as a result we expect that non-expert users can more easily understand and trust the techniques. Moreover, the considered methods are also quite diverse in their properties, and our analysis shows the benefits and drawbacks of different types of methods.
• Automation of feature selection techniques is added to the framework. Automated machine learning is a very promising approach to extracting information out of a dataset. Properly implementing feature selection in automated machine learning opens it up for high dimensional datasets.

Background
Here we give a brief overview of the considered feature selection methods. Also we discuss the concept of automated machine learning as a possible technique to be added to the framework, in combination with a tool that implements automated machine learning.

Feature Selection Methods
Feature selection is a way to perform dimensionality reduction. In feature selection a subset of features is chosen to represent the complete sample space [28]. Several techniques are available to choose a representative subset and the effectiveness of these techniques has been tested in numerous works [12,29]. These techniques can be grouped accordingly in three different categories: filter methods, wrapper methods and embedded methods [9]. A thorough explanation of these three categories can be found in Saeys et al. [9].

Automated Machine Learning: Tree-based Pipeline Optimization Tool
Tree-based Pipeline Optimization Tool (TPOT) is a Python based tool that implements automated machine learning (autoML). Using genetic programming and tree structures pipelines are formed and tried out to find the best solution for a particular dataset. The pipeline backbones consist of preprocessing and machine learning algorithms from scikit-learn, but also several additional algorithms are present (e.g. a hot encoding algorithm). Considering the capabilities from TPOT to cope with challenges in biomedical data, several methods are available. It has several different normalisation/standardisation scalers (StandardScaler, RobustScaler, MinMaxScaler) to tackle feature heterogeneity between different datasets and errors. It also has some feature selection operators to deal with errors (VarianceThreshold, SelectKBest, SelectPercentile). Wrapper methods however are not included, which may perform better in some situations. In place of missing values it imputes the median before starting the evolutionary algorithm. Non-numeric data needs to be converted manually.
A wide variety of algorithms are present in TPOT, as well as numerous hyper parameter values for those algorithms. This is both an advantage as well as a disadvantage since it is very likely that many -if not most -combinations of algorithms and hyper parameters are not a useful outcome. TPOT randomly chooses mutations, but possibly better algorithms could be selected based on some guiding heuristics. For example meta-features could be used to improve the algorithm selection.

Feature Selection Quality
To estimate the quality of the feature selection, multiple machine learning algorithms are explored [30]. Classification of the datasets is done with several machine learning algorithms and validation and tests scores show how well the data can be classified after feature selection. The quality will be described with the accuracy of the machine learning algorithms: the number of correct classifications divided by the total number of classifications. In the evaluation five different machine learning classifiers are used from scikit-learn: logistic regression, decision trees, nearest neighbour, support vector machines and Naive Bayes. Better feature selection algorithms have relatively high accuracy, as they are better at preserving the right features. To capture possible overfitting, both a validation and a test score is computed for the accuracy. For every experiment a training set and a test set is created, making the test set 25% of the complete dataset. A validation score is computed by using the "leave one out" technique on the training set and a test score is computed by testing the classification score of the test set. Since the samples are not evenly distributed over the classes also the precision, recall and F1 scores are also computed to find potential bias in the result.
In some cases it could be convenient that the quality metrics of the machine learning algorithm takes into account the feature subset size. The standard metrics, like accuracy, in those cases are not sufficient enough to capture this aspect. Therefore, we introduce a new modified quality metrics FS score (Equation 1). In FS score a modified version of the original score, in this work being accuracy, is adjusted by multiplying it by a factor dependent on the number of features. This factor consists of a constant β, a value in range [0, 1] which can be chosen to express the influence extent of the number of features. In practice β < 0.025, so the reduction does not have too much influence on the score.
The factor #features is the absolute future number rather than a relative one (for example a percentage of features). The reason for using an absolute count can be found in the goal of the data analysis. In data analysis the number of relevant features that can be quite limited. Useful results need relations between the input Figure 1: An example of the impact of the correction factor on the score, in this case accuracy. The shown correction factor uses β = 0.005. On the x-axis the number of features is shown and on the y-axis the value for the original accuracy, the correction factor and the FS score. and the output, and these relations need to be as simple as possible, which among others, is reflected in the smaller number of features. If the number of features is relative to the total number of features, the input size can change significantly. Take for example the Micro-organisms and the RSCTC datasets provided for this study. A 2% of the total features would be 26 and 1,100 features for these datasets respectively. Intuitively, the relations between 26 features and the output should be easier to grasp than the relations involving 1100 features.
To put the impact of the FS score in perspective, an example figure is made how the outcome is computed ( Figure 1). This figure shows the impact on the performance of a method when using a certain number of features. After using a correction factor, an optimum is created for which the number of features is important, the old optimal score did not include the number of features. The example (with β = 0.005) seems to give a good indication of the desired outcome. For every 10 features, the FS score is reduced by about 10%, which meant for the example filter method that for 50 features an optimum was reached ( Figure 1). Because of this empirically found desired trade-off, β = 0.005 will be chosen in this project when using FS score.

TPOT feature selection integration
As already mentioned, TPOT is an effective tool to find the best machine learning pipeline for a certain dataset. Two restrictions hinder optimization regarding feature selection: 1 Lack of warm start TPOT has a vast array of machine learning and preprocessing algorithms to find the best possible pipeline. Due to the number of possibilities being very high, a lot of time may be wasted due to searching in wrong directions. For feature selection, a pre-defined selection of pipelines (also known as a warm start) would improve efficiency. 2 Feature selection possibilities Several filter and embedded methods are present (Subsection 3.1). All of these select percentages of feature selection, though. Still a large number of features can be present in the result after using percentages. On top of that, no wrapper methods are available, either. To resolve these optimization restrictions, two corresponding additions are made to TPOT.
Focused feature selection An option is added to always start the original population with a feature selection algorithm in the feature selection pipeline. Due to this start the expected search for a good feature selection method is bypassed immediately, which should result in more optimized final pipeline.

Performance measurement
One type: FS score β = 0.005 Addition of correction factor (Equation 1) Alternative feature selection algorithm set An option is added to use an alternative feature selection algorithms set. This set consists of filter, wrapper and embedded methods and the hyper-parameters are predefined to use an upper bound of 200 features. This upper bound was chosen based on the aforementioned FS score, as the factor in this score already has a big impact: a factor 0.2 on the final score.

Implementation
All methods and additions to TPOT are implemented in Python version 3.6 with Anaconda version 3 environment. The code is available at https://github.com/ TimBeishuizen/cBioF.

Datasets
Four datasets were used as a case study for the feature selection algorithms. Two sets are microarray datasets that are used for research on psoriasis [31,32,33,34] and cancer [35]. The two other sets are mass spectrometry datasets, used for research on cancer [36] and micro organisms [37]. All four datasets have a large number of features varying from 1,300 to 54,675 features with a number of samples varying from 200 to 580 ( Table 2). All of these datasets are related to classification since tests are done for different test subject groups. Therefore the emphasis on feature selection algorithms for classification. Considering the large number of features, it can be expected that many of them would be irrelevant or redundant and as such lend themselves to feature selection.

Micro-array Datasets
Micro-array data contains information on expression levels of a large number of genes [38]. These expression levels are usually used for a genomics level research, such as genome mapping, transcription factor activity and pathogen identifications.
Micro-array data are known to present challenges in data quality and need normalisation [39]. The thousands of features that are present in the data also indicate  [37] that selection is very important, so that the irrelevant genes can be ruled out. Aside from that, size may also be an issue. Due to the size of the data not all analyses will perform optimally in both time and quality. We use two microarray datasets: • Psoriasis microarray dataset This dataset is comprised of five different datasets consisting of 54,675 features, all corresponding to gene expression [31,32,33,34]. Samples were collected from three different test subject groups: affected skin from test subjects suffering from psoriasis (214 samples), unaffected skin from test subjects suffering from psoriasis (209 samples) and skin from healthy test subjects (85 samples). Combining these three sample types gives 508 samples. Since the data comes from five different experiments, the data is normalized for every experiment.
• Arcene: Cancer microarray dataset This dataset, called Arcene dataset, is used in a challenge focusing on classification problems with a low number of samples, but a large number of features [35]. It has the same number of features as the Psoriasis dataset, 54675, corresponding to gene expression. It also has 383 samples corresponding to nine different test subject groups. The challenge did not provide labels for the test subject groups. Also these groups differ in size, one group corresponding to 150 samples and the others varying from 16 to 47 samples.

Mass Spectrometry Datasets
Mass spectrometry data contains information on proteins and peptides [40,41]. Analysis of this information is used in studies about proteins, e.g., proteomics [42]. Mass spectrometry is a technique that can be used to find how much of a certain protein is present [43]. Challenges in the analysis of mass spectrometry datasets are mainly found in the way this data is produced. The expression is given for several proteins and this expression can differ significantly between proteins. Therefore some type of normalisation is useful. Aside from that, in this technique usually a large number of proteins is tested at the same time. Therefore feature selection can be helpful to remove irrelevant features. We use two mass spectrometry datasets:

• RSCTC: Cancer mass spectrometry dataset
This dataset was created in the context of a classification problem to distinguish cancer patterns from normal patterns [36]. It is created for the 'Neural Information Processing Systems' conference by merging three mass spectrometry datasets. It consists of 10,000 features corresponding to either spectra of the mass spectrometry or probe variables without any predictive power. Samples from two groups are taken from patients with ovarian or prostate cancer and from control patients. No labels are given to the groups, however, it is known that one of the groups has 88 samples and the other 112 samples, combined in a total of 200 samples.

• Micro organisms mass spectrometry dataset
This dataset is created to back up a proposed method for routinely performing direct mass spectrometry based bacterial species identification [37]. It consists of 1,300 features corresponding to different spectra of the mass spectrometry data and 20 test subject groups corresponding to Gram positive and negative bacterial species. Gram classification is a result of a Gram stain test [44]. The groups differ in size varying from 11 to 60 samples, making a total of 571 samples.

Evaluation of the Basic Filter Method in Combination with Classification Methods
In the first set of evaluation experiments we fix the feature selection algorithm template and evaluate it in combination with different classification methods. The basic filter method algorithm selecting the top n features [9] is used as a feature selection method. The changing meta variables are the dataset, the ranking method, the feature preservation values and the classification methods ( Table 3). The range of feature preservation values chosen to reflect both the ability to show impact of separate features (more impact from fewer features) and the relevance of keeping that number of features (irrelevant feature selection when having more than 1,000 features). All of this together results in a total of 4(datasets)×2(ranking methods)× 11(top n features) × 5(accuracy computation methods) = 440 experiments. These experiments are visualized in eight plots, one plot for every combination of dataset and classification method. These plots then show the change in quality for different number of preserved features.

Feature Selection Algorithms Evaluation
The second set of experiments fixes the classification method in order to compare the feature selection methods. The logistic regression is chosen as the classification At last a combination of the earlier proposed FS score and the computation time is shown as well for β = 0.005 to show the relation between computation time on one hand and the FS score on the other hand. Other feature selection methods, such as the backwards elimination sequential method, the simulated annealing stochastic search method and the embedded backwards elimination method [9] are not evaluated. These feature selection methods are omitted because of the poor scalability with regards to computation time and therefore unfit for datasets with this many features.

TPOT Integration Evaluation
The final experiment evaluates whether our additions to TPOT improve its utility on biomedical problems. This experiment consists of multiple runs of TPOT and all of these steps are also shown in an explanatory table (Table 1). All four datasets are tested (Subsection 5.1) and the accuracy is changed to the feature sensitive FS score with β = 0.005 (Equation 1), as previously discussed (Subsection 4.1). For testing TPOT an optimization time of 120 minutes (two hours) was chosen as a reasonable time constraint to run each experiment 5 times, an algorithm was not allowed to run for longer than ten minutes, a population size of 12 was chosen to not be too selective at the start, and a training set size within TPOT of 0.90 which is a general training set size when not many samples are present. Furthermore tests were done for pipeline selection both with and without focused feature selection and for both with and without alternative feature selection algorithms set, as these additions must be tested for their quality. This gives a total of 4(datasets) × 2(pipeline selection) × 2(feature selection set) × 5(experiment reruns) = 16 different experiments.

Results
In this section we analyze the results or all the proposed experiments.

Feature Selection Exploration Results
The results in Figure 2 show a clear difference in score quality across different datasets. Only for the Arcene dataset a significant difference was observed between using Mutual Information and T-test/ANOVA. Therefore it seems that the ranking method type has less influence on the measurement quality. One interesting aspect was that methods using Mutual Information had a longer computation time than methods using the T-test/ANOVA. Figure 3 shows that all wrapper algorithms preserved less than 65 features for these settings, whereas the performance seems to average around 75%. The filter and embedded methods performed worse, with an overall lower performance score than the wrapper methods, even when more features were present. No immediate conclusions can be drawn from only the filter and embedded algorithm results. When only looking at the wrapper algorithms, some other observations could be done as well. Ordering the features before using a wrapper method structurally gave a better result than using a random ordering. Also a threshold of α = 0.001 usually resulted in more features and in a higher scores in comparison with a threshold of α = 0.01. Comparing the algorithms, the floating search with ordering did best in performance, whereas the other algorithms are performing more similarly.

Evaluation of the TPOT Feature Selection Integration Results
The results of this experiment set are summarized in Table 5. The optimization process is also recorded and shown in a plot for the four datasets ( Figure 4). The five experiment reruns are averaged to one complete result.  The results show that after two hours of running TPOT, regular TPOT always performs worst. According to the processes the high feature selection algorithm set shows much bigger stepwise improvements, accompanied with fewer smaller stepwise improvements. This shows the earlier observed trade-off between computation time and feature selection quality for these specific feature selection algorithms. It takes longer to compute those algorithms, but it also gives a much better result.
The choice for the best TPOT algorithm is not conclusive. For high feature selection, all three new possibilities show improved results. The TPOT variant that is feature selection focused and has the new feature selection algorithm set performs the best for datasets with a very large number of features (RSCTC and Psoriasis), but the trade-off in computation time seems to hurt the performance for smaller datasets. The algorithm that is only feature selection focused has a steady performance for datasets with a lower number of features, but this performance does not seem to be better than the algorithm with only the new feature selection algorithm set.

Discussion
There are several works focusing on the possibility of feature selection, both from a data analytical [7] as well as a biomedical [1,2,8] point of view. They are usually putting emphasis on datasets with a low number of features for which feature selection is less relevant. Hence, in that regard this paper complements the existing work since we handle datasets that have a significantly higher number of at least 1,000 features. The relevance of large datasets with a large number of features will only increase over time due to new and improved techniques of data acquisition and storage. We believe that our work is a good start towards the challenge of tackling feature selection on such big datasets.
From the various feature selection methods that were discussed, several were not tested due to computation time constraints. The wrapper methods backwards elimination and simulated annealing and the embedded backwards elimination all were too computationally intensive to become relevant for the research. In datasets with fewer features, these methods may be showing better results and could be possible candidates for feature selection.
After evaluation of the results in the first experiment setup, it could be concluded that there was not a big difference between using T-test/ANOVA or Mutual Information as a ranking method. Both gave similar results, with the exception of one dataset.
Also a conclusion can be drawn from looking at the accuracy with the number of features preserved. After a threshold of 200 features, additional features did not raise the validation score as much as the first 200 features did. This indicated a second rule of thumb, that at least 200 features should be preserved after using a filter method. This absolute number was not intuitive and much lower than expected. More research is needed on these preserved 200 features and why no other features were needed to predict the output. Cluster analyses for example may show insights in this phenomenon.
After evaluation of all three kinds of methods -filter, wrapper and embedded methods -wrapper methods were significantly better at selecting a smaller fraction of features while preserving a similar test score. Our intuitive expectation was that wrapper methods would be the best in efficiency and embedded methods were expected to outperform both filter methods and wrapper methods when looking at both quality and computation time. This was not the case, however, as embedded methods had a near identical quality to filter methods and performed much worse than wrapper methods. The outcome indicated that wrapper methods are significantly different in results and filter and embedded methods being nearly identical.
Since, unlike filter methods, wrapper methods take dependencies between features into account, they kept these dependencies at a minimum. This makes wrapper methods more suitable in handling multicollinearity and are obviously more useful than filter methods. If dependencies between features themselves are important, wrapper methods might not be the best choice. Further research can be done to investigate this. A downside of the wrapper methods however was that they took much more computation time than the filter and embedded methods. Therefore if it does not matter when features have dependencies with each other, a filter method should be preferred.
There was a difference in quality within the wrapper methods. The forward selection and PTA all showed promising results and therefore should be considered for the framework. Floating search showed the best results, however took significantly longer in computation time. Within the wrapper methods, an ordering beforehand showed improved results. Backward elimination wrapper algorithms, stochastic search algorithms and embedded backwards elimination algorithms all were significantly worse in computation time than the other wrapper algorithms and therefore should not be considered in these cases.
A recommendation for the threshold in wrapper methods is not trivial. A higher threshold of α = 0.01 gave a smaller feature subset at the cost of a lower classification score. If a smaller subset is desired or more influential features are needed, a bigger threshold should be chosen, whereas it will be smaller when a higher quality of the feature subset is desired.
Both additions to TPOT were beneficial according to the experiment. When the initial number of features was very high (50.000), a combination of initial bias towards feature selection and using feature selection algorithms focusing on preservation of only 200 features performed best. For lower numbers of features, only adding one of the improvements was better, as computation time became the limiting factor when using both improvements. Figure 1 An example of the impact of the correction factor on the score, in this case accuracy. The shown correction factor uses β = 0:005. On the x-axis the number of features is shown and on the y-axis the value for the original accuracy, the correction factor and the FS_score. The average validation F1-scores shown per dataset and rank. T and MI refer to t-test/ANOVA and mutual information, respectively. The optimization process for the different TPOT algorithms for the micro-organisms dataset.