Assessing the Limitations of Relief-Based Algorithms in Detecting Higher-Order Interactions

Background: The investigation of epistasis becomes increasingly complex as more loci are considered due to the exponential expansion of possible interactions. Consequently, selecting key features that influence epistatic interactions is crucial for effective downstream analyses. Recognizing this challenge, this study investigates the efficiency of Relief-Based Algorithms (RBAs) in detecting higher-order epistatic interactions, which may be critical for understanding the genetic architecture of complex traits. RBAs are uniquely non-exhaustive, eliminating the need to construct features for every possible interaction and thus improving computational tractability. Motivated by previous research indicating that some RBAs rank predictive features involved in higher-order epistasis as highly negative, we explore the utility of absolute value ranking of RBA feature weights as an alternative method to capture complex interactions. We evaluate ReliefF, MultiSURF, and MultiSURFstar on simulated genetic datasets that model various patterns of genotype-phenotype associations, including 2-way to 5-way genetic interactions, and compare their performance to two control methods: a random shuffle and mutual information. Results: Our findings indicate that while RBAs effectively identify lower-order (2 to 3-way) interactions, their capability to detect higher-order interactions is significantly limited, primarily by large feature count but also by signal noise. Specifically, we observe that RBAs are successful in detecting fully penetrant 4-way XOR interactions using an absolute value ranking approach, but this is restricted to datasets with a minimal number of total features. Conclusions: These results highlight the inherent limitations of current RBAs and underscore the need for enhanced detection capabilities for the investigation of epistasis, particularly in datasets with large feature counts and complex higher-order interactions.


Background
Feature selection, the process of reducing the number of features in a dataset while preserving predictive power, is crucial when leveraging machine learning (ML) approaches in computational genetics and bioinformatics as datasets regularly contain hundreds of thousands to millions of genetic loci.By identifying and focusing on loci (features) that have significant phenotypic associations, researchers can reduce computational demands and boost both the efficiency and performance of ML models [1].However, the investigation of epistasis, interactions among genes and genetic loci, complicates this task considerably.This is because the number of possible interactions grows exponentially with the number of (n) loci considered in k-wise combinations.Simultaneously, the total combinations of k loci increases polynomially with n [2].This expansion poses a formidable challenge for feature selection, particularly when balancing the detection of main effects with the vast pool of potential interactions.Thus, developing a robust and efficient feature selection strategy is crucial for navigating and prioritizing features when employing ML techniques in the analysis of epistasis in large, complex genetic datasets.
Relief-based algorithms (RBAs), a family of filter-based feature selection methods, are particularly effective for genetic analyses as they can detect both main effects and interactions without exhaustively searching the entire parameter space [3].This makes RBAs more efficient than wrapper methods and, unlike embedded methods, can operate independently of ML algorithms, facilitating deployment in any analysis pipeline.Furthermore, RBAs explore feature interactions without explicitly creating new variables for each possible n-way interaction [3,4].RBAs achieve efficiency and thorough evaluation by iteratively updating feature weights (or proxy statistics) to measure a feature's relevance to the predicting endpoint value based on the concept of 'near hits' and 'near misses' in the training set.These hits and misses are categorized by comparing feature value differences between instance pairs.The nearest neighbors of a particular instance that are of the same class (or similar outcome value in continuous endpoints) are called the nearest hits, while the nearest neighbors that are of the opposite class are called the nearest misses.These identified neighboring instances are then used to update feature weights.After execution, the RBA outputs the weight (feature importance) for each feature, which ranges from -1 to 1, prioritizing maximally positive features [5].As a result, RBAs rank features based upon their relative predictive power and, depending on the RBA employed, score features in terms of main effects, interactions, or both.Although we focus here on the effectiveness of RBAs in genetic datasets, these algorithms are versatile, supporting any context and accommodating datasets with categorical or continuous features, missing data, noisy data, and binary, multi-class, or continuous outcomes.Comprehensive details and further exploration of these algorithms can be found in Urbanowicz et al. 2018 [3], which provides an extensive overview of the implementation and applications of RBAs in various domains.
Due to the capability of RBAs to explore and detect both main and interaction effects, a previous study benchmarked various RBAs within these contexts using a variety of simulated genetic datasets with known predictive features [4].Using the python package skrebate, a scikit-learn compatible [6] collection of Releif-based algorithms, that study compared multiple RBAs on their capacity to detect main, epistatic, and heterogeneous effects.Although the capability and efficacy of feature interaction detection are dependent on the specific RBA employed, multiple algorithms identified 2-way and 3-way interactions with varying success.However, all RBAs struggled to detect predictive features in datasets with simulated higher-order (4-way and 5-way) interactions.In these experiments, the known predictive features were consistently given highly negative scores, and thus disregarded by the traditional feature ranking approach.
Two questions arise from the above observations: (1) Are RBAs detecting higherorder interactions beyond 3-way by assigning highly negative scores to predictive features?(2) If so, can an alternative ranking system be developed that highlights predictive features involved in higher-order interactions while also detecting other associations?
In the original research paper introducing the Relief algorithm, Kira and Rendell posit that "statistically, the relevance level of a predictive feature is expected to be larger than zero and that of an irrelevant one is expected to be zero (or negative)" [5].A subsequent study, however, indicates that predictive features might receive more negative scores under certain conditions [7].Specifically, RBAs like ReliefF encounter difficulties in distinguishing between predictive and non-predictive (random) features as the number of nearest neighbors increases, particularly within noisy datasets with complex associations [4].Initially, RBAs tend to assign negative scores to random features due to minor asymmetries in their update mechanisms -where differences with nearest neighbors from the same class lead to negative updates, and those from different classes result in positive updates.As the number of neighbors increases, these updates begin to balance out, often resulting in zero estimates for random attributes.Additionally, the presence of noise in the dataset further complicates the interpretation of these updates, potentially leading to the erroneous assignment of negative scores to predictive features.This influence of noise underscores the sensitivity of RBAs and raises considerations for their use in environments with complex higherorder interactions.Such over-generalizations can mask the true discriminative power of predictive features, especially as the algorithm's sensitivity to noise escalates in highdimensional settings.Consequently, in scenarios involving complex interactions, there may be a marked increase of assigning negative scores to genuinely predictive features.The possibility suggests that this behavior could be leveraged in the investigation of epistasis.
Detecting higher-order interactions in biological systems is a complex and challenging problem [2,8].As previously stated, the number of possible interactions expands as the number of loci n and combinations k increase.Considering the possibility that RBAs can assign negative scores to predictive features in complex interactions, along with their non-exhaustive search capabilities, these algorithms could substantially improve the efficiency and effectiveness of epistasis detection.One approach to determine if RBAs detect higher-order interactions by assigning features negative scores is to invert the ranking so that the most negatively scored features are ranked the highest.However, this approach would fail to detect positively scored features, which experimentally have been shown to indicate main, heterogeneous, and lower-order interaction effects [4].Instead, we propose an alternate solution: rank RBA feature scores by their absolute value in an attempt to capture all informative features, regardless of the type of association.The purpose of this study is to determine if absolute ranking enhances the ability of RBAs to detect higher-order feature interactions while retaining the ability to identify other patterns of single feature and multi-feature association.

Feature Importance Estimation Algorithms
We chose to evaluate ReliefF [9], MultiSURF [4], and MultiSURFstar [10] (hereinafter referred to as MultiSURF*), RBAs for this study, due to their reliable performance over a large variety of problem domains, with higher-order feature interactions being their most notable limitation [4].We used the skrebate v0.62 Python package [4] which includes all three algorithms, and have made all analysis scripts available in release v0.7.1 on GitHub [11].For the ReliefF algorithm, 10 vs. 100 nearest neighbors (N N ) were examined for the run parameter traditionally labeled as k.To avoid confusion with the binomial coefficient notation (n choose k) we will henceforth refer to this run parameter k as N N to avoid confusion.These two values of N N replicate the analyses of the benchmarking paper [4] where N N = 10 yielded more positive scores for predictive features in higher-order experiments while N N = 100 yielded more negative scores [4].We expect each setting of N N to yield different strengths and weaknesses.Particularly, in higher-order interactions, we would expect that ReliefF with N N = 100 will outperform ReliefF with N N = 10 when using absolute value ranking, as non-predictive feature weights should trend toward zero at higher N N settings (dependant on overall data sample size) [7].In all figures, standard rankings for these RBAs are denoted as ReliefF-10NN, ReliefF-100NN, MultiSURF, and MultiSURF* while absolute value rankings are denoted as ReliefF-10NN ABS, ReliefF-100NN ABS, MultiSURF ABS, and MultiSURF* ABS.

ReliefF
ReliefF is an improvement over the original Relief algorithm [5,12] and is now one of the most well-known and most used RBAs to date.In addition to discrete and continuous endpoints, ReliefF can handle multi-class and incomplete datasets as well [13,14].Unlike Relief, which only uses the single nearest hit and nearest miss to update feature weights, ReliefF requires the run parameter, N N , denoting the number of nearest neighbors that will be used for feature scoring.In each training cycle, a target instance is chosen, and the N N nearest hits and N N nearest misses to that target are then used to iteratively update feature scores [3,7].

MultiSURF*
Instead of using a specific number of N N for scoring, MultiSURF* identifies nearest neighbors as well as 'farthest instances' using a boundary threshold, T i , based on the mean pairwise distance between the target instance and others.It also uses a "deadband" zone, based on the standard deviation of pairwise distances between the target instance and all others, to exclude ambiguously 'near' or 'far' instances from affecting feature scores.It was previously demonstrated that the inclusion of inverse-scoring for 'far' instances improved power to detect pure pair-wise interactions in Relief-based feature ranking [10,15].While effective, 'far' scoring requires additional computational cost, and was also demonstrated to negate the ability of Relief-based algorithms to detect univariate effects in feature ranking and selection [4].This makes MultiSURF* only suitable for detecting 2 or 3 way interactions.

MultiSURF
MultiSURF uses the same threshold (T i ) and dead-band introduced in MultiSURF* to identify nearest neighbors, however it omits 'far' scoring.This reduces computational complexity as well as regains the efficacy of the algorithm to detect univariate effects, but at the expense of a small degree in power loss to detect 2 or 3 way interactions [4].This makes MultiSURF efficient for broader applications, particularly in large datasets where simplicity and processing speed are crucial.This approach provides robust analysis while minimizing computational burdens and removing the need for hyperparameter optimization [3,4].

Score Ranking Schemes
We assess the performance of each aforementioned RBA in identifying known predictive features in simulated genetic datasets by utilizing two ranking schemes: standard and absolute value.For standard ranking, we sort feature scores in descending order from the most positive to the most negative.This is the traditional strategy used inherently by RBAs.In absolute value ranking, we instead first convert all feature scores to their absolute value and then rank in descending order, such that highly positive and highly negative scores are prioritized.The aim is to have one algorithm and ranking scheme that can capture the highly positive scores of univariate effects and lower-order interaction effects as well as the hypothesized highly negative scores for higher-order interaction effects.

Data Simulation
Table 1 presents a detailed overview of 2,100 simulated datasets with binary endpoints for classification, building on those used in a previous Relief-algorithm benchmarking study [4].The 'configurations' column details all variations across experiments per genotype-phenotype association pattern, including the number of predictive and total features, heritability (indicative of noise level), the number of instances, and model architecture complexity-categorized as easy (E) or hard (H) [16].Each configuration produces 30 replicate datasets using random seeds.Datasets are generated using custom scripts as well as the GAMETES v2.2 software package [17], which simulates a range of genotype-phenotype relationships, including univariate, multivariate, and interaction effects.In each simulated dataset, predictive and non-predictive features are known beforehand.Unlike the original benchmarking study that primarily used datasets with only 20 features, we increase the minimum feature count to 100 for a more robust evaluation, except in exclusive-or (XOR) datasets.These XOR experiments model pure, strict, and clean 2-way to 5-way non-linearly separable interactions with full penetrance [18]."Pure" refers to an epistatic interaction with no main effects, "strict" refers to no lower-order effects (e.g., a 3-way interaction has no lower-order 2way interactions), and "clean" refers to no noise (heritability = 1; lower heritabilites are termed "noisy").Thus, these XOR datasets serve as straightforward toy examples of feature interactions.Feature counts for these XOR datasets range from 20 to 100 in 20-feature increments.This setup results in 20 distinct experiments across four orders of interaction and five feature levels, enabling a systematic evaluation of each RBA's ability to detect low and high-order interactions as feature counts increase.The XOR datasets contain the only clean interactions simulated in this project, yet all simulated interactions are both pure and strict.These datasets are available upon request, and are similar to datasets publicly available on GitHub [19] that were used in previous benchmarking [4].

Experimental Evaluation
We replicate the evaluation method from the previous benchmarking study [4] to compare ranking performance between methods.As negative controls, we add a random shuffle method that randomly shuffles the features to yield their rankings as well as a non-RBA (mutual information) best suited to detecting univariate effects.Specifically, we used a scikit-learn implementation of mutual information [20] with default settings.Thus, we explore a total of ten different methods: the standard and absolute value rankings of the four RBAs as well as random shuffle and mutual information (each with 30 replicate dataset analyses).
For each dataset configuration: (1) The predictive features are known ahead of time.For each of the 30 replicate datasets and ranking schemes, we find the ranking of the lowest ranked predictive feature.For example, in a dataset with 100 features and 2 predictive features, if the RBA ranks the predictive features as 1st and 5th, we save the 5th position.In total, we obtain 300 of such rankings (30 per method).Only the worst ranked positions, or the "weakest links", are considered, as missing even a single predictive feature can cause RBAs to fail on feature selection problems with interactions.(2) Given n features in a dataset, there are n possible ranking positions.For each position, we compute the percentage of the 30 saved rankings that were placed higher than that particular position.This process is repeated for each method, and in total, we obtain 10*n percentages (n per method).(3) For analysis and visualization, we create heatmaps that summarize the experiments across all ten methods.Each heatmap consists of ten rows (one for each of the eight RBA methods (standard and absolute ranking), as well as random shuffle and mutual information).Each row has n grid squares.The percentages computed in the previous step are represented in the heatmap, where low percentages are encoded as orange-white and high percentages are blue.
In general, an effective feature scoring and ranking method would work towards maximizing the percentages calculated in this analysis.High percentages indicate that the saved predictive feature rankings are above most of the other feature positions.That is, all of the predictive features are consistently being placed in the top rankings, which is the desired result for RBAs.Therefore, a more intensely and consistently blue row in the heatmap signifies a higher performance level.

Clean XOR Low and High-Order Epistasis
Figure 1 shows results for the clean XOR datasets of increasing epistatic order and total feature count.All RBAs perform equally well at detecting predictive features for 2-way epistasis at all feature counts in contrast with Mutual Information and the random shuffle negative control.However, in 3-way datasets, only ReliefF-10NN and ReliefF-100NN, using both standard and absolute ranking, consistently show high power as feature count increases with Relief-10NN yielding the highest observed power.MultiSURF predicts well at 20 features, but performance starts to diminish as features are added with standard MultiSURF always outperforming MultiSURF ABS.Interestingly, both MultiSURF rankings perform better at 100 features than at 80 features but only marginally so.MutliSURF* struggles in 3-way experiments but MultiSURF* ABS outperforms standard MultiSURF* in all feature counts (only marginally so at 40 features and above).This occurs because standard MultiSURF* ranks predictive features with more negative scores.
In 4-way experiments at 20 features, standard ReliefF-10NN, ReliefF-10NN ABS, MultiSURF ABS, and MultiSURF* ABS display high power.However, the power of these RBAs immediately diminishes in data with 40 features.This drop-off is steeper for MultiSURF ABS and MutliSURF* ABS compared to ReliefF-10NN.Finally for 5-way datasets with 20 features, standard ReliefF-10NN, standard MultiSURF*, ReliefF-10NN ABS, ReliefF-100NN ABS, and MultiSURF ABS display marginal power in contrast with the random shuffle with standard ReliefF-10NN achieving the highest power.However, as seen in 4-way, performance sharply diminishes as features are added.Unexpectedly, Mutual Information provides some marginal power in 4way and 5-way experiments, especially when feature counts are intermediate (40-80 features).ReliefF-10NN ABS consistently outperforms ReliefF-100N ABS in higherorder interactions.This observation is surprising since Relief-100NN ABS is expected to perform better in higher-order experiments as non-predictive features should be scored near zero, leaving opportunity for informative features to be scored more negatively.Indeed, we do observe average scores of non-predictive features in 4-way and 5-way XOR experiments closer to zero for ReliefF-100NN ABS (0.0019) compared to ReliefF-10NN ABS (0.0061).However, this trend also holds true for predictive features with ReliefF-100NN ABS scoring predictive features closer to zero on average in 4way and 5-way experiments (ReliefF-10NN ABS: 0.011, ReliefF-100NN ABS: 0.0024).Thus, unexpectedly, ReliefF-10NN ABS showcases better performance in higher-order XOR interactions compared to ReliefF-100NN ABS.Fig. 1 Heatmap results for pure and clean XOR epistatic datasets with increasing epistatic order and feature counts.The scale for power, as the frequency of success, is to the right of the heatmaps.Plot legend is located on the bottom right.

Noisy 2-way Epistasis
Figure 2 shows results for the core set of 2-way epistatic interaction datasets with varying heritability (noise) and training instances.These datasets are comparable to the core 2-way datasets in the original benchmarking paper [4] but with a higher feature count (100 vs. 20) and the addition of absolute value rankings.Datasets that are the most difficult are towards the bottom left of the figure, with low heritability (high noise) and few training instances, while datasets which have high heritability (low noise) and more training instances are on the top right.The standard ranking performs slightly better in very noisy problems.Also, for very noisy problems, increasing the number of training instances increases the performance of the standard ranking substantially, but not as much for the absolute value methods.We find that RBAs assign predictive features with an absolute value very close to zero in noisier datasets.
In contrast, both ranking methods generally assign much larger positive values to predictive features in less noisy problems.As a result, when there is less noise, using absolute ranking does not harm the ability to detect predictive features in these experiments.However, in very noisy datasets, where all feature scores are close to zero, with no very positive or very negative values, absolute ranking confounds 'slightly good' and 'slightly bad' features, making standard ranking perform better.Despite these trends, MultiSURF* is observed to have the highest power levels in most difficult dataset configurations when compared to other RBAs.Additionally, we observe that ReliefF-10NN outperforms ReliefF-100NN in dataset configurations with 200 training instances.However, ReliefF-100NN is superior as training instances increase.Figure 3 shows results for 2-way epistatic datasets where total feature count increases by orders of magnitude from 100 to 100,000 and heritability set at 0.4.The standard and absolute ranking methods perform similarly well for datasets with 100 and 1000 features.At 10,000 features, standard ranking slightly outperforms absolute ranking in all RBAs with MultiSURF* having the highest power.However, Multi-SURF* ABS performs nearly as well as standard MultiSURF* at this feature count.This relationship is not observed at 100,000 features, where standard ranking outperforms absolute value ranking for MultiSURF and MultiSURF*.Standard MutliSURF* has the highest power at this feature count with MultiSURF* ABS outperforming all other absolute value rankings as well.Thus, MultiSURF* may excel over other RBAs when feature counts are very high for 2-way epistasis detection.ReliefF-10NN struggles at 10,000 features.However, at 100,000 features ReliefF-10NN slightly outperforms ReliefF-100NN.Interestingly, absolute value rankings of both ReliefF-10NN and ReliefF-100NN outperform their respective standard rankings at 100,000 features due to these RBAs assigning more negative values to predictive features using the standard ranking.Thus, in very large datasets, using an absolute value ranking scheme may be optimal when employing ReliefF.
Fig. 3 Heatmap results for detecting noisy 2-way epistatic interactions with a heritability of 0.4 and increasing non-predictive features.The scale for power, as the frequency of success, is to the right of the heatmap.Plot legend is located on the bottom right.
Figure 4 presents results for datasets that model heterogeneity between two independent 2-way interactions.This means there are two subgroups of instances, with one 2-way interaction being predictive for one subgroup of data instances, and the other 2-way interaction being predictive for the remaining data instances.One dataset models equal subgroup prevalence (50:50) while the other models unequal subgroup prevalence (75:25).Both the standard and absolute value rankings perform similarly well in the equal prevalence case.However, standard ranking significantly outperforms absolute value ranking in the unequal prevalence case.An exception to this is ReliefF-10NN as it has the lowest power in both subgroup prevalences with standard ranking outperforming absolute value ranking.
Upon further inspection of feature score outputs for the unequal prevalence dataset (75:25), we notice that the four RBAs with standard ranking give the two predictive features associated with the more prevalent subgroup highly positive scores, the two predictive features associated with the less prevalent subgroup scores closer to zero, and the non-predictive features slightly negative scores.This explains the poor performance of absolute ranking as non-predictive features will be close in value to features of the less prevalent subgroup when the absolute value is taken.The predictive features associated with the less prevalent subgroup received scores very close to zero because they are seen by the algorithm as non-predictive for most of the instances, which decreases their score.Taken together, predictive features associated with the less prevalent subgroup are ranked above the non-predictive features when using standard ranking, but below or similar using absolute value ranking.Notably, MultiSURF* achieves the highest power in the 75:25 experiment in both standard and absolute value ranking comparisons.

Noisy 3-Way Epistasis
All RBAs struggle to detect a 3-way epistatic interaction with a heritability of 0.2 (Figure 5).Notably, a heritability of 0.2 is used here rather than 0.4 due to a known limitation of the GAMETES simulator [4] Fig. 5 Heatmap results for noisy 3-way epistatic datasets with a heritability of 0.2.The scale for power, as the frequency of success, is to the right of the heatmap.

Non-Epistatic Associations
Standard and absolute rankings performed similarly for all non-epistatic experiments with the major exception being MultiSURF* (File S1).As previously observed, Mul-tiSURF* struggles to detect single locus and additive effects as it is tailored to detect feature interactions [3,4].However, in non-epistatic experiments with more than one predictive feature (additive models), MultiSURF* ABS outperforms MultiSURF* (File S1).This is due to MultiSURF* giving predictive features negative scores in some replicates, which increase in ranking when the absolute value is taken.Additionally, in the single feature (single main effect) experiment with varying levels of heritability and model architecture difficulty, ReliefF-100NN outperforms ReliefF-10NN at low heritability levels under the hard model architecture, however, the difference is marginal.

Discussion
Our results show that RBAs using an absolute value ranking can only reliably detect higher-order interactions when penetrance and/or heritability is high (low noise) and feature count is low (Figure 1).However, even when these optimal conditions are met, all RBAs struggle to detect 5-way interactions, regardless of feature count and the ranking method employed.In the original benchmarking study [4], only 20 features were used in most experiments, except for a noisy 2-way experiment that increased the feature count from 100 to 100,000, which we have replicated in this study with the addition of absolute value ranking.Typically, studies exploring epistasis consider large sets of features, reflecting the extensive genomic datasets now available [2,21].For a more robust analysis, we increase the minimum number of features to 100 for all experiments in this study.An exception to this are the XOR experiments, which are designed to provide pure, strict, and clean toy examples of how increasing feature counts affect the detection of higher-order epistasis.We recognize that 100 features may still be limited as a benchmark for many real-world datasets.However, it is advisable to clean/filter/prune datasets when studying genetic interactions, as epistasis investigation incurs substantial computational demands, particularly in large genomic datasets.We consider 100 features a practical choice for ensuring computational tractability and for modeling the typical scale of datasets used in focused post-hoc analyses.Moreover, our noisy 2-way experiment with increasing feature count, encompassing up to 100,000 features, effectively demonstrates the limitations of RBAs on their own in large datasets (Figure 3).Notably, wrapper algorithms such as TuRF [22] have been show to dramatically boost the performance of individual RBA algorithms in larger feature spaces.Additionally, our XOR experiments highlight the limitations of RBAs in detecting higher-order interactions without the need to explore larger feature spaces (Figure 1).As mentioned earlier, this study was inspired by results from the previous benchmarking study, where predictive features for higher-order epistasis received highly negative scores from most RBAs.This, and the work of Robnik-Šikonja and Kononenko [7], led us to hypothesize that RBAs might be able to detect complex interactions by assigning highly negative scores to predictive features in complex and/or noisy datasets, with the hypothesis that an absolute value ranking could effectively rank these features alongside main, additive, and lower-order epistatic effects.However, our results clearly demonstrate that increases in feature count significantly limit the detection of higher-order epistatic events (Figure 1) as well as 2-way epistasis, as demonstrated in our experiment with 100,000 features (Figure 3).Moreover, our results illustrate that absolute value ranking with ReliefF-10NN, MultiSURF, and MultiSURF* only has one niche application: 4-way epistasis with high penetrance (low noise) and low feature count.Thus, we only recommend employing absolute value ranking in these RBAs for these specific conditions.Interestingly, standard ReliefF-10NN displays the highest power levels and the least power decay across 4-way simulated XOR datasets as feature count increases.It also demonstrates the highest power in the simulated noisy 3-way epistasis experiment (Figure 5).Thus, when feature counts are low, ReliefF-10NN, using a standard ranking, could be a viable "best option" for 3-way and higher-order epistasis.
Another replicated observation is the performance of MultiSURF* in 2-way epistasis experiments.When conditions are easy (low noise and lower feature counts), all RBAs perform similarly well.However, when conditions become more difficult, as noise and feature count increase, MultiSURF* provides the highest power in most configurations (Figures 2,3,4).Thus, as previously observed [4], it is recommended to employ MultiSURF* when investigating 2-way epistasis alone, especially in more challenging conditions.

Conclusions
In this study, we explore the effectiveness of multiple RBAs, and absolute value ranking, in identifying low vs. higher-order epistatic interactions (e.g., 2, 3, 4, and 5-way).Our findings reveal that RBAs generally struggle to detect 4 and 5-way interactions.Notable exceptions are the use of standard ranking with ReliefF-10NN and absolute value ranking with ReliefF-10NN, MultiSURF and MultiSURF*, which displayed intermediate to high power in fully penetrant 4-way XOR interactions, but only in datasets with 20 or so features.Therefore, these RBAs exhibit substantial limitations in detecting interactions beyond 3-way.We do note, however, that MultiSURF* excels at detecting 2-way epistatic events, especially in more difficult conditions, but at the expense of failing to detect univariate or additive effects.Future research will focus on investigating whether higher-order limitations persist across other models of epistasis other than XOR or GAMETES-simulated genetic datasets.Additionally, we will investigate whether wrapper algorithms, including TuRF [22], IterRelief [23], and VLS Relief [24], can enhance the detection of higher-order interactions with both standard and absolute ranking.We also aim to determine if other RBAs and different ranking schemes have the ability to identify higher-order interactions.

Consent for publication
Not applicable.

Availability of data and materials
Simulated datasets used in this work are available upon request.Similar datasets are publicly available on GitHub at https://github.com/EpistasisLab/rebatebenchmark.All analysis scripts are available in the v0.7.1 release of skrebate at https://github.com/UrbsLab/scikit-rebate.

Fig. 2
Fig.2Heatmap results for noisy 2-way epistatic (core) datasets with varying levels of heritability and training instances.The scale for power, as the frequency of success, is to the right of the heatmap.Plot legend is located on the bottom right.E and H stand for easy and hard model architecture difficulty, respectively.

Fig. 4
Fig.4Heatmap results for detecting two independent heterogeneous 2-way epistatic interactions with a heritability of 0.4.The left heatmap shows results for a 50:50 split (equal subgroup prevalence), and the right heatmap shows results for a 75:25 split (unequal subgroup prevalence).The scale for power, as the frequency of success, is to the right of the heatmap.

1
Ethics approval and consent to participateNot applicable.

Table 1
Simulation study datasets.30 replicates of each configuration are generated.'Simulation method' is either 'C' (custom script) or 'G' (GAMETES).'Configuration Variations' describes further variations to a given dataset group.For example, 50:50/75:25 refers to the ratio of instance subgroup prevalence in heterogeneous problems.

Predictive features in top % of ranked features
. Standard RBA rankings, except those of MultiSURF*, display intermediate power in this experiment.These intermediate results are likely due to the lower heritability of the model.MultiSURF* ABS marginally outperforms standard MultiSURF*, which is also observed in the XOR experiments, for the same reason -predictive features are given negative scores by standard MultiSURF*.