Gaining insights in datasets in the shade of “garbage in, garbage out” rationale: Feature space distribution fitting

This article emphasizes comprehending the “Garbage In, Garbage Out” (GIGO) rationale and ensuring the dataset quality in Machine Learning (ML) applications to achieve high and generalizable performance. An initial step should be added in an ML workflow where researchers evaluate the insights gained by quantitative analysis of the datasets sample and feature spaces. This study contributes towards achieving such a goal by suggesting a technique to quantify datasets in terms of feature frequency distribution characteristics. Hence a unique insight is provided into how the features in the available dataset samples are frequent. The technique was demonstrated in 11 benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by applications such as CALL_PHONE compose a relatively high‐dimensional binary feature space. The results showed that the distributions fit well into two of the four long right‐tail statistical distributions: log‐normal, exponential, power law, and Poisson. Precisely, log‐normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis that enhances the insights in feature space. Finally, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well‐formed statistical methods provides a clear understanding of the datasets and intra‐class and inter‐class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be one to analyze beforehand.


| INTRODUCTION
Classification is a specific problem or task in Machine Learning (ML) at which a computer program (i.e., a classifier) improves its performance through learning from experience (Mitchell, 1997).The experience is gained by providing labeled examples (i.e., training and validation dataset) of one or more classes that share common properties or characteristics (i.e., features) to a classifier that maps the properties into the class labels.The classifier's success is evaluated on the different sets of labeled examples (i.e., a test dataset).After supervised learning and testing phases, the classifier can determine the class of unknown or unlabelled new examples.The definition of classification suggests that classifiers and datasets are the two essential inputs in an ML application.From a dataset perspective, the dependencies on training, validation, and test datasets imply that the insufficient datasets cause low performance even with optimal (i.e., well-modeled, robust) classifiers, which is also called "Garbage In, Garbage Out" (GIGO).In one of the earliest highlights of data dependency of any algorithms, Babbage (1864) faced a provoking question "if you put into the machine wrong figures, will the right answers come out?" and was puzzled by the confusion of ideas behind it.We can, now, re-phrase the question as "can garbage in gold out possible?"Years later, the literature partly took the attention of such practices expecting "gold."Tweedie et al. (1994), for example, identified the approaches in observational studies (e.g., the effect of passive smoking) as GIGO, where insufficient data was not a primary consideration.GIGO was also an underlined caution in quantitative data analysis in all scientific inquiry (Arcury & Quandt, 1998).
The term GIGO was also used in the literature to • address the practices in clinical research and question whether the researchers can draw conclusions based on a tiny sample (Heuser, 1998).
• express the unreliability of data where scientific research is conducted over operational or administrative datasets (e.g., epidemiologic research on administrative databases, billing systems, maintained by healthcare providers, and institutions; Grimes, 2010).
• present the criticality of data and stress that the approach and methods without avoiding GIGO will be affectless in critical areas such as cancer detection and staging (O'Hurley et al., 2014).
In summary, GIGO has turned out to be a colloquial recognition of poor data entry leading to unreliable data output that leads to the necessity of highly accurate, valid, and complete information collection (Kilkenny & Robinson, 2018).Contrary to common belief, the sample size is not necessarily an indication of dataset quality or, in other words, "scientific knowledge is impossible with small-sample classification" (Dougherty & Dalton, 2013).Data quality and the need for insight into data are also a prerequisite for knowledge discovery processes, data mining algorithms, and machine learning workflows as well as handling big data (Hoerl et al., 2014).Triguero et al. (2019), for example, highlights this in big data scope and certain insights can be helpful to transform big data into smart data.Classification is a focused research field in machine learning, where several algorithms are proposed.Researchers try to model a robust classifier based on one of those algorithms to apply it to a specific classification problem.Dougherty et al. (2005) adapted the signal theory of optimal robust filters to classifiers and address two types of robust classifiers: minimax and Bayesian robust classifiers.The former is its worst performance over all states is better than the other classifiers' worst performances.The latter has an expected performance better than the other classifiers.In the signalprocessing theory of robust filtering, the full distributional knowledge and other factors are central such as feature selection and design via training from sample data.Just like a filter to be applied in nondesign settings, a classifier (a filter estimating a class label) should exhibit robust performance across a range of conditions.Dougherty concludes that no matter how precisely the classifier is designed, it may not perform well relative to the actual distributions.

| GIGO VERSUS OPTIMAL CLASSIFIERS IN MACHINE LEARNING
Figure 1, which depicts the ML-based classification workflow with its essential activities conducted by the researchers, highlights the effect of the GIGO rationale.This figure has significantly extended and conceptualized the four combinations of GIGO rationale pictured in (El Naqa et al., 2015).Note that the specification of the whole workflow concerning the introduced "space" concept is a distinctive approach of this study considering the literature.In this manner, the specific scope, transformation of a dataset into different forms (sample space, feature space, and subsample space), and the other input (model space) and output (metric space) could be distinguished easily.
Figure 1 introduces the four possible combinations of the two inputs: a sufficient/insufficient dataset with an optimal/nonoptimal classifier.It is likely that • in Case 1 (only win-win scenario): an optimal classifier trained on a sufficient dataset exhibits high performance, • in Case 2: an optimal classifier trained on an insufficient dataset shows lower performance, • in Case 3: a nonoptimal classifier trained on a sufficient dataset exhibits low performance, and • in Case 4 (the worst case): a nonoptimal classifier trained on an insufficient dataset shows the most inadequate performance.
Therefore, dataset sufficiency plays a decisive role.Researchers who build a classifier that is trained and tested on a dataset publish their classification performances in terms of standard metrics such as accuracy, true positive rate, or F1 (Canbek, Baykal, & Sagiroglu, 2017) or more robust metrics such as Matthews Correlation Coefficient (MCC) and Balanced Accuracy (BACC) (Canbek et al., 2021).The classifiers are compared with other classifiers that are trained and tested on different datasets via the same performance metrics.A few studies have compared or analyzed datasets from an ML perspective.For example, Gauen et al. (2017) compare eight visual image datasets and focus on the distribution of object locations in the image and the ratio of the object size to the image size.They discovered that many dataset labels are centered in the image except datasets having network camera pictures.Fundamentally, ML research and education focuses on the activities assuming ground-truth or gold standard datasets are already available (Geiger et al., 2021), dataset sufficiency and reliability assurance should be defined, measured, and achieved unconditionally.From a qualitative perspective, several aspects are relevant for assessing the datasets, such as their origins, collection methods, the assumptions, and conditions (especially for survey datasets), data imputations, and subject matter experts' opinions.Those aspects can differentiate whether the datasets are actively solicitated (e.g., surveys or the ones with human-labeling) or passively (i.e. as is) acquired (Lew & Schumacher, 2020).From a quantitative perspective, ML-based classification, with mostly the supervised learning approach, first needs to be sure that a dataset exhibits the diversity and representativeness close to reality in the problem domain.Any dataset (i.e.sample in statistics) not representing the reality (i.e., population in statistics) or the ones capturing only a small part of the reality (biased or not rich datasets) or the ones with errors (lowquality datasets) can be evaluated as a "garbage" in GIGO rationale.Garbage can also be injected into datasets unintentionally or deliberately: The suggested conceptualization of ML classification workflow in phases through specific spaces and GIGO rationale (quantitative criteria?) • Unintentional garbage, which occurs not only by injection but also by omission, causes a biased dataset.Such biases can be observed even in widespread ML applications such as face recognition.In a recent study, NIST conducted a test of 189 mostly commercial algorithms on 18 million images of 8.5 million people contained in four large photograph datasets collected in U.S. governmental applications such as visa or border crossing (Grother et al., 2019).The results indicating severe bias showed that false positives (erroneous association of samples of two persons) are highest in West and East African and East Asian people and lowest in Eastern European people.False positives are also higher in women than men and oldest/youngest people than middle-aged adults.False negatives (failure to associate one person in two images) are higher in Asian and American Indian people than in white and African American individuals in one dataset and higher in African and the Caribbean as well as in older individuals in another dataset.The report clearly shows that the classifiers trained/tested on biased datasets that do not reflect the natural distribution of the real-case instances cause misclassifications, which can be expressed as "garbage" by people who are not well-represented.• Deliberate injections or techniques are called damages or attacks that are referred to as adversarial ML as a discipline (Kaloudi & Jingyue, 2020).The attacks that are called poisoning in training phase and evasion in test or production phases and defense against those attacks have recently been studied in the literature (Biggio & Roli, 2018).For example, Mahloujifar et al. (2019) studied those attacks in image classification where a new instance looks similar to an existing instance and suggested that a "concentration of measure" depending on the test dataset's distribution can indicate the robustness to adversarial perturbations.
Hence, the frequency distribution of binary features is introductory but one of the first mathematical insights of ML known-labeled datasets, a quantitative summary.Therefore, some cross-checking assessments of feature distribution might help to sense garbage to examine further.Such an exploratory analysis can be conducted in two ways: • Inter-class: Distribution differences between positive and negative-class datasets in a specific classification application, and • Intra-class: Distribution differences between two or more datasets with the same class (e.g., positive-class datasets) in a specific classification problem domain).
Note that inter-class and intra-class comparisons are known techniques (i.e., maximum inter-class deviation and minimum intra-class variation) in clustering and feature selection (Asfour et al., 2021;Sahu et al., 2017).However, the goal of these assessments does not help feature selection directly.Because feature selection should occur after ensuring the dataset sufficiency.No matter how effective feature selection is, it should be based on a sufficient dataset.Such an exploratory analysis should be conducted before proceeding with selecting features and building a classifier model.
Note that some statistical methods are already used to describe datasets (sample-space size, feature-space size, class ratios, etc.).The statistics related to the shape of the feature distribution, such as skewness, kurtosis, and the number of peaks, can also be analyzed (Piringer et al., 2008).However, those statistical approaches summarize a dataset based on a single attribute that is usually continuous.Nevertheless, interpreting and comparing statistical figures alone are not convenient; besides, they are generally not suitable for discrete or qualitative features.
Knowing the dataset feature space distribution characteristics compared with the ones used in other datasets in the same domain (i.e.intra-class assessment) can present the nature of the data and provide high-level situational awareness of the datasets, samples, and contents.Binary features are simple yet common in today's datasets.Recent practices also emphasize the high effectiveness of binary features (Chen et al., 2021).A researcher who has a specific distribution in her/his dataset that is different from the ones in the same domain should examine further why and how it is different or whether it is a biased dataset.Hence, the dataset could be sufficient.Alternatively, researchers who wish to enrich their datasets usually merge new datasets they acquired from other sources without analyzing.They could not be sure how their datasets are different from the existing ones.Proceeding with the ML workflow (e.g., feature selection) without analyzing such aspects may lead to unrealistic or ungrounded classification models as in the given examples above.Note that because feature count or frequency is needed to find a distribution fit, it is not a costly operation (a single pass in the database with addition operations).Establishing rather generic quantitative analysis can lead to a qualitative analysis of the datasets and help enhance the benchmarking datasets in specific domains (Canbek et al., 2018).

| FITTING BINARY FEATURE FREQUENCIES INTO A STATISTICAL DISTRIBUTION
The feature space-frequency distribution is the frequency of binary feature occurrences sorted in decreasing order.Such distributions generally follow size or frequency trends that are intuitively stated as "trivial many and vital few" or "useful many and vital few" (Juran & Godfrey, 1999).Specifically, four nonnormal long right-tail statistical distributions are described in the literature (Joo et al., 2017): log-normal, exponential, power law, and Poisson.Poisson distribution is usually the distribution of "count data" that shows the counts (non-negative) of a single (or combination of) dependent variable(s).Because we have the counts per feature in feature space here, Poisson or similar distributions like binomial or gamma-count distributions should not be expected to fit.

| Distribution examples found in natural and unnatural phenomena
Power law, which is also known as the 80:20 rule or Pareto principle, exponential, and log-normal distributions are addressed in the literature that tends to search for a specific statistical distribution to discover the characteristics of many natural and unnatural phenomena.The example phenomena reflecting these distributions are compiled and categorized from (Limpert et al., 2001;Milojevi c, 2010;Newman, 2004;White et al., 2008) or other resources with given references as follows: • Natural phenomena fitting log-normal: element concentration in the Earth's crust, latent periods (from infection to the first symptoms) of infectious disease [e.g., the incubation period of Coronavirus disease 2019 (COVID-19), SARS (severe acute respiratory syndrome), and MERS (Middle East Respiratory Syndrome; Backer et al., 2020)], the abundance of bacteria on plants; • Unnatural phenomena fitting log-normal: number of letters per word, number of words per sentence, age of first marriage in Western; • Natural phenomena fitting exponential: damage in nuclear power incidents and accidents before 1980 (Wheatley et al., 2017), moderate-sized disasters (observed sea-level variations, wind velocity, annual river floods) (Pisarenko & Rodkin, 2010), the arrival rate of cosmic ray alpha particles or Geiger counter tics (Tobias, 2012); • Unnatural phenomena fitting exponential: time to failure patterns (also in natural phenomena) (Frank, 2009), modeling malware propagation delays (Wang & Murynets, 2013), frequency of Korean family names (power law in family names in the world), intervals between aircraft arrivals to major airports (Willemain et al., 2004), the inter-arrival times of the 911 calls (Albert, 2011), the time between goals in World Cup football matches (Chu, 2003), the dispersion of U.S. incomes which was qualified as a kind of thermal equilibrium (Bartels, 2012); • Natural phenomena fitting power law: island sizes, lake sizes, flood magnitudes, species body sizes, individual body sizes (White et al., 2008), basic community structure descriptors (number of species, links, and links per species) with the area (Galiana et al., 2022); and • Unnatural phenomena fitting power law: author productivity, citations received by papers, scattering of scientific literature (Milojevi c, 2010).Component sizes in component-based software development (Sharma & Pendharkar, 2022).
The above phenomena are provided to introduce the statistical distributions by examples revealing their diversities and to allow the researchers to relate them to the distributions observed in their datasets.

| Methods for testing distribution fits
A statistical distribution could fit a given distribution (i.e., truth) for the values (x, binary features in our case) greater than or equal to a minimum value (x min ).For each statistical distribution to be fit the truth, an algorithm first estimates a minimum value by minimizing the Kolmogorov-Smirnoff statistics (Clauset et al., 2009) and then estimates the statistical distribution parameters.Two tests were used to validate the plausibility of the estimated fits, namely, power law, log-normal, and exponential statistical distributions: • Bootstrap test for each estimated distribution yielding goodness-of-fit test and pl-value (plausibility value) and • Vuong's test yielding total likelihood-ratios, pl-value (one-sided), and (two-sided) for comparing the first candidate distribution fit against the second fit with its x min is equal to the first fit's x min .For example, a log-normal fit estimated with a specific x min value is denoted as ln*.In contrast, a power law fit that is calculated based on the same x min value is represented as pl.The comparison is expressed as pl vs ln*.
In the bootstrap test, the pl-value indicates the plausibility of the given statistical distribution by simulating multiple instances of the truth and re-inferring the fitted distribution (Gillespie, 2015).In Vuong's test, the Kullback-Leibler information criterion is used to measure the closeness of the given two statistical distributions to the truth in likelihood-ratio statistics with testing the hypothesis that they are equally close (Vuong, 1989).Besides interpreting the sign of the goodness-of-fit test value specifying which statistical distribution has a better fit (positive for the first and negative for the second statistical distribution,) the following pl-values are provided: • pl-value (one-sided) indicates the plausibility of the better statistical distribution if it exists, and • pl-value (two-sided) shows whether both distributions are equally close or far from the truth.

| Online supplementary experimentation platform, software, and datasets
Finding whether a statistical distribution fits into a given distribution should completely be examined, possibly by verifying different methods.Hence, a software library of the comprehensive set of statistical tests was initially implemented to assess the fit of various statistical distributions in R (a software environment for statistical computing and graphics) based on poweRlaw package (Gillespie, 2015).
The online materials provided are as follows.The open-source code implementation, the datasets (in open office spreadsheet and R data format), and other supplementary materials (e.g., the charts and tables provided in this study and the complete dump of the distribution fit tests) are available online to review them in detail and use them in your works.

| Case study
Mobile malware classification problem was chosen as a case study domain because it is a critical emerging cyber security field where ML-based classification approaches are highly studied and practised in the literature and industry to enhance the capacities related to the human factor (Andrade & Yoo, 2019) (Box 1).The method is verified by a demonstration that examines and compares negative (benign) datasets and positive (malign) used in various binary classification (malware classification) studies based on binary features (application permission requests) as summarized in Table 1.Permissions are credentials requested by Android applications before they can use specific system data and facilities such as sending SMS.They are binary flags for Android platforms' primary access control to provide privacy and secure data/information in mobile devices.Many studies in Android malware classification include examining the permission feature frequency distribution in their datasets (Canbek, Sagiroglu, et al., 2017).Although they highlight the dramatic decrease in the frequencies, the distribution that provides valuable insight to qualify the datasets has not been analyzed before.Interestingly, the distributions seem to exhibit common characteristics for all datasets at first glance.

| Case study datasets
Although the literature proposes several approaches to detect Android mobile malware, the datasets are not as diverse as them (Canbek et al., 2018).Table 2 lists the basic quantitative information for the datasets and introduces the related studies.The two dimensions, namely, sample-space size (m) and feature-space size (n) and prevalence (PREV) 1 values are listed.In the related literature, it is observed that authors compare their malware classification performance with others, most of which are based on different benign and malign datasets.The case study can help gain insights into those datasets.

BOX 1 Introduction to Android mobile-malware classification problem
Android is a mobile platform that provides various mobile applications.Android applications are developed by anyone and released on third-party application markets besides the official market named Google Play.Despite this diversity, the platform could be the target of malicious people who develop or make injections into existing applications that expose some risks against end-users.Malware authors develop and use different techniques in those applications appearing as legitimate to overcome the platform's security or exploit human factors.Therefore, mobile malware detection, which is labeling a given application as "benign" (negative) or "malign" ("positive," also known as "malware"), is one of the urging areas to be studied by the security sector and academia.Experts examine the applications manually with the help of specialized tools (e.g., reverse engineering software) and decide whether they are benign or malign.This human-involved process is called malware analysis.In addition to dynamic malware analysis that concentrates on applications' behavior observed at run-time, static malware analysis examines binaries, files, and codes to classify Android malware from benign applications.Android's permission mechanism limits the specific operations performed by applications or provides ad hoc access to particular data at the end-user's discretion.Suppose an application must initiate a phone call without going through the standard dialler user interface for the user to confirm the call, for example.In that case, it must manifest or request CALL_PHONE permissions.Please, refer to Android API (Application Programming Interface) documentation for the list of the permissions and their descriptions at https://developer.android.com/reference/android/Manifest.permission.html.More information can also be found in Appendix B (Canbek, 2021).

Binary classification
Case study This study reviewed 11 academic studies providing Android mobile benign and malign datasets listed in Table 2 and selected six datasets for comparison.The DS 0 dataset listed in the first row in Table 2 has not only a higher number of samples but also the highest number of malware (positive-class examples) compared with other datasets.Note that two published datasets were combined, one from 2011 and one from 2012 into one dataset (DS 5 ; Peng et al., 2012).The six datasets (DS 6 -DS 11 ) encountered in the literature were excluded from this study due to the following reasons.The DS 9 dataset (Peiravian & Zhu, 2013) is the same as the original DS 4 dataset (Jiang & Zhou, 2013).The datasets DS 8 (Canfora et al., 2013), DS 10 (Felt et al., 2011) have missed one class.Only the top 10 permissions were published for DS 6 (Hoffmann et al., 2013), and only the top 20 permissions were published for DS 7 (Sarma et al., 2012), but the whole feature space could not be obtained for this study.

| Initial analysis of the feature space-frequency distributions
Table 3 shows the feature space-frequency distribution graph per dataset for each class and the most plausible fit distributions with (ntail ratio, i.e., fitted features ratio 2 ), which are described below.As seen in the related mini graphs for each dataset in Table 3, all the samples demonstrate a long right-tail.The right-tail holds rarely requested permissions (low-amplitude in graphs) while dominating the short-left part holds frequently requested permissions (high-amplitude in graphs).
Another interesting finding in Table 3 is related to inter-class analysis (i.e., analyzing positive-class versus negative-class datasets).Comparing permission frequency distribution curves per dataset per class, we can see that the frequencies of malign dataset curves get flattered later (decay is small) comparing the benign dataset (sharp drop).The reason behind this sharp drop in benign dataset regular applications uses a few specific permissions, whereas malware needs to request a more comprehensive set of permissions.Figure 2 shows the distributions for all datasets in one graph per class, where the y-axis is transformed into a logarithmic scale.Be aware that the x-axis is not the same feature sequence naturally.
However, the inferences from these graphs may be misleading.They should be verified from a statistical viewpoint (Newman, 2004) because the distributions could-or actually could not be-one of the statistical distributions, namely, power law, log-normal, Poisson, and exponential distributions.If a dataset's features fit into a statistical distribution, the distribution defines the characteristics or nature of the feature space in the dataset.It provides more analysis possibilities on inter-class, intra-class, and extra-class because each statistical distribution has its own theoretical and practical implications for interpreting (Joo et al., 2017).

| RESULTS
Figure 3 shows the charts for benign datasets, whereas Figure 4 for malign datasets, all of which are generated by an R script provided online (DsFeatFreqDistFit.R, see Section 3.3.2).For your comparison, the charts for the best plausible statistical distribution fit and the second plausible one are provided for the same dataset.Figure 3a and b is the log-normal distribution as the best fit and the power law distribution as the second-best fit, respectively, for benign DS 0 .Because frequency distributions of a wide variety of phenomena tend to be, at least approximately, power law distribution (White et al., 2008), it was expected that feature frequencies would fit power law (at least benign/negative datasets due to their naturalness comparing the malign/positive datasets that have features required by malicious purposes).However, the results were different.
Both Figure 3 and Table 4 show that all benign datasets exhibit a log-normal distribution.The benign fits are valid in high ntail percentages except for the benign DS 1 dataset with only 254 samples.The second plausible fits are power law distribution for all datasets.As stated above, Poisson distribution is not a plausible fit for the feature frequency distribution of any dataset no matter what the class is.The feature frequency distributions of malign datasets differ from each other.Unlike benign datasets, malign DS 3 and DS 4 exhibit exponential distribution, and DS 2 are also close to exponential distribution (higher ntail ratio but lower pl-value).
This should not be considered a generalized rule statement.However, an exciting finding revealed that considering the example phenomena above, log-normal and partly power law (e.g., except earthquakes, solar flares, and war intensities) distributions represent the stable phenomena like benign applications' permission request distribution.In contrast, exponential describes somewhat chaotic phenomena like malign applications' T A B L E 4 Plausibility of permission feature frequency fits into log-normal (ln), exponential (ex), and power law (pl) statistical distributions permission request distribution.Another finding is that the high number of samples tends to exhibit a lognormal distribution.
Although it is not focused on in the literature, the number of fitted features was also taken into account because the fitted and not-fitted features provide more insight into the datasets and classes.Table 4 shows the summary of the analysis of plausibility of permission feature frequency distribution fits into log-normal, exponential, and power law statistical distributions."ntail" indicates the ratio of fitted features to the number features (n) as a percentage in the dataset.Examining the results, benign datasets have higher ntail ratios compared to malign datasets.
The tabular presentation is useful to see the corresponding values all at once, but how the distributions fit into the truth may not be easily sensed.Therefore, compact charts were prepared to show the original feature frequency distribution (the truth) and the plausible statistical distribution.

| FITTED/UNFITTED FEATURES ANALYSIS
In this study, further analysis was conducted on the fitted where x ≥ x min and unfitted features where x < x min .Figure 5 shows the result of our analysis of permission feature spaces for all benign and malign datasets.Note that C + and C À show fitted and unfitted features for class C (e.g., malign and benign), respectively whereas C ++ and C ÀÀ show exclusive fitted and unfitted features (i.e., only fitted or unfitted in one class, not in the other), respectively.
There are three sets in Figure 5: • The green one (B + ) is the intersection of fitted features of the benign datasets with 38 common features • The red one (M + ) is the same for malign datasets with 22 common features • The orange one (M À ) is the intersection of unfitted features of the malign datasets with 14 common features Note that the intersection of unfitted features of the benign datasets (B À ) is empty in this study.Fitted or unfitted features are determined in the most plausible statistical distribution per dataset.The DsFeatFreqDistFit.R script finds and displays the fitted and unfitted features per dataset.The features are then intersected per class by using another functionality (getCommonFeatures) in the provided package.The features are also provided in the extra materials provided Venn diagram for fitted/unfitted features.The numbers in braces show the feature counts."+" superscript denotes fitted features whereas "À" denotes unfitted ones online at https://github.com/gurol/dsfeatfreqdist.This approach could be useful for gaining insight into feature space, especially regarding the inter dataset and inter-class analyses, which could be valuable for feature selection and dataset comparison activities.
For binary classification that is classifying benign and malign Android mobile applications, the discriminative features having inter-class differences should be taken into account while performing the activities.Therefore, the features in M ++ , B ++ , and M ÀÀ parts could be discriminative or be evaluated on a new type of malware.Considering malware detection, the M ++ and M ÀÀ features should be examined first among 122 permission features.For the sake of saving space, our initial interpretations have highlighted the following features to report: • in M ++ : WRITE_HISTORY_BOOKMARKS is dangerous type permission, • in M ÀÀ : ACCESS_MOCK_LOCATION, CLEAR_APP_CACHE, and WRITE_CALENDAR are dangerous, SET_PREFERRED_APPLICATIONS has been deprecated since Android API 7, BATTERY_STATS, and BIND_WALLPAPER is a signature or system type permission.The other features are standard type permissions.
Because, fitted features show the most frequent features, the exclusive fitted features of each class (M ++ and B ++ ) distinguish the most discriminative features.We cannot say the same for common (nonexclusive) fitted features (BM + ) that are observed in both classes.For example, an Android application that requires SET_WALLPAPER and/or WRITE_HISTORY_BOOKMARKS permissions in M ++ is more likely to be classified as a malign application.In other words, benign applications do not use those permissions.Likewise, it is rare that a malign application requests for example RECORD_AUDIO or USE_CREDENTIALS that are exclusive fitted features in benign datasets, B ++ .
BOX 2 Prepare your data and run your experiment using the provided script To reproduce the results presented in this manuscript or conduct a similar experiment for your datasets.Please, prepare a spreadsheet and run the commands in R or RStudio according to the instructions given below.For more information and downloading the script files (DsFeatFreqDistFit.R and utils.R) visit https://github.com/gurol/DsFeatFreqDistFit.
Note that no_sim_count dramatically increases the time to complete the code.Use 3, for example, for the first attempts.The commands above and comments can be found in DsFeatFreqDistFit.R file.
Exclusive unfitted features in malign datasets (M ÀÀ ) represent the less frequently observed binary features (permissions, e.g., WRITE_CALENDAR) that are requested by malign applications.Finally, the features fitted in benign datasets but unfitted in malign datasets represent the permissions that are more likely requested by benign applications (e.g., BLUETOOTH).
Note that those inferences give more insights for the domain experts such as malware analysts who know the features in detail.Determining the distribution of the features could be useful for other activities.For instance, the geometric mean should be used to determine the central tendency of a log-normal distribution instead of arithmetic means (Box 2).

| CONCLUSION
In the shade of the Garbage In, Garbage Out (GIGO) rationale, high classification performance could be possible only when an optimal (well-modeled or robust) classifier is trained on sufficient datasets.This requirement is especially crucial in domains where proper benchmark datasets are not available.This study suggests that one of the initial insights (before conducting an ML workflow, e.g., feature selection) into the sufficiency of datasets in ML-based classification studies is quantifying binary feature space distribution of datasets.Hence, the distribution of binary features is essential to gain an initial insight.
The approach has been tested on 11 Android malware/benign application datasets in the literature, and the results are interpreted.This study with an in-depth look at the feature space distribution provides insight into datasets by examining their similarity to various statistical distributions.Interestingly, it was observed that the features in our example benign/malign application datasets exhibit a long right-tail (holding rare features while dominating the short-left part keeping frequent features).Therefore, we can look for the log-normal, exponential, power law, and Poisson statistical distributions fit.In 11 experimental datasets, all the benign datasets exhibit a log-normal distribution against exponential, power law, and Poisson.In malign datasets, the higher plausibility of exponential distributions was observed.The two malign datasets are fit to exponential distributions.Considering the findings, the distribution of feature space among the samples in a dataset should also be analyzed to see whether it is close to precisely one of the probability distributions.If there is, the fitted statistical distribution should be provided as an informative meta-feature with the distribution parameters.The parameters are ntail ratio for all types of distribution fits, mean and standard deviation parameters for the log-normal fit, the rate parameter for exponential fit, or alpha parameter (also known as the exponent or scaling parameter) for power law fit.The plausibility of the test results like in Table 4 could also be provided for further information that would be a good habit of avoiding publication bias.
This study is also a reference for ML studies in terms of providing the natural and unnatural phenomena exhibiting the power law, log-normal, and exponential statistical distributions.The compiled examples give hints about the observed feature distribution fits.In this regard, it is found that permission requests of benign applications follow log-normal distributions, a sort of stable phenomenon.In contrast, malware's permission requests tend to follow exponential distributions, relatively chaotic phenomena.This similarity found in the Android mobile platform could be looked for the benign/malign software on other mobile and desktop platforms.The initial findings represented in Figure 5 could be evaluated further by Android malware domain experts as well as ML researchers in feature selection, and the approach could be followed in other domains.Taken together, this study highlights that such exploratory analyses should be involved more in ML studies.It demonstrates the method along with the ready-to-use open-source scripts and comprehensive accompanying materials to the researchers who come from different disciplines and are uninformed about possible statistical usage.ENDNOTES 1 The proportion of total positive samples (m P ), e.g., having a malign characteristic, in total sample size [m P + m N ] 2 The long right-tail statistical distribution is evident only in the tail (x min > x).Poisson decays exponentially, power law decays polynomially.The sharp drop is not surprising, whereas the long right-tail is surprising.

3. 3
.1 | Online experimentation platform (https://codeocean.com/capsule/3624528/tree/v1)The reproducible detailed results can be obtained via an online experimentation capsule in CodeOcean.No programming is required.3.3.2| The dataset feature frequency distribution fitting software and datasets (https://github.com/gurol/DsFeatFreqDistFit) Comparison of 11 negative and positive-class datasets feature space-frequency distributions with mini graphs and the name of the most plausible distribution fits (n is the feature-space size and the values in braces are ntail) F I G U R E 2 Frequency distribution in a log scale (y-axis) of the feature space (x-axis) per dataset for each class.Dashed lines are added to see the linear trend F I G U R E 3 Feature frequency distribution and power law, log-normal, exponential, and Poisson fits per benign dataset (for DS 0 , DS 1 , and DS 2 : Left charts: The best fit, right charts: The second-best fit).X-axis: Feature counts, y-axis: Ranks (generated via the DsFeatFreqDistFit.R script) F I G U R E 4 Feature frequency distribution and power law, log-normal, exponential, and Poisson fit per malign dataset (for DS 2 and DS 3 : Left charts: The best fit, right charts: The second-best fit).X-axis: Feature counts, y-axis: Ranks (generated via the DsFeatFreqDistFit.R script) DS 1 , DS 2 , DS 3 , and DS 5 ) and one positive-only dataset (DS 4 ).
T A B L E 2 Case study datasets: Summary of sample and feature spaces of the benign (negative) and malign (positive) dataset aThe positive-class datasets contain AMGP samples.