This article emphasises the negative effect of the “Garbage In, Garbage Out” (GIGO) rationale and the importance of ensuring the dataset quality in Machine Learning (ML) based classification applications to achieve high and generalisable performance. Researchers should integrate the insights gained by quantitative analysis of the datasets’ sample and feature spaces into the initial ML workflow. As a specific contribution towards achieving such a goal, a complete approach was suggested to quantify datasets in terms of feature frequency distribution characteristics (i.e. how the features in the available samples comprising the datasets are frequent?). The approach was demonstrated in eleven benign and malign (malware) Android application datasets belonging to six academic Android mobile malware classification studies. The permissions requested by the applications such as CALL_PHONE compose a relatively high-dimensional binary-feature space. The results have shown that the distributions fit well into two of the four long right-tail statistical distributions: log-normal, exponential, power law, and Poisson. Precisely, log-normal was the most exhibited statistical distribution except the two malign datasets that were in exponential. This study also explores statistical distribution fit/unfit feature analysis enhancing the insights in feature space.
Further, the study compiles phenomena examples in the literature exhibiting these statistical distributions that should be considered for interpreting the fitted distributions. In conclusion, conducting well-formed statistical methods provides a clear understanding of the datasets and the intra-class and inter-class differences before proceeding with selecting features and building a classifier model. Feature distribution characteristics should be the one to analyse beforehand.