MultiDMet: Designing a Hybrid Multidimensional Metrics Framework to Predictive Modeling for Performance Evaluation and Feature Selection

doi:10.21203/rs.3.rs-3111777/v1

Download PDF

Research Article

MultiDMet: Designing a Hybrid Multidimensional Metrics Framework to Predictive Modeling for Performance Evaluation and Feature Selection

https://doi.org/10.21203/rs.3.rs-3111777/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

In a competitive digital age where data volumes are increasing with time, the ability to extract meaningful knowledge from high-dimensional data using machine learning (ML) and data mining (DM) techniques and making decisions based on the extracted knowledge is becoming increasingly important in all business domains. Nevertheless, high-dimensional data remains a major challenge for classification algorithms due to its high computational cost and storage requirements. The 2016 Demographic and Health Survey of Ethiopia (EDHS 2016) used as the data source for this study which is publicly available contains several features that may not be relevant to the prediction task. In this paper, we developed a hybrid multidimensional metrics framework for predictive modeling for both model performance evaluation and feature selection to overcome the feature selection challenges and select the best model among the available models in DM and ML. The proposed hybrid metrics were used to measure the efficiency of the predictive models. Experimental results show that the decision tree algorithm is the most efficient model. The higher score of HMM (m, r) = 0.47 illustrates the overall significant model that encompasses almost all the user's requirements, unlike the classical metrics that use a criterion to select the most appropriate model. On the other hand, the ANNs were found to be the most computationally intensive for our prediction task. Moreover, the type of data and the class size of the dataset (unbalanced data) have a significant impact on the efficiency of the model, especially on the computational cost, and the interpretability of the parameters of the model would be hampered. And the efficiency of the predictive model could be improved with other feature selection algorithms (especially hybrid metrics) considering the experts of the knowledge domain, as the understanding of the business domain has a significant impact.

Predictive Modeling

Hybrid Metrics

Feature Selection

Model Selection

Algorithm Analysis

Machine Learning

In today's digital age, we are experiencing an explosion in data volume, data variety, and data dimensions. With the increase of high dimensional data in the last decade, feature selection becomes a necessary step in the field of machine learning, data mining, pattern recognition and statistics as one of the solutions to overcome the curse of dimensionality. Feature selection refers to the process of identifying a subset of the most relevant features that provides representative results for the original set of features [1–5]. As data sets grow larger over time, the ability to extract knowledge hidden in these large data sets and make decisions based on the extracted knowledge is becoming increasingly important in all business domains. The application of data mining techniques is at the heart of the pattern recognition and knowledge extraction process in all domains [6, 7]. Nevertheless, high-dimensional data (HDD) poses a major challenge to classification algorithms for both machine learning (ML) and data mining (DM) models due to its high computational cost and storage requirements [8]. Hypothetically, a larger number of features implies higher discriminative power in classification. In practice, however, this is not always true because the collected features are often not all equally informative, as some of them may be interdependent. In addition, high-dimensional data can cause the problem of the "curse of dimensionality" [9].

Feature extraction and selection methods are used in isolation or in combination with the goal of improving performance such as estimation accuracy, visualization, and understandability of the learned knowledge [10, 11]. In general, features can be categorized as relevant, irrelevant, or redundant. The best subset is the one with the least number of dimensions that contribute the most to learning accuracy [12]. The Demographic and Health Survey of Ethiopia dataset (EDHS 2016), which was used as the data source for this study, contains detailed information on respondents' background characteristics that may not be relevant for predicting contraceptive use [13]. In general, health care is considered "information rich" but "knowledge poor" due to the lack of effective analytical tools to discover hidden relationships and trends in the data. Data mining is used in biomedical sciences and research because it is a rapidly developing technology that can extract useful knowledge, and scientific decision making for diagnosis and treatment of diseases from the database is becoming increasingly important [14]. It can also improve the management of hospital information and promote the development of telemedicine and community medicine [15], and is becoming increasingly popular in the healthcare industry due to its excellent efficiency, including but not limited to the health insurance sector, which can use it to minimize fraud and abuse [16–19]. The algorithms DM and ML search for meaningful patterns in raw datasets and help to make scientific decisions from the database.

On the other hand, both DM and ML are computationally expensive for large datasets because they may contain several features irrelevant to the analysis. Therefore, reducing the dimensionality can effectively reduce these costs. In summary, feature selection methods help to: reduce the dimensionality of the feature spaces (to limit the memory requirements and increase the speed of the algorithm), improve the runtime of the learning algorithms, improve the data quality, increase the performance, and understand the nature of the data and the process that generated the data or visualize the data easily [12]. Several feature selection algorithms have been proposed and studied for ML and DM applications and can be broadly classified into three categories: Filters, Wrappers, and the embedded methods. Filter methods use inherent properties of the data such as information-based measures, distances, or statistical information to evaluate the quality of a selected subset. Nevertheless, the accuracy results obtained may not be guaranteed [20–22]. The wrapper methods use the prediction accuracy of a predetermined learning algorithm to determine the relevance of the selected subsets, and the accuracy of the learning algorithms is usually high. However, the generality of the selected features is limited (tendency to overfit for small training sets) and is very computationally intensive for high-dimensional data [23–26]. And the embedded methods include feature selection as part of the training process and are usually specific to certain learning algorithms such as (SVM, ANNs, and DT classifiers) and thus may be more efficient than the other two categories [27–30]. In this work, we developed a hybrid multidimensional metrics framework for predictive modeling for performance evaluation and feature selection to overcome the challenges in feature selection and select the best model among the available models used in DM and ML. The contributions are fourfold:

1) We proposed a novel feature selection method based on multidimensional metrics, including but not limited to correlation test, chi-square test, expert testimony, established knowledge, etc.

2) We developed a hybrid multidimensional metric for model selection, including: Confusion matrix analysis, ROC curve analysis, statistical significance, practicality or applicability, computation time, simplicity of rule extraction, consistency with existing knowledge.

3) Compare the proposed metrics with the classical approach most commonly used in feature selection and model selection.

4) This research work is of importance to researchers and the scientific community as well as academia to show how these hybrid metrics could be used in DM and ML algorithms in a broader healthcare setting for prediction tasks based on high-dimensional data in similar and/or other platforms.

The remaining of this paper is organized as follows: Related works is presented in section II. Section III describes the framework and problem statement. Experimental results, discussions, and evaluation metrics are presented in section IV. Conclusions and recommendations are provided in Section V.

This section presents two major parts: a) applications of predictive modeling in healthcare industry with the goal to handle the fundamental challenges encountered in both feature selection and model selection supported with real word dataset. b) presenting different research works specific to fertility and contraception methods use.

A. Applications of Predictive Models in Healthcare Industry

Healthcare seems to be 'information rich' but 'knowledge poor' due to lack of effective analysis tools to discover hidden relationships and trends in data. DM and ML are two effective techniques suitable for data analysis and finding hidden patterns that can be used for medical decision making [31]. And ML approaches can benefit the health care system through various approaches such as shortening treatment time, detecting disease causes and symptoms [32]. The application of data mining techniques has long been encouraged in healthcare. For example, health insurance fraud and abuse have led many health insurers to try to reduce their losses by using data mining tools to find and prosecute offenders [16, 33]. In the commercial world, data mining tools are mainly used for fraud detection. Data mining is actively used in diagnosis and treatment, healthcare resource management, customer relationship management, and fraud and anomaly detection [34]. Numerous healthcare companies, hospitals, and pharmaceutical manufacturing facilities are using data mining tools due to their excellent efficiency [34]. The predominant use of data mining and machine learning algorithms compared to traditional statistical methods in healthcare applications or predictive tasks could be mainly due to their promising accuracy and more reliable results, as they offer greater efficiency in processing large amounts of data, are more flexible, and can handle any type of data [35].

A surveillance system that uses data mining techniques to detect new and interesting patterns in infection control data has been implemented at the College of Alabama [36]. Data mining techniques have been implemented to examine reporting practices using International Classification of Diseases, 9th Revision, codes (risk factors). By reconstructing patient profiles, cluster and association analyzes can show how risk factors are reported [37]. To improve its ability to prospectively identify high-risk patients, American Heathway's uses predictive modeling technology [38]. Data mining tools have been used for fertility demographic analysis to determine which attributes have the greatest impact on a country's fertility rate [39]. In summary, data mining in healthcare is used to evaluate the effectiveness of treatments, manage healthcare, manage customer relationships, and detect fraud and abuse, among other applications [39]. In addition, previous research has demonstrated the application of data mining techniques for predictive tasks, including HIV testing [35], cancer [40], heart disease [41], tuberculosis [42], kidney dialysis [43], diabetes [44], dengue fever [45], hepatitis C [46], and IVF [47].

The most commonly used algorithms for prediction tasks in DM and ML include Decision Trees, Naïve Bayes Classifier, Artificial Neural Networks, Support Vector Machines, and K-Nearest Neighbor [48], based on their accuracy performance. However, the efficiency of a model is not just about accuracy. Many clients have multiple lists of requirements that need to be considered when using a predictive model for its prediction task (flexible metrics). In an environment like healthcare, where data can accumulate over time and potentially take on characteristics of Big Data, it is extremely important to analyze the characteristics and identify latent relationships between the characteristics. However, high-dimensional data has become a real challenge for predictive tasks in DM and ML algorithms because the data may come from multiple sources, which in turn affects performance, is computationally intensive, and introduces the problem of overfitting. Therefore, in order to remove irrelevant and redundant features, feature selection is a necessary preprocessing step of the classification process, which serves to reduce computation time and improve learning accuracy, especially for high-dimensional datasets [49, 50].

B. Application of Data Mining Models to Contraceptive Use

Contraceptive use is considered critical for protecting women's health and rights, influencing fertility and population growth, and promoting economic development, especially in much of sub-Saharan Africa [51]. Data mining consists of the application of algorithms to identify and analyze information to create patterns or models [52]. In a study conducted in southern Brazil, data mining was used to analyze the profile of contraceptive method use in a college population [53]. The study found that the results obtained with the generated rules were largely consistent with the literature and global epidemiology and revealed significant vulnerabilities in the college population [53]. The study validated its results based on accuracy, sensitivity, specificity, and area under the ROC curve and obtained higher or at least similar values compared to recent studies using the same methodology [54]. Another study was conducted to examine in detail how a particular data mining method called General Unary Hypotheses Automaton (GUHA) helps predict women's use of contraceptive methods based on knowledge of their demographic and socioeconomic characteristics [55].

It also used a data mining approach to analyze patterns of contraceptive use in India by comparing contraceptive use among groups of women with different demographic, economic, cultural, and social characteristics. The decision tree classification and regression algorithms were applied to identify women with different social, economic, cultural, and demographic characteristics who use different contraceptive methods and then analyze how the pattern of contraceptive methods differs among these groups [56, 57]. The study found that currently married, nonpregnant women aged 15–49 years in India can be classified into 13 mutually exclusive groups based on six characteristics of the women: surviving children, household standard of living, religion, women's schooling, husband's education, and place of residence. The observed differences in patterns of contraceptive use have important policy and programmatic implications related to universal access to family planning. Another study was conducted at the Pratama Hasanah Pekanbaru clinic to examine contraceptive use data using the C4.5 decision tree. The study found that it was necessary to evaluate the contraceptive use data collection to determine the pattern of contraceptive choice. Nine attributes (age, duration of use, menstrual cycle, recently married, recently delivered, breastfeeding, already having offspring, health problems, and more than four children) were used to determine the pattern of contraceptive choice, and the class label was contraceptive use. The classification model of the study had achieved 93.15% accuracy [58].

Another study was also conducted using data mining classification algorithm to predict the duration of contraceptive use of productive couples by adopting the CRISP-DM process method [59]. And data mining techniques were employed to different experimentations using Demography and Health Survey of Indonesia (DHSI) in 2017. The result exemplified that the Adaboost data mining technique produced the best performance of contraceptive used prediction model, with the accuracy score of the classification model as 85.1%. Moreover, the application of data mining technique was also used to predict the likelihood of contraceptive method use among women aged 15–49 years old using DHS of Ethiopia 2005 despite the survey doesn’t reflect or lacks to represent the current sociodemographic status of the population [60]. Experimental results of the study revealed that J48 decision tree performs better than Naïve Bayes. It has also been reported the model had achieved an accuracy of 82.85% to detect contraceptive method users correctly. Furthermore, in several DM and ML applications, it has been revealed that models are described as best fit model with few criteria mostly the performance accuracy, the higher accuracy of the model would best fit the data. However, accuracy of a model, would not give the complete picture of the problem domain as for possible health interventions may require by decision makers based on their policies’ prioritization and hence several features might affect for the prediction of the outcome. This is why an important question arises in the scientific communities, when should we use a particular model to get a complete picture of the variable under study pertinent to the target concept? To this end, we designed a novel approach which is more flexible for both feature selection and model selection criterion to be applied in predictive modeling that suit to our specific problem considering a hybrid multidimensional metrices of a client (based on user’s requirements) into account unlike the classical approaches that mainly rely on unidimensional criteria to perform the tasks.

As the world grows in complexity, overwhelming us with the data it generates, data mining becomes the only hope for clarifying the patterns that underlie it [61]. However, both ML and DM processes requires high computational cost when dealing with high dimensional data as it may comprises several irrelevant features for the analysis. Therefore, feature selection is an essential technique used in DM and ML before any algorithms applied to train a classifier to avoid overfitting, improve model performance, provide faster and more cost-effective models. The selection of optimal features adds an extra layer of complexity in the modelling as instead of just finding optimal parameters for full set of features, first optimal feature subset is to be found and the model parameters are to be optimized [62]. Several feature selection algorithms have been proposed and studied for ML and DM applications and broadly classified into three: Filter, Wrapper and the Embedded methods. The filter methods are suitable for high-dimensional dataset with good generalization and computational cost is also low as they are independent of learning classifiers when selecting the feature subset. Filter methods use inherent properties of the data such as information-based measures, distances, or statistical information to evaluate the quality of a selected subset. However, the accuracy results obtained might not be guaranteed [20–22]. The wrapper methods use the predictive accuracy of a predetermined learning algorithm to determine the relevance of the selected subsets, and the accuracy of the learning algorithms is usually high. However, the generality of the selected features is limited (tend to overfit on small training sets) and computationally expensive for high-dimensional data [23–26]. And the embedded methods incorporate feature selection as a part of the training process and are usually specific to given learning algorithms of classifiers, and may be more efficient than the other two categories [27–29] as they are designed for specific classifiers [30].

To this end, there is a need for establishing a standard or framework more flexible as it comprises multidimensional metrics could help to view the complete picture of the prediction task. In line to this, different DM models were experimented to select the best fit model for the contraceptive dataset used in this paper. However, in the literature, there is no a such fixed guideline or rule to be adopted to pick the best model for the problem. Moreover, many predictive algorithms didn’t perform well with large feature spaces as they could possibly irrelevant or redundant to the target variable. In this paper, we designed a hybrid multidimensional metrices framework to predictive modeling for performance evaluation and feature selection to address the challenges encountered in feature selection and to pick the best model among the available models being used in DM and ML. CRISP-DM method was applied for this study for the reason that it has additional features on understanding the business perspective and its deployment [63]. It begins from understanding the business and ends with the deployment of the system. Figure 1 (a) depicts the architecture of the KDD process of the proposed hybrid multidimensional metrices for feature and model selection in predictive modeling comprises of six essential components: the method used, data preprocessing phase, feature selection, modeling and evaluation phase, and the knowledge representation. Figure 1 (b) demonstrates the architecture of the hybrid multidimensional metrices for feature and model selection in predictive modeling comprises of two major parts: feature selection and model selection phases. Their details are provided in the following sections.

A. Feature selection

Several researches recently have studied both feature selection and clustering together with a single or unified criterion [64]. It has been stated that the importance of selecting of features in any data mining task; that the abundance of potential features constitutes a serious obstacle to the efficiency of most learning algorithms [65]. Popular methods such as k-nearest neighbor, C4.5, and back propagation are slowed down by the presence of many features, especially if most of these features are redundant and irrelevant to the learning task [65]. In this paper, we applied the proposed hybrid multidimensional criterion to aggregate features from a single or multiple source to create a target dataset which are pertinent to the data mining goals.

1. Hybrid Multidimensional Metrics for Feature Selection

Table 1 below illustrates a multidimensional metrics designed for feature selection to predictive modeling. The proposed metrics considered multiple dimensions of the feature whether to retain it for further analysis or not including the following criterion but not limited to: identifying data types of a feature, labeling target feature as either categorical or continuous, propose a statistical measure computed to test the relationship between the two variables for retaining or removing the feature, chi-square test computed to test the independence between the two features, consistency to the established knowledge, expertise’s claim, simplicity in time and interpretability and practicability and applicability of the features are required. Given a national dataset or high dimensional data ${D_B}$ with the attributes of ${X_k}\left( {k=1,2,...,N} \right)$ and detailed information was collected on background characteristics of the respondents based on a nationally representative sample that provides estimates at the national and regional levels. In this paper, the features would be therefore selected based on the proposed hybrid metrics for feature selection pertinent to the data mining goals. Therefore, a new target dataset was prepared for predictive task purpose and the response feature is the contraceptive methods use $\left( {CU} \right)$ which is a binary outcome. And Table 2 below depicts the pseudo code for feature selection of Algorithm 1.

Table 1

A Multidimensional metrices designed for Feature selection to predictive modeling
		List of Features
Criterion	Measures
Data types	Categorical or continuous	A₁
Target feature	Categorical or continuous	A₁
Correlation test	Continuous	.
Chi-square test	Categorical
Established knowledge	(Positive, Negative, Neutral, Unstudied)	.
Expertise’s claim	(State in scientific manner)	.
Simplicity in time and interpretability	Meaningful and clarity
Applicability and Practicality	Model’s direct impact on the domain	A_k

Table 2

A Hybrid Multidimensional Metrics for Feature Selection (Algorithm 1)
Input: 1) Load national database ${D_B}$; 2) Response feature $CU$; 3) The number of respondents N; Output: Target dataset ${T_{Atr}}$ 1. ${T_{Atr}}=\left[ {} \right]$ 2. for $k=1,2,...,n$ do 3. Data types: Identify as categorical or continuous 4. Target variable: Identify as categorical or continuous 5. Apply data transformation when appropriate 6. for continuous variables do 7. Compute Correlation test 8. end for 9. for categorical variables do 10. Compute Chi-square test using equation (1) 11. end for 12. Established knowledge: Identify as Positive, Negative, Neutral, Unstudied 13. Expertise’s claim: State in scientific manner 14. Simplicity in time and interpretability: Identify as yes or no 15. Practicability and Applicability: Identify as yes or no 16. end for 17. Obtain the Attribute vector of the k^th respondent 18. return ${T_{Atr}}$

B. Missing value Handling

Missing values and their problems play important role in the data cleaning process. Several methods have been proposed so as to process missing data in datasets and avoid problems caused by it. When the dataset is small or the number of missing fields is large, not all records with a missing field can be deleted from the sample. Moreover, the fact that a value is missing may be significant itself. A widely applied approach is used to calculate a substitute value for missing fields, for example, the median or mean of a variable [66]. In this paper, for the categorical variable, the missing values were replaced by the modal value of the variable [67]. All features with five percent missing values (5%) selected for further analysis and otherwise discarded from analysis. In this paper, WEKA preprocessing techniques such as replace missing value (using the most frequent (modal) value methods) was used to handle missing values.

C. Data Transformation and Reduction

Data mining often requires data integration or the merging of data from multiple data sources [68]. In data transformation; the collected attributes were transformed into forms which are appropriate for data mining tools. The process of data transformation included feature construction, where new features were constructed and added from the given set of features to help the mining process [64, 69]. In order to make the analysis procedures manageable and cost-effective the data needed to be reduced. Data reduction techniques include data discretization which is one of data transformation methods used to reduce the number of values for a given continuous attribute by dividing the range of the feature into intervals [63, 64]. In this paper, some features were discretized to reduce the unlike values of the features to obtain knowledge (pattern) and to make the dataset suitable for data mining tools. Almost all the selected features have been transformed from their original state in such a way that could be easily understandable and interpretable. For instance, a feature of ethnicity had 46 distinct values but later converted into ten distinct categories as: Afar, Guragie, Tigrean, Amara, Somalie, Sidama, Nuwer, Welaita, Oromo and Others.

D. Methods of Training and Testing

In data mining predictive models, the classifiers rely on being trained before they can reliably be used on new data [70]. The more instances the classifier is exposed to during the training phase, the more reliable it will be as it has more experience. However, once trained, we would like to test the classifier too, so that we are confident that it works successfully. It has been also stated that, in order to predict the performance of a classifier on new data, we need to assess its error rate on an independent test set that played no part in the formation of the classifier [71]. The standard way of predicting the error rate of a learning technique is to use stratified 10-fold cross-validation. The data is divided randomly into 10 parts in which the class is represented in approximately the same proportions as in the full dataset. Each part is held out in turn and the learning scheme trained on the remaining nine-tenths; then its error rate is calculated on the holdout set. Thus, the learning procedure is executed a total of 10 times on different training sets. Finally, the 10 error estimates are averaged to yield an overall error estimate [71]. The WEKA 3.7 tool provides a test options to test on the same set the classifier is trained on (use training set), to test on a user-specified test data (Supplied test set), test on k-fold cross validation, and to train on a percentage of the data and test on the remainder (percentage split). In this paper, 10-fold cross validation was used for the prediction task.

E. Methods of Analysis and Evaluation of the Models

The output of several experiments of the classification models were analyzed and evaluated in terms of the details of the hybrid multidimensional metrics listed below. The complexity of each model in terms of the number of trees and leaves had also been evaluated. Furthermore, the models were evaluated using F-measures to test their statistical significance at 5% level of significance to be used for prediction purposes. In this paper, we designed a hybrid multidimensional criterion for model selection.

1. The ROC Curve

ROC (Receiver Operating Characteristic): ROC curves are a useful tool for comparing classification models [72]. The performance of the classifiers with different parameters was also compared by examining their ROC curve. The ROC curve shows the trade-off between the true positive rate (i.e., true contraceptive user) and the false positive rate (false contraceptive user) for a given model. Moreover, models can be compared with respect to their speed, robustness, scalability, and interpretability which may have an influence on the model [52]. Besides, the ROC curve is a two-dimensional plane; the vertical axis (Y-axis which denotes the sensitivity) represents the true contraception user rate (TCUR) and the horizontal axis (X-axis which denotes 1-specificity) represents the false-contraception user rate (FCUR).

2. The Confusion Matrix

Previous studies on data mining and machine learning techniques revealed that, a confusion matrix was often used to measure performance of the models in terms of accuracy, sensitivity and specificity it achieved as depicted in Table 3 below. The confusion matrix is a matrix representation of the classification results. In a two-class prediction problem the upper left cell denotes the number of samples classified as true while they are true (i.e., true users), and lower right cell denotes the number of samples classified as false while they were actually false (i.e., true false or true not users). The other two cells (lower left cell and upper right cell) denote the number of samples misclassified. Particularly, the lower left cell denotes the number of samples classified as false while they were actually true (i.e., false negative or false non-users), and the upper right cell denotes the number of samples classified as true while they were actually false (i.e., false positive or false contraceptive users). Once the confusion matrixes were constructed, the accuracy, sensitivity and specificity of each model was calculated using the respective formulas presented below. In summary, there are three measures for model performance evaluations, namely: -accuracy, sensitivity and specificity.

$$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{P}+\text{T}\text{N}+\text{F}\text{P}+\text{F}\text{N}} \dots \dots \dots .\dots \dots \left(2\right)$$

$$\text{S}\text{e}\text{n}\text{s}\text{i}\text{t}\text{i}\text{v}\text{i}\text{t}\text{y}=\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N}} \dots \dots \dots \dots \dots \dots .\dots \dots \dots \dots \left(3\right)$$

$$\text{S}\text{p}\text{e}\text{c}\text{i}\text{f}\text{i}\text{c}\text{i}\text{t}\text{y}=\frac{TN}{TN+FP}\dots \dots \dots \dots \dots \dots \dots \dots \dots .\dots .\dots \left(4\right)$$

Table 3

Summary of two-class prediction problem
	Predicted value for contraceptive use
Actual value of current contraceptive use		No	Yes	Total
	No	TN	FP	TN + FP
	Yes	FN	TP	FN + TP
	Total	TN + FN	FP + TP	Grand = TP + FP + TN + FN

3. Data imbalance problem: Handling and test the effect of data imbalance for the Target variable

4. Model’s Statistical Significance: Measured using F test or paired test the overall significance of the model

5. Practicability and Applicability: This is significance of the model from its direct impact on the institution or determined by the manager. This is quite different from statistical significance

6. Simplicity of Model Interpretation: Model’s clarity of rules extraction from user’s side

7. Established knowledge: identified as: Positive, Negative, Neutral (borderline significance), Unstudied

8. Computational cost: Algorithm’s simplicity in terms of time and space

9. Hybrid Multidimensional Metrics for Model Selection: Considers multidimensional scenarios.

2. Hybrid Multidimensional Metrics for Model Selection

Given a national dataset or high dimensional database ${D_B}$ with the features of ${X_k}\left( {k=1,2,...,N} \right)$ detailed information was collected on background characteristics of the respondents. Given that the features were selected based on the proposed hybrid metrics as described in Table 2 for feature selection pertinent to the data mining goals, a new target dataset was prepared for predictive task and the response feature is the contraceptive methods use $\left( {CU} \right)$ which is a binary outcome. Now, suppose there are n requirements received from the user or organization where the model intended to be used by and we define: $R{I_i}=\left\{ {{R_i}} \right\}_{{i=1}}^{n}$ to denote the set of user’s requirements for each model, where $\left\{ {{R_i}} \right\}_{{i=1}}^{n} \in \left\{ {0,1} \right\}$. The user’s requirement indicators are a binary outcome when the value is 1, the corresponding user requirement is selected otherwise unselected. And suppose the list of models to be compared against the user’s requirements are given as: $M_{k}^{{i,R}}={\left[ {M_{1}^{{i,R}},M_{2}^{{i,R}},...,M_{k}^{{i,R}}} \right]^T}.$ And we proposed a hybrid multidimensional metrices used to compute the overall significance of the model taking both the effects of the user’s requirements and their corresponding weights of their importance basically assigned based on the user’s requirements and defined as:

$$HMM\left( {m,r} \right)=\frac{{\sum\limits_{{r=1}}^{R} {\sum\limits_{{m=1}}^{k} {{w_i}.R{I_i}} } }}{M}$$

(5)

The higher $HMM\left( {m,r} \right)$indicates the overall significant model that comprises almost all requirements of the user unlike the classical metrics that used one criterion to pick the best fit model. Table 4 illustrates the pseudo code for the hybrid multidimensional metrics of model selection which is provided as Algorithm 2.

Table 4

The proposed Hybrid Multidimensional Metrics for Model Selection (Algorithm 2)
Input: 1. Target dataset ${T_{Atr}}$ ; 2) Response variable $CU$; 3) The number of respondents N;
Output: 1. Model selected M_k; 2. The metrics $HMM(m,r)$; 3) Knowledge representation: $CU$
1. Start
2. for $m=1,2,...,k$ do
3. Compute ROC Values for each model
4. Assign weight for each model based on user’s requirements
5. Compute Confusion matrix for each model
6. Assign weight for each model based on user’s requirements
7. Test the effect of Data Imbalance problem for each model
8. Test Statistical Significance using F-test
9. Practicability and Applicability: Identify as yes or no
10. Simplicity of Model Interpretation: Identify as yes or no
11. Established knowledge: Identify as Positive, Negative, Neutral, Unstudied
12. Computational cost: identify as High, Moderate, and Low
13. Compute the hybrid multidimensional metrics using Eq. (5)
14. Find $m,r$ $s.t$ $HM{M_{mr}}=\hbox{max} \left( {HMM\left( {m,r} \right)} \right)$
15. end for
16. return M_k; $HMM(m,r)$; $CU$

Table 5 below described the criteria and measures of the hybrid multidimensional metrics designed for model selection of predictive modeling.

Table 5

A Hybrid Multidimensional Metrics designed for Model selection to predictive modeling
		List of Models
Criterion	Measures
ROC value	Categorical or continuous	M₁
Confusion matrix	Accuracy, Specificity and Sensitivity	M₂
Data Imbalance Problem	Compare the differences	.
Statistical Significance	F-test	.
Practicability and Applicability	Model’s direct impact on the domain	.
Simplicity of Model Interpretation	Meaningful and clarity
Established knowledge	(Positive, Negative, Neutral, Unstudied)	.
Computational cost	In time	M_k

A. Chi-square Test Analysis

In this paper, we used EDHS 2016 as a source of data applied to contraceptive use meeting to address the main challenges of both feature and model selection encountered in predictive modeling task. A chi-square test was used to test the association between each feature with the contraceptive use with the purpose to retain it in the model or not for further analysis of the prediction task (Table 6). Accordingly, the socio demographic factors: marital status, religion, wealth index, region, place of residence, ethnicity and highest education level were found to be significantly associated with contraceptive methods use (P-value < 0.000). However, the missing values of features that exceeds 5% were discarded from further analysis. For instance, husbands’ education levels were discarded from analysis (Table 6). Table 6 below indicates that for an attribute of marital status with six levels; respondents were asked about contraception use and the difference between the categories of marital status was tested using P-value. The association between marital status and contraception use were found to be significant (P-value < 0.000). One can also understand that the effect of marital status at every levels of category on contraception use is different. Hence, marital status would be included as potential predictor in the model.

Table 6

Statistical association of socio-demographic attributes related to Contraception use using Chi-square test, EDHS 2016
No	Features	Category	Contraception Use		P-value
No	Features	Category	Yes	No
1	Marital status	Divorced	122	756	.000
		Married	2887	6715
		Living with partner	93	129
		No longer living with partner	52	200
		Never in union	132	4146
		Widowed	26	425
2	Religion	Catholic	22	69	.000
		Protestant	670	2144
		Orthodox	1845	4568
		Muslim	761	5448
		Traditional	4	84
		Others	10	62
3	Highest level of Education	No education	5686	1347	.000
		Primary	4040	1173
		Secondary	1782	456
		Higher	863	336
4	Wealth Index combined	Poorest	3562	332	.000
		Poorer	1610	436
		Middle	1502	500
		Richer	1498	544
		Richest	4199	1500
5	Husbands Education level	No education	901	3530	.000
		Primary	1141	1913
		Secondary	494	732
		Higher	423	597
		Don’t know	21	72
		Missing	5527	332
6	Ethnicity	Afar	13	934	.000
		Guragie	153	502
		Tigrean	469	1436
		Amara	1186	2502
		Oromo	743	2868
		Welaita	71	251
		Sidama	144	211
		Nuwer	4	280
		Somalie	20	1443
		Others	509	1944

Table 7 illustrates that the statistical association of knowledge features related to contraception use were also assessed. Accordingly, media exposure, contraception intent use, heard family planning, ever heard of AIDS, ever heard of STI, recent sexual activity and knowledge ovulatory cycle were found to be significantly associated with contraceptive method use (P-value < 0.000). And HIV transmitted during pregnancy was discarded from analysis (Table 7). An attribute of media exposure with two levels (Yes or No); respondents were asked about contraception use. And the differences between the users and non-users of media exposure against contraceptive use were found to be statistically significant. Hence, media exposure was included as potential predictor to train the model.

Table 7

Statistical association of knowledge attributes related to Contraception use using Chi-square test, EDHS 2016
No	Attributes	Category	Contraception Use		P-value
No	Attributes	Category	No	Yes
1	Media exposure	No	6715	1412	.000
1	Media exposure	Yes	5656	1900	.000
2	Contraception Use Intention	Doesn’t Intend to use	6708	0	.000
		Non-user intends to use later	5663	0
		Using modern method	0	3217
		Using traditional method	0	95
3	Knowledge of Ovulatory Cycle	After period ended: 2	3085	936	.000
		At any time:5	2557	530
		Before period begins:4	879	274
		Middle of cycle:3	2694	1005
		During her period:1	374	108
		Don’t know:8	2782	459
4	Heard Family planning	No	8223	1788	.000
4	Heard Family planning	Yes	4148	1524	.000
5	Ever heard of STI	No	1170	75	.000
5	Ever heard of STI	Yes	11201	3237	.000
6	Recent sexual activity	Active last 4 weeks	4832	2723	.000
		Never had sex	3709	12
		No active: No postpartum abstinence	2937	517
		No active: Postpartum abstinence	893	60
7	Ever heard of AIDS	No	1233	81	.000
7	Ever heard of AIDS	Yes	11138	3231	.000

B. Features Pattern Analysis

The pattern analysis was done to understand the effect of each feature at every level of category related to contraception use. Of the study participants, almost 22% were in the age group of 15 to 19 years and of the total participants only 1.35% was reported to be contraceptive users (Fig. 2). Similarly, 18% were in the age group of 20 to 24 years and of the total participants only 4.5% were contraceptive users. Besides, 18% were in the age group of 25 to 29 years and of which 30% have been reported as contraceptive users. It has also been reported that both age groups 25 to 29 and 30 to 34years had the higher proportions of contraceptive users among other age groups. The pattern indicates that participants both in the age groups of 15 to 19 and 40 to 49 years of age the proportion of contraceptive uses among these groups got declined (Fig. 2). The two lines are not parallel hence it indicates there are variations on contraceptive users among the different age groups of the respondents.

Of the study participants, who have been asked whether contraceptive methods used in the survey, 12.06%, 11.79%, 11.63%, 10.96% and 10.72% were found to be from Oromiya, SNNP, Addis Ababa, Amhara, and Tigray regions respectively (Fig. 3).

Among the study participants who were higher in their educational status (7.64%), only 2.14% has been reported that as contraceptive users. However, participants with no education (45%) have reported the least proportion (8.59%) of contraception methods use. One can see the gap for contraceptive use from the graph below for the participants with no education is huge. The pattern for contraception use gets decrease as educational level get decrease (Fig. 4).

Among the study participants who were married (51%), only 18% has been reported that as contraceptive users. However, participants who never been in union (27%) have reported the least proportion (0.84%) of contraception methods use (Fig. 5).

Of the study participants, 65% were found to be from rural residents and only 13% of rural residents reported as contraceptive users. On the other side, only 8.5%of the urban residents were reported as contraceptive users (Fig. 6).

Among the study participants, Muslims and Orthodox constituted 40% and 41% respectively of which only 5% and 12% reported as contraceptive users. However, participants who are Catholic and traditional religion followers have shown unique pattern unlike the huge gap which is observed among other religions (Fig. 7).

C. Experimentations

The classifiers were used 15,683 instances for training the predictive models applied to contraceptive users. Different data mining algorithms such as: decision tree (J48, random tree, and random forest), Naïve Bayes, and artificial neural network (ANNs) algorithms were used to train the classifiers. Five of the classifiers were trained with two scenarios and with varying testing parameters. The performances of the data mining models were evaluated using 10 k cross validation test option as it is the standard for controlling a bias. Two scenarios were considered with respect to the attribute selections adopted to train the models. These are the classical and the proposed approaches.

a) In classical approach, we used both selection feature and search methods algorithms from the available Weka packages. Accordingly, five attributes (Ethnicity, knowledge any method, current marital status, recent sexual activity and ever been tested for HIV) have got selected using classifier subset evaluator algorithm, and both bestFirst and GreedyStepwise search methods.

b) In the second approach, we applied the hybrid multidimensional metrics approach for the feature selection and accordingly ‘18 selected features’ (socio-demographic determinants, knowledge related to contraception use, knowledge related to AIDS and/or STI, exposure to mass-media, and knowledge on family planning) were used in all experimentations. The current contraception methods use (CCMU) is a binary outcome which is the response variable of the study. List of the features used for this study are presented as shown below in Table 8.

Table 8

List of possible attributes for predicting the model for contraceptive use, EDHS 2016
Rank	Attributes	Contribution of each attribute to the model	Data type	Distinct values
1	Recent_Sexual_Activity	0.12728	Tex	4
2	Curent_Marital_Status	0.08209	Tex	6
3	Ethnicity	0.06039	Numeric	46
4	Num_Living_Children	0.05981	Tex	4
5	Ever_been_Tested_HIV	0.04436	Tex	2
6	AgeGroup	0.04222	Tex	9
7	Region	0.04219	Tex	11
8	WI_Combined	0.02728	Tex	5
9	Religion	0.02639	Tex	6
10	Desire_For_More_Children	0.02579	Tex	3
11	Knowledge_Any_method	0.01613	Tex	3
12	Ever_Heard_AIDS	0.01125	Tex	2
13	Ever_Heard_STI	0.01086	Tex	2
14	Knowledge_Ovulatory_Cycle	0.01041	Tex	6
15	Heard_FP	0.00793	Tex	2
16	Media_Exposure	0.00654	Tex	2
17	Place of residence	0.00328	Tex	2
18	Highest_LevEducation	0.00255	Tex	4

The efficiency of the predictive models was evaluated based on the proposed hybrid multidimensional metrics for model selection as can be shown in Table 9 below. These performance measures are used or designed to be used to fulfil the user’s requirements.

Table 9

Summarization of various experimentations applying with different testing parameters
S.No	Experimentation of models	Testing options	No. of attribute	Selection Attributes
Scenario 1	Naïve base	Training Cross validation Percentile	5	CfsSubsetEval :+ BestFirst: CfsSubsetEval :+GreedyStepwise
	Decision tree (J48)		5
	Random tree		5
	Random forest		5
	Artificial Neural Networks		5
Scenario 2	Naïve base	Training Cross validation Percentile	18
	Decision tree (J48)		18
	Random tree		18	Proposed approach
	Random forest		18
	Artificial Neural Networks		18

Figure 8 depicts that the Artificial Neural Network (ANNs) takes a sample of features (individual inputs p₁, p₂, ..., p_R) to build the predictive modeling for contraceptive use for demonstration purpose. Each individual feature is weighted by the corresponding elements w_1,1, w_{1, 2}, ..., w_{1, R} of the weight matrix W. The ANNs predictive model has been trained with: no hidden layers, two hidden neurons, and with two layers hidden neurons if improvement of prediction power could gain. However, the results for the three cases of layer configurations using the ANNs was found to be similar. Therefore, we recommend the ANNs to model the contraceptive use with no hidden layers for simplicity of model interpretation purpose.

D. Comparison Analysis for the classifiers

1. The ROC Curve Analysis

The ROC value for the data mining algorithm of Naïve Bayes used for modeling of contraception use was found to be 85.1%. The ROC curve analyses for the Naïve Bayes displayed below showed that the curve moves sharply up from zero showing that there are higher true tested than false tested rates. Then the curve starts to become more horizontal as it encounters less true tested and more false tested rates. The area under the curve for the naïve Bayes model was found to be 85.2% (Fig. 9).

2.The Confusion Matrix

Intensive experimentations with different testing parameter options (training test, cross validation and percentage) were done but comparison was done using cross validation (CV) test options only as it is a standard for controlling the bias. Accordingly, the results for Naïve Bayes with CV test option achieved an accuracy of 79.85%, a sensitivity of 58.78% and specificity of 85.49% were demonstrated. But, the Naïve Bayes classifier achieved the minimum cost for time computation in second. Similarly, ANNs (Multiple Perceptron) classifier scored an accuracy of 80.24%, a sensitivity of 44.89% and specificity of 89.70% respectively associated with maximum cost for time computation in seconds. Moreover, the results for decision trees with algorithms of (J48, RT and RF) achieved accuracy better than the above-mentioned classifiers (NB and ANNs) as seen in (Table 10). If we simply see the performance of the model in terms of accuracy it achieved, one can observe that the decision tree of scenario two (J48) is the best model predictor (Table 11). However, we need to check whether these performance measures achieved by each model has a statistical significance at 5% level of significance for further analysis and for feature prediction purpose. And this objective of statistical testing model significance would be achieved using F-test in DM models (Table 12). The complete set of results used for comparison of each model performance was prepared in a tabular format (Tables 10 and 11).

Table 10

Comparison of performance of different Classifiers, **scenario 1 (n = 5)**
Evaluation criteria’s	Naïve Bayes		Decision tree (J48)		Decision tree (random tree)		Decision tree (forest)		Neural networks		Class
Confusion matrix	10576	1795	11676	695	11586	785	11586	785	11097	1274	No
	1365	1947	2320	992	2235	1077	2232	1080	1825	1487	Yes
Accuracy (%)	79.85%		80.77%		80.74%		80.76%		80.24%
Sensitivity (%)	58.78%		29.95%		32.51%		32.60%		44.89%
Specificity (%)	85.49%		94.38%		93.65%		93.65%		89.70%
ROC (%)	0.841%		0.805%		0.841%		0.844%		0.841%
Computations time in seconds	0.01		0.02		0.09		0.7		61.16

Table 11

Comparison of performance of different Classifiers, **scenario 2 (n = 18)**
Evaluation criteria’s	Naïve Bayes		Decision tree (J48)		Decision tree (random tree)		Decision tree (forest)		Neural networks		Class
Confusion matrix	9680	2691	11416	955	10732	1639	11376	995	10964	1407	No
	868	2444	1810	1502	1845	1467	1847	1465	1680	1632	Yes
Accuracy (%)	77.30%		82.36%		77.78%		81.87%		80.32%
Sensitivity (%)	73.79%		45.35%		44.29%		44.23%		49.27%
Specificity (%)	78.25%		92.28%		86.75%		91.96%		88.63%
ROC (%)	0.851%		0.817%		0.691%		0.855%		0.840%
Computations time in seconds	0.00		0.24		0.07		3.78		518.2

3. Model Evaluation for Data Imbalance Problem

1. Data Imbalanced Case

In this paper, the receiver operator characteristics curve analysis (ROC curve) was also used to measure the performance of the models. All the four classifiers using imbalanced data case have achieved ROC values much more than 81% except the random tree with 69.1%. If we simply see the performance of the model in terms of accuracy it achieved, one can observe that the decision tree (J48) is the best model predictor of the other two (Tables 10 and 11). However, a paired two-tailed comparison was done using paired corrected test option to measure the difference of performances among the models in predicting the contraceptive use at 5% level of significance for further analysis and for future prediction purpose (Table 12). This objective of testing model significance would be achieved using F-measure in data mining models. The four data mining models (Decision tree (J48), Decision tree (random tree), decision tree (random forest) and Neural networks (MLP)) were compared against to the ‘Naïve Bayes’ model given for the same number of inputs. Hence, all the models used in this paper are efficient enough (prediction power exceeds 77%) to predict the contraceptive methods use among women since all the models achieved the same F-measures (Table 12). Unlike statistical value that uses P-value for measuring significance of an interest, WEKA uses three symbols ((v/ /*)) for measuring the differences of the models and represented as (v------the difference in performance of the models is considered as victory (better difference), / /-------There is no difference, *-----The difference in performance among the models for prediction is poorer).

Table 12

Model Evaluation for the classifiers, Paired corrected Tester-measure, Confidence: 0.05 (two tailed), *for* **imbalance data**
Dataset	(1) Naïve Bayes	(2) Decision tree (J48)	(3) Random tree	(4) Random forest tree	(5) Neural networks
DataSet_CPR_2018_19_Model: F-measures	(1) 0.84 \| 0.89 0.88 0.88 0.87
	(v/ /*) \| (0/1/0) (0/1/0) (0/1/0) (0/1/0)

2. Handling the problem of Imbalanced Data

The percentage of contraceptive methods use class data size consists about 21% of the respondents was reported as contraceptive users. This class size was considered to be unbalanced data which might be a bias to evaluate the classifier methods. An equal amount of both contraceptive users and non-users was taken randomly using WEKA 3.7.7 pre-processing option to balance these two classes to avoid dominance one over the other. And the overall significance of this balanced data should be compared with the above unbalanced data if there are differences on the models based on the performance measures used for the purpose of prediction. The original sample size was 15,683 but after the data imbalance problem was adjusted the new resample size would become 6586. On other word, the following below experimental results are re-run by considering equal amount of both contraceptive users and non-users. Table 13 illustrates that, after the adjustment of data imbalance, we evaluated if there exist effect due to the imbalance of target variable using the same measures. The models used to predict with unbalanced data achieved slightly higher in overall performance than the models with balanced target dataset this is due to as possibly one target have got chance to dominate over the other target. Despite the slight differences observed due the imbalance of data, all the four classifiers have ROC values much more than 81% and with an improved ROC value of 74.80% for random tree. This indicates that given the features as input, the classifiers are efficient to predict the true contraceptive method users (more than 81% of ROC value) being an individual is contraception user or not. Besides, if we simply see the performance of the model in terms of accuracy it achieved, one can observe that the decision tree (random tree algorism) is the best model predictor of the other two (Table 13). However, we need to check whether these performances of measures achieved by each model has a statistical significance at 5% level of significance for further analysis and for future prediction purpose (Table 14).

Table 13

Comparison of performance of different Classifiers, for **balanced data**
Evaluation criteria’s	Naïve Bayes		Decision tree (J48)		Decision tree (random tree)		Decision tree (forest)		Neural networks		Class
Confusion matrix	2176	1136	2325	987	2475	837	2405	907	2477	835	No
	391	2921	454	2858	964	2348	477	2835	782	2530	Yes
Accuracy (%)	76.94%		78.24%		72.81%		79.10%		75.58%
Sensitivity (%)	65.70%		70.19%		74.72%		72.61%		76.38%
Specificity (%)	88.19%		86.29%		70.89%		85.59%		74.78%
ROC (%)	84.80%		81.70%		74.80%		86.70%		84.20%
Computations time in seconds	0.0		0.13		0.07		0.19		260.59

Table 14 depicts a paired two-tailed comparison was done (paired corrected tested) to measure the difference of performance among the models in predicting the contraception use by the women at 5% level of significance [after adjusting the data imbalance problem]. Four data mining models (Decision tree (J48), Decision tree (random tree), Decision tree (random forest) and Neural networks (MLP)) were compared against to the Naïve Bayes model given for the same number of inputs. But, there was statistically significant differences between the decision tree models (both J48 and random forest algorithms) and the Naïve Bayes model used for prediction to contraception methods use (Table 14). Moreover, the difference in performances of the models used for prediction using the decision tree models were considered as victory (significantly different) as compared to the naïve Bayes model. Nevertheless, all the models used in this paper are efficient enough (prediction power exceeds 77%) to predict the contraceptive methods use among women.

Table 14

Model Evaluation for the classifiers, paired corrected Tester-measure, Confidence: 0.05 (two tailed); **after adjusting the data imbalance problem**
Dataset	(1) Naïve Bayes	(2) Decision tree (J48)	(3) Random tree	(4) Random forest tree	(5) Neural networks
DataSet_CPR_2018_19_Model: F-measures	(1) 0.74 \| (2) 0.76 v (3) 0.73 (4) 0.77 v (5) 0.76
	(v/ /*) \| (0/1/0) (0/1/0) (0/1/0) (0/1/0)

4. Hybrid Multidimensional Metrics for Model Selection

A hybrid multidimensional metrices was used to compute the overall significance of the model taking both the effects of the user’s requirements and their corresponding weights of their importance basically assigned based on the user’s requirements and defined as in Eq. (5). The higher $HMM\left( {m,r} \right)$indicates the overall significant model that comprises almost all requirements of the user unlike the classical metrics that used one criterion to pick the best fit model (Table 15). Accordingly, decision tree (J48) was found be the best fit model for the prediction task based on the hybrid metrics criterion. On the other side, the ANNs was found to be the most computationally expensive for our prediction task.

Table 15

Hybrid multidimensional metrics criterion for final model selection
Metrics	Requirement’s indicator	Classifier’s weight score
Metrics	Requirement’s indicator	NB	DT	RT	RF	ANNs
Roc values	1	0.15	0.15	0.15	2/5	0.15
Accuracy	1	0.15	0.15	0.15	2/5	0.15
data imbalance problem handled	1	0.15	0.15	0.15	2/5	0.15
statistical significance	1	0.13	0.305	0.13	0.305	0.13
practicability and applicability of the model	1	0.15	2/5	0.15	0.15	0.15
simplicity of model interpretation	1	0.15	2/5	0.15	0.15	0.15
consistency to the established knowledge	1	0.15	2/5	0.15	0.15	0.15
algorithm’s simplicity in terms of time and space	1	0.15	2/5	0.15	0.15	0.15
	$HMM\left( {m,r} \right)$:	0.236	0.47	0.25	0.42	0.236

Two scenarios were considered with respect to both feature and model selections adopted to train the models: the classical approach employed the most commonly used algorithms for feature selection. However, this approach has been criticized for its weak side on drawing the complete picture of the prediction task. Therefore, we proposed the hybrid multidimensional metrics for both feature and model selections would be an efficient approach in comprising the entire requirements of the user. Experimental results have revealed that all the predictive models used for this study except random tree were able to predict whether an individual was being contraceptive user or not given that the socio-demographic determinants, knowledge related to contraception use, knowledge related to AIDS and/or STI, exposure to mass-media, and knowledge on family planning as inputs with predictive power of more than 81%. Slight differences were also observed due the imbalance of data and the classifiers have ROC values much more than 81% after adjusting data imbalance problem. However, there was statistically significant difference between the decision tree models and the Naïve Bayes model used for prediction to contraceptive use (after adjusting for imbalance data problem). In conclusion, decision tree (J48) was found to be the best fit model for the prediction task based on the hybrid metrics criterion as the higher score of $HMM\left( {m,r} \right)$ = 0.47 indicates the overall significant model that comprises almost all requirements of the user unlike the classical metrics that rely on one criterion to pick the best fit model which lacks practicality or several characteristics of the model. On the other side, the ANNs was found to be the most computationally expensive for our prediction task. Specifically, this paper concluded that:

Efficiency of predictive model could be better measured based on multidimensional criterion of the performance measures as this approach is more flexible to entertain user’s requirements.
Decision tree (J48) is the most efficient model (with a score of $HMM\left( {m,r} \right)$= 0.47) and found statistically as victory model for the balanced data.
The nature of data and the class size of the dataset (balanced or imbalanced data) have negative impact on the efficiency or prediction power of the model.

Following recommendations are forwarded to the academia, scientific communities and healthcare industries for future work in both feature and model selection scenarios could be considered in similar and/or different platforms of prediction tasks:

Efficiency of a model is a multi-dimensional phenomenon. Hence, different model selection criteria as more flexible as hybrid metrics can be applied including scalability of the model, accuracy and specificity of the model, computational time cost, simplicity of the model, and others.
Efficiency of predictive model could be improved through more flexible feature selection algorithms (specifically flexible hybrid metrics) considering the knowledge domain experts into account as understanding the business domain affects significantly.
The class size or problem of imbalance data need to be handled hence an equal amount of both targets could be taken to minimize a bias that could be introduced to the model.
Data transformation techniques, specifically on continuous features, need to be addressed to make the features suitable for the prediction task and to make the analysis procedures manageable and cost- effective.

Ethical approval and consent to participate

Not applicable.

Availability of data and materials

The data source for this study is the EDHS 2016, which is publicly available.

Consent to publish

Not Applicable.

Funding

This work has not been supported by a funding source.

Conflict of interest

The authors declare that they have no competing interests.

Author contribution

Conceptualization, T.G.H. and T.A.; Data curation, T.G.H. and T.A.; Formal analysis, T.G.H. and T.A.; Investigation, T.G.H. and T.A.; Methodology, T.G.H. and T.A.; Project administration, T.G.H. and T.A.; Resources, T.G.H. and T.A.; Visualization, T.G.H. and T.A.; Writing—original draft, T.G.H.; Writing—review & editing, T.A. All authors reviewed the manuscript.

Acknowledgment

The authors would like to acknowledge the MEASURE Demographic and Health Survey (DHS) authority that they have authorized us to access all the necessary dataset and documents which we needed for this work.

Molina LC, Belanche L, Nebot A. “Feature Selection Algorithms: A Survey and Experimental Evaluation,” Proc. IEEE Int’l Conf. Data Mining, pp. 306–313, 2002.
Liu H, Motoda H, Yu L. “Selective Sampling Approach to Active Feature Selection,” Artificial Intelligence, vol. 159, nos. 1/2, pp. 49–74, 2004.
Chandrashekar G, Sahin F. ‘A survey on feature selection methods’. Comput Electr Eng. 2014;40(1):16–28.
Nakariyakul S. ‘‘Suboptimal branch and bound algorithms for feature subset selection: A comparative study,’’ Pattern Recognit. Lett., vol. 45, pp. 62–70, Aug. 2014.
Sheikhpour R, Sarram MA, Gharaghani S, Chahooki MAZ. ‘‘A survey on semi-supervised feature selection methods,’’ Pattern Recognit., vol. 64, pp. 141–158, Apr. 2017.
Agrawal R, Psaila G. “Active data mining,” Current, pp. 3–8, 1995.
Liao SH, Chu PH, Hsiao PY. ” Expert Syst Appl. 2012;39(12):11303–11. “Data mining techniques and applications - A decade review from 2000 to 2011.
Janecek AGK, Gansterer GF et al. “On the Relationship between Feature Selection and Classification Accuracy”, In: Proceeding of New Challenges for Feature Selection, pp. 40–105, 2008.
Bellman R. Adaptive Control Processes: A Guided Tour. Princeton: Princeton University Press; 1961.
Chumerin N, Hulle V. M. M, “Comparison of Two Feature Extraction Methods Based on Maximization of Mutual Information” In: Proceedings of the 16th IEEE Signal Processing Society Workshop on Machine Learning for Signal Processing, pp. 343–348, 2006.
Motoda H, Liu H. “Feature selection, extraction and construction” In: Towards the Foundation of Data Mining Workshop, Sixth Pacific Asia Conference on Knowledge Discovery and Data Mining (PAKDD 2002), Taipei, Taiwan, pp. 67–72, 2002.
Ladla L, Deepa T. Feature Selection Methods and Algorithms. Int J Comput Sci Eng (IJCSE). 2011;3(5):1787–97.
Ethiopia and Demographic. and Health Survey 2016 [FR328] (dhsprogram.com). Accessed on September 23, 2021.
Sandhya Joshi P, DeepaShenoy, Venugopal KR. Patnaik“Classification of Neurodegenerative Disorders Based on Major Risk Factors EmployingMachine Learning Techniques. Int J Eng Technol. August 2010;ISSN(2):1793–8236.
Veenita Kunwar K, Chandel A, Sai Sabitha A, Bansal. "Chronic Kidney Disease analysis using data mining classification techniques", Cloud System and Big Data Engineering (Confluence) 2016 6th International Conference, pp. 300–305, 2016.
Bellazzi R, Zupan B. Predictive data mining in clinical medicine: current issues and guidelines. Int J Med Inform. 2008;77(2):81–97.
ZHIQIANG GE 1, (Senior, Member. IEEE), ZHIHUAN SONG1, STEVEN X. DING2, AND BIAO HUANG3, (Senior Member, IEEE): Data Mining and Analytics in the Process Industry: The Role of Machine Learning.
Roberts A. (2005). AI32: guide to Weka. Retrieved January 13, 2011, from http://www.comp.leeds.ac.uk/andyr.
Han J, Kamber M. Data mining: concepts and techniques. (2nded). San Francisco: Morgan Kaufmann Publishers; 2006.
Brown G, Pocock A, Zhao M-J, LujÆn M. ‘Conditional likelihood maximization: A unifying framework for information theoretic feature selection. J Mach Learn Res. 2012;13(1):27–66.
Souza J. “Feature Selection with a General Hybrid Algorithm,” PhD dissertation, Univ. of Ottawa, 2004.
Yu L, Liu H. ‘Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res. 2004;5(10):1205–24.
Kohavi R, John GH. ‘‘Wrappers for feature subset selection,’’ Artif. Intell., vol. 97, nos. 1–2, pp. 273–324, Dec. 1997.
Nakariyakul S, Liu Z-P, Chen L. ‘‘Detecting thermophilic proteins through selecting amino acid and dipeptide composition features,’’ Amino Acids, vol. 42, no. 5, pp. 1947–53, May 2012.
Dash M, Liu H. Feature Selection for Classification. Intell Data Anal. 1997;1(3):131–56.
Das S. “Filters, Wrappers and a Boosting-Based Hybrid for Feature Selection,” Proc. 18th Int’l Conf. Machine Learning, pp. 74–81, 2001.
Guyon I, Weston J, Barnhill S, Vapnik V. ‘‘Gene selection for cancer classification using support vector machines,’’ Mach. Learn., vol. 46, nos. 1–3, pp. 389–422, 2002.
Neumann J, Schnörr C, Steidl G. ‘‘Combined SVM-based feature selection and classification,’’ Mach. Learn., vol. 61, nos. 1–3, pp. 129–150, Nov. 2005.
Guyon I, Elisseeff A. An Introduction to Variable and Feature Selection. J Mach Learn Res. 2003;3:1157–82.
Mitchell TM. Generalization as Search. Artif Intell. 1982;18(2):203–26.
Shafique U, Majeed F, Qaiser H, Mustafa IU. Data Mining in Healthcare for Heart Diseases. Int J Innov Appl Stud. 2015;10(4):1312.
Soni J, Ansari U, Sharma D, Soni S. Predictive data mining for medical diagnosis: An overview of heart disease prediction. Int J Comput Appl. 2011;17(8):43–8.
Koh HC, Tan G. Data mining applications in healthcare. J Healthc Inf Manag. 2011;19(2):65.
Mahindrakar P, Hanumanthappa M. Data mining in healthcare: A survey of techniques and algorithms with its limitations and challenges. Int J Eng Res Appl. 2013;3(6):937–41.
Hailu T. Comparing Data Mining Techniques in HIV Testing Prediction. Intell Inform Manage. 2015;7:153–80. 10.4236/iim.2015.73014.
Brosette SE, Spragre AP, Jones WT, Moser SA. A data mining system for infection control surveillance. Methods Inf Med. 2000;39:303–10.
Sandhya Joshi P, DeepaShenoy, Venugopal KR, Patnaik LM. Classification and treatment of different stages of Alzheimer’s disease using various machine learning methods International. J Bioinf Res ISSN. 2010;2(1):0975–3087. -44–52.
Giudici P. Applied Data Mining: Statistical Methods for Business and Industry. New York: John Wiley; 2003.
Kharya S. (2012). Using data mining techniques for diagnosis and prognosis of cancer disease. arXiv preprint arXiv:1205.1923.
Sundar NA, Latha PP, Chandra MR. Performance analysis of classification data mining techniques over heart disease database. Int J Eng Sci Adv Technol. 2012;2(3):470–8.
Obenshain MK. Infect Control Hosp Epidemiol. 2004;25(8):690–5. “Application of Data Mining Techniques to Healthcare Data”.
Maniya H, Hasan M, Patel KP. (2011). Comparative study of naïve Bayes classifier and KNN for tuberculosis. In International Conference on Web Services Computing (ICWSC) (pp. 22–26).
Kusiak A, Dixon B, Shah S. Predicting survival time for kidney dialysis patients: a data mining approach. Comput Biol Med. 2005;35(4):311–27.
Shetty D, Rit K, Shaikh S, Patil N. (2017, March). Diabetes disease prediction using data mining. In 2017 international conference on innovations in information, embedded and communication systems (ICIIECS) (pp. 1–5). IEEE.
Rahim NF, Taib SM, Abidin AIZ. Dengue fatality prediction using data mining. J Fundamental Appl Sci. 2017;9(6S):671–83.
Uhmn S, Kim DH, Cho SW, Cheong JY, Kim J. 2007 Frontiers in the Convergence of Bioscience and Information Technologies. IEEE; 2007, October. pp. 81–6. Chronic hepatitis classification using SNP data and data mining techniques.
Passmore L, Goodside J, Hamel L, Gonzalez L, Silberstein TALI, Trimarchi JAMES. (2003). Assessing decision tree models for clinical in-vitro fertilization data. Dept. of Computer Science and Statistics University of Rhode Island, Technical Report TR03-296.
Dwivedi A, Rehman K, Ghosh M, Raman R. Data Mining Algorithms in Healthcare. Int J Comput Appl. 2018;180(36):26–31.
Guyon I, Gunn S, Nikravesh M, Zadeh LA. Feature extraction: foundations and applications. Volume 207. Springer; 2008.
Xue B, Zhang M, Browne WN. Particle swarm optimization for feature selection in classification: A multi-objective approach. IEEE Trans cybernetics. 2013;43(6):1656–71.
Caldwell JC, Caldwell P, Africa. The new family planning frontier. Stud Fam Plann. 2002;33(1):76–86. [PubMed] [Google Scholar].
Fayyad U. “Data Mining and Knowledge Discovery in Databases: Implications fro scientific databases”, Proc. Of the 9th Int. Conf. on Scientific and Statistical Database Management, Olympia, Washington, USA, 2–11, 1997.
AalokRanjanChaurasia. “Contraceptive Use in India: A Data Mining Approach”, Int J Popul Res Volume 2014 Article ID 821436, 11 pages http://dx.doi.org/10.1155/2014/821436.
Han J, Kamber M. Data Mining Concepts and Techniques. Morgan Kaufmann Publishers; 2006.
Berry MJ, Linoff G. Data mining techniques: for marketing, sales and customer support. USA: Wiley; 1997.
Parr Rud O. Data mining cookbook: modeling data for marketing, risk, and customer relationship management. USA: Wiley; 2001.
Azevedo A, Santos MF. (2008). KDD, SEMMA and CRISP-DM: a parallel overview. IADIS European conference data mining: 182–185.
Suryani D, Labellapansa A, Marsela E. (2018) Accuracy of Algorithm C4.5 to Study Data Mining Against Selection of Contraception. In: Saian R., Abbas M, editors Proceedings of the Second International Conference on the Future of ASEAN (ICoFA) 2017 – Volume 2. Springer, Singapore. https://doi.org/10.1007/978-981-10-8471-3_95.
Dwi Fajar Maulana Y, Ruldeviyani Y, Indra Sensuse D. "Data Mining Classification Approach to Predict the Duration of Contraceptive Use," 2020 Fifth International Conference on Informatics and Computing (ICIC), 2020, pp. 1–6, doi: 10.1109/ICIC50835.2020.9288568.
Hailemariam T, Gebregiorgis A, Meshesha M, Mekonnen W. Application of Data Mining to Predict the Likelihood of Contraceptive Method Use among Women Aged 15–49 Case of 2005 Demographic Health Survey Data Collected by Central Statistics Agency, Addis Ababa, Ethiopia. J Health Med Informat. 2017;8:274. 10.4172/2157-7420.1000274.
Witten IH, Frank E, Hall MA. Data mining practical machine learning tools and techniques. Burlington: Morgan Kaufmann publisher; 2011.
Daelemans W, Hoste V, Meulder FD, Naudts B. “Combined Optimization of Feature Selection and Algorithm Parameter Interaction in Machine Learning of Language”, Proceedings of the 14th European Conference on Machine Learning (ECML-2003), Lecture Notes in Computer Science 2837, Springer-Verlag, Cavtat-Dubrovnik, Croatia, 2003, pp. 84–95.
Pete Chapman (NCR)., Julian Clinton (SPSS), Randy Kerber (NCR), Thomas Khabaza (SPSS), Thomas Reinartz (DaimlerChrysler), Colin Shearer (SPSS) and Rüdiger Wirth (1999): CRISP-DM 1.0: Step by step data mining guide.
Liu H, Motoda H. (1998). Feature Selection for knowledge Discovery and Data Mining.
Brachman RJ, Anand T. (1996). The process of knowledge discovery in databases.
MirjanaPejić. Bach¹, Dijana Ćosić², Data mining usage in health care management: literature survey and decision tree application. Med glas. 2008;5(1):57–64.
Chakrabarti S, Cox E, Frank E, Hartmut GR, Han J, Jiang X, Kamber M, Witten I. Data Mining: Know It All. Burlington, San Francisco: Morga Kaufmann Publishers; 2009.
Famili A, Turney P. (1997) Data Preprocessing and Intelligent Data Analysis. Institute of Information Technology, National Research Council Canada.
Lu H, Setiono R, Liu H. Effective Data Mining Using Neural Networks. IEEE Trans Knowl Data Eng. 1996;8:957–61.
Witten IH, Frank E. (2005) Data Mining: Practical Machine Learning Tools and Techniques. 2nd Edition, MorganKaufmann Publishers, San Francisco.
Breiman L, Friedman JH, Olshen RA, Stone. Classification and Regresion Trees. C.J; 1984.
SlavcoVelickov and Dimitri Solomatine. (March 2000): Predictive Data Mining: Practical Examples: Artificial Intelligence in Civil Engineering. Proc. 2nd Joint Workshop, Cottbus, Germany. ISBN 3-934934-00-5.

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

MultiDMet: Designing a Hybrid Multidimensional Metrics Framework to Predictive Modeling for Performance Evaluation and Feature Selection

Status:

Version 1

Abstract

Figures

I. Introduction

II. Related works

III. Problem Formulation AND FRAMEWORK

IV. Results and discussions

Conclusion

Declarations

References

Additional Declarations

Status:

Version 1