A. Chi-square Test Analysis
In this paper, we used EDHS 2016 as a source of data applied to contraceptive use meeting to address the main challenges of both feature and model selection encountered in predictive modeling task. A chi-square test was used to test the association between each feature with the contraceptive use with the purpose to retain it in the model or not for further analysis of the prediction task (Table 6). Accordingly, the socio demographic factors: marital status, religion, wealth index, region, place of residence, ethnicity and highest education level were found to be significantly associated with contraceptive methods use (P-value < 0.000). However, the missing values of features that exceeds 5% were discarded from further analysis. For instance, husbands’ education levels were discarded from analysis (Table 6). Table 6 below indicates that for an attribute of marital status with six levels; respondents were asked about contraception use and the difference between the categories of marital status was tested using P-value. The association between marital status and contraception use were found to be significant (P-value < 0.000). One can also understand that the effect of marital status at every levels of category on contraception use is different. Hence, marital status would be included as potential predictor in the model.
Table 6
Statistical association of socio-demographic attributes related to Contraception use using Chi-square test, EDHS 2016
No
|
Features
|
Category
|
Contraception Use
|
P-value
|
Yes
|
No
|
|
1
|
Marital status
|
Divorced
|
122
|
756
|
.000
|
Married
|
2887
|
6715
|
Living with partner
|
93
|
129
|
No longer living with partner
|
52
|
200
|
Never in union
|
132
|
4146
|
Widowed
|
26
|
425
|
2
|
Religion
|
Catholic
|
22
|
69
|
.000
|
Protestant
|
670
|
2144
|
Orthodox
|
1845
|
4568
|
Muslim
|
761
|
5448
|
Traditional
|
4
|
84
|
Others
|
10
|
62
|
3
|
Highest level of Education
|
No education
|
5686
|
1347
|
.000
|
Primary
|
4040
|
1173
|
Secondary
|
1782
|
456
|
Higher
|
863
|
336
|
4
|
Wealth Index combined
|
Poorest
|
3562
|
332
|
.000
|
Poorer
|
1610
|
436
|
Middle
|
1502
|
500
|
Richer
|
1498
|
544
|
Richest
|
4199
|
1500
|
5
|
Husbands Education level
|
No education
|
901
|
3530
|
.000
|
Primary
|
1141
|
1913
|
Secondary
|
494
|
732
|
Higher
|
423
|
597
|
Don’t know
|
21
|
72
|
|
|
Missing
|
5527
|
332
|
6
|
Ethnicity
|
Afar
|
13
|
934
|
.000
|
Guragie
|
153
|
502
|
Tigrean
|
469
|
1436
|
Amara
|
1186
|
2502
|
Oromo
|
743
|
2868
|
Welaita
|
71
|
251
|
Sidama
|
144
|
211
|
Nuwer
|
4
|
280
|
Somalie
|
20
|
1443
|
Others
|
509
|
1944
|
Table 7 illustrates that the statistical association of knowledge features related to contraception use were also assessed. Accordingly, media exposure, contraception intent use, heard family planning, ever heard of AIDS, ever heard of STI, recent sexual activity and knowledge ovulatory cycle were found to be significantly associated with contraceptive method use (P-value < 0.000). And HIV transmitted during pregnancy was discarded from analysis (Table 7). An attribute of media exposure with two levels (Yes or No); respondents were asked about contraception use. And the differences between the users and non-users of media exposure against contraceptive use were found to be statistically significant. Hence, media exposure was included as potential predictor to train the model.
Table 7
Statistical association of knowledge attributes related to Contraception use using Chi-square test, EDHS 2016
No
|
Attributes
|
Category
|
Contraception Use
|
P-value
|
No
|
Yes
|
|
1
|
Media exposure
|
No
|
6715
|
1412
|
.000
|
Yes
|
5656
|
1900
|
2
|
Contraception Use Intention
|
Doesn’t Intend to use
|
6708
|
0
|
.000
|
Non-user intends to use later
|
5663
|
0
|
Using modern method
|
0
|
3217
|
Using traditional method
|
0
|
95
|
3
|
Knowledge of Ovulatory Cycle
|
After period ended: 2
|
3085
|
936
|
.000
|
At any time:5
|
2557
|
530
|
Before period begins:4
|
879
|
274
|
Middle of cycle:3
|
2694
|
1005
|
During her period:1
|
374
|
108
|
Don’t know:8
|
2782
|
459
|
4
|
Heard Family planning
|
No
|
8223
|
1788
|
.000
|
Yes
|
4148
|
1524
|
5
|
Ever heard of STI
|
No
|
1170
|
75
|
.000
|
Yes
|
11201
|
3237
|
6
|
Recent sexual activity
|
Active last 4 weeks
|
4832
|
2723
|
.000
|
Never had sex
|
3709
|
12
|
No active: No postpartum abstinence
|
2937
|
517
|
No active: Postpartum abstinence
|
893
|
60
|
7
|
Ever heard of AIDS
|
No
|
1233
|
81
|
.000
|
Yes
|
11138
|
3231
|
B. Features Pattern Analysis
The pattern analysis was done to understand the effect of each feature at every level of category related to contraception use. Of the study participants, almost 22% were in the age group of 15 to 19 years and of the total participants only 1.35% was reported to be contraceptive users (Fig. 2). Similarly, 18% were in the age group of 20 to 24 years and of the total participants only 4.5% were contraceptive users. Besides, 18% were in the age group of 25 to 29 years and of which 30% have been reported as contraceptive users. It has also been reported that both age groups 25 to 29 and 30 to 34years had the higher proportions of contraceptive users among other age groups. The pattern indicates that participants both in the age groups of 15 to 19 and 40 to 49 years of age the proportion of contraceptive uses among these groups got declined (Fig. 2). The two lines are not parallel hence it indicates there are variations on contraceptive users among the different age groups of the respondents.
Of the study participants, who have been asked whether contraceptive methods used in the survey, 12.06%, 11.79%, 11.63%, 10.96% and 10.72% were found to be from Oromiya, SNNP, Addis Ababa, Amhara, and Tigray regions respectively (Fig. 3).
Among the study participants who were higher in their educational status (7.64%), only 2.14% has been reported that as contraceptive users. However, participants with no education (45%) have reported the least proportion (8.59%) of contraception methods use. One can see the gap for contraceptive use from the graph below for the participants with no education is huge. The pattern for contraception use gets decrease as educational level get decrease (Fig. 4).
Among the study participants who were married (51%), only 18% has been reported that as contraceptive users. However, participants who never been in union (27%) have reported the least proportion (0.84%) of contraception methods use (Fig. 5).
Of the study participants, 65% were found to be from rural residents and only 13% of rural residents reported as contraceptive users. On the other side, only 8.5%of the urban residents were reported as contraceptive users (Fig. 6).
Among the study participants, Muslims and Orthodox constituted 40% and 41% respectively of which only 5% and 12% reported as contraceptive users. However, participants who are Catholic and traditional religion followers have shown unique pattern unlike the huge gap which is observed among other religions (Fig. 7).
C. Experimentations
The classifiers were used 15,683 instances for training the predictive models applied to contraceptive users. Different data mining algorithms such as: decision tree (J48, random tree, and random forest), Naïve Bayes, and artificial neural network (ANNs) algorithms were used to train the classifiers. Five of the classifiers were trained with two scenarios and with varying testing parameters. The performances of the data mining models were evaluated using 10 k cross validation test option as it is the standard for controlling a bias. Two scenarios were considered with respect to the attribute selections adopted to train the models. These are the classical and the proposed approaches.
a) In classical approach, we used both selection feature and search methods algorithms from the available Weka packages. Accordingly, five attributes (Ethnicity, knowledge any method, current marital status, recent sexual activity and ever been tested for HIV) have got selected using classifier subset evaluator algorithm, and both bestFirst and GreedyStepwise search methods.
b) In the second approach, we applied the hybrid multidimensional metrics approach for the feature selection and accordingly ‘18 selected features’ (socio-demographic determinants, knowledge related to contraception use, knowledge related to AIDS and/or STI, exposure to mass-media, and knowledge on family planning) were used in all experimentations. The current contraception methods use (CCMU) is a binary outcome which is the response variable of the study. List of the features used for this study are presented as shown below in Table 8.
Table 8
List of possible attributes for predicting the model for contraceptive use, EDHS 2016
Rank
|
Attributes
|
Contribution of each attribute to the model
|
Data type
|
Distinct values
|
% of missing values
|
1
|
Recent_Sexual_Activity
|
0.12728
|
Tex
|
4
|
0
|
2
|
Curent_Marital_Status
|
0.08209
|
Tex
|
6
|
0
|
3
|
Ethnicity
|
0.06039
|
Numeric
|
46
|
0
|
4
|
Num_Living_Children
|
0.05981
|
Tex
|
4
|
0
|
5
|
Ever_been_Tested_HIV
|
0.04436
|
Tex
|
2
|
0
|
6
|
AgeGroup
|
0.04222
|
Tex
|
9
|
0
|
7
|
Region
|
0.04219
|
Tex
|
11
|
0
|
8
|
WI_Combined
|
0.02728
|
Tex
|
5
|
0
|
9
|
Religion
|
0.02639
|
Tex
|
6
|
0
|
10
|
Desire_For_More_Children
|
0.02579
|
Tex
|
3
|
0
|
11
|
Knowledge_Any_method
|
0.01613
|
Tex
|
3
|
0
|
12
|
Ever_Heard_AIDS
|
0.01125
|
Tex
|
2
|
0
|
13
|
Ever_Heard_STI
|
0.01086
|
Tex
|
2
|
0
|
14
|
Knowledge_Ovulatory_Cycle
|
0.01041
|
Tex
|
6
|
0
|
15
|
Heard_FP
|
0.00793
|
Tex
|
2
|
0
|
16
|
Media_Exposure
|
0.00654
|
Tex
|
2
|
0
|
17
|
Place of residence
|
0.00328
|
Tex
|
2
|
0
|
18
|
Highest_LevEducation
|
0.00255
|
Tex
|
4
|
0
|
The efficiency of the predictive models was evaluated based on the proposed hybrid multidimensional metrics for model selection as can be shown in Table 9 below. These performance measures are used or designed to be used to fulfil the user’s requirements.
Table 9
Summarization of various experimentations applying with different testing parameters
S.No
|
Experimentation of models
|
Testing options
|
No. of attribute
|
Selection Attributes
|
Scenario 1
|
Naïve base
|
Training
Cross validation
Percentile
|
5
|
CfsSubsetEval :+ BestFirst:
CfsSubsetEval :+GreedyStepwise
|
|
Decision tree (J48)
|
5
|
|
Random tree
|
5
|
|
Random forest
|
5
|
|
Artificial Neural Networks
|
5
|
Scenario 2
|
Naïve base
|
Training
Cross validation
Percentile
|
18
|
|
|
Decision tree (J48)
|
18
|
|
|
Random tree
|
18
|
Proposed approach
|
|
Random forest
|
18
|
|
|
Artificial Neural Networks
|
18
|
|
Figure 8 depicts that the Artificial Neural Network (ANNs) takes a sample of features (individual inputs p1, p2, ..., pR) to build the predictive modeling for contraceptive use for demonstration purpose. Each individual feature is weighted by the corresponding elements w1,1, w1, 2, ..., w1, R of the weight matrix W. The ANNs predictive model has been trained with: no hidden layers, two hidden neurons, and with two layers hidden neurons if improvement of prediction power could gain. However, the results for the three cases of layer configurations using the ANNs was found to be similar. Therefore, we recommend the ANNs to model the contraceptive use with no hidden layers for simplicity of model interpretation purpose.
D. Comparison Analysis for the classifiers
1. The ROC Curve Analysis
The ROC value for the data mining algorithm of Naïve Bayes used for modeling of contraception use was found to be 85.1%. The ROC curve analyses for the Naïve Bayes displayed below showed that the curve moves sharply up from zero showing that there are higher true tested than false tested rates. Then the curve starts to become more horizontal as it encounters less true tested and more false tested rates. The area under the curve for the naïve Bayes model was found to be 85.2% (Fig. 9).
2.The Confusion Matrix
Intensive experimentations with different testing parameter options (training test, cross validation and percentage) were done but comparison was done using cross validation (CV) test options only as it is a standard for controlling the bias. Accordingly, the results for Naïve Bayes with CV test option achieved an accuracy of 79.85%, a sensitivity of 58.78% and specificity of 85.49% were demonstrated. But, the Naïve Bayes classifier achieved the minimum cost for time computation in second. Similarly, ANNs (Multiple Perceptron) classifier scored an accuracy of 80.24%, a sensitivity of 44.89% and specificity of 89.70% respectively associated with maximum cost for time computation in seconds. Moreover, the results for decision trees with algorithms of (J48, RT and RF) achieved accuracy better than the above-mentioned classifiers (NB and ANNs) as seen in (Table 10). If we simply see the performance of the model in terms of accuracy it achieved, one can observe that the decision tree of scenario two (J48) is the best model predictor (Table 11). However, we need to check whether these performance measures achieved by each model has a statistical significance at 5% level of significance for further analysis and for feature prediction purpose. And this objective of statistical testing model significance would be achieved using F-test in DM models (Table 12). The complete set of results used for comparison of each model performance was prepared in a tabular format (Tables 10 and 11).
Table 10
Comparison of performance of different Classifiers, scenario 1 (n = 5)
Evaluation criteria’s
|
Naïve Bayes
|
Decision tree (J48)
|
Decision tree (random tree)
|
Decision tree (forest)
|
Neural networks
|
Class
|
Confusion matrix
|
10576
|
1795
|
11676
|
695
|
11586
|
785
|
11586
|
785
|
11097
|
1274
|
No
|
|
1365
|
1947
|
2320
|
992
|
2235
|
1077
|
2232
|
1080
|
1825
|
1487
|
Yes
|
Accuracy (%)
|
79.85%
|
80.77%
|
80.74%
|
80.76%
|
80.24%
|
|
Sensitivity (%)
|
58.78%
|
29.95%
|
32.51%
|
32.60%
|
44.89%
|
|
Specificity (%)
|
85.49%
|
94.38%
|
93.65%
|
93.65%
|
89.70%
|
|
ROC (%)
|
0.841%
|
0.805%
|
0.841%
|
0.844%
|
0.841%
|
|
Computations time in seconds
|
0.01
|
0.02
|
0.09
|
0.7
|
61.16
|
|
Table 11
Comparison of performance of different Classifiers, scenario 2 (n = 18)
Evaluation criteria’s
|
Naïve Bayes
|
Decision tree (J48)
|
Decision tree (random tree)
|
Decision tree (forest)
|
Neural networks
|
Class
|
Confusion matrix
|
9680
|
2691
|
11416
|
955
|
10732
|
1639
|
11376
|
995
|
10964
|
1407
|
No
|
|
868
|
2444
|
1810
|
1502
|
1845
|
1467
|
1847
|
1465
|
1680
|
1632
|
Yes
|
Accuracy (%)
|
77.30%
|
82.36%
|
77.78%
|
81.87%
|
80.32%
|
|
Sensitivity (%)
|
73.79%
|
45.35%
|
44.29%
|
44.23%
|
49.27%
|
|
Specificity (%)
|
78.25%
|
92.28%
|
86.75%
|
91.96%
|
88.63%
|
|
ROC (%)
|
0.851%
|
0.817%
|
0.691%
|
0.855%
|
0.840%
|
|
Computations time in seconds
|
0.00
|
0.24
|
0.07
|
3.78
|
518.2
|
|
3. Model Evaluation for Data Imbalance Problem
1. Data Imbalanced Case
In this paper, the receiver operator characteristics curve analysis (ROC curve) was also used to measure the performance of the models. All the four classifiers using imbalanced data case have achieved ROC values much more than 81% except the random tree with 69.1%. If we simply see the performance of the model in terms of accuracy it achieved, one can observe that the decision tree (J48) is the best model predictor of the other two (Tables 10 and 11). However, a paired two-tailed comparison was done using paired corrected test option to measure the difference of performances among the models in predicting the contraceptive use at 5% level of significance for further analysis and for future prediction purpose (Table 12). This objective of testing model significance would be achieved using F-measure in data mining models. The four data mining models (Decision tree (J48), Decision tree (random tree), decision tree (random forest) and Neural networks (MLP)) were compared against to the ‘Naïve Bayes’ model given for the same number of inputs. Hence, all the models used in this paper are efficient enough (prediction power exceeds 77%) to predict the contraceptive methods use among women since all the models achieved the same F-measures (Table 12). Unlike statistical value that uses P-value for measuring significance of an interest, WEKA uses three symbols ((v/ /*)) for measuring the differences of the models and represented as (v------the difference in performance of the models is considered as victory (better difference), / /-------There is no difference, *-----The difference in performance among the models for prediction is poorer).
Table 12
Model Evaluation for the classifiers, Paired corrected Tester-measure, Confidence: 0.05 (two tailed), for imbalance data
Dataset
|
(1) Naïve Bayes
|
(2) Decision tree (J48)
|
(3) Random tree
|
(4) Random forest tree
|
(5) Neural networks
|
DataSet_CPR_2018_19_Model: F-measures
|
(1) 0.84 | 0.89 0.88 0.88 0.87
|
|
(v/ /*) | (0/1/0) (0/1/0) (0/1/0) (0/1/0)
|
2. Handling the problem of Imbalanced Data
The percentage of contraceptive methods use class data size consists about 21% of the respondents was reported as contraceptive users. This class size was considered to be unbalanced data which might be a bias to evaluate the classifier methods. An equal amount of both contraceptive users and non-users was taken randomly using WEKA 3.7.7 pre-processing option to balance these two classes to avoid dominance one over the other. And the overall significance of this balanced data should be compared with the above unbalanced data if there are differences on the models based on the performance measures used for the purpose of prediction. The original sample size was 15,683 but after the data imbalance problem was adjusted the new resample size would become 6586. On other word, the following below experimental results are re-run by considering equal amount of both contraceptive users and non-users. Table 13 illustrates that, after the adjustment of data imbalance, we evaluated if there exist effect due to the imbalance of target variable using the same measures. The models used to predict with unbalanced data achieved slightly higher in overall performance than the models with balanced target dataset this is due to as possibly one target have got chance to dominate over the other target. Despite the slight differences observed due the imbalance of data, all the four classifiers have ROC values much more than 81% and with an improved ROC value of 74.80% for random tree. This indicates that given the features as input, the classifiers are efficient to predict the true contraceptive method users (more than 81% of ROC value) being an individual is contraception user or not. Besides, if we simply see the performance of the model in terms of accuracy it achieved, one can observe that the decision tree (random tree algorism) is the best model predictor of the other two (Table 13). However, we need to check whether these performances of measures achieved by each model has a statistical significance at 5% level of significance for further analysis and for future prediction purpose (Table 14).
Table 13
Comparison of performance of different Classifiers, for balanced data
Evaluation criteria’s
|
Naïve Bayes
|
Decision tree (J48)
|
Decision tree (random tree)
|
Decision tree (forest)
|
Neural networks
|
Class
|
Confusion matrix
|
2176
|
1136
|
2325
|
987
|
2475
|
837
|
2405
|
907
|
2477
|
835
|
No
|
|
391
|
2921
|
454
|
2858
|
964
|
2348
|
477
|
2835
|
782
|
2530
|
Yes
|
Accuracy (%)
|
76.94%
|
78.24%
|
72.81%
|
79.10%
|
75.58%
|
|
Sensitivity (%)
|
65.70%
|
70.19%
|
74.72%
|
72.61%
|
76.38%
|
|
Specificity (%)
|
88.19%
|
86.29%
|
70.89%
|
85.59%
|
74.78%
|
|
ROC (%)
|
84.80%
|
81.70%
|
74.80%
|
86.70%
|
84.20%
|
|
Computations time in seconds
|
0.0
|
0.13
|
0.07
|
0.19
|
260.59
|
|
Table 14 depicts a paired two-tailed comparison was done (paired corrected tested) to measure the difference of performance among the models in predicting the contraception use by the women at 5% level of significance [after adjusting the data imbalance problem]. Four data mining models (Decision tree (J48), Decision tree (random tree), Decision tree (random forest) and Neural networks (MLP)) were compared against to the Naïve Bayes model given for the same number of inputs. But, there was statistically significant differences between the decision tree models (both J48 and random forest algorithms) and the Naïve Bayes model used for prediction to contraception methods use (Table 14). Moreover, the difference in performances of the models used for prediction using the decision tree models were considered as victory (significantly different) as compared to the naïve Bayes model. Nevertheless, all the models used in this paper are efficient enough (prediction power exceeds 77%) to predict the contraceptive methods use among women.
Table 14
Model Evaluation for the classifiers, paired corrected Tester-measure, Confidence: 0.05 (two tailed); after adjusting the data imbalance problem
Dataset
|
(1) Naïve Bayes
|
(2) Decision tree (J48)
|
(3) Random tree
|
(4) Random forest tree
|
(5) Neural networks
|
DataSet_CPR_2018_19_Model: F-measures
|
(1) 0.74 | (2) 0.76 v (3) 0.73 (4) 0.77 v (5) 0.76
|
|
(v/ /*) | (0/1/0) (0/1/0) (0/1/0) (0/1/0)
|
4. Hybrid Multidimensional Metrics for Model Selection
A hybrid multidimensional metrices was used to compute the overall significance of the model taking both the effects of the user’s requirements and their corresponding weights of their importance basically assigned based on the user’s requirements and defined as in Eq. (5). The higher \(HMM\left( {m,r} \right)\)indicates the overall significant model that comprises almost all requirements of the user unlike the classical metrics that used one criterion to pick the best fit model (Table 15). Accordingly, decision tree (J48) was found be the best fit model for the prediction task based on the hybrid metrics criterion. On the other side, the ANNs was found to be the most computationally expensive for our prediction task.
Table 15
Hybrid multidimensional metrics criterion for final model selection
Metrics
|
Requirement’s indicator
|
Classifier’s weight score
|
NB
|
DT
|
RT
|
RF
|
ANNs
|
Roc values
|
1
|
0.15
|
0.15
|
0.15
|
2/5
|
0.15
|
Accuracy
|
1
|
0.15
|
0.15
|
0.15
|
2/5
|
0.15
|
data imbalance problem handled
|
1
|
0.15
|
0.15
|
0.15
|
2/5
|
0.15
|
statistical significance
|
1
|
0.13
|
0.305
|
0.13
|
0.305
|
0.13
|
practicability and applicability of the model
|
1
|
0.15
|
2/5
|
0.15
|
0.15
|
0.15
|
simplicity of model interpretation
|
1
|
0.15
|
2/5
|
0.15
|
0.15
|
0.15
|
consistency to the established knowledge
|
1
|
0.15
|
2/5
|
0.15
|
0.15
|
0.15
|
algorithm’s simplicity in terms of time and space
|
1
|
0.15
|
2/5
|
0.15
|
0.15
|
0.15
|
|
\(HMM\left( {m,r} \right)\):
|
0.236
|
0.47
|
0.25
|
0.42
|
0.236
|