Predictive models and under-five mortality determinants in Ethiopia: evidence from the 2016 Ethiopian Demographic and Health Survey

Background: There is a dearth of literature on predictive models estimating under-five mortality risk in Ethiopia. In this study, we develop a spatial map and predictive models to predict the sociodemographic determinants of under-five mortality in Ethiopia. Methods: The study data were drawn from the 2016 Ethiopian Demographic and Health Survey. We used three predictive models to predict under-five mortality within this sample. The three techniques are random forests, logistic regression, and k-nearest neighbors For each model, measures of model accuracy and Receiver Operating Characteristic curves are used to evaluate the predictive power of each model. Results: There are considerable regional variations in under-five mortality rates in Ethiopia. The under-five mortality prediction ability was found to be moderate to low for the models considered, with the random forest model showing the best performance. Maternal age at birth, sex of a child, previous birth interval, water source, health facility delivery services, antenatal and post-natal care checkups, breastfeeding behavior and household size have been found to be significantly associated with under-five mortality in Ethiopia. Conclusions: The random forest machine learning algorithm produces a higher predictive power for under-five mortality risk factors for the study sample. There is a need to improve the quality and access to health care services to enhance childhood survival chances in the country. male children, short birth interval children, children from unimproved water source households, children delivered at facilities without CS services as well as children whose mothers do not receive antenatal and post-natal care, who are not breastfed immediately and who live in smaller households all have increased risks for under-five mortality in Ethiopia. This study highlights the potential of machine learning methods in predicting under-five mortality risk factors and points to crucial areas for policy development. Our findings reinforce the need to improve the quality and access to health care services such as antenatal, delivery, and post-natal care as well as family planning services in the country to enhance childhood survival chances. Also, based on the findings, expanding access to improved drinking water will help to substantially reduce under-five mortality in the country in the future.

under-five mortality rate in low-income countries was 69 deaths per 1000 live births in 2017 -almost 14 times the rate in high-income countries (5 deaths per 1000 live births) [1]. It has been observed that more than half of these deaths are due to infectious diseases (such as pneumonia and diarrhea) that are preventable and treatable through simple, affordable interventions [2].
Despite the considerable improvements over the past decades, sub-Saharan Africa remains the region with the highest level of under-five mortality in the world, with about half of the global under-five mortality burden [1]. Ethiopia has been found to have the fifth-highest number of newborn deaths in the world, following India, Pakistan, Nigeria, and the Democratic Republic of Congo [3]. It is estimated that about 472,000 children die in Ethiopia each year before their fifth birthday, which places Ethiopia sixth among the countries in the world in terms of an absolute number of under-five deaths [4].In Ethiopia, the under-5 mortality rate has declined by two thirds from the 1990 figure of 204/1,000 live births to 58/1,000 live births in 2016, and thus, achieving the target for Millennium Development Goal 4 (MDG 4) [5]. Despite this achievement, the under-five mortality rate in Ethiopia is still higher than those of many low and middle-income countries (LMIC).
Previous studies have provided much evidence on the socioeconomic and demographic factors that are associated with under-five mortality in Ethiopia [6][7][8], using traditional regression models. In this study, we ascertain the determinants of under-five mortality in Ethiopia using non-traditional regression models drawing on nationally representative data. Specifically, we employed machine learning techniques to predict under 5 mortality in this sample. The main aim is to determine the best predictive model and highlight the potential of machine learning techniques in estimating the sociodemographic effects on under-five mortality in future research. Also, we initially develop a spatial visualization of the under-five mortality rate by region in Ethiopia. The goal is to visually highlight the spatial disparities in under-five mortality in the country and to inform and strengthen appropriate policies or intervention strategies aimed at reducing under-5 mortality in the country.

Data source
This study draws on data from the 2016 Ethiopian Demographic and Health Survey (EDHS), the most recent in the demographic and health survey series that is conducted every five years. The EDHS is a nationally representative household survey that collects data on a wide range of population, health and nutrition indicators with the aim of improving maternal and child health in Ethiopia [9]. The survey used a multi-stage stratified sampling technique based on the 2007 National Population and Housing Census of Ethiopia to select respondents from a total of 624 clusters (187 urban and 437 rural) [9]. A total of 10,641 children under age 5 of mothers selected from 645 clusters were included in this study. This was based on retrospective information obtained from mothers about children that died under age five within the five years preceding the survey (2011 to 2016).

Study variables
In this study, the outcome variable -under-five mortality -was measured in two ways to suit the three different models used. For the logistic regression and random forest models, the primary outcome of interest was under-five mortality categorized as being alive (coded as 0) or dead (coded as 1). Under-five mortality was also defined as the death of a child after birth through 60 months of life.
The predictors (features) used in this study include community, household, individual and health services factors. The community factors comprised residence type (urban/rural) and geographical region (Tigray, Afar, Amhara, Oromia, Somali, Benishangul-Gumuz, Southern Nations Nationalities and People Region (SNNPR), Gambella, Harari, Dire Dawa, and Addis Ababa). The household factors used were the source of drinking water (improved/unimproved), time to water source, toilet facility (improved/unimproved) and household wealth index (low, middle, high) and household size. The individual-level factors consisted of maternal and child characteristics. Maternal factors include mother's age at birth (<20, >20), education (No education, primary, secondary/higher), contraceptive use (Yes/No) and mother's body mass index (BMI) (underweight/overweight and normal). Child factors included whether the child was wanted (child wanted then, wanted later, not at all), sex of the child, birth order (1-2, 3/later), births in last 5 years, and previous birth interval (<2, 2-4, >4 years), as well as whether the child was breastfed within 1 hour of birth. The health services factors included antenatal visits (0, 1-4, 5+ visits), place and mode of delivery services (Facility with Cesarean Section (CS) services, facility without CS, home), and postnatal visits within two months after delivery (Yes/No).
The selection of these predictor variables was based on information from existing literature on the subject.

Analytic strategy
The R programming language (version 3.6.0) and the caret package [10] was used to perform the data processing and analysis. We first developed a spatial map for crude under-five mortality rates by regions in Ethiopia to document the regional disparities in under-five mortality in the country. In this regard, we estimated the rates under-five mortality by region and then merged them with an Ethiopian regional shapefile before mapping it.
We also used the widely accepted machine learning algorithms -logistic regression, a random forest model (RFM), K-nearest neighbors (KNN), -to predict under-five mortality in Ethiopia. These three models were selected for the following reasons. First, logistic regression is typically used to analyze binary data and commonly used as an inferential tool in population health research, but it also can be used as a binary classification model.
Second, the KNN model is chosen based on its ability to detect linear and nonlinear boundaries between groups. The KNN method relies on finding the best value of k so that the k closest observations are used to predict the value of a given observation.
"Closeness" of observations is usually measured using a distance metric such as the Euclidean distance between observations. Third, from a predictive modeling perspective, the random forest model is commonly used in machine learning situations because they are highly flexible and provide better predictive performance. Random forests repeatedly sample the variables in the training data set several times, each time using a random set of predictor variables to produce decision trees. After many of these trees are formed, the forest is examined to see which variable consistently produce a better prediction. In this regard, machine learning techniques draw on a learning process that extracts useful information from the data generation process of previous observations [11]. It is touted as a prominent application of artificial intelligence technology for ensuring good health and social care for an entire population through preventive strategies, and protection from diseases [12].
We randomly selected and trained an 80% sample of the original data, which was eventually used for 10-fold cross-validation to tune the model parameters. The remaining 20% random sample was used as test data to predict the measures of model performance.
Because the outcome is unbalanced (there is a low fraction of children in the data who die), the data are down-sampled so the proportions of data in the training set are equivalent for the cases who were alive after 5 years, and those who had died before 5 years. The performance of these algorithms was evaluated using various metrics including the Area Under Curve (AUC) and Receiver Operating Characteristic (ROC) curve, which are useful in deciding which model provides the best discriminatory power between the dead and alive cases. The positive and negative predictive accuracy of each model is also calculated to show how well the model performs in terms of predicting the dead and alive cases, respectively. The results from all of the above models were weighted using person weights provided by the DHS. For the logistic regression model, we infer the importance and significance of predictors using traditional t-statistics and odds ratios derived from the model estimation, while for the random forest and KNN methods, these are not available. For these models, the Mean Decrease in Gini is calculated, which is a measure of variable importance for these models. Table 1 shows the results of under-five mortality by the sample characteristics. Of the 10,641 under-five children in the sample, there appears to be a significant difference in mortality prevalence between both sexes with female children experiencing higher (6.7%) than males (4.2%). There were also considerable differences regarding birth intervals under-five mortality being more prevalent among children with 2-4 and over 4 years of birth intervals (4.455 and 4.53%, respectively). Under-five mortality was also significantly prevalent among children using unimproved water sources (5.8%) than those who used improved water sources (2.9%). Significant differences were also observed regarding antenatal visits and postnatal care, with under-five mortality being considerably prevalent among children whose mothers did not receive antenatal (5.6%) and postnatal care (4.2%). Children who were breastfed within more than one hour of birth had a significantly higher prevalence of death (9.8%) than those breastfed within one hour of birth (4.5%)

Descriptive results of the background characteristics
while there was also evidence of a significant difference in under-five mortality regarding the number of people in the household. The rest of the characteristics did not show any significant difference in mortality prevalence among their categories.

Predicting under-five mortality
Below, we report results from the three machine learning models (logistic regression, Random Forests, and the k-nearest neighbor models) to predict the under-five mortality outcome ( Table 2). The under-five mortality prediction accuracy was found to be low for all models, at between 46.3 to 67.2% accuracy on the test set, with the random forest model having the highest overall accuracy. The random forest model had high sensitivity, meaning that it was accurate at distinguishing the alive cases from the dead cases, but low specificity, meaning that it was not good in discerning the dead cases. More metrics show that the model is relatively good at predicting both positive (alive) and negative (dead) cases. The model was able to correctly identify 70% of dead cases (28/(28+12)), which suggests it is relatively good at predicting the dead cases. The logistic and KNN models both show lower overall accuracy (59.9 and 46.3%, respectively), and lower sensitivity, specificity, and positive and negative predictive values. The results for the receiver operating characteristics (ROC) curve are shown in Figure 1. Among the three machine learning models employed in this study, the curve for the random forest model shows the highest AUC value, indicating it is best at separating the two classes, among the models considered. The logistic regression model is the only one of the three that allows for direct interpretation of the model coefficients. Table 3 shows the estimated odds ratios and confidence intervals for the model parameters. The model was estimated with a full survey design and weighted to be representative of the population. Factors associated with increased risk of under 5 mortality were male sex, higher birth order and being born in a facility without C-section services. Protective factors were: longer birth intervals, improved water source, having received antenatal and postnatal care as well as larger household size.

Discussion
The study develops a series of predictive models and a regional map for under-five mortality in Ethiopia using machine learning techniques. The spatial map provides evidence of considerable regional disparities in under-five mortality rates in Ethiopia similar to what has been found in Ghana [13]. Tigray and some regions in the central part of the country show the lowest under-five mortality rates whereas regions in the eastern and western parts of the country have the highest under-five mortality rates. Providing evidence on the underlying risk factors may help to better understand the spatial variations of under-five mortality in the country. Regarding the predictive model, the prediction accuracies and AUC statistics are found to be highest for the random forest model. It shows the higher predictive power of the random forest model compared to the other models considered here. In this regard, the Random Forest Model shows that household size, time to the water source, breastfeeding behavior, number of recent births, child sex and length of birth intervals are the strongest predictors of child mortality.
The logistic regression models show that a child's sex, preceding birth interval, water source, place and mode of delivery, antenatal care checkup, postnatal care, household size, and breastfeeding behavior are significantly associated with under-five mortality in Ethiopia. In this study, children of teenage mothers show a higher risk of under-five mortality than children of older mothers. Consequently, male children have shown a significantly higher risk of dying before age five compared with female children. This is consistent with the finding of a cross-sectional study conducted in Bangladesh [14]. It has been shown that male children have an increased risk of dying in the first month of life because of high vulnerability to infectious disease. This is because female neonates are more likely to develop early fetal lung maturity in the first week of life, which may result in a lower incidence of respiratory diseases in female compared with male neonates [15].
In this study, higher birth order of children appears to be associated with a significantly higher risk of under-five mortality. Analogously, the unfavorable effect of higher birth order on childhood survival chances has been well documented in Africa [16] as well as some parts of Asia [17,18) and may probably be due to fierce competition for scarce household resources. Also, the risk of under-five mortality has increased significantly among children with less than 2 years preceding birth interval than children with more than 2 years or birth interval. Affirmatively, there is much evidence that longer birth intervals improve the survival chance of succeeding children [19,20]. A short preceding birth interval can be said to influence under-five mortality through three main mechanisms: First, closely spaced births may cause depletion of the mother. The second mechanism is through sibling competition while the third is the transmission of infectious diseases between the closely spaced children [21]. While the first mechanism is biological, the last two are said to be behavioral effects of a short preceding birth interval [22].
Furthermore, this study finds that the use of an unimproved source of drinking water is associated with an increased risk of under-five mortality. Lack of access to clean water has been considered as one of the important factors that contribute to more than 80 percent of child deaths in the world [23]. There is also considerable evidence from studies in developing countries that show that household sanitation and a clean water supply promote child health and survival [24,25]. In Ethiopia, the proportion of the population using improved drinking-water sources is only 57%, and those who use improved sanitation are less than five percent [2]. This may have serious implications for the underfive mortality levels in the country. This study further provides evidence that children whose mothers do not use any contraceptives have a significantly higher risk for underfive mortality than their counterparts whose mothers use modern contraceptives.
This study also finds that delivery in health facilities without CS services and at home is associated with a higher under-five mortality risk. This may be mainly related to dealing with delivery complications that may raise under-five mortality risk. Health facilities with CS services are very scarce in Ethiopia; even where they are available, transportation challenges encourage women to deliver at home delivery when facility-based delivery is available at a minimal cost [26]. Moreover, this study provides evidence of a positive effect of antenatal and postnatal care checkups on under-five survival chances. This is consistent with the significant association observed between antenatal and postnatal care and lower under-five mortality risk in the literature [27,28]. The implication is that children whose mothers do not receive antenatal and postnatal care services may experience more proximate under-five mortality risk factors, such as congenital and infectious diseases, than their counterparts. This study has also shown a considerable positive effect of early timing of breastfeeding on childhood survival chances.
Breastfeeding has long been shown as an important protective factor against under-five mortality, particularly among developing countries [29,30] and has to play a key part in childhood survival interventions. Quite surprisingly, larger household size appears to be associated with reduced under-five mortality risk in this study, contrary to what has been documented in the literature [18]. However, this may well be underscored by some household-level contextual factors in the country such as availability of considerable social support from siblings.
This study is not without limitations. The survey comprised only surviving women, and since neonatal and maternal mortalities may occur concurrently, this may have led to an underestimation of the under-five mortality rates. Also, using a cross-sectional survey data such as the DHS only provides a snapshot of the scenario unlike using a longitudinal approach. There are also possible biases in the memorization or non-disclosure of deaths by mothers which may underestimate the number of deaths. Nevertheless, the machine learning techniques used provided a strong case for predicting the underlying risk factors of under-five mortality in the study sample.

Conclusions
This study provides evidence of considerable regional disparities in under-five mortality rates in Ethiopia, with the highest rates observed in the Afar, Benishangul -Gumuz and Somali regions. In this study, the Random Forest Model provides a modestly higher predictive power than the logistic regression and k-nearest neighbor models in predicting under-five mortality risks in Ethiopia. Under-five mortality in Ethiopia is significantly associated with maternal age at first birth, sex of a child, previous birth interval, water source, health facility delivery services, antenatal and post-natal care checkups, household size and breastfeeding immediacy. Children of teenage mothers and mothers, male children, short birth interval children, children from unimproved water source households, children delivered at facilities without CS services as well as children whose

Ethics approval and consent to participate
The study used secondary data from the EDHS. Ethical approval not applicable.

Availability of data and methods
The dataset analyzed in this study are available on The DHS Program website.    Figure 1 Spatial distribution of crude under-five mortality rates by regions in Ethiopia.

Figures
Source: Created by the authors using estimates from EDHS.   Variable importance measures for Logistic Regression Model Variable importance measures for K-nearest neighbor Model