This study draws on data from the 2016 Ethiopian Demographic and Health Survey (EDHS), the most recent in the demographic and health survey series that is conducted every five years. The EDHS is a nationally representative household survey that collects data on a wide range of population, health and nutrition indicators with the aim of improving maternal and child health in Ethiopia . The survey used a multi-stage stratified sampling technique based on the 2007 National Population and Housing Census of Ethiopia to select respondents from a total of 624 clusters (187 urban and 437 rural) . A total of 10,641 children under age 5 of mothers selected from 645 clusters were included in this study. This was based on retrospective information obtained from mothers about children that died under age five within the five years preceding the survey (2011 to 2016).
In this study, the outcome variable – under-five mortality – was measured in two ways to suit the three different models used. For the logistic regression and random forest models, the primary outcome of interest was under-five mortality categorized as being alive (coded as 0) or dead (coded as 1). Under-five mortality was also defined as the death of a child after birth through 60 months of life.
The predictors (features) used in this study include community, household, individual and health services factors. The community factors comprised residence type (urban/rural) and geographical region (Tigray, Afar, Amhara, Oromia, Somali, Benishangul-Gumuz, Southern Nations Nationalities and People Region (SNNPR), Gambella, Harari, Dire Dawa, and Addis Ababa). The household factors used were the source of drinking water (improved/unimproved), time to water source, toilet facility (improved/unimproved) and household wealth index (low, middle, high) and household size. The individual-level factors consisted of maternal and child characteristics. Maternal factors include mother’s age at birth (<20, >20), education (No education, primary, secondary/higher), contraceptive use (Yes/No) and mother’s body mass index (BMI) (underweight/overweight and normal). Child factors included whether the child was wanted (child wanted then, wanted later, not at all), sex of the child, birth order (1-2, 3/later), births in last 5 years, and previous birth interval (<2, 2-4, >4 years), as well as whether the child was breastfed within 1 hour of birth. The health services factors included antenatal visits (0, 1-4, 5+ visits), place and mode of delivery services (Facility with Cesarean Section (CS) services, facility without CS, home), and postnatal visits within two months after delivery (Yes/No). The selection of these predictor variables was based on information from existing literature on the subject.
The R programming language (version 3.6.0) and the caret package  was used to perform the data processing and analysis. We first developed a spatial map for crude under-five mortality rates by regions in Ethiopia to document the regional disparities in under-five mortality in the country. In this regard, we estimated the rates under-five mortality by region and then merged them with an Ethiopian regional shapefile before mapping it.
We also used the widely accepted machine learning algorithms – logistic regression, a random forest model (RFM), K-nearest neighbors (KNN), – to predict under-five mortality in Ethiopia. These three models were selected for the following reasons. First, logistic regression is typically used to analyze binary data and commonly used as an inferential tool in population health research, but it also can be used as a binary classification model. Second, the KNN model is chosen based on its ability to detect linear and nonlinear boundaries between groups. The KNN method relies on finding the best value of k so that the k closest observations are used to predict the value of a given observation. “Closeness” of observations is usually measured using a distance metric such as the Euclidean distance between observations. Third, from a predictive modeling perspective, the random forest model is commonly used in machine learning situations because they are highly flexible and provide better predictive performance. Random forests repeatedly sample the variables in the training data set several times, each time using a random set of predictor variables to produce decision trees. After many of these trees are formed, the forest is examined to see which variable consistently produce a better prediction. In this regard, machine learning techniques draw on a learning process that extracts useful information from the data generation process of previous observations . It is touted as a prominent application of artificial intelligence technology for ensuring good health and social care for an entire population through preventive strategies, and protection from diseases .
We randomly selected and trained an 80% sample of the original data, which was eventually used for 10-fold cross-validation to tune the model parameters. The remaining 20% random sample was used as test data to predict the measures of model performance. Because the outcome is unbalanced (there is a low fraction of children in the data who die), the data are down-sampled so the proportions of data in the training set are equivalent for the cases who were alive after 5 years, and those who had died before 5 years. The performance of these algorithms was evaluated using various metrics including the Area Under Curve (AUC) and Receiver Operating Characteristic (ROC) curve, which are useful in deciding which model provides the best discriminatory power between the dead and alive cases. The positive and negative predictive accuracy of each model is also calculated to show how well the model performs in terms of predicting the dead and alive cases, respectively. The results from all of the above models were weighted using person weights provided by the DHS. For the logistic regression model, we infer the importance and significance of predictors using traditional t-statistics and odds ratios derived from the model estimation, while for the random forest and KNN methods, these are not available. For these models, the Mean Decrease in Gini is calculated, which is a measure of variable importance for these models.