This study draws on data from the 2016 Ethiopian Demographic and Health Survey (EDHS), the most recent in the demographic and health survey series that is conducted every five years. The EDHS is a nationally representative household survey that collects data on a wide range of population, health and nutrition indicators with the aim of improving maternal and child health in Ethiopia . The survey used a multi-stage stratified sampling technique based on the 2007 National Population and Housing Census of Ethiopia to select respondents from a total of 624 clusters (187 urban and 437 rural) . A total of 10,641 children under age 5 of mothers selected from 645 clusters were included in this study. This was based on retrospective information obtained from mothers about children that died under age five within the five years preceding the survey (2011 to 2016).
In this study, the outcome variable—under-five mortality—was measured in two ways to suit the three different models used. For the logistic regression and random forest models, the primary outcome of interest was under-five mortality categorized as being alive (coded as 0) or dead (coded as 1). Under-five mortality was also defined as the death of a child after birth through 59 months of life for the Cox-Proportional Hazard Model.
The predictors used in this study include community, household, individual and health services factors. The community factors comprised residence type (urban/rural) and geographical region (Tigray, Afar, Amhara, Oromia, Somali, Benishangul-Gumuz, Southern Nations Nationalities and People Region (SNNPR), Gambella, Harari, Dire Dawa, and Addis Ababa). The household factors used were the source of drinking water (improved/unimproved), toilet facility (improved/unimproved) and household wealth (poor, middle, rich). The individual-level factors consisted of maternal and child characteristics. Maternal factors include mother’s age at birth (<20, 20–29, 30–39, 40–49), education (No education, primary, secondary/higher), contraceptive use (Yes/No) and mother’s nutritional status measured by her body mass index (BMI) (underweight/overweight and normal). Child factors were the sex of the child, health facility’s delivery services (Facility with Cesarean Section (CS) services, facility without CS, home), birth order (1–2, 3/later) and previous birth interval (<2, 2–4, >4 years). The health services factors included the desire for previous pregnancy (child wanted then, wanted later, not at all), antenatal visits (0, 1–4, 5+ visits), and postnatal visits within two months after delivery (Yes/No). The selection of these predictor variables was based on information from existing literature on the subject.
The R programming language (version 3.6.0) was used to perform the data processing and analysis. We first developed a spatial map for crude under-five mortality rates by regions in Ethiopia to document the regional disparities in under-five mortality in the country. In this regard, we aggregated the number of under-five deaths by region and then merged them with an Ethiopian regional shapefile before mapping it.
We also used the widely accepted machine learning algorithms—logistic regression, random forest, and Cox-proportional hazard model—to predict under-five mortality in Ethiopia. Machine learning techniques build models based on previous observations which can then be used to predict new data. Thus, the model built is a result of a learning process that extracts useful information from the data generation process of the previous observations . It provides an opportunity for a cheaper, faster, and a better way of predicting population health problems  and considered to be the most prominent application of artificial intelligence technology for ensuring good health and social care for an entire population through preventive strategies, and protection from diseases .
The models were trained and tested using a set of features extracted from the datasets housed in the 2016 Ethiopian Demographic Health Survey. These algorithms were trained and tested using the national representative data can identify at-risk of childhood undernutrition more accurately. In all the experiments, measures of performance were performed on a 30% random sample of test data, which were not used in cross-validation or model selection. With the remaining 70% of the data, 10-fold cross-validation was used to tune the model parameters. The performance of these algorithms was evaluated using various metrics such as precision, and Area Under Curve (AUC) and Receiver Operating Characteristic (ROC) curve, thus, giving us a good indicator to validate the results.