We first describe the Malawi Service Provision Assessment (SPA) (6) dataset, followed by the methods for the development and evaluation of BN models and the comparison of statistical models. Finally, we describe the details of the decision tree that we developed for decision analysis.
2.1 The SPA Dataset
The SPA survey was conducted between July 2013 and February 2014 by the Ministry of Health of Malawi, with support from the Demographic and Health Surveys (DHS) Program, to assess the status of health facilities and quality of healthcare in Malawi. Data were collected from 1,060 facilities comprised of 97 hospitals, 489 health centers, 55 dispensaries, 369 clinics, and 28 health posts across three major regions in the country, and are representative at the national level by facility type and managing authority (6). These data have been used previously in studies to assess the quality of care and treatment for pneumonia in Malawi (18) and are publicly available from the DHS program (19).
The survey dataset contains observations on 3,441 encounters with children aged 2 to 59 months presenting to an outpatient healthcare facility. For each encounter, the data contains demographic details (age, date of birth, and sex), clinical features (duration of illness, fever, diarrhea, anemia, etc.), mRDT result (if available), and the provider’s diagnosis.
2.2 Data Preprocessing
We assumed the result of the mRDT that is recorded in the dataset to be the gold standard malaria diagnosis. The mRDT has high sensitivity (0.997) and specificity (0.995) for the diagnosis of malaria (20) and is recommended for confirmation of disease by both the WHO and Malawi’s malaria management guidelines (21). Thus, if the mRDT result was positive, we considered malaria to be present, and if the test result was negative, we considered malaria to be absent. This variable is referred to as ‘malaria’ or ‘malaria diagnosis’ in the following sections.
While it would have been ideal to have the mRDT result for each encounter in the dataset, this is not the case. Of the 3,441 encounters, an mRDT result was recorded for only 1,139 encounters, and we restricted our analyses to only these encounters. Table 1 shows the variables that we identified to include for modeling. These variables were chosen based on their inclusion in childhood illness management guidelines (2) as well as on expert domain knowledge. Two of the variables are continuous (age and duration of illness) and the remaining variables are categorical. We discretized the continuous variables since the BN algorithms we used are designed for discrete variables. We discretized age by months (< 2, 212, 1324, 2560, > 60) based on the varying epidemiology of the disease in children of different ages. We discretized the duration of illness by number of days as shown in Table 1. Every predictor variable had one or more missing values, and we denoted them with a special value called ‘Unknown’. The target variable, malaria, is binary, taking the values ‘Positive’ or ‘Negative’ that represent the mRDT result.
Table 1: Variables and values that were included in the models.
Variable

Values

Target variable


Malaria

Present
Absent

Predictor variables


Age

Less than 2 months
212 months
1324 months
2560 months
Over 60 months
Unknown

Duration of Illness

Less than or equal to 2 days
315 days
1630 days
Over 30 days
Unknown

Conscious

Yes
No
Unknown

Anemia

Present
Absent
Unknown

Convulsions

Cough or Difficulty Breathing (CDB)

Diarrhea

History of Fever

Fever (temperature>37.5 C)

Lethargy

Malnutrition

Unable to Feed

Vomiting

We randomly split the dataset into 80% training dataset and 20% test dataset, stratified on the target variable, which is the malaria diagnosis. The models were developed using the training dataset and were evaluated on the test dataset.
2.3 Bayesian Network Models
A BN model is a probabilistic graphical model that is specified by a graphical structure and a set of numerical parameters. The graphical structure consists of nodes representing variables and arcs denoting associations between pairs of variables. In this paper, we use nodes and variables interchangeably. Each node in the network has an accompanying conditional probability table that constitutes the parameters of the node. A BN model can be used as a classifier where the model provides the posterior probability distribution of a target node (such as a disease diagnosis) given the values of all other nodes (such as clinical features) in the network (22). Several approaches are available to construct a BN model. In the first approach, both the structure and the parameters are specified manually using expert knowledge. In the second approach, the structure is specified manually and the parameters are estimated from data. In a third approach, both the structure and parameters are automatically estimated from data; a variety of algorithms have been developed to automatically derive BN models in this way. In this study, we used the second and third approaches to develop two BN models for prediction of malaria using the GeNIe Modeler tool (33). For the first model (manual model) we manually specified the structure based on domain knowledge and for the second model (hybrid model) we automatically derived the structure and subsequently modified it using expert knowledge. Using GeNIe Modeler, we computed the parameters of each node from the dataset by estimating the conditional probability distribution of the node given the values of its parent nodes (14).
Using the GeNIe Modeler tool, we developed manual and hybrid BN models to predict malaria from the predictor variables listed in Table 1. Based on domain knowledge of malaria from experts and the literature, for the manual model we modeled clinical features as conditionally independent of each other given malaria. Specifically, a clinical feature that was a symptom or a sign was represented as a child of the malaria node to create a Naïve Bayes structure. A feature that was not a sign or a symptom was represented as a parent of the malaria node. For example, a sign such as convulsions was represented as a child of malaria with the arc directed from malaria to convulsions. This encodes clinical knowledge that malaria can cause convulsions. As another example, age was represented as a parent of malaria with the arc directed from age to malaria. This denotes knowledge that younger children may be more vulnerable to contracting malaria than older children. In a Naïve Bayes disease model, each sign or symptom node has a single incoming arc from the disease node with no arcs among them. We used the GeNIe Modeler to estimate the parameters of the manual model as conditional probabilities from the training dataset.
While the manual model is simple and interpretable, the conditional independence assumption is overly simplistic. Hence, we developed a hybrid model in which we first derived automatically an Augmented Naïve Bayes (ANB) model using GeNIe Modeler and then modified it guided by expert knowledge. The ANB algorithm in GeNIe Modeler enables efficient learning of an ANB model (see description below) from data. Each arc in the ANB model was then evaluated by an expert for clinical plausibility. We also used the GeNIe Modeler to estimate the parameters of the hybrid model as conditional probabilities from the training dataset.
The ANB model (14) extends the Naïve Bayes model by allowing arcs among child nodes. For example, in a Naïve Bayes model, diarrhea and convulsions are linked only by incoming arcs from the malaria node while in the ANB model, an additional arc may be included from diarrhea to convulsions that implies convulsions are associated with both malaria and diarrhea.
2.4 Comparison Models
For comparison with the diagnostic predictions of BN models, we derived several commonly used statistical models including logistic regression and random forests to predict malaria. Instead of discretizing the continuous variables, age and duration of illness, we scaled them so that the values had unit variance; when a variable had missing values, we imputed its value as the mean of its nonmissing values. We treated the categorical variables in the same way as for the BN models. We trained the models using the scikitlearn library (23) in Python.
2.5 Evaluation
We applied the BN, logistic regression, and random forest models to predict the probability of malaria in the test dataset. Using these predictions, we computed the area under the Receiver Operating Characteristic curve (AUC) and the Brier score for each model. The AUC value indicates the diagnostic discrimination performance of the model, where perfect performance has an AUC of 1. The Brier score measures both the calibration and discrimination of probabilistic predictions and ranges from 0 to 1 with a score of 0 indicating perfect predictive performance. We converted the probability into a binary prediction of malaria present or absent by using a probability threshold of 0.5. With the binary predictions, we computed accuracy, sensitivity, specificity, precision, and F1 score for each model.
2.6 Decision Tree
To conserve the use of mRDT in a resourceconstrained setting like a rural health post in Malawi, we developed a decision tree to compare the consequences of using and not using the mRDT. The decision tree integrates the probability of having malaria with the costs of testing and treatment, and identifies the optimal decision (relative to a set of probabilities and utilities) – to use mRDT or not – in a specific patient.
Figure 1: Illustrative decision tree that integrates predictions from BN models with example costs. Malaria+ and malaria represent malaria present and absent respectively, F refers to clinical features of the patient, and C is the associated cost..
The decision tree that we developed is shown in Figure 1 and uses a common approach to model sequential decisions (24). The decision is driven by the expected costs of testing and treatment that are denoted by ‘mRDT?’ and ‘Treat?’ nodes. We calculated the expected cost of the [mRDT=no] branch using the probability of malaria from the BN model and costs associated with each decision as
In Figure 1, P(malaria+F) is the probability from the BN model that malaria is present given the clinical features of the patient. P(malariaF) is the probability from the BN model that malaria is absent given the features. The costs (shown in the hexagons) in the decision tree are from the perspective of a payer of healthcare costs, such as the government of Malawi, and depend on the resources used, including mRDT and ACT drugs. We used the following costs based on the literature: a mRDT costs US $0.60 (8) and a course of ACT for uncomplicated malaria costs US $1.00 (25). We estimated the cost of mistakenly not treating a child with malaria at US $16.60 based on the assumption that the cost may go up to 10 times the cost of mRDT and ACT drugs for uncomplicated malaria if the untreated disease becomes severe, resulting in a hospital admission.
We computed the expected cost of the [mRDT?=yes] branch as
where
and
In the above equations, P(malaria+mRDT+, F) is the probability of malaria being present given that the mRDT result is positive and the clinical features of the patient, and P(malariamRDT, F) is the probability of malaria being absent given that the mRDT result is negative and the clinical features of the patient. P(mRDT+F) and P(mRDTF) represent the probabilities of mRDT being positive or negative, respectively, given the clinical features of the patient. These probabilities are obtained from the BN model and are assumed to be equal to P(malaria+F) and P(malariaF) respectively. We have made the assumption that a positive result on the mRDT is equivalent to the child having malaria.
We also performed a sensitivity analysis to determine how dependent the strategy selection is on the probability of malaria. We varied the probability of malaria, P(malaria+F) or P(mRDT+F), from 0 to 1 and calculated the expected costs of using and not using the mRDT to determine the probability ranges in which a child may be treated based on clinical features alone without performing mRDT.