We first describe the Malawi Service Provision Assessment (SPA) (6) dataset, followed by the methods for the development and evaluation of BN models and the comparison of other statistical models. Finally, we describe the details of the decision tree that we developed for decision analysis.
2.1 The SPA Dataset
The SPA survey was conducted between July 2013 and February 2014 by the Ministry of Health of Malawi, with support from the Demographic and Health Surveys (DHS) Program, to assess the status of health facilities and quality of healthcare in Malawi. Data were collected from 1,060 facilities comprised of 97 hospitals, 489 health centers, 55 dispensaries, 369 clinics, and 28 health posts across three major regions in the country, and are representative at the national level by facility type and managing authority (6). These data have been used previously in studies to assess the quality of care and treatment for pneumonia in Malawi (24) and are freely available from the DHS program (25).
The survey dataset contains observations on 3,441 encounters with children aged 2 to 59 months presenting to an outpatient healthcare facility. For each encounter, the data contains demographic details (age, date of birth, and sex), clinical features (duration of illness, fever, diarrhea, anemia, etc.), mRDT result (if available), and the provider’s diagnosis.
2.2 Data Preprocessing
We assumed the result of the mRDT that is recorded in the dataset to be the gold standard malaria diagnosis. The mRDT has high sensitivity and specificity (0.997 and 0.995 respectively) for the diagnosis of malaria (26) and is recommended for confirmation of disease by both the WHO and Malawi’s malaria management guidelines (27). Thus, if the mRDT result was positive, we considered malaria to be present, and if the test result was negative, we considered malaria to be absent. This variable is referred to as ‘malaria’ or ‘malaria diagnosis’ in the following sections.
While it would have been ideal to have the mRDT result for each encounter in the dataset, this is not the case. Of the 3,441 encounters, an mRDT result was recorded for only 1,139 encounters, and we restricted our analyses to only these encounters. Table 1 shows the variables that we identified to include for modeling. These variables were chosen based on their inclusion in childhood illness management guidelines (2) as well as on expert domain knowledge. Two of the variables are continuous (age and duration of illness) and the remaining variables are categorical. We discretized the continuous variables since the BN algorithms we used are designed for discrete variables. We discretized age by months (< 2, 2-12, 13-24, 25-60, > 60) based on the varying epidemiology of the disease in children of different ages. We discretized the duration of illness by number of days as shown in Table 1. Every predictor variable had one or more missing values, and we denoted them with a special value called ‘Unknown’. The target variable, malaria, is binary, taking the values ‘Positive’ or ‘Negative’ that represent the mRDT result.
Table 1: Variables and values that were included in the models.
Variable
|
Values
|
Target variable
|
|
Malaria
|
Present
Absent
|
Predictor variable
|
|
Age
|
Less than 2 months
2-12 months
13-24 months
25-60 months
Over 60 months
Unknown
|
Duration of Illness
|
Less than or equal to 2 days
3-15 days
16-30 days
Over 30 days
Unknown
|
Conscious
|
Yes
No
Unknown
|
Anemia
|
Present
Absent
Unknown
|
Convulsions
|
Cough or Difficulty Breathing (CDB)
|
Diarrhea
|
History of Fever
|
Fever (temperature>37.5 C)
|
Lethargy
|
Malnutrition
|
Unable to Feed
|
Vomiting
|
2.3 Bayesian Network Models
A BN model is a probabilistic graphical model that is specified by a graphical structure and a set of numerical parameters (14). The graphical structure consists of nodes representing variables and arcs denoting associations between pairs of variables. In this paper, we use nodes and variables interchangeably. Each node in the network has an accompanying conditional probability table that constitutes the parameters of the node. A BN model can be used as a classifier where the model provides the posterior probability distribution of a target node (such as a disease diagnosis) given the values of all other nodes (such as clinical features) in the network (23). Several approaches are available to construct a BN model. In the first approach, both the structure and the parameters are specified manually using expert knowledge. In the second approach, the structure is specified manually and the parameters are estimated from data. In a third approach, both the structure and parameters are automatically estimated from data; a variety of algorithms have been developed to automatically derive BN models in this way. In this study, we used the second and third approaches to develop two BN models for the prediction of malaria using the GeNIe Modeler tool (28) using the variables listed in Table 1. For the first model (manual model) we manually specified the structure based on domain knowledge and computed the parameters of each node using the GeNIe Modeler. For the second model, we used the GeNIe Modeler to automatically derive the structure of a Tree Augmented Naïve Bayes model (described later) and the parameters of each node in the model. For both models, we used the GeNIe Modeler to compute the parameters of each node from the dataset by estimating the conditional probability distribution of the node given the values of its parent nodes (23).
Manual Model. Based on domain knowledge of malaria from experts and the literature, for the manual model, we modeled clinical features as conditionally independent of each other given malaria. Specifically, a clinical feature that was a symptom or a sign was represented as a child of the malaria node to create a Naïve Bayes structure. A feature that was not a sign or a symptom was represented as a parent of the malaria node. For example, a sign such as convulsions was represented as a child of malaria with the arc directed from malaria to convulsions. This encodes clinical knowledge that malaria can cause convulsions. As another example, age was represented as a parent of malaria with the arc directed from age to malaria. This denotes knowledge that younger children may be more vulnerable to contracting malaria than older children. In a Naïve Bayes disease model, each sign or symptom node has a single incoming arc from the disease node with no arcs among them.
Tree Augmented Naïve Bayes Model. While the manual model is simple and interpretable, the conditional independence assumption may be overly simplistic. Hence, we developed a second model by automatically deriving a Tree Augmented Naïve Bayes (TAN) model using the GeNIe Modeler. The TAN model extends the Naïve Bayes model by allowing arcs among child nodes (21). For example, in a Naïve Bayes model, diarrhea and convulsions are linked only by incoming arcs from the malaria node while in the TAN model an additional arc may be included from diarrhea to vomiting that implies vomiting is associated with both malaria and diarrhea. The TAN algorithm in GeNIe Modeler enables efficient learning of both the structure and the parameters of a TAN model.
2.4 Comparison Models
For comparison with the diagnostic predictions of BN models, we derived several commonly used statistical models including logistic regression and random forest to predict malaria. Instead of discretizing the continuous variables, age and duration of illness, we scaled them so that the values had unit variance; when a variable had missing values, we imputed its value as the mean of its non-missing values. We treated the categorical variables in the same way as for the BN models. We derived and evaluated the models using the scikit-learn library (29) in Python. The logistic regression model was derived using the L2 penalty and the for the value of regularization hyperparameter a search was conducted over seven possible values (0.001, 0.01, 0.1, 1, 10, 100, 1000). The hyperparameters (and the values over which the search was performed) for the random forest model included number of trees in the forest (100, 200, 500), criterion for the split (“gini”, “entropy”), maximum depth of tree (4, 5, 6, 7, 8), and the number of features (square root of total features, log of total features, total features).
2.5 Derivation and Evaluation
We derived and evaluated the manual BN, TAN, logistic regression, and random forest models using 10-fold cross-validation. The dataset was divided into 10 folds, stratified on malaria diagnosis. Over 10 iterations, each fold was used as a test set in turn and the remaining folds were combined to form the training set. For the manual BN model, we estimated the parameters of the model using 10-fold cross-validation while the structure was fixed across all iterations. For the TAN model, we estimated both the structure and parameters using 10-fold cross-validation. For the logistic regression and random forest models, during each iteration of cross-validation, the hyperparameters were chosen using the training set.
During each iteration of cross-validation, we applied the models to predict the probability of malaria in the test set. Using these predictions, we computed the area under the Receiver Operating Characteristic curve (AUC). The AUC value indicates the diagnostic discrimination performance of the model, where perfect performance has an AUC of 1. Then, we converted the probability into a binary prediction of malaria present or absent by using two probability thresholds, including the default threshold of 0.5 and an optimal threshold obtained by maximizing the Youden Index. The threshold that maximizes the Youden Index is the threshold that optimizes the model’s ability when equal weight is given to sensitivity and specificity (30). With the binary predictions, we computed balanced accuracy (BAC), sensitivity, specificity, and the net reclassification improvement (NRI) at both thresholds. BAC is the average of sensitivity and specificity and is more useful than accuracy when the proportion of the target values are imbalanced. NRI quantifies how well a new model correctly reclassifies children with and without malaria compared to a baseline model (31). NRI is computed as:
(sensitivity of new model – sensitivity of baseline model) + (specificity of new model – specificity of baseline model).
For statistical comparisons, we used the DeLong’s test to compare AUCs of two models (32), the paired two-sample Wilcoxon test to compare BACs of a pair of models (33), and McNemar’s Chi-Square test to compare sensitivities and specificities of two models (33).
2.6 Decision Tree Development
To conserve the use of mRDT in a resource-constrained setting like a rural health post in Malawi, we developed a decision tree to compare the consequences of using and not using the mRDT. The decision tree integrates the probability of having malaria (that is obtained from a predictive model) with the costs of testing and treatment and identifies the optimal decision (relative to a set of probabilities and utilities) – to use mRDT or not – in a specific patient.
The decision tree that we developed is shown in Figure 1 and uses a common approach to model sequential decisions (34). The decision is driven by the expected costs of testing and treatment that are denoted by ‘mRDT?’ and ‘Treat?’ nodes. We calculated the expected cost of the [mRDT?=no] branch using the probability of malaria from a predictive model and costs associated with each decision as
In Figure 1, P(malaria+|F) is the probability that malaria is present given the clinical features of the patient. P(malaria-|F) is the probability that malaria is absent given the features. The costs (shown in the hexagons) in the decision tree are from the perspective of a payer of healthcare costs, such as the government of Malawi, and depend on the resources used, including mRDT and ACT drugs. We used the following costs based on the literature: a mRDT costs US $0.60 (8) and a course of ACT for uncomplicated malaria costs US $1.00 (35). We estimated the cost of mistakenly not treating a child with malaria at US $16.60 based on the assumption that the cost may go up to 10 times the cost of mRDT and ACT drugs for uncomplicated malaria if the untreated disease becomes severe, resulting in hospital admission.
We computed the expected cost of the [mRDT?=yes] branch as
In the above equations, P(malaria+|mRDT+, F) is the probability of malaria being present given that the mRDT result is positive and the clinical features of the patient, and P(malaria-|mRDT-, F) is the probability of malaria being absent given that the mRDT result is negative and the clinical features of the patient. P(mRDT+|F) and P(mRDT-|F) represent the probabilities of mRDT being positive or negative, respectively, given the clinical features of the patient. These probabilities are obtained from a model such as the manual BN model and are assumed to be equal to P(malaria+|F) and P(malaria-|F) respectively. We have assumed that a positive result on the mRDT is equivalent to the child having malaria.
We also performed a sensitivity analysis to determine how dependent the strategy selection is on the probability of malaria. We varied the probability of malaria, P(malaria+|F) or P(mRDT+|F), from 0 to 1 and calculated the expected costs of using and not using the mRDT to determine the probability ranges in which a child may be treated based on clinical features alone without performing mRDT.