Can We Predict the Burden of Wasting in Crisis-Affected Countries? Findings from Somalia and South Sudan

Sample surveys are the mainstay of surveillance for wasting in settings affected by crises, but are burdensome and have limited geographical coverage due to insecurity and other access issues. As a possible complement to surveys, we explored a statistical approach to predict the prevalent burden of wasting malnutrition for small population strata in two crisis-affected countries, Somalia (2014-2018) and South Sudan (2015-2018). Methods For each country, we sourced datasets generated by humanitarian actors or other entities on insecurity, displacement, food insecurity, access to services, epidemic occurrence and other factors on the causal pathway to malnutrition. We merged these with datasets of sample household anthropometric surveys done at administrative level 3 (district, county) as part of nutritional surveillance, and, for each of several outcomes including binary and continuous indices based on either weight-for-height or middle-upper-arm circumference, tted and evaluated the predictive performance of generalised linear models and, as an alternative, machine learning random forests. We developed models based on 85 ground surveys in Somalia and 175 in South Sudan. Livelihood type, measles incidence, vegetation index and water price were important predictors in Somalia, and livelihood, rainfall and terms of trade (purchasing power) in South Sudan. However, both generalised linear models and random forests had low performance for both binary and continuous anthropometric outcomes. Predictive models had disappointing performance and are not usable for action. The range of data used and their quality probably limited our analysis. The predictive approach remains theoretically attractive, and deserves further evaluation with larger datasets across multiple settings. We reanalysed all surveys by converting the raw anthropometric readings (weight, height or length, age, middle-upper arm circumference or MUAC) into z-score indices as per the World Health Organization 2006 standardised anthropometric distributions using the anthro package in R, agging and excluding all observations with missing values, <> 5 z-scores from the mean and/or outside the allowed age range (6-59mo). Lastly, we classied all children into severe wasting or wasting according to two alternative denitions: (i) bilateral oedema and/or weight-for-height (WHZ) <3Z (severe wasting) or <2Z (wasting); (ii) bilateral oedema and/or MUAC < 115mm (severe wasting) or < 125mm (wasting) [13]. We tted generalised linear models (binomial for severe wasting and wasting, gaussian otherwise) with standard errors adjusted for cluster design to verify concordance with point estimates and 95% condence intervals (CI) contained in the survey reports. stochastic series, three-month window rolling for all time-varying predictors, and applied moderate local spline smoothing to terms of trade or market price variables. computed per-population rates using stratum-month population gures previously estimated as part of mortality estimation projects for each Briey, these combine available base estimates (census projections in South Sudan; quality-weighted averages of four alternative sources in natural growth assumptions and data on as well as internal displacement to and from each stratum, by month. anthropometric household survey data with a range of potential population-level predictor datasets quantifying theoretical factors causally associated with wasting burden in crisis settings, to explore whether key quantities such as severe wasting or wasting prevalence could be estimated through prediction, as a complement to ground surveys. Resulting predictive models based on either GLM or machine learning approaches had disappointing performance in both Somalia and South Sudan across several anthropometric outcomes. Generally, predictive accuracy was better for outcomes based on WFH than on MUAC, but even for the former our models would not, in our opinion, provide actionable information.

Drawing from an a priori causal framework of factors leading to wasting (Additional le 1, Figure S5), we identi ed potential predictor variables collected at the desired resolution, and merged these with individual child-level data from SMART surveys designed to be representative of single strata. We tted various candidate models to a training data subset, and evaluated their predictive accuracy on a validation data subset, as well as on cross-validation.

Study population and timeframe
For Somalia (including Somaliland and Puntland), we sourced predictor and anthropometric survey data from January 2014 to December 2018 inclusive.
During this period, Somalia's population rose from about 12.8M to 14.5M [10]. Surveys were done in 22 (29%) of Somalia's 75 districts. For South Sudan, the analysis spanned January 2015 to April 2018, and featured surveys from 63 (80%) of the country's 79 counties, as per 2013 administrative borders. South Sudan's population declined from 10.2M to 9.7M during the period, re ecting refugee movements to neighbouring countries [11].
Data sources Anthropometric surveys. We accessed reports and raw datasets of 177 SMART surveys from South Sudan (two were excluded due to very unusual values, leaving 175 analysis-eligible), and 167 from Somalia (82 were excluded: 76, mainly done before 2016, were representative of livelihood zones rather than districts, and thus could not be coupled with predictor data; ve appeared to have followed a non-representative sampling design; one had no available dataset, leaving 85 analysis-eligible). For each survey, we inspected the report to identify any possible bias sources and, in particular, any reported restriction of the effective sampling frame due to insecurity or inaccessibility (e.g. if a report stated that two out of 12 boma, South Sudan's administrative level 3 unit, could not be included in the sample, we approximated the sampling coverage as 2/12 ≈ 83%). We also rescaled the ENA software-reported quality score for the survey (a composite of several indicators including proportion of outlier values, digit preference and properties of the distribution of observed values, ranging from 0% = best to 50% = worst [12]) to a 0-100% range, where best = 100%. We reanalysed all surveys by converting the raw anthropometric readings (weight, height or length, age, middle-upper arm circumference or MUAC) into z-score indices as per the World Health Organization 2006 standardised anthropometric distributions using the anthro package in R, agging and excluding all observations with missing values, <> 5 z-scores from the mean and/or outside the allowed age range (6-59mo). Lastly, we classi ed all children into severe wasting or wasting according to two alternative de nitions: (i) bilateral oedema and/or weight-for-height (WHZ) <3Z (severe wasting) or <2Z (wasting); (ii) bilateral oedema and/or MUAC < 115mm (severe wasting) or < 125mm (wasting) [13]. We tted generalised linear models (binomial for severe wasting and wasting, gaussian otherwise) with standard errors adjusted for cluster design to verify concordance with point estimates and 95% con dence intervals (CI) contained in the survey reports.
Predictors. We developed a causal framework of wasting (Additional le 1, Figure S5) based on existing evidence and plausibility reasoning. We used this framework to identify factors potentially predicting the outcomes of interest. We searched for candidate predictor data representing these factors online and through contacts with humanitarian actors in both Somalia and South Sudan, the main desirable characteristics of datasets being strati cation by stratum and month, and that data be generated routinely for programmatic purposes, i.e. realistically available without further primary data collection. Most datasets had already been sourced as part of similar projects to retrospectively estimate mortality in both countries [10,11]. Candidate predictors for both Somalia and South Sudan are detailed in Table 1 and Table 2, respectively. Each predictor dataset was subjected to data cleaning to remove obvious errors. We excluded predictors that were missing for ≥30% of strata or ≥30% of months. Remaining completeness problems were resolved through interpolation (humanitarian presence), manual imputation (missing market data points were attributed a weighted average of the geographically nearest market's value and the mean of all other non-missing markets, with 0.7 and 0.3 weights respectively) and automatic imputation using the mice R package [14] (water price, severe wasting and wasting treatment quality). To reduce stochastic noise in the time series, we computed three-month window rolling means for all time-varying predictors, and applied moderate local spline smoothing to terms of trade or market price variables. Where appropriate, we computed per-population rates using stratummonth population gures previously estimated as part of mortality estimation projects for each country. Brie y, these combine available base estimates (census projections in South Sudan; quality-weighted averages of four alternative sources in Somalia), natural growth assumptions and data on refugee as well as internal displacement to and from each stratum, by month.  While for both countries data on food security and nutritional therapeutic services were available ( Table 1, Table 2) and moderately predictive (data not shown), we ultimately decided to exclude them as candidate predictors for two reasons: (i) we considered that improved prediction could plausibly result in better targeting of these humanitarian services, which in turn would result in improved nutrition, a reverse-causal effect whose future size the model might fail to predict; and (ii) we assumed that end-users would bene t from a model that could be used to predict malnutrition burden even where none of these services were available, e.g. due to access constraints.

Predictive models
We explored two prediction approaches, as follows.
Generalised linear modelling. We rst split the data by period into a training set (consisting of approximately the chronologically rst 70% of the data) and a 'holdout' (i.e. validation) set (the most recent 30%). For each anthropometric indicator, we tted generalised linear models (GLM) to individual child observations in the training dataset, with robust standard errors to account for the cluster sampling design of most surveys, a quasi-binomial distribution for binary outcomes (severe wasting, wasting) and a gaussian distribution for continuous outcomes (WHZ, MUAC), which we did not transform as they were normally distributed. We speci ed model weights as the product of survey quality score and survey sample coverage.
After visual inspection, we categorised continuous predictors, and selected categorical versus continuous versions of these based on linearity of the association and the smallest-possible Chi-square (for binary outcomes) or F-test (continuous outcomes) p-value testing whether the univariate model provided better t than a null model. We also used this p-value to select among candidate lags for each predictor; however, we modelled climate variables (rainfall, then tted models consisting of all possible combinations of predictors, and shortlisted the best 10% based on predictive accuracy (lowest mean square error, MSE) of model predictions at stratum-month level, relative to observations in the holdout dataset. We manually selected the best xed effects model among these based on relative accuracy on holdout data, accuracy on external data simulated through leave-one-out cross-validation (LOOCV) [18], the plausibility of observed associations, and model parsimony (while the latter characteristic is relatively unimportant for prediction, in practice we wished to avoid users of the model having to collect a large amount of predictor data). Lastly, we explored plausible two-way interactions.
We also tted mixed models (with stratum as a random effect, given that in both countries surveys were repeated in many districts / counties). The latter, however, offered inconsistent accuracy advantages over xed effects models on either cross-validation or holdout datasets. Furthermore, we assumed that end users would be most interested in predicting malnutrition prevalence in hard-to-survey districts / counties, i.e. where no a priori random effects would be estimable. For these reasons, we discarded mixed models altogether.
Machine learning. After splitting data as above, we used the ranger package [19] to grow random forest (RF) regression models on the training dataset, aggregated at stratum-month level: this approach makes minimal assumptions about data structure; brie y, it partitions the data according to various randomly generated 'trees', where each node is de ned by a particular value of one of the predictor variables, with branches being the resulting split in the data; the 'depth' of each tree is de ned by the number of variables that are used to create nodes; randomness is introduced by the choice of variables to build any given tree, values at which splits occur, and the order of variables in the tree structure. The distribution of the outcome arising from the partitions in each tree is compared to the observed data to determine accuracy. RF averages predictions across a large ensemble of trees. We grew RFs with 1000 trees, using all candidate predictors as above, and computed prediction CIs using a jack-knife estimator [20].

Performance evaluation
For both the GLM and RF approach, we present various metrics of predictive accuracy, for estimation: (i) effective coverage, de ned here as the proportion of stratum-months for which the predicted point estimate fell within the 95% or 80%CIs of the observed data; (ii) relative bias, de ned as best models for 'now-casting' (i.e. prediction of malnutrition based on data collected up to the present). We also explored models for forecasting malnutrition 3 months into the future (i.e. prediction based on data collected up to 3 months previously), but found that these had low performance (data not shown). All analysis was done using R software [21] through the RStudio [22] platform.

Results
Anthropometric survey patterns Details of eligible surveys from Somalia are reported in Table 3 and   Performance of Somalia models GLM model coe cients and performance metrics for Somalia are shown in Table 5: odds ratios, OR < 1 and linear coe cients > 0 indicate a protective effect, and vice versa. Four predictors consistently featured in the most predictive models: livelihood, measles occurrence over the previous trimester (a non-zero incidence was strongly associated with worse anthropometric status), NDVI over the previous semester and average market price of water in the second-to-last trimester (also negatively associated with anthropometric status). Theoretical predictor variables including intensity of armed con ict, terms of trade and rainfall intensity compared to the regional mean were inconsistently associated with the outcomes, and marginally useful for prediction. Generally, predictive performance was low: models yielded mostly upward-biased predictions that fell within the observed survey CIs for only 23-80% of stratum-months, depending on the outcome; while denominators were very small, only the model for severe wasting (WFH + oedema) reached a moderate combination of sensitivity and speci city to classify prevalence as per the 5% threshold. Graphs of predictions versus observations support this pattern; Figure 3 shows results for severe wasting (WFH + oedema), while remaining graphs are in the Additional le 1.  Shaded channels indicate different absolute deviance of predictions. Vertical dotted lines denote commonly used severe wasting prevalence thresholds.
Performance of South Sudan models Table 6 shows GLM predictions for South Sudan. Here, the most signi cant associations were with livelihood type, total rainfall and terms of trade. Predictive performance was also low (Figure 4), with bias as high as +46.8% for severe wasting (WFH + oedema) prevalence, coverage no better than 75% across all outcomes and no instance of high sensitivity and speci city for classi cation.  RF models had far better t to the training data than GLMs, but performed similarly on cross-validation and holdout data. The most important variables were livelihood, terms of trade, uptake of measles vaccination and intensity of insecurity (Additional le 1).

Discussion
In this study we combined a range of previously collected, anthropometric household survey data with a range of potential population-level predictor datasets quantifying theoretical factors causally associated with wasting burden in crisis settings, to explore whether key quantities such as severe wasting or wasting prevalence could be estimated through prediction, as a complement to ground surveys. Resulting predictive models based on either GLM or machine learning approaches had disappointing performance in both Somalia and South Sudan across several anthropometric outcomes. Generally, predictive accuracy was better for outcomes based on WFH than on MUAC, but even for the former our models would not, in our opinion, provide actionable information.
Models to predict wasting risk at the individual or household level exist [23,24]. While we did not search the literature systematically due to insu cient resources, we are aware of only two other population-level predictive studies. Osgood-Zimmerman et al. [25] [27] have used geospatial and remotely sensed covariates to map stunting prevalence, while Lentz et al. [28] have also demonstrated the potential of a GLM-based approach for predicting food insecurity in Malawi. We have previously used the same datasets as in this study to develop reasonably predictive models of population-level death rate (a farther-downstream and thus potentially even more multifactorial outcome), albeit only for retrospective estimation [10,11].
Given the above, we expected better predictive performance. It is plausible that additional data on factors causally associated with wasting, including infant and young child feeding practices, use of food security coping strategies, dietary diversity, access to water, sanitation and hygiene services and health service utilisation would have improved prediction: these data are sometimes generated in crisis settings through cross-sectional surveys, but to our knowledge are not typically available at the granular level required for our predictive problem. It is also likely that problems with available data quality constrained model accuracy. Non-differential error or misclassi cation arising from measurement problems (e.g. imprecise child anthropometric measurements) and data entry errors would generally reduce model goodness-of-t and bias estimated associations towards the null: observed-versus-predicted graphs generally suggest 'regression dilution' [29], a phenomenon whereby predictions align around an underestimated linear slope, consistent with high noise in predictor variables. Differential error may also have affected model accuracy in various ways. For example, the predictive value of certain variables would have been dampened if anthropometric surveys had systematically underestimated wasting in the very locations where those predictors exhibited their most extreme values, as might be plausible for surveys done in very remote, insecure locations and thus constrained by time, local staff competency or the need to exclude unreachable communities from the effective sampling frame. We attempted to mitigate such bias by down-weighting lower-quality surveys with evidence of sampling frame selection bias, but models without this weight were not substantively different (data not shown). Pragmatically, these data quality limitations illustrate the challenges of prediction based on data not collected for research.
Our study aim was not to explore associations: as such, we focussed on accuracy and, for example, ignored signi cant effect modi cations that did not improve prediction. Observed GLM associations and variable importance metrics for RF are nonetheless informative. Measles incidence and rainfall or NDVI had plausible associations with most outcomes in both countries, while water price had a very strong association in Somalia. Terms of trade, however, were only important in South Sudan. We did not see strong associations with forced displacement or armed con ict intensity, as documented elsewhere [30], and, critically, rainfall abnormalities (as opposed to total precipitation) were not an important predictor in any model. A recent review of 90 studies concludes that wasting is understudied relative to stunting; the review also nds that, while adequate rainfall during the growing season has been associated with less wasting, relationships with drought and armed con ict are inconclusive [31]. Indeed, the interplay of unusual climate events and armed con ict has proved challenging for food security prediction [32]. More generally, our and others' ndings underscore the context-speci c complexity of causal pathways leading to wasting. They may also re ect the relative noisiness of different datasets, i.e. their accuracy.
Aside from data limitations, our analysis does not thoroughly explore available predictive methods. Among GLM-based approaches, it is possible that different transformations of outcomes or predictors, as well as methods to identify the most informative variables, such as lasso regression, could have yielded improved performance. Among machine learning methods, boosted regression trees could have reduced bias. We note however that these methods would need to yield very considerable improvements over those we used in order to produce useful predictions.

Conclusions
This analysis suggests that predictive modelling for wasting burden in crisis settings may not be an immediately viable alternative to ground surveys, at least in the countries studied. Given the potential bene t of such an approach [5], we nonetheless recommend further study, possibly in other settings, using larger datasets and more advanced machine learning methods (boosted regression trees, support vectors, neural networks) and/or Bayesian frameworks. To facilitate such research, as well as other publicly bene cial analyses, humanitarian actors should systematically make key datasets, including but not limited to anthropometric surveys, publicly available in curated, accessible form [33]. These include, but are not limited to, service data from different sectors (e.g. outpatient consultations; vaccination coverage; anthropometric screening data among outpatient children and pregnant women; admissions and exit outcomes for management of wasting; water availability and quality; coverage of excreta disposal; food security service bene ciaries and Kcal equivalents); market data (e.g. staple prices); morbidity and mortality surveillance data; cross-sectional surveys measuring food security, dietary diversity and infant and young child feeding practices; protection assessments; surveys of perceptions of affected populations; humanitarian presence and activity who-does-whatwhere matrices; and alternative data on insecurity (e.g. incidents monitored by the UN country team) or humanitarian access (e.g. road safety). A simple principle could be to publish all data barring any whose public availability could place humanitarian actors or affected people at unacceptable risk; aggregation and anonymisation may mitigate such risks. Lastly, any studies to date to predict population-level nutrition burden should be synthesised to identify actionable evidence and guide further analysis.  Trends in key survey indicators, South Sudan. Each dot represents the point estimate of a single survey. Box plots indicate the median and inter-quartile range, and whiskers the 95% percentile interval.

Figure 3
GLM-predicted versus observed severe wasting (WFH + oedema) prevalence, Somalia, by district-month, on training data, LOOCV and holdout data. Shaded channels indicate different absolute deviance of predictions. Vertical dotted lines denote commonly used severe wasting prevalence thresholds. GLM-predicted versus observed severe wasting (WFH + oedema) prevalence, South Sudan, by district-month, on training data, LOOCV and holdout data.
Shaded channels indicate different absolute deviance of predictions. Vertical dotted lines denote commonly used severe wasting prevalence thresholds.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. malnutpredpaperadditional1.docx