Machine Learning Methods for “Small-n, Large-p” Problems: Understanding the Complex Drivers of Modern-Day Slavery

40 million people are estimated to be in some form of modern slavery across the globe. Understanding the factors that make any particular individual or geographical region vulnerable to such abuse is essential for the development of e ﬀ ective interventions and policy. E ﬀ orts to isolate and assess the importance of individual drivers statistically are impeded by two key challenges: data scarcity and high dimensionality. The hidden nature of modern slavery restricts available datapoints; and the large number of candidate variables that are potentially predictive of slavery inﬂates the feature space exponentially. The result is a highly problematic “small n , large p ” setting, where overﬁtting and multi-collinearity can render more traditional statistical approaches inapplicable. Recent advances in non-parametric computational methods, however, o ﬀ er scope to overcome such challenges. We present an approach that combines non-linear machine learning models and strict cross-validation methods with novel variable importance techniques, emphasising the importance of stability of model explanations via Rashomon-set analysis. This approach is used to model the prevalence of slavery in 48 countries, with results bringing to light the importance predictive factors - such as a country’s capacity to protect the physical security of women, which has previously been under-emphasized in the literature. Out-of-sample estimates of slavery prevalence are then made for countries where no survey data currently exists.

years, the Walk Free Foundation (WFF) has made valuable 28 progress in this area, providing estimates of slavery incidence 29 across 48 countries in 2016 and 2018, based upon surveys 30 from the Gallup World Poll (GWP). These estimates have 31 been further extrapolated out-of-sample to countries where 32 no GWP survey data existed using theoretically-driven risk 33 models [8,2]. While this has proven a highly beneficial exer-34 cise [9, 10], methods used to produce such estimates are not 35 without their limitations [7, 10,11,12,13,14,15]; and de-36 spite recent advances in Multiple Systems Estimation (MSE) 37 [3, 16, 17,18], assessment of national slavery prevalence to 38 any degree of accuracy remains an active research area. 39 A further problem in modelling slavery is that, again due 40 to its hidden nature, there exists an extensive range of candi-41 date independent variables to investigate, while at the same 42 time a shortage of prevalence estimates to regress against. 43 Therefore, aggregated data used to model slavery prevalence 44 in a population will likely be "small n, large p" in nature 45 [19]. This issue is symptomatic of many "wicked problems" 46 [20] facing computational social sciences, with too few ex-47 perimental units, n, available to researchers in comparison to 48 the vast number of potential driving factors, p, to be mod-49 elled. In such situations, dangers of overfitting and multi-50 collinearity can render traditional statistical approaches inap-51 plicable -and as a result, investigation of the factors under-52 lying modern-slavery remains a challenging statistical task. Despite a rich literature considering the drivers of mod-104 ern slavery, most findings stem from qualitative case data, 105 victim/survivor interviews and small scale surveys. While a 106 depth of insight is obtainable from such methods, findings 107 are necessarily drawn from small samples, limiting general-108 izability to specific regions, industries, and forms of slavery. 109 Few of the predictors hypothesised, and detailed in the pre-110 vious section, have been studied statistically, restricting our 111 understanding of their impacts and interactions. As a conse-112 quence, not enough is known about the extent to which any 113 individual driver engenders slavery incidence; nor how fac-114 tors combine to allow the phenomena to perpetuate, limiting 115 the formation of informed national policy. 116 There are a few recent exceptions; researchers have used 117 national prevalence estimates from GWP surveys [2] to sta-118 tistically consider relationships between slavery and spe-119 cific phenomena such as fishing [32] and globalisation [33]. 120 The WFF have also established a theory driven Vulnera-121 bility Model [34], to summarize factors impacting national 122 prevalence. The WFF framework groups 23 national-level 123 risk variables into five major dimensions: Governance Is-124 sues; Lack of Basic Needs; Effects of Conflict; Inequality; 125 and Disenfranchised Groups [34], and allows a vulnerability 126 score to be formulated for each country using its scores on 127 each dimension. However, the extent to which individual fac-128 tors predict slavery is not reported, nor is there indication of 129 how factors relate to one another. The WFF framework cur-130 rently drops variables when collinearities occur (e.g. Gender 131 Inequality Index [34]) and the framework's combined vul-132 nerability score (which considers all 5 dimensions together) 133 shows only moderate correlation with country prevalence es-134 timates (r=0.33) [8]. 135 Other quantitative studies have predominantly restricted 136 themselves to linear hypothetico-deductive approaches [32, 137 33], focusing on single or small sets of independent vari-138 ables. While this is an understandable situation given lim-139 ited data, it has left an open challenge of assessing the prob-140 lem domain via an inductive and computational approach. 141 Exploration of slavery drivers requires machinery that can 142 accommodate the high-dimensional and non-linear nature of 143 modern slavery, while also accounting for interactions and 144 collinearities between explanatory variables. In this work, 145 we extend WFF's valuable ground-work in the field, intro-146 ducing a computational model able to directly map relation-147 ships between vulnerability indicators and national slavery 148 prevalence. Our explicit target is to quantify the importance 149 of individual factors in predicting slavery prevalence, despite 150 a "small n, large p" context. Central to the approach is the 151 first application of Rashomon-set analysis [35]  [37]. This is a critical issue, with models that appear to 206 provide good 'fit' to data, offering no guarantees of out-of-207 sample generalizability in reality. Such cases undermine the 208 efficacy of any explanations emerging from models; and this 209 adds risk to any real-world interventions formed from them. 210 This issue is exacerbated in high-dimensional settings where 211 multiple variables are modelled simultaneously [37]. Partly, 212 this may be why hypothetico-deductive approaches remain 213 so prevalent in the social sciences, with studies tending to fo-214 cus on a few key hypotheses in order to prevent the increase 215 of family-wise error rates. While this does indeed help to 216 prevent overfitting, the ability to uncover new, and perhaps 217 unexpected, predictors is often lost. Furthermore, the ability 218 to assess relative variable importances across a wide range of 219 features is foregone, with interactions between variables left 220 unmodelled -a situation which is particularly problematic to 221 settings such as slavery, where driving factors are unlikely to 222 act in isolation.

223
In response to issues of multicollinearity, non-linearity 224 and overfitting, the field of machine learning has sought 225 to develop alternative ways to guard against model overfit-226 ting, while accommodating both non-linear and multivari-227 ate data. This has produced a rich array of cross-validation 228 techniques, methods which support inductive experimental 229 setups whilst defending against p-hacking and 'procedural 230 overfitting' [37]. Cross-validation aims to find model param-231 eterisations that maximise generalizability, rather than mini-232 mizing regression residuals (whilst additionally reducing the 233 influence/bias of the researcher in model specification and 234 model selection phases). Due to their inductive nature, such 235 frameworks permit discovery of potential new predictors -236 and crucially allows comparative assessment of the impor-237 tance of variables, even in non-linear settings.

238
Cross-validation was, however, designed with large 239 datasets in mind. If stable model interpretations are to be 240 found using smaller datasets, we require additional consider-241 ations and alterations to a typical machine learning method-242 ology. The steps required to achieve this our outlined in de-243 tail in the Method section, and centre on integration of fea-244 ture compression/selection directly into the modelling pro-245 cess, rather than being undertaken a priori as has been com-246 mon in the field. Reduction of the variable space at some 247 point is unavoidable, if the "large p" problem [38] is to be 248 handled. A set of k, potentially latent, variables must be 249 identified, where k < p. Unlike methods such as partial least 250 squares, which identify such latent variables via parametric 251 modelling assumptions, our approach treats k as a hyper-252 parameter to be identified as part of model class exploration. 253 To traverse the model space, M, we employ full, leave-one-254 out cross validation (LOOCV), allowing model parameters 255 to be optimized, whilst using all of the "small n" data. This 256 further ensures that insights extracted from fitted models will 257 generalize to the whole dataset as much as possible. predictors. Together, these steps allow for meaningful, po-277 tentially novel insights to be extracted via a non-linear, in-278 ductive approach even in "small n, large p" settings.

280
In this study, we apply the "small n, large p" workflow   . 'Theory-based' refers to feature selection based on the literature (N=34). MF refers to allowing the 'max features' hyper-parameter to be tuned, so that the decision tree could not select from all components at each split. A Lasso regression (L1 norm regularisation) was used instead of a linear regression when no feature compression was used due to the the high number of independent variables.

Table 2
The Rashomon set  captured. This allowed identification of optimal latent vari-320 able compression for each predictive mechanism/pipeline.

321
The model classes explored ranged from linear and regu-322 larised regression based models, to non-linear approaches 323 such as decision trees and random forest models (also chosen 324 due to high interpretability). Partial least squares was also 325 examined due to its internal identification of latent structures.

326
The best preforming parameterization for each of these mod-327 elling strategies pipelines is detailed in Table 1, with Figure   328 1 illustrating the distribution of performances for all models 329 with MAE less than 0.27.

330
The best performing model used the full feature pool 331 (p = 106), compressed the feature space via latent NMF 332 components (k = 6), then used a (non-linear) decision tree 333 model with a restricted number of 'max features' (MF) avail-334 able at each split (to help prevent the tree from getting stuck 335 at a local minimum). For the full set of model parameters see 336 Appendix B. The best model's performance was a significant 337 improvement on both the mean (MAE= 0.366, p<0.001) and 338 median (MAE= 0.349, p<0.001) baseline predictions anal-339 ysed using a Wilcoxon Signed-Rank t-test. 340 Predictions using leave-one-out analysis compared to the 341 'actual' slavery prevalence estimations can be viewed in 342 Figure 2, accompanied by the distribution of 10,000 boot-343 strapped predictions to illustrate the associated uncertainty. 344 The graph highlights that the model was less effective at pre-345 dicting a subset of countries, and in particular those with the 346 highest prevalence of slavery, which it tended to underesti-347 mate 1 .

349
In the model interpretation phase, the best model,M, 350 was fit to the data, and resulting NMF components and 351 variable loadings analysed to determine component themes 352 (see Appendix, Figure A3 for a breakdown of the model). 353 Components' relative importances were then compared us-354  Table 2). The x axis is ordered by the MAE between our best model and the 'actual' prevalence as estimated by the GWP survey data.  can be viewed in Appendix, Figure A2). Interestingly, the  Figure 6. A heat map illustrating that (for the best performing model) the partial dependency of the prevalence prediction is especially high when both Access to Resources and Physical Security of Women are low. Here, a lower score for Physical Security of Women indicates less security.  for Women is negatively correlated with Access to Resources 418 (r=-0.44). The effect of the latter relationship is particularly 419 apparent in Figure 5, with Rashomon models, M 2 and M 3 re-420 quiring only one of those components to perform well. This 421 suggests that there may additionally exist non-linear relation-422 ships between the two components, which cannot be fully 423 captured in the correlation coefficient.

424
To understand the unanticipated relationship between Ac-425 cess to Resources and the Physical Security of Women more 426 fully, partial dependency plots were analysed to look for non-427 linear dependencies between the variables. Figure 6 illus-428 trates that the physical security of women is only predictive 429 of slavery in contexts where access to resources is low. In 430 other words, women are particularly vulnerable to being ex-431 ploited in areas where there is poor access to fuel, electric-432 ity, piped/clean water, sanitation, and education (see variable 433 loadings of the Access to Resources component in Appendix, 434 Figure A2). To further highlight the non-linear interactions 435 behind the Physical Security of Women component, Individ-436 ual Conditional Expectation (ICE) graphs shown in Figure 7 437 indicate that when a (lack of) physical security for women 438 is extremely high, this dramatically increases the partial de-439 pendency of the prevalence prediction -more than any other 440 component. The important role that Physical Security of 441 Women can play in the prediction of slavery prevalence is 442 only made available here using a methodology that allows 443 for non-linear interactions of this nature to be modelled.

444
Predicting prevalence for countries with no survey data    The grey box plots illustrate the distribution of 10,000 bootstrapped LOOCV predictions (using the full pipeline NMF->DT) to help illustrate the uncertainty associated with our model's predictions. The box shows the quartiles of the bootstrapped predictions while the whiskers extend to show the rest of the distribution, except for points that were determined to be "outliers" (using a function of the inter-quartile range) which are not plotted. The x axis is ordered by the disagreement between our model's predictions and the predictions made by the GSI model. The countries displayed are those which had <10% missing independent variable data. For all 172 out-of-sample predictions and information on missing data see Appendix, Table A1. than those using features selected from the qualitative lit-487 erature. Given the generous selection of variables entered 488 into the theory-selected feature pool this highlights the po-489 tential of an inductive methodology to uncover novel predic-490 tors, even in "small n, large p" contexts. Model class com-491 parisons unequivocally showed that non-linear models gave 492 better predictions than their linear counterparts, supporting 493 the notion that our models are capturing new non-linearities 494 that have not been analysed before. Finally, allowing guided 495 feature compression, parameterizable as part of a grid search, 496 uniformly outperformed traditional approaches that incorpo-497 rated latent structures, such as partial least squares.

498
The majority of the models in our Rashomon set benefited 499 from identification of latent components. This broadly vali-500 dates the utility of summarising predictors of slavery into k 501 latent factors (first done by the WFF in the construction of 502 the Vulnerability Model [34]). Whilst the final model de-503 cided on by the WFF consists of 5 factors, the naturally oc-504 curring solution (based on eigenvalues greater than 1) actu-505 ally consisted of 6 factors [34], corroborating our bottom-up 506 findings. The present study, however, extends the WFF's ini-507 tial work by constructing components directly shaped by the 508 outcome variable, slavery prevalence. and computational modelling can be combined to build pre-566 diction models -harnessing the advantages of a data driven 567 approach (objectivity, quantification, and out-of-sample pre-568 dictions) combined with expert human judgement and addi-569 tional contextual information.

570
In this study, and while key limitations of traditional re-571 gression approaches were mitigated against, several limita-572 tions remain inherent to the data. The dependent variable, 573 although derived from survey data collected using a repre-574 sentative sampling methodology [48], nonetheless represents 575 an estimate rather than a direct measure of slavery. Conse-576 quently, prevalence estimates must be considered more a re-577 flection of slavery risk across the given the population, rather 578 then corresponding to formal incidence numbers. Further, 579 the estimates we produce are not able to offer indication of 580 the breakdown in the typology of slavery occurring within 581 a country. Recent work has focused on establishing proxy 582 indicators for specific types of vulnerability/exploitation, or 583 exploitation within specific industries, using small scale sur-584 veys married to digital traces such as mobile phone records 585 [49], satellite imagery [50, 51, 52, 53], and vessel data [54]. 586 Such approaches can help to bolster the data in this domain 587 without putting vulnerable individuals at further risk.

588
This study applied machine learning methods to the con-589 text of modern slavery to better understand the drivers of 590 this ongoing phenomenon. Using rigorous cross validation 591 approaches, careful consideration of model stability, and an 592 in-depth variable importance analysis, stable predictive com-593 ponents emerged using data characterised as "small n, large 594 p". Notably, a novel predictive component was found, in 595 the Physical Security of Women, which was shown to predict 596 prevalence non-linearly, and in association with other com-597 ponents (in particular Access to Resources). Such non-linear 598 relationships, combined with the challenges traditional meth-599 ods face when dealing with multicollinearity, may well be the 600 reason that this component, and its in its ability to support 601 quantitative prediction of slavery, has been previously over-602 looked. The findings make the case for further exploration 603 of data-driven, inductive approaches and out-of-sample pre-604 diction to complement existing methodologies being used to 605 study complex social problems such as modern-day slavery. 606   The methodology used in this study was exploratory, in-670 ductive and data driven. This involved evaluating a num-671 ber of different approaches and model classes to assess what 672 configuration of method and model produced the most accu-673 rate predictions. Firstly, the utility of using all the features 674 versus a smaller pool of features selected from the literature 675 (the approach used by the GSI [34]) was tested. This theory 676 driven selection process was undertaken using a literature re-677 view and by consulting with domain experts. The 106 fea-678 tures were reduced to 35 spanning poverty, globalisation and 679 the country's wealth, education, politics, violence, and con-680 flict (the full list can be found in the Appendix, Figure A6). 681 Secondly, the value in using feature decomposition prior to 682 modelling verses inputting the raw variables into the model 683 was assessed. Non-negative Matrix Factorisation (NMF) was 684 chosen as it forces the matrices to be non-negative, making 685 the emergent components easier to interpret than when using 686 other methods such as Principle Component Analysis (PCA). 687 Finally, three different model classes (also selected for their 688 interpretability) linear regression, decision tree, and random 689 forest, with optimised meta-parameters, were also evaluated. 690 The resultant sixteen combinations of different methods and 691 models types can be found in Table 1. Leave one out cross 692 validation (LOOCV) was used to evaluate the model parame-693 ters selected to ensure that the model was generalizable to the 694 full data set. Given the lack of data and that fitting the model 695 for an explanation, rather than pure prediction, was the goal 696 of the study, no data was held back to create a separate test 697 set. 698 have been queried [7,33]. Therefore, this analysis only uses estimates which were derived directly from the GWP survey data (as in [33]). 4 Given that the sampling procedure for both the 2016 and 2018 Gallup World Poll was random [48], there was no known reason to assume that the 2016 survey would have directly influenced the 2018 survey; and thus the country prevalence estimates for 2016 and 2018 were treated as independent.
5 Features were selected based on past literature, the authors' intuition, and conversations with domain experts. Initially there were 137 variables in the feature pool, before 31 were discarded due to missing data, duplication, or measuring a phenomena too similar to the dependent variable. 6 For ease, the 2018 SDG data was used for 2016 and 2018 due to the multiple sources and dates from which the data were collected 7 The CART regression tree method was chosen due to features being too highly correlated to perform regression based imputation methods, such as predictive mean matching (causing singularities in the matrix). Further, as linear models such as NMF and linear regressions were being utilised in this analysis, a non-linear imputation method was preferable to avoid the variables' collinearity being inflated by the imputation process.  The top loading variable is Democracy. Other high loading variables directly related to democracy include bans for political parties, freedom of discussion, and liberal democracy.
The high loading variables are armed conflict, an ongoing (or ever has been a) mass killing, political terror, battle deaths, and enclaves of rape (i.e. in refugee or military camps).
(c) Component 3: Physical Security of Women The top loading variable represents the physical security of women. Other relevant variables include the unequal distribution of power, and the laws/protection around reporting rape.

(d) Component 4: Social Inequality and Discrimination
The two highest loading variables depicts whether there has ever been a mass killing, such as a genocide, before; and the unequal distribution of power between social groups.
(e) Component 5: Access to Resources High loading relevant variables include access to fuel, electricity, water, and sanitation. Registered births, poor literacy, and a lack of infant vaccines also denote poor access to infrastructure/services.

(f) Component 6: Religious and Political Freedoms
High loading variables include religious and political freedoms, political and physical rights, independent judicial system, foreign and domestic freedom of movement, freedom to assemble, political imprisonment, and freedom of movement for women.  Figure A4. The normalised variable loadings onto the components (H matrix from the NMF) for the Rashomon models with the same pipeline as the best model (all the features and the model class NMF->DT (MF)). Some differences in loadings occur due variances in the NMF paramterisation (different random seeds being used for coordinate descent, and different constants that multiply the regularisation terms (alpha)) yet the overarching themes remain stable.     Appendix B Best Model Parameters The best performing model (MAE=0.227) used the following meta-parameters for the NMF and Decision Tree functions in 910 the Scikit-learn python package: random state = 45, NMF K components = 6, NMF solver = 'cd', NMF tolerance = 0.005, 911 NMF alpha = 2, NMF max iterations = 350, max depth of the tree = 6; tree max features = 0.3, tree minimum samples to split 912 on = 3.