Beyond building damage: estimating and 1 understanding non-recovery following disasters 2

14 Following a disaster, crucial decisions about recovery resources often focus on immediate impact, partly due to a lack of detailed information on who will struggle to recover. Here we perform an analysis of surveyed data on reconstruction and secondary data commonly available after a disaster to estimate a metric of non-recovery or the probability that a household could not fully reconstruct within ﬁve years after an earthquake. Analyzing data from the 2015 Nepal earthquake, we ﬁnd that non-recovery is associated with a wide range of factors beyond building damage, such as ongoing risks, population density, and remoteness. If such information were available after the 2015 earthquake, it would have highlighted that many damaged areas have differential abilities to reconstruct due to these factors. More generally, moving beyond damage data to evaluate and quantify non-recovery will support effective post-disaster decisions that consider pre-existing differences in the ability to recover.


Introduction
predict the probability of non-recovery, which is able to capture nonlinear influences and interactions between variables and 74 performed better than traditional modeling methods we also tested. Our model identifies important and realistic factors affecting 75 non-recovery and also predicts a tangible outcome: non-recovery. 76 Figure 2. Study area in Nepal and non-recovery estimation approach. (a) The study area considered here are the 11 rural districts outside of Kathmandu Valley affected by the 2015 Nepal earthquake. The areas in blue were originally classified as severely hit (higher impact) and green as lower impact. (b) The model for non-recovery is calibrated on surveyed recovery outcomes, and uses readily available predictor variables representing sociodemographic, environmental, and geographic factors likely to influence recovery capacity. Outputs include a spatial estimate of non-recovery, the relative influence of each variable, and a metric of performance by validating the model on a test set (See Methods for more information).

77
Our analysis reveals that eight predictors explain the probability of a damaged household completing after the 2015 Nepal 78 earthquake (Table 1). We categorise these predictors into three main categories: 1) hazard exposure, 2) rural accessibility and 79 poverty, and 3) reconstruction complexity. Each of these categories has roughly the same number of predictors, indicating 80 they all are important for predicting non-recovery. These empirically-identified categories linked with impeded recovery are 81 consistent with those defined in other resilience studies in Nepal 14, 15 and broader frameworks of vulnerability [16][17][18] . The range 82 of predictors indicates that reconstruction depends on a collection of socioeconomic, environmental, and geographic factors. 83 Many studies recognize recovery and resilience as a multifaceted process with social and economic dimensions 19-21 . However, 84 existing, rapidly available post-disaster information systems do not clearly acknowledge or account for this. 86 We calculated the marginal effect of each variable to evaluate its relative influence on predicted non-recovery, as shown in 87 Figure 3. This figure allows us to see the average relationship between each variable and reconstruction ability. Each variable 88 generally has a trend where greater values lead to higher probabilities of non-recovery. However, these relationships are not 89 purely monotonic and vary from household to household. This variation points towards the diverse and complex reality of 90 recovery experienced by affected households. Because random forest models capture interactions between variables, these 91 relationships represent the influence of one variable given the inclusion of all the other variables in the model.    investigation.

120
Additionally, areas with greater prevalence of pre-existing food poverty were less likely to recover. This relationship 121 provides evidence that already marginalized communities face additional challenges during reconstruction. It also potentially 122 reflects the intertwined relationship between food security, building damage, and reconstruction 25, 31 , consistent with existing 123 research in these areas. tap water exhibits a similar relationship-greater prevalence of tap water in a region is associated with higher probability of 132 non-recovery. Again, while infrastructure access can be viewed as promoting resilience, here it seems to be related to slowed 133 reconstruction and warrants further research.

134
Topographic slope shows an influence on non-recovery beyond its link to hazard and accessibility. It is likely due to the 135 difficulty of reconstructing on steep slopes or increased costs associated with retaining walls necessary in hillside communities 33 .

136
Spatial distribution of non-recovery given damage 137 The model can be used to map the estimates of non-recovery. Figure 4a shows the probability of a household with a damaged construction types in the mountains. In contrast, Figure 4a shows that non-recovery is predicted to be likely scattered throughout 145 the center, west, east, and south of the study region. This shows a pattern of non-recovery dictated by the spatial pattern of the 146 social, geographic, and environmental predictors included (Figure ??). The map of non-recovery points to areas that were not 147 originally estimated as the most impacted, but that would require support during their recovery due to their socioeconomic and 148 geographic make-up.

150
To shift the focus from damaged buildings to vulnerable communities, we propose emphasizing and quantifying non-recovery broadly-applicable factors of vulnerability in technical information that can be used as a basis for recovery policies.
In addition to Nepal's multihazard risk, the country's geography and changing political landscape make its recovery unique.
Rural households face varying levels of remoteness to the nearest municipality, primarily due to the Himalayas' rugged terrain 230 and the inability to access roads. After decades of a monarchist government (which transitioned to a multiparty democracy in 231 the 1990s) 50, 51 , Nepal underwent a decentralization and devolution process in 2015-2017 that transferred governing power 232 from the central to local governments located in these municipality headquarters throughout the country 52 . Therefore, the 233 importance of local governments for reconstruction increased throughout the recovery period 32, 53, 54 .

234
Survey data 235 The field survey data used in this study were collected by The Asia Foundation (TAF) and local partner Inter Disciplinary surveyed, five of which were classified as "Severely-Hit" in the Post-disaster Needs Assessment, three as "Crisis-Hit," two as 242 "Hit with Heavy Losses," and one "Hit," in order of most affected to least affected.

243
In this study, we considered households from the six rural districts classified as severely-hit and crisis-hit since these districts  Predictor data 256 We represented factors we expect to influence non-recovery with a set of 31 variables, which come from openly available 257 census, remote-sensing, or modeled datasets. These variables were considered rather than questions from the survey data, 258 because the goal is to implement this model to predict areas of non-recovery in the weeks after an earthquake. Therefore, we 259 used predictor data accessible after an event, whereas survey data would take years to collect. Here, we described predictor 260 data for only those eight variables that were selected through the variable selection process as most important for predicting 261 non-recovery. All other variables that were considered are listed in Table ??. Each predictor variable was produced or aggregated to different spatial scales (cells, wards, and LGUs), noted in Table ??.

279
To merge with the survey data, we extracted the value of each predictor at each household location. Once merged, we split 280 the combined dataset into six folds using stratified random sampling to ensure each fold had roughly the same proportion 281 of households that are reconstructed and not reconstructed as the full dataset. We also visually inspected whether each fold covered the same spatial distribution of the study area as the full dataset. We used five folds (84%) as the training set to build 283 the model of non-recovery and one fold (16%) as the test set for evaluating how the model would perform on a future dataset.

284
For the spatial prediction of non-recovery over the study region (Figure 4a), we converted each proxy to the same resolution of 285 300m by 300m by resampling raster data or converting ward and LGU data to cells.

286
Models to predict probability of non-recovery 287 We developed a statistical relationship between the surveyed response of non-recovery (Y ) and the suite of proxies (X) using 288 the training set. Our goal was to predict the probability that a damaged household has not completed reconstruction given its 289 proxy values (P(Y = 1|X = x)). We used a random forest, which is a non-parametric statistical model that averages the results 290 of many individual, decorrelated decision trees 61 . Here we extended the typical random forest to predict probabilities of each 291 household belonging to each reconstruction outcome (1 = not reconstructed, 0 = reconstructed) 62 . A bootstrapped sample of the 292 training dataset is recursively split into distinct subsets for growing one tree in the random forest. Each split divides the data at 293 that split, or parent node, into two child nodes. The parent node is split using a proxy variable that minimizes the mean squared 294 error over all of a set of randomly selected features (mtry). For probability estimation, we continued to grow the tree until we 295 reach the minimum nodesize of 10% of the bootstrapped sample. The probability of each node was the proportion of Y = 1's.

296
This process was repeated for a designated number of trees (ntree). For our model, we tuned hyperparameters using a grid 297 search and minimized the mean squared error.

298
Because the random forest model is non-parametric, it does not require assumptions of the distribution of the data or 299 specification of interaction terms. This is attractive for predicting non-recovery if a sufficient amount of training data is available 300 because it allows for nonlinear relationships between the predictor variables and reconstruction outcome and for unexpected 301 interactions to occur. We found the random forest outperformed (explained below) the standard probability prediction model,

302
the logistic regression, both on the training and test sets (Figure ??).

303
Variable selection 304 To prevent overfitting and for practicality, we reduced the number of variables used in the non-recovery model. We ensured 305 that none of the predictor variables are highly collinear by manually removing all but one variable with a Pearson correlation 306 coefficient greater than 0.75 over the entire study region. Many of these variables tended to be a variation of the same class of 307 predictors (e.g. remoteness to municipality versus remoteness to financial institutions).

308
The variable selection occurred in two stages-one automatic and one manual. The automatic variable selection for the 309 random forest was done by inserting a simulated noise variable and selecting all the proxy variables with a greater Gini 310 importance 63 than that noise variable. To account for variation in the variable selection due to sample location, we repeated the 311 model building process 1000 times using a bootstrapped sample of the training data. Through this automatic selection, we 312 narrowed down the 31 original predictors to 12 variables that occurred more than 75% of the time in the 1000 models, shown in 313 Supplementary Figure ??, and retrained a new random forest using these variables.

314
Once we reduced the variables through this automatic selection, we then manually inspected whether the remaining 12 315 variables provided predictive relationships that were consistent with other studies in Nepal's reconstruction. We removed an 316 additional four variables (percentage with thatch roof, monsoon month precipitation, dry month precipitation, and percentage 317 Dalit caste), as the trends found here were unexplained in the literature.

318
Recovery outcome-predictor variable relationships 319 The partial dependence plots shown in Figure 3 provide insight into so-called "black-box" statistical methods, like the random forest 61 . The dark red line is the average marginal effect of a proxy of interest, X s , on the random forest function, f (X), when all other complementary proxies, X c , vary over the training data used to build the model of non-recovery. The resulting partial dependence function on X s can be estimated with: where X Ci are the values of the proxy variables in the training data of size N. Here, we show these relationships for the training 320 data, as indicated by the light red lines, which is the partial dependence functionf X S (X s ) (i) disaggregated for each household 321 and centered to the minimum value of X

Validation
To evaluate the logistic regression and random forest models' performance, we calculated the area under the receiver operating