a. Study area
In 2018, we conducted a randomized controlled trial to evaluate active case detection among forest-going populations31 in southern Lao PDR, where 95% of the country malaria transmission concentrates32. The data used in this study was collected among the forest-goers enrolled in the Focal Test-And-Treat (FTAT) arm (Fig. 1), an intervention administered continuously to seven health center catchment areas (HCCA) in Champasak province between March and November.
b. Data sources
FTAT survey
Fifteen teams of two peer navigators (PNs) were employed to scout forest fringes areas in FTAT HCCAs for individuals presumed to engage in forest-going activities. The eligibility criteria for these targeted “forest-goers” were to be older than 15 and having slept outside of a village on more than one night in the previous month. PNs themselves were recruited from the local communities of forest-goers and trained to conduct various surveillance activities including blood collection, malaria testing and referrals for treatment.
Upon recruitment of forest-goers in FTAT, PNs conducted an epidemiological survey covering the demographic, behavioral, occupational, malaria knowledge and practice domains. To understand the mobility patterns of this population of forest-goers, PNs offered a subset of them, conveniently sampled, to carry a GPS logger that would record GPS coordinates as they carried it.
GPS data
In May, 53 GPS loggers (I-gotU 120) were dispatched across the 15 PN teams to be offered to enrolled forest-goers and carried for about two months. During that first cycle, loggers were configured to collect GPS coordinates every 30 minutes and were retrieved in July/August by the PN teams for data downloading. A second cycle of data collection was started in September with 69 GPS loggers configured to collect GPS coordinates every 15 minutes. Loggers were retrieved in November for data downloading. Recruiting PNs teams also carried GPS loggers, configured to collect GPS coordinates every 30 minutes over the two cycles.
In order to simplify instructions, the GPS loggers were configured so that they could not be turned off by forest-goers or PNs and the logging intervals selected, 15 to 30 minutes, afforded an estimated 7 to 12 days of battery life. Loggers could be charged on outlets with regular phone chargers. To avoid battery depletion while on forest trips or off the grid, external charging devices (Verbatim®) and two sets of four individual AA lithium batteries were provided to recruited forest-goers. Participants were instructed to carry the GPS loggers at all times, to frequently charge them (at least once a week) and to meet again after two months for GPS loggers’ retrieval. PNs demonstrated all aspects of the GPS loggers’ utilization, including charging, to recruited forest-goers.
GPS logger retrieval questionnaire
After roughly two months, PNs met again with forest-goers to collect the GPS loggers in exchange for a $10 monetary incentive. Upon retrieval, a short questionnaire was administered to assess feasibility of using GPS loggers to record mobility patterns of forest-goers. In particular, the survey asked about forest-goers’ charging practices and logger utilization over the two-month study period.
c. GPS data processing
Data cleaning
The advertised precision of the I-gotU GPS loggers used in this study is 10m. Yet, the makers warn of possible large errors in the GPS coordinates collected, notably when the logger stay indoor for long periods of time and cannot connect with the satellites. To remove those erroneous GPS points, we used a filtering algorithm that identifies GPS points unusually far away from both the previous and next GPS points. See supplemental materials S1 for details.
Significant locations
The data collected by a GPS logger is a time series of GPS points forming a trajectory (Fig. 2a). If several GPS points cluster together, it indicates a location visited frequently or for long periods of time by the HRP carrying the GPS logger (or a location where the GPS logger was left behind). Using a method developed by Barraquand and Benhamou33 and implemented in the adehabitatLT34 package in R35 (version 4.0.5), we computed the residence time spent within a moving 50m-radius circle window centered on every GPS point of the trajectory. Then, we used the biased random bridge kernel method36 implemented in the adehabitatHR34 R35 package, to estimate the utilization distribution (UD) 30m per 30m surface around the trajectory. The UD is a concept widely used in animal movement ecology that measures the utilization of space via the intensity of the GPS points occurrence on the map. A significant location was defined as a 100m-radius circle centered on a local maximum of the UD surface that contains at least one GPS point of the trajectory with a residence time above 2h. Simply put, a significant location is a 100m-radius circle where the GPS logger stayed for more than 2 hours at least once along the trajectory.
Significant locations were mapped on top of earth terrain layers, using ESRI imagery in the leaflet R package, along with the GPS tracks and classified as forest, forest-fringe/rice field or village-based locations by visual inspection. Residence time at village-based significant location as well as self-reported home village by forest-goers in the FTAT questionnaire were additionally used to identify forest goers' home location. Finally, we used PNs' GPS tracks as well as their self-reported home village to identify significant locations that resulted from our study’s activities such as follow-up meetings at PNs' homes. GPS coordinates of forest-goers’ and PNs’ home villages were extracted from a list of geo-referenced villages in the province provided by the national malaria control program.
Outdoor trips
A trip was defined as a series of consecutive GPS points in between two GPS points recorded at the forest-goer's house location. Trips going through an outdoor-based significant location (forest or forest-fringe/rice field) qualified as an outdoor trip (Fig. 2B) but trips where a forest-goer toured the forest for hours without stopping at a single location (Fig. 2C) could also be classified as outdoor trips. To identify those other outdoor trips (Fig. 2C), we first learned the relationship between our classification of outdoor vs village-based significant location and the following covariates using a random forest algorithm: number of Open Street Map37 buildings or places, total 2015 population and average 2018 tree crown cover within 100m and distance to closest village in the province. Tree crown cover layers came from Hansen38 and population from WorldPop39. We then used the predicting algorithm to classify non-significant location GPS points as outdoor or village-based. Finally, outdoor trips were defined as trips that include an outdoor-based significant location or a series of consecutive GPS points adding up to more than two hours outdoor. Simply put, an outdoor trip is a trip where the forest-goer spent more than two hours consecutively outdoor. Trips going through a significant location that resulted from our study’s activities were discarded as unrepresentative of the forest-goers’ routine.
d. Cluster analysis
For each outdoor trip, we computed the mobility pattern parameters listed in Table 1. They were selected to characterize forest-goers’ exposure to the dominant malaria vectors in the GMS, An. dirus and An. minimus13,14, all along the trip. Four domains were covered. Two domains, forest surroundings and timing of the trips, pertained directly to the ecology of these mosquitoes, which thrive in a forested environment and bite during nighttime and around twilight and dawn hours (6 pm and 6 am). The two other domains, pace and fragmentation of the trips, reflect the possible organization and habits of those trips and can influence vector control options. For instance, it may be easier to carry bed nets over short distances and frequently visited location along trips may be arranged to offer better mosquito protection.
Table 1
– Mobility patterns variables.
Domain
|
Forest
|
Pace
|
Fragmentation
|
Timing
|
Variables
|
Average 2018 tree crown cover
|
Duration
|
Number of different significant location
|
Overnight trip
|
Max 2018 tree crown cover
|
Distance
|
Proportion of trip spent at significant location
|
Trip around twilight and/or dawn hours (6 am and/or 6 pm)
|
Proportion of trip where 2018 tree crown cover > 50%
|
Max speed
|
Population density
|
Mobility patterns variables computed for each of the outdoor trips and used as features in the clustering algorithm (after normalization, standardization, and projection onto the principal components).
Variables in Table 1 were standardized by subtracting the mean and dividing by the standard deviation and right-skewed variables (pace and population density) were log-transformed. Then, we used principal component analysis to project the variables onto the principal components (PC) that captured 95% of the variability in the dataset. Then, hierarchical clustering with the complete distance method was applied on the selected PCs to explore the clustering structure of the data. The hierarchical clustering algorithm starts with one observation per “leaf” (= cluster) and progressively groups similar observations together one at a time until they are all grouped together in a single cluster. An advantage of hierarchical clustering over other clustering algorithms such as k-means is that the number of desired clusters, k, does not need to be set in advance. Instead, the resulting dendogram tree represents the clustering structure for all k from 1 to n, the number of observations. The length of the tree branches quantifies the dissimilarity between the leaves and can be used to assess how many clusters should represent the structure of the data. The intra-class correlation coefficient (ICC) for input variables in Table 1 was also computed for different choices of k to evaluate how many clusters would best capture the variability in the dataset.
Finally, mobility pattern characteristics in Table 1 were summarized for each of the clusters identified and plotted to determine the heterogeneity between the clusters, describe their distributions across the trips, and attempt to classify the type of trips identified in each of the clusters.
e. Regression analysis
Nighttime outdoor trips in clusters with high forest penetration were classified as “high-risk” trips given the higher probability of exposure to malaria vectors. Then, gradient boosting trees were used to assess which of the forest-goers’ socio-demographic and behavioral characteristics collected in the FTAT survey best predicted their likelihood to engage in such high-risk trips for malaria. Gradient boosting was selected as one of the most advanced supervised learning algorithms that can accommodate missing values and model non-linearities. Importantly, its implementation in the GPboost40 R35 package allows for random effects at forest-goers’ levels to correctly account for the correlation structure with multiple outdoor trips per forest-goers. Automated grid search and 4-fold cross validation were used to select the best fitting tuning parameters.
Results are presented using SHAP (SHapley Additive exPlanations) values41, an innovative tool increasingly used for interpretation of machine learning models. SHAP values attribute importance values for each feature and each prediction. It enables the ranking of different features in their ability to predict the outcome but also to visualize the adjusted non-linear relationship between the predictors and the outcome.