Setting and village selection
This study was conducted in 2016 in two counties located in the hilly regions of Sichuan Province, China. County surveillance records were used to select ten villages at high risk of reemergent or ongoing S. japonicum transmission. We conducted a census in each village, attempted to geocode the location of all households using handheld Global Positioning System (GPS) devices, and surveyed each household for S. japonicum infection, as described below. The number of households in the selected villages ranged from 19 to 75, with between 50 to 250 residents residing in each village at the time of data collection. We restrict this analysis to households for which: a) GPS coordinates were successfully recorded, and b) at least one household resident was tested for S. japonicum infection during the 2016 infection surveys. Of the 463 households identified during the census, a total of 283 households (61.1%) had both GPS and infection survey data and are therefore included in this analysis. See Figure 1 for details on household exclusion and inclusion.
Data collection and sources
Human infection data were collected in July 2016 as a part of ongoing research efforts in the region assessing persistent schistosomiasis hotspots. All village residents over the age of five were invited to participate in the study. Each participating individual was asked to provide three stool samples on consecutive days. All samples were labelled with the date of collection and participant ID numbers and stored in a cooler or cool room (ideally <10°C) until they could be transported to the central laboratory for processing. Samples were examined using the miracidium hatching test, following standard protocols (26). In brief, for each sample, 30 grams of stool was suspended in water (pH range of 6.8 – 7.2), strained to remove large particles (80 head nylon mesh), strained again to retain the remaining solids (280 head nylon mesh), and then suspended in water at room temperature (28 – 30°C). At 2-, 4- and 8- hours after suspension, the samples were examined for the presence of miracidia for at least 2-minutes each time. An individual was considered positive if any of the three hatch tests were positive.
The habitats of O. hupensis snails in the present study were determined during a national survey on O. hupensis that was conducted in 2016. Snail habitats were first identified by trained professionals from county anti-schistosomiasis control stations using historical records that date back to the 1950s. Historical habitats were digitized by global positioning and geographic information systems (GIS). Surveys of Oncomelania snails were then conducted in the field via transect walks between the months of April and October 2016 using standard systematic sampling methods (27). Briefly, each historic, existing or suspected snail environment was divided into sampling frames set every 5-10 meters along the water line for linear features (e.g., ditches, rivers, etc.) and every 5-10 meters along the periphery of polygon features (e.g., flooded paddy fields, ponds, etc.), with parallel lines extending from each to form a set of sampling frames of between 25 – 100 m2 covering each site. The majority of existing or suspected sites were characterized by shallow, stagnant or moving water (e.g. a stream, pond, rice paddy or irrigation ditch), as these conditions are the preferred habitat of the amphibious freshwater O. hupensis snails (28). For each site, ~20% of the sampling frames were randomly selected to be investigated on foot for the presence of snails. The digitized maps were updated using handheld GPS devices to document present and absent habitat locations, shapes, and whether and how historic habitats had been destroyed or changed (e.g., land use change via urbanization). For each site, the environment type was recorded as either a polygon or line feature in the dataset. Polygons most frequently represented rice paddy fields in the area, though they occasionally represented other habitat sites such as a small pond, a dry field or a beach. Meanwhile, line features most commonly represented dirt or concrete irrigation ditches used for flooding rice paddies, though it could also correspond with streams or other narrow waterways. For the purposes of this analysis, all polygons are referred to as “fields”, while all lines are referred to as “ditches”.
Data on waterbodies, waterways and roads in Sichuan Province, China were obtained from the OpenStreetMap project, downloaded on November 11th, 2021 (29). The OpenStreetMap project draws on local communities of mappers to build a knowledgeable database detailing roads, waterways, transportation and other built and natural environment features (30). OpenStreetMap Contributors use aerial imagery, handheld GPS devices, and field maps, both to generate the data and to verify the accuracy of the open data on a regular basis (30). The roads data from the OpenStreetMap project ranges from national freeways and motorways down to gravel tracks and paths, while the waterways and waterbodies included permanent water features such as large rivers, streams, canals, lakes and reservoirs. Details on the OpenStreetMap data used in this analysis can be found at: https://download.geofabrik.de/osm-data-in-gis-formats-free.pdf (31).
Elevation Data was obtained from the Earth Observation Research Center Japan Aerospace Exploration Agency’s (JAXA EORC) Advanced Land Observing Satellite (ALOS) global digital surface model, which has a horizontal resolution of approximately 30 meters (32). To calculate indices of vegetation and waterbody coverage, the U.S. Geological Survey’s (USGS) Earth Resources Observation and Science (EROS) Center’s image library from the Landsat Satellite 8 – Collection 1 was accessed from the USGS Earth Explorer website (https://earthexplorer.usgs.gov/) to obtain data on surface reflectance bands 2 – 5, as well as the QA band (33). The Landsat-8 satellite repeats its orbital pattern every 16-days (34), resulting in a total of 12 available observations across 2016 that occurred prior to our July infection surveys, which were downloaded for use in this study. The National Aeronautics and Space Administration’s (NASA) pre-processed Moderate Resolution Imaging Spectroradiometer (MODIS) Terra satellite imagery database was also accessed to obtain 250-meter resolution data on vegetation at 16-day intervals at all available timepoints in 2016 prior to the infection surveys (35).
Variable definitions and generation
Outcome variable
S. japonicum infection survey results from the ten study villages were aggregated to the household level and spatially joined to the geographic location of the home. To avoid issues with multicollinearity resulting from residents of the same household having the same values for all environmental predictors used in this analysis, the outcome was a binary measure of household infection status, with “0” indicating no infections detected among participating household members, and “1” indicating that one or more household member tested S. japonicum positive. Each household was represented in the geospatial dataset as a point feature.
Explanatory variables
Model 1: Snail survey data
Using the snail survey data collected in 2016, predictors were generated to assess how the household’s position in relation to surrounding snail habitats influence household-level S. japonicum infection risk (Table 1). The geocoded snail habitat data was divided into two categories: present snail habitat sites, and absent snail habitat sites. Present snail habitat sites were those sites where one or more snails were identified during the survey period, while absent snail habitat sites were those where snails were not found during the 2016 survey. The data were further grouped into “ditches” (i.e., line features deemed suitable for snail habitation) and “fields” (i.e., polygon features deemed suitable for snail habitation), resulting in four snail habitat categories: present ditches, present fields, absent ditches, and absent fields.
Table 1
List of predictors generated for each model.
Snail survey data model
|
Environmental predictors model
|
Presence sites (one or more snails found)
|
Open data
|
Ditches where snails were present
|
Built and natural environment
|
Distance from home to the nearest identified present ditch (km)
|
Distance to nearest waterway (km)
|
Total length of present ditches within 0.25 km radius of the home (km)
|
Distance to nearest waterbody (km)
|
Total length of present ditches within 0.5 km radius of the home (km)
|
Distance to nearest road (km)
|
Total length of present ditches within 1 km radius of the home (km)
|
Elevation (m)
|
Fields where snails were present
|
|
Distance from home to the nearest identified present field (km)
|
Remotely sensed data
|
Total area of present fields within 0.25 km radius of the home (km2)
|
Normalized Difference Water Index (NDWI)
|
Total area of present fields within 0.5 km radius of the home (km2)
|
Average NDWI within 0.25 km radius of the home
|
Total area of present fields within 1 km radius of the home (km2)
|
Average NDWI within 0.5 km radius of the home
|
|
Average NDWI within 1 km radius of the home
|
Absence sites (no snails found)
|
Normalized Difference Vegetation Index (NDVI)
|
Ditches where snails were absent
|
Average NDVI within 0.25 km radius of the home
|
Distance from home to the nearest identified absent ditch (km)
|
Average NDVI within 0.5 km radius of the home
|
Total length of absent ditches within 0.25 km radius of the home (km)
|
Average NDVI within 1 km radius of the home
|
Total length of absent ditches within 0.5 km radius of the home (km)
|
Enhanced Vegetation Index (EVI)
|
Total length of absent ditches within 1 km radius of the home (km)
|
Average EVI within 0.25 km radius of the home
|
Fields where snails were absent
|
Average EVI within 0.5 km radius of the home
|
Distance from home to the nearest identified absent field (km)
|
Average EVI within 1 km radius of the home
|
Total area of absent fields within 0.25 km radius of the home (km2)
|
|
Total area of absent fields within 0.5 km radius of the home (km2)
|
Other predictors included in the models
|
Total area of absent fields within 1 km radius of the home (km2)
|
Number of people tested in the household (N)
|
Other predictors included in the models
|
|
Number of people tested in the household (N)
|
|
NDWI = Normalized Difference Water Index
|
NDVI = Normalized Difference Vegetation Index
|
EVI = Enhanced Vegetation Index
|
For the snail survey data models, present sites are those where at least one snail was found, while absent sites are those where no snails were found during the 2016 snail surveys. For the environmental data models, NDVI and NDWI were calculated using Landsat-8, Collection 1 satellite data collected on January 23rd, February 8th, and April 28th (dates where there was < 30% cloud cover). Pre-processed EVI data from NASA’s Moderate Resolution Imaging Spectroradiometer (MODIS) Terra satellite was averaged across a total of 12 observations occurring at 16-day intervals between January 1st and July 10th, 2016.
Using ArcGIS Pro software (36), three different buffer sizes (0.25, 0.5 and 1.0 kilometer (km) radius length) were generated and applied to each household location using the “Buffer” analysis tool. These buffer radius lengths were defined such that the largest buffer (1 km) generally spanned the entire village area for a centrally located household, whereas the smallest buffer (0.25 km) spanned the immediate surroundings of a given household. The “Summarize Within” analysis tool was used to calculate the total length (km) of present ditches and absent ditches that fell inside each household buffer area. This step was repeated for the present fields and absent fields, calculating the total area of fields (km2) that were encapsulated by each household buffer. The “Near” analysis tool was then used to calculate the geodesic distance (in meters) between each household point and the nearest present ditch, absent ditch, present field, and absent field. The “Join Field” data management tool was used to join all of the newly created variables summarizing the length and area of ditches and fields into a single table, which was then exported to the project’s geodatabase for use in R. Including separate measures for the distance to the nearest ditch and nearest field, as well as the total length and area of ditches and fields within a given buffer area helped us determine whether these features varied in their relative predictive capacity or in their respective spatial scales (i.e. buffer sizes) of influence.
Model 2: Open-source Environmental data
Open-source environmental and remotely sensed data were compiled to create a geospatial dataset containing a range of hypothesized environmental (built and natural) predictors of household S. japonicum infection. Potential environmental predictors were selected if they were 1) previously identified or hypothesized in the literature to serve as predictors of schistosomiasis infection or snail habitat sites; and 2) made publicly available at a 250-meter resolution or finer for the entire study area. Most of the predictors represented natural features of the environment (e.g., waterbodies, elevation, vegetation indices, etc.). Human-made environmental features like roads were also included, as the relative remoteness or connectedness of a given household was hypothesized to be a factor associated with schistosomiasis infection status. Roads and waterways from the OpenStreetMaps project were included as line features, while waterbodies were water features coded as polygons. The “Near” Analysis tool was then used to calculate the geodesic distance (km) between each household and the nearest road, waterway, or waterbody.
Prior studies have suggested that elevation is negatively associated with the presence of O. hupensis snails (17, 37). We used 30-meter resolution elevation data from JAXA EORC’s ALOS satellite to extract the elevation (m) value that corresponded with each household point location.
The presence of either water or vegetation can provide opportunities for water contact and have the potential to impact human infection risk. In this study, we use the NDWI (38) to estimate water content across the study area, and the NDVI (39) and the Enhanced Vegetation index (EVI) (40) to describe vegetation health and density in the study area. The NDWI identifies water features and distinguishes them from soil and vegetation surfaces (38). The NDVI index is chlorophyll-sensitive and provides a measure of crop and vegetation health, while the EVI is more sensitive to canopy variations and performs particularly well in high biomass regions (41). As such, the two vegetation measures complement each other and are frequently used jointly in vegetation studies (41). Whereas NDWI and NDVI were calculated using data from the Landsat-8, Collection 1, pre-processed EVI data at 250-meter resolution was downloaded directly from NASA’s MODIS data library (35).
There was a total of 12 Landsat-8 satellite images collected between January and July of 2016, all of which were examined and processed to remove all medium to high confidence clouds, cloud shadows or other sources of terrain occlusion. To do this, the QA band files made available by USGS for each of the Landsat-8 observations (34) were used in ArcGIS Pro with the “Remap” raster function, to recode all grid cells corresponding with terrain occlusion as “No Data”, while all clear or low confidence cloud cells were set to equal 1. The “Clip” Raster function was then used to remove all cells obscured by cloud cover from each corresponding Landsat-8 image. Overall, cloud cover was high between January 2016 – July 2016 over the study area, ranging from 9.97–100%, with an average cloud coverage of 66.13% across the 12 satellite observations made within that period. To assess the extent to which the removal of cloud cover and terrain occlusion would result in missing data for each household at each time point, the “Raster to Points” tool was used to convert the cloud-corrected satellite data to a grid of points, and the “Extract Multi Values to Points” geoprocessing tool was used to extract the data corresponding to each household’s point location. A total of 3 of the 12 Landsat-8 collections had < 30% cloud coverage and had cloud-corrected data available for between 98.5–100% of households. The remaining 9 collections had cloud cover ranging from 33–100%, which resulted in data for between 0–65% of households. As such, we restricted our use of the Landsat data to the 3 collections with < 30% cloud coverage (collected on January 23rd, February 8th, and April 28th of 2016).
Using ArcGIS Pro’s “Raster Calculator” Image Analyst tool, the NDWI on January 23rd, February 8th and April 28th were each calculated from the cloud-corrected Green and Near Infrared (NIFR) Landsat Surface Reflectance bands (bands 3 and 5 in Landsat-8, respectively), using the following formula developed by McFeeters (1996) (38):
$$NDWI=\frac{\left(Green-NIFR\right)}{(Green+NIFR)}$$
NDVI on January 23rd, February 8th and April 28th was calculated from the cloud-corrected Red and NIFR Landsat Surface Reflectance bands (bands 4 and 5 in Landsat-8, respectively), using the following formula (39):
$$NDVI=\frac{\left(NIFR-Red\right)}{(NIFR+Red)}$$
The “Mosaic to New Raster” tool was then used to calculate an estimate of the average NDWI and NDVI across the three time points with sufficient data coverage.
EVI is calculated based on Blue, Red and NIFR Reflectance bands, as well as a soil adjustment factor (L), and two coefficients (C1 and C2) used to correct for aerosol scattering, as shown in the following formula (39, 41):
$$EVI= (1+L)*\frac{\left(NIR – Red\right)}{(NIFR +(\text{C}1*Red) –(\text{C}2 * Blue) + L)}$$
All 12 of the pre-processed, cloud-corrected EVI measures at 16-day intervals for the period between January – July 2016 were downloaded from NASA’s MODIS Terra Satellite Imagery database (35), and were joined into a single, average EVI measure using the Mosaic to New Raster” tool in ArcGIS Pro.
Each of the final NDWI, NDVI and EVI measures were converted to a grid of points using the “Raster to Multi Points” tool. As was done for the snail data, three different buffers sizes were generated around each household point (0.25, 0.5 and 1.0 km radius), and the “Summarize Within” analysis tool was then used to calculate the average NDWI, NDVI and EVI in the 0.25, 0.5 and 1 km area surround each household. See Table 1 for a summary of all predictors included in the environmental predictors model.
Analysis
To assess the predictive capacity of snail survey and environmental data to predict household S. japonicum infection status, and to compare the predictive performance of the model constructed using snail survey data to one that exclusively used open-source environmental data as predictors, a Random Forests (RF) machine learning approach was used. After the snail habitat dataset and the environmental predictors dataset were each generated in ArcGIS Pro, the R-ArcGIS Bridge from the ‘arcgisbinding’ R package was used to facilitate an easy transfer of data between ArcGIS and Rstudio for the RF analysis (42). Each dataset was split 75/25 for training and validation, respectively. For each training dataset, we oversampled the minority class to correct for class imbalance in our outcome variable (13.8% of households were S. japonicum positive). In total, three different balanced training datasets were generated for the snail data, and three for the environmental data, yielding a total of six balanced datasets that were used for RF model training. This approach allowed us to assess the stability of model performance metrics and variable importance rankings in light of our oversampling approach. The ‘caret’ package in R was used to perform a 10-fold cross validation process to tune each model, helping to determine the optimal maximum node size to use and the number of variables to try at each branch. For each RF model, we specified 5000 trees per forest, as a high number of trees is recommended to help stabilize variable importance rankings (43).
The reserved validation data was used to test each model and calculate performance statistics (accuracy, Cohen’s kappa statistic, receiver operator curve (ROC) area under the curve (AUC), sensitivity, specificity, positive predictive value (PPV) and the negative predictive value (NPV)). To compare performance between models, the best model was defined as the one with the highest kappa value, followed by accuracy and ROC AUC, respectively. The kappa statistic was selected as our main metric for indicating model performance because our reserved validation datasets had a high degree of class imbalance (13.8% of households were S. japonicum positive), and the kappa statistic was developed to help correct for bias related to over-rewarding the prediction of the majority class (44). Model accuracy was also compared to the No Information Rate (NIR), which indicates what the accuracy that would be expected to be if the majority class were predicted every time (NIR = 0.859). A high NIR value results when there is a high degree of class imbalance for the outcome of interest, as was the case in this study. Finally, in the event of a tie in the kappa and accuracy of two models, the ROC AUC was used to select a final, top performing model.
To determine which predictors were the most influential in predicting S. japonicum infection status in our models, the mean decrease in accuracy (MDA) values of predictors were visualized in variable importance plots for each model. For each of the three environmental data models and three snail data models, the top ten predictors indicated by the model’s MDA plots were given a score of 10 to 1 (10 being the score of the top predictor). Variable scores were then summed across the three models to create a three-model summary score of 0 to 30, 30 being the highest score possible, while a score of 0 indicates that the variable was never ranked among the top ten predictors. Simple logistic regression models and lowess plots were examined to determine the direction of association between household S. japonicum infection status and each predictor.
Prediction mapping
Using the top performing RF model, a map of the predicted probability of S. japonicum infection across the entire study area was generated. Within ArcGIS Pro, the “Raster to Point” tool was used to generate a grid of points covering the entire study area surface. The grid dataset was then exported to R using the R-ArcGIS Bridge to calculate the predicted probability of infection at each point across the study area. These predicted probabilities were added to the grid dataset and exported back to ArcGIS Pro for mapping. Finally, the “Point to Raster” tool was used to transform the predicted probabilities into a raster surface, using the “Mean” method for the cell assignment type. All analyses were conducted in ArcGIS Pro 2.8.3 and RStudio Version 4.1.2 (36, 42).