Objective, scope, and data sources:
Plan-EO’s mission is to produce, curate, and disseminate spatial data products relating to the distribution of enteric pathogens and their environmental and sociodemographic determinants. Our approach is to compile, maintain and grow a large database of georeferenced results from studies that diagnosed EIDs in children in LMICs along with spatiotemporally matched covariates. From 2018 to 2022, under a previous project named Global Earth Observation for Monitoring Enteric Diseases (GEO-MED), we began sourcing and compiling a central repository of stool-level microdata collected at study sites in numerous LMICs that together represent the broadest and most representative range of currently available climate zones and environmental contexts. The rationale for this was that data from multiple sites and studies can offer insights into the general epidemiology of EIDs that might be biased by or not apparent from considering just a single location [35, 36]. To draw broad, generalizable conclusions about the impact of the environment on enteropathogens, therefore requires combining data from locations that are representative of diverse ecological zones [37]. Several initial analyses have been published using this database [19, 37, 38] and Plan-EO is now maintaining and expanding this existing resource. Through professional networks and exploratory literature reviews using online search engines and databases (such as PubMed, ResearchGate etc.), published studies are identified that meet the following criteria: a). analyzed stool samples collected from children under 5 years of age; b). used PCR or equivalent molecular diagnostics to detect enteropathogens in samples (ensuring comparable sensitivity across studies and pathogens); c). were carried out in one or more LMICs (as defined by the OECD [39]); d). recorded the dates of sample collection and approximate location of study subjects’ residences (to enable spatiotemporal referencing). Priority is given to studies with large numbers of samples, that diagnosed multiple enteropathogens of different taxa (viruses, bacteria, protozoa) in the same samples, and that took place in countries or contexts not yet represented in the database. An initial list of pathogens has been selected based on their being either highly endemic or responsible for high diarrheal disease morbidity in LMICs [40] as well as to be representative of the three major enteropathogen taxa. These include 5 enteric viruses – adenovirus, astrovirus, norovirus, rotavirus and sapovirus – 3 bacteria – Campylobacter, ETEC and Shigella – and two protozoa – Cryptosporidium and Giardia. A saved search is scheduled using the National Center for Biotechnology Information (NCBI) online tool so that newly published potential collaborating studies are summarized in automated monthly emails.
Investigators on eligible studies are contacted with a request for access to data from individual participants and, if they respond and agree, data use agreements (DUAs) are established with the collaborating institution. Variables requested from contributing studies include:
- Infection status for each pathogen diagnosed in each stool sample.
- Date of sample collection.
- Subjects’ age on that date.
- Whether the sample was collected during a diarrheal episode (e.g., cases) or while the subject was asymptomatic (controls).
- Country, study, and site in which the subject was recruited and whether the study was health facility- or community-based.
- Geographic data. This may consist of household location coordinates where available, otherwise, subjects are georeferenced to the centroid of their neighborhood, village, or district or, where such information is unavailable, the geographical location of the health facility that recruited them.
- Additional subject-level factors such as sex, anthropometric and feeding status and household, maternal and clinical information where available (see table 3).
Figure 1 depicts the flow of data and processes within the Plan-EO project. Once a DUA is fully executed, study-specific databases are securely transferred using a link to an encrypted, cloud-based folder, saved to a secure HIPAA-compliant server (A.), and subsequently deleted from the cloud. Study-specific datasets are then processed and combined into a pooled central study database (B.) with a standardized format and list of variables in accordance with the PRIME-IPD tool for verification and standardization of study datasets retrieved for Independent Participant Data Meta-Analyses (IPD-MA) [41]. Sample data for which coordinates are unavailable are georeferenced by cross-referencing them with online mapping tools and other sources to obtain their latitude and longitude in decimal degrees. The original, study-specific identifiers (IDs) are removed along with any HIPAA-classified IDs and each subject is instead assigned a unique ID that is specific to this project and cannot be matched back to the original, study-specific IDs. Pathogen positivity data are then linked with covariate variables, which fall into three main categories:
a). Subject- and household-level covariates: Most contributing studies conduct baseline and/or follow-up assessments of information relevant to EID transmission risk and vulnerability. Examples are summarized in table 1. These data are recoded to match as closely as possible standardly used variable definitions, units, and categories. Where these are missing or not collected by some studies, values are imputed or interpolated (C.) based on household survey data according to methods described previously [19]. Briefly, equivalent data is extracted from individual child-level microdata collected in Demographic and Health Surveys (DHS) [42], Multiple Indicator Cluster Surveys (MICS) [43], and some country-specific surveys and combined into a parallel pooled survey database (D.) that is coded identically to the pooled study database. Survey data from the same survey strata (region and urban/rural status) in which the study sites were located are appended to the study database. Various methods can then be applied to interpolate or impute missing values based on this locally relevant information.
b). Environmental spatial covariates: A set of time-static environmental and sociodemographic spatial covariates are compiled in raster file format based on their hypothesized or demonstrated associations with diarrheal disease outcomes (E.) [44]. These are summarized in Table 2. Having georeferenced each sample to the approximate location of the subjects’ residence, the variable values are extracted at these coordinate locations using spatial analytical tools (F.). For samples georeferenced to health facilities, covariates are averaged over a theoretical catchment area represented by a 20km buffer around the facility location using the ArcMap Zonal Statistics tool, otherwise they are extracted to household or community coordinates using the Extract Values to Points tool [45].
b). Time-varying hydrometeorological variables: A set of historical daily EO- and model-based re-analysis-derived estimates of hydrometeorological variables have been selected based on their demonstrated or hypothesized potential to influence enteric pathogen transmission [37]. These are extracted (F.) from version 2.1 of the Global Land Data Assimilation System (GLDAS – G.) [46] and are summarized in Table 3. Because of the lagged effect of weather on pathogen transmission, daily hydrometeorological variables are summarized over a lagged period of exposure, using methods described previously (averaged or summed over a 7-day lagged period of exposure from 3 to 9 days prior to the date of sample collection - t-9 to t-3, where t0 is the date of sample collection) [37]. This time window and lag period can be adjusted according to the incubation period of specific pathogens.
Statistical methods:
The resulting database is in a sufficiently flexible format to which numerous statistical modeling approaches can be applied to address specific research questions, make inferences about underlying biological processes and generate prediction maps to identify geographical foci of transmission risk. For example, in a preliminary analysis of Shigella, generalized multivariable models were fitted within a Bayesian framework to derive population-level conditional effects of the predictors [38]. The effect estimates from model outputs can then be extrapolated to all unobserved locations within the target domain for which covariates raster values are available to make predictions (H.). Household-level variables, such as water supply, sanitation coverage, and women’s education, have been geospatially mapped across LMICs by the Local Burden of Disease (LBD) project [47, 48], and Plan-EO investigators are in the process of finalizing our own, improved estimates of these and others (such as housing material, crowding and livestock ownership – I.). Furthermore, subnational data on host-level factors such as breastfeeding and nutritional status, also determinative of pathogen infection risk, can also been sourced from LBD and household surveys [49–51]. By including model terms for symptom status (diarrheal or asymptomatic) and study type (health facility or community-based) it is possible to make separate predictions for positivity in asymptomatic individuals, those experiencing a diarrheal episode and those seeking care for diarrhea. The models will be re-fitted, and the results updated each time a new study database is added.
Dissemination and stakeholder engagement:
Plan-EO will be established as an interinstitutional initiative consisting of two components:
a). An interactive web-based dashboard: We will establish a data access and visualization system and suite of interactive maps to collate and disseminate the data products (comparable to WorldPop [33], the Malaria Atlas Project [31], or the DHS Program’s Spatial Data Repository [52]). It will be built using an open-source platform and provide users with an interactive portal to explore the resulting pathogen-specific risk maps (J.) and the pre-processed environmental and EO-derived spatial data outputs. This repository of products will be continually updated and made publicly available to the research and stakeholder communities both within the webpage itself and for download in commonly used raster formats such as TIFFs. Upon visiting the Plan-EO homepage, the user will be presented with a world map-based interface and a series of drop-down menus with options to choose which pathogen to view and whether to view observed or predicted prevalence. The observed prevalence option will display pin icons at locations where the prevalence of the selected pathogen has been measured by a study, with colors corresponding to the type of study design and size proportional to the number of samples analyzed. By clicking on a pin, a smaller window will appear giving more information about the study site and with a hyperlink to the publication in PubMed as shown in the illustrative example in Figure 3a. The locations will be based solely on information reported in the publication (e.g., district centroids, named health facilities) and will report only aggregated statistics and no subject-specific information. The predicted prevalence option will display the gridded model output surface as a map layer, as illustrated in figure 3b. The user will be able to zoom in and pan over the map and click on locations to obtain prediction values. As the project progresses, we will build a catalogue of layers, including predictions for each pathogen and the covariates being produced that can be superimposed on the map that can be toggled on and off, and downloaded as files, imported into a GIS, and used in further analyses by the end user.
b). An international consortium of investigators: A global network of collaborating researchers (with a majority being early-career and/or from LMICs) will be fostered and coordinated out of the Plan-EO headquarters at the University of Virginia (UVA). Investigators from contributing studies will be invited to join the Plan-EO network and their names, institutional affiliations and contact information will be entered into a database. This will be used both to track the details of individuals to be included as co-authors on publications that rely on their data, and as a mailing list of contacts to whom emails will be sent periodically with updates regarding preliminary results, publications, new members, manuscripts for review etc.
Ethical considerations:
All health information used in the Plan-EO project will be secondary data from studies and surveys that have already been carried out in LMICs by investigators at various institutions around the world and obtained consent for future use of health information from subjects’ caregivers. All investigators with access to the main Plan-EO database will have completed certifications in responsible human subject research. The original study-specific databases will be securely deleted from Plan-EO servers when the project ends unless superseding DUAs are established. The project’s data management and transfer plan has received ethical approval from the IRB of the UVA School of Medicine (IRB-HSR #220353), and the protocol has been registered as an IPD-MA in the PROSPERO prospective register of systematic reviews (CRD42023384709). All publications will follow the PRISMA-IPD [53] guidelines for IPD-MAs and the GATHER [54] guidelines for disease burden estimation.