For this analysis the topic of childhood elevated blood lead levels (EBLL) was selected because of its historical record of research, it will help public policy decision makers to better understand the R&D production in this field, and because data on this topic is accessible. Also, spatiotemporal analysis an appropriate methodology for public health topics (Gui-Quan Sun et al., 2016). The data was extracted from a body of public health research articles on the topic of childhood EBLL. This corpus was curated by the librarians from the U.S. Environmental Protection Agency as a representative list of all available childhood EBLL research between 1990 and 2019. All the research articles identified were conducted in the U.S. or Canada and written between 1990 and 2019. Three information content aggregators were used to extract the body of research — PubMed, Web of Science, and ProQuest. Each of the information content aggregators were filtered for English language, scholarly articles, and peer-reviewed articles. The search terms used for PubMed included: “child” or “child, preschool” or “infant” or “infant, newborn” or “adolescent” and “lead/blood.” The search terms used for Web of Science included: “blood lead” or “lead level*” or “lead exposure” or “exposure to lead” or “Pb exposure” or “Pb level*” or “environmental lead” and “children” or “pediatric.” And the search terms used for ProQuest included: “blood lead” or “lead levels” or “lead exposure” or “exposure to lead” or “Pb exposure” or “Pb levels” or “environmental lead” and “child*” or “pediatric.” The output from these three content aggregators, on the topic of childhood EBLL, in the U.S. and Canada, and between 1990 and 2019 was 1,073 scientific journal articles. These articles were used to extract data for this analysis.
The first step for data extraction was to identify the geographic aggregation unit that would be used for the analysis. The criteria for selecting a geographic aggregation unit were 1) a unit that could easily be extracted from all the journal articles, and 2) a unit that is commonly used to aggregate other data. The U.S. state level geographic unit was selected. The decision to use this aggregation unit excluded Canada from the analysis. Next, the geography data was manually extracted from the 1,073 childhood EBLL research articles. The articles’ titles and abstracts were manually reviewed, and the U.S. state where the research was conducted was extracted and documented; this is referred to as the “research location.” Study location does not necessarily align with the location of the authors’ affiliation. Articles that did not have a study location were excluded from the analysis.
In addition, article subtopics were manually extracted and identified as either “environmental” or “non-environmental” studies. Environmental studies were divided into subcategories: “indoor”, “outdoor”, or “both”. “Indoor” refers to potential indoor Pb sources and/or measurements inside a residence or structure, such as paint or dust concentrations. “Outdoor” refers to potential outdoor Pb sources and/or measurements, such as ambient air or soil measures. And “both” refers to studies that included both “indoor” and “outdoor” measurement and/or modeling components. Articles classified as “non-environmental” typically focused on blood Pb and health outcomes without any consideration of environmental exposure sources. This process was conducted by two separate researchers, independent of one another, and then validated for quality assurance and control. This data and manual data extraction process is the same used in a previous study which demonstrated the use of research location for a knowledge synthesis methodology (Grokhowsky, 2023).
The publication years, publication geographies, and journal names were provided by the content aggregators, and stored in the metadata file included with the literature search. To access this data, the research articles’ metadata was parsed, and the publication years were stored, publication zip codes were stored, and journal names were stored. To identify the publication location at the U.S. state-level, each zip code was cross-referenced with the U.S. Census Bureau’s zip code shapefiles and the corresponding state was recorded. The output of this process provided a column of the years the articles were published, U.S. states where the articles were published, and the journal names that published the articles. The U.S. state where the articles were published is referred to as the “publication location.”
Using the data extracted from the 1,073 childhood EBLL research articles, quantities by geography were calculated for 1) the observed number of journals that published childhood EBLL research in each state; 2) the observed number of articles published in each state; 3) the observed number of articles researched in each state; 4) the observed number of articles published each year in each state; and 5) the number of articles for each subtopic in each state. Correlations for these data points were calculated. The publication year and subtopic frequencies were then derived into binary indicator variables. This was done by replacing the frequency values equal to 0 with a 0 and frequency values greater than 0 with a 1 (e.g., a state with 0 articles written about a subtopic would be recorded as 0 while a state with 1 or more articles written about a subtopic would be recorded as 1). This was conducted for both the quantity of articles published each year and the quantity of articles for each subtopic. Then the derived indicator variables were encoded as categorical values.
Time series analysis was performed by grouping the quantity of childhood EBLL research articles by publication year. The quantity of articles was adjusted for overall research growth equal to 4% per annum (Johnson et al., 2018). An exponential decay formula was used to make this adjustment. The decayed article quantities were evaluated using a time series plot resulting in three distinct periods (1990–1995, 1996–2015, and 2016–2019) and confirmed using Chow tests. These same time periods were found in a previous analysis of this data (Grokhowsky, 2023). Chow tests were conducted by creating three binary indicator variables for each time period and regressing them to the decayed quantity of research in each location using ordinary least squares regression. Thus, three additional indicator variables were derived to represent the time periods published.
In addition to extracting and grouping data from the research articles, feature variables were identified and obtained at the U.S. state level. Sociodemographic variables and structure ages have previously been correlated with EBLL (Xue et al, 2022; Teye SO et al., 2021; Hauptman et al, 2021). Therefore, sociodemographic and structure age data were collected as potential feature variables. Additionally, economic and environmental variables were obtained, as these may also influence research production on childhood EBLL. All feature variables were collected for the timespan between 1990 and 2019, and the mean values per U.S. state were used. In some cases, the timespan was averaged by the available dates between 1990 and 2019 (i.e., mean of data collected every three years or mean of data collected every decade). The primary data sources included the U.S. Census Bureau, the U.S. Bureau of Economic Analysis, the U.S. Bureau of Labor and Statistics, the U.S Environmental Protection Agency, the U.S. Department of Agriculture, the U.S. Geological Survey, and the National Science Foundation. Each feature variable was downloaded using either the governing body’s data portal or an application portal interface (API). All available data for the time period between 1990 and 2019 was downloaded and averaged to create the feature variable dataset. This resulted in 89 feature variables (Table S-1). For the sociodemographic variables that were reported as total units, the per capita value was calculated to account for the total population. Also, feature variables with 5% or fewer missing data were imputed using simulations or mean values from the surrounding U.S. states. If more than 5% of the data was missing the variable was removed from the analysis. Imputation was used for social and economic factors and neighborhood averaging was used for physical factors. For example, a simulated imputation procedure was completed for Wyoming’s “Management Earnings” variable, and the average values were calculated using Maryland’s and Virginia’s values for the District of Columbia’s “Soil Pb Mean,” and “Soil Pb Median” variables. It was done this way because Maryland and Virginia are the two U.S. states surrounding the District of Columbia. In this case we assumed that near things are more alike than far things (O’Sullivan and Unwin, 2010).
Next, education and age of structure data were separately aggregated into derived feature variables. Sociodemographic factors, including maternal and paternal education level, may be associated with higher EBLL for children living in the home, therefore the variables representing the population based on education level were combined into two separate, continuous variables that measure the population of: 1) “High School Education and Lower” and 2) “Some College Education and Higher.” Furthermore, the variables for the quantity of structures based on their year built were aggregated. The aggregation thresholds were based on the dates Pb-based paints and leaded gasoline were historically used. The variables that represent the quantity of structures built between 2005 or later, 2000-04, 1990-99, 1980-89, 1970-79, 1960-69, 1950-59, 1940-49, and 1939 or older, were combined into four derived variables: 1) “Structures Built After 1970”; 2) “Structures Built Before 1970”; 3) “Structures Built Before 1960”; and 4) “Structures Built Before 1950.” Lastly, all continuous, feature variables were standardized to account for the vastly different scales between them.
Before estimating the population of the quantity of EBLL research articles by publication location, research location, and the quantity of EBLL research publishers, exploratory analysis of the observed data was conducted. This included summary statistics and bar charts to display the observed quantity of articles written by geography and subtopic. Next, TFPOBS was calculated using the observed quantity of childhood EBLL research per publication location and the observed quantity of childhood EBLL research per research location. The Cobb-Douglas production function was used to calculate the Solow residuals as the TFPOBS. The Cobb-Douglas production function: log y = log A + α log K + β log L;y = observed values, K = capital, and L = labor was used (Vinood, 2008). The observed values (y) equaled the quantity of research published in each region summed with the quantity of research conducted in each region, the capital measurement (K) equaled the sum of state and federal R&D obligations, and the labor measurement (L) equaled the observed number of childhood EBLL publishers per state. The observed number of childhood EBLL publishers per state was used to represent the labor measurement because of strong correlations between the number of these publishers with both the number of articles per research location and the number of science and engineering articles published per 1,000 graduate students, quantity of observed research articles by research location, and quantity of observed research articles by publication location. Hence, we assumed the number of publishers in geographic region is a strong representation of the scientific research publication labor in that geographic region. The observed quantities of research articles by publication location and research location were decayed by 4% per annum (Johnson et al., 2018). The Solow residuals were stored for the model as the TFPOBS. TFPOBS residuals were assessed, and the estimates were visualized by the 10th and 90th percentiles of the TFPOBS value on a map.
The number of childhood EBLL journals per state, the number of childhood EBLL articles written per publication location, and the number of childhood EBLL articles written per research location were estimated to an infinite population with three separate random forest regression models. Each random forest regression model was calculated, and the variables were extracted in order of importance. The top 5 most important feature variables were then used as the only features for estimation of the revised random forest regression models, for each of the three observed quantities. After the models were estimated, a repeated cross validation was calculated for each model. The original models were then tuned for the maximum number of trees and re-run. These final models were used to capture estimates of the population of the quantity of EBLL journals per state, the quantity of EBLL articles written by publication location, and the quantity of EBLL articles written by research location. These estimated values were then used to calculate the TFPEST with the Cobb-Douglas production function using the total quantity of EBLL articles (i.e., publication location and research location) and EBLL journals. For the TFPEST measurements, the dependent variable (y) equaled the estimated quantity of research published in each region summed with the estimated quantity of research conducted in each region, the capital measurement (K) equaled the sum of state and federal R&D obligations, and the labor measurement (L) equaled the estimated number of childhood EBLL journals per state. The Solow residuals were stored for the model. Residual analysis, calculated from the TFPOBS minus the TFPEST was visualized and checked for normality, and the estimates were visualized by quantile bins. Next, global and local spatial autocorrelation was calculated, using the queen’s neighborhood, for the residuals (i.e., TFPOBS minus the TFPEST) and the TFPEST to assess spatial independence of the residuals and local clusters of the estimated values. Spatial autocorrelation was only assessed for the 48 continental U.S. because Hawaii and Alaska are not directly connected to the mainland, which would cause incorrect results in the cluster analysis. Significant (i.e., p < 0.05) Local Indicators of Spatial Association (LISA) were mapped.
All analysis was produced using R 4.1.2 (2021-11-01) statistical computing and graphics language. The integrated development environment (IDE) used was Rstudio 2022.12.0. R Markdown 2.20 was used to format code chunks and produce dynamic code and commentary documents.