The use of text-mining for knowledge synthesis, including literature reviews, has been gaining traction as modern Natural Language Processing tools have progressed. However, there is a trade off between accuracy and precision with workload times. The largest workload times for data extraction typically have better validation metrics. Manual data extraction is an example of this. Currently, there are a limited number of text-mining tools that have shown data extraction from research texts within minutes while maintaining acceptable validation metrics. By using a previously demonstrated generalized approach to determine research production with modern text-mining tools for extracting research locations and subtopics, a methodology with higher than average validation metrics and significantly improved workload times is shown.
It was shown that coupling two reliable Named Entity Recognition tool kits for research location extraction, a Latent Dirichlet Allocation model for unsupervised subtopic detection, and a Large Language Model for subtopic clustering can extract research locations, from more than 1,000 public health research articles, and cluster their subtopics, in less than 5 minutes. Validation metrics return F1 Measures of 92% for research location extraction and 71% for subtopic clustering. Additionally, applying these results to a previously constructed spatiotemporal generalization process reproduces the generalized results with a correlation coefficient of 97%, which significantly varies from random chance (p < 0.001). The mapped fitted values are reliably close with the validation model’s fitted values.