The scope of the repository is all fields of medical sciences that include a study on a live creature or viruses, so chemical sciences or studies on pure chemical analyses would not be interested in Meta-Phill as they may be rarely targeted for pooled evidence synthesis. Potential stakeholders are Meta-phill developers, Peer-reviewed Journal editors, publishers and librarians of those journals, individual researchers, research communities, clinicians, policymakers, and systematic reviewers.
Resources:
Each meta-phill meta-data contains 9 parts of information, average weighing 3.6 Kilobytes, that solely including all pubmed records, would contain a data repository of as large as 126 Gigabytes. To handle the large dataset, a powerful server or cluster of servers with high-performance CPUs, a large amount of RAM, and sufficient storage space to store the dataset. The database management system (RDBMS) of MySQL stores the big metadata.
The hardware used to host the website is a 4-core CPU Linux hosting server with RAM of 4 gigabytes. The PHP language is used to design the website. The table was handled to be online with an interface of a MySQL table and data table software. We designed DataTables with functionalities of SearchPanes, Select, FixedColumns, Buttons, SearchBuilder, and DateTime.
Metadata automated generation:
The Metadata generator engine is a Python-written application (https://github.com/Metaphill/Engine.git). Abstracts of papers first get saved as a text file via Entrez API for studies indexed in the PubMed database. DOI’s of articles were also used to save abstracts to text files. The engine code defines a list of OpenAI API keys and randomly selects one for use and prompts for different study aspects such as study design, study population, etc. The exact prompts are as bellow:
prompts = { 'Study Design': 'Given the following abstract, what is the study design? please don't write sentences. just give me maximum 5 words. \n\nAbstract: ',
'Study Population/Disease/Situation': 'Given the following abstract, what is the study population? Don't write sentences. name those. \n\nAbstract: ',
'Study Comparison/Prognostic Factor': 'Given the following abstract, what is the study comparison? Don't write sentences. name those. \n\nAbstract: ',
'Study Exposure/Intervention': 'Given the following abstract, what is the study exposure/intervention? Don't write sentences. name those. \n\nAbstract: ',
'Study Primary Outcome': 'Given the following abstract, what is the study primary outcome? Don't write sentences. name those. \n\nAbstract: ',}
To determine the exact study design, Universal Sentence Encoder (USE) model was used for a column of Study design. The source code is available at Git Hub. Full lists of ICD disease classifications and MESH terms were downloaded and embedded in a similar Python app that using the TensorFlow and USE model, selects the best matching term for the study population.
Human supervision:
Two trained researchers were asked to handle the data provided by chatGPT and check their validity. More than 800 entries were validated in 6-hour working time. Personnel who get involved in this step would always be who have more than 1 year of research experience with more than 10 published articles. Final results are being supervised by trained human research experts by comparing the title of the study with the provided prompt responses or the abstract and full text in suspected cases of a mistake by AI.
New metadata registration:
Articles from journals that tended to be included in Metaphill would be generated and uploaded to the repository by human staff. Individual articles are also welcome to be included by requests to the admin email. Journals could ask the authors in the submission era to provide this metadata information for better quality metadata and lower costs.
R shiny-based application of study classifying
The supplied application receives CSV inputs from the users, which are easily downloaded from the repository. It uses hierarchical clustering based on the Jaro-Winkler distances calculated from the text inputs of each row. There is a threshold from the resulting hierarchical clustering dendrogram that cut the diagram at a specified similarity threshold. Users can change this based on the circumstances of the exported articles from the Meta-Phill repository.