Overview
We undertook a parallel data extraction process with a team of human reviewers and a set of AI implementations and then evaluated the quality of the AI responses using an independent evaluation team. Two teams were assembled and each given a set of relevant literature for analysis. The first team consisted of human extractors (KO, JD, DK, RS, RAJ) with experience in various forms of evidence synthesis (e.g., (31, 35)), who undertook a traditional evidence extraction from a set of peer-reviewed literature. The other team (SS, FB, MA, RT) with modelling expertise, employed three AI implementations to produce outputs in accordance with the methods that would be followed by the first team. Both the human reviewers and AI models were asked to answer a series of qualitative questions about each of the CBFM-related papers. The AI-enabled team then compared the human responses to the AI model responses and ranked the level of similarity.
Topic
Given the expertise of the human reviewers and the previous screening research (Spillias et al. 2024), this case study continued investigating and annotating data from the existing literature on community-based fisheries management (CBFM). CBFM is an approach to fisheries management where local coastal communities and fishers are responsible for managing their coastal region and resources in an effort to ensure their sustainable use (36). Given the variability and diversity of terminology and language used in CBFM research, this literature provides a robust test of the capabilities of AI to extract and synthesise relevant information.. We chose this case study as it provided the opportunity to elicit a range of data types and complexities to test AI.
Human extraction procedure
The human reviewers created an initial list of questions. In an initial pilot round, all reviewers analysed three randomly chosen papers. The extracted data was compiled and compared by one reviewer (KO) to point to similarities and differences. Then, the human team met to discuss their analysis, differences, and possible ambiguities in the question phrasing. As a result, the extraction questions were modified to form a final list of eleven questions accompanied by a short explanation for each one (Table 1). After the pilot round, 93 papers found in Spillias et al. (2024) were randomly distributed equally among the five human reviewers. When reading the full-text paper, the reviewers could exclude the paper from further review if they thought that it did not fit the research question and/or if the extraction questions could not be answered based on the content. The reviewers excluded 60 papers this way, leaving 33 papers for the full analysis. For each question and paper, the reviewers gave a short answer to the question (Response) and identified one or more passages that supported the answer (Context). If no answer was possible because the information was not in the paper, no answer was given. Because we were interested in identifying drivers of quality in AI extracted results, the human reviewers also recorded their perceived difficulty in answering the extraction question for each paper as either Easy, Medium, or Hard.
Table 1
Contextual Questions and Coding Notes posed to the Human and AI extractors
Question | Coding Notes |
Which country was the study conducted in? | If multiple countries are part of a single case study, or if there are multiple case studies - code them separately at the country level |
Provide some background as to the drivers and/or motivators of community-based fisheries management. | Why is the CBFM in place/being used? Separate to the benefits of the CBFM - could be because there is strong existing community ownership, could be because all other approaches have failed, could be because of an inherent mistrust of government? |
What management mechanisms are used? | Tangible mechanisms being used to manage the fishery - could be physical limitations like gear limits, size limits or timing limits. |
Which groups of people are involved in the management as part of the CBFM case-studies? Choices: Community-members, Researchers, Practitioners, Government, NGOs, Other | Involved directly in the management of the fishery, NOT in the conducting of the study (collecting data for the study) |
What benefits of Community-Based Fisheries Management are reported in this case study? | Physical or social-ecological benefits - could include things like increased fish counts or larger fish size, or improved social outcomes for communities |
What are the indicators of success of CBFM? | What are the data sources for measuring success - community perception, fish size etc. |
How was the data on benefits collected? | What methods were adopted - catch and release programs, qualitative survey etc. |
What are the reported barriers to success of Community-Based Fisheries Management? | Focus on things that are reported around the case study in question - NOT general examples. What was hindering the success of the CBFM - e.g. poor community buy in, poaching etc. |
Guidelines for future implementation of CBFM? | This can be more general. In light of the study, what are the take home messages for future CBFM projects - NOT future research directions, or things to consider for future studies |
How does the community monitor the system they are managing? | Within the case study, what are the community groups you have already identified doing to monitor the fishery - are they conducting fish surveys, monitoring community catches etc |
How does the community make decisions? | How do they make decisions around the management of the fishery - eg. all decisions are passed on by the matai, or individual villages make decisions on any thing that happens from their shoreline. |
AI extraction procedure
We used three AI implementations to extract data from the 33 papers kept by the human reviewers: (i) One call to GPT-4-Turbo (GPT4x1), (ii) three calls to GPT-4-Turbo and synthesised/summarised by a single call to GPT-4-Turbo (GPT4x3), and (iii) the data extraction feature offered by Elicit (Elicit.com). The scripts used to access GPT-4-Turbo via the API are available online at (https://github.com/s-spillias/AI_Extraction; also see Table 1 for prompts).
We accessed GPT-4-Turbo using the Microsoft Azure API in late January 2024. Mirroring the procedure of the human reviewers, the prompt to GPT-4-Turbo (GPT4x1) explicitly requested that the AI return both Response (answer to the extraction question) and Context (a passage of text from the paper that supports the Response). The AI was also prompted to return the output ‘No Data/No Context’, if it was not possible to answer the question given the paper. Following the prompt, we provided the AI with a cleaned version of the paper text, with metadata and backmatter removed. For the implementation GPT4x3, we employed the preceding strategy three times, harvested the ‘Response’ portion of the output, and then synthesised the Responses, again using GPT-4-Turbo (see Table 1 for prompt). The Context passages were automatically concatenated, rather than synthesised or otherwise modified, to ensure the accurate capture of the relevant passages.
To produce extraction data from Elicit we uploaded the papers as pdfs to Elicit’s online portal on January 29th 2024. We provided a short column name for each question and then put the unaltered text of the question and explanation into the ‘Description’ and ‘Instructions’ fields, respectively (See Fig. 5). We also enabled the feature ‘High-accuracy Mode’ for all columns. Elicit automatically returns ‘Supporting’ and ‘Reasoning’ passages from the text to support the response provided, which we used as ‘Context’ for our evaluation.
For the context strings returned by the AI, we verified their presence in the articles (i.e., that the AI had not ‘hallucinated’), by string matching the context returned by the AI with the full-text of the article. We identified 29 instances where the Context returned by the AI did not match any strings in the full-text. We manually investigated each one and confirmed that, with the exception of one unique Context string, all of the Context strings were present in their respective articles. Indicating essentially near-zero rates of hallucination.
Evaluation procedure
Following the human extraction process, and working independently from the extraction team, an evaluation team (SS, MA, RT, FB) developed a procedure for evaluating the quality of the AI outputs in comparison to the human extractions. Initially, the possibility of a fully blind process was considered, where the source of the extractions (human or AI) would not be revealed to the evaluators. However, this approach was ultimately rejected because it was believed that without knowing the source of the extractions, the evaluation team might lack a proper baseline for judging the quality of the responses. Consequently, while the evaluators were aware of which extractions were AI-generated and which were human-generated, they were not informed about which specific AI model produced each extraction. This partial blinding was intended to reduce bias in the evaluation process while still providing a reference point for quality assessment. The resulting procedure is reflective of other evaluation processes developed by other groups (37)d
To assess the quality of the AI-generated responses, the evaluation team established three criteria: i) whether the Context provided by the AI was appropriate evidence for the question; ii) whether the Response was an appropriate synthesis of the Context in response to the extraction question; and iii) how the AI output compared to the human output, with the human extractions serving as a 'Gold Standard'. A three-point scale— -1 for Poor, 0 for Fair, and 1 for Good—was employed by the evaluation team for grading purposes. A custom program was designed in Python to facilitate this evaluation process (Fig. 6). We also performed the same evaluation procedure using GPT4-Turbo as an evaluator to provide further support for the assessments of the human evaluators. This was done through the API and was repeated five times for each question-paper pair).
During the initial assessment, the evaluation team found that some response pairings required the additional context of the paper to evaluate. Therefore a Flag criterion was added to the custom program for further follow-up to investigate the possibility that AI responses had provided more detailed and/or accurate information than the Gold Standard human extraction.
Follow-up Verification
We performed two manual follow-up verification checks to ensure that the AI was returning valid responses from the articles. First, for every AI output, we verified that the Context returned by the AI was indeed present in the full text article. We did this using an automated script that involved a three-step verification process. This process was designed to account for potential discrepancies due to OCR errors, formatting changes, and minor variations in the text. The first step in the process involved normalising the context string extracted by the AI and the corresponding passage from the full-text article. Normalisation entailed removing all punctuation, converting all characters to lowercase, and replacing newline characters with spaces. This step reduces the variability between strings caused by differences in formatting and case sensitivity. The normalised strings were then compared for similarity. We utilised the fuzz.partial_ratio function from the fuzzywuzzy library to calculate the degree of similarity between the AI-generated context and the text from the full article. If the similarity score exceeded a predefined threshold of 90%, the strings were considered similar enough to be a match. This step allows for a degree of tolerance in the match, accommodating minor differences in wording or spelling that might occur. In addition to the string similarity check, we also performed a cosine similarity calculation using the CountVectorizer from the scikit-learn library (38). The two normalised strings were transformed into a document-term matrix, representing the frequency of terms within each string. The cosine similarity between these two term frequency vectors was then calculated. A cosine similarity score greater than 0.5 was considered indicative of a significant match between the strings. The context string was considered valid if it was contained within the full-text passage, exhibited a high degree of similarity with the passage, or had a cosine similarity score indicating a strong match. By employing these methods, we were able to robustly verify the AI-generated context against the source articles, ensuring that the AI did not 'hallucinate' or fabricate information not present in the original texts.
Second, the evaluation team had the opportunity to ‘flag’ question/output pairs that they felt warranted additional investigation. For each of these flagged data points, two authors (SS, KO) manually investigated the AI response and assessed whether it was faithfully reporting information from the paper.
Statistical Analysis
We calculated the inter-rater reliability by using Cohen’s kappa statistic to evaluate the agreement between the presence/absence of contextual data between the human reviewers and the AI implementations.
We created confusion matrices to compare the frequencies at which each AI implementation and human extractors returned similar responses by either providing a response (data) or not (no data) to a given question from each paper. This was calculated separately for each AI implementation (Elicit, GPT4x1, and GPT4x3) and independently for each response value. All values in each confusion matrix are reported as a portion of the total of 1. These matrices have four quadrants and shows how frequently: the human did not provide a response but the AI did (quadrant 1, top right, false positive), the human and the AI both provided a response (quadrant 2, top left, true positive), the human provided a response but the AI did not (quadrant 3, bottom left, false negative), and neither the human or AI provided a response (quadrant 4, bottom right, true negative).
Several criteria were measured throughout this study including how well the AI pulled the relevant context from the paper for the specific question (Context to Question), how well the AI responded to the relevant context (Response to Context), and how the AI response compared to the human response (AI Response to Human Response).
To analyse all the results, we performed statistical analyses using R software (39). Significance was determined using an α-critical level of 0.05 for all tests. A two-way t-test was done to allow for comparison between assessed values for each AI implementation and each criteria against the ‘fair’ extraction score of 0. In subsequent analyses, we employed a linear mixed-modelling approach (lmer function from lme4 package (40) to assess the impacts of different factors (AI implementations, difficulty, and questions) on overall assessed values. Where applicable, evaluator, extractor, and paper were included as random effects. An analysis of variance (ANOVA) test was performed to assess the overall significance of each variable. When required, pairwise comparisons were then performed on the estimated marginal means (emmeans and pairs functions from emmeans package (41) and Tukey’s method was applied to determine significance of comparison.
To examine the overall results, we created a linear mixed-effects model showing the effect of each AI implementation and criteria on assessed quality (e.g. see S2. Statistical Methods for model specification). This model included the effects of the AI implementations and the interaction of criteria as fixed effects, with evaluator and paper as random variables. Additionally, we constructed individual models to assess the effect of the AI implementation on assessed quality for each criterion independently.
We then investigated the impacts of the questions on the assessed quality of the AI responses using a linear mixed-effects model with the question being the fixed effect and evaluator and paper as random effects.
To evaluate the effect of difficulty, as ranked by the human extraction team, on the assessed quality, we used a linear mixed-effects model including the interaction between difficulty and AI as the fixed effect, with paper and extractor as random effects.. The model investigates the relationship between the assessed quality of each question-paper pair and the interaction between the ranked difficulty of the data point and the type of AI implementation.
Finally, a linear model was also created to ensure that the extractor did not have an effect on the assessed quality.