Article screening
After applying the eligibility criteria, 53 articles were included in the review (Fig. 2). 1900 studies were initially retrieved from scholarly databases; however, 716(39.6%) of these were removed as duplicates. Of 1184 unique references screened by title and abstract, 679(57.3%) were excluded for not having a gastrointestinal focus and 276(23.3%) for not using NLP or describing NLP methods or validation. 86(7.3%) of articles were review only, and 16(1.4%) of articles focused only on gastrointestinal disease risk factors. See Supplement J for details of all abstracts screened and Supplement F for inter-observer agreement results during screening. A full PRISMA flow diagram is provided in Fig. 2.
During full-text screening 126, studies were mainly excluded for being available only in abstract form 57(45.2%), performing only weak validation 4(3.2%) or not providing sufficient details about NLP methods or validation 4(3.2%). A total of 3(2.4%) studies were excluded due to irrelevant indication (limited gastroenterology focus), 2(1.6%) were first published outside the date range, 2(1.6%) were focused primarily on reviewing the existing literature and one (0.8%) study was a sub-study focused on consensus building. See Supplement I for full details of the excluded studies.
Key characteristics of included studies
Of the 53 included studies, 29(54.7%) were published in biomedical informatics or computer science journals, 19(35.8%) were published in gastroenterology clinical journals, and 5(9.4%) were published in non-gastroenterology-focused clinical journals.
A total of 18(34.0%) studies were based on data from a single centre, and 35(66.0%) were multi-site or registry. Regarding technological maturity, 47(88.7%) studies were performed in a development/lab environment. In comparison, 6(11.3%) studies were launched as part of a clinical pilot, and only one (1.9%) was deployed as part of a production clinical human-in-the-loop system(33). No systems are currently being used unsupervised in production.
In terms of clinical focus, 22(41.5%) studies focused primarily on obtaining additional information from clinical investigations, compared to 20(37.8%) studies focused on detecting/extracting diagnoses and 10(18.9%) studies focused on improving the monitoring of a disease or calculating surveillance intervals. Only a single study (1.9%) focused on treatment/management(34).
The total number of documents available to investigators ranged from 101(35) to 14.6 million(36), with up to 610,684(37) individual patients in the available sample population. However, given the high costs involved in annotation, high-quality manually annotated model development document samples varied only between 101(35) and 6836(38), and manually annotated validation document samples ranged from 100(39) to 2988(40) in size.
Study tools/methods used
The authors used a wide array of methodologies/tools, including 26(49.1%) studies using RB methods, 15(28.3%) a hybrid (ML + RB) approach, 10(18.9%) using singular ML models and 2(3.8%) using an ML-ensemble(38, 41). Popular established open-source tools utilised included CLAMP(42), cTAKES(43) and PyCONtext(44)/MedSpacy(45), with Python 15(28.3%) the most popular non-structured query language explicitly mentioned, followed by Java 10(18.9%), Prolog 3(5.7%) and PERL 1(1.9%). Four commercial algorithms (I2E™, EHRead™, ClixNLP™ and EasyCIE™) are mentioned across 5(9.4%) studies. Table 3 provides an overview of the primary open-source NLP tools described.
Table 3
Key NLP Tools Currently Used in Gastroenterology / Hepatology
Tool | Description | Link | Example Usage |
Commonly Used Ontologies / Clinical Data Models |
ICD-10 | WHO International Classification of Diseases version 10 | https://icd.who.int/browse10/2010/en | Coding of gastroenterology diagnoses on discharge summaries as a validation standard |
SNOMED-CT | SNOMED Clinical Terminology system. | https://www.snomed.org/get-snomed | Coding of gastroenterology diagnoses on discharge summaries as a validation standard |
UMLS Metathesaurus | Open-source compendium of controlled vocabularies curated by the US Library of Medicine | http://www.nlm.nih.gov/research/umls/ | Standardisation of Free-Text terms to aid with tokenisation (breaking up) of free-text |
OMOP | Observation of Medical Outcomes Partnership Common Data Model | https://www.ohdsi.org/data-standardization/ | Mapping of clinical information to a standardised data model to aid interoperability |
Java-Based Open-Source Tools |
cTAKES | Open-source NLP system for information extraction from electronic medical record clinical free text | http://ctakes.apache.org/ | Used to process and extract concepts such as diarrhoea from free text |
GATE | Suite of tools for NLP tasks, including information extraction | https://gate.ac.uk/ | Used to extract concepts such as hepatitis from clinical free text |
MALLET | Java-based package for statistical NLP, document classification, clustering, topic modelling and information extraction | http://mallet.cs.umass.edu/ | Used to build a text-to-model pipeline, perhaps to diagnose IBD and perform NLP analysis on that model |
CLAMP | Clinical Language Annotation, Modelling and Processing Toolkit | https://clamp.uth.edu/ | Used to annotate clinical free-text, perhaps for training a model for diagnosis of pancreatic cysts in radiology reports |
Python-Based Open-Source Tools |
NLTK | Python’s natural language processing toolkit | https://www.nltk.org/ | Identify abdominal pain tokens in clinic letters |
Spacy | Self-described as industrial-strength natural language processing in python | https://spacy.io/ | Label patients with polyps with colouring and build a pipeline |
MedSpacy | Successor to PyContextNLP combining the original implementation with Spacy | https://github.com/medspacy/medspacy | Build a fully-functional app annotating endoscopy reports |
Chexpert-labeler | Initially developed to help label chest X-rays adapted in some studies to review CTs and MRIs | https://github.com/stanfordmlgroup/chexpert-labeler | Label radiology reports of patients with, for instance, pancreatic cysts |
Demographics of the included studies
Only 30(56.6%) of studies reported patient demographics. Ages ranged from 16(46) to 85(47) years, while gender balance ranged from 1.8%(48) to 63%(49) female. Only 17(32.1%) studies reported underlying ethnicity and detailed information on participant socioeconomic status or comorbidities was provided in only 5(9.4%) of the studies. A full breakdown of the reported study populations is provided in Supplement G.
Study purpose and primary findings
By subspecialty, 21(39.6%) of studies focused on colonoscopy, 13(24.5%) on liver disease, 7(13.2%) focused on inflammatory bowel disease (IBD), 4(7.5%) focused on gastroscopy 4(7.5%) focused on pancreatic pathology, 2(3.8%) focused on gastroscopy, one (1.9%) focused on endoscopic retrograde cholangiopancreatography (ERCP) and one (1.9%) focused on optimisation of sedation in endoscopic practice more generally. Figure 3 presents a summary of the primary clinical areas of application.
As anticipated, Classification tasks account for 32(59.2%) studies, given that prediction and automation typically depend upon accurate classification. 19(59.4%) of these studies focus specifically on disease case identification. A broader array of clinical tasks exists presently within colonoscopy studies. Complete results of all included studies are provided in Supplement H.
Colonoscopy
Gourevitch et al. examined pathologist variation in colorectal adenoma classification and reported substantial average variations in reported adenoma detection rates (ADR) between endoscopists (28.5%-42.4%), dependent purely on the reporting pathologist(50). Blumenthal et al. managed to predict colonoscopy non-attendance with an AUC of 0.70(51). Li et al. achieved 100% precision and recall while stratifying a sample of 300 Lynch syndrome mismatch repair status reports(52). Shi et al. achieved 94% precision and recall in identifying cancers in family histories. Paterson et al. achieved precision and recall of 0.861 and 0.885, respectively, for predicting colonoscopy indication(53). Hoogendorm et al. achieved an AUC of 0.896 for predicting colorectal cancer at a population level by including information derived from NLP(36).
A systematic review has already been performed regarding the automated detection of adenomas using NLP, finding a pooled precision of 99.7% for these studies(16). However, the studies included in this review were rule-based and thus likely brittle. Table 4 summarises the key results of all colonoscopy result extraction studies focusing on polyp detection, where data was available.
Table 4
Colonoscopy Result Extraction Studies
Study | Study Aim | Outcome | Model | Accuracy | Precision | Recall | F1 Score |
Adenoma-Including Studies |
Syed 2022(54) | Extract clinical concepts from colonoscopy reports | Polyp Detection | DL(BERT) | NR | 0.91 | 0.94 | 0.92 |
Vithayathil 2022(55) | Develop a large colonoscopy-based longitudinal cohort | Adenoma Detection | RB | 1 | 1 | 1 | 1 |
Nayor 2018(56) | Automate calculation of ADR | Adenoma Detection | RB | 1 | 1 | 1 | 1 |
Laique 2021(57) | Extract clinical information from colonoscopy reports. | Polyp Detection | RB | 0.96 | 0.99 | 0.92 | 0.96 |
Tinmouth 2023(58) | Identify colorectal adenomas in pathology reports | Non-Advanced Adenomas | RB | 0.99 | 1 | 0.99 | 0.99 |
Lee 2019(47) | Identify colonoscopy quality and polyp findings. | Polyps > 10mm | Commercial – I2E | 0.95 | 1 | 0.91 | 0.95 |
Fevrier 2020(37) | Extracting Polyp Variables | Adenoma Detection | RB | NR | 0.99 | 0.97 | 0.98 |
Bae 2022(59) | Focusing on polyp detection | Adenoma Detection | RB | 0.99 | 1 | 0.99 | 0.99 |
Non-Adenoma Studies |
Redd 2022(60) | Identify colorectal cancer in US military Veterans. | Colorectal Cancer | ML – LDA & DNN | 0.99 | 0.91 | 0.97 | 0.94 |
Parthasarathy 2020(61) | Automatically Diagnose Serrated Polyposis Syndrome (SPS). | Serrated Polyposis Syndrome | RB | 0.93 | NR | NR | NR |
Ternois 2018(62) | Automatic coding system for colonoscopies | Attribute reports to CCAM codes | RB | NR | 0.92 | 0.92 | 0.92 |
Footnote: NR-Not Reported. Precision(PPV) = TP/(TP + FP). Recall(Sensitivity):TP/(TP + FN). Confidence Intervals Reported Only in a minority of studies
Harrington et al. attempted to personalise colorectal cancer screening follow-up plans, achieving a max AUC of 0.65 for this task(63). Three studies focused on clinical decision support for colorectal cancer surveillance interval calculation, each taking a different approach. Wadia et al. ‘s decision support system divided reports into actionable and non-actionable, achieving precision and recall of 92.8% and 98.9%, respectively(64). Peterson et al.’s algorithm achieved an accuracy of 92% for assigning recommended surveillance intervals for colonoscopy(39), while Karwa et al. reported 100% accuracy at the same task(65). Human surveillance judgements, in comparison, exhibited significantly more deviation from guidelines with a tendency towards earlier surveillance.
Endoscopic retrograde cholangiopancreatography (ERCP) and endoscopic sedation
Shen et al.’s. Human-in-the-loop clinical decision support system (CDSS) aiming to identify patients at higher risk of sedation errors pre-emptively(33) reduced the sedation-type error rate from 0.39–0.037%. Although the system had high recall(sensitivity) of 89.2%, it suffered from low precision (28.5%). Imler et al.’s study focused on automated RB quality metric extraction for ERCP(66). The model identified 13 pre-, intra and post-procedure quality measures from free text; however, the algorithm struggled more with complex concepts such as precut sphincterotomy (84% Precision) and pancreatic stent placement (90% Precision).
Gastrointestinal bleeding
These studies used a combination of RB and ML/DL models to detect gastrointestinal bleeding in clinical free-text - one in the emergency department (ED)(40) and the other in intensive care (ICU)(67). Taggart et al.’s ICU study achieved precision: RB:62.7%, ML:55.9% and recall: RB:91.1%, ML:84.9% on MIMIC-III(68), while Shung et al.’s study achieved precision: RB:72.0%, DL:84.0% and recall: RB:87.0%, DL:90% for detecting bleeding among ED clinical text narratives. In both studies, the NLP approach exceeded the results of using ICD codes alone, but the transformer-based approach was strongest overall.
Gastroscopy
Half of these studies focused on identifying gastric pathology from reports. The ML-ensemble model proposed by Ding et al. achieved an AUC of 0.891 for predicting gastric cancer from gastroscopy report text(38). However, even this model was associated with a 25.6% missed diagnosis rate. Song et al. achieved even more impressive results while attempting to extract ten different gastric diseases from 1,000 validation gastroscopy reports, achieving a precision of > = 97.2%(69) in their centre.
McVay et al. used a 250-patient holdout set to detect dysphagia(70) and achieved a precision of 98.6% and an F1 score of 91.1% on this task. Finally, Nguyen Wenker et al. attempted to detect Barrett’s dysplasia in gastroscopy reports. They achieved 93.2% precision in this task, although the algorithm couldn’t effectively discriminate between low and high-grade dysplasia(71).
Inflammatory bowel disease (IBD)
Stidham et al. used an RB algorithm to identify the status of many skin, eye and joint-related IBD extra-intestinal manifestations (EIM), achieving average recalls of 92% for EIM presence(72). Kurowski et al. created a computational Crohn’s disease state model with symptomatic/asymptomatic, active/inactive and tested/untested states, identifying that 20% of patients were lost to follow-up every 24 months (46). Zand et al. classified flare-line conversations with IBD patients, finding that 90% of the dialogues could be assigned to one of seven categories(73). Walker et al. achieved a precision of 79% and recall of 92% for detecting liver-test derangement in an IBD cohort(74).
Montoto et al. achieved precision and recall of 88% and 98%, respectively, for the diagnosis of Crohn’s, 91% and 71% for disease flare and 86% and 94% for Vedolizumab(75) across a Spanish cohort. Gomollón et al. then built upon this work by attempting to predict disease flare among that cohort, achieving precision and recall of 67% and 71%, respectively, using a random forest model and two years of input data(76). Finally, Hou et al. achieved precision and recall of 87% and 96.6% for detecting low-grade dysplasia in IBD surveillance biopsies within a US cohort(77).
Liver
Bell et al. found that donor text narratives strongly predicted liver utilisation(AUC = 0.81) but not 30-day(AUC = 0.53) or 1-year mortality(AUC = 0.52)(34). Koola et al. phenotyped hepatorenal syndrome (HRS) with precision and recall ranging from 53–73% and 65–84%, respectively, with the final phenotyping algorithm achieving an AUC of 0.93(48) on a small cohort.
Chang et al. achieved 98.4% precision and 90% sensitivity in identifying patients with cirrhosis(78). Redman et al. and Van Fleck et al. achieved 89-91.8% precision and 90–93% recall for identifying obesity-related liver disease from liver imaging reports(79, 80). Heidemann et al. attempted to identify drug-induced liver injury (DILI) cases(49). However, with their four-term RB system, they only achieved precision and recall of 64% and 53%, while in another study, Wang X et al. attempted to attribute the causality of idiopathic DILI, reaching a precision of 86% and recall of 82% with their system(81).
The six remaining studies focused on identifying liver cancer, predominantly hepatocellular carcinoma (HCC), in radiology reports are summarised in Table 5.
Table 5
NLP Liver Cancer Identification Results
Study | Clinical Focus | Imaging Modalities | Accuracy | Precision | Recall | F1 Score |
Yim 2017(35) | Identifying and Classifying Tumour-event Attributes | Not Specified | NR | 0.83–0.88 | 0.68–0.76 | 0.72 |
Tariq 2022(82) | HCC | US/MR using templating | NR | 0.97 for MR 0.68 for US | 0.96 for MR 0.66 for US | 0.95 for MR 0.67 for US |
Liu W 2022(41) | Liver Metastases in Colorectal Cancer | CT/MRI | 0.96 | NR | NR | NR |
Liu H 2021(83) | Predicting the Phrase: ‘hyperintense enhancement in the arterial phase.’ | CT Only | 0.98 | 0.98 | 0.99 | 0.98 |
Sada 2016(84) | HCC | CT/MRI | NR | 0.68 | 0.75 | 0.71 |
Wang T 2022(85) | HCC | Predominantly US with some CT/MRI | 0.99 | 0.86 | 1 | 0.92 |
Table Footnote: NR- Not Reported. Precision(PPV) = TP/(TP + FP). Recall(Sensitivity): TP/(TP + FN).
Pancreas
Three systems reported precision ranging between 33–99% and recall of 25-99.9% for detecting pancreatic cysts in radiological examinations(86–88). Collectively, these studies covered 269,221 individual patients, but substantial heterogeneity of methods, environments, and underlying imaging studies renders reliable meta-analysis challenging. Xie et al. achieved precision and recall of 85.5–100% and 88.7–98.7% for various chronic pancreatitis features(89), finding a higher ten-year mortality (32.5% vs 21.2%) in those with more advanced radiological features.
Quality Assessment
Algorithm running costs were explored in only 6(11.3%) studies, while model explainability was only mentioned in 5(9.4%) studies. However, generalisability was explicitly mentioned by 34(64.1%) of the studies. Open-source code was only made available in 5(9.3%) studies. Supplement D summarises the quality appraisal results for each study.
Risk of Bias Assessment
Studies were all assessed across ten areas of potential bias. All studies scored low for deviation bias (a measure of unclear aims). Only 5(9.4%) studies scored a low risk of bias across all domains. Supplement E summarises the ROB results. Validation bias was the most common, with only 13(24.5%) of studies scoring as low risk in this domain.