Evaluation of Diagnostic Apps and Prediction Models for Myocardial Infarction and Other Causes of Chest Pain: Informing Patient Use

doi:10.21203/rs.3.rs-3571463/v1

Download PDF

Research Article

Evaluation of Diagnostic Apps and Prediction Models for Myocardial Infarction and Other Causes of Chest Pain: Informing Patient Use

https://doi.org/10.21203/rs.3.rs-3571463/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: Symptom checker (SC) applications output possible diagnoses based on user’s symptoms. They may influence patients’ care seeking behavior but remain understudied, especially for high-risk diseases including acute myocardial infarction (AMI).

Objective: This study used risk factor and symptom data reported by patients presenting with chest pain to an ED to evaluate the accuracy of Ada, WebMD, and Isabel SCs in diagnosing high-risk, cardiac, and low risk, noncardiac causes of chest pain. We hypothesized (1) SCs would miss cases of AMI, (2)

SCs would over-diagnose AMI in noncardiac, low risk cases.

Methods: From a dataset of 1872 cases of patients with chest pain, fifty high-risk cases (S1) were randomly sampled. 29 cases (S2) were selected as low risk, noncardiac, and included additional noncardiac symptoms and diagnoses. Samples were entered into the SCs, and matches were identified with top 5 app suggestions (M1-M5). SC performance was compared with a logistic regression (LR) model previously trained on the original dataset to predict AMI.

Results:

WebMD: (S1) Acute coronary syndrome (UA and AMI)- 100% sensitive, 13.3% specific, PPV-43.5%, NPV-100%. Identified 100% of AMIs, 100% of UAs. (S2) Identified 24.1% of S2 low risk, noncardiac diagnoses. Suggested AMI first for 34.5% of cases and only nonurgent diagnoses (true negatives) for 3.4% of cases.

Isabel: (S1) ACS - 75% sensitive, 83.3% specific, PPV-75%, NPV-83.3%. Identified 100% AMIs, 44.4% UAs. (S2) Identified 24.1% of S2 noncardiac diagnoses, suggested AMI first for 17.2%, true negatives 0%.

Ada: (S1) ACS - 95% sensitive, 56.7% specific PPV-59.4%, NPV-94.4%. Identified 100% of AMIs, 88.9% of UAs. (S2) Identified 48.3% of S2 noncardiac diagnoses, suggested AMI first for 34.5%, true negatives 17.2%.

LR model: (S1) ACS – 100% sensitive. Suggested ACS for 59% S2 cases. True negative rate (41%) was significantly higher than WMD (3.4%) or Isabel (0%), (P =.001).

Conclusions: All 3 SC apps identified 100% of AMIs in their top 5 suggestions and were highly sensitive to ACS. However, SCs were risk averse and limited in their identification of noncardiac diagnoses in low-risk patients. The LR model had significantly better discrimination with low-risk patients and potential to decrease excess care.

Symptom checker

myocardial infarction

diagnosis

prediction models

The term “symptom checker” (SC) refers to an app or online program that “interprets a user’s symptoms in order to provide a likely diagnosis or give guidance on action to take.”¹ They request that users enter their age, gender, risk factors, and symptoms and generate a list of suggested diagnoses and/or triage. Although diagnostic algorithms were originally intended for physician use, patient access expands both the potential benefits and risks of symptom checkers. These apps can impact patient access to quality medical information, the choice to seek out primary or urgent/emergency care, and therefore the quality of the diagnostic process. They present an opportunity to extend care to lower resource or rural areas with medical provider shortages. According to a cross-sectional study evaluating usage of the Isabel app, 76.3% of respondents used the app to better understand the cause of their symptoms, 33.2% used it to decide whether to seek care, 20.7% used it to decide where to seek care (primary or urgent care), 15.8% used it to obtain medical advice without going to a doctor, and 12.8% used it to better understand their diagnoses.² Evaluating app accuracy is important since many millions of patients use these apps for insight into diagnoses. An effective symptom checker may help patients recognize dangerous diseases more quickly, reduce unnecessary emergency department visits, decrease costs of care, increase the appropriate use of primary care, give patients more agency, and reduce physician misdiagnosis. In a study of 158,000 patient consults on the App Buoy, 30% of users stated they would be less likely to seek urgent care after using the app, indicating that symptom checker use is linked to changes in care seeking decisions that may lead to changes in cost of care utilization.³ On the other hand, apps that produce false negatives can wrongly advise patients to not seek out care – a recommendation that, if followed, could prove fatal in life-threatening events such as acute myocardial infarction. Apps may also incite worry regarding a false diagnosis and lead to the misguided pursuit of care. Given the broad scope of the impact of SCs, rigorous evaluation of their accuracy is necessary to determine whether they are safe and beneficial.⁴

Despite their growing popularity, the literature on symptom checkers is anything but comprehensive. Existing studies point to app inaccuracies and often have methodological weaknesses. A 2022 review of 10 studies found that app diagnostic and triage accuracy were generally low.⁵ A 2020 follow-up on the landmark 2015 Semigran study found that app triage performance did not improve significantly over 5 years and varied widely between apps. The median triage accuracy in 2020 was 55.8%, while in 2015, it was 59.1%, although the diagnostic accuracy improved modestly.⁶ Although it was found that apps became less risk averse with time, on average, they did not outperform laypersons in triaging between emergency, primary, and self-care.

Previous studies have measured app performance with created vignettes rather than true patient-reported data and in controlled, clinical rather than home settings where patients are likely to use apps. Final diagnoses are nearly always based on physician review of the vignette. The 2015 Semigran study used 45 vignettes with low-, medium- and high-risk diagnoses to evaluate 23 symptom checkers.⁷ Another study used 200 primary care vignettes to evaluate app performance against general. Existing studies have assessed a range of conditions, but few have specifically evaluated patients presenting with potential medical emergencies. While these studies contribute to our understanding of the comparative accuracy of SC apps, they fail to inform app performance on layperson-reported data during high acuity events.

Assessing chest pain, myocardial infarction and diagnosis of acute coronary syndromes

Acute myocardial infarction (AMI) is defined as irreversible myocardial necrosis normally due to coronary artery blockage causing a lack of oxygen. It has an incidence of 208 AMIs per 100,000 people and is one of the leading causes of death in high income countries. ⁸ Patients with AMI are at high risk of sudden death due to cardiac arrest, and most successful resuscitations require trained paramedics or hospital staff equipped with a defibrillator. Percutaneous coronary intervention or fibrinolytic therapy must be given within the first 4 hours after onset of a coronary artery blockage for greatest efficacy. Patient outcomes can be greatly improved with early recognition and rapid arrival in an emergency department. Another emergent form of acute coronary syndrome (ACS) is unstable angina, which is characterized by chest pain at rest, normal cardiac enzymes, and frequent progression to AMI.⁸ ACS requires an assessment of symptoms, medical history, and risk factors to have sufficient clinical suspicion for a cardiac workup.

Symptom checker apps may have a role in the care of such patients. One study found that patients had a strong intention to use apps to monitor symptoms after discharge from treatment for ACS.⁹ However, there are few studies on symptom checkers’ evaluation of chest pain-related conditions. To address this gap in the literature, we evaluated 3 leading SC apps, WebMD, Isabel, and Ada. Chest pain is a common primary symptom that presents across ages, levels of severity, and organ systems and includes cardiac, gastrointestinal, pulmonary, psychiatric, and musculoskeletal etiologies. An effective symptom checker must identify high-risk, primarily cardiac causes of chest pain and low-risk, noncardiac causes. More than half of the deaths associated with AMIs occur before the patient reaches the hospital. Symptom checker use soon after the onset of pain presents an opportunity to triage emergencies faster and earlier.¹⁰ Ischemic heart disease, pulmonary embolism, and pneumonia are examples of high-risk causes of chest pain that often present emergently. Noncardiac causes of chest pain (costochondritis, upper respiratory infections, gastroesophageal reflux, panic attacks, etc.) are common in young, generally healthy patients and do not typically require costly emergency care. Coronary atherosclerosis was not among the top 10 causes of chest pain in adults under 44 years old but was the second highest cause of chest pain in patients older than 45 after nonspecific pain.¹¹ To be effective, apps need to successfully identify both younger, lower-risk, noncardiac cases and older, higher-risk, cardiac cases.

Box 1: Characteristics and diagnosis of acute coronary syndromes

Acute coronary syndrome (AMIs and UAs) typically requires an assessment of symptoms, medical history, and risk factors to have sufficient clinical suspicion for a cardiac workup.¹¹ 70% of fatal AMIs are due to atherosclerosis of the coronary arteries, which are related risk factors such as male sex, older age, family history of coronary artery disease, dyslipidemia, diabetes mellitus, hypertension, obesity, sedentary lifestyle, and peripheral vascular disease.⁸ Other rarer causes include trauma, vasculitis, cocaine use, coronary artery anomalies or emboli, aortic dissection, and other conditions that place excess demands on the heart. Common ACS symptoms in addition to chest pain include lightheadedness, anxiety, cough, sweating (diaphoresis), nausea, wheezing, and an irregular heart rate.⁸ The datasets used in this study particularly explored the quality of chest pain which exists on a spectrum of likelihood of indicating an MI.¹¹ Final diagnosis is made with an assessment of history, vital signs, physical exam of the heart, lung, and pulses, an EKG, laboratory investigations including Troponin levels, and potentially imaging. Symptom checkers can currently only assess limited parts of the clinical data from demographics, possibly cardiac risk factors and symptoms.

Symptom checkers studied.

We studied 3 well-established and widely used symptom checkers described in Box 2 that have previously been evaluated in multiple studies, nearly all based on case vignettes.

Box 2: Descriptions of the three symptom checkers tested

WebMD: WebMD is the most popular website for Americans seeking health information. However, studies have pointed to the low accuracy of its symptom checker in some clinical areas. A study of 34 rheumatoid arthritis patients found that WebMD’s “help seeking advice…is often inappropriate and that the diagnoses suggested are frequently inaccurate.” ¹² Another paper evaluating WebMD’s accuracy with ophthalmic diagnoses found that the primary diagnosis was included 38% of the time in the top 3 suggestions.¹³ In an evaluation of primary care vignettes, WebMD had a top 3 suggestion accuracy of 35.5%.¹⁴

Isabel: Isabel was created by Charlotte and Jason Maude whose daughter Isabel was misdiagnosed when she was 3 years old. This inspired her parents to create Isabel, an online diagnosis generator. Isabel is another popular app, with users completing approximately 200,000 – 300,000 entries per month.² A cross sectional survey of Isabel users found that 90.1% of users found the information useful, 84.1% found it to be a useful diagnostic tool, 76.2% felt it provided insight that led to the correct diagnoses, and 91.4% reported that they would use Isabel again.² Some studies found that Isabel performs with an acceptable level of clinical accuracy.^15,16 Another study found that although it was among the top 3 apps for diagnostic accuracy (44% accurate diagnoses in the first suggestion), it tended to over suggest seeking medical over at home care.⁶

Ada: Ada was founded by Claire Novorol, a British pediatrician, Martin Hirsch and Daniel Nathrath The Ada chatbot began as an application for doctor’s use and was adapted for patient use in 2016. A study on Ada’s evaluation of mental health related chief complaints in an emergency department found that the therapist’s diagnosis matched Ada’s first diagnosis in 51% of cases and 1 out of the top 5 matched in 69% of cases.¹⁷ Gilbert et al (working with Ada Inc) found that Ada had among the highest “condition-suggestion” coverage (99%) and safe triage performance (97%). A study by the current authors of the Ada SC being use by ED patients, found a sensitivity of 70% in the top 5 suggestions, close to the mean physician sensitivity (68.9%).¹⁸

This study was conducted based on anonymous self-reported patient case data obtained in 1994 from the study “Early diagnosis of AMI using clinical and electrocardiographic data at presentation: derivation and evaluation of logistic regression models” conducted by Kennedy et al, with HF as a coauthor. Data was collected at 2 sites, Edinburgh and Sheffield, UK.¹⁹ Patients with a chief complaint of chest pain reported their symptoms to a clinician, which was recorded on a standardized questionnaire detailed developed by the original Kennedy et al study. Diagnosis by an ED physician was based on clinical presentation, ECG review, and serial cardiac enzyme measurements.¹⁹ ECG results and cardiac enzymes in the dataset were not entered into the apps as they would not be available in the home setting. Symptom and risk factor data were deidentified, coded as present or absent, and compiled into a dataset of 1872 patients. The first dataset in the current study (D1) was extracted from the main set and had symptoms focused on describing cardiac events, a total of 697 cases, and described true diagnoses in the form of 5 categories: 1-MI with Q wave, 2-MI without Q wave, 3-unstable angina, 4-stable angina, and 5-noncardiac causes. Symptoms included as D1 variables are shown in appendix table 7. The D2 dataset included a greater variety of symptoms with 1175 cases, and the final diagnoses included specific noncardiac diagnoses.

Box 3: creation and characteristics of datasets used in the study.

The overall dataset had a higher prevalence of acute coronary syndrome (AMIs and Unstable APs) than a typical ED population in the UK or US today, so the S1 sample created from this data was a high-risk group. 50 cases were randomly selected from D1 via a number generator. The average age in this sample was 60.76 and 44% were female. 24% were smokers, 10% ex-smokers, 16% had a family history of cardiac events, 10% had diabetes, 22% had hypertension, and 6% had hyperlipidemia.

The S2 noncardiac sample was created to understand how apps responded to lower risk cases of chest pain. This sample was derived from D2 which included D1’s cardiac variables as well as additional symptoms. 29 of the lowest risk cases were selected from D2 according to the follow criteria: The patient had to be 50 years or under, no prior occurrences of a heart attack or angina, and the given gold standard diagnosis was noncardiac and low risk. The average age in this sample was 37.1 and 17.2% were female. 37.9% were smokers, 10.3% ex-smokers, 20.7% had a family history of cardiac events, 3.4% (1 patient) had diabetes, 6.9% had hypertension, and 0% had hyperlipidemia. Since anxiety was a common diagnosis in this group and there were no mental health symptoms included in D2, the researcher entering cases had to reference the given true diagnosis to include a pertinent symptom (eg “anxious”) for entry.

Case Entry

The same researcher entered all S1 and S2 cases into WebMD, Isabel, and Ada and standardized symptom wording across cases. All symptoms present in a case were entered into the symptom checker unless the app interface did not allow its entry. The Ada chatbot needed some alterations to the case entry process. For Ada chatbot questions about symptoms that were not listed in D2, the researcher input “I don’t know” instead of “no.” This aimed to not dismiss any lines of logic that may have led to the correct final diagnosis. For patients with the final diagnosis as “anxiety,” “panic attack,” or “anxiety attack,” the researcher answered “yes” to the chatbot question, “Has he/she been feeling anxious lately?” However, when asked “has this patient been diagnosed with anxiety or depression?” the researcher answered “no,” since the patient may have been seeking out such a diagnosis for the first time, and past medical history regarding noncardiac conditions was not provided by D2. Anxiety was the only symptom added to the symptoms in the database.

Matching

An accurate diagnosis was determined when there was a “true match” between the final diagnosis provided by the physician and the app suggestion if one of the following were true:

- The conditions in the pair are exact matches.

- The conditions are alternative names for one another.

- One condition is a more precise description of the other.

- It is reasonable to assume two different doctors might use the two different descriptions to label the same conditions

- One condition directly causes the other.

- One condition conveys the nature of the other in a less precise manner.

- Both conditions are highly related and share many symptoms.

Table 1: Determining Diagnostic Matches

Matching Process

True Match

Categorical Match

cardiac

Blind categorical matching

Identification of the correct category of diagnosis (2- heart attack, 3- unstable angina, 4, stable angina, 5- noncardiac

…

S2 noncardiac

Blind pair matching

Identification of a diagnosis that matches the gold-standard diagnosis according to a set of criteria

A match that is not a true match, but both the true and suggested diagnosis are related to the same region of the body and similar mechanisms of disease.

S1: Blind Categorical Matching for True Matches

Each of the S1 cases was assigned these redefined gold standard categories to include most app suggestions: 1- other cardiac diagnoses, 2-heart attack (MI), 3-unstable angina, 4-stable angina, and 5-noncardiac. Therefore, as shown in Table 1, categorical matching was used to determine true matches. All other diagnoses were sent to a medical professional (HF) for categorization. All of these undetermined diagnoses fell into either miscellaneous cardiac category 1 or noncardiac category 5. The medical professional was blinded to which symptom checkers and cases were associated with the undetermined diagnoses. The researcher and medical professional discussed the final categorizations.

S2: Blind Diagnostic Pair Matching for True and Categorical Matches

Since S2 cases had specific gold standard diagnoses, a blinded pair matching process was used to determine whether a diagnostic suggestion was considered a match. First, the primary researcher assigned a preliminary “match” status to suggestions that exactly matched the wording of the gold standard diagnosis and a “categorical match” status to suggestions that were not identical to the gold standard but related to the same organ system. Diagnostic categories included gastric conditions, respiratory conditions, viral infections, and mood disorders. This metric was created to add additional information on app accuracy since S2 true match sensitivities were low, and identifying the general disease category can inform the types of care a patient may seek. Then, the researcher created pairs of one gold standard diagnosis and one corresponding suggested diagnosis. The order of listed pairs was randomized and sent to the medical professional for blinded true and categorical pair matching. Finally, the researcher and medical professional discussed and agreed on all final match and categorical match designations.

Development and testing of prediction models for acute cardiac events.

We compared the results of the 3 symptom checkers to a machine learning algorithm trained on the original chest pain datasets to predict AMI. The study by Wang et al developed a logistic regression model on the same original dataset to predict the probability of AMI based on patient-reported factors. This study demonstrated 1) that their models could accurately predict AMI probability based solely on patient-reported data and 2) the datasets we used at baseline were predictive of acute cardiac events. Therefore, we ran their logistic regression model (LR model) on the S1 and S2 samples to compare AMI identification rates with the 3 symptom checkers. Table 2 shows the LR model from the Wang study with the beta coefficients.

Table 2: Variables, logistic regression beta coefficients and P values for the Wang LR model [Wang et al, 2001]

Variable

Beta

p value

Intercept

Age

Smokes

L arm pain

Pleuritic pain

Sharp

Sweating

Nausea

Episodic

Previous ang

Previous MI

Sex

−6.1005

0.0674

0.7002

0.7165

−2.9265

−1.0132

1.1307

0.9580

−2.0136

−0.9689

−0.7715

0.5236

0.0001

0.0020

0.0005

0.0048

0.0017

0.0001

0.0007

0.0100

0.0001

0.0013

0.0195

The original Wang model was trained on the whole dataset collected in Edinburgh.²⁰ The sampling method for the S1 and S2 test sets was found to include 14 cases from the model training set. We therefore retrained the logistic regression parameters on the cases from the Edinburgh dataset available for this study using an 80:20 split for training and validation, having removed the 14 test cases. The LR model was run on the S1 and S2 sets to provide calculated probabilities for AMI for each case. By adjusting the model’s cutoff threshold for determining a positive case, we created values for the prediction of AMI or no-AMI with varying levels of sensitivity and specificity. For the S1 sample, the goal was to predict 100% of AMI patients to ensure safety. The cutoff threshold was then adjusted to match the sensitivity to AMI of each symptom checker to compare app performance. The original probability threshold set on the S1 dataset was then used for analysis of the S2 dataset to determine the percentage of these negative cases correctly classed as negative for ACS and low risk by the LR model.

The S1 dataset of 50 cases was categorized as having a high prevalence of ACS. Table 3 shows the most frequent symptoms in S1 and S2 for comparison.

Table 3

Comparison of the ranking of the top 10 symptoms in S1 and S2
Rank	S1 high risk of ACS	S2 no cases of ACS
1	Chest pain as major symptom	Chest pain as major symptom
2	Retrosternal chest pain	Retrosternal chest pain
3	Previous angina	Sharp pain
4	Tight chest [pain]	Smoker/episodic pain
5	Left arm pain	Smoker/episodic pain
6	Shortness of breath/ sweating	Shortness of breath
7	Shortness of breath/ sweating	Left chest pain/ sweating
8	Worst pain [ever]	Left chest pain/ sweating
9	Previous MI	Difficulty breathing Tight chest pain
10	Smoker	Difficulty breathing Tight chest pain

High-risk S1 cardiac case identification

The S1 categories based on the final diagnoses are as follows: (1) other cardiac conditions 0 (0%), (2) AMI 11 (22%), (3) unstable angina 9 (18%), (4) stable angina 11 (22%), and (5) noncardiac 19 (38%). Per the CDC, in 2020, 6.9% of ED visits were for coronary artery disease, ischemic heart disease, or for patients with a history of myocardial infarction indicated on the medical record.²⁰ The acute cardiac syndrome (ACS) category was calculated by combining the AMI and UA diagnostic categories. PPV is the positive predictive value, NPV is the negative predictive value. The performance for the 3 apps in identifying UA and AMI were as follows: sensitivity: WebMD (100%), Isabel (75%), Ada (95%); specificity: WebMD (13.3%), Isabel (83.3%), Ada (56.7%); PPV: WebMD (43.5%), Isabel (75%), Ada (59.4%); NPV: WebMD (100%), Isabel (83.3%), Ada (94.4%)

In the top 5 suggestions, WebMD identified 100% of AMIs, 100% of UAs, and 0% of stable angina cases. Isabel identified 100% of AMIs, 44.4% of UAs, and 90.9% of stable angina cases. Ada identified 100% of AMIs, 88.9% of UAs, and 18.2% of stable angina cases (Fig. 2). WebMD identified the highest proportion of unstable angina cases (100% by M3), followed by Ada (88.9 by M5) and Isabel (44.4% by M5). By M5, Isabel identified the highest proportion of stable angina cases (90.9%), WebMD identified 0%, and Ada identified 18.2%. The UA cases that were not identified by Isabel were labeled as one of the following: thoracic aortic aneurysm, stable angina, or AMI, indicating that most were appropriately identified as other high-risk cardiac events. All apps appropriately labeled acute coronary syndrome cases (AMI and UAs) as either the true diagnosis of AMI or UA or another high-risk cardiac event.

Low-risk S2 noncardiac case identification

S2 consisted of 29 noncardiac cases presenting with chest pain from D2 and included the following gold standard diagnoses: hyperventilation (4); anxiety/panic (7); gastric/upper GI (14); and respiratory infection (4). D2 included a wider variety of symptoms than D1 (Table 7 appendix), therefore including more information to help determine specific noncardiac diagnoses and answer the Ada chatbot’s questions. The S2 sample had an average age of 37.1 years. A limitation was that nearly all were male (24/29, 83%).

In their top 5 suggestions, Ada identified the highest proportion of true (48.3%) and categorical (65.5%) noncardiac matches, while WebMD and Isabel both identified 24.1% of true noncardiac diagnoses. In the first listed suggestion, Isabel identified 13.8% of true matches, WebMD identified 6.9% and Ada identified 10.3%. By the 3rd suggestion, Ada was able to identify a greater proportion of diagnoses (31.0%) than WebMD (17.2%) and Isabel (20.7%). WebMD categorically identified 100% of the 14 gastric cases, followed by Ada 9 (64.3%) and Isabel 4 (28.6%). For true gastric case matches, WebMD and Ada identified 5 (35.7%) gastric diagnoses, followed by Isabel 3 (21.4%) (Fig. 5 appendix). The apps identified an increasing proportion of true noncardiac diagnoses as the diagnostic suggestion increased from the 1st to 5th suggestion (Fig. 3)

Figure 4 depicts the number of incorrectly high-risk suggestions per app for the low-risk, noncardiac cases in the first 1st -5th suggestions. Pulmonary embolism (PE), unstable angina (UA), and AMI were the 3 most common high-risk suggestions across the 3 apps and were therefore studied. A pulmonary embolism is a clot that travels often from the leg veins to the lungs, causing blockage of venous outflow from the pulmonary vasculature. It can occur in younger individuals who are sedentary or have hypercoagulable risk factors (cancer, pregnancy, history of blood clots, etc.) but is also a relatively uncommon (0.2% of ED visits) yet emergent diagnosis that can be diagnosed with CT scans in the ED.²¹ Suggesting a PE, MI, or UA early in the list of suggestions correlates with a recommendation for emergent care. Out of the 29 cases tested, WebMD first suggested a total of 19 high-risk suggestions (8 PEs, 1 UA, 10 MIs), Isabel first suggested 15 high-risk suggestions (10 PEs, 5 MIs), and Ada suggested a total of 10 MIs. In the top 5 suggestions, WMD suggested an MI for 51.7% of S2 noncardiac cases, a PE for 82.8%, and a UA for 65.5% of cases; Isabel suggested an MI for 82.8% of cases, a PE for 69.0% of cases, and a UA for 6.9% of cases; and Ada suggested an MI for 58.6% of cases, a PE for 6.9% of cases, and a UA for 44.8% of noncardiac cases.

Table 4

App comparison with the LR model for S1 high risk sample
	WMD	LR1	Isabel	LR2	Ada	LR3
Sensitivity	100	100	75	78.9	95	94.7
Specificity	13.3	32.3	83.3	58.1	56.7	35.5
PPV NPV	43.5 100.0	47.5 100	75 83.3	53.6 81.8	59.4 94.4	47.4 91.7

Table 4 shows that on the S1 data set the LR model had higher specificity and PPV than the WebMD SC and likely higher discrimination overall. The Isabel and Ada SC apps had higher specificity and PPV than the LR model and therefore likely higher discrimination.

Table 5

App comparison with the LR model for S2 low risk sample
	WMD	Isabel	Ada	LR
Sensitivity	-	-	-	-
Specificity	37.9	10.3	37.9	48.3
PPV NPV	0 100.0	0 100	0 100	0 100

Table 5 shows the comparison of the LR model (calibrated for best performance on the S1 high-risk dataset) tested on the S2 low-risk dataset. It had the highest specificity (48.3%), and Isabel had the lowest (the sensitivity and PPV were zero, as there were no positive cases of ACS).

We further defined the SC and LR model outputs to identify those cases that were classed as low risk overall with no urgent diagnoses per case (cardiac or otherwise) – these were the true negative (TN) cases that likely would indicate to the patient they did not need urgent healthcare. Isabel listed an urgent diagnosis for all 29 cases with a specificity of 0%. WMD listed an urgent diagnosis for 28 cases with a specificity of 3.4%. Ada listed an urgent diagnosis for 24 cases with a specificity of 17.2%. The Wang model listed an urgent diagnosis for 17 cases with a specificity 41.4%. The TN rate for the Wang model was significantly higher than that for WMD or Isabel (P = .001, Fisher exact test), and comparison with Ada was not significant.

Table 6

Symptom checker user interface strengths and limitations
		WebMD	Isabel		Ada
Strengths	- Intuitive interface - Quick to use. - Has “body map” to prompt symptoms - Asks about current medications and past or current conditions. - Asks for most bothersome symptom. - Prompts the entry of more symptoms through the “results strength” meter. - Online and app access - Can enter an unlimited number of cases in a day.		- Intuitive interface - Quick to use - Single panel of entry. - Free-text entry without limitations on dropdown menu allows for all symptoms to be included.		- Intuitive interface. - Question based symptom entry is thorough. - Variety of color-coded triage suggestions as follows: “seek emergency care”, “seek medical advice”, and “can usually be managed at home.” - Can save each symptom assessment on the account. - Can enter an unlimited number of assessments in a day.
Limitations	- Dropdown menu limited the entry of some symptoms (right chest pain, left chest pain, lipids, episodic pain, crackles, postural pain, aching right arm) - Triage suggestions only included “urgent.” This was only present immediately in the phone app, and upon clicking on a condition in the online version.			- Triage suggestions only flag emergency conditions and mark conditions as common. - Only allows 5 entries per day. - Only online access	- Dropdown menu can limit the entry of some symptoms. - Requires an account. - Requires an app whose loading speed is affected by internet connection. - Requires the answering of 30–50 questions which takes time.

Table 6 lists the qualitative strengths and weaknesses of each of these app user interfaces determined by the researcher who worked extensively on case entry into the apps. The primary strengths observed in each app interface include Isabel’s free-text entry option, Ada’s questioning mechanism and triage advice, and WebMD’s inclusion of questions about medical history and current medications, and its “results strength” measure. The main limitations to each of these interfaces were Isabel’s lack of a prompting mechanism, WebMD’s dropdown menu limitations on what symptoms can be entered, and the lengthiness of Ada’s questioning strategy. If these apps were to address these limitations in updated versions, we believe they would be more user friendly, and for Isabel and WebMD, they would be potentially more accurate.

Our original null hypotheses were as follows.

Null hypothesis 1: Symptom checkers would miss true cases of AMI or UA (unstable angina)
Null hypothesis 2: Symptom checkers would overly suggest acute cardiac syndrome as a diagnosis in noncardiac cases.

Risk aversion

Assessing both high-risk, cardiac and low-risk noncardiac samples allowed us to study app risk aversion from multiple angles. The high percentage of AMIs suggested in M1, high ACS sensitivity, high false positive rate, and low PPV in all 3 apps point to strong tendencies to recommend high-risk cardiac diagnoses before noncardiac diagnoses. Although only 22% of the 50 S1 cases had a gold standard diagnosis of an AMI, WebMD suggested an AMI as the first diagnosis in 86% of cases, Isabel did so in 38% of cases, and Ada did so in 56% of cases. (Figure 1). WebMD’s identification of 100% of AMIs in M1 appears to be related to a tendency to suggest an AMI for most chest pain presentations. By M5, WebMD and Ada had suggested a noncardiac diagnosis for all 19 noncardiac cases, while Isabel did so for 73.7%, suggesting that noncardiac suggestions are often listed after cardiac ones. (Figure 1). Although noncardiac cases made up 38% of S1 cases, noncardiac matches in the first suggestion were low for WebMD (0%) and Isabel (5.3%) and higher for Ada (47.4%). The high proportion of cardiac suggestions for S1 categorically noncardiac cases in M1 (WebMD – 100%, Isabel – 96%, Ada-76%) also indicates app risk aversion. Since all 3 apps identified 100% of AMIs by the third diagnostic suggestion (M3), we reject null hypothesis 1 that symptom checkers would miss cases of AMI. Since all 3 apps were more likely to diagnose urgent cardiac events than noncardiac conditions in the earlier (M1 and M3) suggestions, further investigation into app identification of noncardiac conditions was conducted during S2 analysis.

The application of low sensitivity to low-risk S2 noncardiac diagnoses further demonstrates risk aversion and limited diagnostic reliability for patients at lower risk of cardiac disease. All three apps had a higher proportion of high-risk/cardiac suggestions (heart attack, pulmonary embolism, unstable angina) in the first suggestion (M1: WMD- 65.5% Isabel-51.2%, Ada-34.5%), which decreased significantly by the fifth suggestion (M5: WMD-3.45% Isabel-13.8%, Ada-6.90%), indicating that high-risk suggestions occurred in greater proportions in earlier suggestions (Figure 4). The proportion of AMI, PE, and UA suggestions for S2 cases decreased from the first to 5th diagnostic suggestion. Although pulmonary embolisms can occur in young patients, the dataset did not provide information other than smoking status, indicating a risk of a hypercoagulable state (history of clots, hypercoagulable medications, pregnancy status, sedentary status, cancer diagnosis, etc.) that would indicate a higher risk of PE. The high proportions of early high-risk suggestions for low-risk cases support the acceptance of null hypothesis 2 that symptom checkers suggest AMI, UA or other emergent conditions as a diagnosis for noncardiac, nonemergent conditions.

Logistical Regression Model vs Symptom Checkers

Use of the Wang machine learning-based logistic regression (LR) algorithm allows comparison of the 3 symptom checker algorithms with a prediction model optimized for the given dataset. The LR model comparison provided a rigorous test of whether the balance of sensitivity versus specificity appears optimal for MI and acute cardiac ischemia. Table 4 depicts how different thresholds of sensitivity and specificity mirror those of the different apps, with WebMD optimizing for sensitivity, Isabel for specificity, and Ada in between. The specificity of the adjusted LR models is close to that of the symptom checkers, with slightly better performance than WMD (32.2% vs 13%) and somewhat inferior performance than Isabel (58.1% vs 83.3%) and Ada (35.5% and 56.7%). This indicates that the three algorithms appear to have similar overall discrimination but different thresholds of sensitivity and specificity. The lower effective threshold for WMD most closely fits the requirement not to miss any cases of ACS – we therefore chose this for determining the performance of the Wang algorithm on the lower risk dataset S2.

Diagnostic Accuracy

In terms of app accuracy, although all 3 apps identified few low-risk noncardiac cases, all identified 100% of AMIs. When diagnosing ACS (AMI and UA) cases, Ada had the lowest false positive rate, the highest PPV, and the highest specificity, indicating a marginally less risk averse approach than WebMD and Isabel. Isabel’s false positive rate, PPV, and specificity were between those of WebMD and Ada, indicating a slightly less risk aversive approach than WebMD. Isabel identified the most stable angina cases (90.9%), but the fewest gastric cases were in patients with fewer cardiac risk factors. Ada identified the highest proportion of noncardiac cases in both the high-risk cardiac subset (S1) as the first suggestion (47.4%) and the correct specific noncardiac diagnosis in 48.3% of cases in S2 (top 5 suggestions). In the lower risk subset S2, Ada followed by WebMD and finally Isabel best identified diagnostic suggestions that were at least categorically related to the gold standard noncardiac diagnosis. WebMD and Ada had slightly higher rates of correctly identifying gastric conditions (35.7%) than Isabel (21.4%), but none of these apps performed well on these data. Although WebMD was able to identify 100% of gastric categorical matches, it was less effective at identifying other noncardiac sources of chest pain. When optimized for sensitivity such as WebMD, the model had a low specificity. Isabel and Ada had higher specificities (controlling for sensitivity), indicating a higher level of discrimination.

Limitations:

The data from this study were collected with forms completed by the physicians seeing the patients in the ED. In some cases, the data were entered from the clinical notes by the research team, including HF, after the patient had left the ED. The sample of patient records used to evaluate the apps and the LR model totaled 79 cases and did not track racial and ethnic diversity. While larger than typical vignette studies, a larger dataset from multiple institutions tracking the diversity of patient characteristics could improve overall accuracy and provide evidence on performance in different populations. To address this, we are carrying out studies of direct use of the Ada app totaling over 900 patients, which will include patients with stroke and AMI.¹⁸ Additionally, the process of case entry was subject to differences between app interfaces and limitations in information available in the dataset. For example, although WebMD allowed for entry of medical history and medications, this was not given in the dataset. For Ada chatbot questions on symptoms not included in the dataset, instead of “no,” “I don’t know” was entered to not dismiss any pertinent lines of logic. Although case entry into symptom checkers from existing datasets as described here is a valuable approach to testing symptom checkers with real patient clinical findings and validated diagnoses, it can be time consuming to enter patient cases into the user interface. Access to an Application Programming Interface (API) to send data directly to the diagnostic algorithm would allow much larger scale evaluation studies. Finally, since triage advice was not consistently reported by WebMD and Isabel, app suggestions of high cardiac risk conditions were inferred to lead to emergent care, and lower risk conditions were inferred to lead to nonemergent care.

Informing Future Methods

The methods of this study are novel in several ways and have the potential to inform how future studies on symptom assessment apps are conducted. The use of real patient data instead of the clinician-constructed vignettes used in most other studies approximates patient use while maintaining the standardization of symptom wording due to entry by a single researcher. The use of both cardiac and noncardiac samples comparing app performance against high- and low-risk cases helped to verify the level of risk aversion and improve evidence for hypothesis 2. In addition, the metrics used to measure accuracy were multidimensional, allowing for nuance in exploring app accuracy. Analyzing distributions of diagnoses depicts what diagnostic outputs look like even when they are not correct. Using categorical matches as a measure in addition to true matches allowed for determinations of both accuracy and whether an app can at least identify a disease category. Noting the qualitative strengths and weaknesses of each app’s interface informs how these apps can tangibly improve their interfaces. Our comparison of app accuracy with the logistic regression model trained on a subset of the same patient data helps highlight the way in which the 3 symptom checker algorithms balance sensitivity and specificity.

The pressing need to improve patient recognition and diagnosis of emergency conditions, including AMI and unstable angina, makes studying symptom checkers a priority. This is the first study to evaluate the performance on these conditions with actual patient presentations and diagnoses from almost 80 patients. The results confirm that the 3 apps diagnosed all cases of AMI and all or almost all cases of UA, except for Isabel. False positive diagnoses were, however, very high, with only Ada detecting a significant number of true negative cases in the low-risk S2 dataset. The logistic regression model performed significantly better on this group, but the major benefit appeared to be a better calibrated threshold for positive cases rather than better discrimination overall. The results suggest that the apps would likely provide appropriate advice for patients who had AC, but more research is needed on larger datasets and with a broader range of presenting complaints such as breathlessness and a broader range of patient populations to assess diagnostic accuracy more fully.

The most important performance metric for such apps is safety: providing diagnoses, triage suggestions, or treatment advice that does not put the patient at risk. It is therefore reassuring that in this study, we have shown that the three apps all listed AMI as a diagnosis for all cases that had a final validated diagnosis of AMI and for nearly all cases with acute cardiac ischemia. In principle, such advice could help a patient with a possible AMI recognize the need to seek emergency care and allow them to avoid some of the risk of unwitnessed cardiac arrest and to receive the most effective treatment. Avoidance of overloading the health system due to risk adverse decision making is the second performance metric. The 3 apps studied in this paper, particularly WebMD, are so risk averse that they are likely to suggest AMI as a diagnosis for nearly all cases of chest pain. There was a stark difference between the apps’ abilities to diagnose urgent cardiac conditions and noncardiac conditions. While risk aversion may prevent patients from missing an urgent cardiac event, it may also worry users and encourage the excess utilization of emergency care over primary care, risking overwhelming emergency departments.

Despite the growing prevalence of electronic diagnostic aids, seeking reliable medical advice that is accurate rather than risk averse is the appropriate way to inform the pursuit of care and obtain accurate health diagnoses and advice on seeking care. Future studies in our lab are evaluating diagnostic performance on patients with possible TIA or stroke, the recruitment of a large cohort of patients in ED and urgent primary care to use the Ada app, and the evaluation of new diagnostic algorithms including Large Language Models.

Ethics approval and consent to participate: This research was current out on fully deidentified, structured data from the original study. The original data collection obtained informed consent and was covered by the local IRBs in Edinburgh and Sheffield, UK as described in the original publication by Kennedy RL et al

Consent for publication: All authors consent to the publication of this paper.

Availability of data and materials: Additional data is available on request to authors.

Competing interests: N/A

Funding: HF and IB supported by AHRQ

Authors’ contributions: SR and HF designed the study. SR entered case data into symptom checkers and prepared figures and tables. SR and HF analyzed data and wrote the main manuscript text. IB created the software to create and run the logistic regression model and analyzed model and symptom checker outputs. All authors reviewed the manuscript.

Acknowledgements: No specific funding was provided for this study. The primary work was carried out by SR as part of an undergraduate thesis. Dr. Fraser and Dr. Backer are supported by AHRQ grant R01HS029513.

Millenson ML, Baldwin JL, Zipperer L, Singh H. Beyond Dr. Google: the evidence on consumer-facing digital tools for diagnosis. Diagn Berl Ger. 2018;5(3):95-105. doi:10.1515/dx-2018-0009
Meyer AND, Giardina TD, Spitzmueller C, Shahid U, Scott TMT, Singh H. Patient Perspectives on the Usefulness of an Artificial Intelligence–Assisted Symptom Checker: Cross-Sectional Survey Study. J Med Internet Res. 2020;22(1):e14679. doi:10.2196/14679
Winn AN, Somai M, Fergestrom N, Crotty BH. Association of Use of Online Symptom Checkers With Patients’ Plans for Seeking Care. JAMA Netw Open. 2019;2(12):e1918561. doi:10.1001/jamanetworkopen.2019.18561
Chambers D, Cantrell AJ, Johnson M, et al. Digital and online symptom checkers and health assessment/triage services for urgent health problems: systematic review. BMJ Open. 2019;9(8):e027743. doi:10.1136/bmjopen-2018-027743
Wallace W, Chan C, Chidambaram S, et al. The diagnostic and triage accuracy of digital and online symptom checker tools: a systematic review. NPJ Digit Med. 2022;5:118. doi:10.1038/s41746-022-00667-w
Schmieding ML, Kopka M, Schmidt K, Schulz-Niethammer S, Balzer F, Feufel MA. Triage Accuracy of Symptom Checker Apps: 5-Year Follow-up Evaluation. J Med Internet Res. 2022;24(5):e31810. doi:10.2196/31810
Semigran HL, Linder JA, Gidengil C, Mehrotra A. Evaluation of symptom checkers for self diagnosis and triage: audit study. BMJ. Published online July 8, 2015:h3480. doi:10.1136/bmj.h3480
Mechanic OJ, Gavin M, Grossman SA. Acute Myocardial Infarction. In: StatPearls. StatPearls Publishing; 2022. Accessed October 25, 2022. http://www.ncbi.nlm.nih.gov/books/NBK459269/
Chen J, Wijesundara JG, Enyim GE, et al. Understanding Patients’ Intention to Use Digital Health Apps That Support Postdischarge Symptom Monitoring by Providers Among Patients With Acute Coronary Syndrome: Survey Study. JMIR Hum Factors. 2022;9(1):e34452. doi:10.2196/34452
McConaghy JR, Oza R. Outpatient Diagnosis of Acute Chest Pain in Adults. Am Fam Physician. 2013;87(3):177-182.
Martha Gulati. 2021 AHA/ACC/ASE/CHEST/SAEM/SCCT/SCMR Guideline for the Evaluation and Diagnosis of Chest Pain. doi:10.1016/j.jacc.2021.07.053
Powley L, McIlroy G, Simons G, Raza K. Are online symptoms checkers useful for patients with inflammatory arthritis? BMC Musculoskelet Disord. 2016;17(1):362. doi:10.1186/s12891-016-1189-2
Shen C, Nguyen M, Gregor A, Isaza G, Beattie A. Accuracy of a Popular Online Symptom Checker for Ophthalmic Diagnoses. JAMA Ophthalmol. 2019;137(6):690. doi:10.1001/jamaophthalmol.2019.0571
Gilbert S, Mehl A, Baluch A, et al. How accurate are digital symptom assessment apps for suggesting conditions and urgency advice? A clinical vignettes comparison to GPs. BMJ Open. 2020;10(12):e040269. doi:10.1136/bmjopen-2020-040269
Ramnarayan P. ISABEL: a web-based differential diagnostic aid for paediatrics: results from an initial performance evaluation. Arch Dis Child. 2003;88(5):408-413. doi:10.1136/adc.88.5.408
Bond WF, Schwartz LM, Weaver KR, Levick D, Giuliano M, Graber ML. Differential Diagnosis Generators: an Evaluation of Currently Available Computer Programs. J Gen Intern Med. 2012;27(2):213-219. doi:10.1007/s11606-011-1804-8
Hennemann S, Kuhn S, Witthöft M, Jungmann SM. Diagnostic Performance of an App-Based Symptom Checker in Mental Disorders: Comparative Study in Psychotherapy Outpatients. JMIR Ment Health. 2022;9(1):e32832. doi:10.2196/32832
Fraser HSF, Cohan G, Koehler C, et al. Evaluation of Diagnostic and Triage Accuracy and Usability of a Symptom Checker in an Emergency Department: Observational Study. JMIR MHealth UHealth. 2022;10(9):e38364. doi:10.2196/38364
Kennedy RL, Fraser HS, McStay LN, Harrison RF. Early diagnosis of acute myocardial infarction using clinical and electrocardiographic data at presentation: derivation and evaluation of logistic regression models. Eur Heart J. 1996;17(8):1181-1191. doi:10.1093/oxfordjournals.eurheartj.a015035
Wang SJ, Ohno-Machado L, Fraser HS, Kennedy RL. Using patient-reportable clinical history factors to predict myocardial infarction. Comput Biol Med. 2001;31(1):1-13. doi:10.1016/s0010-4825(00)00022-6
Hsu SH, Ko CH, Chou EH, et al. Pulmonary embolism in United States emergency departments, 2010–2018. Sci Rep. 2023;13:9070. doi:10.1038/s41598-023-36123-2

No competing interests reported.

Download PDF

Reviewers agreed at journal
18 Sep, 2024
Reviewers agreed at journal
18 Sep, 2024
Reviewers invited by journal
16 Jan, 2024
Editor invited by journal
15 Nov, 2023
Editor assigned by journal
15 Nov, 2023
Submission checks completed at journal
15 Nov, 2023
First submitted to journal
06 Nov, 2023

You are reading this latest preprint version

Evaluation of Diagnostic Apps and Prediction Models for Myocardial Infarction and Other Causes of Chest Pain: Informing Patient Use

Status:

Version 1

Abstract

Figures

Background

Assessing chest pain, myocardial infarction and diagnosis of acute coronary syndromes

Box 1: Characteristics and diagnosis of acute coronary syndromes

Symptom checkers studied.

Box 2: Descriptions of the three symptom checkers tested

Method

Table 1: Determining Diagnostic Matches

Table 2: Variables, logistic regression beta coefficients and P values for the Wang LR model [Wang et al, 2001]

Results

High-risk S1 cardiac case identification

Low-risk S2 noncardiac case identification

Discussion

Conclusions

Declarations

References

Additional Declarations

Supplementary Files

Status:

Version 1