This study was conducted based on anonymous self-reported patient case data obtained in 1994 from the study “Early diagnosis of AMI using clinical and electrocardiographic data at presentation: derivation and evaluation of logistic regression models” conducted by Kennedy et al, with HF as a coauthor. Data was collected at 2 sites, Edinburgh and Sheffield, UK.19 Patients with a chief complaint of chest pain reported their symptoms to a clinician, which was recorded on a standardized questionnaire detailed developed by the original Kennedy et al study. Diagnosis by an ED physician was based on clinical presentation, ECG review, and serial cardiac enzyme measurements.19 ECG results and cardiac enzymes in the dataset were not entered into the apps as they would not be available in the home setting. Symptom and risk factor data were deidentified, coded as present or absent, and compiled into a dataset of 1872 patients. The first dataset in the current study (D1) was extracted from the main set and had symptoms focused on describing cardiac events, a total of 697 cases, and described true diagnoses in the form of 5 categories: 1-MI with Q wave, 2-MI without Q wave, 3-unstable angina, 4-stable angina, and 5-noncardiac causes. Symptoms included as D1 variables are shown in appendix table 7. The D2 dataset included a greater variety of symptoms with 1175 cases, and the final diagnoses included specific noncardiac diagnoses.
Box 3: creation and characteristics of datasets used in the study.
The overall dataset had a higher prevalence of acute coronary syndrome (AMIs and Unstable APs) than a typical ED population in the UK or US today, so the S1 sample created from this data was a high-risk group. 50 cases were randomly selected from D1 via a number generator. The average age in this sample was 60.76 and 44% were female. 24% were smokers, 10% ex-smokers, 16% had a family history of cardiac events, 10% had diabetes, 22% had hypertension, and 6% had hyperlipidemia.
The S2 noncardiac sample was created to understand how apps responded to lower risk cases of chest pain. This sample was derived from D2 which included D1’s cardiac variables as well as additional symptoms. 29 of the lowest risk cases were selected from D2 according to the follow criteria: The patient had to be 50 years or under, no prior occurrences of a heart attack or angina, and the given gold standard diagnosis was noncardiac and low risk. The average age in this sample was 37.1 and 17.2% were female. 37.9% were smokers, 10.3% ex-smokers, 20.7% had a family history of cardiac events, 3.4% (1 patient) had diabetes, 6.9% had hypertension, and 0% had hyperlipidemia. Since anxiety was a common diagnosis in this group and there were no mental health symptoms included in D2, the researcher entering cases had to reference the given true diagnosis to include a pertinent symptom (eg “anxious”) for entry.
|
Case Entry
The same researcher entered all S1 and S2 cases into WebMD, Isabel, and Ada and standardized symptom wording across cases. All symptoms present in a case were entered into the symptom checker unless the app interface did not allow its entry. The Ada chatbot needed some alterations to the case entry process. For Ada chatbot questions about symptoms that were not listed in D2, the researcher input “I don’t know” instead of “no.” This aimed to not dismiss any lines of logic that may have led to the correct final diagnosis. For patients with the final diagnosis as “anxiety,” “panic attack,” or “anxiety attack,” the researcher answered “yes” to the chatbot question, “Has he/she been feeling anxious lately?” However, when asked “has this patient been diagnosed with anxiety or depression?” the researcher answered “no,” since the patient may have been seeking out such a diagnosis for the first time, and past medical history regarding noncardiac conditions was not provided by D2. Anxiety was the only symptom added to the symptoms in the database.
Matching
An accurate diagnosis was determined when there was a “true match” between the final diagnosis provided by the physician and the app suggestion if one of the following were true:
- The conditions in the pair are exact matches.
- The conditions are alternative names for one another.
- One condition is a more precise description of the other.
- It is reasonable to assume two different doctors might use the two different descriptions to label the same conditions
- One condition directly causes the other.
- One condition conveys the nature of the other in a less precise manner.
- Both conditions are highly related and share many symptoms.
Table 1: Determining Diagnostic Matches
|
Matching Process
|
True Match
|
Categorical Match
|
S1
cardiac
|
Blind categorical matching
|
Identification of the correct category of diagnosis (2- heart attack, 3- unstable angina, 4, stable angina, 5- noncardiac
|
…
|
S2 noncardiac
|
Blind pair matching
|
Identification of a diagnosis that matches the gold-standard diagnosis according to a set of criteria
|
A match that is not a true match, but both the true and suggested diagnosis are related to the same region of the body and similar mechanisms of disease.
|
S1: Blind Categorical Matching for True Matches
Each of the S1 cases was assigned these redefined gold standard categories to include most app suggestions: 1- other cardiac diagnoses, 2-heart attack (MI), 3-unstable angina, 4-stable angina, and 5-noncardiac. Therefore, as shown in Table 1, categorical matching was used to determine true matches. All other diagnoses were sent to a medical professional (HF) for categorization. All of these undetermined diagnoses fell into either miscellaneous cardiac category 1 or noncardiac category 5. The medical professional was blinded to which symptom checkers and cases were associated with the undetermined diagnoses. The researcher and medical professional discussed the final categorizations.
S2: Blind Diagnostic Pair Matching for True and Categorical Matches
Since S2 cases had specific gold standard diagnoses, a blinded pair matching process was used to determine whether a diagnostic suggestion was considered a match. First, the primary researcher assigned a preliminary “match” status to suggestions that exactly matched the wording of the gold standard diagnosis and a “categorical match” status to suggestions that were not identical to the gold standard but related to the same organ system. Diagnostic categories included gastric conditions, respiratory conditions, viral infections, and mood disorders. This metric was created to add additional information on app accuracy since S2 true match sensitivities were low, and identifying the general disease category can inform the types of care a patient may seek. Then, the researcher created pairs of one gold standard diagnosis and one corresponding suggested diagnosis. The order of listed pairs was randomized and sent to the medical professional for blinded true and categorical pair matching. Finally, the researcher and medical professional discussed and agreed on all final match and categorical match designations.
Development and testing of prediction models for acute cardiac events.
We compared the results of the 3 symptom checkers to a machine learning algorithm trained on the original chest pain datasets to predict AMI. The study by Wang et al developed a logistic regression model on the same original dataset to predict the probability of AMI based on patient-reported factors. This study demonstrated 1) that their models could accurately predict AMI probability based solely on patient-reported data and 2) the datasets we used at baseline were predictive of acute cardiac events. Therefore, we ran their logistic regression model (LR model) on the S1 and S2 samples to compare AMI identification rates with the 3 symptom checkers. Table 2 shows the LR model from the Wang study with the beta coefficients.
Table 2: Variables, logistic regression beta coefficients and P values for the Wang LR model [Wang et al, 2001]
Variable
|
Beta
|
p value
|
Intercept
Age
Smokes
L arm pain
Pleuritic pain
Sharp
Sweating
Nausea
Episodic
Previous ang
Previous MI
Sex
|
−6.1005
0.0674
0.7002
0.7165
−2.9265
−1.0132
1.1307
0.9580
−2.0136
−0.9689
−0.7715
0.5236
|
0.0001
0.0001
0.0020
0.0005
0.0048
0.0017
0.0001
0.0007
0.0100
0.0001
0.0013
0.0195
|
The original Wang model was trained on the whole dataset collected in Edinburgh.20 The sampling method for the S1 and S2 test sets was found to include 14 cases from the model training set. We therefore retrained the logistic regression parameters on the cases from the Edinburgh dataset available for this study using an 80:20 split for training and validation, having removed the 14 test cases. The LR model was run on the S1 and S2 sets to provide calculated probabilities for AMI for each case. By adjusting the model’s cutoff threshold for determining a positive case, we created values for the prediction of AMI or no-AMI with varying levels of sensitivity and specificity. For the S1 sample, the goal was to predict 100% of AMI patients to ensure safety. The cutoff threshold was then adjusted to match the sensitivity to AMI of each symptom checker to compare app performance. The original probability threshold set on the S1 dataset was then used for analysis of the S2 dataset to determine the percentage of these negative cases correctly classed as negative for ACS and low risk by the LR model.