A Trend of Eight-Years Big Data Analytics of Electronic Medical Records to Review and Study Diagnosis and Treatment of Coronary Artery Disease in Different Genders

Background: Cardiovascular Disease (CVD) and Coronary Artery Disease (CAD) in particular, is one of the leading causes of death, morbidity, and mortality in the United States. Notably, women continue to have worse outcomes than men. The causes of these discrepancies have yet to be fully elucidated. The main objective of this study is to detect gender discrepancies in outcome using data analytics to risk stratify ~ 32,000 patients with CAD of the total 960,129 patients treated at UCSF Medical Center during an eight years. As an implementation of clinical care, this study’s long-term goal is to improve precision diagnosis and ultimately management of CAD for both early detection and identication of patients at risk for rapid progression of the disease. Methods: We designed and implemented a multidimensional framework to trace patients from admission through treatment as a path of events. The time between events for a similar set of paths was calculated. Then the average waiting time for each step of the treatment was calculated for men and women. Finally, we applied statistical analysis to determine differences in time between diagnosis and treatment steps for men and women. Discussions: There were statistically signicant gender-based differences in the common path of diagnosis and treatment of patients with CAD. The average time for women from the rst visit to diagnostic Cardiac Catheterization was more than 2 months than for men (358.77 vs. 291.83 days). By contrast, the average time from diagnostic Cardiac Catheterization to treatment Cardiac Catheterization and Coronary Artery Bypass Grafting (CABG) was not signicant. Women with CAD requiring revascularization have a signicantly longer interval between their rst physician encounter indicative of CVD and their rst diagnostic cardiac catheterization compared to men. Avoiding the delay in diagnosis and treatment will provide a better outcome for patients at risk. Statistical analysis. It shows the statistical analysis of very rst event to diagnostic Cardiac Catheterization and from diagnostic Cardiac Catheterization to the treatment Cardiac Catheterization and CABG.


Introduction
Cardiovascular Disease (CVD) encompasses a broad range of conditions. Coronary Artery Disease (CAD), commonly referred to as Ischemic Heart Disease (IHD), is the leading cause of death, morbidity, and mortality in the United State and globally. Aggarwal et al. 1 in their study on sex differences in CVD suggested that despite advances in treatment and survival, it is still the leading cause of death among women. For example, compared to men, women are less likely to be accurately diagnosed. Several non-traditional health occurrences in women predispose them to CVD including early menopause and menarche, gestational diabetes mellitus, and hypertension. Gender, ethnic, racial, and age discrepancies within CVD diagnosis and treatment exist and have been well reported [1][2][3][4] . Nearly half of all African American adults, 47.7 percent of women, and 46.0 percent of men have some form of CVD 3 . Although the overall guidelines and management of CVD are similar in most of the aspects for both genders, gender-based variations in the pathophysiology, symptomatology, presentation, e cacy of diagnostic tests, and response to pharmacological interventions do exist.
We summarized these differences in one of our previous works on gender based differences in CVD 5 . The etiology for the differences is less well understood. In a study by Ong et al. 6 , the authors suggested that sex hormones affecting blood pressure could play a major role in disparate development of CVD. Even though there was higher diastolic blood pressure in men, higher systolic pressure was reported in women which is a greater risk factor for CVD. Regardless of etiology it is apparent women have a poor outcome compared with men when it comes to CAD in particular. A major reason may be delay in diagnosis or a different treatment algorithm as compared to their male counterparts.
The main objective of this study is to nd these discrepancies using a multidimensional data analytics framework to risk stratify CAD as a subgroup of the CVD patient population at UCSF. We create a cohort selection that allows for simple manipulation and search of the data within the Clinical and Research Data Warehouse (CRDW). This facilitates rapid familiarization and hypothesis testing of the data set. As an example, we hypothesized that there are gender-based discrepancies in the diagnosis and treatment of CAD.
With such a large patient database and the infrastructure for data abstraction in place at UCSF Bakar Computational Health Sciences Institute, we have been able to describe these discrepancies. We believe that speci c studies for individual patients based on medical record pro les with demographic information are more accurate for improving health outcomes in patients with CAD. Our long-term goals are to translate the multi-dimensional big data that is generated at the University of California System to directly improve and assist clinical care decision making. This translation ultimately would improve outcomes for patients and reduce cost.
In previous work, we created a database as a comprehensive resource for research, comprising 126 papers and 68 datasets relevant to CAD diagnosis, extracted from the scienti c literature from 1992 and 2018 7 . We showed signi cant research outcome on Mayo Clinic patient data for implementing a novel model on survival analysis 8 , recommendation and treatment plan for new patient based on patient similarity 9 , and developing a novel prediction model 10 . To date none of our prior research is speci cally discussed the gender based differences. A systematic review of gender based studies of diagnosis and treatment of CAD in the last 20 years 5 shows discrepancies in outcomes of CAD between men and women 2,3,6,11−22 . However, the causes of these discrepancies have yet to be fully elucidated and require further detailed analysis to design interventions and structures to minimize bias. Studies suggest that knowledge and awareness of bias reduce discrimination and therefore our publication will aid in decreasing physician bias 23 . Besides unique situations germane to women such as pregnancy and hormone therapy further make it challenging to diagnose female patients, sometimes young, with CAD in a timely manner.
In this study, we traced potential paths to shed light on the causes of gender discrepancies, and by using path analysis, we uncover delays and possible gender-based differences in diagnosis and treatment of CAD. These differences in healthcare delivery, methodology, diagnosis, procedure, and the time interval between diagnosis procedure and therapeutics may have a signi cant impact on patients' health and outcome.
Personalized treatment for individuals based on a particular EMR pro le may signi cantly reduce unnecessary treatments and cost, and potentially, morbidity and mortality of downstream procedures associated with incorrect or late diagnosis. Moreover, improved precision may change the debate surrounding the standard guidelines based on gender and other individual characteristics of the patients. Suggesting new guidelines based on patient characteristics will help providers in both detection and management of patients at risk of rapid progression of CAD and generally in CVD and it will be an innovation in clinical care. The ndings of this study will allow us to better identify the systemic causes of discrepancies within CAD treatment and pinpoint the best methods for intervention to reduce them.
In the following sections, we describe the overview of the study design from the hypothesis de nition to future work. We explain the study cohort following by data prepossessing, data dictionary, data processing, and data analytics. Next, we show the validation and results. In the last section we discuss the results, study limitations and next steps. And nally, we conclude the study with impact of the study, innovation, perspective with clinical competencies in medical record and competency in patient care.

Study Design and Overview
This study is designed around the basic work ow considering several steps including hypothesis de nition, study cohort and population, data dictionary creation, data prepossessing, data processing, data analytics, validation and results and nally future steps. Figure 1 illustrates the major components of the study from hypothesis de nition to future plan.
We de ne the existence of discrepancies across different genders in: diagnosis and the time of diagnosis.
procedures including invasive and noninvasive procedures. time interval between diagnosis, medication order, and procedure.

Study Cohort and Population
Our data analytics built using EMR data on 960,129 patients admitted to UCSF between July 2011 and December of 2018. This study does not include any human subject and experimental protocol. All data-based De-Identi ed Clinical Data Warehouse (De-ID CDW) were authorized to access as "de-identi ed" by the University of California San Francisco and all IDs and metadata (e.g., location) have been removed. All methods were carried out in accordance with relevant guidelines and regulations at UCSF.
De-ID CDW is a de-identi ed database copy of high-value EHR data. Therefore, this data is not subject to HIPAA restrictions on research use and hence IRB approval or an honest broker intermediary and the need for informed consent was waived by the UCSF Research data team committee. The De-ID CDW system accelerates the research process by permitting UCSF investigators to locate research data and encourage an exploratory approach to hypothesis generation. The De-ID CDW is available to the UCSF research community.
After authorization to access "de-identi ed" EMR data for research, in consultation with cardiac, thoracic, and vascular surgeons, cardiologists and cardiovascular epidemiologists, the following cohort identi cation criteria were developed: Coronary Artery Disease (CAD), commonly referred to as Ischemic Heart Disease (IHD) based on the ICD10 code (120-125).
Patients with missing value speci cally for ICD10 code excluded.
Patients de ned as unknown, and unspeci ed de nition excluded.
To be included in this cohort, patients needed to meet the above criteria, leading to a cohort size of 32,904 CAD patients. Vital such as cholesterol (HDL), cholesterol (LDA), cholesterol (TOTAL), systolic blood pressure, diastolic blood pressure, BMI, age have been considered. Demographic characteristics such as ethnicity have been considered. Smoking conditions including patients never smoked, current every day smoker, former smoker, passive smoke exposure are very important characteristics to be considered. Co-morbidities (e.g. hypertension, liver disease, hyperlipidemia, diabetes, dialysis) for patients with CAD for both genders are calculated. All vitals, characteristics, and co-morbidities are shown in Fig. 2. This data set consisted of deidenti ed patient ID, demographic information (e.g. gender), and diagnosis based on ICD10 code as shown in Table 1. The details of ICD codes are described in supplementary material Table S1 (ICD10details). For procedure code, we used Current Procedural Terminology (CPT) and date of procedure services for both invasive and non-invasive procedures. For medication, we used the medication code, medication name, and date of the orders.

Data Preprocessing
The patients whose medical history does not include at least one element from the set of CPT codes were eliminated from the initial cohort patient. By doing so, the patient number was reduced to ~ 23,000. Before proceeding, the CPT codes were mapped and translated to procedure names (e.g. EKG, CABG for Coronary Artery Bypass Grafting) based on our dictionary. The medication history data set contains patient ID, medication name, medication code, therapeutic class, pharmaceutical class, pharmaceutical subclass, date of medication ordered and gender. A similar translation was done for medication based on medication dictionary and medications were assigned into main classes.

Data Processing and Statistical Analytics
Our approach was based on time series patients' data. Because of a big and diverse patient cohort at UCSF, we could follow each patient from initial interaction with the UCSF medical system following up any medication order and invasive/noninvasive CAD related procedures over months and years of treatment. For each patient the sequence of events was created from the time of initial presentation to the UCSF medical system to the last invasive procedure as the date of extracting data (e.g. CABG as one of the important targets). We have implemented methods to determine the rst suspicion of CAD by providers (primary care and/or cardiologist). We measured the time between different events (e.g. time between prescribing of aspirin or and any other medications and ordering the EKG test, EKG test to CABG) and found the sequence of events for each patient and group of similar patients.
One of the novelties of this study is tracing a multidimensional aspect of patients' treatment over time. It means we look at both medications and procedures over time of treatment. We merged the sequence of medication orders and procedures over time as a time series sequence from the time of admission to the end of treatment as recorded in EMR. Event time was de ned as the date of the rst event (e.g. prescribing aspirin, ordering stress test) until the date of the next event (e.g. ordering EKG test) and the next event. All medications and procedures from the dictionary can count as the rst event in the patient records. We explored all possible existing events (e.g. aspirin = > EKG test = > diagnostic Cardiac Catheterization = > CABG) paths for individual patients. Then we calculated the time interval between every two pairs of events and the number of days. The data set is divided into separate data sets for men and women. For each set, we grouped each row with the same "Path" and compiled the days spanned into a list containing different days from different patients.
Upon the completion of the list of days for each different path, the mean, standard deviation, number of patients and essentially the length of the day is calculated for both men and women data set. As the very last step, both the men and women sets are merged, or concatenated, on the same paths. Then, 2 sample t-Tests are performed for each row to evaluate whether the differences between the average delay days for men and women are statistically signi cant or not. Differences in delay time between groups were assessed with the pvalue. Table 3 shows a few examples of the results of data analytics.

Results
Upon possessing the data, the next step is to search for the evidence that there is a delay in de nitive diagnosis and treatment of CAD in women compared to men. The rst step to validate this hypothesis is to determine the rst point encounter with a physician when a patient was suspected of having cardiovascular disease. We included the treatment path with suspicion of potential cardiovascular disease that combined both procedures and medications. Initially, our experts determined that aspirin is one of the drugs that is frequently ordered early upon the encountering a patient at risk of cardiovascular disease. Thus, as a rst analytic step we calculated the time it took between the rst time aspirin (other medications has been considered too) prescribed to the rst diagnostic Cardiac Catheterization that occurred, and then from diagnostic Cardiac Catheterization to treatment procedures such as percutaneous coronary intervention (treatment Cardiac Catheterization) and coronary artery bypass graft (CABG). In medication data, 40 medication codes were considered as aspirin including 2 groups of therapeutic classes de ned as analgesics and antiplatelet, which includes groups of medication pharmaceutical classes including analgesic antipyretics, salicylates, analgesics, salicylate and non-salicylate comb, bulk chemicals, and platelet aggregation inhibitors. These medications are under the medication pharmaceutical sub classes de ned as salicylate analgesics, salicylate analgesics with non-salicylate analgesics combinations, and salicylate analgesics buffered. Our dictionary for complete information about aspirin and classi cations is in supplementary material Table S6 (aspirin). Table 4 shows data analytics for all different paths with aspirin as a starting point.  Table 5. In the next step we considered procedures as the starting point plus medication to nd the path between the very rst event (any kind of related medications and procedures, because for some patients the rst event in noninvasive procedure and not a mediation order) for the patients at the time of admission and next steps such as diagnostic Cardiac Catheterization and treatment Cardiac Catheterization. As shown in

Discussion
In this study, we explored the use of data analytics to reveal the gender-based discrepancies in the diagnosis and treatment of CAD. We hope by recognizing a clear delay in diagnosis (i.e. time to diagnostic catheterization in women) will make a change in practice and will result in improved outcomes for women with CAD with early detection. We have implemented methods to determine the rst suspicion of CAD by providers (primary care and/or cardiologist). We measured the time interval between different events (e.g. time between prescribing a medication and ordering the cardiac stress test) and found the sequence of events for each patient and group of similar patients. We used statistical analyses to nd the differences between women and men. Our results, based on the analysis of a subset of patients with CAD condition, support the hypothesis that there exist discrepancies in the diagnosis and treatment of CAD based on patient demographic characteristics such as gender.
As the rst step, we use landmarks (e.g. aspirin initiation) as a trigger for early suspicion of CAD and follow up with other markers and medications (e.g. beta-blocker, statins) and we followed that by noninvasive and invasive cardiac procedures. As the next step we used all identi able cardiovascular-related medications as a starting point instead of just aspirin to expand the patient cohort and found signi cant discrepancies. Finally we changed the starting point to include both medications and procedures.
We discovered that when women with the eventual diagnosis of severe CAD are started on aspirin it takes them longer to start beta-blockers, as a known drug to reduce cardiovascular risk, compared to men. Our analysis shows that women who have undergone CABG on the average have waited for 358 to get the "Gold Standard" diagnostic Cardiac Catheterization followed by an extra 127 days to undergo CABG for severe CAD. Men who have undergone CABG on average waited for 291 to get the "Gold Standard" diagnostic Cardiac Catheterization followed by extra 77 days to undergo CABG for severe CAD. From a starting point of any rst event (e.g. aspirin order, cardiac stress test order), on average it takes over two months longer for women to undergo CABG compared to men. In the patients with left main and multi-vessel CAD or unstable angina, the risk of a CAD event is high. For example, if 50 percent are at risk of some event in 6 months (ACS, STEMI, NSTEMI, or sudden cardiac death), then it can be extrapolated that a delay of 2 months would result in a 17 percent increased risk for women compared to men. Our goal was to simplify hypothesis testing as much as possible for the healthcare providers and researchers. We showed that the kind of data analytics, which has been used in this study, is su cient to nd the discrepancies within cardiovascular diagnosis and treatment. While our work focused on the UCSF data, we anticipate that our approach can be applied to other databases of patient data with similar levels of success (e.g. UC System-wide data). Based on our analysis, the difference in the interval from the rst event to diagnostic Cardiac Catheterization is the intervention with a signi cant p-value (p-value = 0.000119), while the p-value for diagnostic Cardiac Catheterization to CABG and diagnostic Cardiac Catheterization to treatment Cardiac Catheterization is not statistically signi cant.
In summary our research has important implications for initiatives aimed for improving the use of EMR to nd the possible reasons for a different outcome in women versus men or based on differences in other patient characteristics. Several efforts are devoted to nding the different outcome and risk factors in different genders -but the reasons for the differences are not yet fully identi ed.
Our work is not without limitations. First, a key limitation in this study is the lack of reliable medication history before admission to UCSF. Because of this limitation, there is a possibility of not capturing some of the medications that patients had been taking over the years prior to rst admission to UCSF. We are planning to overcome this limitation by considering unstructured clinical notes in our future study. With that, we will have access to the history of patients before 2011, which is the starting point of data collection in our study cohort. Because UCSF Medical Center is a tertiary referral center for CABG, some patients are admitted when they have already had a diagnostic Cardiac Catheterization. For this group of patients who arrived from other institution, sometimes the code for diagnostic Catheterization is not entered. As a result of this limitation, we have a decrease in the number of patients with the path from diagnostic Cardiac Catheterization to CABG.
Although the number of patients with CABG is 752, a subset of the patients have the starting point of CAD on admission to UCSF before undergoing therapeutic procedures at UCSF (CABG or therapeutic Catheterization).
We are planning to nd a patient pro le that describes rapidly progressive CAD and ag these patients for frequent and regular cardiovascular assessment. We will develop interactive visualization tools for providers, payers, and researchers to assist the personalized treatment plans for individual patients with speci c characteristics based on new guidelines and suggestions as EMR order sets. Our long-term goal is to translate the multi-dimensional big data including EMR that is generated at the University of California System, to directly improve and assist clinical care decision making that ultimately would improve outcomes for patients and reduce cost. Moreover, this study lays the foundation to develop novel translational interventions through powerful big data-driven analytics that leverage the wide availability of UC System patient data.
As an implementation of clinical care, this study's goal is to improve precision diagnosis and ultimately, management of CVD for both early detection and identi cation of patients at risk for rapid progression of the disease. As a clinical care outcome, we will provide the protocol in an EMR order sets format for early detection of severe CAD in patients at risk for rapid progression. As an example, for a woman with a history of hormone therapy, pregnancy with hypertension in early age, family history, and increased BMI, we can expedite the more sensitive testing (strati ed and varied order sets depending on that patient's risk pro le) instead of long-term therapy with medications (e.g. aspirin, statins, beta-blockers) and diagnose the CAD expeditiously. As an assistant tools for providers, payers, and researchers, we are planning to deliver Interactive visualization Tools, EMR order sets, and recommendation system to access data to search and reuse and guidelines for the treatment for individual patients with speci c characteristics.The outcome of this research lays the foundation to develop novel translational interventions through powerful big datadriven analytics that leverage the wide availability of UC System patient data.

Conclusion
Although the overall guidelines and management of CAD are similar for both genders, gender-based variations in the pathophysiology, symptomatology, presentation, e cacy of diagnostic tests, and response to pharmacological interventions do exist. When features and predictive variables are different in men and women, decision making based on the uni ed platforms and guidelines for diagnosis and treatment of the patients appears to lead to the poor outcomes in women in comparison with men. Therefore studies on CAD based on individual characteristics (e.g. demographics) will have a big impact on the diagnosis and treatment of CAD.
There are discrepancies in the delivery of healthcare in general across different genders. Women with severe CAD requiring revascularization have a signi cantly longer interval between their rst physician encounter indicative of cardiovascular disease to their rst diagnostic cardiac catheterization compared to men. These differences in healthcare delivery, methodology, diagnosis procedure, the time interval between diagnosis procedure and therapeutics may have a signi cant impact on patients' health and outcome.
Personalized treatment for individuals based on speci c EMR pro le and demographic characteristics may signi cantly reduce unnecessary treatments and costs, and potentially, morbidity and mortality of downstream procedures associated with wrong or late diagnosis. Moreover, improved precision may change the debate surrounding the standard guidelines based on gender and other individual characteristics of the patients. Developing updated gender based guidelines will help provider for both early detection and manage individual patients at risk of rapid progression of CAD and generally in CVD will be an innovation in clinical care.  Study Overview and Architecture. This gure illustrates the major components of the study from hypothesis de nition to the results and future plan.