Cardiovascular Phenotyping in Breast Cancer Patients Treated With Her2 Targeted Therapies Using Informatics Approaches

and


Abstract Background
Cardiotoxicity is a serious adverse event associated with some of the most effective breast cancer therapies. Currently, it is di cult to predict which patients will develop cardiotoxicity due to the multiplicity of clinical, behavioral, and biological factors involved.

Methods
Here we describe an effort to apply biomedical informatics approaches to patient data from MedStar Health's EHR systems to discover and characterize factors that contribute to cardiotoxicity in a real world breast cancer population.

Results
Data wrangling techniques including merging data from disparate clinical systems, data transformation, and de-identi cation of personal health information (PHI)were appliedto the raw clinical data to produce a structured integrated dataset for predictive analysis and hypothesis generation. Using this dataset as input, weshowed howpredictive models can be developed to identify patients at high risk for cardiotoxicity.

Conclusions
We demonstrate how suchmodels can be used for hypothesis generation and data exploration with the ultimate goal of developing applications for precision medicine.

Background
Electronic health records (EHR) contain a variety of rich information regarding patient diagnoses, treatments, and health outcomes. Analysis of these data using informatics techniques can uncover important information regarding drug e cacy, outcomes, and adverse events at both the individual patient and population levels [1][2][3][4][5]. Extracting knowledge from raw data mined from clinical systems requires extensive data wrangling in order to shape, clean, validate, and transform the data so that it is in a form which can be readily consumed by downstream analysis processes [6].
We employ EHR data mining and analysis to address the problem of cardiotoxicity in a real-world cohort of HER2-positive breast cancer patients treated with trastuzumab. HER2-positive tumors are de ned by overexpression of human epidermal growth factor receptor 2 (HER2/neu),comprisingapproximately 20%-25% of all breast cancers; if untreated, they have the poorest prognosis among breast cancer subtypes. The monoclonal antibody trastuzumab is among the most effectivetreatments for HER2positive breast cancer.A recognized potential cardiac side effect of trastuzumabis an asymptomatic decline in left ventricular ejection fraction (LVEF) -the fraction of blood that leaves the left ventricle during contraction -known as left ventricular systolic dysfunction (LVSD) [7] and an increased risk of congestive heart failure (CHF) [3].Trastuzumab is often paired with anthracycline chemotherapy, which appears to lead to potentially more permanent LVSD and greater incidence of CHFthan the use of trastuzumab alone [7][8][9].
Despite widespread use,many questions remain unanswered about which patients will be the most susceptible to side-effects from trastuzumab and the optimal treatment protocols [10].
Here we report on our efforts to extract, integrate, and analyze data from EHR systems atthe MedStar Washington Hospital Center (MWHC), a hospital in the MedStar Health network.This project was evaluated by the Georgetown-Howard Universities Center for Clinical and Translational Science Institutional Review Board and was approved for detailed review (GU IRB #: 2016-1255).

Methods
Patient data including demographics, drug administrations, and cardiology information were extracted from MedStar clinical systems and prepared for analysis. We then performed data visualization, statistical modeling, and other analysis to identify the factors most predictive of cardiotoxic events. Our ultimate aim was to develop a framework for clinical decision support and precision medicine. Figure 1 shows our overall analysis work ow.

Clinical EHR Systems
To address the critical need to use EHR data to better understand trastuzumabrelated cardiotoxicity in breast cancer patients,we identi edpatients with available diagnosis, lab, demographic and cardiology information. This required a cohort discovery strategy to identify data sources residing in disparate systems across the MedStar Health network.Our initial data source was ARIA, the oncology EHR systemthat contains patient demographics, diagnosis, lab results, drug orders, and clinical notes, among other data elements.The multi-modality image management system Xcelerawas used as a data source for echocardiogram data from MWHC.

Clinical Data Extraction, Filtering, Integration, and Cleaning
We investigated patients diagnosed with breast cancer who were treated with trastuzumab and had valid echocardiogram data from MWHC available for analysis. Figure 2illustrates our data extraction and ltering process.
In order to identify the patients in our cohort, we executed queries against ARIA using the ICD-9 diagnosis codes for female breast cancer (174.0, 174.1, 174.2, 174.3, 174.4, 174.5, 174.6, 174.8, and 174.9) identifying a set of 11,560 patients for further consideration. Next we queried the drug administration tables for these patients and determined that 702 of these patients received trastuzumab at a MedStar facility.
Using medical record numbers (MRNs) from these702 patients,we queried theMWHCXcelera system for LVEF,left ventricular dimensions and mass, and parameters of diastolic function. 307 patients had an MRN associated with MWHC, and we were able to obtain echocardiogram data for 160 of these patients.
Next, we identi ed a baseline LVEF measurement for each patient from an echocardiogram acquired within a period of two years prior to the rst administration of chemotherapy. This required merging the drug administration information with data from echocardiograms and formulating temporal queries to ensure that the LVEFmeasurements occurred within a two year window prior to trastuzumab administration. Of the 160 patients we were able to identify 95patients with valid baseline LVEF measurementsand additional measurements after trastuzumab administration.
A patient was determined to have a cardiac event if the LVEF dropped below 50 and by more than 10% below baseline or if the LVEF dropped by more than 16% from baseline. This is consistent with clinical guidelines [11]. Using these guidelines, we identi ed 21 patients with cardiotoxic events.
We then produced a consolidated le containing the study data for downstream analysis. This required de-identi cation of PHI including patient MRNs,procedure dates, and the calculation of derived variables like age at baseline and time to cardiotoxic event. Table 1 compiles descriptive statistics about our patient cohort.

Results
Through the extensive data wrangling efforts described above we were able to create adataset for the investigation of cardiotoxic events associated with a speci c, widespread breast cancer therapy. We further analyzed the dataset using a combination of data visualizationand modeling techniques to extract clinically meaningful information andcreate a prototype clinical decision support framework. Our analysis was able to reproduce ndings from other research groups showing that LVEF at baselineis an important factor for predicting cardiotoxicity [9].

Data Visualization
We employed a number of techniques to visualize the data for our patient cohort. The EventFlow software [12,13]allowed us to visualize, align, and query large amounts of temporal data related to drug dosing and echocardiogram measurements. Figure 3shows patients aligned by diagnosis.Patients 994 and 995 clearly show a series of low LVEF values after trastuzumab administration, indicating a cardiotoxic event. This software greatly enhanced our multidisciplinary team's collaborative data analysis sessions by allowing for real-time data exploration and hypothesis generation.

Statistical and Probabilistic Network Modeling
A Cox proportional hazards regression framework was considered, with time to cardiotoxic event as the outcome. In addition to the variables indicated in Table 1, we also considered treatment with 12 other drugs (dichotomous yes/no variables). Race was coded as either "Black or African American" or "Other." We employed the LASSO approach to identify thevariables most associated with this outcomevia the "glmnet" package in R [14,15], using leave-one-out cross validation (CV) to select the model with the minimum CV error.Due to several missing echocardiogram measurements, we considered 5 missing data imputations using the "mice" R package [16]. For 4 of the 5 imputed datasets, the recommended modelselected only baseline LVEF; for the remaining imputed dataset, it additionally selectedseptal thickness.In a univariate Cox regression,baseline LVEF showed a highly signi cant association with time to cardiotoxic event (estimated HR per 10 units = 0.45, p=0.0048).
In addition, a probabilistic network model of cardiotoxicity was developed using the BayesiaLab software [17]. The Markov Blanket learning algorithm [18]was applied to the data using the K-Means method with K=3 bins to discretize the continuous variables and a structural complexity in uence coe cient of 0.35. This parameter balances data tting versus network complexity in learning algorithm. In addition to the variables indicated in Table 1, the TNM staging variables (stage, primary tumor, regional lymph nodes, distant metastasis, and grade) and treatment information with 14 other drugs -including 14 patients treated with doxorubicin, an anthracycline -were included in the analysis. The algorithm identi ed baseline LVEF and left ventricular end diastolic dimension (LVEDD)as important factors for predicting cardiotoxicity. After performing 5-Fold CV only LVEF was found to be robust as it was included in 4 of the 5 models generated. LVEDD was only found in 2 of the 5 models generated.The model had an in-sample classi cation accuracy of 82% and 5-Fold CV accuracy of 75%. Probabilistic networks are valuable for hypothesis generation and can be dynamically queried to produce belief measures given information that is known about a patient. For example, our multidisciplinary team was able to play "what if" by setting the model variables to speci c values of interest to dynamically compute the probability of a cardiotoxic event.
In summary, we successfully created a valuabledataset for use in cardiotoxicity research and demonstrated the use of this dataset by creating predictive models of cardiotoxicity. We hope that others can bene t from our experiences and that this methodology can be extended to other disease areas.

Discussion
Our study provided useful data for assessing cardiotoxicity in breast cancer patients treated with trastuzumab, as well as an important set of lessons learned. A major challenge in using clinical data in Hospital EHR systems is that EHR systems are largely designed for single patient clinical care and not forresearch on large patient cohorts. Bulk EHR data extraction is a major challenge and requires intimate knowledge of backend data stores as well as technical and administrative access. The majority of the time spent on this effort involved getting access to and wrangling data from numerous clinical systems in order to create a structured dataset which could be used for statistical analysis and modeling. A further challenge we faced was incomplete or missing data. Out of a total of 11,560 female breast cancer patients, only 702 were recorded to have received trastuzumab. However, we would expect 20-25% of the breast cancer patients to have HER2-positive tumorsand most of those patients to be treated with trastuzumab, which means we are only capturing 24%-30% of the expected patient population. Many patients were also excluded from our analysis because we could not locate valid LVEF measurements for a patient or determine a baseline LVEF value. This illustrates the general problem of incomplete data in EHR systems that can be due to patients being seen at outside institutions or being lost to follow-up. In addition, some of the patients with missing LVEF data may have had Multi Gated Acquisition Scans (MUGA) rather than echocardiograms and therefore their LVEF values would not be represented in our data set. In the future MUGA data can be extracted from patient records increasing the number of patients in the analysis.
Data completeness can also be improved by using a proxy diagnoses to supplement missing ICD-9 codes. For example a female treated with Trastuzumab having evidence of breast cancer in her medical record can be regarded as a breast cancer patient even if there is no breast cancer associatedICD-9 code found. This technique will allow more patients to be considered in the analysis.
Another issue is the possible overrepresentation of cases due to medical surveillance bias. Patients having complete data in our data set may have been followed closely by the treating physician because they were determined to be at high risk for a cardiotoxic event. This would result in an overrepresentation of high risk patients in our analysis set.We found 21 potential events in the 95 patients with su cient data for analysis. This represents a 22% incidence rate, whichis signi cantly higherthan the incidence rate typically found in clinical practice when using trastuzumab alone or with non-anthracycline chemotherapy agents. For example, a retrospective analysis of patients enrolled in seven phase II and III clinical trials found that 3% -7% of patients receiving only trastuzumab had cardiac dysfunction, while 27% of patients receiving trastuzumab and anthracycline plus cyclophosphamide had cardiac dysfunction; however, most of the patients receiving only trastuzumab had been on anthracycline therapy previously [19].In our study, only 14 patients were treated concomitantly treated with doxorubicin -a type of anthracycline, 4 of them having cardiotoxic events, which made it di cult to perform inference on this group.
Once created, the structured dataset allowed us to conduct collaborative data exploration and analysis with a multidisciplinary team of clinician researchers and informatics scientists. The common variable found to be associated with cardiotoxicity across our analyses was low baseline LVEF. We note that 9 patients in our study had LVEF below the normal limit of 50. In practice, the decision to consider trastuzumab needs to weigh risks and bene ts. We also know that at least 3 of these 9 patients were in the SAFE-HEaRt study, which enrolled patients with baseline LVEF between 40 and 50% who are on applicable heart medications [20].We also cannot exclude the possibility that some of them had a later baseline LVEF measurement -possibly through MUGA -which we missed.

Research and Clinical applications
COVID19 patients with cancer could be at higher risk of adverse cardiac events as a result of cancer treatment [21].NCI has launched the COVID-19 in Cancer Patients Study (NCCAPS) [22], which will help answer questions about COVID-19's impact on cancer patients. The study is now open to adults and will later be expanded to include children. Our work on this paper could be applied to thisusecase to reduce the number of cardiotoxic events experienced by COVID19 patients with cancer.
In 2019, Pishvaian et al [23]presented a new concept called Virtual Molecular Tumor Board (VTMB), which allowedclinicians to combine expert-curated data and data from clinical systems along with data from molecular diagnostics(MolDx) reports to develop consensus on treatments. It usesinterconnected cloud based virtual computing techniques and reduced the time needed for a clinician to assess a patient's tumor pro le and suitability for clinical trials from 14 to 4 days. The cleaned data set and the predictive models from this paper could be used in conjunction with the information presented at a VTMB to enable bettermatches for clinical trials and also reduce the number of cardiotoxic events experienced by patients in the clinical trials.

Conclusions And Future Work
We plan to continue our work to develop a rich resource that connects clinical cardiology and cancer data by re ning our predictive models and adding new data sources such as data extracted from unstructured clinical notes using natural language processing techniques and data from other institutions. Increasing the number of patients in our analysis will enable us to create more accurate models which will lead to a better understanding of cardiotoxicity in real world breast cancer patients. This will ultimately lead to the development of decision support tools in an oncology setting with the goal of reducing the number of cardiotoxic events experienced by patients. Authors' contributions MH and SM designed the concept, analyzed the data and drafted the manuscript. BC, RJ, SB, AA, SR, RT, FA extracted the data from various electronic systems and helped with data analysis.KB and YG contributed to review and revision of the manuscript. PP and AB provided the clinical context and motivation for the project as well as reviewed and edited the manuscript. All authors read and approved the nal manuscript