We examined the corpus using an NLP-based approach and identified seven topics regarding BRCA1/2 (3.1). The performance of the rule-based and machine learning-based NLP system were evaluated and compared (3.2). A timeline view of patients’ medical journey regarding BRCA1/2 was visualized and temporal distributions of topics demonstrated a representative pattern for Precision Medicine practice (3.3). Data quality examination revealed incompleteness and discrepancy regarding the capture of genetic information in unstructured clinical notes thus combining different sources is necessary (3.4). Finally, using cleaned RWD, we were able to identify a significant association between BRCA1/2 mutation and prescription of PARP inhibitors (3.5).
3.1 Topic Identification to Address Contextual Variability in Unstructured Clinical Notes
In total, we extracted 1,179 sentences that contain keywords BRCA1, BRCA2 for 122 patients while the remaining 74 of 196 patients in the cohort cannot find any matching records with these keywords. After removing duplicate sentences, there were a total of 682 unique sentences. Table S1 listed top 10 lowest sf-ipf sentences.
After reviewing sentences with sf-ipf < 0.5 (n=46) plus 50 additional randomly selected sentences in the initial phase, two abstractors and one domain expert classified the initial sentence topics into seven topics: Information, Evaluation, Insurance, Order, Negative (Mutation), Positive (Mutation), and VUS. Three of them (Negative, Positive, VUS) were related to patient genetic information. The scope of each topic was also defined clearly to enable evaluation by abstractors (Table 2). In the rule development process, abstractors identified no sentence that cannot be classified into seven topics.
Table 2 Definition of Topics for Sentences
Topic
|
Definition
|
Information
|
General information about guideline, genetic testing panels, pathways and biological implications, etc.
|
Evaluation
|
Physician estimating risk of having BRCA1/2 mutation, benefit of taking genetic test, etc. & Record any BRCA1/2 muation-carrier relative and risk of being a BRCA1/2 carrier from family history
|
Insurance
|
Concerns over insurance coverage
|
Order
|
Physician recording ordering of tests or waiting for results
|
Negative (Mutation)
|
Negative mutation result from genetic test
|
Positive (Mutation)
|
Positive mutation found from genetic test
|
VUS
|
VUS found from genetic test
|
3.2 NLP System for Automatic Topic Classification
Identification of topic indicating words and assigning topics to each sentence using regular expression involved an iterative evaluation process until top PMI words and topic indicating words converged. According to 682 BRCAness-related unique sentences in EHRs, a word-cloud was generated as Figure 2a, showing the sizes of words proportional to their frequencies in the corpus. Despite that some frequent words were implicitly related to BRCAness, such as “breast” “ovarian” “cancer” and “family” “history”, there lacked clinical contexts to cluster words that share a similar “topic”. After utilizing rule-based topic classification, specific words could be assigned to different topics related to either type of medical encounters (e.g. Information, Evaluation) or genetic results (Positive/Negative/VUS). Shown as Fig 2b, a highly sparse pattern of the word-topic matrix indicated the specificity of representative words, e.g. “family”, “history” and “hereditary” were exclusive to the topic of “Evaluation”.
The final evaluation of our system was conducted by two abstractors with an inter-rater agreement (Kappa statistics) of 0.95, and achieved an overall precision of 0.87, recall of 0.93, and F-measure of 0.91. Performance metrics of machine learning classifiers were listed in Table 3 and Table 4. We could see that four-topic achieved better overall performance. Performance for classifying “Order” and “Insurance” was not ideal. This may be due to limited instances for these two topics (number of patients with “Order” topic = 29 and the number of patients with “Insurance” topic = 8).
Feature importance of two random forest classifiers was calculated based on Gini impurity/information gain[47] and was provided in Table S2 and Table S3. Feature importance from random forest classifiers agreed with our rule-based PMI calculation in a majority of cases. For example, “vus”, “negative”, “pathogenic”, “order”, “risk” and “analysis” were considered important topic-indicators and they were ranked high in both “average impurity decrease” and “number of nodes using that attribute”.
Table 3 Performance of Machine Learning System on Four-Topic Classification
Class
|
Precision
|
Recall
|
F-Measure
|
Information
|
0.933
|
0.925
|
0.929
|
Positive
|
0.757
|
0.8
|
0.778
|
Negative
|
0.879
|
0.879
|
0.879
|
VUS
|
1
|
0.958
|
0.979
|
Overall
|
0.901
|
0.899
|
0.9
|
Table 4 Performance of Machine Learning System on Seven-Topic Classification
Class
|
Precision
|
Recall
|
F-Measure
|
Information
|
0.714
|
0.645
|
0.678
|
Evaluation
|
0.911
|
0.895
|
0.903
|
Insurance
|
1
|
0.625
|
0.769
|
Order
|
1
|
0.5
|
0.667
|
Positive
|
0.707
|
0.829
|
0.763
|
Negative
|
0.789
|
0.909
|
0.845
|
VUS
|
0.92
|
0.958
|
0.939
|
Overall - ML
|
0.833
|
0.823
|
0.82
|
Overall – Rule
|
0.87
|
0.93
|
0.91
|
3.3 Temporal Visualization and Examination of Topics
In order to examine temporal patterns of topic distributions, BRAC1/2 related events were mapped to a timeline. Figure 3a listed individual timelines for 5 patients. We could see that patients share a similar temporal pattern of the medical journey starting from “Evaluation”, “Information”, followed by “Order”, and optionally “Insurance” and finally genetic information (“Positive”, “Negative”, and “VUS”). Figure 3b demonstrated a summarized count percentage heatmap of the temporal order of topics (1 as earliest encounter, 14 as the last encounter, total numbers of encounters for each patient varies) for all the patients. The results from Figure 3b agreed with our observation from the individualized view in Figure 3a that “Evaluation” and “Information” encounters often appear earliest in the timeline. The initial topic of “Order” of genetic tests followed immediately after “Evaluation” was performed and “Information” was communicated with patients. “Insurance” occured more frequently after the initial proposal of “Order” and would sometimes take several encounters to receive confirmation from insurance companies and proceeded with genetic tests. Result-related topics (“Positive”, “Negative”, and “VUS”) were mentioned repeatedly because every encounter, physicians will refer back to medical history to initiate/change treatment plans. Among all result-related topics, “Positive” and “VUS” results were reported and documented earlier than “Negative” results.
3.4 BRCA1/2-related Real-world Data Quality in Unstructured Clinical Notes
We examined the data quality of RWD from unstructured sources with a focus on completeness and discrepancies of genetic test results. We compared the results documented in unstructured clinical notes EHR records versus Foundation Medicine reports. Among Foundation positive patients (N=12), 75% of patients (N=9) had matching EHR records of their positive BRCA1/2 mutation. For VUS, missingness was much larger – only 5 patients out of 24 (20.8%) have their VUS recorded in the EHR. From Figure 2b, we could see that we have more positive cases from EHR records (N=48) than Foundation-provided positive patients (N=9). The reason for this is our information extraction system extracted previous germline BRCA1/2 panel test results as well.
Among all data elements in Table 1, five data elements “Variant_Type”, “Variant_Source”, “Variant_Pathogeneity_Reported”, and “Variant_Classification” were considered most relevant to represent genetic information. Completeness of genetic information captured in clinical notes was listed in the last column of Table 1. We found that the current capture rate of all data elements was low: “Variant_Type” (8 out of 16, 50%), “Variant_Pathogeneity_Reported” (7 out of 16, 43.8%) and “Variant_Classification” (1 out of 16, 6.3%), HGVS_short (4 out of 16, 25%) should be recorded for both positive and VUS patients while “Variant_Source” (24 out of 91, 26.4%) should be available for all negative, positive and VUS patients because it can be derived from genetic testing panel type. Among four data elements, “Variant_Classification” was least frequently documented - only one patient with a “splice_site” variant was documented. In some cases, “Variant_Type” was not extracted explicitly but can be inferred from extracted information that follows HGVS nomenclature, for example, S34F and c.7759C>T is a “single-nucleotide variant” and amplification is a “copy number variation”. Among five data elements, “Variant_Classification” was least frequently documented - only one patient with a “splice_site” variant was documented.
Figure 4 displayed a timeline view of a single patient (Patient 3). In this view, we could see that this patient had a discussion with the physician about the risk of having BRCA1/2 and the benefits of having the genetic test on 08/30/2017. After the discussion, the patient took the test, and the results returned positive. But there was a discrepancy in clinical notes documentation where it first documents the results as a positive mutation in BRCA2 but later revised it to BRCA1 (rectangled). This example demonstrated the benefit of using a timeline view to perform data quality checks in the future.
3.5 Mutation – Medication Association
Figure 5 showed patient’s genetic mutation status validated by Foundation Medicine reports, patients’ PARP inhibitor discussions status from clinical notes, and patients’ PARP inhibitor prescription status from UDP, CDM reports, and clinical notes combined. “PARPi discussion” patient type referred to patients with the only discussion related to this drug while “PARPi prescription” referred to patients with confirmed prescriptions from UDP, CDM reports, and clinical notes “current_medication” section. “PARPi discussion+prescripsion” referred to patients with both discussions related to this drug in clinical notes and confirmed prescriptions. Because there was no patient that had “PARPi prescription” without “PARPi discussion”, we didn’t display “PARPi prescription” in the figure. Fisher's Exact Test was used to test differences between patients with the different BRCA1/2 mutation status. Results from the test on the count data revealed a strong association between patient’s genetic mutation in BRCA1/2 and the prescription of PARP inhibitors with p-value = 0.0004.