In this study, we generated an integrative and standardized data resource for RWD of Precision Oncology via multiple steps, 1) we semi-automatically collected RWD that belongs to key elements (e.g. gene, variant, disease, drug) in a previously proposed precision oncology knowledge model from EHRs; 2) we normalized the collected data using for further data integration; 3) we integrated collected data using a schema by Genetic Testing Ontology (GTO)[16], which captures the semantic meaning and semantic relations in the collected data; 4) we generated PO2RDF using D2RQ[17]. The workflow performed in this study is shown in Fig. 1
Oncology Cohort
Our cohort includes a total of 2,593 patients with Foundation Medicine tumor mutation test. Foundation Medicine offers three different types of tumor panels and covers a range of 709 genes. All patients in the cohort have granted research authorization and are aged above 18. This research project was approved by the Mayo Clinic Institutional Review Board (IRB# 13-009317) and was following the ethical standards of the responsible committee on human experimentation.
Data Retrieval
Based on the institutional oncology cohort, we semi-automatically collected RWD from genetic reports and the electronic health records (EHRs). Patient IDs were linked to integrate data in genetic reports and EHRs by comparing 1) patient clinic number, 2) first and last name and 3) date of birth. According to our previously proposed precision oncology knowledge model[18], three types of data elements were extracted: “genetic information” (“gene” + “variant”), “disease” and “drug”. Data sources that were used to retrieve three data elements were listed in Table 1. While “genetic information” was extracted from genetic reports only, “disease” and “drug” were retrieved from multiple sources including genetic reports, a unified data platform (UDP), a structured clinical data warehouse of Mayo Clinic[19], and unstructured clinical notes. “Disease” was from both genetic reports and UDP. We extracted only cancer-related diagnosis information. For extraction of “drug” concepts from unstructured clinical notes, we leveraged a dictionary from HemOnc.org[20] that curated comprehensive oncology medication knowledge. Sentences in patients’ clinical notes that mentioned drug concepts were extracted using a natural language processing (NLP) system MedTagger[21]. MedTagger enables a series of NLP processes including dictionary-based concepts indexing, keyword mention lookup, and regular expression matching[22]. Both the drug brand name and chemical name were looked up and were normalized to chemical name.
Table 1
| Gene | Variant | Disease | Drug |
Genetic Reports | Y | Y | Y | |
UDP | | | Y | Y |
Clinical notes | | | | Y |
Data Normalization
To facilitate data manipulation and integration, we performed data normalization on RWD extracted from multiple sources. In this study, we mapped “gene”, “variant”, “disease” and “drug” concepts to Unified Medical Language System (UMLS)[23] via the batch process function offered by MetaMap API[24]. The mapping results generated by the MetaMap, include the UMLS preferred terms along with mapping scores. For variants that cannot be mapped to UMLS concepts, we manually normalized variant names to HGVS-nomenclature[25].
Data Integration
We leveraged schema from a previously developed ontology – GTO to integrated the collected RWD. GTO defined seven primary classes namely ‘Diseases’, ‘Gene’, ‘Variant’, ‘Test’, ‘Phenotype’, ‘Risk’ and ‘Drug’ and the relationships among classes[16]. We utilized four of GTO’s primary classes namely ‘Diseases’, ‘Gene’, ‘Variant’ and ‘Drug’ and selected object properties include ‘AssociatedWithGene’ (Domain: ‘Disease’ and Range: ‘Gene’), ‘MayTreatedBy’ (Domain: ‘Disease’ and Range: ‘Drug’), ‘HasContraindicationWith’ (Domain: ‘Drug’ and Range: ‘Disease’), ‘AssociatedWithVariant’ (Domain: ‘Gene’ and Range: ‘Variant’).
We inherited GTO’s data properties especially identifiers that link to external knowledgebases such as Online Mendelian Inheritance in Man (OMIM)[26] and National Drug File Reference Terminology (NDF-RT)[27]. In addition, we added additional identifiers in the data property that link to other precision oncology knowledgebases such as CIViC_Entrez_ID for identifying ‘Gene’ and CIViC_DOID for identifying ‘Disease’ in CIViC. We also incorporated drugs’ brand names (Brand_Name) and categories (Drug_Category) according to HemOnc as additional data properties. We also created a new data class ‘Patient’ to our data schema. The defined data properties for each class, along with some explanation are shown in Table 2.
‘Disease’ and ‘Gene’ relationships were considered valid for diagnosis up to one year before genetic tests. ‘Drug and ‘Gene’ associations (object properties) were considered valid for drug prescriptions up to one year after genetic tests and include targeted therapies only. ‘Disease’ and ‘Drug’ associations (object properties) were considered valid for drug prescriptions after disease diagnosis. For an individual patient, we only count each ‘Disease’ and ‘Drug’ associations once.
Table 2
Description of Data Properties and Related Object Properties
Class | Data Property | Related Object Property |
Patient | Patient_ID, Date_of_Birth, Race, Ethnicity, Gender | HasMutGene, HasVariant, HasDisease, TreatedBy |
Gene | Gene_Name, UMLS_CUI, OMIM_ID, CIViC_Gene_ID, OncoKB_Gene_ID, PharmGKB_Gene_ID | AssociatedWithGene, AssociatedWithVariant, MayTargetedBy |
Variant | Var_Name, UMLS_CUI, ClinVar_ID, dbSNP_ID, CIViC_Var_ID, OncoKB_Var_ID, | AssociatedWithVariant |
Disease | Disease_Name, UMLS_CUI, OMIM_ID, CIViC_DOID, OncoKB_Disease_ID, PharmGKB_Disease_ID | AssociatedWithGene, MayTreatedBy, HasContraindicationWith |
Drug | Drug_Name, Brand_Name, Drug_Category, UMLS_CUI, NUI (NDF-RT Unique Identifier), CIViC_Drug_ID, OncoKB_Drug_ID, PharmGKB_Drug_ID | MayTreatedBy, HasContraindicationWith, MayTargetedBy |
PO2RDF Generation
For the PO2RDF generation, we applied D2RQ, which transforms data in the relational database to RDF. The mapping tool of D2RQ creates a default mapping file by analyzing the schema of an existing database. To map our data with the GTO schema, we manually customized the mapping file accordingly. The data is then published in RDF through the D2RQ server, and can be queried via a D2RQ SPARQL endpoint. We also took an RDF dump from D2RQ into Virtuoso[28] to run federated queries. Figure 2 showed detailed RDF representation of two patients. “Variant” elements were not represented due to space limit.
Use Cases
To demonstrate the usability of PO2RDF, we retrieved triples involving ‘Gene’ and ‘Drug’ from PO2RDF. We then performed association rule analysis[29] to evaluate the significant of real-world associations between mutated gene and selected oncology drugs. Specifically, we examined drugs associated with gene “EGFR”, which are most commonly identified and targeted in lung cancer[30], colorectal cancer[31, 32] and melanoma[33] patients. EGFR inhibitors were initially approved to treat non-small cell lung cancer (NSCLC) and appear to be most effective in patients with adenocarcinoma histology[30]. Even though current FDA drug approved indication for EGRF inhibitors are mostly for NSCLC, it is also used off-label[31–33] for other cancers in real-world setting. Therefore, the results from our association analysis could potentially provide RWE to clinicians and FDA regarding real-world utility of targeted therapies – especially any deviations from guidelines or drug labels.
We calculated the confidence of each {“Drug”, “EGFR”} transaction (Eq. 1). The support of X with respect to a group of transactions T is defined as the proportion of transactions t in the dataset which contains the item X (Eq. 2). Each individual patient was considered as one transaction (t). Our cohort of 2,593 patients were considered as the total transaction set T.