After approval of the protocol by the Institutional Review Board of Severance Hospital (4-2021-1241), Seoul, Republic of Korea, we established a development server using a Windows-based, 12-core computer with 64 GB of memory and Serial Attached SCSI (SAS) disk drives of 100 GB and 2 TB. Operational servers, constituting High Availability (HA) systems, included a database (DB) server (2Ea) with a 10-core CPU, 128 GB memory, OS SSD 100 GB of storage/SCL 2019, DB Safer, Hiware, EMS, and Backup (DB), and a web-based server with a 12-core CPU, 64 GB memory, OS SSD 100 GB of storage/Hiware, EMS, and Backup (File).
The dataflow and computational modules are illustrated in Fig. 1. All data processing, transfer, and storage were performed within the network infrastructure of the hospital. We received authorization for access to all digital records from the EMR system and billing data from the Oncology Care System. To improve the data quality and mitigate the risks associated with erroneous or omitted data, we tailored the selection approaches for each cancer type. The selection was based on the International Classification of Diseases for Oncology (ICD) and physician-assigned ICD-M codes as well as validity criteria designated by the cancer registration program. A comprehensive breakdown of the selection methodologies for each cancer type is provided in Table 1.
Table 1
Selection methods for each cancer type
Cancer Id | Cancer Type | DBName | Criteria |
01 | Breast cancer | YCDL_BRST | (1) Cancer Registry : ICDOCda=C50% AND available = Y AND ICDOCdMb <M9590 |
02 | Colorectal cancer | YCDL_CLRC | (1) Cancer Registry: ICDOCd=(C18%, C19%, C20%) AND ICDOCdM = M81403(Adenocarcinoma) AND available = Y |
03 | Lung cancer | YCDL_LUNG | (1) Cancer Registry : ICDOCd = C34% AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2' |
04 | Gastric cancer | YCDL_GSTR | (1) Cancer Registry : ICDOCd = C16% AND available = Y AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2' |
05 | Liver cancer | YCDL_LVER | (1) Cancer Registry : ICDOCd = C22.0 AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2' |
06 | Melanoma | YCDL_MLNM | (1) Cancer Registry: ICDOCdM_EngNm (pathology) LIKE '%Melanoma%' AND available = Y (2) The cancer diagnosis group = D0023(Malignant melanoma) in CAP systemc. (3) There are records of '%Melanoma%', '%Malignant Spitz%' in the pathology diagnosis results. (4) There are records of '%Melanoma%', '%Malignant Spitz%' in the imaging test. (excluded '%r/o%') |
07 | Kidney cancer | YCDL_KDNY | (1) Cancer Registry : ICDOCd = C64% AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2' |
08 | Prostate cancer | YCDL_PRST | (1) Cancer Registry : ICDOCd = C61% AND available = Y AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2' |
09 | Thyroid cancer | YCDL_THRD | (1) Cancer Registry : ICDOCd = C73% AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2' |
10 | Pancreatic cancer | YCDL_PNCT | (1) Cancer Registry : ICDOCd = C25% AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2' |
11 | Bile duct cancer | YCDL_BLDT | (1) Cancer Registry : ICDOCd=(C22.1, C23.9, C24.0, C24.1, C24.8, C24.9) AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2' |
aICDOCd = ICD-O ( International Classification of Diseases for Oncology) Codes |
bICDOCdM = Morphology section of the ICD-O Code |
cCAP system = Chemotherapy Assistance Program for ordering oncology medications |
Subsequently, a patient-centric data model was developed, underpinned by the patient identification numbers dispensed by the hospital information system. This served as a linchpin for linking the anonymized datasets. In the clinical data extraction stage, we developed an Extract-Transform-Load (ETL) process, which includes Natural Language Processing (NLP), for each feature (Fig. 2). It facilitated the daily movement of data from the DSC source DB to the YCDL target DB. The DSC DB is a reservoir containing raw medical text, (semi-)unstructured data, imaging files, next-generation sequencing (NGS) results, and Extensible Markup Language (XML) formats.
In the initial phase of data processing, we tailored the database corpus from the DSC DB, optimizing the extraction and management of medical terminology, abbreviations, and recurrent misspellings (e.g., within pathology reports). Subsequently, the procured data underwent transformation through a specialized ETL algorithm designed to harmonize terminology based on assertions and the interrelationships of medical concepts. NLP was instrumental in utilizing CT and MRI interpretation counts from follow-up visits as criteria for individual selection. Parsing the raw medical text from radiological reports revealed the indications for disease recurrence and its associated patterns. The subsequent preprocessing phase employed tokenization techniques to structure the extracted data. SQL queries were harnessed to mine data from the primary DSC DB, facilitated by a data manipulation language (DML) management interface. For certain datasets requiring intricate extraction protocols, bespoke ETL strategies were devised using Python scripts crafted for each specific operation (Supplementary Fig. 1).
Using NGS data, Tier 1 pathogenic variants were systematically collected from the EMR. A significant portion of the procedural steps were automated, employing specialized bioinformatics tools for both processing and interpretation, as depicted in Supplementary Fig. 2. Certain elements, including family and smoking histories, were retained in our medical record system in XML format and subsequently extracted and transferred using the ETL process.
Following its development, we applied this framework to our electronic health data from its inception in 2006. The profiles were updated using electronic health records, ensuring a comprehensive view of relevant oncological components over time. The present analysis is based on data collected up to March 2022. Key constituents of these profiles included demographics, diagnoses, clinical examination reports, pathology reports, treatment histories, and encounter specifics (Tables 2 and 3). The structures of these individual profiles were categorized into common, cancer-specific, and index columns. The common features held universal information across multiple cancer types (e.g., age, sex, and cancer diagnosis date) and accounted for 817 features, which was nearly 80% of the total. The cancer-specific features contained data relevant only to specific cancer types (for instance, pulmonary function test in lung cancer) and comprised approximately 20% of the tables (Table 4).
Table 2
Tables in the DSC database
DB No. | DB Name | DB code | Table No. | Table Name | Table Description |
1 | Patients | PT | 1 | CNCR_PATINFO | Patient basic information |
1 | Patients | PT | 2 | CNCR_BODYINFO | Body measurement information |
2 | Diagnosis | DG | 3 | CNCR_DX | Diagnoses relating to a hospital visits |
2 | Diagnosis | DG | 4 | CNCR_CRDINFO | Copayment Decreasing Policy |
2 | Diagnosis | DG | 5 | CNCR_CSLT | Consultant Information |
3 | Examination | EM | 6 | CNCR_LAB | Events relating to laboratory tests |
3 | Examination | EM | 7 | CNCR_IMAGE | Events relating to Imaging test |
4 | Pathology | PH | 8 | CNCR_PATHOLOGY | Events relating to Pathology |
5 | Operation | OP | 9 | CNCR_OP | Surgery |
6 | Treatment | TX | 10 | CNCR_REGIMEN | Chemo- therapy |
6 | Treatment | TX | 11 | CNCR_RT | Radiation-therapy |
6 | Treatment | TX | 12 | CNCR_DRUG | Medicines prescribed |
6 | Treatment | TX | 13 | CNCR_PROC | Procedure (included medical operation) |
7 | Progress | TE | 14 | CNCR_FRM | Clinical Forms |
8 | aCancer registry | TM | 15 | CNCR_TUMOR_RGT | Tumor Registry (personal details and cancer diagnosis) |
8 | Cancer registry | TM | 16 | CNCR_TUMOR_TRANS | Tumor Registry (included cancer recurrence/metastasis) |
8 | Cancer registry | TM | 17 | CNCR_TUMOR_TRC | Tumor Registry (included cancer patient follow-up) |
8 | Cancer registry | TM | 18 | CNCR_TUMOR_TRET | Tumor Registry (included cancer treatment) |
aCancer registry = database of information on cancer patients |
Table 3
Tables in the YCDL database and number of variables
| Table Category | Table Name | Table Description | Common | Breast | Colorectal | Lung | Gastric | Liver | Melanoma | Kidney | Prostate | Thyroid | Prostate | Bile duct | Total |
1 | PT | Patient | PT_BASIC | Patient Basic Information | 20 | | | | | | | | | | | | 20 |
2 | PT | Patient | PT_PHIS | Past History | 39 | 9 | 12 | | | | | | 1 | 2 | | | 63 |
3 | PT | Patient | PT_SHIS | Smoking History | 29 | | | | | | | | | | | | 29 |
4 | PT | Patient | PT_DRNK | Drinking History | 33 | | | | | | | | | | | | 33 |
5 | PT | Patient | PT_FMHS | Family History | 41 | | | | | | | | | | | | 41 |
6 | PT | Patient | PT_BDMS | Body Measurement | 27 | | | | | | | | | | | | 27 |
7 | DG | Diagnosis | DG_INFO | Visit Information | 27 | | | | | | | | | | | | 27 |
8 | DG | Diagnosis | DG_ECHI | Copayment Policy | 27 | | | | | | | | | | | | 27 |
9 | DG | Diagnosis | DG_CNCR | Cancer Diagnosis | 37 | | | | | | | | | | | | 37 |
10 | DG | Diagnosis | DG_CONS | Consultant | 28 | | | | | | | | | | | | 28 |
11 | EM | Examination | EM_LAB | Laboratory Test | 30 | | | | | | | | | | | | 30 |
12 | EM | Examination | EM_IMEX | Imaging Test | 28 | | | 2 | | 2 | | | | | 2 | 2 | 36 |
13 | EM | Examination | EM_GENE | Genetic Test | 28 | | | | | | | | | | | | 28 |
14 | EM | Examination | EM_FCLT | Function Test | 28 | 6 | | 5 | | 2 | | | | 6 | 2 | 2 | 51 |
15 | PH | Pathology | PH_BPSY | Biopsy | 29 | 7 | 13 | 4 | 13 | 5 | 5 | 3 | 7 | 3 | 4 | 4 | 97 |
16 | PH | Pathology | PH_SRGC | Histopathology | 32 | 29 | 36 | 20 | 19 | 34 | 23 | 24 | 16 | 25 | 20 | 33 | 311 |
17 | PH | Pathology | PH_IMML | Immuno-histology | 29 | 13 | 17 | 9 | 10 | 10 | 17 | 15 | 8 | 15 | 12 | 12 | 167 |
18 | OP | Operation | OP_INFO | Operation Information | 28 | | | | | | | | | | | | 28 |
19 | OP | Operation | OP_OPNN | Operation opinion | 38 | 27 | 31 | 13 | 19 | 46 | 6 | 5 | 6 | 12 | 4 | 14 | 221 |
20 | OP | Operation | OP_COMP | Operation Complication | 25 | | | | | | | | | | | | 25 |
21 | TX | Treatment | TX_CHTH | Chemotherapy | 36 | | | | | | | | | | | | 36 |
22 | TX | Treatment | TX_RTH | Radiotherapy | 41 | | | | | | | | | | | | 41 |
23 | TX | Treatment | TX_PRSC | Drug | 29 | | | | | | | | | | | | 29 |
24 | TX | Treatment | TX_MOPR | Procedure | 33 | | | | | | | | | | | | 33 |
25 | TE | Follow-Up | TE_MTST | Follow-Up Metastasis | 27 | | | | | | | | | | | | 27 |
26 | TE | Follow-Up | TE_RCRN | Follow-Up Relapse | 23 | | | | | | | | | | | | 23 |
27 | TE | Follow-Up | TE_DEAD | Dead | 25 | | | | | | | | | | | | 25 |
Table 4
Column characteristics by cancer type
| Number of Common Columns (A) | Number of Cancer-specific Columns (B) | Number of Index Columns (C) | Number of Total Columns (D) | Percentage of Cancer-specific Columns (B/D) |
Breast | 817 | 91 | 459 | 908 | 10% |
Colorectal | 817 | 109 | 459 | 926 | 12% |
Lung | 817 | 53 | 459 | 870 | 6% |
Gastric | 817 | 61 | 459 | 878 | 7% |
Liver | 817 | 99 | 459 | 916 | 11% |
Melanoma | 817 | 51 | 459 | 868 | 6% |
Kidney | 817 | 47 | 459 | 864 | 5% |
Prostate | 817 | 38 | 459 | 855 | 4% |
Thyroid | 817 | 63 | 459 | 880 | 7% |
Pancreatic | 817 | 44 | 459 | 861 | 5% |
Bile duct | 817 | 67 | 459 | 884 | 8% |
We developed a web-based computational platform for data quality control (QC) that scrutinizes potential data defects both automatically and manually on a daily basis, focusing on minimizing the role of the human component (Fig. 3). All data extracted and stored in the YCDL_cancer data repository were continuously evaluated and optimized to establish high-quality data outputs, adhering to standardized data and terminology. Programs for logical checks were configured to evaluate the distribution and continuity of data extracted by the SCL. Based on the QC results, the ETL code was continuously modified, thereby refining the QC logic to enhance the quality and accuracy of the automation. We examined four data quality measures (completeness, timeliness and usefulness, consistency, and accuracy) for all variables, in accordance with established data standards and pertinent aspects of data quality (Table 5). For instance, the logic was set such that the birth date of individuals would precede the date of the initial diagnosis. The analyses revealed that the batch processing method accurately identified erroneous data points, aligned with the established logic. Each piece of data was meticulously reviewed and optimized by a Quality Control Manager. Significant discrepancies or inaccuracies prompted an in-depth examination of the source data and respective ETL processes. Moreover, a hierarchy of data sources was established to resolve conflicts. The QC steps were continuously iterated within distinct closed-loop systems, adhered to operational ontology, and executed by independent QC personnel. This methodology gradually enhanced the accuracy of the cleansed target data with minimal intervention (Supplementary Fig. 3). We assessed the completeness of each individual’s accumulated features, including fundamental characteristics such as date of birth, initial diagnosis date, age, diagnosis code (ICD), TNM and overall stages, and ICDO morphology code.
Table 5
Data quality check criteria
Quality Indicators | Detailed Quality Indicators | Diagnostic Targets | Remarks |
Completeness | Individual Completeness | Columns or input values defined to exist but are Null | |
Conditional Completeness | Checking for NOT NULL constraints | |
Structural Completeness | Implementation based on the physical model designed from the schema, including data types | Verified at the DB design stage |
Validity | Code Validity | Whether codes defined in the common code are used | |
Format Validity | Errors in data format | Verified at the DB design stage |
Boolean Validity | Diagnosis based on columns with Y/N, 0/1 criteria | |
Date Validity | Errors based on date formats | |
Range Validity | Diagnosis based on Min, Max, and Normal range of the column | |
Temporal Relationship Validity | Diagnosis of data that deviates from predetermined sequential relationships | |
Consistency | Referential Integrity | Diagnosis of operation rules for PK (Primary Key) items | Verified at the DB design stage |
Accuracy | Logical Relationship Accuracy | Data diagnosis according to logical relationships, e.g., when item A is n, item B should be at least m | |
Derived Item Accuracy | Diagnosis of derived data, e.g., whether the sum of item A and item B is equal | |
The developed data warehouse showcased survival graphs by tumor stages and demonstrated the framework's ability to expedite data collection for quick clinical hypothesis testing. Kaplan–Meier survival graphs were generated in all cancer types according to tumor stage with 95% confidence intervals. Survival time was defined as the time interval between initial diagnosis and death or the last follow-up. To demonstrate the efficiency of our data framework as a proof-of-concept for swiftly generating and evaluating clinical hypotheses, we present a detailed chronological progression of a previously published retrospective study. The clinical question chosen by one of the authors was whether the peripheral blood neutrophil-to-lymphocyte ratio before, during, or after neoadjuvant chemoradiotherapy for locally advanced rectal cancer is associated with an increased risk of distant metastases after primary rectal cancer surgery.
To underscore the capabilities of our data framework for clinical applications, we then developed a clinical decision-support system that offers (a) a longitudinal view of the complete patient journey, (b) PACS-integrated three-dimensional tumor display, and (c) summary of longitudinal changes in the form of graphs. This system was supported by a Docker-based microservice architecture and overseen by a Python-based API server for backend operations. The framework was implemented as a web application using JavaScript. This design bolsters the accessibility of the system, guarantees platform independence, and ensures that users can access services across various device types. Manual tumor segmentation data are required to use the three-dimensional tumor display with a longitudinal tumor-tracking function. If deep learning-based tumor auto-segmentation algorithms are developed, these models can be integrated into the pipeline. The PACS-integrated method enables physicians to comprehensively track changes in overall trajectory patterns over a long period, fosters an environment that better explains the disease course to patients, and facilitates communication with referring physicians. Longitudinal changes in the overall disease burden were automatically generated using the prepared manual contours and displayed as graphs. The images were de-identified; however, if another image of the same patient was transferred later, the new images were allocated the same de-identified number, facilitating tumor tracking.