Development of a continuous multimodal data supply chain for oncology and an expandable clinical decision support system

doi:10.21203/rs.3.rs-3864430/v1

Download PDF

Article

Development of a continuous multimodal data supply chain for oncology and an expandable clinical decision support system

https://doi.org/10.21203/rs.3.rs-3864430/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background: This study presents an inter-departmental collaboration to create a comprehensive oncology data framework and clinical decision support system, integrating clinical, genomic, and imaging data.

Methods: A real-time data supply chain was established in an academic cancer center to capture unstructured health data from various sources. For each cancer type, specific selection approaches were developed. Firstly, we predefined 817 features applicable across various cancers, along with a median of 61 histology-specific features and developed a customized Extract-Transform-Load (ETL) algorithm for each feature Standard criteria for data quality control (QC) were formulated, and a web-based computational QC platform enabling manual and automatic inspections was created. This framework has been applied to electronic health data since 2006.

Findings: The data supply chain captured features from 171,128 individuals across 11 cancer types. It updates individual profiles daily and conducts QC checks, including 143 logical comparisons, processing an average of 81 cases daily. Continuous automatic and manual data QC within closed-human-in-loop systems ensured accuracy. Using the developed data warehouse, we were able to showcase survival graphs by tumor stages and demonstrate the framework's ability to expedite data collection for quick clinical hypothesis testing. A dashboard displaying patients’ cancer journeys, including landmark events, treatment progress, and longitudinal tumor tracking, was also developed.

Interpretation: We have developed an automatically updated data supply chain that comprehensively synthesizes multimodal medical data and assesses data QC, aimed at directly assisting clinical decision-making for individual patients. Ongoing learning is essential, depending on the purpose of data use, and further research into its applicability in other environments is required.

Health sciences/Diseases/Cancer

Health sciences/Health care/Patient education

Health sciences/Medical research/Outcomes research

data warehouse

continuous capture

multimodal oncologic information

electronic health record

medical imaging

genomic data

big data

Oncology data is multidimensional and diverse, encompassing a vast array of information such as patient characteristics, stage, tumor, and imaging data.¹ The advent of electronic medical records (EMR) and emerging data sources has caused a transformative surge in health information.² This data deluge often exceeds human cognitive limits for decision-making³ and has led oncology professionals to spend more time navigating EMR than engaging with patients to seek fragmented health data from disparate sources, which exacerbates burnout.⁴

Fortunately, rapid advancements in computational techniques, notably machine learning and artificial intelligence (AI), herald new possibilities for harnessing extensive and intricate medical data for individualized, data-driven care.⁵ These technologies have demonstrated potential in refining imaging⁶ and pathology diagnostics,⁷ prognosticating clinical outcomes, optimizing radiation treatment planning,⁸ and accelerating drug development.^9,10 AI has also significantly impacted foundational research in oncology.¹¹

However, challenges related to validation and generalizability¹² mean that the current methodologies for data management and model development fall short of the maturity required for broad-scale AI adoption. Transitioning from the present ad-hoc data aggregation and curation approach to a dynamic "metadata supply chain" is essential for providing contextualized, robust data in real-time.¹³ By capturing pivotal data in real-time, this metadata supply chain can lay the groundwork for a clinical decision support system that vividly maps patient journey, potentially transforming clinical workloads.

Within this framework, our objective was to present our collaborative endeavor for establishing a comprehensive data supply chain in oncology. This system seamlessly integrates clinical, genomic, and imaging data, representing a persistent, flexible, and expandable model. The infrastructure holds the potential to expedite the development of clinical decision support systems and AI applications for risk stratification, diagnosis, and treatment in oncology, paving the way for individualized patient-centered care.

After approval of the protocol by the Institutional Review Board of Severance Hospital (4-2021-1241), Seoul, Republic of Korea, we established a development server using a Windows-based, 12-core computer with 64 GB of memory and Serial Attached SCSI (SAS) disk drives of 100 GB and 2 TB. Operational servers, constituting High Availability (HA) systems, included a database (DB) server (2Ea) with a 10-core CPU, 128 GB memory, OS SSD 100 GB of storage/SCL 2019, DB Safer, Hiware, EMS, and Backup (DB), and a web-based server with a 12-core CPU, 64 GB memory, OS SSD 100 GB of storage/Hiware, EMS, and Backup (File).

The dataflow and computational modules are illustrated in Fig. 1. All data processing, transfer, and storage were performed within the network infrastructure of the hospital. We received authorization for access to all digital records from the EMR system and billing data from the Oncology Care System. To improve the data quality and mitigate the risks associated with erroneous or omitted data, we tailored the selection approaches for each cancer type. The selection was based on the International Classification of Diseases for Oncology (ICD) and physician-assigned ICD-M codes as well as validity criteria designated by the cancer registration program. A comprehensive breakdown of the selection methodologies for each cancer type is provided in Table 1.

Table 1

Selection methods for each cancer type
Cancer Id	Cancer Type	DBName	Criteria
01	Breast cancer	YCDL_BRST	(1) Cancer Registry : ICDOCd^a=C50% AND available = Y AND ICDOCdM^b <M9590
02	Colorectal cancer	YCDL_CLRC	(1) Cancer Registry: ICDOCd=(C18%, C19%, C20%) AND ICDOCdM = M81403(Adenocarcinoma) AND available = Y
03	Lung cancer	YCDL_LUNG	(1) Cancer Registry : ICDOCd = C34% AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2'
04	Gastric cancer	YCDL_GSTR	(1) Cancer Registry : ICDOCd = C16% AND available = Y AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2'
05	Liver cancer	YCDL_LVER	(1) Cancer Registry : ICDOCd = C22.0 AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2'
06	Melanoma	YCDL_MLNM	(1) Cancer Registry: ICDOCdM_EngNm (pathology) LIKE '%Melanoma%' AND available = Y (2) The cancer diagnosis group = D0023(Malignant melanoma) in CAP system^c. (3) There are records of '%Melanoma%', '%Malignant Spitz%' in the pathology diagnosis results. (4) There are records of '%Melanoma%', '%Malignant Spitz%' in the imaging test. (excluded '%r/o%')
07	Kidney cancer	YCDL_KDNY	(1) Cancer Registry : ICDOCd = C64% AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2'
08	Prostate cancer	YCDL_PRST	(1) Cancer Registry : ICDOCd = C61% AND available = Y AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2'
09	Thyroid cancer	YCDL_THRD	(1) Cancer Registry : ICDOCd = C73% AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2'
10	Pancreatic cancer	YCDL_PNCT	(1) Cancer Registry : ICDOCd = C25% AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2'
11	Bile duct cancer	YCDL_BLDT	(1) Cancer Registry : ICDOCd=(C22.1, C23.9, C24.0, C24.1, C24.8, C24.9) AND available = Y AND ICDOCdM < M9590 AND ICDOCdM NOT LIKE '%/2'
^aICDOCd = ICD-O ( International Classification of Diseases for Oncology) Codes
^bICDOCdM = Morphology section of the ICD-O Code
^cCAP system = Chemotherapy Assistance Program for ordering oncology medications

Subsequently, a patient-centric data model was developed, underpinned by the patient identification numbers dispensed by the hospital information system. This served as a linchpin for linking the anonymized datasets. In the clinical data extraction stage, we developed an Extract-Transform-Load (ETL) process, which includes Natural Language Processing (NLP), for each feature (Fig. 2). It facilitated the daily movement of data from the DSC source DB to the YCDL target DB. The DSC DB is a reservoir containing raw medical text, (semi-)unstructured data, imaging files, next-generation sequencing (NGS) results, and Extensible Markup Language (XML) formats.

In the initial phase of data processing, we tailored the database corpus from the DSC DB, optimizing the extraction and management of medical terminology, abbreviations, and recurrent misspellings (e.g., within pathology reports). Subsequently, the procured data underwent transformation through a specialized ETL algorithm designed to harmonize terminology based on assertions and the interrelationships of medical concepts. NLP was instrumental in utilizing CT and MRI interpretation counts from follow-up visits as criteria for individual selection. Parsing the raw medical text from radiological reports revealed the indications for disease recurrence and its associated patterns. The subsequent preprocessing phase employed tokenization techniques to structure the extracted data. SQL queries were harnessed to mine data from the primary DSC DB, facilitated by a data manipulation language (DML) management interface. For certain datasets requiring intricate extraction protocols, bespoke ETL strategies were devised using Python scripts crafted for each specific operation (Supplementary Fig. 1).

Using NGS data, Tier 1 pathogenic variants were systematically collected from the EMR. A significant portion of the procedural steps were automated, employing specialized bioinformatics tools for both processing and interpretation, as depicted in Supplementary Fig. 2. Certain elements, including family and smoking histories, were retained in our medical record system in XML format and subsequently extracted and transferred using the ETL process.

Following its development, we applied this framework to our electronic health data from its inception in 2006. The profiles were updated using electronic health records, ensuring a comprehensive view of relevant oncological components over time. The present analysis is based on data collected up to March 2022. Key constituents of these profiles included demographics, diagnoses, clinical examination reports, pathology reports, treatment histories, and encounter specifics (Tables 2 and 3). The structures of these individual profiles were categorized into common, cancer-specific, and index columns. The common features held universal information across multiple cancer types (e.g., age, sex, and cancer diagnosis date) and accounted for 817 features, which was nearly 80% of the total. The cancer-specific features contained data relevant only to specific cancer types (for instance, pulmonary function test in lung cancer) and comprised approximately 20% of the tables (Table 4).

Table 2

Tables in the DSC database
DB No.	DB Name	DB code	Table No.	Table Name	Table Description
1	Patients	PT	1	CNCR_PATINFO	Patient basic information
1	Patients	PT	2	CNCR_BODYINFO	Body measurement information
2	Diagnosis	DG	3	CNCR_DX	Diagnoses relating to a hospital visits
2	Diagnosis	DG	4	CNCR_CRDINFO	Copayment Decreasing Policy
2	Diagnosis	DG	5	CNCR_CSLT	Consultant Information
3	Examination	EM	6	CNCR_LAB	Events relating to laboratory tests
3	Examination	EM	7	CNCR_IMAGE	Events relating to Imaging test
4	Pathology	PH	8	CNCR_PATHOLOGY	Events relating to Pathology
5	Operation	OP	9	CNCR_OP	Surgery
6	Treatment	TX	10	CNCR_REGIMEN	Chemo- therapy
6	Treatment	TX	11	CNCR_RT	Radiation-therapy
6	Treatment	TX	12	CNCR_DRUG	Medicines prescribed
6	Treatment	TX	13	CNCR_PROC	Procedure (included medical operation)
7	Progress	TE	14	CNCR_FRM	Clinical Forms
8	^aCancer registry	TM	15	CNCR_TUMOR_RGT	Tumor Registry (personal details and cancer diagnosis)
8	Cancer registry	TM	16	CNCR_TUMOR_TRANS	Tumor Registry (included cancer recurrence/metastasis)
8	Cancer registry	TM	17	CNCR_TUMOR_TRC	Tumor Registry (included cancer patient follow-up)
8	Cancer registry	TM	18	CNCR_TUMOR_TRET	Tumor Registry (included cancer treatment)
^aCancer registry = database of information on cancer patients

Table 3

Tables in the YCDL database and number of variables
	Table Category		Table Name	Table Description	Common	Breast	Colorectal	Lung	Gastric	Liver	Melanoma	Kidney	Prostate	Thyroid	Prostate	Bile duct	Total
1	PT	Patient	PT_BASIC	Patient Basic Information	20												20
2	PT	Patient	PT_PHIS	Past History	39	9	12						1	2			63
3	PT	Patient	PT_SHIS	Smoking History	29												29
4	PT	Patient	PT_DRNK	Drinking History	33												33
5	PT	Patient	PT_FMHS	Family History	41												41
6	PT	Patient	PT_BDMS	Body Measurement	27												27
7	DG	Diagnosis	DG_INFO	Visit Information	27												27
8	DG	Diagnosis	DG_ECHI	Copayment Policy	27												27
9	DG	Diagnosis	DG_CNCR	Cancer Diagnosis	37												37
10	DG	Diagnosis	DG_CONS	Consultant	28												28
11	EM	Examination	EM_LAB	Laboratory Test	30												30
12	EM	Examination	EM_IMEX	Imaging Test	28			2		2					2	2	36
13	EM	Examination	EM_GENE	Genetic Test	28												28
14	EM	Examination	EM_FCLT	Function Test	28	6		5		2				6	2	2	51
15	PH	Pathology	PH_BPSY	Biopsy	29	7	13	4	13	5	5	3	7	3	4	4	97
16	PH	Pathology	PH_SRGC	Histopathology	32	29	36	20	19	34	23	24	16	25	20	33	311
17	PH	Pathology	PH_IMML	Immuno-histology	29	13	17	9	10	10	17	15	8	15	12	12	167
18	OP	Operation	OP_INFO	Operation Information	28												28
19	OP	Operation	OP_OPNN	Operation opinion	38	27	31	13	19	46	6	5	6	12	4	14	221
20	OP	Operation	OP_COMP	Operation Complication	25												25
21	TX	Treatment	TX_CHTH	Chemotherapy	36												36
22	TX	Treatment	TX_RTH	Radiotherapy	41												41
23	TX	Treatment	TX_PRSC	Drug	29												29
24	TX	Treatment	TX_MOPR	Procedure	33												33
25	TE	Follow-Up	TE_MTST	Follow-Up Metastasis	27												27
26	TE	Follow-Up	TE_RCRN	Follow-Up Relapse	23												23
27	TE	Follow-Up	TE_DEAD	Dead	25												25

Table 4

Column characteristics by cancer type
	Number of Common Columns (A)	Number of Cancer-specific Columns (B)	Number of Index Columns (C)	Number of Total Columns (D)	Percentage of Cancer-specific Columns (B/D)
Breast	817	91	459	908	10%
Colorectal	817	109	459	926	12%
Lung	817	53	459	870	6%
Gastric	817	61	459	878	7%
Liver	817	99	459	916	11%
Melanoma	817	51	459	868	6%
Kidney	817	47	459	864	5%
Prostate	817	38	459	855	4%
Thyroid	817	63	459	880	7%
Pancreatic	817	44	459	861	5%
Bile duct	817	67	459	884	8%

We developed a web-based computational platform for data quality control (QC) that scrutinizes potential data defects both automatically and manually on a daily basis, focusing on minimizing the role of the human component (Fig. 3). All data extracted and stored in the YCDL_cancer data repository were continuously evaluated and optimized to establish high-quality data outputs, adhering to standardized data and terminology. Programs for logical checks were configured to evaluate the distribution and continuity of data extracted by the SCL. Based on the QC results, the ETL code was continuously modified, thereby refining the QC logic to enhance the quality and accuracy of the automation. We examined four data quality measures (completeness, timeliness and usefulness, consistency, and accuracy) for all variables, in accordance with established data standards and pertinent aspects of data quality (Table 5). For instance, the logic was set such that the birth date of individuals would precede the date of the initial diagnosis. The analyses revealed that the batch processing method accurately identified erroneous data points, aligned with the established logic. Each piece of data was meticulously reviewed and optimized by a Quality Control Manager. Significant discrepancies or inaccuracies prompted an in-depth examination of the source data and respective ETL processes. Moreover, a hierarchy of data sources was established to resolve conflicts. The QC steps were continuously iterated within distinct closed-loop systems, adhered to operational ontology, and executed by independent QC personnel. This methodology gradually enhanced the accuracy of the cleansed target data with minimal intervention (Supplementary Fig. 3). We assessed the completeness of each individual’s accumulated features, including fundamental characteristics such as date of birth, initial diagnosis date, age, diagnosis code (ICD), TNM and overall stages, and ICDO morphology code.

Table 5

Data quality check criteria
Quality Indicators	Detailed Quality Indicators	Diagnostic Targets	Remarks
Completeness	Individual Completeness	Columns or input values defined to exist but are Null
	Conditional Completeness	Checking for NOT NULL constraints
	Structural Completeness	Implementation based on the physical model designed from the schema, including data types	Verified at the DB design stage
Validity	Code Validity	Whether codes defined in the common code are used
	Format Validity	Errors in data format	Verified at the DB design stage
	Boolean Validity	Diagnosis based on columns with Y/N, 0/1 criteria
	Date Validity	Errors based on date formats
	Range Validity	Diagnosis based on Min, Max, and Normal range of the column
	Temporal Relationship Validity	Diagnosis of data that deviates from predetermined sequential relationships
Consistency	Referential Integrity	Diagnosis of operation rules for PK (Primary Key) items	Verified at the DB design stage
Accuracy	Logical Relationship Accuracy	Data diagnosis according to logical relationships, e.g., when item A is n, item B should be at least m
Accuracy	Derived Item Accuracy	Diagnosis of derived data, e.g., whether the sum of item A and item B is equal

The developed data warehouse showcased survival graphs by tumor stages and demonstrated the framework's ability to expedite data collection for quick clinical hypothesis testing. Kaplan–Meier survival graphs were generated in all cancer types according to tumor stage with 95% confidence intervals. Survival time was defined as the time interval between initial diagnosis and death or the last follow-up. To demonstrate the efficiency of our data framework as a proof-of-concept for swiftly generating and evaluating clinical hypotheses, we present a detailed chronological progression of a previously published retrospective study. The clinical question chosen by one of the authors was whether the peripheral blood neutrophil-to-lymphocyte ratio before, during, or after neoadjuvant chemoradiotherapy for locally advanced rectal cancer is associated with an increased risk of distant metastases after primary rectal cancer surgery.

To underscore the capabilities of our data framework for clinical applications, we then developed a clinical decision-support system that offers (a) a longitudinal view of the complete patient journey, (b) PACS-integrated three-dimensional tumor display, and (c) summary of longitudinal changes in the form of graphs. This system was supported by a Docker-based microservice architecture and overseen by a Python-based API server for backend operations. The framework was implemented as a web application using JavaScript. This design bolsters the accessibility of the system, guarantees platform independence, and ensures that users can access services across various device types. Manual tumor segmentation data are required to use the three-dimensional tumor display with a longitudinal tumor-tracking function. If deep learning-based tumor auto-segmentation algorithms are developed, these models can be integrated into the pipeline. The PACS-integrated method enables physicians to comprehensively track changes in overall trajectory patterns over a long period, fosters an environment that better explains the disease course to patients, and facilitates communication with referring physicians. Longitudinal changes in the overall disease burden were automatically generated using the prepared manual contours and displayed as graphs. The images were de-identified; however, if another image of the same patient was transferred later, the new images were allocated the same de-identified number, facilitating tumor tracking.

At the time of analysis, the DB contained records of the feature sets of 171,128 individuals diagnosed with 11 different cancers at a single academic cancer center between January 2006 and March 2022 (Table 6). For each individual, 817 essential features in the common columns and a median of 61 features (range: 38–109) in the cancer-type-specific columns were updated daily and continuously.

Table 6

Number of patients added each year and, in every cohort,
	2006	2007	2008	2009	2010	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022.3	Total
Breast	764	811	737	837	959	880	891	740	938	1,078	1,358	1,409	1,811	1,754	1,526	1,671	253	18,417
Colorectal	882	942	1,006	1,158	1,090	1,207	1,143	979	1,154	1,254	1,438	1,413	1,433	1,387	1,149	1,046	222	18,903
Lung	712	740	756	768	817	840	884	902	943	1,143	1,257	1,409	1,653	1,922	1,737	1,909	334	18,726
Gastric	1,829	1,815	1,944	1,865	1,914	1,964	1,836	1,696	1,899	1,915	1,949	1,918	1,862	1,522	1,267	1,433	254	28,882
Liver	657	589	656	737	679	655	621	569	589	609	657	610	616	646	565	489	112	10,056
Melanoma	49	43	73	74	87	67	58	75	86	99	113	91	108	107	123	105	24	1,382
Kidney	207	245	265	311	305	300	317	301	358	358	440	439	466	588	430	436	41	5,807
Prostate	378	529	621	755	697	760	774	713	711	779	1,007	1,105	1,195	1,432	1,289	1,152	206	14,103
Thyroid	1,535	2,372	2,841	2,748	2,783	2,619	2,909	2,615	2,037	1,792	1,975	2,206	3,058	3,201	2,805	3,575	695	41,766
Pancreas	246	254	274	298	290	313	362	324	326	442	546	455	526	605	545	631	174	6,611
Bile duct	324	330	316	340	338	418	397	331	360	409	463	472	519	453	470	423	112	6,475
Total	7,583	8,670	9,489	9,891	9,959	10,023	10,192	9,245	9,401	9,878	11,203	11,527	13,247	13,617	11,906	12,870	2,427	171,128

During the QC process, we established a comprehensive set of 143 human-driven logical comparisons, including 70 focused on identifying missing data, 41 ensuring temporal validity (e.g., the completion date of radiotherapy should coincide with or follow its initiation date), 15 pinpointing outlier data (such as age at menarche between 8 and 20 years), 13 selecting the relevant values among multiple time points, and 4 dedicated to spotting duplicated or inconsistent data. The QC logic outcomes showed consistent results across 11 different cancer types, comprising a total of 1,523 datasets. We initially set the estimated daily QC case count to 10%, which translated to approximately 81 cases per day.

We generated survival graphs for each of the 11 distinct cancers in our dataset, segmented by tumor stages (Fig. 4). Notably, except for prostate cancer, there is a significant variation in survival rates depending on the stage of cancer; generally, higher stages are associated with lower survival rates.

The efficacy of our data framework in rapidly generating and evaluating clinical hypotheses was demonstrated in a study published in 2022. Following the development of the proposal and subsequent approval from the institutional review board, researchers requested baseline data on patients, tumors, and treatments, as well as peripheral blood neutrophil and lymphocyte counts, spanning the period between the initial diagnosis and the date of primary rectal surgery in December 2020. Data abstraction for 1,386 individuals was efficiently executed using our framework, and its accuracy and reliability were confirmed by an experienced oncologist. This proficiency enabled researchers to commence a pilot analysis in January 2021, merely a month post the initial data acquisition.

We successfully developed a clinical decision support system with four layouts. In the upper-left layout, shown in Fig. 5, three selected image series are displayed alongside their corresponding three-dimensional tumor visualizations, using DICOM files of individually, manually contoured lesions (Supplementary Video 1, Fig. 5a). In the middle-upper layout, the output of the longitudinal tumor tracking is in the form of a graph (Fig. 5b). The section with the hope of predicting individual patient outcomes has been reserved for future integration of any potential model (Fig. 5c). The lower layout presents a comprehensive overview of a patient's healthcare journey, allowing readers to intuitively understand the chronological sequence of events and progression of the patient's treatment (Fig. 5d, Supplementary video 2). This offers holistic and interactive patient summaries on a graphical timeline anchored by real-time data captured within our framework. Users can easily assess patient data in a temporal context with a single click, and the depth of information can be fine-tuned using zoom features and pop-up boxes.

In this study, we successfully developed a cancer-specific information technology infrastructure designed to facilitate the longitudinal collection of comprehensive health data, an accomplishment realized through extensive cross-departmental collaboration. Using this framework, we established an automated DB that continually updates the data from 171,128 individuals, each distinguished by over 800 unique features. Manually collating such an expansive array of features is challenging. To ensure data integrity, we initially implemented rigorous data quality control methods, starting with manual logic applications and subsequently transitioning to an automated management system. This approach, conducted within closed-loop systems, led to a steady enhancement in data precision. Furthermore, we validated the caliber of our automatically harvested data by assessing survival rates in relation to cancer stage. Of practical significance, our system highlights the potential for real-time capture of disease state and treatment data, exemplified by a proof-of-concept for rapid clinical hypothesis testing and offering a holistic view of a patient's journey with a single click. This not only alleviates the clinical burden but also optimizes the research workflow.

Understanding the crucial role of a reliable automatic data supply chain, several research groups have collaboratively developed frameworks to capture and transport oncologic data.^14,15 Morin et al.¹⁴ introduced MEDomics, an information technology infrastructure that integrates seamlessly with multiple EHR DBs to ensure uniform data collection. Their research amassed data from nearly 175,000 patients with cancer at the University of California, San Francisco, between 2010 and 2019. Employing rule-based selection techniques, they identified individuals with high-quality data, narrowing them down to 3,782 breast cancer and 2,054 lung cancer cases. Lower-quality data were more prevalent among individuals located further away from the institution, a trend associated with increased mortality rates. Jung et al.¹⁵ showcased ROOT, an auto-updating data warehouse that consolidates comprehensive clinical data of 67,617 individuals diagnosed with head and neck, thoracic, and esophageal cancers at the Samsung Medical Center in Korea between 2008 and 2020. These endeavors underscore the importance of data governance and active participation of all stakeholders. Considering geographic disparities and practice variations, in-house development might be best positioned to cater to the specific needs of end users.

Building an automated data warehouse using oncology EMR data poses inherent challenges because of the varying degrees of data completeness, inconsistencies, and conflicting or evolving records.¹⁶ In this context, a nationwide initiative was launched to create a comprehensive cancer data library aimed at standardizing terminology and classification within our country. Concurrently, institutional efforts aimed to gather extensive feedback and integrate preexisting registries from diverse cancer groups. The recent proposal of Operational Ontology for Oncology (O3) seeks to achieve multi-institutional and multi-stakeholder consensus, lowering the barriers for collaborative information aggregation.¹⁷ Our next task involves identifying differences and similarities between our defined features and the variables proposed by O3, and if possible, updating the necessary parts. To manage the vast variability of data sources and types, we devised algorithms that harness structured data from diverse origins and process unstructured content using ETL procedures. ETL operations present unique challenges, especially when dealing with components presenting multiple ETL-related complications. Collaboration with team members well-versed in treatment workflows and medical informatics, combined with close cooperation with the IT department, was pivotal in understanding the system functionality and nuances of data interpretation. Both data governance and ethical deliberation are instrumental in ensuring data security and patient privacy.

In the absence of formalized frameworks, challenges may arise in query fulfillment and data management.¹⁸ However, our data supply chain addresses this issue through an end-to-end workflow for data quality assurance, ensuring continual evaluation and improvement. Conflicting, missing, or incorrect data were identified through human-driven logical comparisons and rectified by making logical corrections or adjusting the algorithms. Since its implementation, the quality assurance workflow has been continuously refined, accumulating data checks across multiple cycles. This iterative process enhances data quality and reduces the need for human intervention. Engagement with various groups familiar with the data sources and limitations is essential.

The YCDL framework has numerous potential clinical and research applications. Although limited data have evaluated clinically relevant outcomes in oncology care, emerging evidence suggests that clinical decision support systems using EMR data can positively impact care quality.¹⁹ A recent randomized controlled trial by Hong et al.²⁰ demonstrated accurate triaging of patients with cancer and reduced acute care rates using an EMR-based machine learning algorithm. Coombs et al.²¹ showed that a proposed machine learning tool using real-world EMR data could identify patients with cancer at risk for a 60-day emergency department visit. Another potential application is the generation and rapid testing of clinical hypotheses, as suggested by Morin et al.,¹⁴ which would not have been feasible using traditional data approaches. The YCDL enabled the collection of a vast amount of data, including laboratory results and patient features, thereby facilitating the first pilot analysis. Additionally, automatic flagging of eligible patients for clinical trials shows promise.²¹

The real-time metadata supply chain can automatically display patient histories, thereby eliminating the need for physicians to navigate through numerous pages. With advancements in systemic drugs, patients with stage IV cancer now live longer and have complex treatment histories.²² A quick overview of a patient’s cancer journey allows physicians to efficiently characterize both the disease and the individual, potentially reducing burnout and ensuring quality care.²³ Commercial clinical decision support software such as NAVIFY Oncology Hub,²⁴ Syapse,²⁵ and Flatiron Assist,²⁶ are undergoing evaluation for integration into the EMR system to provide a comprehensive view of a patient's journey. With emerging local therapies,²⁷ AI can play a significant role in detecting and segmenting normal tissues and tumors,²⁸ as well as tracking lesions over time in relation to treatment.²⁹ However, further research on tumor auto-segmentation is warranted.

This study has several limitations that should be considered when interpreting our results. First, our method represents the experience of a single institution, and large-scale adjustments may be necessary for implementation elsewhere. Second, the data supply chain approach is designed as an expandable infrastructure that accommodates updated ontologies and evolving demands. Establishing a strong leadership in data governance, implementing sharing agreements, and promoting open science practices are essential for a robust metadata supply chain. This requires dedicated departments to ensure job security. Collaborative efforts such as workshops and knowledge transfers promote an understanding of the benefits offered by the metadata supply chain and AI technologies. Future work will incorporate additional cancer types such as brain tumors and rare malignancies. Once the ETL process is finalized, we aim to make it publicly accessible. Our hospital primarily diagnoses and follows up with patients within our institution; however, inter-hospital data sharing may become necessary in certain cases. Finally, the current framework only captures survival and recurrence data despite the growing recognition of the importance of quality of life and toxicity profiles as critical outcomes.

In conclusion, this study emphasizes the importance of leveraging computational methods and real-time data supply chains to address the challenges posed by the overwhelming volume of health information in oncology. Our collaborative endeavor to construct a robust data framework not only demonstrates its potential to enhance personalized care, facilitate AI applications, and refine clinical decision-making, but also serves as the cornerstone for acquiring comprehensive health data over time. To promote the widespread adoption of data supply chains and AI technologies, it is imperative to emphasize strong data governance, embrace open science practices, and foster collaborative efforts within the healthcare community.

Authors and contributors

JSC, JSK, and SJS designed the research. ESP and JEC collected the data. SJS verified the raw data. ESP, JSK, and SJS developed ETL and CDSS. JSC, ESP, JEC, JSL, JSK, and SJS analyzed the results. JSC wrote the manuscript, ESP and SJS critically revised the manuscript, and all authors provided feedback. All authors had full access to all the data in the study and read and approved the final manuscript..

Competing interests:

All authors declare no financial or non-financial competing interests

Acknowledgments:

Portions of the content of this paper were presented at the 2023 CARO-COMP Joint Scientific Meeting (September 22, 2023, Montreal, Canada) and the Practical Big Data Workshop 2023 (May 19, 2023, Ann Arbor, MI).

This study was funded by the Big data Center at the National Cancer Center of Korea (grant number: 2020-data-we08).The funder played no role in study design, data collection, data analysis, data interpretation, or writing of this manuscript.

Data availability:

The Yonsei University Health System (YUHS) inaugurated the Severance Data Portal (SDP), a comprehensive medical big data platform, on 2 May 2023 (available at: https://sobig.yuhs.ac/portal). The SDP provides an accessible portal tailored for the research community, with a focus on medical investigations. It is supported by a 'Data Lake' search portal that empowers researchers to locate and harness extensive data sets aligned with their specific research goals. In the forthcoming expansion phase, YUHS intends to enhance the platform through the integration of pioneering digital medical imaging information systems, such as Picture Archiving and Communication Systems (PACS), along with digital pathology data and genomic analysis datasets. Access to the SDP is governed by stringent policies devised to safeguard patient confidentiality and to ensure adherence to all pertinent legal and ethical standards. Researchers aiming to utilize the SDP must submit an access application specifying the proposed data usage, which is then subjected to a thorough review process to ensure compliance with established data governance criteria.

1. Figueiredo, E. B. d., Dametto, M., Rosa, F. d. F. & Bonacin, R. A Multidimensional Framework for Semantic Electronic Health Records in Oncology Domain. In 2021 IEEE 30th International Conference on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE) 165–170 (2021).
2. Heart, T., Ben-Assuli, O. & Shabtai, I. A review of PHR, EMR and EHR integration: A more personalized healthcare and public health policy. Health Policy and Technology 6, 20–25 (2017).
3. Abernethy, A. P. et al. Rapid-learning system for cancer care. J Clin Oncol 28, 4268–4274 (2010).
4. Shanafelt, T. D. et al. Relationship Between Clerical Burden and Characteristics of the Electronic Environment With Physician Burnout and Professional Satisfaction. Mayo Clinic Proceedings 91, 836–848 (2016).
5. Davenport, T. & Kalakota, R. The potential for artificial intelligence in healthcare. Future Healthc J 6, 94–98 (2019).
6. Huang, S. C. et al. Fusion of medical imaging and electronic health records using deep learning: a systematic review and implementation guidelines. NPJ Digit Med 3, 136 (2020).
7. Cui, M. & Zhang, D. Y. Artificial intelligence and computational pathology. Lab Invest 101, 412–422 (2021).
8. Huynh, E. et al. Artificial intelligence in radiation oncology. Nat Rev Clin Oncol 17, 771–781 (2020).
9. Perez-Lopez, R., Reis-Filho, J. S. & Kather, J. N. A framework for artificial intelligence in cancer research and precision oncology. NPJ Precis Oncol 7, 43 (2023).
10. Shreve, J. T., Khanani, S. A. & Haddad, T. C. Artificial Intelligence in Oncology: Current Capabilities, Future Opportunities, and Ethical Considerations. Am Soc Clin Oncol Educ Book 42, 1–10 (2022).
11. Moor, M. et al. Foundation models for generalist medical artificial intelligence. Nature 616, 259–265 (2023).
12. Ramspek, C. L. et al. External validation of prognostic models: what, why, how, when and where? Clin Kidney J 14, 49–58 (2021).
13. Chung, C. & Jaffray, D. A. Cancer Needs a Robust "Metadata Supply Chain" to Realize the Promise of Artificial Intelligence. Cancer Res 81, 5810–5812 (2021).
14. Morin, O. et al. An artificial intelligence framework integrating longitudinal electronic health records with real-world data enables continuous pan-cancer prognostication. Nat Cancer 2, 709–722 (2021).
15. Jung, H. A. et al. Real-time autOmatically updated data warehOuse in healThcare (ROOT): an innovative and automated data collection system. Transl Lung Cancer Res 10, 3865–3874 (2021).
16. Kanas, G. et al. Use of electronic medical records in oncology outcomes research. Clinicoecon Outcomes Res 2, 1–14 (2010).
17. Mayo, C. S. et al. Operational Ontology for Oncology (O3) - A Professional Society Based, Multi-Stakeholder, Consensus Driven Informatics Standard Supporting Clinical and Research use of "Real -World" Data from Patients Treated for Cancer: Operational Ontology for Radiation Oncology. Int J Radiat Oncol Biol Phys https://doi.org/10.1016/j.ijrobp.2023.05.033 (2023). (2023).
18. Khare, R. et al. Design and Refinement of a Data Quality Assessment Workflow for a Large Pediatric Research Network. EGEMS (Wash DC) 7, 36 (2019).
19. Pawloski, P. A., Brooks, G. A., Nielsen, M. E. & Olson-Bullis, B. A. A Systematic Review of Clinical Decision Support Systems for Clinical Oncology Practice. J Natl Compr Canc Netw 17, 331–338 (2019).
20. Hong, J. C. et al. System for High-Intensity Evaluation During Radiation Therapy (SHIELD-RT): A Prospective Randomized Study of Machine Learning-Directed Clinical Evaluations During Radiation and Chemoradiation. J Clin Oncol 38, 3652–3661 (2020).
21. Coombs, L. et al. A machine learning framework supporting prospective clinical decisions applied to risk prediction in oncology. NPJ Digit Med 5, 117 (2022).
22. Colicchio, T. K., Cimino, J. J. & Del Fiol, G. Unintended Consequences of Nationwide Electronic Health Record Adoption: Challenges and Opportunities in the Post-Meaningful Use Era. J Med Internet Res 21, e13313 (2019).
23. Pivovarov, R. & Elhadad, N. Automated methods for the summarization of electronic health records. J Am Med Inform Assoc 22, 938–947 (2015).
24. Goh, E. et al. Remote evaluation of NAVIFY Oncology Hub using clinical simulation. Journal of Clinical Oncology 41, e13622-e13622 (2023).
25. Hirsch, J., Ford, J. M., Nadauld, L. & Hsu, A. Design and implementation of an informatics infrastructure for actionable precision oncology. Journal of Clinical Oncology 33, e17521-e17521 (2015).
26. Maniago, R. et al. Implementation of an EHR-embedded decision support tool in community oncology practices. Journal of Clinical Oncology 39, 274–274 (2021).
27. Liu, W., Bahig, H. & Palma, D. A. Oligometastases: Emerging Evidence. J Clin Oncol 40, 4250–4260 (2022).
28. Primakov, S. P. et al. Automated detection and segmentation of non-small cell lung cancer computed tomography images. Nat Commun 13, 3423 (2022).
29. Cai, J. et al. Deep Lesion Tracker: Monitoring Lesions in 4D Longitudinal Imaging Studies. arXiv 2012.04872 (2020).

(Not answered)

Supplementaryfigures.docx
Supplementaryvideo1.mp4
Supplementary video 1
Supplementaryvideo2.mp4
Supplementary video 2

Download PDF

Editorial decision: revise
24 Apr, 2024
Review #4 received at journal
04 Apr, 2024
Reviewer #4 agreed at journal
26 Mar, 2024
Review #2 received at journal
16 Feb, 2024
Reviewer #3 agreed at journal
13 Feb, 2024
Reviewer #2 agreed at journal
06 Feb, 2024
Reviewer #1 agreed at journal
05 Feb, 2024
Reviewers invited by journal
02 Feb, 2024
Editor assigned by journal
15 Jan, 2024
Submission checks completed at journal
15 Jan, 2024
First submitted to journal
14 Jan, 2024

You are reading this latest preprint version

Development of a continuous multimodal data supply chain for oncology and an expandable clinical decision support system

Status:

Version 1

Abstract

Figures

Introduction

Methods

Results

Discussion

Declarations

Competing interests:

Acknowledgments:

Data availability:

References

Additional Declarations

Supplementary Files

Status:

Version 1