Dataset
In this study, the EMR data of Gachon Gil Medical Center, which launched IBM’s WfO in Korea for the first time in 2016, were used. Gachon Gil Medical Center has obtained reliable data through WfO and multi-disciplinary medical treatments involving face-to-face interaction between patients and three or more cancer treatment specialists. Data were collected from patients who had undergone colorectal cancer surgery between 2004 and 2012. The dataset included information such as the Demographic, Disease Characteristics, Cancer Characteristics, Tumor Characteristics, Treatment Characteristics, Survival Characteristics, and Genetic Characteristics. This standard information is based on the colorectal cancer Common Data Model (CDM) definition employed by five domestic hospitals, including Gachon Gil Medical Center.
The EMR data of the Gachon Gil Medical Center are divided into Scan EMR, XML EMR, and Database EMR according to the storage method. In Scan EMR and XML EMR, there is a possibility that the data are deleted or incorrectly entered while the medical record administrator checks the record. Therefore, to verify the reliability and integrity of the extracted dataset, several colon cancer specialists and medical record administrators collaborated to review the chart.
The chart review involved a detailed three-step process over a six-month period, involving several colorectal cancer specialists and medical record administrators. In the first step, the extracted data were checked to ensure that they were properly mapped with the code described in the colorectal cancer CDM definition document and extracted from the correct location through the normal method. In the second step, to ensure the reliability of the extracted data, an operation to identify and remove incorrect data, such as data redundancies and incorrect inputs, was performed. This chart review process was repeated at monthly intervals under the supervision of a colorectal cancer specialist. In the final step, to reduce unnecessary biases in the deep-learning model training, the colorectal cancer specialists selected first-priority variables that are highly related to survival. Table 1 presents six categories and the variables in each category.
Table 1. Dataset description
Input Variables
|
Demographics
|
Age, Sex, ASA, BMI, Smoking History
|
Disease Characteristics
|
DM History, Pulmonary Disease, Liver Disease, Heart Disease, Kidney Disease
|
Cancer Characteristics
|
Prior Diagnosis Cancer, Initial CEA, Perforation, Obstruction, Emergency, Lymphovascular Invasion, Perineural Invasion, Distal Resection Margin, Radial Margin, Harvested Lymph Node, Positive Lymph Node, Early Complication, Recurrence
|
Tumor Characteristics
|
Hereditary Colorectal Tumor, Tumor Location (Pathology), Histologic Type, TNM Stage (Pathology)
|
Genetic Characteristics
|
K-ras, N-ras, BRAF
|
Treatment Characteristics
|
Postoperative Chemotherapy
|
Survival Characteristics
|
Overall Survival
|
Target Variables
|
Chemotherapy
|
Postoperative Chemotherapy Regimen
(5-FU/LV, XELODA, FOLFOX, FOLFIRI, Surveillance)
|
Data Preprocessing and Oversampling
Data Preprocessing
Data preprocessing is one of the processes that must be performed to obtain correct analysis results. If data preprocessing is not performed correctly, the relationship between the variables may be distorted; thus, accurate results may not be obtained[22]. Therefore, it is important to perform data preprocessing to generate a solid model. In this study, we focused on missing-value processing as well as categorical and continuous variable processing, prior to constructing deep-learning models.
First, if the missing-value ratio was determined to be >80% by checking the ratio of each variable, the variable was excluded, because sufficient data samples for training could not be obtained. Additionally, all the missing-value instances in the prediction target class “Post OP Chemo Regimen” were excluded. Continuous variables such as “Age,” “ASA,” and “CEA” have different ranges, and if training is performed without adjusting the ranges, overfitting may occur, and normal learning may not be able to proceed[23]. Therefore, the range of each variable was adjusted from –1 to 1 by applying the min-max normalization scaling method. In the case of categorical variables, the value was mostly composed of character data rather than numeric data; thus, it could not be automatically recognized and computed by the computer. Therefore, one-hot encoding was performed to vectorize each variable and represent 0 and 1. Fig 1 shows a data preprocessing process including a data oversampling process.
Data Oversampling
In most data generated in the real world, the proportions of the classes of the target variables have an imbalanced distribution[24, 25]. Data that have such a form are called “imbalanced data.” In particular, medical data generated in a clinical environment are severely imbalanced. Normally, we define a class with a relatively small proportion as a “Minor Class” and a class with a large proportion as a “Major Class.[26]” If model training is performed using imbalanced data, it is highly likely that the Minor Class will not be properly processed, and all data will be classified as Major Class[27]. Various methods, such as undersampling and oversampling, have been proposed to solve this problem. Undersampling involves adjusting the proportion of the class by removing some data of a Major Class, and oversampling involves proportioning the class by multiplying by data of a Minor Class. In general, when the amount of data is sufficient, undersampling is used. However, when the undersampling method is used, it is difficult to construct a normal learning model, because the dataset used in this study is not sufficiently secured. Therefore, in this study, we attempted to resolve the data imbalance by oversampling the data using the Bootstrap Resampling algorithm[28], which allows effective inference with a small amount of data. Oversampling data is only added to the minority class in train set to avoid affecting test performance. Fig 2 shows a bootstrapping based oversampling process.
Structure of Chemotherapy Recommender
To predict and recommend treatment methods, we developed a deep feed-forward neural network, called Colorectal Cancer Chemotherapy Recommender (C3R), which is the most basic implementation of the Deep Neural Network (DNN). The model is designed as a three-layer perceptron structure in the order of [Input Layer] – [Hidden Layer] – [Output Layer]. Detailed nodes composing each layer are designed as ([Input: 54] – [Hidden: 64] – [Hidden: 128] – [Hidden: 256] – [Hidden: 64] – [Output: 5]). We used Grid-Search Algorithm to tune hyperparameters. Hyperparameter types and grid-search ranges are as follows. Layers ∈ {1, 2, 3, 4}, Batch size ∈ {32, 64, 128, 256}, Learning Rate ∈ {0.1, 0.01, 0.001, 0.0001}, Optimization Algorithm ∈ {Adam[29], Adadelta[30], RMSProp[31]}. The hyperparameters determined by grid-search algorithm are [Batch Size 64, Learning Rate 0.001, Optimization Algorithm: Adam Optimizer]. A dropout layer is added in the middle of each hidden layer to prevent overfitting. The ReLU[32] was used as the activation function of each layer except the output layer. Softmax[33] was used as the activation function of the output layer. The Softmax activation function calculates the input data and returns a probability value normalized to a value between 0 and 1; this can be expressed as follows:
See formula 1 in the supplementary files.
The returned probability value is defined as the Chemotherapy Recommendation Index, and according to this value, priority can be determined for suggesting an appropriate treatment method to the patient. Fig 3 shows the detailed structure of C3R.
Model Verification and Evaluation Method
To evaluate the performance of the proposed C3R model, we used confusion matrix. After, we compared the diagnosis concordance rate between the C3R model and the Gachon Colorectal Cancer Treatment Protocol (GCCTP) & NCCN guidelines to verify the validity of C3R. The comparative indicators used were Top-1 Accuracy and Top-2 Accuracy, because the treatment methods proposed in each guideline were broken down by priority. The recommendations of the C3R model are included in the Top-1 Accuracy if they are included in the preferred treatment method proposed by each guideline, and they are included in the Top-2 Accuracy if they are included in the next treatment. Fig 4 shows the model verification process including model performance evaluation.
Gachon Colorectal Cancer Treatment Protocol
First, we used the GCCTP, which is used by colorectal cancer specialists to determine treatment options for patients at the Gachon Gil Medical Center. The GCCTP is a rule-based treatment recommendation system based on empirical knowledge from many colorectal cancer specialists. It allows a colorectal cancer specialist to diagnose a patient’s condition and determine treatment options according to information such as the patient’s demographics, TNM stage, and risk factor. Fig 5 shows an example of a colorectal cancer treatment protocol for a case without metastasis.
NCCN Guidelines
The NCCN guidelines, which were published by experts from 28 cancer centers in the United States, reflect the opinions of experts and serve as guidelines for international cancer standards. They cover the diagnosis, treatment decisions, and treatments for 97% of cancers in the United States and are updated annually with new medical grounds to provide the optimal clinical guidelines for treating cancer patients. NCCN guidelines are divided into rectal and colon cancer guidelines. Version 2 of 2019 was used for verification of colon cancer and version 2 of 2018 was used for verification of rectal cancer [34, 35].
Performance Evaluation Metrics
Various comparison indicators were used to evaluate the performance of the C3R model. we used the confusion matrix to evaluate the performance of the C3R model. The confusion matrix, which is typically used for evaluating the performance of an algorithm[36], compares the actual results with the model prediction results in a table form, with four categories: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). TP predicted True and TN as False, but TN as False but True as True, FP predicted as True, but when True was False, FN predicted as False, but the actual result was If true. These results can be used to generate various evaluation indicators, such as accuracy, sensitivity, specificity, precision, recall, F1-Score and area under the curve (AUC)[37]. In this study, we used the precision, recall, F1-score, and AUC, which can be used regardless of class imbalance.