Data-Driven Treatment Pathways Mining for Early Breast Cancer Using cSPADE Algorithm and System Clustering

Background: Due to the multidimensional, multilayered, and chronological order of the cancer data in this study, it was challenging for us to extract treatment paths. Therefore, it was necessary to design a new data mining scheme to effectively extract the treatment path of breast cancer. To determine whether the cSPADE algorithm and system clustering proposed in this study can effectively identify the treatment pathways for early breast cancer. Methods: We applied data mining technology to the electronic medical records of 6891 early breast cancer patients to mine treatment pathways. We provided a method of extracting data from EMR and performed three-stage mining: determining the treatment stage through the cSPADE algorithm → system clustering for treatment plan extraction → cSPADE mining sequence pattern for treatment. The Kolmogorov-Smirnov test and correlation analysis were used to cross-validate the sequence rules of early breast cancer treatment pathways. Results: We unearthed 55 sequence rules for early breast cancer treatment, 3 preoperative neoadjuvant chemotherapy regimens, 3 postoperative chemotherapy regimens, and 2 chemotherapy regimens for patients without surgery. Through 5-fold cross-validation, Pearson and Spearman correlation tests were performed. At the signicance level of P <0.05, all correlation coecients of support, condence and lift were greater than 0.89. Using the Kolmogorov-Smirnov test, we found no signicant differences between the sequence distributions. Conclusions: The cSPADE algorithm combined with system clustering can achieve hierarchical and vertical mining of breast cancer treatment models. By uncovering the treatment pathways of early breast cancer patients by this method, the real-world breast cancer treatment behavior model can be evaluated, and it can provide a reference for the redesign and optimization of the treatment pathways.


Background
At present, breast cancer ranks rst among the diseases affecting women in the world and is the leading cause of cancer deaths [1] . Faced with the high incidence of cancer, how to reduce mortality and prolong survival has become a hot topic of research. To this end, a variety of treatment methods have been developed, including surgical treatment, radiotherapy, chemotherapy, endocrine therapy and targeted therapy, and breast cancer treatment has begun to develop into personalized and precise treatments [2,3] . How to effectively use these treatments to improve the medical effect and quality is a problem we face.
However, the current level of breast cancer diagnosis and treatment in China lags behind that in European and North American countries. There are obvious gaps in the level of diagnosis and treatment equipment and technology in different regions and different hospitals. Inadequate compliance and lack of standardization in guidelines, speci cations and paths have hindered the improvement of the overall level of care [4] . However, Chinese women are different from women in Europe and the United States in many aspects, such as breast structure, sex hormone levels, eating habits, and living environment.
Foreign guidelines and clinical pathways may not be completely applicable [5] . However, building a new clinical pathway from scratch is a time-consuming task for medical staff because it involves aspects such as multidisciplinary collaboration, sequential design, and outcome measurement [6] . Regarding the construction of the treatment path, traditional path development requires multiple experienced doctors, nurses and other related personnel to spend much time collecting data, ensure that the data are fully evidence-based, and discuss together [6] . Differences in opinions often result from differences in personal experience, leading to deviations between the established path and the actual operation.
There are some defects in the quality of clinical pathway design [7] .
In recent years, due to the development of informatics, software engineering, mathematics and other disciplines, interdisciplinary research has begun to emerge, and treatment pathway construction methods based on process mining technology have appeared [8][9][10][11] . The development of electronic medical records (EMR) provides the possibility for the extraction and optimization of treatment paths [12][13][14] . Most process mining algorithms can automatically build process patterns, which are very suitable for understanding and can be used for process optimization [15][16][17] . Recently, many studies have focused on developing sequential pattern mining methods to discover real-world treatment behavior patterns from clinical data, which has become a research hotspot [18][19][20] . However, current research focuses on the mining of drug treatment models [21][22][23] .
Due to the multidimensional, multilayered, and chronological order of the cancer data in this study, it was challenging for us to extract treatment paths. Therefore, it was necessary to design a new data mining scheme to effectively extract the treatment path of breast cancer. Sequence data consist of a series of ordered elements or events and may not include speci c time concepts, such as customer shopping sequences, website click streams and biological sequences. This type of data does not process data at one point in time but rather at a large number of points in time. The sequence mode seeks to nd the order between these data items. At present, sequential pattern mining has been applied to ood alarms [24] , gene regulatory sequences [25] , electronic health records (EHR) work ow [26] , atrial brillation treatment path [27] , and disease comorbidity pattern [28] .
This study pioneered the joint use of two algorithms emerging in the eld of data mining: the cSPADE algorithm + system clustering. Based on this, a three-stage mining method is proposed: the treatment phase is determined by the cSPADE algorithm → system clustering for treatment plan extraction → cSPADE mining sequence mode for treatment. This method realizes the extraction of breast cancer treatment pathways, which can be used to evaluate real-world breast cancer treatments, and provides a reference for the redesign and optimization of treatment pathways.

Materials
West China Hospital of Sichuan University is one of the top general hospitals in China and receives more than 260,000 inpatients from all over the country and internationally every year. We extracted data for 6891 stage 0 III breast cancer patients from West China Hospital of Sichuan University from 2011 to 2017. The average age of the patients was 48.67 ± 10.41 years. According to the patient's registration number, longitudinal tracking was performed. There were 41,070 inpatient medical records, 381,830 outpatient medical records, and more than 10 million doctor's orders. We extracted general information and clinical characteristics of patients, diagnosis, admission and discharge dates, and all inpatient and outpatient orders. The data used in this study are anonymous. Although according to Chinese law, this retrospective EMR study does not require the ethical approval of the regional ethics review committee, we still applied for and obtained the approval of the ethics committee of West China Hospital of Sichuan University (approval number: 2017 − 255).

Data Preparation
We labeled medical orders related to surgery, radiotherapy, chemotherapy, targeted therapy, and endocrine therapy. Fiftynine of 6891 breast cancer patients did not undergo primary antitumor treatment; that is, 6,832 patients ultimately received primary treatment. There are 5758 types of doctor orders. Orders not related to breast cancer surgery, chemotherapy, radiotherapy, endocrine therapy, or targeted therapy were excluded, such as primary care or saline. Then, 138 original medical order names related to surgery, radiotherapy, chemotherapy, endocrine therapy, and targeted therapy were marked.
Mining the treatment pathways of early breast cancer based on the cSPADE algorithm and systematic clustering The data in this study exist in multiple dimensions: doctor's orders, electronic medical records, outpatient records and other dimensions of data. Moreover, there are multiple levels of data: the rst level of breast cancer treatment includes surgery, radiotherapy, chemotherapy, targeted therapy, etc. The second level includes different treatment options, such as surgery including radical surgery, modi ed radical surgery, breast-conserving surgery, and breast reconstruction, which follow a chronological order, such as chemotherapy rst → surgery, or surgery rst → chemotherapy. There are no methods for addressing such complicated path mining in the literature regarding the treatment of cancer patients. This study pioneered the joint use of two algorithms emerging in the eld of data mining: the cSPADE algorithm + system clustering. The cSPADE algorithm was used to mine sequential patterns of treatment paths with time sequence, and then, system clustering was used to achieve dimensionality reduction of different treatment methods.
The three-stage mining method we proposed is as follows: determining the treatment stage through the cSPADE algorithm → systematic clustering for treatment plan extraction → cSPADE mining of the sequential mode of treatment. After the rst analysis, we identi ed different treatment stages. The second step was to combine different study ndings and clinicians' suggestions for each stage or use cluster analysis to summarize the typical treatment plan. The third step was to use sequential pattern mining to link treatment plans in different treatment stages in chronological order, nd corresponding rules, and nally determine the main treatment path. The process of data mining is shown in Fig. 1.
<Fig. 1 about here> Figure 1 Mining process of early breast cancer treatment Identifying all frequent sequential patterns in a transaction database, such as in a large EMR database, requires e cient algorithms to handle large search spaces, and many different algorithms have been developed. In 2001, Zaki described an algorithm called sequential pattern discovery using equivalence classes (SPADE) that uses many strategies to make sequential pattern mining more e cient [29] . Sequential pattern mining usually starts with a transaction database, where each transaction has three elds: a "sequence" corresponding to the subject of the sequence (this study is the patient's medical record number); "transaction time" (the timing of the doctor's order in this study); and "transaction-related items" (the doctor's order for this study). SPADE starts with a horizontal database layout, as shown in Table 1, but it then, converts the dataset into vertical "id lists" for each item, each item containing all sequence IDs and transaction times. Storing a vertical id list allows us to nd sequential patterns using the intersections of the id lists. For example, the intersection of an id list of two items can be used to nd sequential patterns (unilateral mastectomy, exemestane tablets). This method minimizes the number of database scans required. SPADE also takes advantage of common pre xes between sequences to reduce memory requirements. cSPADE is a version of SPADE that contains constraints on sequences. Mining Primary Breast Cancer Treatment Pathways A total of 6,832 patients underwent primary treatment in this study. The main treatments for early breast cancer include surgery, chemotherapy, radiotherapy, endocrine therapy, and targeted therapy. In this study, the cSPADE algorithm was used to mine sequence patterns for early breast cancer treatment, and the sequence pattern of treatment was used as the primary treatment pathways for early breast cancer. We set the support to 0.15 and the maximum length to 10.

Extraction of the breast cancer treatment plan
In the rst step of data mining, we identi ed the primary treatment path for early breast cancer and identi ed the stage of treatment for early breast cancer: preoperative stage (neoadjuvant chemotherapy) → surgery → postoperative chemotherapy → radiotherapy → endocrine therapy. To further analyze the different treatment plans, we further subdivided the treatment plans.
Breast cancer surgery methods are divided into expanded radical surgery, radical surgery, modi ed radical surgery, breastconserving surgery, breast reconstruction, sentinel lymph node biopsy, and supraclavicular lymph node dissection.
Because chemotherapy may occur before, after, or during the treatment of patients who have not undergone surgery, we distinguish patients with chemotherapy orders based on the time of surgery and analyze the chemotherapy regimen in three time periods. Because multiple drugs were used in the same chemotherapy process, based on the co-occurrence of drugs, we clustered the orders of preoperative, postoperative, and nonoperative chemotherapy separately.
After referring to relevant literature and consulting breast cancer radiotherapy experts, we did not subdivide the radiotherapy plan. Endocrine drugs were classi ed into aromatase inhibitors, selective estrogen receptor modulators, and progestins according to their original categories.

Mining Of Secondary Treatment Pathways For Early Breast Cancer
After identifying the different treatment stages for the primary treatment pathway, we subdivided the treatment plan, and based on this, we examined the secondary pathways for early breast cancer treatment. We continued to use R language and the cSPADE algorithm, setting support = 0.02, con dence = 0.5, lift = 1. After generating sequence rules, functions were used to remove redundant rules.

Cross-validation
To verify the stability and accuracy of the results, we also used 5-fold cross validation. The Kolmogorov-Smirnov test, Pearson correlation analysis and Spearman correlation analysis were used to cross-validate the sequence rules of early breast cancer treatment pathways.

Results
Overall situation of early breast cancer treatment This study unearthed 30 primary breast cancer treatment pathways, as shown in Table 2. One-length models included surgery, radiotherapy, chemotherapy, and endocrine therapy. The surgical support was 0.8622658, indicating that 86.2% of patients had surgery. However, targeted therapy did not enter the frequent sequence mode, indicating that few patients use targeted therapy, and the proportion wasless than 15%. Of these sequence patterns, the longest had 4 items. According to the initial treatment, we divided the model of 2-4 items into neoadjuvant chemotherapy + surgery, surgery-based treatment, chemotherapy-based treatment, and other treatments (radiotherapy and endocrine treatment). Results of early breast cancer chemotherapy regimen clustering Table 3 shows the results of clustering. Preoperative chemotherapy resulted in 3 types of chemotherapy, postoperative chemotherapy formed 3 types of chemotherapy, and nonoperative chemotherapy formed 2 types of chemotherapy.  As a result, we found 55 rules for breast cancer treatment, which we use as secondary pathways for early breast cancer treatment. Table 4 shows the most supported sequence rules in different treatments and different numbers of items. From the number of items, the rules were between 2-5 items. According to the initial treatment, we included these rules into neoadjuvant chemotherapy + surgery, mainly surgical treatment, and radiotherapy + endocrine therapy. In each major category, they were sorted in descending order according to the number of items and the support for each rule.
Neoadjuvant chemotherapy was mainly preoperative chemotherapy1. Surgical treatments after neoadjuvant chemotherapy were mainly radical surgery and modi ed radical surgery. Among the pathways with surgery as the main treatment, the ratio of modi ed radical surgery to postoperative chemotherapy 1 was 60.9%, which represented the highest proportion, followed by breast conservation surgery, radical surgery, and modi ed radical surgery + breast reconstruction surgery. Among the pathways dominated by radiotherapy and endocrine therapy, selective estrogen receptor modulators were the most commonly used endocrine regimens, followed by aromatase inhibitors.

Results Of Cross-validation
Through 5-fold cross-validation, Pearson and Spearman correlation tests were performed. Table 5 shows the results of cross-validation. At the signi cance level of P < 0.05, all correlation coe cients of support, con dence and promotion were greater than 0.89. Using the Kolmogorov-Smirnov test, we found no signi cant differences between the sequence distributions.  Table 5 Cross-validation of sequence rules for early breast cancer treatment pathways

Discussion
In past research, there have been studies using frequent itemsets and association rule mining to infer relationships between drugs, laboratory results, and problems [30][31][32] . However, these data mining techniques cannot capture temporal information. Identifying all frequent sequence patterns in a transaction database, especially in a large electronic medical record database, requires effective algorithms to handle large search spaces. Sequential pattern mining is a data mining technique for identifying patterns of ordered events in a database. This method can be used for pattern recognition and prediction. Wright AP et al. [21] used sequential pattern mining to evaluate whether the method can effectively identify the time relationship between diabetes drugs and accurately predict the next drug that may be prescribed for patients. Tang C et al. [22] examined the use of sequential pattern mining techniques in large clinical datasets regarding the treatment and drug use patterns of childhood pneumonia. Zhan C et al. [23] discovered the side effects of drugs by mining the prescription sequence.
However, the above studies have explored the use of drugs, and cancer treatment is often a comprehensive treatment, including surgery, chemotherapy, radiotherapy, etc., and each treatment method comprises a variety of drugs, technologies and other components. The multidimensional and multilevel data make analysis di cult. How to discover such useful information for clinical treatment and hospital management is the challenge we face. To perform data mining at different granularities without generating too much information, we designed a three-step data mining method combining the cSPADE algorithm with cluster analysis to complete the mining of early breast cancer treatment paths. Through rst-step sequence pattern mining, we found 30 frequent sequence patterns and determined the treatment stage of early breast cancer. Based on the determined treatment stage, we used reference guides and literature, expert consultation and cluster analysis to classify treatment options. Finally, we used the algorithm of sequential pattern mining again to mine 55 sequence rules as a secondary treatment path for early breast cancer. When clustering chemotherapeutic drugs, we considered previous studies' use of cluster analysis to obtain typical treatment plans. For example, Jingfeng C et al. [33] used AP clustering to extract typical treatment plans from EMR.
Through this study, we prove the effectiveness of sequential pattern mining, which can be used to determine the order of use of different treatments, to mine treatment paths from the real world and reconstruct stepwise usage pattern similar to those recommended. After mining the sequence pattern, we explored the stability of the mining results of the cSPADE algorithm for sequence rule mining of early breast cancer treatment paths through 5-fold cross-validation. Table 5 shows the number of patterns found in the iteration and the results of the statistical tests performed on each row: Kolmogorov-Smirnov, Pearson, and Spearman correlation tests for support, con dence, and lift. All measures found that in each iteration, all patterns were similar at a signi cance level of P < 0.05. All Pearson and Spearman correlation coe cients are higher than 0.75 [19] , and the smallest correlation coe cient is 0.89, which means that the metric values in the training and validation sets are similar. The p-values of the Kolmogorov-Smirnov tests for the similarity of the distributions are all above 0.7, so no signi cant difference is found in the distributions.

Conclusion
Large-scale electronic medical record data are available, providing us with a unique opportunity to discover patterns using data mining methods. We combined the cSPADE algorithm and system clustering and adopted a three-stage mining process to realize the multilevel, vertical treatment mode of breast cancer. We cross-validated the sequence rules for early breast cancer treatment pathways and con rmed the stability of the results. Our results show that this approach can be used to generate potentially interesting and clinically meaningful cancer treatment sequence patterns and to manage clinical pathways in real-world cancer treatment, thereby guiding clinical practice. Availability of data and material

Abbreviations
The datasets used and analyzed during the current study are available from the corresponding author on reasonable request.

Competing interests
All the authors declare that they have no con icts of interest.  Trends in the use of different treatments for breast cancer