Predicting Endometriosis Onset Using Machine Learning Algorithms


 Background

Endometriosis is a common progressive female health disorder in which tissues similar to the lining of the uterus grow on other parts of the body like ovaries, fallopian tubes, bowel, and other parts of reproductive organs. In women, it is one of the most common causes of pelvic pain and infertility. In the US, one in every ten women of reproductive age group has endometriosis. The actual cause of endometriosis is still unknown, and it is quite difficult to diagnose. There are several theories regarding the cause; however, not a single theory has been scientifically proven.
Methods

In this paper, we try to identify the drivers of endometriosis’ diagnoses via leveraging advanced Machine Learning (ML) algorithms. The primary risks of infertility and other health complications can be minimized to a great extent, if likelihood of endometriosis can be predicted well in advance. As a result, the proper medical care and treatment can be given to the impacted patients. To demonstrate the feasibility, Logistic Regression (LR) and eXtreme Gradient Boosting (XGB) algorithms were trained on 36 months of medical history data.
Results

The machine learning models were used to predict the likelihood of disease on qualified patients from the healthcare claims patient level database. Several directly and indirectly features were identified as important in accurate prediction of the condition onset, including selected diagnosis and procedure codes.
Conclusions

Leveraging the machine learning approaches can aid early prediction of the disease and offer an opportunity for patients to receive the needed medical treatment earlier in the patient journey. Creating a typing tool that can be integrated into the Electronic Health Records (EHR) systems and easily accessed by healthcare providers could further aid the objective of improving the diagnosis activities and inform the diagnostic processes that would result in timely and precise diagnosis, ultimately increasing patient care and quality of life.


Background
Recent advancement in Arti cial Intelligence (AI) and Machine Learning (ML) has provided the opportunity for AI and ML application in the healthcare area, while also slowly improving on the performance benchmark set by the classical statistical techniques [1]. In recent years, healthcare service providers have also shown interest towards data science and machine learning in disease diagnosing. Disease prediction using data mining and machine learning algorithms with patient medical history such as diagnosis of disease, medical and surgical procedures, therapeutics, and treatments, etc., has been slowly introduced to aid decision making processes [2,3,4]. Many statistical and machine learning techniques have been applied to either pathological or clinical data to study the disease in detail and also predict its likelihood of occurrence. Deep learning algorithms such as Convolutional Neural Network (CNN) have been found to predict disease onset and progression with a greater precision compared to analyzing just medical image data [5].
Since healthcare is one of the leading industries with a large amount of structured and unstructured data, it is imperative to use the known advanced techniques to extract the hidden data patterns. Machine Learning algorithms with the help of big data technology has made it easier to mine the vast amount of unstructured data and aided in making important decisions related to patients' health [6]. Due to its high precision and robustness in comparison to conventional statistical methods, most medical scientists have been attracted towards these models to understand the key drivers of disease onset and progression prediction. Arti cial Intelligence, Machine Learning, and big data have been playing a pivotal role in improving healthcare infrastructure, patient care, as well as disease diagnosing, prediction and forecasting, drug discovery, etc., and thereby, reducing medical costs, shortening the time to diagnoses and treatment, as well as enhancing patients' quality of life and access to healthcare [7].
With this motivation in mind, we selected endometriosis as the condition to study in this article. Endometriosis is one of the most common disorders seen in women of a menstruating age in which tissues like the endometrium lining grow on the outer part of the uterus and other organs of the pelvic region. The signs and symptoms vary from patient to patient with some patients having mild symptoms, while others display a moderate to severe level of condition occurrence. The most common symptoms of endometriosis are pelvic pain, dysmenorrhea, and infertility. There is no guaranteed treatment for endometriosis at this time; however, with an early diagnosis and available medical and surgical options, healthcare providers can reduce the risks of potential complications and improve the quality of life for their patients. If we can identify or predict the probability of endometriosis onset by analyzing the medical history of diagnosed patients, the results might help bene t both the healthcare providers' diagnosis process and patients' well-being and quality of life. In this study, the Logistic Regression (LR) and eXtreme Gradient Boosting (XGB) algorithms were used to predict endometriosis occurrence when leveraging medical history of the diagnosed patients.
The remainder of the article is organized as follows: in Sect. 2, we brie y review the project objective; in Sect. 3, we describe different methods used in data preparation, feature engineering, feature selection and model training and validation; in Sect. 4, we present the model outputs and results; and in Sect. 5, we conclude the study with a summary of our ndings.

Objectives
The following objectives will be addressed in this article: i. Train machine learning algorithms to predict the likelihood of endometriosis.
ii. Identify the most signi cant medical events in the patient journey that lead to the diagnosis of endometriosis.
iii. Score entire database using the best performing trained models.
iv. Pro le patients using the predicted scores.

Methods Overview
The data source for this project is the healthcare claims patient level database with the study time period from January 31, 2019 to December 31, 2019. Patient cohorts: study target and control were established using endometriosis ICD 10 diagnosis codes. As endometriosis is a female only condition, female patients 18 and older were part of the study target cohort. A control cohort is often used to create a patient sample to compare with the study target cohort and is selected using cohort matching algorithms. 36 months of patient medical history prior to the rst disease event in 2019 were extracted for both the study target and control cohorts. The healthcare claims patient level data includes diagnosis codes, medical and surgical codes, therapeutics and treatments prescribed at the transactional level.
A number of analytical methods was leveraged for the analysis from the rules-based patient quali cation criteria to Machine Learning algorithms to derive probability of endometriosis onset. The following sub-sections of the article present a detailed explanation for each of the selected methods. The healthcare claims patient level dataset considered in the analysis is speci c to the US healthcare market.

Healthcare claims patient level database
The healthcare claims patient level database is an anonymous longitudinal patient data set that can be used by organizations that are directly or indirectly associated to healthcare [9,41]. There has been an increasing interest in patient-level data, as researchers, healthcare providers, and pharmaceutical companies are realizing the potential of creating better comparisons of effective treatment outcomes by analyzing longitudinal data that represent individual patient-based experiences and interactions with the US healthcare system [42].
The healthcare claims patient level database leveraged for this study consists of medical, hospital, and prescriptions claims across all payment types [10,44]. The database covers more than 317 million patients in the US, spans over more than 17 years of medical health history, and includes more than 1.9 million healthcare providers [43]. Figure 1 presents the summary of information in the database.

Cohort selection
For this study, we identi ed 314,101 con rmed endometriosis patients in 2019 in the healthcare claims patient level database, using prede ned ICD 10 diagnosis codes (Table 1). Female patients age 18 and above were selected to the study target cohort. For the control cohort, a random sample of 3 million female patients with the same age criterion was extracted from the database. To select a control cohort of an equal size to the study target groups out of 3 million patients, a noble technique known as 'propensity score match' was used [18]. Propensity matching algorithm [19], a statistical technique, selects the control cohort based on similar characteristics or covariates observed in the study target cohort. Covariates considered for selection were patient age and medical history [20]. Table 2 presents the distribution comparison between the study target and control cohorts by age and Census geographies. The patient age variable was created via grouping age ranges and US states were grouped into regions.

Data extraction
The next step in the analysis process was to extract the entire medical history of the patients from the available information in the healthcare claims patient level database. In order to ensure extraction of healthcare history data prior to the rst condition event, the event date for the target cohort was established for each patient. In the case of the control cohort, the rst activity in 2019 was considered as the event date.
Using these event dates of respective patients, 36 months of medical history data was extracted. Historical data presented all the medical events in patient history, including diagnoses for comorbid conditions, medical and surgical procedures, therapeutics, and treatment prescribed to patients. Top 1000 diagnosis codes, top 800 medical and surgical procedures, and top 500 prescribed drugs were only considered for further analysis as these top codes constituted more than 80% of total data. A pivot table was created where data at the transaction level was aggregated by the anonymized patient ID. After historical medical claims data preprocessing for both cohorts independently, a dataset was integrated into a single data frame. The integrated data frame had more than 2,600 features. The dataset was further standardized and split into two groups, a training and test set, using 70:30 ratio respectively [21]. The training dataset is used to identify the key features of endometriosis onset, while the test group is used to validate if these features would predict the test group condition onset accurately [22]. Splitting the data into train and test sets helps to assess the model performance and its generalizing ability on unseen data [23].

Machine Learning algorithms' overview
Machine Learning algorithms can be grouped into two categories: supervised and unsupervised learning.

3.4.a. Supervised learning algorithms
Supervised learning is the process of training or building the machine learning algorithms in which algorithms learn to map from input space (X) to output space (Y), i.e. Y = f(X) [25]. The major objective is to approximate the mapping function (f) in order to ensure that when a new data point (x) is added we can predict (y) outcome [26]. Supervised learning algorithms are mainly used for classi cation and prediction problems [32]. Following are the most popular supervised algorithms: logistic regression, decision trees (DTs), random forest (RF), extreme gradient boosting, support vector machines (SVMs), Naïve Bayes, adaptive boosting (AdaBoost), arti cial neural network (ANN) etc. [31].

3.4.b. Unsupervised learning algorithms
Unsupervised learning algorithms, on the other hand, try to learn the hidden pattern within the input dataset (X) [28]. These models are called unsupervised because there is no supervision to guide the models as compared to the supervised learning [29]. Algorithms are left at their own abilities to learn, discover and showcase the patterns in the input data (X). These algorithms are highly popular in the tasks to discover the natural clusters, dimension reduction, anomaly detection, etc. k-Means clustering, principal component analysis (PCA), factor analysis (FA), singular value decomposition (SVD), apriori algorithm (association rule) are some popular examples of unsupervised learning algorithms [31].
Depending on the study objectives and the available data, algorithms are explored, tested for performance and data type t, and selected accordingly. We framed the endometriosis onset prediction into a supervised classi cation problem and selected Logistic Regression and XGB models to develop a highly predictive algorithm of the disease onset. SVM, RF, AdaBoost, ANN, etc. are the other options that were explored in disease prediction; however, Logistic Regression and XGB were selected to predict the condition onset. Logistic Regression allows study of the odds of endometriosis occurrence for a given medical event [15], while XGB has more exibility in ne tuning the hyperparameters in comparison to other tree based algorithms [11].

Logistic Regression
Logistic Regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist [14,15]. Mathematically, a binary logistic model has a dependent variable with two possible values, where the two values are labeled "0" and "1" [33]. Outputs with more than two values are modeled by multinomial logistic regression.
Logistic Regression is used in various elds, including healthcare and social sciences [34].
xExtreme Gradient Boosting Gradient boosting algorithm is a machine learning algorithm which is an ensemble of weak prediction models, mostly decision trees [11]. An individual tree is a simple, often unreliable, model but when multiple trees are grouped together, they can create a robust algorithm [12]. XGB starts by creating a rst simple tree [35], which than progresses sequentially and builds upon the weaker learners, with each iteration revising the previous tree until an optimal point is reached, such as the number of trees (estimators) to build the solution [36].

Chi-Square Test
The Chi-square test is one of the most widely used non-parametric tests [37], often utilized to test the independence between observed and expected frequencies of one or more attributes in a contingency table, popularly known as 'test goodness of t' [38]. In this work, the Chisquare test is used to identify top signi cant features given the dependent variable (Y) [40].
Logistic Regression, being the simplest of the machine learning algorithms, was selected as the base model for the analysis and used to compare other models' performance. Both Logistic Regression and XGB models were trained, and top 1,000 features from each algorithm were selected out of more than 2,600 features used in the model runs. To decrease the number of data elements and to select only the most important variables to predicting the condition onset, we also used a Chi-Square test to identify the top 1,000 features. As a next step, the unique features from each model were utilized to train the nal machine learning model to predict the endometriosis occurrence probability. Algorithms were trained on Python 3.5 using 'scikit-learn' and 'xgboost' libraries.   Section 1 of this work describes endometriosis and its associated signs and symptoms such as 'painful periods', 'lower abdominal and pelvic pain', 'heavy bleeding during periods', 'pain during urination and bowel movement', 'constipation and diarrhea', 'infertility', 'painful sexual intercourse', etc. [16,17]. Identifying these prominent medical events from patients' medical history by the models is the objective of this work. Hence, it is desirable to validate the model performance by analyzing the top features, whether they would help predict endometriosis' onset. Table 4 presents the top features identi ed by the machine learning models, which are directly or indirectly associated with endometriosis. Features such as 'non in ammatory disorder of uterus (D_N85_8)', 'pelvic and perineal pain (D_R10_2)' are the diagnosis codes, presenting the association with the risks and symptoms of endometriosis [45]. Procedure codes such 'anesthesia of lower abdomen for laparoscopy (P_00840)', 'vaginal hysterectomy including biopsy (P_00944)' are the top procedures often associated with the diagnosis as well treatment of endometriosis [45]. Furthermore, the machine learning models suggest that patients often consult with specialists including 'emergency medicine (SPCLT_EM)', 'family medicine (SPCLT_FM)', 'obstetrics and gynecology (SPCLT_OBG)' when experiencing related symptoms and gynecological issues. Overall, the machine learning models selected top features closely related to the onset of endometriosis, which implies that when tracking any of the features the condition onset could be diagnosed sooner.

Feature selection for market de nition
Top features from all three algorithms, which were speci c to target cohort were identi ed. These features presented to be important in diagnosing the endometriosis condition and were selected for patient scorning criteria. The therapeutics as well as medical and surgical procedure codes speci c to endometriosis treatment such as Orilissa, Marilissa, and Lupron Depot were excluded. Around 9.5 million female patients age 18 and above were quali ed for scoring.

Propensity model training and validation
Using the top features selected, Logistic Regression and XGB models were re-trained. As the number of features was reduced, in the beginning we observed a drop in model performance. After several iterations and hyper-parameter tuning, the predictive power of XGB signi cantly improved compared to the previous iterations; however, we did not see any improvement in the Logistic Regression model results. Interestingly, both models were able to identify additional new features aligned with endometriosis. The re-trained machine learning models identi ed all the top features discussed in Sect. 4.1. In Table 5 Table 6 shows that the XGB model performed better compared to the Logistic Regression model. Figure 3 shows the Receiver Operating Characteristic (ROC) curves on test sets for both retrained Logistic Regression and XGB models. The Area under the ROC Curve (AUC) values of LR and XGB models on test were 0.87 and 0.96 respectively. Figure 4 suggests that the XGB model was able to more accurately differentiate target from control than LR model. Hence, we used XGB model to score the quali ed patients.

Scoring quali ed patients
The last step of the model evaluation is to score quali ed patients to assess the model's predictability of condition onset. A complete medical history of 9.5 million quali ed patients was extracted for 36 months, which included diagnosis codes, medical and surgical procedure codes, medications and treatments prescribed as well as practitioners' therapy expertise and Board-Certi ed Specialty. After data pre-processing, the likelihood of endometriosis was predicted using the trained XGB model.
A probability distribution of 9.5 million scored patients is shown in Fig. 5. We observed that most of the predicted probability values are concentrated either towards 0 or 1. Considering 0.5 as the threshold, the XGB model suggests that around 36% of the scored patients are likely to get diagnosed with endometriosis sometime in the future. Assuming an ability to leverage the signi cant variables in diagnosing the condition onset, practitioners can give special medical care and advice in time to these patients, thereby, reducing the risks of endometriosis and its related complications.

Discussion
Overall, the machine learning models have identi ed top features that can explain endometriosis onset in advance. As noted, Tables 4 and 5 in the 4. Results Section, these features include diagnosis codes, medical and surgical procedure codes, as well as physician specialties that often support patients through their healthcare journey.
For the preliminary Logistic Regression, XGB, and Chi-Square runs as noted in Table 4, the following top variables were identi ed as important in predicting the condition onset: 1) diagnoses codes: 'non in ammatory disorder of uterus (D_N85_8)', 'dysmenorrhea (D_N94_6)', 'pelvic and perineal pain (D_R10_2)', 'unspeci ed condition associated with female genital organs and menstrual cycle (D_N94_9) clearly show association with the risks and symptoms of endometriosis [45]; 2) medical and surgical procedure codes such 'anesthesia of lower abdomen for laparoscopy (P_00840)', 'vaginal hysterectomy including biopsy (P_00944)', 'cystourethroscopy (P_52000)', 'laparoscopy, surgical with fulguration or excision of lesions of the ovary, peritoneal surface (P_58662)' are associated with the diagnosis as well treatment of endometriosis [45].
From the patient medical journey and healthcare access side, the machine learning models suggest that patients often consult with specialists, including 'emergency medicine (SPCLT_EM)', 'family medicine (SPCLT_FM)', 'obstetrics and gynecology (SPCLT_OBG)' when experiencing endometriosis related symptoms and gynecological issues. Patients with the history of endometriosis or untreated endometriosis are at a higher risk of developing either an ovarian cancer or 'endometriosis associated adenocarcinoma,' which can also serve as an indicator of potential occurrence of the condition [52,53,54]. The machine learning models selected as one of the top healthcare provider specialties 'hematology/oncology (SPCLT_HO)'. This nding suggests that if a patient has any signs and symptoms as noted above, a consultation with an oncologist is recommended [55,56]. Overall, the machine learning models selected top features directly related to the onset of endometriosis, which implies that when tracking any of the features the condition onset could be diagnosed sooner.
As noted in Table 5 above, Logistic Regression and XGB models identi ed additional features, which are important in predicting the likelihood of endometriosis. The models suggest that features like 'submucous leiomyoma of uterus (D_D25_0)', 'ovarian cyst (D_N83_291)', 'hypertrophy of uterus (D_N85_2)', 'excessive bleeding in the premenopausal period (D_N92_4)','deep dyspareunia ('D_N94_5)','female infertility associated with anovulation (D_N97_0)', 'premenstrual tension syndrome (D_94_3)', 'hormone replacement therapy (D_Z79_890)','family history of malignant neoplasm of ovary' are highly signi cant in predicting the likelihood of endometriosis. There are also several articles, which support the models' claims that broids, ovarian cysts, infertility, menstrual period complications, family history of neoplasm of ovary, hormone therapy etc. have strong association with endometriosis [48]. Recent clinical research also supports that women of reproductive age with 'chronic stress' are at a higher risk of developing endometriosis [47].
The machine learning models have also identi ed Acetaminophen (R_ACETAMINOPHEN), Megestrol acetate (R_MEGESTROL_ACETATE) & Lidocaine hcl (R_LIDOCAINE_HCL) drugs as the strong predictors of endometriosis, as these drugs are often prescribed as analgesics, birth control & treatment of endometrial cancer and to numb the skin/muscles respectively. Furthermore, features such as 'submucous leiomyoma of uterus (D_D25_0)' and 'hypertrophy of uterus (D_N85_2)' are signi cant predictors [49,50] in the disease onset; however, more clinical research is needed to support this statement, as these conditions have similar symptoms, but patients are less likely to develop endometriosis [51].
Overall, the top data elements present the key features that should be considered when diagnosing endometriosis in adult women in order to decrease the time to diagnosis. As noted in the 4.4 Section of the article, when using these variables in the diagnostic processes, we can with a high accuracy predict the condition onset and differentiate accurately between patients with and without the disease.

Conclusions
In this article, we validated the crucial role of AI and ML in the disease diagnosis, prediction, and forecasting. We analyzed medical history of patients with endometriosis using machine learning algorithms and re-trained XGB model on selected important features, which were applied to predict the likelihood of endometriosis occurrence in the adult female population. Early prediction of the disease can offer an opportunity for patients to receive needed medical treatment earlier in the patient journey. Creating a typing tool that can be integrated into the Electronic Health Records (EHR) systems and easily accessed by healthcare providers could further aid the objective of improving the diagnosis activities and inform the diagnostic processes that would result in timely and precise diagnosis, ultimately increasing patient care and quality of life. In our future work, we plan to explore advanced deep learning algorithms to further enhance the model performance and increase the accuracy of the machine learning models in predicting the likelihood of the disease onset.