As Artificial Intelligence (AI) continues to develop, its uses and applications in medicine are rapidly evolving. AI is an umbrella term to describe several methods for processing information1. Machine learning, a form of AI, useful for predicting outcomes, has led to a series of medical publications in recent years2-4. The ability to collect vast amounts of data via electronic medical records (EMR) combined with advanced computer processing has facilitated the process5. Although machine learning can solve multiple problems including regression, clustering, and reinforcement learning, classification of an outcome is one of the most common medical uses. Machine learning classifiers could be helpful in managing the COVID-19 pandemic by predicting surge capacity for a geographically heterogeneous intensive care unit (ICU) workforce or directing therapeutics to patients most likely to derive a benefit in resource depleted hospitals given variable patient response6-9.
Machine learning classification uses one or many independent variables to assign a class or label to a dependent variable such as death or survival in the case of a binary predictor10. Similar to a clinical prediction tool, a set of examples (patients) will derive a rule or solution which will be tested on a validation group. Propensity score matching is also similarly constructed to a machine learning classifier11. In Intensive Care Unit (ICU) medicine, the most common algorithms used to make predictions are linear and logistic regression These are the basis for Acute Physiology and Chronic Health Evaluation (APACHE), Simplified Acute Physiology Score (SAPS), and Sequential Organ Failure Assessment (SOFA) scoring. Other algorithms including support vector machine (SVM), decision trees, and ensembles of multiple algorithms using the same independent variables can be used to derive this decision rule4. APACHE, SOFA, and SAPS are excellent predictors of mortality but exclude potentially useful information including radiology, pharmacy, and clinical impression. In order to capture textual information contained within the chart and convert this to a useful predictive information, natural language processing (NLP) can be used.
NLP is a machine learning technique in which a body of text (corpus) is deconstructed into single or multiple word fractions to be used as independent variables (tokens) for prediction of some dependent variable. As in logistic or linear regression, the words are converted datapoints for which to make some prediction, such as mortality. NLP in conjunction with basic physiologic parameters have recently been used to devise an ICU risk prediction system with very good performance3. These authors utilized a database of over 100,000 patients and logistic regression to arrive at the risk predictor.
Machine learning classification performance tends to improve with more examples available in derivation of the clinical prediction rule. As such, large, organized databases are typically necessary to arrive at meaningful results. More commonly, only smaller amounts of patient data are available for analysis, hindering progress. In this study, we have constructed a hospital risk predictor using limited physiologic variables and chart notes using XGBoost tree, logistic regression, and SVM algorithms in a relatively limited dataset of ICU patients.
Availability of Data and Materials:
The datasets generated and analyzed in the current study are from the publicly available Medical Information Mart for Intensive Care – III (MIMIC-III) database5. This de-identified relational database houses physiologic parameters and clinical data for over 50,000 patients dating from 2001-2012 from Beth Israel Deaconess Hospital in collaboration with Massachusetts Institute of Technology. Both organizations’ institutional review board (IRB) approved waiver of consent for this deidentified dataset12. The New York Medical College also deemed this work to be IRB exempt. This study was carried out in accordance with the Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement13. All methods were performed in accordance with the relevant guidelines and regulations. The specific code with details of hyperparameter tuning are listed on github (https://github.com/Cobritra/NLPpublication.git).