Explainable Machine Learning Prediction for Mortality of COVID-19 in the Colombian Population

The COVID-19 pandemic, which began in late 2019, has become a global public health problem, resulting in large numbers of people infected and dead. One of the greatest challenges in dealing with the disease is to identify those people who are most at risk of becoming infected, seriously ill and dying from the virus, so that they can be isolated in a targeted manner and thus reduce mortality rates. This article proposes the use of machine learning, and specically of neural networks and random forest to build two complementary models that identify the probability that a person has of dying because of COVID-19. The models are trained with the demographic information and medical history of two population groups: on the one hand, 43,000 people who died from COVID-19 in Colombia during 2020, and on the other hand, a random sample of 43,000 people who became ill with COVID-19 during the same period of time, but later recovered. After training the neural network classication model, evaluation metrics were applied that yielded an 88% accuracy value. However, transparency is a major requirement for the explicability of COVID-19 prognosis. Therefore, a complementary random forest model is trained that allows the identication of the most signicant predictors of mortality by COVID-19.


Introduction
Arti cial intelligence has recently been used in medicine for the detection and treatment of diseases. One of the most successful elds of application of arti cial intelligence in disease detection has been machine learning. Machine learning consists of the use of computer algorithms to train mathematical and statistical models from large volumes of data that can include diagnostic images, laboratory tests and medical records. Through this training, the models can identify patterns in the data, which can then be applied to analyze new data sets. For example, a machine learning model can be trained with a large set of patient cell images telling the algorithm which ones belong to cancer patients and which ones do not. After training the model, it is expected that it will be able to identify from a new image of the cells, whether the patient has cancer or not1. Some machine learning algorithms such as SVM, random forest (RF) and neural networks (NN)2-4 have been successfully used in the diagnosis of diseases such as diabetes5,6, Alzheimer's7, heart disease8, cancer9,10, liver cirrhosis11 and chronic kidney disease12-14, among others15-19.
The COVID 19 pandemic, which originated in late 2019 in China caused by a coronavirus similar to that which causes the common cold and severe acute respiratory syndrome (SARS), has become a major public health problem and has infected more than 82.7 million people worldwide, causing more than 1.8 million deaths by the end of 202020. The general isolation measures taken by governments worldwide have not been su cient to contain the advance of the disease and have had a negative impact on the local and global economy. Figure 1 graphically presents the growth in the number of daily cases of COVID-19 worldwide20. One of the great challenges in dealing with the disease is to identify the people who are most at risk of becoming infected, seriously ill and dying from the virus, in order to isolate them selectively and thus reduce mortality rates. This article proposes the use of machine learning, and speci cally of neural networks to build a model that identi es the probability that a person has of dying from COVID-19. The model has as an input data set the demographic information and the diagnosis history of diseases of the individuals, coded according to the International Classi cation of Diseases (ICD). In previous works21 models of neuronal networks capable of predicting the risk of presenting pathologies such as chronic kidney disease has been presented, taking a similar set of data22-25. These models have managed to obtain values of accuracy greater than 95%. Other works related to the use of machine learning in the diagnosis of COVID-19 propose models for the detection of new cases26,27, the use of neural networks for case detection by chest X-rays28-30, the use of models to identify factors that in uence patient mortality31,32, as well as to identify other environmental factors that may affect the spread of the epidemic33.
Despite the tremendous performance of neural networks, they work as black-box systems and their effectiveness is limited by their inability to explain their predictions to the experts. The problem of explainability in Arti cial Intelligence is not new34 but the rise of the machine learning as a very successful classi cation technique has created the necessity to understand how these systems make a prediction in order to increase user's reliability and trust35. Therefore, this paper proposes an alternative prediction model using random forests that enables the identi cation of the most signi cant predictors of mortality by COVID-19.

Results
In this section the evaluation of the two previously trained machine learning models will be carried out. For this, a set of classi er evaluation metrics derived from the confusion matrix technique was applied.

Neural network evaluation
The objective of metrics for the evaluation of classi cation algorithms is to identify the predictive ability of the model.  As can be seen in the graph, the network correctly classi ed 88% of the cases and failed in 12%. Of this percentage of errors, most correspond to false positives, that is, the model is predicting that the person is going to die when he or she recovers.
From the confusion matrix, the set of metrics presented in Table 1 can be obtained. These metrics con rm, on the one hand, the predictive ability of the classi er obtained through the accuracy metric, and on the other hand, a tendency of For binary classi cation algorithms, a metric known as area under the curve (AUC) can be used37. This metric is used to determine the balance between detecting true positives and avoiding false positives. To do this, it shows the detection ratio of true positives on the y-axis, and the ratio of false positives on the x-axis. Figure 3 shows the ROC curve obtained along with its AUC value.
Comparison with random forest model Table 2 presents a comparison of the main metrics obtained after applying the neural network and random forest models. The table shows that the best classi er is the neural network, considering an accuracy value of 88%.
The sensitivity metric, which measures the proportion of positive examples correctly classi ed, shows a better performance of the neural network against random forest. The same situation occurs for speci city, where the neural network gets a higher value than the random forest model. The precision metric indicates the proportion of examples that are truly positive, and the recall metric measures how complete the results are and is like the sensitivity of the model. The value of F-measure corresponds to a balance between precision and recall and simpli es the performance of a classi cation algorithm in a single metric. Finally, the area under the curve (AUC) of the neural network model again exceeds random forest because of its tendency to correctly identify true positives and avoid false positives.

Signi cant predictors
A very important feature of Random Forest is that, although it is a black box model, it is possible to know the importance that the algorithm gives to each input variable. Importance measures the impact that each variable has on the nal prediction of the model. In the case of the model trained with the demographic and health care data, the distribution of values indicated in Table 3 was obtained. In this table are the 10 variables with the greatest importance for the model. In rst place is the variable Age, followed by a set of diagnoses among which are hypertension (I10), diabetes mellitus (E10.9, E11.9 and E10.8), obesity (E66.9 and E66.0), chronic obstructive pulmonary disease (J44.9) and chronic renal disease (N18.9). These diagnoses coincide with the medical theory that points to these variables as the main risk factors for death from COVID-1938. Other diagnoses such as prostate hyperplasia (N40) may be associated with aging, considering that the main risk factor is the age of the patients.

Discussion
The neural network and the model trained with random forest were able to identify patients at risk of dying from COVID- 19, with values of accuracy of 88% and 87% respectively. This demonstrates the predictive capacity of both models and their effectiveness in identifying patterns in large data sets, as well as their application in the early detection of diseases and other health risk conditions. Future work includes the implementation of the model for predicting the total population of Colombia to identify the people with the highest level of risk within the health system and to take the necessary measures for their protection. It is also planned to create a web application that, applying the trained model, allows people to consult their individual risk level.

Methods
This section describes the techniques used following the CRISP-DM methodology in order to perform the data capture and treatment, training and optimization of the machine learning models built with the aim of predicting the risk of death by COVID-19. The CRISP-DM methodology is considered a standard for the life cycle of data analysis projects and includes descriptions of the basic phases of a project, the tasks required in each phase and the relationships between the tasks. Among its advantages over other methodologies is its exibility, since it can be easily adapted to any data exploitation project, such as automatic learning39. A general overview of this methodology is presented in Fig. 4.

Data collection
The data set required for the creation of the machine learning models was obtained from the SEGCOVID and RIPS databases of the Colombian Ministry of Health and Social Protection. SEGCOVID is a web application that records the follow-up of suspected and con rmed cases of COVID-1940. The RIPS database (Registro Individual de Prestación de Servicios de Salud) contains information on medical care provided to all members of the health system in Colombia since 200941. Two samples were taken from the population: one corresponding to 43,000 people who died in Colombia from COVID-19 during 2020, and the other to a group of 43,000 people who fell ill from COVID-19 during the same period of time but subsequently recovered. The next step was to integrate the two groups of people in the same table called "Patients", adding for each record the elds: ID or unique anonymized identi er of the person, sex, age, ethnicity, place of residence, and the label "Dead" which indicates with a "1" if the person belongs to the group of deceased or with a "0" if he or she belongs to the group of recovered patients. Figure 5 shows some example data taken from the "Patients" table. The total number of records in this table is 86,000. Finally, the "Diagnoses" table was created, in which all the diagnoses identi ed in the RIPS database were added for each person in the "Patients" table, as well as the number of medical cares provided for this diagnosis. This set of diagnoses included the diseases detected in these individuals up to December 31, 2020. This table has 1,076,718 records, corresponding to 86,000 persons and 8,111 diagnoses of diseases. Figure 6 shows some example data taken from the "Diagnoses" table.

Data cleaning and preparation
This process transforms the data set obtained from the database and generates a new model with which to train and test the neural network. This transformation includes feature selection operations, table joining, row to column transformation, categorical variable handling, null value treatment and other cleaning operations required to obtain the nal dataset.
From the initial exploration of the data, several variables were considered to make up the data set; however, they were not included due to the lack of data, as in the case of the results of laboratory tests and family history, or due to data quality problems, as in the case of variables related to people's weight and height. Finally, the following variables were To create the data model required for the training of the neural network, it is necessary to have all the patient's information in the same record. For this reason, it was necessary to perform a transformation from rows to columns so that all the pathologies that a patient has are in the same row. Once the function is executed, a single dataframe is generated containing a record for each of the 86,000 persons in the sample, with 8,112 columns corresponding to each possible disease within the database, in addition to the person's unique identi er. The nal step is to ll in the blank values with zeros. This occurs in cases where the person does not have any of the diseases.

Categorical variables contain descriptions within a set of nite elements that cannot be converted into numerical
values. This is the case with the variables sex, ethnicity and department in the table "Patients". For these cases it is necessary to create as many columns as possible domain values contained in each variable and ll in with values "1" or "0" depending on the option that applies to each person.
A nal step required for the creation of the nal data set consists of joining the two tables worked on so far: the "Patients" and the "Diagnoses" tables. Each table contains only one record for each person, so the process consists in joining the row of the "Patients" table with the row of the "Diagnoses" table for the same person. Figure 7 shows graphically the creation of the nal dataset.

Training and test datasets
Once the data was cleaned and prepared, it was used to train the neural network. At this stage of the process it is necessary to separate the data set into three subsets: training, validation, and testing. The training data set is composed of those examples that will allow the model to learn from the characteristics and identify the patterns hidden in the data. The validation set allows you to identify if the model is learning correctly as it is being trained. This is important considering that machine learning models can be over tted. Before proceeding with the creation of the datasets, three additional operations are required. The rst consists of a review of the variables used for training the model. The set of variables up to this moment is composed of 8,152 variables obtained after applying data cleaning and preparation processes. As a result of this review it was identi ed that the ID variable is not relevant for the training of the model.
Once the column is removed, the resulting data set is composed of 86,000 records and 8,151 variables. The following operation consists of separating the data into two different structures that within the Python language are known as dataframes. On the one hand, we have the X dataframe, which contains the characteristics or input variables of the model, and on the other hand we have the Y dataframe, which contains the class or output variable of the model. The third operation is to normalize the data. Each variable of the X dataframe contains a range of values different from the other variables or characteristics. This can be a problem for training the model because variables with higher values may have more weight than the others. To solve this situation, a process of data normalization is performed, so that all variables are in a range of values between 0 and 1. Finally, the separation of the training and test data sets is performed using the train_test_split function, belonging to the scikit-learn library. In this case 30% of the data was used for testing and the remaining 70% of the examples were used to train the model. The validation data set will be obtained from the training set during the construction of the neural network.

De nition of the neural network topology
The topology or architecture of the network de nes the number of layers, as well as the number of neurons per layer and how they are connected. The topology of a network is directly related to the complexity of the tasks that can be learned by it42. Generally, networks with a greater number of layers and neurons can identify more complex patterns, although they consume a greater amount of computational resources, especially processing capacity and memory space. The proposed neural network is composed of 5 layers according to the diagram in Fig. 8. The input layer corresponds to the characteristics or input variables of the network, composed of 8.150 nodes or neurons. The following 3 layers correspond to the hidden layers of the model and contain 500, 100 and 50 neurons, respectively. The last layer corresponds to the neuron that represents the only class of the binary classi cation problem. The objective of the training is to obtain the values corresponding to the optimal weights (W) for each layer of the network, as well as the bias values43.

Model training
The initial model of the neural network was implemented using the Keras library44 and the TensorFlow framework45 in Python. This library facilitates the creation and evaluation of classi ers with neural networks. The Sequential class allows to add new layers to the network, indicating for each one the number of neurons, as well as the activation function to be implemented. Once the model was de ned, it was compiled using an optimization algorithm. The use of the gradient descent algorithm is common, although in practice a variation of this algorithm known as stochastic gradient descent (SGD) is used which is less computationally costly42. Although stochastic gradient descent is an algorithm that works well in general, it has several problems that sometimes make it di cult to train neural networks. One of the algorithms that has been created to solve these problems is Adam46. This algorithm combines techniques taken from other methods such as RMSProp47 and SGD with momentum to improve the speed with which it converges in the search for an optimal solution. Adam also reduces the probability that the algorithm stops in an intermediate position and is not able to advance in the search for a local or global minimum. For these reasons Adam was used for the training of the neural network proposed in this work.
One of the most critical elements for neural network training is the activation function. The activation function is the mechanism by which arti cial neurons process information and this information is propagated through the network. The activation function chosen for the neural network training was the RELU function (recti ed linear unit). This is the most widely used activation unit in practice today48. Neural networks are models with great power of representation.
The large number of parameters and layers added to a neural network makes it easy for the network to learn too well the data it is training with. Sometimes, the network may be able to perfectly memorize a set of training data, even classifying them perfectly. However, when this is done, the model loses its ability to generalize and, when evaluated on a set of data not seen during the training or test set, a low capacity for prediction can be observed42. This phenomenon is known as over tting and is a common phenomenon in machine learning: the algorithm is modeling the training data too well while losing ability to generalize on unseen data. The regularization techniques try to solve this problem.
Dropout is a quite modern technique that has found a great reception49. Dropout prevents the memorization of variables. In this model, the dropout technique was applied to the output of each layer of the network, with a hyper parameter value equal to 0.5. After training the neural network with 10 iterations or epochs it was found that the model converged in the third iteration and obtained an accuracy value of 88% in both the training and the assessment datasets. Figure 9 shows the point of convergence of the model, as well as a possible over tting that occurs if the model is continued to be trained beyond the third iteration.    Figure 1 Daily growth in the number of COVID-19 cases worldwide20.

Figure 2
Confusion matrix of the neural network model.    Covid-19 Data Set.

Figure 8
Topology of the neural network with 5 layers.