This section describes the techniques used following the CRISP-DM methodology to perform data capture and treatment, training and optimization of the machine learning models built to predict the risk of death by COVID-19. The CRISP-DM methodology is considered a standard for the data analysis project life cycle and includes descriptions of the basic project phases, the tasks required in each phase and the relationships between the tasks. Among its advantages over other methodologies is its flexibility, since it can be easily adapted to any data exploitation project, such as automatic learning43. A general overview of this methodology is presented in Fig. 4.
Data collection
The dataset required for creating the machine learning models was obtained from the SEGCOVID and RIPS databases of the Colombian Ministry of Health and Social Protection. SEGCOVID is a web application that records the follow-up of suspected and confirmed cases of COVID-1944. The RIPS database (Registro Individual de Prestación de Servicios de Salud) contains information on medical care provided to all members of the health system in Colombia since 200945. Two samples were taken from the population: one corresponding to 43,000 people who died in Colombia from COVID-19 during 2020 and the other a group of 43,000 people who fell ill from COVID-19 during the same period but subsequently recovered. The next step was to integrate the two groups of people in the same table called "Patients", adding for each record the fields: ID or unique anonymized identifier of the person, sex, age, ethnicity, place of residence, and the label "Dead" which indicates with a "1" if the person belongs to the group of deceased patients or with a "0" if he or she belongs to the group of recovered patients. Figure 5 shows some example data taken from the “Patients” table. The total number of records in this table was 86,000.
Finally, the "Diagnoses" table was created, in which all the diagnoses identified in the RIPS database were added for each person in the "Patients" table, as well as the number of medical treatments provided for this diagnosis. This set of diagnoses included the diseases detected in these individuals up to December 31, 2020. This table has 1,076,718 records, corresponding to 86,000 persons and 8,111 diagnoses of diseases. Figure 6 shows some example data taken from the “Diagnoses” table.
Ethical Statement
The database is privately owned by the government of Colombia. However, for experimentation it is stated that all methods have been carried out in accordance with the relevant guidelines and regulations. The Colombian committee of scientific medicine approved the present research. The information used for the construction of the machine learning models was obtained from the databases of the Ministry of Health and Social Protection of the Republic of Colombia. The authors are authorized to access this information, as well as to generate statistics and analytical models based on it. To guarantee the privacy of personal information, the data have been anonymized and no authorization from patients is required for their statistical and scientific use. The use of public information is an activity that the Colombian state entities put at the service of citizens, to encourage the use and exploitation of the same. In this sense, in Colombia, as a general rule enshrined in the Constitution, all citizens have access to public information, except in cases established by law as exceptions; these are the following: documents subject to reserve according to the Constitution and/or the Law; documents related to defense or national security; criminal investigations that have not passed the stage of instruction, and those related to personal data whose disclosure may violate the rights of privacy and intimacy of individuals, without prejudice to other exceptions that a law may establish. This confirms that all research was conducted in accordance with relevant and legal guidelines/regulations. The identity of the present research with human participants has been concealed by data protection law, but all experiments have been validated in accordance with the Declaration of Helsinki.
Data analysis
The distribution of the main demographic features for the group of recovered patients and for the group of deceased patients is presented below. The sex variable shows a distribution of 48.5% of women and 51.5% of men in the group of recovered patients; and a distribution of 47.23% of women and 52.77% of men in the group of deceased patients. This result shows a higher number of males in both groups, although it is slightly higher in the deceased group than in the recovered group. Figure 7 shows the distribution of patients by sex for each data set.
The age variable shows a greater number of patients who died after 50 years of age, especially in the 60 to 90 years age group. On the contrary, the group of recovered persons is concentrated especially in those under 50 years of age. This indicates that as age increases, so does the probability of patient death, this variable being one of the main risk factors for COVID-19 mortality. Figure 8 shows the distribution of patient age for each data set.
The last variable analyzed for the "Persons" table is the patient's department of residence. From the maps it can be seen that there are more cases in Bogota, the central region, Antioquia and the Atlantic coast, although the distribution is similar for the two population groups. Figure 9 shows the distribution of persons by department for each data set.
Data cleaning and preparation
This process transforms the dataset obtained from the database and generates a new model with which to train and test the neural network. This transformation includes feature selection operations, table joins, row-to-column transformation, categorical variable handling, null value treatment and other cleaning operations required to obtain the final dataset.
From the initial exploration of the data, several variables were considered to make up the dataset; however, they were not included due to the lack of data, as in the case of the results of laboratory tests and family history, or due to data quality problems, as in the case of variables related to people's weight and height. Finally, the following variables were identified with which the model was trained:
- Sex: categorical variable with two domain values: "F" (female) and "M" (male).
- Age: numerical variable with integer values between 0 and 130, which stores the age of the patient.
- Ethnicity: categorical variable with 4 domain values: "INDIGENOUS", "AFROCOLOMBIAN", "OTHER ETHNICITIES" and "NONE".
- Department: categorical variable with 32 domain values corresponding to the Colombia department code, which represents the geographical location of the patient.
- Diagnoses: categorical variable with the ICD codes of the 8,111 diseases identified in the "Diagnoses" table. For each diagnosis, the number of treatments reported in the RIPS by each person is stored. If the person does not have the diagnosis, the number of treatments provided will be equal to 0.
- Deceased: numerical variable that contains two values: 1 if the person died, and 0 otherwise. This variable is used as a label for each element of the data set.
To create the data model required for training the neural network, it was necessary to have all the patient's information in the same record. For this reason, it was necessary to perform a transformation from rows to columns so that all of a patient’s pathologies were in the same row. Once the function was executed, a single dataframe was generated containing a record for each of the 86,000 persons in the sample, with 8,112 columns corresponding to each possible disease within the database, in addition to the person's unique identifier. The final step was to fill in the blank values with zeros. This occurred in cases where the person did not have any of the diseases.
Categorical variables contain descriptions within a set of finite elements that cannot be converted into numerical values. This was the case with the variables sex, ethnicity and department in the "Patients" table. For these cases, it was necessary to create as many columns as possible domain values contained in each variable and fill with values "1" or "0" depending on the option that applied to each person.
A final step required for the creation of the final dataset consisted of joining the two tables worked on thus far: the "Patients" and the "Diagnoses" tables. Each table contained only one record for each person, so the process consisted of joining the rows in the "Patients" table with the rows in the "Diagnoses" table for the same person. Figure 10 graphically shows the creation of the final dataset.
Training and test datasets
Once the data were cleaned and prepared, they were used to train the neural network. At this stage of the process, it was necessary to separate the dataset into three subsets: training, validation, and testing. The training dataset was composed of examples that allowed the model to learn from the characteristics and identify the patterns hidden in the data. The validation set determined whether the model was learning correctly as it was being trained. This was important considering that machine learning models can be overfitted. Before creating the datasets, three additional operations were required. The first consisted of a review of the variables used for model training. The current set of variables is composed of 8,152 variables obtained after applying data cleaning and preparation processes. As a result of this review, it was identified that the ID variable is not relevant for model training. Once the column was removed, the resulting dataset was composed of 86,000 records and 8,151 variables. The following operation consists of separating the data into two different structures that are known as dataframes within the Python language. The X dataframe contains the characteristics or input variables of the model, and the Y dataframe contains the class or output variable of the model. The third operation normalized the data. Each variable of the X dataframe contains a range of values different from the other variables or characteristics. This can be a problem for training the model because variables with higher values may have more weight than the others. To solve this, data normalization was performed so that all variables were in a range of values between 0 and 1. Finally, the separation of the training and test datasets was performed using the train_test_split function in the scikit-learn library. In this case, 30% of the data were used for testing, and the remaining 70% of the examples were used to train the model. The validation dataset was obtained from the training set during the construction of the neural network.
Definition of the neural network topology
The topology or architecture of the network defines the number of layers, as well as the number of neurons per layer and how they are connected. The network topology is directly related to the complexity of the tasks that can be learned by it46. Generally, networks with a greater number of layers and neurons can identify more complex patterns, although they consume a greater amount of computational resources, especially processing capacity and memory space. The proposed neural network was composed of 5 layers according to the diagram in Figure 11. The input layer corresponds to the characteristics or input variables of the network, composed of 8.150 nodes or neurons. The following 3 layers correspond to the hidden layers of the model and contained 500, 100 and 50 neurons. The last layer corresponds to the neuron that represents the only class of the binary classification problem. The objective of the training was to obtain the values corresponding to the optimal weights (W) for each layer of the network, as well as the bias values47.
Neural network model training
The initial model of the neural network was implemented using the Keras library48 and the TensorFlow framework49 in Python. This library facilitates the creation and evaluation of classifiers with neural networks. The sequential class allows the addition of new layers to the network, indicating the number of neurons for each layer, as well as the activation function to be implemented. Once the model was defined, it was compiled using an optimization algorithm. The use of the gradient descent algorithm is common, although in practice, a variation in this algorithm known as stochastic gradient descent (SGD) was used, which was less computationally costly46. Although stochastic gradient descent is an algorithm that works well in general, it has several problems that sometimes make it difficult to train neural networks. One of the algorithms that was created to solve these problems is Adam50. This algorithm combines techniques taken from other methods, such as RMSProp51 and SGD, with momentum to improve the speed with which it converges in the search for an optimal solution. Adam also reduces the probability that the algorithm stops in an intermediate position and cannot advance in the search for a local or global minimum. For these reasons, Adam was used to train the neural network proposed in this work.
One of the most critical elements for neural network training is the activation function. The activation function is the mechanism by which artificial neurons process information, which is propagated through the network. The activation function chosen for the neural network training was the ReLU function (rectified linear unit). This is the most widely used activation unit in practice today52. Neural networks are models with great power of representation. The large number of parameters and layers added to a neural network makes it easy for the network to learn the data it is training with too well. Sometimes, the network may be able to perfectly memorize a set of training data, even classifying them perfectly. However, when this is done, the model loses its ability to generalize, and when evaluated on a set of data not seen during the training or testing, a low capacity for prediction can be observed46. This phenomenon is known as overfitting and is a common phenomenon in machine learning: the algorithm models the training data too well while losing the ability to generalize on unseen data. Regularization techniques attempt to solve this problem. Dropout is a modern technique that has found great reception53. Dropout prevents the memorization of variables. In this model, the dropout technique was applied to the output of each network layer, with a hyperparameter value equal to 0.5. After training the neural network with 10 iterations or epochs, it was found that the model converged in the third iteration and obtained an accuracy value of 88% in both the training and assessment datasets. Figure 12 shows the point of convergence of the model, as well as possible overfitting that occurs if the model continues to be trained beyond the third iteration.
Random forest model training
The following model was trained using the random forest algorithm54. This algorithm arises as a response to one of the main drawbacks of decision trees, which is their low predictive capacity. The solution to this problem is the use of an ensemble or combination of several decision trees. Ensembles of trees can be created using bagging or boosting techniques. Random forest models are based on the combination of trees using bagging methods55. Bagging methods allow training several trees separately, making predictions on the test data and then obtaining the final prediction by measuring the mode of all predictions. The random forest model combines the bagging method with a random variable selection technique to add diversity to decision trees. In this way, all trees are trained with a different set of variables, and each variable has the opportunity to appear in more than one model.
The implementation of the random forest algorithm was carried out through the RandomForestClassifier function included in the scikit-learn library version 0.21. This function efficiently trains classifiers, indicating the number of trees to be created, as well as other important hyperparameters to optimize the final model performance. Figure 13 shows a summary of the parameters used to train the model with random forest.
Reproducibility
The source code required to reproduce the results presented in this paper, together with the training and testing sets, is available at https://github.com/grvasquezm/Covid19Mortality