Algorithm
Complete subsets can be obtained by deleting rows and columns with missing data. Different deleting ways will generate different intact subsets. We can use the optimal intact subset containing most rows and columns in the downstream statistical analysis.
1. Find the optimal intact subset by traversing method
For an original dataset, set the number of rows = n, the number of columns = f. Take dataset A in Fig. 1 with n = 8 and f = 4 for an example. Symbols of ‘V’ represented observed values, and blank cells represented missing values. If we delete X4 and corresponding rows with missing values, the subset contains four rows: n2, n5, n6, n8, three columns: X1, X2, X3 and 12 figures. Similarly, if we delete X1, X2, the intact subset contains two rows: n2, n8, two columns: X3, X4 and four figures. Deleting different column(s), the retention data of intact subset is different. (See Table 1 for further details). Through traversing all deleting methods in Table 1, we can find that deleting X4 is the best way because the subset has the most reserved figures.
2. The derivation of the total number of deleting ways
Traversing method has a fatal shortcoming: high computational cost. In this example, there were \({N=C}_{4}^{1}+{C}_{4}^{2}+{C}_{4}^{3}+{C}_{4}^{4}={2}^{4}=16\) kinds of deleting ways (Table 1).
Table 1
All deleting ways of dataset A in Fig. 1
The number of deleted columns
|
The combination of deleted columns
|
The number of reserved rows
|
The number of reserved columns
|
The number of reserved figures
|
0
|
none
|
2
|
4
|
8
|
1
|
X1
|
2
|
3
|
6
|
X2
|
2
|
3
|
6
|
X3
|
2
|
3
|
6
|
X4
|
4
|
3
|
12
|
2
|
X1、X2
|
2
|
2
|
4
|
X1、X3
|
2
|
2
|
4
|
X1、X4
|
4
|
2
|
8
|
X2、X3
|
2
|
2
|
4
|
X2、X4
|
4
|
2
|
8
|
X3、X4
|
5
|
2
|
10
|
3
|
X1、X2、X3
|
2
|
1
|
2
|
X1、X2、X4
|
5
|
1
|
5
|
X1、X3、X4
|
5
|
1
|
5
|
X2、X3、X4
|
6
|
1
|
6
|
4
|
X1、X2、X3、X4
|
0
|
0
|
0
|
There were 16 kinds of deleting ways, among which deleting X4 is the best way. |
Generally, for a dataset with f columns, there will be N = 2f kinds of deleting ways. The times of traversing will increase exponentially with the increase of f value. For example, when f = 12, \(N={2}^{12}=4096\), when f = 50, \(N={2}^{50}=1.1\times {10}^{15}\). Obviously, it has far exceeded the computing power of ordinary computing devices Therefore, it’s necessary for us to explore a substitute algorithm to simplify computing.
3. The basic principle and steps of OIS.method
In order to avoid traversing all deleting ways, we used an indicator containing the missing information of both rows and columns to determine the deleting order of columns. Set the indicator as Sm.
Sm\(=\sum _{1}^{n}\frac{{a}_{i,j}}{f}\times 100+bj\times 100\) (1)
n: the number of rows, f: the number of columns, \(a\): the number of observed values, \({a}_{i,j}\):the number of observed values in the i-th row, j -th column, bj: the number of observed values in the j -th column.
Take dataset A in Fig. 1 for an example. The detailed steps of OIS.method were as follows, and the flowchart of OIS.method was shown in Fig. 1.
Step1: Compute the value of\({a}\)
Count the number of observed values of each row and record as a.
Step2: Replace
Replace all observed values with ‘100’, and replace missing data with ‘\(\frac{{a}_{i,j}}{f}\)’.
Step 3 Compute the value of Sm
Compute the value of Sm of each column according to Formula (1). After that, Sort columns according to the Sm value from smallest to largest.
Step 4: Delete columns
For each combination of deleted columns, we should delete column(s) according to the order in the step 3.
Cell represents the amount of data of intact subset. \(Cell={n}_{subset}\times {f}_{subset}\). It has maximum value in a certain combination of deleted columns (Fig. 2).
Step 5: Choose the optimal deleting method
Cell had the maximum when deleting X4.
OIS.method can avoid traversing all deleting ways by identifying the optimal deleting order via Sm value. OIS.method has been validated in another two datasets, too. The process of validation was shown in Supplementary Table 1–2. All results were consistent with traversing method.
1.1 Series of simulated datasets 1
Step1: generate datasets with missing values
Firstly, we generated 25 categorical datasets using Python sklearn library. These datasets were full combinations of the number of rows: 11, 2000, 5000, 7500, 10 000, and the number of columns: 11, 20, 50, 80, 100. Secondly, set different patterns of missingness: missing rates included: 20%,30%,40%,50%,60%,70%,80%, distributions of missingness included: random distribution, exponential distribution, uniform distribution and lognormvariate distribution. Totally, we have generated \(25\times 7\times 4=700\) datasets with missing data.
Step2: handle missing values
We used five control methods on these datasets to handle missing data and compared against OIS.method. Therefore, we can obtain \(700\times 6=4200\) intact datasets. The individual methods in this comparison were:
⑴IterativeImputer: Missing values are estimated by modeling features with missing values as functions of other features in a cyclic manner.
⑵IterativeSVD: It’s a low-rank matrix imputation method based on singular value decomposition (SVD). The observation matrix is decomposed into three specific matrices, then, multiplying the three matrices as the imputing value of the observation matrix at the corresponding position. The method executes multiple SVD algorithms to impute observation matrix, and use it to replace observation matrix without imputed. It is performed iteratively.
⑶K-Nearest Neighbor: Missing values were imputed using K-Nearest Neighbor (KNN) imputation [8].
⑷Random Forest: The missing values were imputed based on random forest model [9].
⑸Simple Fill: Missing values were imputed with the mean or median of each column.
Step3: Build and evaluate machine learning models
Five machine learning models were established using 4200 intact datasets:
⑴Logistic Regression: It can perform classification task by establishing a regression formula of classification boundary according to the existing data.
⑵Multiayer Perceptrons (MLP): It has three parts: ①Input layer consists of training data.②Hidden layer contains several layers of neurons. ③Output layer is used to output results[10].
⑶Support Vector Machine (SVM): It can find an optimal hyperplane to distinguish different types of samples. It has been widely used in binary classification problems [11].
⑷Random Forest: It uses Bootstrap re-sampling method to extract multiple samples from original data, and models each sample for decision tree using random node splitting technology. After summarizing the prediction results of multiple decision trees, the final result was drawn by voting.
⑸Extreme Gradient Boosting (XGBoost): It is an integrated learning algorithm based on gradient boosting methods [12]. One of the major advantages is that XGBoost provides L1 and L2 regularization. So, it can reduce overfitting and improve the generalization performance of models.
The predicting abilities of machine models were evaluated by five indicators: AUC (Area Under ROC Curve), Accuracy, Precision, Recall and F1-score. we used confusion matrix to illustrate these indicators. [13] (Table 2).
Table 2
confusion matrix
|
True value
|
Positive
|
Negative
|
Predicted value
|
Positive
|
True Positive (TP)
|
False Positive (FP)
|
|
Negative
|
False Negative (FN)
|
True Negative (TN)
|
\({A}{c}{c}{u}{r}{a}{c}{y}=({T}{P}+{T}{N})/({T}{P}+{F}{N}+{F}{P}+{T}{N})\),\({P}{r}{e}{c}{i}{s}{i}{o}{n}=\left({T}{P}\right)/({T}{P}+{F}{P})\),\({R}{e}{a}{c}{l}{l}=\left({T}{P}\right)/({T}{P}+{F}{N})\),\({F}{1}_{{s}{c}{o}{r}{e}}=2\times\) (\({p}{r}{e}{s}{i}{s}{i}{o}{n}\times {r}{e}{c}{a}{l}{l})/({p}{r}{e}{c}{i}{s}{i}{o}{n}+{r}{e}{c}{a}{l}{l})\). |
The better the prediction performance of the model, the better the method can improve the classification performance. The values of indicators can reflect the abilities of missing data processing methods.
The above procedures were repeated 10 times.
1.2 Series of simulated datasets 2
All variables in simulated datasets 1 have contribution to outcome variables. However, real-world datasets have many redundant variables. Thus, in order to simulated real-world datasets, we introduced redundant variables in the dataset whose number of samples was 500, number of features was 10, missing rate was 50% and distribution of missingness was exponential distribution. The multiples of redundant variables included 0,1,1.5,3,8.
OIS.method and five control methods (same as above) were adopted to handle the missing values. The remaining steps are the same as before.
In machine learning, the information provided by columns is more important when the dataset has a large size of samples. So, we added power exponents to\({f}_{subset}\) after using OIS.method to calculate Cell. \(Cell={n}_{subset}\times {f}_{subset}^{{e}}\), the values of e included:1,1.2,1.5,1.8,2.
The above procedures were repeated 10 times.
2. Validation in real-world datasets
OIS.method and three other control methods(simple impute, random forest, modified random forest) were applied in two real-world classification tasks. Results were presented in the form of mean ± standard deviation.
2.1 Task 1: risk prediction of hypotension during dialysis in dialysis patients
We collected the data of all renal dialysis patients in Sichuan Provincial People's Hospital. The dataset contained the records of 314, 534 patients, each made up of 496 features. Set missing rate = the number of missing values/total amount of data. The missing rate of this dataset was 42.54%.
After handling missing data. We have established three machine learning models: Random Forest, Gradient Boosting, Logistic Regression. Similarly, we also used five indicators (same as above) to reflect the performance of different methods.
2.2 Task 2: risk prediction of drug adverse reaction in elderly patients with type 2 diabetes
We collected the data of all elderly patients with type 2 diabetes in Sichuan Provincial People's Hospital. The dataset contained the records of 1,224 patients, each made up of 279 features. The missing rate of this dataset was 21.75%.
After handling missing data. We have established two machine learning models: Ensemble Learning and Passive Aggressive. The remaining steps were the same as before.